Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Tim Donohue Thu, 13 Oct 2011 09:50:28 -0700

Hi George,

Hmm..that's a bit odd. It's definitely not a "known issue".

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you 
expect. The 'buildDocumentForItem()' method is the one that takes care 
of indexing all Item content into a Lucene "Document".

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item
2. Index all Item Metadata
3. Add in all various "sort" options (so you can sort search results)
4. Locate the "TEXT" Bundle in the Item and index *all* Bitstreams in 
that Bundle.

If you turn on Debugging you should actually see the DSIndexer report 
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like 
your "TEXT" bundle is getting all the right Bitstreams added (by 
filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if 
there are multiple that may be the issue -- but DSpace itself should 
only be generating one TEXT bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change
# to take effect on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 10000

So, it could be possible that these PDFs are larger, and Lucene just 
stops indexing content after 10,000 words.  You can set this to "-1" if 
you want to disable any word-based limit.

Not sure if that helps or not! :)

- Tim

On 10/13/2011 11:28 AM, George S Kozak wrote:
> Hi. Everyone:
>
> After a bit of digging what I have discovered is that any item that has
> multiple bitstreams of PDFs, only the first bitstream added is
> searchable. The other bitstreams in the item seem to be ignored by the
> indexer. I have checked and the extracted Texts are there, so it is not
> an issue with the filter-media program.
>
> We (at Cornell) have many items with multiple bitstreams of PDFs, and so
> far all of my testing indicates only the first bitstream of the item is
> being indexed by the Dspace search engine.
>
> Is this a known issue? Is there something wrong in my configuration
> files that may be causing this?
>
> George Kozak
>
> Digital Library Specialist
>
> Cornell University Library Information Technologies (CUL-IT)
>
> 501 Olin Library
>
> Cornell University
>
> Ithaca, NY 14853
>
> 607-255-8924
>
>
>
> ------------------------------------------------------------------------------
> All the data continuously generated in your IT infrastructure contains a
> definitive record of customers, application performance, security
> threats, fraudulent activity and more. Splunk takes this data and makes
> sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Reply via email to