Susan:

I am running DSpace 1.7.1...I "think" Tim may be right about my config settings 
(thanks for the suggestions, Tim).   I am going to test that out and I will let 
everyone know. 

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924


-----Original Message-----
From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:[email protected]] 
Sent: Thursday, October 13, 2011 1:04 PM
To: Tim Donohue; George S Kozak
Cc: [email protected]
Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

What version of DSpace are you running?  I just tested something completely 
unrelated this morning, but it involved adding a second document to an Item, 
then running filter media, then doing a search to do if the text in the second 
document was found - it WAS.

We are running DSpace 1.7.1. JSPUI.

Sue


Sue Walker-Thornton
(757) 864-2368

-----Original Message-----
From: Tim Donohue [mailto:[email protected]]
Sent: Thursday, October 13, 2011 12:50 PM
To: George S Kozak
Cc: [email protected]
Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Hi George,

Hmm..that's a bit odd. It's definitely not a "known issue".

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you expect. 
The 'buildDocumentForItem()' method is the one that takes care of indexing all 
Item content into a Lucene "Document".

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. 
Add in all various "sort" options (so you can sort search results) 4. Locate 
the "TEXT" Bundle in the Item and index *all* Bitstreams in that Bundle.

If you turn on Debugging you should actually see the DSIndexer report
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like your 
"TEXT" bundle is getting all the right Bitstreams added (by filter-media). I'm 
assuming there is only *one* TEXT Bundle, right? (if there are multiple that 
may be the issue -- but DSpace itself should only be generating one TEXT 
bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change # to take effect 
on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 10000

So, it could be possible that these PDFs are larger, and Lucene just stops 
indexing content after 10,000 words.  You can set this to "-1" if you want to 
disable any word-based limit.

Not sure if that helps or not! :)

- Tim


On 10/13/2011 11:28 AM, George S Kozak wrote:
> Hi. Everyone:
>
> After a bit of digging what I have discovered is that any item that 
> has multiple bitstreams of PDFs, only the first bitstream added is 
> searchable. The other bitstreams in the item seem to be ignored by the 
> indexer. I have checked and the extracted Texts are there, so it is 
> not an issue with the filter-media program.
>
> We (at Cornell) have many items with multiple bitstreams of PDFs, and 
> so far all of my testing indicates only the first bitstream of the 
> item is being indexed by the Dspace search engine.
>
> Is this a known issue? Is there something wrong in my configuration 
> files that may be causing this?
>
> George Kozak
>
> Digital Library Specialist
>
> Cornell University Library Information Technologies (CUL-IT)
>
> 501 Olin Library
>
> Cornell University
>
> Ithaca, NY 14853
>
> 607-255-8924
>
>
>
> ----------------------------------------------------------------------
> -------- All the data continuously generated in your IT infrastructure 
> contains a definitive record of customers, application performance, 
> security threats, fraudulent activity and more. Splunk takes this data 
> and makes sense of it. Business sense. IT sense. Common sense.
> http://p.sf.net/sfu/splunk-d2d-oct
>
>
>
> _______________________________________________
> DSpace-tech mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security threats, 
fraudulent activity and more. Splunk takes this data and makes sense of it. 
Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

------------------------------------------------------------------------------
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Reply via email to