[Dspace-tech] Strange problem with searching - More and disturbing information!
Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Susan: I am running DSpace 1.7.1...I think Tim may be right about my config settings (thanks for the suggestions, Tim). I am going to test that out and I will let everyone know. George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] Sent: Thursday, October 13, 2011 1:04 PM To: Tim Donohue; George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
We also have search.maxfieldlength set to -1. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: George S Kozak [mailto:g...@cornell.edu] Sent: Thursday, October 13, 2011 1:09 PM To: Thornton, Susan M. (LARC-B702)[LITES]; Tim Donohue Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! Susan: I am running DSpace 1.7.1...I think Tim may be right about my config settings (thanks for the suggestions, Tim). I am going to test that out and I will let everyone know. George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] Sent: Thursday, October 13, 2011 1:04 PM To: Tim Donohue; George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Tim: You were right! I changed the config file and now my searches are working for other bitstreams! Thank you, very much!! This clears up a problem that I have had for a long time. Now I wonder what other old and now bad setting that I have! George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech