Re: [Dspace-tech] Strange problem with searching
On Wed, Oct 12, 2011 at 20:25, George S Kozak g...@cornell.edu wrote: I have tried running index-init and deleting the extracted text and re-running filter-media, but still no luck with the searches for this collection. Hi, just to make sure - did you run filter-media before or after index-init/index-update? Because filter-media creates text files from media and index-* indexes them. So in case you didn't run index-init or index-update after filter-media, they won't be indexed. Regards, ~~helix84 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching
Hi, helix84: Yes, I did run things in the correct order. That's what has stumped me. I can't figure out why these specific records are not searchable while other records are searchable. George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: ivan.ma...@gmail.com [mailto:ivan.ma...@gmail.com] On Behalf Of helix84 Sent: Thursday, October 13, 2011 6:07 AM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching On Wed, Oct 12, 2011 at 20:25, George S Kozak g...@cornell.edu wrote: I have tried running index-init and deleting the extracted text and re-running filter-media, but still no luck with the searches for this collection. Hi, just to make sure - did you run filter-media before or after index-init/index-update? Because filter-media creates text files from media and index-* indexes them. So in case you didn't run index-init or index-update after filter-media, they won't be indexed. Regards, ~~helix84 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching
On Thu, Oct 13, 2011 at 15:36, George S Kozak g...@cornell.edu wrote: Yes, I did run things in the correct order. That's what has stumped me. I can't figure out why these specific records are not searchable while other records are searchable. I'm not sure how to help you further. Can you check if the text file in the TEXT bundle has READ access for the Anonymous group? The TEXT bundle itself also has READ access for the Anonymous group by default. Regards, ~~helix84 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
[Dspace-tech] Strange problem with searching - More and disturbing information!
Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Susan: I am running DSpace 1.7.1...I think Tim may be right about my config settings (thanks for the suggestions, Tim). I am going to test that out and I will let everyone know. George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] Sent: Thursday, October 13, 2011 1:04 PM To: Tim Donohue; George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
We also have search.maxfieldlength set to -1. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: George S Kozak [mailto:g...@cornell.edu] Sent: Thursday, October 13, 2011 1:09 PM To: Thornton, Susan M. (LARC-B702)[LITES]; Tim Donohue Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! Susan: I am running DSpace 1.7.1...I think Tim may be right about my config settings (thanks for the suggestions, Tim). I am going to test that out and I will let everyone know. George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] Sent: Thursday, October 13, 2011 1:04 PM To: Tim Donohue; George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing information! What version of DSpace are you running? I just tested something completely unrelated this morning, but it involved adding a second document to an Item, then running filter media, then doing a search to do if the text in the second document was found - it WAS. We are running DSpace 1.7.1. JSPUI. Sue Sue Walker-Thornton (757) 864-2368 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech
Re: [Dspace-tech] Strange problem with searching - More and disturbing information!
Tim: You were right! I changed the config file and now my searches are working for other bitstreams! Thank you, very much!! This clears up a problem that I have had for a long time. Now I wonder what other old and now bad setting that I have! George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -Original Message- From: Tim Donohue [mailto:tdono...@duraspace.org] Sent: Thursday, October 13, 2011 12:50 PM To: George S Kozak Cc: dspace-tech@lists.sourceforge.net Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing information! Hi George, Hmm..that's a bit odd. It's definitely not a known issue. In fact, looking at the DSIndexer class (which is the class which creates/updates the Lucene search index), it should be doing what you expect. The 'buildDocumentForItem()' method is the one that takes care of indexing all Item content into a Lucene Document. https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040 Specifically, it should be doing the following: 1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. Add in all various sort options (so you can sort search results) 4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle. If you turn on Debugging you should actually see the DSIndexer report *every* Bitstream that it adds to the index. So, I'm a bit at a loss as to what may be happening. It sounds like your TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if there are multiple that may be the issue -- but DSpace itself should only be generating one TEXT bundle). The only other thing I can think of is that your 'search.maxfieldlength' setting is too small. In your dspace.cfg you should see: # Maximum number of terms indexed for a single field in Lucene. # Default is 10,000 words - often not enough for full-text indexing. # If you change this, you'll need to re-index for the change # to take effect on previously added items. # -1 = unlimited (Integer.MAX_VALUE) search.maxfieldlength = 1 So, it could be possible that these PDFs are larger, and Lucene just stops indexing content after 10,000 words. You can set this to -1 if you want to disable any word-based limit. Not sure if that helps or not! :) - Tim On 10/13/2011 11:28 AM, George S Kozak wrote: Hi. Everyone: After a bit of digging what I have discovered is that any item that has multiple bitstreams of PDFs, only the first bitstream added is searchable. The other bitstreams in the item seem to be ignored by the indexer. I have checked and the extracted Texts are there, so it is not an issue with the filter-media program. We (at Cornell) have many items with multiple bitstreams of PDFs, and so far all of my testing indicates only the first bitstream of the item is being indexed by the Dspace search engine. Is this a known issue? Is there something wrong in my configuration files that may be causing this? George Kozak Digital Library Specialist Cornell University Library Information Technologies (CUL-IT) 501 Olin Library Cornell University Ithaca, NY 14853 607-255-8924 -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech -- All the data continuously generated in your IT infrastructure contains a definitive record of customers, application performance, security threats, fraudulent activity and more. Splunk takes this data and makes sense of it. Business sense. IT sense. Common sense. http://p.sf.net/sfu/splunk-d2d-oct ___ DSpace-tech mailing list DSpace-tech@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dspace-tech