subject:"\[Dspace\-tech\] Strange problem with searching \- More and disturbing information\!"

[Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread George S Kozak

Hi. Everyone:

After a bit of digging what I have discovered is that any item that has 
multiple bitstreams of PDFs, only the first bitstream added is searchable.   
The other bitstreams in the item seem to be ignored by the indexer.   I have 
checked and the extracted Texts are there, so it is not an issue with the 
filter-media program.

We (at Cornell) have many items with multiple bitstreams of PDFs, and so far 
all of my testing indicates only the first bitstream of the item is being 
indexed by the Dspace search engine.

Is this a known issue?  Is there something wrong in my configuration files that 
may be causing this?


George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread Tim Donohue

Hi George,

Hmm..that's a bit odd. It's definitely not a known issue.

In fact, looking at the DSIndexer class (which is the class which
creates/updates the Lucene search index), it should be doing what you
expect. The 'buildDocumentForItem()' method is the one that takes care
of indexing all Item content into a Lucene Document.

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item
2. Index all Item Metadata
3. Add in all various sort options (so you can sort search results)
4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in
that Bundle.

If you turn on Debugging you should actually see the DSIndexer report
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening. It sounds like
your TEXT bundle is getting all the right Bitstreams added (by
filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if
there are multiple that may be the issue -- but DSpace itself should
only be generating one TEXT bundle).

The only other thing I can think of is that your 'search.maxfieldlength'
setting is too small. In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change
# to take effect on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 1

So, it could be possible that these PDFs are larger, and Lucene just
stops indexing content after 10,000 words. You can set this to -1 if
you want to disable any word-based limit.

Not sure if that helps or not! :)

- Tim

On 10/13/2011 11:28 AM, George S Kozak wrote:
Hi. Everyone:

After a bit of digging what I have discovered is that any item that has
multiple bitstreams of PDFs, only the first bitstream added is
searchable. The other bitstreams in the item seem to be ignored by the
indexer. I have checked and the extracted Texts are there, so it is not
an issue with the filter-media program.

We (at Cornell) have many items with multiple bitstreams of PDFs, and so
far all of my testing indicates only the first bitstream of the item is
being indexed by the Dspace search engine.

Is this a known issue? Is there something wrong in my configuration
files that may be causing this?

George Kozak

Digital Library Specialist

Cornell University Library Information Technologies (CUL-IT)

501 Olin Library

Cornell University

Ithaca, NY 14853

607-255-8924

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct

___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread Thornton, Susan M. (LARC-B702)[LITES]

What version of DSpace are you running?  I just tested something completely 
unrelated this morning, but it involved adding a second document to an Item, 
then running filter media, then doing a search to do if the text in the second 
document was found - it WAS.

We are running DSpace 1.7.1. JSPUI.

Sue


Sue Walker-Thornton
(757) 864-2368

-Original Message-
From: Tim Donohue [mailto:tdono...@duraspace.org] 
Sent: Thursday, October 13, 2011 12:50 PM
To: George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Hi George,

Hmm..that's a bit odd. It's definitely not a known issue.

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you 
expect. The 'buildDocumentForItem()' method is the one that takes care 
of indexing all Item content into a Lucene Document.

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item
2. Index all Item Metadata
3. Add in all various sort options (so you can sort search results)
4. Locate the TEXT Bundle in the Item and index *all* Bitstreams in 
that Bundle.

If you turn on Debugging you should actually see the DSIndexer report 
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like 
your TEXT bundle is getting all the right Bitstreams added (by 
filter-media). I'm assuming there is only *one* TEXT Bundle, right? (if 
there are multiple that may be the issue -- but DSpace itself should 
only be generating one TEXT bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change
# to take effect on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 1

So, it could be possible that these PDFs are larger, and Lucene just 
stops indexing content after 10,000 words.  You can set this to -1 if 
you want to disable any word-based limit.

Not sure if that helps or not! :)

- Tim


On 10/13/2011 11:28 AM, George S Kozak wrote:
 Hi. Everyone:

 After a bit of digging what I have discovered is that any item that has
 multiple bitstreams of PDFs, only the first bitstream added is
 searchable. The other bitstreams in the item seem to be ignored by the
 indexer. I have checked and the extracted Texts are there, so it is not
 an issue with the filter-media program.

 We (at Cornell) have many items with multiple bitstreams of PDFs, and so
 far all of my testing indicates only the first bitstream of the item is
 being indexed by the Dspace search engine.

 Is this a known issue? Is there something wrong in my configuration
 files that may be causing this?

 George Kozak

 Digital Library Specialist

 Cornell University Library Information Technologies (CUL-IT)

 501 Olin Library

 Cornell University

 Ithaca, NY 14853

 607-255-8924



 --
 All the data continuously generated in your IT infrastructure contains a
 definitive record of customers, application performance, security
 threats, fraudulent activity and more. Splunk takes this data and makes
 sense of it. Business sense. IT sense. Common sense.
 http://p.sf.net/sfu/splunk-d2d-oct



 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread George S Kozak

Susan:

I am running DSpace 1.7.1...I think Tim may be right about my config settings 
(thanks for the suggestions, Tim).   I am going to test that out and I will let 
everyone know. 

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924


-Original Message-
From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] 
Sent: Thursday, October 13, 2011 1:04 PM
To: Tim Donohue; George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

What version of DSpace are you running?  I just tested something completely 
unrelated this morning, but it involved adding a second document to an Item, 
then running filter media, then doing a search to do if the text in the second 
document was found - it WAS.

We are running DSpace 1.7.1. JSPUI.

Sue


Sue Walker-Thornton
(757) 864-2368

-Original Message-
From: Tim Donohue [mailto:tdono...@duraspace.org]
Sent: Thursday, October 13, 2011 12:50 PM
To: George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Hi George,

Hmm..that's a bit odd. It's definitely not a known issue.

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you expect. 
The 'buildDocumentForItem()' method is the one that takes care of indexing all 
Item content into a Lucene Document.

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. 
Add in all various sort options (so you can sort search results) 4. Locate 
the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle.

If you turn on Debugging you should actually see the DSIndexer report
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like your 
TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm 
assuming there is only *one* TEXT Bundle, right? (if there are multiple that 
may be the issue -- but DSpace itself should only be generating one TEXT 
bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change # to take effect 
on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 1

So, it could be possible that these PDFs are larger, and Lucene just stops 
indexing content after 10,000 words.  You can set this to -1 if you want to 
disable any word-based limit.

Not sure if that helps or not! :)

- Tim


On 10/13/2011 11:28 AM, George S Kozak wrote:
 Hi. Everyone:

 After a bit of digging what I have discovered is that any item that 
 has multiple bitstreams of PDFs, only the first bitstream added is 
 searchable. The other bitstreams in the item seem to be ignored by the 
 indexer. I have checked and the extracted Texts are there, so it is 
 not an issue with the filter-media program.

 We (at Cornell) have many items with multiple bitstreams of PDFs, and 
 so far all of my testing indicates only the first bitstream of the 
 item is being indexed by the Dspace search engine.

 Is this a known issue? Is there something wrong in my configuration 
 files that may be causing this?

 George Kozak

 Digital Library Specialist

 Cornell University Library Information Technologies (CUL-IT)

 501 Olin Library

 Cornell University

 Ithaca, NY 14853

 607-255-8924



 --
  All the data continuously generated in your IT infrastructure 
 contains a definitive record of customers, application performance, 
 security threats, fraudulent activity and more. Splunk takes this data 
 and makes sense of it. Business sense. IT sense. Common sense.
 http://p.sf.net/sfu/splunk-d2d-oct



 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
All the data continuously generated in your IT infrastructure contains a 
definitive record of customers, application performance, security threats, 
fraudulent activity and more. Splunk takes this data and makes sense of it. 
Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread Thornton, Susan M. (LARC-B702)[LITES]

We also have search.maxfieldlength set to -1.
Sue


Sue Walker-Thornton
(757) 864-2368


-Original Message-
From: George S Kozak [mailto:g...@cornell.edu] 
Sent: Thursday, October 13, 2011 1:09 PM
To: Thornton, Susan M. (LARC-B702)[LITES]; Tim Donohue
Cc: dspace-tech@lists.sourceforge.net
Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Susan:

I am running DSpace 1.7.1...I think Tim may be right about my config settings 
(thanks for the suggestions, Tim).   I am going to test that out and I will let 
everyone know. 

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924


-Original Message-
From: Thornton, Susan M. (LARC-B702)[LITES] [mailto:susan.m.thorn...@nasa.gov] 
Sent: Thursday, October 13, 2011 1:04 PM
To: Tim Donohue; George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: RE: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

What version of DSpace are you running?  I just tested something completely 
unrelated this morning, but it involved adding a second document to an Item, 
then running filter media, then doing a search to do if the text in the second 
document was found - it WAS.

We are running DSpace 1.7.1. JSPUI.

Sue


Sue Walker-Thornton
(757) 864-2368

-Original Message-
From: Tim Donohue [mailto:tdono...@duraspace.org]
Sent: Thursday, October 13, 2011 12:50 PM
To: George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Hi George,

Hmm..that's a bit odd. It's definitely not a known issue.

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you expect. 
The 'buildDocumentForItem()' method is the one that takes care of indexing all 
Item content into a Lucene Document.

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. 
Add in all various sort options (so you can sort search results) 4. Locate 
the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle.

If you turn on Debugging you should actually see the DSIndexer report
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like your 
TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm 
assuming there is only *one* TEXT Bundle, right? (if there are multiple that 
may be the issue -- but DSpace itself should only be generating one TEXT 
bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change # to take effect 
on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 1

So, it could be possible that these PDFs are larger, and Lucene just stops 
indexing content after 10,000 words.  You can set this to -1 if you want to 
disable any word-based limit.

Not sure if that helps or not! :)

- Tim


On 10/13/2011 11:28 AM, George S Kozak wrote:
 Hi. Everyone:

 After a bit of digging what I have discovered is that any item that 
 has multiple bitstreams of PDFs, only the first bitstream added is 
 searchable. The other bitstreams in the item seem to be ignored by the 
 indexer. I have checked and the extracted Texts are there, so it is 
 not an issue with the filter-media program.

 We (at Cornell) have many items with multiple bitstreams of PDFs, and 
 so far all of my testing indicates only the first bitstream of the 
 item is being indexed by the Dspace search engine.

 Is this a known issue? Is there something wrong in my configuration 
 files that may be causing this?

 George Kozak

 Digital Library Specialist

 Cornell University Library Information Technologies (CUL-IT)

 501 Olin Library

 Cornell University

 Ithaca, NY 14853

 607-255-8924



 --
  All the data continuously generated in your IT infrastructure 
 contains a definitive record of customers, application performance, 
 security threats, fraudulent activity and more. Splunk takes this data 
 and makes sense of it. Business sense. IT sense. Common sense.
 http://p.sf.net/sfu/splunk-d2d-oct



 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

2011-10-13 Thread George S Kozak

Tim:

You were right!  I changed the config file and now my searches are working for 
other bitstreams!  Thank you, very much!!
This clears up a problem that I have had for a long time.  Now I wonder what 
other old and now bad setting that I have!

George Kozak
Digital Library Specialist
Cornell University Library Information Technologies (CUL-IT)
501 Olin Library
Cornell University
Ithaca, NY 14853
607-255-8924


-Original Message-
From: Tim Donohue [mailto:tdono...@duraspace.org] 
Sent: Thursday, October 13, 2011 12:50 PM
To: George S Kozak
Cc: dspace-tech@lists.sourceforge.net
Subject: Re: [Dspace-tech] Strange problem with searching - More and disturbing 
information!

Hi George,

Hmm..that's a bit odd. It's definitely not a known issue.

In fact, looking at the DSIndexer class (which is the class which 
creates/updates the Lucene search index), it should be doing what you expect. 
The 'buildDocumentForItem()' method is the one that takes care of indexing all 
Item content into a Lucene Document.

https://fisheye3.atlassian.com/browse/~br=trunk/dspace/dspace/trunk/dspace-api/src/main/java/org/dspace/search/DSIndexer.java?hb=true#to1040

Specifically, it should be doing the following:
1. Initialize the Lucene Document for the Item 2. Index all Item Metadata 3. 
Add in all various sort options (so you can sort search results) 4. Locate 
the TEXT Bundle in the Item and index *all* Bitstreams in that Bundle.

If you turn on Debugging you should actually see the DSIndexer report
*every* Bitstream that it adds to the index.

So, I'm a bit at a loss as to what may be happening.  It sounds like your 
TEXT bundle is getting all the right Bitstreams added (by filter-media). I'm 
assuming there is only *one* TEXT Bundle, right? (if there are multiple that 
may be the issue -- but DSpace itself should only be generating one TEXT 
bundle).

The only other thing I can think of is that your 'search.maxfieldlength' 
setting is too small.  In your dspace.cfg you should see:

# Maximum number of terms indexed for a single field in Lucene.
# Default is 10,000 words - often not enough for full-text indexing.
# If you change this, you'll need to re-index for the change # to take effect 
on previously added items.
# -1 = unlimited (Integer.MAX_VALUE)
search.maxfieldlength = 1

So, it could be possible that these PDFs are larger, and Lucene just stops 
indexing content after 10,000 words.  You can set this to -1 if you want to 
disable any word-based limit.

Not sure if that helps or not! :)

- Tim


On 10/13/2011 11:28 AM, George S Kozak wrote:
 Hi. Everyone:

 After a bit of digging what I have discovered is that any item that 
 has multiple bitstreams of PDFs, only the first bitstream added is 
 searchable. The other bitstreams in the item seem to be ignored by the 
 indexer. I have checked and the extracted Texts are there, so it is 
 not an issue with the filter-media program.

 We (at Cornell) have many items with multiple bitstreams of PDFs, and 
 so far all of my testing indicates only the first bitstream of the 
 item is being indexed by the Dspace search engine.

 Is this a known issue? Is there something wrong in my configuration 
 files that may be causing this?

 George Kozak

 Digital Library Specialist

 Cornell University Library Information Technologies (CUL-IT)

 501 Olin Library

 Cornell University

 Ithaca, NY 14853

 607-255-8924



 --
  All the data continuously generated in your IT infrastructure 
 contains a definitive record of customers, application performance, 
 security threats, fraudulent activity and more. Splunk takes this data 
 and makes sense of it. Business sense. IT sense. Common sense.
 http://p.sf.net/sfu/splunk-d2d-oct



 ___
 DSpace-tech mailing list
 DSpace-tech@lists.sourceforge.net
 https://lists.sourceforge.net/lists/listinfo/dspace-tech

--
All the data continuously generated in your IT infrastructure contains a
definitive record of customers, application performance, security
threats, fraudulent activity and more. Splunk takes this data and makes
sense of it. Business sense. IT sense. Common sense.
http://p.sf.net/sfu/splunk-d2d-oct
___
DSpace-tech mailing list
DSpace-tech@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/dspace-tech

[Dspace-tech] Strange problem with searching - More and disturbing information!

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

Re: [Dspace-tech] Strange problem with searching - More and disturbing information!

6 matches

Site Navigation

Mail list logo

Footer information