Hi Sue:

Yes, I saw the post, and thanks for these numbers - I just wanted to make sure 
there wasn't some issue of scale that was 'hidden'  under the raw item count 
(e.g that your articles were 10 times bigger on avg).
And now the indexing time is at least roughly proportional. I haven't studied 
the behavior of those parameters, so have no specific advice at the moment - 
but I'll keep your values in mind when I look next at indexing code...

Thanks,

Richard

On Jun 21, 2010, at 9:44 PM, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL 
SERVICES COMPANY] wrote:

Hi Richard,
     I don’t know if you saw my subsequent post today, but I ended up changing 
two dspace.cfg parameters and it sped up index-init considerably – it only took 
a day and a half this time.  I’m a bit worried about the  impact it’s had on 
our full-text searching.  Since we had a large repository, I had our 
search.max-clauses set at 200,000 and I changed it to 4096 which is twice the 
default.  I also changed search.maxfieldlength from -1 (unlimited) to 10,000 
for the same reason.  What do you think?  See our numbers below.
Thanks a bunch,
Sue

From: Richard Rodgers [mailto:[email protected]]
Sent: Monday, June 21, 2010 1:50 PM
To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
Cc: 
[email protected]<mailto:[email protected]>; 
William L Hays
Subject: Re: [Dspace-tech] java.lang.outOfMemory error trying to run index-init

Hi Sue:

I don't have any immediate help, but I'm struck by how long the indexing job is 
taking. I had a comparison done with one of our DSpace 1.6 repositories which 
is about half the size of yours
(71,481 items), and is mostly text-based content (which I think yours is also?) 
On not particularly fast hardware, a complete re-index took about 5 hours - not 
5 days.

There may be some subtle limit in the code based on size - so to get started, I 
did a 'profile' of our repo with respect to full-text content (which I am 
assuming accounts for most of the indexing time - but I could be wrong). Here 
is the 'profile' and the queries we used to get it. I'd be interested to see 
what your repo looks like using the same metrics.
[Sue T.] Our numbers in blue to the right of yours:

count of items                                                                  
           71,481[Sue T.]          140,337
count of bitstreams in text extract bundles (TEXT):              89,993[Sue T.] 
      134,215
sum of all file sizes in text extract bundles:                7,695,414,829[Sue 
T.]  12,804,764,306
average size of text extract  bitstream:                                    
85,511[Sue T.]  95,405

Queries used:

select count(bs.bitstream_id)
from bundle b, bundle2bitstream b2b, bitstream bs
where b2b.bundle_id = b.bundle_id and b2b.bitstream_id = bs.bitstream_id
and b.name<http://b.name/> = 'TEXT'


select sum(bs.size_bytes)
from bundle b, bundle2bitstream b2b, bitstream bs
where b2b.bundle_id = b.bundle_id and b2b.bitstream_id = bs.bitstream_id
and b.name<http://b.name/> = 'TEXT'

Thanks,

Richard


On Jun 19, 2010, at 7:50 PM, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL 
SERVICES COMPANY] wrote:


We have a large repository, currently with 140,376 Items.  Due to user 
complaints about search results, we recently turned off stemming in our DSpace 
1.5.1 search by commenting out the following line in DSAnalyzer.java:

result = new PorterStemFilter(result);

Of course then we had to run index-init to rebuild the search indexes and we’ve 
been having problems getting the job to finish.  Due to the size of our 
repository, index-init takes about 5 or 6 days to complete and now it’s failed 
twice due to the following error:

An unexpected error has been detected by Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 655360 bytes for GrET in 
/BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp. Out of 
swap space?
#
#  Internal Error (allocation.inline.hpp:42), pid=23486, tid=5
#  Error: GrET in 
/BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp
#
# Java VM: Java HotSpot(TM) Server VM (10.0-b19 mixed mode solaris-sparc)
# An error report file with more information is saved as:
# /dspace/hs_err_pid23486.log
#
# If you would like to submit a bug report, please visit:
#   http://java.sun.com/webapps/bugreport/crash.jsp
#
Abort - core dumped

Can someone please help us with this?  This most recent time index-init failed 
was 4½ days into the index rebuild – after indexing 104,082 out of 140,376 
items and now it looks like if we want an accurate and complete index, we’re 
going to have to start all over again with the rebuild and there’s no guarantee 
it will finish successfully.

Any help would be much appreciated!

I’m attaching the core dump and a copy of our DSRUN to this email.

Thanks in advance,
Sue


Sue Walker-Thornton
NASA Langley Research Center
Integrated Library Systems
Developer, Application & Database Administrator
ConITS Contract ~ NCI Information Systems, Inc.
130 Research Drive
Hampton, VA  23666
Office: (757) 224-4074 ~ Mobile: (757) 506-9903 ~ Fax: (757) 224-4001
email:  [email protected]<mailto:[email protected]>

<hs_err_pid23486.log><ATT00001.c><ATT00002.c>


------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate 
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the 
lucky parental unit.  See the prize list and enter to win: 
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech
  • [Dspac... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Richard Rodgers
      • ... Peter Dietz
        • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
      • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
        • ... Richard Rodgers
          • ... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
    • R... Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]

Reply via email to