Hi Richard,
I don't know if you saw my subsequent post today, but I ended up changing
two dspace.cfg parameters and it sped up index-init considerably - it only took
a day and a half this time. I'm a bit worried about the impact it's had on
our full-text searching. Since we had a large repository, I had our
search.max-clauses set at 200,000 and I changed it to 4096 which is twice the
default. I also changed search.maxfieldlength from -1 (unlimited) to 10,000
for the same reason. What do you think? See our numbers below.
Thanks a bunch,
Sue
From: Richard Rodgers [mailto:[email protected]]
Sent: Monday, June 21, 2010 1:50 PM
To: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY]
Cc: [email protected]; William L Hays
Subject: Re: [Dspace-tech] java.lang.outOfMemory error trying to run index-init
Hi Sue:
I don't have any immediate help, but I'm struck by how long the indexing job is
taking. I had a comparison done with one of our DSpace 1.6 repositories which
is about half the size of yours
(71,481 items), and is mostly text-based content (which I think yours is also?)
On not particularly fast hardware, a complete re-index took about 5 hours - not
5 days.
There may be some subtle limit in the code based on size - so to get started, I
did a 'profile' of our repo with respect to full-text content (which I am
assuming accounts for most of the indexing time - but I could be wrong). Here
is the 'profile' and the queries we used to get it. I'd be interested to see
what your repo looks like using the same metrics.
[Sue T.] Our numbers in blue to the right of yours:
count of items
71,481[Sue T.] 140,337
count of bitstreams in text extract bundles (TEXT): 89,993[Sue T.]
134,215
sum of all file sizes in text extract bundles: 7,695,414,829[Sue
T.] 12,804,764,306
average size of text extract bitstream:
85,511[Sue T.] 95,405
Queries used:
select count(bs.bitstream_id)
from bundle b, bundle2bitstream b2b, bitstream bs
where b2b.bundle_id = b.bundle_id and b2b.bitstream_id = bs.bitstream_id
and b.name<http://b.name/> = 'TEXT'
select sum(bs.size_bytes)
from bundle b, bundle2bitstream b2b, bitstream bs
where b2b.bundle_id = b.bundle_id and b2b.bitstream_id = bs.bitstream_id
and b.name<http://b.name/> = 'TEXT'
Thanks,
Richard
On Jun 19, 2010, at 7:50 PM, Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL
SERVICES COMPANY] wrote:
We have a large repository, currently with 140,376 Items. Due to user
complaints about search results, we recently turned off stemming in our DSpace
1.5.1 search by commenting out the following line in DSAnalyzer.java:
result = new PorterStemFilter(result);
Of course then we had to run index-init to rebuild the search indexes and we've
been having problems getting the job to finish. Due to the size of our
repository, index-init takes about 5 or 6 days to complete and now it's failed
twice due to the following error:
An unexpected error has been detected by Java Runtime Environment:
#
# java.lang.OutOfMemoryError: requested 655360 bytes for GrET in
/BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp. Out of
swap space?
#
# Internal Error (allocation.inline.hpp:42), pid=23486, tid=5
# Error: GrET in
/BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp
#
# Java VM: Java HotSpot(TM) Server VM (10.0-b19 mixed mode solaris-sparc)
# An error report file with more information is saved as:
# /dspace/hs_err_pid23486.log
#
# If you would like to submit a bug report, please visit:
# http://java.sun.com/webapps/bugreport/crash.jsp
#
Abort - core dumped
Can someone please help us with this? This most recent time index-init failed
was 4½ days into the index rebuild - after indexing 104,082 out of 140,376
items and now it looks like if we want an accurate and complete index, we're
going to have to start all over again with the rebuild and there's no guarantee
it will finish successfully.
Any help would be much appreciated!
I'm attaching the core dump and a copy of our DSRUN to this email.
Thanks in advance,
Sue
Sue Walker-Thornton
NASA Langley Research Center
Integrated Library Systems
Developer, Application & Database Administrator
ConITS Contract ~ NCI Information Systems, Inc.
130 Research Drive
Hampton, VA 23666
Office: (757) 224-4074 ~ Mobile: (757) 506-9903 ~ Fax: (757) 224-4001
email: [email protected]<mailto:[email protected]>
<hs_err_pid23486.log><ATT00001.c><ATT00002.c>
------------------------------------------------------------------------------
ThinkGeek and WIRED's GeekDad team up for the Ultimate
GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the
lucky parental unit. See the prize list and enter to win:
http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________
DSpace-tech mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/dspace-tech