I ended up changing some DSpace config parameters and re-running index-init and this time it completed successfully, and in less than half the time it took to run it before. I'm a bit worried about the impact of changing these parameters and would like some feedback on this. Here are the changes I made to dspace.cfg (we are running version 1.5.1):
search.max-clauses: The default is 2048. The documentation says, "Higher values of search.max-clauses will enable prefix searches to work on large repositories". Since we have a large repository, and we want it to be completely full-text searchable, I had ours set at 200,000. I now have ours set at 4096. I'm not exactly sure what the impact of this change is though. search.maxfieldlength: The default is 10000. The documentation says, "Maximum number of terms indexed for a single field in Lucene. Default is 10,000 words - often not enough for full-text indexing. If you change this, you'll need to re-index for the change to take effect on previously added items. -1 = unlimited (Integer.MAX_VALUE)." Again, since we have a large repository, and we want it to be completely full-text searchable, I had ours set at -1 (unlimited). I now have it set back at the default - 10000. After I made these changes, I re-ran index-init and, instead of taking 5-6 days to complete, it completed in about a day and a half. It also did NOT get that memory error (below) it got the last 2 times I tried to run it with the old search.max-clauses and search.maxfieldlength settings we were using. Here are our other search parameters in dspace.cfg: search.operator: We have ours set at OR. search.index.?: ##### Fields to Index for Search ##### # DC metadata elements.qualifiers to be indexed for search # format: - search.index.[number] = [search field]:element.qualifier # - * used as wildcard ### changing these will change your search results, ### ### but will NOT automatically change your search displays ### search.index.1 = author:dc.contributor.author search.index.2 = corpauthor:dc.contributor.corpAuthor search.index.3 = corpauthor:dc.contributor.authorAffiliation search.index.4 = title:dc.title search.index.5 = titlecontrolkey:dc.identifier.titleControlKey search.index.6 = accessionnumber:dc.identifier.accessionNumber search.index.7 = reportnumber:dc.identifier.reportNumber search.index.8 = subjectkeyword:dc.subject.keywords webui.browse.index.?: webui.browse.index.1 = dateissued:item:dateissued:desc webui.browse.index.2 = author:metadata:dc.contributor.author:text webui.browse.index.3 = title:item:title webui.browse.index.4 = subject:metadata:dc.subject.keywords:text webui.browse.index.5 = dateaccessioned:item:dateaccessioned:desc webui.browse.index.6 = corpauthor:metadata:dc.contributor.corpAuthor:text Since the changes I made resulted in index-init completing much quicker than before, and it seems to have gotten rid of the Memory/Out of Swap space error, I'm wondering what we lost, if anything, in our search results or if this should even be a concern for us. Any suggestions/advise would be appreciated! Thanks, Sue From: Thornton, Susan M. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] [mailto:[email protected]] Sent: Saturday, June 19, 2010 7:51 PM To: [email protected] Cc: Kimbrough, Glenn W. (LARC-B7)[NCI]; Warren, Douglas Lewis (LARC-B7)[NCI]; Smail, James W. (LARC-B702)[RAYTHEON TECHNICAL SERVICES COMPANY] Subject: [Dspace-tech] java.lang.outOfMemory error trying to run index-init We have a large repository, currently with 140,376 Items. Due to user complaints about search results, we recently turned off stemming in our DSpace 1.5.1 search by commenting out the following line in DSAnalyzer.java: result = new PorterStemFilter(result); Of course then we had to run index-init to rebuild the search indexes and we've been having problems getting the job to finish. Due to the size of our repository, index-init takes about 5 or 6 days to complete and now it's failed twice due to the following error: An unexpected error has been detected by Java Runtime Environment: # # java.lang.OutOfMemoryError: requested 655360 bytes for GrET in /BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp. Out of swap space? # # Internal Error (allocation.inline.hpp:42), pid=23486, tid=5 # Error: GrET in /BUILD_AREA/jdk6_04/hotspot/src/share/vm/utilities/growableArray.cpp # # Java VM: Java HotSpot(TM) Server VM (10.0-b19 mixed mode solaris-sparc) # An error report file with more information is saved as: # /dspace/hs_err_pid23486.log # # If you would like to submit a bug report, please visit: # http://java.sun.com/webapps/bugreport/crash.jsp # Abort - core dumped Can someone please help us with this? This most recent time index-init failed was 4½ days into the index rebuild - after indexing 104,082 out of 140,376 items and now it looks like if we want an accurate and complete index, we're going to have to start all over again with the rebuild and there's no guarantee it will finish successfully. Any help would be much appreciated! I'm attaching the core dump and a copy of our DSRUN to this email. Thanks in advance, Sue Sue Walker-Thornton NASA Langley Research Center Integrated Library Systems Developer, Application & Database Administrator ConITS Contract ~ NCI Information Systems, Inc. 130 Research Drive Hampton, VA 23666 Office: (757) 224-4074 ~ Mobile: (757) 506-9903 ~ Fax: (757) 224-4001 email: [email protected]<mailto:[email protected]>
<<inline: image002.gif>>
------------------------------------------------------------------------------ ThinkGeek and WIRED's GeekDad team up for the Ultimate GeekDad Father's Day Giveaway. ONE MASSIVE PRIZE to the lucky parental unit. See the prize list and enter to win: http://p.sf.net/sfu/thinkgeek-promo
_______________________________________________ DSpace-tech mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/dspace-tech

