[ https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Markus Jelsma updated NUTCH-708: -------------------------------- Bulk close of legacy issues: http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira > NutchBean: OOM due to searcher.max.hits and dedup. > -------------------------------------------------- > > Key: NUTCH-708 > URL: https://issues.apache.org/jira/browse/NUTCH-708 > Project: Nutch > Issue Type: Bug > Components: searcher > Affects Versions: 1.0.0 > Environment: Ubuntu Linux, Java 5. > Reporter: Aaron Binns > > When searching an index we built for the National Archives, this one in > particular: http://webharvest.gov/collections/congress110th/ > We ran into an interesting situation. > We were using searcher.max.hits=1000 in order to get faster searches. Since > our index is sorted, the "best" documents are "at the front" and setting > searcher.max.hits=1000 would give us a nice trade-off of search quality vs. > response time. > What I discovered was that with dedup (on site) enabled, we would get into > this loop where the searcher.max.hits would limit the raw hits to 1000 and > the deduplication code would get to the end of those 1000 results and still > need more as it hadn't found enough de-dup'd results to satisfy the query. > The first 6 pages of results would be fine, but when we got to page 7, the > NutchBean would need more than 1000 raw results in order to get 60 de-duped > results. > The code: > for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) { > // get the next raw hit > > > if (rawHitNum >= hits.getLength()) > { > // optimize query by prohibiting more matches on some excluded values > > > Query optQuery = (Query)query.clone(); > for (int i = 0; i < excludedValues.size(); i++) { > if (i == MAX_PROHIBITED_TERMS) > break; > optQuery.addProhibitedTerm(((String)excludedValues.get(i)), > dedupField); > } > numHitsRaw = (int)(numHitsRaw * rawHitsFactor); > if (LOG.isInfoEnabled()) { > LOG.info("re-searching for "+numHitsRaw+" raw hits, query: > "+optQuery); > } > hits = searcher.search(optQuery, numHitsRaw, > dedupField, sortField, reverse); > if (LOG.isInfoEnabled()) { > LOG.info("found "+hits.getTotal()+" raw hits"); > } > rawHitNum = -1; > continue; > } > The loop constraints were never satisfied as rawHitNum and hits.getLength() > are capped by searcher.max.hits (1000). The numHitsRaw keeps increasing by a > factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the > search library code an array is allocated using that value as the size and > you get an OOM. > We worked around the problem by abandoning the use of searcher.max.hits. I > suppose we could have increased the value, but the index was small enough > (~10GB) that disabling searcher.max.hits didn't degrade the response time too > much. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira