[ 
https://issues.apache.org/jira/browse/NUTCH-708?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-708:
--------------------------------


Bulk close of legacy issues:
http://www.lucidimagination.com/search/document/2738eeb014805854/clean_up_open_legacy_issues_in_jira

> NutchBean: OOM due to searcher.max.hits and dedup.
> --------------------------------------------------
>
>                 Key: NUTCH-708
>                 URL: https://issues.apache.org/jira/browse/NUTCH-708
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 1.0.0
>         Environment: Ubuntu Linux, Java 5.
>            Reporter: Aaron Binns
>
> When searching an index we built for the National Archives, this one in 
> particular: http://webharvest.gov/collections/congress110th/
> We ran into an interesting situation.
> We were using searcher.max.hits=1000 in order to get faster searches.  Since 
> our index is sorted, the "best" documents are "at the front" and setting 
> searcher.max.hits=1000 would give us a nice trade-off of search quality vs. 
> response time.
> What I discovered was that with dedup (on site) enabled, we would get into 
> this loop where the searcher.max.hits would limit the raw hits to 1000 and 
> the deduplication code would get to the end of those 1000 results and still 
> need more as it hadn't found enough de-dup'd results to satisfy the query.
> The first 6 pages of results would be fine, but when we got to page 7, the 
> NutchBean would need more than 1000 raw results in order to get 60 de-duped 
> results.
> The code:
>     for (int rawHitNum = 0; rawHitNum < hits.getTotal(); rawHitNum++) {
>       // get the next raw hit                                                 
>                                                                               
>                                                      
>       if (rawHitNum >= hits.getLength())
>         {
>         // optimize query by prohibiting more matches on some excluded values 
>                                                                               
>                                                      
>         Query optQuery = (Query)query.clone();
>         for (int i = 0; i < excludedValues.size(); i++) {
>           if (i == MAX_PROHIBITED_TERMS)
>             break;
>           optQuery.addProhibitedTerm(((String)excludedValues.get(i)),
>                                      dedupField);
>         }
>         numHitsRaw = (int)(numHitsRaw * rawHitsFactor);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("re-searching for "+numHitsRaw+" raw hits, query: 
> "+optQuery);
>         }
>         hits = searcher.search(optQuery, numHitsRaw,
>                                dedupField, sortField, reverse);
>         if (LOG.isInfoEnabled()) {
>           LOG.info("found "+hits.getTotal()+" raw hits");
>         }
>         rawHitNum = -1;
>         continue;
>       }
> The loop constraints were never satisfied as rawHitNum and hits.getLength() 
> are capped by searcher.max.hits (1000).  The numHitsRaw keeps increasing by a 
> factor of 2 (rawHitsFactor) until it gets to 2^31 or so and deep down in the 
> search library code an array is allocated using that value as the size and 
> you get an OOM.
> We worked around the problem by abandoning the use of searcher.max.hits.  I 
> suppose we could have increased the value, but the index was small enough 
> (~10GB) that disabling searcher.max.hits didn't degrade the response time too 
> much.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to