[ 
https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrzej Bialecki  updated NUTCH-455:
------------------------------------

    Fix Version/s:     (was: 1.0.0)
                   1.1

> dedup on tokenized fields is faulty
> -----------------------------------
>
>                 Key: NUTCH-455
>                 URL: https://issues.apache.org/jira/browse/NUTCH-455
>             Project: Nutch
>          Issue Type: Bug
>          Components: searcher
>    Affects Versions: 0.9.0
>            Reporter: Enis Soztutar
>             Fix For: 1.1
>
>         Attachments: IndexSearcherCacheWarm.patch
>
>
> (From LUCENE-252) 
> nutch uses several index servers, and the search results from these servers 
> are merged using a dedup field for for deleting duplicates. The values from 
> this field is cached by Lucene's FieldCachImpl. The default is the site 
> field, which is indexed and tokenized. However for a Tokenized Field (for 
> example "url" in nutch), FieldCacheImpl returns an array of Terms rather that 
> array of field values, so dedup'ing becomes faulty. Current FieldCache 
> implementation does not respect tokenized fields , and as described above 
> caches only terms. 
> So in the situation that we are searching using "url" as the dedup field, 
> when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of 
> the url (such as "www" or "com") rather that the whole url. This prevents 
> using tokenized fields in the dedup field. 
> I have written a patch for lucene and attached it in 
> http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
> aforementioned issue about tokenized field caching. However building such a 
> cache for about 1.5M documents takes 20+ secs. The code in 
> IndexSearcher.translateHits() starts with
> if (dedupField != null) 
>       dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);
> and for the first call of search in IndexSearcher, cache is built. 
> Long story short, i have written a patch against IndexSearcher, which in 
> constructor warms-up the caches of wanted fields(configurable). I think we 
> should vote for LUCENE-252, and then commit the above patch with the last 
> version of lucene.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to