dedup on tokenized fields is faulty
-----------------------------------

                 Key: NUTCH-455
                 URL: https://issues.apache.org/jira/browse/NUTCH-455
             Project: Nutch
          Issue Type: Bug
          Components: searcher
    Affects Versions: 0.9.0
            Reporter: Enis Soztutar
             Fix For: 0.9.0


(From LUCENE-252) 
nutch uses several index servers, and the search results from these servers are 
merged using a dedup field for for deleting duplicates. The values from this 
field is cached by Lucene's FieldCachImpl. The default is the site field, which 
is indexed and tokenized. However for a Tokenized Field (for example "url" in 
nutch), FieldCacheImpl returns an array of Terms rather that array of field 
values, so dedup'ing becomes faulty. Current FieldCache implementation does not 
respect tokenized fields , and as described above caches only terms. 

So in the situation that we are searching using "url" as the dedup field, when 
a Hit is constructed in IndexSearcher, the dedupValue becomes a token of the 
url (such as "www" or "com") rather that the whole url. This prevents using 
tokenized fields in the dedup field. 

I have written a patch for lucene and attached it in 
http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the 
aforementioned issue about tokenized field caching. However building such a 
cache for about 1.5M documents takes 20+ secs. The code in 
IndexSearcher.translateHits() starts with

if (dedupField != null) 
      dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField);

and for the first call of search in IndexSearcher, cache is built. 

Long story short, i have written a patch against IndexSearcher, which in 
constructor warms-up the caches of wanted fields(configurable). I think we 
should vote for LUCENE-252, and then commit the above patch with the last 
version of lucene.






-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-------------------------------------------------------------------------
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT & business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV
_______________________________________________
Nutch-developers mailing list
Nutch-developers@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/nutch-developers

Reply via email to