[ https://issues.apache.org/jira/browse/NUTCH-455?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12479262 ]
Enis Soztutar commented on NUTCH-455: ------------------------------------- (from LUCENE-252) In nutch we have 3 options : 1st is to disallow deleting duplicates on tokenized fields(due to FieldCache), 2nd is to index the tokenized field twice(once tokenized, and once untokenized), 3rd use LUCENE-252 and the above patch and warm the cache initially in the index servers. I am in favor of the 3rd option. I think first resolving LUCENE-252, and then proceeding with NUTCH-255 is more sensible. > dedup on tokenized fields is faulty > ----------------------------------- > > Key: NUTCH-455 > URL: https://issues.apache.org/jira/browse/NUTCH-455 > Project: Nutch > Issue Type: Bug > Components: searcher > Affects Versions: 0.9.0 > Reporter: Enis Soztutar > Fix For: 0.9.0 > > Attachments: IndexSearcherCacheWarm.patch > > > (From LUCENE-252) > nutch uses several index servers, and the search results from these servers > are merged using a dedup field for for deleting duplicates. The values from > this field is cached by Lucene's FieldCachImpl. The default is the site > field, which is indexed and tokenized. However for a Tokenized Field (for > example "url" in nutch), FieldCacheImpl returns an array of Terms rather that > array of field values, so dedup'ing becomes faulty. Current FieldCache > implementation does not respect tokenized fields , and as described above > caches only terms. > So in the situation that we are searching using "url" as the dedup field, > when a Hit is constructed in IndexSearcher, the dedupValue becomes a token of > the url (such as "www" or "com") rather that the whole url. This prevents > using tokenized fields in the dedup field. > I have written a patch for lucene and attached it in > http://issues.apache.org/jira/browse/LUCENE-252, this patch fixes the > aforementioned issue about tokenized field caching. However building such a > cache for about 1.5M documents takes 20+ secs. The code in > IndexSearcher.translateHits() starts with > if (dedupField != null) > dedupValues = FieldCache.DEFAULT.getStrings(reader, dedupField); > and for the first call of search in IndexSearcher, cache is built. > Long story short, i have written a patch against IndexSearcher, which in > constructor warms-up the caches of wanted fields(configurable). I think we > should vote for LUCENE-252, and then commit the above patch with the last > version of lucene. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. ------------------------------------------------------------------------- Take Surveys. Earn Cash. Influence the Future of IT Join SourceForge.net's Techsay panel and you'll get the chance to share your opinions on IT & business topics through brief surveys-and earn cash http://www.techsay.com/default.php?page=join.php&p=sourceforge&CID=DEVDEV _______________________________________________ Nutch-developers mailing list Nutch-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/nutch-developers