[ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12552935
 ] 

Joseph Chen commented on NUTCH-579:
-----------------------------------

I changed the db.signature.class and this seems to solve the problem when I 
first do a crawl.

Now I'm seeing a similar problem when I try to merge the results of two crawls. 
 I performed two separate crawls using the crawl tool.  I wanted to merge the 
results of the two crawls.  Here are the steps I did:

1) Merged the segments from the two crawls
2) Inverted links
3) Merged the crawldb
4) Indexed the segments
5) Dedup the index
6) Merged the indexes

I noticed a problem after running the dedup.  My original index had about 8000 
documents (corresponding to feed posts) and after merging I ended up with about 
half that number (4000 documents).

Examining the index via Luke shows that I'm back down to one post feed - each 
document has a unique digest value. 
When I skip the dedup step (step 5), the number of documents is around 17000, 
and examining this index shows multiple posts from a feed.

I searched for the db.signature.class value in the DeleteDuplicates.java class, 
which is the class that gets called when running bin/nutch dedup, but I didn't 
see any references to this value.

Any ideas about this issue?

> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason 
> for this is that the digest, which is calculated for based on the content (or 
> the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the 
> individual feed entries were being indexed properly but then when the dedup 
> step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
> by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), 
> Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of 
> documents.
> Anyone have any comments on whether this is a good solution or if there is a 
> better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to