[jira] Issue Comment Edited: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Yury (JIRA) Fri, 09 Jan 2009 11:38:30 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-579?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12662490#action_12662490
 ]


yury edited comment on NUTCH-579 at 1/9/09 11:36 AM:
-----------------------------------------------------

Hi!

I have the same problem with feed perser. I crawl livejournal feed and 
FeedParser pars it. ParseResult contains all items of chanel but index contains 
only chanel header. Joseph's solution unfortunately don't work, 
FeedIndexingFilter process only chanels header.

      was (Author: yury):
    Hi!

I have the same problem with feed perser. I crawl livejournal feed and 
FeddParser pars it. ParseResult contains all items of chanel but index contains 
only chanel header. Joseph's solution unfortunately don't work, 
FeedIndexingFilter process only chanels header.
  
> Feed plugin only indexes one post per feed due to identical digest
> ------------------------------------------------------------------
>
>                 Key: NUTCH-579
>                 URL: https://issues.apache.org/jira/browse/NUTCH-579
>             Project: Nutch
>          Issue Type: Bug
>          Components: indexer
>    Affects Versions: 1.0.0
>            Reporter: Joseph Chen
>
> When parsing an rss feed, only one post will be indexed per feed.  The reason 
> for this is that the digest, which is calculated for based on the content (or 
> the url if the content is null) is always the same for each post in a feed.
> I noticed this when I was examining my lucene indexes using Luke.  All of the 
> individual feed entries were being indexed properly but then when the dedup 
> step ran, my merged index ended up with only one document.
> As a quick fix, I simply overrode the digest in the FeedIndexingFilter.java, 
> by adding the following code to the filter function:
> byte[] signature = MD5Hash.digest(url.toString()).getDigest();
> doc.removeField("digest");
> doc.add(new Field("digest", StringUtil.toHexString(signature), 
> Field.Store.YES, Field.Index.NO));
> This seems to fix the issue as the index now contains the proper number of 
> documents.
> Anyone have any comments on whether this is a good solution or if there is a 
> better solution?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (NUTCH-579) Feed plugin only indexes one post per feed due to identical digest

Reply via email to