[ 
https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357
 ] 

Doğacan Güney commented on NUTCH-443:
-------------------------------------

Well... That's embarrassing. It seems I forgot to include the necessary changes 
to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' 
fetch datums. 

So, that seemed easy (just a couple of lines) but then I realized that there is 
another bug. (Which, in my defense, was present in Nutch before 443. So the bug 
was there, I only made it worse:)

It is a bit difficult to describe, so please bear with me. The problem goes 
like this:

In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and 
a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses 
empty Content and creates a parse data and an empty parse text. After updatedb 
and one more generate-fetch-parse-updatedb cycle, we now have a proper content, 
parse text and parse data for foo in the new segment.

Now, assume I index both of these segments together. Url foo will have two sets 
of (fetch datum, parse), one coming from the first segment, the other coming 
from the second segment. Since first fetch datum is LINKED,  this code in 
Indexer.reduce will cause foo to be discarded:

    if (redir != null) {
      // XXX page was redirected - what should we do?
      // XXX discard it for now
      return;
    }

And it doesn't work if we just remove this code. Remember that foo has two sets 
of (fetch datum, parse) and one of the parses contains an empty parse text. 
Since, in reduce Indexer will randomly choose one of the parses it is likely 
that we will get an empty parse text for url foo.

This is the part that I made worse: Since Indexer has to read crawl_parse it 
will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) 
and discard a lot of useful pages in any multi-segment index job.

Sorry if the description is unnecessarily complex.



> allow parsers to return multiple Parse object, this will speed up the rss 
> parser
> --------------------------------------------------------------------------------
>
>                 Key: NUTCH-443
>                 URL: https://issues.apache.org/jira/browse/NUTCH-443
>             Project: Nutch
>          Issue Type: New Feature
>          Components: fetcher
>    Affects Versions: 0.9.0
>            Reporter: Renaud Richardet
>         Assigned To: Andrzej Bialecki 
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, 
> NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, 
> NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, 
> NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, 
> NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, 
> parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff
>
>
> allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser 
> can return multiple parse objects, that will all be indexed separately. 
> Advantage: no need to fetch all feed-items separately.
> see the discussion at 
> http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to