[ https://issues.apache.org/jira/browse/NUTCH-443?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12495357 ]
Doğacan Güney commented on NUTCH-443: ------------------------------------- Well... That's embarrassing. It seems I forgot to include the necessary changes to Indexer. Indexer has to read crawl_parse too so that it can pickup sub-urls' fetch datums. So, that seemed easy (just a couple of lines) but then I realized that there is another bug. (Which, in my defense, was present in Nutch before 443. So the bug was there, I only made it worse:) It is a bit difficult to describe, so please bear with me. The problem goes like this: In fetcher, if max.redirect is 0, Nutch pushes an empty Content to content and a LINKED datum to crawl_fetch (let's call this url foo). ParseSegment parses empty Content and creates a parse data and an empty parse text. After updatedb and one more generate-fetch-parse-updatedb cycle, we now have a proper content, parse text and parse data for foo in the new segment. Now, assume I index both of these segments together. Url foo will have two sets of (fetch datum, parse), one coming from the first segment, the other coming from the second segment. Since first fetch datum is LINKED, this code in Indexer.reduce will cause foo to be discarded: if (redir != null) { // XXX page was redirected - what should we do? // XXX discard it for now return; } And it doesn't work if we just remove this code. Remember that foo has two sets of (fetch datum, parse) and one of the parses contains an empty parse text. Since, in reduce Indexer will randomly choose one of the parses it is likely that we will get an empty parse text for url foo. This is the part that I made worse: Since Indexer has to read crawl_parse it will get a lot of STATUS_LINKED (that are written to crawl_parse as outlinks) and discard a lot of useful pages in any multi-segment index job. Sorry if the description is unnecessarily complex. > allow parsers to return multiple Parse object, this will speed up the rss > parser > -------------------------------------------------------------------------------- > > Key: NUTCH-443 > URL: https://issues.apache.org/jira/browse/NUTCH-443 > Project: Nutch > Issue Type: New Feature > Components: fetcher > Affects Versions: 0.9.0 > Reporter: Renaud Richardet > Assigned To: Andrzej Bialecki > Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-443-draft-v1.patch, NUTCH-443-draft-v2.patch, > NUTCH-443-draft-v3.patch, NUTCH-443-draft-v4.patch, NUTCH-443-draft-v5.patch, > NUTCH-443-draft-v6.patch, NUTCH-443-draft-v7.patch, > NUTCH-443.022507.patch.txt, NUTCH-443.02282007-v2.patch, > NUTCH-443.02282007.patch, NUTCH-443.08052007.patch, > parse-map-core-draft-v1.patch, parse-map-core-untested.patch, parsers.diff > > > allow Parser#parse to return a Map<String,Parse>. This way, the RSS parser > can return multiple parse objects, that will all be indexed separately. > Advantage: no need to fetch all feed-items separately. > see the discussion at > http://www.nabble.com/RSS-fecter-and-index-individul-how-can-i-realize-this-function-tf3146271.html -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.