[
https://issues.apache.org/jira/browse/NUTCH-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Markus Jelsma updated NUTCH-1174:
---------------------------------
Attachment: NUTCH-1174-1.5-1.patch
Patch for 1.5 fixing this very obscure issue. It became apparent when
unnormalized null-path URL's reappeared in our webgraph dumps but not in the
CrawlDB; at least not in its dumps.
For some reason, when dumping the CrawlDB, we saw the URL's actually having
normalized null-paths.
Although this issue fixes the problem for the web graph - much better scores -
i'm not sure how or why the URL's in CrawlDB dumps seem to be normalized.
I'm also not sure about this patch for it may add duplicates to the Outlinks
object. Are dupes flushed prior or after outlinks processing? I would asume
before as it saves CPU cycles when filtering etc.
> Outlinks are not properly normalized
> ------------------------------------
>
> Key: NUTCH-1174
> URL: https://issues.apache.org/jira/browse/NUTCH-1174
> Project: Nutch
> Issue Type: Bug
> Components: parser
> Affects Versions: 1.3
> Reporter: Markus Jelsma
> Assignee: Markus Jelsma
> Fix For: 1.5
>
> Attachments: NUTCH-1174-1.5-1.patch
>
>
> In ParseOutputFormat, the toUrl is read from Outlink and is processed. This
> String object is filtered, normalized etc but the original Outlink object is
> actually added. The normalized url in toUrl is not written back to the
> Outlink object.
> This issue adds a setUrl method to Outlink which is used in ParseOutputFormat
> to overwrite the unnormalized url.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira