[ 
https://issues.apache.org/jira/browse/NUTCH-1174?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Markus Jelsma updated NUTCH-1174:
---------------------------------

    Attachment: NUTCH-1174-1.5-1.patch

Patch for 1.5 fixing this very obscure issue. It became apparent when 
unnormalized null-path URL's reappeared in our webgraph dumps but not in the 
CrawlDB; at least not in its dumps.

For some reason, when dumping the CrawlDB, we saw the URL's actually having 
normalized null-paths.

Although this issue fixes the problem for the web graph - much better scores - 
i'm not sure how or why the URL's in CrawlDB dumps seem to be normalized.

I'm also not sure about this patch for it may add duplicates to the Outlinks 
object. Are dupes flushed prior or after outlinks processing? I would asume 
before as it saves CPU cycles when filtering etc.

                
> Outlinks are not properly normalized
> ------------------------------------
>
>                 Key: NUTCH-1174
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1174
>             Project: Nutch
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.3
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.5
>
>         Attachments: NUTCH-1174-1.5-1.patch
>
>
> In ParseOutputFormat, the toUrl is read from Outlink and is processed. This 
> String object is filtered, normalized etc but the original Outlink object is 
> actually added. The normalized url in toUrl is not written back to the 
> Outlink object.
> This issue adds a setUrl method to Outlink which is used in ParseOutputFormat 
> to overwrite the unnormalized url.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to