[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505.patch

New patch. This is sort of a release candidate, if there are no objections, I 
think this patch can go in as it is.

The biggest change is that ParseData is no longer a Configurable. In the 
current implementation, when a parse data comes to ParseOutputFormat, it 
contains at most db.max.outlinks.per.page, then after filtering, 
ParseOutputFormat outputs whatever remains. 

For example, in a situation where ignoreExternalLinks is true and the first 
hundred links (assuming db.max.outlinks per page is 100) are all external, no 
outlinks will be extracted, even if there are internal urls past 100th outlinks 
mark.

So, now parse data reads all outlinks, ParseOutputFormat processes them and 
outputs at most db.max.outlinks.per.page many outlinks (Also resulting parse 
data contains db.max.outlinks.per.page outlinks too). I think this is a better 
approach but it may be a bit slower.

Besides this change, UrlValidator code is cleaned up and moved into 
org.apache.nutch.net package. Also, outlinks are not normalized in 
ParseOutputFormat since they are already normalized in Outlink.Outlink. There 
is no point in normalizing them twice.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Priority: Minor
>         Attachments: NUTCH-505.patch, NUTCH-505_draft.patch, 
> NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to