[ 
https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Doğacan Güney updated NUTCH-505:
--------------------------------

    Attachment: NUTCH-505-v2.patch

After my last commit, I read that Sun's java.util.regex implementation is 
actually faster than jakarta-oro. So, I changed UrlValidator to use 
java.util.regex instead of jakarta-oro. I made some simple tests and 
java.util.regex really seems to be faster. I also added some basic 
optimizations to ParseOutputFormat (added initialCapacity arguments to 
ArrayLists to reduce the number of allocations).

Is it necessary to reopen this issue or open another issue for this? I think 
this one is simple enough to commit without opening a seperate issue, but feel 
free to disagree.

Also, I realized that UrlValidator considers 
http://www.iiit.net/images/CCCCCC_line_br[1].gif invalid, even though firefox 
will display the gif (firefox escapes the path then fetches the escaped url). 
This doesn't seem to be a problem right now since nutch can't fetch these urls 
anyway, but we may consider adding some sort of smart escaping later.

> Outlink urls should be validated
> --------------------------------
>
>                 Key: NUTCH-505
>                 URL: https://issues.apache.org/jira/browse/NUTCH-505
>             Project: Nutch
>          Issue Type: Improvement
>            Reporter: Doğacan Güney
>            Assignee: Doğacan Güney
>            Priority: Minor
>             Fix For: 1.0.0
>
>         Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, 
> NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch
>
>
> See discussion here:
> http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html
> Parse plugins may extract garbage urls from pages. We need a url validation 
> system that tests these urls and filters out garbage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to