[ https://issues.apache.org/jira/browse/NUTCH-505?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Doğacan Güney updated NUTCH-505: -------------------------------- Attachment: NUTCH-505-v2.patch After my last commit, I read that Sun's java.util.regex implementation is actually faster than jakarta-oro. So, I changed UrlValidator to use java.util.regex instead of jakarta-oro. I made some simple tests and java.util.regex really seems to be faster. I also added some basic optimizations to ParseOutputFormat (added initialCapacity arguments to ArrayLists to reduce the number of allocations). Is it necessary to reopen this issue or open another issue for this? I think this one is simple enough to commit without opening a seperate issue, but feel free to disagree. Also, I realized that UrlValidator considers http://www.iiit.net/images/CCCCCC_line_br[1].gif invalid, even though firefox will display the gif (firefox escapes the path then fetches the escaped url). This doesn't seem to be a problem right now since nutch can't fetch these urls anyway, but we may consider adding some sort of smart escaping later. > Outlink urls should be validated > -------------------------------- > > Key: NUTCH-505 > URL: https://issues.apache.org/jira/browse/NUTCH-505 > Project: Nutch > Issue Type: Improvement > Reporter: Doğacan Güney > Assignee: Doğacan Güney > Priority: Minor > Fix For: 1.0.0 > > Attachments: NUTCH-505-v2.patch, NUTCH-505.patch, NUTCH-505.patch, > NUTCH-505_draft.patch, NUTCH-505_draft_v2.patch > > > See discussion here: > http://www.nabble.com/fetching-http%3A--www.variety.com-%3C-div%3E%3C-a%3E-tf3961692.html > Parse plugins may extract garbage urls from pages. We need a url validation > system that tests these urls and filters out garbage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.