[ https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12847071#action_12847071 ]
Andrzej Bialecki commented on NUTCH-800: ----------------------------------------- I'm puzzled by your problem description. Is Nutch affected by a potentially malicious URL data? URL form encoding is just a transport encoding, it doesn't make URL inherently safe (or unsafe). > Generator builds a URL list that is not encoded > ----------------------------------------------- > > Key: NUTCH-800 > URL: https://issues.apache.org/jira/browse/NUTCH-800 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, > 1.0.0, 1.1 > Reporter: Jesse Campbell > > The URL string that is grabbed by the generator when creating the fetch list > does not get encoded, could potentially allow unsafe excecution, and breaks > reading improperly encoded URLs from the scraped pages. > Since we a) cannot guarantee that any site we scrape is not malitious, and b) > likely do not have control over all content providers, we are currently > forced to use a regex normalizer to perform the same function as a built-in > java class (it would be unsafe to leave alone) > A quick solution would be to update Generator.java to utilize the > java.net.URLEncoder static class: > line 187: > old: String urlString = url.toString(); > new: String urlString = URLEncoder.encode(url.toString(),"UTF-8"); > line 192: > old: u = new URL(url.toString()); > new: u = new URL(urlString); > The use of URLEncoder.encode could also be at the updatedb stage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.