[ https://issues.apache.org/jira/browse/NUTCH-800?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12851089#action_12851089 ]
Jesse Campbell commented on NUTCH-800: -------------------------------------- Well as it is right now, badly encoded urls will cause the crawler to break (with exceptions) This tells me that it is not parsing the url string properly, which makes me question the possibility that there *could* be code injection... Where I work, we try to be defensive... anything that comes from an outside source (in this case URLs either entered by the user in a text file or scraped from a website) should be encoded so that code injection isn't possible, or is at least harder. I realize we're running java and not JS, so it would not be quite as simple as dropping in an Alert() command... I also want it fixed because I don't really like the idea of using a regex normalizer to fix URLs with spaces in them... regex also is known to have multiple vulnerabilities in all languages. > Generator builds a URL list that is not encoded > ----------------------------------------------- > > Key: NUTCH-800 > URL: https://issues.apache.org/jira/browse/NUTCH-800 > Project: Nutch > Issue Type: Bug > Components: generator > Affects Versions: 0.6, 0.7, 0.7.1, 0.7.2, 0.8, 0.8.1, 0.8.2, 0.7.3, 0.9.0, > 1.0.0, 1.1 > Reporter: Jesse Campbell > > The URL string that is grabbed by the generator when creating the fetch list > does not get encoded, could potentially allow unsafe excecution, and breaks > reading improperly encoded URLs from the scraped pages. > Since we a) cannot guarantee that any site we scrape is not malitious, and b) > likely do not have control over all content providers, we are currently > forced to use a regex normalizer to perform the same function as a built-in > java class (it would be unsafe to leave alone) > A quick solution would be to update Generator.java to utilize the > java.net.URLEncoder static class: > line 187: > old: String urlString = url.toString(); > new: String urlString = URLEncoder.encode(url.toString(),"UTF-8"); > line 192: > old: u = new URL(url.toString()); > new: u = new URL(urlString); > The use of URLEncoder.encode could also be at the updatedb stage. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.