Generator builds a URL list that is not encoded
-----------------------------------------------
Key: NUTCH-800
URL: https://issues.apache.org/jira/browse/NUTCH-800
Project: Nutch
Issue Type: Bug
Components: generator
Affects Versions: 1.0.0, 0.9.0, 0.8.1, 0.8, 0.7.2, 0.7.1, 0.7, 0.6, 0.8.2,
0.7.3, 1.1
Reporter: Jesse Campbell
The URL string that is grabbed by the generator when creating the fetch list
does not get encoded, could potentially allow unsafe excecution, and breaks
reading improperly encoded URLs from the scraped pages.
Since we a) cannot guarantee that any site we scrape is not malitious, and b)
likely do not have control over all content providers, we are currently forced
to use a regex normalizer to perform the same function as a built-in java class
(it would be unsafe to leave alone)
A quick solution would be to update Generator.java to utilize the
java.net.URLEncoder static class:
line 187:
old: String urlString = url.toString();
new: String urlString = URLEncoder.encode(url.toString(),"UTF-8");
line 192:
old: u = new URL(url.toString());
new: u = new URL(urlString);
The use of URLEncoder.encode could also be at the updatedb stage.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.