Generator builds a URL list that is not encoded
-----------------------------------------------

                 Key: NUTCH-800
                 URL: https://issues.apache.org/jira/browse/NUTCH-800
             Project: Nutch
          Issue Type: Bug
          Components: generator
    Affects Versions: 1.0.0, 0.9.0, 0.8.1, 0.8, 0.7.2, 0.7.1, 0.7, 0.6, 0.8.2, 
0.7.3, 1.1
            Reporter: Jesse Campbell


The URL string that is grabbed by the generator when creating the fetch list 
does not get encoded, could potentially allow unsafe excecution, and breaks 
reading improperly encoded URLs from the scraped pages.
Since we a) cannot guarantee that any site we scrape is not malitious, and b) 
likely do not have control over all content providers, we are currently forced 
to use a regex normalizer to perform the same function as a built-in java class 
(it would be unsafe to leave alone)

A quick solution would be to update Generator.java to utilize the 
java.net.URLEncoder static class:

line 187: 
old: String urlString = url.toString();
new: String urlString = URLEncoder.encode(url.toString(),"UTF-8");

line 192:
old: u = new URL(url.toString());
new: u = new URL(urlString);


The use of URLEncoder.encode could also be at the updatedb stage.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to