[ 
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637563#comment-14637563
 ] 

Sebastian Nagel commented on NUTCH-2064:
----------------------------------------

Hi Markus, why not define the range(s) of characters which can be safely 
unescaped by a positive statement as in the 
[RFC3986|https://tools.ietf.org/html/rfc3986#section-2.2]:
{quote}
For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and 
%61-%7A), DIGIT (%30-%39),
hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be 
created by URI producers and, when
found in a URI, should be decoded to their corresponding unreserved characters 
by URI normalizers.
{quote}
It's more than & and /, also, e.g. a plus sign as in 
[http://google.com/search?q=c%2B%2B]. See also 
[Percent-encoding|https://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI]
 in Wikipedia.

> URLNormalizer basic to properly encode non-ASCII characters
> -----------------------------------------------------------
>
>                 Key: NUTCH-2064
>                 URL: https://issues.apache.org/jira/browse/NUTCH-2064
>             Project: Nutch
>          Issue Type: Bug
>    Affects Versions: 1.10
>            Reporter: Markus Jelsma
>             Fix For: 1.11
>
>         Attachments: NUTCH-1098.patch, NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to