[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14637563#comment-14637563
]
Sebastian Nagel commented on NUTCH-2064:
----------------------------------------
Hi Markus, why not define the range(s) of characters which can be safely
unescaped by a positive statement as in the
[RFC3986|https://tools.ietf.org/html/rfc3986#section-2.2]:
{quote}
For consistency, percent-encoded octets in the ranges of ALPHA (%41-%5A and
%61-%7A), DIGIT (%30-%39),
hyphen (%2D), period (%2E), underscore (%5F), or tilde (%7E) should not be
created by URI producers and, when
found in a URI, should be decoded to their corresponding unreserved characters
by URI normalizers.
{quote}
It's more than & and /, also, e.g. a plus sign as in
[http://google.com/search?q=c%2B%2B]. See also
[Percent-encoding|https://en.wikipedia.org/wiki/Percent-encoding#Percent-encoding_in_a_URI]
in Wikipedia.
> URLNormalizer basic to properly encode non-ASCII characters
> -----------------------------------------------------------
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch, NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)