[
https://issues.apache.org/jira/browse/NUTCH-2064?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14635824#comment-14635824
]
Sebastian Nagel commented on NUTCH-2064:
----------------------------------------
Definitely a nice-to-have feature, to get rid of duplicates or to avoid errors
if a protocol plugin does not support non-ASCII characters. -1 for the patch
so far:
* given the unfortunate discussions in NUTCH-1098 it's probably better to write
a patch from scratch (I would volunteer!)
* the patched urlnormalizer-basic does unescape more than it should according
to [RFC3986|https://tools.ietf.org/html/rfc3986#section-2.1]. Ampersand and
colon (and other characters) should stay escaped:
{noformat}
% cat test_urls.txt
http://x.com/s?q=a%26b&m=10
http://x.com/show?http%3A%2F%2Fx.com%2Fb
% cat test_urls.txt | nutch plugin urlnormalizer-basic
org.apache.nutch.net.urlnormalizer.basic.BasicURLNormalizer
http://x.com/s?q=a&b&m=10
http://x.com/show?http:%2F%2Fx.com%2Fb
{noformat}
* would be good to have unit tests with realistic URLs to test for such cases
> URLNormalizer basic to properly encode non-ASCII characters
> -----------------------------------------------------------
>
> Key: NUTCH-2064
> URL: https://issues.apache.org/jira/browse/NUTCH-2064
> Project: Nutch
> Issue Type: Bug
> Affects Versions: 1.10
> Reporter: Markus Jelsma
> Fix For: 1.11
>
> Attachments: NUTCH-1098.patch
>
>
> NUTCH-1098 rewritten to work on trunk. Unit test is identical to 1098.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)