Jorge Luis Betancourt Gonzalez created NUTCH-2541:
-----------------------------------------------------
Summary: Arabic characters in the URL are not properly escaped by
the protocol-httpclient plugin
Key: NUTCH-2541
URL: https://issues.apache.org/jira/browse/NUTCH-2541
Project: Nutch
Issue Type: Bug
Components: plugin, protocol
Affects Versions: 1.14, 2.3.1
Reporter: Jorge Luis Betancourt Gonzalez
As reported on [1]
When trying to crawl some URLs with Arabic characters Nutch will complain due
to an {{InvalidArgumentException}}. This happens because the HTTP client
library is using internally the {{java.net.URI}} which does not support this
characters unless they're properly escaped.
[1]
https://stackoverflow.com/questions/49379007/apache-nutch-2-3-1-fetcher-giving-invalid-uri-exception/49395225?noredirect=1#comment85798974_49395225
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)