[
https://issues.apache.org/jira/browse/NUTCH-1098?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13140170#comment-13140170
]
Markus Jelsma commented on NUTCH-1098:
--------------------------------------
Path prefixes can be overcome by using -p1 with patch. Indenting will be looked
after.
I also believe we should to disable this feature by default as existing crawls
are very much affected by this issue and need to be taken care off. One would
have to renormalize the crawldb (and linkdb and webgraphdb if in use) which is
trivial. However, the keys in existing segments are not updated so indexing the
data won't work anymore.
> better url-normalizer basic
> ---------------------------
>
> Key: NUTCH-1098
> URL: https://issues.apache.org/jira/browse/NUTCH-1098
> Project: Nutch
> Issue Type: Improvement
> Components: fetcher
> Affects Versions: 1.3
> Environment: Any
> Reporter: Radim Kolar
> Assignee: Markus Jelsma
> Labels: encoding, url
> Fix For: 1.5
>
> Attachments: patch-urlnormalizer.diff
>
> Original Estimate: 4h
> Remaining Estimate: 4h
>
> Basic URL normalizer lacks 2 important features
> Encode space in URL into %20 to unbreak httpclient and possibly others who do
> not expect space inside URL
> Ability to decode %33 encoding in URL. This is important for avoiding
> duplicates
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira