[jira] [Commented] (NUTCH-1321) IDNNormalizer

Sebastian Nagel (JIRA) Fri, 28 Mar 2014 16:35:31 -0700

    [ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13951585#comment-13951585
 ]


Sebastian Nagel commented on NUTCH-1321:
----------------------------------------

In BasicURLNormalizer URLs are already split into parts (protocol, host, etc.): 
we could call directly {{IDN.toASCII(host)}} which would be more efficient than 
using {{URLUtil.toASCII(url)}} and doing split and concatenation twice.

Maybe we should move the decoding of the punycoded URLs from IndexUtil to 
index-basic / BasicIndexingFilter: field "url" is filled here. In case of 
redirects it's filled with reprUrl which should be decoded as well.

Regarding a port to 1.x: trunk does currently not differentiate between 'id' 
and 'url'. IDN-decoding the URL in NutchDocument may cause that documents are 
not properly deleted, cf. NUTCH-1708 for a similar problem and discussions.

> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
> indexer so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (NUTCH-1321) IDNNormalizer

Reply via email to