[jira] [Commented] (NUTCH-1321) IDNNormalizer

Sebastian Nagel (JIRA) Fri, 20 Dec 2013 15:04:44 -0800

    [ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13854647#comment-13854647
 ]


Sebastian Nagel commented on NUTCH-1321:
----------------------------------------

Sorry, I should have checked the date of patches to get the latest one. The 
right patch is correctly formatted and applies well. Thanks!

You are right regarding point 2: in 2.x 'id' is the reversed (and punycoded) 
URL. In 1.x the situation is different. But for 2.x there is definitely no 
problem. For 1.x this should be discussed.

Testing the patch failed because URLUtil.toUNICODE() returned null for 
punycoded URLs (opened NUTCH-1685).

Is there really a need for isPunycode(). At least, for the current patch it 
checks for punycode by converting to Unicode and comparing the result with the 
original URL. It would be more efficient to convert it unconditionally (without 
changes to the URL if it's not an internationalized domain name).

> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
> indexer so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

[jira] [Commented] (NUTCH-1321) IDNNormalizer

Reply via email to