[ 
https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13853056#comment-13853056
 ] 

Sebastian Nagel commented on NUTCH-1321:
----------------------------------------

Hi [~ilhamikalkan],
great! Thanks! The patch looks good (not tested yet). A few comments:
# method isPunycode(url)
{code}
String[] arr = url.split("\\.");
if (arr[1].startsWith("xn--"))
{code} fails for URLs like {{http://www.medizin.xn--uni-tbingen-xhb.de/}}
# maybe we should make the decoding from Punycode to Unicode in scope indexer 
configurable by some property "urlnormalizer.idn.indexer.decode" or similar. 
URLs are used as ordinary content (tokenized field "url") and unique ID (field 
"id") for updating and deleting indexed documents. Some indexer back-ends may 
require the id field to be pure ASCII or Punycode.
# cosmetics: code should be formatted by 
[eclipse-codeformat.xml|http://svn.apache.org/viewvc/nutch/branches/2.x/eclipse-codeformat.xml],
 patches generated as decribed in 
[1|http://wiki.apache.org/nutch/Becoming_A_Nutch_Developer], 
[2|http://wiki.apache.org/nutch/HowToContribute].

> IDNNormalizer
> -------------
>
>                 Key: NUTCH-1321
>                 URL: https://issues.apache.org/jira/browse/NUTCH-1321
>             Project: Nutch
>          Issue Type: New Feature
>            Reporter: Markus Jelsma
>            Assignee: Markus Jelsma
>             Fix For: 1.9
>
>         Attachments: Nutch-1321.patch, idnNormalizer.patch
>
>
> Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an 
> indexer so it will encode ASCII URL's to their proper unicode equivalant.



--
This message was sent by Atlassian JIRA
(v6.1.4#6159)

Reply via email to