[ https://issues.apache.org/jira/browse/NUTCH-1321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
İlhami KALKAN updated NUTCH-1321: --------------------------------- Attachment: idnNormalizer.patch I added patch file. Non-ascii urls are converted punycode by BasicURLNormalizer.java in inject phase and also parse phase while extracting outlinks. In index phase, punycodes are converted to unicode. > IDNNormalizer > ------------- > > Key: NUTCH-1321 > URL: https://issues.apache.org/jira/browse/NUTCH-1321 > Project: Nutch > Issue Type: New Feature > Reporter: Markus Jelsma > Assignee: Markus Jelsma > Fix For: 1.9 > > Attachments: Nutch-1321.patch, idnNormalizer.patch > > > Right now, IDN's are indexed as ASCII. An IDNNormalizer is to be used with an > indexer so it will encode ASCII URL's to their proper unicode equivalant. -- This message was sent by Atlassian JIRA (v6.1.4#6159)