[ 
https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889102#action_12889102
 ] 

Alex McLintock commented on NUTCH-18:
-------------------------------------

David, 

I believe you are looking at a different problem to the one in this issue. I 
think we ought to open up a new issue instead. 

You are specifically trying to fetch pages from a domain with "funny" 
characters in its name - whereas the original issue was simply "funny" 
characters anywhere in the URL - such as in the path of the URL. For that 
problem URL encoding of some kind is required at the right time.

I note that this is a Catalan domain which *does* resolve outside of Nutch/Java 
(which I found very surprising)
Most FAQs I found say that accents are not allowed, but apparently 
Internationalised Domain Names do exist and are "experimental"

http://www.domini.cat/idn-policy.pdf
http://en.wikipedia.org/wiki/Internationalized_domain_name

"Internationalized domain names are stored in the Domain Name System as ASCII 
strings using Punycode transcription"

Now we use this code for checking whether the host exists...

Fetcher.java :
          InetAddress addr = InetAddress.getByName(u.getHost());

I can't tell right now whether that does Punycode transcription. I suspect not. 

We might look at how this tool does it, but I am not sure about the 
licensing... http://jwhoisserver.sourceforge.net/



> Windows servers include illegal characters in URLs
> --------------------------------------------------
>
>                 Key: NUTCH-18
>                 URL: https://issues.apache.org/jira/browse/NUTCH-18
>             Project: Nutch
>          Issue Type: Bug
>          Components: fetcher
>            Reporter: Stefan Groschupf
>            Priority: Minor
>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include 
> illegal characters in URLs -- specifically, characters with 
> the high bit set to produce non-English letters. In 
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would 
> help if high-bit characters (and other illegal characters) 
> in URLs could be escaped (using percent-hex notation) 
> as part of the URL fix-up process, probably right after 
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit 
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to