[
https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889102#action_12889102
]
Alex McLintock commented on NUTCH-18:
-------------------------------------
David,
I believe you are looking at a different problem to the one in this issue. I
think we ought to open up a new issue instead.
You are specifically trying to fetch pages from a domain with "funny"
characters in its name - whereas the original issue was simply "funny"
characters anywhere in the URL - such as in the path of the URL. For that
problem URL encoding of some kind is required at the right time.
I note that this is a Catalan domain which *does* resolve outside of Nutch/Java
(which I found very surprising)
Most FAQs I found say that accents are not allowed, but apparently
Internationalised Domain Names do exist and are "experimental"
http://www.domini.cat/idn-policy.pdf
http://en.wikipedia.org/wiki/Internationalized_domain_name
"Internationalized domain names are stored in the Domain Name System as ASCII
strings using Punycode transcription"
Now we use this code for checking whether the host exists...
Fetcher.java :
InetAddress addr = InetAddress.getByName(u.getHost());
I can't tell right now whether that does Punycode transcription. I suspect not.
We might look at how this tool does it, but I am not sure about the
licensing... http://jwhoisserver.sourceforge.net/
> Windows servers include illegal characters in URLs
> --------------------------------------------------
>
> Key: NUTCH-18
> URL: https://issues.apache.org/jira/browse/NUTCH-18
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Stefan Groschupf
> Priority: Minor
>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include
> illegal characters in URLs -- specifically, characters with
> the high bit set to produce non-English letters. In
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would
> help if high-bit characters (and other illegal characters)
> in URLs could be escaped (using percent-hex notation)
> as part of the URL fix-up process, probably right after
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.