[
https://issues.apache.org/jira/browse/NUTCH-18?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12889173#action_12889173
]
Reinhard Schwab commented on NUTCH-18:
--------------------------------------
when i try to open this link
http://www.altaribagorça.cat
with firefox i also get a
Address Not Found
Firefox can't find the server at www.altaribagor%c3%a7a.cat.
my firefox uses url encoding for the address.
may be the same problem you have?
may be some encoding issue?
> Windows servers include illegal characters in URLs
> --------------------------------------------------
>
> Key: NUTCH-18
> URL: https://issues.apache.org/jira/browse/NUTCH-18
> Project: Nutch
> Issue Type: Bug
> Components: fetcher
> Reporter: Stefan Groschupf
> Priority: Minor
>
> Transfered from:
> http://sourceforge.net/tracker/index.php?func=detail&aid=1110243&group_id=59548&atid=491356
> submitted by:
> Ken Meltsner
> While spidering our intranet, I found that IIS may include
> illegal characters in URLs -- specifically, characters with
> the high bit set to produce non-English letters. In
> addition, both Firefox and IE will accept URLs with high-
> bit characters, but Java won't.
> While this may not be Nutch's (or Java's) fault, it would
> help if high-bit characters (and other illegal characters)
> in URLs could be escaped (using percent-hex notation)
> as part of the URL fix-up process, probably right after
> the hostname lower-case conversion.
> Example document name in Portuguese(with high-bit
> characters) taken from a longer URL:
> Nota%20tecnica%20-%20Alteração%20de%
> 20escopo.doc
> and with percent-escaped characters:
> Nota%20tecnica%20-%20Altera%e7%e3o%20de%
> 20escopo.doc
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.