Ken Krugler wrote:
I'm wondering if anybody else has encountered this problem...
I've got a funky URL: "http://-angelcries.blogspot.com/"
Note the leading '-' in the subdomain.
This works fine in Firefox, and gives no problem with the URL class.
But the URI class throws a URISyntaxException if you use a host name
with a leading '-'.
I've read through the relevant portions of RFCs 3986, 2396, 1123, 1034
and 952. The general trend has been for more lenient domain name
specification over time, but even today subdomains and domains should
not start with '-'. However it's clear that DNS servers allow the use of
a leading dash for subdomains.
Since HttpClient is often used to fetch content that is browsed by
users, it would be an admirable goal to work around this problem - but
the only solution I see is to use a custom URI class instead of what's
in the JDK.
Based on what I see in other projects (e.g. Tomcat) this process of
replacing default implementations with custom versions winds up being a
path that's often taken, unfortunately, due to issues like this one.
Ken
(1) HttpClient 3.x has its own URI implementation. Sadly, it happened to
be the ugliest and most troublesome area of the entire project, no one
was willing to work or even do minimal maintenance on. This is the
reason why it got replaced with the standard Java URI implementation in
HttpClient 4.x.
(2) It is simply not possible to replace java.net.URI with something
else without causing a major API breakage. I personally do not think it
is worth it.
It should be possible, though, to investigate the feasibility of
replacing the standard URI parsing routine with a more lenient one
Web crawlers that need to be able to handle non-standard or broken URIs
as well as tolerate other non-standard behaviors might be much better
off using HttpCore directly, possibly re-using the connection management
components from HttpClient.
Oleg
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]