On Jun 4, 2012, at 5:41pm, Mugoma Joseph Okomba wrote:

> Hello,
> 
> While trying to use HttpClient 4.2 to download page I am getting:
> 
> java.net.URISyntaxException: Illegal character in query at index 85:
> http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories
> 
> 
> On HttpClient 3.x I get similar error:
> 
> java.lang.IllegalArgumentException: Invalid uri
> 'http://www.target.com/webapp/wcs/stores/servlet/s?searchTerm=Jared+Diamond&category=0|All|matchallany|all+categories':
> Invalid query
> 
> 
> However using the native Java download causes no error:
> 
> URL getURL = new URL(url);
> HttpURLConnection huc =  ( HttpURLConnection )  getURL.openConnection ();
> huc.setRequestMethod("GET");
> InputStream inps = null;
>       try{
>               huc.connect();
>               inps = (InputStream) huc.getInputStream();
>       }
> 
> 
> The URL is valid and accessible. How can one make HttpClient resolve such
> URL?

This issue is one that has come up on occasion in the past, where the Java.net 
URI class is more restrictive than the URL class, or most browsers, or most DNS 
software.

In your case it's failing because '|' (vertical bar) is not considered a valid 
character by Java's URI class (which is used internally by HttpClient), but it 
is OK for a URL. Which always struck me as odd, since most people talk about 
URLs being a subset of URIs :)

Going back in time, RFC1630 (T. Berners-Lee, CERN 1994) classifies the vertical 
bar (called "vline" in the spec) as a "national" character:

  national               { | } | vline | [ | ] | \ | ^ | ~

And then says:

The "national" and "punctuation" characters do not appear in any productions 
and therefore may not appear in URIs. So technically speaking the URI class is 
doing the right thing.

You'll run into a similar issue with subdomains that use '-', e.g. 
-angelcries.blogspot.com can be used to construct a URL, but not a URI.

Because DNS software & browsers are permissive, you'll find a number of these 
cases where web pages can't be fetched using HttpClient.

-- Ken

--------------------------------------------
http://about.me/kkrugler
+1 530-210-6378




--------------------------
Ken Krugler
http://www.scaleunlimited.com
custom big data solutions & training
Hadoop, Cascading, Mahout & Solr




Reply via email to