On Sun, 2012-03-25 at 14:19 -0400, Uncle wrote: > > It is not HttpClient reporting a wrong response status. It is the server > > behaving incorrectly. I get the same 404 when accessing the location > > directly. > > What do you mean "directly"? >
Without redirect. > > The problem is that the server does not correctly handle URI > > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly > > state how fragments in redirect locations should be handled. So, in my > > opinion it is a server side issue. > > In my opinion, if 5 clients (HttpURLConnection, HttpClient, Chrome, Safari, > Firefox) try to hit the URL, and 4 of them do so successfully and one does > not, the issue is with the one client, not with the server. Many URL's are > poorly formed or ambiguous, yet most clients take extra steps to access them, > which makes them more useful. HttpClient is not a browser but you are certainly entitled to have a different opinion. > I think that HttpClient should either do that or provide facilities for > doing so. > It does. One can handle redirects differently by implementing a custom RedirectStrategy and rewriting malformed redirect URIs in a way which is acceptable in the context of a specific application > > The URL has illegal character(s), which is the reason why the redirect > > fails. > > The Java toolkit and browsers URLEncode the URL, which avoids this problem. > This seems like a good general approach when redirecting. > See above. Oleg > Randy > > On Mar 24, 2012, at 7:59 PM, Oleg Kalnichevski wrote: > > > On Sat, 2012-03-24 at 16:46 -0400, Uncle wrote: > >> On Mar 24, 2012, at 2:48 PM, Oleg Kalnichevski wrote: > >> > >>> On Sat, 2012-03-24 at 08:50 -0400, Uncle wrote: > >>>> Apologies if this has been addressed, I searched the archives and was > >>>> unable to find anything directly relating to this, though it seems > >>>> straightforward. > >>>> > >>>> I am trying to use httpclient to obtain the redirect URL for a url such > >>>> as http://bit.ly/GGviSv, but I am getting a 404 error. This is a > >>>> "permanent" redirect (code 301). This code: > >>>> > >>>> String url = "http://bit.ly/GGviSv"; > >>>> HttpGet httpget = new HttpGet(url); > >>>> HttpContext context = new BasicHttpContext(); > >>>> HttpClient httpclient = new DefaultHttpClient(); > >>>> > >>>> HttpResponse response = httpclient.execute(httpget, context); > >>>> > >>>> RedirectStrategy redirectStrategy = new DefaultRedirectStrategy(); > >>>> > >>>> log.info("isRedirected = " + > >>>> redirectStrategy.isRedirected(httpget, response, context)); > >>>> for(Header header : response.getAllHeaders()) > >>>> log.info("header: " + header); > >>>> > >>>> log.info("status = " + response.getStatusLine()); > >>>> > >>>> outputs: > >>>> > >>>> isRedirected = false > >>>> header: Server: nginx > >>>> header: Date: Sat, 24 Mar 2012 12:38:43 GMT > >>>> header: Content-Type: text/html; charset=UTF-8 > >>>> > >>>> > >>>> header: Transfer-Encoding: chunked > >>>> header: Connection: keep-alive > >>>> header: Vary: Cookie > >>>> header: X-CF-Powered-By: WP 1.2.0 > >>>> header: X-Pingback: http://lavamagazine.com/xmlrpc.php > >>>> header: Expires: Wed, 11 Jan 1984 05:00:00 GMT > >>>> header: Last-Modified: Sat, 24 Mar 2012 12:38:43 GMT > >>>> header: Cache-Control: no-cache, must-revalidate, max-age=0 > >>>> header: Pragma: no-cache > >>>> status = HTTP/1.1 404 Not Found > >>>> > >>>> I expected 1) isRedirected to be true, 2) the response code to be 301, > >>>> and/or 3) the destination URL to be in the headers where I could get it. > >>>> However, if I ignore the 404 and continue getting the URL: > >>>> > >>>> HttpUriRequest currentReq = (HttpUriRequest) context.getAttribute( > >>>> ExecutionContext.HTTP_REQUEST ); > >>>> HttpHost currentHost = (HttpHost) > >>>> context.getAttribute(ExecutionContext.HTTP_TARGET_HOST); > >>>> String currentUrl = (currentReq.getURI().isAbsolute()) ? > >>>> currentReq.getURI().toString() : (currentHost.toURI() + > >>>> currentReq.getURI()); > >>>> httpclient.getConnectionManager().shutdown(); > >>>> log.info("Redirected URL = " + currentUrl); > >>>> > >>>> This does the right thing and provides me with the correct URL. So, why > >>>> the 404 error? I am processing a large quantity of URL's and need to > >>>> accurately determine which ones are errors, redirects, etc. > >>>> > >>>> Thanks for any assistance. > >>>> > >>>> Randy > >>>> > >>> > >>> As far as I can tell HttpClient correctly redirects to the new location, > >>> but the resource is simply no longer there. > >>> > >>> [DEBUG] headers - >> GET /GGviSv HTTP/1.1 > >>> [DEBUG] headers - >> Host: bit.ly > >>> [DEBUG] headers - >> Connection: Keep-Alive > >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT > >>> (java 1.5) > >>> [DEBUG] headers - << HTTP/1.1 301 Moved > >>> [DEBUG] headers - << Server: nginx > >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:44 GMT > >>> [DEBUG] headers - << Content-Type: text/html; charset=utf-8 > >>> [DEBUG] headers - << Connection: keep-alive > >>> [DEBUG] headers - << Set-Cookie: > >>> _bit=4f6e1694-00156-016bf-3d1cf10a;domain=.bit.ly;expires=Thu Sep 20 > >>> 18:46:44 2012;path=/; HttpOnly > >>> [DEBUG] headers - << Cache-control: private; max-age=90 > >>> [DEBUG] headers - << Location: > >>> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 > >>> [DEBUG] headers - << MIME-Version: 1.0 > >>> [DEBUG] headers - << Content-Length: 185 > >>> [DEBUG] headers - >> > >>> GET > >>> /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 > >>> HTTP/1.1 > >>> [DEBUG] headers - >> Host: lavamagazine.com > >>> [DEBUG] headers - >> Connection: Keep-Alive > >>> [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT > >>> (java 1.5) > >>> [DEBUG] headers - << HTTP/1.1 404 Not Found > >>> [DEBUG] headers - << Server: nginx > >>> [DEBUG] headers - << Date: Sat, 24 Mar 2012 18:46:45 GMT > >>> [DEBUG] headers - << Content-Type: text/html; charset=UTF-8 > >>> [DEBUG] headers - << Transfer-Encoding: chunked > >>> [DEBUG] headers - << Connection: keep-alive > >>> [DEBUG] headers - << Vary: Cookie > >>> [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0 > >>> [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php > >>> [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT > >>> [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 18:46:45 GMT > >>> [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0 > >>> [DEBUG] headers - << Pragma: no-cache > >>> > >>> Oleg > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: [email protected] > >>> For additional commands, e-mail: [email protected] > >>> > >> > >> Yet, if you hit the URL: > >> > >> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 > >> > >> with your browser, the content comes up fine. > >> > >> Hitting the redirect URL with the standard Java HttpURLConnetion class > >> does not produce the 404: > >> > >> String url = "http://bit.ly/GGviSv"; > >> URL urlObj = new URL(url); > >> HttpURLConnection urlConnection = > >> (HttpURLConnection)urlObj.openConnection(); > >> urlConnection.setRequestMethod("GET"); > >> urlConnection.setConnectTimeout(15000); > >> urlConnection.setReadTimeout(30000); > >> urlConnection.connect(); > >> log.info("Response code = " + urlConnection.getResponseCode()); > >> InputStream inputStream = urlConnection.getInputStream(); > >> log.info("Redirected URL = " + urlConnection.getURL().toString()); > >> > >> This outputs: > >> > >> Response code = 200 > >> Redirected URL = > >> http://lavamagazine.com/features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 > >> > >> So HttpClient reports a 404, but HttpURLConnection reports a 200 and my > >> browsers (Safari, Chrome, and FireFox) all hit the link fine. > >> > > > > It is not HttpClient reporting a wrong response status. It is the server > > behaving incorrectly. I get the same 404 when accessing the location > > directly. The problem is that the server does not correctly handle URI > > fragment (the #axzz1pdAzTzT2 bit). The HTTP spec does not explicitly > > state how fragments in redirect locations should be handled. So, in my > > opinion it is a server side issue. > > > > You can work the problem around by using a custom redirect strategy and > > rewrites redirect location and strips away the fragment if present. > > > > [DEBUG] headers - >> > > GET /features/video-biking-the-ironman-melbourne-run-course/#axzz1pdAzTzT2 > > HTTP/1.1 > > [DEBUG] headers - >> Host: lavamagazine.com > > [DEBUG] headers - >> Connection: Keep-Alive > > [DEBUG] headers - >> User-Agent: Apache-HttpClient/4.2-beta2-SNAPSHOT > > (java 1.5) > > [DEBUG] headers - << HTTP/1.1 404 Not Found > > [DEBUG] headers - << Server: nginx > > [DEBUG] headers - << Date: Sat, 24 Mar 2012 23:31:10 GMT > > [DEBUG] headers - << Content-Type: text/html; charset=UTF-8 > > [DEBUG] headers - << Transfer-Encoding: chunked > > [DEBUG] headers - << Connection: keep-alive > > [DEBUG] headers - << Vary: Cookie > > [DEBUG] headers - << X-CF-Powered-By: WP 1.2.0 > > [DEBUG] headers - << X-Pingback: http://lavamagazine.com/xmlrpc.php > > [DEBUG] headers - << Expires: Wed, 11 Jan 1984 05:00:00 GMT > > [DEBUG] headers - << Last-Modified: Sat, 24 Mar 2012 23:31:10 GMT > > [DEBUG] headers - << Cache-Control: no-cache, must-revalidate, max-age=0 > > [DEBUG] headers - << Pragma: no-cache > > > > > >> Here is another URL that is problematic: > >> > >> http://on.wsj.com/GHGlfS > >> > >> this produces: > >> > >> org.apache.http.client.ClientProtocolException > >> at > >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:822) > >> at > >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:754) > >> ... snip ... > >> Caused by: org.apache.http.ProtocolException: Invalid redirect URI: > >> http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonĂ¢??s-death-an-accident/?mod=e2tw > >> at > >> org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:185) > >> at > >> org.apache.http.impl.client.DefaultRedirectStrategy.getLocationURI(DefaultRedirectStrategy.java:116) > >> at > >> org.apache.http.impl.client.DefaultRedirectStrategy.getRedirect(DefaultRedirectStrategy.java:193) > >> at > >> org.apache.http.impl.client.DefaultRequestDirector.handleResponse(DefaultRequestDirector.java:1035) > >> at > >> org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:492) > >> at > >> org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820) > >> ... 28 more > >> Caused by: java.net.URISyntaxException: Illegal character in path at index > >> 72: > >> http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houstonĂ¢??s-death-an-accident/?mod=e2tw > >> at java.net.URI$Parser.fail(URI.java:2809) > >> at java.net.URI$Parser.checkChars(URI.java:2982) > >> at java.net.URI$Parser.parseHierarchical(URI.java:3066) > >> at java.net.URI$Parser.parse(URI.java:3014) > >> at java.net.URI.<init>(URI.java:578) > >> at > >> org.apache.http.impl.client.DefaultRedirectStrategy.createLocationURI(DefaultRedirectStrategy.java:183) > >> ... 33 more > >> > >> The redirected URL has a special character in it (single quote), and the > >> client doesn't handle that. The Java code that I pasted above produces > >> > > > > The URL has illegal character(s), which is the reason why the redirect > > fails. > > > > Oleg > > > >> Response code = 200 > >> Redirected URL = > >> http://blogs.wsj.com/speakeasy/2012/03/22/coroner-rules-whitney-houston%e2%80%99s-death-an-accident/?%3fs-death-an-accident/%3fmod=e2tw > >> > >> Randy > >> > >> > >> --------------------------------------------------------------------- > >> To unsubscribe, e-mail: [email protected] > >> For additional commands, e-mail: [email protected] > >> > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
