On Fri, 2009-04-10 at 15:55 -0700, Rutuja Joshi wrote: > Hello, > > I am working on a web crawler application and using HttpClient by Apache > for the same. I have following issues that I am not able to resolve: > (This is my first post and not sure to what extent I can provide the > details and ask questions, so please pardon me) > > 1> Whenever I try to download pdf file using HttpClient, the pdf that > gets downloaded is approximately half the size from the one I download > using Firefox. Same with png file. Both acrobat and image viewer reject > the files saying invalid format. There may be something related to > compression etc but how do I find out? I am reading from response as > input stream , wrap it around buffered stream and write to file. So > basically I am just fetching the raw bytes from the response. If needed, > I will provide details log ( I read about wire log, haven;t tried it but > if needed I 'll try to produce one and provide you). >
Yes, wire log would be quite helpful, as well as the code snippet demonstrating the way you are using HttpClient API. Logging guide for HttpClient 4.0: http://hc.apache.org/httpcomponents-client/logging.html Logging guide for HttpClient 3.x: http://hc.apache.org/httpclient-3.x/logging.html > 2> How do I know if thethe file that I am fetching is the text file or > not? By the Content-Type response header > For e.g, given that I do not know the file type that I am fetching > is there any way to know from the content-type etc what type of file I > have fetched? > I tried content-type header, its the same for a normal HTML file , a PDF > file and also for an image file. > In this case the server side code is broken. > 3> Redirects - I have set followredirects = true. I have one URL that > upon accessing from Firefox redirects, but using HttpClient it does not. > The status code for some reason is 200 (OK), Should this have to be 3XX > for the HttpClient to follow redirects? The HTML dump from httpclient is > as follows: > > <html> > <head> > <script> > <!-- > redirect_url="http://www.feedroom.com/"; > window.location.replace(redirect_url); > --> > </script> > </head> > <body> > Redirecting...<br/> > This url is deprecated. If your browser doesn't immediately redirect > you to the new url, please click the link below:<br/> > <a href="http://www.feedroom.com/">http://www.feedroom.com/</a> > </body> > </html> > HttpClient takes care of redirects automatically. If you do not want that, you can always disable automatic redirect handling. Hope this helps Oleg > Thanks in advance! > Rutuja > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
