Hello,
I am working on a web crawler application and using HttpClient by Apache
for the same. I have following issues that I am not able to resolve:
(This is my first post and not sure to what extent I can provide the
details and ask questions, so please pardon me)
1> Whenever I try to download pdf file using HttpClient, the pdf that
gets downloaded is approximately half the size from the one I download
using Firefox. Same with png file. Both acrobat and image viewer reject
the files saying invalid format. There may be something related to
compression etc but how do I find out? I am reading from response as
input stream , wrap it around buffered stream and write to file. So
basically I am just fetching the raw bytes from the response. If needed,
I will provide details log ( I read about wire log, haven;t tried it but
if needed I 'll try to produce one and provide you).
2> How do I know if thethe file that I am fetching is the text file or
not? For e.g, given that I do not know the file type that I am fetching
is there any way to know from the content-type etc what type of file I
have fetched?
I tried content-type header, its the same for a normal HTML file , a PDF
file and also for an image file.
3> Redirects - I have set followredirects = true. I have one URL that
upon accessing from Firefox redirects, but using HttpClient it does not.
The status code for some reason is 200 (OK), Should this have to be 3XX
for the HttpClient to follow redirects? The HTML dump from httpclient is
as follows:
<html>
<head>
<script>
<!--
redirect_url="http://www.feedroom.com/";
window.location.replace(redirect_url);
-->
</script>
</head>
<body>
Redirecting...<br/>
This url is deprecated. If your browser doesn't immediately redirect
you to the new url, please click the link below:<br/>
<a href="http://www.feedroom.com/">http://www.feedroom.com/</a>
</body>
</html>
Thanks in advance!
Rutuja
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]