Normally you'd let HttpClient handle decoding, chunked responses, etc. Then what you save is the raw content (as an array of bytes) and the response headers.

Converting the above into a parsable page is something best handled by Tika (as an example), since it will attempt to determine the charset encoding for the bytes, based on the response header, the HTML markup, and (worst case) statistics from the bytes.

The second, off-line step has nothing to do with HttpClient.

-- Ken

On Feb 11, 2011, at 1:30am, CodingForever wrote:


I appreciated.
That is working like I want.
You see that, i am trying to decoding html page using header and
content(offline). And I am not a perfect about httpclient. So I could not
find the best solution for my problem.

Think that you have a,
Header and Content(raw content(gzip,deflate,may be chunked) )
I need a solution that , I will give header and content then reading the
decoded output until the end of the page.
Can you offer me a solution for this problem?

Best Regards.

olegk wrote:

On Fri, 2011-02-11 at 00:39 -0800, CodingForever wrote:
Thanks olegk for the answer,Now I am looking that. But I will ask
something
I wrote the code that below. How can I get the decoded content using
header
parameters ?

String s =
   "HTTP/1.1 200 OK\r\n"
   + "Server: whatever\r\n"
   + "Date: some date\r\n"
   + "Set-Cookie: c1=stuff\r\n"
   + "Transfer-Encoding: chunked\r\n"
   + "Content-Type: text/html; charset=ISO-8859-1\r\n"
   + "\r\n"
   + "5\r\n01234\r\n5\r\n56789\r\n6\r\nabcdef\r\n0\r\n\r\n test";
SessionInputBuffer inbuffer = new SessionInputBufferMockup(s,
"US-ASCII");
HttpResponseParser parser = new HttpResponseParser(
   inbuffer,
   BasicLineParser.DEFAULT,
   new DefaultHttpResponseFactory(),
   new BasicHttpParams());
HttpResponse response = (HttpResponse) parser.parse();
EntityDeserializer deserializer = new EntityDeserializer(new
LaxContentLengthStrategy());
HttpEntity entity = deserializer.deserialize(inbuffer, response);
System.out.println(EntityUtils.toString(entity, HTTP.ASCII));

---

Oleg


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]




--
View this message in context: 
http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30899580.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]


--------------------------
Ken Krugler
+1 530-210-6378
http://bixolabs.com
e l a s t i c   w e b   m i n i n g






---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to