Thanks, I will use Tika after the httpclient operations. Ken Krugler wrote: > > Normally you'd let HttpClient handle decoding, chunked responses, etc. > Then what you save is the raw content (as an array of bytes) and the > response headers. > > Converting the above into a parsable page is something best handled by > Tika (as an example), since it will attempt to determine the charset > encoding for the bytes, based on the response header, the HTML markup, > and (worst case) statistics from the bytes. > > The second, off-line step has nothing to do with HttpClient. > > -- Ken > > On Feb 11, 2011, at 1:30am, CodingForever wrote: > >> >> I appreciated. >> That is working like I want. >> You see that, i am trying to decoding html page using header and >> content(offline). And I am not a perfect about httpclient. So I >> could not >> find the best solution for my problem. >> >> Think that you have a, >> Header and Content(raw content(gzip,deflate,may be chunked) ) >> I need a solution that , I will give header and content then reading >> the >> decoded output until the end of the page. >> Can you offer me a solution for this problem? >> >> Best Regards. >> >> olegk wrote: >>> >>> On Fri, 2011-02-11 at 00:39 -0800, CodingForever wrote: >>>> Thanks olegk for the answer,Now I am looking that. But I will ask >>>> something >>>> I wrote the code that below. How can I get the decoded content using >>>> header >>>> parameters ? >>> >>> String s = >>> "HTTP/1.1 200 OK\r\n" >>> + "Server: whatever\r\n" >>> + "Date: some date\r\n" >>> + "Set-Cookie: c1=stuff\r\n" >>> + "Transfer-Encoding: chunked\r\n" >>> + "Content-Type: text/html; charset=ISO-8859-1\r\n" >>> + "\r\n" >>> + "5\r\n01234\r\n5\r\n56789\r\n6\r\nabcdef\r\n0\r\n\r\n test"; >>> SessionInputBuffer inbuffer = new SessionInputBufferMockup(s, >>> "US-ASCII"); >>> HttpResponseParser parser = new HttpResponseParser( >>> inbuffer, >>> BasicLineParser.DEFAULT, >>> new DefaultHttpResponseFactory(), >>> new BasicHttpParams()); >>> HttpResponse response = (HttpResponse) parser.parse(); >>> EntityDeserializer deserializer = new EntityDeserializer(new >>> LaxContentLengthStrategy()); >>> HttpEntity entity = deserializer.deserialize(inbuffer, response); >>> System.out.println(EntityUtils.toString(entity, HTTP.ASCII)); >>> >>> --- >>> >>> Oleg >>> >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >>> >> >> -- >> View this message in context: >> http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30899580.html >> Sent from the HttpClient-User mailing list archive at Nabble.com. >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > > -------------------------- > Ken Krugler > +1 530-210-6378 > http://bixolabs.com > e l a s t i c w e b m i n i n g > > > > > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > > >
-- View this message in context: http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30901913.html Sent from the HttpClient-User mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
