Thanks, I will use Tika after the httpclient operations. 

Ken Krugler wrote:
> 
> Normally you'd let HttpClient handle decoding, chunked responses, etc.  
> Then what you save is the raw content (as an array of bytes) and the  
> response headers.
> 
> Converting the above into a parsable page is something best handled by  
> Tika (as an example), since it will attempt to determine the charset  
> encoding for the bytes, based on the response header, the HTML markup,  
> and (worst case) statistics from the bytes.
> 
> The second, off-line step has nothing to do with HttpClient.
> 
> -- Ken
> 
> On Feb 11, 2011, at 1:30am, CodingForever wrote:
> 
>>
>> I appreciated.
>> That is working like I want.
>> You see that, i am trying to decoding html page using header and
>> content(offline). And I am not a perfect about httpclient. So I  
>> could not
>> find the best solution for my problem.
>>
>> Think that you have a,
>> Header and Content(raw content(gzip,deflate,may be chunked) )
>> I need a solution that , I will give header and content then reading  
>> the
>> decoded output until the end of the page.
>> Can you offer me a solution for this problem?
>>
>> Best Regards.
>>
>> olegk wrote:
>>>
>>> On Fri, 2011-02-11 at 00:39 -0800, CodingForever wrote:
>>>> Thanks olegk for the answer,Now I am looking that. But I will ask
>>>> something
>>>> I wrote the code that below. How can I get the decoded content using
>>>> header
>>>> parameters ?
>>>
>>> String s =
>>>    "HTTP/1.1 200 OK\r\n"
>>>    + "Server: whatever\r\n"
>>>    + "Date: some date\r\n"
>>>    + "Set-Cookie: c1=stuff\r\n"
>>>    + "Transfer-Encoding: chunked\r\n"
>>>    + "Content-Type: text/html; charset=ISO-8859-1\r\n"
>>>    + "\r\n"
>>>    + "5\r\n01234\r\n5\r\n56789\r\n6\r\nabcdef\r\n0\r\n\r\n test";
>>> SessionInputBuffer inbuffer = new SessionInputBufferMockup(s,
>>> "US-ASCII");
>>> HttpResponseParser parser = new HttpResponseParser(
>>>    inbuffer,
>>>    BasicLineParser.DEFAULT,
>>>    new DefaultHttpResponseFactory(),
>>>    new BasicHttpParams());
>>> HttpResponse response = (HttpResponse) parser.parse();
>>> EntityDeserializer deserializer = new EntityDeserializer(new
>>> LaxContentLengthStrategy());
>>> HttpEntity entity = deserializer.deserialize(inbuffer, response);
>>> System.out.println(EntityUtils.toString(entity, HTTP.ASCII));
>>>
>>> ---
>>>
>>> Oleg
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>>
>>
>> -- 
>> View this message in context:
>> http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30899580.html
>> Sent from the HttpClient-User mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
> 
> --------------------------
> Ken Krugler
> +1 530-210-6378
> http://bixolabs.com
> e l a s t i c   w e b   m i n i n g
> 
> 
> 
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
> 
> 
> 

-- 
View this message in context: 
http://old.nabble.com/Header-and-Content-parsing-and-saving-as-html-page-tp30897495p30901913.html
Sent from the HttpClient-User mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to