Hi Stijn : 
I also use entity.getContentEncoding() ,but it returns "null".
Is there any way to obtain charset of webpage?
When we browse this page from a browser like FF,it renders charset ,but when we 
request with HttpClient or Curl ,we can not get charset?
I think this is a big problem ,when we have a crawler.Because when we crawl of 
webpage ,HttpClient gives us  a stream,and we must know the charset of that 
webpage to save it in Database,but it seems in some webpage ,we can not get 
charset of that webpage.
 
Khosro.


>________________________________
>From: Stijn Deknudt <[email protected]>
>To: HttpClient User Discussion <[email protected]>
>Cc: Khosro Asgharifard Sharabiani <[email protected]>
>Sent: Tuesday, August 16, 2011 4:38 PM
>Subject: Re: Obtaining charset of page from HttpResponse.
>
>Hi Khosri,
>
>The Content-Type header is set (correctly) to "text/html", like Jon said.
>There's no header in the response that says anything about the
>character set, but you can obtain this information from the entity
>itself: the HTML contains the character set inside the meta tag:
><meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
>
>See also http://www.w3.org/International/O-charset to get more
>information about all different possibilities to declare the character
>encodings.
>
>Kind regards,
>Stijn Deknudt.
>
>On 8/16/11, Jon Moore <[email protected]> wrote:
>> Hi,
>>
>> This is because the resource at www.annahar.com that you link to
>> returns a Content-Type header that just reads "text/html":
>>
>> $ curl -v
>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>/dev/null
>> * About to connect() to www.annahar.com port 80 (#0)
>> *   Trying 66.242.155.235... connected
>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1
>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>> OpenSSL/0.9.7l zlib/1.2.3
>>> Host: www.annahar.com
>>> Accept: */*
>>>
>> < HTTP/1.1 200 OK
>> < Connection: close
>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>> < Server: Microsoft-IIS/6.0
>> < X-Powered-By: ASP.NET
>> < X-Powered-By: PHP/5.2.0
>> < Content-type: text/html
>> <
>>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>> Current
>>                                  Dload  Upload   Total   Spent    Left
>> Speed
>>   0     0    0     0    0     0      0      0 --:--:-- --:--:--
>> --:--:--     0{ [data not shown]
>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>> --:--:--  237k* Closing connection #0
>>
>> So httpclient is doing the right thing -- it's giving you access to
>> exactly what's in the header that's returned.
>>
>> Jon
>>
>>
>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>> <[email protected]> wrote:
>>> Hello,
>>> I use the following code to find charset of a page,but it does not worked
>>> for page
>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>
>>> Code :
>>>  [code]
>>>
>>> try {
>>> HttpClient httpclient = new DefaultHttpClient();
>>> String
>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";;
>>> HttpGet httpget = new HttpGet(url);
>>> HttpResponse response;
>>> response = httpclient.execute(httpget);
>>> HttpEntity entity = response.getEntity();
>>> if (entity != null) {
>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>> System.out.println(allHeaders[0].getValue());
>>> }
>>> } catch (ClientProtocolException e) {
>>> e.printStackTrace();
>>> } catch (IOException e) {
>>> e.printStackTrace();
>>> }
>>> [/code]
>>>
>>>
>>> And the output of above code is : text/html.
>>> But i think the output must be "text/html; charset=windows-1256" .Am i
>>> right?
>>>
>>> But when i use
>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel";
>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it
>>> is OK.
>>> It seems ,it works for some pages not all of them.Why this happens?
>>>
>>>
>>> Khosro.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: [email protected]
>> For additional commands, e-mail: [email protected]
>>
>>
>
>
>-- 
>Stijn
>[email protected]
>
>
>

Reply via email to