Re: Obtaining charset of page from HttpResponse.

Jon Moore Tue, 16 Aug 2011 06:07:36 -0700

Hi Khosro,

Stijn is saying that you need to parse the text/html response body and
look for the <meta> tag that contains the charset. There are multiple
places the charset for an HTML webpage can be specified: please see
the link that Stijn sent for more details.


Jon

On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani
<[email protected]> wrote:
> Hi Stijn :
> I also use entity.getContentEncoding() ,but it returns "null".
> Is there any way to obtain charset of webpage?
> When we browse this page from a browser like FF,it renders charset ,but when 
> we request with HttpClient or Curl ,we can not get charset?
> I think this is a big problem ,when we have a crawler.Because when we crawl 
> of webpage ,HttpClient gives us  a stream,and we must know the charset of 
> that webpage to save it in Database,but it seems in some webpage ,we can not 
> get charset of that webpage.
>
> Khosro.
>
>
>>________________________________
>>From: Stijn Deknudt <[email protected]>
>>To: HttpClient User Discussion <[email protected]>
>>Cc: Khosro Asgharifard Sharabiani <[email protected]>
>>Sent: Tuesday, August 16, 2011 4:38 PM
>>Subject: Re: Obtaining charset of page from HttpResponse.
>>
>>Hi Khosri,
>>
>>The Content-Type header is set (correctly) to "text/html", like Jon said.
>>There's no header in the response that says anything about the
>>character set, but you can obtain this information from the entity
>>itself: the HTML contains the character set inside the meta tag:
>><meta http-equiv="Content-Type" content="text/html; charset=windows-1256">
>>
>>See also http://www.w3.org/International/O-charset to get more
>>information about all different possibilities to declare the character
>>encodings.
>>
>>Kind regards,
>>Stijn Deknudt.
>>
>>On 8/16/11, Jon Moore <[email protected]> wrote:
>>> Hi,
>>>
>>> This is because the resource at www.annahar.com that you link to
>>> returns a Content-Type header that just reads "text/html":
>>>
>>> $ curl -v
>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>/dev/null
>>> * About to connect() to www.annahar.com port 80 (#0)
>>> *   Trying 66.242.155.235... connected
>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1
>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>> Host: www.annahar.com
>>>> Accept: */*
>>>>
>>> < HTTP/1.1 200 OK
>>> < Connection: close
>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>> < Server: Microsoft-IIS/6.0
>>> < X-Powered-By: ASP.NET
>>> < X-Powered-By: PHP/5.2.0
>>> < Content-type: text/html
>>> <
>>>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>>> Current
>>>                                  Dload  Upload   Total   Spent    Left
>>> Speed
>>>   0     0    0     0    0     0      0      0 --:--:-- --:--:--
>>> --:--:--     0{ [data not shown]
>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>> --:--:--  237k* Closing connection #0
>>>
>>> So httpclient is doing the right thing -- it's giving you access to
>>> exactly what's in the header that's returned.
>>>
>>> Jon
>>>
>>>
>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>> <[email protected]> wrote:
>>>> Hello,
>>>> I use the following code to find charset of a page,but it does not worked
>>>> for page
>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>
>>>> Code :
>>>>  [code]
>>>>
>>>> try {
>>>> HttpClient httpclient = new DefaultHttpClient();
>>>> String
>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";;
>>>> HttpGet httpget = new HttpGet(url);
>>>> HttpResponse response;
>>>> response = httpclient.execute(httpget);
>>>> HttpEntity entity = response.getEntity();
>>>> if (entity != null) {
>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>> System.out.println(allHeaders[0].getValue());
>>>> }
>>>> } catch (ClientProtocolException e) {
>>>> e.printStackTrace();
>>>> } catch (IOException e) {
>>>> e.printStackTrace();
>>>> }
>>>> [/code]
>>>>
>>>>
>>>> And the output of above code is : text/html.
>>>> But i think the output must be "text/html; charset=windows-1256" .Am i
>>>> right?
>>>>
>>>> But when i use
>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel";
>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it
>>>> is OK.
>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>
>>>>
>>>> Khosro.
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: [email protected]
>>> For additional commands, e-mail: [email protected]
>>>
>>>
>>
>>
>>--
>>Stijn
>>[email protected]
>>
>>
>>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Obtaining charset of page from HttpResponse.

Reply via email to