Re: Obtaining charset of page from HttpResponse.

Stijn Deknudt Tue, 16 Aug 2011 06:28:03 -0700

I forgot to mention in my previous post that you can use
BufferedHttpEntity when you would stream the content of the entity: in
that case the content also gets fetched only once.


Kind regards,
Stijn.

On 8/16/11, Stijn Deknudt <[email protected]> wrote:
> Hi Khosro,
>
> As described in http://www.w3.org/International/O-charset, there are
> different ways to specify the content encoding. Because the site you
> mention doesn't provide you the encoding in the header (see article:
> Send the 'charset' parameter in the Content-Type header of HTTP),
> you'll need to get the entity and find the encoding yourself in the
> content.
> One way to do this is to use EntityUtils to get the content, search
> for the content-type meta-tag and use the charset to convert the
> content with this information. This means you don't use the stream
> directly (if you do this you'll need to fetch the content 2 times: one
> time to consume the content until you retrieved the character set
> information, and another time to consume the whole entity with this
> character set).
>
> Kind regards,
> Stijn.
>
> On 8/16/11, Khosro Asgharifard Sharabiani <[email protected]>
> wrote:
>> Hi Stijn :
>> I also use entity.getContentEncoding() ,but it returns "null".
>> Is there any way to obtain charset of webpage?
>> When we browse this page from a browser like FF,it renders charset ,but
>> when
>> we request with HttpClient or Curl ,we can not get charset?
>> I think this is a big problem ,when we have a crawler.Because when we
>> crawl
>> of webpage ,HttpClient gives us  a stream,and we must know the charset of
>> that webpage to save it in Database,but it seems in some webpage ,we can
>> not
>> get charset of that webpage.
>>
>> Khosro.
>>
>>
>>>________________________________
>>>From: Stijn Deknudt <[email protected]>
>>>To: HttpClient User Discussion <[email protected]>
>>>Cc: Khosro Asgharifard Sharabiani <[email protected]>
>>>Sent: Tuesday, August 16, 2011 4:38 PM
>>>Subject: Re: Obtaining charset of page from HttpResponse.
>>>
>>>Hi Khosri,
>>>
>>>The Content-Type header is set (correctly) to "text/html", like Jon said.
>>>There's no header in the response that says anything about the
>>>character set, but you can obtain this information from the entity
>>>itself: the HTML contains the character set inside the meta tag:
>>><meta http-equiv="Content-Type" content="text/html;
>>> charset=windows-1256">
>>>
>>>See also http://www.w3.org/International/O-charset to get more
>>>information about all different possibilities to declare the character
>>>encodings.
>>>
>>>Kind regards,
>>>Stijn Deknudt.
>>>
>>>On 8/16/11, Jon Moore <[email protected]> wrote:
>>>> Hi,
>>>>
>>>> This is because the resource at www.annahar.com that you link to
>>>> returns a Content-Type header that just reads "text/html":
>>>>
>>>> $ curl -v
>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>>/dev/null
>>>> * About to connect() to www.annahar.com port 80 (#0)
>>>> *   Trying 66.242.155.235... connected
>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0)
>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1
>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4
>>>>> OpenSSL/0.9.7l zlib/1.2.3
>>>>> Host: www.annahar.com
>>>>> Accept: */*
>>>>>
>>>> < HTTP/1.1 200 OK
>>>> < Connection: close
>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT
>>>> < Server: Microsoft-IIS/6.0
>>>> < X-Powered-By: ASP.NET
>>>> < X-Powered-By: PHP/5.2.0
>>>> < Content-type: text/html
>>>> <
>>>>   % Total    % Received % Xferd  Average Speed   Time    Time     Time
>>>> Current
>>>>                                  Dload  Upload   Total   Spent    Left
>>>> Speed
>>>>   0     0    0     0    0     0      0      0 --:--:-- --:--:--
>>>> --:--:--     0{ [data not shown]
>>>> 100 91340    0 91340    0     0   187k      0 --:--:-- --:--:--
>>>> --:--:--  237k* Closing connection #0
>>>>
>>>> So httpclient is doing the right thing -- it's giving you access to
>>>> exactly what's in the header that's returned.
>>>>
>>>> Jon
>>>>
>>>>
>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani
>>>> <[email protected]> wrote:
>>>>> Hello,
>>>>> I use the following code to find charset of a page,but it does not
>>>>> worked
>>>>> for page
>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";
>>>>>
>>>>> Code :
>>>>>  [code]
>>>>>
>>>>> try {
>>>>> HttpClient httpclient = new DefaultHttpClient();
>>>>> String
>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon";;
>>>>> HttpGet httpget = new HttpGet(url);
>>>>> HttpResponse response;
>>>>> response = httpclient.execute(httpget);
>>>>> HttpEntity entity = response.getEntity();
>>>>> if (entity != null) {
>>>>> Header[] allHeaders = response.getHeaders("Content-Type");
>>>>> System.out.println(allHeaders[0].getValue());
>>>>> }
>>>>> } catch (ClientProtocolException e) {
>>>>> e.printStackTrace();
>>>>> } catch (IOException e) {
>>>>> e.printStackTrace();
>>>>> }
>>>>> [/code]
>>>>>
>>>>>
>>>>> And the output of above code is : text/html.
>>>>> But i think the output must be "text/html; charset=windows-1256" .Am i
>>>>> right?
>>>>>
>>>>> But when i use
>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel";
>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think
>>>>> ,it
>>>>> is OK.
>>>>> It seems ,it works for some pages not all of them.Why this happens?
>>>>>
>>>>>
>>>>> Khosro.
>>>>
>>>> ---------------------------------------------------------------------
>>>> To unsubscribe, e-mail: [email protected]
>>>> For additional commands, e-mail: [email protected]
>>>>
>>>>
>>>
>>>
>>>--
>>>Stijn
>>>[email protected]
>>>
>>>
>>>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Obtaining charset of page from HttpResponse.

Reply via email to