Hi Khosro, Stijn is saying that you need to parse the text/html response body and look for the <meta> tag that contains the charset. There are multiple places the charset for an HTML webpage can be specified: please see the link that Stijn sent for more details.
Jon On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani <[email protected]> wrote: > Hi Stijn : > I also use entity.getContentEncoding() ,but it returns "null". > Is there any way to obtain charset of webpage? > When we browse this page from a browser like FF,it renders charset ,but when > we request with HttpClient or Curl ,we can not get charset? > I think this is a big problem ,when we have a crawler.Because when we crawl > of webpage ,HttpClient gives us a stream,and we must know the charset of > that webpage to save it in Database,but it seems in some webpage ,we can not > get charset of that webpage. > > Khosro. > > >>________________________________ >>From: Stijn Deknudt <[email protected]> >>To: HttpClient User Discussion <[email protected]> >>Cc: Khosro Asgharifard Sharabiani <[email protected]> >>Sent: Tuesday, August 16, 2011 4:38 PM >>Subject: Re: Obtaining charset of page from HttpResponse. >> >>Hi Khosri, >> >>The Content-Type header is set (correctly) to "text/html", like Jon said. >>There's no header in the response that says anything about the >>character set, but you can obtain this information from the entity >>itself: the HTML contains the character set inside the meta tag: >><meta http-equiv="Content-Type" content="text/html; charset=windows-1256"> >> >>See also http://www.w3.org/International/O-charset to get more >>information about all different possibilities to declare the character >>encodings. >> >>Kind regards, >>Stijn Deknudt. >> >>On 8/16/11, Jon Moore <[email protected]> wrote: >>> Hi, >>> >>> This is because the resource at www.annahar.com that you link to >>> returns a Content-Type header that just reads "text/html": >>> >>> $ curl -v >>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>/dev/null >>> * About to connect() to www.annahar.com port 80 (#0) >>> * Trying 66.242.155.235... connected >>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>> OpenSSL/0.9.7l zlib/1.2.3 >>>> Host: www.annahar.com >>>> Accept: */* >>>> >>> < HTTP/1.1 200 OK >>> < Connection: close >>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>> < Server: Microsoft-IIS/6.0 >>> < X-Powered-By: ASP.NET >>> < X-Powered-By: PHP/5.2.0 >>> < Content-type: text/html >>> < >>> % Total % Received % Xferd Average Speed Time Time Time >>> Current >>> Dload Upload Total Spent Left >>> Speed >>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>> --:--:-- 0{ [data not shown] >>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>> --:--:-- 237k* Closing connection #0 >>> >>> So httpclient is doing the right thing -- it's giving you access to >>> exactly what's in the header that's returned. >>> >>> Jon >>> >>> >>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>> <[email protected]> wrote: >>>> Hello, >>>> I use the following code to find charset of a page,but it does not worked >>>> for page >>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>> >>>> Code : >>>> [code] >>>> >>>> try { >>>> HttpClient httpclient = new DefaultHttpClient(); >>>> String >>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>>> HttpGet httpget = new HttpGet(url); >>>> HttpResponse response; >>>> response = httpclient.execute(httpget); >>>> HttpEntity entity = response.getEntity(); >>>> if (entity != null) { >>>> Header[] allHeaders = response.getHeaders("Content-Type"); >>>> System.out.println(allHeaders[0].getValue()); >>>> } >>>> } catch (ClientProtocolException e) { >>>> e.printStackTrace(); >>>> } catch (IOException e) { >>>> e.printStackTrace(); >>>> } >>>> [/code] >>>> >>>> >>>> And the output of above code is : text/html. >>>> But i think the output must be "text/html; charset=windows-1256" .Am i >>>> right? >>>> >>>> But when i use >>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it >>>> is OK. >>>> It seems ,it works for some pages not all of them.Why this happens? >>>> >>>> >>>> Khosro. >>> >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: [email protected] >>> For additional commands, e-mail: [email protected] >>> >>> >> >> >>-- >>Stijn >>[email protected] >> >> >> --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
