Hi Khosro, Detecting the charset for an arbitrary HTML page is a non-trivial problem, and not something that is in scope for HttpClient.
E.g. sometimes the response header has no charset, and there's nothing in the HTML <meta> tag. In that case, browsers (and web crawlers) use statistical analysis to guess at the appropriate charset. One suggestion - you can use Tika to process a web page and detect the charset. -- Ken On Aug 16, 2011, at 6:07am, Jon Moore wrote: > Hi Khosro, > > Stijn is saying that you need to parse the text/html response body and > look for the <meta> tag that contains the charset. There are multiple > places the charset for an HTML webpage can be specified: please see > the link that Stijn sent for more details. > > Jon > > On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani > <[email protected]> wrote: >> Hi Stijn : >> I also use entity.getContentEncoding() ,but it returns "null". >> Is there any way to obtain charset of webpage? >> When we browse this page from a browser like FF,it renders charset ,but when >> we request with HttpClient or Curl ,we can not get charset? >> I think this is a big problem ,when we have a crawler.Because when we crawl >> of webpage ,HttpClient gives us a stream,and we must know the charset of >> that webpage to save it in Database,but it seems in some webpage ,we can not >> get charset of that webpage. >> >> Khosro. >> >> >>> ________________________________ >>> From: Stijn Deknudt <[email protected]> >>> To: HttpClient User Discussion <[email protected]> >>> Cc: Khosro Asgharifard Sharabiani <[email protected]> >>> Sent: Tuesday, August 16, 2011 4:38 PM >>> Subject: Re: Obtaining charset of page from HttpResponse. >>> >>> Hi Khosri, >>> >>> The Content-Type header is set (correctly) to "text/html", like Jon said. >>> There's no header in the response that says anything about the >>> character set, but you can obtain this information from the entity >>> itself: the HTML contains the character set inside the meta tag: >>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1256"> >>> >>> See also http://www.w3.org/International/O-charset to get more >>> information about all different possibilities to declare the character >>> encodings. >>> >>> Kind regards, >>> Stijn Deknudt. >>> >>> On 8/16/11, Jon Moore <[email protected]> wrote: >>>> Hi, >>>> >>>> This is because the resource at www.annahar.com that you link to >>>> returns a Content-Type header that just reads "text/html": >>>> >>>> $ curl -v >>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>> /dev/null >>>> * About to connect() to www.annahar.com port 80 (#0) >>>> * Trying 66.242.155.235... connected >>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>>> OpenSSL/0.9.7l zlib/1.2.3 >>>>> Host: www.annahar.com >>>>> Accept: */* >>>>> >>>> < HTTP/1.1 200 OK >>>> < Connection: close >>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>>> < Server: Microsoft-IIS/6.0 >>>> < X-Powered-By: ASP.NET >>>> < X-Powered-By: PHP/5.2.0 >>>> < Content-type: text/html >>>> < >>>> % Total % Received % Xferd Average Speed Time Time Time >>>> Current >>>> Dload Upload Total Spent Left >>>> Speed >>>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>>> --:--:-- 0{ [data not shown] >>>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>>> --:--:-- 237k* Closing connection #0 >>>> >>>> So httpclient is doing the right thing -- it's giving you access to >>>> exactly what's in the header that's returned. >>>> >>>> Jon >>>> >>>> >>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>>> <[email protected]> wrote: >>>>> Hello, >>>>> I use the following code to find charset of a page,but it does not worked >>>>> for page >>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>> >>>>> Code : >>>>> [code] >>>>> >>>>> try { >>>>> HttpClient httpclient = new DefaultHttpClient(); >>>>> String >>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>>>> HttpGet httpget = new HttpGet(url); >>>>> HttpResponse response; >>>>> response = httpclient.execute(httpget); >>>>> HttpEntity entity = response.getEntity(); >>>>> if (entity != null) { >>>>> Header[] allHeaders = response.getHeaders("Content-Type"); >>>>> System.out.println(allHeaders[0].getValue()); >>>>> } >>>>> } catch (ClientProtocolException e) { >>>>> e.printStackTrace(); >>>>> } catch (IOException e) { >>>>> e.printStackTrace(); >>>>> } >>>>> [/code] >>>>> >>>>> >>>>> And the output of above code is : text/html. >>>>> But i think the output must be "text/html; charset=windows-1256" .Am i >>>>> right? >>>>> >>>>> But when i use >>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it >>>>> is OK. >>>>> It seems ,it works for some pages not all of them.Why this happens? >>>>> >>>>> >>>>> Khosro. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> >>> -- >>> Stijn >>> [email protected] >>> >>> >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > -------------------------- Ken Krugler +1 530-210-6378 http://bixolabs.com custom data mining solutions --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
