Hi Stijn : I also use entity.getContentEncoding() ,but it returns "null". Is there any way to obtain charset of webpage? When we browse this page from a browser like FF,it renders charset ,but when we request with HttpClient or Curl ,we can not get charset? I think this is a big problem ,when we have a crawler.Because when we crawl of webpage ,HttpClient gives us a stream,and we must know the charset of that webpage to save it in Database,but it seems in some webpage ,we can not get charset of that webpage. Khosro.
>________________________________ >From: Stijn Deknudt <[email protected]> >To: HttpClient User Discussion <[email protected]> >Cc: Khosro Asgharifard Sharabiani <[email protected]> >Sent: Tuesday, August 16, 2011 4:38 PM >Subject: Re: Obtaining charset of page from HttpResponse. > >Hi Khosri, > >The Content-Type header is set (correctly) to "text/html", like Jon said. >There's no header in the response that says anything about the >character set, but you can obtain this information from the entity >itself: the HTML contains the character set inside the meta tag: ><meta http-equiv="Content-Type" content="text/html; charset=windows-1256"> > >See also http://www.w3.org/International/O-charset to get more >information about all different possibilities to declare the character >encodings. > >Kind regards, >Stijn Deknudt. > >On 8/16/11, Jon Moore <[email protected]> wrote: >> Hi, >> >> This is because the resource at www.annahar.com that you link to >> returns a Content-Type header that just reads "text/html": >> >> $ curl -v >> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>/dev/null >> * About to connect() to www.annahar.com port 80 (#0) >> * Trying 66.242.155.235... connected >> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>> OpenSSL/0.9.7l zlib/1.2.3 >>> Host: www.annahar.com >>> Accept: */* >>> >> < HTTP/1.1 200 OK >> < Connection: close >> < Date: Tue, 16 Aug 2011 11:50:50 GMT >> < Server: Microsoft-IIS/6.0 >> < X-Powered-By: ASP.NET >> < X-Powered-By: PHP/5.2.0 >> < Content-type: text/html >> < >> % Total % Received % Xferd Average Speed Time Time Time >> Current >> Dload Upload Total Spent Left >> Speed >> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >> --:--:-- 0{ [data not shown] >> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >> --:--:-- 237k* Closing connection #0 >> >> So httpclient is doing the right thing -- it's giving you access to >> exactly what's in the header that's returned. >> >> Jon >> >> >> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >> <[email protected]> wrote: >>> Hello, >>> I use the following code to find charset of a page,but it does not worked >>> for page >>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>> >>> Code : >>> [code] >>> >>> try { >>> HttpClient httpclient = new DefaultHttpClient(); >>> String >>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>> HttpGet httpget = new HttpGet(url); >>> HttpResponse response; >>> response = httpclient.execute(httpget); >>> HttpEntity entity = response.getEntity(); >>> if (entity != null) { >>> Header[] allHeaders = response.getHeaders("Content-Type"); >>> System.out.println(allHeaders[0].getValue()); >>> } >>> } catch (ClientProtocolException e) { >>> e.printStackTrace(); >>> } catch (IOException e) { >>> e.printStackTrace(); >>> } >>> [/code] >>> >>> >>> And the output of above code is : text/html. >>> But i think the output must be "text/html; charset=windows-1256" .Am i >>> right? >>> >>> But when i use >>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it >>> is OK. >>> It seems ,it works for some pages not all of them.Why this happens? >>> >>> >>> Khosro. >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> >> > > >-- >Stijn >[email protected] > > >
