I forgot to mention in my previous post that you can use BufferedHttpEntity when you would stream the content of the entity: in that case the content also gets fetched only once.
Kind regards, Stijn. On 8/16/11, Stijn Deknudt <[email protected]> wrote: > Hi Khosro, > > As described in http://www.w3.org/International/O-charset, there are > different ways to specify the content encoding. Because the site you > mention doesn't provide you the encoding in the header (see article: > Send the 'charset' parameter in the Content-Type header of HTTP), > you'll need to get the entity and find the encoding yourself in the > content. > One way to do this is to use EntityUtils to get the content, search > for the content-type meta-tag and use the charset to convert the > content with this information. This means you don't use the stream > directly (if you do this you'll need to fetch the content 2 times: one > time to consume the content until you retrieved the character set > information, and another time to consume the whole entity with this > character set). > > Kind regards, > Stijn. > > On 8/16/11, Khosro Asgharifard Sharabiani <[email protected]> > wrote: >> Hi Stijn : >> I also use entity.getContentEncoding() ,but it returns "null". >> Is there any way to obtain charset of webpage? >> When we browse this page from a browser like FF,it renders charset ,but >> when >> we request with HttpClient or Curl ,we can not get charset? >> I think this is a big problem ,when we have a crawler.Because when we >> crawl >> of webpage ,HttpClient gives us a stream,and we must know the charset of >> that webpage to save it in Database,but it seems in some webpage ,we can >> not >> get charset of that webpage. >> >> Khosro. >> >> >>>________________________________ >>>From: Stijn Deknudt <[email protected]> >>>To: HttpClient User Discussion <[email protected]> >>>Cc: Khosro Asgharifard Sharabiani <[email protected]> >>>Sent: Tuesday, August 16, 2011 4:38 PM >>>Subject: Re: Obtaining charset of page from HttpResponse. >>> >>>Hi Khosri, >>> >>>The Content-Type header is set (correctly) to "text/html", like Jon said. >>>There's no header in the response that says anything about the >>>character set, but you can obtain this information from the entity >>>itself: the HTML contains the character set inside the meta tag: >>><meta http-equiv="Content-Type" content="text/html; >>> charset=windows-1256"> >>> >>>See also http://www.w3.org/International/O-charset to get more >>>information about all different possibilities to declare the character >>>encodings. >>> >>>Kind regards, >>>Stijn Deknudt. >>> >>>On 8/16/11, Jon Moore <[email protected]> wrote: >>>> Hi, >>>> >>>> This is because the resource at www.annahar.com that you link to >>>> returns a Content-Type header that just reads "text/html": >>>> >>>> $ curl -v >>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>>/dev/null >>>> * About to connect() to www.annahar.com port 80 (#0) >>>> * Trying 66.242.155.235... connected >>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>>> OpenSSL/0.9.7l zlib/1.2.3 >>>>> Host: www.annahar.com >>>>> Accept: */* >>>>> >>>> < HTTP/1.1 200 OK >>>> < Connection: close >>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>>> < Server: Microsoft-IIS/6.0 >>>> < X-Powered-By: ASP.NET >>>> < X-Powered-By: PHP/5.2.0 >>>> < Content-type: text/html >>>> < >>>> % Total % Received % Xferd Average Speed Time Time Time >>>> Current >>>> Dload Upload Total Spent Left >>>> Speed >>>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>>> --:--:-- 0{ [data not shown] >>>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>>> --:--:-- 237k* Closing connection #0 >>>> >>>> So httpclient is doing the right thing -- it's giving you access to >>>> exactly what's in the header that's returned. >>>> >>>> Jon >>>> >>>> >>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>>> <[email protected]> wrote: >>>>> Hello, >>>>> I use the following code to find charset of a page,but it does not >>>>> worked >>>>> for page >>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>> >>>>> Code : >>>>> [code] >>>>> >>>>> try { >>>>> HttpClient httpclient = new DefaultHttpClient(); >>>>> String >>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>>>> HttpGet httpget = new HttpGet(url); >>>>> HttpResponse response; >>>>> response = httpclient.execute(httpget); >>>>> HttpEntity entity = response.getEntity(); >>>>> if (entity != null) { >>>>> Header[] allHeaders = response.getHeaders("Content-Type"); >>>>> System.out.println(allHeaders[0].getValue()); >>>>> } >>>>> } catch (ClientProtocolException e) { >>>>> e.printStackTrace(); >>>>> } catch (IOException e) { >>>>> e.printStackTrace(); >>>>> } >>>>> [/code] >>>>> >>>>> >>>>> And the output of above code is : text/html. >>>>> But i think the output must be "text/html; charset=windows-1256" .Am i >>>>> right? >>>>> >>>>> But when i use >>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think >>>>> ,it >>>>> is OK. >>>>> It seems ,it works for some pages not all of them.Why this happens? >>>>> >>>>> >>>>> Khosro. >>>> >>>> --------------------------------------------------------------------- >>>> To unsubscribe, e-mail: [email protected] >>>> For additional commands, e-mail: [email protected] >>>> >>>> >>> >>> >>>-- >>>Stijn >>>[email protected] >>> >>> >>> > --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
