Thanks Stijn, I think your approach to use BufferedHttpEntity is useful to avoid fetching twice,and also find charset of a webpage. Khosro.
>________________________________ >From: Stijn Deknudt <[email protected]> >To: HttpClient User Discussion <[email protected]>; Khosro >Asgharifard Sharabiani <[email protected]> >Sent: Tuesday, August 16, 2011 5:57 PM >Subject: Re: Obtaining charset of page from HttpResponse. > >I forgot to mention in my previous post that you can use >BufferedHttpEntity when you would stream the content of the entity: in >that case the content also gets fetched only once. > >Kind regards, >Stijn. > >On 8/16/11, Stijn Deknudt <[email protected]> wrote: >> Hi Khosro, >> >> As described in http://www.w3.org/International/O-charset, there are >> different ways to specify the content encoding. Because the site you >> mention doesn't provide you the encoding in the header (see article: >> Send the 'charset' parameter in the Content-Type header of HTTP), >> you'll need to get the entity and find the encoding yourself in the >> content. >> One way to do this is to use EntityUtils to get the content, search >> for the content-type meta-tag and use the charset to convert the >> content with this information. This means you don't use the stream >> directly (if you do this you'll need to fetch the content 2 times: one >> time to consume the content until you retrieved the character set >> information, and another time to consume the whole entity with this >> character set). >> >> Kind regards, >> Stijn. >> >> On 8/16/11, Khosro Asgharifard Sharabiani <[email protected]> >> wrote: >>> Hi Stijn : >>> I also use entity.getContentEncoding() ,but it returns "null". >>> Is there any way to obtain charset of webpage? >>> When we browse this page from a browser like FF,it renders charset ,but >>> when >>> we request with HttpClient or Curl ,we can not get charset? >>> I think this is a big problem ,when we have a crawler.Because when we >>> crawl >>> of webpage ,HttpClient gives us a stream,and we must know the charset of >>> that webpage to save it in Database,but it seems in some webpage ,we can >>> not >>> get charset of that webpage. >>> >>> Khosro. >>> >>> >>>>________________________________ >>>>From: Stijn Deknudt <[email protected]> >>>>To: HttpClient User Discussion <[email protected]> >>>>Cc: Khosro Asgharifard Sharabiani <[email protected]> >>>>Sent: Tuesday, August 16, 2011 4:38 PM >>>>Subject: Re: Obtaining charset of page from HttpResponse. >>>> >>>>Hi Khosri, >>>> >>>>The Content-Type header is set (correctly) to "text/html", like Jon said. >>>>There's no header in the response that says anything about the >>>>character set, but you can obtain this information from the entity >>>>itself: the HTML contains the character set inside the meta tag: >>>><meta http-equiv="Content-Type" content="text/html; >>>> charset=windows-1256"> >>>> >>>>See also http://www.w3.org/International/O-charset to get more >>>>information about all different possibilities to declare the character >>>>encodings. >>>> >>>>Kind regards, >>>>Stijn Deknudt. >>>> >>>>On 8/16/11, Jon Moore <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> This is because the resource at www.annahar.com that you link to >>>>> returns a Content-Type header that just reads "text/html": >>>>> >>>>> $ curl -v >>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>>>/dev/null >>>>> * About to connect() to www.annahar.com port 80 (#0) >>>>> * Trying 66.242.155.235... connected >>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>>>> OpenSSL/0.9.7l zlib/1.2.3 >>>>>> Host: www.annahar.com >>>>>> Accept: */* >>>>>> >>>>> < HTTP/1.1 200 OK >>>>> < Connection: close >>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>>>> < Server: Microsoft-IIS/6.0 >>>>> < X-Powered-By: ASP.NET >>>>> < X-Powered-By: PHP/5.2.0 >>>>> < Content-type: text/html >>>>> < >>>>> % Total % Received % Xferd Average Speed Time Time Time >>>>> Current >>>>> Dload Upload Total Spent Left >>>>> Speed >>>>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>>>> --:--:-- 0{ [data not shown] >>>>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>>>> --:--:-- 237k* Closing connection #0 >>>>> >>>>> So httpclient is doing the right thing -- it's giving you access to >>>>> exactly what's in the header that's returned. >>>>> >>>>> Jon >>>>> >>>>> >>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>>>> <[email protected]> wrote: >>>>>> Hello, >>>>>> I use the following code to find charset of a page,but it does not >>>>>> worked >>>>>> for page >>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>>> >>>>>> Code : >>>>>> [code] >>>>>> >>>>>> try { >>>>>> HttpClient httpclient = new DefaultHttpClient(); >>>>>> String >>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>>>>> HttpGet httpget = new HttpGet(url); >>>>>> HttpResponse response; >>>>>> response = httpclient.execute(httpget); >>>>>> HttpEntity entity = response.getEntity(); >>>>>> if (entity != null) { >>>>>> Header[] allHeaders = response.getHeaders("Content-Type"); >>>>>> System.out.println(allHeaders[0].getValue()); >>>>>> } >>>>>> } catch (ClientProtocolException e) { >>>>>> e.printStackTrace(); >>>>>> } catch (IOException e) { >>>>>> e.printStackTrace(); >>>>>> } >>>>>> [/code] >>>>>> >>>>>> >>>>>> And the output of above code is : text/html. >>>>>> But i think the output must be "text/html; charset=windows-1256" .Am i >>>>>> right? >>>>>> >>>>>> But when i use >>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think >>>>>> ,it >>>>>> is OK. >>>>>> It seems ,it works for some pages not all of them.Why this happens? >>>>>> >>>>>> >>>>>> Khosro. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> >>>>-- >>>>Stijn >>>>[email protected] >>>> >>>> >>>> >> > > >
