Hi Ken, Maybe using Tika is well ,but i have not used it and i must investigate more about your approach. Anyway ,i think Stijn's approach to use BufferedHttpEntity is useful for now. Khosro.
>________________________________ >From: Ken Krugler <[email protected]> >To: HttpClient User Discussion <[email protected]> >Sent: Tuesday, August 16, 2011 6:27 PM >Subject: Re: Obtaining charset of page from HttpResponse. > >Hi Khosro, > >Detecting the charset for an arbitrary HTML page is a non-trivial problem, and >not something that is in scope for HttpClient. > >E.g. sometimes the response header has no charset, and there's nothing in the >HTML <meta> tag. > >In that case, browsers (and web crawlers) use statistical analysis to guess at >the appropriate charset. > >One suggestion - you can use Tika to process a web page and detect the charset. > >-- Ken > >On Aug 16, 2011, at 6:07am, Jon Moore wrote: > >> Hi Khosro, >> >> Stijn is saying that you need to parse the text/html response body and >> look for the <meta> tag that contains the charset. There are multiple >> places the charset for an HTML webpage can be specified: please see >> the link that Stijn sent for more details. >> >> Jon >> >> On Tue, Aug 16, 2011 at 8:40 AM, Khosro Asgharifard Sharabiani >> <[email protected]> wrote: >>> Hi Stijn : >>> I also use entity.getContentEncoding() ,but it returns "null". >>> Is there any way to obtain charset of webpage? >>> When we browse this page from a browser like FF,it renders charset ,but >>> when we request with HttpClient or Curl ,we can not get charset? >>> I think this is a big problem ,when we have a crawler.Because when we crawl >>> of webpage ,HttpClient gives us a stream,and we must know the charset of >>> that webpage to save it in Database,but it seems in some webpage ,we can >>> not get charset of that webpage. >>> >>> Khosro. >>> >>> >>>> ________________________________ >>>> From: Stijn Deknudt <[email protected]> >>>> To: HttpClient User Discussion <[email protected]> >>>> Cc: Khosro Asgharifard Sharabiani <[email protected]> >>>> Sent: Tuesday, August 16, 2011 4:38 PM >>>> Subject: Re: Obtaining charset of page from HttpResponse. >>>> >>>> Hi Khosri, >>>> >>>> The Content-Type header is set (correctly) to "text/html", like Jon said. >>>> There's no header in the response that says anything about the >>>> character set, but you can obtain this information from the entity >>>> itself: the HTML contains the character set inside the meta tag: >>>> <meta http-equiv="Content-Type" content="text/html; charset=windows-1256"> >>>> >>>> See also http://www.w3.org/International/O-charset to get more >>>> information about all different possibilities to declare the character >>>> encodings. >>>> >>>> Kind regards, >>>> Stijn Deknudt. >>>> >>>> On 8/16/11, Jon Moore <[email protected]> wrote: >>>>> Hi, >>>>> >>>>> This is because the resource at www.annahar.com that you link to >>>>> returns a Content-Type header that just reads "text/html": >>>>> >>>>> $ curl -v >>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>>> /dev/null >>>>> * About to connect() to www.annahar.com port 80 (#0) >>>>> * Trying 66.242.155.235... connected >>>>> * Connected to www.annahar.com (66.242.155.235) port 80 (#0) >>>>>> GET /content.php?priority=1&table=main&type=main&day=Mon HTTP/1.1 >>>>>> User-Agent: curl/7.16.4 (i386-apple-darwin9.0) libcurl/7.16.4 >>>>>> OpenSSL/0.9.7l zlib/1.2.3 >>>>>> Host: www.annahar.com >>>>>> Accept: */* >>>>>> >>>>> < HTTP/1.1 200 OK >>>>> < Connection: close >>>>> < Date: Tue, 16 Aug 2011 11:50:50 GMT >>>>> < Server: Microsoft-IIS/6.0 >>>>> < X-Powered-By: ASP.NET >>>>> < X-Powered-By: PHP/5.2.0 >>>>> < Content-type: text/html >>>>> < >>>>> % Total % Received % Xferd Average Speed Time Time Time >>>>> Current >>>>> Dload Upload Total Spent Left >>>>> Speed >>>>> 0 0 0 0 0 0 0 0 --:--:-- --:--:-- >>>>> --:--:-- 0{ [data not shown] >>>>> 100 91340 0 91340 0 0 187k 0 --:--:-- --:--:-- >>>>> --:--:-- 237k* Closing connection #0 >>>>> >>>>> So httpclient is doing the right thing -- it's giving you access to >>>>> exactly what's in the header that's returned. >>>>> >>>>> Jon >>>>> >>>>> >>>>> On Tue, Aug 16, 2011 at 7:42 AM, Khosro Asgharifard Sharabiani >>>>> <[email protected]> wrote: >>>>>> Hello, >>>>>> I use the following code to find charset of a page,but it does not worked >>>>>> for page >>>>>> "http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon" >>>>>> >>>>>> Code : >>>>>> [code] >>>>>> >>>>>> try { >>>>>> HttpClient httpclient = new DefaultHttpClient(); >>>>>> String >>>>>> url="http://www.annahar.com/content.php?priority=1&table=main&type=main&day=Mon"; >>>>>> HttpGet httpget = new HttpGet(url); >>>>>> HttpResponse response; >>>>>> response = httpclient.execute(httpget); >>>>>> HttpEntity entity = response.getEntity(); >>>>>> if (entity != null) { >>>>>> Header[] allHeaders = response.getHeaders("Content-Type"); >>>>>> System.out.println(allHeaders[0].getValue()); >>>>>> } >>>>>> } catch (ClientProtocolException e) { >>>>>> e.printStackTrace(); >>>>>> } catch (IOException e) { >>>>>> e.printStackTrace(); >>>>>> } >>>>>> [/code] >>>>>> >>>>>> >>>>>> And the output of above code is : text/html. >>>>>> But i think the output must be "text/html; charset=windows-1256" .Am i >>>>>> right? >>>>>> >>>>>> But when i use >>>>>> "http://bigbrowser.blog.lemonde.fr/2011/08/03/iran-le-mossad-derriere-le-meurtre-dun-scientifique-spiegel" >>>>>> as a url in code,it returns "text/html; charset=UTF-8" ,that i think ,it >>>>>> is OK. >>>>>> It seems ,it works for some pages not all of them.Why this happens? >>>>>> >>>>>> >>>>>> Khosro. >>>>> >>>>> --------------------------------------------------------------------- >>>>> To unsubscribe, e-mail: [email protected] >>>>> For additional commands, e-mail: [email protected] >>>>> >>>>> >>>> >>>> >>>> -- >>>> Stijn >>>> [email protected] >>>> >>>> >>>> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: [email protected] >> For additional commands, e-mail: [email protected] >> > >-------------------------- >Ken Krugler >+1 530-210-6378 >http://bixolabs.com >custom data mining solutions > > > > > > > >--------------------------------------------------------------------- >To unsubscribe, e-mail: [email protected] >For additional commands, e-mail: [email protected] > > > >
