Hi Prof,

Thank you for your reply. Sorry that I missed out the below information.
>Sys.getlocale()
[1] "LC_COLLATE=English_United States.1252;LC_CTYPE=English_United 
States.1252;LC_MONETARY=English_United 
States.1252;LC_NUMERIC=C;LC_TIME=English_United States.1252"

I have just noticed that traditional chinese character cause the encoding 
problem, while chinese simplified works fine.

>library(RCurl)
>theurl <- getURL("http://home.sina.com",encoding='utf8')
#Encoding(theurl)
#[1]"latin1"
>txt <- readLines(con=textConnection(theurl),encoding='utf8')
>write.table(file='D:/fileas.txt',txt)

When I open the fileas.txt, the Chinese traditional character readable in 
notepad, but when I try to read file to Rgui:-
> smple <- scan('D:/fileas.txt',what='')
Then it comes to unrecognisable character again, I was wondering if Rgui 
support traditional Chinese character now... 

I think I need to looking for solution of inter-Chinese character's translation.
Thank you.


Best,
Ryusuke

  ===============================================

Hi Ryusuke
 
 I would use the encoding parameter of htmlParse() and 
 download and parse the content in one operation:
 
     htmlParse("http://home.sina.com";, encoding = "UTF-8")
 
 If you want to use getURL() in RCurl, use the .encoding parameter
 
  You didn't tell us the output of Sys.getlocale()
  or how your terminal/console is configured, so the above
  may vary under your configuration, but works on various
  machines for me with different settings.
 
    D.
 
 
Ryusuke Kenji wrote:
> 
> Hi All,
> 
> First method:-
> >library(XML)
> 
> >theurl <- "http://home.sina.com";
> >download.file(theurl, "tmp.html")
> 
> >txt <- readLines("tmp.html")
> 
> >txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = 
> TRUE)
> 
> >g <- xpathSApply(txt, "//p", function(x) xmlValue(x))
> 
> >head(grep(" ", g, value=T))
> 
> 
> [1] "?????? | ?????? | ENGLISH"                               
> "??????????????? ???????????????"                        
> [3] "??????? ?????????? ??????????????????(???)"              
> "?????????????????????????????? ????????????????????????"
> [5] " ???????????????????????????????????????"                "? ??????????! 
> ????? ??????! ????????????????????????!"  
> 
> 
> 
> SecondMethod:-
> >library(RCurl)
> 
> >theurl <- getURL("http://home.sina.com",encoding='GB2312')
> 
> >Encoding(theurl)
> 
> [1]"unknown"
> 
> >txt <- readLines(con=textConnection(theurl),encoding='GB2312')
> >txt[5:10] #show the lines which occurred encoding problem.
> [1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" 
> />"
> [2] "<title>SINA.com US ????????? -??????</title>"
> [3] "<meta name=\"Keywords\" content=\"????????????, ???????????????, 
> ???????????????, ??????????????????,????????????, SINA, US, News, Chinese, 
> Asia\" />"
> [4] "<meta name=\"Description\" 
> content=\"???????????????????????????????????????, 
> ???????????????24????????????????????????????????, ????????????????????????, 
> ????????????, ??????????????????, ????????????????????????, ?????????BBS, 
> ???????????????????????????????????.\" />"
> [5]""                                                                         
>                                                                               
>                                                                               
>                                            
> [6] "<link rel=\"stylesheet\" type=\"text/css\" 
> href=\"http://ui.sina.com/assets/css/style_home.css\"; />"
> 
> i am trying to read data from a Chinese language website, but the Chinese 
> characters always unreadable, may I know if any good idea to cope such 
> encoding problem in RCurl and XML?
> 
> 
> Regards,
> Ryusuke
>                            
> _________________________________________________________________
> 
> 
>     [[alternative HTML version deleted]]
> 
 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
 
 
-- 
"There are men who can think no deeper than a fact" - Voltaire
 
 
Duncan Temple Lang                dun...@wald.ucdavis.edu
Department of Statistics          work:  (530) 752-4782
4210 Mathematical Sciences Bldg.  fax:   (530) 752-7099
One Shields Ave.
University of California at Davis
Davis, CA 95616, USA
 
 
                                          
_________________________________________________________________
¥á©`¥ë¤òÒ»À¨¥Á¥§¥Ã¥¯£¡Ëû¤ÎŸoÁÏ¥á©`¥ë¤â¥×¥í¥Ð¥¤¥À©`¥á©`¥ë¤â¡£

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to