Hi All,

First method:-
>library(XML)

>theurl <- "http://home.sina.com";
>download.file(theurl, "tmp.html")

>txt <- readLines("tmp.html")

>txt <- htmlTreeParse(txt, error=function(...){}, useInternalNodes = 
TRUE)

>g <- xpathSApply(txt, "//p", function(x) xmlValue(x))

>head(grep(" ", g, value=T))


[1] "繁體 | 簡體 | ENGLISH"                               "女憲兵站崗 
電風扇伺候"                        
[3] "é¬¼å‰ƒé ­ç¾Žå°‘å¥³ 選美爆冷稱後(圖)"              
"(國際)性感航空廣告 俄半裸空姐洗飛機"
[5] " 四海同心慶雙十之台灣環島游"                "é è³¼æ©Ÿç¥¨! 
é€æ‚ æ¸¸å¡! 抽五星酒店和機票!"  



SecondMethod:-
>library(RCurl)

>theurl <- getURL("http://home.sina.com",encoding='GB2312')

>Encoding(theurl)

[1]"unknown"

>txt <- readLines(con=textConnection(theurl),encoding='GB2312')
>txt[5:10] #show the lines which occurred encoding problem.
[1] "<meta http-equiv=\"Content-Type\" content=\"text/html; charset=utf-8\" />"
[2] "<title>SINA.com US 新浪網 -北美</title>"
[3] "<meta name=\"Keywords\" content=\"北美新浪, 新浪北美站, 
美國中文網, 北美中文網站,華人網站, SINA, US, News, Chinese, 
Asia\" />"
[4] "<meta name=\"Description\" 
content=\"北美地區最大的中文網絡媒體, 
為海外華人24小時不間斷提ä¾æµ·é‡è³‡è¨Š, 內容包括最新新聞, 
娛樂訊息, 實用移民資訊, 股市匯市財經信息, 高人氣BBS, 
面向å—美華人的交友平台等.\" />"
[5]""                                                                           
                                                                                
                                                                                
                                     
[6] "<link rel=\"stylesheet\" type=\"text/css\" 
href=\"http://ui.sina.com/assets/css/style_home.css\"; />"

i am trying to read data from a Chinese language website, but the Chinese 
characters always unreadable, may I know if any good idea to cope such encoding 
problem in RCurl and XML?


Regards,
Ryusuke
                                          
_________________________________________________________________


        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to