Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit : > Hi Milan, > > a <- getURL(con, .encoding = "UTF-8") > Encoding(a) > > [1] "UTF-8" > a # Here - the UTF-8 codes looks like fine. > htmlParse(a, encoding = "UTF-8") ###again same encoding issue And what if you try this: a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
or this: a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8")) Cheers > >>why didn't getURL() detect and set a's encoding correctly? > I think there are page issue because another sites works fine > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a écrit : > > Hi Milan! > > > > > > > Encoding(a) > > [1] "unknown" > > Hm, here I get "UTF-8", which is my locale encoding. > > I've tried a little more, and I discovered that using > a <- getURL(u, .encoding="UTF-8") > ensures that a is in the correct encoding here. I know this is > not your > problem, but it might help: check whether Encoding(a) is set > to "UTF-8" > or not in that case, and whether this fixes things. > > I'm not sure how htmlParse() detects the encoding when you > pass it a > character vector, but it probably uses Encoding(a), since > that's the > only reliable information; if it is missing, maybe it falls > back to what > the contents of the file say (maybe even before what the > "encoding" > argument says), which is windows-1251, and may not be the > encoding in > which getURL() saved the character vector. The question would > then be: > why didn't getURL() detect and set a's encoding correctly? > > > My two cents > > > > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr> > > Le jeudi 21 février 2013 à 13:16 +0400, Lawr Eskin a > écrit : > > > Hello dear R-help mailing list. > > > > > > > > > Looks like the same issue in Russian: > > > > > > > > > > > > library(RCurl) > > > > > > library(XML) > > > > > > u = " > > > http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1" > > > > > > a = getURL(u) > > > > > > a # Here - the Russian is fine. > > > > > > a2 <- htmlParse(a) > > > > > > a2 # Here it is a mess... > > > > > > > > > > > > None of these seem to fix it: > > > > > > > > > > > > htmlParse(a, encoding = "windows-1251") > > > > > > htmlParse(a, encoding = "CP1251") > > > > > > htmlParse(a, encoding = "cp1251") > > > > > > htmlParse(a, encoding = "iso8859-5") > > > > > > > > > > > > This is my locale: > > > > > > > > > > > > Sys.getlocale() > > > > > > > > > > "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251" > > > > > > > > > > > > Any suggestions? > > > > What does Encoding(a) say? > > > > > > (FWIW, here on Linux even a is not in the correct > encoding : > > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 > Transitional//EN" > > "http://www.w3.org/TR/REC-html40/loose.dtd"> > > <html><head> > > <title>ГЉГіГЇГЁГІГј îäГîêîìГГ ГІГГіГѕ ГЄГўГ > ðòèð > > Гі Гў Ìîà > > ±ГЄГўГҐ В— 11430 îáúÿâëåГГЁГ© Г® ïðîäà > æå îäà > > îêîìà > > Г ГІГûõ êâà ðòèð</title> > > [...]) > > > > > > Regards > > > > > > > Thanks you very much in advance, > > > > > > Lavrentiy Eskin > > > > > <http://www.eng.nvg.ru> > > > > > > [[alternative HTML version deleted]] > > > > > > ______________________________________________ > > > R-help@r-project.org mailing list > > > https://stat.ethz.ch/mailman/listinfo/r-help > > > PLEASE do read the posting guide > > http://www.R-project.org/posting-guide.html > > > and provide commented, minimal, self-contained, > reproducible > > code. > > > > > > > ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.