I thiknk that I have to install Linux on VM... There is a shortest way

by the way, could you please advise how to rebuild 'XML' package for R with
latest libxml sources? Who may do that?
or is it possible to build the new R package based on another non-C sorced
parsers based like on PyPY, erlang and so on?

2013/2/22 Milan Bouchet-Valat <nalimi...@club.fr>

> Le jeudi 21 février 2013 à 18:53 +0400, Lawr Eskin a écrit :
> > iconv trued before in various try, same issue and result with encoding
> > = unknown
> > now try sub - same issue
> This procedure works on Linux, but not on Windows:
>
> library(RCurl)
> library(XML)
> u <- "http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1";
> a <- getURL(u, .encoding="UTF-8")
> a <- iconv(a, "windows-1251", "UTF-8")
> a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
> a2
>
> But maybe the problem is more general, and related to conversion between
> encodings on Windows. What looks weird to me is that on Windows, I'm not
> able to save a character string to a file in UTF-8, despite what ?file
> says:
> x <- "Все права защищены"
> Encoding(x)
> # UTF-8
> cat(x, con <- file("foo", "w", encoding="UTF-8")); close(con)
> x2 <- readLines(con <- file(foo, "r", encoding="UTF-8")); close(con)
> Encoding(x2)
> # unknown
> x2
> # [1] "<U+041A><U+0443>..."
>
> I know the problem happens on write because the file cannot be read
> correctly on Linux either.
>
> This Windows machine uses Windows Server 2008 with French_France.1252
> locale.
>
> > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr>
> >         Le jeudi 21 février 2013 à 18:31 +0400, Lawr Eskin a écrit :
> >         > Hi Milan,
> >         >
> >         > a <- getURL(con, .encoding = "UTF-8")
> >         > Encoding(a)
> >         > > [1] "UTF-8"
> >         > a # Here - the UTF-8 codes looks like fine.
> >         > htmlParse(a, encoding = "UTF-8") ###again same encoding
> >         issue
> >
> >         And what if you try this:
> >         a2 <- htmlParse(sub("windows-1251", "UTF-8", a))
> >
> >         or this:
> >         a2 <- htmlParse(iconv(a, "windows-1251", "UTF-8"))
> >
> >
> >         Cheers
> >
> >
> >         > >>why didn't getURL() detect and set a's encoding correctly?
> >         > I think there are page issue because another sites works
> >         fine
> >         >
> >         > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr>
> >         >         Le jeudi 21 février 2013 à 16:04 +0400, Lawr Eskin a
> >         écrit :
> >         >         > Hi Milan!
> >         >         >
> >         >         >
> >         >         > > Encoding(a)
> >         >         > [1] "unknown"
> >         >
> >         >         Hm, here I get "UTF-8", which is my locale encoding.
> >         >
> >         >         I've tried a little more, and I discovered that
> >         using
> >         >         a <- getURL(u, .encoding="UTF-8")
> >         >         ensures that a is in the correct encoding here. I
> >         know this is
> >         >         not your
> >         >         problem, but it might help: check whether
> >         Encoding(a) is set
> >         >         to "UTF-8"
> >         >         or not in that case, and whether this fixes things.
> >         >
> >         >         I'm not sure how htmlParse() detects the encoding
> >         when you
> >         >         pass it a
> >         >         character vector, but it probably uses Encoding(a),
> >         since
> >         >         that's the
> >         >         only reliable information; if it is missing, maybe
> >         it falls
> >         >         back to what
> >         >         the contents of the file say (maybe even before what
> >         the
> >         >         "encoding"
> >         >         argument says), which is windows-1251, and may not
> >         be the
> >         >         encoding in
> >         >         which getURL() saved the character vector. The
> >         question would
> >         >         then be:
> >         >         why didn't getURL() detect and set a's encoding
> >         correctly?
> >         >
> >         >
> >         >         My two cents
> >         >
> >         >
> >         >         > 2013/2/21 Milan Bouchet-Valat <nalimi...@club.fr>
> >         >         >         Le jeudi 21 février 2013 à 13:16 +0400,
> >         Lawr Eskin a
> >         >         écrit :
> >         >         >         > Hello dear R-help mailing list.
> >         >         >         >
> >         >         >         >
> >         >         >         > Looks like the same issue in Russian:
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > library(RCurl)
> >         >         >         >
> >         >         >         > library(XML)
> >         >         >         >
> >         >         >         > u = "
> >         >         >
> >         >
> >         http://www.cian.ru/cat.php?deal_type=2&obl_id=1&room1=1";
> >         >         >         >
> >         >         >         > a = getURL(u)
> >         >         >         >
> >         >         >         > a # Here - the Russian is fine.
> >         >         >         >
> >         >         >         > a2 <- htmlParse(a)
> >         >         >         >
> >         >         >         > a2 # Here it is a mess...
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > None of these seem to fix it:
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > htmlParse(a, encoding = "windows-1251")
> >         >         >         >
> >         >         >         > htmlParse(a, encoding = "CP1251")
> >         >         >         >
> >         >         >         > htmlParse(a, encoding = "cp1251")
> >         >         >         >
> >         >         >         > htmlParse(a, encoding = "iso8859-5")
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > This is my locale:
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > Sys.getlocale()
> >         >         >         >
> >         >         >         >
> >         >         >
> >         >
> >
> "LC_COLLATE=Russian_Russia.1251;LC_CTYPE=Russian_Russia.1251;LC_MONETARY=Russian_Russia.1251;LC_NUMERIC=C;LC_TIME=Russian_Russia.1251"
> >         >         >         >
> >         >         >         >
> >         >         >         >
> >         >         >         > Any suggestions?
> >         >         >
> >         >         >         What does Encoding(a) say?
> >         >         >
> >         >         >
> >         >         >         (FWIW, here on Linux even a is not in the
> >         correct
> >         >         encoding :
> >         >         >         <!DOCTYPE html PUBLIC "-//W3C//DTD HTML
> >         4.0
> >         >         Transitional//EN"
> >         >         >
> >         "http://www.w3.org/TR/REC-html40/loose.dtd";>
> >         >         >         <html><head>
> >         >         >         <title>ГЉГіГЇГЁГІГј 
> > îäíîêîìíà òí
> >         ГіГѕ ГЄГўГ
> >         >         ðòèð
> >         >         >         Гі Гў ГЊГ®Г
> >         >         >         ±ГЄГўГҐ В— 11430 
> > îáúÿâëåíèé î
> >         ïðîäГ
> >         >         æå îäí
> >         >         >         îêîìí
> >         >         >         à òíûõ êâà 
> > ðòèð</title>
> >         >         >         [...])
> >         >         >
> >         >         >
> >         >         >         Regards
> >         >         >
> >         >         >
> >         >         >         > Thanks you very much in advance,
> >         >         >         >
> >         >         >         >     Lavrentiy Eskin
> >         >         >
> >         >         >         >  <http://www.eng.nvg.ru>
> >         >         >         >
> >         >         >         >       [[alternative HTML version
> >         deleted]]
> >         >         >         >
> >         >         >         >
> >         ______________________________________________
> >         >         >         > R-help@r-project.org mailing list
> >         >         >         >
> >         https://stat.ethz.ch/mailman/listinfo/r-help
> >         >         >         > PLEASE do read the posting guide
> >         >         >
> >         http://www.R-project.org/posting-guide.html
> >         >         >         > and provide commented, minimal,
> >         self-contained,
> >         >         reproducible
> >         >         >         code.
> >         >         >
> >         >         >
> >         >
> >         >
> >         >
> >
> >
> >
>
>

        [[alternative HTML version deleted]]

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to