Re: [R] reading and frequency analysis of Spanish text

David Winsemius Wed, 05 Aug 2009 11:40:12 -0700

When I open that link in OpenOffice.org Writer and then save in "Textencoded" format with "Unicode" encoding, the diacriticals (is that thecorrect font-ish term?) seem to remain intact wehn re-opended. When Iread that file in, not with scan() but with readLines(), here is whatI get for the second string:

langren.txt <- readLines("/Users/davidwinsemius/Downloads/Verdadera-spanish-stripped-1.txt", encoding="UTF-8")

 langren.txt[2]

[2] "MIGUEL FLORENCIO VAN LANGREN Matemático y cosmógrafo de suMajestad presenta las siguientes consideraciones de la Longitud porMar y Tierra; y dice que su Padre y Abuelo fueron astrónomos ygeógrafos, y en particular su padre asistió a las observacionescelestes realizadas por el famoso astrónomo Ticho Brahe, de quienrecibió sus primeras observaciones, como consta por las obras deldicho Ticho. Así mismo su padre sirvió a su majestad como cosmógrafoen Flandes. Y el dicho VAN LANGREN, a imitación de sus antepasados, haejercitado en esas artes y descubierto cosas que no se sabían sobre laverdadera longitud por mar y tierra, apoyándose más en lo esencial queen lo especulativo. Y habiéndolo propuesto a la infanta Isabel, muyaficionada a dichas artes, ella le recomendó al rey por una carta en1629 (página 9 de este documento), para que le encargase corregir lageografía. Su majestad lo aprobó por una real cédula, debido a losenormes errores que muestran las distancias calculadas por eminentesastrónomos y geógrafos entre Toledo y Roma, tal como se muestra enesta línea, por la cual se pueden conjeturar los errores entre lugaresmás distantes."


Mind you this was on a Mac so the usual cross-platform caveats apply:

> sessionInfo()
R version 2.9.1 Patched (2009-07-04 r48897)
x86_64-apple-darwin9.7.0

locale:
en_US.UTF-8/en_US.UTF-8/C/C/en_US.UTF-8/en_US.UTF-8

attached base packages:

[1] splines stats graphics grDevices utils datasetsmethods base


other attached packages:

[1] lattice_0.17-25 MASS_7.2-46 plotrix_2.6-4 plyr_0.1.9Design_2.1-2 survival_2.35-4

[7] Hmisc_3.5-2

loaded via a namespace (and not attached):
[1] cluster_1.12.0 grid_2.9.1     tools_2.9.1

--
DW


On Aug 5, 2009, at 2:19 PM, Michael Friendly wrote:

For an historical paper I'm working on, I have some Spanishplaintext, presently in the form of a Word .doc
file,
http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc
and also some ciphered text from the same original source. Theultimate goal is to use somefrequency analysis of letters and word lengths in the plaintext tohelp decode the ciphered text.
For now, I'm stuck on how to read the Spanish plaintext into R as atext string, given that it is in a Word .doc fileusing some form of latin1 encoding. From Word, I can Save As ..plain text (.txt), but I'm worried about losingcharacter encoding information and I don't see anything in the listof Other encodings presented that seems
helpful.
A naive attempt to read the .doc file directly gives:
> langren.sp.file <- "http://euclid.psych.yorku.ca/SCS/Gallery/images/Private/Langren/Verdadera-spanish-stripped.doc"
>
> langren.txt <- scan(langren.sp.file, encoding="latin1")
Error in scan(file, what, nmax, sep, dec, quote, skip, nlines,na.strings, :
scan() expected 'a real', got 'ÐÏà¡±á'
>

Can someone help?

--
Michael Friendly Email: friendly AT yorku DOT ca Professor,Psychology Dept.
York University      Voice: 416 736-5115 x66249 Fax: 416 736-5814
4700 Keele Street    http://www.math.yorku.ca/SCS/friendly.html
Toronto, ONT  M3J 1P3 CANADA


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] reading and frequency analysis of Spanish text

Reply via email to