Hi,

I bumped into a serious issue while trying to analyse some texts in Bulgarian language (with the tm package). I import a tab-separated csv file, which holds a total of 22 variables, most of which are text cells (not factors), using the read.delim function:

data<-read.delim("bigcompanies_ascii.csv",
                header=TRUE,
                quote="'",
                sep="\t",
                as.is=TRUE,
                encoding='CP1251',
                fileEncoding='CP1251')

(I also tried the above with UTF-8 encoding on a UTF-8-saved file.)

I have my list of stop words written in a separate text file, one word per line, which I read into R using the scan function:

stoplist<-scan(file='stoplist_ascii.txt',
               what='character',
               strip.white=TRUE,
               blank.lines.skip=TRUE,
               fileEncoding='CP1251',
               encoding='CP1251')

(also tried with UTF-8 here on a correspondingly encoded file)

I currently only test with a corpus based on the contents of just one variable, and I construct the corpus from a VectorSource. When I run inspect, all seems fine and I can see the text properly, with unicode characters present:

data.corpus<-Corpus(VectorSource(data$variable,encoding='UTF-8'),
                   readerControl=list(language='bulgarian'))

However, no matter what I do - like which encoding I select - UTF-8 or CP1251, which is the typical code page for Bulgarian texts, I cannot get to remove the stop words from my corpus. The issue is present in both Linux and Windows, and across the computers I use R on, and I don't think it is related to bad configuration. Removal of punctuation, white space, and numbers is flawless, but the inability to remove stop words prevents me from further analysing the texts.

Has somebody had experience with languages other than English, and for which there is no predefined stop list available through the stopwords function? I will highly appreciate any tips and advice!

Thanks in advance,
Vince

______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to