I have narrowed the problem down to this: NGramTokenizer("-", control = Weka_control(min = 1, max = 4))
The string actually occurs as fourth segment in the 21,226th sentence. I find this strange, since I am using the default delimiters ' \r\n\t.,;:'"()?!', which do not contain a hyphen. Regards, Richard On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote > I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit > mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM. > > I am encountering the following error in RWeka: > > Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", > .jcast(tokenizer, : java.lang.StringIndexOutOfBoundsException: > String index out of range: 1 > > Here is the code that is causing the problem: > > > library(rJava) > > (.jinit(parameters = "-Xmx3000m")) > > library(RWeka) > > wctrl <- Weka_control(min = 1, max = 4) > > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl) > > lseg is a list of 965193 sentences, each of which consists of one or > more segments. For example, lseg[[1]] is > > [[1]] > [1] "calculation of results xxxx activity is defined as the increase > in radioactivity " [2] "in dpm" > [3] "in the pellet " > [4] "xxx" > > [5] "" > [6] "caused by the addition of xx xxxx" > > lapply should build 1-, 2-, 3- and 4-grams of each sentence segment. > Is there any way to solve or circumvent the error? In Java > Preferences on the Mac I have specified for applications Java SE 6 > 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions. > > (Side remark: I'm surprised that it only does this for the first > and last segments of the first sentence. Admittedly, the other > segments have less than 4 grams, but that should not stop it from > building n- grams consisting of fewer grams!) > > Thanks, > Richard > ------ > Richard R. Liu > Dittingerstr. 33 > CH-4053 Basel > Switzerland > > Tel.: +41 61 331 10 47 > Email: richard....@pueo-owl.ch -- Richard R. Liu Dittingerstr. 33 CH-4053 Basel Switzerland Tel.: +41 61 331 10 47 Email: richard....@pueo-owl.ch _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac