>>>>> Richard R Liu writes: > I have narrowed the problem down to this: >> NGramTokenizer("-", control = wctrl) > Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", > .jcast(tokenizer, : > java.lang.StringIndexOutOfBoundsException: String index out of range: 1
> Indeed, the 21226th sentence contains a segment composed of a single > hyphen. I am using the default delimiters of the WEKA control. The > hyphen is thus not a delimiter. A segment consisting of two > consecutive hyphens ("--") does not cause the exception. Thanks. This seems to be a bug in Weka itself, so there is not really a lot I can do: perhaps you can report the problem to the upstream maintainers? Best -k > Regards, > Richard > On Tue, 12 Jan 2010 16:50:16 +0100, Richard R. Liu wrote >> I am running R version 2.10.1 Patched (2010-01-07 r50940) in 64-bit >> mode under Mac OS X 10.5.8 on a MacBook Pro with 8GB RAM. >> >> I am encountering the following error in RWeka: >> >> Error in .jcall("weka/core/tokenizers/Tokenizer", "[S", "tokenize", >> .jcast(tokenizer, : java.lang.StringIndexOutOfBoundsException: >> String index out of range: 1 >> >> Here is the code that is causing the problem: >> >> > library(rJava) >> > (.jinit(parameters = "-Xmx3000m")) >> > library(RWeka) >> > wctrl <- Weka_control(min = 1, max = 4) >> > lseg.4gram <- lapply(lseg, NGramTokenizer, control = wctrl) >> >> lseg is a list of 965193 sentences, each of which consists of one or >> more segments. For example, lseg[[1]] is >> >> [[1]] >> [1] "calculation of results xxxx activity is defined as the increase >> in radioactivity " [2] "in dpm" >> [3] "in the pellet " >> [4] "xxx" >> >> [5] "" >> [6] "caused by the addition of xx xxxx" >> >> lapply should build 1-, 2-, 3- and 4-grams of each sentence segment. >> Is there any way to solve or circumvent the error? In Java >> Preferences on the Mac I have specified for applications Java SE 6 >> 64- bit, then J2SE 5.0 64-bit, before other 32-bit versions. >> >> (Side remark: I'm surprised that it only does this for the first >> and last segments of the first sentence. Admittedly, the other >> segments have less than 4 grams, but that should not stop it from >> building n- grams consisting of fewer grams!) >> >> Thanks, >> Richard >> ------ >> Richard R. Liu >> Dittingerstr. 33 >> CH-4053 Basel >> Switzerland >> >> Tel.: +41 61 331 10 47 >> Email: richard....@pueo-owl.ch > -- > Richard R. Liu > Dittingerstr. 33 > CH-4053 Basel > Switzerland > Tel.: +41 61 331 10 47 > Email: richard....@pueo-owl.ch _______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@stat.math.ethz.ch https://stat.ethz.ch/mailman/listinfo/r-sig-mac