On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono <suharto_angg...@yahoo.com> wrote: > \b is word boundary. > But, unexpectedly, strsplit("dia ma", "\\b") splits character by character. > >> strsplit("dia ma", "\\b") > [[1]] > [1] "d" "i" "a" " " "m" "a" > >> strsplit("dia ma", "\\b", perl=TRUE) > [[1]] > [1] "d" "i" "a" " " "m" "a" > > > How can that be? > > This is the output of 'gregexpr'. > >> gregexpr("\\b", "dia ma") > [[1]] > [1] 1 2 3 4 5 6 > attr(,"match.length") > [1] 0 0 0 0 0 0 > >> gregexpr("\\b", "dia ma", perl=TRUE) > [[1]] > [1] 1 4 5 7 > attr(,"match.length") > [1] 0 0 0 0 > > > The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I > expect 'strsplit' to split at that points.
You can use strapply in the gsubfn function to match all words and non-words: library(gsubfn) strapply("dia ma", "\\w+|\\W+", c) # c("dia", " ", "ma") or all spaces and non-spaces: strapply("dia ma", "\\s+|\\S+", c) # c("dia", " ", "ma") ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.