On Thu, Jul 8, 2010 at 4:15 AM, Suharto Anggono Suharto Anggono
<suharto_angg...@yahoo.com> wrote:
> \b is word boundary.
> But, unexpectedly, strsplit("dia ma", "\\b") splits character by character.
>
>> strsplit("dia ma", "\\b")
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>> strsplit("dia ma", "\\b", perl=TRUE)
> [[1]]
> [1] "d" "i" "a" " " "m" "a"
>
>
> How can that be?
>
> This is the output of 'gregexpr'.
>
>> gregexpr("\\b", "dia ma")
> [[1]]
> [1] 1 2 3 4 5 6
> attr(,"match.length")
> [1] 0 0 0 0 0 0
>
>> gregexpr("\\b", "dia ma", perl=TRUE)
> [[1]]
> [1] 1 4 5 7
> attr(,"match.length")
> [1] 0 0 0 0
>
>
> The output from gregexpr("\\b", "dia ma", perl=TRUE) is what I expect. I 
> expect 'strsplit' to split at that points.

You can use strapply in the gsubfn function to match all words and non-words:

library(gsubfn)
strapply("dia ma", "\\w+|\\W+", c)     # c("dia", " ", "ma")

or all spaces and non-spaces:

strapply("dia ma", "\\s+|\\S+", c)     # c("dia", " ", "ma")

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to