On 04/12/2022 00:25, Hadley Wickham wrote:
On Sun, Dec 4, 2022 at 12:50 PM Hervé Pagès <hpages.on.git...@gmail.com> wrote:
On 03/12/2022 07:21, Bert Gunter wrote:
Perhaps it is worth pointing out that looping constructs like lapply() can
be avoided and the procedure vectorized by mimicking Martin Morgan's
solution:

## s is the string to be searched.
diff(c(0,grep('b',strsplit(s,'')[[1]])))

However, Martin's solution is simpler and likely even faster as the regex
engine is unneeded:

diff(c(0, which(strsplit(s, "")[[1]] == "b"))) ## completely vectorized

This seems much preferable to me.
Of all the proposed solutions, Andrew Hart's solution seems the most
efficient:

    big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)

    system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
    #    user  system elapsed
    #   0.736   0.028   0.764

    system.time(diff(c(0, which(strsplit(big_string, "", fixed=TRUE)[[1]]
== "b"))))
    #    user  system elapsed
    #  2.100   0.356   2.455

The bigger the string, the bigger the gap in performance.

Also, the bigger the average gap between 2 successive b's, the bigger
the gap in performance.

Finally: always use fixed=TRUE in strsplit() if you don't need to use
the regex engine.
You can do a bit better if you are willing to use stringr:

library(stringr)
big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 500000)

system.time(nchar(strsplit(big_string, split="b", fixed=TRUE)[[1]]) + 1)
#>    user  system elapsed
#>   0.126   0.002   0.128

system.time(str_length(str_split(big_string, fixed("b"))[[1]]))
#>    user  system elapsed
#>   0.103   0.004   0.107

(And my timings also suggest that it's time for Hervé to get a new computer :P)

LOL

Actually my timings were for

  big_string <- strrep("abaaabbaaaaabaaabaaaaaaaaaaaaaaaaaaab", 1500000)

but I mixed up things when I copy-pasted them in my email.

Even though I still need a new laptop and I'm in the process of getting a new one ;-)

H.

--
Hervé Pagès

Bioconductor Core Team
hpages.on.git...@gmail.com

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to