An optimized version of substring/substr is now in R-devel (76172).

Best,
Tomas

On 2/22/19 8:16 PM, Tomas Kalibera wrote:
On 2/20/19 7:55 PM, Toby Hocking wrote:
Update: I have observed that stringi::stri_sub is linear time complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

To me this is a clear indication of a bug in substring, but again it would
be nice to have some feedback/confirmation before posting on bugzilla.

Also this suggests a fix -- just need to copy whatever stringi::stri_sub is
doing.

Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'

spends 96% time in checking if the string is ascii and 3% in strlen(); if we take advantage of the pre-computed value in the ASCII bit, the speed up is about 40x. Of course, with micro-benchmarks, any performance limitation can be arbitrarily inflated, users cannot expect to see these or any close speedups in applications as a result of the patch. The patch is going to do other easy optimizations that will not complicate the code, including avoiding the strlen() call (taking advantage of pre-computed length of R character object).

Best
Tomas




On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdho...@gmail.com> wrote:

Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a single (character scalar) text/subject, and many substrings to extract. For example

substring("AAAA", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R

I expected the time complexity to be linear in N.

The example above may seem contrived/trivial, but it is indeed relevant to a number of packages (rex, rematch2, namedCapture) which provide functions that use gregexpr and then substring to extract the text in the captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R

I suspect that a fix can be accomplished by optimizing the implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R code uses rep_len so that the text/subject which is passed to the C code is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these (identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for signing me
up for an account)

Toby

    [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel



______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Reply via email to