Re: [Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Tomas Kalibera Thu, 28 Feb 2019 05:54:40 -0800

An optimized version of substring/substr is now in R-devel (76172).


Best,
Tomas

On 2/22/19 8:16 PM, Tomas Kalibera wrote:

On 2/20/19 7:55 PM, Toby Hocking wrote:
Update: I have observed that stringi::stri_sub is linear timecomplexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
To me this is a clear indication of a bug in substring, but again itwould
be nice to have some feedback/confirmation before posting on bugzilla.
Also this suggests a fix -- just need to copy whateverstringi::stri_sub is
doing.
Thanks for the report, I am working on a patch that will address this.

I confirm there is a lot of potential for speedup. On my system,

'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
spends 96% time in checking if the string is ascii and 3% in strlen();if we take advantage of the pre-computed value in the ASCII bit, thespeed up is about 40x. Of course, with micro-benchmarks, anyperformance limitation can be arbitrarily inflated, users cannotexpect to see these or any close speedups in applications as a resultof the patch. The patch is going to do other easy optimizations thatwill not complicate the code, including avoiding the strlen() call(taking advantage of pre-computed length of R character object).
Best
Tomas
On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdho...@gmail.com> wrote:
Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)

I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is asingle(character scalar) text/subject, and many substrings to extract. Forexample
substring("AAAA", 1:4, 1:4)

or more generally,

N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)

The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
I expected the time complexity to be linear in N.
The example above may seem contrived/trivial, but it is indeedrelevant toa number of packages (rex, rematch2, namedCapture) which providefunctionsthat use gregexpr and then substring to extract the text in thecaptured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE afterapplying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
I suspect that a fix can be accomplished by optimizing theimplementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring Rcodeuses rep_len so that the text/subject which is passed to the C codeis a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these(identical)
text/subject elements?

Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin forsigning me
up for an account)

Toby
    [[alternative HTML version deleted]]

______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Bug: time complexity of substring is quadratic as string size and number of substrings increases

Reply via email to