An optimized version of substring/substr is now in R-devel (76172).
Best,
Tomas
On 2/22/19 8:16 PM, Tomas Kalibera wrote:
On 2/20/19 7:55 PM, Toby Hocking wrote:
Update: I have observed that stringi::stri_sub is linear time
complexity,
and it computes the same thing as base::substring. figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
To me this is a clear indication of a bug in substring, but again it
would
be nice to have some feedback/confirmation before posting on bugzilla.
Also this suggests a fix -- just need to copy whatever
stringi::stri_sub is
doing.
Thanks for the report, I am working on a patch that will address this.
I confirm there is a lot of potential for speedup. On my system,
'N=200000; x <- substring(paste(rep("A", N), collapse=""), 1:N, 1:N)'
spends 96% time in checking if the string is ascii and 3% in strlen();
if we take advantage of the pre-computed value in the ASCII bit, the
speed up is about 40x. Of course, with micro-benchmarks, any
performance limitation can be arbitrarily inflated, users cannot
expect to see these or any close speedups in applications as a result
of the patch. The patch is going to do other easy optimizations that
will not complicate the code, including avoiding the strlen() call
(taking advantage of pre-computed length of R character object).
Best
Tomas
On Wed, Feb 20, 2019 at 11:16 AM Toby Hocking <tdho...@gmail.com> wrote:
Hi all, (and especially hi to Tomas Kalibera who accepted my patch sent
yesterday)
I believe that I have found another bug, this time in the substring
function. The use case that I am concerned with is when there is a
single
(character scalar) text/subject, and many substrings to extract. For
example
substring("AAAA", 1:4, 1:4)
or more generally,
N=1000
substring(paste(rep("A", N), collapse=""), 1:N, 1:N)
The problem I observe is that the time complexity is quadratic in N, as
shown on this figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.png
source:
https://github.com/tdhock/namedCapture-article/blob/master/figure-substring-bug.R
I expected the time complexity to be linear in N.
The example above may seem contrived/trivial, but it is indeed
relevant to
a number of packages (rex, rematch2, namedCapture) which provide
functions
that use gregexpr and then substring to extract the text in the
captured
sub-patterns. The figure
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.png
shows the issue: these packages have quadratic time complexity, whereas
other packages (and the gregexpr function with perl=TRUE after
applying the
patch discussed yesterday) have linear time complexity. I believe the
problem is the substring function. Source for this figure:
https://github.com/tdhock/namedCapture-article/blob/master/figure-trackDb-pkgs.R
I suspect that a fix can be accomplished by optimizing the
implementation
of substring, for the special case when the text/subject is a single
element (character scalar). Right now I notice that the substring R
code
uses rep_len so that the text/subject which is passed to the C code
is a
character vector with the same length as the number of substrings to
extract. Maybe the C code is calling strlen for each of these
(identical)
text/subject elements?
Anyway, it would be useful to have some feedback to make sure this is
indeed a bug before I post on bugzilla. (btw thanks Martin for
signing me
up for an account)
Toby
[[alternative HTML version deleted]]
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel
______________________________________________
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel