[GitHub] [arrow-datafusion] isidentical opened a new pull request, #3804: Optimize `regexp_replace` when the input is a sparse array

GitBox Tue, 11 Oct 2022 15:13:31 -0700


isidentical opened a new pull request, #3804:
URL: https://github.com/apache/arrow-datafusion/pull/3804


   # Which issue does this PR close?
   
   <!--
   We generally require a GitHub issue to be filed for all bug fixes and 
enhancements and this helps us generate change logs for our releases. You can 
link an issue to this PR using the GitHub syntax. For example `Closes #123` 
indicates that this PR will close issue #123.
   -->
   
   Closes #3803.
   
    # Rationale for this change
   <!--
    Why are you proposing this change? If this is already explained clearly in 
the issue then this section is not needed.
    Explaining clearly why changes are proposed helps reviewers understand your 
changes and offer better suggestions for fixes.  
   -->
   This is something that arrow-rs already does very heavily, although they 
also combine it with the unsafe array building APIs which this PR deliberately 
avoids (reasoning is at the end).
   
   
   It doesn't provide as much speed-up as the other optimizations for the 
`regex_replace`, but it seems like a relatively straightforward change for a 
not-so-bad gain (see benchmarks section). It is also the final optimization 
candidate (at least localized to the `regex_replace` itself), if I am not 
missing anything. 
   
   # What changes are included in this PR?
   <!--
   There is no need to duplicate the description in the issue here but it is 
sometimes worth providing a summary of the individual changes in this PR.
   -->
   Instead of rebuilding the string array from the iterator, we now build the 
underlying array data by ourselves in which case we also have the ability to 
leverage existing null buffers for the input. In the generic version, this 
couldn't be done without recombining the underlying bitmaps of all the inputs 
but since this is the specialized case we know for a fact that the only array 
input is the `strings` (the first argument).
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   This is an optimization.
   
   <!--
   If there are any breaking changes to public APIs, please add the `api 
change` label.
   -->
   
   # Benchmarks
   
   As expected, there is no speed-up (or slowdowns, which I guess is 
noteworthy) in cases where the input array is very dense with data (low amount 
of NULLs). But depending on the input's data density, we see an average 
speed-up of %20.
   
   |                            | master | this pr | this pr (unsafe) | 
speed-up                                             |
   
|----------------------------|--------|---------|------------------|------------------------------------------------------|
   | %100 null & %0 data        | 0.121  | 0.088   | 0.071            | 1.38x 
against this PR (1.71x against unsafe version) |
   | %80 null & %20 data        | 0.489  | 0.396   | 0.347            | 1.23x 
against this PR (1.40x against unsafe version) |
   | %50 null & %50 data        | 0.741  | 0.640   | 0.619            | 1.15x 
against this PR (1.20x against unsafe version) |
   | data > null (80-20, 100-0) | 1.104  | 1.101   | 1.067            | No 
change (<=%3, within noise)                       |
   
   (Built with the release mode, for a variation of the query/dataset in my 
[example PR](https://github.com/isidentical/arrow-datafusion/pull/3)).
   
   I have also included a column which indicates the speed-ups when we use the 
unsafe version but I am not so sure if it makes sense to increase the number of 
'unsafe' spots in datafusion for a not-so-crazy speed-up (even if that from the 
soundness perspective it is safe). I'd be happy to also change my PR to include 
the unsafe version ([which looks like 
this](https://github.com/isidentical/arrow-datafusion/pull/3)) if it is not 
that bad. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] isidentical opened a new pull request, #3804: Optimize `regexp_replace` when the input is a sparse array

Reply via email to