isidentical opened a new pull request, #3614:
URL: https://github.com/apache/arrow-datafusion/pull/3614

   # Which issue does this PR close?
   
   Closes #3613.
   
   # Rationale for this change
   @Dandandan noticed `regex_replace` with a known pattern seems to be taking 
an extremely long amount of time during ClickBench suite in #3518. This seems 
to be true due to many factors, but mainly due to how generic `regex_replace` 
implementation is (it can handle 2⁴ combinations when it comes to 
scalars/arrays). Having a generic version ready is good for compatibility, but 
at the same time, it makes us pay the overhead for common cases (like the 
example in #3518, where the pattern is static).
   
   # What changes are included in this PR?
   This PR adds a scalarity (not sure if this is a real word) based 
specialization system where at the **runtime** the best `regex_replace` 
variation can be picked and executed for the given set of inputs. The system 
here is just the start, and if there is enough gains we might add a third case 
where the replacement is also known.
   
   # Are there any user-facing changes?
   <!--
   If there are user-facing changes then we may require documentation to be 
updated before approving the PR.
   -->
   This is mainly an optimization, and there shouldn't be any user facing 
changes.
   
   # Benchmarks
   
   Running all benchmarks with `--release` mode (using the datafusion-cli crate 
with `-f` option).
   
   The initial benchmark is the Query 28 from clickhouse
   ```sql
   SELECT
       REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
       AVG(length("Referer")) AS l,
       COUNT(*) AS c,
       MIN("Referer")
   FROM hits_1
       WHERE "Referer" <> ''
       GROUP BY k
       HAVING COUNT(*) > 100000
       ORDER BY l DESC
   LIMIT 25;
   ```
   
   |                               | Master        | This Branch   | Factor     
    |
   
|-------------------------------|---------------|---------------|----------------|
   | Cold Run                      | 2.875 seconds | 0.318 seconds | 9.04x 
speed-up |
   | Hot Run (6th consecutive run) | 2.252 seconds | 0.266 seconds | 8.46x 
speed-up |
   | Average                       | 2.408 seconds | 0.277 seconds | 8.69x 
speed-up |
   
   (Note: I don't have the full ClickBench data, just have a partition of it 
[1/100 scale] so this might not be very reflective)
   
   A second benchmark is the one where we have both the source and the 
replacements as arrays, which shows speed-up factor of  1.7X.
   
   ```sql
   -- Generate data
   --
   -- import secrets
   -- import random
   --
   -- rows = 1_000_000
   --
   -- data = {"user_id": [], "website": []}
   -- for _ in range(rows):
   --     data["user_id"].append(secrets.token_hex(8))
   --
   --     # Sometimes it is proper URL, and sometimes it is not.
   --     data["website"].append(
   --         random.choice(["http", "https", "unknown", ""])
   --         + random.choice([":", "://"])
   --         + random.choice(["google", "facebook"])
   --         + random.choice([".com", ".org", ""])
   --     )
   --
   -- import pandas as pd
   -- df = pd.DataFrame(data)
   -- df.to_parquet("data.parquet")
   
   CREATE EXTERNAL TABLE generated_data
   STORED AS PARQUET
   LOCATION 'data.parquet';
   
   -- Query 1
   EXPLAIN ANALYZE
   SELECT
       REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS 
encoded_website
   FROM generated_data;
   ```
   ### TODO
   - [ ] Still needs tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to