isidentical opened a new pull request, #3614:
URL: https://github.com/apache/arrow-datafusion/pull/3614
# Which issue does this PR close?
Closes #3613.
# Rationale for this change
@Dandandan noticed `regex_replace` with a known pattern seems to be taking
an extremely long amount of time during ClickBench suite in #3518. This seems
to be true due to many factors, but mainly due to how generic `regex_replace`
implementation is (it can handle 2⁴ combinations when it comes to
scalars/arrays). Having a generic version ready is good for compatibility, but
at the same time, it makes us pay the overhead for common cases (like the
example in #3518, where the pattern is static).
# What changes are included in this PR?
This PR adds a scalarity (not sure if this is a real word) based
specialization system where at the **runtime** the best `regex_replace`
variation can be picked and executed for the given set of inputs. The system
here is just the start, and if there is enough gains we might add a third case
where the replacement is also known.
# Are there any user-facing changes?
<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->
This is mainly an optimization, and there shouldn't be any user facing
changes.
# Benchmarks
Running all benchmarks with `--release` mode (using the datafusion-cli crate
with `-f` option).
The initial benchmark is the Query 28 from clickhouse
```sql
SELECT
REGEXP_REPLACE("Referer", '^https?://(?:www.)?([^/]+)/.*$', '1') AS k,
AVG(length("Referer")) AS l,
COUNT(*) AS c,
MIN("Referer")
FROM hits_1
WHERE "Referer" <> ''
GROUP BY k
HAVING COUNT(*) > 100000
ORDER BY l DESC
LIMIT 25;
```
| | Master | This Branch | Factor
|
|-------------------------------|---------------|---------------|----------------|
| Cold Run | 2.875 seconds | 0.318 seconds | 9.04x
speed-up |
| Hot Run (6th consecutive run) | 2.252 seconds | 0.266 seconds | 8.46x
speed-up |
| Average | 2.408 seconds | 0.277 seconds | 8.69x
speed-up |
(Note: I don't have the full ClickBench data, just have a partition of it
[1/100 scale] so this might not be very reflective)
A second benchmark is the one where we have both the source and the
replacements as arrays, which shows speed-up factor of 1.7X.
```sql
-- Generate data
--
-- import secrets
-- import random
--
-- rows = 1_000_000
--
-- data = {"user_id": [], "website": []}
-- for _ in range(rows):
-- data["user_id"].append(secrets.token_hex(8))
--
-- # Sometimes it is proper URL, and sometimes it is not.
-- data["website"].append(
-- random.choice(["http", "https", "unknown", ""])
-- + random.choice([":", "://"])
-- + random.choice(["google", "facebook"])
-- + random.choice([".com", ".org", ""])
-- )
--
-- import pandas as pd
-- df = pd.DataFrame(data)
-- df.to_parquet("data.parquet")
CREATE EXTERNAL TABLE generated_data
STORED AS PARQUET
LOCATION 'data.parquet';
-- Query 1
EXPLAIN ANALYZE
SELECT
REGEXP_REPLACE("website", '^https?://(?:www.)?([^/]+)$', "user_id") AS
encoded_website
FROM generated_data;
```
### TODO
- [ ] Still needs tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]