LuciferYang commented on PR #38171:
URL: https://github.com/apache/spark/pull/38171#issuecomment-1514053021
@lyy-pineapple I simple tested this pr with our production scenario, the
test table is a real business data table with 50 billion records. The test SQL
is as follows:
```
select count(url) from rlike_table where url not rlike ${regex1} and url not
rlike ${regex2}
```
In the test SQL, the `url` stores the website address, and `regex1` and
`regex2` are also real user cases(The tested sql has been simplified, and in
actual business, one sql may contain 10-20 `rlike` expressions, mainly the
conditions in `filter` and `case when`).
The testing App used 20 executors, each with 40 cores.
When using this pr, the test job can be completed **within 3 hours** and the
test job need take over 100 hours to complete without this pr (In fact, after
100 hours, the test job is still running, and it is roughly estimated that it
will take another 30 ~ 50 hours to complete).
So personally, I like this one.
also cc @cloud-fan @dongjoon-hyun FYI
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]