[GitHub] [spark] LuciferYang commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

via GitHub Tue, 18 Apr 2023 19:38:31 -0700


LuciferYang commented on PR #38171:
URL: https://github.com/apache/spark/pull/38171#issuecomment-1514053021


   @lyy-pineapple I simple tested this pr with our production scenario, the 
test table is a real business data table with 50 billion records. The test SQL 
is as follows:
   
   ```
   select count(url) from rlike_table where url not rlike ${regex1} and url not 
rlike ${regex2}
   ```
   
   In the test SQL, the `url` stores the website address, and `regex1` and 
`regex2` are also real user cases(The tested sql has been simplified, and in 
actual business, one sql may contain 10-20 `rlike` expressions, mainly the 
conditions in `filter` and `case when`).
   
   The testing App used 20 executors, each with 40 cores. 
   
   When using this pr, the test job can be completed **within 3 hours** and the 
test job need take over 100 hours to complete without this pr (In fact, after 
100 hours, the test job is still running, and it is roughly estimated that it 
will take another 30 ~ 50 hours to complete). 
   
   So personally, I like this one.
   
   also cc @cloud-fan @dongjoon-hyun FYI
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang commented on pull request #38171: [SPARK-9213] [SQL] Improve regular expression performance (via joni)

Reply via email to