Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

via GitHub Thu, 19 Sep 2024 12:22:06 -0700


Rachelint commented on issue #11680:
URL: https://github.com/apache/datafusion/issues/11680#issuecomment-2362005473


   > > > Does anyone know what the rationale of having repartition + coalesce, 
what kind of query benefits from it
   > > 
   > > 
   > > The primary reason is scalability. Efficient aggregation requires 
multi-core CPUs to process data in parallel. To facilitate this and prevent 
contention from multiple threads altering a single hash table simultaneously 
(often managed with locks), a repartition phase is introduced.
   > > This repartition allows each thread to perform aggregation 
independently. Furthermore, pre-aggregation is employed to conduct preliminary 
calculations to streamline repartitioning, significantly reducing the volume of 
data that needs repartitioning(in cases either the cadinatlity is low or there 
are several hot keys).
   > > > The next thing is to find out when and how should I avoid 
EnforceDistribution and bypass SanityCheckPlan if needed. And ensure there is 
no other queries slow down.
   > > 
   > > 
   > > This when-and-how problem is difficult for query engines because it 
requires foreknowledge of the data characteristics on grouping keys. It's even 
harder for DataFusion since we have very limited table metadata that could help 
us with this decision.
   > > In an adaptive approach, the aggregate operator could internally start 
with one method and move to another without interacting with other 
components(the physical optimizer and other physical operators), making it more 
feasible.
   > 
   > Maybe reducing the cost of repartition is an alternative? I think the 
reason why `partial + final` lead to performance regression in some situation 
is possible that the cost of splitting batches is greater than the improvement 
of parallel merging? I am trying a poc about fusing the repartition and partial 
aggregation.
   
   It is exciting that this idea seems promising! 
   I tried some related queries in my local, and it seems faster.
   - Main
   ```
   Q16: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
   Query 16 iteration 0 took 5537.5 ms and returned 10 rows
   Query 16 iteration 1 took 5308.1 ms and returned 10 rows
   Query 16 iteration 2 took 5499.6 ms and returned 10 rows
   Query 16 iteration 3 took 5561.5 ms and returned 10 rows
   Query 16 iteration 4 took 5380.3 ms and returned 10 rows
   
   Q17: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" LIMIT 10;
   Query 17 iteration 0 took 4961.5 ms and returned 10 rows
   Query 17 iteration 1 took 4926.8 ms and returned 10 rows
   Query 17 iteration 2 took 5000.1 ms and returned 10 rows
   Query 17 iteration 3 took 4952.6 ms and returned 10 rows
   Query 17 iteration 4 took 4922.3 ms and returned 10 rows
   ```
   
   - The poc #12526 
   ```
   Q16: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" ORDER BY COUNT(*) DESC LIMIT 10;
   Query 16 iteration 0 took 5252.9 ms and returned 10 rows
   Query 16 iteration 1 took 5092.9 ms and returned 10 rows
   Query 16 iteration 2 took 5102.9 ms and returned 10 rows
   Query 16 iteration 3 took 5181.0 ms and returned 10 rows
   Query 16 iteration 4 took 5169.2 ms and returned 10 rows
   
   Q17: SELECT "UserID", "SearchPhrase", COUNT(*) FROM hits GROUP BY "UserID", 
"SearchPhrase" LIMIT 10;
   Query 17 iteration 0 took 4660.8 ms and returned 10 rows
   Query 17 iteration 1 took 4619.3 ms and returned 10 rows
   Query 17 iteration 2 took 4588.3 ms and returned 10 rows
   Query 17 iteration 3 took 4717.4 ms and returned 10 rows
   Query 17 iteration 4 took 4633.0 ms and returned 10 rows
   ```
   
   I will check it more carefully again, and if it actually works, I will 
submit a formal PR.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Improve performance of high cardinality grouping by reusing hash values [datafusion]

Reply via email to