Re: [I] Memory over-reservation when running native shuffle write [datafusion-comet]

via GitHub Thu, 29 Aug 2024 02:22:02 -0700


Kontinuation commented on issue #887:
URL: 
https://github.com/apache/datafusion-comet/issues/887#issuecomment-2317124398


   I found that the memory reserved by the native shuffle writer is based on 
[guesses of data 
sizes](https://github.com/apache/datafusion-comet/blob/0.2.0/native/core/src/execution/datafusion/shuffle_writer.rs#L226-L236)
 and [proportional to the number of 
partitions](https://github.com/apache/datafusion-comet/blob/0.2.0/native/core/src/execution/datafusion/shuffle_writer.rs#L746-L759).
 For TPC-H query 10, the schema of shuffled record batches is:
   
   | Field | Type | Slot Size |
   |--|--|--|
   |col0| Int64 | 8 * len |
   | col1 | Utf8 | 104 * len |
   | col2 | Decimal128(12, 2) | 16 * len |
   | col3 | Utf8 | 104 * len |
   | col4 | Utf8 | 104 * len |
   | col5 | Utf8 | 104 * len |
   | col6 | Utf8 | 104 * len |
   | col7 | Decimal128(36, 4) | 16 * len |
   | col8 | Boolean | len / 8 |
   
   The estimated size of each record batch is 4588544 bytes given the batch 
size of 8192. The shuffle repartitioner reserves a batch for each partition, 
given the partition number of 200 (the default value of 
`spark.sql.shuffle.partitions`), the total amount of reserved memory is 
917708800 bytes (nearly 900 MB).
   
   We have multiple cores on each executor, each core runs a shuffle writer and 
reserves its own memory. On a 6-core worker instance, the amount of reserved 
memory will be 5 GB. If we tune the number of shuffle partitions for large ETL, 
the native shuffle writer has to reserve more memory. For partition number = 
2000, the total reserved memory will be 50 GB. This is usually more than the 
total memory of the worker instance.
   
   Maybe a better approach is to always reserve `comet execution memory * 
shuffle fraction` amount of memory, and check if the reserved memory was 
exceeded each time we insert a record batch? We can estimate the amount of 
allocated memory using `batch.get_array_memory_size()`, and trigger a spill if 
we've already ingested too many batches. This would make the native shuffle 
writer adaptive to the actual size of record batches and avoid over-reservation.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Memory over-reservation when running native shuffle write [datafusion-comet]

Reply via email to