Csaba Ringhofer created IMPALA-12594:
----------------------------------------

             Summary: KrpcDataStreamSender's mem estimate is different than 
real usage
                 Key: IMPALA-12594
                 URL: https://issues.apache.org/jira/browse/IMPALA-12594
             Project: IMPALA
          Issue Type: Bug
          Components: Backend, Frontend
            Reporter: Csaba Ringhofer


IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are 
few gaps between the how the frontend estimates memory and how the backend 
actually allocates it:
The frontend uses the following formula:
buffer_size = num_channels * 2 * (tuple_buffer_length + 
compressed_buffer_length)
This takes account for the serialization and compression buffer for each 
OutboundRowBatch.

This can  both under and over estimate:
1. it doesn't take account of the RowBatch used by channels during partitioned 
exchange to collact rows belonging to a single channel 
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232

2.it ignores the adjustment to the RowBatch capacity above based on flag 
data_stream_sender_buffer_size 
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379
This adjustment can both increase or decrease the capacity to have to desired 
total size (16K by defaul).

Note that the adjustment above ignores var len data, so it can massively 
underestimate in some cases. Meanwhile the frontend logic calculates string 
sizes if stats are present. Ideally both logic would be improved and synced to 
use both data_stream_sender_buffer_size and the string sizes for the estimate 
(I am not sure about collection types).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to