Csaba Ringhofer created IMPALA-12594:
----------------------------------------
Summary: KrpcDataStreamSender's mem estimate is different than
real usage
Key: IMPALA-12594
URL: https://issues.apache.org/jira/browse/IMPALA-12594
Project: IMPALA
Issue Type: Bug
Components: Backend, Frontend
Reporter: Csaba Ringhofer
IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are
few gaps between the how the frontend estimates memory and how the backend
actually allocates it:
The frontend uses the following formula:
buffer_size = num_channels * 2 * (tuple_buffer_length +
compressed_buffer_length)
This takes account for the serialization and compression buffer for each
OutboundRowBatch.
This can both under and over estimate:
1. it doesn't take account of the RowBatch used by channels during partitioned
exchange to collact rows belonging to a single channel
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232
2.it ignores the adjustment to the RowBatch capacity above based on flag
data_stream_sender_buffer_size
https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379
This adjustment can both increase or decrease the capacity to have to desired
total size (16K by defaul).
Note that the adjustment above ignores var len data, so it can massively
underestimate in some cases. Meanwhile the frontend logic calculates string
sizes if stats are present. Ideally both logic would be improved and synced to
use both data_stream_sender_buffer_size and the string sizes for the estimate
(I am not sure about collection types).
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]