[ 
https://issues.apache.org/jira/browse/IMPALA-12594?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17819176#comment-17819176
 ] 

ASF subversion and git services commented on IMPALA-12594:
----------------------------------------------------------

Commit 2f14fd29c0b47fc2c170a7f0eb1cecaf6b9704f4 in impala's branch 
refs/heads/master from Csaba Ringhofer
[ https://gitbox.apache.org/repos/asf?p=impala.git;h=2f14fd29c ]

IMPALA-12433: Share buffers among channels in KrpcDataStreamSender

Before this patch each KrpcDataStreamSender::Channel had 2
OutboundRowBatch with its own serialization and compression buffers.

This patch switches to use a single buffer per channel. This is
enough to store the in-flight data in KRPC, while other buffers
are only used during serialization and compression which is done for
just a single channel at a time, so can be shared among channels.

Memory estimates in the planner are not changed because the existing
calculation has several issues (see IMPALA-12594).

Change-Id: I64854a350a9dae8bf3af11c871882ea4750e60b3
Reviewed-on: http://gerrit.cloudera.org:8080/20719
Tested-by: Impala Public Jenkins <[email protected]>
Reviewed-by: Kurt Deschler <[email protected]>
Reviewed-by: Zihao Ye <[email protected]>
Reviewed-by: Michael Smith <[email protected]>


> KrpcDataStreamSender's mem estimate is different than real usage
> ----------------------------------------------------------------
>
>                 Key: IMPALA-12594
>                 URL: https://issues.apache.org/jira/browse/IMPALA-12594
>             Project: IMPALA
>          Issue Type: Bug
>          Components: Backend, Frontend
>            Reporter: Csaba Ringhofer
>            Priority: Major
>
> IMPALA-6684 added memory estimates for KrpcDataStreamSender's, but there are 
> few gaps between the how the frontend estimates memory and how the backend 
> actually allocates it:
> The frontend uses the following formula:
> buffer_size = num_channels * 2 * (tuple_buffer_length + 
> compressed_buffer_length)
> This takes account for the serialization and compression buffer for each 
> OutboundRowBatch.
> This can  both under and over estimate:
> 1. it doesn't take account of the RowBatch used by channels during 
> partitioned exchange to collact rows belonging to a single channel 
> https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L232
> 2.it ignores the adjustment to the RowBatch capacity above based on flag 
> data_stream_sender_buffer_size 
> https://github.com/apache/impala/blob/4c762725c707f8d150fe250c03faf486008702d4/be/src/runtime/krpc-data-stream-sender.cc#L379
> This adjustment can both increase or decrease the capacity to have to desired 
> total size (16K by defaul).
> Note that the adjustment above ignores var len data, so it can massively 
> underestimate in some cases. Meanwhile the frontend logic calculates string 
> sizes if stats are present. Ideally both logic would be improved and synced 
> to use both data_stream_sender_buffer_size and the string sizes for the 
> estimate (I am not sure about collection types).



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to