[ 
https://issues.apache.org/jira/browse/HIVE-23736?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18048576#comment-18048576
 ] 

Stamatis Zampetakis commented on HIVE-23736:
--------------------------------------------

Although the TopN logic in ReduceSink and TopNKeyOperator are very similar 
there is an important difference between the two. The TopNKeyOperator produces 
rows instantly on a tuple-per-tuple basis without having to process the entire 
input. On the other hand, the ReduceSink operator (when TopN functionality is 
on) will process the entire input before producing any output row. This 
distinction has also a significant impact on the number of tuples that pass by 
each operator. In the worst case, the TopNKeyOperator may forward the entire 
input without cutting of any tuples while the ReduceSink will strictly produce 
only the Top N.

As explained in detail under HIVE-29322, disabling the TopN functionality from 
ReduceSink can have a negative impact in performance since a potentially large 
number of records may be sent unnecessarily from mappers to reducers.

> Disable topn in ReduceSinkOp if a TNK is introduced
> ---------------------------------------------------
>
>                 Key: HIVE-23736
>                 URL: https://issues.apache.org/jira/browse/HIVE-23736
>             Project: Hive
>          Issue Type: Improvement
>          Components: Physical Optimizer
>            Reporter: Krisztian Kasa
>            Assignee: Krisztian Kasa
>            Priority: Minor
>              Labels: pull-request-available
>             Fix For: 4.0.0-alpha-1
>
>          Time Spent: 20m
>  Remaining Estimate: 0h
>
> Both the Reduce Sink and TopNKey operator has Top n key filtering 
> functionality. If TNK is introduced this functionality is done twice.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to