Takuya Kitazawa created HIVEMALL-138:
----------------------------------------

             Summary: Implement to_top_k_ordered_map
                 Key: HIVEMALL-138
                 URL: https://issues.apache.org/jira/browse/HIVEMALL-138
             Project: Hivemall
          Issue Type: New Feature
            Reporter: Takuya Kitazawa
            Assignee: Takuya Kitazawa
            Priority: Minor


As an alternative "each_top_k" functionality, let us implement 
"to_top_k_ordered_map(int k, int key, int value)" UDAF. Compared to the CLUSTER 
BY + "each_top_k" option, UDAF enables us to utilize mapper-side aggregation.

According to [~myui]:

A problem is that multiple to_top_k_ordered_map UDAFs is concurrently executed 
and memory consumption is not reduced.

to_top_k_ordered_map will become O(|article_id|*k) (or, 
O(|article_id|*k/reducers*combiner_effect_ratio) per a reducer) space 
complexity while each_top_k is O(k) (or O(k/reducers) per a reducer) space 
complexity in an operator. each_top_k internally uses priority queue (not 
sorting), assuming the given inputs are sorted by a group key using CLUSTER BY. 
Shuffle involves a scalable external sort and memory space complexity can be 
avoided.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Reply via email to