[ https://issues.apache.org/jira/browse/HIVE-4358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641355#comment-13641355 ]
Harish Butani commented on HIVE-4358: ------------------------------------- This is to allow a PTF to process the raw input before partitioning has happened. A good example is how to perform CandidateFrequentItemSet computation. The input is a Basket(basketId, productId) table; output is a list of Itemsets that are frequently brought together; frequent is defined by a threshold parameter. The output has the form Itemset(Array<String> itemset), assuming ProductId is String. The way you compute this is to apply a FrequentItemSet algorithm on subsets of the input in parallel. So in our prototype we implemented the DynamicItemCounting algorithm. This got executed in each mapper; the output was a Candidate Itemset(Array<String> itemset, count) from each mapper. The reducer than summed counts across all mappers and checked for thresholds. But from a calling perspective it still appears like a PTF invocation to a caller: select itemset from candidateFreqItemSets(on basket partition by itemset) Behind the scenes we create a Plan with a PTFOp for the Map-side where the DynamicItemCounting is done; and a PTFOp on the reduce side where the aggregation is done. Hope this makes sense; i realize it is very brief, can go over it in detail with you. > Check for Map side processing in PTFOp is no longer valid > --------------------------------------------------------- > > Key: HIVE-4358 > URL: https://issues.apache.org/jira/browse/HIVE-4358 > Project: Hive > Issue Type: Bug > Components: PTF-Windowing > Reporter: Harish Butani > Attachments: HIVE-4358.D10473.1.patch > > > With the changes for ReduceSinkDedup it is no longer true that a non Map-side > PTF Operator is preceded by an ExtractOp. For e.g. following query can > produce the issue: > {noformat} > create view IF NOT EXISTS mfgr_price_view as > select p_mfgr, p_brand, > sum(p_retailprice) as s > from part > group by p_mfgr, p_brand; > > select p_mfgr, p_brand, s, > sum(s) over w1 as s1 > from mfgr_price_view > window w1 as (distribute by p_mfgr sort by p_brand rows between 2 preceding > and current row); > {noformat} > Fix is to add an explicit flag to PTFDesc -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira