[ 
https://issues.apache.org/jira/browse/HIVE-4358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641355#comment-13641355
 ] 

Harish Butani commented on HIVE-4358:
-------------------------------------

This is to allow a PTF to process the raw input before partitioning has 
happened.
A good example is how to perform CandidateFrequentItemSet computation. 
The input is a Basket(basketId, productId) table; output is a list of Itemsets 
that are frequently
brought together; frequent is defined by a threshold parameter.
The output has the form Itemset(Array<String> itemset), assuming ProductId is 
String.

The way you compute this is to apply a FrequentItemSet algorithm on subsets of 
the input in parallel.
So in our prototype we implemented the DynamicItemCounting algorithm. This got 
executed in each mapper;
the output was a Candidate Itemset(Array<String> itemset, count) from each 
mapper.
The reducer than summed counts across all mappers and checked for thresholds.

But from a calling perspective it still appears like a PTF invocation to a 
caller:

select itemset
from candidateFreqItemSets(on basket partition by itemset)

Behind the scenes we create a Plan with a PTFOp for the Map-side where the 
DynamicItemCounting is done; and a PTFOp on the reduce side where the 
aggregation is done. 

Hope this makes sense; i realize it is very brief, can go over it in detail 
with you.
                
> Check for Map side processing in PTFOp is no longer valid
> ---------------------------------------------------------
>
>                 Key: HIVE-4358
>                 URL: https://issues.apache.org/jira/browse/HIVE-4358
>             Project: Hive
>          Issue Type: Bug
>          Components: PTF-Windowing
>            Reporter: Harish Butani
>         Attachments: HIVE-4358.D10473.1.patch
>
>
> With the changes for ReduceSinkDedup it is no longer true that a non Map-side 
> PTF Operator is preceded by an ExtractOp. For e.g. following query can 
> produce the issue:
> {noformat}
> create view IF NOT EXISTS mfgr_price_view as 
> select p_mfgr, p_brand, 
> sum(p_retailprice) as s 
> from part 
> group by p_mfgr, p_brand;
>         
> select p_mfgr, p_brand, s, 
> sum(s) over w1  as s1
> from mfgr_price_view 
> window w1 as (distribute by p_mfgr sort by p_brand rows between 2 preceding 
> and current row);
> {noformat}
> Fix is to add an explicit flag to PTFDesc

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to