[
https://issues.apache.org/jira/browse/HIVE-4358?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13641355#comment-13641355
]
Harish Butani commented on HIVE-4358:
-------------------------------------
This is to allow a PTF to process the raw input before partitioning has
happened.
A good example is how to perform CandidateFrequentItemSet computation.
The input is a Basket(basketId, productId) table; output is a list of Itemsets
that are frequently
brought together; frequent is defined by a threshold parameter.
The output has the form Itemset(Array<String> itemset), assuming ProductId is
String.
The way you compute this is to apply a FrequentItemSet algorithm on subsets of
the input in parallel.
So in our prototype we implemented the DynamicItemCounting algorithm. This got
executed in each mapper;
the output was a Candidate Itemset(Array<String> itemset, count) from each
mapper.
The reducer than summed counts across all mappers and checked for thresholds.
But from a calling perspective it still appears like a PTF invocation to a
caller:
select itemset
from candidateFreqItemSets(on basket partition by itemset)
Behind the scenes we create a Plan with a PTFOp for the Map-side where the
DynamicItemCounting is done; and a PTFOp on the reduce side where the
aggregation is done.
Hope this makes sense; i realize it is very brief, can go over it in detail
with you.
> Check for Map side processing in PTFOp is no longer valid
> ---------------------------------------------------------
>
> Key: HIVE-4358
> URL: https://issues.apache.org/jira/browse/HIVE-4358
> Project: Hive
> Issue Type: Bug
> Components: PTF-Windowing
> Reporter: Harish Butani
> Attachments: HIVE-4358.D10473.1.patch
>
>
> With the changes for ReduceSinkDedup it is no longer true that a non Map-side
> PTF Operator is preceded by an ExtractOp. For e.g. following query can
> produce the issue:
> {noformat}
> create view IF NOT EXISTS mfgr_price_view as
> select p_mfgr, p_brand,
> sum(p_retailprice) as s
> from part
> group by p_mfgr, p_brand;
>
> select p_mfgr, p_brand, s,
> sum(s) over w1 as s1
> from mfgr_price_view
> window w1 as (distribute by p_mfgr sort by p_brand rows between 2 preceding
> and current row);
> {noformat}
> Fix is to add an explicit flag to PTFDesc
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira