[
https://issues.apache.org/jira/browse/PIG-169?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12582412#action_12582412
]
Arun C Murthy commented on PIG-169:
-----------------------------------
Sigh, the one problem even with the LOVisitor is that it is hard to track what
happens to actual tuples when they get flattened explicitly in a FOREACH, are
filtered out etc.
i.e.
INPUT = load 'input';
A = group INPUT by $0;
B = foreach A generate flatten($1);
C = stream B through `script`;
In this case B shoudn't be 'auto-flattened' since there is no 'group' at all ...
> Enhance PigStorage to handle complicated Tuples (i.e. automatically flatten
> them)
> ---------------------------------------------------------------------------------
>
> Key: PIG-169
> URL: https://issues.apache.org/jira/browse/PIG-169
> Project: Pig
> Issue Type: Improvement
> Components: data
> Reporter: Arun C Murthy
> Assignee: Arun C Murthy
>
> Currently PigStorage (actually Tuple.toDelimitedString) only handles the
> simple case of straight DataAtoms as fields and borks if it has any other
> Datum as a field. It would be nice to enhance it to handle the more
> complicated cases too. Currently users _have to_ use a *flatten* to convert
> these to simpler Tuples which can be then handled by PigStorage.
> ----
> On a related note, there is an interesting caveat with GROUP/COGROUP
> operators... they result in tuples with the first field which has the name
> 'group', whose value on which the grouping has been performed.
> E.g.
> Input:
> <A, 1>
> <A, 2>
> Pig script:
> INPUT = load 'input';
> A = group INPUT by $0;
> B = stream A through `script`;
> Results in A being:
> (A, {(A, 1), (A, 2)})
> Now, if PigStorage _auto-flattens_ A it results in:
> (A, A, 1)
> (A, A, 2)
> However, user expectation is probably the straight-forward:
> (A, 1)
> (A, 2)
> ---
> Alan suggested that we could use the LOVisitor infrastructure to visit nodes
> in the tree, save up information (i.e. that a GROUP/COGROUP occured) and then
> use that information to get PigStorage to 'skip' the group field while
> auto-flattening. However it might have to done if, and only if, PigStorage is
> auto-flattening tuples directly coming from a GROUP/COGROUP operator i.e.
> doesn't have other EvalSpecs working on those tuples ...
> ---
> Thoughts?
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.