[
https://issues.apache.org/jira/browse/PIG-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211174#comment-13211174
]
Prashant Kommireddi commented on PIG-2541:
------------------------------------------
This gets tricky now (if I understand correctly), consider the sequence (please
note the following is keeping in mind the current way of appending source_tag
to end)
{code}
1. A = load 'input' using PigStorage('\t', '-tagsource'); // schema is
undefined at this point
2. B = FOREACH A GENERATE (int)$0 as col1, (long)$1 as col2, (chararray)$2 as
source_tag; // schema: (col1:int, col2:long, source_tag:chararray)
3. C = GROUP B BY source_tag;
.
.
.
.
n. STORE N INTO 'intermediate' using PigStorage('\t', '-schema'); //Schema,
lets say: (source_tag: chararray, cnt: long)
{code}
Now, the user wants to read 'intermediate' using schema, and also know the
(new) source path(s).
{code}
--Mentioning -schema is not required here, included just for clarity.
A = load 'intermediate' using PigStorage('\t', '-schema -tagsource');
//Schema: (source_tag:chararray, cnt:long, source_tag:chararray)
{code}
There would be a conflict in auto-loading 'source_tag' in the above case. I
think de-coupling 'schema' from 'tagsource' would be a nice alternative, as the
input path is not part of "real data". It is a derived field which could be
treated differently from actual data contained within input files. So the user
always expects the right schema for first n-1 columns, with nth column being
the source_tag for which schema does not really need to be auto-loaded? Similar
to how it would work if one had extended PigStorage to implement source tagging.
I completely agree with you on the pain of appending source_tag to the end, its
less predictable than at the start. However, things would get complicated in
terms of maintainence when users want to switch between using and not using
source tagging. It would be great to minimize reference repositioning changes
for production jobs (error-prone, might result in large number of script
changes if fields are not referenced via an alias).
Lastly, I am leaning towards the 'append' approach but fine with either one. We
just need to make sure this is an easy to use/adopt feature.
> Automatic record provenance (source tagging) for PigStorage
> -----------------------------------------------------------
>
> Key: PIG-2541
> URL: https://issues.apache.org/jira/browse/PIG-2541
> Project: Pig
> Issue Type: Improvement
> Components: impl
> Affects Versions: 0.9.1
> Reporter: Richard Ding
> Assignee: Prashant Kommireddi
> Attachments: PIG-2541.patch
>
>
> There are a lot of interests in knowing where the data comes from when
> loading from a directory (or a set of directories). One can do it manually
> (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be
> more convenient for users if we implement this in the PigStorage with a
> command line option (e.g., pig.source.tagging=true/false) to turn it on/off.
> By default it will be off.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira