[jira] [Commented] (PIG-2541) Automatic record provenance (source tagging) for PigStorage

Prashant Kommireddi (Commented) (JIRA) Sat, 18 Feb 2012 17:12:22 -0800

    [ 
https://issues.apache.org/jira/browse/PIG-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211174#comment-13211174
 ]


Prashant Kommireddi commented on PIG-2541:
------------------------------------------

This gets tricky now (if I understand correctly), consider the sequence (please 
note the following is keeping in mind the current way of appending source_tag 
to end)

{code}
1. A = load 'input' using PigStorage('\t', '-tagsource');    // schema is 
undefined at this point

2. B = FOREACH A GENERATE (int)$0 as col1, (long)$1 as col2, (chararray)$2 as 
source_tag; // schema: (col1:int, col2:long, source_tag:chararray)

3. C = GROUP B BY source_tag;

.
.
.
.

n. STORE N INTO 'intermediate' using PigStorage('\t', '-schema');  //Schema, 
lets say: (source_tag: chararray, cnt: long)
{code}

Now, the user wants to read 'intermediate' using schema, and also know the 
(new) source path(s). 

{code}
--Mentioning -schema is not required here, included just for clarity.
A = load 'intermediate' using PigStorage('\t', '-schema -tagsource');  
//Schema: (source_tag:chararray, cnt:long, source_tag:chararray)
{code}

There would be a conflict in auto-loading 'source_tag' in the above case. I 
think de-coupling 'schema' from 'tagsource' would be a nice alternative, as the 
input path is not part of "real data". It is a derived field which could be 
treated differently from actual data contained within input files. So the user 
always expects the right schema for first n-1 columns, with nth column being 
the source_tag for which schema does not really need to be auto-loaded? Similar 
to how it would work if one had extended PigStorage to implement source tagging.

I completely agree with you on the pain of appending source_tag to the end, its 
less predictable than at the start. However, things would get complicated in 
terms of maintainence when users want to switch between using and not using 
source tagging. It would be great to minimize reference repositioning changes 
for production jobs (error-prone, might result in large number of script 
changes if fields are not referenced via an alias). 

Lastly, I am leaning towards the 'append' approach but fine with either one. We 
just need to make sure this is an easy to use/adopt feature.
                
> Automatic record provenance (source tagging) for PigStorage
> -----------------------------------------------------------
>
>                 Key: PIG-2541
>                 URL: https://issues.apache.org/jira/browse/PIG-2541
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.9.1
>            Reporter: Richard Ding
>            Assignee: Prashant Kommireddi
>         Attachments: PIG-2541.patch
>
>
> There are a lot of interests in knowing where the data comes from when 
> loading from a directory (or a set of directories). One can do it manually 
> (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be 
> more convenient for users if we implement this in the PigStorage with a 
> command line option (e.g., pig.source.tagging=true/false) to turn it on/off. 
> By default it will be off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (PIG-2541) Automatic record provenance (source tagging) for PigStorage

Reply via email to