[ 
https://issues.apache.org/jira/browse/PIG-2541?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13211147#comment-13211147
 ] 

Dmitriy V. Ryaboy commented on PIG-2541:
----------------------------------------

Prashant, I am thinking of the case when a loaded schema is something like 
(a:int, b:int) but, due to loading with "using PigStorage('\t', '-useSchema 
-pig.source.tagging=true'), the schema expected by the user is (a:int, b:int, 
source_tag:chararray). Since the loader doesn't report this modified schema, 
the user won't be able to access the new field. I suspect regression wasn't 
caught because you didn't test both options combined, and only used them 
separately.

This should "just work" on the storage, as opposed to loader, side, I don't 
think there's a problem there as long as the loader side is fixed.

Regarding position of the tag -- I really think putting it in the beginning is 
better. As I described above, putting it on the end leads to straight-up 
unpredictable results in some circumstances; avoiding that situation takes 
precedence (in my mind) over convenience of modifying existing scripts (which 
will need to be modified anyway to take advantage of this.. so in for a penny, 
in for a pound).
                
> Automatic record provenance (source tagging) for PigStorage
> -----------------------------------------------------------
>
>                 Key: PIG-2541
>                 URL: https://issues.apache.org/jira/browse/PIG-2541
>             Project: Pig
>          Issue Type: Improvement
>          Components: impl
>    Affects Versions: 0.9.1
>            Reporter: Richard Ding
>            Assignee: Prashant Kommireddi
>         Attachments: PIG-2541.patch
>
>
> There are a lot of interests in knowing where the data comes from when 
> loading from a directory (or a set of directories). One can do it manually 
> (see https://cwiki.apache.org/confluence/display/PIG/FAQ). But it will be 
> more convenient for users if we implement this in the PigStorage with a 
> command line option (e.g., pig.source.tagging=true/false) to turn it on/off. 
> By default it will be off.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to