[ https://issues.apache.org/jira/browse/PIG-5231?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16025155#comment-16025155 ]
Daniel Dai commented on PIG-5231: --------------------------------- Vote for 3. We pick the first schema in dirs in all LoadFunc, such as OrcStorage, AvroStorage. I don't think we shall make an exception for PigStorage. +1 for the patch. > PigStorage with -schema may produce inconsistent outputs with more fields > ------------------------------------------------------------------------- > > Key: PIG-5231 > URL: https://issues.apache.org/jira/browse/PIG-5231 > Project: Pig > Issue Type: Bug > Reporter: Koji Noguchi > Assignee: Koji Noguchi > Priority: Minor > Attachments: pig-5231-v01.patch > > > When multiple directories are passed to PigStorage(',','-schema'), pig will > {quote} > No attempt to merge conflicting schemas is made during loading. The first > schema encountered during a file system scan is used. > {quote} > For two directories input with schema > file1: (f1:chararray, f2:int) and > file2: (f1:chararray, f2:int, f3:int) > Pig will pick the first schema from file1 and only allow f1, f2 access. > However, output would still contain 3 fields for tuples from file2. This > later leads to complete corrupt outputs due to shifted fields resulting in > incorrect references. > (This may also happen when input itself contains the delimiter.) > If file2 schema is picked, this is already handled by filling the missing > fields with null. (PIG-3100) -- This message was sent by Atlassian JIRA (v6.3.15#6346)