[ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Jerry Chen updated PIG-3404:
----------------------------

    Attachment: PIG-3404.patch

Patch for reference
                
> Improve Pig to ignore bad files or inaccessible files or folders
> ----------------------------------------------------------------
>
>                 Key: PIG-3404
>                 URL: https://issues.apache.org/jira/browse/PIG-3404
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.11.2
>            Reporter: Jerry Chen
>              Labels: Rhino
>         Attachments: PIG-3404.patch
>
>
> There are use cases in Pig:
> * A directory is used as the input of a load operation. It is possible that 
> one or more files in that directory are bad files (for example, corrupted or 
> bad data caused by compression).
> * A directory is used as the input of a load operation. The current user may 
> not have permission to access any subdirectories or files of that directory.
> The current Pig implementation will abort the whole Pig job for such cases. 
> It would be useful to have option to allow the job to continue and ignore the 
> bad files or inaccessible files/folders without abort the job, ideally, log 
> or print a warning for such error or violations. This requirement is not 
> trivial because for big data set for large analytics applications, this is 
> not always possible to sort out the  good data for processing; Ignore a few 
> of bad files may be a better choice for such situations.
> We propose to use “Ignore bad files” flag to address this problem. 
> AvroStorage and related file format in Pig already has this flag but it is 
> not complete to cover all the cases mentioned above. We would improve the 
> PigStorage and related text format to support this new flag as well as 
> improve AvroStorage and related facilities to completely support the concept.
> The flag is “Storage” (For example, PigStorage or AvroStorage) based and can 
> be set for each load operation respectively. The value of this flag will be 
> false if it is not explicitly set. Ideally, we can provide a global pig 
> parameter which forces the default value to true for all load functions even 
> if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to