[ https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Jerry Chen updated PIG-3404: ---------------------------- Attachment: PIG-3404.patch Patch for reference > Improve Pig to ignore bad files or inaccessible files or folders > ---------------------------------------------------------------- > > Key: PIG-3404 > URL: https://issues.apache.org/jira/browse/PIG-3404 > Project: Pig > Issue Type: New Feature > Components: data > Affects Versions: 0.11.2 > Reporter: Jerry Chen > Labels: Rhino > Attachments: PIG-3404.patch > > > There are use cases in Pig: > * A directory is used as the input of a load operation. It is possible that > one or more files in that directory are bad files (for example, corrupted or > bad data caused by compression). > * A directory is used as the input of a load operation. The current user may > not have permission to access any subdirectories or files of that directory. > The current Pig implementation will abort the whole Pig job for such cases. > It would be useful to have option to allow the job to continue and ignore the > bad files or inaccessible files/folders without abort the job, ideally, log > or print a warning for such error or violations. This requirement is not > trivial because for big data set for large analytics applications, this is > not always possible to sort out the good data for processing; Ignore a few > of bad files may be a better choice for such situations. > We propose to use “Ignore bad files” flag to address this problem. > AvroStorage and related file format in Pig already has this flag but it is > not complete to cover all the cases mentioned above. We would improve the > PigStorage and related text format to support this new flag as well as > improve AvroStorage and related facilities to completely support the concept. > The flag is “Storage” (For example, PigStorage or AvroStorage) based and can > be set for each load operation respectively. The value of this flag will be > false if it is not explicitly set. Ideally, we can provide a global pig > parameter which forces the default value to true for all load functions even > if it is not explicitly set in the LOAD statement. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira