Jerry Chen created PIG-3404:
-------------------------------

             Summary: Improve Pig to ignore bad files or inaccessible files or 
folders
                 Key: PIG-3404
                 URL: https://issues.apache.org/jira/browse/PIG-3404
             Project: Pig
          Issue Type: New Feature
          Components: data
    Affects Versions: 0.11.2
            Reporter: Jerry Chen


There are use cases in Pig:
* A directory is used as the input of a load operation. It is possible that one 
or more files in that directory are bad files (for example, corrupted or bad 
data caused by compression).
* A directory is used as the input of a load operation. The current user may 
not have permission to access any subdirectories or files of that directory.

The current Pig implementation will abort the whole Pig job for such cases. It 
would be useful to have option to allow the job to continue and ignore the bad 
files or inaccessible files/folders without abort the job, ideally, log or 
print a warning for such error or violations. This requirement is not trivial 
because for big data set for large analytics applications, this is not always 
possible to sort out the  good data for processing; Ignore a few of bad files 
may be a better choice for such situations.

We propose to use “Ignore bad files” flag to address this problem. AvroStorage 
and related file format in Pig already has this flag but it is not complete to 
cover all the cases mentioned above. We would improve the PigStorage and 
related text format to support this new flag as well as improve AvroStorage and 
related facilities to completely support the concept.

The flag is “Storage” (For example, PigStorage or AvroStorage) based and can be 
set for each load operation respectively. The value of this flag will be false 
if it is not explicitly set. Ideally, we can provide a global pig parameter 
which forces the default value to true for all load functions even if it is not 
explicitly set in the LOAD statement.


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to