[ 
https://issues.apache.org/jira/browse/PIG-3404?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777186#comment-13777186
 ] 

Jerry Chen commented on PIG-3404:
---------------------------------

Hi Park, sorry for the late response and I am glad that we can discuss this 
topic further in this JIRA. 

Just as mentioned in the JIRA description, we are taking the approach of 
“Ignore bad files” flag for each storage. Different storages can be controlled 
separately instead of a global flag.  On the other hand, in our use cases, we 
also want to take care that the current user may not have permission to access 
any subdirectories of the input directory, which can be looked as “bad 
directory” in concept.

Another thing is the ignore ratio. We currently take an even simpler approach 
of “ignore all” or “ignore nothing” using a flag. Just as you mentioned, 
PIG-3059 uses a threshold to control how many bad input splits can be ignored. 
This is a good thing. While the question is “How many cases in reality that we 
need a ratio is not 0 and 1?” 

I went through the patch in PIG-3059. I was trying to understand how the ratio 
is controlled globally in a distributed MapReduce task environment. It seems 
that in InputErrorTracker.java, you use a local variable (numErrors) for error 
tracking. I may miss something there but it would be very helpful if you can 
help explain.

Thank you too for providing the helpful information and let’s continue the 
discussion. 

                
> Improve Pig to ignore bad files or inaccessible files or folders
> ----------------------------------------------------------------
>
>                 Key: PIG-3404
>                 URL: https://issues.apache.org/jira/browse/PIG-3404
>             Project: Pig
>          Issue Type: New Feature
>          Components: data
>    Affects Versions: 0.11.2
>            Reporter: Jerry Chen
>              Labels: Rhino
>         Attachments: PIG-3404.patch
>
>
> There are use cases in Pig:
> * A directory is used as the input of a load operation. It is possible that 
> one or more files in that directory are bad files (for example, corrupted or 
> bad data caused by compression).
> * A directory is used as the input of a load operation. The current user may 
> not have permission to access any subdirectories or files of that directory.
> The current Pig implementation will abort the whole Pig job for such cases. 
> It would be useful to have option to allow the job to continue and ignore the 
> bad files or inaccessible files/folders without abort the job, ideally, log 
> or print a warning for such error or violations. This requirement is not 
> trivial because for big data set for large analytics applications, this is 
> not always possible to sort out the  good data for processing; Ignore a few 
> of bad files may be a better choice for such situations.
> We propose to use “Ignore bad files” flag to address this problem. 
> AvroStorage and related file format in Pig already has this flag but it is 
> not complete to cover all the cases mentioned above. We would improve the 
> PigStorage and related text format to support this new flag as well as 
> improve AvroStorage and related facilities to completely support the concept.
> The flag is “Storage” (For example, PigStorage or AvroStorage) based and can 
> be set for each load operation respectively. The value of this flag will be 
> false if it is not explicitly set. Ideally, we can provide a global pig 
> parameter which forces the default value to true for all load functions even 
> if it is not explicitly set in the LOAD statement.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Reply via email to