[ 
https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239193#comment-13239193
 ] 

Jonathan Coveney commented on PIG-2614:
---------------------------------------

Russell,

In Elephant-bird, there is a key 
elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like 
doing this, so find attached a patch that adds the functionality you want (note 
that it includes PIG-2551, which is more or less good to go... only because 
that patch brings in a Counter helper).

The default functionality does not change. On an error, it will die. However, 
there are not two keys that can be set:
pig.piggybank.storage.avro.bad.record.threshold
pig.piggybank.storage.avro.bad.record.min

The former sets the acceptable ratio threshhold. The latter sets the minimum 
number of errors before it can error out.

Here is where you come in:

Currently, the only error I log is on "reader.next()." Are there any other 
cases where errors (at least, errors indicating a bad row) can be thrown? And 
on an error, what do you want to happen? Skip the row, or return null? It seems 
to make sense to me to skip the record (also, the number of records processed 
and the number of errors thrown is logged in a Hadoop counter now).

Secondly, someone needs to make tests. It currently passes the tests, but 
that's because the default threshold and min are 0. I don't know what is and 
isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work 
implementing will motivate someone to add tests ;)
                
> AvroStorage crashes on LOADING a single bad error
> -------------------------------------------------
>
>                 Key: PIG-2614
>                 URL: https://issues.apache.org/jira/browse/PIG-2614
>             Project: Pig
>          Issue Type: Bug
>          Components: piggybank
>    Affects Versions: 0.10, 0.11
>            Reporter: Russell Jurney
>            Priority: Blocker
>              Labels: avro, avrostorage, bad, book, cutting, doug, for, my, 
> pig, sadism
>             Fix For: 0.10, 0.11
>
>         Attachments: PIG-2614_0.patch
>
>
> AvroStorage dies when a single bad record exists, such as one with missing 
> fields.  This is very bad on 'big data,' where bad records are inevitable.  
> See discussion at 
> http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss
>  for more theory.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to