[ https://issues.apache.org/jira/browse/PIG-2614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239193#comment-13239193 ]
Jonathan Coveney commented on PIG-2614: --------------------------------------- Russell, In Elephant-bird, there is a key elephantbird.mapred.input.bad.record.threshold. For whatever reason I felt like doing this, so find attached a patch that adds the functionality you want (note that it includes PIG-2551, which is more or less good to go... only because that patch brings in a Counter helper). The default functionality does not change. On an error, it will die. However, there are not two keys that can be set: pig.piggybank.storage.avro.bad.record.threshold pig.piggybank.storage.avro.bad.record.min The former sets the acceptable ratio threshhold. The latter sets the minimum number of errors before it can error out. Here is where you come in: Currently, the only error I log is on "reader.next()." Are there any other cases where errors (at least, errors indicating a bad row) can be thrown? And on an error, what do you want to happen? Skip the row, or return null? It seems to make sense to me to skip the record (also, the number of records processed and the number of errors thrown is logged in a Hadoop counter now). Secondly, someone needs to make tests. It currently passes the tests, but that's because the default threshold and min are 0. I don't know what is and isn't a bad Avro file, though, so yeah. Hopefully the fact that I did the work implementing will motivate someone to add tests ;) > AvroStorage crashes on LOADING a single bad error > ------------------------------------------------- > > Key: PIG-2614 > URL: https://issues.apache.org/jira/browse/PIG-2614 > Project: Pig > Issue Type: Bug > Components: piggybank > Affects Versions: 0.10, 0.11 > Reporter: Russell Jurney > Priority: Blocker > Labels: avro, avrostorage, bad, book, cutting, doug, for, my, > pig, sadism > Fix For: 0.10, 0.11 > > Attachments: PIG-2614_0.patch > > > AvroStorage dies when a single bad record exists, such as one with missing > fields. This is very bad on 'big data,' where bad records are inevitable. > See discussion at > http://www.quora.com/Big-Data/In-Big-Data-ETL-how-many-records-are-an-acceptable-loss > for more theory. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira