[
https://issues.apache.org/jira/browse/HCATALOG-487?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Travis Crawford updated HCATALOG-487:
-------------------------------------
Attachment: HCATALOG-487_skip_bad_records.2.patch
Updating the patch as I had to make two small changes when committing this:
* TestHCatInputFormat did not have the Apache file header, and the new check
style rules caught it! I added the header.
* Adding check style pulled in google-collections, which conflicts with guava.
I excluded google-collections.
> HCatalog should tolerate a user-defined amount of bad records
> -------------------------------------------------------------
>
> Key: HCATALOG-487
> URL: https://issues.apache.org/jira/browse/HCATALOG-487
> Project: HCatalog
> Issue Type: Improvement
> Reporter: Travis Crawford
> Assignee: Travis Crawford
> Fix For: 0.5
>
> Attachments: HCATALOG-487_skip_bad_records.1.patch,
> HCATALOG-487_skip_bad_records.2.patch
>
>
> HCatalog tasks currently fail when deserializing corrupt records. In some
> cases, large data sets have a small number of corrupt records and its okay to
> skip them. In fact Hadoop has support for skipping bad records for exactly
> this reason.
> However, using the Hadoop-native record skipping feature (like Hive does) is
> very coarse and leads to a large number of failed tasks, task scheduling
> overhead, and limited control over the skipping behavior.
> HCatalog should have native support for skipping a user-defined amount of bad
> records.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira