I have checked out Hive source code and made a customerized build which took advantage of 0.19's skip bad record feature. It might be brutal force but works for me:-), just FYI, let me know if you deem this approach appropiate and I can check it in.
On Sat, Feb 21, 2009 at 12:01 AM, Joydeep Sen Sarma <[email protected]>wrote: > There are certain class of errors (out of memory types) that cannot be > handled within Hive. For such cases � doing it in Hadoop would make sense. > The other case is handling errors in user scripts. This is especially tricky > � and we would need to borrow/use hadoop techniques for retry during the > same. > > > > However � out of memory exceptions are rare � and from what we have seen � > when they do happen � it's not possible to fix them by retrying (for example > joins end up consuming too much memory). We have a controlled execution > engine. If the deserializers don't barf on the input (which is also possible > � sometimes a deserializer will try to allocate a large string and die) � > then the execution engine should not get errors other than regular > exceptions > > > > So most errors are regular exceptions that can be caught and the errors > ignored and reported in the job counter or the job failed as requested. We > should do this as a first step. > > > ------------------------------ > > *From:* Qing Yan [mailto:[email protected]] > *Sent:* Thursday, February 19, 2009 7:34 PM > *To:* [email protected] > *Subject:* Re: Error input handling in Hive > > > > Hi Zheng, > > > > I have opened a Jira(HIVE295). > > > > IMHO there are three steps errors can be handled: > > > > 1) Always fail. One bad record and whole job fails which is the current > Hive behavior. > > > > 2) Always success. Ignoring bad records(save them somewhere to allow > further analysis) and job still successes. > > > > 3) Success with condition. Something in the middle ground as you described. > > > > What can be done is make this configurable and let the user decide which > setting is appropiate for his application. > > > > In practice I would image 2) will be most common case(e.g.0.1% error rate). > > > BTW Just curious since you guys already use Hive in prod, how > you guarantee the input is 100% given Hive itself doesn't > > do any checking by itself. > > > > One thing I wasn't sure is whether the error handling logic should > better belong to the hive layer or the hadoop layer. > > > > Hadoop 0.19 already support 2) > http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Skipping+Bad+Records > > and may support 3) in the future. > > > > So the blackbox way is for Hive to just expose those API calls or as a > general approach allow user add "aspect" to the JobConf > > object. Is this allowed in Hive design? > > > > > > Regards, > > > Qing > > > On Thu, Feb 19, 2009 at 5:59 PM, Zheng Shao <[email protected]> wrote: > > Hi Qing, > > That's a good idea. Can you open a jira? > There are lots of details before we can add that feature to Hive. For > example, how to specify the largest number of data corruption that can > be accepted, by absolute number or percentage, etc. What about half > corrupted records in case we only need the non-corrupted part in the > query, etc. > > > Zheng > > > > > On 2/19/09, Qing Yan <[email protected]> wrote: > > Say I have some bad/ill-formatted records in the input, is there a way to > > configure the default Hive parser to discard those records directly(e.g. > > when a integer column get a string)? > > > > Besides, is the new skip-bad-records feature in 0.19 accessible in Hive? > > It is a quite handy feature in the real world. > > > > What I see so far is the Hive parser throws exception and cause the whole > > job to fail ultimately. > > > > Thanks for the help! > > > > Qing > > > > -- > Sent from Gmail for mobile | mobile.google.com > > Yours, > Zheng > > >
