I have checked out Hive source code and made a customerized build which
 took advantage of 0.19's skip bad record feature.
It might be brutal force but works for me:-), just FYI, let me know if you
deem this approach appropiate and I can check it in.

On Sat, Feb 21, 2009 at 12:01 AM, Joydeep Sen Sarma <[email protected]>wrote:

>  There are certain class of errors (out of memory types) that cannot be
> handled within Hive. For such cases � doing it in Hadoop would make sense.
> The other case is handling errors in user scripts. This is especially tricky
> � and we would need to borrow/use hadoop techniques for retry during the
> same.
>
>
>
> However � out of memory exceptions are rare � and from what we have seen �
> when they do happen � it's not possible to fix them by retrying (for example
> joins end up consuming too much memory). We have a controlled execution
> engine. If the deserializers don't barf on the input (which is also possible
> � sometimes a deserializer will try to allocate a large string and die) �
> then the execution engine should not get errors other than regular
> exceptions
>
>
>
> So most errors are regular exceptions that can be caught and the errors
> ignored and reported in the job counter or the job failed as requested. We
> should do this as a first step.
>
>
>  ------------------------------
>
> *From:* Qing Yan [mailto:[email protected]]
> *Sent:* Thursday, February 19, 2009 7:34 PM
> *To:* [email protected]
> *Subject:* Re: Error input handling in Hive
>
>
>
> Hi Zheng,
>
>
>
> I have opened a Jira(HIVE295).
>
>
>
> IMHO there are three steps errors can be handled:
>
>
>
> 1) Always fail. One bad record and whole job fails which is the current
> Hive behavior.
>
>
>
> 2) Always success. Ignoring bad records(save them somewhere to allow
> further analysis) and job still successes.
>
>
>
> 3) Success with condition. Something in the middle ground as you described.
>
>
>
> What can be done is make this configurable and let the user decide which
> setting is appropiate for his application.
>
>
>
> In practice I would image 2) will be most common case(e.g.0.1% error rate).
>
>
> BTW Just curious since you guys already use Hive in prod, how
> you guarantee the input is 100% given Hive itself doesn't
>
> do any checking by itself.
>
>
>
> One thing I wasn't sure is whether the error handling logic should
> better belong to the hive layer or the hadoop layer.
>
>
>
> Hadoop 0.19 already support 2)
> http://hadoop.apache.org/core/docs/r0.19.0/mapred_tutorial.html#Skipping+Bad+Records
>
> and may support 3)  in the future.
>
>
>
> So the blackbox way is for Hive to just expose those API calls or as a
> general approach allow user add "aspect" to the JobConf
>
> object. Is this allowed in Hive design?
>
>
>
>
>
> Regards,
>
>
> Qing
>
>
> On Thu, Feb 19, 2009 at 5:59 PM, Zheng Shao <[email protected]> wrote:
>
> Hi Qing,
>
> That's a good idea. Can you open a jira?
> There are lots of details before we can add that feature to Hive. For
> example, how to specify the largest number of data corruption that can
> be accepted, by absolute number or percentage, etc. What about half
> corrupted records in case we only need the non-corrupted part in the
> query, etc.
>
>
> Zheng
>
>
>
>
> On 2/19/09, Qing Yan <[email protected]> wrote:
> > Say I have some bad/ill-formatted records in the input, is there a way to
> > configure the default Hive parser to discard those records directly(e.g.
> > when a integer column get a string)?
> >
> > Besides, is the new skip-bad-records feature in 0.19 accessible in Hive?
> > It is a quite handy feature in the real world.
> >
> > What I see so far is the Hive parser throws exception and cause the whole
> > job to fail ultimately.
> >
> > Thanks for the help!
> >
> > Qing
> >
>
> --
> Sent from Gmail for mobile | mobile.google.com
>
> Yours,
> Zheng
>
>
>

Reply via email to