Thats a nice approach. It fits my 
Unix-solved-everything-but-needs-syntactic-sugar world-view :-) (e.g. if we had 
a 1| and 2| syntax, this would be:

0<./FOO "1| ./bar > A" "2| ./MyHandler > B" 

:-)

- milind

On Jan 18, 2011, at 10:27 AM, Julien Le Dem wrote:

> That would be nice.
> Also letting the error handler output the result to a relation would be 
> useful.
> (To let the script output application error metrics)
> For example it could (optionally) use the keyword INTO just like the SPLIT 
> operator.
> 
> FOO = LOAD ...;
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS;
> 
> ErrorHandler would look a little more like EvalFunc:
> 
> public interface ErrorHandler<T> {
> 
>  public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
> 
> public Schema outputSchema(Schema input);
> 
> }
> 
> There could be a built-in handler to output the skipped record (input: tuple, 
> funcname:chararray, errorMessage:chararray)
> 
> A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS;
> 
> Julien
> 
> On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote:
> 
> I was thinking about this..
> 
> We add an optional ON_ERROR clause to operators, which allows a user to
> specify error handling. The error handler would be a udf that would
> implement an interface along these lines:
> 
> public interface ErrorHandler {
> 
>  public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws
> IOException;
> 
> }
> 
> I think this makes sense not to make a static method so that users could
> keep required state, and for example have the handler throw its own
> IOException of it's been invoked too many times.
> 
> D
> 
> 
> On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan 
> <s...@yahoo-inc.com>wrote:
> 
>> Thanks for the clarification Ashutosh.
>> 
>> Implementing this in the user realm is tricky as Dmitriy states.
>> Sensitivity to error thresholds will require support from the system. We can
>> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let
>> users classify each record. The system can then track counts of each record
>> type to facilitate the computation of thresholds. The last part is to allow
>> users to specify thresholds and appropriate actions (interrupt, exit,
>> continue, etc.). A possible mechanism to realize this is the
>> ErrorHandlingUDF described by Dmitriy.
>> 
>> Santhosh
>> 
>> -----Original Message-----
>> From: Ashutosh Chauhan [mailto:hashut...@apache.org]
>> Sent: Friday, January 14, 2011 7:35 PM
>> To: u...@pig.apache.org
>> Subject: Re: Exception Handling in Pig Scripts
>> 
>> Santhosh,
>> 
>> The way you are proposing, it will kill the pig script. I think what user
>> wants is to ignore few "bad records" and to process the rest and get
>> results. Problem here is how to let user tell Pig the definition of "bad
>> record" and how to let him specify threshold for % of bad records at which
>> Pig should fail the script.
>> 
>> Ashutosh
>> 
>> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <s...@yahoo-inc.com>
>> wrote:
>>> Sorry about the late response.
>>> 
>>> Hadoop n00b is proposing a language extension for error handling, similar
>> to the mechanisms in other well known languages like C++, Java, etc.
>>> 
>>> For now, can't the error semantics be handled by the UDF? For exceptional
>> scenarios you could throw an ExecException with the right details. The
>> physical operator that handles the execution of UDF's traps it for you and
>> propagates the error back to the client. You can take a look at any of the
>> builtin UDFs to see how Pig handles it internally.
>>> 
>>> Santhosh
>>> 
>>> -----Original Message-----
>>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com]
>>> Sent: Tuesday, January 11, 2011 10:41 AM
>>> To: u...@pig.apache.org
>>> Subject: Re: Exception Handling in Pig Scripts
>>> 
>>> Right now error handling is controlled by the UDFs themselves, and there
>> is no way to direct it externally.
>>> You can make an ErrorHandlingUDF that would take a udf spec, invoke it,
>> trap errors, and then do the specified error handling behavior.. that's a
>> bit ugly though.
>>> 
>>> There is a problem with trapping general exceptions of course, in that if
>> they happen 0.000001% of the time you can probably just ignore them, but if
>> they happen in half your dataset, you want the job to tell you something is
>> wrong. So this stuff gets non-trivial. If anyone wants to propose a design
>> to solve this general problem, I think that would be a welcome addition.
>>> 
>>> D
>>> 
>>> On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com>
>> wrote:
>>> 
>>>> Thanks, I sometimes get a date like 0001-01-01. This would be a valid
>>>> date format, but when I try to get the seconds between this and
>>>> another date, say 2011-01-01, I get an error that the value is too
>>>> large to be fit into int and the process stops. Do we have something
>>>> like ifError(x-y, null,x-y)? Or would I have to implement this as an
>>>> UDF?
>>>> 
>>>> Thanks
>>>> 
>>>> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvrya...@gmail.com>
>>>> wrote:
>>>> 
>>>>> Create a UDF that verifies the format, and go through a filtering
>>>>> step first.
>>>>> If you would like to save the malformated records so you can look
>>>>> at them later, you can use the SPLIT operator to route the good
>>>>> records to your regular workflow, and the bad records some place on
>> HDFS.
>>>>> 
>>>>> -D
>>>>> 
>>>>> On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2h...@gmail.com>
>> wrote:
>>>>> 
>>>>>> Hello,
>>>>>> 
>>>>>> I have a pig script that uses piggy bank to calculate date
>> differences.
>>>>>> Sometimes, when I get a wierd date or wrong format in the input,
>>>>>> the
>>>>> script
>>>>>> throws and error and aborts.
>>>>>> 
>>>>>> Is there a way I could trap these errors and move on without
>>>>>> stopping
>>>> the
>>>>>> execution?
>>>>>> 
>>>>>> Thanks
>>>>>> 
>>>>>> PS: I'm using CDH2 with Pig 0.5
>>>>>> 
>>>>> 
>>>> 
>>> 
>> 
> 

---
Milind Bhandarkar
mbhandar...@linkedin.com



Reply via email to