Thats a nice approach. It fits my Unix-solved-everything-but-needs-syntactic-sugar world-view :-) (e.g. if we had a 1| and 2| syntax, this would be:
0<./FOO "1| ./bar > A" "2| ./MyHandler > B" :-) - milind On Jan 18, 2011, at 10:27 AM, Julien Le Dem wrote: > That would be nice. > Also letting the error handler output the result to a relation would be > useful. > (To let the script output application error metrics) > For example it could (optionally) use the keyword INTO just like the SPLIT > operator. > > FOO = LOAD ...; > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT MyHandler INTO A_ERRORS; > > ErrorHandler would look a little more like EvalFunc: > > public interface ErrorHandler<T> { > > public T handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws > IOException; > > public Schema outputSchema(Schema input); > > } > > There could be a built-in handler to output the skipped record (input: tuple, > funcname:chararray, errorMessage:chararray) > > A = FOREACH FOO GENERATE Bar(*) ON_ERROR SPLIT INTO A_ERRORS; > > Julien > > On 1/16/11 12:22 AM, "Dmitriy Ryaboy" <dvrya...@gmail.com> wrote: > > I was thinking about this.. > > We add an optional ON_ERROR clause to operators, which allows a user to > specify error handling. The error handler would be a udf that would > implement an interface along these lines: > > public interface ErrorHandler { > > public void handle(IOExcetion ioe, EvalFunc evalFunc, Tuple input) throws > IOException; > > } > > I think this makes sense not to make a static method so that users could > keep required state, and for example have the handler throw its own > IOException of it's been invoked too many times. > > D > > > On Sat, Jan 15, 2011 at 11:53 PM, Santhosh Srinivasan > <s...@yahoo-inc.com>wrote: > >> Thanks for the clarification Ashutosh. >> >> Implementing this in the user realm is tricky as Dmitriy states. >> Sensitivity to error thresholds will require support from the system. We can >> probably provide a taxonomy of records (good, bad, incomplete, etc.) to let >> users classify each record. The system can then track counts of each record >> type to facilitate the computation of thresholds. The last part is to allow >> users to specify thresholds and appropriate actions (interrupt, exit, >> continue, etc.). A possible mechanism to realize this is the >> ErrorHandlingUDF described by Dmitriy. >> >> Santhosh >> >> -----Original Message----- >> From: Ashutosh Chauhan [mailto:hashut...@apache.org] >> Sent: Friday, January 14, 2011 7:35 PM >> To: u...@pig.apache.org >> Subject: Re: Exception Handling in Pig Scripts >> >> Santhosh, >> >> The way you are proposing, it will kill the pig script. I think what user >> wants is to ignore few "bad records" and to process the rest and get >> results. Problem here is how to let user tell Pig the definition of "bad >> record" and how to let him specify threshold for % of bad records at which >> Pig should fail the script. >> >> Ashutosh >> >> On Fri, Jan 14, 2011 at 18:18, Santhosh Srinivasan <s...@yahoo-inc.com> >> wrote: >>> Sorry about the late response. >>> >>> Hadoop n00b is proposing a language extension for error handling, similar >> to the mechanisms in other well known languages like C++, Java, etc. >>> >>> For now, can't the error semantics be handled by the UDF? For exceptional >> scenarios you could throw an ExecException with the right details. The >> physical operator that handles the execution of UDF's traps it for you and >> propagates the error back to the client. You can take a look at any of the >> builtin UDFs to see how Pig handles it internally. >>> >>> Santhosh >>> >>> -----Original Message----- >>> From: Dmitriy Ryaboy [mailto:dvrya...@gmail.com] >>> Sent: Tuesday, January 11, 2011 10:41 AM >>> To: u...@pig.apache.org >>> Subject: Re: Exception Handling in Pig Scripts >>> >>> Right now error handling is controlled by the UDFs themselves, and there >> is no way to direct it externally. >>> You can make an ErrorHandlingUDF that would take a udf spec, invoke it, >> trap errors, and then do the specified error handling behavior.. that's a >> bit ugly though. >>> >>> There is a problem with trapping general exceptions of course, in that if >> they happen 0.000001% of the time you can probably just ignore them, but if >> they happen in half your dataset, you want the job to tell you something is >> wrong. So this stuff gets non-trivial. If anyone wants to propose a design >> to solve this general problem, I think that would be a welcome addition. >>> >>> D >>> >>> On Tue, Jan 11, 2011 at 12:47 AM, hadoop n00b <new2h...@gmail.com> >> wrote: >>> >>>> Thanks, I sometimes get a date like 0001-01-01. This would be a valid >>>> date format, but when I try to get the seconds between this and >>>> another date, say 2011-01-01, I get an error that the value is too >>>> large to be fit into int and the process stops. Do we have something >>>> like ifError(x-y, null,x-y)? Or would I have to implement this as an >>>> UDF? >>>> >>>> Thanks >>>> >>>> On Tue, Jan 11, 2011 at 11:40 AM, Dmitriy Ryaboy <dvrya...@gmail.com> >>>> wrote: >>>> >>>>> Create a UDF that verifies the format, and go through a filtering >>>>> step first. >>>>> If you would like to save the malformated records so you can look >>>>> at them later, you can use the SPLIT operator to route the good >>>>> records to your regular workflow, and the bad records some place on >> HDFS. >>>>> >>>>> -D >>>>> >>>>> On Mon, Jan 10, 2011 at 9:58 PM, hadoop n00b <new2h...@gmail.com> >> wrote: >>>>> >>>>>> Hello, >>>>>> >>>>>> I have a pig script that uses piggy bank to calculate date >> differences. >>>>>> Sometimes, when I get a wierd date or wrong format in the input, >>>>>> the >>>>> script >>>>>> throws and error and aborts. >>>>>> >>>>>> Is there a way I could trap these errors and move on without >>>>>> stopping >>>> the >>>>>> execution? >>>>>> >>>>>> Thanks >>>>>> >>>>>> PS: I'm using CDH2 with Pig 0.5 >>>>>> >>>>> >>>> >>> >> > --- Milind Bhandarkar mbhandar...@linkedin.com