Yep, it's the finish() method See javadocs: http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/EvalFunc.html
On Thu, Jul 1, 2010 at 1:55 PM, Dave Viner <[email protected]> wrote: > Hi Dmitriy, > > Thanks! This is very helpful! > > Is there a method that gets called with the UDF object is being destroyed? > Something that allows for cleanup? > > Thanks again. > Dave Viner > > > On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[email protected]> wrote: > > > Yes, I mean exec(). > > > > The constructor will be called "at least 1 time". It will not be called > > once > > per tuple -- the UDF object is created when the data starts flowing, and > is > > destroyed when it stops. So you can put things into the constructor. > > > > By default, a no-argument constructor gets invoked. You can make Pig use > a > > constructor that takes string arguments (strings only!) by "defining" a > > function, like so: > > > > DEFINE MyFunction com.my.company.MyFunction('foo', 'bar') > > > > [...] > > > > foobar = FOREACH some_relation GENERATE MyFunction(some_field); > > > > This will cause the relation foobar to get populated by the results of > > calling MyFunction.exec on some_field of every tuple in some_relation, > with > > MyFunction having been instantiated using the arguments 'foo' and 'bar'. > > The instantiation will happen a few times on the client-side (your > > machine), > > while Pig tries to compile the program and send it to Hadoop, and one or > > more times per task in Hadoop (in practice, you can pretend it's just > once > > per task). > > > > -Dmitriy > > > > On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[email protected]> wrote: > > > > > @Dmitriy, you mentioned an eval() method... is that part of the UDF? > Or > > do > > > you mean exec() ? > > > > > > I think my confusion may be that I'm not clear on the actual steps > taken > > > when a UDF is invoked. Clearly, the key step is to invoke the > exec(Tuple > > > input) method. But, it would appear that an object is instantiated > > first. > > > Are there any parameters passed to the constructor? Or is there any > way > > > to > > > influence those parameters? > > > > > > Also, how many objects would be constructed? Is it one for each > > invocation > > > of the UDF? Or one for each process managing the map/reduce? > > > > > > @Ashutosh, this is a neat patch. Reading/writing to a DB would be > super > > > helpful from within Pig. But, I don't have enough Pig experience to > know > > > how to translate a StoreFunc into a EvalFunc. In your code, the > > > constructor > > > sets up the variables and then the prepareToWrite actually handle the > > > connection to the database. Is there some similar call in an EvalFunc > > which > > > is like a "prepareToExec" ? > > > > > > Thanks > > > Dave Viner > > > > > > > > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < > > > [email protected]> wrote: > > > > > > > That will be a day of rejoice when a multi-million Oracle deployment > > > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > > > > > > > Ashutosh > > > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[email protected]> > > wrote: > > > > > Can you put a LOG.info and javadoc into this patch saying "watch > out, > > > DB > > > > > connection bomb being deployed"? :) > > > > > > > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > > > > [email protected]> wrote: > > > > > > > > > >> There is an uncommitted Piggybank UDF which may help you. > > > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the > > first > > > > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It > > does > > > a > > > > >> different thing of writing rows from Pig into the DB. But DB > > > > >> connection part you can borrow from it. > > > > >> > > > > >> Note to self: I really want to get this patch committed before > more > > > > >> people reinvent the wheel of making Pig talk to DB. > > > > >> > > > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[email protected]> > > > > wrote: > > > > >> > Also -- I hope your cluster is not too big. It's really easy to > > DDOS > > > > your > > > > >> > database using hadoop. > > > > >> > > > > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy < > > [email protected]> > > > > >> wrote: > > > > >> > > > > > >> >> The simplest thing you can do is to have database handle at the > > > > object > > > > >> >> level, set it to null, and just initialize it in eval() if you > > see > > > > that > > > > >> it's > > > > >> >> null. > > > > >> >> You can also init the connection in the constructor. > > > > >> >> A static dbh will let you share it across tasks, if you persist > > the > > > > jvm. > > > > >> >> Naturally you will want to throw in some code to handle dropped > > > > >> connections > > > > >> >> and all that. > > > > >> >> > > > > >> >> > > > > >> >> > > > > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[email protected] > > > > > > wrote: > > > > >> >> > > > > >> >>> In a custom UDF, what's the most appropriate way to initialize > > and > > > > >> connect > > > > >> >>> to a old-fashioned rdbms? > > > > >> >>> > > > > >> >>> I wrote a simple UDF which opens/closes a connection on each > > > exec(), > > > > >> but > > > > >> >>> this feels a bit like overkill. Is there an "init()" method > > that > > > is > > > > >> >>> invoked > > > > >> >>> in a UDF to help with one-time initialization (like a database > > > > >> connection > > > > >> >>> or > > > > >> >>> sql query preparation)? > > > > >> >>> > > > > >> >>> Thanks > > > > >> >>> Dave Viner > > > > >> >>> > > > > >> >> > > > > >> >> > > > > >> > > > > > >> > > > > > > > > > > > > > > >
