Hi Dmitriy, Thanks! This is very helpful!
Is there a method that gets called with the UDF object is being destroyed? Something that allows for cleanup? Thanks again. Dave Viner On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[email protected]> wrote: > Yes, I mean exec(). > > The constructor will be called "at least 1 time". It will not be called > once > per tuple -- the UDF object is created when the data starts flowing, and is > destroyed when it stops. So you can put things into the constructor. > > By default, a no-argument constructor gets invoked. You can make Pig use a > constructor that takes string arguments (strings only!) by "defining" a > function, like so: > > DEFINE MyFunction com.my.company.MyFunction('foo', 'bar') > > [...] > > foobar = FOREACH some_relation GENERATE MyFunction(some_field); > > This will cause the relation foobar to get populated by the results of > calling MyFunction.exec on some_field of every tuple in some_relation, with > MyFunction having been instantiated using the arguments 'foo' and 'bar'. > The instantiation will happen a few times on the client-side (your > machine), > while Pig tries to compile the program and send it to Hadoop, and one or > more times per task in Hadoop (in practice, you can pretend it's just once > per task). > > -Dmitriy > > On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[email protected]> wrote: > > > @Dmitriy, you mentioned an eval() method... is that part of the UDF? Or > do > > you mean exec() ? > > > > I think my confusion may be that I'm not clear on the actual steps taken > > when a UDF is invoked. Clearly, the key step is to invoke the exec(Tuple > > input) method. But, it would appear that an object is instantiated > first. > > Are there any parameters passed to the constructor? Or is there any way > > to > > influence those parameters? > > > > Also, how many objects would be constructed? Is it one for each > invocation > > of the UDF? Or one for each process managing the map/reduce? > > > > @Ashutosh, this is a neat patch. Reading/writing to a DB would be super > > helpful from within Pig. But, I don't have enough Pig experience to know > > how to translate a StoreFunc into a EvalFunc. In your code, the > > constructor > > sets up the variables and then the prepareToWrite actually handle the > > connection to the database. Is there some similar call in an EvalFunc > which > > is like a "prepareToExec" ? > > > > Thanks > > Dave Viner > > > > > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan < > > [email protected]> wrote: > > > > > That will be a day of rejoice when a multi-million Oracle deployment > > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;) > > > > > > Ashutosh > > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[email protected]> > wrote: > > > > Can you put a LOG.info and javadoc into this patch saying "watch out, > > DB > > > > connection bomb being deployed"? :) > > > > > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan < > > > > [email protected]> wrote: > > > > > > > >> There is an uncommitted Piggybank UDF which may help you. > > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the > first > > > >> patch ( pig-1229.2.patch by Ankur ) listed on the page It > does > > a > > > >> different thing of writing rows from Pig into the DB. But DB > > > >> connection part you can borrow from it. > > > >> > > > >> Note to self: I really want to get this patch committed before more > > > >> people reinvent the wheel of making Pig talk to DB. > > > >> > > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[email protected]> > > > wrote: > > > >> > Also -- I hope your cluster is not too big. It's really easy to > DDOS > > > your > > > >> > database using hadoop. > > > >> > > > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy < > [email protected]> > > > >> wrote: > > > >> > > > > >> >> The simplest thing you can do is to have database handle at the > > > object > > > >> >> level, set it to null, and just initialize it in eval() if you > see > > > that > > > >> it's > > > >> >> null. > > > >> >> You can also init the connection in the constructor. > > > >> >> A static dbh will let you share it across tasks, if you persist > the > > > jvm. > > > >> >> Naturally you will want to throw in some code to handle dropped > > > >> connections > > > >> >> and all that. > > > >> >> > > > >> >> > > > >> >> > > > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[email protected]> > > > wrote: > > > >> >> > > > >> >>> In a custom UDF, what's the most appropriate way to initialize > and > > > >> connect > > > >> >>> to a old-fashioned rdbms? > > > >> >>> > > > >> >>> I wrote a simple UDF which opens/closes a connection on each > > exec(), > > > >> but > > > >> >>> this feels a bit like overkill. Is there an "init()" method > that > > is > > > >> >>> invoked > > > >> >>> in a UDF to help with one-time initialization (like a database > > > >> connection > > > >> >>> or > > > >> >>> sql query preparation)? > > > >> >>> > > > >> >>> Thanks > > > >> >>> Dave Viner > > > >> >>> > > > >> >> > > > >> >> > > > >> > > > > >> > > > > > > > > > >
