Hi Dmitriy,

Thanks!  This is very helpful!

Is there a method that gets called with the UDF object is being destroyed?
 Something that allows for cleanup?

Thanks again.
Dave Viner


On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[email protected]> wrote:

> Yes, I mean exec().
>
> The constructor will be called "at least 1 time". It will not be called
> once
> per tuple -- the UDF object is created when the data starts flowing, and is
> destroyed when it stops. So you can put things into the constructor.
>
> By default, a no-argument constructor gets invoked. You can make Pig use a
> constructor that takes string arguments (strings only!) by "defining" a
> function, like so:
>
> DEFINE MyFunction com.my.company.MyFunction('foo', 'bar')
>
> [...]
>
> foobar = FOREACH some_relation GENERATE MyFunction(some_field);
>
> This will cause the relation foobar to get populated by the results of
> calling MyFunction.exec on some_field of every tuple in some_relation, with
> MyFunction having been instantiated using the arguments 'foo' and 'bar'.
> The instantiation will happen a few times on the client-side (your
> machine),
> while Pig tries to compile the program and send it to Hadoop, and one or
> more times per task in Hadoop (in practice, you can pretend it's just once
> per task).
>
> -Dmitriy
>
> On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[email protected]> wrote:
>
> > @Dmitriy, you mentioned an eval() method... is that part of the UDF?  Or
> do
> > you mean exec() ?
> >
> > I think my confusion may be that I'm not clear on the actual steps taken
> > when a UDF is invoked.  Clearly, the key step is to invoke the exec(Tuple
> > input) method.  But, it would appear that an object is instantiated
> first.
> >  Are there any parameters passed to the constructor?  Or is there any way
> > to
> > influence those parameters?
> >
> > Also, how many objects would be constructed?  Is it one for each
> invocation
> > of the UDF?  Or one for each process managing the map/reduce?
> >
> > @Ashutosh, this is a neat patch.  Reading/writing to a DB would be super
> > helpful from within Pig.  But, I don't have enough Pig experience to know
> > how to translate a StoreFunc into a EvalFunc.  In your code, the
> > constructor
> > sets up the variables and then the prepareToWrite actually handle the
> > connection to the database. Is there some similar call in an EvalFunc
> which
> > is like a "prepareToExec" ?
> >
> > Thanks
> > Dave Viner
> >
> >
> > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan <
> > [email protected]> wrote:
> >
> > > That will be a day of rejoice when a multi-million Oracle deployment
> > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;)
> > >
> > > Ashutosh
> > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[email protected]>
> wrote:
> > > > Can you put a LOG.info and javadoc into this patch saying "watch out,
> > DB
> > > > connection bomb being deployed"? :)
> > > >
> > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan <
> > > > [email protected]> wrote:
> > > >
> > > >> There is an uncommitted Piggybank UDF which may help you.
> > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the
> first
> > > >> patch (         pig-1229.2.patch by Ankur ) listed on the page It
> does
> > a
> > > >> different thing of writing rows from Pig into the DB. But DB
> > > >> connection part you can borrow from it.
> > > >>
> > > >> Note to self: I really want to get this patch committed before more
> > > >> people reinvent the wheel of making Pig talk to DB.
> > > >>
> > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[email protected]>
> > > wrote:
> > > >> > Also -- I hope your cluster is not too big. It's really easy to
> DDOS
> > > your
> > > >> > database using hadoop.
> > > >> >
> > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <
> [email protected]>
> > > >> wrote:
> > > >> >
> > > >> >> The simplest thing you can do is to have database handle at the
> > > object
> > > >> >> level, set it to null, and just initialize it in eval() if you
> see
> > > that
> > > >> it's
> > > >> >> null.
> > > >> >> You can also init the connection in the constructor.
> > > >> >> A static dbh will let you share it across tasks, if you persist
> the
> > > jvm.
> > > >> >> Naturally you will want to throw in some code to handle dropped
> > > >> connections
> > > >> >> and all that.
> > > >> >>
> > > >> >>
> > > >> >>
> > > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[email protected]>
> > > wrote:
> > > >> >>
> > > >> >>> In a custom UDF, what's the most appropriate way to initialize
> and
> > > >> connect
> > > >> >>> to a old-fashioned rdbms?
> > > >> >>>
> > > >> >>> I wrote a simple UDF which opens/closes a connection on each
> > exec(),
> > > >> but
> > > >> >>> this feels a bit like overkill.  Is there an "init()" method
> that
> > is
> > > >> >>> invoked
> > > >> >>> in a UDF to help with one-time initialization (like a database
> > > >> connection
> > > >> >>> or
> > > >> >>> sql query preparation)?
> > > >> >>>
> > > >> >>> Thanks
> > > >> >>> Dave Viner
> > > >> >>>
> > > >> >>
> > > >> >>
> > > >> >
> > > >>
> > > >
> > >
> >
>

Reply via email to