Re: UDF and rdbms lookups

Dmitriy Ryaboy Thu, 01 Jul 2010 14:15:31 -0700

Yep, it's the finish() method

See javadocs:
http://hadoop.apache.org/pig/javadoc/docs/api/org/apache/pig/EvalFunc.html




On Thu, Jul 1, 2010 at 1:55 PM, Dave Viner <[email protected]> wrote:

> Hi Dmitriy,
>
> Thanks!  This is very helpful!
>
> Is there a method that gets called with the UDF object is being destroyed?
>  Something that allows for cleanup?
>
> Thanks again.
> Dave Viner
>
>
> On Thu, Jul 1, 2010 at 1:16 PM, Dmitriy Ryaboy <[email protected]> wrote:
>
> > Yes, I mean exec().
> >
> > The constructor will be called "at least 1 time". It will not be called
> > once
> > per tuple -- the UDF object is created when the data starts flowing, and
> is
> > destroyed when it stops. So you can put things into the constructor.
> >
> > By default, a no-argument constructor gets invoked. You can make Pig use
> a
> > constructor that takes string arguments (strings only!) by "defining" a
> > function, like so:
> >
> > DEFINE MyFunction com.my.company.MyFunction('foo', 'bar')
> >
> > [...]
> >
> > foobar = FOREACH some_relation GENERATE MyFunction(some_field);
> >
> > This will cause the relation foobar to get populated by the results of
> > calling MyFunction.exec on some_field of every tuple in some_relation,
> with
> > MyFunction having been instantiated using the arguments 'foo' and 'bar'.
> > The instantiation will happen a few times on the client-side (your
> > machine),
> > while Pig tries to compile the program and send it to Hadoop, and one or
> > more times per task in Hadoop (in practice, you can pretend it's just
> once
> > per task).
> >
> > -Dmitriy
> >
> > On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[email protected]> wrote:
> >
> > > @Dmitriy, you mentioned an eval() method... is that part of the UDF?
>  Or
> > do
> > > you mean exec() ?
> > >
> > > I think my confusion may be that I'm not clear on the actual steps
> taken
> > > when a UDF is invoked.  Clearly, the key step is to invoke the
> exec(Tuple
> > > input) method.  But, it would appear that an object is instantiated
> > first.
> > >  Are there any parameters passed to the constructor?  Or is there any
> way
> > > to
> > > influence those parameters?
> > >
> > > Also, how many objects would be constructed?  Is it one for each
> > invocation
> > > of the UDF?  Or one for each process managing the map/reduce?
> > >
> > > @Ashutosh, this is a neat patch.  Reading/writing to a DB would be
> super
> > > helpful from within Pig.  But, I don't have enough Pig experience to
> know
> > > how to translate a StoreFunc into a EvalFunc.  In your code, the
> > > constructor
> > > sets up the variables and then the prepareToWrite actually handle the
> > > connection to the database. Is there some similar call in an EvalFunc
> > which
> > > is like a "prepareToExec" ?
> > >
> > > Thanks
> > > Dave Viner
> > >
> > >
> > > On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan <
> > > [email protected]> wrote:
> > >
> > > > That will be a day of rejoice when a multi-million Oracle deployment
> > > > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;)
> > > >
> > > > Ashutosh
> > > > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[email protected]>
> > wrote:
> > > > > Can you put a LOG.info and javadoc into this patch saying "watch
> out,
> > > DB
> > > > > connection bomb being deployed"? :)
> > > > >
> > > > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan <
> > > > > [email protected]> wrote:
> > > > >
> > > > >> There is an uncommitted Piggybank UDF which may help you.
> > > > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the
> > first
> > > > >> patch (         pig-1229.2.patch by Ankur ) listed on the page It
> > does
> > > a
> > > > >> different thing of writing rows from Pig into the DB. But DB
> > > > >> connection part you can borrow from it.
> > > > >>
> > > > >> Note to self: I really want to get this patch committed before
> more
> > > > >> people reinvent the wheel of making Pig talk to DB.
> > > > >>
> > > > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[email protected]>
> > > > wrote:
> > > > >> > Also -- I hope your cluster is not too big. It's really easy to
> > DDOS
> > > > your
> > > > >> > database using hadoop.
> > > > >> >
> > > > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <
> > [email protected]>
> > > > >> wrote:
> > > > >> >
> > > > >> >> The simplest thing you can do is to have database handle at the
> > > > object
> > > > >> >> level, set it to null, and just initialize it in eval() if you
> > see
> > > > that
> > > > >> it's
> > > > >> >> null.
> > > > >> >> You can also init the connection in the constructor.
> > > > >> >> A static dbh will let you share it across tasks, if you persist
> > the
> > > > jvm.
> > > > >> >> Naturally you will want to throw in some code to handle dropped
> > > > >> connections
> > > > >> >> and all that.
> > > > >> >>
> > > > >> >>
> > > > >> >>
> > > > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[email protected]
> >
> > > > wrote:
> > > > >> >>
> > > > >> >>> In a custom UDF, what's the most appropriate way to initialize
> > and
> > > > >> connect
> > > > >> >>> to a old-fashioned rdbms?
> > > > >> >>>
> > > > >> >>> I wrote a simple UDF which opens/closes a connection on each
> > > exec(),
> > > > >> but
> > > > >> >>> this feels a bit like overkill.  Is there an "init()" method
> > that
> > > is
> > > > >> >>> invoked
> > > > >> >>> in a UDF to help with one-time initialization (like a database
> > > > >> connection
> > > > >> >>> or
> > > > >> >>> sql query preparation)?
> > > > >> >>>
> > > > >> >>> Thanks
> > > > >> >>> Dave Viner
> > > > >> >>>
> > > > >> >>
> > > > >> >>
> > > > >> >
> > > > >>
> > > > >
> > > >
> > >
> >
>

Re: UDF and rdbms lookups

Reply via email to