Re: UDF and rdbms lookups

Dmitriy Ryaboy Thu, 01 Jul 2010 13:17:59 -0700

Yes, I mean exec().

The constructor will be called "at least 1 time". It will not be called once
per tuple -- the UDF object is created when the data starts flowing, and is
destroyed when it stops. So you can put things into the constructor.


By default, a no-argument constructor gets invoked. You can make Pig use a
constructor that takes string arguments (strings only!) by "defining" a
function, like so:

DEFINE MyFunction com.my.company.MyFunction('foo', 'bar')

[...]

foobar = FOREACH some_relation GENERATE MyFunction(some_field);

This will cause the relation foobar to get populated by the results of
calling MyFunction.exec on some_field of every tuple in some_relation, with
MyFunction having been instantiated using the arguments 'foo' and 'bar'.
The instantiation will happen a few times on the client-side (your machine),
while Pig tries to compile the program and send it to Hadoop, and one or
more times per task in Hadoop (in practice, you can pretend it's just once
per task).

-Dmitriy

On Thu, Jul 1, 2010 at 12:53 PM, Dave Viner <[email protected]> wrote:

> @Dmitriy, you mentioned an eval() method... is that part of the UDF?  Or do
> you mean exec() ?
>
> I think my confusion may be that I'm not clear on the actual steps taken
> when a UDF is invoked.  Clearly, the key step is to invoke the exec(Tuple
> input) method.  But, it would appear that an object is instantiated first.
>  Are there any parameters passed to the constructor?  Or is there any way
> to
> influence those parameters?
>
> Also, how many objects would be constructed?  Is it one for each invocation
> of the UDF?  Or one for each process managing the map/reduce?
>
> @Ashutosh, this is a neat patch.  Reading/writing to a DB would be super
> helpful from within Pig.  But, I don't have enough Pig experience to know
> how to translate a StoreFunc into a EvalFunc.  In your code, the
> constructor
> sets up the variables and then the prepareToWrite actually handle the
> connection to the database. Is there some similar call in an EvalFunc which
> is like a "prepareToExec" ?
>
> Thanks
> Dave Viner
>
>
> On Thu, Jul 1, 2010 at 11:03 AM, Ashutosh Chauhan <
> [email protected]> wrote:
>
> > That will be a day of rejoice when a multi-million Oracle deployment
> > comes to a grinding halt by tiny-weeny 4 line pig script. *wink* ;)
> >
> > Ashutosh
> > On Thu, Jul 1, 2010 at 10:52, Dmitriy Ryaboy <[email protected]> wrote:
> > > Can you put a LOG.info and javadoc into this patch saying "watch out,
> DB
> > > connection bomb being deployed"? :)
> > >
> > > On Thu, Jul 1, 2010 at 10:48 AM, Ashutosh Chauhan <
> > > [email protected]> wrote:
> > >
> > >> There is an uncommitted Piggybank UDF which may help you.
> > >> https://issues.apache.org/jira/browse/PIG-1229 You can try the first
> > >> patch (         pig-1229.2.patch by Ankur ) listed on the page It does
> a
> > >> different thing of writing rows from Pig into the DB. But DB
> > >> connection part you can borrow from it.
> > >>
> > >> Note to self: I really want to get this patch committed before more
> > >> people reinvent the wheel of making Pig talk to DB.
> > >>
> > >> On Thu, Jul 1, 2010 at 09:48, Dmitriy Ryaboy <[email protected]>
> > wrote:
> > >> > Also -- I hope your cluster is not too big. It's really easy to DDOS
> > your
> > >> > database using hadoop.
> > >> >
> > >> > On Thu, Jul 1, 2010 at 9:47 AM, Dmitriy Ryaboy <[email protected]>
> > >> wrote:
> > >> >
> > >> >> The simplest thing you can do is to have database handle at the
> > object
> > >> >> level, set it to null, and just initialize it in eval() if you see
> > that
> > >> it's
> > >> >> null.
> > >> >> You can also init the connection in the constructor.
> > >> >> A static dbh will let you share it across tasks, if you persist the
> > jvm.
> > >> >> Naturally you will want to throw in some code to handle dropped
> > >> connections
> > >> >> and all that.
> > >> >>
> > >> >>
> > >> >>
> > >> >> On Thu, Jul 1, 2010 at 9:01 AM, Dave Viner <[email protected]>
> > wrote:
> > >> >>
> > >> >>> In a custom UDF, what's the most appropriate way to initialize and
> > >> connect
> > >> >>> to a old-fashioned rdbms?
> > >> >>>
> > >> >>> I wrote a simple UDF which opens/closes a connection on each
> exec(),
> > >> but
> > >> >>> this feels a bit like overkill.  Is there an "init()" method that
> is
> > >> >>> invoked
> > >> >>> in a UDF to help with one-time initialization (like a database
> > >> connection
> > >> >>> or
> > >> >>> sql query preparation)?
> > >> >>>
> > >> >>> Thanks
> > >> >>> Dave Viner
> > >> >>>
> > >> >>
> > >> >>
> > >> >
> > >>
> > >
> >
>

Re: UDF and rdbms lookups

Reply via email to