Re: UDF and rdbms lookups

Mridul Muralidharan Wed, 07 Jul 2010 13:41:08 -0700


You will need to look at lifecycle of a udf to better understand this.

Typically they are created (note: one or more creations !) during plancreation time (before job submission) and subsequently deserialized onthe various mapper/reducer nodes to get executed (iirc).



So typically what I have in my code path is :

---- cut start ---
// default will be false
boolean transient initialized = false;

exec(){
  if (!initialized) doInit();

  ...
}

doInit(){

// acquire resources (sockets, rdbms conn, etc) , initialize state(create directory/files, copy from hdfs to local, etc).


}
---- cut end ---

If I am not wrong, each udf invocation in pig results in a new udfgetting created - so use with care (you can have M * N rdbms connectionsif there are M mappers and N invocations in a mapred job)



Regards,
Mridul

On Thursday 01 July 2010 09:31 PM, Dave Viner wrote:

In a custom UDF, what's the most appropriate way to initialize and connect
to a old-fashioned rdbms?

I wrote a simple UDF which opens/closes a connection on each exec(), but
this feels a bit like overkill.  Is there an "init()" method that is invoked
in a UDF to help with one-time initialization (like a database connection or
sql query preparation)?

Thanks
Dave Viner

Re: UDF and rdbms lookups

Reply via email to