You will need to look at lifecycle of a udf to better understand this.
Typically they are created (note: one or more creations !) during plan creation time (before job submission) and subsequently deserialized on the various mapper/reducer nodes to get executed (iirc).


So typically what I have in my code path is :

---- cut start ---
// default will be false
boolean transient initialized = false;

exec(){
  if (!initialized) doInit();

  ...
}

doInit(){
// acquire resources (sockets, rdbms conn, etc) , initialize state (create directory/files, copy from hdfs to local, etc).

}
---- cut end ---


If I am not wrong, each udf invocation in pig results in a new udf getting created - so use with care (you can have M * N rdbms connections if there are M mappers and N invocations in a mapred job)


Regards,
Mridul

On Thursday 01 July 2010 09:31 PM, Dave Viner wrote:
In a custom UDF, what's the most appropriate way to initialize and connect
to a old-fashioned rdbms?

I wrote a simple UDF which opens/closes a connection on each exec(), but
this feels a bit like overkill.  Is there an "init()" method that is invoked
in a UDF to help with one-time initialization (like a database connection or
sql query preparation)?

Thanks
Dave Viner

Reply via email to