You will need to look at lifecycle of a udf to better understand this.
Typically they are created (note: one or more creations !) during plan
creation time (before job submission) and subsequently deserialized on
the various mapper/reducer nodes to get executed (iirc).
So typically what I have in my code path is :
---- cut start ---
// default will be false
boolean transient initialized = false;
exec(){
if (!initialized) doInit();
...
}
doInit(){
// acquire resources (sockets, rdbms conn, etc) , initialize state
(create directory/files, copy from hdfs to local, etc).
}
---- cut end ---
If I am not wrong, each udf invocation in pig results in a new udf
getting created - so use with care (you can have M * N rdbms connections
if there are M mappers and N invocations in a mapred job)
Regards,
Mridul
On Thursday 01 July 2010 09:31 PM, Dave Viner wrote:
In a custom UDF, what's the most appropriate way to initialize and connect
to a old-fashioned rdbms?
I wrote a simple UDF which opens/closes a connection on each exec(), but
this feels a bit like overkill. Is there an "init()" method that is invoked
in a UDF to help with one-time initialization (like a database connection or
sql query preparation)?
Thanks
Dave Viner