@Ian: This is a very interesting use/test case for the work you are
doing at UCI on the new more dynamic deployment model, and on how the
underlying UDF infrastructure can best support ML-model-based UDFs...!
On 11/17/19 4:56 PM, Xikui Wang wrote:
I wonder what would the deployment-initialization do?
btw, the UDF does have a deinitialize() method which is expected to be
invoked when the UDF is deinitialized, but that's is ignored for now as the
IScalarEvaluator in general doesn't not deinitialize. To make that work, we
would need a bigger change in Hyracks to make it aware that step. This
could one improvement as well...
Best,
Xikui
On Sun, Nov 17, 2019 at 11:30 AM Till Westmann <[email protected]> wrote:
It seems that it's be nice if we had a step (similar to the
initialization step) in the deployment lifecycle as well.
And I guess that we'd need to corresponding clean-up step for
un-deployment as well.
Does that make sense? If so, should we file an improvement for this?
Cheers,
Till
On 17 Nov 2019, at 9:29, Xikui Wang wrote:
The UDF interface has an initialize method which is invoked per every
lifecycle. Putting the model loading code in there can probably solve
your
problem. The initialization is done per query (Hyrack job). For
example, if
you do
SELECT mylib#myudf(t) FROM Tweets t;
in which there are 100 tweets in the Tweets dataset. The
initialization
method will be called once and the evaluate method will be invoked 100
times. In the context of feeds attached with UDFs, the
initialization happens only once when feed starts.
Best,
Xikui
On Sun, Nov 17, 2019 at 6:44 AM Torsten Bergh Moss <
[email protected]> wrote:
Dear developers,
I am trying to build a machine learning-based UDF for classification.
This
involves loading in a model that has been trained offline, which in
practice basically is deserialization of a big object. This process
of
deserialization takes a significant amount of time, but it only
"needs" to
happen once, and after that the model can do the classification
rather
rapidly.
Therefore, in order to avoid having to load the model every time the
UDF
is called, I am wondering where in the UDF lifecycle I can do the
loading
in order to achieve a "load model once, classify
infinitely"-scenario, and
how to implement it. I am assuming it should be done somewhere inside
the
factory-function-relationship, but I am not sure where/how and can't
seem
to find a lot of documentation on it.
All help is appreciated, thanks!
Best wishes,
Torsten