Re: Micro-batch semantics for UDFs

Xikui Wang Sat, 29 Feb 2020 20:41:45 -0800

Hi Torsten,

In case you want to customize the UDF framework to trigger your UDF on a
batch of records, you could consider reusing the PartitionHolder that I did
for my enrichment for the ingestion project. It takes a number of records,
processes them, and returns with the processed results. I used them to
enable hash joins on feeds and refreshes reference data per batch. That
might be helpful. You can find more information here [1].


[1] https://arxiv.org/pdf/1902.08271.pdf

Best,
Xikui

On Thu, Feb 27, 2020 at 2:35 PM Dmitry Lychagin
<[email protected]> wrote:

> Torsten,
>
> I see a couple of possible approaches here:
>
> 1. Make your function operate on arrays of values instead of primitive
> values.
> You'll probably need to have a GROUP BY in your query to create an array
> (using ARRAY_AGG() or GROUP AS variable).
> Then pass that array to your function which would process it and would
> also return a result array.
> Then unnest that output  array to get the cardinality back.
>
> 2. Alternatively,  you could try creating a new runtime for ASSIGN
> operator that'd pass batches of input tuples to a new kind of function
> evaluator.
> You'll need to provide replacements for
> AssignPOperator/AssignRuntimeFactory.
> Also you'd need to modify InlineVariablesRule[1] so it doesn't inline
> those ASSIGNS.
>
> [1]
> https://github.com/apache/asterixdb/blob/master/hyracks-fullstack/algebricks/algebricks-rewriter/src/main/java/org/apache/hyracks/algebricks/rewriter/rules/InlineVariablesRule.java#L144
>
> Thanks,
> -- Dmitry
>
>
> On 2/27/20, 2:02 PM, "Torsten Bergh Moss" <[email protected]>
> wrote:
>
>     Greetings everyone,
>
>
>     I'm experimenting a lot with UDF's utilizing Neural Network inference,
> mainly for classification of tweets. Problem is, running the UDF's in a
> one-at-a-time fashion severely under-exploits the capacity of GPU-powered
> NN's, as well as there being a certain latency associated with moving data
> from the CPU to the GPU and back every time the UDF is called, causing for
> poor performance.
>
>
>     Ideally it would be possible use the UDF to process records in a
> micro-batch fashion, letting them accumulate until a certain batch-size is
> reached (as big as my GPU's memory can handle) before passing the data
> along to the neural network to get the outputs.
>
>
>     Is there a way to accomplish this with the current UDF framework
> (either in java or python)? If not, where would I have to start to develop
> such a feature?
>
>
>     Best wishes,
>
>     Torsten Bergh Moss
>
>
>

Re: Micro-batch semantics for UDFs

Reply via email to