Re: (Spark SQL) partition-scoped UDF

Eron Wright Sat, 05 Sep 2015 13:56:12 -0700

The transformer is a classification model produced by the 
NeuralNetClassification estimator of dl4j-spark-ml.  Source code here.  The 
neural net operates most efficiently when many examples are classified in 
batch.  I imagine overriding `transform` rather than `predictRaw`.   Does 
anyone know of a solution compatible with Spark 1.4 or 1.5?

Thanks again!

From:  Reynold Xin
Date:  Friday, September 4, 2015 at 5:19 PM
To:  Eron Wright
Cc:  "dev@spark.apache.org"
Subject:  Re: (Spark SQL) partition-scoped UDF

Can you say more about your transformer?

This is a good idea, and indeed we are doing it for R already (the latest way 
to run UDFs in R is to pass the entire partition as a local R dataframe for 
users to run on). However, what works for R for simple data processing might 
not work for your high performance transformer, etc.

On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright <ewri...@live.com> wrote:
Transformers in Spark ML typically operate on a per-row basis, based on 
callUDF. For a new transformer that I'm developing, I have a need to transform 
an entire partition with a function, as opposed to transforming each row 
separately.   The reason is that, in my case, rows must be transformed in batch 
for efficiency to amortize some overhead.   How may I accomplish this?

One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD 
that is then converted back to a DataFrame.   Unsure about the viability or 
consequences of that.

Thanks!
Eron Wright

Re: (Spark SQL) partition-scoped UDF

Reply via email to