(Spark SQL) partition-scoped UDF

Eron Wright Fri, 04 Sep 2015 10:09:23 -0700

Transformers in Spark ML typically operate on a per-row basis, based on 
callUDF. For a new transformer that I'm developing, I have a need to transform 
an entire partition with a function, as opposed to transforming each row 
separately.   The reason is that, in my case, rows must be transformed in batch 
for efficiency to amortize some overhead.   How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD 
that is then converted back to a DataFrame.   Unsure about the viability or 
consequences of that.
Thanks!Eron Wright

(Spark SQL) partition-scoped UDF

Reply via email to