Transformers in Spark ML typically operate on a per-row basis, based on
callUDF. For a new transformer that I'm developing, I have a need to transform
an entire partition with a function, as opposed to transforming each row
separately. The reason is that, in my case, rows must be transformed in batch
for efficiency to amortize some overhead. How may I accomplish this?
One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD
that is then converted back to a DataFrame. Unsure about the viability or
consequences of that.
Thanks!Eron Wright