The transformer is a classification model produced by the NeuralNetClassification estimator of dl4j-spark-ml. Source code here. The neural net operates most efficiently when many examples are classified in batch. I imagine overriding `transform` rather than `predictRaw`. Does anyone know of a solution compatible with Spark 1.4 or 1.5?
Thanks again! From: Reynold Xin Date: Friday, September 4, 2015 at 5:19 PM To: Eron Wright Cc: "dev@spark.apache.org" Subject: Re: (Spark SQL) partition-scoped UDF Can you say more about your transformer? This is a good idea, and indeed we are doing it for R already (the latest way to run UDFs in R is to pass the entire partition as a local R dataframe for users to run on). However, what works for R for simple data processing might not work for your high performance transformer, etc. On Fri, Sep 4, 2015 at 7:08 AM, Eron Wright <ewri...@live.com> wrote: Transformers in Spark ML typically operate on a per-row basis, based on callUDF. For a new transformer that I'm developing, I have a need to transform an entire partition with a function, as opposed to transforming each row separately. The reason is that, in my case, rows must be transformed in batch for efficiency to amortize some overhead. How may I accomplish this? One option appears to be to invoke DataFrame::mapPartitions, yielding an RDD that is then converted back to a DataFrame. Unsure about the viability or consequences of that. Thanks! Eron Wright