Hi Peng, I just added support for scalar Pandas UDF to return a StructType as a Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. Is that the functionality you are looking for?
Bryan On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote: > right now, i'm using the colums-at-a-time mapping > https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129 > > > > On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote: > >> Maybe, it depends on what you're doing. It sounds like you are trying >> to do row-at-a-time mapping, even on a pandas DataFrame. Is what >> you're doing vectorized? may not help much. >> Just make the pandas Series into a DataFrame if you want? and a single >> col back to Series? >> >> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote: >> > >> > pandas/arrow is for the memory efficiency, and mapPartitions is only >> available to rdds, for sure i can do everything in rdd. >> > >> > But i thought that's the whole point of having pandas_udf, so my >> program run faster and consumes less memory ? >> > >> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote: >> >> >> >> Are you just applying a function to every row in the DataFrame? you >> >> don't need pandas at all. Just get the RDD of Row from it and map a >> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF >> >> that operates on all columns and returns a new value. mapPartitions is >> >> also available if you want to transform an iterator of Row to another >> >> iterator of Row. >> >> >> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote: >> >> > >> >> > it is very similar to SCALAR, but for SCALAR the output can't be >> struct/row and the input has to be pd.Series, which doesn't support a row. >> >> > >> >> > I'm doing tensorflow batch inference in spark, >> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 >> >> > >> >> > Which i have to do the groupBy in order to use the apply function, >> i'm wondering why not just enable apply to df ? >> >> > >> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote: >> >> >> >> >> >> Are you looking for SCALAR? that lets you map one row to one row, >> but >> >> >> do it more efficiently in batch. What are you trying to do? >> >> >> >> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote: >> >> >> > >> >> >> > I'm looking for a mapPartition(pandas_udf) for a >> pyspark.Dataframe. >> >> >> > >> >> >> > ``` >> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP) >> >> >> > def do_nothing(pandas_df): >> >> >> > return pandas_df >> >> >> > >> >> >> > >> >> >> > new_df = df.mapPartition(do_nothing) >> >> >> > ``` >> >> >> > pandas_udf only support scala or GROUPED_MAP. Why not support >> just Map? >> >> >> > >> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> >> wrote: >> >> >> >> >> >> >> >> Are you looking for @pandas_udf in Python? Or just mapPartition? >> Those exist already >> >> >> >> >> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote: >> >> >> >>> >> >> >> >>> There is a nice map_partition function in R `dapply`. so that >> user can pass a row to udf. >> >> >> >>> >> >> >> >>> I'm wondering why we don't have that in python? >> >> >> >>> >> >> >> >>> I'm trying to have a map_partition function with pandas_udf >> supported >> >> >> >>> >> >> >> >>> thanks! >> >