Yeah, that seems most likely i have wanted, does the scalar Pandas UDF support input is a StructType too ?
On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <cutl...@gmail.com> wrote: > Hi Peng, > > I just added support for scalar Pandas UDF to return a StructType as a > Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. Is > that the functionality you are looking for? > > Bryan > > On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote: > >> right now, i'm using the colums-at-a-time mapping >> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129 >> >> >> >> >> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote: >> >>> Maybe, it depends on what you're doing. It sounds like you are trying >>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what >>> you're doing vectorized? may not help much. >>> Just make the pandas Series into a DataFrame if you want? and a single >>> col back to Series? >>> >>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote: >>> > >>> > pandas/arrow is for the memory efficiency, and mapPartitions is only >>> available to rdds, for sure i can do everything in rdd. >>> > >>> > But i thought that's the whole point of having pandas_udf, so my >>> program run faster and consumes less memory ? >>> > >>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote: >>> >> >>> >> Are you just applying a function to every row in the DataFrame? you >>> >> don't need pandas at all. Just get the RDD of Row from it and map a >>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF >>> >> that operates on all columns and returns a new value. mapPartitions is >>> >> also available if you want to transform an iterator of Row to another >>> >> iterator of Row. >>> >> >>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote: >>> >> > >>> >> > it is very similar to SCALAR, but for SCALAR the output can't be >>> struct/row and the input has to be pd.Series, which doesn't support a row. >>> >> > >>> >> > I'm doing tensorflow batch inference in spark, >>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108 >>> >> > >>> >> > Which i have to do the groupBy in order to use the apply function, >>> i'm wondering why not just enable apply to df ? >>> >> > >>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote: >>> >> >> >>> >> >> Are you looking for SCALAR? that lets you map one row to one row, >>> but >>> >> >> do it more efficiently in batch. What are you trying to do? >>> >> >> >>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote: >>> >> >> > >>> >> >> > I'm looking for a mapPartition(pandas_udf) for a >>> pyspark.Dataframe. >>> >> >> > >>> >> >> > ``` >>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP) >>> >> >> > def do_nothing(pandas_df): >>> >> >> > return pandas_df >>> >> >> > >>> >> >> > >>> >> >> > new_df = df.mapPartition(do_nothing) >>> >> >> > ``` >>> >> >> > pandas_udf only support scala or GROUPED_MAP. Why not support >>> just Map? >>> >> >> > >>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com> >>> wrote: >>> >> >> >> >>> >> >> >> Are you looking for @pandas_udf in Python? Or just >>> mapPartition? Those exist already >>> >> >> >> >>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote: >>> >> >> >>> >>> >> >> >>> There is a nice map_partition function in R `dapply`. so that >>> user can pass a row to udf. >>> >> >> >>> >>> >> >> >>> I'm wondering why we don't have that in python? >>> >> >> >>> >>> >> >> >>> I'm trying to have a map_partition function with pandas_udf >>> supported >>> >> >> >>> >>> >> >> >>> thanks! >>> >>