Re: [pyspark] dataframe map_partition

peng yu Fri, 08 Mar 2019 11:55:28 -0800

Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
support input is a StructType too ?


On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <cutl...@gmail.com> wrote:

> Hi Peng,
>
> I just added support for scalar Pandas UDF to return a StructType as a
> Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836. Is
> that the functionality you are looking for?
>
> Bryan
>
> On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote:
>
>> right now, i'm using the colums-at-a-time mapping
>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>>
>>
>>
>>
>> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote:
>>
>>> Maybe, it depends on what you're doing. It sounds like you are trying
>>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>>> you're doing vectorized? may not help much.
>>> Just make the pandas Series into a DataFrame if you want? and a single
>>> col back to Series?
>>>
>>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote:
>>> >
>>> > pandas/arrow is for the memory efficiency, and mapPartitions is only
>>> available to rdds, for sure i can do everything in rdd.
>>> >
>>> > But i thought that's the whole point of having pandas_udf, so my
>>> program run faster and consumes less memory ?
>>> >
>>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote:
>>> >>
>>> >> Are you just applying a function to every row in the DataFrame? you
>>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>>> >> that operates on all columns and returns a new value. mapPartitions is
>>> >> also available if you want to transform an iterator of Row to another
>>> >> iterator of Row.
>>> >>
>>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote:
>>> >> >
>>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>>> struct/row and the input has to be pd.Series, which doesn't support a row.
>>> >> >
>>> >> > I'm doing tensorflow batch inference in spark,
>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>>> >> >
>>> >> > Which i have to do the groupBy in order to use the apply function,
>>> i'm wondering why not just enable apply to df ?
>>> >> >
>>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote:
>>> >> >>
>>> >> >> Are you looking for SCALAR? that lets you map one row to one row,
>>> but
>>> >> >> do it more efficiently in batch. What are you trying to do?
>>> >> >>
>>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote:
>>> >> >> >
>>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>>> pyspark.Dataframe.
>>> >> >> >
>>> >> >> > ```
>>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>>> >> >> > def do_nothing(pandas_df):
>>> >> >> >     return pandas_df
>>> >> >> >
>>> >> >> >
>>> >> >> > new_df = df.mapPartition(do_nothing)
>>> >> >> > ```
>>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
>>> just Map?
>>> >> >> >
>>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com>
>>> wrote:
>>> >> >> >>
>>> >> >> >> Are you looking for @pandas_udf in Python? Or just
>>> mapPartition? Those exist already
>>> >> >> >>
>>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com> wrote:
>>> >> >> >>>
>>> >> >> >>> There is a nice map_partition function in R `dapply`.  so that
>>> user can pass a row to udf.
>>> >> >> >>>
>>> >> >> >>> I'm wondering why we don't have that in python?
>>> >> >> >>>
>>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>>> supported
>>> >> >> >>>
>>> >> >> >>> thanks!
>>>
>>

Re: [pyspark] dataframe map_partition

Reply via email to