Re: [pyspark] dataframe map_partition

peng yu Fri, 08 Mar 2019 19:32:44 -0800

Cool, thanks for letting me know, but why not support dapply
http://spark.apache.org/docs/2.0.0/api/R/dapply.html as supported in R, so
we can just pass in a pandas dataframe


On Fri, Mar 8, 2019 at 6:09 PM Li Jin <ice.xell...@gmail.com> wrote:

> Hi,
>
> Pandas UDF supports input as struct type. However, note that it will be
> turned into python dict because pandas itself does not have native struct
> type.
> On Fri, Mar 8, 2019 at 2:55 PM peng yu <yupb...@gmail.com> wrote:
>
>> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
>> support input is a StructType too ?
>>
>> On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <cutl...@gmail.com> wrote:
>>
>>> Hi Peng,
>>>
>>> I just added support for scalar Pandas UDF to return a StructType as a
>>> Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836.
>>> Is that the functionality you are looking for?
>>>
>>> Bryan
>>>
>>> On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote:
>>>
>>>> right now, i'm using the colums-at-a-time mapping
>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>>>>
>>>>
>>>>
>>>>
>>>> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>>
>>>>> Maybe, it depends on what you're doing. It sounds like you are trying
>>>>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>>>>> you're doing vectorized? may not help much.
>>>>> Just make the pandas Series into a DataFrame if you want? and a single
>>>>> col back to Series?
>>>>>
>>>>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote:
>>>>> >
>>>>> > pandas/arrow is for the memory efficiency, and mapPartitions is only
>>>>> available to rdds, for sure i can do everything in rdd.
>>>>> >
>>>>> > But i thought that's the whole point of having pandas_udf, so my
>>>>> program run faster and consumes less memory ?
>>>>> >
>>>>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote:
>>>>> >>
>>>>> >> Are you just applying a function to every row in the DataFrame? you
>>>>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>>>>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>>>>> >> that operates on all columns and returns a new value. mapPartitions
>>>>> is
>>>>> >> also available if you want to transform an iterator of Row to
>>>>> another
>>>>> >> iterator of Row.
>>>>> >>
>>>>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote:
>>>>> >> >
>>>>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>>>>> struct/row and the input has to be pd.Series, which doesn't support a row.
>>>>> >> >
>>>>> >> > I'm doing tensorflow batch inference in spark,
>>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>>>>> >> >
>>>>> >> > Which i have to do the groupBy in order to use the apply
>>>>> function, i'm wondering why not just enable apply to df ?
>>>>> >> >
>>>>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com>
>>>>> wrote:
>>>>> >> >>
>>>>> >> >> Are you looking for SCALAR? that lets you map one row to one
>>>>> row, but
>>>>> >> >> do it more efficiently in batch. What are you trying to do?
>>>>> >> >>
>>>>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com>
>>>>> wrote:
>>>>> >> >> >
>>>>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>>>>> pyspark.Dataframe.
>>>>> >> >> >
>>>>> >> >> > ```
>>>>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>>>>> >> >> > def do_nothing(pandas_df):
>>>>> >> >> >     return pandas_df
>>>>> >> >> >
>>>>> >> >> >
>>>>> >> >> > new_df = df.mapPartition(do_nothing)
>>>>> >> >> > ```
>>>>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
>>>>> just Map?
>>>>> >> >> >
>>>>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com>
>>>>> wrote:
>>>>> >> >> >>
>>>>> >> >> >> Are you looking for @pandas_udf in Python? Or just
>>>>> mapPartition? Those exist already
>>>>> >> >> >>
>>>>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com>
>>>>> wrote:
>>>>> >> >> >>>
>>>>> >> >> >>> There is a nice map_partition function in R `dapply`.  so
>>>>> that user can pass a row to udf.
>>>>> >> >> >>>
>>>>> >> >> >>> I'm wondering why we don't have that in python?
>>>>> >> >> >>>
>>>>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>>>>> supported
>>>>> >> >> >>>
>>>>> >> >> >>> thanks!
>>>>>
>>>>

Re: [pyspark] dataframe map_partition

Reply via email to