Re: [pyspark] dataframe map_partition

Li Jin Fri, 08 Mar 2019 15:10:24 -0800

Hi,

Pandas UDF supports input as struct type. However, note that it will be
turned into python dict because pandas itself does not have native struct
type.
On Fri, Mar 8, 2019 at 2:55 PM peng yu <[email protected]> wrote:


> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
> support input is a StructType too ?
>
> On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <[email protected]> wrote:
>
>> Hi Peng,
>>
>> I just added support for scalar Pandas UDF to return a StructType as a
>> Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836.
>> Is that the functionality you are looking for?
>>
>> Bryan
>>
>> On Thu, Mar 7, 2019 at 1:13 PM peng yu <[email protected]> wrote:
>>
>>> right now, i'm using the colums-at-a-time mapping
>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>>>
>>>
>>>
>>>
>>> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <[email protected]> wrote:
>>>
>>>> Maybe, it depends on what you're doing. It sounds like you are trying
>>>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>>>> you're doing vectorized? may not help much.
>>>> Just make the pandas Series into a DataFrame if you want? and a single
>>>> col back to Series?
>>>>
>>>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <[email protected]> wrote:
>>>> >
>>>> > pandas/arrow is for the memory efficiency, and mapPartitions is only
>>>> available to rdds, for sure i can do everything in rdd.
>>>> >
>>>> > But i thought that's the whole point of having pandas_udf, so my
>>>> program run faster and consumes less memory ?
>>>> >
>>>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <[email protected]> wrote:
>>>> >>
>>>> >> Are you just applying a function to every row in the DataFrame? you
>>>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>>>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>>>> >> that operates on all columns and returns a new value. mapPartitions
>>>> is
>>>> >> also available if you want to transform an iterator of Row to another
>>>> >> iterator of Row.
>>>> >>
>>>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <[email protected]> wrote:
>>>> >> >
>>>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>>>> struct/row and the input has to be pd.Series, which doesn't support a row.
>>>> >> >
>>>> >> > I'm doing tensorflow batch inference in spark,
>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>>>> >> >
>>>> >> > Which i have to do the groupBy in order to use the apply function,
>>>> i'm wondering why not just enable apply to df ?
>>>> >> >
>>>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <[email protected]> wrote:
>>>> >> >>
>>>> >> >> Are you looking for SCALAR? that lets you map one row to one row,
>>>> but
>>>> >> >> do it more efficiently in batch. What are you trying to do?
>>>> >> >>
>>>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <[email protected]> wrote:
>>>> >> >> >
>>>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>>>> pyspark.Dataframe.
>>>> >> >> >
>>>> >> >> > ```
>>>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>>>> >> >> > def do_nothing(pandas_df):
>>>> >> >> >     return pandas_df
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > new_df = df.mapPartition(do_nothing)
>>>> >> >> > ```
>>>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
>>>> just Map?
>>>> >> >> >
>>>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <[email protected]>
>>>> wrote:
>>>> >> >> >>
>>>> >> >> >> Are you looking for @pandas_udf in Python? Or just
>>>> mapPartition? Those exist already
>>>> >> >> >>
>>>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <[email protected]>
>>>> wrote:
>>>> >> >> >>>
>>>> >> >> >>> There is a nice map_partition function in R `dapply`.  so
>>>> that user can pass a row to udf.
>>>> >> >> >>>
>>>> >> >> >>> I'm wondering why we don't have that in python?
>>>> >> >> >>>
>>>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>>>> supported
>>>> >> >> >>>
>>>> >> >> >>> thanks!
>>>>
>>>

Re: [pyspark] dataframe map_partition

Reply via email to