Hi,

Pandas UDF supports input as struct type. However, note that it will be
turned into python dict because pandas itself does not have native struct
type.
On Fri, Mar 8, 2019 at 2:55 PM peng yu <yupb...@gmail.com> wrote:

> Yeah, that seems most likely i have wanted, does the scalar Pandas UDF
> support input is a StructType too ?
>
> On Fri, Mar 8, 2019 at 2:25 PM Bryan Cutler <cutl...@gmail.com> wrote:
>
>> Hi Peng,
>>
>> I just added support for scalar Pandas UDF to return a StructType as a
>> Pandas DataFrame in https://issues.apache.org/jira/browse/SPARK-23836.
>> Is that the functionality you are looking for?
>>
>> Bryan
>>
>> On Thu, Mar 7, 2019 at 1:13 PM peng yu <yupb...@gmail.com> wrote:
>>
>>> right now, i'm using the colums-at-a-time mapping
>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/utils.py#L129
>>>
>>>
>>>
>>>
>>> On Thu, Mar 7, 2019 at 4:00 PM Sean Owen <sro...@gmail.com> wrote:
>>>
>>>> Maybe, it depends on what you're doing. It sounds like you are trying
>>>> to do row-at-a-time mapping, even on a pandas DataFrame. Is what
>>>> you're doing vectorized? may not help much.
>>>> Just make the pandas Series into a DataFrame if you want? and a single
>>>> col back to Series?
>>>>
>>>> On Thu, Mar 7, 2019 at 2:45 PM peng yu <yupb...@gmail.com> wrote:
>>>> >
>>>> > pandas/arrow is for the memory efficiency, and mapPartitions is only
>>>> available to rdds, for sure i can do everything in rdd.
>>>> >
>>>> > But i thought that's the whole point of having pandas_udf, so my
>>>> program run faster and consumes less memory ?
>>>> >
>>>> > On Thu, Mar 7, 2019 at 3:40 PM Sean Owen <sro...@gmail.com> wrote:
>>>> >>
>>>> >> Are you just applying a function to every row in the DataFrame? you
>>>> >> don't need pandas at all. Just get the RDD of Row from it and map a
>>>> >> UDF that makes another Row, and go back to DataFrame. Or make a UDF
>>>> >> that operates on all columns and returns a new value. mapPartitions
>>>> is
>>>> >> also available if you want to transform an iterator of Row to another
>>>> >> iterator of Row.
>>>> >>
>>>> >> On Thu, Mar 7, 2019 at 2:33 PM peng yu <yupb...@gmail.com> wrote:
>>>> >> >
>>>> >> > it is very similar to SCALAR, but for SCALAR the output can't be
>>>> struct/row and the input has to be pd.Series, which doesn't support a row.
>>>> >> >
>>>> >> > I'm doing tensorflow batch inference in spark,
>>>> https://github.com/yupbank/tf-spark-serving/blob/master/tss/serving.py#L108
>>>> >> >
>>>> >> > Which i have to do the groupBy in order to use the apply function,
>>>> i'm wondering why not just enable apply to df ?
>>>> >> >
>>>> >> > On Thu, Mar 7, 2019 at 3:15 PM Sean Owen <sro...@gmail.com> wrote:
>>>> >> >>
>>>> >> >> Are you looking for SCALAR? that lets you map one row to one row,
>>>> but
>>>> >> >> do it more efficiently in batch. What are you trying to do?
>>>> >> >>
>>>> >> >> On Thu, Mar 7, 2019 at 2:03 PM peng yu <yupb...@gmail.com> wrote:
>>>> >> >> >
>>>> >> >> > I'm looking for a mapPartition(pandas_udf) for  a
>>>> pyspark.Dataframe.
>>>> >> >> >
>>>> >> >> > ```
>>>> >> >> > @pandas_udf(df.schema, PandasUDFType.MAP)
>>>> >> >> > def do_nothing(pandas_df):
>>>> >> >> >     return pandas_df
>>>> >> >> >
>>>> >> >> >
>>>> >> >> > new_df = df.mapPartition(do_nothing)
>>>> >> >> > ```
>>>> >> >> > pandas_udf only support scala or GROUPED_MAP.  Why not support
>>>> just Map?
>>>> >> >> >
>>>> >> >> > On Thu, Mar 7, 2019 at 2:57 PM Sean Owen <sro...@gmail.com>
>>>> wrote:
>>>> >> >> >>
>>>> >> >> >> Are you looking for @pandas_udf in Python? Or just
>>>> mapPartition? Those exist already
>>>> >> >> >>
>>>> >> >> >> On Thu, Mar 7, 2019, 1:43 PM peng yu <yupb...@gmail.com>
>>>> wrote:
>>>> >> >> >>>
>>>> >> >> >>> There is a nice map_partition function in R `dapply`.  so
>>>> that user can pass a row to udf.
>>>> >> >> >>>
>>>> >> >> >>> I'm wondering why we don't have that in python?
>>>> >> >> >>>
>>>> >> >> >>> I'm trying to have a map_partition function with pandas_udf
>>>> supported
>>>> >> >> >>>
>>>> >> >> >>> thanks!
>>>>
>>>

Reply via email to