[GitHub] [spark] BryanCutler commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

GitBox Tue, 14 Jan 2020 10:54:59 -0800

BryanCutler commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type 
hints in pandas UDF and rename/move inconsistent pandas UDF types
URL: https://github.com/apache/spark/pull/27165#issuecomment-574320076
 
 
   >On the first one they can wrap an existing function can’t they? They don’t
   need to modify it.
   
   Yeah, it's just an inconvenience for the user to define a wrapper function 
and then wrap that again in `pandas_udf`.
   
   >That's my concern too which makes me to decide not to deprecate existing 
pandas UDFs for now. If we add proposal 1 way too, then we will have 3 ways 
(old one, proposal 1, and proposal 2) which I think might be more confusing. So 
my current take is to add the type hint way first, and see if users like it or 
not in Spark 3 (without any deprecation for now)
   
   Sure, this sounds like a good path forward @HyukjinKwon . My concerns are 
only if type hints are required.
   
   >The current udf returns a Column which is able to bind with other SQL 
expressions whereas 
   
   Actually, `pandas_udf` returns a `UserDefinedFunction`. It's when that is 
called with `Columns` as input, it returns a `Column`.
   
   >We do have methods that have similar semantics in both Scala ("strongly" 
typed Dataset transformations), in SparkR (already mentioned) and 3rd party 
extensions (sparklyr's spark_apply). None requires end users to be aware of 
nitty-gritty details of the internal execution model.
   
   @zero323 Maybe I'm thinking of this from the users perspective, but to me 
`pandas_udf` in this context is really just a function bound with output 
schema. So it seems like we are just talking about 
   ```python
   df.groupby("...").applyInPandas(func, schema=df.schema)
   ```
   vs
   ```python
   df.groupby("...").apply(pandas_udf(func, returnType=df.schema))
   ```
   Which is just a subtle difference, so I'm not sure how this requires the 
user to the the details of internal execution.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] BryanCutler commented on issue #27165: [SPARK-28264][PYTHON][SQL] Support type hints in pandas UDF and rename/move inconsistent pandas UDF types

Reply via email to