leewyang commented on PR #37734:
URL: https://github.com/apache/spark/pull/37734#issuecomment-1315678614

   BTW, I'm seeing a change in behavior in the `pandas_udf` when used with 
`limit` in the latest master branch of spark (vs. 3.3.1), per this example code:
   ```
   import numpy as np
   import pandas as pd
   from pyspark.sql.functions import pandas_udf
   from pyspark.sql.types import DoubleType
   
   data = np.arange(0, 1000, dtype=np.float64)
   pdf = pd.DataFrame(data, columns=['x'])
   df = spark.createDataFrame(pdf)
   
   @pandas_udf(returnType=DoubleType())
   def times_two(x):
       print(x.shape)
       return x*2
       
   # 3.3.1: shape = (10,)
   # master: shape = (500,)
   df.limit(10).withColumn("x2", times_two("x")).collect()
   ```
   
   Not sure if this is a regression or an intentional change, but it does 
impact performance for this PR, since a given model will be run against 500 
rows instead of 10 (even though the final results show only 10 rows).  
Basically, it looks like the `limit` function is being applied _after_ running 
the `pandas_udf` on a full partition, whereas it used to be applied _before_ 
running the `pandas_udf`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to