leewyang commented on PR #37734:
URL: https://github.com/apache/spark/pull/37734#issuecomment-1315678614
BTW, I'm seeing a change in behavior in the `pandas_udf` when used with
`limit` in the latest master branch of spark (vs. 3.3.1), per this example code:
```
import numpy as np
import pandas as pd
from pyspark.sql.functions import pandas_udf
from pyspark.sql.types import DoubleType
data = np.arange(0, 1000, dtype=np.float64)
pdf = pd.DataFrame(data, columns=['x'])
df = spark.createDataFrame(pdf)
@pandas_udf(returnType=DoubleType())
def times_two(x):
print(x.shape)
return x*2
# 3.3.1: shape = (10,)
# master: shape = (500,)
df.limit(10).withColumn("x2", times_two("x")).collect()
```
Not sure if this is a regression or an intentional change, but it does
impact performance for this PR, since a given model will be run against 500
rows instead of 10 (even though the final results show only 10 rows).
Basically, it looks like the `limit` function is being applied _after_ running
the `pandas_udf` on a full partition, whereas it used to be applied _before_
running the `pandas_udf`.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]