ueshin opened a new pull request, #51546:
URL: https://github.com/apache/spark/pull/51546
### What changes were proposed in this pull request?
Skips `Row` object creation in Arrow-optimized UDTF execution.
### Why are the changes needed?
The `Row` object creation is not necessary in Arrow-optimized UDTF
execution, although it's expensive.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
The existing tests, and manual benchmarks.
```py
def profile(f, *args, _n=10, **kwargs):
import cProfile
import pstats
import gc
st = None
for _ in range(5):
f(*args, **kwargs)
for _ in range(_n):
gc.collect()
with cProfile.Profile() as pr:
ret = f(*args, **kwargs)
if st is None:
st = pstats.Stats(pr)
else:
st.add(pstats.Stats(pr))
st.sort_stats("time", "cumulative").print_stats()
return ret
from pyspark.sql.conversion import ArrowTableToRowsConversion,
LocalDataToArrowConversion
from pyspark.sql.types import *
data = [
(i if i % 1000 else None, str(i))
for i in range(1000000)
]
schema = (
StructType()
.add("i", IntegerType(), nullable=True)
.add("s", StringType(), nullable=True)
)
def to_arrow():
return LocalDataToArrowConversion.convert(data, schema,
use_large_var_types=False)
def from_arrow(tbl, return_as_tuples):
return ArrowTableToRowsConversion.convert(tbl, schema,
return_as_tuples=return_as_tuples)
tbl = to_arrow()
profile(from_arrow, tbl, return_as_tuples=False)
profile(from_arrow, tbl, return_as_tuples=True)
```
- before (`return_as_tuples=False`)
```
60655810 function calls in 14.112 seconds
```
- after (`return_as_tuples=True`)
```
20328060 function calls in 5.613 seconds
```
### Was this patch authored or co-authored using generative AI tooling?
No.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]