ueshin opened a new pull request, #51508:
URL: https://github.com/apache/spark/pull/51508

   ### What changes were proposed in this pull request?
   
   Optimizes `ArrowTableToRowsConversion.convert` to improve its performance, 
similar to https://github.com/apache/spark/pull/51482.
   
   - Calculate `fields` in advance
   - Move conversions to `columnar_data` creation
   - Make creation of `rows` for-comprehension to avoid expensive `list.append` 
calls
   
   ### Why are the changes needed?
   
   `ArrowTableToRowsConversion.convert` has several performance overhead.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The existing tests, and manual benchmarks.
   
   ```py
   def profile(f, *args, _n=10, **kwargs):
       import cProfile
       import pstats
       import gc
       st = None
       for _ in range(5):
           f(*args, **kwargs)
       for _ in range(_n):
           gc.collect()
           with cProfile.Profile() as pr:
               ret = f(*args, **kwargs)
           if st is None:
               st = pstats.Stats(pr)
           else:
               st.add(pstats.Stats(pr))
       st.sort_stats("time", "cumulative").print_stats()
       return ret
   
   from pyspark.sql.conversion import ArrowTableToRowsConversion, 
LocalDataToArrowConversion
   from pyspark.sql.types import *
   
   data = [
       (i if i % 1000 else None, str(i), i)
       for i in range(1000000)
   ]
   schema = (
       StructType()
       .add("i", IntegerType(), nullable=True)
       .add("s", StringType(), nullable=True)
       .add("ii", IntegerType(), nullable=False)
   )
   
   def to_arrow():
       return LocalDataToArrowConversion.convert(data, schema, 
use_large_var_types=False)
   
   def from_arrow(tbl):
       return ArrowTableToRowsConversion.convert(tbl, schema)
   
   tbl = to_arrow()
   profile(from_arrow, tbl)
   ```
   
   - before
   
   ```
   100983380 function calls in 24.509 seconds
   ```
   
   - after
   
   ```
   70655910 function calls in 16.947 seconds
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to