ueshin opened a new pull request, #51542:
URL: https://github.com/apache/spark/pull/51542

   ### What changes were proposed in this pull request?
   
   Skips calling conversions if identity function.
   
   ### Why are the changes needed?
   
   Calling functions is usually expensive. We should avoid it if the function 
is an identity function in the critical path.
   
   ### Does this PR introduce _any_ user-facing change?
   
   No.
   
   ### How was this patch tested?
   
   The existing tests, and manual benchmarks.
   
   ```py
   def profile(f, _n=10, *args, **kwargs):
       import cProfile
       import pstats
       import gc
       st = None
       for _ in range(5):
           f(*args, **kwargs)
       for _ in range(_n):
           gc.collect()
           with cProfile.Profile() as pr:
               ret = f(*args, **kwargs)
           if st is None:
               st = pstats.Stats(pr)
           else:
               st.add(pstats.Stats(pr))
       st.sort_stats("time", "cumulative").print_stats()
       return ret
   
   from pyspark.sql.conversion import ArrowTableToRowsConversion, 
LocalDataToArrowConversion
   from pyspark.sql.types import *
   
   data = [
       (i if i % 1000 else None, str(i), (i, str(i)))
       for i in range(1000000)
   ]
   schema = (
       StructType()
       .add("i", IntegerType(), nullable=True)
       .add("s", StringType(), nullable=True)
       .add("si", StructType().add("i", IntegerType()).add("s", StringType()))
   )
   
   def to_arrow():
       return LocalDataToArrowConversion.convert(data, schema, 
use_large_var_types=False)  # skipping the input check
   
   def from_arrow(tbl):
       return ArrowTableToRowsConversion.convert(tbl, schema)  # skipping 
creating rows
   
   tbl = profile(to_arrow)
   profile(from_arrow, tbl)
   ```
   
   - before
   
   ```
   140329810 function calls (140329750 primitive calls) in 12.908 seconds
   180989400 function calls (180989380 primitive calls) in 40.992 seconds
   ```
   
   - after
   
   ```
   80330750 function calls (80330690 primitive calls) in 10.347 seconds
   140989380 function calls (140989360 primitive calls) in 35.979 seconds
   ```
   
   ### Was this patch authored or co-authored using generative AI tooling?
   
   No.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to