eddyxu commented on pull request #31735:
URL: https://github.com/apache/spark/pull/31735#issuecomment-794844245


   Hi, @HyukjinKwon I did some benchmarks using the following code
   
   ```python
           df = self.spark.range(1, 10 ** 8, numPartitions=32)
           df = df.cache()
           df.count()
   
           @pandas_udf(ArrayType(ExampleBoxUDT()))
           def array_of_boxes(series: pd.Series) -> pd.Series:
               boxes = []
               for _, i in series.items():
                   boxes.append([ExampleBox(*([i] * 4)), ExampleBox(*([i + 1] * 
4))])
               return pd.Series(boxes)
   
           @pandas_udf(ArrayType(ArrayType(FloatType())))
           def array_of_arrays(series: pd.Series) -> pd.Series:
               boxes = []
               for _, i in series.items():
                   boxes.append([[i] * 4, [i + 1] * 4])
               return pd.Series(boxes)
   
           import time
           start = time.time()
           df.withColumn("b", array_of_arrays(df.id)).count()
           print(f"Using non-UDT: {time.time() - start}")
           start = time.time()
           df.withColumn("boxes", array_of_boxes(df.id)).count()
           print(f"Using UDT: {time.time() - start}")
   ```
   
   Running it on my 16' Macbook Pro: `Intel(R) Core(TM) i7-9750H CPU @ 2.60GHz, 
 32GB RAM`
   
   Using UDT              |  Time
   ------------ | -------------
    Yes     |  0.2784s
   No   |    0.24767s


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to