ueshin opened a new pull request #32775:
URL: https://github.com/apache/spark/pull/32775


   ### What changes were proposed in this pull request?
   
   Introduces `Field` to manage dtypes and `StructField`s.
   
   `InternalFrame` is already managing dtypes, but when it checks the Spark's 
data types, column names, and nullabilities, it tries to run the analysis phase 
each time it needs, which will cause a performance issue.
   
   It will use `Field` class which stores the retrieved Spark's data types, 
column names, and nullabilities, and reuse them. Also, in case those can be 
known, just update and reuse them without asking Spark.
   
   ### Why are the changes needed?
   
   Currently there are some performance issues in the pandas-on-Spark layer.
   
   One of them is accessing Java DataFrame and run analysis phase too many 
times, especially just for retrieving the current column names or data types.
   
   We should reduce the amount of unnecessary access.
   
   ### Does this PR introduce _any_ user-facing change?
   
   Improves the performance in pandas-on-Spark layer:
   
   ```py
   df = ps.read_parquet("/path/to/test.parquet")  # contains ~75 columns
   df = df[(df["col"] > 0) & (df["col"] < 10000)]
   ```
   
   Before the PR, it took about **2.15 sec** and after **1.15 sec**.
   
   ### How was this patch tested?
   
   Existing tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to