[GitHub] [spark] LuciferYang edited a comment on pull request #30483: [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc

GitBox Wed, 03 Feb 2021 01:28:38 -0800


LuciferYang edited a comment on pull request #30483:
URL: https://github.com/apache/spark/pull/30483#issuecomment-772273452



   Simple test:
   ```
   val df = spark.read.parquet(or orc)("/xxx/data")
   
   df.createOrReplaceTempView("test_table")
   
   spark.sql("select sum(a), sum(b), sum(c) from test_table where id = 
1381339").show
   spark.sql("select sum(a), sum(b), sum(c) from test_table where id = 
28643411").show
   ```
   
   Data Source V1:
   
   1.  parquet with `spark.sql.fileMetaCache.parquet.enabled =false`
   
   **Each footer was read 4 times, both queries read 6.9m data.**
   
   
![image](https://user-images.githubusercontent.com/1475305/106707904-faf6bb00-662c-11eb-8ce8-5492af5b3528.png)
   
![image](https://user-images.githubusercontent.com/1475305/106707931-0ba73100-662d-11eb-8080-cf8885852e3c.png)
   
   2. parquet with `spark.sql.fileMetaCache.parquet.enabled =true` 
   
   **Each footer was read 1 times, 1st query read 5m data and 2nd query read 3m 
data.**
   
   
![image](https://user-images.githubusercontent.com/1475305/106707982-1e216a80-662d-11eb-9832-66728312ac08.png)
   
![image](https://user-images.githubusercontent.com/1475305/106708048-385b4880-662d-11eb-8cca-2b0e4029affa.png)
   
   
   3. orc with `spark.sql.fileMetaCache.orc.enabled =false`
   
   **Each footer was read 4 times, both queries read 52.3m data.**
   
   
![image](https://user-images.githubusercontent.com/1475305/106708161-5d4fbb80-662d-11eb-81df-656f9be55475.png)
   
![image](https://user-images.githubusercontent.com/1475305/106708209-70628b80-662d-11eb-93cd-a3cacca8f667.png)
   
   4. orc with `spark.sql.fileMetaCache.orc.enabled =true`
   
   **Each footer was read 1 times, 1st query read 45.5m data and 2nd query read 
38.7m data.**
   
   
![image](https://user-images.githubusercontent.com/1475305/106708235-7ce6e400-662d-11eb-861c-d8a26a247623.png)
   
![image](https://user-images.githubusercontent.com/1475305/106708265-85d7b580-662d-11eb-899e-2155104a9427.png)
   
   
   DataSource V2 API has similar results.
   
   


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] LuciferYang edited a comment on pull request #30483: [SPARK-33449][SQL] Add File Metadata cache support for Parquet and Orc

Reply via email to