LuciferYang commented on pull request #30663:
URL: https://github.com/apache/spark/pull/30663#issuecomment-749378207


   @HyukjinKwon  @dongjoon-hyun 
   
   It seems that it is not easy to prove this optimization through UT. I did 
the following test, taking DataSourceV1 as an example:
   
   1. 10 files and each file has 1100 columns , each file about 400m
   2. Execute a simple query without filter `select  count(xxx) from orc_table`
   
   The key results are as follows:
   
   **without this pr** 
   
   
![image](https://user-images.githubusercontent.com/1475305/102857830-1aae9700-4464-11eb-92a6-d91a96b96e8e.png)
   
   **with this pr**
   
   
![image](https://user-images.githubusercontent.com/1475305/102857970-67926d80-4464-11eb-967c-7f7944697600.png)
   
   The `Input Size` from `96.2 MiB` to `72.3 MiB`,  the `Total Time Across All 
Tasks` from 2s to 1s


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to