[GitHub] [spark] lxian commented on pull request #31998: [WIP][SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

GitBox Wed, 31 Mar 2021 04:33:57 -0700


lxian commented on pull request #31998:
URL: https://github.com/apache/spark/pull/31998#issuecomment-810993995



   > I think it wouldn't affect the performance significantly when the dataset 
is actually large with many files in production? I doubt if it's worthwhile to 
manage a separate index.
   > 
   > It would be great if we run TPC-DS or a proper benchmark and see if there 
are significant performance improvement.
   
   I think the performance will depend on the selectivity of the filter in the 
query. The index will only be applied if there are some pages to be skipped in 
a rowgroup. There is some benchmark by 
   @wangyum  https://github.com/apache/spark/pull/31393
   
   I will run a TPC-DS benchmark later to see if there is some improvements


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] lxian commented on pull request #31998: [WIP][SPARK-34859][SQL] parquet vectorized reader - support column index with rowIndexes

Reply via email to