[GitHub] [arrow-datafusion] Dandandan commented on issue #1363: Major performance regression in reading Parquet on master

GitBox Fri, 26 Nov 2021 03:20:28 -0800


Dandandan commented on issue #1363:
URL: 
https://github.com/apache/arrow-datafusion/issues/1363#issuecomment-979899560



   @rdettai 
   
   No problem - just want to figure out what could be the reason :)!
   
   So far I tested:
   
   * master - performance regression
   * 2454e468641d4d98af211c2800c0afec2732385b  - regression
   * d2d47d38b8c1b4605272d7f917406527cdf68bc9 - fast again
   
   
   > after #1347 was merged
   For TPCH I remember collecting stats doesn't have a big effect now, as data 
is very evenly distributed, and I think also doesn't take a long time to 
collect those.
   
   To reproduce:
   
   * Create partitioned Parquet dataset (slowness seems to increase with nr of 
partitions? - not 100% sure yet)
   * Run some queries `cargo run --release --bin tpch -- benchmark datafusion 
--iterations 5 --path [data] --format parquet --query 6 --batch-size 8192 -p 16`
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] Dandandan commented on issue #1363: Major performance regression in reading Parquet on master

Reply via email to