[GitHub] [arrow-datafusion] alamb commented on pull request #5790: Revert pr #5020

via GitHub Fri, 31 Mar 2023 12:36:13 -0700


alamb commented on PR #5790:
URL: 
https://github.com/apache/arrow-datafusion/pull/5790#issuecomment-1492496143


   For anyone else following along the original PR that added this change was 
https://github.com/apache/arrow-datafusion/pull/5020
   
   @yahoNanJing  -- looking at the screen shot you provided, 
   
![228734458-ddd80ca3-f59d-47a6-bd55-141ed31924e3](https://user-images.githubusercontent.com/490673/229212240-66994f1b-b811-4c6c-af9b-b0f8678c8fb2.png)
   
   It looks to me like the parquet file in question is being read using 2 
streams (aka the file was opened by two different tasks which are reading it in 
parallel)
   
   Thus while the wall clock time takes only 4 seconds it may be possible that 
the total cpu time is actually 7 seconds
   
   You could potentially disable the `repartition_file_scans` option
   
   
https://docs.rs/datafusion/latest/datafusion/config/struct.OptimizerOptions.html#structfield.repartition_file_scans
   
   And see if the metrics were more like what you expected
   
   Perhaps @thinkharderdev or @tustvold  have additional thoughts they could 
share
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow-datafusion] alamb commented on pull request #5790: Revert pr #5020

Reply via email to