Ted-Jiang commented on PR #7620: URL: https://github.com/apache/arrow-datafusion/pull/7620#issuecomment-1751580979
> Thanks @Ted-Jiang and @suremarc > > Do you have any performance measurements you can shar @Ted-Jiang about how much this feature increases performance for your usecase? We try to fix the situation when calling remote storage list file statistics sometimes are not stable: we run one query avg cost 4 seconds, but sometimes it cost double time, we print some log for debug: ``` 2023-09-12T06:10:59.098033Z INFO tokio-runtime-worker ThreadId(16) datafusion::execution::context: Datafusion optimize logical plan cost: 10.852243ms 2023-09-12T06:11:03.309479Z INFO tokio-runtime-worker ThreadId(16) datafusion::execution::context: Datafusion create physical plan cost: **_4.2222994s_** 2023-09-12T06:11:03.309543Z INFO tokio-runtime-worker ThreadId(16) ballista_scheduler::planner: planning query stages for job VyQVhat 2023-09-12T06:11:06.439247Z INFO tokio-runtime-worker ThreadId(20) ballista_scheduler::display: === [VyQVhat/1] Stage finished, physical plan with metrics === ``` we saw half time cost in create physical plan 🤣 (btw we pass logic plan LIST_TABLE to datafusion in out front end (written in JAVA)) after enable cache we fix this problem the query time keep stable. btw we have fix the _consistency_ in our java side, because out sys build kind of materialized view, if we we change the source data the path as the cache key also change -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
