Ted-Jiang commented on PR #7620:
URL: 
https://github.com/apache/arrow-datafusion/pull/7620#issuecomment-1751580979

   > Thanks @Ted-Jiang and @suremarc
   > 
   > Do you have any performance measurements you can shar @Ted-Jiang about how 
much this feature increases performance for your usecase?
   
   We try to fix the situation when calling remote storage list file statistics 
sometimes are not stable: 
   we run one query avg cost 4 seconds, but sometimes it cost double time, we 
print some log for debug:
   ```
   2023-09-12T06:10:59.098033Z  INFO tokio-runtime-worker ThreadId(16) 
datafusion::execution::context: Datafusion optimize logical plan cost: 
10.852243ms
   2023-09-12T06:11:03.309479Z  INFO tokio-runtime-worker ThreadId(16) 
datafusion::execution::context: Datafusion create physical plan cost: 
**_4.2222994s_**
   2023-09-12T06:11:03.309543Z  INFO tokio-runtime-worker ThreadId(16) 
ballista_scheduler::planner: planning query stages for job VyQVhat
   2023-09-12T06:11:06.439247Z  INFO tokio-runtime-worker ThreadId(20) 
ballista_scheduler::display: === [VyQVhat/1] Stage finished, physical plan with 
metrics ===
   ```
   we saw half time cost in create physical plan 🤣  (btw we pass logic plan 
LIST_TABLE to datafusion in out front end (written in JAVA))
   
   after enable cache we fix this problem the query time keep stable.
   btw we have fix the _consistency_ in our java side, because out sys build 
kind of materialized view, if we we change the source data the path as the 
cache key also change


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to