BlakeOrth commented on issue #16365: URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3189937566
@alamb I think additional observability tooling is almost always a positive development. That being said, I have to be completely honest with you and note that I'm ultimately an API user, not a CLI user. I've been using a hacky-instrumented CLI here to help give a common tool and example(s) of potential improvements. The CLI and my use case both leverage the `ListingTable` which is where I'm personally interested in driving performance improvements with tables on "high latency" storage. Exposing additional metrics around where DataFusion is spending its time at the API level (and in turn through the CLI) does seem very useful to me though. I personally had to rely on a mix of production metrics for our object storage, doing off-cpu-time profiling, and the aforementioned hacked in timing instrumentation, to help me understand that listing files and collecting their object metadata was taking a non-trivial amount of time, especially in hive partitioned contexts. Better metrics should, in theory, eliminate the need for much of that toil. > better diagnose why this command takes so long (and thus how we can make it better0 I don't currently have any true insights (just some educated guesses) as to why table creation is taking so long, but I also have done very little investigation there as of now. I'm personally more interested in improving performance (more query performance than write performance currently) to existing tables and consider the table creation step to effectively be a 1-time cost. I'm happy to share or better clarify any of my current findings. I'm just not entirely sure the best avenue to do so since this is a pretty active project and the core maintainers seem busy already. I can open draft PRs for the couple of POCs I've thrown together, highlight existing areas of code I've done hacky-instrumentation around etc. if it helps further the discussion. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org