BlakeOrth commented on issue #16365:
URL: https://github.com/apache/datafusion/issues/16365#issuecomment-3189937566

   @alamb I think additional observability tooling is almost always a positive 
development. That being said, I have to be completely honest with you and note 
that I'm ultimately an API user, not a CLI user. I've been using a 
hacky-instrumented CLI here to help give a common tool and example(s) of 
potential improvements. The CLI and my use case both leverage the 
`ListingTable` which is where I'm personally interested in driving performance 
improvements with tables on "high latency" storage.
   
   Exposing additional metrics around where DataFusion is spending its time at 
the API level (and in turn through the CLI) does seem very useful to me though. 
I personally had to rely on a mix of production metrics for our object storage, 
doing off-cpu-time profiling, and the aforementioned hacked in timing 
instrumentation, to help me understand that listing files and collecting their 
object metadata was taking a non-trivial amount of time, especially in hive 
partitioned contexts. Better metrics should, in theory, eliminate the need for 
much of that toil.
   
   > better diagnose why this command takes so long (and thus how we can make 
it better0
   
   I don't currently have any true insights (just some educated guesses) as to 
why table creation is taking so long, but I also have done very little 
investigation there as of now. I'm personally more interested in improving 
performance (more query performance than write performance currently) to 
existing tables and consider the table creation step to effectively be a 1-time 
cost. I'm happy to share or better clarify any of my current findings. I'm just 
not entirely sure the best avenue to do so since this is a pretty active 
project and the core maintainers seem busy already. I can open draft PRs for 
the couple of POCs I've thrown together, highlight existing areas of code I've 
done hacky-instrumentation around etc. if it helps further the discussion.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to