Re: [I] Keynote presentation for SiMoD workshop at SIGMOD 2024 [datafusion]

via GitHub Mon, 13 May 2024 04:02:00 -0700


alamb commented on issue #10481:
URL: https://github.com/apache/datafusion/issues/10481#issuecomment-2107276815


   Here are some notes I have on what I want to talk about
   
   interfaces and then paradoxically allowed us to narrow the scope of 
potential optimizations (e.g. compute kernels) and have people focus on 
different areas. 
   
   Things we didn't implement:
   * File formats (instead focused on Parquet, avro, arrow, json, csv)
   * Memory format Arrow (not just externally but internally)
   * threadpool standard (tokio) vs our own thread pool
   * pull / exchange rather than morsel driven parallelism
   * standard I/O rather than buffer pool
   * latest / greatest window aggregates fanciness (todo get paper link)
   
   Providing simple built in defaults, but hooks for more specialized 
implementations
   Keeps DF simple, allows
   * Catalog
   * memory / disk manager
   
   
   Things we did: places we spent time and complexity
   * normalized keys / row format
   * optimizing parquet reader
   * optimizing hashing 
   * plan representation (logical plans, exprs, etc)
   * function library
   * ListingTable (maybe this should have been more 
   
   Things I would do differently next time:
   Keep listing table out of the core
   UDFs from the start
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: [I] Keynote presentation for SiMoD workshop at SIGMOD 2024 [datafusion]

Reply via email to