alamb commented on issue #10481: URL: https://github.com/apache/datafusion/issues/10481#issuecomment-2107276815
Here are some notes I have on what I want to talk about interfaces and then paradoxically allowed us to narrow the scope of potential optimizations (e.g. compute kernels) and have people focus on different areas. Things we didn't implement: * File formats (instead focused on Parquet, avro, arrow, json, csv) * Memory format Arrow (not just externally but internally) * threadpool standard (tokio) vs our own thread pool * pull / exchange rather than morsel driven parallelism * standard I/O rather than buffer pool * latest / greatest window aggregates fanciness (todo get paper link) Providing simple built in defaults, but hooks for more specialized implementations Keeps DF simple, allows * Catalog * memory / disk manager Things we did: places we spent time and complexity * normalized keys / row format * optimizing parquet reader * optimizing hashing * plan representation (logical plans, exprs, etc) * function library * ListingTable (maybe this should have been more Things I would do differently next time: Keep listing table out of the core UDFs from the start -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
