andygrove commented on PR #3932: URL: https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4236533524
> First, thanks for the quick response. I appreciate it. On the AI side, I think its better to use best tools available and be honest about our processes so that we can mature our practices and focus as an industry. To address your questions... Agreed. I use AI extensively. The main challenge for this project that the contribution velocity exceeds review capacity. > The motivation on my side is that my day-job employer is a significant user of Delta, and I find the current state and future direction of Delta Uniform, particularly its openness, a bit unclear. It is important for us to preserve vendor flexibility within our Spark stacks, and having a viable accelerator outside of Databricks is a key part of that. This work is a step in that direction. Adding Delta Lake support makes Comet appealing to a wider audience, which hopefully leads to more contributors/maintainers over time. > From a maintainability perspective, I have a couple of thoughts. The design of this PR intentionally minimizes direct reliance on delta-rs by using the kernel only for scan planning, not execution. It also has fairly extensive test cases to detect regressions, but as you know that has its own limitations. As long as Comet continues to directly support Parquet, this approach should remain relatively stable over time. Makes sense. > That said, there is an opportunity to move toward a more pluggable architecture. For example, a third-party library, such as a Delta or Hudi provider, could implement a native scan planning interface exposed by Comet. This would allow dependencies and integrations to be fully externalized and would shift the maintenance burden to the plugin owner. Interesting idea. We tried something like this in the past with the Java implementation of Iceberg. It led to some challenges with circular dependencies. It would be worth creating an issue to discuss. > Longer term, I would like to see [IndexTables](https://indextables.io) and Comet become compatible to help accelerate joins and such on plain spark. Achieving that would likely require a more robust plugin model that supports not just scan planning, but also FFI-based columnar streaming. That is a more involved effort and likely a ways out, given the current state of my codebase. > > Love your thoughts, and of course no hard feelings if this doesn't align with where you want to focus your product. Oh, it's definitely not *my* product. Let's see what other maintainers have to say. Adding Delta Lake support would be great for Comet's futures. My concern is just over maintenance going forward. However, the feature is marked as experimental and disabled by default, so the feature could always be removed in the future if we get into a situation where the code is no longer maintained and causing issues. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
