schenksj commented on PR #3932: URL: https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4236474713
> Thanks @schenksj. Could you fix the linter issues (see contributor guide for instructions). > > Thanks for acknowledging that this was written by AI. This is a very large PR for a significant new feature. Adding support for Delta Lake certainly has value, but we need to consider who is going to maintain this code going forward. I am concerned that if we merge this and then there are changes in the delta-lake-rs dependency in the future then it could cause an extra maintenance burden on the existing maintainers, who are more focused on Iceberg support and have been contributing to Iceberg as well. > > Could you tell me more about the motivation for this work? Do you have any suggestions for how this could be maintained in the future? > Thanks @schenksj. Could you fix the linter issues (see contributor guide for instructions). > > Thanks for acknowledging that this was written by AI. This is a very large PR for a significant new feature. Adding support for Delta Lake certainly has value, but we need to consider who is going to maintain this code going forward. I am concerned that if we merge this and then there are changes in the delta-lake-rs dependency in the future then it could cause an extra maintenance burden on the existing maintainers, who are more focused on Iceberg support and have been contributing to Iceberg as well. > > Could you tell me more about the motivation for this work? Do you have any suggestions for how this could be maintained in the future? Here’s the same version with all em dashes removed and phrasing adjusted to keep it smooth: --- Hi Andy, First, thanks for the quick response. I appreciate it. On the AI side, I think its better to use best tools available and be honest about our processes so that we can mature our practices and focus as an industry. To address your questions... The motivation on my side is that my day-job employer is a significant user of Delta, and I find the current state and future direction of Delta Uniform, particularly its openness, a bit unclear. It is important for us to preserve vendor flexibility within our Spark stacks, and having a viable accelerator outside of Databricks is a key part of that. This work is a step in that direction. From a maintainability perspective, I have a couple of thoughts. The design of this PR intentionally minimizes direct reliance on delta-rs by using the kernel only for scan planning, not execution. It also has fairly extensive test cases to detect regressions, but as you know that has its own limitations. As long as Comet continues to directly support Parquet, this approach should remain relatively stable over time. That said, there is an opportunity to move toward a more pluggable architecture. For example, a third-party library, such as a Delta or Hudi provider, could implement a native scan planning interface exposed by Comet. This would allow dependencies and integrations to be fully externalized and would shift the maintenance burden to the plugin owner. Longer term, I would like to see [IndexTables](https://indextables.io) and Comet become compatible to help accelerate joins and such on plain spark. Achieving that would likely require a more robust plugin model that supports not just scan planning, but also FFI-based columnar streaming. That is a more involved effort and likely a ways out, given the current state of my codebase. Love your thoughts, and of course no hard feelings if this doesn't align with where you want to focus your product. --- -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
