andygrove commented on PR #3932:
URL: 
https://github.com/apache/datafusion-comet/pull/3932#issuecomment-4236533524

   > First, thanks for the quick response. I appreciate it. On the AI side, I 
think its better to use best tools available and be honest about our processes 
so that we can mature our practices and focus as an industry. To address your 
questions...
   
   Agreed. I use AI extensively. The main challenge for this project that the 
contribution velocity exceeds review capacity. 
   
   > The motivation on my side is that my day-job employer is a significant 
user of Delta, and I find the current state and future direction of Delta 
Uniform, particularly its openness, a bit unclear. It is important for us to 
preserve vendor flexibility within our Spark stacks, and having a viable 
accelerator outside of Databricks is a key part of that. This work is a step in 
that direction.
   
   Adding Delta Lake support makes Comet appealing to a wider audience, which 
hopefully leads to more contributors/maintainers over time.
   
   > From a maintainability perspective, I have a couple of thoughts. The 
design of this PR intentionally minimizes direct reliance on delta-rs by using 
the kernel only for scan planning, not execution. It also has fairly extensive 
test cases to detect regressions, but as you know that has its own limitations. 
As long as Comet continues to directly support Parquet, this approach should 
remain relatively stable over time.
   
   Makes sense.
   
   > That said, there is an opportunity to move toward a more pluggable 
architecture. For example, a third-party library, such as a Delta or Hudi 
provider, could implement a native scan planning interface exposed by Comet. 
This would allow dependencies and integrations to be fully externalized and 
would shift the maintenance burden to the plugin owner.
   
   Interesting idea. We tried something like this in the past with the Java 
implementation of Iceberg.  It led to some challenges with circular 
dependencies. It would be worth creating an issue to discuss.
   
   > Longer term, I would like to see [IndexTables](https://indextables.io) and 
Comet become compatible to help accelerate joins and such on plain spark. 
Achieving that would likely require a more robust plugin model that supports 
not just scan planning, but also FFI-based columnar streaming. That is a more 
involved effort and likely a ways out, given the current state of my codebase.
   > 
   > Love your thoughts, and of course no hard feelings if this doesn't align 
with where you want to focus your product.
   
   Oh, it's definitely not *my* product. Let's see what other maintainers have 
to say. Adding Delta Lake support would be great for Comet's futures. My 
concern is just over maintenance going forward. However, the feature is marked 
as experimental and disabled by default, so the feature could always be removed 
in the future if we get into a situation where the code is no longer maintained 
and causing issues.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to