rdblue commented on issue #23208: [SPARK-25530][SQL] data source v2 API refactor (batch write) URL: https://github.com/apache/spark/pull/23208#issuecomment-450026352 > Isn't it the queryId discussion? No. What I'm mainly referring to are other differences that weren't discussed because of the confusion over query ID. For example, the doc proposes `buildForBatch` and `buildForStreaming` where the read side has just `build` and `toBatch` and others. Why was this chosen instead of using the same pattern as the read side? It looks inconsistent to an implementer. Is that inconsistency justified? Next, why is OutputMode passed to the data source? I understand what it means needs to be a configuration option, but batch is configured differently. Should OutputMode be configured using overwrite traits instead of passing a mode? I think that would make more sense because batch and streaming would use the same configuration method. If you want to keep OutputMode, how do updates work? If a source can't handle updates, how is that signaled? How does a source know which rows to replace for update mode? (Even if these questions are answered elsewhere, I think these concerns need to be documented in the design.) That said, I think there are some minor open questions for query ID. I agree there needs to be some way to pass it, but how it is passed hasn't been addressed yet. Why not pass it using a builder trait like other options? That way sources can choose to have it passed in. > Technically we should not block a PR just because some specific persons haven't reviewed it. . . . We can always have followup PRs I think there's some confusion here about what's being reviewed. I think this is blocked on a completed design, not a PR review. Why would we would move forward on a PR that implements a redesign without finishing the design -- especially so early in discussion and consensus building? That would introduce unnecessary code churn, and bias discussions toward what is already implemented instead of what makes the most sense. It is easier not to make changes. I think there's a bias against changes even with a prototype PR because it's annoying to rebase. In principle, the goal of discussing a design is to make it better and, in the process, build community consensus for it. If you think that your design has reached a fixed point and that you have built that consensus in the community, then of course you don't need one person to sign off when the PR to implement, assuming it has been genuinely reviewed. For this specific proposal, I don't think that the design has reached a fixed point and it is too early to conclude there is consensus for it. Even if I thought it were ready, we would not have built consensus yet. You're still adding to the design doc and we cancelled the community sync because it's a holiday week and many people are out. I understand wanting to move more quickly and how frustrating it can be when you don't get review attention. But right now, I think it is more productive to focus on being prepared for when people can give this their attention again. I recommend taking this time to work on the design doc: that doc should persuade the community that you've thought through all of the trade-offs and made reasonable choices. Personally, I'm asking a lot of questions because I haven't been convinced that alternatives have been considered.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
