rdblue commented on issue #23208: [SPARK-25530][SQL] data source v2 API 
refactor (batch write)
URL: https://github.com/apache/spark/pull/23208#issuecomment-450026352
 
 
   > Isn't it the queryId discussion?
   
   No.
   
   What I'm mainly referring to are other differences that weren't discussed 
because of the confusion over query ID. For example, the doc proposes 
`buildForBatch` and `buildForStreaming` where the read side has just `build` 
and `toBatch` and others. Why was this chosen instead of using the same pattern 
as the read side? It looks inconsistent to an implementer. Is that 
inconsistency justified?
   
   Next, why is OutputMode passed to the data source? I understand what it 
means needs to be a configuration option, but batch is configured differently. 
Should OutputMode be configured using overwrite traits instead of passing a 
mode? I think that would make more sense because batch and streaming would use 
the same configuration method.
   
   If you want to keep OutputMode, how do updates work? If a source can't 
handle updates, how is that signaled? How does a source know which rows to 
replace for update mode? (Even if these questions are answered elsewhere, I 
think these concerns need to be documented in the design.)
   
   That said, I think there are some minor open questions for query ID. I agree 
there needs to be some way to pass it, but how it is passed hasn't been 
addressed yet. Why not pass it using a builder trait like other options? That 
way sources can choose to have it passed in.
   
   > Technically we should not block a PR just because some specific persons 
haven't reviewed it. . . . We can always have followup PRs
   
   I think there's some confusion here about what's being reviewed. I think 
this is blocked on a completed design, not a PR review.
   
   Why would we would move forward on a PR that implements a redesign without 
finishing the design -- especially so early in discussion and consensus 
building? That would introduce unnecessary code churn, and bias discussions 
toward what is already implemented instead of what makes the most sense. It is 
easier not to make changes. I think there's a bias against changes even with a 
prototype PR because it's annoying to rebase.
   
   In principle, the goal of discussing a design is to make it better and, in 
the process, build community consensus for it. If you think that your design 
has reached a fixed point and that you have built that consensus in the 
community, then of course you don't need one person to sign off when the PR to 
implement, assuming it has been genuinely reviewed.
   
   For this specific proposal, I don't think that the design has reached a 
fixed point and it is too early to conclude there is consensus for it. Even if 
I thought it were ready, we would not have built consensus yet. You're still 
adding to the design doc and we cancelled the community sync because it's a 
holiday week and many people are out.
   
   I understand wanting to move more quickly and how frustrating it can be when 
you don't get review attention. But right now, I think it is more productive to 
focus on being prepared for when people can give this their attention again.
   
   I recommend taking this time to work on the design doc: that doc should 
persuade the community that you've thought through all of the trade-offs and 
made reasonable choices. Personally, I'm asking a lot of questions because I 
haven't been convinced that alternatives have been considered.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to