[GitHub] [spark] HeartSaVioR commented on pull request #28523: [SPARK-31706][SQL] add back the support of streaming update mode

GitBox Fri, 15 May 2020 06:28:30 -0700


HeartSaVioR commented on pull request #28523:
URL: https://github.com/apache/spark/pull/28523#issuecomment-629235937



   I'm curious how SS committers think about these comments upon - if they 
agree about the comments then the issue is rather about the design issue of 
streaming output mode, and I think the right way to fix is decoupling streaming 
output mode with sink.
   
   Would it break backward compatibility? If you look back branch-2.4, you'll 
find it very surprised that most built-in sink implementations "ignore" the 
output mode. The output mode provided by StreamWriteSupport.createStreamWriter 
is not used at all, with only one exception.
   
   The only exception for built-in sink is memory sink, and it doesn't deal 
with the difference between append mode and update mode. It's only used to 
truncate the memory for complete mode. Given that the sink exists most likely 
for testing, it only helps to make tests be easier to implement, nothing else. 
Also I'm strongly in favor of dropping complete mode, as I already provided 
data loss issue, and I don't think it's production-wise.
   
   I don't even think custom sinks have been respecting the output mode, as the 
API lacks of information on update mode, and complete mode is not 
production-wise.
   
   The major problem is the time. I totally understand such change may feel a 
bit huge to go with the release which has already done in RC1 though...
   
   I hope we address it in the major release (that's a good rationalization), 
but if we really want to minimize the changes for now, what about adding 
`SupportsStreamingTruncate` as internal trait as well, so that we avoid 
coupling `SupportTruncate` with complete mode and let streaming write go 
different path with the batch one?
   (Even better if we simply don't support complete mode or restrict to private 
right now, but...)
   
   As both `SupportsStreamingUpdate` and `SupportsStreamingTruncate` would be 
internal one, we will have time to revisit the streaming path and change 
without making effect of public API.
   
   That would make custom sinks only be able to append as we won't expose these 
abilities, but that's what I expect so far. Even with the new DSv2 implementing 
truncate in streaming sink looks to be very limited, as truncation should take 
place when committing, which means write tasks cannot write to the destination 
directly, not scalable.
   
   Does my proposal make sense?
   
   cc. to @tdas @zsxwing @jose-torres @brkyvz @jerryshao @gaborgsomogyi to hear 
their voices as well. Please cc. to more ppl if you have anyone to get some 
help taking a look at.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR commented on pull request #28523: [SPARK-31706][SQL] add back the support of streaming update mode

Reply via email to