HeartSaVioR commented on pull request #28523: URL: https://github.com/apache/spark/pull/28523#issuecomment-629235937
I'm curious how SS committers think about these comments upon - if they agree about the comments then the issue is rather about the design issue of streaming output mode, and I think the right way to fix is decoupling streaming output mode with sink. Would it break backward compatibility? If you look back branch-2.4, you'll find it very surprised that most built-in sink implementations "ignore" the output mode. The output mode provided by StreamWriteSupport.createStreamWriter is not used at all, with only one exception. The only exception for built-in sink is memory sink, and it doesn't deal with the difference between append mode and update mode. It's only used to truncate the memory for complete mode. Given that the sink exists most likely for testing, it only helps to make tests be easier to implement, nothing else. Also I'm strongly in favor of dropping complete mode, as I already provided data loss issue, and I don't think it's production-wise. I don't even think custom sinks have been respecting the output mode, as the API lacks of information on update mode, and complete mode is not production-wise. The major problem is the time. I totally understand such change may feel a bit huge to go with the release which has already done in RC1 though... I hope we address it in the major release (that's a good rationalization), but if we really want to minimize the changes for now, what about adding `SupportsStreamingTruncate` as internal trait as well, so that we avoid coupling `SupportTruncate` with complete mode and let streaming write go different path with the batch one? (Even better if we simply don't support complete mode or restrict to private right now, but...) As both `SupportsStreamingUpdate` and `SupportsStreamingTruncate` would be internal one, we will have time to revisit the streaming path and change without making effect of public API. That would make custom sinks only be able to append as we won't expose these abilities, but that's what I expect so far. Even with the new DSv2 implementing truncate in streaming sink looks to be very limited, as truncation should take place when committing, which means write tasks cannot write to the destination directly, not scalable. Does my proposal make sense? cc. to @tdas @zsxwing @jose-torres @brkyvz @jerryshao @gaborgsomogyi to hear their voices as well. Please cc. to more ppl if you have anyone to get some help taking a look at. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
