HeartSaVioR edited a comment on pull request #29715: URL: https://github.com/apache/spark/pull/29715#issuecomment-692354511
> So this DataStreamWriterV2 is used to enforce output mode for v2 streaming sinks, so that there is no backward compatibility issue? No. That's a side improvement, not a major goal. As I commented, fixing the problems on DataStreamWriter isn't the purpose of introducing DataStreamWriterV2. This is rather providing symmetric user experience between batch and streaming, as with DataFrameWriterV2 end users can go through running batch query with catalog table on writer side, whereas streaming query doesn't have something to enable this. (I don't see API for reading catalog table on reader side of streaming query as well. Do I understand correct?) The problems I described in previous comment are simply the problems on Structured Streaming - let me explain at the end of comment, as it might be going to be out of topic. I see DataFrameWriterV2 has integrated lots of other benefits (more fluent, logical plan on write node, etc.) which should be great to have in DataStreamWriterV2, but I think they're not a key part of *WriterV2. Supporting catalog table is simply the major reason to have it. Regarding the problems on Structured Streaming - I kicked the incomplete state support on continuous mode out from Structured Streaming, but I basically concerns about "continuous mode" itself, as it's rather applying hacks to workaround architectural limitation. (+ No one cares about it in community.) And as I had initiated discussion earlier (and has been commented in various PRs), I think complete mode should be kicked out as well. The mode addresses some limited cases but is treated as one of valid modes which adds much complexity - some operations which basically shouldn't be supported in streaming query are supported under complete mode, and vice versa. Because the mode doesn't fit naturally. It's useful for now because Spark doesn't support true update mode on sink - and once Spark can support update mode on sink, content in external storage should be just equivalent to what the complete mode provides, without having to dump all of the outputs. Probably we can simulate complete mode via having a special stateful operator which only works with update mode. Specific to micro-batch, supporting DSv1 is also a major headache - lots of pattern matchings in MicroBatchExecution are to support DSv1, and even there're workarounds applied for DSv1 (e.g #29700) I remember the answer in discussion thread that DSv1 for streaming data source is not exposed to the public API which is great news, but I see no action/plan to get rid of it. Is there something DSv2 cannot cover the functionality which is possible in DSv1? If then why not prioritize to address the problem? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
