[GitHub] [spark] HeartSaVioR edited a comment on pull request #29715: [WIP][SPARK-32847][SS] Add DataStreamWriterV2 API

GitBox Mon, 14 Sep 2020 15:49:43 -0700


HeartSaVioR edited a comment on pull request #29715:
URL: https://github.com/apache/spark/pull/29715#issuecomment-692354511



   > So this DataStreamWriterV2 is used to enforce output mode for v2 streaming 
sinks, so that there is no backward compatibility issue?
   
   No. That's a side improvement which can be dropped, not a major goal.
   
   As I commented, fixing the problems on DataStreamWriter isn't the purpose of 
introducing DataStreamWriterV2. This is rather providing symmetric user 
experience between batch and streaming, as with DataFrameWriterV2 end users can 
go through running batch query with **catalog table** on writer side, whereas 
streaming query doesn't have something to enable this.
   (I don't see API for reading catalog table on reader side of streaming query 
as well. Do I understand correct?)
   
   The problems I described in previous comment are simply the problems on 
Structured Streaming - let me explain at the end of comment, as it might be 
going to be out of topic.
   
   I see DataFrameWriterV2 has integrated lots of other benefits (more fluent, 
logical plan on write node, etc.) which should be great to have in 
DataStreamWriterV2, but I think they're not a key part of *WriterV2. Supporting 
catalog table is simply the major reason to have it.
   
   Regarding the problems on Structured Streaming - 
   
   I kicked the incomplete state support on continuous mode out from Structured 
Streaming, but I basically concerns about "continuous mode" itself, as it's 
rather applying hacks to workaround architectural limitation. (+ No one cares 
about it in community.) 
   
   And as I had initiated discussion earlier (and has been commented in various 
PRs), I think complete mode should be kicked out as well. The mode addresses 
some limited cases but is treated as one of valid modes which adds much 
complexity - some operations which basically shouldn't be supported in 
streaming query are supported under complete mode, and vice versa. Because the 
mode doesn't fit naturally.
   
   It's useful for now because Spark doesn't support true update mode on sink - 
and once Spark can support update mode on sink, content in external storage 
should be just equivalent to what the complete mode provides, without having to 
dump all of the outputs. Probably we can simulate complete mode via having a 
special stateful operator which only works with update mode.
   
   Specific to micro-batch, supporting DSv1 is also a major headache - lots of 
pattern matchings in MicroBatchExecution are to support DSv1, and even there're 
workarounds applied for DSv1 (e.g #29700). I remember the answer in discussion 
thread that DSv1 for streaming data source is not exposed to the public API 
which is great news, but I see no action/plan to get rid of it. Is there 
something DSv2 cannot cover the functionality which is possible in DSv1? If 
then why not prioritize to address the problem?


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #29715: [WIP][SPARK-32847][SS] Add DataStreamWriterV2 API

Reply via email to