HeartSaVioR commented on issue #22282: [SPARK-23539][SS] Add support for Kafka 
headers in Structured Streaming
URL: https://github.com/apache/spark/pull/22282#issuecomment-463784685
 
 
   Adding new column(s) in DataSource might open the chance to crash on 
existing query for SS, when new columns are propagated to state schema. For 
example, like @zsxwing explained, `dropDuplicates()` (note: no explicit columns 
provided) will leverage `all columns`, so unless the query has projection to 
select columns explicitly before calling `dropDuplicates()`, the state schema 
will be changed and makes incompatibility.
   
   Even worse, if my understanding is right (please correct me if I'm missing 
here), error message would not be informative: there's no check on state schema 
compatibility as we don't store schema of state explicitly (that's one of my 
radar on future contribution) and it might not just crash but might show 
undefined behavior.
   
   Once we don't make this optional and bring the change to default, sadly 
there's no workaround and they will be required to fix their query (via adding 
all columns except header in `dropDuplicates()` or adding select before it) to 
keep using their existing state.
   
   Unfortunately, in Structured Streaming, existing state does really matter 
and we need to concern of it every time. (Even when we deal for batch query it 
should be also considered as a view of Structured Streaming so that it doesn't 
break streaming query as a side effect.)

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to