HeartSaVioR edited a comment on pull request #30521:
URL: https://github.com/apache/spark/pull/30521#issuecomment-738568538


   > I meant these users need to create the table before starting the query no 
matter which behavior we decide.
   
   OK I agree with this. Probably I've messed up with everything as there're 
bunch of inputs from different folks and I had to defense. My bad.
   
   > Could you give an example? For people familiar with DataFrameWriterV2, 
when they try to use APIs (such as partitionedBy and tableProperty) in 
DataStreamWriter, they will quickly notice that DataStreamWriter doesn't have 
such APIs, and notice the limitations of toTable.
   
   I would say that's not good on UX hence I even haven't thought about it, but 
I agree it still makes sense. That could be a thing to tolerate, as we will 
document the limitations as well.
   
   > If you meant adding a new method def toTable(tableName: String, 
ifNotExist: Boolean): StreamingQuery, then it might affect our future work. For 
example, we would need to explain how ifNotExist works if we add options to 
specify how to create the table, and might need to deprecate it in future.
   
   I'm not sure I understand. Could you please elaborate about which options do 
you have in mind?
   
   I don't expect us to struggle with adding something in `toTable` in future. 
We are already seeing how we can avoid making users wonder about impact of 
configuration in DataFrameWriterV2 by enforcing creation of table when table 
related configurations are provided (provider, table properties, partitions). 
This is quite an improvement and good "learning from history" practice we 
should follow. Once we are providing the functionality to create table for 
streaming query, DataStreamWriterV2 is still right to me rather than evolving 
`toTable`.
   
   There's already inconsistency in DataStreamWriter - the configuration 
`partitionBy` is in effect and provided in DSv1 (we're still in discussion what 
is the right way to do, "respect the existing table's partition and only use 
for creating table" vs "pass it in any way") but in DSv2 path `partitionBy` is 
ignored now (probably will be used for creating table, but latter isn't 
possible). What we will do for DSv1 table? 
   
   We should resolve the confusion, and the effects should be also documented 
in javadoc as well, creating table vs table exists, DSv1 vs DSv2 (4 different 
situations should be all documented).
   
   So there're different opinions across different folks -
   
   1. me and @zsxwing tend to agree the necessity of DataStreamWriterV2, but 
@cloud-fan and @brkyvz seem to mind adding it.
   2. I think end users should provide their decision to create table or not 
when starting the query, while others think we can consider it as trade-off for 
better usability for more people.
   
   I see there's strong claim about the 2 - I'll step back this as well, with 
claim that we should stop making more confusions on the DataStreamWriter 
(DataStreamWriterV2 should be designed), and SPARK-33638 is a blocker for 
3.2.0. Even we define it as a blocker for 3.2.0, it will be available on public 
probably at 2nd half of next year at earliest. There's a gap, but if we at 
least make a strong promise, I'm OK with that.
   
   I'm sorry but strong promise only means making it as blocker. We tend to 
defer something and promise handling it later, but in many cases we struggle 
with others and completely forget it. SPARK-27237 was the one.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to