HeartSaVioR edited a comment on pull request #30521: URL: https://github.com/apache/spark/pull/30521#issuecomment-738568538
> I meant these users need to create the table before starting the query no matter which behavior we decide. OK I agree with this. Probably I've messed up with everything as there're bunch of inputs from different folks and I had to defense. My bad. > Could you give an example? For people familiar with DataFrameWriterV2, when they try to use APIs (such as partitionedBy and tableProperty) in DataStreamWriter, they will quickly notice that DataStreamWriter doesn't have such APIs, and notice the limitations of toTable. I would say that's not good on UX hence I even haven't thought about it, but I agree it still makes sense. That could be a thing to tolerate, as we will document the limitations as well. > If you meant adding a new method def toTable(tableName: String, ifNotExist: Boolean): StreamingQuery, then it might affect our future work. For example, we would need to explain how ifNotExist works if we add options to specify how to create the table, and might need to deprecate it in future. I'm not sure I understand. Could you please elaborate about which options do you have in mind? I don't expect us to struggle with adding something in `toTable` in future. We are already seeing how we can avoid making users wonder about impact of configuration in DataFrameWriterV2 by enforcing creation of table when table related configurations are provided (provider, table properties, partitions). This is quite an improvement and good "learning from history" practice we should follow. Once we are providing the functionality to create table for streaming query, DataStreamWriterV2 is still right to me rather than evolving `toTable`. There's already inconsistency in DataStreamWriter - the configuration `partitionBy` is in effect and provided in DSv1 (we're still in discussion what is the right way to do, "respect the existing table's partition and only use for creating table" vs "pass it in any way") but in DSv2 path `partitionBy` is ignored now (probably will be used for creating table, but latter isn't possible). What we will do for DSv1 table? We should resolve the confusion, and the effects should be also documented in javadoc as well, creating table vs table exists, DSv1 vs DSv2 (4 different situations should be all documented). So there're different opinions across different folks - 1. me and @zsxwing tend to agree the necessity of DataStreamWriterV2, but @cloud-fan and @brkyvz seem to mind adding it. 2. I think end users should provide their decision to create table or not when starting the query, while others think we can consider it as trade-off for better usability for more people. I see there's strong claim about the 2 - I'll step back this as well, with claim that we should stop making more confusions on the DataStreamWriter (DataStreamWriterV2 should be designed), and SPARK-33638 is a blocker for 3.2.0. Even we define it as a blocker for 3.2.0, it will be available on public probably at 2nd half of next year at earliest. There's a gap, but if we at least make a strong promise, I'm OK with that. I'm sorry but strong promise only means making it as blocker. We tend to defer something and promise handling it later, but in many cases we struggle with others and completely forget it. SPARK-27237 was the one. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
