HeartSaVioR edited a comment on pull request #30521: URL: https://github.com/apache/spark/pull/30521#issuecomment-737466463
>> You have no idea of output mode in DSv1, and that's what I have been concerned about. > We pass the output mode to the sink here: My bad, please disregard about the sentence `You have no idea of output mode in DSv1`. I still think that is still not good enough compared to what we do in DSv2. In DSv2 there's a check for availability, and it works like enforcing. In both cases update mode is never supported properly, but at least complete mode DSv2 enforces truncate. > We create an output path if it doesn't exist since the beginning in streaming, and I have not heard complaints about this. We recommend to create a Kafka topic in prior to run the streaming query as in many cases creating topic by default configuration is tend to be not sufficient, and same here I haven't heard complaints about this. Given table has its configuration, isn't the Kafka topic case a fair comparison with table? > Why not make the behavior of creating table consistent with creating an output path? I don't think both needs to be consistent, otherwise we should just remove append mode in batch query on saveAsTable. If we think about consistency with batch path, it should be just possible to not create table even if the table doesn't exist. I have commented multiple times to explain about why we should not create table by default and how DataStreamWriter cannot cover all cases, so I'll just quote my comments instead. >> Users are LAAAAZZY. As a developer, I would also prefer that people explicitly create their tables first, but plenty of users complain about that workflow. > I agree about this, but user are not always wanted to create a table if it doesn't exist. That's the reason there's append in save mode, and we don't have such in new approach. Yes, users are lazy, and that said they don't always want to assume a new table could be created and provide all informations in case of table creation. If the table exists, these provided options are meaningless and just a burden (and also quite confused if the existing table has different options). >> Can't we parse the string partitions as expressions? > My bad, probably you're talking about DSv2. Even in DataFrameWriter we don't do that (please correct me if I'm mistaken) - please refer DataFrameWriter.partitioningAsV2. The difference between DataFrameWriter and DataFrameWriterV2 is not only removing savemode. DataFrameWriter doesn't fully support DSv2 table creation - exactly same problem with what I pointed out. In batch query, you can prevent creating DSv2 table unintentionally with immature table properties via using savemode as "append", or use DataFrameWriterV2 to create DSv2 table with full support. There's no such thing in streaming path. > I think leveraging the old (probably DSv1) options is not sufficient - this doesn't have full coverage on DSv2 table - no Transform on partitioning, no properties, no options. > Using source (via format(...)) as USE <provider> is also not intuitive - it is only effective when table creation is taking place, and it occurs implicitly. In overall, I don't see any deep consideration about v2 table here, whereas my initial rationalization of adding the API was to enable support v2 table. Can we please stop thinking only on v1 table and ensure we also cover v2 table? ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
