[GitHub] [spark] HeartSaVioR edited a comment on pull request #30521: [SPARK-33577][SS] Add support for V1Table in stream writer table API

GitBox Tue, 01 Dec 2020 20:24:55 -0800


HeartSaVioR edited a comment on pull request #30521:
URL: https://github.com/apache/spark/pull/30521#issuecomment-736957997

> You can perform complete mode writes, which overwrites the entire data
every time.

Sorry probably I wasn't clear. This isn't true for DSv1 Sink interface
unless data source does the hack to require providing output mode to the data
source option directly. You have no idea of output mode in DSv1, and that's
what I have been concerned about. Output mode is effectively no-op at least for
DSv1 sink. For the backward compatibility we allow to do update/complete as
append, but that's just to not break backward compatibility on old data sources
and we shouldn't continue doing this.

I've already raised related discussion in dev. mailing list months ago, but
no response. I wish we don't ignore the discussion thread in dev mailing list.

http://apache-spark-developers-list.1001551.n3.nabble.com/Output-mode-in-Structured-Streaming-and-DSv1-sink-DSv2-table-tt30216.html#a30239

> Users are LAAAAZZY. As a developer, I would also prefer that people
explicitly create their tables first, but plenty of users complain about that
workflow.

I agree about this, but user are not always wanted to create a table if it
doesn't exist. That's the reason there's `append` in save mode, and we don't
have such in new approach. Yes, users are lazy, and that said they don't always
want to assume a new table could be created and provide all informations in
case of table creation. If the table exists, these provided options are
meaningless and just a burden (and also quite confused if the existing table
has different options).

> Can't we parse the string partitions as expressions?

~DSv1 interface doesn't allow to provide expression to partition. Please
refer the definition of DataSource. That would be completely data source's role
to parse and interpret the string partition column. This is quite different
from what we do for DSv2. That said, we can't fully leverage the functionality
of create table against DSv2 in interfaces based on DSv1, like
DataStreamWriter.~

My bad, probably you're talking about DSv2. Even in DataFrameWriter we don't
do that (please correct me if I'm mistaken) - please refer
`DataFrameWriter.partitioningAsV2`. The difference between DataFrameWriter and
DataFrameWriterV2 is not only removing savemode. DataFrameWriter doesn't fully
support DSv2 table creation - same problem with what I pointed out. In batch
query, you can prevent creating wrong DSv2 table with using savemode as
"append", or use DataFrameWriterV2 for DSv2 table. There's no such thing in
streaming path.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #30521: [SPARK-33577][SS] Add support for V1Table in stream writer table API

Reply via email to