[GitHub] [spark] HeartSaVioR edited a comment on pull request #30521: [SPARK-33577][SS] Add support for V1Table in stream writer table API

GitBox Wed, 02 Dec 2020 13:31:18 -0800


HeartSaVioR edited a comment on pull request #30521:
URL: https://github.com/apache/spark/pull/30521#issuecomment-737505823



   > The complete mode doesn't require truncate + insert. It's just telling the 
sink to overwrite the table entirely, and overwrite doesn't have to be 
implemented as truncate + insert. 
   
   truncate is defined as overwrite with where condition is literally true. We 
are talking about the same, and my point is that the availability is checked by 
Spark. If that's not a big deal, OK.
   
   > IMO, table can be viewed as an alias of a path.
   
   That is limited to the file based tables - with DSv2 you should be able to 
match anything feasible with table. I see some efforts were done to support 
JDBC specific catalog even in community, and there was a talk about applying 
DSv2 with Cassandra. We can even create a Kafka specific catalog, which I 
considered a bit but stuck about schema as we wouldn't want to continue 
providing just key and value in binary form even for Kafka table.
   
   For me, `table is an alias of a path` isn't correct, at least for DSv2.
   
   > I doubt we can support v2 table perfectly in the existing 
DataStreamWriter. It's likely we would need to add DataStreamWriterV2 similar 
to DataFrameWriterV2. 
   
   That has been the main concern. The `saveAsTable` API was initially proposed 
to be added to DataStreamWriterV2, but I got rejected as it doesn't have enough 
worth, hence it has added to DataStreamWriter "unlike" my initial intention. It 
should have been cleared if we just follow the same path on DataStreamWriterV2 
as we did for DataFrameWriterV2. DataFrameWriterV2 should be able to deal with 
v1 table hence this won't be a problem for streaming case as well, and that 
enables us to "focus" on the "table" with v2 table full support as requirement.
   
   @cloud-fan Can we please consider this again?
   
   > I prefer to focus on making v1 table work since the file format in 
streaming doesn't support DSv2
   
   IMO supporting DSv2 for file formats is what we need to spend efforts to fix 
ASAP. If DSv2 lacks something so cannot be done, we should see what is missing 
and also fix. DSv1 streaming API isn't officially supported - it's behind the 
private package. That said we are not dogfooding with the DSv2 which is the 
only the official way to implement data source from the ecosystem.
   
   With documenting that DataStreamWriter doesn't fully support DSv2 and 
"promising" DataStreamWriterV2 (TODO comment in codebase, JIRA issue, etc.) I'm 
OK with tolerate it as of now. As I said, DataFrameWriter already is. Just 
there's an alternative for the batch query whereas there's no alternative for 
now for the streaming query.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
[email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [spark] HeartSaVioR edited a comment on pull request #30521: [SPARK-33577][SS] Add support for V1Table in stream writer table API

Reply via email to