[GitHub] rdblue commented on issue #23836: [SPARK-26915][SQL] DataFrameWriter.save() should write without schema validation

GitBox Tue, 19 Feb 2019 11:43:39 -0800

rdblue commented on issue #23836: [SPARK-26915][SQL] DataFrameWriter.save()  
should write without schema validation
URL: https://github.com/apache/spark/pull/23836#issuecomment-465280946
 
 
   @jose-torres, I'm not saying that the default should be v1 forever.
   
   The right way to move over is to develop them in parallel and switch over 
when we can validate that the behavior is the same. Right now, v2 can't run 
CTAS plans so we clearly can't switch. But when v2 has all of the necessary 
logical plans, then we can start running the existing behavior tests on v2 to 
see what changes remain, like changing validation for path-based tables.
   
   Continuing to use SaveMode actually inhibits the move to v2. If write paths 
use SaveMode, then they can pass behavior tests and appear to work when they 
actually don't.
   
   Also, let me clarify my comment on using v1. I think we need to keep v1 
around until the process of moving to v2 is complete because there are code 
paths that we know can't be changed to v2 without altering behavior. For 
example, we've agreed to standardize behavior on what file sources do. Users 
will have to choose between existing behavior and using v2 for other sources.
   
   I'm not confident that all v1 behaviors will be available in v2. In v1, a 
CTAS plan can be validated against an existing table. In some cases, that CTAS 
should fail because the table exists (SQL) and in some cases, the plan that is 
created should be AppendData instead of CTAS (DataFrameWriter). Does the 
validation for AppendData work exactly the same way as validating a CTAS that 
is actually and append? My guess is that it doesn't, and that we might not want 
it to.
   
   I think the final solution is to introduce a new write API that always uses 
v2 and makes it obvious what plan will be used. I've proposed such an API in 
the logical plans SPIP. Moving users to that API and eventually deprecating the 
DataFrameWriter API will take care of migrating the last few cases (which 
should be minor) from v1 to v2.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] rdblue commented on issue #23836: [SPARK-26915][SQL] DataFrameWriter.save() should write without schema validation

Reply via email to