Github user danielvdende commented on the issue:

    https://github.com/apache/spark/pull/20057
  
    Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned, 
our main reason for wanting to get away from Sqoop is also for stability 
reasons and to get rid of MapReduce in preparation for our move to Kubernetes 
(or something similar). We've also seen it to be much faster than Sqoop. In 
terms of why we need the feature in this PR: we have some tables in PostgreSQL 
that have foreign keys linking them. We have also specified a schema for these 
tables. If we use the drop-and-recreate option, Spark will determine the 
schema, overriding our PostgreSQL schema. Obviously, these should match up, but 
I personally don't like that Spark can do this (and that you can't explicitly 
tell it not to). 
    
    Because of this behaviour, we currently require 2 tasks in Airflow (as 
@Fokko mentioned) to ensure the tables are truncated, but the schema stays in 
place. This PR would enable us to specify in a single, idempotent (Airflow) 
task that we want to truncate the table before putting new data in there. The 
cascade enables us to not break foreign key relations and cause errors.
    
    To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop 
task isn't idempotent), but is in fact improving on what Sqoop offers.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to