Github user danielvdende commented on the issue:
https://github.com/apache/spark/pull/20057
Hi guys, @Fokko @gatorsmile, completely agree with what @Fokko mentioned,
our main reason for wanting to get away from Sqoop is also for stability
reasons and to get rid of MapReduce in preparation for our move to Kubernetes
(or something similar). We've also seen it to be much faster than Sqoop. In
terms of why we need the feature in this PR: we have some tables in PostgreSQL
that have foreign keys linking them. We have also specified a schema for these
tables. If we use the drop-and-recreate option, Spark will determine the
schema, overriding our PostgreSQL schema. Obviously, these should match up, but
I personally don't like that Spark can do this (and that you can't explicitly
tell it not to).
Because of this behaviour, we currently require 2 tasks in Airflow (as
@Fokko mentioned) to ensure the tables are truncated, but the schema stays in
place. This PR would enable us to specify in a single, idempotent (Airflow)
task that we want to truncate the table before putting new data in there. The
cascade enables us to not break foreign key relations and cause errors.
To be clear, this therefore isn't emulating a Sqoop feature (as a Sqoop
task isn't idempotent), but is in fact improving on what Sqoop offers.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]