Github user Fokko commented on the issue:
https://github.com/apache/spark/pull/20057
Hi @gatorsmile, thanks for putting it to the tests. The main reason why I
personally dislike Sqoop is:
- **Legacy.** The old map-reduce should be buried in the upcoming years. As
a data engineering consultant, I see more people questioning the whole Hadoop
stack. Using Sqoop you still need to run map-reduce tasks, and this isn't easy
on other platforms like kubernetes.
- **Stability.** I see Sqoop jobs fail quite often, and there isn't a nice
way of retrying this in an atomic way. For example, when having a Sqoop job on
Airflow, we cannot simply retry the operation. We when we import data from a
rmdbs to hdfs, we have to make sure that the target directory of the previous
run has been deleted.
This is also where Spark-jdbc comes in, for example, in the future we would
like to delete single partitions, but this is wip. Maybe @danielvdende can
elaborate a bit on their use-case.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]