Github user Fokko commented on the issue:

    https://github.com/apache/spark/pull/20057
  
    Hi @gatorsmile, thanks for putting it to the tests. The main reason why I 
personally dislike Sqoop is:
    
    - **Legacy.** The old map-reduce should be buried in the upcoming years. As 
a data engineering consultant, I see more people questioning the whole Hadoop 
stack. Using Sqoop you still need to run map-reduce tasks, and this isn't easy 
on other platforms like kubernetes.
    - **Stability.** I see Sqoop jobs fail quite often, and there isn't a nice 
way of retrying this in an atomic way. For example, when having a Sqoop job on 
Airflow, we cannot simply retry the operation. We when we import data from a 
rmdbs to hdfs, we have to make sure that the target directory of the previous 
run has been deleted.
    
    This is also where Spark-jdbc comes in, for example, in the future we would 
like to delete single partitions, but this is wip. Maybe @danielvdende can 
elaborate a bit on their use-case.


---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to