Hi Be careful with sparkJdbc as a replacement of Sqoop for large tables. Sqoop is able to handle any source table size while sparkJdbc design does not. While it provides a way to distribute in multiple partitions, spark is limited by the executors memory where sqoop is limited by the hdfs space.
As a result, I have written a spark library (for postgres only right now) witch overcome the core spark jdbc limitations. It handles any workload, and my tests show it was 8 times faster than sqoop. I have not tested it with airflow, but it is compatible with apache livy and pySpark. https://github.com/EDS-APHP/spark-postgres On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote: > Hi , > > I am seaching how to substitute Apache Sqoop > > I am analyzing SparkJDBCOperator, but i dont understand how i have to use . > > It a version of SparkSubmit operator, for include as conection JDBC > conection ? > > I need to include Spark code? > > Any example? > > Thanks, I am very lost > > Regards, > Iván Robla -- nicolas
