On Sun, Feb 10, 2019 at 12:45:33PM +0100, Driesprong, Fokko wrote:
> Since there is also Pyspark support, it should be relative straightforward
> to invoke the spark-postgres library from Airflow.

yes pyspark is supported. In version 3 of spark-postgres library, I will
improve the pyspark API which is right now not user friendly. Still I
tested it successfully with pyspark and it gives same performances and
reliability as scala spark.

The idea might work for mysql and oracle, but this would need to deeply
understand those databases and specific connectors such postgres's COPY
command used in spark-postgres.


regards

> Op za 9 feb. 2019 om 12:16 schreef Nicolas Paris <[email protected]>:
> 
> > Hi
> >
> > Be careful with sparkJdbc as a replacement of Sqoop for large tables.
> > Sqoop is able to handle any source table size while sparkJdbc design does
> > not.
> > While it provides a way to distribute in multiple partitions, spark is
> > limited by the executors memory where sqoop is limited by the hdfs
> > space.
> >
> > As a result, I have written a spark library (for postgres only right
> > now) witch overcome the core spark jdbc limitations. It handles any
> > workload, and my tests show it was 8 times faster than sqoop. I have not
> > tested it with airflow, but it is compatible with apache livy and
> > pySpark.
> >
> > https://github.com/EDS-APHP/spark-postgres
> >
> >
> > On Fri, Feb 01, 2019 at 01:53:57PM +0100, Iván Robla Albarrán wrote:
> > > Hi ,
> > >
> > > I am seaching how to substitute Apache Sqoop
> > >
> > > I am analyzing SparkJDBCOperator, but i dont understand how i have to
> > use .
> > >
> > > It a version of  SparkSubmit operator, for include as conection JDBC
> > > conection ?
> > >
> > >  I need to include Spark code?
> > >
> > > Any example?
> > >
> > > Thanks, I am very lost
> > >
> > > Regards,
> > > Iván Robla
> >
> > --
> > nicolas
> >

-- 
nicolas

Reply via email to