This will depend on multiple factors. Assuming we are talking significant
volumes of data, I'd prefer sqoop compared to spark on yarn, if ingestion
performance is the sole consideration (which is true in many production use
cases). Sqoop provides some potential optimisations specially around using
native database batch extraction tools that spark cannot take advantage of.
The performance inefficiency of using MR (actually map-only) is
insignificant over a large corpus of data. Further, in a shared cluster, if
the data volume is skewed for the given partition key, spark, without
dynamic container allocation, can be significantly inefficient from cluster
resources usage perspective. With dynamic allocation enabled, it is less so
but sqoop still has a slight edge due to the time Spark holds on to the
resources before giving them up.

If ingestion is part of a more complex DAG that relies on Spark cache (rdd
/ dataframe or dataset), then using Spark jdbc can have a significant
advantage in being able to cache the data without persisting into hdfs
first. But whether this will convert into an overall significantly better
performance of the DAG or cluster will depend on the DAG stages and their
performance. In general, if the ingestion stage is the significant
bottleneck in the DAG, then the advantage will be significant.

Hope this provides a general direction to consider in your case.

On 25 Aug 2016 3:09 a.m., "Venkata Penikalapati" <
mail.venkatakart...@gmail.com> wrote:

> Team,
> Please help me in choosing sqoop or spark jdbc to fetch data from rdbms.
> Sqoop has lot of optimizations to fetch data does spark jdbc also has those
> ?
>
> I'm performing few analytics using spark data for which data is residing
> in rdbms.
>
> Please guide me with this.
>
>
> Thanks
> Venkata Karthik P
>
>

Reply via email to