DataFrames are a higher level API for working with tabular data - RDDs are
used underneath. You can use either and easily convert between them in your
code as necessary.

DataFrames provide a nice abstraction for many cases, so it may be easier
to code against them. Though if you're used to thinking in terms of
collections rather than tables, you may find RDDs more natural. Data frames
can also be faster, since Spark will do some optimizations under the hood -
if you are using PySpark, this will avoid the overhead. Data frames may
also perform better if you're reading structured data, such as a Hive table
or Parquet files.

I recommend you prefer data frames, switching over to RDDs as necessary
(when you need to perform an operation not supported by data frames / Spark
SQL).

HOWEVER (and this is a big one), Spark 1.6 will have yet another API -
datasets. The release of Spark 1.6 is currently being finalized and I would
expect it in the next few days. You will probably want to use the new API
once it's available.


On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com>
wrote:

> Hi,
> I am new bee to spark and a bit confused about RDDs and DataFames in Spark.
> Can somebody explain me with the use cases which one to use when ?
>
> Would really appreciate the clarification .
>
> Thanks,
> Divya
>

Reply via email to