DataFrames are a higher level API for working with tabular data - RDDs are used underneath. You can use either and easily convert between them in your code as necessary.
DataFrames provide a nice abstraction for many cases, so it may be easier to code against them. Though if you're used to thinking in terms of collections rather than tables, you may find RDDs more natural. Data frames can also be faster, since Spark will do some optimizations under the hood - if you are using PySpark, this will avoid the overhead. Data frames may also perform better if you're reading structured data, such as a Hive table or Parquet files. I recommend you prefer data frames, switching over to RDDs as necessary (when you need to perform an operation not supported by data frames / Spark SQL). HOWEVER (and this is a big one), Spark 1.6 will have yet another API - datasets. The release of Spark 1.6 is currently being finalized and I would expect it in the next few days. You will probably want to use the new API once it's available. On Sun, Dec 27, 2015 at 9:18 PM, Divya Gehlot <divya.htco...@gmail.com> wrote: > Hi, > I am new bee to spark and a bit confused about RDDs and DataFames in Spark. > Can somebody explain me with the use cases which one to use when ? > > Would really appreciate the clarification . > > Thanks, > Divya >