Hi,

We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to
get the community's opinion.

The context is that SchemaRDD is becoming a common data format used for
bringing data into Spark from external systems, and used for various
components of Spark, e.g. MLlib's new pipeline API. We also expect more and
more users to be programming directly against SchemaRDD API rather than the
core RDD API. SchemaRDD, through its less commonly used DSL originally
designed for writing test cases, always has the data-frame like API. In
1.3, we are redesigning the API to make the API usable for end users.


There are two motivations for the renaming:

1. DataFrame seems to be a more self-evident name than SchemaRDD.

2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even
though it would contain some RDD functions like map, flatMap, etc), and
calling it Schema*RDD* while it is not an RDD is highly confusing. Instead.
DataFrame.rdd will return the underlying RDD for all RDD methods.


My understanding is that very few users program directly against the
SchemaRDD API at the moment, because they are not well documented. However,
oo maintain backward compatibility, we can create a type alias DataFrame
that is still named SchemaRDD. This will maintain source compatibility for
Scala. That said, we will have to update all existing materials to use
DataFrame rather than SchemaRDD.

Reply via email to