Hi, We are considering renaming SchemaRDD -> DataFrame in 1.3, and wanted to get the community's opinion.
The context is that SchemaRDD is becoming a common data format used for bringing data into Spark from external systems, and used for various components of Spark, e.g. MLlib's new pipeline API. We also expect more and more users to be programming directly against SchemaRDD API rather than the core RDD API. SchemaRDD, through its less commonly used DSL originally designed for writing test cases, always has the data-frame like API. In 1.3, we are redesigning the API to make the API usable for end users. There are two motivations for the renaming: 1. DataFrame seems to be a more self-evident name than SchemaRDD. 2. SchemaRDD/DataFrame is actually not going to be an RDD anymore (even though it would contain some RDD functions like map, flatMap, etc), and calling it Schema*RDD* while it is not an RDD is highly confusing. Instead. DataFrame.rdd will return the underlying RDD for all RDD methods. My understanding is that very few users program directly against the SchemaRDD API at the moment, because they are not well documented. However, oo maintain backward compatibility, we can create a type alias DataFrame that is still named SchemaRDD. This will maintain source compatibility for Scala. That said, we will have to update all existing materials to use DataFrame rather than SchemaRDD.