[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

felixcheung Mon, 20 Jun 2016 12:02:58 -0700

Github user felixcheung commented on a diff in the pull request:

    https://github.com/apache/spark/pull/13592#discussion_r67747099
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -12,130 +12,129 @@ title: Spark SQL and DataFrames
     Spark SQL is a Spark module for structured data processing. Unlike the 
basic Spark RDD API, the interfaces provided
     by Spark SQL provide Spark with more information about the structure of 
both the data and the computation being performed. Internally,
     Spark SQL uses this extra information to perform extra optimizations. 
There are several ways to
    -interact with Spark SQL including SQL, the DataFrames API and the Datasets 
API. When computing a result
    +interact with Spark SQL including SQL and the Dataset API. When computing 
a result
     the same execution engine is used, independent of which API/language you 
are using to express the
    -computation. This unification means that developers can easily switch back 
and forth between the
    -various APIs based on which provides the most natural way to express a 
given transformation.
    +computation. This unification means that developers can easily switch back 
and forth between
    +different APIs based on which provides the most natural way to express a 
given transformation.
     
     All of the examples on this page use sample data included in the Spark 
distribution and can be run in
     the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
     ## SQL
     
    -One use of Spark SQL is to execute SQL queries written using either a 
basic SQL syntax or HiveQL.
    +One use of Spark SQL is to execute SQL queries.
     Spark SQL can also be used to read data from an existing Hive 
installation. For more on how to
     configure this feature, please refer to the [Hive Tables](#hive-tables) 
section. When running
    -SQL from within another programming language the results will be returned 
as a [DataFrame](#DataFrames).
    +SQL from within another programming language the results will be returned 
as a [DataFrame](#datasets-and-dataframes).
     You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
     or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -## DataFrames
    +## Datasets and DataFrames
     
    -A DataFrame is a distributed collection of data organized into named 
columns. It is conceptually
    -equivalent to a table in a relational database or a data frame in 
R/Python, but with richer
    -optimizations under the hood. DataFrames can be constructed from a wide 
array of [sources](#data-sources) such
    -as: structured data files, tables in Hive, external databases, or existing 
RDDs.
    +A Dataset is a new interface added in Spark 1.6 that tries to provide the 
benefits of RDDs (strong
    +typing, ability to use powerful lambda functions) with the benefits of 
Spark SQL's optimized
    +execution engine. A Dataset can be [constructed](#creating-datasets) from 
JVM objects and then
    +manipulated using functional transformations (`map`, `flatMap`, `filter`, 
etc.).
     
    -The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    -[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    -[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
    +The Dataset API is the successor of the DataFrame API, which was 
introduced in Spark 1.3. In Spark
    +2.0, Datasets and DataFrames are unified, and DataFrames are now 
equivalent to Datasets of `Row`s.
    +In fact, `DataFrame` is simply a type alias of `Dataset[Row]` in [the 
Scala API][scala-datasets].
    +However, [Java API][java-datasets] users must use `Dataset<Row>` instead.
     
    -## Datasets
    +[scala-datasets]: api/scala/index.html#org.apache.spark.sql.Dataset
    +[java-datasets]: api/java/index.html?org/apache/spark/sql/Dataset.html
     
    -A Dataset is a new experimental interface added in Spark 1.6 that tries to 
provide the benefits of
    -RDDs (strong typing, ability to use powerful lambda functions) with the 
benefits of Spark SQL's
    -optimized execution engine. A Dataset can be 
[constructed](#creating-datasets) from JVM objects and then manipulated
    -using functional transformations (map, flatMap, filter, etc.).
    +Python does not have support for the Dataset API, but due to its dynamic 
nature many of the
    +benefits are already available (i.e. you can access the field of a row by 
name naturally
    +`row.columnName`). The case for R is similar.
     
    -The unified Dataset API can be used both in 
[Scala](api/scala/index.html#org.apache.spark.sql.Dataset) and
    -[Java](api/java/index.html?org/apache/spark/sql/Dataset.html). Python does 
not yet have support for
    -the Dataset API, but due to its dynamic nature many of the benefits are 
already available (i.e. you can
    -access the field of a row by name naturally `row.columnName`). Full python 
support will be added
    -in a future release.
    +Throughout this document, we will often refer to Scala/Java Datasets of 
`Row`s as DataFrames.
     
     # Getting Started
     
    -## Starting Point: SQLContext
    +## Starting Point: SparkSession
     
     <div class="codetabs">
     <div data-lang="scala"  markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/scala/index.html#org.apache.spark.sql.SQLContext) 
class, or one of its
    -descendants. To create a basic `SQLContext`, all you need is a 
SparkContext.
    +The entry point into all functionality in Spark is the 
[`SparkSession`](api/scala/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight scala %}
    -val sc: SparkContext // An existing SparkContext.
    -val sqlContext = new org.apache.spark.sql.SQLContext(sc)
    +import org.apache.spark.sql.SparkSession
    +
    +val spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate()
     
     // this is used to implicitly convert an RDD to a DataFrame.
    -import sqlContext.implicits._
    +import spark.implicits._
     {% endhighlight %}
     
     </div>
     
     <div data-lang="java" markdown="1">
     
    -The entry point into all functionality in Spark SQL is the
    -[`SQLContext`](api/java/index.html#org.apache.spark.sql.SQLContext) class, 
or one of its
    -descendants. To create a basic `SQLContext`, all you need is a 
SparkContext.
    +The entry point into all functionality in Spark is the 
[`SparkSession`](api/java/index.html#org.apache.spark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.build()`:
     
     {% highlight java %}
    -JavaSparkContext sc = ...; // An existing JavaSparkContext.
    -SQLContext sqlContext = new org.apache.spark.sql.SQLContext(sc);
    -{% endhighlight %}
    +import org.apache.spark.sql.SparkSession
     
    +SparkSession spark = SparkSession.build()
    +  .master("local")
    +  .appName("Word Count")
    +  .config("spark.some.config.option", "some-value")
    +  .getOrCreate();
    +{% endhighlight %}
     </div>
     
     <div data-lang="python"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -[`SQLContext`](api/python/pyspark.sql.html#pyspark.sql.SQLContext) class, 
or one
    -of its decedents. To create a basic `SQLContext`, all you need is a 
SparkContext.
    +The entry point into all functionality in Spark is the 
[`SparkSession`](api/python/pyspark.sql.html#pyspark.sql.SparkSession) class. 
To create a basic `SparkSession`, just use `SparkSession.build`:
     
     {% highlight python %}
    -from pyspark.sql import SQLContext
    -sqlContext = SQLContext(sc)
    +from pyspark.sql import SparkSession
    +
    +spark = SparkSession.build \
    +  .master("local") \
    +  .appName("Word Count") \
    +  .config("spark.some.config.option", "some-value") \
    +  .getOrCreate()
     {% endhighlight %}
     
     </div>
     
     <div data-lang="r"  markdown="1">
     
    -The entry point into all relational functionality in Spark is the
    -`SQLContext` class, or one of its decedents. To create a basic 
`SQLContext`, all you need is a SparkContext.
    +Unlike Scala, Java, and Python API, we haven't finished migrating 
`SQLContext` to `SparkSession` for SparkR yet, so
    +the entry point into all relational functionality in SparkR is still the
    +`SQLContext` class in Spark 2.0. To create a basic `SQLContext`, all you 
need is a `SparkContext`.
    --- End diff --
    
    And use of SQLContext is deprecated. Please see PR #13751



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #13592: [SPARK-15863][SQL][DOC] Initial SQL programming g...

Reply via email to