[GitHub] spark pull request: [WIP][SPARK-12069][SQL] Update documentation w...

gatorsmile Thu, 03 Dec 2015 14:42:58 -0800

Github user gatorsmile commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10060#discussion_r46624363
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -9,18 +9,51 @@ title: Spark SQL and DataFrames
     
     # Overview
     
    -Spark SQL is a Spark module for structured data processing. It provides a 
programming abstraction called DataFrames and can also act as distributed SQL 
query engine.
    +Spark SQL is a Spark module for structured data processing.  Unlike the 
basic Spark RDD API, the interfaces provided
    +by Spark SQL provide Spark with more about the structure of both the data 
and the computation being performed.  Internally,
    +Spark SQL uses this extra information to perform extra optimizations.  
There are several ways to
    +interact with Spark SQL including SQL, the DataFrames API and the Datasets 
API.  When computing a result
    +the same execution engine is used, independent of which API/language you 
are using to express the
    +computation.  This unification means that developers can easily switch 
back and forth between the
    +various APIs based on which provides the most natural way to express a 
given transformation.
     
    -Spark SQL can also be used to read data from an existing Hive 
installation.  For more on how to configure this feature, please refer to the 
[Hive Tables](#hive-tables) section.
    +All of the examples on this page use sample data included in the Spark 
distribution and can be run in
    +the `spark-shell`, `pyspark` shell, or `sparkR` shell.
     
    -# DataFrames
    +## SQL
     
    -A DataFrame is a distributed collection of data organized into named 
columns. It is conceptually equivalent to a table in a relational database or a 
data frame in R/Python, but with richer optimizations under the hood. 
DataFrames can be constructed from a wide array of sources such as: structured 
data files, tables in Hive, external databases, or existing RDDs.
    +One use of Spark SQL is to execute SQL queries written using either a 
basic SQL syntax or HiveQL.
    +Spark SQL can also be used to read data from an existing Hive 
installation.  For more on how to
    +configure this feature, please refer to the [Hive Tables](#hive-tables) 
section.  When running
    +SQL from within another programming language the results will be returned 
as a [DataFrame](#DataFrames).
    +You can also interact with the SQL interface using the 
[command-line](#running-the-spark-sql-cli)
    +or over [JDBC/ODBC](#running-the-thrift-jdbcodbc-server).
     
    -The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame), 
[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html), 
[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
    +## DataFrames
     
    -All of the examples on this page use sample data included in the Spark 
distribution and can be run in the `spark-shell`, `pyspark` shell, or `sparkR` 
shell.
    +A DataFrame is a distributed collection of data organized into named 
columns. It is conceptually
    +equivalent to a table in a relational database or a data frame in 
R/Python, but with richer
    +optimizations under the hood. DataFrames can be constructed from a wide 
array of [sources](#data-sources) such
    +as: structured data files, tables in Hive, external databases, or existing 
RDDs.
     
    +The DataFrame API is available in 
[Scala](api/scala/index.html#org.apache.spark.sql.DataFrame),
    +[Java](api/java/index.html?org/apache/spark/sql/DataFrame.html),
    +[Python](api/python/pyspark.sql.html#pyspark.sql.DataFrame), and 
[R](api/R/index.html).
    +
    +## Datasets
    +
    +A Dataset is a new experimental interface added in Spark 1.6 that tries to 
provide the benefits of
    +RDDs (strong typing, ability to use powerful lambda functions) with the 
benifits of Spark SQL's
    --- End diff --
    
    benifits -> benefits



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [WIP][SPARK-12069][SQL] Update documentation w...

Reply via email to