[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

pwendell Sun, 20 Jul 2014 21:20:07 -0700

Github user pwendell commented on a diff in the pull request:

    https://github.com/apache/spark/pull/1399#discussion_r15154792
  
    --- Diff: docs/sql-programming-guide.md ---
    @@ -573,4 +572,170 @@ prefixed with a tick (`'`).  Implicit conversions 
turn these symbols into expres
     evaluated by the SQL execution engine.  A full list of the functions 
supported can be found in the
     [ScalaDoc](api/scala/index.html#org.apache.spark.sql.SchemaRDD).
     
    -<!-- TODO: Include the table of operations here. -->
    \ No newline at end of file
    +<!-- TODO: Include the table of operations here. -->
    +
    +## Running the Thrift JDBC server
    +
    +The Thrift JDBC server implemented here corresponds to the [`HiveServer2`]
    +(https://cwiki.apache.org/confluence/display/Hive/Setting+Up+HiveServer2) 
in Hive 0.12. You can test
    +the JDBC server with the beeline script comes with either Spark or Hive 
0.12.
    +
    +To start the JDBC server, run the following in the Spark directory:
    +
    +    ./sbin/start-thriftserver.sh
    +
    +The default port the server listens on is 10000.  Now you can use beeline 
to test the Thrift JDBC
    +server:
    +
    +    ./bin/beeline
    +
    +Connect to the JDBC server in beeline with:
    +
    +    beeline> !connect jdbc:hive2://localhost:10000
    +
    +Beeline will ask you for a username and password. In non-secure mode, 
simply enter the username on
    +your machine and a blank password. For secure mode, please follow the 
instructions given in the
    +[beeline 
documentation](https://cwiki.apache.org/confluence/display/Hive/HiveServer2+Clients)
    +
    +Configuration of Hive is done by placing your `hive-site.xml` file in 
`conf/`.
    +
    +You may also use the beeline script comes with Hive.
    +
    +### Migration Guide for Shark Users
    +
    +#### Reducer number
    +
    +In Shark, default reducer number is 1, and can be tuned by property 
`mapred.reduce.tasks`. In Spark SQL, reducer number is default to 200, and can 
be customized by the `spark.sql.shuffle.partitions` property:
    +
    +```
    +SET spark.sql.shuffle.partitions=10;
    +SELECT page, count(*) c FROM logs_last_month_cached
    +GROUP BY page ORDER BY c DESC LIMIT 10;
    +```
    +
    +You may also put this property in `hive-site.xml` to override the default 
value.
    +
    +#### Caching
    +
    +The `shark.cache` table property no longer exists, and tables whose name 
end with `_cached` are no longer automcatically cached. Instead, we provide 
`CACHE TABLE` and `UNCACHE TABLE` statements to let user control table caching 
explicitly:
    +
    +```
    +CACHE TABLE logs_last_month;
    +UNCACHE TABLE logs_last_month;
    +```
    +
    +**NOTE** `CACHE TABLE tbl` is lazy, it only marks table `tbl` as "need to 
by cached if necessary", but doesn't actually cache it until a query that 
touches `tbl` is executed. To force the table to be cached, you may simply 
count the table immediately after executing `CACHE TABLE`:
    +
    +```
    +CACHE TABLE logs_last_month;
    +SELECT COUNT(1) FROM logs_last_month;
    +```
    +
    +Several caching related features are not supported yet:
    +
    +* User defined partition level cache eviction policy
    +* RDD reloading
    +* In-memory cache write through policy
    +
    +### Compatibility with Apache Hive
    +
    +#### Deploying in Exising Hive Warehouses
    +
    +Spark SQL Thrift JDBC server is designed to be "out of the box" compatible 
with existing Hive
    +installations. You do not need to modify your existing Hive Metastore or 
change the data placement
    +or partitioning of your tables.
    +
    +#### Supported Hive Features
    +
    +Spark SQL supports the vast majority of Hive features, such as:
    +
    +* Hive query statements, including:
    + * `SELECT`
    + * `GROUP BY
    + * `ORDER BY`
    + * `CLUSTER BY`
    + * `SORT BY`
    +* All Hive operators, including:
    + * Relational operators (`=`, `â`, `==`, `<>`, `<`, `>`, `>=`, `<=`, etc)
    + * Arthimatic operators (`+`, `-`, `*`, `/`, `%`, etc)
    + * Logical operators (`AND`, `&&`, `OR`, `||`, etc)
    + * Complex type constructors
    + * Mathemtatical functions (`sign`, `ln`, `cos`, etc)
    + * String functions (`instr`, `length`, `printf`, etc)
    +* User defined functions (UDF)
    +* User defined aggregation functions (UDAF)
    +* User defined serialization formats (SerDe's)
    +* Joins
    + * `JOIN`
    + * `{LEFT|RIGHT|FULL} OUTER JOIN`
    + * `LEFT SEMI JOIN`
    + * `CROSS JOIN`
    +* Unions
    +* Sub queries
    + * `SELECT col FROM ( SELECT a + b AS col from t1) t2`
    +* Sampling
    +* Explain
    +* Partitioned tables
    +* All Hive DDL Functions, including:
    + * `CREATE TABLE`
    + * `CREATE TABLE AS SELECT`
    + * `ALTER TABLE`
    +* Most Hive Data types, including:
    + * `TINYINT`
    + * `SMALLINT`
    + * `INT`
    + * `BIGINT`
    + * `BOOLEAN`
    + * `FLOAT`
    + * `DOUBLE`
    + * `STRING`
    + * `BINARY`
    + * `TIMESTAMP`
    + * `ARRAY<>`
    + * `MAP<>`
    + * `STRUCT<>`
    +
    +#### Unsupported Hive Functionality
    +
    +Below is a list of Hive features that we don't support yet. Most of these 
features are rarely  used in Hive deployments.
    --- End diff --
    
    extra space after rarely



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] spark pull request: [SPARK-2410][SQL][WIP] Cherry picked Hive Thri...

Reply via email to