GitHub user yhqairqq opened a pull request:
https://github.com/apache/spark/pull/18311
Branch 2.0
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/18311.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #18311
----
commit e355ca8e828629455228b6a346d64638ab639cfa
Author: Christian Kadner <[email protected]>
Date: 2016-10-06T21:28:49Z
[SPARK-17803][TESTS] Upgrade docker-client dependency
[SPARK-17803: Docker integration tests don't run with "Docker for
Mac"](https://issues.apache.org/jira/browse/SPARK-17803)
## What changes were proposed in this pull request?
This PR upgrades the
[docker-client](https://mvnrepository.com/artifact/com.spotify/docker-client)
dependency from
[3.6.6](https://mvnrepository.com/artifact/com.spotify/docker-client/3.6.6) to
[5.0.2](https://mvnrepository.com/artifact/com.spotify/docker-client/5.0.2) to
enable _Docker for Mac_ users to run the `docker-integration-tests` out of the
box.
The very latest docker-client version is
[6.0.0](https://mvnrepository.com/artifact/com.spotify/docker-client/6.0.0) but
that has one additional dependency and no usage yet.
## How was this patch tested?
The code change was tested on Mac OS X Yosemite with both _Docker Toolbox_
as well as _Docker for Mac_ and on Linux Ubuntu 14.04.
```
$ build/mvn -Pyarn -Phadoop-2.6 -Dhadoop.version=2.6.0 -Phive
-Phive-thriftserver -DskipTests clean package
$ build/mvn -Pdocker-integration-tests -Pscala-2.11 -pl
:spark-docker-integration-tests_2.11 clean compile test
```
Author: Christian Kadner <[email protected]>
Closes #15378 from ckadner/SPARK-17803_Docker_for_Mac.
(cherry picked from commit 49d11d49983fbe270f4df4fb1e34b5fbe854c5ec)
Signed-off-by: Josh Rosen <[email protected]>
commit b1a9c41e8c41c90dd15ee6f635356dd1a5bbf395
Author: Dongjoon Hyun <[email protected]>
Date: 2016-10-06T23:09:45Z
[SPARK-17750][SQL][BACKPORT-2.0] Fix CREATE VIEW with INTERVAL arithmetic
## What changes were proposed in this pull request?
Currently, Spark raises `RuntimeException` when creating a view with
timestamp with INTERVAL arithmetic like the following. The root cause is the
arithmetic expression, `TimeAdd`, was transformed into `timeadd` function as a
VIEW definition. This PR fixes the SQL definition of `TimeAdd` and `TimeSub`
expressions.
```scala
scala> sql("CREATE TABLE dates (ts TIMESTAMP)")
scala> sql("CREATE VIEW view1 AS SELECT ts + INTERVAL 1 DAY FROM dates")
java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
```
## How was this patch tested?
Pass Jenkins with a new testcase.
Author: Dongjoon Hyun <[email protected]>
Closes #15383 from dongjoon-hyun/SPARK-17750-BACK.
commit 594a2cf6f7c74c54127b8c3947aadbe0052b404c
Author: sethah <[email protected]>
Date: 2016-10-07T04:10:17Z
[SPARK-17792][ML] L-BFGS solver for linear regression does not accept
general numeric label column types
## What changes were proposed in this pull request?
Before, we computed `instances` in LinearRegression in two spots, even
though they did the same thing. One of them did not cast the label column to
`DoubleType`. This patch consolidates the computation and always casts the
label column to `DoubleType`.
## How was this patch tested?
Added a unit test to check all solvers. This test failed before this patch.
Author: sethah <[email protected]>
Closes #15364 from sethah/linreg_numeric_type.
(cherry picked from commit 3713bb199142c5e06e2e527c99650f02f41f47b1)
Signed-off-by: Yanbo Liang <[email protected]>
commit 380b099fcfe6f70b978300ea208faf630855471a
Author: Dongjoon Hyun <[email protected]>
Date: 2016-10-07T05:27:20Z
[SPARK-17612][SQL][BRANCH-2.0] Support `DESCRIBE table PARTITION` SQL syntax
## What changes were proposed in this pull request?
This is a backport of SPARK-17612. This implements `DESCRIBE table
PARTITION` SQL Syntax again. It was supported until Spark 1.6.2, but was
dropped since 2.0.0.
**Spark 1.6.2**
```scala
scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY
(c STRING, d STRING)")
res1: org.apache.spark.sql.DataFrame = [result: string]
scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res2: org.apache.spark.sql.DataFrame = [result: string]
scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
+----------------------------------------------------------------+
|result |
+----------------------------------------------------------------+
|a string |
|b int |
|c string |
|d string |
| |
|# Partition Information |
|# col_name data_type comment |
| |
|c string |
|d string |
+----------------------------------------------------------------+
```
**Spark 2.0**
- **Before**
```scala
scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY
(c STRING, d STRING)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
org.apache.spark.sql.catalyst.parser.ParseException:
Unsupported SQL statement
```
- **After**
```scala
scala> sql("CREATE TABLE partitioned_table (a STRING, b INT) PARTITIONED BY
(c STRING, d STRING)")
res0: org.apache.spark.sql.DataFrame = []
scala> sql("ALTER TABLE partitioned_table ADD PARTITION (c='Us', d=1)")
res1: org.apache.spark.sql.DataFrame = []
scala> sql("DESC partitioned_table PARTITION (c='Us', d=1)").show(false)
+-----------------------+---------+-------+
|col_name |data_type|comment|
+-----------------------+---------+-------+
|a |string |null |
|b |int |null |
|c |string |null |
|d |string |null |
|# Partition Information| | |
|# col_name |data_type|comment|
|c |string |null |
|d |string |null |
+-----------------------+---------+-------+
scala> sql("DESC EXTENDED partitioned_table PARTITION (c='Us',
d=1)").show(100,false)
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
|col_name
|data_type|comment|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
|a
|string
|null |
|b
|int
|null |
|c
|string
|null |
|d
|string
|null |
|# Partition Information
|
| |
|# col_name
|data_type|comment|
|c
|string
|null |
|d
|string
|null |
|
|
| |
|Detailed Partition Information CatalogPartition(
Partition Values: [Us, 1]
Storage(Location:
file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1,
InputFormat: org.apache.hadoop.mapred.TextInputFormat, OutputFormat:
org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat, Serde:
org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe, Properties:
[serialization.format=1])
Partition Parameters:{transient_lastDdlTime=1475001066})| |
|
+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+---------+-------+
scala> sql("DESC FORMATTED partitioned_table PARTITION (c='Us',
d=1)").show(100,false)
+--------------------------------+---------------------------------------------------------------------------------------+-------+
|col_name |data_type
|comment|
+--------------------------------+---------------------------------------------------------------------------------------+-------+
|a |string
|null |
|b |int
|null |
|c |string
|null |
|d |string
|null |
|# Partition Information |
| |
|# col_name |data_type
|comment|
|c |string
|null |
|d |string
|null |
| |
| |
|# Detailed Partition Information|
| |
|Partition Value: |[Us, 1]
| |
|Database: |default
| |
|Table: |partitioned_table
| |
|Location:
|file:/Users/dhyun/SPARK-17612-DESC-PARTITION/spark-warehouse/partitioned_table/c=Us/d=1|
|
|Partition Parameters: |
| |
| transient_lastDdlTime |1475001066
| |
| |
| |
|# Storage Information |
| |
|SerDe Library:
|org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
| |
|InputFormat: |org.apache.hadoop.mapred.TextInputFormat
| |
|OutputFormat:
|org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
| |
|Compressed: |No
| |
|Storage Desc Parameters: |
| |
| serialization.format |1
| |
+--------------------------------+---------------------------------------------------------------------------------------+-------+
```
## How was this patch tested?
Pass the Jenkins tests with a new testcase.
Author: Dongjoon Hyun <[email protected]>
Closes #15351 from dongjoon-hyun/SPARK-17612-BACK.
commit 3487b020354988a91181f23b1c6711bfcdb4c529
Author: Bryan Cutler <[email protected]>
Date: 2016-10-07T07:27:55Z
[SPARK-17805][PYSPARK] Fix in sqlContext.read.text when pass in list of
paths
## What changes were proposed in this pull request?
If given a list of paths, `pyspark.sql.readwriter.text` will attempt to use
an undefined variable `paths`. This change checks if the param `paths` is a
basestring and then converts it to a list, so that the same variable `paths`
can be used for both cases
## How was this patch tested?
Added unit test for reading list of files
Author: Bryan Cutler <[email protected]>
Closes #15379 from BryanCutler/sql-readtext-paths-SPARK-17805.
(cherry picked from commit bcaa799cb01289f73e9f48526e94653a07628983)
Signed-off-by: Reynold Xin <[email protected]>
commit 9f2eb27a425385836dba5aad61babfb1db738a73
Author: Sean Owen <[email protected]>
Date: 2016-10-07T17:31:41Z
[SPARK-17707][WEBUI] Web UI prevents spark-submit application to be finished
This expands calls to Jetty's simple `ServerConnector` constructor to
explicitly specify a `ScheduledExecutorScheduler` that makes daemon threads. It
should otherwise result in exactly the same configuration, because the other
args are copied from the constructor that is currently called.
(I'm not sure we should change the Hive Thriftserver impl, but I did
anyway.)
This also adds `sc.stop()` to the quick start guide example.
Existing tests; _pending_ at least manual verification of the fix.
Author: Sean Owen <[email protected]>
Closes #15381 from srowen/SPARK-17707.
(cherry picked from commit cff560755244dd4ccb998e0c56e81d2620cd4cff)
Signed-off-by: Shixiong Zhu <[email protected]>
commit f460a199e8fc78ce879b79844c6c9e340b574439
Author: Shixiong Zhu <[email protected]>
Date: 2016-10-07T18:32:39Z
[SPARK-17346][SQL][TEST-MAVEN] Add Kafka source for Structured Streaming
(branch 2.0)
## What changes were proposed in this pull request?
Backport
https://github.com/apache/spark/commit/9293734d35eb3d6e4fd4ebb86f54dd5d3a35e6db
and
https://github.com/apache/spark/commit/b678e465afa417780b54db0fbbaa311621311f15
into branch 2.0.
The only difference is the Spark version in pom file.
## How was this patch tested?
Jenkins.
Author: Shixiong Zhu <[email protected]>
Closes #15367 from zsxwing/kafka-source-branch-2.0.
commit a84d8ef375f853c5841d458a593e41b457b9e6ff
Author: Herman van Hovell <[email protected]>
Date: 2016-10-07T10:46:39Z
[SPARK-17782][STREAMING][BUILD] Add Kafka 0.10 project to build modules
## What changes were proposed in this pull request?
This PR adds the Kafka 0.10 subproject to the build infrastructure. This
makes sure Kafka 0.10 tests are only triggers when it or of its dependencies
change.
Author: Herman van Hovell <[email protected]>
Closes #15355 from hvanhovell/SPARK-17782.
commit 6d056c168c45d2decf5ffbb96d59623d52ed8490
Author: Davies Liu <[email protected]>
Date: 2016-10-07T22:03:47Z
[SPARK-17806] [SQL] fix bug in join key rewritten in HashJoin
## What changes were proposed in this pull request?
In HashJoin, we try to rewrite the join key as Long to improve the
performance of finding a match. The rewriting part is not well tested, has a
bug that could cause wrong result when there are at least three integral
columns in the joining key also the total length of the key exceed 8 bytes.
## How was this patch tested?
Added unit test to covering the rewriting with different number of columns
and different data types. Manually test the reported case and confirmed that
this PR fix the bug.
Author: Davies Liu <[email protected]>
Closes #15390 from davies/rewrite_key.
(cherry picked from commit 94b24b84a666517e31e9c9d693f92d9bbfd7f9ad)
Signed-off-by: Davies Liu <[email protected]>
commit d27df35795fac0fd167e51d5ba08092a17eedfc2
Author: jiangxingbo <[email protected]>
Date: 2016-10-10T04:52:46Z
[SPARK-17832][SQL] TableIdentifier.quotedString creates un-parseable names
when name contains a backtick
## What changes were proposed in this pull request?
The `quotedString` method in `TableIdentifier` and `FunctionIdentifier`
produce an illegal (un-parseable) name when the name contains a backtick. For
example:
```
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser._
import org.apache.spark.sql.catalyst.TableIdentifier
import org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute
val complexName = TableIdentifier("`weird`table`name", Some("`d`b`1"))
parseTableIdentifier(complexName.unquotedString) // Does not work
parseTableIdentifier(complexName.quotedString) // Does not work
parseExpression(complexName.unquotedString) // Does not work
parseExpression(complexName.quotedString) // Does not work
```
We should handle the backtick properly to make `quotedString` parseable.
## How was this patch tested?
Add new testcases in `TableIdentifierParserSuite` and
`ExpressionParserSuite`.
Author: jiangxingbo <[email protected]>
Closes #15403 from jiangxb1987/backtick.
(cherry picked from commit 26fbca480604ba258f97b9590cfd6dda1ecd31db)
Signed-off-by: Herman van Hovell <[email protected]>
commit d719e9a080a909a6a56db938750d553668743f8f
Author: Dhruve Ashar <[email protected]>
Date: 2016-10-10T15:55:57Z
[SPARK-17417][CORE] Fix # of partitions for Reliable RDD checkpointing
## What changes were proposed in this pull request?
Currently the no. of partition files are limited to 10000 files (%05d
format). If there are more than 10000 part files, the logic goes for a toss
while recreating the RDD as it sorts them by string. More details can be found
in the JIRA desc [here](https://issues.apache.org/jira/browse/SPARK-17417).
## How was this patch tested?
I tested this patch by checkpointing a RDD and then manually renaming part
files to the old format and tried to access the RDD. It was successfully
created from the old format. Also verified loading a sample parquet file and
saving it as multiple formats - CSV, JSON, Text, Parquet, ORC and read them
successfully back from the saved files. I couldn't launch the unit test from my
local box, so will wait for the Jenkins output.
Author: Dhruve Ashar <[email protected]>
Closes #15370 from dhruve/bug/SPARK-17417.
(cherry picked from commit 4bafacaa5f50a3e986c14a38bc8df9bae303f3a0)
Signed-off-by: Tom Graves <[email protected]>
commit ff9f5bbf1795d9f5b14838099dcc1bb4ac8a9b5b
Author: Davies Liu <[email protected]>
Date: 2016-10-11T02:14:01Z
[SPARK-17738][TEST] Fix flaky test in ColumnTypeSuite
## What changes were proposed in this pull request?
The default buffer size is not big enough for randomly generated MapType.
## How was this patch tested?
Ran the tests in 100 times, it never fail (it fail 8 times before the
patch).
Author: Davies Liu <[email protected]>
Closes #15395 from davies/flaky_map.
(cherry picked from commit d5ec4a3e014494a3d991a6350caffbc3b17be0fd)
Signed-off-by: Shixiong Zhu <[email protected]>
commit a6b5e1dccf0be0e709d6d4113cdacb0cecce39fd
Author: Shixiong Zhu <[email protected]>
Date: 2016-10-11T17:53:07Z
[SPARK-17346][SQL][TESTS] Fix the flaky topic deletion in
KafkaSourceStressSuite
## What changes were proposed in this pull request?
A follow up Pr for SPARK-17346 to fix flaky
`org.apache.spark.sql.kafka010.KafkaSourceStressSuite`.
Test log:
https://amplab.cs.berkeley.edu/jenkins/job/spark-master-test-sbt-hadoop-2.4/1855/testReport/junit/org.apache.spark.sql.kafka010/KafkaSourceStressSuite/_It_is_not_a_test_/
Looks like deleting the Kafka internal topic `__consumer_offsets` is flaky.
This PR just simply ignores internal topics.
## How was this patch tested?
Existing tests.
Author: Shixiong Zhu <[email protected]>
Closes #15384 from zsxwing/SPARK-17346-flaky-test.
(cherry picked from commit 75b9e351413dca0930e8545e6283874db09d8482)
Signed-off-by: Tathagata Das <[email protected]>
commit 5ec3e6680a091883369c002ae599d6b03f38c863
Author: Ergin Seyfe <[email protected]>
Date: 2016-10-11T19:51:08Z
[SPARK-17816][CORE][BRANCH-2.0] Fix ConcurrentModificationException issue
in BlockStatusesAccumulator
## What changes were proposed in this pull request?
Replaced `BlockStatusesAccumulator` with `CollectionAccumulator` which is
thread safe and few more cleanups.
## How was this patch tested?
Tested in master branch and cherry-picked.
Author: Ergin Seyfe <[email protected]>
Closes #15425 from seyfe/race_cond_jsonprotocal_branch-2.0.
commit e68e95e947045704d3e6a36bb31e104a99d3adcc
Author: Alexander Pivovarov <[email protected]>
Date: 2016-10-12T05:31:21Z
Fix hadoop.version in building-spark.md
Couple of mvn build examples use `-Dhadoop.version=VERSION` instead of
actual version number
Author: Alexander Pivovarov <[email protected]>
Closes #15440 from apivovarov/patch-1.
(cherry picked from commit 299eb04ba05038c7dbb3ecf74a35d4bbfa456643)
Signed-off-by: Reynold Xin <[email protected]>
commit f3d82b53c42a971deedc04de6950b9228e5262ea
Author: Kousuke Saruta <[email protected]>
Date: 2016-10-12T05:36:57Z
[SPARK-17880][DOC] The url linking to `AccumulatorV2` in the document is
incorrect.
## What changes were proposed in this pull request?
In `programming-guide.md`, the url which links to `AccumulatorV2` says
`api/scala/index.html#org.apache.spark.AccumulatorV2` but
`api/scala/index.html#org.apache.spark.util.AccumulatorV2` is correct.
## How was this patch tested?
manual test.
Author: Kousuke Saruta <[email protected]>
Closes #15439 from sarutak/SPARK-17880.
(cherry picked from commit b512f04f8e546843d5a3f35dcc6b675b5f4f5bc0)
Signed-off-by: Reynold Xin <[email protected]>
commit f12b74c02eec9e201fec8a16dac1f8e549c1b4f0
Author: cody koeninger <[email protected]>
Date: 2016-10-12T07:40:47Z
[SPARK-17853][STREAMING][KAFKA][DOC] make it clear that reusing group.id is
bad
## What changes were proposed in this pull request?
Documentation fix to make it clear that reusing group id for different
streams is super duper bad, just like it is with the underlying Kafka consumer.
## How was this patch tested?
I built jekyll doc and made sure it looked ok.
Author: cody koeninger <[email protected]>
Closes #15442 from koeninger/SPARK-17853.
(cherry picked from commit c264ef9b1918256a5018c7a42a1a2b42308ea3f7)
Signed-off-by: Reynold Xin <[email protected]>
commit 4dcbde48de6c46e2fd8ccfec732b8ff5c24f97a4
Author: Bryan Cutler <[email protected]>
Date: 2016-10-11T06:29:52Z
[SPARK-17808][PYSPARK] Upgraded version of Pyrolite to 4.13
## What changes were proposed in this pull request?
Upgraded to a newer version of Pyrolite which supports serialization of a
BinaryType StructField for PySpark.SQL
## How was this patch tested?
Added a unit test which fails with a raised ValueError when using the
previous version of Pyrolite 4.9 and Python3
Author: Bryan Cutler <[email protected]>
Closes #15386 from BryanCutler/pyrolite-upgrade-SPARK-17808.
(cherry picked from commit 658c7147f5bf637f36e8c66b9207d94b1e7c74c5)
Signed-off-by: Sean Owen <[email protected]>
commit 5451541d1113aa75bab80914ca51a913f6ba4753
Author: prigarg <[email protected]>
Date: 2016-10-12T17:14:45Z
[SPARK-17884][SQL] To resolve Null pointer exception when casting from
empty string to interval type.
## What changes were proposed in this pull request?
This change adds a check in castToInterval method of Cast expression , such
that if converted value is null , then isNull variable should be set to true.
Earlier, the expression Cast(Literal(), CalendarIntervalType) was throwing
NullPointerException because of the above mentioned reason.
## How was this patch tested?
Added test case in CastSuite.scala
jira entry for detail: https://issues.apache.org/jira/browse/SPARK-17884
Author: prigarg <[email protected]>
Closes #15449 from priyankagargnitk/SPARK-17884.
(cherry picked from commit d5580ebaa086b9feb72d5428f24c5b60cd7da745)
Signed-off-by: Reynold Xin <[email protected]>
commit d55ba3063da1a5d12e3b09e55f089f16ecf327bb
Author: Hossein <[email protected]>
Date: 2016-10-12T17:32:38Z
[SPARK-17790][SPARKR] Support for parallelizing R data.frame larger than 2GB
## What changes were proposed in this pull request?
If the R data structure that is being parallelized is larger than `INT_MAX`
we use files to transfer data to JVM. The serialization protocol mimics Python
pickling. This allows us to simply call `PythonRDD.readRDDFromFile` to create
the RDD.
I tested this on my MacBook. Following code works with this patch:
```R
intMax <- .Machine$integer.max
largeVec <- 1:intMax
rdd <- SparkR:::parallelize(sc, largeVec, 2)
```
## How was this patch tested?
* [x] Unit tests
Author: Hossein <[email protected]>
Closes #15375 from falaki/SPARK-17790.
(cherry picked from commit 5cc503f4fe9737a4c7947a80eecac053780606df)
Signed-off-by: Felix Cheung <[email protected]>
commit 050b8177e27df06d33a6f6f2b3b6a952b0d03ba6
Author: cody koeninger <[email protected]>
Date: 2016-10-12T22:22:06Z
[SPARK-17782][STREAMING][KAFKA] alternative eliminate race condition of
poll twice
## What changes were proposed in this pull request?
Alternative approach to https://github.com/apache/spark/pull/15387
Author: cody koeninger <[email protected]>
Closes #15401 from koeninger/SPARK-17782-alt.
(cherry picked from commit f9a56a153e0579283160519065c7f3620d12da3e)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 5903dabc57c07310573babe94e4f205bdea6455f
Author: Brian Cho <[email protected]>
Date: 2016-10-13T03:43:18Z
[SPARK-16827][BRANCH-2.0] Avoid reporting spill metrics as shuffle metrics
## What changes were proposed in this pull request?
Fix a bug where spill metrics were being reported as shuffle metrics.
Eventually these spill metrics should be reported (SPARK-3577), but separate
from shuffle metrics. The fix itself basically reverts the line to what it was
in 1.6.
## How was this patch tested?
Cherry-picked from master (#15347)
Author: Brian Cho <[email protected]>
Closes #15455 from dafrista/shuffle-metrics-2.0.
commit ab00e410c6b1d7dafdfabcea1f249c78459b94f0
Author: Burak Yavuz <[email protected]>
Date: 2016-10-13T04:40:45Z
[SPARK-17876] Write StructuredStreaming WAL to a stream instead of
materializing all at once
## What changes were proposed in this pull request?
The CompactibleFileStreamLog materializes the whole metadata log in memory
as a String. This can cause issues when there are lots of files that are being
committed, especially during a compaction batch.
You may come across stacktraces that look like:
```
java.lang.OutOfMemoryError: Requested array size exceeds VM limit
at java.lang.StringCoding.encode(StringCoding.java:350)
at java.lang.String.getBytes(String.java:941)
at
org.apache.spark.sql.execution.streaming.FileStreamSinkLog.serialize(FileStreamSinkLog.scala:127)
```
The safer way is to write to an output stream so that we don't have to
materialize a huge string.
## How was this patch tested?
Existing unit tests
Author: Burak Yavuz <[email protected]>
Closes #15437 from brkyvz/ser-to-stream.
(cherry picked from commit edeb51a39d76d64196d7635f52be1b42c7ec4341)
Signed-off-by: Shixiong Zhu <[email protected]>
commit d38f38a093b4dff32c686675d93ab03e7a8f4908
Author: buzhihuojie <[email protected]>
Date: 2016-10-13T05:51:54Z
minor doc fix for Row.scala
## What changes were proposed in this pull request?
minor doc fix for "getAnyValAs" in class Row
## How was this patch tested?
None.
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Author: buzhihuojie <[email protected]>
Closes #15452 from david-weiluo-ren/minorDocFixForRow.
(cherry picked from commit 7222a25a11790fa9d9d1428c84b6f827a785c9e8)
Signed-off-by: Reynold Xin <[email protected]>
commit d7fa3e32421c73adfa522adfeeb970edd4c22eb3
Author: Shixiong Zhu <[email protected]>
Date: 2016-10-13T20:31:50Z
[SPARK-17834][SQL] Fetch the earliest offsets manually in KafkaSource
instead of counting on KafkaConsumer
## What changes were proposed in this pull request?
Because `KafkaConsumer.poll(0)` may update the partition offsets, this PR
just calls `seekToBeginning` to manually set the earliest offsets for the
KafkaSource initial offsets.
## How was this patch tested?
Existing tests.
Author: Shixiong Zhu <[email protected]>
Closes #15397 from zsxwing/SPARK-17834.
(cherry picked from commit 08eac356095c7faa2b19d52f2fb0cbc47eb7d1d1)
Signed-off-by: Shixiong Zhu <[email protected]>
commit c53b8374911e801ed98c1436c384f0aef076eaab
Author: Davies Liu <[email protected]>
Date: 2016-10-14T21:45:20Z
[SPARK-17863][SQL] should not add column into Distinct
## What changes were proposed in this pull request?
We are trying to resolve the attribute in sort by pulling up some column
for grandchild into child, but that's wrong when the child is Distinct, because
the added column will change the behavior of Distinct, we should not do that.
## How was this patch tested?
Added regression test.
Author: Davies Liu <[email protected]>
Closes #15489 from davies/order_distinct.
(cherry picked from commit da9aeb0fde589f7c21c2f4a32036a68c0353965d)
Signed-off-by: Yin Huai <[email protected]>
commit 2a1b10b649a8d4c077a0e19df976f1fd36b7e266
Author: Jun Kim <[email protected]>
Date: 2016-10-15T07:36:55Z
[SPARK-17953][DOCUMENTATION] Fix typo in SparkSession scaladoc
## What changes were proposed in this pull request?
### Before:
```scala
SparkSession.builder()
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value").
.getOrCreate()
```
### After:
```scala
SparkSession.builder()
.master("local")
.appName("Word Count")
.config("spark.some.config.option", "some-value")
.getOrCreate()
```
There was one unexpected dot!
Author: Jun Kim <[email protected]>
Closes #15498 from tae-jun/SPARK-17953.
(cherry picked from commit 36d81c2c68ef4114592b069287743eb5cb078318)
Signed-off-by: Reynold Xin <[email protected]>
commit 3cc2fe5b94d3bcdfb4f28bfa6d8e51fe67d6e1b4
Author: Dongjoon Hyun <[email protected]>
Date: 2016-10-17T05:15:47Z
[SPARK-17819][SQL][BRANCH-2.0] Support default database in connection URIs
for Spark Thrift Server
## What changes were proposed in this pull request?
Currently, Spark Thrift Server ignores the default database in URI. This PR
supports that like the following.
```sql
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "create database testdb"
$ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "create table t(a
int)"
$ bin/beeline -u jdbc:hive2://localhost:10000/testdb -e "show tables"
...
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
| t | false |
+------------+--------------+--+
1 row selected (0.347 seconds)
$ bin/beeline -u jdbc:hive2://localhost:10000 -e "show tables"
...
+------------+--------------+--+
| tableName | isTemporary |
+------------+--------------+--+
+------------+--------------+--+
No rows selected (0.098 seconds)
```
## How was this patch tested?
Pass the Jenkins with a newly added testsuite.
Author: Dongjoon Hyun <[email protected]>
Closes #15507 from dongjoon-hyun/SPARK-17819-BACK.
commit ca66f52ff81c19e17ca3733eac92d66012a3ec6e
Author: Weiqing Yang <[email protected]>
Date: 2016-10-17T05:38:30Z
[MINOR][SQL] Add prettyName for current_database function
## What changes were proposed in this pull request?
Added a `prettyname` for current_database function.
## How was this patch tested?
Manually.
Before:
```
scala> sql("select current_database()").show
+-----------------+
|currentdatabase()|
+-----------------+
| default|
+-----------------+
```
After:
```
scala> sql("select current_database()").show
+------------------+
|current_database()|
+------------------+
| default|
+------------------+
```
Author: Weiqing Yang <[email protected]>
Closes #15506 from weiqingy/prettyName.
(cherry picked from commit 56b0f5f4d1d7826737b81ebc4ec5dad83b6463e3)
Signed-off-by: Reynold Xin <[email protected]>
commit d1a02117862b20d0e8e58f4c6da6a97665a02590
Author: gatorsmile <[email protected]>
Date: 2016-10-17T07:29:53Z
[SPARK-17892][SQL][2.0] Do Not Optimize Query in CTAS More Than Once #15048
### What changes were proposed in this pull request?
This PR is to backport https://github.com/apache/spark/pull/15048 and
https://github.com/apache/spark/pull/15459.
However, in 2.0, we do not have a unified logical node `CreateTable` and
the analyzer rule `PreWriteCheck` is also different. To minimize the code
changes, this PR adds a new rule `AnalyzeCreateTableAsSelect`. Please treat it
as a new PR to review. Thanks!
As explained in https://github.com/apache/spark/pull/14797:
>Some analyzer rules have assumptions on logical plans, optimizer may break
these assumption, we should not pass an optimized query plan into
QueryExecution (will be analyzed again), otherwise we may some weird bugs.
For example, we have a rule for decimal calculation to promote the
precision before binary operations, use PromotePrecision as placeholder to
indicate that this rule should not apply twice. But a Optimizer rule will
remove this placeholder, that break the assumption, then the rule applied
twice, cause wrong result.
We should not optimize the query in CTAS more than once. For example,
```Scala
spark.range(99, 101).createOrReplaceTempView("tab1")
val sqlStmt = "SELECT id, cast(id as long) * cast('1.0' as decimal(38, 18))
as num FROM tab1"
sql(s"CREATE TABLE tab2 USING PARQUET AS $sqlStmt")
checkAnswer(spark.table("tab2"), sql(sqlStmt))
```
Before this PR, the results do not match
```
== Results ==
!== Correct Answer - 2 == == Spark Answer - 2 ==
![100,100.000000000000000000] [100,null]
[99,99.000000000000000000] [99,99.000000000000000000]
```
After this PR, the results match.
```
+---+----------------------+
|id |num |
+---+----------------------+
|99 |99.000000000000000000 |
|100|100.000000000000000000|
+---+----------------------+
```
In this PR, we do not treat the `query` in CTAS as a child. Thus, the
`query` will not be optimized when optimizing CTAS statement. However, we still
need to analyze it for normalizing and verifying the CTAS in the Analyzer.
Thus, we do it in the analyzer rule `PreprocessDDL`, because so far only this
rule needs the analyzed plan of the `query`.
### How was this patch tested?
Author: gatorsmile <[email protected]>
Closes #15502 from gatorsmile/ctasOptimize2.0.
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]