GitHub user felixcheung opened a pull request:
https://github.com/apache/spark/pull/17514
[SPARK-20197][SPARKR] CRAN check fail with package installation
Test failed because SPARK_HOME is not set before Spark is installed.
Also current directory is not == SPARK_HOME when tests are run with R CMD
check, unlike in Jenkins, so disable that test for now. (that would also
disable the test in Jenkins - so this change should not be ported to master
as-is.)
Manual run R CMD check
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/felixcheung/spark rcrancheck
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/17514.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #17514
----
commit b14fc391893468e25de1e24d982d6f260cac59ad
Author: Reynold Xin <[email protected]>
Date: 2016-12-15T05:08:45Z
[SPARK-18869][SQL] Add TreeNode.p that returns BaseType
## What changes were proposed in this pull request?
After the bug fix in SPARK-18854, TreeNode.apply now returns TreeNode[_]
rather than a more specific type. It would be easier for interactive debugging
to introduce a function that returns the BaseType.
## How was this patch tested?
N/A - this is a developer only feature used for interactive debugging. As
long as it compiles, it should be good to go. I tested this in spark-shell.
Author: Reynold Xin <[email protected]>
Closes #16288 from rxin/SPARK-18869.
(cherry picked from commit 5d510c693aca8c3fd3364b4453160bc8585ffc8e)
Signed-off-by: Reynold Xin <[email protected]>
commit d399a297d1ec9e0a3c57658cba0320b4d7fe88c5
Author: Dongjoon Hyun <[email protected]>
Date: 2016-12-15T05:29:20Z
[SPARK-18875][SPARKR][DOCS] Fix R API doc generation by adding
`DESCRIPTION` file
## What changes were proposed in this pull request?
Since Apache Spark 1.4.0, R API document page has a broken link on
`DESCRIPTION file` because Jekyll plugin script doesn't copy the file. This PR
aims to fix that.
- Official Latest Website:
http://spark.apache.org/docs/latest/api/R/index.html
- Apache Spark 2.1.0-rc2:
http://people.apache.org/~pwendell/spark-releases/spark-2.1.0-rc2-docs/api/R/index.html
## How was this patch tested?
Manual.
```bash
cd docs
SKIP_SCALADOC=1 jekyll build
```
Author: Dongjoon Hyun <[email protected]>
Closes #16292 from dongjoon-hyun/SPARK-18875.
(cherry picked from commit ec0eae486331c3977505d261676b77a33c334216)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 2a8de2e11ebab0cb9056444053127619d8a47d8a
Author: Felix Cheung <[email protected]>
Date: 2016-12-15T05:51:52Z
[SPARK-18849][ML][SPARKR][DOC] vignettes final check update
## What changes were proposed in this pull request?
doc cleanup
## How was this patch tested?
~~vignettes is not building for me. I'm going to kick off a full clean
build and try again and attach output here for review.~~
Output html here: https://felixcheung.github.io/sparkr-vignettes.html
Author: Felix Cheung <[email protected]>
Closes #16286 from felixcheung/rvignettespass.
(cherry picked from commit 7d858bc5ce870a28a559f4e81dcfc54cbd128cb7)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit e430915fad7ffb9397a96f0ef16e741c6b4f158b
Author: Tathagata Das <[email protected]>
Date: 2016-12-15T19:54:35Z
[SPARK-18870] Disallowed Distinct Aggregations on Streaming Datasets
## What changes were proposed in this pull request?
Check whether Aggregation operators on a streaming subplan have aggregate
expressions with isDistinct = true.
## How was this patch tested?
Added unit test
Author: Tathagata Das <[email protected]>
Closes #16289 from tdas/SPARK-18870.
(cherry picked from commit 4f7292c87512a7da3542998d0e5aa21c27a511e9)
Signed-off-by: Tathagata Das <[email protected]>
commit 900ce558a238fb9d8220527d8313646fe6830695
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-15T21:17:51Z
[SPARK-18826][SS] Add 'latestFirst' option to FileStreamSource
## What changes were proposed in this pull request?
When starting a stream with a lot of backfill and maxFilesPerTrigger, the
user could often want to start with most recent files first. This would let you
keep low latency for recent data and slowly backfill historical data.
This PR adds a new option `latestFirst` to control this behavior. When it's
true, `FileStreamSource` will sort the files by the modified time from latest
to oldest, and take the first `maxFilesPerTrigger` files as a new batch.
## How was this patch tested?
The added test.
Author: Shixiong Zhu <[email protected]>
Closes #16251 from zsxwing/newest-first.
(cherry picked from commit 68a6dc974b25e6eddef109f6fd23ae4e9775ceca)
Signed-off-by: Tathagata Das <[email protected]>
commit b6a81f4720752efe459860d28d7f8f738b2944c3
Author: Burak Yavuz <[email protected]>
Date: 2016-12-15T22:26:54Z
[SPARK-18888] partitionBy in DataStreamWriter in Python throws _to_seq not
defined
## What changes were proposed in this pull request?
`_to_seq` wasn't imported.
## How was this patch tested?
Added partitionBy to existing write path unit test
Author: Burak Yavuz <[email protected]>
Closes #16297 from brkyvz/SPARK-18888.
commit ef2ccf94224f00154cab7ab173d65442ecd389d7
Author: Patrick Wendell <[email protected]>
Date: 2016-12-15T22:46:00Z
Preparing Spark release v2.1.0-rc3
commit a7364a82eb0d18f92f1d8e46c1160a55bc250032
Author: Patrick Wendell <[email protected]>
Date: 2016-12-15T22:46:09Z
Preparing development version 2.1.1-SNAPSHOT
commit 08e4272872fc17c43f0dc79d329b946e8e85694d
Author: Burak Yavuz <[email protected]>
Date: 2016-12-15T23:46:03Z
[SPARK-18868][FLAKY-TEST] Deflake StreamingQueryListenerSuite: single
listener, check trigger...
## What changes were proposed in this pull request?
Use `recentProgress` instead of `lastProgress` and filter out last non-zero
value. Also add eventually to the latest assertQuery similar to first
`assertQuery`
## How was this patch tested?
Ran test 1000 times
Author: Burak Yavuz <[email protected]>
Closes #16287 from brkyvz/SPARK-18868.
(cherry picked from commit 9c7f83b0289ba4550b156e6af31cf7c44580eb12)
Signed-off-by: Shixiong Zhu <[email protected]>
commit ae853e8f3bdbd16427e6f1ffade4f63abaf74abb
Author: Shivaram Venkataraman <[email protected]>
Date: 2016-12-16T00:15:51Z
[MINOR] Only rename SparkR tar.gz if names mismatch
## What changes were proposed in this pull request?
For release builds the R_PACKAGE_VERSION and VERSION are the same (e.g.,
2.1.0). Thus `cp` throws an error which causes the build to fail.
## How was this patch tested?
Manually by executing the following script
```
set -o pipefail
set -e
set -x
touch a
R_PACKAGE_VERSION=2.1.0
VERSION=2.1.0
if [ "$R_PACKAGE_VERSION" != "$VERSION" ]; then
cp a a
fi
```
Author: Shivaram Venkataraman <[email protected]>
Closes #16299 from shivaram/sparkr-cp-fix.
(cherry picked from commit 9634018c4d6d5a4f2c909f7227d91e637107b7f4)
Signed-off-by: Reynold Xin <[email protected]>
commit ec31726581a43624fd47ce48f4e33d2a8e96c15c
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T00:18:20Z
Preparing Spark release v2.1.0-rc4
commit 62a6577bfa3a83783c813e74286e62b668e9af83
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T00:18:29Z
Preparing development version 2.1.1-SNAPSHOT
commit b23220fa67dd279d0b8005cb66d0875adbd3c8cb
Author: Shivaram Venkataraman <[email protected]>
Date: 2016-12-16T01:13:35Z
[MINOR] Handle fact that mv is different on linux, mac
Follow up to
https://github.com/apache/spark/commit/ae853e8f3bdbd16427e6f1ffade4f63abaf74abb
as `mv` throws an error on the Jenkins machines if source and destinations are
the same.
Author: Shivaram Venkataraman <[email protected]>
Closes #16302 from shivaram/sparkr-no-mv-fix.
(cherry picked from commit 5a44f18a2a114bdd37b6714d81f88cb68148f0c9)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit cd0a08361e2526519e7c131c42116bf56fa62c76
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T01:57:04Z
Preparing Spark release v2.1.0-rc5
commit 483624c2e13c8f239ee750bc149941b79800d0b0
Author: Patrick Wendell <[email protected]>
Date: 2016-12-16T01:57:11Z
Preparing development version 2.1.1-SNAPSHOT
commit d8548c8a7541bfa37761382edbb1892a145b2b71
Author: Reynold Xin <[email protected]>
Date: 2016-12-16T05:58:27Z
[SPARK-18892][SQL] Alias percentile_approx approx_percentile
## What changes were proposed in this pull request?
percentile_approx is the name used in Hive, and approx_percentile is the
name used in Presto. approx_percentile is actually more consistent with our
approx_count_distinct. Given the cost to alias SQL functions is low
(one-liner), it'd be better to just alias them so it is easier to use.
## How was this patch tested?
Technically I could add an end-to-end test to verify this one-line change,
but it seemed too trivial to me.
Author: Reynold Xin <[email protected]>
Closes #16300 from rxin/SPARK-18892.
(cherry picked from commit 172a52f5d31337d90155feb7072381e8d5712288)
Signed-off-by: Reynold Xin <[email protected]>
commit a73201dafcf22756b8074a73e1b5da41cdf8b9a4
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-16T08:42:39Z
[SPARK-18850][SS] Make StreamExecution and progress classes serializable
## What changes were proposed in this pull request?
This PR adds StreamingQueryWrapper to make StreamExecution and progress
classes serializable because it is too easy for it to get captured with normal
usage. If StreamingQueryWrapper gets captured in a closure but no place calls
its methods, it should not fail the Spark tasks. However if its methods are
called, then this PR will throw a better message.
## How was this patch tested?
`test("StreamingQuery should be Serializable but cannot be used in
executors")`
`test("progress classes should be Serializable")`
Author: Shixiong Zhu <[email protected]>
Closes #16272 from zsxwing/SPARK-18850.
(cherry picked from commit d7f3058e17571d76a8b4c8932de6de81ce8d2e78)
Signed-off-by: Tathagata Das <[email protected]>
commit d8ef0be83d8d032ddab79b465226ed3ff3d1eff7
Author: Takeshi YAMAMURO <[email protected]>
Date: 2016-12-16T14:44:42Z
[SPARK-18108][SQL] Fix a schema inconsistent bug that makes a parquet
reader fail to read data
## What changes were proposed in this pull request?
A vectorized parquet reader fails to read column data if data schema and
partition schema overlap with each other and inferred types in the partition
schema differ from ones in the data schema. An example code to reproduce this
bug is as follows;
```
scala> case class A(a: Long, b: Int)
scala> val as = Seq(A(1, 2))
scala> spark.createDataFrame(as).write.parquet("/data/a=1/")
scala> val df = spark.read.parquet("/data/")
scala> df.printSchema
root
|-- a: long (nullable = true)
|-- b: integer (nullable = true)
scala> df.collect
java.lang.NullPointerException
at
org.apache.spark.sql.execution.vectorized.OnHeapColumnVector.getLong(OnHeapColumnVector.java:283)
at
org.apache.spark.sql.execution.vectorized.ColumnarBatch$Row.getLong(ColumnarBatch.java:191)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
Source)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
```
The root cause is that a logical layer (`HadoopFsRelation`) and a physical
layer (`VectorizedParquetRecordReader`) have a different assumption on
partition schema; the logical layer trusts the data schema to infer the type
the overlapped partition columns, and, on the other hand, the physical layer
trusts partition schema which is inferred from path string. To fix this bug,
this pr simply updates `HadoopFsRelation.schema` to respect the partition
columns position in data schema and respect the partition columns type in
partition schema.
## How was this patch tested?
Add tests in `ParquetPartitionDiscoverySuite`
Author: Takeshi YAMAMURO <[email protected]>
Closes #16030 from maropu/SPARK-18108.
(cherry picked from commit dc2a4d4ad478fdb0486cc0515d4fe8b402d24db4)
Signed-off-by: Wenchen Fan <[email protected]>
commit df589be5443980f344d50afc8068f57ae18995de
Author: Dongjoon Hyun <[email protected]>
Date: 2016-12-16T19:30:21Z
[SPARK-18897][SPARKR] Fix SparkR SQL Test to drop test table
## What changes were proposed in this pull request?
SparkR tests, `R/run-tests.sh`, succeeds only once because
`test_sparkSQL.R` does not clean up the test table, `people`.
As a result, the rows in `people` table are accumulated at every run and
the test cases fail.
The following is the failure result for the second run.
```r
Failed
-------------------------------------------------------------------------
1. Failure: create DataFrame from RDD (test_sparkSQL.R#204)
-------------------
collect(sql("SELECT age from people WHERE name = 'Bob'"))$age not equal to
c(16).
Lengths differ: 2 vs 1
2. Failure: create DataFrame from RDD (test_sparkSQL.R#206)
-------------------
collect(sql("SELECT height from people WHERE name ='Bob'"))$height not
equal to c(176.5).
Lengths differ: 2 vs 1
```
## How was this patch tested?
Manual. Run `run-tests.sh` twice and check if it passes without failures.
Author: Dongjoon Hyun <[email protected]>
Closes #16310 from dongjoon-hyun/SPARK-18897.
(cherry picked from commit 1169db44bc1d51e68feb6ba2552520b2d660c2c0)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit d2a131a8482ab26ebd10121195a212b30042c72e
Author: Shixiong Zhu <[email protected]>
Date: 2016-12-16T23:04:11Z
[SPARK-18904][SS][TESTS] Merge two FileStreamSourceSuite files
## What changes were proposed in this pull request?
Merge two FileStreamSourceSuite files into one file.
## How was this patch tested?
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #16315 from zsxwing/FileStreamSourceSuite.
(cherry picked from commit 4faa8a3ec0bae4b210bc5d79918e008ab218f55a)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 001f49b7ca3a1fd19d1ca1112b1095c690bb89e9
Author: Felix Cheung <[email protected]>
Date: 2016-12-17T22:37:34Z
[SPARK-18849][ML][SPARKR][DOC] vignettes final check reorg
## What changes were proposed in this pull request?
Reorganizing content (copy/paste)
## How was this patch tested?
https://felixcheung.github.io/sparkr-vignettes.html
Previous:
https://felixcheung.github.io/sparkr-vignettes_old.html
Author: Felix Cheung <[email protected]>
Closes #16301 from felixcheung/rvignettespass2.
(cherry picked from commit 38fd163d0d2c44128bf8872d297b79edd7bd4137)
Signed-off-by: Felix Cheung <[email protected]>
commit 4b8a643f9bb74919a980f72ea72be957689ed8d5
Author: gatorsmile <[email protected]>
Date: 2016-12-18T09:02:04Z
[SPARK-18918][DOC] Missing </td> in Configuration page
### What changes were proposed in this pull request?
The configuration page looks messy now, as shown in the nightly build:
https://people.apache.org/~pwendell/spark-nightly/spark-master-docs/latest/configuration.html
Starting from the following location:

### How was this patch tested?
Attached is the screenshot generated in my local computer after the fix.
[Configuration - Spark 2.2.0
Documentation.pdf](https://github.com/apache/spark/files/659315/Configuration.-.Spark.2.2.0.Documentation.pdf)
Author: gatorsmile <[email protected]>
Closes #16327 from gatorsmile/docFix.
(cherry picked from commit c0c9e1d27a4c9ede768cfb150cdb26d68472f1da)
Signed-off-by: Sean Owen <[email protected]>
commit a5da8db85bcc0c183ec3bc15d9389b29c57cb103
Author: Yuming Wang <[email protected]>
Date: 2016-12-18T09:08:02Z
[SPARK-18827][CORE] Fix cannot read broadcast on disk
## What changes were proposed in this pull request?
`NoSuchElementException` will throw since
https://github.com/apache/spark/pull/15056 if a broadcast cannot cache in
memory. The reason is that that change cannot cover `!unrolled.hasNext` in
`next()` function.
This change is to cover the `!unrolled.hasNext` and check `hasNext` before
calling `next` in `blockManager.getLocalValues` to make it more robust.
We can cache and read broadcast even it cannot fit in memory from this pull
request.
Exception log:
```
16/12/10 10:10:04 INFO UnifiedMemoryManager: Will not store broadcast_131
as the required space (1048576 bytes) exceeds our memory limit (122764 bytes)
16/12/10 10:10:04 WARN MemoryStore: Failed to reserve initial memory
threshold of 1024.0 KB for computing block broadcast_131 in memory.
16/12/10 10:10:04 WARN MemoryStore: Not enough space to cache broadcast_131
in memory! (computed 384.0 B so far)
16/12/10 10:10:04 INFO MemoryStore: Memory use = 95.6 KB (blocks) + 0.0 B
(scratch space shared across 0 tasks(s)) = 95.6 KB. Storage limit = 119.9 KB.
16/12/10 10:10:04 ERROR Utils: Exception encountered
java.util.NoSuchElementException
at
org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
at
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
at
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
at scala.Option.map(Option.scala:146)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
16/12/10 10:10:04 ERROR Executor: Exception in task 1.0 in stage 86.0 (TID
134423)
java.io.IOException: java.util.NoSuchElementException
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1276)
at
org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:206)
at
org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:66)
at
org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:96)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:86)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.util.NoSuchElementException
at
org.apache.spark.util.collection.PrimitiveVector$$anon$1.next(PrimitiveVector.scala:58)
at
org.apache.spark.storage.memory.PartiallyUnrolledIterator.next(MemoryStore.scala:700)
at
org.apache.spark.util.CompletionIterator.next(CompletionIterator.scala:30)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1$$anonfun$2.apply(TorrentBroadcast.scala:210)
at scala.Option.map(Option.scala:146)
at
org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:210)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1269)
... 12 more
```
## How was this patch tested?
Add unit test
Author: Yuming Wang <[email protected]>
Closes #16252 from wangyum/SPARK-18827.
(cherry picked from commit 1e5c51f336b90cd1eed43e9c6cf00faee696174c)
Signed-off-by: Sean Owen <[email protected]>
commit 3080f995c690b34d131b428b6d63044ebc1f60eb
Author: gatorsmile <[email protected]>
Date: 2016-12-19T03:50:56Z
[SPARK-18703][SPARK-18675][SQL][BACKPORT-2.1] CTAS for hive serde table
should work for all hive versions AND Drop Staging Directories and Data Files
### What changes were proposed in this pull request?
This PR is to backport https://github.com/apache/spark/pull/16104 and
https://github.com/apache/spark/pull/16134.
----------
[[SPARK-18675][SQL] CTAS for hive serde table should work for all hive
versions](https://github.com/apache/spark/pull/16104)
Before hive 1.1, when inserting into a table, hive will create the staging
directory under a common scratch directory. After the writing is finished, hive
will simply empty the table directory and move the staging directory to it.
After hive 1.1, hive will create the staging directory under the table
directory, and when moving staging directory to table directory, hive will
still empty the table directory, but will exclude the staging directory there.
In `InsertIntoHiveTable`, we simply copy the code from hive 1.2, which
means we will always create the staging directory under the table directory, no
matter what the hive version is. This causes problems if the hive version is
prior to 1.1, because the staging directory will be removed by hive when hive
is trying to empty the table directory.
This PR copies the code from hive 0.13, so that we have 2 branches to
create staging directory. If hive version is prior to 1.1, we'll go to the old
style branch(i.e. create the staging directory under a common scratch
directory), else, go to the new style branch(i.e. create the staging directory
under the table directory)
----------
[[SPARK-18703] [SQL] Drop Staging Directories and Data Files After each
Insertion/CTAS of Hive serde Tables](https://github.com/apache/spark/pull/16134)
Below are the files/directories generated for three inserts againsts a Hive
table:
```
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/.part-00000.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-29_149_4298858301766472202-1/-ext-10000/part-00000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/.part-00000.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_454_6445008511655931341-1/-ext-10000/part-00000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/._SUCCESS.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/.part-00000.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/_SUCCESS
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.hive-staging_hive_2016-12-03_20-56-30_722_3388423608658711001-1/-ext-10000/part-00000
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
```
The first 18 files are temporary. We do not drop it until the end of JVM
termination. If JVM does not appropriately terminate, these temporary
files/directories will not be dropped.
Only the last two files are needed, as shown below.
```
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/.part-00000.crc
/private/var/folders/4b/sgmfldk15js406vk7lw5llzw0000gn/T/spark-41eaa5ce-0288-471e-bba1-09cc482813ff/part-00000
```
The temporary files/directories could accumulate a lot when we issue many
inserts, since each insert generats at least six files. This could eat a lot of
spaces and slow down the JVM termination. When the JVM does not terminates
approprately, the files might not be dropped.
This PR is to drop the created staging files and temporary data files after
each insert/CTAS.
### How was this patch tested?
Added test cases.
Author: gatorsmile <[email protected]>
Closes #16325 from gatorsmile/backport-18703&18675.
commit fc1b25660d8d2ac676c0b020208bcb9b711978c8
Author: xuanyuanking <[email protected]>
Date: 2016-12-19T19:31:43Z
[SPARK-18700][SQL] Add StripedLock for each table's relation in cache
## What changes were proposed in this pull request?
As the scenario describe in
[SPARK-18700](https://issues.apache.org/jira/browse/SPARK-18700), when
cachedDataSourceTables invalided, the coming few queries will fetch all
FileStatus in listLeafFiles function. In the condition of table has many
partitions, these jobs will occupy much memory of driver finally may cause
driver OOM.
In this patch, add StripedLock for each table's relation in cache not for
the whole cachedDataSourceTables, each table's load cache operation protected
by it.
## How was this patch tested?
Add a multi-thread access table test in `PartitionedTablePerfStatsSuite`
and check it only loading once using metrics in `HiveCatalogMetrics`
Author: xuanyuanking <[email protected]>
Closes #16135 from xuanyuanking/SPARK-18700.
(cherry picked from commit 24482858e05bea84cacb41c62be0a9aaa33897ee)
Signed-off-by: Herman van Hovell <[email protected]>
commit c1a26b458dd353be3ab1a2b3f9bb80809cf63479
Author: Wenchen Fan <[email protected]>
Date: 2016-12-19T19:42:59Z
[SPARK-18921][SQL] check database existence with Hive.databaseExists
instead of getDatabase
## What changes were proposed in this pull request?
It's weird that we use `Hive.getDatabase` to check the existence of a
database, while Hive has a `databaseExists` interface.
What's worse, `Hive.getDatabase` will produce an error message if the
database doesn't exist, which is annoying when we only want to check the
database existence.
This PR fixes this and use `Hive.databaseExists` to check database
existence.
## How was this patch tested?
N/A
Author: Wenchen Fan <[email protected]>
Closes #16332 from cloud-fan/minor.
(cherry picked from commit 7a75ee1c9224aa5c2e954fe2a71f9ad506f6782b)
Signed-off-by: Yin Huai <[email protected]>
commit f07e989c02844151587f9a29fe77ea65facea422
Author: Josh Rosen <[email protected]>
Date: 2016-12-20T00:19:38Z
[SPARK-18928] Check TaskContext.isInterrupted() in FileScanRDD, JDBCRDD &
UnsafeSorter
## What changes were proposed in this pull request?
In order to respond to task cancellation, Spark tasks must periodically
check `TaskContext.isInterrupted()`, but this check is missing on a few
critical read paths used in Spark SQL, including `FileScanRDD`, `JDBCRDD`, and
UnsafeSorter-based sorts. This can cause interrupted / cancelled tasks to
continue running and become zombies (as also described in #16189).
This patch aims to fix this problem by adding `TaskContext.isInterrupted()`
checks to these paths. Note that I could have used `InterruptibleIterator` to
simply wrap a bunch of iterators but in some cases this would have an adverse
performance penalty or might not be effective due to certain special uses of
Iterators in Spark SQL. Instead, I inlined `InterruptibleIterator`-style logic
into existing iterator subclasses.
## How was this patch tested?
Tested manually in `spark-shell` with two different reproductions of
non-cancellable tasks, one involving scans of huge files and another involving
sort-merge joins that spill to disk. Both causes of zombie tasks are fixed by
the changes added here.
Author: Josh Rosen <[email protected]>
Closes #16340 from JoshRosen/sql-task-interruption.
(cherry picked from commit 5857b9ac2d9808d9b89a5b29620b5052e2beebf5)
Signed-off-by: Herman van Hovell <[email protected]>
commit 2971ae564cb3e97aa5ecac7f411daed7d54248ad
Author: Josh Rosen <[email protected]>
Date: 2016-12-20T02:43:59Z
[SPARK-18761][CORE] Introduce "task reaper" to oversee task killing in
executors
## What changes were proposed in this pull request?
Spark's current task cancellation / task killing mechanism is "best effort"
because some tasks may not be interruptible or may not respond to their
"killed" flags being set. If a significant fraction of a cluster's task slots
are occupied by tasks that have been marked as killed but remain running then
this can lead to a situation where new jobs and tasks are starved of resources
that are being used by these zombie tasks.
This patch aims to address this problem by adding a "task reaper" mechanism
to executors. At a high-level, task killing now launches a new thread which
attempts to kill the task and then watches the task and periodically checks
whether it has been killed. The TaskReaper will periodically re-attempt to call
`TaskRunner.kill()` and will log warnings if the task keeps running. I modified
TaskRunner to rename its thread at the start of the task, allowing TaskReaper
to take a thread dump and filter it in order to log stacktraces from the exact
task thread that we are waiting to finish. If the task has not stopped after a
configurable timeout then the TaskReaper will throw an exception to trigger
executor JVM death, thereby forcibly freeing any resources consumed by the
zombie tasks.
This feature is flagged off by default and is controlled by four new
configurations under the `spark.task.reaper.*` namespace. See the updated
`configuration.md` doc for details.
## How was this patch tested?
Tested via a new test case in `JobCancellationSuite`, plus manual testing.
Author: Josh Rosen <[email protected]>
Closes #16189 from JoshRosen/cancellation.
commit cd297c390daedbfcaea8431dec4a37ca39dd26e3
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-12-20T21:12:16Z
[SPARK-18281] [SQL] [PYSPARK] Remove timeout for reading data through
socket for local iterator
## What changes were proposed in this pull request?
There is a timeout failure when using `rdd.toLocalIterator()` or
`df.toLocalIterator()` for a PySpark RDD and DataFrame:
df = spark.createDataFrame([[1],[2],[3]])
it = df.toLocalIterator()
row = next(it)
df2 = df.repartition(1000) # create many empty partitions which
increase materialization time so causing timeout
it2 = df2.toLocalIterator()
row = next(it2)
The cause of this issue is, we open a socket to serve the data from JVM
side. We set timeout for connection and reading through the socket in Python
side. In Python we use a generator to read the data, so we only begin to
connect the socket once we start to ask data from it. If we don't consume it
immediately, there is connection timeout.
In the other side, the materialization time for RDD partitions is
unpredictable. So we can't set a timeout for reading data through the socket.
Otherwise, it is very possibly to fail.
## How was this patch tested?
Added tests into PySpark.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Liang-Chi Hsieh <[email protected]>
Closes #16263 from viirya/fix-pyspark-localiterator.
(cherry picked from commit 95c95b71ed31b2971475aec6d7776dc234845d0a)
Signed-off-by: Davies Liu <[email protected]>
commit 3857d5ba8b195bc1eb4b75f00398535b42164ff1
Author: Burak Yavuz <[email protected]>
Date: 2016-12-20T22:19:35Z
[SPARK-18927][SS] MemorySink for StructuredStreaming can't recover from
checkpoint if location is provided in SessionConf
## What changes were proposed in this pull request?
Checkpoint Location can be defined for a StructuredStreaming on a per-query
basis by the `DataStreamWriter` options, but it can also be provided through
SparkSession configurations. It should be able to recover in both cases when
the OutputMode is Complete for MemorySinks.
## How was this patch tested?
Unit tests
Author: Burak Yavuz <[email protected]>
Closes #16342 from brkyvz/chk-rec.
(cherry picked from commit caed89321fdabe83e46451ca4e968f86481ad500)
Signed-off-by: Shixiong Zhu <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]