GitHub user yashbopardikar opened a pull request:
https://github.com/apache/spark/pull/15339
Branch 2.0
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.0
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/15339.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #15339
----
commit 0fb01496c09defa1436dbb7f5e1cbc5461617a31
Author: WangTaoTheTonic <[email protected]>
Date: 2016-08-11T22:09:23Z
[SPARK-17022][YARN] Handle potential deadlock in driver handling messages
## What changes were proposed in this pull request?
We directly send RequestExecutors to AM instead of transfer it to
yarnShedulerBackend first, to avoid potential deadlock.
## How was this patch tested?
manual tests
Author: WangTaoTheTonic <[email protected]>
Closes #14605 from WangTaoTheTonic/lock.
(cherry picked from commit ea0bf91b4a2ca3ef472906e50e31fd6268b6f53e)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit b4047fc21cefcf6a43c1ee88af330a042f02bebc
Author: Dongjoon Hyun <[email protected]>
Date: 2016-08-12T06:40:12Z
[SPARK-16975][SQL] Column-partition path starting '_' should be handled
correctly
Currently, Spark ignores path names starting with underscore `_` and `.`.
This causes read-failures for the column-partitioned file data sources whose
partition column names starts from '_', e.g. `_col`.
**Before**
```scala
scala> spark.range(10).withColumn("_locality_code",
$"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
org.apache.spark.sql.AnalysisException: Unable to infer schema for
ParquetFormat at /tmp/parquet20. It must be specified manually;
```
**After**
```scala
scala> spark.range(10).withColumn("_locality_code",
$"id").write.partitionBy("_locality_code").save("/tmp/parquet")
scala> spark.read.parquet("/tmp/parquet")
res2: org.apache.spark.sql.DataFrame = [id: bigint, _locality_code: int]
```
Pass the Jenkins with a new test case.
Author: Dongjoon Hyun <[email protected]>
Closes #14585 from dongjoon-hyun/SPARK-16975-PARQUET.
(cherry picked from commit abff92bfdc7d4c9d2308794f0350561fe0ceb4dd)
Signed-off-by: Cheng Lian <[email protected]>
commit bde94cd71086fd348f3ba96de628d6df3f87dba5
Author: petermaxlee <[email protected]>
Date: 2016-08-12T06:56:55Z
[SPARK-17013][SQL] Parse negative numeric literals
## What changes were proposed in this pull request?
This patch updates the SQL parser to parse negative numeric literals as
numeric literals, instead of unary minus of positive literals.
This allows the parser to parse the minimal value for each data type, e.g.
"-32768S".
## How was this patch tested?
Updated test cases.
Author: petermaxlee <[email protected]>
Closes #14608 from petermaxlee/SPARK-17013.
(cherry picked from commit 00e103a6edd1a1f001a94d41dd1f7acc40a1e30f)
Signed-off-by: Reynold Xin <[email protected]>
commit 38378f59f2c91a6f07366aa2013522c334066c69
Author: Jagadeesan <[email protected]>
Date: 2016-08-13T10:25:03Z
[SPARK-12370][DOCUMENTATION] Documentation should link to examples â¦
## What changes were proposed in this pull request?
When documentation is built is should reference examples from the same
build. There are times when the docs have links that point to files in the
GitHub head which may not be valid on the current release. Changed that in URLs
to make them point to the right tag in git using ```SPARK_VERSION_SHORT```
â¦from its own release version] [Streaming programming guide]
Author: Jagadeesan <[email protected]>
Closes #14596 from jagadeesanas2/SPARK-12370.
(cherry picked from commit e46cb78b3b9fd04a50b5ae50f360db612d656a48)
Signed-off-by: Sean Owen <[email protected]>
commit a21ecc9964bbd6e41a5464dcc85db1529de14d67
Author: Luciano Resende <[email protected]>
Date: 2016-08-13T10:42:38Z
[SPARK-17023][BUILD] Upgrade to Kafka 0.10.0.1 release
## What changes were proposed in this pull request?
Update Kafka streaming connector to use Kafka 0.10.0.1 release
## How was this patch tested?
Tested via Spark unit and integration tests
Author: Luciano Resende <[email protected]>
Closes #14606 from lresende/kafka-upgrade.
(cherry picked from commit 67f025d90e6ba8c039ff45e26d34f20d24b92e6a)
Signed-off-by: Sean Owen <[email protected]>
commit 750f8804540df5ad68a732f68598c4a2dbbc4761
Author: Sean Owen <[email protected]>
Date: 2016-08-13T22:40:43Z
[SPARK-16966][SQL][CORE] App Name is a randomUUID even when
"spark.app.name" exists
## What changes were proposed in this pull request?
Don't override app name specified in `SparkConf` with a random app name.
Only set it if the conf has no app name even after options have been applied.
See also https://github.com/apache/spark/pull/14602
This is similar to Sherry302 's original proposal in
https://github.com/apache/spark/pull/14556
## How was this patch tested?
Jenkins test, with new case reproducing the bug
Author: Sean Owen <[email protected]>
Closes #14630 from srowen/SPARK-16966.2.
(cherry picked from commit cdaa562c9a09e2e83e6df4e84d911ce1428a7a7c)
Signed-off-by: Reynold Xin <[email protected]>
commit e02d0d0852c5d56558ddfd13c675b3f2d70a7eea
Author: zero323 <[email protected]>
Date: 2016-08-14T10:59:24Z
[SPARK-17027][ML] Avoid integer overflow in PolynomialExpansion.getPolySize
## What changes were proposed in this pull request?
Replaces custom choose function with
o.a.commons.math3.CombinatoricsUtils.binomialCoefficient
## How was this patch tested?
Spark unit tests
Author: zero323 <[email protected]>
Closes #14614 from zero323/SPARK-17027.
(cherry picked from commit 0ebf7c1bff736cf54ec47957d71394d5b75b47a7)
Signed-off-by: Sean Owen <[email protected]>
commit 8f4cacd3a7a077a43bc82b887498cde9f6fb20e3
Author: Junyang Qian <[email protected]>
Date: 2016-08-15T18:03:03Z
[SPARK-16508][SPARKR] Split docs for arrange and orderBy methods
This PR splits arrange and orderBy methods according to their functionality
(the former for sorting sparkDataFrame and the latter for windowSpec).



Author: Junyang Qian <[email protected]>
Closes #14522 from junyangq/SPARK-16508-0.
(cherry picked from commit 564fe614c11deb657e0ac9e6b75e65370c48b7fe)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 45036327fdbdb0167b3c53245fce9dc2be67ffe9
Author: Shixiong Zhu <[email protected]>
Date: 2016-08-15T22:55:32Z
[SPARK-17065][SQL] Improve the error message when encountering an
incompatible DataSourceRegister
## What changes were proposed in this pull request?
Add an instruction to ask the user to remove or upgrade the incompatible
DataSourceRegister in the error message.
## How was this patch tested?
Test command:
```
build/sbt -Dscala-2.10 package
SPARK_SCALA_VERSION=2.10 bin/spark-shell --packages
ai.h2o:sparkling-water-core_2.10:1.6.5
scala> Seq(1).toDS().write.format("parquet").save("foo")
```
Before:
```
java.util.ServiceConfigurationError:
org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.h2o.DefaultSource could not be instantiated
at java.util.ServiceLoader.fail(ServiceLoader.java:232)
at java.util.ServiceLoader.access$100(ServiceLoader.java:185)
at
java.util.ServiceLoader$LazyIterator.nextService(ServiceLoader.java:384)
at java.util.ServiceLoader$LazyIterator.next(ServiceLoader.java:404)
at java.util.ServiceLoader$1.next(ServiceLoader.java:480)
...
Caused by: java.lang.NoClassDefFoundError: org/apache/spark/Logging
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at
java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
...
```
After:
```
java.lang.ClassNotFoundException: Detected an incompatible
DataSourceRegister. Please remove the incompatible library from classpath or
upgrade it. Error: org.apache.spark.sql.sources.DataSourceRegister: Provider
org.apache.spark.h2o.DefaultSource could not be instantiated
at
org.apache.spark.sql.execution.datasources.DataSource.lookupDataSource(DataSource.scala:178)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:79)
at
org.apache.spark.sql.execution.datasources.DataSource.write(DataSource.scala:441)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:213)
at org.apache.spark.sql.DataFrameWriter.save(DataFrameWriter.scala:196)
...
```
Author: Shixiong Zhu <[email protected]>
Closes #14651 from zsxwing/SPARK-17065.
(cherry picked from commit 268b71d0d792f875fcfaec5314862236754a00d6)
Signed-off-by: Yin Huai <[email protected]>
commit 2e2c787bf588e129eaaadc792737fd9d2892939c
Author: Herman van Hovell <[email protected]>
Date: 2016-08-16T08:12:27Z
[SPARK-16964][SQL] Remove private[hive] from sql.hive.execution package
## What changes were proposed in this pull request?
This PR is a small follow-up to https://github.com/apache/spark/pull/14554.
This also widens the visibility of a few (similar) Hive classes.
## How was this patch tested?
No test. Only a visibility change.
Author: Herman van Hovell <[email protected]>
Closes #14654 from hvanhovell/SPARK-16964-hive.
(cherry picked from commit 8fdc6ce400f9130399fbdd004df48b3ba95bcd6a)
Signed-off-by: Reynold Xin <[email protected]>
commit 237ae54c960d52b35b4bc673609aed9998c2bd45
Author: Reynold Xin <[email protected]>
Date: 2016-08-16T08:14:53Z
Revert "[SPARK-16964][SQL] Remove private[hive] from sql.hive.execution
package"
This reverts commit 2e2c787bf588e129eaaadc792737fd9d2892939c.
commit 1c56971167a0ebb3c422ccc7cc3d6904015fe2ec
Author: Herman van Hovell <[email protected]>
Date: 2016-08-16T08:15:31Z
[SPARK-16964][SQL] Remove private[sql] and private[spark] from
sql.execution package [Backport]
## What changes were proposed in this pull request?
This PR backports https://github.com/apache/spark/pull/14554 to branch-2.0.
I have also changed the visibility of a few similar Hive classes.
## How was this patch tested?
(Only a package visibility change)
Author: Herman van Hovell <[email protected]>
Author: Reynold Xin <[email protected]>
Closes #14652 from hvanhovell/SPARK-16964.
commit 022230c20905a29483cfd4cc76b74fe5f208c2c8
Author: Felix Cheung <[email protected]>
Date: 2016-08-16T18:19:18Z
[SPARK-16519][SPARKR] Handle SparkR RDD generics that create warnings in R
CMD check
Rename RDD functions for now to avoid CRAN check warnings.
Some RDD functions are sharing generics with DataFrame functions (hence the
problem) so after the renames we need to add new generics, for now.
unit tests
Author: Felix Cheung <[email protected]>
Closes #14626 from felixcheung/rrddfunctions.
(cherry picked from commit c34b546d674ce186f13d9999b97977bc281cfedf)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit 6cb3eab7cc49ad8b8459ddc479a900de9dea1bcf
Author: sandy <[email protected]>
Date: 2016-08-16T19:50:55Z
[SPARK-17089][DOCS] Remove api doc link for mapReduceTriplets operator
## What changes were proposed in this pull request?
Remove the api doc link for mapReduceTriplets operator because in latest
api they are remove so when user link to that api they will not get
mapReduceTriplets there so its more good to remove than confuse the user.
## How was this patch tested?
Run all the test cases

Author: sandy <[email protected]>
Closes #14669 from phalodi/SPARK-17089.
(cherry picked from commit e28a8c5899c48ff065e2fd3bb6b10c82b4d39c2c)
Signed-off-by: Reynold Xin <[email protected]>
commit 3e0163bee2354258899c82ce4cc4aacafd2a802d
Author: Herman van Hovell <[email protected]>
Date: 2016-08-17T04:35:39Z
[SPARK-17084][SQL] Rename ParserUtils.assert to validate
## What changes were proposed in this pull request?
This PR renames `ParserUtils.assert` to `ParserUtils.validate`. This is
done because this method is used to check requirements, and not to check if the
program is in an invalid state.
## How was this patch tested?
Simple rename. Compilation should do.
Author: Herman van Hovell <[email protected]>
Closes #14665 from hvanhovell/SPARK-17084.
(cherry picked from commit 4a2c375be2bcd98cc7e00bea920fd6a0f68a4e14)
Signed-off-by: Reynold Xin <[email protected]>
commit 68a24d3e7aa9b40d4557652d3179b0ccb0f8624e
Author: mvervuurt <[email protected]>
Date: 2016-08-17T06:12:59Z
[MINOR][DOC] Fix the descriptions for `properties` argument in the
documenation for jdbc APIs
## What changes were proposed in this pull request?
This should be credited to mvervuurt. The main purpose of this PR is
- simply to include the change for the same instance in `DataFrameReader`
just to match up.
- just avoid duplicately verifying the PR (as I already did).
The documentation for both should be the same because both assume the
`properties` should be the same `dict` for the same option.
## How was this patch tested?
Manually building Python documentation.
This will produce the output as below:
- `DataFrameReader`

- `DataFrameWriter`

Closes #14624
Author: hyukjinkwon <[email protected]>
Author: mvervuurt <[email protected]>
Closes #14677 from HyukjinKwon/typo-python.
(cherry picked from commit 0f6aa8afaacdf0ceca9c2c1650ca26a5c167ae69)
Signed-off-by: Reynold Xin <[email protected]>
commit 22c7660a8744049e27ea4cc4c08755ac95ea43f5
Author: Kazuaki Ishizaki <[email protected]>
Date: 2016-08-17T13:34:57Z
[SPARK-15285][SQL] Generated SpecificSafeProjection.apply method grows
beyond 64 KB
## What changes were proposed in this pull request?
This PR splits the generated code for ```SafeProjection.apply``` by using
```ctx.splitExpressions()```. This is because the large code body for
```NewInstance``` may grow beyond 64KB bytecode size for ```apply()``` method.
Here is [the original PR](https://github.com/apache/spark/pull/13243) for
SPARK-15285. However, it breaks a build with Scala 2.10 since Scala 2.10 does
not a case class with large number of members. Thus, it was reverted by [this
commit](https://github.com/apache/spark/commit/fa244e5a90690d6a31be50f2aa203ae1a2e9a1cf).
## How was this patch tested?
Added new tests by using `DefinedByConstructorParams` instead of case class
for scala-2.10
Author: Kazuaki Ishizaki <[email protected]>
Closes #14670 from kiszk/SPARK-15285-2.
(cherry picked from commit 56d86742d2600b8426d75bd87ab3c73332dca1d2)
Signed-off-by: Wenchen Fan <[email protected]>
commit 394d5986617e65852422afeb8d755e38795bbe25
Author: Wenchen Fan <[email protected]>
Date: 2016-08-17T16:31:22Z
[SPARK-17102][SQL] bypass UserDefinedGenerator for json format check
## What changes were proposed in this pull request?
We use reflection to convert `TreeNode` to json string, and currently don't
support arbitrary object. `UserDefinedGenerator` takes a function object, so we
should skip json format test for it, or the tests can be flacky, e.g.
`DataFrameSuite.simple explode`, this test always fail with scala 2.10(branch
1.6 builds with scala 2.10 by default), but pass with scala 2.11(master branch
builds with scala 2.11 by default).
## How was this patch tested?
N/A
Author: Wenchen Fan <[email protected]>
Closes #14679 from cloud-fan/json.
(cherry picked from commit 928ca1c6d12b23d84f9b6205e22d2e756311f072)
Signed-off-by: Yin Huai <[email protected]>
commit 9406f82db1e96c84bfacb4cac9b74aab6d4fde06
Author: Tathagata Das <[email protected]>
Date: 2016-08-17T20:31:34Z
[SPARK-17096][SQL][STREAMING] Improve exception string reported through the
StreamingQueryListener
## What changes were proposed in this pull request?
Currently, the stackTrace (as `Array[StackTraceElements]`) reported through
StreamingQueryListener.onQueryTerminated is useless as it has the stack trace
of where StreamingQueryException is defined, not the stack trace of underlying
exception. For example, if a streaming query fails because of a / by zero
exception in a task, the `QueryTerminated.stackTrace` will have
```
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:211)
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```
This is basically useless, as it is location where the
StreamingQueryException was defined. What we want is
Here is the right way to reason about what should be posted as through
StreamingQueryListener.onQueryTerminated
- The actual exception could either be a SparkException, or an arbitrary
exception.
- SparkException reports the relevant executor stack trace of a failed
task as a string in the the exception message. The `Array[StackTraceElements]`
returned by `SparkException.stackTrace()` is mostly irrelevant.
- For any arbitrary exception, the `Array[StackTraceElements]` returned
by `exception.stackTrace()` may be relevant.
- When there is an error in a streaming query, it's hard to reason whether
the `Array[StackTraceElements]` is useful or not. In fact, it is not clear
whether it is even useful to report the stack trace as this array of Java
objects. It may be sufficient to report the strack trace as a string, along
with the message. This is how Spark reported executor stra
- Hence, this PR simplifies the API by removing the array `stackTrace` from
`QueryTerminated`. Instead the `exception` returns a string containing the
message and the stack trace of the actual underlying exception that failed the
streaming query (i.e. not that of the StreamingQueryException). If anyone is
interested in the actual stack trace as an array, can always access them
through `streamingQuery.exception` which returns the exception object.
With this change, if a streaming query fails because of a / by zero
exception in a task, the `QueryTerminated.exception` will be
```
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1
in stage 0.0 failed 1 times, most recent failure: Lost task 1.0 in stage 0.0
(TID 1, localhost): java.lang.ArithmeticException: / by zero
at
org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply$mcII$sp(StreamingQueryListenerSuite.scala:153)
at
org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
at
org.apache.spark.sql.streaming.StreamingQueryListenerSuite$$anonfun$5$$anonfun$apply$mcV$sp$4$$anonfun$apply$mcV$sp$5.apply(StreamingQueryListenerSuite.scala:153)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:370)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:232)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:226)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:86)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Driver stacktrace:
at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1429)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1417)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1416)
at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1416)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:802)
...
```
It contains the relevant executor stack trace. In a case
non-SparkException, if the streaming source MemoryStream throws an exception,
exception message will have the relevant stack trace.
```
java.lang.RuntimeException: this is the exception message
at
org.apache.spark.sql.execution.streaming.MemoryStream.getBatch(memory.scala:103)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:316)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$5.apply(StreamExecution.scala:313)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at
scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:241)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at
org.apache.spark.sql.execution.streaming.StreamProgress.foreach(StreamProgress.scala:25)
at
scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:241)
at
org.apache.spark.sql.execution.streaming.StreamProgress.flatMap(StreamProgress.scala:25)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatch(StreamExecution.scala:313)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anonfun$org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches$1.apply$mcZ$sp(StreamExecution.scala:197)
at
org.apache.spark.sql.execution.streaming.ProcessingTimeExecutor.execute(TriggerExecutor.scala:43)
at
org.apache.spark.sql.execution.streaming.StreamExecution.org$apache$spark$sql$execution$streaming$StreamExecution$$runBatches(StreamExecution.scala:187)
at
org.apache.spark.sql.execution.streaming.StreamExecution$$anon$1.run(StreamExecution.scala:124)
```
Note that this change in the public `QueryTerminated` class is okay as the
APIs are still experimental.
## How was this patch tested?
Unit tests that test whether the right information is present in the
exception message reported through QueryTerminated object.
Author: Tathagata Das <[email protected]>
Closes #14675 from tdas/SPARK-17096.
(cherry picked from commit d60af8f6aa53373de1333cc642cf2a9d7b39d912)
Signed-off-by: Tathagata Das <[email protected]>
commit 585d1d95cb1c4419c716d3b3f595834927e0c175
Author: Xin Ren <[email protected]>
Date: 2016-08-17T23:31:42Z
[SPARK-17038][STREAMING] fix metrics retrieval source of 'lastReceivedBatch'
https://issues.apache.org/jira/browse/SPARK-17038
## What changes were proposed in this pull request?
StreamingSource's lastReceivedBatch_submissionTime,
lastReceivedBatch_processingTimeStart, and lastReceivedBatch_processingTimeEnd
all use data from lastCompletedBatch instead of lastReceivedBatch.
In particular, this makes it impossible to match lastReceivedBatch_records
with a batchID/submission time.
This is apparent when looking at StreamingSource.scala, lines 89-94.
## How was this patch tested?
Manually running unit tests on local laptop
Author: Xin Ren <[email protected]>
Closes #14681 from keypointt/SPARK-17038.
(cherry picked from commit e6bef7d52f0e19ec771fb0f3e96c7ddbd1a6a19b)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 91aa53239570d7c89e771050d79a1a857797498b
Author: Liang-Chi Hsieh <[email protected]>
Date: 2016-08-18T05:24:12Z
[SPARK-16995][SQL] TreeNodeException when flat mapping
RelationalGroupedDataset created from DataFrame containing a column created
with lit/expr
## What changes were proposed in this pull request?
A TreeNodeException is thrown when executing the following minimal example
in Spark 2.0.
import spark.implicits._
case class test (x: Int, q: Int)
val d = Seq(1).toDF("x")
d.withColumn("q", lit(0)).as[test].groupByKey(_.x).flatMapGroups{case
(x, iter) => List[Int]()}.show
d.withColumn("q",
expr("0")).as[test].groupByKey(_.x).flatMapGroups{case (x, iter) =>
List[Int]()}.show
The problem is at `FoldablePropagation`. The rule will do
`transformExpressions` on `LogicalPlan`. The query above contains a `MapGroups`
which has a parameter `dataAttributes:Seq[Attribute]`. One attributes in
`dataAttributes` will be transformed to an `Alias(literal(0), _)` in
`FoldablePropagation`. `Alias` is not an `Attribute` and causes the error.
We can't easily detect such type inconsistency during transforming
expressions. A direct approach to this problem is to skip doing
`FoldablePropagation` on object operators as they should not contain such
expressions.
## How was this patch tested?
Jenkins tests.
Author: Liang-Chi Hsieh <[email protected]>
Closes #14648 from viirya/flat-mapping.
(cherry picked from commit 10204b9d29cd69895f5a606e75510dc64cf2e009)
Signed-off-by: Wenchen Fan <[email protected]>
commit 5735b8bd769c64e2b0e0fae75bad794cde3edc99
Author: Reynold Xin <[email protected]>
Date: 2016-08-18T08:37:25Z
[SPARK-16391][SQL] Support partial aggregation for reduceGroups
## What changes were proposed in this pull request?
This patch introduces a new private ReduceAggregator interface that is a
subclass of Aggregator. ReduceAggregator only requires a single associative and
commutative reduce function. ReduceAggregator is also used to implement
KeyValueGroupedDataset.reduceGroups in order to support partial aggregation.
Note that the pull request was initially done by viirya.
## How was this patch tested?
Covered by original tests for reduceGroups, as well as a new test suite for
ReduceAggregator.
Author: Reynold Xin <[email protected]>
Author: Liang-Chi Hsieh <[email protected]>
Closes #14576 from rxin/reduceAggregator.
(cherry picked from commit 1748f824101870b845dbbd118763c6885744f98a)
Signed-off-by: Wenchen Fan <[email protected]>
commit ec5f157a32f0c65b5f93bdde7a6334e982b3b83c
Author: petermaxlee <[email protected]>
Date: 2016-08-18T11:44:13Z
[SPARK-17117][SQL] 1 / NULL should not fail analysis
## What changes were proposed in this pull request?
This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 /
NULL" throws an analysis exception:
```
org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to
data type mismatch: differing types in '(1 / NULL)' (int and null).
```
The problem is that division type coercion did not take null type into
account.
## How was this patch tested?
A unit test for the type coercion, and a few end-to-end test cases using
SQLQueryTestSuite.
Author: petermaxlee <[email protected]>
Closes #14695 from petermaxlee/SPARK-17117.
(cherry picked from commit 68f5087d2107d6afec5d5745f0cb0e9e3bdd6a0b)
Signed-off-by: Herman van Hovell <[email protected]>
commit 176af17a7213a4c2847a04f715137257657f2961
Author: Xin Ren <[email protected]>
Date: 2016-08-10T07:49:06Z
[MINOR][SPARKR] R API documentation for "coltypes" is confusing
## What changes were proposed in this pull request?
R API documentation for "coltypes" is confusing, found when working on
another ticket.
Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html,
where parameters have 2 "x" which is a duplicate, and also the example is not
very clear


## How was this patch tested?
Tested manually on local machine. And the screenshots are like below:


Author: Xin Ren <[email protected]>
Closes #14489 from keypointt/rExample.
(cherry picked from commit 1203c8415cd11540f79a235e66a2f241ca6c71e4)
Signed-off-by: Shivaram Venkataraman <[email protected]>
commit ea684b69cd6934bc093f4a5a8b0d8470e92157cd
Author: Eric Liang <[email protected]>
Date: 2016-08-18T11:33:55Z
[SPARK-17069] Expose spark.range() as table-valued function in SQL
This adds analyzer rules for resolving table-valued functions, and adds one
builtin implementation for range(). The arguments for range() are the same as
those of `spark.range()`.
Unit tests.
cc hvanhovell
Author: Eric Liang <[email protected]>
Closes #14656 from ericl/sc-4309.
(cherry picked from commit 412dba63b511474a6db3c43c8618d803e604bc6b)
Signed-off-by: Reynold Xin <[email protected]>
commit c180d637a3caca0d4e46f4980c10d1005eb453bc
Author: petermaxlee <[email protected]>
Date: 2016-08-19T01:19:47Z
[SPARK-16947][SQL] Support type coercion and foldable expression for inline
tables
This patch improves inline table support with the following:
1. Support type coercion.
2. Support using foldable expressions. Previously only literals were
supported.
3. Improve error message handling.
4. Improve test coverage.
Added a new unit test suite ResolveInlineTablesSuite and a new file-based
end-to-end test inline-table.sql.
Author: petermaxlee <[email protected]>
Closes #14676 from petermaxlee/SPARK-16947.
(cherry picked from commit f5472dda51b980a726346587257c22873ff708e3)
Signed-off-by: Reynold Xin <[email protected]>
commit 05b180faa4bd87498516c05d4769cc2f51d56aae
Author: Reynold Xin <[email protected]>
Date: 2016-08-19T02:02:32Z
HOTFIX: compilation broken due to protected ctor.
(cherry picked from commit b482c09fa22c5762a355f95820e4ba3e2517fb77)
Signed-off-by: Reynold Xin <[email protected]>
commit d55d1f454e6739ccff9c748f78462d789b09991f
Author: Nick Lavers <[email protected]>
Date: 2016-08-19T09:11:59Z
[SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
JIRA issue link:
https://issues.apache.org/jira/browse/SPARK-16961
Changed one line of Utils.randomizeInPlace to allow elements to stay in
place.
Created a unit test that runs a Pearson's chi squared test to determine
whether the output diverges significantly from a uniform distribution.
Author: Nick Lavers <[email protected]>
Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
(cherry picked from commit 5377fc62360d5e9b5c94078e41d10a96e0e8a535)
Signed-off-by: Sean Owen <[email protected]>
commit e0c60f1850706faf2830b09af3dc6b52ffd9991e
Author: Reynold Xin <[email protected]>
Date: 2016-08-19T13:11:35Z
[SPARK-16994][SQL] Whitelist operators for predicate pushdown
## What changes were proposed in this pull request?
This patch changes predicate pushdown optimization rule (PushDownPredicate)
from using a blacklist to a whitelist. That is to say, operators must be
explicitly allowed. This approach is more future-proof: previously it was
possible for us to introduce a new operator and then render the optimization
rule incorrect.
This also fixes the bug that previously we allowed pushing filter beneath
limit, which was incorrect. That is to say, before this patch, the optimizer
would rewrite
```
select * from (select * from range(10) limit 5) where id > 3
to
select * from range(10) where id > 3 limit 5
```
## How was this patch tested?
- a unit test case in FilterPushdownSuite
- an end-to-end test in limit.sql
Author: Reynold Xin <[email protected]>
Closes #14713 from rxin/SPARK-16994.
(cherry picked from commit 67e59d464f782ff5f509234212aa072a7653d7bf)
Signed-off-by: Wenchen Fan <[email protected]>
commit d0707c6baeb4003735a508f981111db370984354
Author: Kousuke Saruta <[email protected]>
Date: 2016-08-19T15:11:25Z
[SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is
enabled.
## What changes were proposed in this pull request?
If the following conditions are satisfied, executors don't load properties
in `hdfs-site.xml` and UnknownHostException can be thrown.
(1) NameNode HA is enabled
(2) spark.eventLogging is disabled or logging path is NOT on HDFS
(3) Using Standalone or Mesos for the cluster manager
(4) There are no code to load `HdfsCondition` class in the driver
regardless of directly or indirectly.
(5) The tasks access to HDFS
(There might be some more conditions...)
For example, following code causes UnknownHostException when the conditions
above are satisfied.
```
sc.textFile("<path on HDFS>").collect
```
```
java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster
at
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
at
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
at
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
at
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
at
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
at
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170)
at
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438)
at
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
at
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
at
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
at scala.Option.map(Option.scala:146)
at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177)
at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.UnknownHostException: hacluster
```
But following code doesn't cause the Exception because `textFile` method
loads `HdfsConfiguration` indirectly.
```
sc.textFile("<path on HDFS>").collect
```
When a job includes some operations which access to HDFS, the object of
`org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`,
serialized and broadcasted from driver to executors and each executor
deserialize the object with `loadDefaults` false so HDFS related properties
should be set before broadcasted.
## How was this patch tested?
Tested manually on my standalone cluster.
Author: Kousuke Saruta <[email protected]>
Closes #13738 from sarutak/SPARK-11227.
(cherry picked from commit 071eaaf9d2b63589f2e66e5279a16a5a484de6f5)
Signed-off-by: Tom Graves <[email protected]>
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]