GitHub user yotingting opened a pull request:
https://github.com/apache/spark/pull/21264
Branch 2.2
## What changes were proposed in this pull request?
when i use yarn client mode, the spark task locked in the collect stage
countDf.collect().map(_.getLong(0)).mkString.toLong
```
sun.misc.Unsafe.park(Native Method)
java.util.concurrent.locks.LockSupport.parkNanos(LockSupport.java:215)
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2078)
java.util.concurrent.ThreadPoolExecutor.awaitTermination(ThreadPoolExecutor.java:1475)
org.apache.spark.rpc.netty.Dispatcher.awaitTermination(Dispatcher.scala:180)
org.apache.spark.rpc.netty.NettyRpcEnv.awaitTermination(NettyRpcEnv.scala:281)
org.apache.spark.executor.CoarseGrainedExecutorBackend$$anonfun$run$1.apply$mcV$sp(CoarseGrainedExecutorBackend.scala:231)
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:67)
org.apache.spark.deploy.SparkHadoopUtil$$anon$2.run(SparkHadoopUtil.scala:66)
java.security.AccessController.doPrivileged(Native Method)
javax.security.auth.Subject.doAs(Subject.java:422)
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1692)
org.apache.spark.deploy.SparkHadoopUtil.runAsSparkUser(SparkHadoopUtil.scala:66)
org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:188)
org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:284)
org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
```
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21264.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21264
----
commit 83bdb04871248357ddbb665198c538f2df449006
Author: aokolnychyi <anton.okolnychyi@...>
Date: 2017-07-18T04:07:50Z
[SPARK-21332][SQL] Incorrect result type inferred for some decimal
expressions
## What changes were proposed in this pull request?
This PR changes the direction of expression transformation in the
DecimalPrecision rule. Previously, the expressions were transformed down, which
led to incorrect result types when decimal expressions had other decimal
expressions as their operands. The root cause of this issue was in visiting
outer nodes before their children. Consider the example below:
```
val inputSchema = StructType(StructField("col", DecimalType(26, 6)) ::
Nil)
val sc = spark.sparkContext
val rdd = sc.parallelize(1 to 2).map(_ => Row(BigDecimal(12)))
val df = spark.createDataFrame(rdd, inputSchema)
// Works correctly since no nested decimal expression is involved
// Expected result type: (26, 6) * (26, 6) = (38, 12)
df.select($"col" * $"col").explain(true)
df.select($"col" * $"col").printSchema()
// Gives a wrong result since there is a nested decimal expression that
should be visited first
// Expected result type: ((26, 6) * (26, 6)) * (26, 6) = (38, 12) *
(26, 6) = (38, 18)
df.select($"col" * $"col" * $"col").explain(true)
df.select($"col" * $"col" * $"col").printSchema()
```
The example above gives the following output:
```
// Correct result without sub-expressions
== Parsed Logical Plan ==
'Project [('col * 'col) AS (col * col)#4]
+- LogicalRDD [col#1]
== Analyzed Logical Plan ==
(col * col): decimal(38,12)
Project [CheckOverflow((promote_precision(cast(col#1 as decimal(26,6))) *
promote_precision(cast(col#1 as decimal(26,6)))), DecimalType(38,12)) AS (col *
col)#4]
+- LogicalRDD [col#1]
== Optimized Logical Plan ==
Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col *
col)#4]
+- LogicalRDD [col#1]
== Physical Plan ==
*Project [CheckOverflow((col#1 * col#1), DecimalType(38,12)) AS (col *
col)#4]
+- Scan ExistingRDD[col#1]
// Schema
root
|-- (col * col): decimal(38,12) (nullable = true)
// Incorrect result with sub-expressions
== Parsed Logical Plan ==
'Project [(('col * 'col) * 'col) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]
== Analyzed Logical Plan ==
((col * col) * col): decimal(38,12)
Project
[CheckOverflow((promote_precision(cast(CheckOverflow((promote_precision(cast(col#1
as decimal(26,6))) * promote_precision(cast(col#1 as decimal(26,6)))),
DecimalType(38,12)) as decimal(26,6))) * promote_precision(cast(col#1 as
decimal(26,6)))), DecimalType(38,12)) AS ((col * col) * col)#11]
+- LogicalRDD [col#1]
== Optimized Logical Plan ==
Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1),
DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col *
col) * col)#11]
+- LogicalRDD [col#1]
== Physical Plan ==
*Project [CheckOverflow((cast(CheckOverflow((col#1 * col#1),
DecimalType(38,12)) as decimal(26,6)) * col#1), DecimalType(38,12)) AS ((col *
col) * col)#11]
+- Scan ExistingRDD[col#1]
// Schema
root
|-- ((col * col) * col): decimal(38,12) (nullable = true)
```
## How was this patch tested?
This PR was tested with available unit tests. Moreover, there are tests to
cover previously failing scenarios.
Author: aokolnychyi <[email protected]>
Closes #18583 from aokolnychyi/spark-21332.
(cherry picked from commit 0be5fb41a6b7ef4da9ba36f3604ac646cb6d4ae3)
Signed-off-by: gatorsmile <[email protected]>
commit 99ce551a13f0918b440ddc094c3a32167d7ab3dd
Author: Burak Yavuz <brkyvz@...>
Date: 2017-07-18T04:09:07Z
[SPARK-21445] Make IntWrapper and LongWrapper in UTF8String Serializable
## What changes were proposed in this pull request?
Making those two classes will avoid Serialization issues like below:
```
Caused by: java.io.NotSerializableException:
org.apache.spark.unsafe.types.UTF8String$IntWrapper
Serialization stack:
- object not serializable (class:
org.apache.spark.unsafe.types.UTF8String$IntWrapper, value:
org.apache.spark.unsafe.types.UTF8String$IntWrapper326450e)
- field (class:
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1, name:
result$2, type: class org.apache.spark.unsafe.types.UTF8String$IntWrapper)
- object (class
org.apache.spark.sql.catalyst.expressions.Cast$$anonfun$castToInt$1,
<function1>)
```
## How was this patch tested?
- [x] Manual testing
- [ ] Unit test
Author: Burak Yavuz <[email protected]>
Closes #18660 from brkyvz/serializableutf8.
(cherry picked from commit 26cd2ca0402d7d49780116d45a5622a45c79f661)
Signed-off-by: Wenchen Fan <[email protected]>
commit df061fd5f93c8110107198a94e68a4e29248e345
Author: Wenchen Fan <wenchen@...>
Date: 2017-07-18T22:56:16Z
[SPARK-21457][SQL] ExternalCatalog.listPartitions should correctly handle
partition values with dot
## What changes were proposed in this pull request?
When we list partitions from hive metastore with a partial partition spec,
we are expecting exact matching according to the partition values. However,
hive treats dot specially and match any single character for dot. We should do
an extra filter to drop unexpected partitions.
## How was this patch tested?
new regression test.
Author: Wenchen Fan <[email protected]>
Closes #18671 from cloud-fan/hive.
(cherry picked from commit f18b905f6cace7686ef169fda7de474079d0af23)
Signed-off-by: gatorsmile <[email protected]>
commit 5a0a76f1648729dfa7ed0522dd2cb41ba805a2cd
Author: jinxing <jinxing6042@...>
Date: 2017-07-19T13:35:26Z
[SPARK-21414] Refine SlidingWindowFunctionFrame to avoid OOM.
## What changes were proposed in this pull request?
In `SlidingWindowFunctionFrame`, it is now adding all rows to the buffer
for which the input row value is equal to or less than the output row upper
bound, then drop all rows from the buffer for which the input row value is
smaller than the output row lower bound.
This could result in the buffer is very big though the window is small.
For example:
```
select a, b, sum(a)
over (partition by b order by a range between 1000000 following and 1000001
following)
from table
```
We can refine the logic and just add the qualified rows into buffer.
## How was this patch tested?
Manual test:
Run sql
`select shop, shopInfo, district, sum(revenue) over(partition by district
order by revenue range between 100 following and 200 following) from
revenueList limit 10`
against a table with 4 columns(shop: String, shopInfo: String, district:
String, revenue: Int). The biggest partition is around 2G bytes, containing
200k lines.
Configure the executor with 2G bytes memory.
With the change in this pr, it works find. Without this change, below
exception will be thrown.
```
MemoryError: Java heap space
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:504)
at
org.apache.spark.sql.catalyst.expressions.UnsafeRow.copy(UnsafeRow.java:62)
at
org.apache.spark.sql.execution.window.SlidingWindowFunctionFrame.write(WindowFunctionFrame.scala:201)
at
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:365)
at
org.apache.spark.sql.execution.window.WindowExec$$anonfun$14$$anon$1.next(WindowExec.scala:289)
at
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
Source)
at
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
at
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$25.apply(RDD.scala:827)
at
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:341)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
```
Author: jinxing <[email protected]>
Closes #18634 from jinxing64/SPARK-21414.
(cherry picked from commit 4eb081cc870a9d1c42aae90418535f7d782553e9)
Signed-off-by: Wenchen Fan <[email protected]>
commit 4c212eed1a4a75756216b13aab211d945e14d89b
Author: donnyzone <wellfengzhu@...>
Date: 2017-07-19T13:48:54Z
[SPARK-21441][SQL] Incorrect Codegen in SortMergeJoinExec results failures
in some cases
## What changes were proposed in this pull request?
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21441
This issue can be reproduced by the following example:
```
val spark = SparkSession
.builder()
.appName("smj-codegen")
.master("local")
.config("spark.sql.autoBroadcastJoinThreshold", "1")
.getOrCreate()
val df1 = spark.createDataFrame(Seq((1, 1), (2, 2), (3, 3))).toDF("key",
"int")
val df2 = spark.createDataFrame(Seq((1, "1"), (2, "2"), (3,
"3"))).toDF("key", "str")
val df = df1.join(df2, df1("key") === df2("key"))
.filter("int = 2 or reflect('java.lang.Integer', 'valueOf', str) = 1")
.select("int")
df.show()
```
To conclude, the issue happens when:
(1) SortMergeJoin condition contains CodegenFallback expressions.
(2) In PhysicalPlan tree, SortMergeJoin node is the child of root node,
e.g., the Project in above example.
This patch fixes the logic in `CollapseCodegenStages` rule.
## How was this patch tested?
Unit test and manual verification in our cluster.
Author: donnyzone <[email protected]>
Closes #18656 from DonnyZone/Fix_SortMergeJoinExec.
(cherry picked from commit 6b6dd682e84d3b03d0b15fbd81a0d16729e521d2)
Signed-off-by: Wenchen Fan <[email protected]>
commit 86cd3c08871618441c0c297da0f48ac284595697
Author: Tathagata Das <tathagata.das1565@...>
Date: 2017-07-19T18:02:07Z
[SPARK-21464][SS] Minimize deprecation warnings caused by ProcessingTime
class
## What changes were proposed in this pull request?
Use of `ProcessingTime` class was deprecated in favor of
`Trigger.ProcessingTime` in Spark 2.2. However interval uses to ProcessingTime
causes deprecation warnings during compilation. This cannot be avoided entirely
as even though it is deprecated as a public API, ProcessingTime instances are
used internally in TriggerExecutor. This PR is to minimize the warning by
removing its uses from tests as much as possible.
## How was this patch tested?
Existing tests.
Author: Tathagata Das <[email protected]>
Closes #18678 from tdas/SPARK-21464.
(cherry picked from commit 70fe99dc62ef636a99bcb8a580ad4de4dca95181)
Signed-off-by: Tathagata Das <[email protected]>
commit 308bce0eb60649b15836614567532460ea73bd12
Author: DFFuture <albert.zhang23@...>
Date: 2017-07-19T21:45:11Z
[SPARK-21446][SQL] Fix setAutoCommit never executed
## What changes were proposed in this pull request?
JIRA Issue: https://issues.apache.org/jira/browse/SPARK-21446
options.asConnectionProperties can not have fetchsizeï¼because fetchsize
belongs to Spark-only options, and Spark-only options have been excluded in
connection properities.
So change properties of beforeFetch from
options.asConnectionProperties.asScala.toMap to
options.asProperties.asScala.toMap
## How was this patch tested?
Author: DFFuture <[email protected]>
Closes #18665 from DFFuture/sparksql_pg.
(cherry picked from commit c9729187bcef78299390e53cd9af38c3e084060e)
Signed-off-by: gatorsmile <[email protected]>
commit 9949fed1c45865b6e5e8ebe610789c5fb9546052
Author: Corey Woodfield <coreywoodfield@...>
Date: 2017-07-19T22:21:38Z
[SPARK-21333][DOCS] Removed invalid joinTypes from javadoc of
Dataset#joinWith
## What changes were proposed in this pull request?
Two invalid join types were mistakenly listed in the javadoc for joinWith,
in the Dataset class. I presume these were copied from the javadoc of join, but
since joinWith returns a Dataset\<Tuple2\>, left_semi and left_anti are
invalid, as they only return values from one of the datasets, instead of from
both
## How was this patch tested?
I ran the following code :
```
public static void main(String[] args) {
SparkSession spark = new SparkSession(new SparkContext("local[*]",
"Test"));
Dataset<Row> one = spark.createDataFrame(Arrays.asList(new Bean(1), new
Bean(2), new Bean(3), new Bean(4), new Bean(5)), Bean.class);
Dataset<Row> two = spark.createDataFrame(Arrays.asList(new Bean(4), new
Bean(5), new Bean(6), new Bean(7), new Bean(8), new Bean(9)), Bean.class);
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"inner").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"cross").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"full").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"full_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"left").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"left_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"right").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"right_outer").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"left_semi").show();} catch (Exception e) {e.printStackTrace();}
try {two.joinWith(one, one.col("x").equalTo(two.col("x")),
"left_anti").show();} catch (Exception e) {e.printStackTrace();}
}
```
which tests all the different join types, and the last two (left_semi and
left_anti) threw exceptions. The same code using join instead of joinWith did
fine. The Bean class was just a java bean with a single int field, x.
Author: Corey Woodfield <[email protected]>
Closes #18462 from coreywoodfield/master.
(cherry picked from commit 8cd9cdf17a7a4ad6f2eecd7c4b388ca363c20982)
Signed-off-by: gatorsmile <[email protected]>
commit 88dccda393bc79dc6032f71b6acf8eb2b4b152be
Author: Dhruve Ashar <dhruveashar@...>
Date: 2017-07-21T19:03:46Z
[SPARK-21243][CORE] Limit no. of map outputs in a shuffle fetch
For configurations with external shuffle enabled, we have observed that if
a very large no. of blocks are being fetched from a remote host, it puts the NM
under extra pressure and can crash it. This change introduces a configuration
`spark.reducer.maxBlocksInFlightPerAddress` , to limit the no. of map outputs
being fetched from a given remote address. The changes applied here are
applicable for both the scenarios - when external shuffle is enabled as well as
disabled.
Ran the job with the default configuration which does not change the
existing behavior and ran it with few configurations of lower values
-10,20,50,100. The job ran fine and there is no change in the output. (I will
update the metrics related to NM in some time.)
Author: Dhruve Ashar <dhruveashargmail.com>
Closes #18487 from dhruve/impr/SPARK-21243.
Author: Dhruve Ashar <[email protected]>
Closes #18691 from dhruve/branch-2.2.
commit da403b95353f064c24da25236fa7f905fa8ddca1
Author: Holden Karau <holden@...>
Date: 2017-07-21T23:50:47Z
[SPARK-21434][PYTHON][DOCS] Add pyspark pip documentation.
Update the Quickstart and RDD programming guides to mention pip.
Built docs locally.
Author: Holden Karau <[email protected]>
Closes #18698 from holdenk/SPARK-21434-add-pyspark-pip-documentation.
(cherry picked from commit cc00e99d5396893b2d3d50960161080837cf950a)
Signed-off-by: Holden Karau <[email protected]>
commit 62ca13dcaf79b85fca02de5628b607196534c605
Author: Marcelo Vanzin <vanzin@...>
Date: 2017-07-23T15:23:13Z
[SPARK-20904][CORE] Don't report task failures to driver during shutdown.
Executors run a thread pool with daemon threads to run tasks. This means
that those threads remain active when the JVM is shutting down, meaning
those tasks are affected by code that runs in shutdown hooks.
So if a shutdown hook messes with something that the task is using (e.g.
an HDFS connection), the task will fail and will report that failure to
the driver. That will make the driver mark the task as failed regardless
of what caused the executor to shut down. So, for example, if YARN
pre-empted
that executor, the driver would consider that task failed when it should
instead ignore the failure.
This change avoids reporting failures to the driver when shutdown hooks
are executing; this fixes the YARN preemption accounting, and doesn't really
change things much for other scenarios, other than reporting a more generic
error ("Executor lost") when the executor shuts down unexpectedly - which
is arguably more correct.
Tested with a hacky app running on spark-shell that tried to cause failures
only when shutdown hooks were running, verified that preemption didn't cause
the app to fail because of task failures exceeding the threshold.
Author: Marcelo Vanzin <[email protected]>
Closes #18594 from vanzin/SPARK-20904.
(cherry picked from commit cecd285a2aabad4e7db5a3d18944b87fbc4eee6c)
Signed-off-by: Wenchen Fan <[email protected]>
commit e5ec3390cbbef87fca8a27bea701a225e18b98ea
Author: DjvuLee <lihu@...>
Date: 2017-07-25T17:21:18Z
[SPARK-21383][YARN] Fix the YarnAllocator allocates more Resource
When NodeManagers launching Executors,
the `missing` value will exceed the
real value when the launch is slow, this can lead to YARN allocates more
resource.
We add the `numExecutorsRunning` when calculate the `missing` to avoid this.
Test by experiment.
Author: DjvuLee <[email protected]>
Closes #18651 from djvulee/YarnAllocate.
(cherry picked from commit 8de080d9f9d3deac7745f9b3428d97595975701d)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit c91191bed186f816b760af98218392f9a178942b
Author: Eric Vandenberg <ericvandenberg@...>
Date: 2017-07-25T18:45:35Z
[SPARK-21447][WEB UI] Spark history server fails to render compressed
inprogress history file in some cases.
Add failure handling for EOFException that can be thrown during
decompression of an inprogress spark history file, treat same as case
where can't parse the last line.
## What changes were proposed in this pull request?
Failure handling for case of EOFException thrown within the
ReplayListenerBus.replay method to handle the case analogous to json parse fail
case. This path can arise in compressed inprogress history files since an
incomplete compression block could be read (not flushed by writer on a block
boundary). See the stack trace of this occurrence in the jira ticket
(https://issues.apache.org/jira/browse/SPARK-21447)
## How was this patch tested?
Added a unit test that specifically targets validating the failure handling
path appropriately when maybeTruncated is true and false.
Author: Eric Vandenberg <[email protected]>
Closes #18673 from ericvandenbergfb/fix_inprogress_compr_history_file.
(cherry picked from commit 06a9793793ca41dcef2f10ca06af091a57c721c4)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 1bfd1a83b5e18f42bf76c1d72cd0347ff578e9cd
Author: Marcelo Vanzin <vanzin@...>
Date: 2017-07-26T00:57:26Z
[SPARK-21494][NETWORK] Use correct app id when authenticating to external
service.
There was some code based on the old SASL handler in the new auth client
that
was incorrectly using the SASL user as the user to authenticate against the
external shuffle service. This caused the external service to not be able to
find the correct secret to authenticate the connection, failing the
connection.
In the course of debugging, I found that some log messages from the YARN
shuffle
service were a little noisy, so I silenced some of them, and also added a
couple
of new ones that helped find this issue. On top of that, I found that a
check
in the code that records app secrets was wrong, causing more log spam and
also
using an O(n) operation instead of an O(1) call.
Also added a new integration suite for the YARN shuffle service with auth
on,
and verified it failed before, and passes now.
Author: Marcelo Vanzin <[email protected]>
Closes #18706 from vanzin/SPARK-21494.
(cherry picked from commit 300807c6e3011e4d78c6cf750201d0ab8e5bdaf5)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 06b2ef01ed87add681144fe1d801718caba271af
Author: aokolnychyi <anton.okolnychyi@...>
Date: 2017-07-27T23:49:42Z
[SPARK-21538][SQL] Attribute resolution inconsistency in the Dataset API
## What changes were proposed in this pull request?
This PR contains a tiny update that removes an attribute resolution
inconsistency in the Dataset API. The following example is taken from the
ticket description:
```
spark.range(1).withColumnRenamed("id", "x").sort(col("id")) // works
spark.range(1).withColumnRenamed("id", "x").sort($"id") // works
spark.range(1).withColumnRenamed("id", "x").sort('id) // works
spark.range(1).withColumnRenamed("id", "x").sort("id") // fails with:
org.apache.spark.sql.AnalysisException: Cannot resolve column name "id"
among (x);
```
The above `AnalysisException` happens because the last case calls
`Dataset.apply()` to convert strings into columns, which triggers attribute
resolution. To make the API consistent between overloaded methods, this PR
defers the resolution and constructs columns directly.
Author: aokolnychyi <[email protected]>
Closes #18740 from aokolnychyi/spark-21538.
(cherry picked from commit f44ead89f48f040b7eb9dfc88df0ec995b47bfe9)
Signed-off-by: gatorsmile <[email protected]>
commit 93790313b2e36e5e5ac4dfe13b285f03c42da111
Author: Yan Facai (é¢åæ) <facai.yan@...>
Date: 2017-07-28T02:10:35Z
[SPARK-21306][ML] OneVsRest should support setWeightCol
## What changes were proposed in this pull request?
add `setWeightCol` method for OneVsRest.
`weightCol` is ignored if classifier doesn't inherit HasWeightCol trait.
## How was this patch tested?
+ [x] add an unit test.
Author: Yan Facai (é¢åæ) <[email protected]>
Closes #18554 from facaiy/BUG/oneVsRest_missing_weightCol.
(cherry picked from commit a5a3189974ea4628e9489eb50099a5432174e80c)
Signed-off-by: Yanbo Liang <[email protected]>
commit df6cd35ecb710b99911f39b9d7d16cac08468b4d
Author: Remis Haroon <remis.haroon@...>
Date: 2017-07-29T12:26:10Z
[SPARK-21508][DOC] Fix example code provided in Spark Streaming
Documentation
## What changes were proposed in this pull request?
JIRA ticket :
[SPARK-21508](https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21508)
correcting a mistake in example code provided in Spark Streaming Custom
Receivers Documentation
The example code provided in the documentation on 'Spark Streaming Custom
Receivers' has an error.
doc link :
https://spark.apache.org/docs/latest/streaming-custom-receivers.html
```
// Assuming ssc is the StreamingContext
val customReceiverStream = ssc.receiverStream(new CustomReceiver(host,
port))
val words = lines.flatMap(_.split(" "))
...
```
instead of `lines.flatMap(_.split(" "))`
it should be `customReceiverStream.flatMap(_.split(" "))`
## How was this patch tested?
this documentation change is tested manually by jekyll build , running
below commands
```
jekyll build
jekyll serve --watch
```
screen-shots provided below


Author: Remis Haroon <[email protected]>
Closes #18770 from remisharoon/master.
(cherry picked from commit c14382030b373177cf6aa3c045e27d754368a927)
Signed-off-by: Sean Owen <[email protected]>
commit 24a9bace131465bf6a177f304cf8f05b0e4fe6ed
Author: Liang-Chi Hsieh <viirya@...>
Date: 2017-07-29T17:02:56Z
[SPARK-21555][SQL] RuntimeReplaceable should be compared semantically by
its canonicalized child
## What changes were proposed in this pull request?
When there are aliases (these aliases were added for nested fields) as
parameters in `RuntimeReplaceable`, as they are not in the children expression,
those aliases can't be cleaned up in analyzer rule `CleanupAliases`.
An expression `nvl(foo.foo1, "value")` can be resolved to two semantically
different expressions in a group by query because they contain different
aliases.
Because those aliases are not children of `RuntimeReplaceable` which is an
`UnaryExpression`. So we can't trim the aliases out by simple transforming the
expressions in `CleanupAliases`.
If we want to replace the non-children aliases in `RuntimeReplaceable`, we
need to add more codes to `RuntimeReplaceable` and modify all expressions of
`RuntimeReplaceable`. It makes the interface ugly IMO.
Consider those aliases will be replaced later at optimization and so
they're no harm, this patch chooses to simply override `canonicalized` of
`RuntimeReplaceable`.
One concern is about `CleanupAliases`. Because it actually cannot clean up
ALL aliases inside a plan. To make caller of this rule notice that, this patch
adds a comment to `CleanupAliases`.
## How was this patch tested?
Added test.
Author: Liang-Chi Hsieh <[email protected]>
Closes #18761 from viirya/SPARK-21555.
(cherry picked from commit 9c8109ef414c92553335bb1e90e9681e142128a4)
Signed-off-by: gatorsmile <[email protected]>
commit 66fa6bd6d48b08625ecedfcb5a976678141300bd
Author: Xingbo Jiang <xingbo.jiang@...>
Date: 2017-07-29T17:11:31Z
[SPARK-19451][SQL] rangeBetween method should accept Long value as boundary
## What changes were proposed in this pull request?
Long values can be passed to `rangeBetween` as range frame boundaries, but
we silently convert it to Int values, this can cause wrong results and we
should fix this.
Further more, we should accept any legal literal values as range frame
boundaries. In this PR, we make it possible for Long values, and make accepting
other DataTypes really easy to add.
This PR is mostly based on Herman's previous amazing work:
https://github.com/hvanhovell/spark/commit/596f53c339b1b4629f5651070e56a8836a397768
After this been merged, we can close #16818 .
## How was this patch tested?
Add new tests in `DataFrameWindowFunctionsSuite` and `TypeCoercionSuite`.
Author: Xingbo Jiang <[email protected]>
Closes #18540 from jiangxb1987/rangeFrame.
(cherry picked from commit 92d85637e7f382aae61c0f26eb1524d2b4c93516)
Signed-off-by: gatorsmile <[email protected]>
commit e2062b9c1106433799d2874dfe17e181fe1ecb5e
Author: gatorsmile <gatorsmile@...>
Date: 2017-07-30T03:35:22Z
Revert "[SPARK-19451][SQL] rangeBetween method should accept Long value as
boundary"
This reverts commit 66fa6bd6d48b08625ecedfcb5a976678141300bd.
commit 174543466934c6ced5812e2dfc7e1a18793cf0b1
Author: Marcelo Vanzin <vanzin@...>
Date: 2017-08-01T17:06:03Z
[SPARK-21522][CORE] Fix flakiness in LauncherServerSuite.
Handle the case where the server closes the socket before the full message
has been written by the client.
Author: Marcelo Vanzin <[email protected]>
Closes #18727 from vanzin/SPARK-21522.
(cherry picked from commit b133501800b43fa5c538a4e5ad597c9dc7d8378e)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 79e5805f9284c53b0c329f086190298b70f012c1
Author: Sean Owen <sowen@...>
Date: 2017-08-01T18:05:55Z
[SPARK-21593][DOCS] Fix 2 rendering errors on configuration page
## What changes were proposed in this pull request?
Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and
SPARK-15355.
## How was this patch tested?
Manually built and viewed docs with jekyll
Author: Sean Owen <[email protected]>
Closes #18793 from srowen/SPARK-21593.
(cherry picked from commit b1d59e60dee2a41f8eff8ef29b3bcac69111e2f0)
Signed-off-by: Sean Owen <[email protected]>
commit 67c60d78e4c4562fbf86b46d14b7d635aaf67e5b
Author: Devaraj K <devaraj@...>
Date: 2017-08-01T20:38:55Z
[SPARK-21339][CORE] spark-shell --packages option does not add jars to
classpath on windows
The --packages option jars are getting added to the classpath with the
scheme as "file:///", in Unix it doesn't have problem with this since the
scheme contains the Unix Path separator which separates the jar name with
location in the classpath. In Windows, the jar file is not getting resolved
from the classpath because of the scheme.
Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
With this PR, we are avoiding the 'file://' scheme to get added to the
packages jar files.
I have verified manually in Windows and Unix environments, with the change
it adds the jar to classpath like below,
Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
Author: Devaraj K <[email protected]>
Closes #18708 from devaraj-kavali/SPARK-21339.
(cherry picked from commit 58da1a2455258156fe8ba57241611eac1a7928ef)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 397f904219e7617386144aba87998a057bde02e3
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-02T17:59:59Z
[SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats
## What changes were proposed in this pull request?
This PR fixed a potential overflow issue in EventTimeStats.
## How was this patch tested?
The new unit tests
Author: Shixiong Zhu <[email protected]>
Closes #18803 from zsxwing/avg.
(cherry picked from commit 7f63e85b47a93434030482160e88fe63bf9cff4e)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 467ee8dff8494a730ef8c00aafc02266a794a1fe
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-02T21:02:13Z
[SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key
## What changes were proposed in this pull request?
When the watermark is not a column of `dropDuplicates`, right now it will
crash. This PR fixed this issue.
## How was this patch tested?
The new unit test.
Author: Shixiong Zhu <[email protected]>
Closes #18822 from zsxwing/SPARK-21546.
(cherry picked from commit 0d26b3aa55f9cc75096b0e2b309f64fe3270b9a5)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 690f491f6e979bc960baa05de1a66306b06dc85a
Author: Bryan Cutler <cutlerb@...>
Date: 2017-08-03T01:28:19Z
[SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle
registry
## What changes were proposed in this pull request?
When using PySpark broadcast variables in a multi-threaded environment,
`SparkContext._pickled_broadcast_vars` becomes a shared resource. A race
condition can occur when broadcast variables that are pickled from one thread
get added to the shared ` _pickled_broadcast_vars` and become part of the
python command from another thread. This PR introduces a thread-safe pickled
registry using thread local storage so that when python command is pickled
(causing the broadcast variable to be pickled and added to the registry) each
thread will have their own view of the pickle registry to retrieve and clear
the broadcast variables used.
## How was this patch tested?
Added a unit test that causes this race condition using another thread.
Author: Bryan Cutler <[email protected]>
Closes #18823 from BryanCutler/branch-2.2.
commit 1bcfa2a0ccdc1d3c3c5075bc6e2838c69f5b2f7f
Author: Christiam Camacho <camacho@...>
Date: 2017-08-03T22:40:25Z
Fix Java SimpleApp spark application
## What changes were proposed in this pull request?
Add missing import and missing parentheses to invoke `SparkSession::text()`.
## How was this patch tested?
Built and the code for this application, ran jekyll locally per
docs/README.md.
Author: Christiam Camacho <[email protected]>
Closes #18795 from christiam/master.
(cherry picked from commit dd72b10aba9997977f82605c5c1778f02dd1f91e)
Signed-off-by: Sean Owen <[email protected]>
commit f9aae8ecde62fc6d92a4807c68d812bac6b207e2
Author: Andrew Ray <ray.andrew@...>
Date: 2017-08-04T07:58:01Z
[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table
with extreme values on the partition column
## What changes were proposed in this pull request?
An overflow of the difference of bounds on the partitioning column leads to
no data being read. This
patch checks for this overflow.
## How was this patch tested?
New unit test.
Author: Andrew Ray <[email protected]>
Closes #18800 from aray/SPARK-21330.
(cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6)
Signed-off-by: Sean Owen <[email protected]>
commit 841bc2f86d61769057fca08cebbb72a98bde00dc
Author: liuxian <liu.xian3@...>
Date: 2017-08-05T05:55:06Z
[SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as
group-by ordinal
## What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is
[1, 3]); line 1 pos 10`
The cause of this failure:
If an aggregateExpression is integer, after replaced with this
aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by
using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
## How was this patch tested?
Added unit test case
Author: liuxian <[email protected]>
Closes #18779 from 10110346/groupby.
(cherry picked from commit 894d5a453a3f47525408ee8c91b3b594daa43ccb)
Signed-off-by: gatorsmile <[email protected]>
commit 098aaec304a6b4c94a364f08c2d8ef18009689d8
Author: vinodkc <vinod.kc.in@...>
Date: 2017-08-06T06:04:39Z
[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null
## What changes were proposed in this pull request?
In SQLContext.get(key,null) for a key that is not defined in the conf, and
doesn't have a default value defined, throws a NPE. Int happens only when conf
has a value converter
Added null check on defaultValue inside SQLConf.getConfString to avoid
calling entry.valueConverter(defaultValue)
## How was this patch tested?
Added unit test
Author: vinodkc <[email protected]>
Closes #18852 from vinodkc/br_Fix_SPARK-21588.
(cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9)
Signed-off-by: gatorsmile <[email protected]>
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]