GitHub user xianbin opened a pull request:
https://github.com/apache/spark/pull/21691
Branch 2.2
## What changes were proposed in this pull request?
(Please fill in changes proposed in this fix)
## How was this patch tested?
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/apache/spark branch-2.2
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/21691.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #21691
----
commit 79e5805f9284c53b0c329f086190298b70f012c1
Author: Sean Owen <sowen@...>
Date: 2017-08-01T18:05:55Z
[SPARK-21593][DOCS] Fix 2 rendering errors on configuration page
## What changes were proposed in this pull request?
Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and
SPARK-15355.
## How was this patch tested?
Manually built and viewed docs with jekyll
Author: Sean Owen <[email protected]>
Closes #18793 from srowen/SPARK-21593.
(cherry picked from commit b1d59e60dee2a41f8eff8ef29b3bcac69111e2f0)
Signed-off-by: Sean Owen <[email protected]>
commit 67c60d78e4c4562fbf86b46d14b7d635aaf67e5b
Author: Devaraj K <devaraj@...>
Date: 2017-08-01T20:38:55Z
[SPARK-21339][CORE] spark-shell --packages option does not add jars to
classpath on windows
The --packages option jars are getting added to the classpath with the
scheme as "file:///", in Unix it doesn't have problem with this since the
scheme contains the Unix Path separator which separates the jar name with
location in the classpath. In Windows, the jar file is not getting resolved
from the classpath because of the scheme.
Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
With this PR, we are avoiding the 'file://' scheme to get added to the
packages jar files.
I have verified manually in Windows and Unix environments, with the change
it adds the jar to classpath like below,
Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
Author: Devaraj K <[email protected]>
Closes #18708 from devaraj-kavali/SPARK-21339.
(cherry picked from commit 58da1a2455258156fe8ba57241611eac1a7928ef)
Signed-off-by: Marcelo Vanzin <[email protected]>
commit 397f904219e7617386144aba87998a057bde02e3
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-02T17:59:59Z
[SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats
## What changes were proposed in this pull request?
This PR fixed a potential overflow issue in EventTimeStats.
## How was this patch tested?
The new unit tests
Author: Shixiong Zhu <[email protected]>
Closes #18803 from zsxwing/avg.
(cherry picked from commit 7f63e85b47a93434030482160e88fe63bf9cff4e)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 467ee8dff8494a730ef8c00aafc02266a794a1fe
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-02T21:02:13Z
[SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key
## What changes were proposed in this pull request?
When the watermark is not a column of `dropDuplicates`, right now it will
crash. This PR fixed this issue.
## How was this patch tested?
The new unit test.
Author: Shixiong Zhu <[email protected]>
Closes #18822 from zsxwing/SPARK-21546.
(cherry picked from commit 0d26b3aa55f9cc75096b0e2b309f64fe3270b9a5)
Signed-off-by: Shixiong Zhu <[email protected]>
commit 690f491f6e979bc960baa05de1a66306b06dc85a
Author: Bryan Cutler <cutlerb@...>
Date: 2017-08-03T01:28:19Z
[SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle
registry
## What changes were proposed in this pull request?
When using PySpark broadcast variables in a multi-threaded environment,
`SparkContext._pickled_broadcast_vars` becomes a shared resource. A race
condition can occur when broadcast variables that are pickled from one thread
get added to the shared ` _pickled_broadcast_vars` and become part of the
python command from another thread. This PR introduces a thread-safe pickled
registry using thread local storage so that when python command is pickled
(causing the broadcast variable to be pickled and added to the registry) each
thread will have their own view of the pickle registry to retrieve and clear
the broadcast variables used.
## How was this patch tested?
Added a unit test that causes this race condition using another thread.
Author: Bryan Cutler <[email protected]>
Closes #18823 from BryanCutler/branch-2.2.
commit 1bcfa2a0ccdc1d3c3c5075bc6e2838c69f5b2f7f
Author: Christiam Camacho <camacho@...>
Date: 2017-08-03T22:40:25Z
Fix Java SimpleApp spark application
## What changes were proposed in this pull request?
Add missing import and missing parentheses to invoke `SparkSession::text()`.
## How was this patch tested?
Built and the code for this application, ran jekyll locally per
docs/README.md.
Author: Christiam Camacho <[email protected]>
Closes #18795 from christiam/master.
(cherry picked from commit dd72b10aba9997977f82605c5c1778f02dd1f91e)
Signed-off-by: Sean Owen <[email protected]>
commit f9aae8ecde62fc6d92a4807c68d812bac6b207e2
Author: Andrew Ray <ray.andrew@...>
Date: 2017-08-04T07:58:01Z
[SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table
with extreme values on the partition column
## What changes were proposed in this pull request?
An overflow of the difference of bounds on the partitioning column leads to
no data being read. This
patch checks for this overflow.
## How was this patch tested?
New unit test.
Author: Andrew Ray <[email protected]>
Closes #18800 from aray/SPARK-21330.
(cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6)
Signed-off-by: Sean Owen <[email protected]>
commit 841bc2f86d61769057fca08cebbb72a98bde00dc
Author: liuxian <liu.xian3@...>
Date: 2017-08-05T05:55:06Z
[SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as
group-by ordinal
## What changes were proposed in this pull request?
create temporary view data as select * from values
(1, 1),
(1, 2),
(2, 1),
(2, 2),
(3, 1),
(3, 2)
as data(a, b);
`select 3, 4, sum(b) from data group by 1, 2;`
`select 3 as c, 4 as d, sum(b) from data group by c, d;`
When running these two cases, the following exception occurred:
`Error in query: GROUP BY position 4 is not in select list (valid range is
[1, 3]); line 1 pos 10`
The cause of this failure:
If an aggregateExpression is integer, after replaced with this
aggregateExpression, the
groupExpression still considered as an ordinal.
The solution:
This bug is due to re-entrance of an analyzed plan. We can solve it by
using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
## How was this patch tested?
Added unit test case
Author: liuxian <[email protected]>
Closes #18779 from 10110346/groupby.
(cherry picked from commit 894d5a453a3f47525408ee8c91b3b594daa43ccb)
Signed-off-by: gatorsmile <[email protected]>
commit 098aaec304a6b4c94a364f08c2d8ef18009689d8
Author: vinodkc <vinod.kc.in@...>
Date: 2017-08-06T06:04:39Z
[SPARK-21588][SQL] SQLContext.getConf(key, null) should return null
## What changes were proposed in this pull request?
In SQLContext.get(key,null) for a key that is not defined in the conf, and
doesn't have a default value defined, throws a NPE. Int happens only when conf
has a value converter
Added null check on defaultValue inside SQLConf.getConfString to avoid
calling entry.valueConverter(defaultValue)
## How was this patch tested?
Added unit test
Author: vinodkc <[email protected]>
Closes #18852 from vinodkc/br_Fix_SPARK-21588.
(cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9)
Signed-off-by: gatorsmile <[email protected]>
commit 7a04def920438ef0e08b66a95befeec981e5571e
Author: Xianyang Liu <xianyang.liu@...>
Date: 2017-08-07T09:04:53Z
[SPARK-21621][CORE] Reset numRecordsWritten after
DiskBlockObjectWriter.commitAndGet called
## What changes were proposed in this pull request?
We should reset numRecordsWritten to zero after
DiskBlockObjectWriter.commitAndGet called.
Because when `revertPartialWritesAndClose` be called, we decrease the
written records in `ShuffleWriteMetrics` . However, we decreased the written
records to zero, this should be wrong, we should only decreased the number
reords after the last `commitAndGet` called.
## How was this patch tested?
Modified existing test.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Xianyang Liu <[email protected]>
Closes #18830 from ConeyLiu/DiskBlockObjectWriter.
(cherry picked from commit 534a063f7c693158437d13224f50d4ae789ff6fb)
Signed-off-by: Wenchen Fan <[email protected]>
commit 4f0eb0c862c0362b14fc5db468f4fc08fb8a08c6
Author: Xiao Li <gatorsmile@...>
Date: 2017-08-07T16:00:01Z
[SPARK-21647][SQL] Fix SortMergeJoin when using CROSS
### What changes were proposed in this pull request?
author: BoleynSu
closes https://github.com/apache/spark/pull/18836
```Scala
val df = Seq((1, 1)).toDF("i", "j")
df.createOrReplaceTempView("T")
withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
sql("select * from (select a.i from T a cross join T t where t.i = a.i)
as t1 " +
"cross join T t2 where t2.i = t1.i").explain(true)
}
```
The above code could cause the following exception:
```
SortMergeJoinExec should not take Cross as the JoinType
java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross
as the JoinType
at
org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
```
Our SortMergeJoinExec supports CROSS. We should not hit such an exception.
This PR is to fix the issue.
### How was this patch tested?
Modified the two existing test cases.
Author: Xiao Li <[email protected]>
Author: Boleyn Su <[email protected]>
Closes #18863 from gatorsmile/pr-18836.
(cherry picked from commit bbfd6b5d24be5919a3ab1ac3eaec46e33201df39)
Signed-off-by: Wenchen Fan <[email protected]>
commit 43f9c84b6749b2ebf802e1f062238167b2b1f3bb
Author: Andrey Taptunov <taptunov@...>
Date: 2017-08-05T05:40:04Z
[SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled
FS cache
This PR replaces #18623 to do some clean up.
Closes #18623
Jenkins
Author: Shixiong Zhu <[email protected]>
Author: Andrey Taptunov <[email protected]>
Closes #18848 from zsxwing/review-pr18623.
commit fa92a7be709e78db8e8f50dca8e13855c1034fde
Author: Jose Torres <joseph-torres@...>
Date: 2017-08-07T19:27:16Z
[SPARK-21565][SS] Propagate metadata in attribute replacement.
## What changes were proposed in this pull request?
Propagate metadata in attribute replacement during streaming execution.
This is necessary for EventTimeWatermarks consuming replaced attributes.
## How was this patch tested?
new unit test, which was verified to fail before the fix
Author: Jose Torres <[email protected]>
Closes #18840 from joseph-torres/SPARK-21565.
(cherry picked from commit cce25b360ee9e39d9510134c73a1761475eaf4ac)
Signed-off-by: Shixiong Zhu <[email protected]>
commit a1c1199e122889ed34415be5e4da67168107a595
Author: gatorsmile <gatorsmile@...>
Date: 2017-08-07T20:04:04Z
[SPARK-21648][SQL] Fix confusing assert failure in JDBC source when
parallel fetching parameters are not properly provided.
### What changes were proposed in this pull request?
```SQL
CREATE TABLE mytesttable1
USING org.apache.spark.sql.jdbc
OPTIONS (
url
'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
dbtable 'mytesttable1',
paritionColumn 'state_id',
lowerBound '0',
upperBound '52',
numPartitions '53',
fetchSize '10000'
)
```
The above option name `paritionColumn` is wrong. That mean, users did not
provide the value for `partitionColumn`. In such case, users hit a confusing
error.
```
AssertionError: assertion failed
java.lang.AssertionError: assertion failed
at scala.Predef$.assert(Predef.scala:156)
at
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
at
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
```
### How was this patch tested?
Added a test case
Author: gatorsmile <[email protected]>
Closes #18864 from gatorsmile/jdbcPartCol.
(cherry picked from commit baf5cac0f8c35925c366464d7e0eb5f6023fce57)
Signed-off-by: gatorsmile <[email protected]>
commit 86609a95af4b700e83638b7416c7e3706c2d64c6
Author: Liang-Chi Hsieh <viirya@...>
Date: 2017-08-08T08:12:41Z
[SPARK-21567][SQL] Dataset should work with type alias
If we create a type alias for a type workable with Dataset, the type alias
doesn't work with Dataset.
A reproducible case looks like:
object C {
type TwoInt = (Int, Int)
def tupleTypeAlias: TwoInt = (1, 1)
}
Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
It throws an exception like:
type T1 is not a class
scala.ScalaReflectionException: type T1 is not a class
at
scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
...
This patch accesses the dealias of type in many places in `ScalaReflection`
to fix it.
Added test case.
Author: Liang-Chi Hsieh <[email protected]>
Closes #18813 from viirya/SPARK-21567.
(cherry picked from commit ee1304199bcd9c1d5fc94f5b06fdd5f6fe7336a1)
Signed-off-by: Wenchen Fan <[email protected]>
commit e87ffcaa3e5b75f8d313dc995e4801063b60cd5c
Author: Wenchen Fan <wenchen@...>
Date: 2017-08-08T08:32:49Z
Revert "[SPARK-21567][SQL] Dataset should work with type alias"
This reverts commit 86609a95af4b700e83638b7416c7e3706c2d64c6.
commit d0233145208eb6afcd9fe0c1c3a9dbbd35d7727e
Author: pgandhi <pgandhi@...>
Date: 2017-08-09T05:46:06Z
[SPARK-21503][UI] Spark UI shows incorrect task status for a killed
Executor Process
The executor tab on Spark UI page shows task as completed when an executor
process that is running that task is killed using the kill command.
Added the case ExecutorLostFailure which was previously not there, thus,
the default case would be executed in which case, task would be marked as
completed. This case will consider all those cases where executor connection to
Spark Driver was lost due to killing the executor process, network connection
etc.
## How was this patch tested?
Manually Tested the fix by observing the UI change before and after.
Before:
<img width="1398" alt="screen shot-before"
src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png">
After:
<img width="1385" alt="screen shot-after"
src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png">
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: pgandhi <[email protected]>
Author: pgandhi999 <[email protected]>
Closes #18707 from pgandhi999/master.
(cherry picked from commit f016f5c8f6c6aae674e9905a5c0b0bede09163a4)
Signed-off-by: Wenchen Fan <[email protected]>
commit 7446be3328ea75a5197b2587e3a8e2ca7977726b
Author: WeichenXu <weichenxu123@...>
Date: 2017-08-09T06:44:10Z
[SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong
wolfe line search
## What changes were proposed in this pull request?
Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
https://github.com/scalanlp/breeze/pull/651
## How was this patch tested?
N/A
Author: WeichenXu <[email protected]>
Closes #18797 from WeichenXu123/update-breeze.
(cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d)
Signed-off-by: Yanbo Liang <[email protected]>
commit f6d56d2f1c377000921effea2b1faae15f9cae82
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-09T06:49:33Z
[SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the
return value
Same PR as #18799 but for branch 2.2. Main discussion the other PR.
--------
When I was investigating a flaky test, I realized that many places don't
check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When
a batch is supposed to be there, the caller just ignores None rather than
throwing an error. If some bug causes a query doesn't generate a batch metadata
file, this behavior will hide it and allow the query continuing to run and
finally delete metadata logs and make it hard to debug.
This PR ensures that places calling HDFSMetadataLog.get always check the
return value.
Jenkins
Author: Shixiong Zhu <[email protected]>
Closes #18890 from tdas/SPARK-21596-2.2.
commit 3ca55eaafee8f4216eb5466021a97604713033a1
Author: 10087686 <wang.jiaochun@...>
Date: 2017-08-09T10:45:38Z
[SPARK-21663][TESTS] test("remote fetch below max RPC message size") should
call masterTracker.stop() in MapOutputTrackerSuite
Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
## What changes were proposed in this pull request?
After Unit tests endï¼there should be call masterTracker.stop() to free
resource;
(Please fill in changes proposed in this fix)
## How was this patch tested?
Run Unit tests;
(Please explain how this patch was tested. E.g. unit tests, integration
tests, manual tests)
(If this patch involves UI changes, please attach a screenshot; otherwise,
remove this)
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: 10087686 <[email protected]>
Closes #18867 from wangjiaochun/mapout.
(cherry picked from commit 6426adffaf152651c30d481bb925d5025fd6130a)
Signed-off-by: Wenchen Fan <[email protected]>
commit c909496983314b48dd4d8587e586b553b04ff0ce
Author: Reynold Xin <rxin@...>
Date: 2017-08-11T01:56:25Z
[SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog
## What changes were proposed in this pull request?
This patch removes the unused SessionCatalog.getTableMetadataOption and
ExternalCatalog. getTableOption.
## How was this patch tested?
Removed the test case.
Author: Reynold Xin <[email protected]>
Closes #18912 from rxin/remove-getTableOption.
(cherry picked from commit 584c7f14370cdfafdc6cd554b2760b7ce7709368)
Signed-off-by: Reynold Xin <[email protected]>
commit 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c
Author: Tejas Patil <tejasp@...>
Date: 2017-08-11T20:01:00Z
[SPARK-21595] Separate thresholds for buffering and spilling in
ExternalAppendOnlyUnsafeRowArray
## What changes were proposed in this pull request?
[SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported
that there is excessive spilling to disk due to default spill threshold for
`ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old
behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909)
would hold data in an array for first 4096 records post which it would switch
to `UnsafeExternalSorter` and start spilling to disk after reaching
`spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was
paucity of memory due to excessive consumers).
Currently the (switch from in-memory to `UnsafeExternalSorter`) and
(`UnsafeExternalSorter` spilling to disk) for
`ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR
aims to separate that to have more granular control.
## How was this patch tested?
Added unit tests
Author: Tejas Patil <[email protected]>
Closes #18843 from tejasapatil/SPARK-21595.
(cherry picked from commit 94439997d57875838a8283c543f9b44705d3a503)
Signed-off-by: Herman van Hovell <[email protected]>
commit 7b9807754fd43756ba852bf93590a5024f2aa129
Author: Andrew Ash <andrew@...>
Date: 2017-08-14T14:48:08Z
[SPARK-21563][CORE] Fix race condition when serializing TaskDescriptions
and adding jars
## What changes were proposed in this pull request?
Fix the race condition when serializing TaskDescriptions and adding jars by
keeping the set of jars and files for a TaskSet constant across the lifetime of
the TaskSet. Otherwise TaskDescription serialization can produce an invalid
serialization when new file/jars are added concurrently as the TaskDescription
is serialized.
## How was this patch tested?
Additional unit test ensures jars/files contained in the TaskDescription
remain constant throughout the lifetime of the TaskSet.
Author: Andrew Ash <[email protected]>
Closes #18913 from ash211/SPARK-21563.
(cherry picked from commit 6847e93cf427aa971dac1ea261c1443eebf4089e)
Signed-off-by: Wenchen Fan <[email protected]>
commit 48bacd36c673bcbe20dc2e119cddb2a61261a394
Author: Shixiong Zhu <shixiong@...>
Date: 2017-08-14T22:06:55Z
[SPARK-21696][SS] Fix a potential issue that may generate partial snapshot
files
## What changes were proposed in this pull request?
Directly writing a snapshot file may generate a partial file. This PR
changes it to write to a temp file then rename to the target file.
## How was this patch tested?
Jenkins.
Author: Shixiong Zhu <[email protected]>
Closes #18928 from zsxwing/SPARK-21696.
(cherry picked from commit 282f00b410fdc4dc69b9d1f3cb3e2ba53cd85b8b)
Signed-off-by: Tathagata Das <[email protected]>
commit d9c8e6223f6b31bfbca33b1064ead9720cfefa10
Author: Liang-Chi Hsieh <viirya@...>
Date: 2017-08-15T05:29:15Z
[SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are
successfully removed
## What changes were proposed in this pull request?
We put staging path to delete into the deleteOnExit cache of `FileSystem`
in case of the path can't be successfully removed. But when we successfully
remove the path, we don't remove it from the cache. We should do it to avoid
continuing grow the cache size.
## How was this patch tested?
Added a test.
Author: Liang-Chi Hsieh <[email protected]>
Closes #18934 from viirya/SPARK-21721.
(cherry picked from commit 4c3cf1cc5cdb400ceef447d366e9f395cd87b273)
Signed-off-by: gatorsmile <[email protected]>
commit f1accc8511cf034fa4edee0c0a5747def0df04a2
Author: Jan Vrsovsky <jan.vrsovsky@...>
Date: 2017-08-16T07:21:42Z
[SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)
Check the option "numFeatures" only when reading LibSVM, not when writing.
When writing, Spark was raising an exception. After the change it will ignore
the option completely. liancheng HyukjinKwon
(Maybe the usage should be forbidden when writing, in a major version
change?).
Manual test, that loading and writing LibSVM files work fine, both with and
without the numFeatures option.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Jan Vrsovsky <[email protected]>
Closes #18872 from ProtD/master.
(cherry picked from commit 8321c141f63a911a97ec183aefa5ff75a338c051)
Signed-off-by: Sean Owen <[email protected]>
commit f5ede0d558e3db51867d8c1c0a12c8fb286c797c
Author: John Lee <jlee2@...>
Date: 2017-08-16T14:44:09Z
[SPARK-21656][CORE] spark dynamic allocation should not idle timeout
executors when tasks still to run
## What changes were proposed in this pull request?
Right now spark lets go of executors when they are idle for the 60s (or
configurable time). I have seen spark let them go when they are idle but they
were really needed. I have seen this issue when the scheduler was waiting to
get node locality but that takes longer than the default idle timeout. In these
jobs the number of executors goes down really small (less than 10) but there
are still like 80,000 tasks to run.
We should consider not allowing executors to idle timeout if they are still
needed according to the number of tasks to be run.
## How was this patch tested?
Tested by manually adding executors to `executorsIdsToBeRemoved` list and
seeing if those executors were removed when there are a lot of tasks and a high
`numExecutorsTarget` value.
Code used
In `ExecutorAllocationManager.start()`
```
start_time = clock.getTimeMillis()
```
In `ExecutorAllocationManager.schedule()`
```
val executorIdsToBeRemoved = ArrayBuffer[String]()
if ( now > start_time + 1000 * 60 * 2) {
logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
start_time += 1000 * 60 * 100
var counter = 0
for (x <- executorIds) {
counter += 1
if (counter == 2) {
counter = 0
executorIdsToBeRemoved += x
}
}
}
Author: John Lee <[email protected]>
Closes #18874 from yoonlee95/SPARK-21656.
(cherry picked from commit adf005dabe3b0060033e1eeaedbab31a868efc8c)
Signed-off-by: Tom Graves <[email protected]>
commit 2a9697593add425efa15d51afb501b6236a78e26
Author: Wenchen Fan <wenchen@...>
Date: 2017-08-16T16:36:33Z
[SPARK-18464][SQL][BACKPORT] support old table which doesn't store schema
in table properties
backport https://github.com/apache/spark/pull/18907 to branch 2.2
Author: Wenchen Fan <[email protected]>
Closes #18963 from cloud-fan/backport.
commit fdea642dbd17d74c8bf136c1746159acaa937d25
Author: donnyzone <wellfengzhu@...>
Date: 2017-08-18T05:37:32Z
[SPARK-21739][SQL] Cast expression should initialize timezoneId when it is
called statically to convert something into TimestampType
## What changes were proposed in this pull request?
https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
This issue is caused by introducing TimeZoneAwareExpression.
When the **Cast** expression converts something into TimestampType, it
should be resolved with setting `timezoneId`. In general, it is resolved in
LogicalPlan phase.
However, there are still some places that use Cast expression statically to
convert datatypes without setting `timezoneId`. In such cases,
`NoSuchElementException: None.get` will be thrown for TimestampType.
This PR is proposed to fix the issue. We have checked the whole project and
found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
## How was this patch tested?
unit test
Author: donnyzone <[email protected]>
Closes #18960 from DonnyZone/spark-21739.
(cherry picked from commit 310454be3b0ce5ff6b6ef0070c5daadf6fb16927)
Signed-off-by: gatorsmile <[email protected]>
commit 6c2a38a381f22029abd9ca4beab49b2473a13670
Author: Cédric Pelvet <cedric.pelvet@...>
Date: 2017-08-20T10:05:54Z
[MINOR] Correct validateAndTransformSchema in GaussianMixture and
AFTSurvivalRegression
## What changes were proposed in this pull request?
The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType)
did not modify the variable schema, hence only the last line had any effect. A
temporary variable is used to correctly append the two columns predictionCol
and probabilityCol.
## How was this patch tested?
Manually.
Please review http://spark.apache.org/contributing.html before opening a
pull request.
Author: Cédric Pelvet <[email protected]>
Closes #18980 from sharp-pixel/master.
(cherry picked from commit 73e04ecc4f29a0fe51687ed1337c61840c976f89)
Signed-off-by: Sean Owen <[email protected]>
----
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]