[GitHub] spark pull request #21691: Branch 2.2

xianbin Sun, 01 Jul 2018 22:53:47 -0700

GitHub user xianbin opened a pull request:

    https://github.com/apache/spark/pull/21691


    Branch 2.2

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/21691.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #21691
    
----
commit 79e5805f9284c53b0c329f086190298b70f012c1
Author: Sean Owen <sowen@...>
Date:   2017-08-01T18:05:55Z

    [SPARK-21593][DOCS] Fix 2 rendering errors on configuration page
    
    ## What changes were proposed in this pull request?
    
    Fix 2 rendering errors on configuration doc page, due to SPARK-21243 and 
SPARK-15355.
    
    ## How was this patch tested?
    
    Manually built and viewed docs with jekyll
    
    Author: Sean Owen <[email protected]>
    
    Closes #18793 from srowen/SPARK-21593.
    
    (cherry picked from commit b1d59e60dee2a41f8eff8ef29b3bcac69111e2f0)
    Signed-off-by: Sean Owen <[email protected]>

commit 67c60d78e4c4562fbf86b46d14b7d635aaf67e5b
Author: Devaraj K <devaraj@...>
Date:   2017-08-01T20:38:55Z

    [SPARK-21339][CORE] spark-shell --packages option does not add jars to 
classpath on windows
    
    The --packages option jars are getting added to the classpath with the 
scheme as "file:///", in Unix it doesn't have problem with this since the 
scheme contains the Unix Path separator which separates the jar name with 
location in the classpath. In Windows, the jar file is not getting resolved 
from the classpath because of the scheme.
    
    Windows : file:///C:/Users/<user>/.ivy2/jars/<jar-name>.jar
    Unix : file:///home/<user>/.ivy2/jars/<jar-name>.jar
    
    With this PR, we are avoiding the 'file://' scheme to get added to the 
packages jar files.
    
    I have verified manually in Windows and Unix environments, with the change 
it adds the jar to classpath like below,
    
    Windows : C:\Users\<user>\.ivy2\jars\<jar-name>.jar
    Unix : /home/<user>/.ivy2/jars/<jar-name>.jar
    
    Author: Devaraj K <[email protected]>
    
    Closes #18708 from devaraj-kavali/SPARK-21339.
    
    (cherry picked from commit 58da1a2455258156fe8ba57241611eac1a7928ef)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 397f904219e7617386144aba87998a057bde02e3
Author: Shixiong Zhu <shixiong@...>
Date:   2017-08-02T17:59:59Z

    [SPARK-21597][SS] Fix a potential overflow issue in EventTimeStats
    
    ## What changes were proposed in this pull request?
    
    This PR fixed a potential overflow issue in EventTimeStats.
    
    ## How was this patch tested?
    
    The new unit tests
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #18803 from zsxwing/avg.
    
    (cherry picked from commit 7f63e85b47a93434030482160e88fe63bf9cff4e)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 467ee8dff8494a730ef8c00aafc02266a794a1fe
Author: Shixiong Zhu <shixiong@...>
Date:   2017-08-02T21:02:13Z

    [SPARK-21546][SS] dropDuplicates should ignore watermark when it's not a key
    
    ## What changes were proposed in this pull request?
    
    When the watermark is not a column of `dropDuplicates`, right now it will 
crash. This PR fixed this issue.
    
    ## How was this patch tested?
    
    The new unit test.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #18822 from zsxwing/SPARK-21546.
    
    (cherry picked from commit 0d26b3aa55f9cc75096b0e2b309f64fe3270b9a5)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 690f491f6e979bc960baa05de1a66306b06dc85a
Author: Bryan Cutler <cutlerb@...>
Date:   2017-08-03T01:28:19Z

    [SPARK-12717][PYTHON][BRANCH-2.2] Adding thread-safe broadcast pickle 
registry
    
    ## What changes were proposed in this pull request?
    
    When using PySpark broadcast variables in a multi-threaded environment,  
`SparkContext._pickled_broadcast_vars` becomes a shared resource.  A race 
condition can occur when broadcast variables that are pickled from one thread 
get added to the shared ` _pickled_broadcast_vars` and become part of the 
python command from another thread.  This PR introduces a thread-safe pickled 
registry using thread local storage so that when python command is pickled 
(causing the broadcast variable to be pickled and added to the registry) each 
thread will have their own view of the pickle registry to retrieve and clear 
the broadcast variables used.
    
    ## How was this patch tested?
    
    Added a unit test that causes this race condition using another thread.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #18823 from BryanCutler/branch-2.2.

commit 1bcfa2a0ccdc1d3c3c5075bc6e2838c69f5b2f7f
Author: Christiam Camacho <camacho@...>
Date:   2017-08-03T22:40:25Z

    Fix Java SimpleApp spark application
    
    ## What changes were proposed in this pull request?
    
    Add missing import and missing parentheses to invoke `SparkSession::text()`.
    
    ## How was this patch tested?
    
    Built and the code for this application, ran jekyll locally per 
docs/README.md.
    
    Author: Christiam Camacho <[email protected]>
    
    Closes #18795 from christiam/master.
    
    (cherry picked from commit dd72b10aba9997977f82605c5c1778f02dd1f91e)
    Signed-off-by: Sean Owen <[email protected]>

commit f9aae8ecde62fc6d92a4807c68d812bac6b207e2
Author: Andrew Ray <ray.andrew@...>
Date:   2017-08-04T07:58:01Z

    [SPARK-21330][SQL] Bad partitioning does not allow to read a JDBC table 
with extreme values on the partition column
    
    ## What changes were proposed in this pull request?
    
    An overflow of the difference of bounds on the partitioning column leads to 
no data being read. This
    patch checks for this overflow.
    
    ## How was this patch tested?
    
    New unit test.
    
    Author: Andrew Ray <[email protected]>
    
    Closes #18800 from aray/SPARK-21330.
    
    (cherry picked from commit 25826c77ddf0d5753d2501d0e764111da2caa8b6)
    Signed-off-by: Sean Owen <[email protected]>

commit 841bc2f86d61769057fca08cebbb72a98bde00dc
Author: liuxian <liu.xian3@...>
Date:   2017-08-05T05:55:06Z

    [SPARK-21580][SQL] Integers in aggregation expressions are wrongly taken as 
group-by ordinal
    
    ## What changes were proposed in this pull request?
    
    create temporary view data as select * from values
    (1, 1),
    (1, 2),
    (2, 1),
    (2, 2),
    (3, 1),
    (3, 2)
    as data(a, b);
    
    `select 3, 4, sum(b) from data group by 1, 2;`
    `select 3 as c, 4 as d, sum(b) from data group by c, d;`
    When running these two cases, the following exception occurred:
    `Error in query: GROUP BY position 4 is not in select list (valid range is 
[1, 3]); line 1 pos 10`
    
    The cause of this failure:
    If an aggregateExpression is integer, after replaced with this 
aggregateExpression, the
    groupExpression still considered as an ordinal.
    
    The solution:
    This bug is due to re-entrance of an analyzed plan. We can solve it by 
using `resolveOperators` in `SubstituteUnresolvedOrdinals`.
    
    ## How was this patch tested?
    Added unit test case
    
    Author: liuxian <[email protected]>
    
    Closes #18779 from 10110346/groupby.
    
    (cherry picked from commit 894d5a453a3f47525408ee8c91b3b594daa43ccb)
    Signed-off-by: gatorsmile <[email protected]>

commit 098aaec304a6b4c94a364f08c2d8ef18009689d8
Author: vinodkc <vinod.kc.in@...>
Date:   2017-08-06T06:04:39Z

    [SPARK-21588][SQL] SQLContext.getConf(key, null) should return null
    
    ## What changes were proposed in this pull request?
    
    In SQLContext.get(key,null) for a key that is not defined in the conf, and 
doesn't have a default value defined, throws a NPE. Int happens only when conf 
has a value converter
    
    Added null check on defaultValue inside SQLConf.getConfString to avoid 
calling entry.valueConverter(defaultValue)
    
    ## How was this patch tested?
    Added unit test
    
    Author: vinodkc <[email protected]>
    
    Closes #18852 from vinodkc/br_Fix_SPARK-21588.
    
    (cherry picked from commit 1ba967b25e6d88be2db7a4e100ac3ead03a2ade9)
    Signed-off-by: gatorsmile <[email protected]>

commit 7a04def920438ef0e08b66a95befeec981e5571e
Author: Xianyang Liu <xianyang.liu@...>
Date:   2017-08-07T09:04:53Z

    [SPARK-21621][CORE] Reset numRecordsWritten after 
DiskBlockObjectWriter.commitAndGet called
    
    ## What changes were proposed in this pull request?
    
    We should reset numRecordsWritten to zero after 
DiskBlockObjectWriter.commitAndGet called.
    Because when `revertPartialWritesAndClose` be called, we decrease the 
written records in `ShuffleWriteMetrics` . However, we decreased the written 
records to zero, this should be wrong, we should only decreased the number 
reords after the last `commitAndGet` called.
    
    ## How was this patch tested?
    Modified existing test.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Xianyang Liu <[email protected]>
    
    Closes #18830 from ConeyLiu/DiskBlockObjectWriter.
    
    (cherry picked from commit 534a063f7c693158437d13224f50d4ae789ff6fb)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 4f0eb0c862c0362b14fc5db468f4fc08fb8a08c6
Author: Xiao Li <gatorsmile@...>
Date:   2017-08-07T16:00:01Z

    [SPARK-21647][SQL] Fix SortMergeJoin when using CROSS
    
    ### What changes were proposed in this pull request?
    author: BoleynSu
    closes https://github.com/apache/spark/pull/18836
    
    ```Scala
    val df = Seq((1, 1)).toDF("i", "j")
    df.createOrReplaceTempView("T")
    withSQLConf(SQLConf.AUTO_BROADCASTJOIN_THRESHOLD.key -> "-1") {
      sql("select * from (select a.i from T a cross join T t where t.i = a.i) 
as t1 " +
        "cross join T t2 where t2.i = t1.i").explain(true)
    }
    ```
    The above code could cause the following exception:
    ```
    SortMergeJoinExec should not take Cross as the JoinType
    java.lang.IllegalArgumentException: SortMergeJoinExec should not take Cross 
as the JoinType
        at 
org.apache.spark.sql.execution.joins.SortMergeJoinExec.outputOrdering(SortMergeJoinExec.scala:100)
    ```
    
    Our SortMergeJoinExec supports CROSS. We should not hit such an exception. 
This PR is to fix the issue.
    
    ### How was this patch tested?
    Modified the two existing test cases.
    
    Author: Xiao Li <[email protected]>
    Author: Boleyn Su <[email protected]>
    
    Closes #18863 from gatorsmile/pr-18836.
    
    (cherry picked from commit bbfd6b5d24be5919a3ab1ac3eaec46e33201df39)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 43f9c84b6749b2ebf802e1f062238167b2b1f3bb
Author: Andrey Taptunov <taptunov@...>
Date:   2017-08-05T05:40:04Z

    [SPARK-21374][CORE] Fix reading globbed paths from S3 into DF with disabled 
FS cache
    
    This PR replaces #18623 to do some clean up.
    
    Closes #18623
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    Author: Andrey Taptunov <[email protected]>
    
    Closes #18848 from zsxwing/review-pr18623.

commit fa92a7be709e78db8e8f50dca8e13855c1034fde
Author: Jose Torres <joseph-torres@...>
Date:   2017-08-07T19:27:16Z

    [SPARK-21565][SS] Propagate metadata in attribute replacement.
    
    ## What changes were proposed in this pull request?
    
    Propagate metadata in attribute replacement during streaming execution. 
This is necessary for EventTimeWatermarks consuming replaced attributes.
    
    ## How was this patch tested?
    new unit test, which was verified to fail before the fix
    
    Author: Jose Torres <[email protected]>
    
    Closes #18840 from joseph-torres/SPARK-21565.
    
    (cherry picked from commit cce25b360ee9e39d9510134c73a1761475eaf4ac)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit a1c1199e122889ed34415be5e4da67168107a595
Author: gatorsmile <gatorsmile@...>
Date:   2017-08-07T20:04:04Z

    [SPARK-21648][SQL] Fix confusing assert failure in JDBC source when 
parallel fetching parameters are not properly provided.
    
    ### What changes were proposed in this pull request?
    ```SQL
    CREATE TABLE mytesttable1
    USING org.apache.spark.sql.jdbc
      OPTIONS (
      url 
'jdbc:mysql://${jdbcHostname}:${jdbcPort}/${jdbcDatabase}?user=${jdbcUsername}&password=${jdbcPassword}',
      dbtable 'mytesttable1',
      paritionColumn 'state_id',
      lowerBound '0',
      upperBound '52',
      numPartitions '53',
      fetchSize '10000'
    )
    ```
    
    The above option name `paritionColumn` is wrong. That mean, users did not 
provide the value for `partitionColumn`. In such case, users hit a confusing 
error.
    
    ```
    AssertionError: assertion failed
    java.lang.AssertionError: assertion failed
        at scala.Predef$.assert(Predef.scala:156)
        at 
org.apache.spark.sql.execution.datasources.jdbc.JdbcRelationProvider.createRelation(JdbcRelationProvider.scala:39)
        at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:312)
    ```
    
    ### How was this patch tested?
    Added a test case
    
    Author: gatorsmile <[email protected]>
    
    Closes #18864 from gatorsmile/jdbcPartCol.
    
    (cherry picked from commit baf5cac0f8c35925c366464d7e0eb5f6023fce57)
    Signed-off-by: gatorsmile <[email protected]>

commit 86609a95af4b700e83638b7416c7e3706c2d64c6
Author: Liang-Chi Hsieh <viirya@...>
Date:   2017-08-08T08:12:41Z

    [SPARK-21567][SQL] Dataset should work with type alias
    
    If we create a type alias for a type workable with Dataset, the type alias 
doesn't work with Dataset.
    
    A reproducible case looks like:
    
        object C {
          type TwoInt = (Int, Int)
          def tupleTypeAlias: TwoInt = (1, 1)
        }
    
        Seq(1).toDS().map(_ => ("", C.tupleTypeAlias))
    
    It throws an exception like:
    
        type T1 is not a class
        scala.ScalaReflectionException: type T1 is not a class
          at 
scala.reflect.api.Symbols$SymbolApi$class.asClass(Symbols.scala:275)
          ...
    
    This patch accesses the dealias of type in many places in `ScalaReflection` 
to fix it.
    
    Added test case.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #18813 from viirya/SPARK-21567.
    
    (cherry picked from commit ee1304199bcd9c1d5fc94f5b06fdd5f6fe7336a1)
    Signed-off-by: Wenchen Fan <[email protected]>

commit e87ffcaa3e5b75f8d313dc995e4801063b60cd5c
Author: Wenchen Fan <wenchen@...>
Date:   2017-08-08T08:32:49Z

    Revert "[SPARK-21567][SQL] Dataset should work with type alias"
    
    This reverts commit 86609a95af4b700e83638b7416c7e3706c2d64c6.

commit d0233145208eb6afcd9fe0c1c3a9dbbd35d7727e
Author: pgandhi <pgandhi@...>
Date:   2017-08-09T05:46:06Z

    [SPARK-21503][UI] Spark UI shows incorrect task status for a killed 
Executor Process
    
    The executor tab on Spark UI page shows task as completed when an executor 
process that is running that task is killed using the kill command.
    Added the case ExecutorLostFailure which was previously not there, thus, 
the default case would be executed in which case, task would be marked as 
completed. This case will consider all those cases where executor connection to 
Spark Driver was lost due to killing the executor process, network connection 
etc.
    
    ## How was this patch tested?
    Manually Tested the fix by observing the UI change before and after.
    Before:
    <img width="1398" alt="screen shot-before" 
src="https://user-images.githubusercontent.com/22228190/28482929-571c9cea-6e30-11e7-93dd-728de5cdea95.png";>
    After:
    <img width="1385" alt="screen shot-after" 
src="https://user-images.githubusercontent.com/22228190/28482964-8649f5ee-6e30-11e7-91bd-2eb2089c61cc.png";>
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: pgandhi <[email protected]>
    Author: pgandhi999 <[email protected]>
    
    Closes #18707 from pgandhi999/master.
    
    (cherry picked from commit f016f5c8f6c6aae674e9905a5c0b0bede09163a4)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 7446be3328ea75a5197b2587e3a8e2ca7977726b
Author: WeichenXu <weichenxu123@...>
Date:   2017-08-09T06:44:10Z

    [SPARK-21523][ML] update breeze to 0.13.2 for an emergency bugfix in strong 
wolfe line search
    
    ## What changes were proposed in this pull request?
    
    Update breeze to 0.13.1 for an emergency bugfix in strong wolfe line search
    https://github.com/scalanlp/breeze/pull/651
    
    ## How was this patch tested?
    
    N/A
    
    Author: WeichenXu <[email protected]>
    
    Closes #18797 from WeichenXu123/update-breeze.
    
    (cherry picked from commit b35660dd0e930f4b484a079d9e2516b0a7dacf1d)
    Signed-off-by: Yanbo Liang <[email protected]>

commit f6d56d2f1c377000921effea2b1faae15f9cae82
Author: Shixiong Zhu <shixiong@...>
Date:   2017-08-09T06:49:33Z

    [SPARK-21596][SS] Ensure places calling HDFSMetadataLog.get check the 
return value
    
    Same PR as #18799 but for branch 2.2. Main discussion the other PR.
    --------
    
    When I was investigating a flaky test, I realized that many places don't 
check the return value of `HDFSMetadataLog.get(batchId: Long): Option[T]`. When 
a batch is supposed to be there, the caller just ignores None rather than 
throwing an error. If some bug causes a query doesn't generate a batch metadata 
file, this behavior will hide it and allow the query continuing to run and 
finally delete metadata logs and make it hard to debug.
    
    This PR ensures that places calling HDFSMetadataLog.get always check the 
return value.
    
    Jenkins
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #18890 from tdas/SPARK-21596-2.2.

commit 3ca55eaafee8f4216eb5466021a97604713033a1
Author: 10087686 <wang.jiaochun@...>
Date:   2017-08-09T10:45:38Z

    [SPARK-21663][TESTS] test("remote fetch below max RPC message size") should 
call masterTracker.stop() in MapOutputTrackerSuite
    
    Signed-off-by: 10087686 <wang.jiaochunzte.com.cn>
    
    ## What changes were proposed in this pull request?
    After Unit tests endï¼there should be call masterTracker.stop() to free 
resource;
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    Run Unit tests;
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: 10087686 <[email protected]>
    
    Closes #18867 from wangjiaochun/mapout.
    
    (cherry picked from commit 6426adffaf152651c30d481bb925d5025fd6130a)
    Signed-off-by: Wenchen Fan <[email protected]>

commit c909496983314b48dd4d8587e586b553b04ff0ce
Author: Reynold Xin <rxin@...>
Date:   2017-08-11T01:56:25Z

    [SPARK-21699][SQL] Remove unused getTableOption in ExternalCatalog
    
    ## What changes were proposed in this pull request?
    This patch removes the unused SessionCatalog.getTableMetadataOption and 
ExternalCatalog. getTableOption.
    
    ## How was this patch tested?
    Removed the test case.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #18912 from rxin/remove-getTableOption.
    
    (cherry picked from commit 584c7f14370cdfafdc6cd554b2760b7ce7709368)
    Signed-off-by: Reynold Xin <[email protected]>

commit 406eb1c2ee670c2f14f2737c32c9aa0b8d35bf7c
Author: Tejas Patil <tejasp@...>
Date:   2017-08-11T20:01:00Z

    [SPARK-21595] Separate thresholds for buffering and spilling in 
ExternalAppendOnlyUnsafeRowArray
    
    ## What changes were proposed in this pull request?
    
    [SPARK-21595](https://issues.apache.org/jira/browse/SPARK-21595) reported 
that there is excessive spilling to disk due to default spill threshold for 
`ExternalAppendOnlyUnsafeRowArray` being quite small for WINDOW operator. Old 
behaviour of WINDOW operator (pre https://github.com/apache/spark/pull/16909) 
would hold data in an array for first 4096 records post which it would switch 
to `UnsafeExternalSorter` and start spilling to disk after reaching 
`spark.shuffle.spill.numElementsForceSpillThreshold` (or earlier if there was 
paucity of memory due to excessive consumers).
    
    Currently the (switch from in-memory to `UnsafeExternalSorter`) and 
(`UnsafeExternalSorter` spilling to disk) for 
`ExternalAppendOnlyUnsafeRowArray` is controlled by a single threshold. This PR 
aims to separate that to have more granular control.
    
    ## How was this patch tested?
    
    Added unit tests
    
    Author: Tejas Patil <[email protected]>
    
    Closes #18843 from tejasapatil/SPARK-21595.
    
    (cherry picked from commit 94439997d57875838a8283c543f9b44705d3a503)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 7b9807754fd43756ba852bf93590a5024f2aa129
Author: Andrew Ash <andrew@...>
Date:   2017-08-14T14:48:08Z

    [SPARK-21563][CORE] Fix race condition when serializing TaskDescriptions 
and adding jars
    
    ## What changes were proposed in this pull request?
    
    Fix the race condition when serializing TaskDescriptions and adding jars by 
keeping the set of jars and files for a TaskSet constant across the lifetime of 
the TaskSet.  Otherwise TaskDescription serialization can produce an invalid 
serialization when new file/jars are added concurrently as the TaskDescription 
is serialized.
    
    ## How was this patch tested?
    
    Additional unit test ensures jars/files contained in the TaskDescription 
remain constant throughout the lifetime of the TaskSet.
    
    Author: Andrew Ash <[email protected]>
    
    Closes #18913 from ash211/SPARK-21563.
    
    (cherry picked from commit 6847e93cf427aa971dac1ea261c1443eebf4089e)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 48bacd36c673bcbe20dc2e119cddb2a61261a394
Author: Shixiong Zhu <shixiong@...>
Date:   2017-08-14T22:06:55Z

    [SPARK-21696][SS] Fix a potential issue that may generate partial snapshot 
files
    
    ## What changes were proposed in this pull request?
    
    Directly writing a snapshot file may generate a partial file. This PR 
changes it to write to a temp file then rename to the target file.
    
    ## How was this patch tested?
    
    Jenkins.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #18928 from zsxwing/SPARK-21696.
    
    (cherry picked from commit 282f00b410fdc4dc69b9d1f3cb3e2ba53cd85b8b)
    Signed-off-by: Tathagata Das <[email protected]>

commit d9c8e6223f6b31bfbca33b1064ead9720cfefa10
Author: Liang-Chi Hsieh <viirya@...>
Date:   2017-08-15T05:29:15Z

    [SPARK-21721][SQL] Clear FileSystem deleteOnExit cache when paths are 
successfully removed
    
    ## What changes were proposed in this pull request?
    
    We put staging path to delete into the deleteOnExit cache of `FileSystem` 
in case of the path can't be successfully removed. But when we successfully 
remove the path, we don't remove it from the cache. We should do it to avoid 
continuing grow the cache size.
    
    ## How was this patch tested?
    
    Added a test.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #18934 from viirya/SPARK-21721.
    
    (cherry picked from commit 4c3cf1cc5cdb400ceef447d366e9f395cd87b273)
    Signed-off-by: gatorsmile <[email protected]>

commit f1accc8511cf034fa4edee0c0a5747def0df04a2
Author: Jan Vrsovsky <jan.vrsovsky@...>
Date:   2017-08-16T07:21:42Z

    [SPARK-21723][ML] Fix writing LibSVM (key not found: numFeatures)
    
    Check the option "numFeatures" only when reading LibSVM, not when writing. 
When writing, Spark was raising an exception. After the change it will ignore 
the option completely. liancheng HyukjinKwon
    
    (Maybe the usage should be forbidden when writing, in a major version 
change?).
    
    Manual test, that loading and writing LibSVM files work fine, both with and 
without the numFeatures option.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Jan Vrsovsky <[email protected]>
    
    Closes #18872 from ProtD/master.
    
    (cherry picked from commit 8321c141f63a911a97ec183aefa5ff75a338c051)
    Signed-off-by: Sean Owen <[email protected]>

commit f5ede0d558e3db51867d8c1c0a12c8fb286c797c
Author: John Lee <jlee2@...>
Date:   2017-08-16T14:44:09Z

    [SPARK-21656][CORE] spark dynamic allocation should not idle timeout 
executors when tasks still to run
    
    ## What changes were proposed in this pull request?
    
    Right now spark lets go of executors when they are idle for the 60s (or 
configurable time). I have seen spark let them go when they are idle but they 
were really needed. I have seen this issue when the scheduler was waiting to 
get node locality but that takes longer than the default idle timeout. In these 
jobs the number of executors goes down really small (less than 10) but there 
are still like 80,000 tasks to run.
    We should consider not allowing executors to idle timeout if they are still 
needed according to the number of tasks to be run.
    
    ## How was this patch tested?
    
    Tested by manually adding executors to `executorsIdsToBeRemoved` list and 
seeing if those executors were removed when there are a lot of tasks and a high 
`numExecutorsTarget` value.
    
    Code used
    
    In  `ExecutorAllocationManager.start()`
    
    ```
        start_time = clock.getTimeMillis()
    ```
    
    In `ExecutorAllocationManager.schedule()`
    ```
        val executorIdsToBeRemoved = ArrayBuffer[String]()
        if ( now > start_time + 1000 * 60 * 2) {
          logInfo("--- REMOVING 1/2 of the EXECUTORS ---")
          start_time +=  1000 * 60 * 100
          var counter = 0
          for (x <- executorIds) {
            counter += 1
            if (counter == 2) {
              counter = 0
              executorIdsToBeRemoved += x
            }
          }
        }
    
    Author: John Lee <[email protected]>
    
    Closes #18874 from yoonlee95/SPARK-21656.
    
    (cherry picked from commit adf005dabe3b0060033e1eeaedbab31a868efc8c)
    Signed-off-by: Tom Graves <[email protected]>

commit 2a9697593add425efa15d51afb501b6236a78e26
Author: Wenchen Fan <wenchen@...>
Date:   2017-08-16T16:36:33Z

    [SPARK-18464][SQL][BACKPORT] support old table which doesn't store schema 
in table properties
    
    backport https://github.com/apache/spark/pull/18907 to branch 2.2
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #18963 from cloud-fan/backport.

commit fdea642dbd17d74c8bf136c1746159acaa937d25
Author: donnyzone <wellfengzhu@...>
Date:   2017-08-18T05:37:32Z

    [SPARK-21739][SQL] Cast expression should initialize timezoneId when it is 
called statically to convert something into TimestampType
    
    ## What changes were proposed in this pull request?
    
    https://issues.apache.org/jira/projects/SPARK/issues/SPARK-21739
    
    This issue is caused by introducing TimeZoneAwareExpression.
    When the **Cast** expression converts something into TimestampType, it 
should be resolved with setting `timezoneId`. In general, it is resolved in 
LogicalPlan phase.
    
    However, there are still some places that use Cast expression statically to 
convert datatypes without setting `timezoneId`. In such cases,  
`NoSuchElementException: None.get` will be thrown for TimestampType.
    
    This PR is proposed to fix the issue. We have checked the whole project and 
found two such usages(i.e., in`TableReader` and `HiveTableScanExec`).
    
    ## How was this patch tested?
    
    unit test
    
    Author: donnyzone <[email protected]>
    
    Closes #18960 from DonnyZone/spark-21739.
    
    (cherry picked from commit 310454be3b0ce5ff6b6ef0070c5daadf6fb16927)
    Signed-off-by: gatorsmile <[email protected]>

commit 6c2a38a381f22029abd9ca4beab49b2473a13670
Author: CÃ©dric Pelvet <cedric.pelvet@...>
Date:   2017-08-20T10:05:54Z

    [MINOR] Correct validateAndTransformSchema in GaussianMixture and 
AFTSurvivalRegression
    
    ## What changes were proposed in this pull request?
    
    The line SchemaUtils.appendColumn(schema, $(predictionCol), IntegerType) 
did not modify the variable schema, hence only the last line had any effect. A 
temporary variable is used to correctly append the two columns predictionCol 
and probabilityCol.
    
    ## How was this patch tested?
    
    Manually.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: CÃ©dric Pelvet <[email protected]>
    
    Closes #18980 from sharp-pixel/master.
    
    (cherry picked from commit 73e04ecc4f29a0fe51687ed1337c61840c976f89)
    Signed-off-by: Sean Owen <[email protected]>

----


---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #21691: Branch 2.2

Reply via email to