[GitHub] spark pull request #18278: Branch 2.2

GaryLeee Mon, 12 Jun 2017 04:21:39 -0700

GitHub user GaryLeee opened a pull request:

    https://github.com/apache/spark/pull/18278


    Branch 2.2

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18278.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18278
    
----
commit 6cd2f16b155ce42d8e379de5ce6ced7804fbde92
Author: Takeshi Yamamuro <[email protected]>
Date:   2017-04-21T02:40:21Z

    [SPARK-20281][SQL] Print the identical Range parameters of SparkContext 
APIs and SQL in explain
    
    ## What changes were proposed in this pull request?
    This pr modified code to print the identical `Range` parameters of 
SparkContext APIs and SQL in `explain` output. In the current master, they 
internally use `defaultParallelism` for `splits` by default though, they print 
different strings in explain output;
    
    ```
    scala> spark.range(4).explain
    == Physical Plan ==
    *Range (0, 4, step=1, splits=Some(8))
    
    scala> sql("select * from range(4)").explain
    == Physical Plan ==
    *Range (0, 4, step=1, splits=None)
    ```
    
    ## How was this patch tested?
    Added tests in `SQLQuerySuite` and modified some results in the existing 
tests.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17670 from maropu/SPARK-20281.
    
    (cherry picked from commit 48d760d028dd73371f99d084c4195dbc4dda5267)
    Signed-off-by: Xiao Li <[email protected]>

commit cddb4b7db81b01b4abf2ab683aba97e4eabb9769
Author: Herman van Hovell <[email protected]>
Date:   2017-04-21T07:05:03Z

    [SPARK-20420][SQL] Add events to the external catalog
    
    ## What changes were proposed in this pull request?
    It is often useful to be able to track changes to the `ExternalCatalog`. 
This PR makes the `ExternalCatalog` emit events when a catalog object is 
changed. Events are fired before and after the change.
    
    The following events are fired per object:
    
    - Database
      - CreateDatabasePreEvent: event fired before the database is created.
      - CreateDatabaseEvent: event fired after the database has been created.
      - DropDatabasePreEvent: event fired before the database is dropped.
      - DropDatabaseEvent: event fired after the database has been dropped.
    - Table
      - CreateTablePreEvent: event fired before the table is created.
      - CreateTableEvent: event fired after the table has been created.
      - RenameTablePreEvent: event fired before the table is renamed.
      - RenameTableEvent: event fired after the table has been renamed.
      - DropTablePreEvent: event fired before the table is dropped.
      - DropTableEvent: event fired after the table has been dropped.
    - Function
      - CreateFunctionPreEvent: event fired before the function is created.
      - CreateFunctionEvent: event fired after the function has been created.
      - RenameFunctionPreEvent: event fired before the function is renamed.
      - RenameFunctionEvent: event fired after the function has been renamed.
      - DropFunctionPreEvent: event fired before the function is dropped.
      - DropFunctionPreEvent: event fired after the function has been dropped.
    
    The current events currently only contain the names of the object modified. 
We add more events, and more details at a later point.
    
    A user can monitor changes to the external catalog by adding a listener to 
the Spark listener bus checking for `ExternalCatalogEvent`s using the 
`SparkListener.onOtherEvent` hook. A more direct approach is add listener 
directly to the `ExternalCatalog`.
    
    ## How was this patch tested?
    Added the `ExternalCatalogEventSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #17710 from hvanhovell/SPARK-20420.
    
    (cherry picked from commit e2b3d2367a563d4600d8d87b5317e71135c362f0)
    Signed-off-by: Reynold Xin <[email protected]>

commit eb4d097c3c73d1aaf4cd9e17193a6b06ba273429
Author: HervÃ© <[email protected]>
Date:   2017-04-21T07:52:18Z

    Small rewording about history server use case
    
    Hello
    PR #10991 removed the built-in history view from Spark Standalone, so the 
history server is no longer useful to Yarn or Mesos only.
    
    Author: HervÃ© <[email protected]>
    
    Closes #17709 from dud225/patch-1.
    
    (cherry picked from commit 34767997e0c6cb28e1fac8cb650fa3511f260ca5)
    Signed-off-by: Sean Owen <[email protected]>

commit aaeca8bdd4bbbad5a14e1030e1d7ecf4836e8a5d
Author: Juliusz Sompolski <[email protected]>
Date:   2017-04-21T14:11:24Z

    [SPARK-20412] Throw ParseException from visitNonOptionalPartitionSpec 
instead of returning null values.
    
    ## What changes were proposed in this pull request?
    
    If a partitionSpec is supposed to not contain optional values, a 
ParseException should be thrown, and not nulls returned.
    The nulls can later cause NullPointerExceptions in places not expecting 
them.
    
    ## How was this patch tested?
    
    A query like "SHOW PARTITIONS tbl PARTITION(col1='val1', col2)" used to 
throw a NullPointerException.
    Now it throws a ParseException.
    
    Author: Juliusz Sompolski <[email protected]>
    
    Closes #17707 from juliuszsompolski/SPARK-20412.
    
    (cherry picked from commit c9e6035e1fb825d280eaec3bdfc1e4d362897ffd)
    Signed-off-by: Wenchen Fan <[email protected]>

commit adaa3f7e027338522e8a71ea40b3237d5889a30d
Author: Kazuaki Ishizaki <[email protected]>
Date:   2017-04-21T14:25:35Z

    [SPARK-20341][SQL] Support BigInt's value that does not fit in long value 
range
    
    ## What changes were proposed in this pull request?
    
    This PR avoids an exception in the case where `scala.math.BigInt` has a 
value that does not fit into long value range (e.g. `Long.MAX_VALUE+1`). When 
we run the following code by using the current Spark, the following exception 
is thrown.
    
    This PR keeps the value using `BigDecimal` if we detect such an overflow 
case by catching `ArithmeticException`.
    
    Sample program:
    ```
    case class BigIntWrapper(value:scala.math.BigInt)```
    
spark.createDataset(BigIntWrapper(scala.math.BigInt("10000000000000000002"))::Nil).show
    ```
    Exception:
    ```
    Error while encoding: java.lang.ArithmeticException: BigInteger out of long 
range
    staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, 
true])).value, true) AS value#0
    java.lang.RuntimeException: Error while encoding: 
java.lang.ArithmeticException: BigInteger out of long range
    staticinvoke(class org.apache.spark.sql.types.Decimal$, DecimalType(38,0), 
apply, assertnotnull(assertnotnull(input[0, org.apache.spark.sql.BigIntWrapper, 
true])).value, true) AS value#0
        at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:290)
        at 
org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
        at 
org.apache.spark.sql.SparkSession$$anonfun$2.apply(SparkSession.scala:454)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at 
scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:234)
        at scala.collection.immutable.List.foreach(List.scala:381)
        at scala.collection.TraversableLike$class.map(TraversableLike.scala:234)
        at scala.collection.immutable.List.map(List.scala:285)
        at 
org.apache.spark.sql.SparkSession.createDataset(SparkSession.scala:454)
        at org.apache.spark.sql.Agg$$anonfun$18.apply$mcV$sp(MySuite.scala:192)
        at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
        at org.apache.spark.sql.Agg$$anonfun$18.apply(MySuite.scala:192)
        at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
        at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
        at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
        at org.scalatest.Transformer.apply(Transformer.scala:22)
        at org.scalatest.Transformer.apply(Transformer.scala:20)
        at org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
        at org.apache.spark.SparkFunSuite.withFixture(SparkFunSuite.scala:68)
        at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
        at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
        at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
        at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    ...
    Caused by: java.lang.ArithmeticException: BigInteger out of long range
        at java.math.BigInteger.longValueExact(BigInteger.java:4531)
        at org.apache.spark.sql.types.Decimal.set(Decimal.scala:140)
        at org.apache.spark.sql.types.Decimal$.apply(Decimal.scala:434)
        at org.apache.spark.sql.types.Decimal.apply(Decimal.scala)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder.toRow(ExpressionEncoder.scala:287)
        ... 59 more
    ```
    
    ## How was this patch tested?
    
    Add new test suite into `DecimalSuite`
    
    Author: Kazuaki Ishizaki <[email protected]>
    
    Closes #17684 from kiszk/SPARK-20341.
    
    (cherry picked from commit a750a595976791cb8a77063f690ea8f82ea75a8f)
    Signed-off-by: Wenchen Fan <[email protected]>

commit ff1f989f29c08bb5297f3aa35f30ff06e0cb8046
Author: WeichenXu <[email protected]>
Date:   2017-04-21T17:58:13Z

    [SPARK-20423][ML] fix MLOR coeffs centering when reg == 0
    
    ## What changes were proposed in this pull request?
    
    When reg == 0, MLOR has multiple solutions and we need to centralize the 
coeffs to get identical result.
    BUT current implementation centralize the `coefficientMatrix` by the global 
coeffs means.
    
    In fact the `coefficientMatrix` should be centralized on each feature index 
itself.
    Because, according to the MLOR probability distribution function, it can be 
proven easily that:
    suppose `{ w0, w1, .. w(K-1) }` make up the `coefficientMatrix`,
    then `{ w0 + c, w1 + c, ... w(K - 1) + c}` will also be the equivalent 
solution.
    `c` is an arbitrary vector of `numFeatures` dimension.
    reference
    https://core.ac.uk/download/pdf/6287975.pdf
    
    So that we need to centralize the `coefficientMatrix` on each feature 
dimension separately.
    
    **We can also confirm this through R library `glmnet`, that MLOR in 
`glmnet` always generate coefficients result that the sum of each dimension is 
all `zero`, when reg == 0.**
    
    ## How was this patch tested?
    
    Tests added.
    
    Author: WeichenXu <[email protected]>
    
    Closes #17706 from WeichenXu123/mlor_center.
    
    (cherry picked from commit eb00378f0eed6afbf328ae6cd541cc202d14c1f0)
    Signed-off-by: DB Tsai <[email protected]>

commit 6c2489c66682fdc6a886346ed980d95e6e5eefde
Author: éå°é¾ 10207633 <[email protected]>
Date:   2017-04-21T19:08:26Z

    [SPARK-20401][DOC] In the spark official configuration document, the 
'spark.driver.supervise' configuration parameter specification and default 
values are necessary.
    
    ## What changes were proposed in this pull request?
    Use the REST interface submits the spark job.
    e.g.
    curl -X  POST http://10.43.183.120:6066/v1/submissions/create --header 
"Content-Type:application/json;charset=UTF-8" --data'{
        "action": "CreateSubmissionRequest",
        "appArgs": [
            "myAppArgument"
        ],
        "appResource": "/home/mr/gxl/test.jar",
        "clientSparkVersion": "2.2.0",
        "environmentVariables": {
            "SPARK_ENV_LOADED": "1"
        },
        "mainClass": "cn.zte.HdfsTest",
        "sparkProperties": {
            "spark.jars": "/home/mr/gxl/test.jar",
            **"spark.driver.supervise": "true",**
            "spark.app.name": "HdfsTest",
            "spark.eventLog.enabled": "false",
            "spark.submit.deployMode": "cluster",
            "spark.master": "spark://10.43.183.120:6066"
        }
    }'
    
    **I hope that make sure that the driver is automatically restarted if it 
fails with non-zero exit code.
    But I can not find the 'spark.driver.supervise' configuration parameter 
specification and default values from the spark official document.**
    ## How was this patch tested?
    
    manual tests
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: éå°é¾ 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17696 from guoxiaolongzte/SPARK-20401.
    
    (cherry picked from commit ad290402aa1d609abf5a2883a6d87fa8bc2bd517)
    Signed-off-by: Sean Owen <[email protected]>

commit d68e0a3a5ec39a3cb4358aacfc2bd1c5d783e51e
Author: eatoncys <[email protected]>
Date:   2017-04-22T11:29:35Z

    [SPARK-20386][SPARK CORE] modify the log info if the block exists on the 
slave already
    
    ## What changes were proposed in this pull request?
    Modify the added memory size to memSize-originalMemSize if the  block 
exists on the slave already
    since if the  block exists, the added memory size should be 
memSize-originalMemSize; if originalMemSize is bigger than memSize ,then the 
log info should be Removed memory, removed size should be 
originalMemSize-memSize
    
    ## How was this patch tested?
    Multiple runs on existing unit tests
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: eatoncys <[email protected]>
    
    Closes #17683 from eatoncys/SPARK-20386.
    
    (cherry picked from commit 05a451491d535c0828413ce2eb06fe94571069ac)
    Signed-off-by: Sean Owen <[email protected]>

commit 807c718925dc4105b0eb176dd5f515a85b8047a2
Author: Takeshi Yamamuro <[email protected]>
Date:   2017-04-22T16:41:58Z

    [SPARK-20430][SQL] Initialise RangeExec parameters in a driver side
    
    ## What changes were proposed in this pull request?
    This pr initialised `RangeExec` parameters in a driver side.
    In the current master, a query below throws `NullPointerException`;
    ```
    sql("SET spark.sql.codegen.wholeStage=false")
    sql("SELECT * FROM range(1)").show
    
    17/04/20 17:11:05 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
    java.lang.NullPointerException
            at 
org.apache.spark.sql.execution.SparkPlan.sparkContext(SparkPlan.scala:54)
            at 
org.apache.spark.sql.execution.RangeExec.numSlices(basicPhysicalOperators.scala:343)
            at 
org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:506)
            at 
org.apache.spark.sql.execution.RangeExec$$anonfun$20.apply(basicPhysicalOperators.scala:505)
            at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
            at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
            at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
            at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
            at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
            at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
            at 
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
            at org.apache.spark.scheduler.Task.run(Task.scala:108)
            at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:320)
            at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
            at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    ```
    
    ## How was this patch tested?
    Added a test in `DataFrameRangeSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17717 from maropu/SPARK-20430.
    
    (cherry picked from commit b3c572a6b332b79fef72c309b9038b3c939dcba2)
    Signed-off-by: Xiao Li <[email protected]>

commit cad33a7301f6e0b40b88789f0a96f9cc7ebf9d6e
Author: éå°é¾ 10207633 <[email protected]>
Date:   2017-04-23T12:33:14Z

    [SPARK-20385][WEB-UI] Submitted Time' field, the date format needs to be 
formatted, in running Drivers table or Completed Drivers table in master web ui.
    
    ## What changes were proposed in this pull request?
    Submitted Time' field, the date format **needs to be formatted**, in 
running Drivers table or Completed Drivers table in master web ui.
    Before fix this problem  e.g.
    
    Completed Drivers
    Submission ID                    **Submitted Time**                      
Worker                                 State          Cores           Memory    
          Main Class
    driver-20170419145755-0005   **Wed Apr 19 14:57:55 CST 2017**        
worker-20170419145250-zdh120-40412     FAILED     1           1024.0 MB         
  cn.zte.HdfsTest
    
    please see the  
attachment:https://issues.apache.org/jira/secure/attachment/12863977/before_fix.png
    
    After fix this problem e.g.
    
    Completed Drivers
    Submission ID                    **Submitted Time**                      
Worker                                 State          Cores           Memory    
          Main Class
    driver-20170419145755-0006   **2017/04/19 16:01:25**         
worker-20170419145250-zdh120-40412              FAILED    1           1024.0 MB 
          cn.zte.HdfsTest
    
    please see the  
attachment:https://issues.apache.org/jira/secure/attachment/12863976/after_fix.png
    
    'Submitted Time' field, the date format **has been formatted**, in running 
Applications table or Completed Applicationstable in master web ui, **it is 
correct.**
    e.g.
    Running Applications
    Application ID                      Name                    Cores   Memory 
per Executor        **Submitted Time**         User         State               
 Duration
    app-20170419160910-0000 (kill)      SparkSQL::10.43.183.120 1           5.0 
GB                     **2017/04/19 16:09:10**    root     RUNNING          53 s
    
    **Format after the time easier to observe, and consistent with the 
applications table,so I think it's worth fixing.**
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: éå°é¾ 10207633 <[email protected]>
    Author: guoxiaolong <[email protected]>
    Author: guoxiaolongzte <[email protected]>
    
    Closes #17682 from guoxiaolongzte/SPARK-20385.
    
    (cherry picked from commit 2eaf4f3fe3595ae341a3a5ce886b859992dea5b2)
    Signed-off-by: Sean Owen <[email protected]>

commit 2bef01f64b832a94a52c64aba0aecbbb0e7a4003
Author: Xiao Li <[email protected]>
Date:   2017-04-24T09:21:42Z

    [SPARK-20439][SQL] Fix Catalog API listTables and getTable when failed to 
fetch table metadata
    
    ### What changes were proposed in this pull request?
    
    `spark.catalog.listTables` and `spark.catalog.getTable` does not work if we 
are unable to retrieve table metadata due to any reason (e.g., table serde 
class is not accessible or the table type is not accepted by Spark SQL). After 
this PR, the APIs still return the corresponding Table without the description 
and tableType)
    
    ### How was this patch tested?
    Added a test case
    
    Author: Xiao Li <[email protected]>
    
    Closes #17730 from gatorsmile/listTables.
    
    (cherry picked from commit 776a2c0e91dfea170ea1c489118e1d42c4121f35)
    Signed-off-by: Wenchen Fan <[email protected]>

commit cf16c3250e946c4f89edc999d8764e8fa3dfb056
Author: [email protected] <[email protected]>
Date:   2017-04-24T15:43:06Z

    [SPARK-18901][ML] Require in LR LogisticAggregator is redundant
    
    ## What changes were proposed in this pull request?
    
    In MultivariateOnlineSummarizer,
    
    `add` and `merge` have check for weights and feature sizes. The checks in 
LR are redundant, which are removed from this PR.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: [email protected] <[email protected]>
    
    Closes #17478 from wangmiao1981/logit.
    
    (cherry picked from commit 90264aced7cfdf265636517b91e5d1324fe60112)
    Signed-off-by: Yanbo Liang <[email protected]>

commit 30149d54cf4eadc843d7c64f3d0b52c21a3f5dda
Author: jerryshao <[email protected]>
Date:   2017-04-25T01:18:59Z

    [SPARK-20239][CORE] Improve HistoryServer's ACL mechanism
    
    ## What changes were proposed in this pull request?
    
    Current SHS (Spark History Server) two different ACLs:
    
    * ACL of base URL, it is controlled by "spark.acls.enabled" or 
"spark.ui.acls.enabled", and with this enabled, only user configured with 
"spark.admin.acls" (or group) or "spark.ui.view.acls" (or group), or the user 
who started SHS could list all the applications, otherwise none of them can be 
listed. This will also affect REST APIs which listing the summary of all apps 
and one app.
    * Per application ACL. This is controlled by 
"spark.history.ui.acls.enabled". With this enabled only history admin user and 
user/group who ran this app can access the details of this app.
    
    With this two ACLs, we may encounter several unexpected behaviors:
    
    1. if base URL's ACL (`spark.acls.enable`) is enabled but user A has no 
view permission. User "A" cannot see the app list but could still access 
details of it's own app.
    2. if ACLs of base URL (`spark.acls.enable`) is disabled, then user "A" 
could download any application's event log, even it is not run by user "A".
    3. The changes of Live UI's ACL will affect History UI's ACL which share 
the same conf file.
    
    The unexpected behaviors is mainly because we have two different ACLs, 
ideally we should have only one to manage all.
    
    So to improve SHS's ACL mechanism, here in this PR proposed to:
    
    1. Disable "spark.acls.enable" and only use "spark.history.ui.acls.enable" 
for history server.
    2. Check permission for event-log download REST API.
    
    With this PR:
    
    1. Admin user could see/download the list of all applications, as well as 
application details.
    2. Normal user could see the list of all applications, but can only 
download and check the details of applications accessible to him.
    
    ## How was this patch tested?
    
    New UTs are added, also verified in real cluster.
    
    CC tgravescs vanzin please help to review, this PR changes the semantics 
you did previously. Thanks a lot.
    
    Author: jerryshao <[email protected]>
    
    Closes #17582 from jerryshao/SPARK-20239.
    
    (cherry picked from commit 5280d93e6ecec7327e7fcd3d8d1cb90e01e774fc)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit fb59a195428597f50c599fff0c6521604a454400
Author: Sameer Agarwal <[email protected]>
Date:   2017-04-25T05:05:20Z

    [SPARK-20451] Filter out nested mapType datatypes from sort order in 
randomSplit
    
    ## What changes were proposed in this pull request?
    
    In `randomSplit`, It is possible that the underlying dataset doesn't 
guarantee the ordering of rows in its constituent partitions each time a split 
is materialized which could result in overlapping
    splits.
    
    To prevent this, as part of SPARK-12662, we explicitly sort each input 
partition to make the ordering deterministic. Given that `MapTypes` cannot be 
sorted this patch explicitly prunes them out from the sort order. Additionally, 
if the resulting sort order is empty, this patch then materializes the dataset 
to guarantee determinism.
    
    ## How was this patch tested?
    
    Extended `randomSplit on reordered partitions` in `DataFrameStatSuite` to 
also test for dataframes with mapTypes nested mapTypes.
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #17751 from sameeragarwal/randomsplit2.
    
    (cherry picked from commit 31345fde82ada1f8bb12807b250b04726a1f6aa6)
    Signed-off-by: Wenchen Fan <[email protected]>

commit c18de9c045aaf7d17113f87a6b2146811b4af0eb
Author: Armin Braun <[email protected]>
Date:   2017-04-25T08:13:50Z

    [SPARK-20455][DOCS] Fix Broken Docker IT Docs
    
    ## What changes were proposed in this pull request?
    
    Just added the Maven `test`goal.
    
    ## How was this patch tested?
    
    No test needed, just a trivial documentation fix.
    
    Author: Armin Braun <[email protected]>
    
    Closes #17756 from original-brownbear/SPARK-20455.
    
    (cherry picked from commit c8f1219510f469935aa9ff0b1c92cfe20372377c)
    Signed-off-by: Sean Owen <[email protected]>

commit b62ebd91bb2c64e1ecef0f2d97db91f5ce32743b
Author: Sergey Zhemzhitsky <[email protected]>
Date:   2017-04-25T08:18:36Z

    [SPARK-20404][CORE] Using Option(name) instead of Some(name)
    
    Using Option(name) instead of Some(name) to prevent runtime failures when 
using accumulators created like the following
    ```
    sparkContext.accumulator(0, null)
    ```
    
    Author: Sergey Zhemzhitsky <[email protected]>
    
    Closes #17740 from szhem/SPARK-20404-null-acc-names.
    
    (cherry picked from commit 0bc7a90210aad9025c1e1bdc99f8e723c1bf0fbf)
    Signed-off-by: Sean Owen <[email protected]>

commit e2591c6d74081e9edad2e8982c0125a4f1d21437
Author: wangmiao1981 <[email protected]>
Date:   2017-04-25T08:30:36Z

    [SPARK-18901][FOLLOWUP][ML] Require in LR LogisticAggregator is redundant
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up PR of #17478.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: wangmiao1981 <[email protected]>
    
    Closes #17754 from wangmiao1981/followup.
    
    (cherry picked from commit 387565cf14b490810f9479ff3adbf776e2edecdc)
    Signed-off-by: Yanbo Liang <[email protected]>

commit 55834a898547b00bb8de1891fd061651f941aa0b
Author: Yanbo Liang <[email protected]>
Date:   2017-04-25T17:10:41Z

    [SPARK-20449][ML] Upgrade breeze version to 0.13.1
    
    ## What changes were proposed in this pull request?
    Upgrade breeze version to 0.13.1, which fixed some critical bugs of 
L-BFGS-B.
    
    ## How was this patch tested?
    Existing unit tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17746 from yanboliang/spark-20449.
    
    (cherry picked from commit 67eef47acfd26f1f0be3e8ef10453514f3655f62)
    Signed-off-by: DB Tsai <[email protected]>

commit f971ce5dd0788fe7f5d2ca820b9ea3db72033ddc
Author: ding <[email protected]>
Date:   2017-04-25T18:20:32Z

    [SPARK-5484][GRAPHX] Periodically do checkpoint in Pregel
    
    ## What changes were proposed in this pull request?
    
    Pregel-based iterative algorithms with more than ~50 iterations begin to 
slow down and eventually fail with a StackOverflowError due to Spark's lack of 
support for long lineage chains.
    
    This PR causes Pregel to checkpoint the graph periodically if the 
checkpoint directory is set.
    This PR moves PeriodicGraphCheckpointer.scala from mllib to graphx, moves 
PeriodicRDDCheckpointer.scala, PeriodicCheckpointer.scala from mllib to core
    ## How was this patch tested?
    
    unit tests, manual tests
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: ding <[email protected]>
    Author: dding3 <[email protected]>
    Author: Michael Allman <[email protected]>
    
    Closes #15125 from dding3/cp2_pregel.
    
    (cherry picked from commit 0a7f5f2798b6e8b2ba15e8b3aa07d5953ad1c695)
    Signed-off-by: Felix Cheung <[email protected]>

commit f0de600797ff4883927d0c70732675fd8629e239
Author: Sameer Agarwal <[email protected]>
Date:   2017-04-26T00:05:20Z

    [SPARK-18127] Add hooks and extension points to Spark
    
    ## What changes were proposed in this pull request?
    
    This patch adds support for customizing the spark session by injecting 
user-defined custom extensions. This allows a user to add custom analyzer 
rules/checks, optimizer rules, planning strategies or even a customized parser.
    
    ## How was this patch tested?
    
    Unit Tests in SparkSessionExtensionSuite
    
    Author: Sameer Agarwal <[email protected]>
    
    Closes #17724 from sameeragarwal/session-extensions.
    
    (cherry picked from commit caf392025ce21d701b503112060fa016d5eabe04)
    Signed-off-by: Xiao Li <[email protected]>

commit c8803c06854683c8761fdb3c0e4c55d5a9e22a95
Author: Eric Wasserman <[email protected]>
Date:   2017-04-26T03:42:43Z

    [SPARK-16548][SQL] Inconsistent error handling in JSON parsing SQL functions
    
    ## What changes were proposed in this pull request?
    
    change to using Jackson's `com.fasterxml.jackson.core.JsonFactory`
    
        public JsonParser createParser(String content)
    
    ## How was this patch tested?
    
    existing unit tests
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Eric Wasserman <[email protected]>
    
    Closes #17693 from ewasserman/SPARK-20314.
    
    (cherry picked from commit 57e1da39464131329318b723caa54df9f55fa54f)
    Signed-off-by: Wenchen Fan <[email protected]>

commit a2f5ced3236db665bb33adc1bf1f90553997f46b
Author: anabranch <[email protected]>
Date:   2017-04-26T08:49:05Z

    [SPARK-20400][DOCS] Remove References to 3rd Party Vendor Tools
    
    ## What changes were proposed in this pull request?
    
    Simple documentation change to remove explicit vendor references.
    
    ## How was this patch tested?
    
    NA
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: anabranch <[email protected]>
    
    Closes #17695 from anabranch/remove-vendor.
    
    (cherry picked from commit 7a365257e934e838bd90f6a0c50362bf47202b0e)
    Signed-off-by: Sean Owen <[email protected]>

commit 612952251c5ac626e256bc2ab9414faf1662dde9
Author: Tom Graves <[email protected]>
Date:   2017-04-26T13:23:31Z

    [SPARK-19812] YARN shuffle service fails to relocate recovery DB acroâ¦
    
    â¦ss NFS directories
    
    ## What changes were proposed in this pull request?
    
    Change from using java Files.move to use Hadoop filesystem operations to 
move the directories.  The java Files.move does not work when moving 
directories across NFS mounts and in fact also says that if the directory has 
entries you should do a recursive move. We are already using Hadoop filesystem 
here so just use the local filesystem from there as it handles this properly.
    
    Note that the DB here is actually a directory of files and not just a 
single file, hence the change in the name of the local var.
    
    ## How was this patch tested?
    
    Ran YarnShuffleServiceSuite unit tests.  Unfortunately couldn't easily add 
one here since involves NFS.
    Ran manual tests to verify that the DB directories were properly moved 
across NFS mounted directories. Have been running this internally for weeks.
    
    Author: Tom Graves <[email protected]>
    
    Closes #17748 from tgravescs/SPARK-19812.
    
    (cherry picked from commit 7fecf5130163df9c204a2764d121a7011d007f4e)
    Signed-off-by: Tom Graves <[email protected]>

commit 34dec68d7eb647d997fdb27fe65d579c74b39e58
Author: Yanbo Liang <[email protected]>
Date:   2017-04-26T13:34:18Z

    [MINOR][ML] Fix some PySpark & SparkR flaky tests
    
    ## What changes were proposed in this pull request?
    Some PySpark & SparkR tests run with tiny dataset and tiny ```maxIter```, 
which means they are not converged. I donât think checking intermediate 
result during iteration make sense, and these intermediate result may 
vulnerable and not stable, so we should switch to check the converged result. 
We hit this issue at #17746 when we upgrade breeze to 0.13.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17757 from yanboliang/flaky-test.
    
    (cherry picked from commit dbb06c689c157502cb081421baecce411832aad8)
    Signed-off-by: Yanbo Liang <[email protected]>

commit b65858bb3cb8e69b1f73f5f2c76a7cd335120695
Author: jerryshao <[email protected]>
Date:   2017-04-26T14:01:50Z

    [SPARK-20391][CORE] Rename memory related fields in ExecutorSummay
    
    ## What changes were proposed in this pull request?
    
    This is a follow-up of #14617 to make the name of memory related fields 
more meaningful.
    
    Here  for the backward compatibility, I didn't change `maxMemory` and 
`memoryUsed` fields.
    
    ## How was this patch tested?
    
    Existing UT and local verification.
    
    CC squito and tgravescs .
    
    Author: jerryshao <[email protected]>
    
    Closes #17700 from jerryshao/SPARK-20391.
    
    (cherry picked from commit 66dd5b83ff95d5f91f37dcdf6aac89faa0b871c5)
    Signed-off-by: Imran Rashid <[email protected]>

commit 6709bcf6e66e99e17ba2a3b1482df2dba1a15716
Author: Michal Szafranski <[email protected]>
Date:   2017-04-26T18:21:25Z

    [SPARK-20473] Enabling missing types in ColumnVector.Array
    
    ## What changes were proposed in this pull request?
    ColumnVector implementations originally did not support some Catalyst types 
(float, short, and boolean). Now that they do, those types should be also added 
to the ColumnVector.Array.
    
    ## How was this patch tested?
    Tested using existing unit tests.
    
    Author: Michal Szafranski <[email protected]>
    
    Closes #17772 from michal-databricks/spark-20473.
    
    (cherry picked from commit 99c6cf9ef16bf8fae6edb23a62e46546a16bca80)
    Signed-off-by: Reynold Xin <[email protected]>

commit e278876ba3d66d3fb249df59c3de8d78ca25c5f0
Author: Michal Szafranski <[email protected]>
Date:   2017-04-26T19:47:37Z

    [SPARK-20474] Fixing OnHeapColumnVector reallocation
    
    ## What changes were proposed in this pull request?
    OnHeapColumnVector reallocation copies to the new storage data up to 
'elementsAppended'. This variable is only updated when using the 
ColumnVector.appendX API, while ColumnVector.putX is more commonly used.
    
    ## How was this patch tested?
    Tested using existing unit tests.
    
    Author: Michal Szafranski <[email protected]>
    
    Closes #17773 from michal-databricks/spark-20474.
    
    (cherry picked from commit a277ae80a2836e6533b338d2b9c4e59ed8a1daae)
    Signed-off-by: Reynold Xin <[email protected]>

commit b48bb3ab2c8134f6b533af29a241dce114076720
Author: Weiqing Yang <[email protected]>
Date:   2017-04-26T20:54:40Z

    [SPARK-12868][SQL] Allow adding jars from hdfs
    
    ## What changes were proposed in this pull request?
    Spark 2.2 is going to be cut, it'll be great if SPARK-12868 can be resolved 
before that. There have been several PRs for this like 
[PR#16324](https://github.com/apache/spark/pull/16324) , but all of them are 
inactivity for a long time or have been closed.
    
    This PR added a SparkUrlStreamHandlerFactory, which relies on 'protocol' to 
choose the appropriate
    UrlStreamHandlerFactory like FsUrlStreamHandlerFactory to create 
URLStreamHandler.
    
    ## How was this patch tested?
    1. Add a new unit test.
    2. Check manually.
    Before: throw an exception with " failed unknown protocol: hdfs"
    <img width="914" alt="screen shot 2017-03-17 at 9 07 36 pm" 
src="https://cloud.githubusercontent.com/assets/8546874/24075277/5abe0a7c-0bd5-11e7-900e-ec3d3105da0b.png";>
    
    After:
    <img width="1148" alt="screen shot 2017-03-18 at 11 42 18 am" 
src="https://cloud.githubusercontent.com/assets/8546874/24075283/69382a60-0bd5-11e7-8d30-d9405c3aaaba.png";>
    
    Author: Weiqing Yang <[email protected]>
    
    Closes #17342 from weiqingy/SPARK-18910.
    
    (cherry picked from commit 2ba1eba371213d1ac3d1fa1552e5906e043c2ee4)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit d6efda512e9d40e0a51c03675477bfb20c6bc7ae
Author: Mark Grover <[email protected]>
Date:   2017-04-27T00:06:21Z

    [SPARK-20435][CORE] More thorough redaction of sensitive information
    
    This change does a more thorough redaction of sensitive information from 
logs and UI
    Add unit tests that ensure that no regressions happen that leak sensitive 
information to the logs.
    
    The motivation for this change was appearance of password like so in 
`SparkListenerEnvironmentUpdate` in event logs under some JVM configurations:
    `"sun.java.command":"org.apache.spark.deploy.SparkSubmit ... --conf 
spark.executorEnv.HADOOP_CREDSTORE_PASSWORD=secret_password ..."
    `
    Previously redaction logic was only checking if the key matched the secret 
regex pattern, it'd redact it's value. That worked for most cases. However, in 
the above case, the key (sun.java.command) doesn't tell much, so the value 
needs to be searched. This PR expands the check to check for values as well.
    
    ## How was this patch tested?
    
    New unit tests added that ensure that no sensitive information is present 
in the event logs or the yarn logs. Old unit test in UtilsSuite was modified 
because the test was asserting that a non-sensitive property's value won't be 
redacted. However, the non-sensitive value had the literal "secret" in it which 
was causing it to redact. Simply updating the non-sensitive property's value to 
another arbitrary value (that didn't have "secret" in it) fixed it.
    
    Author: Mark Grover <[email protected]>
    
    Closes #17725 from markgrover/spark-20435.
    
    (cherry picked from commit 66636ef0b046e5d1f340c3b8153d7213fa9d19c7)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 8ccb4a57c82146c1a8f8966c7e64010cf5632cb6
Author: Patrick Wendell <[email protected]>
Date:   2017-04-27T00:32:19Z

    Preparing Spark release v2.2.0-rc1

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18278: Branch 2.2

Reply via email to