[GitHub] spark pull request: [SPARK-14622] Retain lost executors status

yang0228 Thu, 14 Apr 2016 20:26:24 -0700

GitHub user yang0228 opened a pull request:

    https://github.com/apache/spark/pull/12408


    [SPARK-14622] Retain lost executors status

    ## What changes were proposed in this pull request?
    
    Retain history info for lost executors  in âexecutorsâ dashboard of 
spark Web UI.
    
    ## How was this patch tested?
    
    Apply unit tests and observe the Web UI with a related screenshot.
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/12408.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #12408
    
----
commit ac1b8b302a92678bbeece6e9c7879f1cb8fdad12
Author: Nishkam Ravi <[email protected]>
Date:   2016-03-31T19:03:05Z

    [SPARK-13796] Redirect error message to logWarning
    
    ## What changes were proposed in this pull request?
    
    Redirect error message to logWarning
    
    ## How was this patch tested?
    
    Unit tests, manual tests
    
    JoshRosen
    
    Author: Nishkam Ravi <[email protected]>
    
    Closes #12052 from nishkamravi2/master_warning.

commit 446c45bd87035e20653394fcaf9dc8caa4299038
Author: gatorsmile <[email protected]>
Date:   2016-03-31T19:03:55Z

    [SPARK-14182][SQL] Parse DDL Command: Alter View
    
    This PR is to provide native parsing support for DDL commands: `Alter 
View`. Since its AST trees are highly similar to `Alter Table`. Thus, both 
implementation are integrated into the same one.
    
    Based on the Hive DDL document:
    https://cwiki.apache.org/confluence/display/Hive/LanguageManual+DDL and 
https://cwiki.apache.org/confluence/display/Hive/PartitionedViews
    
    **Syntax:**
    ```SQL
    ALTER VIEW view_name RENAME TO new_view_name
    ```
     - to change the name of a view to a different name
    
    **Syntax:**
    ```SQL
    ALTER VIEW view_name SET TBLPROPERTIES ('comment' = new_comment);
    ```
     - to add metadata to a view
    
    **Syntax:**
    ```SQL
    ALTER VIEW view_name UNSET TBLPROPERTIES [IF EXISTS] ('comment', 'key')
    ```
     - to remove metadata from a view
    
    **Syntax:**
    ```SQL
    ALTER VIEW view_name ADD [IF NOT EXISTS] PARTITION spec1[, PARTITION spec2, 
...]
    ```
     - to add the partitioning metadata for a view.
     - the syntax of partition spec in `ALTER VIEW` is identical to `ALTER 
TABLE`, **EXCEPT** that it is **ILLEGAL** to specify a `LOCATION` clause.
    
    **Syntax:**
    ```SQL
    ALTER VIEW view_name DROP [IF EXISTS] PARTITION spec1[, PARTITION spec2, 
...]
    ```
     - to drop the related partition metadata for a view.
    
    Added the related test cases to `DDLCommandSuite`
    
    Author: gatorsmile <[email protected]>
    Author: xiaoli <[email protected]>
    Author: Xiao Li <[email protected]>
    
    Closes #11987 from gatorsmile/parseAlterView.

commit 8a333d2da859fd593bda183413630bc3757529c9
Author: jeanlyn <[email protected]>
Date:   2016-03-31T19:04:42Z

    [SPARK-14243][CORE] update task metrics when removing blocks
    
    ## What changes were proposed in this pull request?
    
    This PR try to use `incUpdatedBlockStatuses ` to update the 
`updatedBlockStatuses ` when removing blocks, making sure `BlockManager` 
correctly updates `updatedBlockStatuses`
    
    ## How was this patch tested?
    
    test("updated block statuses") in BlockManagerSuite.scala
    
    Author: jeanlyn <[email protected]>
    
    Closes #12091 from jeanlyn/updateBlock.

commit 4d93b653f7294698526674950d3dc303691260f8
Author: Michael Gummelt <[email protected]>
Date:   2016-03-31T19:06:16Z

    [Docs] Update monitoring.md to accurately describe the history server
    
    It looks like the docs were recently updated to reflect the History 
Server's support for incomplete applications, but they still had wording that 
suggested only completed applications were viewable.  This fixes that.
    
    My editor also introduced several whitespace removal changes, that I hope 
are OK, as text files shouldn't have trailing whitespace.  To verify they're 
purely whitespace changes, add `&w=1` to your browser address.  If this isn't 
acceptable, let me know and I'll update the PR.
    
    I also didn't think this required a JIRA.  Let me know if I should create 
one.
    
    Not tested
    
    Author: Michael Gummelt <[email protected]>
    
    Closes #12045 from mgummelt/update-history-docs.

commit 0abee534f0ad9bbe84d8d3d3478ecaa594f1e0f4
Author: Wenchen Fan <[email protected]>
Date:   2016-03-31T19:07:19Z

    [SPARK-14069][SQL] Improve SparkStatusTracker to also track executor 
information
    
    ## What changes were proposed in this pull request?
    
    Track executor information like host and port, cache size, running tasks.
    
    TODO: tests
    
    ## How was this patch tested?
    
    N/A
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #11888 from cloud-fan/status-tracker.

commit 10508f36adcb74a563010636dffcd1f68efd8468
Author: Jo Voordeckers <[email protected]>
Date:   2016-03-31T19:08:10Z

    [SPARK-11327][MESOS] Dispatcher does not respect all args from the Submit 
request
    
    Supersedes https://github.com/apache/spark/pull/9752
    
    Author: Jo Voordeckers <[email protected]>
    Author: Iulian Dragos <[email protected]>
    
    Closes #10370 from jayv/mesos_cluster_params.

commit 3cfbeb70b1feb1f3a8c4d0b2d2f3715a356c80f2
Author: Michel Lemay <[email protected]>
Date:   2016-03-31T19:15:32Z

    [SPARK-13710][SHELL][WINDOWS] Fix jline dependency on Windows
    
    ## What changes were proposed in this pull request?
    
    Exclude jline from curator-recipes since it conflicts with scala 2.11 when 
running spark-shell.  Should not affect scala 2.10 since it is builtin.
    
    ## How was this patch tested?
    
    Ran spark-shell manually.
    
    Author: Michel Lemay <[email protected]>
    
    Closes #12043 from michellemay/spark-13710-fix-jline-on-windows.

commit e785402826dcd984d9312470464714ba6c908a49
Author: Shixiong Zhu <[email protected]>
Date:   2016-03-31T19:17:25Z

    [SPARK-14304][SQL][TESTS] Fix tests that don't create temp files in the 
`java.io.tmpdir` folder
    
    ## What changes were proposed in this pull request?
    
    If I press `CTRL-C` when running these tests, the temp files will be left 
in `sql/core` folder and I need to delete them manually. It's annoying. This PR 
just moves the temp files to the `java.io.tmpdir` folder and add a name prefix 
for them.
    
    ## How was this patch tested?
    
    Existing Jenkins tests
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #12093 from zsxwing/temp-file.

commit b11887c086974dbab18b9f53e99a26bbe06e9c86
Author: sethah <[email protected]>
Date:   2016-03-31T20:00:10Z

    [SPARK-14264][PYSPARK][ML] Add feature importance for GBTs in pyspark
    
    ## What changes were proposed in this pull request?
    
    Feature importances are exposed in the python API for GBTs.
    
    Other changes:
    * Update the random forest feature importance documentation to not repeat 
decision tree docstring and instead place a reference to it.
    
    ## How was this patch tested?
    
    Python doc tests were updated to validate GBT feature importance.
    
    Author: sethah <[email protected]>
    
    Closes #12056 from sethah/Pyspark_GBT_feature_importance.

commit a7af6cd2eaf9f6ff491b9e1fabfc9c6f3d0f54bf
Author: Josh Rosen <[email protected]>
Date:   2016-03-31T20:52:59Z

    [SPARK-14281][TESTS] Fix java8-tests and simplify their build
    
    This patch fixes a compilation / build break in Spark's `java8-tests` and 
refactors their POM to simplify the build. See individual commit messages for 
more details.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #12073 from JoshRosen/fix-java8-tests.

commit 8de201baedc8e839e06098c536ba31b3dafd54b5
Author: Sital Kedia <[email protected]>
Date:   2016-03-31T23:06:44Z

    [SPARK-14277][CORE] Upgrade Snappy Java to 1.1.2.4
    
    ## What changes were proposed in this pull request?
    
    Upgrade snappy to 1.1.2.4 to improve snappy read/write performance.
    
    ## How was this patch tested?
    
    Tested by running a job on the cluster and saw 7.5% cpu savings after this 
change.
    
    Author: Sital Kedia <[email protected]>
    
    Closes #12096 from sitalkedia/snappyRelease.

commit f0afafdc5dfee80d7e5cd2fc1fa8187def7f262d
Author: Davies Liu <[email protected]>
Date:   2016-03-31T23:40:20Z

    [SPARK-14267] [SQL] [PYSPARK] execute multiple Python UDFs within single 
batch
    
    ## What changes were proposed in this pull request?
    
    This PR support multiple Python UDFs within single batch, also improve the 
performance.
    
    ```python
    >>> from pyspark.sql.types import IntegerType
    >>> sqlContext.registerFunction("double", lambda x: x * 2, IntegerType())
    >>> sqlContext.registerFunction("add", lambda x, y: x + y, IntegerType())
    >>> sqlContext.sql("SELECT double(add(1, 2)), add(double(2), 
1)").explain(True)
    == Parsed Logical Plan ==
    'Project [unresolvedalias('double('add(1, 2)), 
None),unresolvedalias('add('double(2), 1), None)]
    +- OneRowRelation$
    
    == Analyzed Logical Plan ==
    double(add(1, 2)): int, add(double(2), 1): int
    Project [double(add(1, 2))#14,add(double(2), 1)#15]
    +- Project [double(add(1, 2))#14,add(double(2), 1)#15]
       +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS 
add(double(2), 1)#15]
          +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
             +- EvaluatePython [double(add(1, 2)),double(2)], 
[pythonUDF0#16,pythonUDF1#17]
                +- OneRowRelation$
    
    == Optimized Logical Plan ==
    Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS 
add(double(2), 1)#15]
    +- EvaluatePython [add(pythonUDF1#17, 1)], [pythonUDF0#18]
       +- EvaluatePython [double(add(1, 2)),double(2)], 
[pythonUDF0#16,pythonUDF1#17]
          +- OneRowRelation$
    
    == Physical Plan ==
    WholeStageCodegen
    :  +- Project [pythonUDF0#16 AS double(add(1, 2))#14,pythonUDF0#18 AS 
add(double(2), 1)#15]
    :     +- INPUT
    +- !BatchPythonEvaluation [add(pythonUDF1#17, 1)], 
[pythonUDF0#16,pythonUDF1#17,pythonUDF0#18]
       +- !BatchPythonEvaluation [double(add(1, 2)),double(2)], 
[pythonUDF0#16,pythonUDF1#17]
          +- Scan OneRowRelation[]
    ```
    
    ## How was this patch tested?
    
    Added new tests.
    
    Using the following script to benchmark 1, 2 and 3 udfs,
    ```
    df = sqlContext.range(1, 1 << 23, 1, 4)
    double = F.udf(lambda x: x * 2, LongType())
    print df.select(double(df.id)).count()
    print df.select(double(df.id), double(df.id + 1)).count()
    print df.select(double(df.id), double(df.id + 1), double(df.id + 2)).count()
    ```
    Here is the results:
    
    N | Before | After  | speed up
    ---- |------------ | -------------|------
    1 | 22 s | 7 s |  3.1X
    2 | 38 s | 13 s | 2.9X
    3 | 58 s | 16 s | 3.6X
    
    This benchmark ran locally with 4 CPUs. For 3 UDFs, it launched 12 Python 
before before this patch, 4 process after this patch. After this patch, it will 
use less memory for multiple UDFs than before (less buffering).
    
    Author: Davies Liu <[email protected]>
    
    Closes #12057 from davies/multi_udfs.

commit 96941b12f8b465df21423275f3cd3ade579b4fa1
Author: Zhang, Liye <[email protected]>
Date:   2016-04-01T03:17:52Z

    [SPARK-14242][CORE][NETWORK] avoid copy in compositeBuffer for frame decoder
    
    ## What changes were proposed in this pull request?
    In this patch, we set the initial `maxNumComponents` to `Integer.MAX_VALUE` 
instead of the default size ( which is 16) when allocating `compositeBuffer` in 
`TransportFrameDecoder` because `compositeBuffer` will introduce too many 
memory copies underlying if `compositeBuffer` is with default 
`maxNumComponents` when the frame size is large (which result in many transport 
messages). For details, please refer to 
[SPARK-14242](https://issues.apache.org/jira/browse/SPARK-14242).
    
    ## How was this patch tested?
    spark unit tests and manual tests.
    For manual tests, we can reproduce the performance issue with following 
code:
    `sc.parallelize(Array(1,2,3),3).mapPartitions(a=>Array(new 
Array[Double](1024 * 1024 * 50)).iterator).reduce((a,b)=> a).length`
    It's easy to see the performance gain, both from the running time and CPU 
usage.
    
    Author: Zhang, Liye <[email protected]>
    
    Closes #12038 from liyezhang556520/spark-14242.

commit 1b070637fa03ab4966f76427b15e433050eaa956
Author: Cheng Lian <[email protected]>
Date:   2016-04-01T06:46:08Z

    [SPARK-14295][SPARK-14274][SQL] Implements buildReader() for LibSVM
    
    ## What changes were proposed in this pull request?
    
    This PR implements `FileFormat.buildReader()` for the LibSVM data source. 
Besides that, a new interface method `prepareRead()` is added to `FileFormat`:
    
    ```scala
      def prepareRead(
          sqlContext: SQLContext,
          options: Map[String, String],
          files: Seq[FileStatus]): Map[String, String] = options
    ```
    
    After migrating from `buildInternalScan()` to `buildReader()`, we lost the 
opportunity to collect necessary global information, since `buildReader()` 
works in a per-partition manner. For example, LibSVM needs to infer the total 
number of features if the `numFeatures` data source option is not set. Any 
necessary collected global information should be returned using the data source 
options map. By default, this method just returns the original options 
untouched.
    
    An alternative approach is to absorb `inferSchema()` into `prepareRead()`, 
since schema inference is also some kind of global information gathering. 
However, this approach wasn't chosen because schema inference is optional, 
while `prepareRead()` must be called whenever a `HadoopFsRelation` based data 
source relation is instantiated.
    
    One unaddressed problem is that, when `numFeatures` is absent, now the 
input data will be scanned twice. The `buildInternalScan()` code path doesn't 
need to do this because it caches the raw parsed RDD in memory before computing 
the total number of features. However, with `FileScanRDD`, the raw parsed RDD 
is created in a different way (e.g. partitioning) from the final RDD.
    
    ## How was this patch tested?
    
    Tested using existing test suites.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #12088 from liancheng/spark-14295-libsvm-build-reader.

commit 26867ebc67edab97376c5d8fee76df294359e461
Author: Alexander Ulanov <[email protected]>
Date:   2016-04-01T06:48:36Z

    [SPARK-11262][ML] Unit test for gradient, loss layers, memory management 
for multilayer perceptron
    
    1.Implement LossFunction trait and implement squared error and cross entropy
    loss with it
    2.Implement unit test for gradient and loss
    3.Implement InPlace trait and in-place layer evaluation
    4.Refactor interface for ActivationFunction
    5.Update of Layer and LayerModel interfaces
    6.Fix random weights assignment
    7.Implement memory allocation by MLP model instead of individual layers
    
    These features decreased the memory usage and increased flexibility of
    internal API.
    
    Author: Alexander Ulanov <[email protected]>
    Author: avulanov <[email protected]>
    
    Closes #9229 from avulanov/mlp-refactoring.

commit 22249afb4a932a82ff1f7a3befea9fda5a60a3f4
Author: Yanbo Liang <[email protected]>
Date:   2016-04-01T06:49:58Z

    [SPARK-14303][ML][SPARKR] Define and use KMeansWrapper for SparkR::kmeans
    
    ## What changes were proposed in this pull request?
    Define and use ```KMeansWrapper``` for ```SparkR::kmeans```. It's only the 
code refactor for the original ```KMeans``` wrapper.
    
    ## How was this patch tested?
    Existing tests.
    
    cc mengxr
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #12039 from yanboliang/spark-14059.

commit 3715ecdf417b47423ff07145a5623d8d817c45ef
Author: Cheng Lian <[email protected]>
Date:   2016-04-01T09:02:48Z

    [SPARK-14295][MLLIB][HOTFIX] Fixes Scala 2.10 compilation failure
    
    ## What changes were proposed in this pull request?
    
    Fixes a compilation failure introduced in PR #12088 under Scala 2.10.
    
    ## How was this patch tested?
    
    Compilation.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #12107 from liancheng/spark-14295-hotfix.

commit 0b04f8fdf1614308cb3e7e0c7282f7365cc3d1bb
Author: Dilip Biswal <[email protected]>
Date:   2016-04-01T16:27:11Z

    [SPARK-14184][SQL] Support native execution of SHOW DATABASE command and 
fix SHOW TABLE to use table identifier pattern
    
    ## What changes were proposed in this pull request?
    
    This PR addresses the following
    
    1. Supports native execution of SHOW DATABASES command
    2. Fixes SHOW TABLES to apply the identifier_with_wildcards pattern if 
supplied.
    
    SHOW TABLE syntax
    ```
    SHOW TABLES [IN database_name] ['identifier_with_wildcards'];
    ```
    SHOW DATABASES syntax
    ```
    SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];
    ```
    
    ## How was this patch tested?
    Tests added in SQLQuerySuite (both hive and sql contexts) and 
DDLCommandSuite
    
    Note: Since the table name pattern was not working , tests are added in 
both SQLQuerySuite to
    verify the application of the table pattern.
    
    Author: Dilip Biswal <[email protected]>
    
    Closes #11991 from dilipbiswal/dkb_show_database.

commit a471c7f9eaa59d55dfff5b9d1a858f304a6b3a84
Author: sureshthalamati <[email protected]>
Date:   2016-04-01T16:33:31Z

    [SPARK-14133][SQL] Throws exception for unsupported create/drop/alter index 
, and lock/unlock operations.
    
    ## What changes were proposed in this pull request?
    
    This  PR  throws Unsupported Operation exception for create index, drop 
index, alter index , lock table , lock database, unlock table, and unlock 
database operations that are not supported in Spark SQL. Currently these 
operations are executed executed by Hive.
    
    Error:
    spark-sql> drop index my_index on my_table;
    Error in query:
    Unsupported operation: drop index(line 1, pos 0)
    
    ## How was this patch tested?
    Added test cases to HiveQuerySuite
    
    yhuai hvanhovell andrewor14
    
    Author: sureshthalamati <[email protected]>
    
    Closes #12069 from sureshthalamati/unsupported_ddl_spark-14133.

commit 58e6bc827f1f9dc1afee07dca1bee1f56553dd20
Author: Dongjoon Hyun <[email protected]>
Date:   2016-04-01T17:36:01Z

    [MINOR] [SQL] Update usage of `debug` by removing `typeCheck` and adding 
`debugCodegen`
    
    ## What changes were proposed in this pull request?
    
    This PR updates the usage comments of `debug` according to the following 
commits.
    - [SPARK-9754](https://issues.apache.org/jira/browse/SPARK-9754) removed 
`typeCheck`.
    - [SPARK-14227](https://issues.apache.org/jira/browse/SPARK-14227) added 
`debugCodegen`.
    
    ## How was this patch tested?
    
    Manual.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #12094 from dongjoon-hyun/minor_fix_debug_usage.

commit 8ba2b7f28fee39c4839e5ea125bd25f5091a3a1e
Author: jerryshao <[email protected]>
Date:   2016-04-01T17:52:13Z

    [SPARK-12343][YARN] Simplify Yarn client and client argument
    
    ## What changes were proposed in this pull request?
    
    Currently in Spark on YARN, configurations can be passed through SparkConf, 
env and command arguments, some parts are duplicated, like client argument and 
SparkConf. So here propose to simplify the command arguments.
    
    ## How was this patch tested?
    
    This patch is tested manually with unit test.
    
    CC vanzin tgravescs , please help to suggest this proposal. The original 
purpose of this JIRA is to remove `ClientArguments`, through refactoring some 
arguments like `--class`, `--arg` are not so easy to replace, so here I remove 
the most part of command line arguments, only keep the minimal set.
    
    Author: jerryshao <[email protected]>
    
    Closes #11603 from jerryshao/SPARK-12343.

commit 381358fbe9afbe205299cbbea4c43148e2e69468
Author: Yanbo Liang <[email protected]>
Date:   2016-04-01T19:53:39Z

    [SPARK-14305][ML][PYSPARK] PySpark ml.clustering BisectingKMeans support 
export/import
    
    ## What changes were proposed in this pull request?
    PySpark ml.clustering BisectingKMeans support export/import
    ## How was this patch tested?
    doc test.
    
    cc jkbradley
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #12112 from yanboliang/spark-14305.

commit df68beb85de59bb6d35b2a8a3b85dbc447798bf5
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-04-01T20:00:55Z

    [SPARK-13995][SQL] Extract correct IsNotNull constraints for Expression
    
    ## What changes were proposed in this pull request?
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13995
    
    We infer relative `IsNotNull` constraints from logical plan's expressions 
in `constructIsNotNullConstraints` now. However, we don't consider the case of 
(nested) `Cast`.
    
    For example:
    
        val tr = LocalRelation('a.int, 'b.long)
        val plan = tr.where('a.attr === 'b.attr).analyze
    
    Then, the plan's constraints will have `IsNotNull(Cast(resolveColumn(tr, 
"a"), LongType))`, instead of `IsNotNull(resolveColumn(tr, "a"))`. This PR 
fixes it.
    
    Besides, as `IsNotNull` constraints are most useful for `Attribute`, we 
should do recursing through any `Expression` that is null intolerant and 
construct `IsNotNull` constraints for all `Attribute`s under these Expressions.
    
    For example, consider the following constraints:
    
        val df = Seq((1,2,3)).toDF("a", "b", "c")
        df.where("a + b = c").queryExecution.analyzed.constraints
    
    The inferred isnotnull constraints should be isnotnull(a), isnotnull(b), 
isnotnull(c), instead of isnotnull(a + c) and isnotnull(c).
    
    ## How was this patch tested?
    
    Test is added into `ConstraintPropagationSuite`.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #11809 from viirya/constraint-cast.

commit a884daad805a701494e87393dc307937472a985d
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-04-01T20:03:27Z

    [SPARK-14191][SQL] Remove invalid Expand operator constraints
    
    `Expand` operator now uses its child plan's constraints as its valid 
constraints (i.e., the base of constraints). This is not correct because 
`Expand` will set its group by attributes to null values. So the nullability of 
these attributes should be true.
    
    E.g., for an `Expand` operator like:
    
        val input = LocalRelation('a.int, 'b.int, 'c.int).where('c.attr > 10 && 
'a.attr < 5 && 'b.attr > 2)
        Expand(
          Seq(
            Seq('c, Literal.create(null, StringType), 1),
            Seq('c, 'a, 2)),
          Seq('c, 'a, 'gid.int),
          Project(Seq('a, 'c), input))
    
    The `Project` operator has the constraints `IsNotNull('a)`, `IsNotNull('b)` 
and `IsNotNull('c)`. But the `Expand` should not have `IsNotNull('a)` in its 
constraints.
    
    This PR is the first step for this issue and remove invalid constraints of 
`Expand` operator.
    
    A test is added to `ConstraintPropagationSuite`.
    
    Author: Liang-Chi Hsieh <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes #11995 from viirya/fix-expand-constraints.

commit 1e886159849e3918445d3fdc3c4cef86c6c1a236
Author: Tejas Patil <[email protected]>
Date:   2016-04-01T20:13:16Z

    [SPARK-14070][SQL] Use ORC data source for SQL queries on ORC tables
    
    ## What changes were proposed in this pull request?
    
    This patch enables use of OrcRelation for SQL queries which read data from 
Hive tables. Changes in this patch:
    
    - Added a new rule `OrcConversions` which would alter the plan to use 
`OrcRelation`. In this diff, the conversion is done only for reads.
    - Added a new config `spark.sql.hive.convertMetastoreOrc` to control the 
conversion
    
    BEFORE
    
    ```
    scala>  hqlContext.sql("SELECT * FROM orc_table").explain(true)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*, None)]
    +- 'UnresolvedRelation `orc_table`, None
    
    == Analyzed Logical Plan ==
    key: string, value: string
    Project [key#171,value#172]
    +- MetastoreRelation default, orc_table, None
    
    == Optimized Logical Plan ==
    MetastoreRelation default, orc_table, None
    
    == Physical Plan ==
    HiveTableScan [key#171,value#172], MetastoreRelation default, orc_table, 
None
    ```
    
    AFTER
    
    ```
    scala> hqlContext.sql("SELECT * FROM orc_table").explain(true)
    == Parsed Logical Plan ==
    'Project [unresolvedalias(*, None)]
    +- 'UnresolvedRelation `orc_table`, None
    
    == Analyzed Logical Plan ==
    key: string, value: string
    Project [key#76,value#77]
    +- SubqueryAlias orc_table
       +- Relation[key#76,value#77] ORC part: struct<>, data: 
struct<key:string,value:string>
    
    == Optimized Logical Plan ==
    Relation[key#76,value#77] ORC part: struct<>, data: 
struct<key:string,value:string>
    
    == Physical Plan ==
    WholeStageCodegen
    :  +- Scan ORC part: struct<>, data: 
struct<key:string,value:string>[key#76,value#77] InputPaths: 
file:/user/hive/warehouse/orc_table
    ```
    
    ## How was this patch tested?
    
    - Added a new unit test. Ran existing unit tests
    - Ran with production like data
    
    ## Performance gains
    
    Ran on a production table in Facebook (note that the data was in DWRF file 
format which is similar to ORC)
    
    Best case : when there was no matching rows for the predicate in the query 
(everything is filtered out)
    
    ```
                          CPU time          Wall time     Total wall time 
across all tasks
    ================================================================
    Without the change   541_515 sec    25.0 mins    165.8 hours
    With change              407 sec       1.5 mins     15 mins
    ```
    
    Average case: A subset of rows in the data match the query predicate
    
    ```
                            CPU time        Wall time     Total wall time 
across all tasks
    ================================================================
    Without the change   624_630 sec     31.0 mins    199.0 h
    With change           14_769 sec      5.3 mins      7.7 h
    ```
    
    Author: Tejas Patil <[email protected]>
    
    Closes #11891 from tejasapatil/orc_ppd.

commit 1b829ce13990b40fd8d7c9efcc2ae55c4dbc861c
Author: Burak Yavuz <[email protected]>
Date:   2016-04-01T20:19:24Z

    [SPARK-14160] Time Windowing functions for Datasets
    
    ## What changes were proposed in this pull request?
    
    This PR adds the function `window` as a column expression.
    
    `window` can be used to bucket rows into time windows given a time column. 
With this expression, performing time series analysis on batch data, as well as 
streaming data should become much more simpler.
    
    ### Usage
    
    Assume the following schema:
    
    `sensor_id, measurement, timestamp`
    
    To average 5 minute data every 1 minute (window length of 5 minutes, slide 
duration of 1 minute), we will use:
    ```scala
    df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ), 
"sensor_id")
      .agg(mean("measurement").as("avg_meas"))
    ```
    
    This will generate windows such as:
    ```
    09:00:00-09:05:00
    09:01:00-09:06:00
    09:02:00-09:07:00 ...
    ```
    
    Intervals will start at every `slideDuration` starting at the unix epoch 
(1970-01-01 00:00:00 UTC).
    To start intervals at a different point of time, e.g. 30 seconds after a 
minute, the `startTime` parameter can be used.
    
    ```scala
    df.groupBy(window("timestamp", â5 minutesâ, â1 minuteâ, "30 
second"), "sensor_id")
      .agg(mean("measurement").as("avg_meas"))
    ```
    
    This will generate windows such as:
    ```
    09:00:30-09:05:30
    09:01:30-09:06:30
    09:02:30-09:07:30 ...
    ```
    
    Support for Python will be made in a follow up PR after this.
    
    ## How was this patch tested?
    
    This patch has some basic unit tests for the `TimeWindow` expression 
testing that the parameters pass validation, and it also has some 
unit/integration tests testing the correctness of the windowing and usability 
in complex operations (multi-column grouping, multi-column projections, joins).
    
    Author: Burak Yavuz <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes #12008 from brkyvz/df-time-window.

commit 3e991dbc310a4a33eec7f3909adce50bf8268d04
Author: Liang-Chi Hsieh <[email protected]>
Date:   2016-04-01T21:02:32Z

    [SPARK-13674] [SQL] Add wholestage codegen support to Sample
    
    JIRA: https://issues.apache.org/jira/browse/SPARK-13674
    
    ## What changes were proposed in this pull request?
    
    Sample operator doesn't support wholestage codegen now. This pr is to add 
support to it.
    
    ## How was this patch tested?
    
    A test is added into `BenchmarkWholeStageCodegen`. Besides, all tests 
should be passed.
    
    Author: Liang-Chi Hsieh <[email protected]>
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #11517 from viirya/add-wholestage-sample.

commit bd7b91cefb0d192d808778e6182dcdd2c143e132
Author: zhonghaihua <[email protected]>
Date:   2016-04-01T21:23:14Z

    [SPARK-12864][YARN] initialize executorIdCounter after ApplicationMaster 
killed for max nâ¦
    
    Currently, when max number of executor failures reached the 
`maxNumExecutorFailures`, `ApplicationMaster` will be killed and re-register 
another one.This time, `YarnAllocator` will be created a new instance.
    But, the value of property `executorIdCounter` in `YarnAllocator` will 
reset to `0`. Then the Id of new executor will starting from `1`. This will 
confuse with the executor has already created before, which will cause 
FetchFailedException.
    This situation is just in yarn client mode, so this is an issue in yarn 
client mode. For more details, [link to jira issues 
SPARK-12864](https://issues.apache.org/jira/browse/SPARK-12864)
    This PR introduce a mechanism to initialize `executorIdCounter` after 
`ApplicationMaster` killed.
    
    Author: zhonghaihua <[email protected]>
    
    Closes #10794 from zhonghaihua/initExecutorIdCounterAfterAMKilled.

commit e41acb757327e3226ffe312766ec759c16616588
Author: Josh Rosen <[email protected]>
Date:   2016-04-01T21:34:59Z

    [SPARK-13992] Add support for off-heap caching
    
    This patch adds support for caching blocks in the executor processes using 
direct / off-heap memory.
    
    ## User-facing changes
    
    **Updated semantics of `OFF_HEAP` storage level**: In Spark 1.x, the 
`OFF_HEAP` storage level indicated that an RDD should be cached in Tachyon. 
Spark 2.x removed the external block store API that Tachyon caching was based 
on (see #10752 / SPARK-12667), so `OFF_HEAP` became an alias for 
`MEMORY_ONLY_SER`. As of this patch, `OFF_HEAP` means "serialized and cached in 
off-heap memory or on disk". Via the `StorageLevel` constructor, `useOffHeap` 
can be set if `serialized == true` and can be used to construct custom storage 
levels which support replication.
    
    **Storage UI reporting**: the storage UI will now report whether in-memory 
blocks are stored on- or off-heap.
    
    **Only supported by UnifiedMemoryManager**: for simplicity, this feature is 
only supported when the default UnifiedMemoryManager is used; applications 
which use the legacy memory manager (`spark.memory.useLegacyMode=true`) are not 
currently able to allocate off-heap storage memory, so using off-heap caching 
will fail with an error when legacy memory management is enabled. Given that we 
plan to eventually remove the legacy memory manager, this is not a significant 
restriction.
    
    **Memory management policies:** the policies for dividing available memory 
between execution and storage are the same for both on- and off-heap memory. 
For off-heap memory, the total amount of memory available for use by Spark is 
controlled by `spark.memory.offHeap.size`, which is an absolute size. Off-heap 
storage memory obeys `spark.memory.storageFraction` in order to control the 
amount of unevictable storage memory. For example, if 
`spark.memory.offHeap.size` is 1 gigabyte and Spark uses the default 
`storageFraction` of 0.5, then up to 500 megabytes of off-heap cached blocks 
will be protected from eviction due to execution memory pressure. If necessary, 
we can split `spark.memory.storageFraction` into separate on- and off-heap 
configurations, but this doesn't seem necessary now and can be done later 
without any breaking changes.
    
    **Use of off-heap memory does not imply use of off-heap execution (or 
vice-versa)**: for now, the settings controlling the use of off-heap execution 
memory (`spark.memory.offHeap.enabled`) and off-heap caching are completely 
independent, so Spark SQL can be configured to use off-heap memory for 
execution while continuing to cache blocks on-heap. If desired, we can change 
this in a followup patch so that `spark.memory.offHeap.enabled` affect the 
default storage level for cached SQL tables.
    
    ## Internal changes
    
    - Rename `ByteArrayChunkOutputStream` to `ChunkedByteBufferOutputStream`
      - It now returns a `ChunkedByteBuffer` instead of an array of byte arrays.
      - Its constructor now accept an `allocator` function which is called to 
allocate `ByteBuffer`s. This allows us to control whether it allocates regular 
ByteBuffers or off-heap DirectByteBuffers.
      - Because block serialization is now performed during the unroll process, 
a `ChunkedByteBufferOutputStream` which is configured with a `DirectByteBuffer` 
allocator will use off-heap memory for both unroll and storage memory.
    - The `MemoryStore`'s MemoryEntries now tracks whether blocks are stored 
on- or off-heap.
      - `evictBlocksToFreeSpace()` now accepts a `MemoryMode` parameter so that 
we don't try to evict off-heap blocks in response to on-heap memory pressure 
(or vice-versa).
    - Make sure that off-heap buffers are properly de-allocated during 
MemoryStore eviction.
    - The JVM limits the total size of allocated direct byte buffers using the 
`-XX:MaxDirectMemorySize` flag and the default tends to be fairly low (< 512 
megabytes in some JVMs). To work around this limitation, this patch adds a 
custom DirectByteBuffer allocator which ignores this memory limit.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #11805 from JoshRosen/off-heap-caching.

commit 0b7d4966ca7e02f351c4b92a74789cef4799fcb1
Author: Shixiong Zhu <[email protected]>
Date:   2016-04-01T22:00:38Z

    [SPARK-14316][SQL] StateStoreCoordinator should extend ThreadSafeRpcEndpoint
    
    ## What changes were proposed in this pull request?
    
    RpcEndpoint is not thread safe and allows multiple messages to be processed 
at the same time. StateStoreCoordinator should use ThreadSafeRpcEndpoint.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Shixiong Zhu <[email protected]>
    
    Closes #12100 from zsxwing/fix-StateStoreCoordinator.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-14622] Retain lost executors status

Reply via email to