[GitHub] spark pull request #16191: spark decision tree

zhuangxue Tue, 06 Dec 2016 23:28:00 -0800

GitHub user zhuangxue opened a pull request:

    https://github.com/apache/spark/pull/16191


    spark decision tree

    What algorithm is used in spark decision tree (is ID3, C4.5 or CART)?


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16191.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16191
    
----
commit 16eaad9daed0b633e6a714b5704509aa7107d6e5
Author: Sean Owen <[email protected]>
Date:   2016-11-10T18:20:03Z

    [SPARK-18262][BUILD][SQL] JSON.org license is now CatX
    
    ## What changes were proposed in this pull request?
    
    Try excluding org.json:json from hive-exec dep as it's Cat X now. It may be 
the case that it's not used by the part of Hive Spark uses anyway.
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <[email protected]>
    
    Closes #15798 from srowen/SPARK-18262.

commit b533fa2b205544b42dcebe0a6fee9d8275f6da7d
Author: Michael Allman <[email protected]>
Date:   2016-11-10T21:41:13Z

    [SPARK-17993][SQL] Fix Parquet log output redirection
    
    (Link to Jira issue: https://issues.apache.org/jira/browse/SPARK-17993)
    ## What changes were proposed in this pull request?
    
    PR #14690 broke parquet log output redirection for converted partitioned 
Hive tables. For example, when querying parquet files written by Parquet-mr 
1.6.0 Spark prints a torrent of (harmless) warning messages from the Parquet 
reader:
    
    ```
    Oct 18, 2016 7:42:18 PM WARNING: org.apache.parquet.CorruptStatistics: 
Ignoring statistics because created_by could not be parsed (see PARQUET-251): 
parquet-mr version 1.6.0
    org.apache.parquet.VersionParser$VersionParseException: Could not parse 
created_by: parquet-mr version 1.6.0 using format: (.+) version ((.*) )?\(build 
?(.*)\)
        at org.apache.parquet.VersionParser.parse(VersionParser.java:112)
        at 
org.apache.parquet.CorruptStatistics.shouldIgnoreStatistics(CorruptStatistics.java:60)
        at 
org.apache.parquet.format.converter.ParquetMetadataConverter.fromParquetStatistics(ParquetMetadataConverter.java:263)
        at 
org.apache.parquet.hadoop.ParquetFileReader$Chunk.readAllPages(ParquetFileReader.java:583)
        at 
org.apache.parquet.hadoop.ParquetFileReader.readNextRowGroup(ParquetFileReader.java:513)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.checkEndOfRowGroup(VectorizedParquetRecordReader.java:270)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextBatch(VectorizedParquetRecordReader.java:225)
        at 
org.apache.spark.sql.execution.datasources.parquet.VectorizedParquetRecordReader.nextKeyValue(VectorizedParquetRecordReader.java:137)
        at 
org.apache.spark.sql.execution.datasources.RecordReaderIterator.hasNext(RecordReaderIterator.scala:39)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.nextIterator(FileScanRDD.scala:162)
        at 
org.apache.spark.sql.execution.datasources.FileScanRDD$$anon$1.hasNext(FileScanRDD.scala:102)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.scan_nextBatch$(Unknown
 Source)
        at 
org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown
 Source)
        at 
org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
        at 
org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:372)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:231)
        at 
org.apache.spark.sql.execution.SparkPlan$$anonfun$2.apply(SparkPlan.scala:225)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
        at 
org.apache.spark.rdd.RDD$$anonfun$mapPartitionsInternal$1$$anonfun$apply$24.apply(RDD.scala:803)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:319)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:283)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
        at org.apache.spark.scheduler.Task.run(Task.scala:99)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:282)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    ```
    
    This only happens during execution, not planning, and it doesn't matter 
what log level the `SparkContext` is set to. That's because Parquet (versions < 
1.9) doesn't use slf4j for logging. Note, you can tell that log redirection is 
not working here because the log message format does not conform to the default 
Spark log message format.
    
    This is a regression I noted as something we needed to fix as a follow up.
    
    It appears that the problem arose because we removed the call to 
`inferSchema` during Hive table conversion. That call is what triggered the 
output redirection.
    
    ## How was this patch tested?
    
    I tested this manually in four ways:
    1. Executing `spark.sqlContext.range(10).selectExpr("id as 
a").write.mode("overwrite").parquet("test")`.
    2. Executing `spark.read.format("parquet").load(legacyParquetFile).show` 
for a Parquet file `legacyParquetFile` written using Parquet-mr 1.6.0.
    3. Executing `select * from legacy_parquet_table limit 1` for some 
unpartitioned Parquet-based Hive table written using Parquet-mr 1.6.0.
    4. Executing `select * from legacy_partitioned_parquet_table where 
partcol=x limit 1` for some partitioned Parquet-based Hive table written using 
Parquet-mr 1.6.0.
    
    I ran each test with a new instance of `spark-shell` or `spark-sql`.
    
    Incidentally, I found that test case 3 was not a regressionâredirection 
was not occurring in the master codebase prior to #14690.
    
    I spent some time working on a unit test, but based on my experience 
working on this ticket I feel that automated testing here is far from feasible.
    
    cc ericl dongjoon-hyun
    
    Author: Michael Allman <[email protected]>
    
    Closes #15538 from mallman/spark-17993-fix_parquet_log_redirection.

commit 2f7461f31331cfc37f6cfa3586b7bbefb3af5547
Author: Wenchen Fan <[email protected]>
Date:   2016-11-10T21:42:48Z

    [SPARK-17990][SPARK-18302][SQL] correct several partition related 
behaviours of ExternalCatalog
    
    ## What changes were proposed in this pull request?
    
    This PR corrects several partition related behaviors of `ExternalCatalog`:
    
    1. default partition location should not always lower case the partition 
column names in path string(fix `HiveExternalCatalog`)
    2. rename partition should not always lower case the partition column names 
in updated partition path string(fix `HiveExternalCatalog`)
    3. rename partition should update the partition location only for managed 
table(fix `InMemoryCatalog`)
    4. create partition with existing directory should be fine(fix 
`InMemoryCatalog`)
    5. create partition with non-existing directory should create that 
directory(fix `InMemoryCatalog`)
    6. drop partition from external table should not delete the directory(fix 
`InMemoryCatalog`)
    
    ## How was this patch tested?
    
    new tests in `ExternalCatalogSuite`
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #15797 from cloud-fan/partition.

commit e0deee1f7df31177cfc14bbb296f0baa372f473d
Author: Cheng Lian <[email protected]>
Date:   2016-11-10T21:44:54Z

    [SPARK-18403][SQL] Temporarily disable flaky ObjectHashAggregateSuite
    
    ## What changes were proposed in this pull request?
    
    Randomized tests in `ObjectHashAggregateSuite` is being flaky and breaks PR 
builds. This PR disables them temporarily to bring back the PR build.
    
    ## How was this patch tested?
    
    N/A
    
    Author: Cheng Lian <[email protected]>
    
    Closes #15845 from liancheng/ignore-flaky-object-hash-agg-suite.

commit a3356343cbf58b930326f45721fb4ecade6f8029
Author: Eric Liang <[email protected]>
Date:   2016-11-11T01:00:43Z

    [SPARK-18185] Fix all forms of INSERT / OVERWRITE TABLE for Datasource 
tables
    
    ## What changes were proposed in this pull request?
    
    As of current 2.1, INSERT OVERWRITE with dynamic partitions against a 
Datasource table will overwrite the entire table instead of only the partitions 
matching the static keys, as in Hive. It also doesn't respect custom partition 
locations.
    
    This PR adds support for all these operations to Datasource tables managed 
by the Hive metastore. It is implemented as follows
    - During planning time, the full set of partitions affected by an INSERT or 
OVERWRITE command is read from the Hive metastore.
    - The planner identifies any partitions with custom locations and includes 
this in the write task metadata.
    - FileFormatWriter tasks refer to this custom locations map when 
determining where to write for dynamic partition output.
    - When the write job finishes, the set of written partitions is compared 
against the initial set of matched partitions, and the Hive metastore is 
updated to reflect the newly added / removed partitions.
    
    It was necessary to introduce a method for staging files with absolute 
output paths to `FileCommitProtocol`. These files are not handled by the Hadoop 
output committer but are moved to their final locations when the job commits.
    
    The overwrite behavior of legacy Datasource tables is also changed: no 
longer will the entire table be overwritten if a partial partition spec is 
present.
    
    cc cloud-fan yhuai
    
    ## How was this patch tested?
    
    Unit tests, existing tests.
    
    Author: Eric Liang <[email protected]>
    Author: Wenchen Fan <[email protected]>
    
    Closes #15814 from ericl/sc-5027.

commit 5ddf69470b93c0b8a28bb4ac905e7670d9c50a95
Author: Yanbo Liang <[email protected]>
Date:   2016-11-11T01:13:10Z

    [SPARK-18401][SPARKR][ML] SparkR random forest should support output 
original label.
    
    ## What changes were proposed in this pull request?
    SparkR ```spark.randomForest``` classification prediction should output 
original label rather than the indexed label. This issue is very similar with 
[SPARK-18291](https://issues.apache.org/jira/browse/SPARK-18291).
    
    ## How was this patch tested?
    Add unit tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #15842 from yanboliang/spark-18401.

commit 4f15d94cfec86130f8dab28ae2e228ded8124020
Author: Junjie Chen <[email protected]>
Date:   2016-11-11T18:37:58Z

    [SPARK-13331] AES support for over-the-wire encryption
    
    ## What changes were proposed in this pull request?
    
    DIGEST-MD5 mechanism is used for SASL authentication and secure 
communication. DIGEST-MD5 mechanism supports 3DES, DES, and RC4 ciphers. 
However, 3DES, DES and RC4 are slow relatively.
    
    AES provide better performance and security by design and is a replacement 
for 3DES according to NIST. Apache Common Crypto is a cryptographic library 
optimized with AES-NI, this patch employ Apache Common Crypto as enc/dec 
backend for SASL authentication and secure channel to improve spark RPC.
    ## How was this patch tested?
    
    Unit tests and Integration test.
    
    Author: Junjie Chen <[email protected]>
    
    Closes #15172 from cjjnjust/shuffle_rpc_encrypt.

commit a531fe1a82ec515314f2db2e2305283fef24067f
Author: Vinayak <[email protected]>
Date:   2016-11-11T18:54:16Z

    [SPARK-17843][WEB UI] Indicate event logs pending for processing on history 
server UI
    
    ## What changes were proposed in this pull request?
    
    History Server UI's application listing to display information on currently 
under process event logs so a user knows that pending this processing an 
application may not list on the UI.
    
    When there are no event logs under process, the application list page has a 
"Last Updated" date-time at the top indicating the date-time of the last 
_completed_ scan of the event logs. The value is displayed to the user in 
his/her local time zone.
    ## How was this patch tested?
    
    All unit tests pass. Particularly all the suites under 
org.apache.spark.deploy.history.\* were run to test changes.
    - Very first startup - Pending logs - no logs processed yet:
    
    <img width="1280" alt="screen shot 2016-10-24 at 3 07 04 pm" 
src="https://cloud.githubusercontent.com/assets/12079825/19640981/b8d2a96a-99fc-11e6-9b1f-2d736fe90e48.png";>
    - Very first startup - Pending logs - some logs processed:
    
    <img width="1280" alt="screen shot 2016-10-24 at 3 18 42 pm" 
src="https://cloud.githubusercontent.com/assets/12079825/19641087/3f8e3bae-99fd-11e6-9ef1-e0e70d71d8ef.png";>
    - Last updated - No currently pending logs:
    
    <img width="1280" alt="screen shot 2016-10-17 at 8 34 37 pm" 
src="https://cloud.githubusercontent.com/assets/12079825/19443100/4d13946c-94a9-11e6-8ee2-c442729bb206.png";>
    - Last updated - With some currently pending logs:
    
    <img width="1280" alt="screen shot 2016-10-24 at 3 09 31 pm" 
src="https://cloud.githubusercontent.com/assets/12079825/19640903/7323ba3a-99fc-11e6-8359-6a45753dbb28.png";>
    - No applications found and No currently pending logs:
    
    <img width="1280" alt="screen shot 2016-10-24 at 3 24 26 pm" 
src="https://cloud.githubusercontent.com/assets/12079825/19641364/03a2cb04-99fe-11e6-87d6-d09587fc6201.png";>
    
    Author: Vinayak <[email protected]>
    
    Closes #15410 from vijoshi/SAAS-608_master.

commit d42bb7cc4e32c173769bd7da5b9b5eafb510860c
Author: Dongjoon Hyun <[email protected]>
Date:   2016-11-11T21:28:18Z

    [SPARK-17982][SQL] SQLBuilder should wrap the generated SQL with 
parenthesis for LIMIT
    
    ## What changes were proposed in this pull request?
    
    Currently, `SQLBuilder` handles `LIMIT` by always adding `LIMIT` at the end 
of the generated subSQL. It makes `RuntimeException`s like the following. This 
PR adds a parenthesis always except `SubqueryAlias` is used together with 
`LIMIT`.
    
    **Before**
    
    ``` scala
    scala> sql("CREATE TABLE tbl(id INT)")
    scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
    java.lang.RuntimeException: Failed to analyze the canonicalized SQL: ...
    ```
    
    **After**
    
    ``` scala
    scala> sql("CREATE TABLE tbl(id INT)")
    scala> sql("CREATE VIEW v1(id2) AS SELECT id FROM tbl LIMIT 2")
    scala> sql("SELECT id2 FROM v1")
    res4: org.apache.spark.sql.DataFrame = [id2: int]
    ```
    
    **Fixed cases in this PR**
    
    The following two cases are the detail query plans having problematic SQL 
generations.
    
    1. `SELECT * FROM (SELECT id FROM tbl LIMIT 2)`
    
        Please note that **FROM SELECT** part of the generated SQL in the 
below. When we don't use '()' for limit, this fails.
    
    ```scala
    # Original logical plan:
    Project [id#1]
    +- GlobalLimit 2
       +- LocalLimit 2
          +- Project [id#1]
             +- MetastoreRelation default, tbl
    
    # Canonicalized logical plan:
    Project [gen_attr_0#1 AS id#4]
    +- SubqueryAlias tbl
       +- Project [gen_attr_0#1]
          +- GlobalLimit 2
             +- LocalLimit 2
                +- Project [gen_attr_0#1]
                   +- SubqueryAlias gen_subquery_0
                      +- Project [id#1 AS gen_attr_0#1]
                         +- SQLTable default, tbl, [id#1]
    
    # Generated SQL:
    SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM SELECT 
`gen_attr_0` FROM (SELECT `id` AS `gen_attr_0` FROM `default`.`tbl`) AS 
gen_subquery_0 LIMIT 2) AS tbl
    ```
    
    2. `SELECT * FROM (SELECT id FROM tbl TABLESAMPLE (2 ROWS))`
    
        Please note that **((~~~) AS gen_subquery_0 LIMIT 2)** in the below. 
When we use '()' for limit on `SubqueryAlias`, this fails.
    
    ```scala
    # Original logical plan:
    Project [id#1]
    +- Project [id#1]
       +- GlobalLimit 2
          +- LocalLimit 2
             +- MetastoreRelation default, tbl
    
    # Canonicalized logical plan:
    Project [gen_attr_0#1 AS id#4]
    +- SubqueryAlias tbl
       +- Project [gen_attr_0#1]
          +- GlobalLimit 2
             +- LocalLimit 2
                +- SubqueryAlias gen_subquery_0
                   +- Project [id#1 AS gen_attr_0#1]
                      +- SQLTable default, tbl, [id#1]
    
    # Generated SQL:
    SELECT `gen_attr_0` AS `id` FROM (SELECT `gen_attr_0` FROM ((SELECT `id` AS 
`gen_attr_0` FROM `default`.`tbl`) AS gen_subquery_0 LIMIT 2)) AS tbl
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test with a newly added test case.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #15546 from dongjoon-hyun/SPARK-17982.

commit 6e95325fc3726d260054bd6e7c0717b3c139917e
Author: Ryan Blue <[email protected]>
Date:   2016-11-11T21:52:10Z

    [SPARK-18387][SQL] Add serialization to checkEvaluation.
    
    ## What changes were proposed in this pull request?
    
    This removes the serialization test from RegexpExpressionsSuite and
    replaces it by serializing all expressions in checkEvaluation.
    
    This also fixes math constant expressions by making LeafMathExpression
    Serializable and fixes NumberFormat values that are null or invalid
    after serialization.
    
    ## How was this patch tested?
    
    This patch is to tests.
    
    Author: Ryan Blue <[email protected]>
    
    Closes #15847 from rdblue/SPARK-18387-fix-serializable-expressions.

commit ba23f768f7419039df85530b84258ec31f0c22b4
Author: Felix Cheung <[email protected]>
Date:   2016-11-11T23:49:55Z

    [SPARK-18264][SPARKR] build vignettes with package, update vignettes for 
CRAN release build and add info on release
    
    ## What changes were proposed in this pull request?
    
    Changes to DESCRIPTION to build vignettes.
    Changes the metadata for vignettes to generate the recommended format 
(which is about <10% of size before). Unfortunately it does not look as nice
    (before - left, after - right)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/20040492/b75883e6-a40d-11e6-9534-25cdd5d59a8b.png)
    
    
![image](https://cloud.githubusercontent.com/assets/8969467/20040490/a40f4d42-a40d-11e6-8c91-af00ddcbdad9.png)
    
    Also add information on how to run build/release to CRAN later.
    
    ## How was this patch tested?
    
    manually, unit tests
    
    shivaram
    
    We need this for branch-2.1
    
    Author: Felix Cheung <[email protected]>
    
    Closes #15790 from felixcheung/rpkgvignettes.

commit 46b2550bcd3690a260b995fd4d024a73b92a0299
Author: sethah <[email protected]>
Date:   2016-11-12T01:38:26Z

    [SPARK-18060][ML] Avoid unnecessary computation for MLOR
    
    ## What changes were proposed in this pull request?
    
    Before this patch, the gradient updates for multinomial logistic regression 
were computed by an outer loop over the number of classes and an inner loop 
over the number of features. Inside the inner loop, we standardized the feature 
value (`value / featuresStd(index)`), which means we performed the computation 
`numFeatures * numClasses` times. We only need to perform that computation 
`numFeatures` times, however. If we re-order the inner and outer loop, we can 
avoid this, but then we lose sequential memory access. In this patch, we 
instead lay out the coefficients in column major order while we train, so that 
we can avoid the extra computation and retain sequential memory access. We 
convert back to row-major order when we create the model.
    
    ## How was this patch tested?
    
    This is an implementation detail only, so the original behavior should be 
maintained. All tests pass. I ran some performance tests to verify speedups. 
The results are below, and show significant speedups.
    ## Performance Tests
    
    **Setup**
    
    3 node bare-metal cluster
    120 cores total
    384 gb RAM total
    
    **Results**
    
    NOTE: The `currentMasterTime` and `thisPatchTime` are times in seconds for 
a single iteration of L-BFGS or OWL-QN.
    
    |    |   numPoints |   numFeatures |   numClasses |   regParam |   
elasticNetParam |   currentMasterTime (sec) |   thisPatchTime (sec) |   
pctSpeedup |
    
|----|-------------|---------------|--------------|------------|-------------------|---------------------------|-----------------------|--------------|
    |  0 |       1e+07 |           100 |          500 |       0.5  |            
     0 |                        90 |                    18 |           80 |
    |  1 |       1e+08 |           100 |           50 |       0.5  |            
     0 |                        90 |                    19 |           78 |
    |  2 |       1e+08 |           100 |           50 |       0.05 |            
     1 |                        72 |                    19 |           73 |
    |  3 |       1e+06 |           100 |         5000 |       0.5  |            
     0 |                        93 |                    53 |           43 |
    |  4 |       1e+07 |           100 |         5000 |       0.5  |            
     0 |                       900 |                   390 |           56 |
    |  5 |       1e+08 |           100 |          500 |       0.5  |            
     0 |                       840 |                   174 |           79 |
    |  6 |       1e+08 |           100 |          200 |       0.5  |            
     0 |                       360 |                    72 |           80 |
    |  7 |       1e+08 |          1000 |            5 |       0.5  |            
     0 |                         9 |                     3 |           66 |
    
    Author: sethah <[email protected]>
    
    Closes #15593 from sethah/MLOR_PERF_COL_MAJOR_COEF.

commit 3af894511be6fcc17731e28b284dba432fe911f5
Author: Weiqing Yang <[email protected]>
Date:   2016-11-12T02:36:23Z

    [SPARK-16759][CORE] Add a configuration property to pass caller contexts of 
upstream applications into Spark
    
    ## What changes were proposed in this pull request?
    
    Many applications take Spark as a computing engine and run on it. This PR 
adds a configuration property `spark.log.callerContext` that can be used by 
Spark's upstream applications (e.g. Oozie) to set up their caller contexts into 
Spark. In the end, Spark will combine its own caller context with the caller 
contexts of its upstream applications, and write them into Yarn RM log and HDFS 
audit log.
    
    The audit log has a config to truncate the caller contexts passed in 
(default 128). The caller contexts will be sent over rpc, so it should be 
concise. The call context written into HDFS log and Yarn log consists of two 
parts: the information `A` specified by Spark itself and the value `B` of 
`spark.log.callerContext` property.  Currently `A` typically takes 64 to 74 
characters,  so `B` can have up to 50 characters (mentioned in the doc 
`running-on-yarn.md`)
    ## How was this patch tested?
    
    Manual tests. I have run some Spark applications with 
`spark.log.callerContext` configuration in Yarn client/cluster mode, and 
verified that the caller contexts were written into Yarn RM log and HDFS audit 
log correctly.
    
    The ways to configure `spark.log.callerContext` property:
    - In spark-defaults.conf:
    
    ```
    spark.log.callerContext  infoSpecifiedByUpstreamApp
    ```
    - In app's source code:
    
    ```
    val spark = SparkSession
          .builder
          .appName("SparkKMeans")
          .config("spark.log.callerContext", "infoSpecifiedByUpstreamApp")
          .getOrCreate()
    ```
    
    When running on Spark Yarn cluster mode, the driver is unable to pass 
'spark.log.callerContext' to Yarn client and AM since Yarn client and AM have 
already started before the driver performs `.config("spark.log.callerContext", 
"infoSpecifiedByUpstreamApp")`.
    
    The following  example shows the command line used to submit a SparkKMeans 
application and the corresponding records in Yarn RM log and HDFS audit log.
    
    Command:
    
    ```
    ./bin/spark-submit --verbose --executor-cores 3 --num-executors 1 --master 
yarn --deploy-mode client --class org.apache.spark.examples.SparkKMeans 
examples/target/original-spark-examples_2.11-2.1.0-SNAPSHOT.jar 
hdfs://localhost:9000/lr_big.txt 2 5
    ```
    
    Yarn RM log:
    
    <img width="1440" alt="screen shot 2016-10-19 at 9 12 03 pm" 
src="https://cloud.githubusercontent.com/assets/8546874/19547050/7d2f278c-9649-11e6-9df8-8d5ff12609f0.png";>
    
    HDFS audit log:
    
    <img width="1400" alt="screen shot 2016-10-19 at 10 18 14 pm" 
src="https://cloud.githubusercontent.com/assets/8546874/19547102/096060ae-964a-11e6-981a-cb28efd5a058.png";>
    
    Author: Weiqing Yang <[email protected]>
    
    Closes #15563 from weiqingy/SPARK-16759.

commit bc41d997ea287080f549219722b6d9049adef4e2
Author: Guoqiang Li <[email protected]>
Date:   2016-11-12T09:49:14Z

    [SPARK-18375][SPARK-18383][BUILD][CORE] Upgrade netty to 4.0.42.Final
    
    ## What changes were proposed in this pull request?
    
    One of the important changes for 4.0.42.Final is "Support any FileRegion 
implementation when using epoll transport netty/netty#5825".
    In 4.0.42.Final, `MessageWithHeader` can work properly when 
`spark.[shuffle|rpc].io.mode` is set to epoll
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Guoqiang Li <[email protected]>
    
    Closes #15830 from witgo/SPARK-18375_netty-4.0.42.

commit 22cb3a060a440205281b71686637679645454ca6
Author: Yanbo Liang <[email protected]>
Date:   2016-11-12T14:13:22Z

    [SPARK-14077][ML][FOLLOW-UP] Minor refactor and cleanup for NaiveBayes
    
    ## What changes were proposed in this pull request?
    * Refactor out ```trainWithLabelCheck``` and make ```mllib.NaiveBayes``` 
call into it.
    * Avoid capturing the outer object for ```modelType```.
    * Move ```requireNonnegativeValues``` and 
```requireZeroOneBernoulliValues``` to companion object.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #15826 from yanboliang/spark-14077-2.

commit 1386fd28daf798bf152606f4da30a36223d75d18
Author: Holden Karau <[email protected]>
Date:   2016-11-12T22:50:37Z

    [SPARK-18418] Fix flags for make_binary_release for hadoop profile
    
    ## What changes were proposed in this pull request?
    
    Fix the flags used to specify the hadoop version
    
    ## How was this patch tested?
    
    Manually tested as part of https://github.com/apache/spark/pull/15659 by 
having the build succeed.
    
    cc joshrosen
    
    Author: Holden Karau <[email protected]>
    
    Closes #15860 from holdenk/minor-fix-release-build-script.

commit b91a51bb231af321860415075a7f404bc46e0a74
Author: Denny Lee <[email protected]>
Date:   2016-11-14T02:10:06Z

    [SPARK-18426][STRUCTURED STREAMING] Python Documentation Fix for Structured 
Streaming Programming Guide
    
    ## What changes were proposed in this pull request?
    
    Update the python section of the Structured Streaming Guide from .builder() 
to .builder
    
    ## How was this patch tested?
    
    Validated documentation and successfully running the test example.
    
    Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.
    
    'Builder' object is not callable object hence changed .builder() to
    .builder
    
    Author: Denny Lee <[email protected]>
    
    Closes #15872 from dennyglee/master.

commit 07be232ea12dfc8dc3701ca948814be7dbebf4ee
Author: Yanbo Liang <[email protected]>
Date:   2016-11-14T04:25:12Z

    [SPARK-18412][SPARKR][ML] Fix exception for some SparkR ML algorithms 
training on libsvm data
    
    ## What changes were proposed in this pull request?
    * Fix the following exceptions which throws when 
```spark.randomForest```(classification), ```spark.gbt```(classification), 
```spark.naiveBayes``` and ```spark.glm```(binomial family) were fitted on 
libsvm data.
    ```
    java.lang.IllegalArgumentException: requirement failed: If label column 
already exists, forceIndexLabel can not be set with true.
    ```
    See [SPARK-18412](https://issues.apache.org/jira/browse/SPARK-18412) for 
more detail about how to reproduce this bug.
    * Refactor out ```getFeaturesAndLabels``` to RWrapperUtils, since lots of 
ML algorithm wrappers use this function.
    * Drop some unwanted columns when making prediction.
    
    ## How was this patch tested?
    Add unit test.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #15851 from yanboliang/spark-18412.

commit f95b124c68ccc2e318f6ac30685aa47770eea8f3
Author: Sean Owen <[email protected]>
Date:   2016-11-14T07:52:07Z

    [SPARK-18382][WEBUI] "run at null:-1" in UI when no file/line info in call 
site info
    
    ## What changes were proposed in this pull request?
    
    Avoid reporting null/-1 file / line number in call sites if encountering 
StackTraceElement without this info
    
    ## How was this patch tested?
    
    Existing tests
    
    Author: Sean Owen <[email protected]>
    
    Closes #15862 from srowen/SPARK-18382.

commit ae6cddb78742be94aa0851ce719f293e0a64ce4f
Author: actuaryzhang <[email protected]>
Date:   2016-11-14T11:08:06Z

    [SPARK-18166][MLLIB] Fix Poisson GLM bug due to wrong requirement of 
response values
    
    ## What changes were proposed in this pull request?
    
    The current implementation of Poisson GLM seems to allow only positive 
values. This is incorrect since the support of Poisson includes the origin. The 
bug is easily fixed by changing the test of the Poisson variable from  
'require(y **>** 0.0' to  'require(y **>=** 0.0'.
    
    mengxr  srowen
    
    Author: actuaryzhang <[email protected]>
    Author: actuaryzhang <[email protected]>
    
    Closes #15683 from actuaryzhang/master.

commit 637a0bb88f74712001f32a53ff66fd0b8cb67e4a
Author: WangTaoTheTonic <[email protected]>
Date:   2016-11-14T11:22:36Z

    [SPARK-18396][HISTORYSERVER] Duration" column makes search result confused, 
maybe we should make it unsearchable
    
    ## What changes were proposed in this pull request?
    
    When we search data in History Server, it will check if any columns 
contains the search string. Duration is represented as long value in table, so 
if we search simple string like "003", "111", the duration containing "003", 
â111â will be showed, which make not much sense to users.
    We cannot simply transfer the long value to meaning format like "1 h", "3.2 
min" because they are also used for sorting. Better way to handle it is ban 
"Duration" columns from searching.
    
    ## How was this patch tested
    
    manually tests.
    
    Before("local-1478225166651" pass the filter because its duration in long 
value, which is "257244245" contains search string "244"):
    
![before](https://cloud.githubusercontent.com/assets/5276001/20203166/f851ffc6-a7ff-11e6-8fe6-91a90ca92b23.jpg)
    
    After:
    
![after](https://cloud.githubusercontent.com/assets/5276001/20178646/2129fbb0-a78d-11e6-9edb-39f885ce3ed0.jpg)
    
    Author: WangTaoTheTonic <[email protected]>
    
    Closes #15838 from WangTaoTheTonic/duration.

commit 9d07ceee7860921eafb55b47852f1b51089c98da
Author: Noritaka Sekiyama <[email protected]>
Date:   2016-11-14T12:07:59Z

    [SPARK-18432][DOC] Changed HDFS default block size from 64MB to 128MB
    
    Changed HDFS default block size from 64MB to 128MB.
    https://issues.apache.org/jira/browse/SPARK-18432
    
    Author: Noritaka Sekiyama <[email protected]>
    
    Closes #15879 from moomindani/SPARK-18432.

commit bdfe60ac921172be0fb77de2f075cc7904a3b238
Author: Tathagata Das <[email protected]>
Date:   2016-11-14T18:03:01Z

    [SPARK-18416][STRUCTURED STREAMING] Fixed temp file leak in state store
    
    ## What changes were proposed in this pull request?
    
    StateStore.get() causes temporary files to be created immediately, even if 
the store is not used to make updates for new version. The temp file is not 
closed as store.commit() is not called in those cases, thus keeping the output 
stream to temp file open forever.
    
    This PR fixes it by opening the temp file only when there are updates being 
made.
    
    ## How was this patch tested?
    
    New unit test
    
    Author: Tathagata Das <[email protected]>
    
    Closes #15859 from tdas/SPARK-18416.

commit 89d1fa58dbe88560b1f2b0362fcc3035ccc888be
Author: cody koeninger <[email protected]>
Date:   2016-11-14T19:10:37Z

    [SPARK-17510][STREAMING][KAFKA] config max rate on a per-partition basis
    
    ## What changes were proposed in this pull request?
    
    Allow configuration of max rate on a per-topicpartition basis.
    ## How was this patch tested?
    
    Unit tests.
    
    The reporter (Jeff Nadler) said he could test on his workload, so let's 
wait on that report.
    
    Author: cody koeninger <[email protected]>
    
    Closes #15132 from koeninger/SPARK-17510.

commit 75934457d75996be71ffd0d4b448497d656c0d40
Author: Zheng RuiFeng <[email protected]>
Date:   2016-11-14T19:42:00Z

    [SPARK-11496][GRAPHX][FOLLOWUP] Add param checking for 
runParallelPersonalizedPageRank
    
    ## What changes were proposed in this pull request?
    add the param checking to keep in line with other algos
    
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <[email protected]>
    
    Closes #15876 from zhengruifeng/param_check_runParallelPersonalizedPageRank.

commit bd85603ba5f9e61e1aa8326d3e4d5703b5977a4c
Author: Nattavut Sutyanyong <[email protected]>
Date:   2016-11-14T19:59:15Z

    [SPARK-17348][SQL] Incorrect results from subquery transformation
    
    ## What changes were proposed in this pull request?
    
    Return an Analysis exception when there is a correlated non-equality 
predicate in a subquery and the correlated column from the outer reference is 
not from the immediate parent operator of the subquery. This PR prevents 
incorrect results from subquery transformation in such case.
    
    Test cases, both positive and negative tests, are added.
    
    ## How was this patch tested?
    
    sql/test, catalyst/test, hive/test, and scenarios that will produce 
incorrect results without this PR and product correct results when subquery 
transformation does happen.
    
    Author: Nattavut Sutyanyong <[email protected]>
    
    Closes #15763 from nsyca/spark-17348.

commit c07187823a98f0d1a0f58c06e28a27e1abed157a
Author: Michael Armbrust <[email protected]>
Date:   2016-11-15T00:46:26Z

    [SPARK-18124] Observed delay based Event Time Watermarks
    
    This PR adds a new method `withWatermark` to the `Dataset` API, which can 
be used specify an _event time watermark_.  An event time watermark allows the 
streaming engine to reason about the point in time after which we no longer 
expect to see late data.  This PR also has augmented `StreamExecution` to use 
this watermark for several purposes:
      - To know when a given time window aggregation is finalized and thus 
results can be emitted when using output modes that do not allow updates (e.g. 
`Append` mode).
      - To minimize the amount of state that we need to keep for on-going 
aggregations, by evicting state for groups that are no longer expected to 
change.  Although, we do still maintain all state if the query requires (i.e. 
if the event time is not present in the `groupBy` or when running in `Complete` 
mode).
    
    An example that emits windowed counts of records, waiting up to 5 minutes 
for late data to arrive.
    ```scala
    df.withWatermark("eventTime", "5 minutes")
      .groupBy(window($"eventTime", "1 minute") as 'window)
      .count()
      .writeStream
      .format("console")
      .mode("append") // In append mode, we only output finalized aggregations.
      .start()
    ```
    
    ### Calculating the watermark.
    The current event time is computed by looking at the `MAX(eventTime)` seen 
this epoch across all of the partitions in the query minus some user defined 
_delayThreshold_.  An additional constraint is that the watermark must increase 
monotonically.
    
    Note that since we must coordinate this value across partitions 
occasionally, the actual watermark used is only guaranteed to be at least 
`delay` behind the actual event time.  In some cases we may still process 
records that arrive more than delay late.
    
    This mechanism was chosen for the initial implementation over processing 
time for two reasons:
      - it is robust to downtime that could affect processing delay
      - it does not require syncing of time or timezones between the producer 
and the processing engine.
    
    ### Other notable implementation details
     - A new trigger metric `eventTimeWatermark` outputs the current value of 
the watermark.
     - We mark the event time column in the `Attribute` metadata using the key 
`spark.watermarkDelay`.  This allows downstream operations to know which column 
holds the event time.  Operations like `window` propagate this metadata.
     - `explain()` marks the watermark with a suffix of `-T${delayMs}` to ease 
debugging of how this information is propagated.
     - Currently, we don't filter out late records, but instead rely on the 
state store to avoid emitting records that are both added and filtered in the 
same epoch.
    
    ### Remaining in this PR
     - [ ] The test for recovery is currently failing as we don't record the 
watermark used in the offset log.  We will need to do so to ensure determinism, 
but this is deferred until #15626 is merged.
    
    ### Other follow-ups
    There are some natural additional features that we should consider for 
future work:
     - Ability to write records that arrive too late to some external store in 
case any out-of-band remediation is required.
     - `Update` mode so you can get partial results before a group is evicted.
     - Other mechanisms for calculating the watermark.  In particular a 
watermark based on quantiles would be more robust to outliers.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #15702 from marmbrus/watermarks.

commit c31def1ddcbed340bfc071d54fb3dc7945cb525a
Author: Zheng RuiFeng <[email protected]>
Date:   2016-11-15T05:15:39Z

    [SPARK-18428][DOC] Update docs for GraphX
    
    ## What changes were proposed in this pull request?
    1, Add link of `VertexRDD` and `EdgeRDD`
    2, Notify in `Vertex and Edge RDDs` that not all methods are listed
    3, `VertexID` -> `VertexId`
    
    ## How was this patch tested?
    No tests, only docs is modified
    
    Author: Zheng RuiFeng <[email protected]>
    
    Closes #15875 from zhengruifeng/update_graphop_doc.

commit 86430cc4e8dbc65a091a532fc9c5ec12b7be04f4
Author: gatorsmile <[email protected]>
Date:   2016-11-15T05:21:34Z

    [SPARK-18430][SQL] Fixed Exception Messages when Hitting an Invocation 
Exception of Function Lookup
    
    ### What changes were proposed in this pull request?
    When the exception is an invocation exception during function lookup, we 
return a useless/confusing error message:
    
    For example,
    ```Scala
    df.selectExpr("concat_ws()")
    ```
    Below is the error message we got:
    ```
    null; line 1 pos 0
    org.apache.spark.sql.AnalysisException: null; line 1 pos 0
    ```
    
    To get the meaningful error message, we need to get the cause. The fix is 
exactly the same as what we did in https://github.com/apache/spark/pull/12136. 
After the fix, the message we got is the exception issued in the constuctor of 
function implementation:
    ```
    requirement failed: concat_ws requires at least one argument.; line 1 pos 0
    org.apache.spark.sql.AnalysisException: requirement failed: concat_ws 
requires at least one argument.; line 1 pos 0
    ```
    
    ### How was this patch tested?
    Added test cases.
    
    Author: gatorsmile <[email protected]>
    
    Closes #15878 from gatorsmile/functionNotFound.

commit d89bfc92302424406847ac7a9cfca714e6b742fc
Author: Michael Gummelt <[email protected]>
Date:   2016-11-15T07:46:54Z

    [SPARK-18232][MESOS] Support CNI
    
    ## What changes were proposed in this pull request?
    
    Adds support for CNI-isolated containers
    
    ## How was this patch tested?
    
    I launched SparkPi both with and without `spark.mesos.network.name`, and 
verified the job completed successfully.
    
    Author: Michael Gummelt <[email protected]>
    
    Closes #15740 from mgummelt/spark-342-cni.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16191: spark decision tree

Reply via email to