[GitHub] spark pull request #15964: [SPARK-18356] [ML] Improve MLKmeans Performance

ZakariaHili Mon, 21 Nov 2016 06:27:02 -0800

GitHub user ZakariaHili opened a pull request:

    https://github.com/apache/spark/pull/15964


    [SPARK-18356] [ML] Improve MLKmeans Performance

    ## What changes were proposed in this pull request?
    
    Spark Kmeans fit() doesn't cache the RDD which generates a lot of warnings 
: 
     WARN KMeans: The input data is not directly cached, which may hurt 
performance if its parent RDDs are also uncached.
    So, Kmeans should cache the internal rdd before calling the Mllib.Kmeans 
algo, this helped to improve spark kmeans performance by 14%
    
    
https://github.com/ZakariaHili/spark/commit/a9cf905cf7dbd50eeb9a8b4f891f2f41ea672472
    
    @hhbyyh 
    ## How was this patch tested?
    Pass Kmeans tests and existing tests


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.1

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15964.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15964
    
----
commit 39d2fdb51233ed9b1aaf3adaa3267853f5e58c0f
Author: frreiss <[email protected]>
Date:   2016-11-02T06:00:17Z

    [SPARK-17475][STREAMING] Delete CRC files if the filesystem doesn't use 
checksum files
    
    ## What changes were proposed in this pull request?
    
    When the metadata logs for various parts of Structured Streaming are stored 
on non-HDFS filesystems such as NFS or ext4, the HDFSMetadataLog class leaves 
hidden HDFS-style checksum (CRC) files in the log directory, one file per 
batch. This PR modifies HDFSMetadataLog so that it detects the use of a 
filesystem that doesn't use CRC files and removes the CRC files.
    ## How was this patch tested?
    
    Modified an existing test case in HDFSMetadataLogSuite to check whether 
HDFSMetadataLog correctly removes CRC files on the local POSIX filesystem.  Ran 
the entire regression suite.
    
    Author: frreiss <[email protected]>
    
    Closes #15027 from frreiss/fred-17475.
    
    (cherry picked from commit 620da3b4828b3580c7ed7339b2a07938e6be1bb1)
    Signed-off-by: Reynold Xin <[email protected]>

commit e6509c2459e7ece3c3c6bcd143b8cc71f8f4d5c8
Author: Eric Liang <[email protected]>
Date:   2016-11-02T06:15:10Z

    [SPARK-18183][SPARK-18184] Fix INSERT [INTO|OVERWRITE] TABLE ... PARTITION 
for Datasource tables
    
    There are a couple issues with the current 2.1 behavior when inserting into 
Datasource tables with partitions managed by Hive.
    
    (1) OVERWRITE TABLE ... PARTITION will actually overwrite the entire table 
instead of just the specified partition.
    (2) INSERT|OVERWRITE does not work with partitions that have custom 
locations.
    
    This PR fixes both of these issues for Datasource tables managed by Hive. 
The behavior for legacy tables or when `manageFilesourcePartitions = false` is 
unchanged.
    
    There is one other issue in that INSERT OVERWRITE with dynamic partitions 
will overwrite the entire table instead of just the updated partitions, but 
this behavior is pretty complicated to implement for Datasource tables. We 
should address that in a future release.
    
    Unit tests.
    
    Author: Eric Liang <[email protected]>
    
    Closes #15705 from ericl/sc-4942.
    
    (cherry picked from commit abefe2ec428dc24a4112c623fb6fbe4b2ca60a2b)
    Signed-off-by: Reynold Xin <[email protected]>

commit 85dd073743946383438aabb9f1281e6075f25cc5
Author: Reynold Xin <[email protected]>
Date:   2016-11-02T06:37:03Z

    [SPARK-18192] Support all file formats in structured streaming
    
    ## What changes were proposed in this pull request?
    This patch adds support for all file formats in structured streaming sinks. 
This is actually a very small change thanks to all the previous refactoring 
done using the new internal commit protocol API.
    
    ## How was this patch tested?
    Updated FileStreamSinkSuite to add test cases for json, text, and parquet.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15711 from rxin/SPARK-18192.
    
    (cherry picked from commit a36653c5b7b2719f8bfddf4ddfc6e1b828ac9af1)
    Signed-off-by: Reynold Xin <[email protected]>

commit 4c4bf87acf2516a72b59f4e760413f80640dca1e
Author: CodingCat <[email protected]>
Date:   2016-11-02T06:39:53Z

    [SPARK-18144][SQL] logging StreamingQueryListener$QueryStartedEvent
    
    ## What changes were proposed in this pull request?
    
    The PR fixes the bug that the QueryStartedEvent is not logged
    
    the postToAll() in the original code is actually calling 
StreamingQueryListenerBus.postToAll() which has no listener at all....we shall 
post by sparkListenerBus.postToAll(s) and this.postToAll() to trigger local 
listeners as well as the listeners registered in LiveListenerBus
    
    zsxwing
    ## How was this patch tested?
    
    The following snapshot shows that QueryStartedEvent has been logged 
correctly
    
    
![image](https://cloud.githubusercontent.com/assets/678008/19821553/007a7d28-9d2d-11e6-9f13-49851559cdaa.png)
    
    Author: CodingCat <[email protected]>
    
    Closes #15675 from CodingCat/SPARK-18144.
    
    (cherry picked from commit 85c5424d466f4a5765c825e0e2ab30da97611285)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 3b624bedf0f0ecd5dcfcc262a3ca8b4e33662533
Author: Ryan Blue <[email protected]>
Date:   2016-11-02T07:08:30Z

    [SPARK-17532] Add lock debugging info to thread dumps.
    
    ## What changes were proposed in this pull request?
    
    This adds information to the web UI thread dump page about the JVM locks
    held by threads and the locks that threads are blocked waiting to
    acquire. This should help find cases where lock contention is causing
    Spark applications to run slowly.
    ## How was this patch tested?
    
    Tested by applying this patch and viewing the change in the web UI.
    
    
![thread-lock-info](https://cloud.githubusercontent.com/assets/87915/18493057/6e5da870-79c3-11e6-8c20-f54c18a37544.png)
    
    Additions:
    - A "Thread Locking" column with the locks held by the thread or that are 
blocking the thread
    - Links from the a blocked thread to the thread holding the lock
    - Stack frames show where threads are inside `synchronized` blocks, 
"holding Monitor(...)"
    
    Author: Ryan Blue <[email protected]>
    
    Closes #15088 from rdblue/SPARK-17532-add-thread-lock-info.
    
    (cherry picked from commit 2dc048081668665f85623839d5f663b402e42555)
    Signed-off-by: Reynold Xin <[email protected]>

commit ab8da1413836591fecbc75a2515875bf3e50527f
Author: Liwei Lin <[email protected]>
Date:   2016-11-02T09:10:34Z

    [SPARK-18198][DOC][STREAMING] Highlight code snippets
    
    ## What changes were proposed in this pull request?
    
    This patch uses `{% highlight lang %}...{% endhighlight %}` to highlight 
code snippets in the `Structured Streaming Kafka010 integration doc` and the 
`Spark Streaming Kafka010 integration doc`.
    
    This patch consists of two commits:
    - the first commit fixes only the leading spaces -- this is large
    - the second commit adds the highlight instructions -- this is much simpler 
and easier to review
    
    ## How was this patch tested?
    
    SKIP_API=1 jekyll build
    
    ## Screenshots
    
    **Before**
    
    
![snip20161101_3](https://cloud.githubusercontent.com/assets/15843379/19894258/47746524-a087-11e6-9a2a-7bff2d428d44.png)
    
    **After**
    
    
![snip20161101_1](https://cloud.githubusercontent.com/assets/15843379/19894324/8bebcd1e-a087-11e6-835b-88c4d2979cfa.png)
    
    Author: Liwei Lin <[email protected]>
    
    Closes #15715 from lw-lin/doc-highlight-code-snippet.
    
    (cherry picked from commit 98ede49496d0d7b4724085083d4f24436b92a7bf)
    Signed-off-by: Sean Owen <[email protected]>

commit 176afa5e8b207e28a16e1b22280ed05c10b7b486
Author: Sean Owen <[email protected]>
Date:   2016-11-02T09:39:15Z

    [SPARK-18076][CORE][SQL] Fix default Locale used in DateFormat, 
NumberFormat to Locale.US
    
    ## What changes were proposed in this pull request?
    
    Fix `Locale.US` for all usages of `DateFormat`, `NumberFormat`
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <[email protected]>
    
    Closes #15610 from srowen/SPARK-18076.
    
    (cherry picked from commit 9c8deef64efee20a0ddc9b612f90e77c80aede60)
    Signed-off-by: Sean Owen <[email protected]>

commit 41491e54080742f6e4a1e80a72cd9f46a9336e31
Author: eyal farago <eyal farago>
Date:   2016-11-02T10:12:20Z

    [SPARK-16839][SQL] Simplify Struct creation code path
    
    ## What changes were proposed in this pull request?
    
    Simplify struct creation, especially the aspect of `CleanupAliases` which 
missed some aliases when handling trees created by `CreateStruct`.
    
    This PR includes:
    
    1. A failing test (create struct with nested aliases, some of the aliases 
survive `CleanupAliases`).
    2. A fix that transforms `CreateStruct` into a `CreateNamedStruct` 
constructor, effectively eliminating `CreateStruct` from all expression trees.
    3. A `NamePlaceHolder` used by `CreateStruct` when column names cannot be 
extracted from unresolved `NamedExpression`.
    4. A new Analyzer rule that resolves `NamePlaceHolder` into a string 
literal once the `NamedExpression` is resolved.
    5. `CleanupAliases` code was simplified as it no longer has to deal with 
`CreateStruct`'s top level columns.
    
    ## How was this patch tested?
    Running all tests-suits in package org.apache.spark.sql, especially 
including the analysis suite, making sure added test initially fails, after 
applying suggested fix rerun the entire analysis package successfully.
    
    Modified few tests that expected `CreateStruct` which is now transformed 
into `CreateNamedStruct`.
    
    Author: eyal farago <eyal farago>
    Author: Herman van Hovell <[email protected]>
    Author: eyal farago <[email protected]>
    Author: Eyal Farago <[email protected]>
    Author: Hyukjin Kwon <[email protected]>
    Author: eyalfa <[email protected]>
    
    Closes #15718 from hvanhovell/SPARK-16839-2.
    
    (cherry picked from commit f151bd1af8a05d4b6c901ebe6ac0b51a4a1a20df)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 9be069125f7e94df9d862f307b87965baf9416e3
Author: Takeshi YAMAMURO <[email protected]>
Date:   2016-11-02T18:29:26Z

    [SPARK-17683][SQL] Support ArrayType in Literal.apply
    
    ## What changes were proposed in this pull request?
    
    This pr is to add pattern-matching entries for array data in 
`Literal.apply`.
    ## How was this patch tested?
    
    Added tests in `LiteralExpressionSuite`.
    
    Author: Takeshi YAMAMURO <[email protected]>
    
    Closes #15257 from maropu/SPARK-17683.
    
    (cherry picked from commit 4af0ce2d96de3397c9bc05684cad290a52486577)
    Signed-off-by: Reynold Xin <[email protected]>

commit a885d5bbce9dba66b394850b3aac51ae97cb18dd
Author: buzhihuojie <[email protected]>
Date:   2016-11-02T18:36:20Z

    [SPARK-17895] Improve doc for rangeBetween and rowsBetween
    
    ## What changes were proposed in this pull request?
    
    Copied description for row and range based frame boundary from 
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/execution/window/WindowExec.scala#L56
    
    Added examples to show different behavior of rangeBetween and rowsBetween 
when involving duplicate values.
    
    Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.
    
    Author: buzhihuojie <[email protected]>
    
    Closes #15727 from david-weiluo-ren/improveDocForRangeAndRowsBetween.
    
    (cherry picked from commit 742e0fea5391857964e90d396641ecf95cac4248)
    Signed-off-by: Reynold Xin <[email protected]>

commit 0093257ea94d3a197ca061b54c04685d7c1f616a
Author: Xiangrui Meng <[email protected]>
Date:   2016-11-02T18:41:49Z

    [SPARK-14393][SQL] values generated by non-deterministic functions 
shouldn't change after coalesce or union
    
    ## What changes were proposed in this pull request?
    
    When a user appended a column using a "nondeterministic" function to a 
DataFrame, e.g., `rand`, `randn`, and `monotonically_increasing_id`, the 
expected semantic is the following:
    - The value in each row should remain unchanged, as if we materialize the 
column immediately, regardless of later DataFrame operations.
    
    However, since we use `TaskContext.getPartitionId` to get the partition 
index from the current thread, the values from nondeterministic columns might 
change if we call `union` or `coalesce` after. `TaskContext.getPartitionId` 
returns the partition index of the current Spark task, which might not be the 
corresponding partition index of the DataFrame where we defined the column.
    
    See the unit tests below or JIRA for examples.
    
    This PR uses the partition index from `RDD.mapPartitionWithIndex` instead 
of `TaskContext` and fixes the partition initialization logic in whole-stage 
codegen, normal codegen, and codegen fallback. 
`initializeStatesForPartition(partitionIndex: Int)` was added to `Projection`, 
`Nondeterministic`, and `Predicate` (codegen) and initialized right after 
object creation in `mapPartitionWithIndex`. `newPredicate` now returns a 
`Predicate` instance rather than a function for proper initialization.
    ## How was this patch tested?
    
    Unit tests. (Actually I'm not very confident that this PR fixed all issues 
without introducing new ones ...)
    
    cc: rxin davies
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #15567 from mengxr/SPARK-14393.
    
    (cherry picked from commit 02f203107b8eda1f1576e36c4f12b0e3bc5e910e)
    Signed-off-by: Reynold Xin <[email protected]>

commit bd3ea6595788a4fe5399e6c6c666618d8cb6872c
Author: Jeff Zhang <[email protected]>
Date:   2016-11-02T18:47:45Z

    [SPARK-18160][CORE][YARN] spark.files & spark.jars should not be passed to 
driver in yarn mode
    
    ## What changes were proposed in this pull request?
    
    spark.files is still passed to driver in yarn mode, so SparkContext will 
still handle it which cause the error in the jira desc.
    
    ## How was this patch tested?
    
    Tested manually in a 5 node cluster. As this issue only happens in multiple 
node cluster, so I didn't write test for it.
    
    Author: Jeff Zhang <[email protected]>
    
    Closes #15669 from zjffdu/SPARK-18160.
    
    (cherry picked from commit 3c24299b71e23e159edbb972347b13430f92a465)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 1eef8e5cd09dfb8b77044ef9864321618e8ea8c8
Author: Steve Loughran <[email protected]>
Date:   2016-11-02T18:52:29Z

    [SPARK-17058][BUILD] Add maven snapshots-and-staging profile to build/test 
against staging artifacts
    
    ## What changes were proposed in this pull request?
    
    Adds a `snapshots-and-staging profile` so that  RCs of projects like Hadoop 
and HBase can be used in developer-only build and test runs. There's a comment 
above the profile telling people not to use this in production.
    
    There's no attempt to do the same for SBT, as Ivy is different.
    ## How was this patch tested?
    
    Tested by building against the Hadoop 2.7.3 RC 1 JARs
    
    without the profile (and without any local copy of the 2.7.3 artifacts), 
the build failed
    
    ```
    mvn install -DskipTests -Pyarn,hadoop-2.7,hive -Dhadoop.version=2.7.3
    
    ...
    
    [INFO] 
------------------------------------------------------------------------
    [INFO] Building Spark Project Launcher 2.1.0-SNAPSHOT
    [INFO] 
------------------------------------------------------------------------
    Downloading: 
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.pom
    [WARNING] The POM for org.apache.hadoop:hadoop-client:jar:2.7.3 is missing, 
no dependency information available
    Downloading: 
https://repo1.maven.org/maven2/org/apache/hadoop/hadoop-client/2.7.3/hadoop-client-2.7.3.jar
    [INFO] 
------------------------------------------------------------------------
    [INFO] Reactor Summary:
    [INFO]
    [INFO] Spark Project Parent POM ........................... SUCCESS [  
4.482 s]
    [INFO] Spark Project Tags ................................. SUCCESS [ 
17.402 s]
    [INFO] Spark Project Sketch ............................... SUCCESS [ 
11.252 s]
    [INFO] Spark Project Networking ........................... SUCCESS [ 
13.458 s]
    [INFO] Spark Project Shuffle Streaming Service ............ SUCCESS [  
9.043 s]
    [INFO] Spark Project Unsafe ............................... SUCCESS [ 
16.027 s]
    [INFO] Spark Project Launcher ............................. FAILURE [  
1.653 s]
    [INFO] Spark Project Core ................................. SKIPPED
    ...
    ```
    
    With the profile, the build completed
    
    ```
    mvn install -DskipTests -Pyarn,hadoop-2.7,hive,snapshots-and-staging 
-Dhadoop.version=2.7.3
    ```
    
    Author: Steve Loughran <[email protected]>
    
    Closes #14646 from steveloughran/stevel/SPARK-17058-support-asf-snapshots.
    
    (cherry picked from commit 37d95227a21de602b939dae84943ba007f434513)
    Signed-off-by: Reynold Xin <[email protected]>

commit 2aff2ea81d260a47e7762b2990ed62a91e5d0198
Author: Reynold Xin <[email protected]>
Date:   2016-11-02T22:53:02Z

    [SPARK-18214][SQL] Simplify RuntimeReplaceable type coercion
    
    ## What changes were proposed in this pull request?
    RuntimeReplaceable is used to create aliases for expressions, but the way 
it deals with type coercion is pretty weird (each expression is responsible for 
how to handle type coercion, which does not obey the normal implicit type cast 
rules).
    
    This patch simplifies its handling by allowing the analyzer to traverse 
into the actual expression of a RuntimeReplaceable.
    
    ## How was this patch tested?
    - Correctness should be guaranteed by existing unit tests already
    - Removed SQLCompatibilityFunctionSuite and moved it 
sql-compatibility-functions.sql
    - Added a new test case in sql-compatibility-functions.sql for verifying 
explain behavior.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15723 from rxin/SPARK-18214.
    
    (cherry picked from commit fd90541c35af2bccf0155467bec8cea7c8865046)
    Signed-off-by: Reynold Xin <[email protected]>

commit 5ea2f9e5e449c02f77635918bfcc7ba7193c97a2
Author: Wenchen Fan <[email protected]>
Date:   2016-11-03T01:05:14Z

    [SPARK-17470][SQL] unify path for data source table and locationUri for 
hive serde table
    
    ## What changes were proposed in this pull request?
    
    Due to a limitation of hive metastore(table location must be directory 
path, not file path), we always store `path` for data source table in storage 
properties, instead of the `locationUri` field. However, we should not expose 
this difference to `CatalogTable` level, but just treat it as a hack in 
`HiveExternalCatalog`, like we store table schema of data source table in table 
properties.
    
    This PR unifies `path` and `locationUri` outside of `HiveExternalCatalog`, 
both data source table and hive serde table should use the `locationUri` field.
    
    This PR also unifies the way we handle default table location for managed 
table. Previously, the default table location of hive serde managed table is 
set by external catalog, but the one of data source table is set by command. 
After this PR, we follow the hive way and the default table location is always 
set by external catalog.
    
    For managed non-file-based tables, we will assign a default table location 
and create an empty directory for it, the table location will be removed when 
the table is dropped. This is reasonable as metastore doesn't care about 
whether a table is file-based or not, and an empty table directory has no harm.
    For external non-file-based tables, ideally we can omit the table location, 
but due to a hive metastore issue, we will assign a random location to it, and 
remove it right after the table is created. See SPARK-15269 for more details. 
This is fine as it's well isolated in `HiveExternalCatalog`.
    
    To keep the existing behaviour of the `path` option, in this PR we always 
add the `locationUri` to storage properties using key `path`, before passing 
storage properties to `DataSource` as data source options.
    ## How was this patch tested?
    
    existing tests.
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #15024 from cloud-fan/path.
    
    (cherry picked from commit 3a1bc6f4780f8384c1211b1335e7394a4a28377e)
    Signed-off-by: Yin Huai <[email protected]>

commit 1e29f0a0d2772efc5e9cdc9727847388a87547d4
Author: hyukjinkwon <[email protected]>
Date:   2016-11-03T03:56:30Z

    [SPARK-17963][SQL][DOCUMENTATION] Add examples (extend) in each expression 
and improve documentation
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to change the documentation for functions. Please refer 
the discussion from https://github.com/apache/spark/pull/15513
    
    The changes include
    - Re-indent the documentation
    - Add examples/arguments in `extended` where the arguments are multiple or 
specific format (e.g. xml/ json).
    
    For examples, the documentation was updated as below:
    ### Functions with single line usage
    
    **Before**
    - `pow`
    
      ``` sql
      Usage: pow(x1, x2) - Raise x1 to the power of x2.
      Extended Usage:
      > SELECT pow(2, 3);
       8.0
      ```
    - `current_timestamp`
    
      ``` sql
      Usage: current_timestamp() - Returns the current timestamp at the start 
of query evaluation.
      Extended Usage:
      No example for current_timestamp.
      ```
    
    **After**
    - `pow`
    
      ``` sql
      Usage: pow(expr1, expr2) - Raises `expr1` to the power of `expr2`.
      Extended Usage:
          Examples:
            > SELECT pow(2, 3);
             8.0
      ```
    
    - `current_timestamp`
    
      ``` sql
      Usage: current_timestamp() - Returns the current timestamp at the start 
of query evaluation.
      Extended Usage:
          No example/argument for current_timestamp.
      ```
    ### Functions with (already) multiple line usage
    
    **Before**
    - `approx_count_distinct`
    
      ``` sql
      Usage: approx_count_distinct(expr) - Returns the estimated cardinality by 
HyperLogLog++.
          approx_count_distinct(expr, relativeSD=0.05) - Returns the estimated 
cardinality by HyperLogLog++
            with relativeSD, the maximum estimation error allowed.
    
      Extended Usage:
      No example for approx_count_distinct.
      ```
    - `percentile_approx`
    
      ``` sql
      Usage:
            percentile_approx(col, percentage [, accuracy]) - Returns the 
approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must 
be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive 
integer literal which
            controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
    
            percentile_approx(col, array(percentage1 [, percentage2]...) [, 
accuracy]) - Returns the approximate
            percentile array of column `col` at the given percentage array. 
Each value of the
            percentage array must be between 0.0 and 1.0. The `accuracy` 
parameter (default: 10000) is
            a positive integer literal which controls approximation accuracy at 
the cost of memory.
            Higher value of `accuracy` yields better accuracy, `1.0/accuracy` 
is the relative error of
            the approximation.
    
      Extended Usage:
      No example for percentile_approx.
      ```
    
    **After**
    - `approx_count_distinct`
    
      ``` sql
      Usage:
          approx_count_distinct(expr[, relativeSD]) - Returns the estimated 
cardinality by HyperLogLog++.
            `relativeSD` defines the maximum estimation error allowed.
    
      Extended Usage:
          No example/argument for approx_count_distinct.
      ```
    
    - `percentile_approx`
    
      ``` sql
      Usage:
          percentile_approx(col, percentage [, accuracy]) - Returns the 
approximate percentile value of numeric
            column `col` at the given percentage. The value of percentage must 
be between 0.0
            and 1.0. The `accuracy` parameter (default: 10000) is a positive 
numeric literal which
            controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
            better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
            When `percentage` is an array, each value of the percentage array 
must be between 0.0 and 1.0.
            In this case, returns the approximate percentile array of column 
`col` at the given
            percentage array.
    
      Extended Usage:
          Examples:
            > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
             [10.0,10.0,10.0]
            > SELECT percentile_approx(10.0, 0.5, 100);
             10.0
      ```
    ## How was this patch tested?
    
    Manually tested
    
    **When examples are multiple**
    
    ``` sql
    spark-sql> describe function extended reflect;
    Function: reflect
    Class: org.apache.spark.sql.catalyst.expressions.CallMethodViaReflection
    Usage: reflect(class, method[, arg1[, arg2 ..]]) - Calls a method with 
reflection.
    Extended Usage:
        Examples:
          > SELECT reflect('java.util.UUID', 'randomUUID');
           c33fb387-8500-4bfa-81d2-6e0e3e930df2
          > SELECT reflect('java.util.UUID', 'fromString', 
'a5cf6c42-0c85-418f-af6c-3e4e5b1328f2');
           a5cf6c42-0c85-418f-af6c-3e4e5b1328f2
    ```
    
    **When `Usage` is in single line**
    
    ``` sql
    spark-sql> describe function extended min;
    Function: min
    Class: org.apache.spark.sql.catalyst.expressions.aggregate.Min
    Usage: min(expr) - Returns the minimum value of `expr`.
    Extended Usage:
        No example/argument for min.
    ```
    
    **When `Usage` is already in multiple lines**
    
    ``` sql
    spark-sql> describe function extended percentile_approx;
    Function: percentile_approx
    Class: 
org.apache.spark.sql.catalyst.expressions.aggregate.ApproximatePercentile
    Usage:
        percentile_approx(col, percentage [, accuracy]) - Returns the 
approximate percentile value of numeric
          column `col` at the given percentage. The value of percentage must be 
between 0.0
          and 1.0. The `accuracy` parameter (default: 10000) is a positive 
numeric literal which
          controls approximation accuracy at the cost of memory. Higher value 
of `accuracy` yields
          better accuracy, `1.0/accuracy` is the relative error of the 
approximation.
          When `percentage` is an array, each value of the percentage array 
must be between 0.0 and 1.0.
          In this case, returns the approximate percentile array of column 
`col` at the given
          percentage array.
    
    Extended Usage:
        Examples:
          > SELECT percentile_approx(10.0, array(0.5, 0.4, 0.1), 100);
           [10.0,10.0,10.0]
          > SELECT percentile_approx(10.0, 0.5, 100);
           10.0
    ```
    
    **When example/argument is missing**
    
    ``` sql
    spark-sql> describe function extended rank;
    Function: rank
    Class: org.apache.spark.sql.catalyst.expressions.Rank
    Usage:
        rank() - Computes the rank of a value in a group of values. The result 
is one plus the number
          of rows preceding or equal to the current row in the ordering of the 
partition. The values
          will produce gaps in the sequence.
    
    Extended Usage:
        No example/argument for rank.
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #15677 from HyukjinKwon/SPARK-17963-1.
    
    (cherry picked from commit 7eb2ca8e338e04034a662920261e028f56b07395)
    Signed-off-by: gatorsmile <[email protected]>

commit 2cf39d63833ea0bf2a4c66c259409ee7808fdab6
Author: gatorsmile <[email protected]>
Date:   2016-11-03T04:01:03Z

    [SPARK-18175][SQL] Improve the test case coverage of implicit type casting
    
    ### What changes were proposed in this pull request?
    
    So far, we have limited test case coverage about implicit type casting. We 
need to draw a matrix to find all the possible casting pairs.
    - Reorged the existing test cases
    - Added all the possible type casting pairs
    - Drawed a matrix to show the implicit type casting. The table is very 
wide. Maybe hard to review. Thus, you also can access the same table via the 
link to [a google 
sheet](https://docs.google.com/spreadsheets/d/19PS4ikrs-Yye_mfu-rmIKYGnNe-NmOTt5DDT1fOD3pI/edit?usp=sharing).
    
    SourceType\CastToType | ByteType | ShortType | IntegerType | LongType | 
DoubleType | FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | 
DateType | TimestampType | ArrayType | MapType | StructType | NullType | 
CalendarIntervalType | DecimalType | NumericType | IntegralType
    ------------ | ------------ | ------------ | ------------ | ------------ | 
------------ | ------------ | ------------ | ------------ | ------------ | 
------------ | ------------ | ------------ | ------------ | ------------ | 
------------ | ------------ | ------------ | ------------ | ------------ |  
-----------
    **ByteType** | ByteType | ShortType | IntegerType | LongType | DoubleType | 
FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | 
X    | X    | X    | DecimalType(3, 0) | ByteType | ByteType
    **ShortType** | ByteType | ShortType | IntegerType | LongType | DoubleType 
| FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    
| X    | X    | X    | DecimalType(5, 0) | ShortType | ShortType
    **IntegerType** | ByteType | ShortType | IntegerType | LongType | 
DoubleType | FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | 
X    | X    | X    | X    | X    | DecimalType(10, 0) | IntegerType | 
IntegerType
    **LongType** | ByteType | ShortType | IntegerType | LongType | DoubleType | 
FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    | 
X    | X    | X    | DecimalType(20, 0) | LongType | LongType
    **DoubleType** | ByteType | ShortType | IntegerType | LongType | DoubleType 
| FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    
| X    | X    | X    | DecimalType(30, 15) | DoubleType | IntegerType
    **FloatType** | ByteType | ShortType | IntegerType | LongType | DoubleType 
| FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    
| X    | X    | X    | DecimalType(14, 7) | FloatType | IntegerType
    **Dec(10, 2)** | ByteType | ShortType | IntegerType | LongType | DoubleType 
| FloatType | Dec(10, 2) | X    | X    | StringType | X    | X    | X    | X    
| X    | X    | X    | DecimalType(10, 2) | Dec(10, 2) | IntegerType
    **BinaryType** | X    | X    | X    | X    | X    | X    | X    | 
BinaryType | X    | StringType | X    | X    | X    | X    | X    | X    | X    
| X    | X    | X
    **BooleanType** | X    | X    | X    | X    | X    | X    | X    | X    | 
BooleanType | StringType | X    | X    | X    | X    | X    | X    | X    | X   
 | X    | X
    **StringType** | ByteType | ShortType | IntegerType | LongType | DoubleType 
| FloatType | Dec(10, 2) | BinaryType | X    | StringType | DateType | 
TimestampType | X    | X    | X    | X    | X    | DecimalType(38, 18) | 
DoubleType | X
    **DateType** | X    | X    | X    | X    | X    | X    | X    | X    | X    
| StringType | DateType | TimestampType | X    | X    | X    | X    | X    | X  
  | X    | X
    **TimestampType** | X    | X    | X    | X    | X    | X    | X    | X    | 
X    | StringType | DateType | TimestampType | X    | X    | X    | X    | X    
| X    | X    | X
    **ArrayType** | X    | X    | X    | X    | X    | X    | X    | X    | X   
 | X    | X    | X    | ArrayType* | X    | X    | X    | X    | X    | X    | X
    **MapType** | X    | X    | X    | X    | X    | X    | X    | X    | X    
| X    | X    | X    | X    | MapType* | X    | X    | X    | X    | X    | X
    **StructType** | X    | X    | X    | X    | X    | X    | X    | X    | X  
  | X    | X    | X    | X    | X    | StructType* | X    | X    | X    | X    
| X
    **NullType** | ByteType | ShortType | IntegerType | LongType | DoubleType | 
FloatType | Dec(10, 2) | BinaryType | BooleanType | StringType | DateType | 
TimestampType | ArrayType | MapType | StructType | NullType | 
CalendarIntervalType | DecimalType(38, 18) | DoubleType | IntegerType
    **CalendarIntervalType** | X    | X    | X    | X    | X    | X    | X    | 
X    | X    | X    | X    | X    | X    | X    | X    | X    | 
CalendarIntervalType | X    | X    | X
    Note: ArrayType\*, MapType\*, StructType\* are castable only when the 
internal child types also match; otherwise, not castable
    ### How was this patch tested?
    N/A
    
    Author: gatorsmile <[email protected]>
    
    Closes #15691 from gatorsmile/implicitTypeCasting.
    
    (cherry picked from commit 9ddec8636c4f5e8c4592aefecec9886b409ced8f)
    Signed-off-by: gatorsmile <[email protected]>

commit 965c964c2657aaf575f0e00ce6b74a8f05172c06
Author: Dongjoon Hyun <[email protected]>
Date:   2016-11-03T06:50:50Z

    [SPARK-18200][GRAPHX] Support zero as an initial capacity in OpenHashSet
    
    ## What changes were proposed in this pull request?
    
    [SPARK-18200](https://issues.apache.org/jira/browse/SPARK-18200) reports 
Apache Spark 2.x raises `java.lang.IllegalArgumentException: requirement 
failed: Invalid initial capacity` while running `triangleCount`. The root cause 
is that `VertexSet`, a type alias of `OpenHashSet`, does not allow zero as a 
initial size. This PR loosens the restriction to allow zero.
    
    ## How was this patch tested?
    
    Pass the Jenkins test with a new test case in `OpenHashSetSuite`.
    
    Author: Dongjoon Hyun <[email protected]>
    
    Closes #15741 from dongjoon-hyun/SPARK-18200.
    
    (cherry picked from commit d24e736471f34ef8f2c12766393379c4213fe96e)
    Signed-off-by: Reynold Xin <[email protected]>

commit c4c5328f2ab2ddb2137e575865ced93c6bc624b1
Author: Daoyuan Wang <[email protected]>
Date:   2016-11-03T07:18:03Z

    [SPARK-17122][SQL] support drop current database
    
    ## What changes were proposed in this pull request?
    
    In Spark 1.6 and earlier, we can drop the database we are using. In Spark 
2.0, native implementation prevent us from dropping current database, which may 
break some old queries. This PR would re-enable the feature.
    ## How was this patch tested?
    
    one new unit test in `SessionCatalogSuite`.
    
    Author: Daoyuan Wang <[email protected]>
    
    Closes #15011 from adrian-wang/dropcurrent.
    
    (cherry picked from commit 96cc1b5675273c276e04c4dc19ef9033a314292d)
    Signed-off-by: gatorsmile <[email protected]>

commit bc7f05f5f03653c623190b8178bcbe981a41c2f3
Author: Reynold Xin <[email protected]>
Date:   2016-11-03T09:42:48Z

    [SPARK-18219] Move commit protocol API (internal) from sql/core to core 
module
    
    ## What changes were proposed in this pull request?
    This patch moves the new commit protocol API from sql/core to core module, 
so we can use it in the future in the RDD API.
    
    As part of this patch, I also moved the speficiation of the random uuid for 
the write path out of the commit protocol, and instead pass in a job id.
    
    ## How was this patch tested?
    N/A
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15731 from rxin/SPARK-18219.
    
    (cherry picked from commit 937af592e65f4dd878aafcabf8fe2cfe7fa3d9b3)
    Signed-off-by: Reynold Xin <[email protected]>

commit 71104c9c97a648c94e6619279ad49752c01c89c3
Author: Reynold Xin <[email protected]>
Date:   2016-11-03T09:45:54Z

    [SQL] minor - internal doc improvement for InsertIntoTable.
    
    ## What changes were proposed in this pull request?
    I was reading this part of the code and was really confused by the 
"partition" parameter. This patch adds some documentation for it to reduce 
confusion in the future.
    
    I also looked around other logical plans but most of them are either 
already documented, or pretty self-evident to people that know Spark SQL.
    
    ## How was this patch tested?
    N/A - doc change only.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15749 from rxin/doc-improvement.
    
    (cherry picked from commit 0ea5d5b24c1f7b29efeac0e72d271aba279523f7)
    Signed-off-by: Reynold Xin <[email protected]>

commit 99891e56ea286580323fd82e303064d3c0730d85
Author: Zheng RuiFeng <[email protected]>
Date:   2016-11-03T14:45:20Z

    [SPARK-18177][ML][PYSPARK] Add missing 'subsamplingRate' of pyspark 
GBTClassifier
    
    ## What changes were proposed in this pull request?
    Add missing 'subsamplingRate' of pyspark GBTClassifier
    
    ## How was this patch tested?
    existing tests
    
    Author: Zheng RuiFeng <[email protected]>
    
    Closes #15692 from zhengruifeng/gbt_subsamplingRate.
    
    (cherry picked from commit 9dc9f9a5dde37d085808a264cfb9cf4d4f72417d)
    Signed-off-by: Yanbo Liang <[email protected]>

commit c2876bfbf06fe1057c4236128d41782c61685c53
Author: gatorsmile <[email protected]>
Date:   2016-11-03T15:35:36Z

    [SPARK-17981][SPARK-17957][SQL] Fix Incorrect Nullability Setting to False 
in FilterExec
    
    ### What changes were proposed in this pull request?
    
    When `FilterExec` contains `isNotNull`, which could be inferred and pushed 
down or users specified, we convert the nullability of the involved columns if 
the top-layer expression is null-intolerant. However, this is not correct, if 
the top-layer expression is not a leaf expression, it could still tolerate the 
null when it has null-tolerant child expressions.
    
    For example, `cast(coalesce(a#5, a#15) as double)`. Although `cast` is a 
null-intolerant expression, but obviously`coalesce` is null-tolerant. Thus, it 
could eat null.
    
    When the nullability is wrong, we could generate incorrect results in 
different cases. For example,
    
    ``` Scala
        val df1 = Seq((1, 2), (2, 3)).toDF("a", "b")
        val df2 = Seq((2, 5), (3, 4)).toDF("a", "c")
        val joinedDf = df1.join(df2, Seq("a"), "outer").na.fill(0)
        val df3 = Seq((3, 1)).toDF("a", "d")
        joinedDf.join(df3, "a").show
    ```
    
    The optimized plan is like
    
    ```
    Project [a#29, b#30, c#31, d#42]
    +- Join Inner, (a#29 = a#41)
       :- Project [cast(coalesce(cast(coalesce(a#5, a#15) as double), 0.0) as 
int) AS a#29, cast(coalesce(cast(b#6 as double), 0.0) as int) AS b#30, 
cast(coalesce(cast(c#16 as double), 0.0) as int) AS c#31]
       :  +- Filter isnotnull(cast(coalesce(cast(coalesce(a#5, a#15) as 
double), 0.0) as int))
       :     +- Join FullOuter, (a#5 = a#15)
       :        :- LocalRelation [a#5, b#6]
       :        +- LocalRelation [a#15, c#16]
       +- LocalRelation [a#41, d#42]
    ```
    
    Without the fix, it returns an empty result. With the fix, it can return a 
correct answer:
    
    ```
    +---+---+---+---+
    |  a|  b|  c|  d|
    +---+---+---+---+
    |  3|  0|  4|  1|
    +---+---+---+---+
    ```
    ### How was this patch tested?
    
    Added test cases to verify the nullability changes in FilterExec. Also 
added a test case for verifying the reported incorrect result.
    
    Author: gatorsmile <[email protected]>
    
    Closes #15523 from gatorsmile/nullabilityFilterExec.
    
    (cherry picked from commit 66a99f4a411ee7dc94ff1070a8fd6865fd004093)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 4f91630c8100ee3a6fd168bc4247ca6fadd0a736
Author: Reynold Xin <[email protected]>
Date:   2016-11-03T18:48:05Z

    [SPARK-18244][SQL] Rename partitionProviderIsHive -> 
tracksPartitionsInCatalog
    
    ## What changes were proposed in this pull request?
    This patch renames partitionProviderIsHive to tracksPartitionsInCatalog, as 
the old name was too Hive specific.
    
    ## How was this patch tested?
    Should be covered by existing tests.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15750 from rxin/SPARK-18244.
    
    (cherry picked from commit b17057c0a69b9c56e503483d97f5dc209eef0884)
    Signed-off-by: Reynold Xin <[email protected]>

commit 3e139e2390085cfb42f7136f150b0fa08c14eb61
Author: ç¦æ <[email protected]>
Date:   2016-11-03T19:02:01Z

    [SPARK-18237][HIVE] hive.exec.stagingdir have no effect
    
    hive.exec.stagingdir have no effect in spark2.0.1ï¼
    Hive confs in hive-site.xml will be loaded in `hadoopConf`, so we should 
use `hadoopConf` in `InsertIntoHiveTable` instead of `SessionState.conf`
    
    Author: ç¦æ <[email protected]>
    
    Closes #15744 from ClassNotFoundExp/master.
    
    (cherry picked from commit 16293311cdb25a62733a9aae4355659b971a3ce1)
    Signed-off-by: Reynold Xin <[email protected]>

commit 569f77a11819523bdf5dc2c6429fc3399cbb6519
Author: Kishor Patil <[email protected]>
Date:   2016-11-03T21:10:26Z

    [SPARK-18099][YARN] Fail if same files added to distributed cache for 
--files and --archives
    
    ## What changes were proposed in this pull request?
    
    During spark-submit, if yarn dist cache is instructed to add same file 
under --files and --archives, This code change ensures the spark yarn 
distributed cache behaviour is retained i.e. to warn and fail if same files is 
mentioned in both --files and --archives.
    ## How was this patch tested?
    
    Manually tested:
    1. if same jar is mentioned in --jars and --files it will continue to 
submit the job.
    - basically functionality [SPARK-14423] #12203 is unchanged
      1. if same file is mentioned in --files and --archives it will fail to 
submit the job.
    
    Please review 
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark before 
opening a pull request.
    
    â¦ under archives and files
    
    Author: Kishor Patil <[email protected]>
    
    Closes #15627 from kishorvpatil/spark18099.
    
    (cherry picked from commit 098e4ca9c7af61e64839a50c65be449749af6482)
    Signed-off-by: Tom Graves <[email protected]>

commit 2daca62cd342203694f22232ceb026dcaf56d3d5
Author: cody koeninger <[email protected]>
Date:   2016-11-03T21:43:25Z

    [SPARK-18212][SS][KAFKA] increase executor poll timeout
    
    ## What changes were proposed in this pull request?
    
    Increase poll timeout to try and address flaky test
    
    ## How was this patch tested?
    
    Ran existing unit tests
    
    Author: cody koeninger <[email protected]>
    
    Closes #15737 from koeninger/SPARK-18212.
    
    (cherry picked from commit 67659c9afaeb2289e56fd87fafee953e8f050383)
    Signed-off-by: Michael Armbrust <[email protected]>

commit af60b1ebbf5cb91dc724aad9d3d7476ce9085ac9
Author: Reynold Xin <[email protected]>
Date:   2016-11-03T22:30:45Z

    [SPARK-18257][SS] Improve error reporting for FileStressSuite
    
    ## What changes were proposed in this pull request?
    This patch improves error reporting for FileStressSuite, when there is an 
error in Spark itself (not user code). This works by simply tightening the 
exception verification, and gets rid of the unnecessary thread for starting the 
stream.
    
    Also renamed the class FileStreamStressSuite to make it more obvious it is 
a streaming suite.
    
    ## How was this patch tested?
    This is a test only change and I manually verified error reporting by 
injecting some bug in the addBatch code for FileStreamSink.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #15757 from rxin/SPARK-18257.
    
    (cherry picked from commit f22954ad49bf5a32c7b6d8487cd38ffe0da904ca)
    Signed-off-by: Reynold Xin <[email protected]>

commit 37550c49218e1890f8adc10c9549a23dc072e21f
Author: Sean Owen <[email protected]>
Date:   2016-11-04T00:27:23Z

    [SPARK-18138][DOCS] Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 
2.6 are deprecated in Spark 2.1.0
    
    ## What changes were proposed in this pull request?
    
    Document that Java 7, Python 2.6, Scala 2.10, Hadoop < 2.6 are deprecated 
in Spark 2.1.0. This does not actually implement any of the change in 
SPARK-18138, just peppers the documentation with notices about it.
    
    ## How was this patch tested?
    
    Doc build
    
    Author: Sean Owen <[email protected]>
    
    Closes #15733 from srowen/SPARK-18138.
    
    (cherry picked from commit dc4c60098641cf64007e2f0e36378f000ad5f6b1)
    Signed-off-by: Reynold Xin <[email protected]>

commit 91d567150b305d05acb8543da5cbf21df244352d
Author: Herman van Hovell <[email protected]>
Date:   2016-11-04T04:59:59Z

    [SPARK-18259][SQL] Do not capture Throwable in QueryExecution
    
    ## What changes were proposed in this pull request?
    `QueryExecution.toString` currently captures `java.lang.Throwable`s; this 
is far from a best practice and can lead to confusing situation or invalid 
application states. This PR fixes this by only capturing `AnalysisException`s.
    
    ## How was this patch tested?
    Added a `QueryExecutionSuite`.
    
    Author: Herman van Hovell <[email protected]>
    
    Closes #15760 from hvanhovell/SPARK-18259.
    
    (cherry picked from commit aa412c55e31e61419d3de57ef4b13e50f9b38af0)
    Signed-off-by: Reynold Xin <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #15964: [SPARK-18356] [ML] Improve MLKmeans Performance

Reply via email to