[GitHub] spark pull request #18720: Spark 2.2.0: pip pyspark doesn't work well in not...

ShuaiW Sun, 23 Jul 2017 11:42:06 -0700

GitHub user ShuaiW opened a pull request:

    https://github.com/apache/spark/pull/18720


    Spark 2.2.0: pip pyspark doesn't work well in notebook on Windows 10; got 
Exception: Java gateway process exited before sending the driver its port number

    - Step1: `pip install pyspark`
    - Step 2: `jupyter notebook`
    - Step 3: initialize a Spark Context
    
    ```python
    from pyspark import SparkContext
    sc = SparkContext()
    ```
    But I got the error `Exception: Java gateway process exited before sending 
the driver its port number`. 
    
    One workaround is to point SPARK_HOME variable to one of the pre-built 
Spark package directory (e.g., `C:\spark-2.2.0-bin-hadoop2.7`), and step 3 
works fine so a Spark Context is initialized. I looked through the pyspark 
folder installed by pip and the pre-built package from download and found the 
file structures are a bit different, which should be the cause for the problem.
    
    Weirdly, step 1-3 seems to work well on Mac OS 10.11.6.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.2

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/18720.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #18720
    
----
commit 6c5b7e106895302a87cf6522d3c64c3badac699f
Author: Felix Cheung <[email protected]>
Date:   2017-05-08T06:10:18Z

    [SPARK-20626][SPARKR] address date test warning with timezone on windows
    
    ## What changes were proposed in this pull request?
    
    set timezone on windows
    
    ## How was this patch tested?
    
    unit test, AppVeyor
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17892 from felixcheung/rtimestamptest.
    
    (cherry picked from commit c24bdaab5a234d18b273544cefc44cc4005bf8fc)
    Signed-off-by: Felix Cheung <[email protected]>

commit d8a5a0d3420abbb911d8a80dc7165762eb08d779
Author: Wayne Zhang <[email protected]>
Date:   2017-05-08T06:16:30Z

    [SPARKR][DOC] fix typo in vignettes
    
    ## What changes were proposed in this pull request?
    Fix typo in vignettes
    
    Author: Wayne Zhang <[email protected]>
    
    Closes #17884 from actuaryzhang/typo.
    
    (cherry picked from commit 2fdaeb52bbe2ed1a9127ac72917286e505303c85)
    Signed-off-by: Felix Cheung <[email protected]>

commit 7b9d05ad00455daa53ae4ef1a602a6c64c2c95a4
Author: Nick Pentreath <[email protected]>
Date:   2017-05-08T10:45:00Z

    [SPARK-20596][ML][TEST] Consolidate and improve ALS recommendAll test cases
    
    Existing test cases for `recommendForAllX` methods (added in 
[SPARK-19535](https://issues.apache.org/jira/browse/SPARK-19535)) test `k < num 
items` and `k = num items`. Technically we should also test that `k > num 
items` returns the same results as `k = num items`.
    
    ## How was this patch tested?
    
    Updated existing unit tests.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #17860 from MLnick/SPARK-20596-als-rec-tests.
    
    (cherry picked from commit 58518d070777fc0665c4d02bad8adf910807df98)
    Signed-off-by: Nick Pentreath <[email protected]>

commit 23681e9ca0042328f93962701d19ca371727b0b7
Author: Xianyang Liu <[email protected]>
Date:   2017-05-08T17:25:24Z

    [SPARK-20621][DEPLOY] Delete deprecated config parameter in 'spark-env.sh'
    
    ## What changes were proposed in this pull request?
    
    Currently, `spark.executor.instances` is deprecated in `spark-env.sh`, 
because we suggest config it in `spark-defaults.conf` or other config file. And 
also this parameter is useless even if you set it in `spark-env.sh`, so remove 
it in this patch.
    
    ## How was this patch tested?
    
    Existing tests.
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.
    
    Author: Xianyang Liu <[email protected]>
    
    Closes #17881 from ConeyLiu/deprecatedParam.
    
    (cherry picked from commit aeb2ecc0cd898f5352df0a04be1014b02ea3e20e)
    Signed-off-by: Marcelo Vanzin <[email protected]>

commit 4179ffc031a0dbca6a93255c673de800ce7393fe
Author: Hossein <[email protected]>
Date:   2017-05-08T21:48:11Z

    [SPARK-20661][SPARKR][TEST] SparkR tableNames() test fails
    
    ## What changes were proposed in this pull request?
    Cleaning existing temp tables before running tableNames tests
    
    ## How was this patch tested?
    SparkR Unit tests
    
    Author: Hossein <[email protected]>
    
    Closes #17903 from falaki/SPARK-20661.
    
    (cherry picked from commit 2abfee18b6511482b916c36f00bf3abf68a59e19)
    Signed-off-by: Yin Huai <[email protected]>

commit 54e07434968624dbb0fb80773356e614b954e52f
Author: Felix Cheung <[email protected]>
Date:   2017-05-09T05:49:40Z

    [SPARK-20661][SPARKR][TEST][FOLLOWUP] SparkR tableNames() test fails
    
    ## What changes were proposed in this pull request?
    
    Change it to check for relative count like in this test 
https://github.com/apache/spark/blame/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L3355
 for catalog APIs
    
    ## How was this patch tested?
    
    unit tests, this needs to combine with another commit with SQL change to 
check
    
    Author: Felix Cheung <[email protected]>
    
    Closes #17905 from felixcheung/rtabletests.
    
    (cherry picked from commit b952b44af4d243f1e3ad88bccf4af7d04df3fc81)
    Signed-off-by: Felix Cheung <[email protected]>

commit 72fca9a0a7a6dd2ab7c338fab9666b51cd981cce
Author: Peng <[email protected]>
Date:   2017-05-09T08:05:49Z

    [SPARK-11968][MLLIB] Optimize MLLIB ALS recommendForAll
    
    The recommendForAll of MLLIB ALS is very slow.
    GC is a key problem of the current method.
    The task use the following code to keep temp result:
    val output = new Array[(Int, (Int, Double))](m*n)
    m = n = 4096 (default value, no method to set)
    so output is about 4k * 4k * (4 + 4 + 8) = 256M. This is a large memory and 
cause serious GC problem, and it is frequently OOM.
    
    Actually, we don't need to save all the temp result. Support we recommend 
topK (topK is about 10, or 20) product for each user, we only need 4k * topK * 
(4 + 4 + 8) memory to save the temp result.
    
    The Test Environment:
    3 workers: each work 10 core, each work 30G memory, each work 1 executor.
    The Data: User 480,000, and Item 17,000
    
    BlockSize:     1024  2048  4096  8192
    Old method:  245s  332s  488s  OOM
    This solution: 121s  118s   117s  120s
    
    The existing UT.
    
    Author: Peng <[email protected]>
    Author: Peng Meng <[email protected]>
    
    Closes #17742 from mpjlu/OptimizeAls.
    
    (cherry picked from commit 8079424763c2043264f30a6898ce964379bd9b56)
    Signed-off-by: Nick Pentreath <[email protected]>

commit ca3f7edbad6a2e7fcd1c1d3dbd1a522cd0d7c476
Author: Nick Pentreath <[email protected]>
Date:   2017-05-09T08:13:15Z

    [SPARK-20587][ML] Improve performance of ML ALS recommendForAll
    
    This PR is a `DataFrame` version of #17742 for 
[SPARK-11968](https://issues.apache.org/jira/browse/SPARK-11968), for improving 
the performance of `recommendAll` methods.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Nick Pentreath <[email protected]>
    
    Closes #17845 from MLnick/ml-als-perf.
    
    (cherry picked from commit 10b00abadf4a3473332eef996db7b66f491316f2)
    Signed-off-by: Nick Pentreath <[email protected]>

commit 4bbfad44e426365ad9f4941d68c110523b17ea6d
Author: Jon McLean <[email protected]>
Date:   2017-05-09T08:47:50Z

    [SPARK-20615][ML][TEST] SparseVector.argmax throws IndexOutOfBoundsException
    
    ## What changes were proposed in this pull request?
    
    Added a check for for the number of defined values.  Previously the argmax 
function assumed that at least one value was defined if the vector size was 
greater than zero.
    
    ## How was this patch tested?
    
    Tests were added to the existing VectorsSuite to cover this case.
    
    Author: Jon McLean <[email protected]>
    
    Closes #17877 from jonmclean/vectorArgmaxIndexBug.
    
    (cherry picked from commit be53a78352ae7c70d8a07d0df24574b3e3129b4a)
    Signed-off-by: Sean Owen <[email protected]>

commit 4b7aa0b1dbd85e2238acba45e8f94c097358fb72
Author: Yanbo Liang <[email protected]>
Date:   2017-05-09T09:30:37Z

    [SPARK-20606][ML] ML 2.2 QA: Remove deprecated methods for ML
    
    ## What changes were proposed in this pull request?
    Remove ML methods we deprecated in 2.1.
    
    ## How was this patch tested?
    Existing tests.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17867 from yanboliang/spark-20606.
    
    (cherry picked from commit b8733e0ad9f5a700f385e210450fd2c10137293e)
    Signed-off-by: Yanbo Liang <[email protected]>

commit b3309676bb83a80d38b916066d046866a6f42ef0
Author: Xiao Li <[email protected]>
Date:   2017-05-09T12:10:50Z

    [SPARK-20667][SQL][TESTS] Cleanup the cataloged metadata after completing 
the package of sql/core and sql/hive
    
    ## What changes were proposed in this pull request?
    
    So far, we do not drop all the cataloged objects after each package. 
Sometimes, we might hit strange test case errors because the previous test 
suite did not drop the cataloged/temporary objects (tables/functions/database). 
At least, we can first clean up the environment when completing the package of 
`sql/core` and `sql/hive`.
    
    ## How was this patch tested?
    N/A
    
    Author: Xiao Li <[email protected]>
    
    Closes #17908 from gatorsmile/reset.
    
    (cherry picked from commit 0d00c768a860fc03402c8f0c9081b8147c29133e)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 272d2a10d70588e1f80cc6579d4ec3c44b5bbfc2
Author: Takeshi Yamamuro <[email protected]>
Date:   2017-05-09T12:22:51Z

    [SPARK-20311][SQL] Support aliases for table value functions
    
    ## What changes were proposed in this pull request?
    This pr added parsing rules to support aliases in table value functions.
    
    ## How was this patch tested?
    Added tests in `PlanParserSuite`.
    
    Author: Takeshi Yamamuro <[email protected]>
    
    Closes #17666 from maropu/SPARK-20311.
    
    (cherry picked from commit 714811d0b5bcb5d47c39782ff74f898d276ecc59)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 08e1b78f01955c7151d9e984d392d45deced6e34
Author: Wenchen Fan <[email protected]>
Date:   2017-05-09T16:09:35Z

    [SPARK-20548][FLAKY-TEST] share one REPL instance among REPL test cases
    
    `ReplSuite.newProductSeqEncoder with REPL defined class` was flaky and 
throws OOM exception frequently. By analyzing the heap dump, we found the 
reason is that, in each test case of `ReplSuite`, we create a REPL instance, 
which creates a classloader and loads a lot of classes related to 
`SparkContext`. More details please see 
https://github.com/apache/spark/pull/17833#issuecomment-298711435.
    
    In this PR, we create a new test suite, `SingletonReplSuite`, which shares 
one REPL instances among all the test cases. Then we move most of the tests 
from `ReplSuite` to `SingletonReplSuite`, to avoid creating a lot of REPL 
instances and reduce memory footprint.
    
    test only change
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17844 from cloud-fan/flaky-test.
    
    (cherry picked from commit f561a76b2f895dea52f228a9376948242c3331ad)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 73aa23b8ef64960e7f171aa07aec396667a2339d
Author: Reynold Xin <[email protected]>
Date:   2017-05-09T16:24:28Z

    [SPARK-20674][SQL] Support registering UserDefinedFunction as named UDF
    
    ## What changes were proposed in this pull request?
    For some reason we don't have an API to register UserDefinedFunction as 
named UDF. It is a no brainer to add one, in addition to the existing register 
functions we have.
    
    ## How was this patch tested?
    Added a test case in UDFSuite for the new API.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #17915 from rxin/SPARK-20674.
    
    (cherry picked from commit d099f414d2cb53f5a61f6e77317c736be6f953a0)
    Signed-off-by: Xiao Li <[email protected]>

commit c7bd909f67209b4d1354c3d5b0a0fb1d4e28f205
Author: Sean Owen <[email protected]>
Date:   2017-05-09T17:22:23Z

    [SPARK-19876][BUILD] Move Trigger.java to java source hierarchy
    
    ## What changes were proposed in this pull request?
    
    Simply moves `Trigger.java` to `src/main/java` from `src/main/scala`
    See https://github.com/apache/spark/pull/17219
    
    ## How was this patch tested?
    
    Existing tests.
    
    Author: Sean Owen <[email protected]>
    
    Closes #17921 from srowen/SPARK-19876.2.
    
    (cherry picked from commit 25ee816e090c42f0e35be2d2cb0f8ec60726317c)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 9e8d23b3a2f99985ffb3c4eb67ac0a2774fa5b02
Author: Holden Karau <[email protected]>
Date:   2017-05-09T18:25:29Z

    [SPARK-20627][PYSPARK] Drop the hadoop distirbution name from the Python 
version
    
    ## What changes were proposed in this pull request?
    
    Drop the hadoop distirbution name from the Python version (PEP440 - 
https://www.python.org/dev/peps/pep-0440/). We've been using the local version 
string to disambiguate between different hadoop versions packaged with PySpark, 
but PEP0440 states that local versions should not be used when publishing 
up-stream. Since we no longer make PySpark pip packages for different hadoop 
versions, we can simply drop the hadoop information. If at a later point we 
need to start publishing different hadoop versions we can look at make 
different packages or similar.
    
    ## How was this patch tested?
    
    Ran `make-distribution` locally
    
    Author: Holden Karau <[email protected]>
    
    Closes #17885 from holdenk/SPARK-20627-remove-pip-local-version-string.
    
    (cherry picked from commit 1b85bcd9298cf84dd746fe8e91ab0b0df69ef17e)
    Signed-off-by: Holden Karau <[email protected]>

commit d191b962dc81c015fa92a38d882a8c7ea620ef06
Author: Yin Huai <[email protected]>
Date:   2017-05-09T21:47:45Z

    Revert "[SPARK-20311][SQL] Support aliases for table value functions"
    
    This reverts commit 714811d0b5bcb5d47c39782ff74f898d276ecc59.

commit 7600a7ab65777a59f3a33edef40328b6a5d864ef
Author: uncleGen <[email protected]>
Date:   2017-05-09T22:08:09Z

    [SPARK-20373][SQL][SS] Batch queries with 
'Dataset/DataFrame.withWatermark()` does not execute
    
    ## What changes were proposed in this pull request?
    
    Any Dataset/DataFrame batch query with the operation `withWatermark` does 
not execute because the batch planner does not have any rule to explicitly 
handle the EventTimeWatermark logical plan.
    The right solution is to simply remove the plan node, as the watermark 
should not affect any batch query in any way.
    
    Changes:
    - In this PR, we add a new rule `EliminateEventTimeWatermark` to check if 
we need to ignore the event time watermark. We will ignore watermark in any 
batch query.
    
    Depends upon:
    - [SPARK-20672](https://issues.apache.org/jira/browse/SPARK-20672). We can 
not add this rule into analyzer directly, because streaming query will be 
copied to `triggerLogicalPlan ` in every trigger, and the rule will be applied 
to `triggerLogicalPlan` mistakenly.
    
    Others:
    - A typo fix in example.
    
    ## How was this patch tested?
    
    add new unit test.
    
    Author: uncleGen <[email protected]>
    
    Closes #17896 from uncleGen/SPARK-20373.
    
    (cherry picked from commit c0189abc7c6ddbecc1832d2ff0cfc5546a010b60)
    Signed-off-by: Shixiong Zhu <[email protected]>

commit 6a996b36283dcd22ff7aa38968a80f575d2f151e
Author: Yuming Wang <[email protected]>
Date:   2017-05-10T02:45:00Z

    [SPARK-17685][SQL] Make SortMergeJoinExec's currentVars is null when 
calling createJoinKey
    
    ## What changes were proposed in this pull request?
    
    The following SQL query cause `IndexOutOfBoundsException` issue when `LIMIT 
> 1310720`:
    ```sql
    CREATE TABLE tab1(int int, int2 int, str string);
    CREATE TABLE tab2(int int, int2 int, str string);
    INSERT INTO tab1 values(1,1,'str');
    INSERT INTO tab1 values(2,2,'str');
    INSERT INTO tab2 values(1,1,'str');
    INSERT INTO tab2 values(2,3,'str');
    
    SELECT
      count(*)
    FROM
      (
        SELECT t1.int, t2.int2
        FROM (SELECT * FROM tab1 LIMIT 1310721) t1
        INNER JOIN (SELECT * FROM tab2 LIMIT 1310721) t2
        ON (t1.int = t2.int AND t1.int2 = t2.int2)
      ) t;
    ```
    
    This pull request fix this issue.
    
    ## How was this patch tested?
    
    unit tests
    
    Author: Yuming Wang <[email protected]>
    
    Closes #17920 from wangyum/SPARK-17685.
    
    (cherry picked from commit 771abeb46f637592aba2e63db2ed05b6cabfd0be)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 7b6f3a118e973216264bbf356af2bb1e7870466e
Author: hyukjinkwon <[email protected]>
Date:   2017-05-10T05:44:47Z

    [SPARK-20590][SQL] Use Spark internal datasource if multiples are found for 
the same shorten name
    
    ## What changes were proposed in this pull request?
    
    One of the common usability problems around reading data in spark 
(particularly CSV) is that there can often be a conflict between different 
readers in the classpath.
    
    As an example, if someone launches a 2.x spark shell with the spark-csv 
package in the classpath, Spark currently fails in an extremely unfriendly way 
(see databricks/spark-csv#367):
    
    ```bash
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    scala> val df = spark.read.csv("/foo/bar.csv")
    java.lang.RuntimeException: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat, 
com.databricks.spark.csv.DefaultSource15), please specify the fully qualified 
class name.
      at scala.sys.package$.error(package.scala:27)
      at 
org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:574)
      at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass$lzycompute(DataSource.scala:85)
      at 
org.apache.spark.sql.execution.datasources.DataSource.providingClass(DataSource.scala:85)
      at 
org.apache.spark.sql.execution.datasources.DataSource.resolveRelation(DataSource.scala:295)
      at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:178)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:533)
      at org.apache.spark.sql.DataFrameReader.csv(DataFrameReader.scala:412)
      ... 48 elided
    ```
    
    This PR proposes a simple way of fixing this error by picking up the 
internal datasource if there is single (the datasource that has 
"org.apache.spark" prefix).
    
    ```scala
    scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ```scala
    scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ## How was this patch tested?
    
    Manually tested as below:
    
    ```bash
    ./bin/spark-shell --packages com.databricks:spark-csv_2.11:1.5.0
    ```
    
    ```scala
    spark.sparkContext.setLogLevel("WARN")
    ```
    
    **positive cases**:
    
    ```scala
    scala> spark.range(1).write.format("csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:44 WARN DataSource: Multiple sources found for csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    ```scala
    scala> spark.range(1).write.format("Csv").mode("overwrite").save("/tmp/abc")
    17/05/10 09:47:52 WARN DataSource: Multiple sources found for Csv 
(org.apache.spark.sql.execution.datasources.csv.CSVFileFormat,
    com.databricks.spark.csv.DefaultSource15), defaulting to the internal 
datasource (org.apache.spark.sql.execution.datasources.csv.CSVFileFormat).
    ```
    
    (newlines were inserted for readability).
    
    ```scala
    scala> 
spark.range(1).write.format("com.databricks.spark.csv").mode("overwrite").save("/tmp/abc")
    ```
    
    ```scala
    scala> 
spark.range(1).write.format("org.apache.spark.sql.execution.datasources.csv.CSVFileFormat").mode("overwrite").save("/tmp/abc")
    ```
    
    **negative cases**:
    
    ```scala
    scala> 
spark.range(1).write.format("com.databricks.spark.csv.CsvRelation").save("/tmp/abc")
    java.lang.InstantiationException: com.databricks.spark.csv.CsvRelation
    ...
    ```
    
    ```scala
    scala> 
spark.range(1).write.format("com.databricks.spark.csv.CsvRelatio").save("/tmp/abc")
    java.lang.ClassNotFoundException: Failed to find data source: 
com.databricks.spark.csv.CsvRelatio. Please find packages at 
http://spark.apache.org/third-party-projects.html
    ...
    ```
    
    Author: hyukjinkwon <[email protected]>
    
    Closes #17916 from HyukjinKwon/datasource-detect.
    
    (cherry picked from commit 3d2131ab4ddead29601fb3c597b798202ac25fdd)
    Signed-off-by: Wenchen Fan <[email protected]>

commit ef50a954882fa1911f7ede3f0aefc8fcf09c6059
Author: Josh Rosen <[email protected]>
Date:   2017-05-10T06:36:36Z

    [SPARK-20686][SQL] PropagateEmptyRelation incorrectly handles aggregate 
without grouping
    
    ## What changes were proposed in this pull request?
    
    The query
    
    ```
    SELECT 1 FROM (SELECT COUNT(*) WHERE FALSE) t1
    ```
    
    should return a single row of output because the subquery is an aggregate 
without a group-by and thus should return a single row. However, Spark 
incorrectly returns zero rows.
    
    This is caused by SPARK-16208 / #13906, a patch which added an optimizer 
rule to propagate EmptyRelation through operators. The logic for handling 
aggregates is wrong: it checks whether aggregate expressions are non-empty for 
deciding whether the output should be empty, whereas it should be checking 
grouping expressions instead:
    
    An aggregate with non-empty grouping expression will return one output row 
per group. If the input to the grouped aggregate is empty then all groups will 
be empty and thus the output will be empty. It doesn't matter whether the 
aggregation output columns include aggregate expressions since that won't 
affect the number of output rows.
    
    If the grouping expressions are empty, however, then the aggregate will 
always produce a single output row and thus we cannot propagate the 
EmptyRelation.
    
    The current implementation is incorrect and also misses an optimization 
opportunity by not propagating EmptyRelation in the case where a grouped 
aggregate has aggregate expressions (in other words, `SELECT COUNT(*) from 
emptyRelation GROUP BY x` would _not_ be optimized to `EmptyRelation` in the 
old code, even though it safely could be).
    
    This patch resolves this issue by modifying `PropagateEmptyRelation` to 
consider only the presence/absence of grouping expressions, not the aggregate 
functions themselves, when deciding whether to propagate EmptyRelation.
    
    ## How was this patch tested?
    
    - Added end-to-end regression tests in `SQLQueryTest`'s `group-by.sql` file.
    - Updated unit tests in `PropagateEmptyRelationSuite`.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #17929 from JoshRosen/fix-PropagateEmptyRelation.
    
    (cherry picked from commit a90c5cd8226146a58362732171b92cb99a7bc4c7)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 3ed2f4d516ce02dfef929195778f8214703913d8
Author: zero323 <[email protected]>
Date:   2017-05-10T08:57:52Z

    [SPARK-20631][PYTHON][ML] LogisticRegression._checkThresholdConsistency 
should use values not Params
    
    ## What changes were proposed in this pull request?
    
    - Replace `getParam` calls with `getOrDefault` calls.
    - Fix exception message to avoid unintended `TypeError`.
    - Add unit tests
    
    ## How was this patch tested?
    
    New unit tests.
    
    Author: zero323 <[email protected]>
    
    Closes #17891 from zero323/SPARK-20631.
    
    (cherry picked from commit 804949c6bf00b8e26c39d48bbcc4d0470ee84e47)
    Signed-off-by: Yanbo Liang <[email protected]>

commit 7597a522b7e5be43910e86cd6f805e7e9ee08ced
Author: Alex Bozarth <[email protected]>
Date:   2017-05-10T09:20:10Z

    [SPARK-20630][WEB UI] Fixed column visibility in Executor Tab
    
    ## What changes were proposed in this pull request?
    
    #14617 added new columns to the executor table causing the visibility 
checks for the logs and threadDump columns to toggle the wrong columns since 
they used hard-coded column numbers.
    
    I've updated the checks to use column names instead of numbers so future 
updates don't accidentally break this again.
    
    Note: This will also need to be back ported into 2.2 since #14617 was 
merged there
    
    ## How was this patch tested?
    
    Manually tested
    
    Author: Alex Bozarth <[email protected]>
    
    Closes #17904 from ajbozarth/spark20630.
    
    (cherry picked from commit ca4625e0e58df7f02346470d22a9478d9640709d)
    Signed-off-by: Sean Owen <[email protected]>

commit 0851b6cfb8980fa8816a96026fbf0498799e296b
Author: Wenchen Fan <[email protected]>
Date:   2017-05-10T11:30:00Z

    [SPARK-20688][SQL] correctly check analysis for scalar sub-queries
    
    ## What changes were proposed in this pull request?
    
    In `CheckAnalysis`, we should call `checkAnalysis` for `ScalarSubquery` at 
the beginning, as later we will call `plan.output` which is invalid if `plan` 
is not resolved.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17930 from cloud-fan/tmp.
    
    (cherry picked from commit 789bdbe3d0d9558043872161bdfa148ec021a849)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 5f6029c7500b0c5a769c6b62879d8532a5692a50
Author: wangzhenhua <[email protected]>
Date:   2017-05-10T11:42:49Z

    [SPARK-20678][SQL] Ndv for columns not in filter condition should also be 
updated
    
    ## What changes were proposed in this pull request?
    
    In filter estimation, we update column stats for those columns in filter 
condition. However, if the number of rows decreases after the filter (i.e. the 
overall selectivity is less than 1), we need to update (scale down) the number 
of distinct values (NDV) for all columns, no matter they are in filter 
conditions or not.
    
    This pr also fixes the inconsistency of rounding mode for ndv and rowCount.
    
    ## How was this patch tested?
    
    Added new tests.
    
    Author: wangzhenhua <[email protected]>
    
    Closes #17918 from wzhfy/scaleDownNdvAfterFilter.
    
    (cherry picked from commit 76e4a5566b1e9579632e03440cecd04dd142bc44)
    Signed-off-by: Wenchen Fan <[email protected]>

commit 358516dcbef5178cdc6cb4387d7f6837359946ba
Author: Xianyang Liu <[email protected]>
Date:   2017-05-10T12:56:34Z

    [MINOR][BUILD] Fix lint-java breaks.
    
    ## What changes were proposed in this pull request?
    
    This PR proposes to fix the lint-breaks as below:
    ```
    [ERROR] src/main/java/org/apache/spark/unsafe/Platform.java:[51] (regexp) 
RegexpSingleline: No trailing whitespace allowed.
    [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[45,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
    [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[62,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
    [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[78,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
    [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[92,25] 
(naming) MethodName: Method name 'ProcessingTime' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
    [ERROR] src/main/scala/org/apache/spark/sql/streaming/Trigger.java:[102,25] 
(naming) MethodName: Method name 'Once' must match pattern 
'^[a-z][a-z0-9][a-zA-Z0-9_]*$'.
    [ERROR] 
src/test/java/org/apache/spark/streaming/kinesis/JavaKinesisInputDStreamBuilderSuite.java:[28,8]
 (imports) UnusedImports: Unused import - 
org.apache.spark.streaming.api.java.JavaDStream.
    ```
    
    after:
    ```
    dev/lint-java
    Checkstyle checks passed.
    ```
    [Test Result](https://travis-ci.org/ConeyLiu/spark/jobs/229666169)
    
    ## How was this patch tested?
    
    Travis CI
    
    Author: Xianyang Liu <[email protected]>
    
    Closes #17890 from ConeyLiu/codestyle.
    
    (cherry picked from commit fcb88f9211e39c705073db5300c96ceeb3f227d7)
    Signed-off-by: Sean Owen <[email protected]>

commit 86cef4df5fd9e28a8ece4ec33376d3622de2ef69
Author: Ala Luszczak <[email protected]>
Date:   2017-05-10T15:41:04Z

    [SPARK-19447] Remove remaining references to generated rows metric
    
    ## What changes were proposed in this pull request?
    
    
https://github.com/apache/spark/commit/b486ffc86d8ad6c303321dcf8514afee723f61f8 
left behind references to "number of generated rows" metrics, that should have 
been removed.
    
    ## How was this patch tested?
    
    Existing unit tests.
    
    Author: Ala Luszczak <[email protected]>
    
    Closes #17939 from ala/SPARK-19447-fix.
    
    (cherry picked from commit 5c2c4dcce529d228a97ede0386b95213ea0e1da5)
    Signed-off-by: Herman van Hovell <[email protected]>

commit 3eb0ee06a588da5b9c08a72d178835c6e8bad36b
Author: Josh Rosen <[email protected]>
Date:   2017-05-10T23:50:57Z

    [SPARK-20685] Fix BatchPythonEvaluation bug in case of single UDF w/ 
repeated arg.
    
    ## What changes were proposed in this pull request?
    
    There's a latent corner-case bug in PySpark UDF evaluation where executing 
a `BatchPythonEvaluation` with a single multi-argument UDF where _at least one 
argument value is repeated_ will crash at execution with a confusing error.
    
    This problem was introduced in #12057: the code there has a fast path for 
handling a "batch UDF evaluation consisting of a single Python UDF", but that 
branch incorrectly assumes that a single UDF won't have repeated arguments and 
therefore skips the code for unpacking arguments from the input row (whose 
schema may not necessarily match the UDF inputs due to de-duplication of 
repeated arguments which occurred in the JVM before sending UDF inputs to 
Python).
    
    This fix here is simply to remove this special-casing: it turns out that 
the code in the "multiple UDFs" branch just so happens to work for the 
single-UDF case because Python treats `(x)` as equivalent to `x`, not as a 
single-argument tuple.
    
    ## How was this patch tested?
    
    New regression test in `pyspark.python.sql.tests` module (tested and 
confirmed that it fails before my fix).
    
    Author: Josh Rosen <[email protected]>
    
    Closes #17927 from JoshRosen/SPARK-20685.
    
    (cherry picked from commit 8ddbc431d8b21d5ee57d3d209a4f25e301f15283)
    Signed-off-by: Xiao Li <[email protected]>

commit 80a57fa90be8dca4340345c09b4ea28fbf11a516
Author: Yanbo Liang <[email protected]>
Date:   2017-05-11T06:48:13Z

    [SPARK-20606][ML] Revert "[] ML 2.2 QA: Remove deprecated methods for ML"
    
    This reverts commit b8733e0ad9f5a700f385e210450fd2c10137293e.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #17944 from yanboliang/spark-20606-revert.
    
    (cherry picked from commit 0698e6c88ca11fdfd6e5498cab784cf6dbcdfacb)
    Signed-off-by: Yanbo Liang <[email protected]>

commit dd9e3b2c976a4ef3b4837590a2ba0954cf73860d
Author: Wenchen Fan <[email protected]>
Date:   2017-05-11T07:41:15Z

    [SPARK-20569][SQL] RuntimeReplaceable functions should not take extra 
parameters
    
    ## What changes were proposed in this pull request?
    
    `RuntimeReplaceable` always has a constructor with the expression to 
replace with, and this constructor should not be the function builder.
    
    ## How was this patch tested?
    
    new regression test
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #17876 from cloud-fan/minor.
    
    (cherry picked from commit b4c99f43690f8cfba414af90fa2b3998a510bba8)
    Signed-off-by: Xiao Li <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #18720: Spark 2.2.0: pip pyspark doesn't work well in not...

Reply via email to