[GitHub] spark pull request #15404: Branch 2.0

yintengfei Sat, 08 Oct 2016 18:33:38 -0700

GitHub user yintengfei opened a pull request:

    https://github.com/apache/spark/pull/15404


    Branch 2.0

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-2.0

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/15404.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #15404
    
----
commit 5735b8bd769c64e2b0e0fae75bad794cde3edc99
Author: Reynold Xin <r...@databricks.com>
Date:   2016-08-18T08:37:25Z

    [SPARK-16391][SQL] Support partial aggregation for reduceGroups
    
    ## What changes were proposed in this pull request?
    This patch introduces a new private ReduceAggregator interface that is a 
subclass of Aggregator. ReduceAggregator only requires a single associative and 
commutative reduce function. ReduceAggregator is also used to implement 
KeyValueGroupedDataset.reduceGroups in order to support partial aggregation.
    
    Note that the pull request was initially done by viirya.
    
    ## How was this patch tested?
    Covered by original tests for reduceGroups, as well as a new test suite for 
ReduceAggregator.
    
    Author: Reynold Xin <r...@databricks.com>
    Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
    
    Closes #14576 from rxin/reduceAggregator.
    
    (cherry picked from commit 1748f824101870b845dbbd118763c6885744f98a)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit ec5f157a32f0c65b5f93bdde7a6334e982b3b83c
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-18T11:44:13Z

    [SPARK-17117][SQL] 1 / NULL should not fail analysis
    
    ## What changes were proposed in this pull request?
    This patch fixes the problem described in SPARK-17117, i.e. "SELECT 1 / 
NULL" throws an analysis exception:
    
    ```
    org.apache.spark.sql.AnalysisException: cannot resolve '(1 / NULL)' due to 
data type mismatch: differing types in '(1 / NULL)' (int and null).
    ```
    
    The problem is that division type coercion did not take null type into 
account.
    
    ## How was this patch tested?
    A unit test for the type coercion, and a few end-to-end test cases using 
SQLQueryTestSuite.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14695 from petermaxlee/SPARK-17117.
    
    (cherry picked from commit 68f5087d2107d6afec5d5745f0cb0e9e3bdd6a0b)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit 176af17a7213a4c2847a04f715137257657f2961
Author: Xin Ren <iamsh...@126.com>
Date:   2016-08-10T07:49:06Z

    [MINOR][SPARKR] R API documentation for "coltypes" is confusing
    
    ## What changes were proposed in this pull request?
    
    R API documentation for "coltypes" is confusing, found when working on 
another ticket.
    
    Current version http://spark.apache.org/docs/2.0.0/api/R/coltypes.html, 
where parameters have 2 "x" which is a duplicate, and also the example is not 
very clear
    
    
![current](https://cloud.githubusercontent.com/assets/3925641/17386808/effb98ce-59a2-11e6-9657-d477d258a80c.png)
    
    ![screen shot 2016-08-03 at 5 56 00 
pm](https://cloud.githubusercontent.com/assets/3925641/17386884/91831096-59a3-11e6-84af-39890b3d45d8.png)
    
    ## How was this patch tested?
    
    Tested manually on local machine. And the screenshots are like below:
    
    ![screen shot 2016-08-07 at 11 29 20 
pm](https://cloud.githubusercontent.com/assets/3925641/17471144/df36633c-5cf6-11e6-8238-4e32ead0e529.png)
    
    ![screen shot 2016-08-03 at 5 56 22 
pm](https://cloud.githubusercontent.com/assets/3925641/17386896/9d36cb26-59a3-11e6-9619-6dae29f7ab17.png)
    
    Author: Xin Ren <iamsh...@126.com>
    
    Closes #14489 from keypointt/rExample.
    
    (cherry picked from commit 1203c8415cd11540f79a235e66a2f241ca6c71e4)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit ea684b69cd6934bc093f4a5a8b0d8470e92157cd
Author: Eric Liang <e...@databricks.com>
Date:   2016-08-18T11:33:55Z

    [SPARK-17069] Expose spark.range() as table-valued function in SQL
    
    This adds analyzer rules for resolving table-valued functions, and adds one 
builtin implementation for range(). The arguments for range() are the same as 
those of `spark.range()`.
    
    Unit tests.
    
    cc hvanhovell
    
    Author: Eric Liang <e...@databricks.com>
    
    Closes #14656 from ericl/sc-4309.
    
    (cherry picked from commit 412dba63b511474a6db3c43c8618d803e604bc6b)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit c180d637a3caca0d4e46f4980c10d1005eb453bc
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-19T01:19:47Z

    [SPARK-16947][SQL] Support type coercion and foldable expression for inline 
tables
    
    This patch improves inline table support with the following:
    
    1. Support type coercion.
    2. Support using foldable expressions. Previously only literals were 
supported.
    3. Improve error message handling.
    4. Improve test coverage.
    
    Added a new unit test suite ResolveInlineTablesSuite and a new file-based 
end-to-end test inline-table.sql.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14676 from petermaxlee/SPARK-16947.
    
    (cherry picked from commit f5472dda51b980a726346587257c22873ff708e3)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 05b180faa4bd87498516c05d4769cc2f51d56aae
Author: Reynold Xin <r...@databricks.com>
Date:   2016-08-19T02:02:32Z

    HOTFIX: compilation broken due to protected ctor.
    
    (cherry picked from commit b482c09fa22c5762a355f95820e4ba3e2517fb77)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit d55d1f454e6739ccff9c748f78462d789b09991f
Author: Nick Lavers <nick.lav...@videoamp.com>
Date:   2016-08-19T09:11:59Z

    [SPARK-16961][CORE] Fixed off-by-one error that biased randomizeInPlace
    
    JIRA issue link:
    https://issues.apache.org/jira/browse/SPARK-16961
    
    Changed one line of Utils.randomizeInPlace to allow elements to stay in 
place.
    
    Created a unit test that runs a Pearson's chi squared test to determine 
whether the output diverges significantly from a uniform distribution.
    
    Author: Nick Lavers <nick.lav...@videoamp.com>
    
    Closes #14551 from nicklavers/SPARK-16961-randomizeInPlace.
    
    (cherry picked from commit 5377fc62360d5e9b5c94078e41d10a96e0e8a535)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit e0c60f1850706faf2830b09af3dc6b52ffd9991e
Author: Reynold Xin <r...@databricks.com>
Date:   2016-08-19T13:11:35Z

    [SPARK-16994][SQL] Whitelist operators for predicate pushdown
    
    ## What changes were proposed in this pull request?
    This patch changes predicate pushdown optimization rule (PushDownPredicate) 
from using a blacklist to a whitelist. That is to say, operators must be 
explicitly allowed. This approach is more future-proof: previously it was 
possible for us to introduce a new operator and then render the optimization 
rule incorrect.
    
    This also fixes the bug that previously we allowed pushing filter beneath 
limit, which was incorrect. That is to say, before this patch, the optimizer 
would rewrite
    ```
    select * from (select * from range(10) limit 5) where id > 3
    
    to
    
    select * from range(10) where id > 3 limit 5
    ```
    
    ## How was this patch tested?
    - a unit test case in FilterPushdownSuite
    - an end-to-end test in limit.sql
    
    Author: Reynold Xin <r...@databricks.com>
    
    Closes #14713 from rxin/SPARK-16994.
    
    (cherry picked from commit 67e59d464f782ff5f509234212aa072a7653d7bf)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit d0707c6baeb4003735a508f981111db370984354
Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
Date:   2016-08-19T15:11:25Z

    [SPARK-11227][CORE] UnknownHostException can be thrown when NameNode HA is 
enabled.
    
    ## What changes were proposed in this pull request?
    
    If the following conditions are satisfied, executors don't load properties 
in `hdfs-site.xml` and UnknownHostException can be thrown.
    
    (1) NameNode HA is enabled
    (2) spark.eventLogging is disabled or logging path is NOT on HDFS
    (3) Using Standalone or Mesos for the cluster manager
    (4) There are no code to load `HdfsCondition` class in the driver 
regardless of directly or indirectly.
    (5) The tasks access to HDFS
    
    (There might be some more conditions...)
    
    For example, following code causes UnknownHostException when the conditions 
above are satisfied.
    ```
    sc.textFile("<path on HDFS>").collect
    
    ```
    
    ```
    java.lang.IllegalArgumentException: java.net.UnknownHostException: hacluster
        at 
org.apache.hadoop.security.SecurityUtil.buildTokenService(SecurityUtil.java:378)
        at 
org.apache.hadoop.hdfs.NameNodeProxies.createNonHAProxy(NameNodeProxies.java:310)
        at 
org.apache.hadoop.hdfs.NameNodeProxies.createProxy(NameNodeProxies.java:176)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:678)
        at org.apache.hadoop.hdfs.DFSClient.<init>(DFSClient.java:619)
        at 
org.apache.hadoop.hdfs.DistributedFileSystem.initialize(DistributedFileSystem.java:149)
        at 
org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:2653)
        at org.apache.hadoop.fs.FileSystem.access$200(FileSystem.java:92)
        at 
org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:2687)
        at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:2669)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:371)
        at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:170)
        at 
org.apache.hadoop.mapred.JobConf.getWorkingDirectory(JobConf.java:656)
        at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:438)
        at 
org.apache.hadoop.mapred.FileInputFormat.setInputPaths(FileInputFormat.java:411)
        at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
        at 
org.apache.spark.SparkContext$$anonfun$hadoopFile$1$$anonfun$32.apply(SparkContext.scala:986)
        at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
        at 
org.apache.spark.rdd.HadoopRDD$$anonfun$getJobConf$6.apply(HadoopRDD.scala:177)
        at scala.Option.map(Option.scala:146)
        at org.apache.spark.rdd.HadoopRDD.getJobConf(HadoopRDD.scala:177)
        at org.apache.spark.rdd.HadoopRDD$$anon$1.<init>(HadoopRDD.scala:213)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:209)
        at org.apache.spark.rdd.HadoopRDD.compute(HadoopRDD.scala:102)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
        at 
org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:318)
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:282)
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:70)
        at org.apache.spark.scheduler.Task.run(Task.scala:85)
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
        at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.UnknownHostException: hacluster
    ```
    
    But following code doesn't cause the Exception because `textFile` method 
loads `HdfsConfiguration` indirectly.
    
    ```
    sc.textFile("<path on HDFS>").collect
    ```
    
    When a job includes some operations which access to HDFS, the object of 
`org.apache.hadoop.Configuration` is wrapped by `SerializableConfiguration`,  
serialized and broadcasted from driver to executors and each executor 
deserialize the object with `loadDefaults` false so HDFS related properties 
should be set before broadcasted.
    
    ## How was this patch tested?
    Tested manually on my standalone cluster.
    
    Author: Kousuke Saruta <saru...@oss.nttdata.co.jp>
    
    Closes #13738 from sarutak/SPARK-11227.
    
    (cherry picked from commit 071eaaf9d2b63589f2e66e5279a16a5a484de6f5)
    Signed-off-by: Tom Graves <tgra...@yahoo-inc.com>

commit 3276ccfac807514d5a959415bcf58d2aa6ed8fbc
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-07-26T04:00:01Z

    [SPARK-16686][SQL] Remove PushProjectThroughSample since it is handled by 
ColumnPruning
    
    We push down `Project` through `Sample` in `Optimizer` by the rule 
`PushProjectThroughSample`. However, if the projected columns produce new 
output, they will encounter whole data instead of sampled data. It will bring 
some inconsistency between original plan (Sample then Project) and optimized 
plan (Project then Sample). In the extreme case such as attached in the JIRA, 
if the projected column is an UDF which is supposed to not see the sampled out 
data, the result of UDF will be incorrect.
    
    Since the rule `ColumnPruning` already handles general `Project` pushdown. 
We don't need  `PushProjectThroughSample` anymore. The rule `ColumnPruning` 
also avoids the described issue.
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
    
    Closes #14327 from viirya/fix-sample-pushdown.
    
    (cherry picked from commit 7b06a8948fc16d3c14e240fdd632b79ce1651008)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit ae89c8e170dd77e0b2adc04a2c85577f6df5cdef
Author: Sital Kedia <ske...@fb.com>
Date:   2016-08-19T18:27:30Z

    [SPARK-17113] [SHUFFLE] Job failure due to Executor OOM in offheap mode
    
    ## What changes were proposed in this pull request?
    
    This PR fixes executor OOM in offheap mode due to bug in Cooperative Memory 
Management for UnsafeExternSorter.  UnsafeExternalSorter was checking if memory 
page is being used by upstream by comparing the base object address of the 
current page with the base object address of upstream. However, in case of 
offheap memory allocation, the base object addresses are always null, so there 
was no spilling happening and eventually the operator would OOM.
    
    Following is the stack trace this issue addresses -
    java.lang.OutOfMemoryError: Unable to acquire 1220 bytes of memory, got 0
        at 
org.apache.spark.memory.MemoryConsumer.allocatePage(MemoryConsumer.java:120)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.acquireNewPageIfNecessary(UnsafeExternalSorter.java:341)
        at 
org.apache.spark.util.collection.unsafe.sort.UnsafeExternalSorter.insertRecord(UnsafeExternalSorter.java:362)
        at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.insertRow(UnsafeExternalRowSorter.java:93)
        at 
org.apache.spark.sql.execution.UnsafeExternalRowSorter.sort(UnsafeExternalRowSorter.java:170)
    
    ## How was this patch tested?
    
    Tested by running the failing job.
    
    Author: Sital Kedia <ske...@fb.com>
    
    Closes #14693 from sitalkedia/fix_offheap_oom.
    
    (cherry picked from commit cf0cce90364d17afe780ff9a5426dfcefa298535)
    Signed-off-by: Davies Liu <davies....@gmail.com>

commit efe832200f2fdf90868f5d03b45f1d75502444b3
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-20T01:14:45Z

    [SPARK-17149][SQL] array.sql for testing array related functions
    
    ## What changes were proposed in this pull request?
    This patch creates array.sql in SQLQueryTestSuite for testing array related 
functions, including:
    
    - indexing
    - array creation
    - size
    - array_contains
    - sort_array
    
    ## How was this patch tested?
    The patch itself is about adding tests.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14708 from petermaxlee/SPARK-17149.
    
    (cherry picked from commit a117afa7c2d94f943106542ec53d74ba2b5f1058)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 379b1272925e534d99ddf4e4add054284900d200
Author: Srinath Shankar <srin...@databricks.com>
Date:   2016-08-20T02:54:26Z

    [SPARK-17158][SQL] Change error message for out of range numeric literals
    
    ## What changes were proposed in this pull request?
    
    Modifies error message for numeric literals to
    Numeric literal <literal> does not fit in range [min, max] for type <T>
    
    ## How was this patch tested?
    
    Fixed up the error messages for literals.sql in  SqlQueryTestSuite and 
re-ran via sbt. Also fixed up error messages in ExpressionParserSuite
    
    Author: Srinath Shankar <srin...@databricks.com>
    
    Closes #14721 from srinathshankar/sc4296.
    
    (cherry picked from commit ba1737c21aab91ff3f1a1737aa2d6b07575e36a3)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit f7458c71d3b02864acb33fc48c130a0a734e9723
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-20T05:19:38Z

    [SPARK-17150][SQL] Support SQL generation for inline tables
    
    ## What changes were proposed in this pull request?
    This patch adds support for SQL generation for inline tables. With this, it 
would be possible to create a view that depends on inline tables.
    
    ## How was this patch tested?
    Added a test case in LogicalPlanToSQLSuite.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14709 from petermaxlee/SPARK-17150.
    
    (cherry picked from commit 45d40d9f66c666eec6df926db23937589d67225d)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 4c4c2753b1012e395ae3896396b6509d6082fdf2
Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
Date:   2016-08-20T15:29:48Z

    [SPARK-17104][SQL] LogicalRelation.newInstance should follow the semantics 
of MultiInstanceRelation
    
    ## What changes were proposed in this pull request?
    
    Currently `LogicalRelation.newInstance()` simply creates another 
`LogicalRelation` object with the same parameters. However, the `newInstance()` 
method inherited from `MultiInstanceRelation` should return a copy of object 
with unique expression ids. Current `LogicalRelation.newInstance()` can cause 
failure when doing self-join.
    
    ## How was this patch tested?
    
    Jenkins tests.
    
    Author: Liang-Chi Hsieh <sim...@tw.ibm.com>
    
    Closes #14682 from viirya/fix-localrelation.
    
    (cherry picked from commit 31a015572024046f4deaa6cec66bb6fab110f31d)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 24dd9a702694db1d2c28ff4c41edac2b3112df60
Author: petermaxlee <petermax...@gmail.com>
Date:   2016-08-20T16:25:55Z

    [SPARK-17124][SQL] RelationalGroupedDataset.agg should preserve order and 
allow multiple aggregates per column
    
    ## What changes were proposed in this pull request?
    This patch fixes a longstanding issue with one of the 
RelationalGroupedDataset.agg function. Even though the signature accepts vararg 
of pairs, the underlying implementation turns the seq into a map, and thus not 
order preserving nor allowing multiple aggregates per column.
    
    This change also allows users to use this function to run multiple 
different aggregations for a single column, e.g.
    ```
    agg("age" -> "max", "age" -> "count")
    ```
    
    ## How was this patch tested?
    Added a test case in DataFrameAggregateSuite.
    
    Author: petermaxlee <petermax...@gmail.com>
    
    Closes #14697 from petermaxlee/SPARK-17124.
    
    (cherry picked from commit 9560c8d29542a5dcaaa07b7af9ef5ddcdbb5d14d)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit faff9297d154596e35de555c819049ba9a51d57d
Author: Bryan Cutler <cutl...@gmail.com>
Date:   2016-08-20T20:45:26Z

    [SPARK-12666][CORE] SparkSubmit packages fix for when 'default' conf 
doesn't exist in dependent module
    
    ## What changes were proposed in this pull request?
    
    Adding a "(runtime)" to the dependency configuration will set a fallback 
configuration to be used if the requested one is not found.  E.g. with the 
setting "default(runtime)", Ivy will look for the conf "default" in the module 
ivy file and if not found will look for the conf "runtime".  This can help with 
the case when using "sbt publishLocal" which does not write a "default" conf in 
the published ivy.xml file.
    
    ## How was this patch tested?
    used spark-submit with --packages option for a package published locally 
with no default conf, and a package resolved from Maven central.
    
    Author: Bryan Cutler <cutl...@gmail.com>
    
    Closes #13428 from BryanCutler/fallback-package-conf-SPARK-12666.
    
    (cherry picked from commit 9f37d4eac28dd179dd523fa7d645be97bb52af9c)
    Signed-off-by: Josh Rosen <joshro...@databricks.com>

commit 26d5a8b0dab10310ec76b91465b3b4ff465e9746
Author: Xiangrui Meng <m...@databricks.com>
Date:   2016-08-21T17:31:25Z

    [MINOR][R] add SparkR.Rcheck/ and SparkR_*.tar.gz to R/.gitignore
    
    ## What changes were proposed in this pull request?
    
    Ignore temp files generated by `check-cran.sh`.
    
    Author: Xiangrui Meng <m...@databricks.com>
    
    Closes #14740 from mengxr/R-gitignore.
    
    (cherry picked from commit ab7143463daf2056736c85e3a943c826b5992623)
    Signed-off-by: Xiangrui Meng <m...@databricks.com>

commit 0297896119e11f23da4b14f62f50ec72b5fac57f
Author: Junyang Qian <junya...@databricks.com>
Date:   2016-08-20T13:59:23Z

    [SPARK-16508][SPARKR] Fix CRAN undocumented/duplicated arguments warnings.
    
    This PR tries to fix all the remaining "undocumented/duplicated arguments" 
warnings given by CRAN-check.
    
    One left is doc for R `stats::glm` exported in SparkR. To mute that 
warning, we have to also provide document for all arguments of that non-SparkR 
function.
    
    Some previous conversation is in #14558.
    
    R unit test and `check-cran.sh` script (with no-test).
    
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #14705 from junyangq/SPARK-16508-master.
    
    (cherry picked from commit 01401e965b58f7e8ab615764a452d7d18f1d4bf0)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit e62b29f29f44196a1cbe13004ff4abfd8e5be1c1
Author: Dongjoon Hyun <dongj...@apache.org>
Date:   2016-08-21T20:07:47Z

    [SPARK-17098][SQL] Fix `NullPropagation` optimizer to handle `COUNT(NULL) 
OVER` correctly
    
    ## What changes were proposed in this pull request?
    
    Currently, `NullPropagation` optimizer replaces `COUNT` on null literals in 
a bottom-up fashion. During that, `WindowExpression` is not covered properly. 
This PR adds the missing propagation logic.
    
    **Before**
    ```scala
    scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
    java.lang.UnsupportedOperationException: Cannot evaluate expression: cast(0 
as bigint) windowspecdefinition(ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED 
FOLLOWING)
    ```
    
    **After**
    ```scala
    scala> sql("SELECT COUNT(1 + NULL) OVER ()").show
    
+----------------------------------------------------------------------------------------------+
    |count((1 + CAST(NULL AS INT))) OVER (ROWS BETWEEN UNBOUNDED PRECEDING AND 
UNBOUNDED FOLLOWING)|
    
+----------------------------------------------------------------------------------------------+
    |                                                                           
                  0|
    
+----------------------------------------------------------------------------------------------+
    ```
    
    ## How was this patch tested?
    
    Pass the Jenkins test with a new test case.
    
    Author: Dongjoon Hyun <dongj...@apache.org>
    
    Closes #14689 from dongjoon-hyun/SPARK-17098.
    
    (cherry picked from commit 91c2397684ab791572ac57ffb2a924ff058bb64f)
    Signed-off-by: Herman van Hovell <hvanhov...@databricks.com>

commit 49cc44de3ad5495b2690633791941aa00a62b553
Author: Davies Liu <dav...@databricks.com>
Date:   2016-08-22T08:16:03Z

    [SPARK-17115][SQL] decrease the threshold when split expressions
    
    ## What changes were proposed in this pull request?
    
    In 2.0, we change the threshold of splitting expressions from 16K to 64K, 
which cause very bad performance on wide table, because the generated method 
can't be JIT compiled by default (above the limit of 8K bytecode).
    
    This PR will decrease it to 1K, based on the benchmark results for a wide 
table with 400 columns of LongType.
    
    It also fix a bug around splitting expression in whole-stage codegen (it 
should not split them).
    
    ## How was this patch tested?
    
    Added benchmark suite.
    
    Author: Davies Liu <dav...@databricks.com>
    
    Closes #14692 from davies/split_exprs.
    
    (cherry picked from commit 8d35a6f68d6d733212674491cbf31bed73fada0f)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

commit 2add45fabeb0ea4f7b17b5bc4910161370e72627
Author: Jagadeesan <a...@us.ibm.com>
Date:   2016-08-22T08:30:31Z

    [SPARK-17085][STREAMING][DOCUMENTATION AND ACTUAL CODE DIFFERS - 
UNSUPPORTED OPERATIONS]
    
    Changes in  Spark Stuctured Streaming doc in this link
    
https://spark.apache.org/docs/2.0.0/structured-streaming-programming-guide.html#unsupported-operations
    
    Author: Jagadeesan <a...@us.ibm.com>
    
    Closes #14715 from jagadeesanas2/SPARK-17085.
    
    (cherry picked from commit bd9655063bdba8836b4ec96ed115e5653e246b65)
    Signed-off-by: Sean Owen <so...@cloudera.com>

commit 79195982a4c6f8b1a3e02069dea00049cc806574
Author: Junyang Qian <junya...@databricks.com>
Date:   2016-08-22T17:03:48Z

    [SPARKR][MINOR] Fix Cache Folder Path in Windows
    
    ## What changes were proposed in this pull request?
    
    This PR tries to fix the scheme of local cache folder in Windows. The name 
of the environment variable should be `LOCALAPPDATA` rather than 
`%LOCALAPPDATA%`.
    
    ## How was this patch tested?
    
    Manual test in Windows 7.
    
    Author: Junyang Qian <junya...@databricks.com>
    
    Closes #14743 from junyangq/SPARKR-FixWindowsInstall.
    
    (cherry picked from commit 209e1b3c0683a9106428e269e5041980b6cc327f)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 94eff08757cee70c5b31fff7095bbb1e6ebc7ecf
Author: Sean Owen <so...@cloudera.com>
Date:   2016-08-22T18:15:53Z

    [SPARK-16320][DOC] Document G1 heap region's effect on spark 2.0 vs 1.6
    
    ## What changes were proposed in this pull request?
    
    Collect GC discussion in one section, and documenting findings about G1 GC 
heap region size.
    
    ## How was this patch tested?
    
    Jekyll doc build
    
    Author: Sean Owen <so...@cloudera.com>
    
    Closes #14732 from srowen/SPARK-16320.
    
    (cherry picked from commit 342278c09cf6e79ed4f63422988a6bbd1e7d8a91)
    Signed-off-by: Yin Huai <yh...@databricks.com>

commit 6dcc1a3f0cc8f2ed71f7bb6b1493852a58259d2f
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-08-22T19:53:52Z

    [SPARKR][MINOR] Add Xiangrui and Felix to maintainers
    
    ## What changes were proposed in this pull request?
    
    This change adds Xiangrui Meng and Felix Cheung to the maintainers field in 
the package description.
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #14758 from shivaram/sparkr-maintainers.
    
    (cherry picked from commit 6f3cd36f93c11265449fdce3323e139fec8ab22d)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit 01a4d69f309a1cc8d370ce9f85e6a4f31b6db3b8
Author: Eric Liang <e...@databricks.com>
Date:   2016-08-22T22:48:35Z

    [SPARK-17162] Range does not support SQL generation
    
    ## What changes were proposed in this pull request?
    
    The range operator previously didn't support SQL generation, which made it 
not possible to use in views.
    
    ## How was this patch tested?
    
    Unit tests.
    
    cc hvanhovell
    
    Author: Eric Liang <e...@databricks.com>
    
    Closes #14724 from ericl/spark-17162.
    
    (cherry picked from commit 84770b59f773f132073cd2af4204957fc2d7bf35)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit b65b041af8b64413c7d460d4ea110b2044d6f36e
Author: Felix Cheung <felixcheun...@hotmail.com>
Date:   2016-08-22T22:53:10Z

    [SPARK-16508][SPARKR] doc updates and more CRAN check fixes
    
    replace ``` ` ``` in code doc with `\code{thing}`
    remove added `...` for drop(DataFrame)
    fix remaining CRAN check warnings
    
    create doc with knitr
    
    junyangq
    
    Author: Felix Cheung <felixcheun...@hotmail.com>
    
    Closes #14734 from felixcheung/rdoccleanup.
    
    (cherry picked from commit 71afeeea4ec8e67edc95b5d504c557c88a2598b9)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit ff2f873800fcc3d699e52e60fd0e69eb01d12503
Author: Eric Liang <e...@databricks.com>
Date:   2016-08-22T23:32:14Z

    [SPARK-16550][SPARK-17042][CORE] Certain classes fail to deserialize in 
block manager replication
    
    ## What changes were proposed in this pull request?
    
    This is a straightforward clone of JoshRosen 's original patch. I have 
follow-up changes to fix block replication for repl-defined classes as well, 
but those appear to be flaking tests so I'm going to leave that for SPARK-17042
    
    ## How was this patch tested?
    
    End-to-end test in ReplSuite (also more tests in DistributedSuite from the 
original patch).
    
    Author: Eric Liang <e...@databricks.com>
    
    Closes #14311 from ericl/spark-16550.
    
    (cherry picked from commit 8e223ea67acf5aa730ccf688802f17f6fc10907c)
    Signed-off-by: Reynold Xin <r...@databricks.com>

commit 225898961bc4bc71d56f33c027adbb2d0929ae5a
Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
Date:   2016-08-23T00:09:32Z

    [SPARK-16577][SPARKR] Add CRAN documentation checks to run-tests.sh
    
    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    This change adds CRAN documentation checks to be run as a part of 
`R/run-tests.sh` . As this script is also used by Jenkins this means that we 
will get documentation checks on every PR going forward.
    
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Author: Shivaram Venkataraman <shiva...@cs.berkeley.edu>
    
    Closes #14759 from shivaram/sparkr-cran-jenkins.
    
    (cherry picked from commit 920806ab272ba58a369072a5eeb89df5e9b470a6)
    Signed-off-by: Shivaram Venkataraman <shiva...@cs.berkeley.edu>

commit eaea1c86b897d302107a9b6833a27a2b24ca31a0
Author: Cheng Lian <l...@databricks.com>
Date:   2016-08-23T01:11:47Z

    [SPARK-17182][SQL] Mark Collect as non-deterministic
    
    ## What changes were proposed in this pull request?
    
    This PR marks the abstract class `Collect` as non-deterministic since the 
results of `CollectList` and `CollectSet` depend on the actual order of input 
rows.
    
    ## How was this patch tested?
    
    Existing test cases should be enough.
    
    Author: Cheng Lian <l...@databricks.com>
    
    Closes #14749 from liancheng/spark-17182-non-deterministic-collect.
    
    (cherry picked from commit 2cdd92a7cd6f85186c846635b422b977bdafbcdd)
    Signed-off-by: Wenchen Fan <wenc...@databricks.com>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request #15404: Branch 2.0

Reply via email to