[GitHub] spark pull request: Branch 1.5

xif10416s Sun, 11 Oct 2015 22:34:33 -0700

GitHub user xif10416s opened a pull request:

    https://github.com/apache/spark/pull/9071


    Branch 1.5

    

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.5

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9071.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9071
    
----
commit 5c749c82cb3caa5a41fd3fd49c32ab23c6f738da
Author: Wenchen Fan <[email protected]>
Date:   2015-08-19T22:04:56Z

    [SPARK-6489] [SQL] add column pruning for Generate
    
    This PR takes over https://github.com/apache/spark/pull/5358
    
    Author: Wenchen Fan <[email protected]>
    
    Closes #8268 from cloud-fan/6489.
    
    (cherry picked from commit b0dbaec4f942a47afde3490b9339ad3bd187024d)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 56a37b01fd07f4f1a8cb4e07b55e1a02cf23a5f7
Author: Eric Liang <[email protected]>
Date:   2015-08-19T22:43:08Z

    [SPARK-9895] User Guide for RFormula Feature Transformer
    
    mengxr
    
    Author: Eric Liang <[email protected]>
    
    Closes #8293 from ericl/docs-2.
    
    (cherry picked from commit 8e0a072f78b4902d5f7ccc6b15232ed202a117f9)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 321cb99caa9e63e19eeec0d04fe9d425abdb7109
Author: Reynold Xin <[email protected]>
Date:   2015-08-20T00:35:41Z

    [SPARK-9242] [SQL] Audit UDAF interface.
    
    A few minor changes:
    
    1. Improved documentation
    2. Rename apply(distinct....) to distinct.
    3. Changed MutableAggregationBuffer from a trait to an abstract class.
    4. Renamed returnDataType to dataType to be more consistent with other 
expressions.
    
    And unrelated to UDAFs:
    
    1. Renamed file names in expressions to use suffix "Expressions" to be more 
consistent.
    2. Moved regexp related expressions out to its own file.
    3. Renamed StringComparison => StringPredicate.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #8321 from rxin/SPARK-9242.
    
    (cherry picked from commit 2f2686a73f5a2a53ca5b1023e0d7e0e6c9be5896)
    Signed-off-by: Reynold Xin <[email protected]>

commit 16414dae03b427506b2a1ebb7d405e6fa3bdad17
Author: zsxwing <[email protected]>
Date:   2015-08-20T01:36:01Z

    [SPARK-9812] [STREAMING] Fix Python 3 compatibility issue in PySpark 
Streaming and some docs
    
    This PR includes the following fixes:
    1. Use `range` instead of `xrange` in `queue_stream.py` to support Python 3.
    2. Fix the issue that `utf8_decoder` will return `bytes` rather than `str` 
when receiving an empty `bytes` in Python 3.
    3. Fix the commands in docs so that the user can copy them directly to the 
command line. The previous commands was broken in the middle of a path, so when 
copying to the command line, the path would be split to two parts by the extra 
spaces, which forces the user to fix it manually.
    
    Author: zsxwing <[email protected]>
    
    Closes #8315 from zsxwing/SPARK-9812.
    
    (cherry picked from commit 1f29d502e7ecd6faa185d70dc714f9ea3922fb6d)
    Signed-off-by: Tathagata Das <[email protected]>

commit a3ed2c31e60b11c09f815b42c0cd894be3150c67
Author: Timothy Chen <[email protected]>
Date:   2015-08-20T02:43:26Z

    [SPARK-10124] [MESOS] Fix removing queued driver in mesos cluster mode.
    
    Currently the spark applications can be queued to the Mesos cluster 
dispatcher, but when multiple jobs are in queue we don't handle removing jobs 
from the buffer correctly while iterating and causes null pointer exception.
    
    This patch copies the buffer before iterating them, so exceptions aren't 
thrown when the jobs are removed.
    
    Author: Timothy Chen <[email protected]>
    
    Closes #8322 from tnachen/fix_cluster_mode.
    
    (cherry picked from commit 73431d8afb41b93888d2642a1ce2d011f03fb740)
    Signed-off-by: Andrew Or <[email protected]>

commit 63922fa4dd5fb4a24e6f8c984b080698ca3b0a26
Author: zsxwing <[email protected]>
Date:   2015-08-20T02:43:09Z

    [SPARK-10125] [STREAMING] Fix a potential deadlock in JobGenerator.stop
    
    Because `lazy val` uses `this` lock, if JobGenerator.stop and 
JobGenerator.doCheckpoint (JobGenerator.shouldCheckpoint has not yet been 
initialized) run at the same time, it may hang.
    
    Here are the stack traces for the deadlock:
    
    ```Java
    "pool-1-thread-1-ScalaTest-running-StreamingListenerSuite" #11 prio=5 
os_prio=31 tid=0x00007fd35d094800 nid=0x5703 in Object.wait() 
[0x000000012ecaf000]
       java.lang.Thread.State: WAITING (on object monitor)
            at java.lang.Object.wait(Native Method)
            at java.lang.Thread.join(Thread.java:1245)
            - locked <0x00000007b5d8d7f8> (a 
org.apache.spark.util.EventLoop$$anon$1)
            at java.lang.Thread.join(Thread.java:1319)
            at org.apache.spark.util.EventLoop.stop(EventLoop.scala:81)
            at 
org.apache.spark.streaming.scheduler.JobGenerator.stop(JobGenerator.scala:155)
            - locked <0x00000007b5d8cea0> (a 
org.apache.spark.streaming.scheduler.JobGenerator)
            at 
org.apache.spark.streaming.scheduler.JobScheduler.stop(JobScheduler.scala:95)
            - locked <0x00000007b5d8ced8> (a 
org.apache.spark.streaming.scheduler.JobScheduler)
            at 
org.apache.spark.streaming.StreamingContext.stop(StreamingContext.scala:687)
    
    "JobGenerator" #67 daemon prio=5 os_prio=31 tid=0x00007fd35c3b9800 
nid=0x9f03 waiting for monitor entry [0x0000000139e4a000]
       java.lang.Thread.State: BLOCKED (on object monitor)
            at 
org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint$lzycompute(JobGenerator.scala:63)
            - waiting to lock <0x00000007b5d8cea0> (a 
org.apache.spark.streaming.scheduler.JobGenerator)
            at 
org.apache.spark.streaming.scheduler.JobGenerator.shouldCheckpoint(JobGenerator.scala:63)
            at 
org.apache.spark.streaming.scheduler.JobGenerator.doCheckpoint(JobGenerator.scala:290)
            at 
org.apache.spark.streaming.scheduler.JobGenerator.org$apache$spark$streaming$scheduler$JobGenerator$$processEvent(JobGenerator.scala:182)
            at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:83)
            at 
org.apache.spark.streaming.scheduler.JobGenerator$$anon$1.onReceive(JobGenerator.scala:82)
            at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    ```
    
    I can use this patch to produce this deadlock: 
https://github.com/zsxwing/spark/commit/8a88f28d1331003a65fabef48ae3d22a7c21f05f
    
    And a timeout build in Jenkins due to this deadlock: 
https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/1654/
    
    This PR initializes `checkpointWriter` before `eventLoop` uses it to avoid 
this deadlock.
    
    Author: zsxwing <[email protected]>
    
    Closes #8326 from zsxwing/SPARK-10125.

commit 71aa5475597f4220e2bab6b42caf9b98f248ac99
Author: Tathagata Das <[email protected]>
Date:   2015-08-20T04:15:58Z

    [SPARK-10128] [STREAMING] Used correct classloader to deserialize WAL data
    
    Recovering Kinesis sequence numbers from WAL leads to 
classnotfoundexception because the ObjectInputStream does not use the correct 
classloader and the SequenceNumberRanges class (in streaming-kinesis-asl 
package) cannot be found (added through spark-submit) while deserializing. The 
solution is to use `Thread.currentThread().getContextClassLoader` while 
deserializing.
    
    Author: Tathagata Das <[email protected]>
    
    Closes #8328 from tdas/SPARK-10128 and squashes the following commits:
    
    f19b1c2 [Tathagata Das] Used correct classloader to deserialize WAL data
    
    (cherry picked from commit b762f9920f7587d3c08493c49dd2fede62110b88)
    Signed-off-by: Tathagata Das <[email protected]>

commit 675e2249472fbadecb5c8f8da6ae8ff7a1f05305
Author: Yin Huai <[email protected]>
Date:   2015-08-20T10:43:24Z

    [SPARK-10092] [SQL] Backports #8324 to branch-1.5
    
    Author: Yin Huai <[email protected]>
    
    Closes #8336 from liancheng/spark-10092/for-branch-1.5.

commit 5be517584be0c78dc4641a4aa14ea9da05ed344d
Author: Reynold Xin <[email protected]>
Date:   2015-08-20T14:53:27Z

    [SPARK-10100] [SQL] Eliminate hash table lookup if there is no grouping key 
in aggregation.
    
    This improves performance by ~ 20 - 30% in one of my local test and should 
fix the performance regression from 1.4 to 1.5 on ss_max.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #8332 from rxin/SPARK-10100.
    
    (cherry picked from commit b4f4e91c395cb69ced61d9ff1492d1b814f96828)
    Signed-off-by: Yin Huai <[email protected]>

commit 257e9d727874332fd192f6a993f9ea8bf464abf5
Author: MechCoder <[email protected]>
Date:   2015-08-20T17:05:31Z

    [MINOR] [SQL] Fix sphinx warnings in PySpark SQL
    
    Author: MechCoder <[email protected]>
    
    Closes #8171 from MechCoder/sql_sphinx.
    
    (cherry picked from commit 52c60537a274af5414f6b0340a4bd7488ef35280)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit a7027e6d3369a1157c53557c8215273606086d84
Author: Alex Shkurenko <[email protected]>
Date:   2015-08-20T17:16:38Z

    [SPARK-9982] [SPARKR] SparkR DataFrame fail to return data of Decimal type
    
    Author: Alex Shkurenko <[email protected]>
    
    Closes #8239 from ashkurenko/master.
    
    (cherry picked from commit 39e91fe2fd43044cc734d55625a3c03284b69f09)
    Signed-off-by: Shivaram Venkataraman <[email protected]>

commit 2f47e099d31275f03ad372483e1bb23a322044f5
Author: Cheng Lian <[email protected]>
Date:   2015-08-20T18:00:24Z

    [SPARK-10136] [SQL] Fixes Parquet support for Avro array of primitive array
    
    I caught SPARK-10136 while adding more test cases to 
`ParquetAvroCompatibilitySuite`. Actual bug fix code lies in 
`CatalystRowConverter.scala`.
    
    Author: Cheng Lian <[email protected]>
    
    Closes #8341 from liancheng/spark-10136/parquet-avro-nested-primitive-array.
    
    (cherry picked from commit 85f9a61357994da5023b08b0a8a2eb09388ce7f8)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 19b92c87a38fd3594e60e96dbf1f85e92163be36
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T18:06:31Z

    Preparing Spark release v1.5.0-rc1

commit a1785e3f50a75b81ff444b3a299db91e0a38f702
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T18:06:41Z

    Preparing development version 1.5.0-SNAPSHOT

commit 6026f4fd729f4c7158a87c5c706fde866d7aae60
Author: Josh Rosen <[email protected]>
Date:   2015-08-20T18:31:03Z

    [SPARK-10126] [PROJECT INFRA] Fix typo in release-build.sh which broke 
snapshot publishing for Scala 2.11
    
    The current `release-build.sh` has a typo which breaks snapshot publication 
for Scala 2.11. We should change the Scala version to 2.11 and clean before 
building a 2.11 snapshot.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #8325 from JoshRosen/fix-2.11-snapshots.
    
    (cherry picked from commit 12de348332108f8c0c5bdad1d4cfac89b952b0f8)
    Signed-off-by: Josh Rosen <[email protected]>

commit 99eeac8cca176cfb64d5fd354a0a7c279613bbc9
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T19:43:08Z

    Preparing Spark release v1.5.0-rc1

commit eac31abdf2fe89abb3dec2fa9285f918ae682d58
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T19:43:13Z

    Preparing development version 1.5.0-SNAPSHOT

commit 2e0d2a9cc3cb7021e3bdd032d079cf6c8916c725
Author: Xiangrui Meng <[email protected]>
Date:   2015-08-20T21:47:04Z

    [SPARK-10138] [ML] move setters to MultilayerPerceptronClassifier and add 
Java test suite
    
    Otherwise, setters do not return self type. jkbradley avulanov
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #8342 from mengxr/SPARK-10138.
    
    (cherry picked from commit 2a3d98aae285aba39786e9809f96de412a130f39)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 560ec1268b824acc01d347a3fbc78ac16216a9b0
Author: MechCoder <[email protected]>
Date:   2015-08-20T21:56:08Z

    [SPARK-10108] Add since tags to mllib.feature
    
    Author: MechCoder <[email protected]>
    
    Closes #8309 from MechCoder/tags_feature.
    
    (cherry picked from commit 7cfc0750e14f2c1b3847e4720cc02150253525a9)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 2beea65bfbbf4a94ad6b7ca5e4c24f59089f6099
Author: Joseph K. Bradley <[email protected]>
Date:   2015-08-20T22:01:31Z

    [SPARK-9245] [MLLIB] LDA topic assignments
    
    For each (document, term) pair, return top topic.  Note that instances of 
(doc, term) pairs within a document (a.k.a. "tokens") are exchangeable, so we 
should provide an estimate per document-term, rather than per token.
    
    CC: rotationsymmetry mengxr
    
    Author: Joseph K. Bradley <[email protected]>
    
    Closes #8329 from jkbradley/lda-topic-assignments.
    
    (cherry picked from commit eaafe139f881d6105996373c9b11f2ccd91b5b3e)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit d837d51d54510310f7bae05cd331c0c23946404c
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T22:33:04Z

    Preparing Spark release v1.5.0-rc1

commit 175c1d9c90d47a469568e14b4b90a440b8d9e95c
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T22:33:10Z

    Preparing development version 1.5.0-SNAPSHOT

commit 4c56ad772637615cc1f4f88d619fac6c372c8552
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T23:24:07Z

    Preparing Spark release v1.5.0-rc1

commit 988e838a2fe381052f3018df4f31a55434be75ea
Author: Patrick Wendell <[email protected]>
Date:   2015-08-20T23:24:12Z

    Preparing development version 1.5.1-SNAPSHOT

commit 04ef52a5bcbe8fba2941af235b8cda1255d4af8d
Author: Xiangrui Meng <[email protected]>
Date:   2015-08-21T03:01:13Z

    [SPARK-10140] [DOC] add target fields to @Since
    
    so constructors parameters and public fields can be annotated. rxin 
MechCoder
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #8344 from mengxr/SPARK-10140.2.
    
    (cherry picked from commit cdd9a2bb10e20556003843a0f7aaa33acd55f6d2)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit e5e601739b1ec49da95f30657bfcfc691e35d9be
Author: Alexander Ulanov <[email protected]>
Date:   2015-08-21T03:02:27Z

    [SPARK-9846] [DOCS] User guide for Multilayer Perceptron Classifier
    
    Added user guide for multilayer perceptron classifier:
      - Simplified description of the multilayer perceptron classifier
      - Example code for Scala and Java
    
    Author: Alexander Ulanov <[email protected]>
    
    Closes #8262 from avulanov/SPARK-9846-mlpc-docs.
    
    (cherry picked from commit dcfe0c5cde953b31c5bfeb6e41d1fc9b333241eb)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 817c38a0a1405c2bf407070e13e16934c777cd89
Author: Daoyuan Wang <[email protected]>
Date:   2015-08-21T19:21:51Z

    [SPARK-10130] [SQL] type coercion for IF should have children resolved first
    
    Type coercion for IF should have children resolved first, or we could meet 
unresolved exception.
    
    Author: Daoyuan Wang <[email protected]>
    
    Closes #8331 from adrian-wang/spark10130.
    
    (cherry picked from commit 3c462f5d87a9654c5a68fd658a40f5062029fd9a)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 4e72839b7b1e0b925837b49534a07188a603d838
Author: jerryshao <[email protected]>
Date:   2015-08-21T20:10:11Z

    [SPARK-10122] [PYSPARK] [STREAMING] Fix getOffsetRanges bug in 
PySpark-Streaming transform function
    
    Details of the bug and explanations can be seen in 
[SPARK-10122](https://issues.apache.org/jira/browse/SPARK-10122).
    
    tdas , please help to review.
    
    Author: jerryshao <[email protected]>
    
    Closes #8347 from jerryshao/SPARK-10122 and squashes the following commits:
    
    4039b16 [jerryshao] Fix getOffsetRanges in transform() bug

commit e7db8761bd47ed53a313eb74f901c95ca89e23fb
Author: MechCoder <[email protected]>
Date:   2015-08-21T21:19:24Z

    [SPARK-9864] [DOC] [MLlib] [SQL] Replace since in scaladoc to Since 
annotation
    
    Author: MechCoder <[email protected]>
    
    Closes #8352 from MechCoder/since.
    
    (cherry picked from commit f5b028ed2f1ad6de43c8b50ebf480e1b6c047035)
    Signed-off-by: Xiangrui Meng <[email protected]>

commit 14c8c0c0da1184c587f0d5ab60f1d56feaa588e4
Author: Yin Huai <[email protected]>
Date:   2015-08-21T21:30:00Z

    [SPARK-10143] [SQL] Use parquet's block size (row group size) setting as 
the min split size if necessary.
    
    https://issues.apache.org/jira/browse/SPARK-10143
    
    With this PR, we will set min split size to parquet's block size (row group 
size) set in the conf if the min split size is smaller. So, we can avoid have 
too many tasks and even useless tasks for reading parquet data.
    
    I tested it locally. The table I have has 343MB and it is in my local FS. 
Because I did not set any min/max split size, the default split size was 32MB 
and the map stage had 11 tasks. But there were only three tasks that actually 
read data. With my PR, there were only three tasks in the map stage. Here is 
the difference.
    
    Without this PR:
    
![image](https://cloud.githubusercontent.com/assets/2072857/9399179/8587dba6-4765-11e5-9189-7ebba52a2b6d.png)
    
    With this PR:
    
![image](https://cloud.githubusercontent.com/assets/2072857/9399185/a4735d74-4765-11e5-8848-1f1e361a6b4b.png)
    
    Even if the block size setting does match the actual block size of parquet 
file, I think it is still generally good to use parquet's block size setting if 
min split size is smaller than this block size.
    
    Tested it on a cluster using
    ```
    val count = 
sqlContext.table("""store_sales""").groupBy().count().queryExecution.executedPlan(3).execute().count
    ```
    Basically, it reads 0 column of table `store_sales`. My table has 1824 
parquet files with size from 80MB to 280MB (1 to 3 row group sizes). Without 
this patch, in a 16 worker cluster, the job had 5023 tasks and spent 102s. With 
this patch, the job had 2893 tasks and spent 64s. It is still not as good as 
using one mapper per file (1824 tasks and 42s), but it is much better than our 
master.
    
    Author: Yin Huai <[email protected]>
    
    Closes #8346 from yhuai/parquetMinSplit.
    
    (cherry picked from commit e3355090d4030daffed5efb0959bf1d724c13c13)
    Signed-off-by: Yin Huai <[email protected]>

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Branch 1.5

Reply via email to