[GitHub] spark pull request #16413: Branch 1.3

Kevy123 Mon, 26 Dec 2016 18:44:40 -0800

GitHub user Kevy123 opened a pull request:

    https://github.com/apache/spark/pull/16413


    Branch 1.3

    ## What changes were proposed in this pull request?
    
    (Please fill in changes proposed in this fix)
    
    ## How was this patch tested?
    
    (Please explain how this patch was tested. E.g. unit tests, integration 
tests, manual tests)
    (If this patch involves UI changes, please attach a screenshot; otherwise, 
remove this)
    
    Please review http://spark.apache.org/contributing.html before opening a 
pull request.


You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/spark branch-1.3

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16413.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16413
    
----
commit 60b9b96b227ff1821530e29210cc0c338ce97528
Author: Daoyuan Wang <[email protected]>
Date:   2015-03-23T03:46:16Z

    [SPARK-4985] [SQL] parquet support for date type
    
    This PR might have some issues with #3732 ,
    and this would have merge conflicts with #3820 so the review can be delayed 
till that 2 were merged.
    
    Author: Daoyuan Wang <[email protected]>
    
    Closes #3822 from adrian-wang/parquetdate and squashes the following 
commits:
    
    2c5d54d [Daoyuan Wang] add a test case
    faef887 [Daoyuan Wang] parquet support for primitive date
    97e9080 [Daoyuan Wang] parquet support for date type
    
    (cherry picked from commit 4659468f369d69e7f777130e5e3b4c5d47a624f1)
    Signed-off-by: Cheng Lian <[email protected]>

commit a29f49320d317141beef348510af83dee8d41adb
Author: Yadong Qi <[email protected]>
Date:   2015-03-23T10:16:49Z

    [SPARK-6397][SQL] Check the missingInput simply
    
    https://github.com/apache/spark/pull/5082
    
    /cc liancheng
    
    Author: Yadong Qi <[email protected]>
    
    Closes #5132 from watermen/sql-missingInput-new and squashes the following 
commits:
    
    1e5bdc5 [Yadong Qi] Check the missingInput simply
    
    (cherry picked from commit 9f3273bd9c919f6c48a95383b3d5be357c89998c)
    Signed-off-by: Cheng Lian <[email protected]>

commit 04b207815d9464401e135886fe44fa5422031cb0
Author: Volodymyr Lyubinets <[email protected]>
Date:   2015-03-24T00:00:27Z

    [SPARK-6124] Support jdbc connection properties in OPTIONS part of the query
    
    One more thing if this PR is considered to be OK - it might make sense to 
add extra .jdbc() API's that take Properties to SQLContext.
    
    Author: Volodymyr Lyubinets <[email protected]>
    
    Closes #4859 from vlyubin/jdbcProperties and squashes the following commits:
    
    7a8cfda [Volodymyr Lyubinets] Support jdbc connection properties in OPTIONS 
part of the query
    
    (cherry picked from commit bfd3ee9f76aaab3dcde71d92e2b8ca60a0e42262)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 6f10142de87c8f3725e2a306162ee143fc4ba851
Author: Cheng Lian <[email protected]>
Date:   2015-03-24T08:12:11Z

    [SPARK-6452] [SQL] Checks for missing attributes and unresolved operator 
for all types of operator
    
    In `CheckAnalysis`, `Filter` and `Aggregate` are checked in separate case 
clauses, thus never hit those clauses for unresolved operators and missing 
input attributes.
    
    This PR also removes the `prettyString` call when generating error message 
for missing input attributes. Because result of `prettyString` doesn't contain 
expression ID, and may give confusing messages like
    
    > resolved attributes a missing from a
    
    cc rxin
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5129)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <[email protected]>
    
    Closes #5129 from liancheng/spark-6452 and squashes the following commits:
    
    52cdc69 [Cheng Lian] Addresses comments
    029f9bd [Cheng Lian] Checks for missing attributes and unresolved operator 
for all types of operator
    
    (cherry picked from commit 1afcf773d0cafdfd9bf106fdc5c429ed2ba3dd36)
    Signed-off-by: Michael Armbrust <[email protected]>

commit e5451432e3c7412e9bace66386b209f4b824dbcb
Author: Cong Yue <[email protected]>
Date:   2015-03-24T12:56:13Z

    Update the command to use IPython notebook
    
    As for "notebook --pylab inline" is not supported any more, update the 
related documentation for this.
    
    Author: Cong Yue <[email protected]>
    
    Closes #5111 from yuecong/patch-1 and squashes the following commits:
    
    872df76 [Cong Yue] Update the command to use IPython notebook
    
    (cherry picked from commit c12312f8b16bb8f9355d5f9e786c5a608863eb01)
    Signed-off-by: Sean Owen <[email protected]>

commit 8722369c24991d050e8b2f55c271eeab06005fe7
Author: Kousuke Saruta <[email protected]>
Date:   2015-03-24T16:13:25Z

    [SPARK-5559] [Streaming] [Test] Remove oppotunity we met flakiness when 
running FlumeStreamSuite
    
    When we run FlumeStreamSuite on Jenkins, sometimes we get error like as 
follows.
    
        sbt.ForkMain$ForkError: The code passed to eventually never returned 
normally. Attempted 52 times over 10.094849836 seconds. Last failure message: 
Error connecting to localhost/127.0.0.1:23456.
            at 
org.scalatest.concurrent.Eventually$class.tryTryAgain$1(Eventually.scala:420)
            at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:438)
            at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
            at 
org.scalatest.concurrent.Eventually$class.eventually(Eventually.scala:307)
           at 
org.scalatest.concurrent.Eventually$.eventually(Eventually.scala:478)
           at 
org.apache.spark.streaming.flume.FlumeStreamSuite.writeAndVerify(FlumeStreamSuite.scala:116)
               at 
org.apache.spark.streaming.flume.FlumeStreamSuite.org$apache$spark$streaming$flume$FlumeStreamSuite$$testFlumeStream(FlumeStreamSuite.scala:74)
           at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply$mcV$sp(FlumeStreamSuite.scala:66)
            at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
            at 
org.apache.spark.streaming.flume.FlumeStreamSuite$$anonfun$3.apply(FlumeStreamSuite.scala:66)
            at 
org.scalatest.Transformer$$anonfun$apply$1.apply$mcV$sp(Transformer.scala:22)
            at org.scalatest.OutcomeOf$class.outcomeOf(OutcomeOf.scala:85)
            at org.scalatest.OutcomeOf$.outcomeOf(OutcomeOf.scala:104)
            at org.scalatest.Transformer.apply(Transformer.scala:22)
            at org.scalatest.Transformer.apply(Transformer.scala:20)
                    at 
org.scalatest.FunSuiteLike$$anon$1.apply(FunSuiteLike.scala:166)
            at org.scalatest.Suite$class.withFixture(Suite.scala:1122)
            at org.scalatest.FunSuite.withFixture(FunSuite.scala:1555)
            at 
org.scalatest.FunSuiteLike$class.invokeWithFixture$1(FunSuiteLike.scala:163)
           at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
            at 
org.scalatest.FunSuiteLike$$anonfun$runTest$1.apply(FunSuiteLike.scala:175)
            at org.scalatest.SuperEngine.runTestImpl(Engine.scala:306)
            at org.scalatest.FunSuiteLike$class.runTest(FunSuiteLike.scala:175)
    
    This error is caused by check-then-act logic  when it find free-port .
    
          /** Find a free port */
          private def findFreePort(): Int = {
            Utils.startServiceOnPort(23456, (trialPort: Int) => {
              val socket = new ServerSocket(trialPort)
              socket.close()
              (null, trialPort)
            }, conf)._2
          }
    
    Removing the check-then-act is not easy but we can reduce the chance of 
having the error by choosing random value for initial port instead of 23456.
    
    Author: Kousuke Saruta <[email protected]>
    
    Closes #4337 from sarutak/SPARK-5559 and squashes the following commits:
    
    16f109f [Kousuke Saruta] Added `require` to Utils#startServiceOnPort
    c39d8b6 [Kousuke Saruta] Merge branch 'SPARK-5559' of 
github.com:sarutak/spark into SPARK-5559
    1610ba2 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-5559
    33357e3 [Kousuke Saruta] Changed "findFreePort" method in MQTTStreamSuite 
and FlumeStreamSuite so that it can choose valid random port
    a9029fe [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-5559
    9489ef9 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into SPARK-5559
    8212e42 [Kousuke Saruta] Modified default port used in FlumeStreamSuite 
from 23456 to random value
    
    (cherry picked from commit 85cf0636825d1997d64d0bdc04618f29b7222da1)
    Signed-off-by: Sean Owen <[email protected]>

commit 4ff577160235c0ca82de8330702ed07293024de1
Author: Peter Rudenko <[email protected]>
Date:   2015-03-24T16:33:38Z

    [ML][docs][minor] Define LabeledDocument/Document classes in CV example
    
    To easier copy/paste Cross-Validation example code snippet need to define 
LabeledDocument/Document in it, since they difined in a previous example.
    
    Author: Peter Rudenko <[email protected]>
    
    Closes #5135 from petro-rudenko/patch-3 and squashes the following commits:
    
    5190c75 [Peter Rudenko] Fix primitive types for java examples.
    1d35383 [Peter Rudenko] [SQL][docs][minor] Define LabeledDocument/Document 
classes in CV example
    
    (cherry picked from commit 08d452801195cc6cf0697a594e98cd4778f358ee)
    Signed-off-by: Sean Owen <[email protected]>

commit bc92a2e405241542770a64adfef39dcb02e96461
Author: Xiangrui Meng <[email protected]>
Date:   2015-03-20T19:02:57Z

    [SPARK-5955][MLLIB] add checkpointInterval to ALS
    
    Add checkpiontInterval to ALS to prevent:
    
    1. StackOverflow exceptions caused by long lineage,
    2. large shuffle files generated during iterations,
    3. slow recovery when some node fail.
    
    srowen coderxiang
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #5076 from mengxr/SPARK-5955 and squashes the following commits:
    
    df56791 [Xiangrui Meng] update impl to reuse code
    29affcb [Xiangrui Meng] do not materialize factors in implicit
    20d3f7f [Xiangrui Meng] add checkpointInterval to ALS
    
    (cherry picked from commit 6b36470c66bd6140c45e45d3f1d51b0082c3fd97)
    Signed-off-by: Xiangrui Meng <[email protected]>
    
    Conflicts:
        mllib/src/main/scala/org/apache/spark/ml/recommendation/ALS.scala

commit f0141ca385047550594541703809f5ddf75b480a
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T19:09:02Z

    [SPARK-6459][SQL] Warn when constructing trivially true equals predicate
    
    For example, one might expect the following code to work, but it does not.  
Now you will at least get a warning with a suggestion to use aliases.
    
    ```scala
    val df = sqlContext.load(path, "parquet")
    val txns = df.groupBy("cust_id").agg($"cust_id", 
countDistinct($"day_num").as("txns"))
    val spend = df.groupBy("cust_id").agg($"cust_id", 
sum($"extended_price").as("spend"))
    val rmJoin = txns.join(spend, txns("cust_id") === spend("cust_id"), "inner")
    ```
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5163 from marmbrus/selfJoinError and squashes the following commits:
    
    16c1f0b [Michael Armbrust] fix visibility
    1b57e8d [Michael Armbrust] Warn when constructing trivially true equals 
predicate
    
    (cherry picked from commit 32efadd0500f10bddf2ae8456c9e719ec52940f1)
    Signed-off-by: Michael Armbrust <[email protected]>

commit c0101d3927714fe09e791d9e0f7dc601a8f9d585
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T19:10:30Z

    [SPARK-6437][SQL] Use completion iterator to close external sorter
    
    Otherwise we will leak files when spilling occurs.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5161 from marmbrus/cleanupAfterSort and squashes the following 
commits:
    
    cb13d3c [Michael Armbrust] hint to inferencer
    cdebdf5 [Michael Armbrust] Use completion iterator to close external sorter
    
    (cherry picked from commit 26c6ce3d2947df5a294b1ad4a22fae5d31d06c19)
    Signed-off-by: Michael Armbrust <[email protected]>

commit c699e2b766a7cb9e03762bf278d7b19f631cb4e8
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T19:28:01Z

    [SPARK-6054][SQL] Fix transformations of TreeNodes that hold StructTypes
    
    Due to a recent change that made `StructType` a `Seq` we started 
inadvertently turning `StructType`s into generic `Traversable` when attempting 
nested tree transformations.  In this PR we explicitly avoid descending into 
`DataType`s to avoid this bug.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5157 from marmbrus/udfFix and squashes the following commits:
    
    26f7087 [Michael Armbrust] Fix transformations of TreeNodes that hold 
StructTypes
    
    (cherry picked from commit 3fa3d121dfec60f9768d3859e8450ee482b2d4e8)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 930b667e5ca55e5e8e658bf0912a93179b98073f
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T19:32:25Z

    Revert "[SPARK-5680][SQL] Sum function on all null values, should return 
zero"
    
    This reverts commit 93975a3786fbf4581553b347fa56fb2b7da6f861.

commit 92bf888aefc87ec434636961215480aebc0e2c15
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T20:22:46Z

    [SPARK-6375][SQL] Fix formatting of error messages.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5155 from marmbrus/errorMessages and squashes the following commits:
    
    b898188 [Michael Armbrust] Fix formatting of error messages.
    
    (cherry picked from commit 046c1e2aa459147bf592371bb9fb7a65edb182e7)
    Signed-off-by: Michael Armbrust <[email protected]>

commit df671bc36c1c57456e0b2a2391a2a276b3d38bec
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T21:08:20Z

    [SPARK-6376][SQL] Avoid eliminating subqueries until optimization
    
    Previously it was okay to throw away subqueries after analysis, as we would 
never try to use that tree for resolution again.  However, with eager analysis 
in `DataFrame`s this can cause errors for queries such as:
    
    ```scala
    val df = Seq(1,2,3).map(i => (i, i.toString)).toDF("int", "str")
    df.as('x).join(df.as('y), $"x.str" === $"y.str").groupBy("x.str").count()
    ```
    
    As a result, in this PR we defer the elimination of subqueries until the 
optimization phase.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5160 from marmbrus/subqueriesInDfs and squashes the following 
commits:
    
    a9bb262 [Michael Armbrust] Update Optimizer.scala
    27d25bf [Michael Armbrust] fix hive tests
    9137e03 [Michael Armbrust] add type
    81cd597 [Michael Armbrust] Avoid eliminating subqueries until optimization
    
    (cherry picked from commit cbeaf9ebab31a0bcbca884d4db7a791fd9edbff3)
    Signed-off-by: Michael Armbrust <[email protected]>

commit f48c16d3b3ace92eee6d02009a90f5d8b3e3dc75
Author: Michael Armbrust <[email protected]>
Date:   2015-03-24T21:10:56Z

    [SPARK-6458][SQL] Better error messages for invalid data sources
    
    Avoid unclear match errors and use `AnalysisException`.
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5158 from marmbrus/dataSourceError and squashes the following 
commits:
    
    af9f82a [Michael Armbrust] Yins comment
    90c6ba4 [Michael Armbrust] Better error messages for invalid data sources
    
    (cherry picked from commit a8f51b82968147abebbe61b8b68b066d21a0c6e6)
    Signed-off-by: Michael Armbrust <[email protected]>

commit dcf56aa8b828eeae1152a3969fd7153178c3a8a7
Author: Josh Rosen <[email protected]>
Date:   2015-03-24T21:38:20Z

    [SPARK-6209] Clean up connections in ExecutorClassLoader after failing to 
load classes (master branch PR)
    
    ExecutorClassLoader does not ensure proper cleanup of network connections 
that it opens. If it fails to load a class, it may leak partially-consumed 
InputStreams that are connected to the REPL's HTTP class server, causing that 
server to exhaust its thread pool, which can cause the entire job to hang.  See 
[SPARK-6209](https://issues.apache.org/jira/browse/SPARK-6209) for more 
details, including a bug reproduction.
    
    This patch fixes this issue by ensuring proper cleanup of these resources.  
It also adds logging for unexpected error cases.
    
    This PR is an extended version of #4935 and adds a regression test.
    
    Author: Josh Rosen <[email protected]>
    
    Closes #4944 from JoshRosen/executorclassloader-leak-master-branch and 
squashes the following commits:
    
    e0e3c25 [Josh Rosen] Wrap try block around getReponseCode; re-enable 
keep-alive by closing error stream
    961c284 [Josh Rosen] Roll back changes that were added to get the 
regression test to fail
    7ee2261 [Josh Rosen] Add a failing regression test
    e2d70a3 [Josh Rosen] Properly clean up after errors in ExecutorClassLoader
    
    (cherry picked from commit 7215aa745590a3eec9c1ff35d28194235a550db7)
    Signed-off-by: Andrew Or <[email protected]>

commit 586e0d924ebbfdeaf83d4d71506705c5e7aaf6f9
Author: Reynold Xin <[email protected]>
Date:   2015-03-24T23:03:55Z

    [SPARK-6428][SQL] Added explicit types for all public methods in catalyst
    
    I think after this PR, we can finally turn the rule on. There are still 
some smaller ones that need to be fixed, but those are easier.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #5162 from rxin/catalyst-explicit-types and squashes the following 
commits:
    
    e7eac03 [Reynold Xin] [SPARK-6428][SQL] Added explicit types for all public 
methods in catalyst.
    
    (cherry picked from commit 73348012d4ce6c9db85dfb48d51026efe5051c73)
    Signed-off-by: Reynold Xin <[email protected]>
    
    Conflicts:
        
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/aggregates.scala
        
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/namedExpressions.scala
        
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/util/package.scala

commit de8b2d4be84e48a764cf17ea0d292f177274d5ad
Author: Kay Ousterhout <[email protected]>
Date:   2015-03-24T23:26:43Z

    [SPARK-6088] Correct how tasks that get remote results are shown in UI.
    
    It would be great to fix this for 1.3. since the fix is surgical and it 
helps understandability for users.
    
    cc shivaram pwendell
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes #4839 from kayousterhout/SPARK-6088 and squashes the following 
commits:
    
    3ab012c [Kay Ousterhout] Update getting result time incrementally, 
correctly set GET_RESULT status
    f346b49 [Kay Ousterhout] Typos
    748ea6b [Kay Ousterhout] Fixed build failure
    84d617c [Kay Ousterhout] [SPARK-6088] Correct how tasks that get remote 
results are shown in the UI.
    
    (cherry picked from commit 6948ab6f8ba836446b005f2cf1cc4abc944c5053)
    Signed-off-by: Andrew Or <[email protected]>

commit e4db5a327b2c5f65c442e2ff7ec333fbee8e27e7
Author: Kay Ousterhout <[email protected]>
Date:   2015-03-24T23:29:40Z

    [SPARK-3570] Include time to open files in shuffle write time.
    
    Opening shuffle files can be very significant when the disk is
    contended, especially when using ext3. While writing data to
    a file can avoid hitting disk (and instead hit the buffer
    cache), opening a file always involves writing some metadata
    about the file to disk, so the open time can be a very significant
    portion of the shuffle write time. In one job I ran recently, the time to
    write shuffle data to the file was only 4ms for each task, but
    the time to open the file was about 100x as long (~400ms).
    
    When we add metrics about spilled data (#2504), we should ensure
    that the file open time is also included there.
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes #4550 from kayousterhout/SPARK-3570 and squashes the following 
commits:
    
    ea3a4ae [Kay Ousterhout] Added comment about excluded open time
    fdc5185 [Kay Ousterhout] Improved comment
    42b7e43 [Kay Ousterhout] Fixed parens for nanotime
    2423555 [Kay Ousterhout] [SPARK-3570] Include time to open files in shuffle 
write time.
    
    (cherry picked from commit d8ccf655f344eed65cdaf5d9252f1b565b8406ca)
    Signed-off-by: Andrew Or <[email protected]>

commit 6af9408a998baacc2f0f1e929f189d170b5e4868
Author: Christophe PrÃ©aud <[email protected]>
Date:   2015-03-25T00:05:49Z

    [SPARK-6469] Improving documentation on YARN local directories usage
    
    Clarify the local directories usage in YARN
    
    Author: Christophe PrÃ©aud <[email protected]>
    
    Closes #5165 from preaudc/yarn-doc-local-dirs and squashes the following 
commits:
    
    6912b90 [Christophe PrÃ©aud] Fix some formatting issues.
    4fa8ec2 [Christophe PrÃ©aud] Merge remote-tracking branch 'upstream/master' 
into yarn-doc-local-dirs
    eaaf519 [Christophe PrÃ©aud] Clarify the local directories usage in YARN
    436fb7d [Christophe PrÃ©aud] Revert "Clarify the local directories usage in 
YARN"
    876ae5e [Christophe PrÃ©aud] Clarify the local directories usage in YARN
    608dbfa [Christophe PrÃ©aud] Merge remote-tracking branch 'upstream/master'
    a49a2ce [Christophe PrÃ©aud] Merge remote-tracking branch 'upstream/master'
    9ba89ca [Christophe PrÃ©aud] Ensure that files are fetched atomically
    54419ae [Christophe PrÃ©aud] Merge remote-tracking branch 'upstream/master'
    c6a5590 [Christophe PrÃ©aud] Revert commit 
8ea871f8130b2490f1bad7374a819bf56f0ccbbd
    7456a33 [Christophe PrÃ©aud] Merge remote-tracking branch 'upstream/master'
    8ea871f [Christophe PrÃ©aud] Ensure that files are fetched atomically

commit 8e4e2e3f8ca677fb1599ae87b17dbf5e21e02849
Author: Bill Chambers <[email protected]>
Date:   2015-03-25T05:24:35Z

    [DOCUMENTATION]Fixed Missing Type Import in Documentation
    
    Needed to import the types specifically, not the more general pyspark.sql
    
    Author: Bill Chambers <[email protected]>
    Author: anabranch <[email protected]>
    
    Closes #5179 from anabranch/master and squashes the following commits:
    
    8fa67bf [anabranch] Corrected SqlContext Import
    603b080 [Bill Chambers] [DOCUMENTATION]Fixed Missing Type Import in 
Documentation
    
    (cherry picked from commit c5cc41468e8709d09c09289bb55bc8edc99404b1)
    Signed-off-by: Reynold Xin <[email protected]>

commit 2be4255a05e7a1548f51b02f6bf62507f1c3414b
Author: Yanbo Liang <[email protected]>
Date:   2015-03-25T17:05:56Z

    [SPARK-6496] [MLLIB] GeneralizedLinearAlgorithm.run(input, initialWeights) 
should initialize numFeatures
    
    In GeneralizedLinearAlgorithm ```numFeatures``` is default to -1, we need 
to update it to correct value when we call run() to train a model.
    ```LogisticRegressionWithLBFGS.run(input)``` works well, but when we call 
```LogisticRegressionWithLBFGS.run(input, initialWeights)``` to train 
multiclass classification model, it will throw exception due to the numFeatures 
is not updated.
    In this PR, we just update numFeatures at the beginning of 
GeneralizedLinearAlgorithm.run(input, initialWeights) and add test case.
    
    Author: Yanbo Liang <[email protected]>
    
    Closes #5167 from yanboliang/spark-6496 and squashes the following commits:
    
    8131c48 [Yanbo Liang] LogisticRegressionWithLBFGS.run(input, 
initialWeights) should initialize numFeatures
    
    (cherry picked from commit 10c78607b2724f5a64b0cdb966e9c5805f23919b)
    Signed-off-by: Sean Owen <[email protected]>

commit 6791f425d52b6c69a3675651a7e3f6622c55ce89
Author: Michael Griffiths <[email protected]>
Date:   2015-02-28T14:47:39Z

    SPARK-6063 MLlib doesn't pass mvn scalastyle check due to UTF chars in 
LDAModel.scala
    
    Remove unicode characters from MLlib file.
    
    Author: Michael Griffiths <[email protected]>
    Author: Griffiths, Michael (NYC-RPM) <[email protected]>
    
    Closes #4815 from msjgriffiths/SPARK-6063 and squashes the following 
commits:
    
    bcd7de1 [Griffiths, Michael (NYC-RPM)] Change \u201D quote marks around 
'theta' to standard single apostrophe (\x27)
    38eb535 [Michael Griffiths] Merge pull request #2 from apache/master
    b08e865 [Michael Griffiths] Merge pull request #1 from apache/master

commit 4efa6c5dd4038f0e78347180a956e2006a663262
Author: DoingDone9 <[email protected]>
Date:   2015-03-25T18:11:52Z

    [SPARK-6409][SQL] It is not necessary that avoid old inteface of hive, 
because this will make some UDAF can not work.
    
    spark avoid old inteface of hive, then some udaf can not work like 
"org.apache.hadoop.hive.ql.udf.generic.GenericUDAFAverage"
    
    Author: DoingDone9 <[email protected]>
    
    Closes #5131 from DoingDone9/udaf and squashes the following commits:
    
    9de08d0 [DoingDone9] Update HiveUdfSuite.scala
    49c62dc [DoingDone9] Update hiveUdfs.scala
    98b134f [DoingDone9] Merge pull request #5 from apache/master
    161cae3 [DoingDone9] Merge pull request #4 from apache/master
    c87e8b6 [DoingDone9] Merge pull request #3 from apache/master
    cb1852d [DoingDone9] Merge pull request #2 from apache/master
    c3f046f [DoingDone9] Merge pull request #1 from apache/master
    
    (cherry picked from commit 968408b345a0e26f7ee9105a6a0c3456cf10576a)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 0cd47489362f7a8397520d28ac24f0dee04e1388
Author: Cheng Lian <[email protected]>
Date:   2015-03-26T00:40:19Z

    [SPARK-6450] [SQL] Fixes metastore Parquet table conversion
    
    The `ParquetConversions` analysis rule generates a hash map, which maps 
from the original `MetastoreRelation` instances to the newly created 
`ParquetRelation2` instances. However, `MetastoreRelation.equals` doesn't 
compare output attributes. Thus, if a single metastore Parquet table appears 
multiple times in a query, only a single entry ends up in the hash map, and the 
conversion is not correctly performed.
    
    Proper fix for this issue should be overriding `equals` and `hashCode` for 
MetastoreRelation. Unfortunately, this breaks more tests than expected. It's 
possible that these tests are ill-formed from the very beginning. As 1.3.1 
release is approaching, we'd like to make the change more surgical to avoid 
potential regressions. The proposed fix here is to make both the metastore 
relations and their output attributes as keys in the hash map used in 
ParquetConversions.
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/5183)
    <!-- Reviewable:end -->
    
    Author: Cheng Lian <[email protected]>
    
    Closes #5183 from liancheng/spark-6450 and squashes the following commits:
    
    3536780 [Cheng Lian] Fixes metastore Parquet table conversion
    
    (cherry picked from commit 8c3b0052f4792d97d23244ade335676e37cb1fae)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 9edb34fc38147bd150340e006f05d346b4a40f8c
Author: Michael Armbrust <[email protected]>
Date:   2015-03-26T02:21:54Z

    [SPARK-6463][SQL] AttributeSet.equal should compare size
    
    Previously this could result in sets compare equals when in fact the right 
was a subset of the left.
    
    Based on #5133 by sisihj
    
    Author: sisihj <[email protected]>
    Author: Michael Armbrust <[email protected]>
    
    Closes #5194 from marmbrus/pr/5133 and squashes the following commits:
    
    5ed4615 [Michael Armbrust] fix imports
    d4cbbc0 [Michael Armbrust] Add test cases
    0a0834f [sisihj]  AttributeSet.equal should compare size
    
    (cherry picked from commit 276ef1c3cfd44b5fc082e1a495fff22fbaf6add3)
    Signed-off-by: Michael Armbrust <[email protected]>

commit 0ba759985288f5df6940c37f5f401bc31de53a1c
Author: Davies Liu <[email protected]>
Date:   2015-03-26T07:01:24Z

    [SPARK-6536] [PySpark] Column.inSet() in Python
    
    ```
    >>> df[df.name.inSet("Bob", "Mike")].collect()
    [Row(age=5, name=u'Bob')]
    >>> df[df.age.inSet([1, 2, 3])].collect()
    [Row(age=2, name=u'Alice')]
    ```
    
    Author: Davies Liu <[email protected]>
    
    Closes #5190 from davies/in and squashes the following commits:
    
    6b73a47 [Davies Liu] Column.inSet() in Python
    
    (cherry picked from commit f535802977c5a3ce45894d89fdf59f8723f023c8)
    Signed-off-by: Reynold Xin <[email protected]>

commit 8254996557512b8bbc8fd35c550004b56144581f
Author: Michael Armbrust <[email protected]>
Date:   2015-03-26T10:46:57Z

    [SPARK-6465][SQL] Fix serialization of GenericRowWithSchema using kryo
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #5191 from marmbrus/kryoRowsWithSchema and squashes the following 
commits:
    
    bb83522 [Michael Armbrust] Fix serialization of GenericRowWithSchema using 
kryo
    f914f16 [Michael Armbrust] Add no arg constructor to GenericRowWithSchema
    
    (cherry picked from commit f88f51bbd461e0a42ad7021147268509b9c3c56e)
    Signed-off-by: Cheng Lian <[email protected]>

commit 836c9216599b676ae8f421384f4f20fd35e8c53b
Author: Yash Datta <[email protected]>
Date:   2015-03-26T13:13:38Z

    [SQL][SPARK-6471]: Metastore schema should only be a subset of parquet 
schema to support dropping of columns using replace columns
    
    Currently in the parquet relation 2 implementation, error is thrown in case 
merged schema is not exactly the same as metastore schema.
    But to support cases like deletion of column using replace column command, 
we can relax the restriction so that even if metastore schema is a subset of 
merged parquet schema, the query will work.
    
    Author: Yash Datta <[email protected]>
    
    Closes #5141 from saucam/replace_col and squashes the following commits:
    
    e858d5b [Yash Datta] SPARK-6471: Fix test cases, add a new test case for 
metastore schema to be subset of parquet schema
    5f2f467 [Yash Datta] SPARK-6471: Metastore schema should only be a subset 
of parquet schema to support dropping of columns using replace columns
    
    (cherry picked from commit 1c05027a143d1b0bf3df192984e6cac752b1e926)
    Signed-off-by: Cheng Lian <[email protected]>

commit 5b5f0e2b08941bde2655b1aec9b2ae28c377be78
Author: guliangliang <[email protected]>
Date:   2015-03-26T13:28:56Z

    [SPARK-6491] Spark will put the current working dir to the CLASSPATH
    
    When running "bin/computer-classpath.sh", the output will be:
    
:/spark/conf:/spark/assembly/target/scala-2.10/spark-assembly-1.3.0-hadoop2.5.0-cdh5.2.0.jar:/spark/lib_managed/jars/datanucleus-rdbms-3.2.9.jar:/spark/lib_managed/jars/datanucleus-api-jdo-3.2.6.jar:/spark/lib_managed/jars/datanucleus-core-3.2.10.jar
    Java will add the current working dir to the CLASSPATH, if the first ":" 
exists, which is not expected by spark users.
    For example, if I call spark-shell in the folder /root. And there exists a 
"core-site.xml" under /root/. Spark will use this file as HADOOP CONF file, 
even if I have already set HADOOP_CONF_DIR=/etc/hadoop/conf.
    
    Author: guliangliang <[email protected]>
    
    Closes #5156 from marsishandsome/Spark6491 and squashes the following 
commits:
    
    5ae214f [guliangliang] use appendToClasspath to change CLASSPATH
    b21f3b2 [guliangliang] keep the classpath order
    5d1f870 [guliangliang] [SPARK-6491] Spark will put the current working dir 
to the CLASSPATH

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request #16413: Branch 1.3

Reply via email to