[GitHub] spark pull request: Addition of Python example for SPARK-8320

nssalian Wed, 17 Jun 2015 09:17:25 -0700

GitHub user nssalian opened a pull request:

    https://github.com/apache/spark/pull/6860


    Addition of Python example for SPARK-8320

    Added python code to 
https://spark.apache.org/docs/latest/streaming-programming-guide.html 
    to the Level of Parallelism in Data Receiving section.
    
    Please review and let me know if there are any additional changes that are 
needed.
    
    Thank you.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/nssalian/spark master

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/6860.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #6860
    
----
commit 5a1a1075a607be683f008ef92fa227803370c45f
Author: Andrew Or <[email protected]>
Date:   2015-05-04T16:17:55Z

    [MINOR] Fix python test typo?
    
    I suspect haven't been using anaconda in tests in a while. I wonder if this 
change actually does anything but this line as it stands looks strictly less 
correct.
    
    Author: Andrew Or <[email protected]>
    
    Closes #5883 from andrewor14/fix-run-tests-typo and squashes the following 
commits:
    
    a3ad720 [Andrew Or] Fix typo?

commit e0833c5958bbd73ff27cfe6865648d7b6e5a99bc
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-04T18:28:59Z

    [SPARK-5956] [MLLIB] Pipeline components should be copyable.
    
    This PR added `copy(extra: ParamMap): Params` to `Params`, which makes a 
copy of the current instance with a randomly generated uid and some extra param 
values. With this change, we only need to implement `fit` and `transform` 
without extra param values given the default implementation of `fit(dataset, 
extra)`:
    
    ~~~scala
    def fit(dataset: DataFrame, extra: ParamMap): Model = {
      copy(extra).fit(dataset)
    }
    ~~~
    
    Inside `fit` and `transform`, since only the embedded values are used, I 
added `$` as an alias for `getOrDefault` to make the code easier to read. For 
example, in `LinearRegression.fit` we have:
    
    ~~~scala
    val effectiveRegParam = $(regParam) / yStd
    val effectiveL1RegParam = $(elasticNetParam) * effectiveRegParam
    val effectiveL2RegParam = (1.0 - $(elasticNetParam)) * effectiveRegParam
    ~~~
    
    Meta-algorithm like `Pipeline` implements its own `copy(extra)`. So the 
fitted pipeline model stored all copied stages (no matter whether it is a 
transformer or a model).
    
    Other changes:
    * `Params$.inheritValues` is moved to `Params!.copyValues` and returns the 
target instance.
    * `fittingParamMap` was removed because the `parent` carries this 
information.
    * `validate` was renamed to `validateParams` to be more precise.
    
    TODOs:
    * [x] add tests for newly added methods
    * [ ] update documentation
    
    jkbradley dbtsai
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #5820 from mengxr/SPARK-5956 and squashes the following commits:
    
    7bef88d [Xiangrui Meng] address comments
    05229c3 [Xiangrui Meng] assert -> assertEquals
    b2927b1 [Xiangrui Meng] organize imports
    f14456b [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-5956
    93e7924 [Xiangrui Meng] add tests for hasParam & copy
    463ecae [Xiangrui Meng] merge master
    2b954c3 [Xiangrui Meng] update Binarizer
    465dd12 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-5956
    282a1a8 [Xiangrui Meng] fix test
    819dd2d [Xiangrui Meng] merge master
    b642872 [Xiangrui Meng] example code runs
    5a67779 [Xiangrui Meng] examples compile
    c76b4d1 [Xiangrui Meng] fix all unit tests
    0f4fd64 [Xiangrui Meng] fix some tests
    9286a22 [Xiangrui Meng] copyValues to trained models
    53e0973 [Xiangrui Meng] move inheritValues to Params and rename it to 
copyValues
    9ee004e [Xiangrui Meng] merge copy and copyWith; rename validate to 
validateParams
    d882afc [Xiangrui Meng] test compile
    f082a31 [Xiangrui Meng] make Params copyable and simply handling of extra 
params in all spark.ml components

commit f32e69ecc333867fc966f65cd0aeaeddd43e0945
Author: äºå³¤ <[email protected]>
Date:   2015-05-04T19:08:38Z

    [SPARK-7319][SQL] Improve the output from DataFrame.show()
    
    Author: äºå³¤ <[email protected]>
    
    Closes #5865 from kaka1992/df.show and squashes the following commits:
    
    c79204b [äºå³¤] Update
    a1338f6 [äºå³¤] Update python dataFrame show test and add empty df unit 
test.
    734369c [äºå³¤] Update python dataFrame show test and add empty df unit 
test.
    84aec3e [äºå³¤] Update python dataFrame show test and add empty df unit 
test.
    159b3d5 [äºå³¤] update
    03ef434 [äºå³¤] update
    7394fd5 [äºå³¤] update test show
    ced487a [äºå³¤] update pep8
    b6e690b [äºå³¤] Merge remote-tracking branch 'upstream/master' into df.show
    30ac311 [äºå³¤] [SPARK-7294] ADD BETWEEN
    7d62368 [äºå³¤] [SPARK-7294] ADD BETWEEN
    baf839b [äºå³¤] [SPARK-7294] ADD BETWEEN
    d11d5b9 [äºå³¤] [SPARK-7294] ADD BETWEEN

commit fc8b58195afa67fbb75b4c8303e022f703cbf007
Author: Andrew Or <[email protected]>
Date:   2015-05-04T23:21:36Z

    [SPARK-6943] [SPARK-6944] DAG visualization on SparkUI
    
    This patch adds the functionality to display the RDD DAG on the SparkUI.
    
    This DAG describes the relationships between
    - an RDD and its dependencies,
    - an RDD and its operation scopes, and
    - an RDD's operation scopes and the stage / job hierarchy
    
    An operation scope here refers to the existing public APIs that created the 
RDDs (e.g. `textFile`, `treeAggregate`). In the future, we can expand this to 
include higher level operations like SQL queries.
    
    *Note: This blatantly stole a few lines of HTML and JavaScript from #5547 
(thanks shroffpradyumn!)*
    
    Here's what the job page looks like:
    <img 
src="https://issues.apache.org/jira/secure/attachment/12730286/job-page.png"; 
width="700px"/>
    and the stage page:
    <img 
src="https://issues.apache.org/jira/secure/attachment/12730287/stage-page.png"; 
width="300px"/>
    
    Author: Andrew Or <[email protected]>
    
    Closes #5729 from andrewor14/viz2 and squashes the following commits:
    
    666c03b [Andrew Or] Round corners of RDD boxes on stage page (minor)
    01ba336 [Andrew Or] Change RDD cache color to red (minor)
    6f9574a [Andrew Or] Add tests for RDDOperationScope
    1c310e4 [Andrew Or] Wrap a few more RDD functions in an operation scope
    3ffe566 [Andrew Or] Restore "null" as default for RDD name
    5fdd89d [Andrew Or] children -> child (minor)
    0d07a84 [Andrew Or] Fix python style
    afb98e2 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz2
    0d7aa32 [Andrew Or] Fix python tests
    3459ab2 [Andrew Or] Fix tests
    832443c [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz2
    429e9e1 [Andrew Or] Display cached RDDs on the viz
    b1f0fd1 [Andrew Or] Rename OperatorScope -> RDDOperationScope
    31aae06 [Andrew Or] Extract visualization logic from listener
    83f9c58 [Andrew Or] Implement a programmatic representation of operator 
scopes
    5a7faf4 [Andrew Or] Rename references to viz scopes to viz clusters
    ee33d52 [Andrew Or] Separate HTML generating code from listener
    f9830a2 [Andrew Or] Refactor + clean up + document JS visualization code
    b80cc52 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz2
    0706992 [Andrew Or] Add link from jobs to stages
    deb48a0 [Andrew Or] Translate stage boxes taking into account the width
    5c7ce16 [Andrew Or] Connect RDDs across stages + update style
    ab91416 [Andrew Or] Introduce visualization to the Job Page
    5f07e9c [Andrew Or] Remove more return statements from scopes
    5e388ea [Andrew Or] Fix line too long
    43de96e [Andrew Or] Add parent IDs to StageInfo
    6e2cfea [Andrew Or] Remove all return statements in `withScope`
    d19c4da [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz2
    7ef957c [Andrew Or] Fix scala style
    4310271 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz2
    aa868a9 [Andrew Or] Ensure that HadoopRDD is actually serializable
    c3bfcae [Andrew Or] Re-implement scopes using closures instead of 
annotations
    52187fc [Andrew Or] Rat excludes
    09d361e [Andrew Or] Add ID to node label (minor)
    71281fa [Andrew Or] Embed the viz in the UI in a toggleable manner
    8dd5af2 [Andrew Or] Fill in documentation + miscellaneous minor changes
    fe7816f [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz
    205f838 [Andrew Or] Reimplement rendering with dagre-d3 instead of viz.js
    5e22946 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
viz
    6a7cdca [Andrew Or] Move RDD scope util methods and logic to its own file
    494d5c2 [Andrew Or] Revert a few unintended style changes
    9fac6f3 [Andrew Or] Re-implement scopes through annotations instead
    f22f337 [Andrew Or] First working implementation of visualization with 
vis.js
    2184348 [Andrew Or] Translate RDD information to dot file
    5143523 [Andrew Or] Expose the necessary information in RDDInfo
    a9ed4f9 [Andrew Or] Add a few missing scopes to certain RDD methods
    6b3403b [Andrew Or] Scope all RDD methods

commit 80554111703c08e2bedbe303e04ecd162ec119e1
Author: Burak Yavuz <[email protected]>
Date:   2015-05-05T00:02:49Z

    [SPARK-7243][SQL] Contingency Tables for DataFrames
    
    Computes a pair-wise frequency table of the given columns. Also known as 
cross-tabulation.
    cc mengxr rxin
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #5842 from brkyvz/df-cont and squashes the following commits:
    
    a07c01e [Burak Yavuz] addressed comments v4.1
    ae9e01d [Burak Yavuz] fix test
    9106585 [Burak Yavuz] addressed comments v4.0
    bced829 [Burak Yavuz] fix merge conflicts
    a63ad00 [Burak Yavuz] addressed comments v3.0
    a0cad97 [Burak Yavuz] addressed comments v3.0
    6805df8 [Burak Yavuz] addressed comments and fixed test
    939b7c4 [Burak Yavuz] lint python
    7f098bc [Burak Yavuz] add crosstab pyTest
    fd53b00 [Burak Yavuz] added python support for crosstab
    27a5a81 [Burak Yavuz] implemented crosstab

commit 678c4da0fa1bbfb6b5a0d3aced7aefa1bbbc193c
Author: Reynold Xin <[email protected]>
Date:   2015-05-05T01:03:07Z

    [SPARK-7266] Add ExpectsInputTypes to expressions when possible.
    
    This should gives us better analysis time error messages (rather than 
runtime) and automatic type casting.
    
    Author: Reynold Xin <[email protected]>
    
    Closes #5796 from rxin/expected-input-types and squashes the following 
commits:
    
    c900760 [Reynold Xin] [SPARK-7266] Add ExpectsInputTypes to expressions 
when possible.

commit 8aa5aea7fee0ae9cd34e16c30655ee02b8747455
Author: Bryan Cutler <[email protected]>
Date:   2015-05-05T01:29:22Z

    [SPARK-7236] [CORE] Fix to prevent AkkaUtils askWithReply from sleeping on 
final attempt
    
    Added a check so that if `AkkaUtils.askWithReply` is on the final attempt, 
it will not sleep for the `retryInterval`.  This should also prevent the thread 
from sleeping for `Int.Max` when using `askWithReply` with default values for 
`maxAttempts` and `retryInterval`.
    
    Author: Bryan Cutler <[email protected]>
    
    Closes #5896 from BryanCutler/askWithReply-sleep-7236 and squashes the 
following commits:
    
    653a07b [Bryan Cutler] [SPARK-7236] Fix to prevent AkkaUtils askWithReply 
from sleeping on final attempt

commit e9b16e67c636a8a91ab9fb0f4ef98146abbde1e9
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-05T06:52:42Z

    [SPARK-7314] [SPARK-3524] [PYSPARK] upgrade Pyrolite to 4.4
    
    This PR upgrades Pyrolite to 4.4, which contains the bug fix for SPARK-3524 
and some other performance improvements (e.g., SPARK-6288). The artifact is 
still under `org.spark-project` on Maven Central since there is no official 
release published there.
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #5850 from mengxr/SPARK-7314 and squashes the following commits:
    
    2ed4a95 [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-7314
    da3c2dd [Xiangrui Meng] remove my repo
    fe7e29b [Xiangrui Meng] switch to maven central
    6ddac0e [Xiangrui Meng] reverse the machine code for float/double
    d2d5b5b [Xiangrui Meng] change back to 4.4
    7824a9c [Xiangrui Meng] use Pyrolite 3.1
    cc3903a [Xiangrui Meng] upgrade Pyrolite to 4.4-0 for testing

commit da738cffa8f7e12545b47f31dcb051f2927e4149
Author: Niccolo Becchi <[email protected]>
Date:   2015-05-05T07:54:42Z

    [MINOR] Renamed variables in SparkKMeans.scala, LocalKMeans.scala and 
kmeans.py to simplify readability
    
    With the previous syntax it could look like that the reduceByKey sums 
separately abscissas and ordinates of some 2D points. Perhaps in this way 
should be easier to understand the example, especially for who is starting the 
functional programming like me now.
    
    Author: Niccolo Becchi <[email protected]>
    Author: pippobaudos <[email protected]>
    
    Closes #5875 from pippobaudos/patch-1 and squashes the following commits:
    
    3bb3a47 [pippobaudos] renamed variables in LocalKMeans.scala and kmeans.py 
to simplify readability
    2c2a7a2 [Niccolo Becchi] Update SparkKMeans.scala

commit c5790a2f772168351c18bb0da51a124cee89a06f
Author: Marcelo Vanzin <[email protected]>
Date:   2015-05-05T07:56:16Z

    [MINOR] [BUILD] Declare ivy dependency in root pom.
    
    Without this, any dependency that pulls ivy transitively may override
    the version and potentially cause issue. In my machine, the hive tests
    were pulling an old version of ivy, and subsequently failing with a
    "NoSuchMethodError".
    
    Author: Marcelo Vanzin <[email protected]>
    
    Closes #5893 from vanzin/ivy-dep-fix and squashes the following commits:
    
    ea2112d [Marcelo Vanzin] [minor] [build] Declare ivy dependency in root pom.

commit 1854ac326a9cc6014817d8df30ed0458eee5d7d1
Author: Tathagata Das <[email protected]>
Date:   2015-05-05T08:45:19Z

    [SPARK-7139] [STREAMING] Allow received block metadata to be saved to WAL 
and recovered on driver failure
    
    - Enabled ReceivedBlockTracker WAL by default
    - Stored block metadata in the WAL
    - Optimized WALBackedBlockRDD by skipping block fetch when the block is 
known to not exist in Spark
    
    Author: Tathagata Das <[email protected]>
    
    Closes #5732 from tdas/SPARK-7139 and squashes the following commits:
    
    575476e [Tathagata Das] Added more tests to get 100% coverage of the 
WALBackedBlockRDD
    19668ba [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into SPARK-7139
    685fab3 [Tathagata Das] Addressed comments in PR
    637bc9c [Tathagata Das] Changed segment to handle
    466212c [Tathagata Das] Merge remote-tracking branch 'apache-github/master' 
into SPARK-7139
    5f67a59 [Tathagata Das] Fixed HdfsUtils to handle append in local file 
system
    1bc5bc3 [Tathagata Das] Fixed bug on unexpected recovery
    d06fa21 [Tathagata Das] Enabled ReceivedBlockTracker by default, stored 
block metadata and optimized block fetching in WALBackedBlockRDD

commit 8776fe0b93b6e6d718738bcaf9838a2196e12c8a
Author: Tathagata Das <[email protected]>
Date:   2015-05-05T08:58:51Z

    [HOTFIX] [TEST] Ignoring flaky tests
    
    org.apache.spark.DriverSuite.driver should exit after finishing without 
cleanup (SPARK-530)
    https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2267/
    
    org.apache.spark.deploy.SparkSubmitSuite.includes jars passed in through 
--jars
    
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2271/AMPLAB_JENKINS_BUILD_PROFILE=hadoop1.0,label=centos/testReport/
    
    org.apache.spark.streaming.flume.FlumePollingStreamSuite.flume polling test
    https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/2269/
    
    Author: Tathagata Das <[email protected]>
    
    Closes #5901 from tdas/ignore-flaky-tests and squashes the following 
commits:
    
    9cd8667 [Tathagata Das] Ignoring tests.

commit 8436f7e98e674020007a9175973c6a1095b6774f
Author: jerryshao <[email protected]>
Date:   2015-05-05T09:01:06Z

    [SPARK-7113] [STREAMING] Support input information reporting for Direct 
Kafka stream
    
    Author: jerryshao <[email protected]>
    
    Closes #5879 from jerryshao/SPARK-7113 and squashes the following commits:
    
    b0b506c [jerryshao] Address the comments
    0babe66 [jerryshao] Support input information reporting for Direct Kafka 
stream

commit 4d29867ede9a87b160c3d715c1fb02067feef449
Author: zsxwing <[email protected]>
Date:   2015-05-05T09:15:39Z

    [SPARK-7341] [STREAMING] [TESTS] Fix the flaky test: 
org.apache.spark.stre...
    
    ...aming.InputStreamsSuite.socket input stream
    
    Remove non-deterministic "Thread.sleep" and use deterministic strategies to 
fix the flaky failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-Maven-pre-YARN/hadoop.version=1.0.4,label=centos/2127/testReport/junit/org.apache.spark.streaming/InputStreamsSuite/socket_input_stream/
    
    Author: zsxwing <[email protected]>
    
    Closes #5891 from zsxwing/SPARK-7341 and squashes the following commits:
    
    611157a [zsxwing] Add wait methods to BatchCounter and use BatchCounter in 
InputStreamsSuite
    014b58f [zsxwing] Use withXXX to clean up the resources
    c9bf746 [zsxwing] Move 'waitForStart' into the 'start' method and fix the 
code style
    9d0de6d [zsxwing] [SPARK-7341][Streaming][Tests] Fix the flaky test: 
org.apache.spark.streaming.InputStreamsSuite.socket input stream

commit fc8feaa8e94e1e611d2abb1e5e38de512961502b
Author: shekhar.bansal <[email protected]>
Date:   2015-05-05T10:09:51Z

    [SPARK-6653] [YARN] New config to specify port for sparkYarnAM actor system
    
    Author: shekhar.bansal <[email protected]>
    
    Closes #5719 from zuxqoj/master and squashes the following commits:
    
    5574ff7 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for 
sparkYarnAM actor system
    5117258 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for 
sparkYarnAM actor system
    9de5330 [shekhar.bansal] [SPARK-6653][yarn] New config to specify port for 
sparkYarnAM actor system
    456a592 [shekhar.bansal] [SPARK-6653][yarn] New configuration property to 
specify port for sparkYarnAM actor system
    803e93e [shekhar.bansal] [SPARK-6653][yarn] New configuration property to 
specify port for sparkYarnAM actor system

commit 4222da68dc5360b7a2a8b8bdce231e887ac2f044
Author: Sandy Ryza <[email protected]>
Date:   2015-05-05T11:38:46Z

    [SPARK-5112] Expose SizeEstimator as a developer api
    
    "The best way to size the amount of memory consumption your dataset will 
require is to create an RDD, put it into cache, and look at the SparkContext 
logs on your driver program. The logs will tell you how much memory each 
partition is consuming, which you can aggregate to get the total size of the 
RDD."
    -the Tuning Spark page
    
    This is a pain. It would be much nicer to expose simply functionality for 
understanding the memory footprint of a Java object.
    
    Author: Sandy Ryza <[email protected]>
    
    Closes #3913 from sryza/sandy-spark-5112 and squashes the following commits:
    
    8d9e082 [Sandy Ryza] Add SizeEstimator in org.apache.spark
    2e1a906 [Sandy Ryza] Revert "Move SizeEstimator out of util"
    93f4cd0 [Sandy Ryza] Move SizeEstimator out of util
    e21c1f4 [Sandy Ryza] Remove unused import
    798ab88 [Sandy Ryza] Update documentation and add to SparkContext
    34c523c [Sandy Ryza] SPARK-5112. Expose SizeEstimator as a developer api

commit 51f462003b416eac92feb5a6725f6c2994389010
Author: Jihong MA <[email protected]>
Date:   2015-05-05T11:40:41Z

    [SPARK-7357] Improving HBaseTest example
    
    Author: Jihong MA <[email protected]>
    
    Closes #5904 from JihongMA/SPARK-7357 and squashes the following commits:
    
    7d6153a [Jihong MA] SPARK-7357 Improving HBaseTest example

commit d49735800db27239c11478aac4b0f2ec9df91a3f
Author: Imran Rashid <[email protected]>
Date:   2015-05-05T12:25:40Z

    [SPARK-3454] separate json endpoints for data in the UI
    
    Exposes data available in the UI as json over http.  Key points:
    
    * new endpoints, handled independently of existing XyzPage classes.  Root 
entrypoint is `JsonRootResource`
    * Uses jersey + jackson for routing & converting POJOs into json
    * tests against known results in `HistoryServerSuite`
    * also fixes some minor issues w/ the UI -- synchronizing on access to 
`StorageListener` & `StorageStatusListener`, and fixing some inconsistencies w/ 
the way we handle retained jobs & stages.
    
    Author: Imran Rashid <[email protected]>
    
    Closes #4435 from squito/SPARK-3454 and squashes the following commits:
    
    da1e35f [Imran Rashid] typos etc.
    5e78b4f [Imran Rashid] fix rendering problems
    5ae02ad [Imran Rashid] Merge branch 'master' into SPARK-3454
    f016182 [Imran Rashid] change all constructors json-pojo class constructors 
to be private[spark] to protect us from mima-false-positives if we add fields
    3347b72 [Imran Rashid] mark EnumUtil as @Private
    ec140a2 [Imran Rashid] create @Private
    cc1febf [Imran Rashid] add docs on the metrics-as-json api
    cbaf287 [Imran Rashid] Merge branch 'master' into SPARK-3454
    56db31e [Imran Rashid] update tests for mulit-attempt
    7f3bc4e [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier 
to start & stop http servers in sbt"
    67008b4 [Imran Rashid] rats
    9e51400 [Imran Rashid] style
    c9bae1c [Imran Rashid] handle multiple attempts per app
    b87cd63 [Imran Rashid] add sbt-revolved plugin, to make it easier to start 
& stop http servers in sbt
    188762c [Imran Rashid] multi-attempt
    2af11e5 [Imran Rashid] Merge branch 'master' into SPARK-3454
    befff0c [Imran Rashid] review feedback
    14ac3ed [Imran Rashid] jersey-core needs to be explicit; move version & 
scope to parent pom.xml
    f90680e [Imran Rashid] Merge branch 'master' into SPARK-3454
    dc8a7fe [Imran Rashid] style, fix errant comments
    acb7ef6 [Imran Rashid] fix indentation
    7bf1811 [Imran Rashid] move MetricHelper so mima doesnt think its exposed; 
comments
    9d889d6 [Imran Rashid] undo some unnecessary changes
    f48a7b0 [Imran Rashid] docs
    52bbae8 [Imran Rashid] StorageListener & StorageStatusListener needs to 
synchronize internally to be thread-safe
    31c79ce [Imran Rashid] asm no longer needed for SPARK_PREPEND_CLASSES
    b2f8b91 [Imran Rashid] @DeveloperApi
    2e19be2 [Imran Rashid] lazily convert ApplicationInfo to avoid memory 
overhead
    ba3d9d2 [Imran Rashid] upper case enums
    39ac29c [Imran Rashid] move EnumUtil
    d2bde77 [Imran Rashid] update error handling & scoping
    4a234d3 [Imran Rashid] avoid jersey-media-json-jackson b/c of potential 
version conflicts
    a157a2f [Imran Rashid] style
    7bd4d15 [Imran Rashid] delete security test, since it doesnt do anything
    a325563 [Imran Rashid] style
    a9c5cf1 [Imran Rashid] undo changes superceeded by master
    0c6f968 [Imran Rashid] update deps
    1ed0d07 [Imran Rashid] Merge branch 'master' into SPARK-3454
    4c92af6 [Imran Rashid] style
    f2e63ad [Imran Rashid] Merge branch 'master' into SPARK-3454
    c22b11f [Imran Rashid] fix compile error
    9ea682c [Imran Rashid] go back to good ol' java enums
    cf86175 [Imran Rashid] style
    d493b38 [Imran Rashid] Merge branch 'master' into SPARK-3454
    f05ae89 [Imran Rashid] add in ExecutorSummaryInfo for MiMa :(
    101a698 [Imran Rashid] style
    d2ef58d [Imran Rashid] revert changes that had HistoryServer refresh the 
application listing more often
    b136e39b [Imran Rashid] Revert "add sbt-revolved plugin, to make it easier 
to start & stop http servers in sbt"
    e031719 [Imran Rashid] fixes from review
    1f53a66 [Imran Rashid] style
    b4a7863 [Imran Rashid] fix compile error
    2c8b7ee [Imran Rashid] rats
    1578a4a [Imran Rashid] doc
    674f8dc [Imran Rashid] more explicit about total numbers of jobs & stages 
vs. number retained
    9922be0 [Imran Rashid] Merge branch 'master' into stage_distributions
    f5a5196 [Imran Rashid] undo removal of renderJson from MasterPage, since 
there is no substitute yet
    db61211 [Imran Rashid] get JobProgressListener directly from UI
    fdfc181 [Imran Rashid] stage/taskList
    63eb4a6 [Imran Rashid] tests for taskSummary
    ad27de8 [Imran Rashid] error handling on quantile values
    b2efcaf [Imran Rashid] cleanup, combine stage-related paths into one 
resource
    aaba896 [Imran Rashid] wire up task summary
    a4b1397 [Imran Rashid] stage metric distributions
    e48ba32 [Imran Rashid] rename
    eaf3bbb [Imran Rashid] style
    25cd894 [Imran Rashid] if only given day, assume GMT
    51eaedb [Imran Rashid] more visibility fixes
    9f28b7e [Imran Rashid] ack, more cleanup
    99764e1 [Imran Rashid] Merge branch 'SPARK-3454_w_jersey' into SPARK-3454
    a61a43c [Imran Rashid] oops, remove accidental checkin
    a066055 [Imran Rashid] set visibility on a lot of classes
    1f361c8 [Imran Rashid] update rat-excludes
    0be5120 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
    2382bef [Imran Rashid] switch to using new "enum"
    fef6605 [Imran Rashid] some utils for working w/ new "enum" format
    dbfc7bf [Imran Rashid] style
    b86bcb0 [Imran Rashid] update test to look at one stage attempt
    5f9df24 [Imran Rashid] style
    7fd156a [Imran Rashid] refactor jsonDiff to avoid code duplication
    73f1378 [Imran Rashid] test json; also add test cases for cleaned stages & 
jobs
    97d411f [Imran Rashid] json endpoint for one job
    0c96147 [Imran Rashid] better error msgs for bad stageId vs bad attemptId
    dddbd29 [Imran Rashid] stages have attempt; jobs are sorted; resource for 
all attempts for one stage
    190c17a [Imran Rashid] StagePage should distinguish no task data, from 
unknown stage
    84cd497 [Imran Rashid] AllJobsPage should still report correct completed & 
failed job count, even if some have been cleaned, to make it consistent w/ 
AllStagesPage
    36e4062 [Imran Rashid] SparkUI needs to know about startTime, so it can 
list its own applicationInfo
    b4c75ed [Imran Rashid] fix merge conflicts; need to widen visibility in a 
few cases
    e91750a [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
    56d2fc7 [Imran Rashid] jersey needs asm for SPARK_PREPEND_CLASSES to work
    f7df095 [Imran Rashid] add test for accumulables, and discover that I need 
update after all
    9c0c125 [Imran Rashid] add accumulableInfo
    00e9cc5 [Imran Rashid] more style
    3377e61 [Imran Rashid] scaladoc
    d05f7a9 [Imran Rashid] dont use case classes for status api POJOs, since 
they have binary compatibility issues
    654cecf [Imran Rashid] move all the status api POJOs to one file
    b86e2b0 [Imran Rashid] style
    18a8c45 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
    5598f19 [Imran Rashid] delete some unnecessary code, more to go
    56edce0 [Imran Rashid] style
    017c755 [Imran Rashid] add in metrics now available
    1b78cb7 [Imran Rashid] fix some import ordering
    0dc3ea7 [Imran Rashid] if app isnt found, reload apps from FS before giving 
up
    c7d884f [Imran Rashid] fix merge conflicts
    0c12b50 [Imran Rashid] Merge branch 'master' into SPARK-3454_w_jersey
    b6a96a8 [Imran Rashid] compare json by AST, not string
    cd37845 [Imran Rashid] switch to using java.util.Dates for times
    a4ab5aa [Imran Rashid] add in explicit dependency on jersey 1.9 -- maven 
wasn't happy before this
    4fdc39f [Imran Rashid] refactor case insensitive enum parsing
    cba1ef6 [Imran Rashid] add security (maybe?) for metrics json
    f0264a7 [Imran Rashid] switch to using jersey for metrics json
    bceb3a9 [Imran Rashid] set http response code on error, some testing
    e0356b6 [Imran Rashid] put new test expectation files in rat excludes (is 
this OK?)
    b252e7a [Imran Rashid] small cleanup of accidental changes
    d1a8c92 [Imran Rashid] add sbt-revolved plugin, to make it easier to start 
& stop http servers in sbt
    4b398d0 [Imran Rashid] expose UI data as json in new endpoints

commit b83091ae4589feea78b056827bc3b7659d271e41
Author: Liang-Chi Hsieh <[email protected]>
Date:   2015-05-05T13:44:02Z

    [MINOR] Minor update for document
    
    Two minor doc errors in `BytesToBytesMap` and `UnsafeRow`.
    
    Author: Liang-Chi Hsieh <[email protected]>
    
    Closes #5906 from viirya/minor_doc and squashes the following commits:
    
    27f9089 [Liang-Chi Hsieh] Minor update for doc.

commit 5ffc73e68b3a6ea30c25931e9e0495a4c7e5654c
Author: zsxwing <[email protected]>
Date:   2015-05-05T14:04:14Z

    [SPARK-5074] [CORE] [TESTS] Fix the flakey test 'run shuffle with map stage 
failure' in DAGSchedulerSuite
    
    Test failure: 
https://amplab.cs.berkeley.edu/jenkins/job/Spark-Master-SBT/AMPLAB_JENKINS_BUILD_PROFILE=hadoop2.2,label=centos/2240/testReport/junit/org.apache.spark.scheduler/DAGSchedulerSuite/run_shuffle_with_map_stage_failure/
    
    This is because many tests share the same `JobListener`. Because after each 
test, `scheduler` isn't stopped. So actually it's still running. When running 
the test `run shuffle with map stage failure`, some previous test may trigger 
[ResubmitFailedStages](https://github.com/apache/spark/blob/ebc25a4ddfe07a67668217cec59893bc3b8cf730/core/src/main/scala/org/apache/spark/scheduler/DAGScheduler.scala#L1120)
 logic, and report `jobFailed` and override the global `failure` variable.
    
    This PR uses `after` to call `scheduler.stop()` for each test.
    
    Author: zsxwing <[email protected]>
    
    Closes #5903 from zsxwing/SPARK-5074 and squashes the following commits:
    
    1e6f13e [zsxwing] Fix the flakey test 'run shuffle with map stage failure' 
in DAGSchedulerSuite

commit c6d1efba29a4235130024fee9f118e6b2cb89ce1
Author: zsxwing <[email protected]>
Date:   2015-05-05T14:09:58Z

    [SPARK-7350] [STREAMING] [WEBUI] Attach the Streaming tab when calling 
ssc.start()
    
    It's meaningless to display the Streaming tab before `ssc.start()`. So we 
should attach it in the `ssc.start` method.
    
    Author: zsxwing <[email protected]>
    
    Closes #5898 from zsxwing/SPARK-7350 and squashes the following commits:
    
    e676487 [zsxwing] Attach the Streaming tab when calling ssc.start()

commit 5ab652cdb8bef10214edd079502a7f49017579aa
Author: MechCoder <[email protected]>
Date:   2015-05-05T14:53:11Z

    [SPARK-7202] [MLLIB] [PYSPARK] Add SparseMatrixPickler to SerDe
    
    Utilities for pickling and unpickling SparseMatrices using SerDe
    
    Author: MechCoder <[email protected]>
    
    Closes #5775 from MechCoder/spark-7202 and squashes the following commits:
    
    7e689dc [MechCoder] [SPARK-7202] Add SparseMatrixPickler to SerDe

commit 5995ada96b661546a80657f2c5ed20604593e4aa
Author: Hrishikesh Subramonian <[email protected]>
Date:   2015-05-05T14:57:39Z

    [SPARK-6612] [MLLIB] [PYSPARK] Python KMeans parity
    
    The following items are added to Python kmeans:
    
    kmeans - setEpsilon, setInitializationSteps
    KMeansModel - computeCost, k
    
    Author: Hrishikesh Subramonian <[email protected]>
    
    Closes #5647 from FlytxtRnD/newPyKmeansAPI and squashes the following 
commits:
    
    b9e451b [Hrishikesh Subramonian] set seed to fixed value in doc test
    5fd3ced [Hrishikesh Subramonian] doc test corrections
    20b3c68 [Hrishikesh Subramonian] python 3 fixes
    4d4e695 [Hrishikesh Subramonian] added arguments in python tests
    21eb84c [Hrishikesh Subramonian] Python Kmeans - setEpsilon, 
setInitializationSteps, k and computeCost added.

commit 9d250e64dac263bcbbad6b023382ac7b5b592408
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-05T15:00:31Z

    Closes #5591
    Closes #5878

commit d4cb38aeb7412a353c6cbca2a9b8f9729afbaba7
Author: Alain <[email protected]>
Date:   2015-05-05T15:47:34Z

    [MLLIB] [TREE] Verify size of input rdd > 0 when building meta data
    
    Require non empty input rdd such that we can take the first labeledpoint 
and get the feature size
    
    Author: Alain <[email protected]>
    Author: [email protected] <[email protected]>
    
    Closes #5810 from AiHe/decisiontree-issue and squashes the following 
commits:
    
    3b1d08a [[email protected]] [MLLIB][tree] merge the assertion into the 
evaluation of numFeatures
    cf2e567 [Alain] [MLLIB][tree] Use a rdd api to verify size of input rdd > 0 
when building meta data
    b448f47 [Alain] [MLLIB][tree] Verify size of input rdd > 0 when building 
meta data

commit 1fdabf8dcdb31391fec3952d312eb0ac59ece43b
Author: Andrew Or <[email protected]>
Date:   2015-05-05T16:37:04Z

    [SPARK-7237] Many user provided closures are not actually cleaned
    
    Note: ~140 lines are tests.
    
    In a nutshell, we never cleaned closures the user provided through the 
following operations:
    - sortBy
    - keyBy
    - mapPartitions
    - mapPartitionsWithIndex
    - aggregateByKey
    - foldByKey
    - foreachAsync
    - one of the aliases for runJob
    - runApproximateJob
    
    For more details on a reproduction and why they were not cleaned, please 
see [SPARK-7237](https://issues.apache.org/jira/browse/SPARK-7237).
    
    Author: Andrew Or <[email protected]>
    
    Closes #5787 from andrewor14/clean-more and squashes the following commits:
    
    2f1f476 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
clean-more
    7265865 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
clean-more
    df3caa3 [Andrew Or] Address comments
    7a3cc80 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
clean-more
    6498f44 [Andrew Or] Add missing test for groupBy
    e83699e [Andrew Or] Clean one more
    8ac3074 [Andrew Or] Prevent NPE in tests when CC is used outside of an app
    9ac5f9b [Andrew Or] Clean closures that are not currently cleaned
    19e33b4 [Andrew Or] Add tests for all public RDD APIs that take in closures

commit 57e9f29e17d97ed9d0f110fb2ce5a075b854a841
Author: Andrew Or <[email protected]>
Date:   2015-05-05T16:37:49Z

    [SPARK-7318] [STREAMING] DStream cleans objects that are not closures
    
    I added a check in `ClosureCleaner#clean` to fail fast if this is detected 
in the future. tdas
    
    Author: Andrew Or <[email protected]>
    
    Closes #5860 from andrewor14/streaming-closure-cleaner and squashes the 
following commits:
    
    8e971d7 [Andrew Or] Do not throw exception if object to clean is not closure
    5ee4e25 [Andrew Or] Fix tests
    eed3390 [Andrew Or] Merge branch 'master' of github.com:apache/spark into 
streaming-closure-cleaner
    67eeff4 [Andrew Or] Add tests
    a4fa768 [Andrew Or] Clean the closure, not the RDD

commit 9f1f9b1037ee003a07ff09d60bb360cf32c8a564
Author: jerryshao <[email protected]>
Date:   2015-05-05T16:43:49Z

    [SPARK-7007] [CORE] Add a metric source for ExecutorAllocationManager
    
    Add a metric source to expose the internal status of 
ExecutorAllocationManager to better monitoring the resource usage of executors 
when dynamic allocation is enable. Please help to review, thanks a lot.
    
    Author: jerryshao <[email protected]>
    
    Closes #5589 from jerryshao/dynamic-allocation-source and squashes the 
following commits:
    
    104d155 [jerryshao] rebase and address the comments
    c501a2c [jerryshao] Address the comments
    d237ba5 [jerryshao] Address the comments
    2c3540f [jerryshao] Add a metric source for ExecutorAllocationManager

commit 18340d7be55a6834918956555bf820c96769aa52
Author: Burak Yavuz <[email protected]>
Date:   2015-05-05T18:01:25Z

    [SPARK-7243][SQL] Reduce  size for Contingency Tables in DataFrames
    
    Reduced take size from 1e8 to 1e6.
    
    cc rxin
    
    Author: Burak Yavuz <[email protected]>
    
    Closes #5900 from brkyvz/df-cont-followup and squashes the following 
commits:
    
    c11e762 [Burak Yavuz] fix grammar
    b30ace2 [Burak Yavuz] address comments
    a417ba5 [Burak Yavuz] [SPARK-7243][SQL] Reduce  size for Contingency Tables 
in DataFrames

commit ee374e89cd1f08730fed9d50b742627d5b19d241
Author: Xiangrui Meng <[email protected]>
Date:   2015-05-05T18:45:37Z

    [SPARK-7333] [MLLIB] Add BinaryClassificationEvaluator to PySpark
    
    This PR adds `BinaryClassificationEvaluator` to Python ML Pipelines API, 
which is a simple wrapper of the Scala implementation. oefirouz
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #5885 from mengxr/SPARK-7333 and squashes the following commits:
    
    25d7451 [Xiangrui Meng] fix tests in python 3
    babdde7 [Xiangrui Meng] fix doc
    cb51e6a [Xiangrui Meng] add BinaryClassificationEvaluator in PySpark

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: Addition of Python example for SPARK-8320

Reply via email to