[GitHub] spark pull request: [SPARK-4477] [PySpark] remove numpy from RDDSa...

davies Tue, 18 Nov 2014 17:04:56 -0800

GitHub user davies opened a pull request:

    https://github.com/apache/spark/pull/3351


    [SPARK-4477] [PySpark] remove numpy from RDDSampler

    In RDDSampler, it try use numpy to gain better performance for possion(), 
but the number of call of random() is only (1+faction) * N in the pure python 
implementation of possion(), so there is no much performance gain from numpy.
    
    numpy is not a dependent of pyspark, so it maybe introduce some problem, 
such as there is no numpy installed in slaves, but only installed master, as 
reported in SPARK-927.
    
    It also complicate the code a lot, so we may should remove numpy from 
RDDSampler.
    
    closes #2313
    
    Note: this patch including some commits that not mirrored to github, it 
will be OK after it catches up.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/davies/spark numpy

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/3351.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #3351
    
----
commit 41fce544cadce5ed314b75f368abf79ee7fcd2da
Author: Davies Liu <[email protected]>
Date:   2014-11-10T23:54:20Z

    randomSplit()

commit 1715ee38b619868c284454d5058376e9e0ca09a7
Author: Davies Liu <[email protected]>
Date:   2014-11-12T22:15:47Z

    address comments

commit 0d9b256a9b7aa52c37c6a952ffd68bf4441d46e5
Author: Davies Liu <[email protected]>
Date:   2014-11-12T23:00:17Z

    refactor

commit 95a48aca935b2c3c9fe1af2aad674811a4c04395
Author: Davies Liu <[email protected]>
Date:   2014-11-13T18:53:11Z

    Merge branch 'master' of github.com:apache/spark into randomSplit

commit c7a2007245752480dfe96316d22dc8f19ea1b1a2
Author: Davies Liu <[email protected]>
Date:   2014-11-13T19:29:43Z

    switch to python implementation

commit f866bcf7e23a6d888d2113df4da3031bfe91400e
Author: Davies Liu <[email protected]>
Date:   2014-11-13T19:31:09Z

    remove unneeded change

commit 4dfa2cdce6a2eaae5f5e24321e324bd0498ea49f
Author: Davies Liu <[email protected]>
Date:   2014-11-13T20:20:56Z

    refactor

commit f5fdf63fe0d0ea091c01a4144585276a7db63625
Author: Davies Liu <[email protected]>
Date:   2014-11-13T20:54:14Z

    fix bug with int in weights

commit 78bf997f13c6f08129671a9d6a3484620d5b37a2
Author: Davies Liu <[email protected]>
Date:   2014-11-13T21:08:10Z

    fix tests, do not use numpy in randomSplit, no performance gain

commit 51649f5e5b29ab8db1c6c3fd91c6f625124ab327
Author: Davies Liu <[email protected]>
Date:   2014-11-14T06:40:40Z

    remove numpy in RDDSampler

commit f58302346ad0e859994fae52299fd32b3d3ef61a
Author: Davies Liu <[email protected]>
Date:   2014-11-14T08:37:20Z

    fix tests

commit cedc3b5aa43a16e2da62f12a36317f00aa1002cc
Author: Felix Maximilian MÃ¶ller <[email protected]>
Date:   2014-11-18T18:08:24Z

    ALS implicit: added missing parameter alpha in doc string
    
    Author: Felix Maximilian MÃ¶ller 
<[email protected]>
    
    Closes #3343 from felixmaximilian/fix-documentation and squashes the 
following commits:
    
    43dcdfb [Felix Maximilian MÃ¶ller] Removed the information about the switch 
implicitPrefs. The parameter implicitPrefs cannot be set in this context 
because it is inherent true when calling the trainImplicit method.
    7d172ba [Felix Maximilian MÃ¶ller] added missing parameter alpha in doc 
string.

commit 8fbf72b7903b5bbec8d949151aa4693b4af26ff5
Author: Davies Liu <[email protected]>
Date:   2014-11-18T18:11:13Z

    [SPARK-4435] [MLlib] [PySpark] improve classification
    
    This PR add setThrehold() and clearThreshold() for LogisticRegressionModel 
and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(), 
SVNModel.predict() and NaiveBayes.predict()
    
    Author: Davies Liu <[email protected]>
    
    Closes #3305 from davies/setThreshold and squashes the following commits:
    
    d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
setThreshold
    e4acd76 [Davies Liu] address comments
    2231a5f [Davies Liu] bugfix
    7bd9009 [Davies Liu] address comments
    0b0a8a7 [Davies Liu] address comments
    c1e5573 [Davies Liu] improve classification

commit b54c6ab3c54e65238d6766832ea1f3fcd694f2fd
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-18T18:35:29Z

    [SPARK-4396] allow lookup by index in Python's Rating
    
    In PySpark, ALS can take an RDD of (user, product, rating) tuples as input. 
However, model.predict outputs an RDD of Rating. So on the input side, users 
can use r[0], r[1], r[2], while on the output side, users have to use r.user, 
r.product, r.rating. We should allow lookup by index in Rating by making Rating 
a namedtuple.
    
    davies
    
    <!-- Reviewable:start -->
    [<img src="https://reviewable.io/review_button.png"; height=40 alt="Review 
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
    <!-- Reviewable:end -->
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:
    
    543aef0 [Xiangrui Meng] use named tuple to implement ALS
    0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into 
SPARK-4396
    d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating

commit 90d72ec8502f7ec11d2fe42f08c884ad2159266f
Author: Michael Armbrust <[email protected]>
Date:   2014-11-18T20:13:23Z

    [SQL] Support partitioned parquet tables that have the key in both the 
directory and the file
    
    Author: Michael Armbrust <[email protected]>
    
    Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following 
commits:
    
    447f08c [Michael Armbrust] Support partitioned parquet tables that have the 
key in both the directory and the file

commit bfebfd8b28eeb7e75292333f7885aa0830fcb5fe
Author: Kousuke Saruta <[email protected]>
Date:   2014-11-18T20:17:33Z

    [SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar 
name.
    
    This PR adds a regression test for SPARK-4434.
    
    Author: Kousuke Saruta <[email protected]>
    
    Closes #3326 from sarutak/add-triple-slash-testcase and squashes the 
following commits:
    
    82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment
    9149027 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into add-triple-slash-testcase
    c1c80ca [Kousuke Saruta] Fixed style
    4f30210 [Kousuke Saruta] Modified comments
    9e09da2 [Kousuke Saruta] Fixed URI validation for jar file
    d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not 
enough for Jar file
    ac79906 [Kousuke Saruta] Merge branch 'master' of 
git://git.apache.org/spark into add-triple-slash-testcase
    6d4f47e [Kousuke Saruta] Added a test case as a regression check for 
SPARK-4434

commit 80f31778820586a93d73fa15279a204611cc3c60
Author: Davies Liu <[email protected]>
Date:   2014-11-18T21:11:38Z

    [SPARK-4404] remove sys.exit() in shutdown hook
    
    If SparkSubmit die first, then bootstrapper will be blocked by shutdown 
hook. sys.exit() in a shutdown hook will cause some kind of dead lock.
    
    cc andrewor14
    
    Author: Davies Liu <[email protected]>
    
    Closes #3289 from davies/fix_bootstraper and squashes the following commits:
    
    ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
fix_bootstraper
    e04b690 [Davies Liu] remove sys.exit in hook
    4d11366 [Davies Liu] remove shutdown hook if subprocess die fist

commit e34f38ff1a0dfbb0ffa4bd11071e03b1a58de998
Author: Davies Liu <[email protected]>
Date:   2014-11-18T21:37:21Z

    [SPARK-4017] show progress bar in console
    
    The progress bar will look like this:
    
    
![1___spark_job__85_250_finished__4_are_running___java_](https://cloud.githubusercontent.com/assets/40902/4854813/a02f44ac-6099-11e4-9060-7c73a73151d6.png)
    
    In the right corner, the numbers are: finished tasks, running tasks, total 
tasks.
    
    After the stage has finished, it will disappear.
    
    The progress bar is only showed if logging level is WARN or higher (but 
progress in title is still showed), it can be turned off by 
spark.driver.showConsoleProgress.
    
    Author: Davies Liu <[email protected]>
    
    Closes #3029 from davies/progress and squashes the following commits:
    
    95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
progress
    fc49ac8 [Davies Liu] address commentse
    2e90f75 [Davies Liu] show multiple stages in same time
    0081bcc [Davies Liu] address comments
    38c42f1 [Davies Liu] fix tests
    ab87958 [Davies Liu] disable progress bar during tests
    30ac852 [Davies Liu] re-implement progress bar
    b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
progress
    6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
    e4e7344 [Davies Liu] refactor
    e1f524d [Davies Liu] revert unnecessary change
    a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
progress
    5cae3f2 [Davies Liu] fix style
    ea49fe0 [Davies Liu] address comments
    bc53d99 [Davies Liu] refactor
    e6bb189 [Davies Liu] fix logging in sparkshell
    7e7d4e7 [Davies Liu] address commments
    5df26bb [Davies Liu] fix style
    9e42208 [Davies Liu] show progress bar in console and title

commit 010bc86e40a0e54b6850b75abd6105e70eb1af10
Author: Kay Ousterhout <[email protected]>
Date:   2014-11-18T23:01:06Z

    [SPARK-4463] Add (de)select all button for add'l metrics.
    
    This commit removes the behavior where when a user clicks
    "Show additional metrics" on the stage page, all of the additional
    metrics are automatically selected; now, collapsing and expanding
    the additional metrics has no effect on which options are selected.
    Instead, there's a "(De)select All" box at the top; checking this box
    checks all additional metrics (and similarly, unchecking it unchecks
    all additional metrics).
    
    This commit is intended to be backported to 1.2, so that the additional
    metrics behavior is not confusing to users.
    
    Now when a user clicks the "Show additional metrics" menu, this is what
    it looks like:
    
![image](https://cloud.githubusercontent.com/assets/1108612/5094347/1541ead6-6f15-11e4-8e8c-25a65ddbdfb2.png)
    
    Author: Kay Ousterhout <[email protected]>
    
    Closes #3331 from kayousterhout/SPARK-4463 and squashes the following 
commits:
    
    9e17cea [Kay Ousterhout] Added italics
    b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l 
metrics.

commit d2e29516f2064f93f3a9070c91fc7460706e0b0a
Author: Davies Liu <[email protected]>
Date:   2014-11-18T23:57:33Z

    [SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS
    
    ```
    class LogisticRegressionWithLBFGS
     |  train(cls, data, iterations=100, initialWeights=None, corrections=10, 
tolerance=0.0001, regParam=0.01, intercept=False)
     |      Train a logistic regression model on the given data.
     |
     |      :param data:           The training data, an RDD of LabeledPoint.
     |      :param iterations:     The number of iterations (default: 100).
     |      :param initialWeights: The initial weights (default: None).
     |      :param regParam:       The regularizer parameter (default: 0.01).
     |      :param regType:        The type of regularizer used for training
     |                             our model.
     |                             :Allowed values:
     |                               - "l1" for using L1 regularization
     |                               - "l2" for using L2 regularization
     |                               - None for no regularization
     |                               (default: "l2")
     |      :param intercept:      Boolean parameter which indicates the use
     |                             or not of the augmented representation for
     |                             training data (i.e. whether bias features
     |                             are activated or not).
     |      :param corrections:    The number of corrections used in the LBFGS 
update (default: 10).
     |      :param tolerance:      The convergence tolerance of iterations for 
L-BFGS (default: 1e-4).
     |
     |      >>> data = [
     |      ...     LabeledPoint(0.0, [0.0, 1.0]),
     |      ...     LabeledPoint(1.0, [1.0, 0.0]),
     |      ... ]
     |      >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
     |      >>> lrm.predict([1.0, 0.0])
     |      1
     |      >>> lrm.predict([0.0, 1.0])
     |      0
     |      >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
     |      [1, 0]
    ```
    
    Author: Davies Liu <[email protected]>
    
    Closes #3307 from davies/lbfgs and squashes the following commits:
    
    34bd986 [Davies Liu] Merge branch 'master' of 
http://git-wip-us.apache.org/repos/asf/spark into lbfgs
    5a945a6 [Davies Liu] address comments
    941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
lbfgs
    03e5543 [Davies Liu] add it to docs
    ed2f9a8 [Davies Liu] add regType
    76cd1b6 [Davies Liu] reorder arguments
    4429a74 [Davies Liu] Update classification.py
    9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS

commit 4a377aff2d36b64a65b54192a987aba44b8f78e0
Author: Davies Liu <[email protected]>
Date:   2014-11-19T00:17:51Z

    [SPARK-3721] [PySpark] broadcast objects larger than 2G
    
    This patch will bring support for broadcasting objects larger than 2G.
    
    pickle, zlib, FrameSerializer and Array[Byte] all can not support objects 
larger than 2G, so this patch introduce LargeObjectSerializer to serialize 
broadcast objects, the object will be serialized and compressed into small 
chunks, it also change the type of Broadcast[Array[Byte]]] into 
Broadcast[Array[Array[Byte]]]].
    
    Testing for support broadcast objects larger than 2G is slow and memory 
hungry, so this is tested manually, could be added into SparkPerf.
    
    Author: Davies Liu <[email protected]>
    Author: Davies Liu <[email protected]>
    
    Closes #2659 from davies/huge and squashes the following commits:
    
    7b57a14 [Davies Liu] add more tests for broadcast
    28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
huge
    a2f6a02 [Davies Liu] bug fix
    4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
huge
    5875c73 [Davies Liu] address comments
    10a349b [Davies Liu] address comments
    0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
huge
    6182c8f [Davies Liu] Merge branch 'master' into huge
    d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
huge
    2514848 [Davies Liu] address comments
    fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
huge
    1c2d928 [Davies Liu] fix scala style
    091b107 [Davies Liu] broadcast objects larger than 2G

commit bb46046154a438df4db30a0e1fd557bd3399ee7b
Author: Xiangrui Meng <[email protected]>
Date:   2014-11-19T00:25:44Z

    [SPARK-4433] fix a racing condition in zipWithIndex
    
    Spark hangs with the following code:
    
    ~~~
    sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
    ~~~
    
    This is because ZippedWithIndexRDD triggers a job in getPartitions and it 
causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to 
compute `startIndices` during construction.
    
    This should be applied to branch-1.0, branch-1.1, and branch-1.2.
    
    pwendell
    
    Author: Xiangrui Meng <[email protected]>
    
    Closes #3291 from mengxr/SPARK-4433 and squashes the following commits:
    
    c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex

commit 7f22fa81ebd5e501fcb0e1da5506d1d4fb9250cf
Author: Davies Liu <[email protected]>
Date:   2014-11-19T00:37:35Z

    [SPARK-4327] [PySpark] Python API for RDD.randomSplit()
    
    ```
    pyspark.RDD.randomSplit(self, weights, seed=None)
        Randomly splits this RDD with the provided weights.
    
        :param weights: weights for splits, will be normalized if they don't 
sum to 1
        :param seed: random seed
        :return: split RDDs in an list
    
        >>> rdd = sc.parallelize(range(10), 1)
        >>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
        >>> rdd1.collect()
        [3, 6]
        >>> rdd2.collect()
        [0, 5, 7]
        >>> rdd3.collect()
        [1, 2, 4, 8, 9]
    ```
    
    Author: Davies Liu <[email protected]>
    
    Closes #3193 from davies/randomSplit and squashes the following commits:
    
    78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no 
performance gain
    f5fdf63 [Davies Liu] fix bug with int in weights
    4dfa2cd [Davies Liu] refactor
    f866bcf [Davies Liu] remove unneeded change
    c7a2007 [Davies Liu] switch to python implementation
    95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into 
randomSplit
    0d9b256 [Davies Liu] refactor
    1715ee3 [Davies Liu] address comments
    41fce54 [Davies Liu] randomSplit()

commit 13f7b05f3fd7a34ba5dbb3d153308a910059503b
Author: Davies Liu <[email protected]>
Date:   2014-11-19T01:00:17Z

    Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into 
numpy
    
    Conflicts:
        python/pyspark/rddsampler.py

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] spark pull request: [SPARK-4477] [PySpark] remove numpy from RDDSa...

Reply via email to