GitHub user davies opened a pull request:
https://github.com/apache/spark/pull/3351
[SPARK-4477] [PySpark] remove numpy from RDDSampler
In RDDSampler, it try use numpy to gain better performance for possion(),
but the number of call of random() is only (1+faction) * N in the pure python
implementation of possion(), so there is no much performance gain from numpy.
numpy is not a dependent of pyspark, so it maybe introduce some problem,
such as there is no numpy installed in slaves, but only installed master, as
reported in SPARK-927.
It also complicate the code a lot, so we may should remove numpy from
RDDSampler.
closes #2313
Note: this patch including some commits that not mirrored to github, it
will be OK after it catches up.
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/davies/spark numpy
Alternatively you can review and apply these changes as the patch at:
https://github.com/apache/spark/pull/3351.patch
To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:
This closes #3351
----
commit 41fce544cadce5ed314b75f368abf79ee7fcd2da
Author: Davies Liu <[email protected]>
Date: 2014-11-10T23:54:20Z
randomSplit()
commit 1715ee38b619868c284454d5058376e9e0ca09a7
Author: Davies Liu <[email protected]>
Date: 2014-11-12T22:15:47Z
address comments
commit 0d9b256a9b7aa52c37c6a952ffd68bf4441d46e5
Author: Davies Liu <[email protected]>
Date: 2014-11-12T23:00:17Z
refactor
commit 95a48aca935b2c3c9fe1af2aad674811a4c04395
Author: Davies Liu <[email protected]>
Date: 2014-11-13T18:53:11Z
Merge branch 'master' of github.com:apache/spark into randomSplit
commit c7a2007245752480dfe96316d22dc8f19ea1b1a2
Author: Davies Liu <[email protected]>
Date: 2014-11-13T19:29:43Z
switch to python implementation
commit f866bcf7e23a6d888d2113df4da3031bfe91400e
Author: Davies Liu <[email protected]>
Date: 2014-11-13T19:31:09Z
remove unneeded change
commit 4dfa2cdce6a2eaae5f5e24321e324bd0498ea49f
Author: Davies Liu <[email protected]>
Date: 2014-11-13T20:20:56Z
refactor
commit f5fdf63fe0d0ea091c01a4144585276a7db63625
Author: Davies Liu <[email protected]>
Date: 2014-11-13T20:54:14Z
fix bug with int in weights
commit 78bf997f13c6f08129671a9d6a3484620d5b37a2
Author: Davies Liu <[email protected]>
Date: 2014-11-13T21:08:10Z
fix tests, do not use numpy in randomSplit, no performance gain
commit 51649f5e5b29ab8db1c6c3fd91c6f625124ab327
Author: Davies Liu <[email protected]>
Date: 2014-11-14T06:40:40Z
remove numpy in RDDSampler
commit f58302346ad0e859994fae52299fd32b3d3ef61a
Author: Davies Liu <[email protected]>
Date: 2014-11-14T08:37:20Z
fix tests
commit cedc3b5aa43a16e2da62f12a36317f00aa1002cc
Author: Felix Maximilian Möller <[email protected]>
Date: 2014-11-18T18:08:24Z
ALS implicit: added missing parameter alpha in doc string
Author: Felix Maximilian Möller
<[email protected]>
Closes #3343 from felixmaximilian/fix-documentation and squashes the
following commits:
43dcdfb [Felix Maximilian Möller] Removed the information about the switch
implicitPrefs. The parameter implicitPrefs cannot be set in this context
because it is inherent true when calling the trainImplicit method.
7d172ba [Felix Maximilian Möller] added missing parameter alpha in doc
string.
commit 8fbf72b7903b5bbec8d949151aa4693b4af26ff5
Author: Davies Liu <[email protected]>
Date: 2014-11-18T18:11:13Z
[SPARK-4435] [MLlib] [PySpark] improve classification
This PR add setThrehold() and clearThreshold() for LogisticRegressionModel
and SVMModel, also support RDD of vector in LogisticRegressionModel.predict(),
SVNModel.predict() and NaiveBayes.predict()
Author: Davies Liu <[email protected]>
Closes #3305 from davies/setThreshold and squashes the following commits:
d0b835f [Davies Liu] Merge branch 'master' of github.com:apache/spark into
setThreshold
e4acd76 [Davies Liu] address comments
2231a5f [Davies Liu] bugfix
7bd9009 [Davies Liu] address comments
0b0a8a7 [Davies Liu] address comments
c1e5573 [Davies Liu] improve classification
commit b54c6ab3c54e65238d6766832ea1f3fcd694f2fd
Author: Xiangrui Meng <[email protected]>
Date: 2014-11-18T18:35:29Z
[SPARK-4396] allow lookup by index in Python's Rating
In PySpark, ALS can take an RDD of (user, product, rating) tuples as input.
However, model.predict outputs an RDD of Rating. So on the input side, users
can use r[0], r[1], r[2], while on the output side, users have to use r.user,
r.product, r.rating. We should allow lookup by index in Rating by making Rating
a namedtuple.
davies
<!-- Reviewable:start -->
[<img src="https://reviewable.io/review_button.png" height=40 alt="Review
on Reviewable"/>](https://reviewable.io/reviews/apache/spark/3261)
<!-- Reviewable:end -->
Author: Xiangrui Meng <[email protected]>
Closes #3261 from mengxr/SPARK-4396 and squashes the following commits:
543aef0 [Xiangrui Meng] use named tuple to implement ALS
0b61bae [Xiangrui Meng] Merge remote-tracking branch 'apache/master' into
SPARK-4396
d3bd7d4 [Xiangrui Meng] allow lookup by index in Python's Rating
commit 90d72ec8502f7ec11d2fe42f08c884ad2159266f
Author: Michael Armbrust <[email protected]>
Date: 2014-11-18T20:13:23Z
[SQL] Support partitioned parquet tables that have the key in both the
directory and the file
Author: Michael Armbrust <[email protected]>
Closes #3272 from marmbrus/keyInPartitionedTable and squashes the following
commits:
447f08c [Michael Armbrust] Support partitioned parquet tables that have the
key in both the directory and the file
commit bfebfd8b28eeb7e75292333f7885aa0830fcb5fe
Author: Kousuke Saruta <[email protected]>
Date: 2014-11-18T20:17:33Z
[SPARK-4075][SPARK-4434] Fix the URI validation logic for Application Jar
name.
This PR adds a regression test for SPARK-4434.
Author: Kousuke Saruta <[email protected]>
Closes #3326 from sarutak/add-triple-slash-testcase and squashes the
following commits:
82bc9cc [Kousuke Saruta] Fixed wrong grammar in comment
9149027 [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into add-triple-slash-testcase
c1c80ca [Kousuke Saruta] Fixed style
4f30210 [Kousuke Saruta] Modified comments
9e09da2 [Kousuke Saruta] Fixed URI validation for jar file
d4b99ef [Kousuke Saruta] [SPARK-4075] [Deploy] Jar url validation is not
enough for Jar file
ac79906 [Kousuke Saruta] Merge branch 'master' of
git://git.apache.org/spark into add-triple-slash-testcase
6d4f47e [Kousuke Saruta] Added a test case as a regression check for
SPARK-4434
commit 80f31778820586a93d73fa15279a204611cc3c60
Author: Davies Liu <[email protected]>
Date: 2014-11-18T21:11:38Z
[SPARK-4404] remove sys.exit() in shutdown hook
If SparkSubmit die first, then bootstrapper will be blocked by shutdown
hook. sys.exit() in a shutdown hook will cause some kind of dead lock.
cc andrewor14
Author: Davies Liu <[email protected]>
Closes #3289 from davies/fix_bootstraper and squashes the following commits:
ea5cdd1 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
fix_bootstraper
e04b690 [Davies Liu] remove sys.exit in hook
4d11366 [Davies Liu] remove shutdown hook if subprocess die fist
commit e34f38ff1a0dfbb0ffa4bd11071e03b1a58de998
Author: Davies Liu <[email protected]>
Date: 2014-11-18T21:37:21Z
[SPARK-4017] show progress bar in console
The progress bar will look like this:

In the right corner, the numbers are: finished tasks, running tasks, total
tasks.
After the stage has finished, it will disappear.
The progress bar is only showed if logging level is WARN or higher (but
progress in title is still showed), it can be turned off by
spark.driver.showConsoleProgress.
Author: Davies Liu <[email protected]>
Closes #3029 from davies/progress and squashes the following commits:
95336d5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
progress
fc49ac8 [Davies Liu] address commentse
2e90f75 [Davies Liu] show multiple stages in same time
0081bcc [Davies Liu] address comments
38c42f1 [Davies Liu] fix tests
ab87958 [Davies Liu] disable progress bar during tests
30ac852 [Davies Liu] re-implement progress bar
b3f34e5 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
progress
6fd30ff [Davies Liu] show progress bar if no task finished in 500ms
e4e7344 [Davies Liu] refactor
e1f524d [Davies Liu] revert unnecessary change
a60477c [Davies Liu] Merge branch 'master' of github.com:apache/spark into
progress
5cae3f2 [Davies Liu] fix style
ea49fe0 [Davies Liu] address comments
bc53d99 [Davies Liu] refactor
e6bb189 [Davies Liu] fix logging in sparkshell
7e7d4e7 [Davies Liu] address commments
5df26bb [Davies Liu] fix style
9e42208 [Davies Liu] show progress bar in console and title
commit 010bc86e40a0e54b6850b75abd6105e70eb1af10
Author: Kay Ousterhout <[email protected]>
Date: 2014-11-18T23:01:06Z
[SPARK-4463] Add (de)select all button for add'l metrics.
This commit removes the behavior where when a user clicks
"Show additional metrics" on the stage page, all of the additional
metrics are automatically selected; now, collapsing and expanding
the additional metrics has no effect on which options are selected.
Instead, there's a "(De)select All" box at the top; checking this box
checks all additional metrics (and similarly, unchecking it unchecks
all additional metrics).
This commit is intended to be backported to 1.2, so that the additional
metrics behavior is not confusing to users.
Now when a user clicks the "Show additional metrics" menu, this is what
it looks like:

Author: Kay Ousterhout <[email protected]>
Closes #3331 from kayousterhout/SPARK-4463 and squashes the following
commits:
9e17cea [Kay Ousterhout] Added italics
b731230 [Kay Ousterhout] [SPARK-4463] Add (de)select all button for add'l
metrics.
commit d2e29516f2064f93f3a9070c91fc7460706e0b0a
Author: Davies Liu <[email protected]>
Date: 2014-11-18T23:57:33Z
[SPARK-4306] [MLlib] Python API for LogisticRegressionWithLBFGS
```
class LogisticRegressionWithLBFGS
| train(cls, data, iterations=100, initialWeights=None, corrections=10,
tolerance=0.0001, regParam=0.01, intercept=False)
| Train a logistic regression model on the given data.
|
| :param data: The training data, an RDD of LabeledPoint.
| :param iterations: The number of iterations (default: 100).
| :param initialWeights: The initial weights (default: None).
| :param regParam: The regularizer parameter (default: 0.01).
| :param regType: The type of regularizer used for training
| our model.
| :Allowed values:
| - "l1" for using L1 regularization
| - "l2" for using L2 regularization
| - None for no regularization
| (default: "l2")
| :param intercept: Boolean parameter which indicates the use
| or not of the augmented representation for
| training data (i.e. whether bias features
| are activated or not).
| :param corrections: The number of corrections used in the LBFGS
update (default: 10).
| :param tolerance: The convergence tolerance of iterations for
L-BFGS (default: 1e-4).
|
| >>> data = [
| ... LabeledPoint(0.0, [0.0, 1.0]),
| ... LabeledPoint(1.0, [1.0, 0.0]),
| ... ]
| >>> lrm = LogisticRegressionWithLBFGS.train(sc.parallelize(data))
| >>> lrm.predict([1.0, 0.0])
| 1
| >>> lrm.predict([0.0, 1.0])
| 0
| >>> lrm.predict(sc.parallelize([[1.0, 0.0], [0.0, 1.0]])).collect()
| [1, 0]
```
Author: Davies Liu <[email protected]>
Closes #3307 from davies/lbfgs and squashes the following commits:
34bd986 [Davies Liu] Merge branch 'master' of
http://git-wip-us.apache.org/repos/asf/spark into lbfgs
5a945a6 [Davies Liu] address comments
941061b [Davies Liu] Merge branch 'master' of github.com:apache/spark into
lbfgs
03e5543 [Davies Liu] add it to docs
ed2f9a8 [Davies Liu] add regType
76cd1b6 [Davies Liu] reorder arguments
4429a74 [Davies Liu] Update classification.py
9252783 [Davies Liu] python api for LogisticRegressionWithLBFGS
commit 4a377aff2d36b64a65b54192a987aba44b8f78e0
Author: Davies Liu <[email protected]>
Date: 2014-11-19T00:17:51Z
[SPARK-3721] [PySpark] broadcast objects larger than 2G
This patch will bring support for broadcasting objects larger than 2G.
pickle, zlib, FrameSerializer and Array[Byte] all can not support objects
larger than 2G, so this patch introduce LargeObjectSerializer to serialize
broadcast objects, the object will be serialized and compressed into small
chunks, it also change the type of Broadcast[Array[Byte]]] into
Broadcast[Array[Array[Byte]]]].
Testing for support broadcast objects larger than 2G is slow and memory
hungry, so this is tested manually, could be added into SparkPerf.
Author: Davies Liu <[email protected]>
Author: Davies Liu <[email protected]>
Closes #2659 from davies/huge and squashes the following commits:
7b57a14 [Davies Liu] add more tests for broadcast
28acff9 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
huge
a2f6a02 [Davies Liu] bug fix
4820613 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
huge
5875c73 [Davies Liu] address comments
10a349b [Davies Liu] address comments
0c33016 [Davies Liu] Merge branch 'master' of github.com:apache/spark into
huge
6182c8f [Davies Liu] Merge branch 'master' into huge
d94b68f [Davies Liu] Merge branch 'master' of github.com:apache/spark into
huge
2514848 [Davies Liu] address comments
fda395b [Davies Liu] Merge branch 'master' of github.com:apache/spark into
huge
1c2d928 [Davies Liu] fix scala style
091b107 [Davies Liu] broadcast objects larger than 2G
commit bb46046154a438df4db30a0e1fd557bd3399ee7b
Author: Xiangrui Meng <[email protected]>
Date: 2014-11-19T00:25:44Z
[SPARK-4433] fix a racing condition in zipWithIndex
Spark hangs with the following code:
~~~
sc.parallelize(1 to 10).zipWithIndex.repartition(10).count()
~~~
This is because ZippedWithIndexRDD triggers a job in getPartitions and it
causes a deadlock in DAGScheduler.getPreferredLocs (synced). The fix is to
compute `startIndices` during construction.
This should be applied to branch-1.0, branch-1.1, and branch-1.2.
pwendell
Author: Xiangrui Meng <[email protected]>
Closes #3291 from mengxr/SPARK-4433 and squashes the following commits:
c284d9f [Xiangrui Meng] fix a racing condition in zipWithIndex
commit 7f22fa81ebd5e501fcb0e1da5506d1d4fb9250cf
Author: Davies Liu <[email protected]>
Date: 2014-11-19T00:37:35Z
[SPARK-4327] [PySpark] Python API for RDD.randomSplit()
```
pyspark.RDD.randomSplit(self, weights, seed=None)
Randomly splits this RDD with the provided weights.
:param weights: weights for splits, will be normalized if they don't
sum to 1
:param seed: random seed
:return: split RDDs in an list
>>> rdd = sc.parallelize(range(10), 1)
>>> rdd1, rdd2, rdd3 = rdd.randomSplit([0.4, 0.6, 1.0], 11)
>>> rdd1.collect()
[3, 6]
>>> rdd2.collect()
[0, 5, 7]
>>> rdd3.collect()
[1, 2, 4, 8, 9]
```
Author: Davies Liu <[email protected]>
Closes #3193 from davies/randomSplit and squashes the following commits:
78bf997 [Davies Liu] fix tests, do not use numpy in randomSplit, no
performance gain
f5fdf63 [Davies Liu] fix bug with int in weights
4dfa2cd [Davies Liu] refactor
f866bcf [Davies Liu] remove unneeded change
c7a2007 [Davies Liu] switch to python implementation
95a48ac [Davies Liu] Merge branch 'master' of github.com:apache/spark into
randomSplit
0d9b256 [Davies Liu] refactor
1715ee3 [Davies Liu] address comments
41fce54 [Davies Liu] randomSplit()
commit 13f7b05f3fd7a34ba5dbb3d153308a910059503b
Author: Davies Liu <[email protected]>
Date: 2014-11-19T01:00:17Z
Merge branch 'master' of http://git-wip-us.apache.org/repos/asf/spark into
numpy
Conflicts:
python/pyspark/rddsampler.py
----
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]