Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
Ok, thanks. Do you sen any missing task for it to be merged then ?
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
For the "normal failure case", the one we encounter is that when the
vocabulary is too small the embedding learning raises. And it is pretty
legitimate in our use cases to have
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/17076#discussion_r104115651
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
The blocking flag is not the issue.
The issue is that there was not try / finally and the destruction was only
done if the fitting succeded and skipped otherwise.
In this PR I
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
Actually we *did* observe memory leak (severe enough to lead to application
failure) due to this when doing thousands of Word2Vec learning in one Spark
application, a few of them failing
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
I'm fixing this (issue compiling Spark locally delay me). The whole point
is that they are *not* destroyed within the method in case of exception.
---
If your project is set up for it, you
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/14299#discussion_r103657660
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,21 @@ class Word2Vec extends Serializable
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/14299#discussion_r103656554
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,21 @@ class Word2Vec extends Serializable
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16805
@thunterdb @srowen could you please point us to the right reviewers /
commiters regarding this proposal of improvement for PipedRDD, supporting
binary streams ?
---
If your project is set
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/14299#discussion_r103421239
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -314,6 +315,20 @@ class Word2Vec extends Serializable
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
@vanzin The ticket was filled a long time ago, I updated the PR to make it
clearer. Is any manual linking in some other forge needed ?
@thunterdb I have updated the review following
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
@thunterdb Copy that, working on it and sorry for the acknowledge delay.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/17076#discussion_r103254026
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/17076#discussion_r103249915
--- Diff:
mllib/src/main/scala/org/apache/spark/ml/classification/LinearSVC.scala ---
@@ -440,19 +440,9 @@ private class LinearSVCAggregator
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16279
We have backported it to our internal version of Spark anyhow. As told
above feel free to close it if you consider this is not worth officially
backporting :-)
---
If your project is set
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16279
Ok, thanks for the pointer. I do agree this is a non critical judgement
call .
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16279
May I ask why ? There was no conflicts so no additional qualification work
is required and this looks like a performance bug fix to me, not a new feature.
In order to adjust our contribution
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16279
@srowen sorry to bother you but this is just the backport (without any
conflict) of #16037. Could you please validate it ?
---
If your project is set up for it, you can reply
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16279
[SPARK-18471][MLLIB][BACKPORT-2.0] In LBFGS, avoid sending huge vectors of
0
# What changes were proposed in this pull request? #
CostFun used to send a dense vector of zeroes
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
Thanks for keeping up with this merge request, I've learned a lot wrt the
contribution process and good practice, and next contrib will hopefully be much
more straightforward. Thanks
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r92150485
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r91949738
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r91493798
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,24 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
@sethah Agreed, that's why I uggested to add a dedicated treeAggregate
wrapper to MLlin Utils which would take care of that without fiddling with
sparsity for each seqOP and comboOp. See
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r91036283
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,27 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
L-BFGS is the only optimizer I've used so far. I'm not sure how much time I
can free to take care of the other ones, but I'll try :-)
Regarding the bench, I'll check if we have
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
@MLnick there it is all tests passing :-)
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/16037#discussion_r90422375
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,27 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
This seems very related to the change. I know nothing wrt the way PySpark
interfaces Python and Scala and I'm surprised that changing an internal f the
Scala lib causes this ? Ready to learn
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
Scala style checks fixed, and spurious cast removed.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
@MLnick I have to add a transtyping denseGrad in the expression returned by
seqOp from DenseVector to Vector. The only way I could find to do it with the
API are: `Vectors.fromBreeze
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16037
Hello @MLnick,
Sorry that my push of a new version just crossed with your suggestion.
I'll test you suggestion. Thanks to for having written it, I didn't get
that Sean's
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/16078
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16078
In order to reduce the mess around multiple PRs I'll close this one and
rebase the change in #16037 as you requested.
What is the right way,convenient for you, to share a proposal
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Agreed this is too ML specific to be deserve a fix in core.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/16038
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16078
It is necessary to address it in L-BFGS at least. We propose a solution in
core which can be legitimately rejected as not relevant for core. And two
solutions in MLlib, one provide
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
@mridulm we don't know how to monitor the size of the serialize task. Sure
it would not 10MB due to all those zeros. But we nonetheless measure a
significant increase in performance
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16078
[SPARK-18471][MLLIB] Fix huge vectors of zero send in closure in L-BFGS
## What changes were proposed in this pull request?
Introduced util TreeAggregatorWithZeroGenerator and used
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Actually aggregating (and thus sending on the network) on quite dense
SparseVectors with 10s of million elements is not to taken lightly. This would
required serious benchmarking. What I
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
THe big doubt I have is on *Of course, the first call to axpy should
produce a dense vector, but, that's already on the executor*: the other operand
is Sparse and has to be sparse
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Yes sure the zero Vector is very sparse. Bu ti do not get your suggestion ?
I see no way to pass a Sparse Vector as zero and get the type of vector to
change underway to Dense Vector
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
No this does not do the trick as the result of the aggregation IS dense.
And the zero in (tree)aggregate has the same type as the result. Said
otherwise, in L-BFGS, we do aggregate vectors
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
Maybe we could have a live talk/chat this evening Paris time / Morning West
Coast time?
If I'm not mistaken both #16037 and this change fix the issue at two
different levels
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/16042
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16042
Sorry created by mistake
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16042
Fix of dev scripts and new, Criteo specific, ones WIP
You can merge this pull request into a Git repository by running:
$ git pull https://github.com/AnthonyTruchet/spark dev-tools
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16040
Agreed, and we might also consider making the broadcasted variable
compatible with some Automatic Resource Management to avoid such leak (like
https://github.com/jsuereth/scala-arm
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16040
[SPARK-18612][MLLIB] Delete broadcasted variable in LBFGS CostFun
## What changes were proposed in this pull request?
Fix a broadcasted variable leak occurring at each invocation
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
It is related as if this PR get accepted, then the fix for PR to #16037
becomes trivial: replace treeAggregate with a call to
`treeAggregateWithZeroGenerator( () => (Vectors.zeros(n),
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/16038
@srowen here is the companion PR to #16037.
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16038
[SPARK-18471][CORE] New treeAggregate overload for big large aggregators
## What changes were proposed in this pull request?
The zero for the aggregation used to be shipped
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15963
Once more (last time hopefully) I mistakenly fiddled with PR. Closing this
one and replace it with #16037.
Code style review above taken into account in new PR.
---
If your project
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/15963
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/16037
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
## What changes were proposed in this pull request?
CostFun used to send a dense vector of zeroes as a closure
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15963
Hello @srowen , sorry not to have updated this lately, I've been taken by
other emergencies at work. I'll update this on Monday. Actually I'll submit a
variant of treeAggregate in core
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15963
@srowen Here is at last the real PR for SPARK-18471.
Sorry for the noise due to GitHub fiddling...
---
If your project is set up for it, you can reply to this email and have your
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/15963
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
## What changes were proposed in this pull request?
CostFun used to send a dense vector of zeroes as a closure
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15931
Sure, sorry I mistakenly pushed a badly rebased version
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/15931
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/15931
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
## What changes were proposed in this pull request?
The zero for the aggregation used to be shipped into a closure
Github user AnthonyTruchet closed the pull request at:
https://github.com/apache/spark/pull/15905
---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15905
I missed part of my company guidelines. Closing this PR and creating a new
one shortly from my company account. Sorry for the noise.
---
If your project is set up for it, you can reply
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/15905#discussion_r88409682
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/optimization/LBFGS.scala ---
@@ -241,16 +241,28 @@ object LBFGS extends Logging
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15905
By he way do you think that this should be addressed in core or just in
each ML specific use ?
---
If your project is set up for it, you can reply to this email and have your
reply appear
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/15905
[SPARK-18471][MLLIB] In LBFGS, avoid sending huge vectors of 0
## What changes were proposed in this pull request?
CostFun used to send a dense vector of zeroes as a closure
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15023
This is perfectly understandable, Just in the same way that we err on the
side of caution by not switching to 2.0 branch for prod just now :-)
---
If your project is set up for it, you can
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/15023
I'm aware that features are not generally back-ported. The point is, for us
this is a bug, preventing a deployment in production. We thus back-ported the
fix internally and now propose
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14270
Hello @markgrover @vanzin ! As this is just a backport of your work, would
you please consider reviewing it ?
---
If your project is set up for it, you can reply to this email and have your
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/15023
Backport [SPARK-5847] Allow for configuring MetricsSystem's prefix
This is a backport of #14270. Because the spark.internal.config system
does not exists in branch 1.6, a simpler
Github user AnthonyTruchet commented on the issue:
https://github.com/apache/spark/pull/14299
@pzz2011 Nothing _that_ bad if your application is just learning once and
saving / using the embeddings. But if you run hundreds or thousands of
learning, you'll just crash because
Github user AnthonyTruchet commented on a diff in the pull request:
https://github.com/apache/spark/pull/14299#discussion_r71723751
--- Diff:
mllib/src/main/scala/org/apache/spark/mllib/feature/Word2Vec.scala ---
@@ -313,133 +313,139 @@ class Word2Vec extends Serializable
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/14299
Ensure broadcasted variables are destroyed even in case of exception
## What changes were proposed in this pull request?
Ensure broadcasted variable are destroyed even in case
GitHub user AnthonyTruchet opened a pull request:
https://github.com/apache/spark/pull/14268
[SPARK-16440][MLlib] Destroy broadcasted variables even on driver
## What changes were proposed in this pull request?
Forgotten broadcasted variables were persisted into a previous #PR
74 matches
Mail list logo