This is expensive but doable:
rdd.zipWithIndex().filter { case (_, idx) => idx >= 10 && idx < 20 }.collect()
-Xiangrui
On Thu, Jul 10, 2014 at 12:53 PM, Nick Chammas
wrote:
> Interesting question on Stack Overflow:
> http://stackoverflow.com/q/24677180/877069
>
> Basically, is there a way to ta
news20.binary's feature dimension is 1.35M. So the serialized task
size is above the default limit 10M. You need to set
spark.akka.frameSize to, e.g, 20. Due to a bug SPARK-1112, this
parameter is not passed to executors automatically, which causes Spark
freezes. This was fixed in the latest master
SparkKMeans is a naive implementation. Please use
mllib.clustering.KMeans in practice. I created a JIRA for this:
https://issues.apache.org/jira/browse/SPARK-2434 -Xiangrui
On Thu, Jul 10, 2014 at 2:45 AM, Tathagata Das
wrote:
> I ran the SparkKMeans example (not the mllib KMeans that Sean ran) w
t;> > 2) The execution was successful when run in local mode with reduced
>> > number
>> > of partitions. Does this imply issues communicating/coordinating across
>> > processes (i.e. driver, master and workers)?
>> >
>> > Thanks,
>> >
You can either use sc.wholeTextFiles and then a flatMap to reduce the
number of partitions, or give more memory to the driver process by
using --driver-memory 20g and then call RDD.repartition(small number)
after you load the data in. -Xiangrui
On Mon, Jul 7, 2014 at 7:38 PM, innowireless TaeYun K
1) The feature dimension should be a fixed number before you run
NaiveBayes. If you use bag of words, you need to handle the
word-to-index dictionary by yourself. You can either ignore the words
that never appear in training (because they have no effect in
prediction), or use hashing to randomly pr
try sbt/sbt clean first
On Tue, Jul 8, 2014 at 8:25 AM, bai阿蒙 wrote:
> Hi guys,
> when i try to compile the latest source by sbt/sbt compile, I got an error.
> Can any one help me?
>
> The following is the detail: it may cause by TestSQLContext.scala
> [error]
> [error] while compiling:
> /d
Well, I believe this is a correct implementation but please let us
know if you run into problems. The NaiveBayes implementation in MLlib
v1.0 supports sparse data, which is usually the case for text
classificiation. I would recommend upgrading to v1.0. -Xiangrui
On Tue, Jul 8, 2014 at 7:20 AM, Rah
Hi Rahul,
We plan to add online model updates with Spark Streaming, perhaps in
v1.1, starting with linear methods. Please open a JIRA for Naive
Bayes. For Naive Bayes, we need to update the priors and conditional
probabilities, which means we should also remember the number of
observations for the
No, but it should be easy to add one. -Xiangrui
On Mon, Jul 7, 2014 at 12:37 AM, Ulanov, Alexander
wrote:
> Hi,
>
>
>
> Is there a method in Spark/MLlib to convert DenseVector to SparseVector?
>
>
>
> Best regards, Alexander
ave1) and the second host with slave2.
>
> 2) The execution was successful when run in local mode with reduced number
> of partitions. Does this imply issues communicating/coordinating across
> processes (i.e. driver, master and workers)?
>
> Thanks,
> Bharath
>
>
>
> O
704062240-6a65
> 14/07/04 06:22:40 INFO MemoryStore: MemoryStore started with capacity 6.7
> GB.
> 14/07/04 06:22:40 INFO ConnectionManager: Bound socket to port 46901 with id
> = ConnectionManagerId(slave1,46901)
> 14/07/04 06:22:40 INFO BlockManagerMaster: Trying to register BlockManager
&
; job failed.
> So it appears that no matter what the task input-result size, the execution
> fails at the end of the stage corresponding to GradientDescent.aggregate
> (and the preceding count() in GradientDescent goes through fine). Let me
> know if you need any additional information.
>
PROCESS_LOCAL slave1 2014/07/02
> 16:01:28 35 s 0.1 s
> 1 727 SUCCESS PROCESS_LOCAL slave2 2014/07/02
> 16:01:28 33 s 99 ms
>
> Any pointers / diagnosis please?
>
>
>
>
> On Thu, Jun 19, 2014 at 10:03 AM, Bhara
3)
>>at
>> sun.launcher.LauncherHelper.getMainMethod(LauncherHelper.java:494)
>>at
>> sun.launcher.LauncherHelper.checkAndLoadMain(LauncherHelper.java:486)
>> "
>>
>> If "scalac -d classes/ SparkKMeans.scala" can't see my cla
Hi Dmitriy,
It is sweet to have the bindings, but it is very easy to downgrade the
performance with them. The BLAS/LAPACK APIs have been there for more
than 20 years and they are still the top choice for high-performance
linear algebra. I'm thinking about whether it is possible to make the
evaluat
Hi Thunder,
Please understand that both MLlib and breeze are in active
development. Before v1.0, we used jblas but in the public APIs we only
exposed Array[Double]. In v1.0, we introduced Vector that supports
both dense and sparse data and switched the backend to
breeze/netlib-java (except ALS). W
This is due to a bug in sampling, which was fixed in 1.0.1 and latest
master. See https://github.com/apache/spark/pull/1234 . -Xiangrui
On Wed, Jul 2, 2014 at 8:23 PM, x wrote:
> Hello,
>
> I a newbie to Spark MLlib and ran into a curious case when following the
> instruction at the page below.
>
We were not ready to expose it as a public API in v1.0. Both breeze
and MLlib are in rapid development. It would be possible to expose it
as a developer API in v1.1. For now, it should be easy to define a
toBreeze method in your own project. -Xiangrui
On Tue, Jul 1, 2014 at 12:17 PM, Koert Kuipers
Try to reduce number of partitions to match the number of cores. We
will add treeAggregate to reduce the communication cost.
PR: https://github.com/apache/spark/pull/1110
-Xiangrui
On Tue, Jul 1, 2014 at 12:55 AM, Charles Li wrote:
> Hi Spark,
>
> I am running LBFGS on our user data. The data s
You can use either bin/run-example or bin/spark-summit to run example
code. "scalac -d classes/ SparkKMeans.scala" doesn't recognize Spark
classpath. There are examples in the official doc:
http://spark.apache.org/docs/latest/quick-start.html#where-to-go-from-here
-Xiangrui
On Tue, Jul 1, 2014 at
You were using an old version of numpy, 1.4? I think this is fixed in
the latest master. Try to replace vec.dot(target) by numpy.dot(vec,
target), or use the latest master. -Xiangrui
On Mon, Jun 30, 2014 at 2:04 PM, Sam Jacobs wrote:
> Hi,
>
>
> I modified the example code for logistic regression
Could you post the code snippet and the error stack trace? -Xiangrui
On Mon, Jun 30, 2014 at 7:03 AM, Daniel Micol wrote:
> Hello,
>
> I’m trying to use KMeans with MLLib but am getting a TaskNotSerializable
> error. I’m using Spark 0.9.1 and invoking the KMeans.run method with k = 2
> and numPar
labels can be learned), and I would also like to do cross fold
> validation.
>
> The driver doesn't seem to be using too much memory. I left it as -Xmx8g and
> it never complained.
>
> Kyle
>
>
>
> On Fri, Jun 27, 2014 at 1:18 PM, Xiangrui Meng wrote:
>>
Try to use --executor-memory 12g with spark-summit. Or you can set it
in conf/spark-defaults.properties and rsync it to all workers and then
restart. -Xiangrui
On Fri, Jun 27, 2014 at 1:05 PM, Peng Cheng wrote:
> I give up, communication must be blocked by the complex EC2 network topology
> (thou
Hi Kyle,
A few questions:
1) Did you use `setIntercept(true)`?
2) How many features?
I'm a little worried about driver's load because the final aggregation
and weights update happen on the driver. Did you check driver's memory
usage as well?
Best,
Xiangrui
On Fri, Jun 27, 2014 at 8:10 AM, Kyle
Your data source is S3 and data is used twice. m1.large does not have very good
network performance. Please try file.count() and see how fast it goes. -Xiangrui
> On Jun 20, 2014, at 8:16 AM, mathias wrote:
>
> Hi there,
>
> We're trying out Spark and are experiencing some performance issues u
This is a planned feature for v1.1. I'm going to work on it after v1.0.1
release. -Xiangrui
> On Jun 20, 2014, at 6:46 AM, Charles Earl wrote:
>
> Looking for something like scikit's grid search module.
> C
It is because the frame size is not set correctly in executor backend. see
spark-1112 . We are going to fix it in v1.0.1 . Did you try the treeAggregate?
> On Jun 19, 2014, at 2:01 AM, Makoto Yui wrote:
>
> Xiangrui and Debasish,
>
> (2014/06/18 6:33), Debasish Das wrote:
>> I did run pretty b
Denis, I think it is fine to have PLSA in MLlib. But I'm not familiar
with the modification you mentioned since the paper is new. We may
need to spend more time to learn the trade-offs. Feel free to create a
JIRA for PLSA and we can move our discussion there. It would be great
if you can share your
> Thanks,
> Bharath
>
>
>
> On Wed, Jun 18, 2014 at 7:14 AM, Bharath Ravi Kumar
> wrote:
>>
>> Hi Xiangrui ,
>>
>> I'm using 1.0.0.
>>
>> Thanks,
>> Bharath
>>
>> On 18-Jun-2014 1:43 am, "Xiangrui Meng&q
Makoto, please use --driver-memory 8G when you launch spark-shell. -Xiangrui
On Tue, Jun 17, 2014 at 4:49 PM, Xiangrui Meng wrote:
> DB, Yes, reduce and aggregate are linear.
>
> Makoto, dense vectors are used to in aggregation. If you have 32
> partitions and each one sending a den
e shares the same
>> behavior as aggregate operation which is O(n)?
>>
>> Sincerely,
>>
>> DB Tsai
>> ---
>> My Blog: https://www.dbtsai.com
>> LinkedIn: https://www.linkedin.com/in/dbtsai
>>
>>
>> On Tue, Jun 17, 2014
Hi Makoto,
Are you using Spark 1.0 or 0.9? Could you go to the executor tab of
the web UI and check the driver's memory?
treeAggregate is not part of 1.0.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 2:00 PM, Xiangrui Meng wrote:
> Hi DB,
>
> treeReduce (treeAggregate) is a feature I
time, where n is the number of partitions. It would be
great if someone can help test its scalability.
Best,
Xiangrui
On Tue, Jun 17, 2014 at 1:32 PM, Makoto Yui wrote:
> Hi Xiangrui,
>
>
> (2014/06/18 4:58), Xiangrui Meng wrote:
>>
>> How many partitions did you set? If
Hi Bharath,
Thanks for posting the details! Which Spark version are you using?
Best,
Xiangrui
On Tue, Jun 17, 2014 at 6:48 AM, Bharath Ravi Kumar wrote:
> Hi,
>
> (Apologies for the long mail, but it's necessary to provide sufficient
> details considering the number of issues faced.)
>
> I'm ru
Hi Jayati,
Thanks for asking! MLlib algorithms are all implemented in Scala. It
makes us easier to maintain if we have the implementations in one
place. For the roadmap, please visit
http://www.slideshare.net/xrmeng/m-llib-hadoopsummit to see features
planned for v1.1. Before contributing new algo
Hi Makoto,
How many partitions did you set? If there are too many partitions,
please do a coalesce before calling ML algorithms.
Btw, could you try the tree branch in my repo?
https://github.com/mengxr/spark/tree/tree
I used tree aggregate in this branch. It should help with the scalability.
Be
ned, but the
> source code reveals that the intercept is also penalized if one is included,
> which is usually inappropriate. The developer should fix this problem.
>
> Best,
>
> Congrui
>
> -Original Message-
> From: Xiangrui Meng [mailto:men...@gmail.com]
> Sent:
1.
"examples/src/main/scala/org/apache/spark/examples/mllib/BinaryClassification.scala"
contains example code that shows how to set regParam.
2. A static method with more than 3 parameters becomes hard to
remember and hard to maintain. Please use LogistricRegressionWithSGD's
default constructor a
You can create tf vectors and then use
RowMatrix.computeColumnSummaryStatistics to get df (numNonzeros). For
tokenizer and stemmer, you can use scalanlp/chalk. Yes, it is worth
having a simple interface for it. -Xiangrui
On Fri, Jun 13, 2014 at 1:21 AM, Stuti Awasthi wrote:
> Hi all,
>
>
>
> I wa
Could you try to click one that RDD and see the storage info per
partition? I tried continuously caching RDDs, so new ones kick old
ones out when there is not enough memory. I saw similar glitches but
the storage info per partition is correct. If you find a way to
reproduce this error, please creat
For broadcast data, please read
http://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
.
For one-vs-all, please read
https://en.wikipedia.org/wiki/Multiclass_classification .
-Xiangrui
On Mon, Jun 9, 2014 at 7:24 AM, littlebird wrote:
> Thank you for your reply, I don't q
Hi Tobias,
Which file system and which encryption are you using?
Best,
Xiangrui
On Sun, Jun 8, 2014 at 10:16 PM, Xiangrui Meng wrote:
> Hi dlaw,
>
> You are using breeze-0.8.1, but the spark assembly jar depends on
> breeze-0.7. If the spark assembly jar comes the first on the cla
Hi dlaw,
You are using breeze-0.8.1, but the spark assembly jar depends on
breeze-0.7. If the spark assembly jar comes the first on the classpath
but the method from DenseMatrix is only available in breeze-0.8.1, you
get NoSuchMethod. So,
a) If you don't need the features in breeze-0.8.1, do not
At this time, you need to do one-vs-all manually for multiclass
training. For your second question, if the algorithm is implemented in
Java/Scala/Python and designed for single machine, you can broadcast
the dataset to each worker, train models on workers. If the algorithm
is implemented in a diffe
Yes. If k-means reached the max number of iterations, you should see
the following in the log:
KMeans reached the max number of iterations:
Best,
Xiangrui
On Fri, Jun 6, 2014 at 2:08 AM, Stuti Awasthi wrote:
> Hi all,
>
>
>
> I have a very basic question. I tried running KMeans with 10 iterati
For standalone and yarn mode, you need to install native libraries on all
nodes. The best solution is installing them to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3 . If your matrix is sparse, the native libraries cannot
help because they are for dense linear algebra. You can create RDD of
Hi Krishna,
Specifying executor memory in local mode has no effect, because all of
the threads run inside the same JVM. You can either try
--driver-memory 60g or start a standalone server.
Best,
Xiangrui
On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng wrote:
> 80M by 4 should be about 2.
80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
take that long, even on a single executor. Besides what Matei
suggested, could you also verify the executor memory in
http://localhost:4040 in the Executors tab. It is very likely the
executors do not have enough memory. In that c
Could you check whether the vectors have the same size? -Xiangrui
On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 wrote:
> what does this exception mean?
>
> 14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
> java.lang.IllegalArgumentException: requirement failed
> at scala.Predef$.r
Did you try sc.stop()?
On Tue, Jun 3, 2014 at 9:54 PM, MEETHU MATHEW wrote:
> Hi,
>
> I want to know how I can stop a running SparkContext in a proper way so that
> next time when I start a new SparkContext, the web UI can be launched on the
> same port 4040.Now when i quit the job using ctrl+z t
Hi Suela,
(Please subscribe our user mailing list and send your questions there
in the future.) For your case, each file contains a column of numbers.
So you can use `sc.textFile` to read them first, zip them together,
and then create labeled points:
val xx = sc.textFile("/path/to/ex2x.dat").map(
Yes. MLlib 1.0 supports sparse input data for linear methods. -Xiangrui
On Mon, Jun 2, 2014 at 11:36 PM, praveshjain1991
wrote:
> I am not sure. I have just been using some numerical datasets.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Using-
Hi Tobias,
One hack you can try is:
rdd.mapPartitions(iter => {
val x = new X()
iter.map(row => x.doSomethingWith(row)) ++ { x.shutdown(); Iterator.empty }
})
Best,
Xiangrui
On Thu, May 29, 2014 at 11:38 PM, Tobias Pfeiffer wrote:
> Hi,
>
> I want to use an object x in my RDD processing as
The documentation you looked at is not official, though it is from
@pwendell's website. It was for the Spark SQL release. Please find the
official documentation here:
http://spark.apache.org/docs/latest/mllib-linear-methods.html#linear-support-vector-machine-svm
It contains a working example show
You are using ec2. Did you specify the spark version when you ran
spark-ec2 script or update /root/spark after the cluster was created?
It is very likely that you are running 0.9 on ec2. -Xiangrui
On Thu, May 29, 2014 at 5:22 PM, jamborta wrote:
> Hi all,
>
> I wanted to try spark 1.0.0, because
Was the error message the same as you posted when you used `root` as
the user id? Could you try this:
1) Do not specify user id. (Default would be `root`.)
2) If it fails in the middle, try `spark-ec2 --resume launch
` to continue launching the cluster.
Best,
Xiangrui
On Thu, May 22, 2014 a
It doesn't guarantee the exact sample size. If you fix the random
seed, it would return the same result every time. -Xiangrui
On Wed, May 21, 2014 at 2:05 PM, glxc wrote:
> I have a graph and am trying to take a random sample of vertices without
> replacement, using the RDD.sample() method
>
> ve
If the RDD is cached, you can check its storage information in the
Storage tab of the Web UI.
On Wed, May 21, 2014 at 12:31 PM, yxzhao wrote:
> Thanks Xiangrui, How to check and make sure the data is distributed
> evenly? Thanks again.
> On Wed, May 21, 2014 at 2:17 PM, Xiangrui Meng [v
Many OutOfMemoryErrors in the log. Is your data distributed evenly? -Xiangrui
On Wed, May 21, 2014 at 11:23 AM, yxzhao wrote:
> I run the pagerank example processing a large data set, 5GB in size, using 48
> machines. The job got stuck at the time point: 14/05/20 21:32:17, as the
> attached log s
:
> Unfortunately, I don't have a bunch of moderately big xml files; I have one,
> really big file - big enough that reading it into memory as a single string
> is not feasible.
>
>
> On Tue, May 20, 2014 at 1:24 PM, Xiangrui Meng wrote:
>>
>> Try sc.wholeTextFiles(). It
Try sc.wholeTextFiles(). It reads the entire file into a string
record. -Xiangrui
On Tue, May 20, 2014 at 8:25 AM, Nathan Kronenfeld
wrote:
> We are trying to read some large GraphML files to use in spark.
>
> Is there an easy way to read XML-based files like this that accounts for
> partition bo
Actually there is a sliding method implemented in
mllib.rdd.RDDFunctions. Since this is not for general use cases, we
didn't include it in spark-core. You can take a look at the
implementation there and see whether it fits. -Xiangrui
On Mon, May 19, 2014 at 10:06 PM, Mohit Jaggi wrote:
> Thanks S
Checkout the master or branch-1.0. Then the examples should be there. -Xiangrui
On Mon, May 19, 2014 at 11:36 AM, yxzhao wrote:
> Thanks Xiangrui,
>
> But I did not find the directory:
> examples/src/main/scala/org/apache/spark/examples/mllib.
> Could you give me more detail or show me one exampl
The classpath seems to be correct. Where did you link libopenblas*.so
to? The safest approach is to rename it to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3 . This is the way I made it work. -Xiangrui
On Sun, May 18, 2014 at 4:49 PM, wxhsdp wrote:
> ok
>
> Spark Executor Command: "java" "-c
Can you attach the slave classpath? -Xiangrui
On Sun, May 18, 2014 at 2:02 AM, wxhsdp wrote:
> Hi, xiangrui
>
> you said "It doesn't work if you put the netlib-native jar inside an
> assembly
> jar. Try to mark it "provided" in the dependencies, and use --jars to
> include them with spark-s
You need to include breeze-natives or netlib:all to load the native
libraries. Check the log messages to ensure native libraries are used,
especially on the worker nodes. The easiest way to use OpenBLAS is
copying the shared library to /usr/lib/libblas.so.3 and
/usr/lib/liblapack.so.3. -Xiangrui
O
Hi Andrew,
I submitted a patch and verified it solves the problem. You can
download the patch from
https://issues.apache.org/jira/browse/HADOOP-10614 .
Best,
Xiangrui
On Fri, May 16, 2014 at 6:48 PM, Xiangrui Meng wrote:
> Hi Andrew,
>
> This is the JIRA I created:
> https://issue
>
> I'm using CDH4.4.0, which I think uses the MapReduce v2 API. The .jars are
> named like this: hadoop-hdfs-2.0.0-cdh4.4.0.jar
>
> I'm also glad you were able to reproduce! Please paste a link to the Hadoop
> bug you file so I can follow along.
>
> Thanks!
> And
It doesn't work if you put the netlib-native jar inside an assembly
jar. Try to mark it "provided" in the dependencies, and use --jars to
include them with spark-submit. -Xiangrui
On Wed, May 14, 2014 at 6:12 PM, wxhsdp wrote:
> Hi, DB
>
> i tried including breeze library by using spark 1.0, it
Hi Andrew,
This is the JIRA I created:
https://issues.apache.org/jira/browse/MAPREDUCE-5893 . Hopefully
someone wants to work on it.
Best,
Xiangrui
On Fri, May 16, 2014 at 6:47 PM, Xiangrui Meng wrote:
> Hi Andre,
>
> I could reproduce the bug with Hadoop 2.2.0. Some older version of
Thu, May 15, 2014 at 3:48 PM, Xiangrui Meng wrote:
> Hi Andrew,
>
> Could you try varying the minPartitions parameter? For example:
>
> val r = sc.textFile("/user/aa/myfile.bz2", 4).count
> val r = sc.textFile("/user/aa/myfile.bz2", 8).count
>
> Best,
&g
In Spark's KMeans, if no cluster center moves more than epsilon in
Euclidean distance from previous iteration, the algorithm finishes. No
further iterations are performed. For Mahout, you need to check the
documentation or the code to see what epsilon means there. -Xiangrui
On Wed, May 14, 2014 at
Hi Andrew,
Could you try varying the minPartitions parameter? For example:
val r = sc.textFile("/user/aa/myfile.bz2", 4).count
val r = sc.textFile("/user/aa/myfile.bz2", 8).count
Best,
Xiangrui
On Tue, May 13, 2014 at 9:08 AM, Xiangrui Meng wrote:
> Which hadoop versi
Are you running Spark or just Breeze? First try breeze-natives locally
with the reference blas library and see whether it works or not. Also,
do not enable multi-threading when you compile OpenBLAS
(USE_THREADS=0). -Xiangrui
On Tue, May 13, 2014 at 2:17 AM, wxhsdp wrote:
> Hi, Xiangrui
>
> i co
If you check out the master branch, there are some examples that can
be used as templates under
examples/src/main/scala/org/apache/spark/examples/mllib
Best,
Xiangrui
On Wed, May 14, 2014 at 1:36 PM, yxzhao wrote:
>
> Hello,
> I found the classfication algorithms SVM and LogisticRegression impl
Could you try `println(result.toDebugString())` right after `val
result = ...` and attach the result? -Xiangrui
On Fri, May 9, 2014 at 8:20 AM, phoenix bai wrote:
> after a couple of tests, I find that, if I use:
>
> val result = model.predict(prdctpairs)
> result.map(x =>
> x.user+","+x.prod
You need
> val raw = sc.sequenceFile(path, classOf[Text], classOf[VectorWriteable])
to load the data. After that, you can do
> val data = raw.values.map(_.get)
To get an RDD of mahout's Vector. You can use `--jar mahout-math.jar`
when you launch spark-shell to include mahout-math.
Best,
Xiangr
It depends on how you want to use the string features. For the day of
the week, you can replace it with 6 binary features indicating
Mon/Tue/Wed/Th/Fri/Sat. -Xiangrui
On Fri, May 9, 2014 at 5:31 AM, praveshjain1991
wrote:
> I have been trying to use LR in Spark's Java API. I used the dataset give
This is a known issue. Please try to reduce the number of iterations
(e.g., <35). -Xiangrui
On Fri, May 9, 2014 at 3:45 AM, phoenix bai wrote:
> Hi all,
>
> My spark code is running on yarn-standalone.
>
> the last three lines of the code as below,
>
> val result = model.predict(prdctpairs)
>
testing
> tomorrow.
>
> Thanks.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Tue, May 13, 2014 at 11:41 PM, Xiangrui Meng wrot
I don't know whether this would fix the problem. In v0.9, you need
`yarn-standalone` instead of `yarn-cluster`.
See
https://github.com/apache/spark/commit/328c73d037c17440c2a91a6c88b4258fbefa0c08
On Tue, May 13, 2014 at 11:36 PM, Xiangrui Meng wrote:
> Does v0.9 support yarn-cluster
Does v0.9 support yarn-cluster mode? I checked SparkContext.scala in
v0.9.1 and didn't see special handling of `yarn-cluster`. -Xiangrui
On Mon, May 12, 2014 at 11:14 AM, DB Tsai wrote:
> We're deploying Spark in yarn-cluster mode (Spark 0.9), and we add jar
> dependencies in command line with "-
Which hadoop version did you use? I'm not sure whether Hadoop v2 fixes
the problem you described, but it does contain several fixes to bzip2
format. -Xiangrui
On Wed, May 7, 2014 at 9:19 PM, Andrew Ash wrote:
> Hi all,
>
> Is anyone reading and writing to .bz2 files stored in HDFS from Spark with
e RDD when it is materialized & it only
> materializes in the end, then it runs out of stack.
>
> Regards
> Mayur
>
> Mayur Rustagi
> Ph: +1 (760) 203 3257
> http://www.sigmoidanalytics.com
> @mayur_rustagi
>
>
>
> On Tue, May 13, 2014 at 11:40 AM, Xiangru
You have a long lineage that causes the StackOverflow error. Try
rdd.checkPoint() and rdd.count() for every 20~30 iterations.
checkPoint can cut the lineage. -Xiangrui
On Mon, May 12, 2014 at 3:42 PM, Guanhua Yan wrote:
> Dear Sparkers:
>
> I am using Python spark of version 0.9.0 to implement so
Hi Deb, feel free to add accuracy along with precision and recall. -Xiangrui
On Mon, May 12, 2014 at 1:26 PM, Debasish Das wrote:
> Hi,
>
> I see precision and recall but no accuracy in mllib.evaluation.binary.
>
> Is it already under development or it needs to be added ?
>
> Thanks.
> Deb
>
Those are warning messages instead of errors. You need to add
netlib-java:all to use native BLAS/LAPACK. But it won't work if you
include netlib-java:all in an assembly jar. It has to be a separate
jar when you submit your job. For SGD, we only use level-1 BLAS, so I
don't think native code is call
Hi Chieh-Yen,
Great to see the Spark implementation of LIBLINEAR! We will definitely
consider adding a wrapper in MLlib to support it. Is the source code
on github?
Deb, Spark LIBLINEAR uses BSD license, which is compatible with Apache.
Best,
Xiangrui
On Sun, May 11, 2014 at 10:29 AM, Debasish
Hi Diana,
SparkALS is an example implementation of ALS. It doesn't call the ALS
algorithm implemented in MLlib. M, U, and F are used to generate
synthetic data.
I'm updating the examples. In the meantime, you can take a look at the
updated MLlib guide:
http://50.17.120.186:4000/mllib-collaborativ
; cache in HDFS.
>
>
> Sincerely,
>
> DB Tsai
> ---
> My Blog: https://www.dbtsai.com
> LinkedIn: https://www.linkedin.com/in/dbtsai
>
>
> On Sun, Apr 27, 2014 at 7:34 PM, Xiangrui Meng wrote:
>>
>> Eve
partitions and giving driver more ram and see
whether it can help? -Xiangrui
On Sun, Apr 27, 2014 at 3:33 PM, John King wrote:
> I'm already using the SparseVector class.
>
> ~200 labels
>
>
> On Sun, Apr 27, 2014 at 12:26 AM, Xiangrui Meng wrote:
>>
>> How many label
How many labels does your dataset have? -Xiangrui
On Sat, Apr 26, 2014 at 6:03 PM, DB Tsai wrote:
> Which version of mllib are you using? For Spark 1.0, mllib will
> support sparse feature vector which will improve performance a lot
> when computing the distance between points and centroid.
>
> S
entioned in the error have anything to do with it?
>
>
> On Thu, Apr 24, 2014 at 7:54 PM, Xiangrui Meng wrote:
>>
>> I don't see anything wrong with your code. Could you do points.count()
>> to see how many training examples you have? Also, make sure you don
}
>
>val vector = new SparseVector(2357815, indices.toArray,
> featValues.toArray)
>
>return LabeledPoint(values(0).toDouble, vector)
>
> }
>
>
> val data = sc.textFile("data.txt")
>
> val empty = data
Do you mind sharing more code and error messages? The information you
provided is too little to identify the problem. -Xiangrui
On Thu, Apr 24, 2014 at 1:55 PM, John King wrote:
> Last command was:
>
> val model = new NaiveBayes().run(points)
>
>
>
> On Thu, Apr 24, 2014
and mapping. Just
> received this error when trying to classify.
>
>
> On Thu, Apr 24, 2014 at 4:32 PM, Xiangrui Meng wrote:
>>
>> Is your Spark cluster running? Try to start with generating simple
>> RDDs and counting. -Xiangrui
>>
>> On Thu, Apr 24,
The data array in RDD is passed by reference to jblas, so data copying
in this stage. However, if jblas uses the native interface, there is a
copying overhead. I think jblas uses java implementation for at least
Level 1 BLAS, and calling native interface for Level 2 & 3. -Xiangrui
On Thu, Apr 24,
Is your Spark cluster running? Try to start with generating simple
RDDs and counting. -Xiangrui
On Thu, Apr 24, 2014 at 11:38 AM, John King
wrote:
> I receive this error:
>
> Traceback (most recent call last):
>
> File "", line 1, in
>
> File
> "/home/ubuntu/spark-1.0.0-rc2/python/pyspark/ml
401 - 500 of 530 matches
Mail list logo