ctrl + z will stop the job from being executed ( If you do a *fg/bg *you
can resume the job). You need to press ctrl + c to terminate the job!
Thanks
Best Regards
On Wed, Jun 4, 2014 at 10:24 AM, MEETHU MATHEW meethu2...@yahoo.co.in
wrote:
Hi,
I want to know how I can stop a running
Hi,
What is your Zeromq version ? It is known to work well with 2.2
an output of `sudo ldconfig -v | grep zmq` would helpful in this regard.
Thanks
Prashant Sharma
On Wed, Jun 4, 2014 at 11:40 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I am trying to use Spark Streaming (1.0.0)
Hi all,
I've set up a 4-node spark cluster (the nodes are r3.large) with the
spark-ec2 script. I've been trying to run a job on this cluster, and I'm
trying to figure out why I get the following exception:
java.net.SocketException: Connection reset
at
I should add that I'm using spark 0.9.1.
Thanks!
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/SocketException-when-reading-from-S3-s3n-format-tp6889p6890.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
For SSDs in r3, maybe it's better to mount with `discard` option since it
supports TRIM:
What I did for r3.large:
echo '/dev/xvdb /mnt ext4 defaults,noatime,nodiratime,discard 0 0'
/etc/fstab
mkfs.ext4 /dev/xvdb
mount /dev/xvdb
2014-06-03 19:15 GMT+02:00 Matei Zaharia
what does this exception mean?
14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
java.lang.IllegalArgumentException: requirement failed
at scala.Predef$.require(Predef.scala:221)
at org.apache.spark.mllib.util.MLUtils$.fastSquaredDistance(MLUtils.scala:271)
at
I wanted to know the meaning of the following log message when running a
spark streaming job :
[spark-akka.actor.default-dispatcher-18] INFO
org.apache.spark.streaming.scheduler.JobScheduler - Total delay: 5.432 s for
time 1401870454500 ms (execution: 0.593 s)
According to my understanding,
Hi
I'm using Spark 0.9.1 and Shark 0.9.1. My dataset does not fit into memory
I have in my cluster setup, so I want to use also disk for caching. I guess
MEMORY_ONLY is the default storage level in Spark. If that's the case how
could I change the storage level to MEMORY_AND_DISK in Spark?
Hi, all
i've observed that sometimes when the executor finishes one task, it will
wait about 5 seconds to
get another task to work, during the 5 seconds, the executor does nothing:
cpu idle, no disk access,
no network transfer. is that normal for spark?
thanks!
--
View this message in
hi,maillist:
i try to compile spark ,but failed, here is my compile command and
compile output
# SPARK_HADOOP_VERSION=2.0.0-cdh4.4.0 SPARK_YARN=true sbt/sbt assembly
[warn] 18 warnings found
[info] Compiling 53 Scala sources and 1 Java source to
Could you check whether the vectors have the same size? -Xiangrui
On Wed, Jun 4, 2014 at 1:43 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
what does this exception mean?
14/06/04 16:35:15 ERROR executor.Executor: Exception in task ID 6
java.lang.IllegalArgumentException: requirement failed
It's complaining about the native library shipped with ZeroMQ, right?
That message is the JVM complaining about how it was compiled. If so,
I think it's a question for ZeroMQ?
On Wed, Jun 4, 2014 at 7:10 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
I am trying to use Spark Streaming
Just a thought... Are you trying to use use the RDD as a Map?
On 3 June 2014 23:14, Doris Xin doris.s@gmail.com wrote:
Hey Amit,
You might want to check out PairRDDFunctions
http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.rdd.PairRDDFunctions.
For your use
Thanks for the reply Akhil.
I created a tar.gz of created by make-distribution.sh which is accessible
from all the slaves (I checked it using hadoop fs -ls /path/). Also there
are no worker logs printed in $SPARK_HOME/work/ directory on the workers
(which are otherwise printed if i run without
I think Mayur meant that Spark doesn't necessarily clean the closure
under Java 7 -- is that true though? I didn't know of an issue there.
Some anonymous class in your (?) OptimisingSort class is getting
serialized, which may be fine and intentional, but it is not
serializable. You haven't posted
I am not sure if it is exposed in the SBT build, but you may need the
equivalent of the 'yarn-alpha' profile from the Maven build. This
older build of CDH predates the newer YARN APIs.
See also
https://groups.google.com/forum/#!msg/spark-users/T1soH67C5M4/CmGYV8kfRkcJ
Or, use a later CDH. In
http://spark.apache.org/docs/latest/running-on-mesos.html#troubleshooting-and-debugging
If you are not able to find the logs in /var/log/mesos
Do check in /tmp/mesos/ and you can see your applications id and all just
like in the $SPARK_HOME/work directory.
Thanks
Best Regards
On Wed,
The error is resolved. I was using a comparator which was not serialised
because of which it was throwing the error.
I have now switched to kryo serializer as it is faster than java serialser.
I have set the required config
conf.set(spark.serializer,
Hi,
I am a new spark user. Pls let me know how to handle the following scenario:
I have a data set with the following fields:
1. DeviceId
2. latitude
3. longitude
4. ip address
5. Datetime
6. Mobile application name
With the above data, I would like to perform the following steps:
1. Collect all
I had issues around embedded functions here's what I have figured. Every
inner class actually contains a field referencing the outer class. The
anonymous class actually has a this$0 field referencing the outer class,
and thus why
Spark is trying to serialize Outer class.
In the Scala API, the
static inner classes do not refer to the outer class. Often people
declare them non-static by default when it's unnecessary -- a
Comparator class is typically a great example. Anonymous inner classes
declared inside a method are another example, but there again they can
be refactored into named
It is possible if you use a cartesian product to produce all possible
pairs for each IP address and 2 stages of map-reduce:
- first by pairs of points to find the total of each pair and
- second by IP address to find the pair for each IP address with the
maximum count.
Oleg
On 4 June 2014
Hi,
I am doing join of two RDDs which giving different results ( counting number of
records ) each time I run this code on same input.
The input files are large enough to be divided in two splits. When the program
runs on two workers with single core assigned to these, output is consistent
You've got a conflict in the version of Jackson that is being used:
Caused by: java.lang.NoSuchMethodError:
com.fasterxml.jackson.databind.module.SimpleSerializers.init(Ljava/util/List;)V
Looks like you are using Jackson 2.x somewhere, but AFAIK all of the
Hadoop/Spark libs are still on 1.x.
Those aren't the names of the artifacts:
http://search.maven.org/#search%7Cga%7C1%7Ca%3A%22spark-streaming-twitter_2.10%22
The name is spark-streaming-twitter_2.10
On Wed, Jun 4, 2014 at 1:49 PM, Jeremy Lee
unorthodox.engine...@gmail.com wrote:
Man, this has been hard going. Six days, and I
thank you! 孟祥瑞
with your help i solved the problem.
I constructed SparseVectors in a wrong way
the first parameter of the constructor SparseVector(int size, int[] indices,
double[] values)
I mistaked it for the size of values
2014-06-04
bluejoe2008
From: Xiangrui Meng
Date: 2014-06-04
@Sean, the %% syntax in SBT should automatically add the Scala major
version qualifier (_2.10, _2.11 etc) for you, so that does appear to be
correct syntax for the build.
I seemed to run into this issue with some missing Jackson deps, and solved
it by including the jar explicitly on the driver
Ah sorry, this may be the thing I learned for the day. The issue is
that classes from that particular artifact are missing though. Worth
interrogating the resulting .jar file with jar tf to see if it made
it in?
On Wed, Jun 4, 2014 at 2:12 PM, Nick Pentreath nick.pentre...@gmail.com wrote:
hi, folks,
is there any easier way to define a custom RDD in Java?
I am wondering if I have to define a new java class which extends RDD from
scratch? It is really a hard job for developers!
2014-06-04
bluejoe2008
On Wed, Jun 4, 2014 at 12:31 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Ah, sorry to hear you had more problems. Some thoughts on them:
There will always be more problems, 'tis the nature of coding. :-) I try
not to bother the list until I've smacked my head against them for a few
hours,
I get a very similar stack trace and have no idea what could be causing it
(see below). I've created a SO:
http://stackoverflow.com/questions/24038908/spark-fails-on-big-jobs-with-java-io-ioexception-filesystem-closed
14/06/02 20:44:04 INFO client.AppClient$ClientActor: Executor updated:
Hello All.
I have a newbie question.
We have a use case where huge amount of data will be coming in streams or
micro-batches of streams and we want to process these streams according to
some business logic. We don't have to provide extremely low latency
guarantees but batch M/R will still be
Hi Ajay, would you mind to synthesise a minimum code snippet that can
reproduce this issue and paste it here?
On Wed, Jun 4, 2014 at 8:32 PM, Ajay Srivastava a_k_srivast...@yahoo.com
wrote:
Hi,
I am doing join of two RDDs which giving different results ( counting
number of records ) each
Hi
Im trying run some spark code on a cluster but I keep running into a
java.io.StreamCorruptedException: invalid type code: AC error. My task
involves analyzing ~50GB of data (some operations involve sorting) then
writing them out to a JSON file. Im running the analysis on each of the
data's ~10
I think by default a thread can die up to 4 times before Spark considers it
a failure. Are you seeing that happen? I believe that is a configurable
thing, but don't know off the top of my head how to change it.
I've seen this error before when reading data from a large amount of files
on S3, and
On Wed, Jun 4, 2014 at 9:35 AM, Jeremy Lee unorthodox.engine...@gmail.com
wrote:
Oh, I went back to m1.large while those issues get sorted out.
Random side note, Amazon is deprecating the m1 instances in favor of m3
instances, which have SSDs and more ECUs than their m1 counterparts.
Just curious, what do you want your custom RDD to do that the normal ones
don't?
On Wed, Jun 4, 2014 at 6:30 AM, bluejoe2008 bluejoe2...@gmail.com wrote:
hi, folks,
is there any easier way to define a custom RDD in Java?
I am wondering if I have to define a new java class which
On Wed, Jun 4, 2014 at 3:33 PM, Matt Kielo mki...@oculusinfo.com wrote:
Im trying run some spark code on a cluster but I keep running into a
java.io.StreamCorruptedException: invalid type code: AC error. My task
involves analyzing ~50GB of data (some operations involve sorting) then
writing
nilmish,
To confirm your code is using kryo, go to the web ui of your application
(defaults to :4040) and look at the environment tab. If your serializer
settings are there then things should be working properly.
I'm not sure how to confirm that it works against typos in the setting, but
you
You can change storage level on an individual RDD with
.persist(StorageLevel.MEMORY_AND_DISK), but I don't think you can change
what the default persistency level is for RDDs.
Andrew
On Wed, Jun 4, 2014 at 1:52 AM, Salih Kardan karda...@gmail.com wrote:
Hi
I'm using Spark 0.9.1 and Shark
Thanks folks. I was trying to get the RDD[multimap] so the collectAsMap is what
I needed.
Best,
Amit
On Jun 4, 2014, at 6:53, Cheng Lian lian.cs@gmail.com wrote:
On Wed, Jun 4, 2014 at 5:56 AM, Amit Kumar kumarami...@gmail.com wrote:
Hi Folks,
I am new to spark -and this is probably
Yes, RDD as a map of String keys and List of string as values.
Amit
On Jun 4, 2014, at 2:46, Oleg Proudnikov oleg.proudni...@gmail.com wrote:
Just a thought... Are you trying to use use the RDD as a Map?
On 3 June 2014 23:14, Doris Xin doris.s@gmail.com wrote:
Hey Amit,
You
When you group by IP address in step 1 to this:
(ip1,(lat1,lon1),(lat2,lon2))
(ip2,(lat3,lon3),(lat4,lat5))
How many lat/lon locations do you expect for each IP address? avg and max
are interesting.
Andrew
On Wed, Jun 4, 2014 at 5:29 AM, Oleg Proudnikov
Hi All,
I have experienced some crashing behavior with join in pyspark. When I
attempt a join with 2000 partitions in the result, the join succeeds, but
when I use only 200 partitions in the result, the join fails with the
message Job aborted due to stage failure: Master removed our application:
Since $HADOOP_HOME is deprecated, try adding it to the Mesos configuration
file.
Add `export MESOS_HADOOP_HOME=$HADOOP_HOME to ~/.bashrc` and that should
solve your error
--
View this message in context:
On Tue, Jun 3, 2014 at 8:46 PM, Marek Wiewiorka marek.wiewio...@gmail.com
wrote:
Hi All,
I've been experiencing a very strange error after upgrade from Spark 0.9
to 1.0 - it seems that saveAsTestFile function is throwing
java.lang.UnsupportedOperationException that I have never seen before.
Oh, this would be super useful for us too!
Actually wouldn't it be best if you could see the whole call stack on the
UI, rather than just one line? (Of course you would have to click to expand
it.)
On Wed, Jun 4, 2014 at 2:38 AM, John Salvatier jsalvat...@gmail.com wrote:
Ok, I will probably
I am also getting the exact error, with the exact logs when I run Spark 1.0.0
in coarse-grained mode.
Coarse grained mode works perfectly with earlier versions that I tested -
0.9.1 and 0.9.0, but seems to have undergone some modification in spark
1.0.0
--
View this message in context:
Are you using spark-submit to run your application?
On Wed, Jun 4, 2014 at 8:49 AM, ajatix a...@sigmoidanalytics.com wrote:
I am also getting the exact error, with the exact logs when I run Spark
1.0.0
in coarse-grained mode.
Coarse grained mode works perfectly with earlier versions that I
I'm running a manually built cluster on EC2. I have mesos (0.18.2) and hdfs
(2.0.0-cdh4.5.0) installed on all slaves (3) and masters (3). I have
spark-1.0.0 on one master and the executor file is on hdfs for the slaves.
Whenever I try to launch a spark application on the cluster, it starts a
task
Actually, what the stack trace is showing is the result of an exception
being thrown by the DAGScheduler's event processing actor. What happens is
that the Supervisor tries to shut down Spark when an exception is thrown by
that actor. As part of the shutdown procedure, the DAGScheduler tries to
Exactly the same story - it used to work with 0.9.1 and does not work
anymore with 1.0.0.
I ran tests using spark-shell as well as my application(so tested turning
on coarse mode via env variable and SparkContext properties explicitly)
M.
2014-06-04 18:12 GMT+02:00 ajatix
Hello Alex
Thanks for the link. Yes creating a singleton object for logging outside
the code that gets executed on the workers definitely works. The problem
that i am facing though is related to configuration of the logger. I don't
see any log messages in the worker logs of the application.
a)
Thanks a lot, sorry for the really late reply! (Didn't have my laptop)
This is working, but it's dreadfully slow and seems to not run in parallel?
On Mon, May 19, 2014 at 2:54 PM, Nick Pentreath nick.pentre...@gmail.com
wrote:
You need to use mapPartitions (or foreachPartition) to instantiate
Hey, thanks a lot for reporting this. Do you mind making a JIRA with
the details so we can track it?
- Patrick
On Wed, Jun 4, 2014 at 9:24 AM, Marek Wiewiorka
marek.wiewio...@gmail.com wrote:
Exactly the same story - it used to work with 0.9.1 and does not work
anymore with 1.0.0.
I ran tests
Hey There,
This is only possible in Scala right now. However, this is almost
never needed since the core API is fairly flexible. I have the same
question as Andrew... what are you trying to do with your RDD?
- Patrick
On Wed, Jun 4, 2014 at 7:49 AM, Andrew Ash and...@andrewash.com wrote:
Just
Hey Chirag,
Those init scripts are part of the Cloudera Spark package (they are
not in the Spark project itself) so you might try e-mailing their
support lists directly.
- Patrick
On Wed, Jun 4, 2014 at 7:19 AM, chirag lakhani chirag.lakh...@gmail.com wrote:
I recently spun up an AWS cluster
Spark is already part of the distribution, and the core CDH5 parcel.
You shouldn't need extra steps unless you're doing something special.
It may be that this is the very cause of the error when trying to
install over the existing services.
On Wed, Jun 4, 2014 at 3:19 PM, chirag lakhani
Hey Jeremy,
The issue is that you are using one of the external libraries and
these aren't actually packaged with Spark on the cluster, so you need
to create an uber jar that includes them.
You can look at the example here (I recently did this for a kafka
project and the idea is the same):
I am building Spark by myself and I am using Java 7 to both build and run.
I will try with Java 6.
Thanks,
Suman.
On 6/3/2014 7:18 PM, Matei Zaharia wrote:
What Java version do you have, and how did you get Spark (did you build it
yourself by any chance or download a pre-built one)? If you
I tried building with Java 6 and also tried the pre-built packages. I am
still getting the same error.
It works fine when I run it on a machine with Solaris OS and X-86
architecture.
But, it does not work with Solaris OS and Sparc architecture.
Any ideas, why this would happen?
Thanks,
Thanks you! The regions advice solved the problem for my friend who was getting
the key pair does not exist problem. I am still getting the error:
ERROR:boto:400 Bad Request
ERROR:boto:?xml version=1.0 encoding=UTF-8?
ResponseErrorsErrorCodeInvalidParameterValue/CodeMessageInvalid
value 'null'
chmod 600 path/FinalKey.pem
Cheers
k/
On Wed, Jun 4, 2014 at 12:49 PM, Sam Taylor Steyer sste...@stanford.edu
wrote:
Also, once my friend logged in to his cluster he received the error
Permissions 0644 for 'FinalKey.pem' are too open. This sounds like the
other problem described. How do we
Awesome, that worked. Thank you!
- Original Message -
From: Krishna Sankar ksanka...@gmail.com
To: user@spark.apache.org
Sent: Wednesday, June 4, 2014 12:52:00 PM
Subject: Re: Trouble launching EC2 Cluster with Spark
chmod 600 path/FinalKey.pem
Cheers
k/
On Wed, Jun 4, 2014 at 12:49
Maybe your two workers have different assembly jar files?
I just ran into a similar problem that my spark-shell is using a different
jar file than my workers - got really confusing results.
On Jun 4, 2014 8:33 AM, Ajay Srivastava a_k_srivast...@yahoo.com wrote:
Hi,
I am doing join of two RDDs
N/M.. I wrote a HadoopRDD subclass and append one env field of the
HadoopPartition to the value in compute function. It worked pretty well.
Thanks!
On Jun 4, 2014 12:22 AM, Xu (Simon) Chen xche...@gmail.com wrote:
I don't quite get it..
mapPartitionWithIndex takes a function that maps an
If this isn’t the problem, it would be great if you can post the code for the
program.
Matei
On Jun 4, 2014, at 12:58 PM, Xu (Simon) Chen xche...@gmail.com wrote:
Maybe your two workers have different assembly jar files?
I just ran into a similar problem that my spark-shell is using a
Hello,
I am trying to use spark in such a scenario:
I have code written in Hadoop and now I try to migrate to Spark. The
mappers and reducers are fairly complex. So I wonder if I can reuse the
map() functions I already wrote in Hadoop (Java), and use Spark to chain
them, mixing the Java
Yes, you can write some glue in Spark to call these. Some functions to look at:
- SparkContext.hadoopRDD lets you create an input RDD from an existing JobConf
configured by Hadoop (including InputFormat, paths, etc)
- RDD.mapPartitions lets you operate in all the values on one partition (block)
That’s a good idea too, maybe we can change CallSiteInfo to do that.
Matei
On Jun 4, 2014, at 8:44 AM, Daniel Darabos daniel.dara...@lynxanalytics.com
wrote:
Oh, this would be super useful for us too!
Actually wouldn't it be best if you could see the whole call stack on the UI,
rather
In PySpark, the data processed by each reduce task needs to fit in memory
within the Python process, so you should use more tasks to process this
dataset. Data is spilled to disk across tasks.
I’ve created https://issues.apache.org/jira/browse/SPARK-2021 to track this —
it’s something we’ve
Will the broadcast variables be disposed automatically if the context is
stopped, or do I still need to unpersist()?
On Sat, May 31, 2014 at 1:20 PM, Patrick Wendell pwend...@gmail.com wrote:
Hey There,
You can remove an accumulator by just letting it go out of scope and
it will be garbage
Hi,
Just wondering if you can try this:
val obj = sql(select manufacturer, count(*) as examcount from pft
group by manufacturer order by examcount desc)
obj.collect()
obj.queryExecution.executedPlan.executeCollect()
and time the third line alone. It could be that Spark SQL taking some
time to
All of these are disposed of automatically if you stop the context or exit the
program.
Matei
On Jun 4, 2014, at 2:22 PM, Daniel Siegmann daniel.siegm...@velos.io wrote:
Will the broadcast variables be disposed automatically if the context is
stopped, or do I still need to unpersist()?
Hi Matei,
Thanks for the reply and creating the JIRA. I hear what you're saying,
although to be clear I want to still state that it seems like each reduce
task is loading significantly more data than just the records needed for
that task. The workers seem to load all data from each block
I timed the third line and here are stage timings,
collect at SparkPlan.scala:52- 0.5 s
mapPartitions at Exchange.scala:58 - 0.7 s
RangePartitioner at Exchange.Scala:62 - 0.7 s
RangePartitioner at Exchange.Scala:62 - 0.5 s
Hi,
I’m following the directions to run the cassandra example
“org.apache.spark.examples.CassandraTest” and I get this error
Exception in thread main java.lang.IncompatibleClassChangeError: Found
interface org.apache.hadoop.mapreduce.JobContext, but class was expected
at
It took me a little while to get back to this but it works now!!
I'm invoking the shell like this:
spark-shell --jars target/scala-2.10/spark-etl_2.10-1.0.jar
Once inside, I can invoke a method in my package to run the job.
val reseult = etl.IP2IncomeJob.job(sc)
On Tue, May 27, 2014 at 8:42
Hey Matei,
Wanted to let you know this issue appears to be fixed in 1.0.0. Great work!
-- Jeremy
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/error-loading-large-files-in-PySpark-0-9-0-tp3049p6985.html
Sent from the Apache Spark User List mailing list
Thanks Patrick!
Uberjars. Cool. I'd actually heard of them. And thanks for the link to the
example! I shall work through that today.
I'm still learning sbt and it's many options... the last new framework I
learned was node.js, and I think I've been rather spoiled by npm.
At least it's not
I think the problem is that once unpacked in Python, the objects take
considerably more space, as they are stored as Python objects in a Python
dictionary. Take a look at python/pyspark/join.py and combineByKey in
python/pyspark/rdd.py. We should probably try to store these in serialized form.
When I run sbt/sbt assembly, I get the following exception. Is anyone else
experiencing a similar problem?
..
[info] Resolving org.eclipse.jetty.orbit#javax.servlet;3.0.0.v201112011016
...
[info] Updating {file:/Users/Sung/Projects/spark_06_04_14/}assembly...
[info] Resolving
Yes, thanks updating this old thread! We heard our community demands and
added support for Java receivers!
TD
On Wed, Jun 4, 2014 at 12:15 PM, lbustelo g...@bustelos.com wrote:
Not that what TD was referring above, is already in 1.0.0
Nevermind, it turns out that this is a problem for the Pivotal Hadoop that
we are trying to compile against.
On Wed, Jun 4, 2014 at 4:16 PM, Sung Hwan Chung coded...@cs.stanford.edu
wrote:
When I run sbt/sbt assembly, I get the following exception. Is anyone else
experiencing a similar
So Python is used in many of the Spark Ecosystem products, but not
Streaming at this point. Is there a roadmap to include Python APIs in Spark
Streaming? Anytime frame on this?
Thanks!
John
On Thu, May 29, 2014 at 4:19 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Quite a few people ask
We are definitely investigating a Python API for Streaming, but no announced
deadline at this point.
Matei
On Jun 4, 2014, at 5:02 PM, John Omernik j...@omernik.com wrote:
So Python is used in many of the Spark Ecosystem products, but not Streaming
at this point. Is there a roadmap to
Thank you for the response. If it helps at all: I demoed the Spark platform
for our data science team today. The idea of moving code from batch
testing, to Machine Learning systems, GraphX, and then to near-real time
models with streaming was cheered by the team as an efficiency they would
love.
if i compile spark with CDH4.6 and enable yarn support , it can run on
CDH4.4?
On Wed, Jun 4, 2014 at 5:59 PM, Sean Owen so...@cloudera.com wrote:
I am not sure if it is exposed in the SBT build, but you may need the
equivalent of the 'yarn-alpha' profile from the Maven build. This
older
Hi All.,
I am new to Spark and I am trying to run LogisticRegression (with SGD)
using MLLib on a beefy single machine with about 128GB RAM. The dataset has
about 80M rows with only 4 features so it barely occupies 2Gb on disk.
I am running the code using all 8 cores with 20G memory using
i want to use spark to handle data from non-sql databases (RDF triple store,
for example)
however, i am not familiar with Scala
so i want to know how to create a RdfTriplesRDD rapidly
2014-06-05
bluejoe2008
From: Patrick Wendell
Date: 2014-06-05 01:25
To: user
Subject: Re: is there any
Has anyone tried to use a log4j.xml instead of a log4j.properties with
spark 0.9.1? I'm trying to run spark streaming on yarn and i've set the
environment variable SPARK_LOG4J_CONF to a log4j.xml file instead of a
log4j.properties file, but spark seems to be using the default
log4j.properties
Are you using the logistic_regression.py in examples/src/main/python or
examples/src/main/python/mllib? The first one is an example of writing logistic
regression by hand and won’t be as efficient as the MLlib one. I suggest trying
the MLlib one.
You may also want to check how many iterations
Ah, is the file gzipped by any chance? We can’t decompress gzipped files in
parallel so they get processed by a single task.
It may also be worth looking at the application UI (http://localhost:4040) to
see 1) whether all the data fits in memory in the Storage tab (maybe it somehow
becomes
I'm still a Spark newbie, but I have a heavy background in languages and
compilers... so take this with a barrel of salt...
Scala, to me, is the heart and soul of Spark. Couldn't work without it.
Procedural languages like Python, Java, and all the rest are lovely when
you have a couple of
hi, all
when my spark program accessed hdfs files
an error happened:
Exception in thread main org.apache.hadoop.ipc.RemoteException: Server IPC
version 9 cannot communicate with client version 4
it seems the client was trying to connect hadoop2 via an old hadoop protocol
so my question is:
80M by 4 should be about 2.5GB uncompressed. 10 iterations shouldn't
take that long, even on a single executor. Besides what Matei
suggested, could you also verify the executor memory in
http://localhost:4040 in the Executors tab. It is very likely the
executors do not have enough memory. In that
Hi Krishna,
Specifying executor memory in local mode has no effect, because all of
the threads run inside the same JVM. You can either try
--driver-memory 60g or start a standalone server.
Best,
Xiangrui
On Wed, Jun 4, 2014 at 7:28 PM, Xiangrui Meng men...@gmail.com wrote:
80M by 4 should be
Shahab,
Interesting question. Couple of points (based on the information from
your e-mail)
1. One can support the use case in Spark as a set of transformations on
a WIP TDD over a span of time and the final transformation outputting to a
processed TDD
- Spark streaming would be
I will try both and get back to you soon!
Thanks for all your help!
Regards,
Krishna
On Wed, Jun 4, 2014 at 7:56 PM, Xiangrui Meng men...@gmail.com wrote:
Hi Krishna,
Specifying executor memory in local mode has no effect, because all of
the threads run inside the same JVM. You can either
Hey Sam,
You mentioned two problems here, did your VPC error message get fixed
or only the key permissions problem?
I noticed we had some report a similar issue with the VPC stuff a long
time back (but there is no real resolution here):
https://spark-project.atlassian.net/browse/SPARK-1166
If
1 - 100 of 105 matches
Mail list logo