You can start with the Zeppelin project http://zeppelin-project.org/,
they currently has support for SparkContext (not streaming), since the code
is open you can customize it for your usecase. Here's a video of it
https://www.youtube.com/watch?v=_PQbVH_aO5Efeature=youtu.be
Thanks
Best Regards
On
Looks like a configuration issue, can you paste your spark-env.sh on the
worker?
Thanks
Best Regards
On Wed, Oct 1, 2014 at 8:27 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
It would help to turn on debug level logging in log4j and see the logs.
Just looking at the error logs is not
Hi Akhil,
Thanks for your reply.
We are using CDH 5.1.3 and spark configuration is taken care by Cloudera
configuration. Please let me know if you would like to review the
configuration.
Regards,
Riyaz
On Wed, Oct 1, 2014 at 10:10 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Looks like a
In that case, fire-up a sparkshell and try the following:
scalaimport org.apache.spark.streaming.{Seconds, StreamingContext}
scalaimport org.apache.spark.streaming.StreamingContext._
scalaval ssc = new
StreamingContext(spark://YOUR-SPARK-MASTER-URI,Streaming
Hi,
My application is to digest user logs and deduct user quotas. I need to
maintain latest states of user quotas persistently, so that latest user
quotas will not be lost.
I have tried *updateStateByKey* to generate and a DStream for user quotas
and called
Hi,
Are there any code examples demonstrating spark streaming applications
which depend on states? That is, last-run *updateStateByKey* results are
used as inputs.
Thanks.
It's much easier than all this. Spark Streaming gives you a DStream of
RDDs. You want the count for each RDD. DStream.count() gives you
exactly that: a DStream of Longs which are the counts of events in
each mini batch.
On Tue, Sep 30, 2014 at 8:42 PM, Andy Davidson
a...@santacruzintegration.com
Hi,
I have a problem with using accumulators in Spark. As seen on the Spark
website, if you want custom accumulators you can simply extend (with an
object) the AccumulatorParam trait. The problem is that I need to make that
object generic, such as this:
object SeqAccumulatorParam[B] extends
Just realized that, of course, objects can't be generic, but how do I
create a generic AccumulatorParam?
2014-10-01 12:33 GMT+02:00 Johan Stenberg johanstenber...@gmail.com:
Hi,
I have a problem with using accumulators in Spark. As seen on the Spark
website, if you want custom accumulators
If you want a single-machine 'cluster' to try all of these things, you
don't strictly need a distribution, but, it will probably save you a
great deal of time and trouble compared to setting all of this up by
hand.
Naturally I would promote CDH, as it contains Spark and Mahout and
supports them
FYI, in case anybody else has this problem, we switched to Spark 1.1
(outside CDH) and the same Spark application worked first time (once
recompiled with Spark 1.1 libs of course). I assume this is because Spark
1.1 is compiled with Hive.
On 29 September 2014 17:41, Patrick McGloin
Hi, thanks for the reply.
I added the ALS.setIntermediateRDDStorageLevel and it worked well (a little
slow, but still did the job and i've made MF and get all the features).
But even if I persist with DISK_ONLY, the system monitor shows on the memory
and swap history that Apache Spark is using
Calling collect on anything is almost always a bad idea. The only
exception is if you are looking to pass that data on to any other system
never see it again :) .
I would say you need to implement outlier detection on the rdd process it
in spark itself rather than calling collect on it.
We encountered a problem of loading a huge number of small files (hundred
thousands of files) from HDFS in Spark. Our jobs were failed over time.
This one forced us to write own loader with combining by means of Hadoop
CombineFileInputFormat.
It significantly reduced number of mappers from 10
Hi,
I'm implementing KryoSerializer for my custom class. Here is class
public class ImpressionFactsValue implements KryoSerializable {
private int hits;
public ImpressionFactsValue() {
}
public int getHits() {
I don't think persist is meant for end-user usage. You might want to call
saveAsTextFiles, for example, if you're saving to the file system as
strings. You can also dump the DStream to a DB -- there are samples on this
list (you'd have to do a combo of foreachRDD and mapPartitions, likely)
On
I don't think you can get a SparkContext inside an RDD function (such as
mapPartitions), but you shouldn't need to. Have you considered returning
the data read from the database from mapPartitions to create a new RDD and
then just save it to a file like normal?
For example:
Hi,
We've had some performance issues since switching to 1.1.0, and we finally found
the origin : TorrentBroadcast seems to be very slow in our setting (and it
became default with 1.1.0)
The logs of a 4MB variable with TB : (15s)
14/10/01 15:47:13 INFO storage.MemoryStore: Block
Hi,
Sorry this question may be trivial. I'm new to Spark and GraphX. I need to
create a graph that has different types of nodes(3 types) and edges(4
types). Each type of node and edge has a different list of attributes.
1) How should I build the graph? Should I specify all types of nodes(or
I'll try my best ;-).
1/ you could create a abstract type for the types (1 on top of Vs, 1 other
on top of Es types) than use the subclasses as payload in your VertexRDD or
in your Edge. Regarding storage and files, it doesn't really matter (unless
you want to use the OOTB loading method, thus
Can't you extend a class in place of object which can be generic ?
class GenericAccumulator[B] extends AccumulatorParam[Seq[B]] {
}
On Wed, Oct 1, 2014 at 3:38 AM, Johan Stenberg johanstenber...@gmail.com
wrote:
Just realized that, of course, objects can't be generic, but how do I
create a
Hi,
I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
MultipleTextOutputFormat,:
outRdd.saveAsNewAPIHadoopFile(/tmp, String.class, String.class,
MultipleTextOutputFormat.class);
but I'm getting this compilation error:
Bound mismatch: The generic method
Hi,
What's the relationship between Spark worker and executor memory settings
in standalone mode? Do they work independently or does the worker cap
executor memory?
Also, is the number of concurrent executors per worker capped by the number
of CPU cores configured for the worker?
Are you trying to do something along the lines of what's described here?
https://issues.apache.org/jira/browse/SPARK-3533
On Wed, Oct 1, 2014 at 10:53 AM, Tomer Benyamini tomer@gmail.com
wrote:
Hi,
I'm trying to write my JavaPairRDD using saveAsNewAPIHadoopFile with
We have a configuration CDH5.0,Spark1.1.0(stand alone) and hive0.12
We are trying to run some realtime analytics from Tableau8.1(BI Tool) and
one of the dashboard was failing with below error. Is it something to do
with the aggregate function not supported by spark-sql ?Any help
appreciated.
Yes exactly.. so I guess this is still an open request. Any workaround?
On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Are you trying to do something along the lines of what's described here?
https://issues.apache.org/jira/browse/SPARK-3533
On Wed, Oct 1,
I can submit a MapReduce job reading that table, although its processing
rate is also a litter slower than I expected, but not that slow as Spark.
2014-10-01 12:04 GMT+08:00 Ted Yu yuzhih...@gmail.com:
Can you launch a job which exercises TableInputFormat on the same table
without using Spark
Not that I'm aware of. I'm looking for a work-around myself!
On Wed, Oct 1, 2014 at 11:15 AM, Tomer Benyamini tomer@gmail.com
wrote:
Yes exactly.. so I guess this is still an open request. Any workaround?
On Wed, Oct 1, 2014 at 6:04 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
There is this thread on Stack Overflow
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job
about
the same topic, which you may find helpful.
On Wed, Oct 1, 2014 at 11:17 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Not that I'm aware of.
If the missing values are 0, then you can also look into implicit
formulation...
On Tue, Sep 30, 2014 at 12:05 PM, Xiangrui Meng men...@gmail.com wrote:
We don't handle missing value imputation in the current version of
MLlib. In future releases, we can store feature information in the
Hi all,
I cannot figure out why this command is not setting the driver memory (it is
setting the executor memory):
conf = (SparkConf()
.setMaster(yarn-client)
.setAppName(test)
.set(spark.driver.memory, 1G)
Using TableInputFormat is not the fastest way of reading data from HBase.
Do not expect 100s of Mb per sec. You probably should take a look at M/R
over HBase snapshots.
https://issues.apache.org/jira/browse/HBASE-8369
-Vladimir Rodionov
On Wed, Oct 1, 2014 at 8:17 AM, Tao Xiao
You can't set up the driver memory programatically in client mode. In
that mode, the same JVM is running the driver, so you can't modify
command line options anymore when initializing the SparkContext.
(And you can't really start cluster mode apps that way, so the only
way to set this is through
As far as I know, that feature is not in CDH 5.0.0
FYI
On Wed, Oct 1, 2014 at 9:34 AM, Vladimir Rodionov
vrodio...@splicemachine.com wrote:
Using TableInputFormat is not the fastest way of reading data from HBase.
Do not expect 100s of Mb per sec. You probably should take a look at M/R
over
thanks Marcelo.
What's the reason it is not possible in cluster mode, either?
On Wed, Oct 1, 2014 at 5:42 PM, Marcelo Vanzin van...@cloudera.com wrote:
You can't set up the driver memory programatically in client mode. In
that mode, the same JVM is running the driver, so you can't modify
CC user@ for indexing.
Glad you fixed it. All source code for these examples are under
SPARK_HOME/examples. For example, the converters used here are in
examples/src/main/scala/org/apache/spark/examples/pythonconverters/HBaseConverters.scala
Btw, you may find our blog post useful.
ALS still needs to load and deserialize the in/out blocks (one by one)
from disk and then construct least squares subproblems. All happen in
RAM. The final model is also stored in memory. -Xiangrui
On Wed, Oct 1, 2014 at 4:36 AM, Alex T chiorts...@gmail.com wrote:
Hi, thanks for the reply.
I
Because that's not how you launch apps in cluster mode; you have to do
it through the command line, or by calling directly the respective
backend code to launch it.
(That being said, it would be nice to have a programmatic way of
launching apps that handled all this - this has been brought up in
when you say respective backend code to launch it, I thought this is
the way to do that.
thanks,
Tamas
On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin van...@cloudera.com wrote:
Because that's not how you launch apps in cluster mode; you have to do
it through the command line, or by calling
when you say respective backend code to launch it, I thought this is
the way to do that.
thanks,
Tamas
On Wed, Oct 1, 2014 at 6:13 PM, Marcelo Vanzin [via Apache Spark User
List] ml-node+s1001560n15506...@n3.nabble.com wrote:
Because that's not how you launch apps in cluster mode; you have to
No, you can't instantiate a SparkContext to start apps in cluster mode.
For Yarn, for example, you'd have to call directly into
org.apache.spark.deploy.yarn.Client; that class will tell the Yarn
cluster to launch the driver for you and then instantiate the
SparkContext.
On Wed, Oct 1, 2014 at
Hi Tamas,
Yes, Marcelo is right. The reason why it doesn't make sense to set
spark.driver.memory in your SparkConf is because your application code,
by definition, *is* the driver. This means by the time you get to the code
that initializes your SparkConf, your driver JVM has already started with
Hi Sean
I guess I am missing something.
JavaDStreamString foo =
JavaDStreamLong c = foo.count()
This is circular. I need to get the count as an actual scalar value not a
JavaDStream. Some one else posted psudo code that used foreachRDD() . This
seems to work for me.
Thanks
Andy
From:
Thank you for the replies. It makes sense for scala/java, but in python the
JVM is launched when the spark context is initialised, so it should be able
to set it, I assume.
On 1 Oct 2014 18:25, Andrew Or-2 [via Apache Spark User List]
ml-node+s1001560n15510...@n3.nabble.com wrote:
Hi Tamas,
On Tue, Sep 30, 2014 at 10:14 PM, Rick Richardson
rick.richard...@gmail.com wrote:
I am experiencing significant logging spam when running PySpark in IPython
Notebok
Exhibit A: http://i.imgur.com/BDP0R2U.png
I have taken into consideration advice from:
Hi
I am new to Spark Streaming. Can I think of JavaDStream foreachRDD() as
being 'for each mini batch¹? The java doc does not say much about this
function.
Here is the background. I am writing a little test program to figure out how
to use streams. At some point I wanted to calculate an
Yes, that makes sense. It's similar to the all reduce pattern in vw.
On Wednesday, 1 October 2014, Matei Zaharia matei.zaha...@gmail.com wrote:
Some of the MLlib algorithms do tree reduction in 1.1:
http://databricks.com/blog/2014/09/22/spark-1-1-mllib-performance-improvements.html.
You can
Yes, foreachRDD will do your something for each RDD, which is what you
get for each mini-batch of input.
The operations you express on a DStream (or JavaDStream) are all,
really, for each RDD, including print(). Logging is a little harder
to reason about since the logging will happen on a
Hi,
I found the answer to my problem, and just writing to keep it as KB.
Turns out the problem wasn’t related to S3 performance, it was due my SOURCE
was not fast enough, due the lazy nature of Spark what I saw on the dashboard
was saveAsTextFile at FacebookProcessor.scala:46 instead of the
Thanks for your reply. Unfortunately changing the log4j.properties within
SPARK_HOME/conf has no effect on pyspark for me. When I change it in the
master or workers the log changes have the desired effect, but pyspark
seems to ignore them. I have changed the levels to WARN, changed the
appender
On Wed, Oct 1, 2014 at 11:33 AM, Akshat Aranya aara...@gmail.com wrote:
On Wed, Oct 1, 2014 at 11:00 AM, Boromir Widas vcsub...@gmail.com wrote:
1. worker memory caps executor.
2. With default config, every job gets one executor per worker. This
executor runs with all cores available to
Thanks, Xiangrui and Debashish for your input.
Date: Wed, 1 Oct 2014 08:35:51 -0700
Subject: Re: MLLib: Missing value imputation
From: debasish.da...@gmail.com
To: men...@gmail.com
CC: ssti...@live.com; user@spark.apache.org
If the missing values are 0, then you can also look into implicit
How do you setup IPython to access pyspark in notebook?
I did as following, it worked for me:
$ export SPARK_HOME=/opt/spark-1.1.0/
$ export
PYTHONPATH=/opt/spark-1.1.0/python:/opt/spark-1.1.0/python/lib/py4j-0.8.2.1-src.zip
$ ipython notebook
All the logging will go into console (not in
Hi Sean
Many many thanks. This really clears a lot up for me
Andy
From: Sean Owen so...@cloudera.com
Date: Wednesday, October 1, 2014 at 11:27 AM
To: Andrew Davidson a...@santacruzintegration.com
Cc: user@spark.apache.org user@spark.apache.org
Subject: Re: can I think of JavaDStream
I was starting PySpark as a profile within IPython Notebook as per:
http://blog.cloudera.com/blog/2014/08/how-to-use-ipython-notebook-with-apache-spark/
My setup looks like:
import os
import sys
spark_home = os.environ.get('SPARK_HOME', None)
if not spark_home:
raise ValueError('SPARK_HOME
Here is the other relevant bit of my set-up:
MASTER=spark://sparkmaster:7077
IPYTHON_OPTS=notebook --pylab inline --ip=0.0.0.0
CASSANDRA_NODES=cassandra1|cassandra2|cassandra3
PYSPARK_SUBMIT_ARGS=--master $MASTER --deploy-mode client --num-executors
6 --executor-memory 1g --executor-cores 1
One indirect way to control the number of cores used in an executor is to
set spark.cores.max and set spark.deploy.spreadOut to be true. The
scheduler in the standalone cluster then assigns roughly the same number of
cores (spark.cores.max/number of worker nodes) to each executor for an
Hi,
I need monitoring some aspects about my cluster like network and resources.
Ganglia looks like a good option for what I need.
Then, I found out that Spark has support to Ganglia.
On the Spark monitoring webpage there is this information:
To install the GangliaSink you’ll need to perform a
I'll do this test and after I reply the result.
Thank you Marcelo.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Question-About-Submit-Application-tp15072p15539.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi,
After reading some previous posts about this issue, I have increased the
java heap space to -Xms64g -Xmx64g, but still met the
java.lang.OutOfMemoryError: GC overhead limit exceeded error. Does anyone
have other suggestions?
I am reading a data of 200 GB and my total memory is 120 GB, so I
Hi
How many nodes in your cluster? It seems to me 64g does not help if each of
your node doesn't have that many memory.
Liquan
On Wed, Oct 1, 2014 at 1:37 PM, anny9699 anny9...@gmail.com wrote:
Hi,
After reading some previous posts about this issue, I have increased the
java heap space to
well, sort of! we make input/output formats (cascading taps, scalding
sources) available in spark, and we ported the scalding fields api to
spark. so it's for those of us that have a serious investment in
cascading/scalding and want to leverage that in spark.
blog is here:
Hi Liquan,
I have 8 workers, each with 15.7GB memory.
What you said makes sense, but if I don't increase heap space, it keeps
telling me GC overhead limit exceeded.
Thanks!
Anny
On Wed, Oct 1, 2014 at 1:41 PM, Liquan Pei [via Apache Spark User List]
ml-node+s1001560n1554...@n3.nabble.com
Pretty cool, thanks for sharing this! I've added a link to it on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects.
Matei
On Oct 1, 2014, at 1:41 PM, Koert Kuipers ko...@tresata.com wrote:
well, sort of! we make input/output formats (cascading taps,
Thank you Zhang!
I am grateful for your help!
2014-10-01 14:05 GMT-03:00 Kan Zhang kzh...@apache.org:
CC user@ for indexing.
Glad you fixed it. All source code for these examples are under
SPARK_HOME/examples. For example, the converters used here are in
I found the problem. I was manually constructing the CLASSPATH and
SPARK_CLASSPATH because I needed jars for running the cassandra lib.
For some reason that I cannot explain, it was this that was causing the
issue. Maybe one of the jars had a log4j.properties rolled up in it?
I removed almost
I'm trying to understand the intuition behind the features method that
Aaron used in one of his demos. I believe this feature will just work for
detecting the character set (i.e., language used).
Can someone help ?
def featurize(s: String): Vector = {
val n = 1000
val result = new
Thanks Sean. This is how I set this memory. I set it when I start to run
the job
java -Xms64g -Xmx64g -cp
/root/spark/lib/spark-assembly-1.0.0-hadoop1.0.4.jar:/root/scala/lib/scala-library.jar:./target/MyProject.jar
MyClass
Is there some problem with it?
On Wed, Oct 1, 2014 at 2:03 PM, Sean
Can you use spark submit to set the the executor memory? Take a look at
https://spark.apache.org/docs/latest/submitting-applications.html.
Liquan
On Wed, Oct 1, 2014 at 2:21 PM, 陈韵竹 anny9...@gmail.com wrote:
Thanks Sean. This is how I set this memory. I set it when I start to run
the job
Hi,
I want implement an RDD wherein the decision of number of partitions is
based on the number of executors that have been set up. Is there some way I
can determine the number of executors within the getPartitions() call?
thanks
On Wed, Oct 1, 2014 at 4:56 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Pretty cool, thanks for sharing this! I've added a link to it on the wiki:
https://cwiki.apache.org/confluence/display/SPARK/Supplemental+Spark+Projects
.
Matei
On Oct 1, 2014, at 1:41 PM, Koert Kuipers
The program computes hashing bi-gram frequency normalized by total number
of bigrams then filter out zero values. hashing is a effective trick of
vectorizing features. Take a look at
http://en.wikipedia.org/wiki/Feature_hashing
Liquan
On Wed, Oct 1, 2014 at 2:18 PM, Soumya Simanta
Hi,
It appears that the step size is too high that the model is diverging with the
added noise.
Could you try by setting the step size to be 0.1 or 0.01?
Best,
Burak
- Original Message -
From: Krishna Sankar ksanka...@gmail.com
To: user@spark.apache.org
Sent: Wednesday, October 1,
Hi
We were using Horton 2.4.1 as our Hadoop distribution and now switched to MapR
Previously to read a text file we would use :
test = sc.textFile(\hdfs://10.48.101.111:8020/user/hdfs/test\)
What would be the equivalent of the same for Mapr.
Best Regards
Santosh
Thank you for the replies. It makes sense for scala/java, but in
python the JVM is launched when the spark context is initialised, so
it should be able to set it, I assume.
On Wed, Oct 1, 2014 at 6:24 PM, Andrew Or and...@databricks.com wrote:
Hi Tamas,
Yes, Marcelo is right. The reason why
Forgot to mention that I've tested that SerIntWritable and
PipelineDocumentWritable are serializable by serializing /
deserializing to/from a byte array in memory.
On Wed, Oct 1, 2014 at 1:43 PM, Timothy Potter thelabd...@gmail.com wrote:
I'm running into the following deserialization issue when
Thanks Burak. Step size 0.01 worked for b) and step=0.0001 for c) !
Cheers
k/
On Wed, Oct 1, 2014 at 3:00 PM, Burak Yavuz bya...@stanford.edu wrote:
Hi,
It appears that the step size is too high that the model is diverging with
the added noise.
Could you try by setting the step size to
There is doc on MapR:
http://doc.mapr.com/display/MapR/Accessing+MapR-FS+in+Java+Applications
-Vladimir Rodionov
On Wed, Oct 1, 2014 at 3:00 PM, Addanki, Santosh Kumar
santosh.kumar.adda...@sap.com wrote:
Hi
We were using Horton 2.4.1 as our Hadoop distribution and now switched to
MapR
Hi
We would like to do this in PySpark Environment
i.e something like
test = sc.textFile(maprfs:///user/root/test) or
test = sc.textFile(hdfs:///user/root/test) or
Currently when we try
test = sc.textFile(maprfs:///user/root/test)
It throws the error
No File-System for scheme: maprfs
Best
Excellent! Thanks Andy. I will give it a go.
On Thu, Oct 2, 2014 at 12:42 AM, andy petrella [via Apache Spark User List]
ml-node+s1001560n15487...@n3.nabble.com wrote:
I'll try my best ;-).
1/ you could create a abstract type for the types (1 on top of Vs, 1 other
on top of Es types) than
Yes.. you should use maprfs://
I personally haven't used pyspark, I just used scala shell or standalone
with MapR.
I think you need to set classpath right, adding jar like
/opt/mapr/hadoop/hadoop-0.20.2/lib/hadoop-0.20.2-dev-core.jar to the
classpath
in the classpath.
Sungwook
On Wed, Oct 1,
It should just work in PySpark, the same way it does in Java / Scala apps.
Matei
On Oct 1, 2014, at 4:12 PM, Sungwook Yoon sy...@maprtech.com wrote:
Yes.. you should use maprfs://
I personally haven't used pyspark, I just used scala shell or standalone with
MapR.
I think you need to
hey guys
I am using spark 1.0.0+cdh5.1.0+41
When two users try to run spark-shell , the first guy's spark-shell shows
active in the 18080 Web UI but the second user shows WAITING and the shell
has a bunch of errors but does go the spark-shell and sc.master seems to
point to the correct master.
hey guys
Is there a way to run Spark in local mode from within Eclipse.I am running
Eclipse Kepler on a Macbook Pro with MavericksLike one can run hadoop
map/reduce applications from within Eclipse and debug and learn.
thanks
sanjay
Cycling bits:
http://search-hadoop.com/m/JW1q5wxkXH/spark+eclipsesubj=Buidling+spark+in+Eclipse+Kepler
On Wed, Oct 1, 2014 at 4:35 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.invalid wrote:
hey guys
Is there a way to run Spark in local mode from within Eclipse.
I am running Eclipse
Hello Sanjay,
This can be done, and is a very effective way to debug.
1) Compile and package your project to get a fat jar
2) In your SparkConf use setJars and give location of this jar. Also set
your master here as local in SparkConf
3) Use this SparkConf when creating JavaSparkContext
4) Debug
Parquet format seems to be comparatively better for analytic load, it has
performance compression benefits for large analytic workload.
A workaround could be to use long datatype to store epoch timestamp value.
If you already have existing parquet files (impala tables) then you may need
to
As Sungwook said, the classpath pointing to the mapr jar is the key for
that error.
MapR has a Spark install that hopefully makes it easier. I don't have the
instructions handy but you can asking their support about it.
-Suren
On Wed, Oct 1, 2014 at 7:18 PM, Matei Zaharia
You need to set --total-executor-cores to limit how many total cores it grabs
on the cluster. --executor-cores is just for each individual executor, but it
will try to launch many of them.
Matei
On Oct 1, 2014, at 4:29 PM, Sanjay Subramanian
sanjaysubraman...@yahoo.com.INVALID wrote:
hey
Hi,
I am using spark v 1.1.0. The default value of spark.cleaner.ttl is infinite
as per the online docs. Since a lot of shuffle files are generated in
/tmp/spark-local* and the disk is running out of space, we tested with a
smaller value of ttl. However, even when job has completed and the timer
Awesome thanks a TON. It works
There is a clash in the UI port initially but looks like it creates a second UI
port at 4041 for the second user wanting to use the spark-shell 14/10/01
17:34:38 INFO JettyUtils: Failed to create UI at port, 4040. Trying
again.14/10/01 17:34:38 INFO JettyUtils:
I number of the problems I want to work with generate datasets which are
too large to hold in memory. This becomes an issue when building a
FlatMapFunction and also when the data used in combineByKey cannot be held
in memory.
The following is a simple, if a little silly, example of a
Hi Everyone,
I'm working on training mllib's Naive Bayes to classify TF/IDF vectoried
docs using Spark 1.1.0.
I've gotten this to work fine on a smaller set of data, but when I increase
the number of vectorized documents I get hung up on training. The only
messages I'm seeing are below. I'm
You are likely running into SPARK-3708
https://issues.apache.org/jira/browse/SPARK-3708, which was fixed by #2594
https://github.com/apache/spark/pull/2594 this morning.
On Wed, Oct 1, 2014 at 8:09 AM, tonsat ton...@gmail.com wrote:
We have a configuration CDH5.0,Spark1.1.0(stand alone) and
Yes, the bigram in that demo only has two characters, which could
separate different character sets. -Xiangrui
On Wed, Oct 1, 2014 at 2:54 PM, Liquan Pei liquan...@gmail.com wrote:
The program computes hashing bi-gram frequency normalized by total number of
bigrams then filter out zero values.
The cost depends on the feature dimension, number of instances, number
of classes, and number of partitions. Do you mind sharing those
numbers? -Xiangrui
On Wed, Oct 1, 2014 at 6:31 PM, Mike Bernico mike.bern...@gmail.com wrote:
Hi Everyone,
I'm working on training mllib's Naive Bayes to
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0.
-Xiangrui
On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain ji...@sellpoints.com wrote:
So I am trying to print the model output from MLlib however I am only
getting things like the following:
Yeah I'm using 1.0.0 and thanks for taking the time to check!
Sent from my iPhone
On Oct 1, 2014, at 8:48 PM, Xiangrui Meng men...@gmail.com wrote:
Which Spark version are you using? It works in 1.1.0 but not in 1.0.0.
-Xiangrui
On Wed, Oct 1, 2014 at 2:13 PM, Jimmy McErlain
Hi,
I am using custom partitioner to partition my JavaPairRDD where key is a
String.
I use hashCode of the sub-string of the key to derive the partition index
but I have noticed that my partition contains keys which have a different
partitionIndex returned by the partitioner.
Another issue I am
99 matches
Mail list logo