This is really exciting! Thanks so much for this work, I think you've
guaranteed Pig's continued vitality.
On Wednesday, August 27, 2014, Matei Zaharia matei.zaha...@gmail.com
wrote:
Awesome to hear this, Mayur! Thanks for putting this together.
Matei
On August 27, 2014 at 10:04:12 PM,
Hi Cheng,
You specify extra python files through --py-files. For example:
bin/spark-submit [your other options] --py-files helper.py main_app.py
-Andrew
2014-08-27 22:58 GMT-07:00 Chengi Liu chengi.liu...@gmail.com:
Hi,
I have two files..
main_app.py and helper.py
main_app.py calls
Hi,
I'm using a cluster with 5 nodes that each use 8 cores and 10GB of RAM
Basically I'm creating a dictionary from text, i.e. giving each words that
occurs more than n times in all texts a unique identifier.
The essential port of the code looks like that:
var texts = ctx.sql(SELECT text FROM
I am building (yet another) job server for Spark using Play! framework and
it seems like Play's akka dependency conflicts with Spark's shaded akka
dependency. Using SBT, I can force Play to use akka 2.2.3 (unshaded) but I
haven't been able to figure out how to exclude com.typesafe.akka
I'm working on this patch to visualize stages:
https://github.com/apache/spark/pull/2077
Phuoc Do
On Mon, Aug 4, 2014 at 10:12 PM, Zongheng Yang zonghen...@gmail.com wrote:
I agree that this is definitely useful.
One related project I know of is Sparkling [1] (also see talk at Spark
Hi,
I have a RDD of key-value pairs. Now I want to find the key for which the
values has the largest number of elements. How should I do that?
Basically I want to select the key for which the number of items in values
is the largest.
Thank You
If just want arbitrary unique id attached to each record in a dstream (no
ordering etc), then why not create generate and attach an UUID to each
record?
On Wed, Aug 27, 2014 at 4:18 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
I see a issue here.
If rdd.id is 1000 then rdd.id *
hi guys,
can someone explain or give the stupid user like me a link where i can get
the right usage of sbt and spark in order to run the examples as a stand
alone app
I got to the point running the app by sbt run path-to-the-data but still
get some error because i probably didnt tell the app
got it when I read the class refference
https://spark.apache.org/docs/0.9.1/api/core/index.html#org.apache.spark.SparkConf
conf.setMaster(local[2])
set the master to local with 2 threads
but still get some warnings and the result (see below) is also not right i
think
ps: by the way ... first
Hi all,
What is the expression that I should use with spark sql DSL if I need to
retreive
data with a field in a given set.
For example :
I have the following schema
case class Person(name: String, age: Int)
And I need to do something like :
personTable.where('name in Seq(foo, bar)) ?
Thanks for the reply. Sorry I could not ask more earlier.
Trying to use a parquet file is not working at all.
case class Rec(name:String,pv:Int)
val sqlContext=new org.apache.spark.sql.SQLContext(sc)
import sqlContext.createSchemaRDD
val
Maybe you can refer sliding method of RDD, but it's right now mllib private
method.
Look at org.apache.spark.mllib.rdd.RDDFunctions.
2014-08-26 12:59 GMT+08:00 Vida Ha v...@databricks.com:
Can you paste the code? It's unclear to me how/when the out of memory is
occurring without seeing the
Hi,triedmvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests dependency:tree dep.txtAttached the dep. txt for your information.[WARNING]
[WARNING] Some problems were encountered while building the effective settings
[WARNING] Unrecognised tag: 'mirrors' (position: START_TAG
I see 0.98.5 in dep.txt
You should be good to go.
On Thu, Aug 28, 2014 at 3:16 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
tried
mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
dependency:tree dep.txt
Attached the dep. txt for your
The file is compiling properly but when I try to run the jar file using
spark-submit, it is giving some errors. I am running spark locally and have
downloaded a pre-built version of Spark named For Hadoop 2 (HDP2, CDH5). AI
don't know if it is a dependency problem but I don't want to have
Not sure if this make sense, but maybe would be nice to have a kind of flag
available within the code that tells me if I'm running in a normal
situation or during a recovery.
To better explain this, let's consider the following scenario:
I am processing data, let's say from a Kafka streaming, and
Hi,
I tried to start Spark but failed:
$ ./sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
/mnt/hadoop/spark-1.0.2/sbin/../logs/spark-edhuser-org.apache.spark.deploy.master.Master-1-m133.out
failed to launch org.apache.spark.deploy.master.Master:
Failed to find
fileA=1 2 3 4 one number a line,save in /sparktest/1/
fileB=3 4 5 6 one number a line,save in /sparktest/2/
I want to get 3 and 4
var a = sc.textFile(/sparktest/1/).map((_,1))
var b = sc.textFile(/sparktest/2/).map((_,1))
a.filter(param={b.lookup(param._1).length0}).map(_._1).foreach(println)
You need to set HADOOP_HOME. Is Spark officially supposed to work on
Windows or not at this stage? I know the build doesn't quite yet.
On Thu, Aug 28, 2014 at 11:37 AM, Hingorani, Vineet
vineet.hingor...@sap.com wrote:
The file is compiling properly but when I try to run the jar file using
On 08/28/2014 07:20 AM, marylucy wrote:
fileA=1 2 3 4 one number a line,save in /sparktest/1/
fileB=3 4 5 6 one number a line,save in /sparktest/2/
I want to get 3 and 4
var a = sc.textFile(/sparktest/1/).map((_,1))
var b = sc.textFile(/sparktest/2/).map((_,1))
Hi all,
I have an RDD which has values in the format id,date,cost.
I want to group the elements based on the id and date columns and get the sum
of the cost for each group.
Can somebody tell me how to do this?
Thanks Regards,
Meethu M
It sounds like you are adding the same key to every element, and joining,
in order to accomplish a full cartesian join? I can imagine doing it that
way would blow up somewhere. There is a cartesian() method to do this maybe
more efficiently.
However if your data set is large, this sort of
0.98.2 is not an HBase version, but 0.98.2-hadoop2 is:
http://search.maven.org/#search%7Cgav%7C1%7Cg%3A%22org.apache.hbase%22%20AND%20a%3A%22hbase%22
On Thu, Aug 28, 2014 at 2:54 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
I need to use Spark with HBase 0.98 and tried
For your reference:
val d1 = textFile.map(line = {
val fileds = line.split(,)
((fileds(0),fileds(1)), fileds(2).toDouble)
})
val d2 = d1.reduceByKey(_+_)
d2.foreach(println)
2014-08-28 20:04 GMT+08:00 MEETHU MATHEW meethu2...@yahoo.co.in:
Hi all,
I have an RDD
If you mean your values are all a Seq or similar already, then you just
take the top 1 ordered by the size of the value:
rdd.top(1)(Ordering.by(_._2.size))
On Thu, Aug 28, 2014 at 9:34 AM, Deep Pradhan pradhandeep1...@gmail.com
wrote:
Hi,
I have a RDD of key-value pairs. Now I want to find
How can I set HADOOP_HOME if I am running the Spark on my local machine without
anything else? Do I have to install some other pre-built file? I am on Windows
7 and Spark’s official site says that it is available on Windows, I added Java
path in the PATH variable.
Vineet
From: Sean Owen
Can you copy the exact spark-submit command that you are running?
You should be able to run it locally without installing hadoop.
Here is an example on how to run the job locally.
# Run application locally on 8 cores
./bin/spark-submit \
--class org.apache.spark.examples.SparkPi \
--master
Yes, but I think at the moment there is still a dependency on Hadoop even
when not using it. See https://issues.apache.org/jira/browse/SPARK-2356
On Thu, Aug 28, 2014 at 2:14 PM, Guru Medasani gdm...@outlook.com wrote:
Can you copy the exact spark-submit command that you are running?
You
Hi there,
I'm trying to run JavaSparkPi example on YARN with master = yarn-client but
I have a problem.
It runs smoothly with submitting application, first container for
Application Master works too.
When job is starting and there are some tasks to do I'm getting this warning
on console (I'm
Can you clarify the scenario:
val ssc = new StreamingContext(sparkConf, Seconds(10))
ssc.checkpoint(checkpointDirectory)
val stream = KafkaUtils.createStream(...)
val wordCounts = lines.flatMap(_.split( )).map(x = (x, 1L))
val wordDstream= wordCounts.updateStateByKey[Int](updateFunc)
I didn't see that problem.
Did you run this command ?
mvn -Phbase-hadoop2,hadoop-2.4,yarn -Dhadoop.version=2.4.1 -DskipTests
clean package
Here is what I got:
TYus-MacBook-Pro:spark-1.0.2 tyu$ sbin/start-all.sh
starting org.apache.spark.deploy.master.Master, logging to
I've got an RDD where each element is a long string (a whole document). I'm
using pyspark so some of the handy partition-handling functions aren't
available, and I count the number of elements in each partition with:
def count_partitions(id, iterator):
c = sum(1 for _ in iterator)
A bit in analogy with a linked-list a double linked-list. It might introduce
overhead in terms of memory usage, but you could use two directed edges to
substitute the uni-directed edge.
--
View this message in context:
Thanks Sean.
Looks like there is a workaround as per the JIRA
https://issues.apache.org/jira/browse/SPARK-2356 .
http://qnalist.com/questions/4994960/run-spark-unit-test-on-windows-7.
May be that's worth a shot?
On Aug 28, 2014, at 8:15 AM, Sean Owen so...@cloudera.com wrote:
Yes, but I
Thank you Sean and Guru for giving the information. Btw I have to put this
line, but where should I add it? In my scala file below other ‘import …’ lines
are written?
System.setProperty(hadoop.home.dir, d:\\winutil\\)
Thank you
Vineet
From: Sean Owen [mailto:so...@cloudera.com]
Sent:
You should set this as early as possible in your program, before other
code runs.
On Thu, Aug 28, 2014 at 3:27 PM, Hingorani, Vineet
vineet.hingor...@sap.com wrote:
Thank you Sean and Guru for giving the information. Btw I have to put this
line, but where should I add it? In my scala file below
The following error is given when I try to add this line:
[info] Set current project to Simple Project (in build
file:/C:/Users/D062844/Desktop/Hand
sOnSpark/Install/spark-1.0.2-bin-hadoop2/)
[info] Compiling 1 Scala source to
C:\Users\D062844\Desktop\HandsOnSpark\Install\spark-1.0
Hi All,
Is there any way to change the delimiter from being a comma ?
Some of the strings in my data contain commas as well, making it very
difficult to parse the results.
Yadid
Hi all,
Just wondering if there is a way to use logging to print to spark logs some
additional info (similar to debug in scalding).
Thanks,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Print-to-spark-log-tp13035.html
Sent from the Apache Spark User
Breeze author David also has a github project on cuda binding in
scalado you prefer using java or scala ?
On Aug 27, 2014 2:05 PM, Frank van Lankvelt f.vanlankv...@onehippo.com
wrote:
you could try looking at ScalaCL[1], it's targeting OpenCL rather than
CUDA, but that might be close
Hi,
I have just tried to apply the patch of SPARK-1297:
https://issues.apache.org/jira/browse/SPARK-1297
There are two files in it, named spark-1297-v2.txt and spark-1297-v4.txt
respectively.
When applying the 2nd one, I got Hunk #1 FAILED at 45
Can you please advise how to fix it in order
I attached patch v5 which corresponds to the pull request.
Please try again.
On Thu, Aug 28, 2014 at 9:50 AM, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
I have just tried to apply the patch of SPARK-1297:
https://issues.apache.org/jira/browse/SPARK-1297
There are two
Hi,
patch -p1 -i spark-1297-v5.txt
can't find file to patch at input line 5
Perhaps you used the wrong -p or --strip option?
The text leading up to this was:
--
|diff --git docs/building-with-maven.md docs/building-with-maven.md
|index 672d0ef..f8bcd2b 100644
|---
Hi
I'd like to announce a couple of updates to the SparkR project. In order to
facilitate better collaboration for new features and development we have a
new mailing list, issue tracker for SparkR.
- The new JIRA is hosted at https://sparkr.atlassian.net/browse/SPARKR/ and
we have migrated all
Hi Folks,
I’d like to find out tips on how to convert the RDDs inside a Spark Streaming
DStream to a set of SchemaRDDs.
My DStream contains JSON data pushed over from Kafka, and I’d like to use
SparkSQL’s JSON import function (i.e. jsonRDD) to register the JSON dataset as
a table, and perform
Hi All,
@Andrew
Thanks for the tips. I just built the master branch of Spark last
night, but am still having problems viewing history through the
standalone UI. I dug into the Spark job events directories as you
suggested, and I see at a minimum 'SPARK_VERSION_1.0.0' and
'EVENT_LOG_1'; for
Hi Yana,
The fact is that the DB writing is happening on the node level and not on
Spark level. One of the benefits of distributed computing nature of Spark is
enabling IO distribution as well. For example, is much faster to have the
nodes to write to Cassandra instead of having them all
bq. Spark 1.0.2
For the above release, you can download pom.xml attached to the JIRA and
place it in examples directory
I verified that the build against 0.98.4 worked using this command:
mvn -Dhbase.profile=hadoop2 -Phadoop-2.4,yarn -Dhadoop.version=2.4.1
-DskipTests clean package
Patch v5
Sorry for the extremely late reply. It turns out that the same error occurred
when running on yarn. However, I recently updated my project to depend on
cdh5 and the issue I was having disappeared and I am no longer setting the
userClassPathFirst to true.
--
View this message in context:
@bharat-
overall, i've noticed a lot of confusion about how Spark Streaming scales -
as well as how it handles failover and checkpointing, but we can discuss
that separately.
there's actually 2 dimensions to scaling here: receiving and processing.
*Receiving*
receiving can be scaled out by
Hi Ted,
I downloaded pom.xml to examples directory.
It works, thanks!!
Regards
Arthur
[INFO]
[INFO] Reactor Summary:
[INFO]
[INFO] Spark Project Parent POM .. SUCCESS [2.119s]
[INFO] Spark Project
Hi,
I use Hadoop 2.4.1 and HBase 0.98.5 with snappy enabled in both Hadoop and
HBase.
With default setting in Spark 1.0.2, when trying to load a file I got Class
org.apache.hadoop.io.compress.SnappyCodec not found
Can you please advise how to enable snappy in Spark?
Regards
Arthur
scala
Thank you Debasish.
I am fine with either Scala or Java. I would like to get a quick
evaluation on the performance gain, e.g., ALS on GPU. I would like to try
whichever library does the business :)
Best regards,
Wei
-
Wei Tan, PhD
Research Staff Member
IBM T.
Hi,
my check native result:
hadoop checknative
14/08/29 02:54:51 WARN bzip2.Bzip2Factory: Failed to load/initialize
native-bzip2 library system-native, will use pure-Java version
14/08/29 02:54:51 INFO zlib.ZlibFactory: Successfully loaded initialized
native-zlib library
Native library
hey guys
i still try to get used to compile and run the example code
why does the run_example code submit the class with an
org.apache.spark.examples in front of the class itself?
probably a stupid question but i would be glad some one of you explains
by the way.. how was the
I'm not sure if this is the case, but basic monitoring is described here:
https://spark.apache.org/docs/latest/monitoring.html
If it comes to something more sophisticated I was for example able to save
some messages into local logs and view them in YARN UI via http by editing
spark source code
In many cases when I work with Map Reduce my mapper or my reducer might
take a single value and map it to multiple keys -
The reducer might also take a single key and emit multiple values
I don't think that functions like flatMap and reduceByKey will work or are
there tricks I am not aware of
I was able to recently solve this problem for standalone mode. For this mode,
I did not use a history server. Instead, I set spark.eventLog.dir (in
conf/spark-defaults.conf) to a directory in hdfs (basically this directory
should be in a place that is writable by the master and accessible globally
Hi,
Thanks for the response. I tried to use countByKey. But I am not able to
write the output to console or to a file. Neither collect() nor
saveAsTextFile() work for the Map object that is generated after
countByKey().
valx = sc.textFile(baseFile)).map { line =
val
thanks for the reply.
I was looking for something for the case when it's running outside of the
spark framework. if I declare a sparkcontext or and rdd that could print
some messages in the log?
The problem I have that if I print something from the scala object that runs
the spark app, it does
Yes, that is an option.
I started with a function of batch time, and index to generate id as long. This
may be faster than generating UUID, with added benefit of sorting based on time.
- Original Message -
From: Tathagata Das tathagata.das1...@gmail.com
To: Soumitra Kumar
But then if you want to generate ids that are unique across ALL the records
that you are going to see in a stream (which can be potentially infinite),
then you definitely need a number space larger than long :)
TD
On Thu, Aug 28, 2014 at 12:48 PM, Soumitra Kumar kumar.soumi...@gmail.com
wrote:
Hello there,
I've a basic question on the downloadthat which option I need to
downloadfor standalone cluster.
I've a private cluster of three machineson Centos. When I click on
download it shows me following:
Download Spark
The latest release is Spark 1.0.2, released August 5, 2014
Try using local[n] with n 1, instead of local. Since receivers take up
1 slot, and local is basically 1 slot, there is no slot left to process
the data. That's why nothing gets printed.
TD
On Thu, Aug 28, 2014 at 10:28 AM, Verma, Rishi (398J)
rishi.ve...@jpl.nasa.gov wrote:
Hi Folks,
I’d
On Thu, Aug 28, 2014 at 7:00 AM, Rok Roskar rokros...@gmail.com wrote:
I've got an RDD where each element is a long string (a whole document). I'm
using pyspark so some of the handy partition-handling functions aren't
available, and I count the number of elements in each partition with:
def
If you aren't using Hadoop, I don't think it matters which you download.
I'd probably just grab the Hadoop 2 package.
Out of curiosity, what are you using as your data store? I get the
impression most Spark users are using HDFS or something built on top.
On Thu, Aug 28, 2014 at 4:07 PM, Sanjeev
I assume your on-demand calculations are a streaming flow? If your data
aggregated from batch isn't too large, maybe you should just save it to
disk; when your streaming flow starts you can read the aggregations back
from disk and perhaps just broadcast them. Though I guess you'd have to
restart
To emulate a Mapper, flatMap() is exactly what you want. Since it
flattens, it means you return an Iterable of values instead of 1
value. That can be a Collection containing many values, or 1, or 0.
For a reducer, to really reproduce what a Reducer does in Java, I
think you will need groupByKey()
Hello Daniel, If you’re not using Hadoop then why you want to grab the Hadoop
package? CDH5 will download all the Hadoop packages and cloudera manager too.
Just curious what happen if you start spark on EC2 cluster, what it choose for
the data store as default?
-Sanjeev
From: Daniel Siegmann
Do you see this error right in the beginning or after running for sometime?
The root cause seems to be that somehow your Spark executors got killed,
which killed receivers and caused further errors. Please try to take a look
at the executor logs of the lost executor to find what is the root cause
If you are repartitioning to 8 partitions, and your node happen to have at
least 4 cores each, its possible that all 8 partitions are assigned to only
2 nodes. Try increasing the number of partitions. Also make sure you have
executors (allocated by YARN) running on more than two nodes if you want
great question, wei. this is very important to understand from a
performance perspective. and this extends is beyond kinesis - it's for any
streaming source that supports shards/partitions.
i need to do a little research into the internals to confirm my theory.
lemme get back to you!
-chris
Hi,
If change my etc/hadoop/core-site.xml
from
property
nameio.compression.codecs/name
value
org.apache.hadoop.io.compress.SnappyCodec,
org.apache.hadoop.io.compress.GzipCodec,
org.apache.hadoop.io.compress.DefaultCodec,
Appeared after running for a while. I re-ran the job and this time, it
crashed with:
14/08/29 00:18:50 WARN ReceiverTracker: Error reported by receiver for
stream 0: Error in block pushing thread - java.net.SocketException: Too
many open files
Shouldn't the failed receiver get re-spawned on a
TD - Apologies, didn't realize I was replying to you instead of the list.
What does numPartitions refer to when calling createStream? I read an
earlier thread that seemed to suggest that numPartitions translates to
partitions created on the Spark side?
The comma is just the way the default toString works for Row objects.
Since SchemaRDDs are also RDDs, you can do arbitrary transformations on
the Row objects that are returned.
For example, if you'd rather the delimiter was '|':
sql(SELECT * FROM src).map(_.mkString(|)).collect()
On Thu, Aug
Hi,
I am using a cluster where each node has 16GB (this is the executor memory).
After I complete an MLlib job, the executor tab shows the following:
Memory: 142.6 KB Used (95.5 GB Total)
and individual worker nodes have the Memory Used values as 17.3 KB / 8.6 GB
(this is different for
Hi,
I fixed the issue by copying libsnappy.so to Java ire.
Regards
Arthur
On 29 Aug, 2014, at 8:12 am, arthur.hk.c...@gmail.com
arthur.hk.c...@gmail.com wrote:
Hi,
If change my etc/hadoop/core-site.xml
from
property
nameio.compression.codecs/name
value
hi, guys
I am trying to understand how spark work on the concurrent model. I read
below from https://spark.apache.org/docs/1.0.2/job-scheduling.html
quote
Inside a given Spark application (SparkContext instance), multiple parallel
jobs can run simultaneously if they were submitted from
Hi,
While using HiveContext.
If hive table created as test_datatypes(testbigint bigint, ss bigint ) select
below works fine.
For create table test_datatypes(testbigint bigint, testdec decimal(5,2) )
scala val dataTypes=hiveContext.hql(select * from test_datatypes)
14/08/28 21:18:44 INFO
Anyone? No customers using streaming at scale?
From: Steve Nunez snu...@hortonworks.com
Date: Wednesday, August 27, 2014 at 9:08
To: user@spark.apache.org user@spark.apache.org
Subject: Reference Accounts Large Node Deployments
All,
Does anyone have specific references to customers,
Hey Sparkies...
I have an odd bug.
I am running Spark 0.9.2 on Amazon EC2 machines as a job (i.e. not in REPL)
After a bunch of processing, I tell spark to save my rdd to S3 using:
rdd.saveAsSequenceFile(uri,codec)
That line of code hangs. By hang I mean
(a) Spark stages UI shows no update on
Hi list,
Any change on this one? I think I have seen a lot of work being done on this
lately but I am unable to forge a working solution from jira tickets. Any
example would be highly appreciated.
Tomas
--
View this message in context:
Yeah, saveAsTextFile is an RDD specific method. If you really want to use that
method, just turn the map into an RDD:
`sc.parallelize(x.toSeq).saveAsTextFile(...)`
Reading through the api-docs will present you many more alternate solutions!
Best,
Burak
- Original Message -
From: SK
Each executor reserves some memory for storing RDDs in memory, and
some for executor operations like shuffling. The number you see is
memory reserved for storing RDDs, and defaults to about 0.6 of the
total (spark.storage.memoryFraction).
On Fri, Aug 29, 2014 at 2:32 AM, SK skrishna...@gmail.com
Hi,
Spark uses by default approximately 60% of the executor heap memory to store
RDDs. That's why you have 8.6GB instead of 16GB. 95.5 is therefore the sum of
all the 8.6 GB of executor memory + the driver memory.
Best,
Burak
- Original Message -
From: SK skrishna...@gmail.com
To:
I’m currently using the Spark 1.1 branch and have been able to get the Thrift
service up and running. The quick questions were whether I should able to use
the Thrift service to connect to SparkSQL generated tables and/or Hive tables?
As well, by any chance do we have any documents that
Hello
I'm new to Spark and playing around, but saw the following error. Could
anyone to help on it?
Thanks
Gary
scala c
res15: org.apache.spark.rdd.RDD[String] = FlatMappedRDD[7] at flatMap at
console:23
scala group
res16: org.apache.spark.rdd.RDD[(String, Iterable[String])] =
(Please ignore if duplicated)
Hi,
I use Spark 1.0.2 with Hive 0.13.1
I have already set the hive mysql database to latine1;
mysql:
alter database hive character set latin1;
Spark:
scala val hiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
scala hiveContext.hql(create table
Hi,
Thanks for the responses. I understand that the second values in the Memory
Used column for the executors add up to 95.5 GB and the first values add up
to 17.3 KB. If 95.5 GB is the memory used to store the RDDs, then what is
17.3 KB ? is that the memory used for shuffling operations? For
It did. It got failed and respawned 4 times.
In this case, the too many open files is a sign that you need increase the
system-wide limit of open files.
Try adding ulimit -n 16000 to your conf/spark-env.sh.
TD
On Thu, Aug 28, 2014 at 5:29 PM, Tim Smith secs...@gmail.com wrote:
Appeared after
Click the Storage tab. You have some (tiny) RDD persisted in memory.
On Fri, Aug 29, 2014 at 5:58 AM, SK skrishna...@gmail.com wrote:
Hi,
Thanks for the responses. I understand that the second values in the Memory
Used column for the executors add up to 95.5 GB and the first values add up
to
Hi,
You can set the settings in conf/spark-env.sh like this:export
SPARK_LIBRARY_PATH=/usr/lib/hadoop/lib/native/
SPARK_JAVA_OPTS+=-Djava.library.path=$SPARK_LIBRARY_PATH
SPARK_JAVA_OPTS+=-Dspark.io.compression.codec=org.apache.spark.io.SnappyCompressionCodec
93 matches
Mail list logo