How large is your graph, and how much memory does your cluster have?
We don't have a good way to determine the *optimal* number of partitions
aside from trial and error, but to get the job to at least run to
completion, it might help to use the MEMORY_AND_DISK storage level and a
large number of
Hello,
We get a graph with 100B edges of nearly 800GB in gz format.
We have 80 machines, each one has 60GB memory.
I have not ever seen the program run to completion.
Alcaid
2014-11-02 14:06 GMT+08:00 Ankur Dave ankurd...@gmail.com:
How large is your graph, and how much memory does your
None of your tuning will help here because the problem is actually the way
you are saving the output. If you take a look at the stacktrace, it is
trying to build a single string that is too large for the VM to allocate
memory. The VM is actually not running out of memory, but rather, JVM
cannot
Just a wild guess, but I had to exclude “javax.servlet.servlet-api” from my
Hadoop dependencies to run a SparkContext.
In your build.sbt:
org.apache.hadoop % hadoop-common % “... exclude(javax.servlet,
servlet-api),
org.apache.hadoop % hadoop-hdfs % “... exclude(javax.servlet,
servlet-api”)
Hi,
I am using Spark on Yarn, particularly Spark in Python. I am trying to run:
myrdd = sc.textFile(s3n://mybucket/files/*/*/*.json)
myrdd.getNumPartitions()
Unfortunately it seems that Spark tries to load everything to RAM, or at least
after while of running this everything slows down and
Thank you, I would expect it to work as you write, but I am probably
experiencing it working other way. But now it seems that Spark is generally
trying to fit everything to RAM. I run Spark on YARN and I have wraped this to
another question:
did you create SQLContext?
On Sat, Nov 1, 2014 at 7:51 PM, abhinav chowdary abhinav.chowd...@gmail.com
wrote:
I have same requirement of passing list of values to in clause, when i am
trying to do
i am getting below error
scala val longList = Seq[Expression](a, b)
console:11: error: type
Thanks for responding. This is what I initially suspected, and hence asked
why the library needed to construct the entire value buffer on a single
host before writing it out. The stacktrace appeared to suggest that user
code is not constructing the large buffer. I'm simply calling groupBy and
You can check in the worker logs for more accurate information(that are
found under the work directory inside spark directory). I used to hit this
issue with:
- Too many open files : Increasing the ulimit would solve this issue
- Akka connection timeout/Framesize: Setting the following while
Adding the libthrift jar
http://mvnrepository.com/artifact/org.apache.thrift/libthrift/0.9.0 in
the class path would resolve this issue.
Thanks
Best Regards
On Sat, Nov 1, 2014 at 12:34 AM, Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
Hi,
I am trying to load hive datasets using
You can set HADOOP_CONF_DIR inside the spark-env.sh file
Thanks
Best Regards
On Sat, Nov 1, 2014 at 4:14 AM, ameyc ambr...@gmail.com wrote:
How do i setup hadoop_conf_dir correctly when I'm running my spark job on
Yarn? My Yarn environment has the correct hadoop_conf_dir settings by the
saveAsText means save every element of the RDD as one line of text.
It works like TextOutputFormat in Hadoop MapReduce since that's what
it uses. So you are causing it to create one big string out of each
Iterable this way.
On Sun, Nov 2, 2014 at 4:48 PM, Bharath Ravi Kumar reachb...@gmail.com
Hi,
Sorry to bounce back the old thread.
What is the state now? Is this problem solved. How spark handle categorical
data now?
Regards,
Ashutosh
--
View this message in context:
This operation requires two transformers:
1) Indexer, which maps string features into categorical features
2) OneHotEncoder, which flatten categorical features into binary features
We are working on the new dataset implementation, so we can easily
express those transformations. Sorry for late!
I see the job in the web interface but don't know how to kill it
I thought that only applied when you're trying to run a job using
spark-submit or in the shell...
On Sun, Nov 2, 2014 at 8:47 AM, Akhil Das ak...@sigmoidanalytics.com
wrote:
You can set HADOOP_CONF_DIR inside the spark-env.sh file
Thanks
Best Regards
On Sat, Nov 1, 2014 at 4:14 AM, ameyc
Thanks Sean! My novice understanding is that the 'native heap' is the
address space not allocated to the JVM heap, but I wanted to check to see
if I'm missing something. I found out my issue appeared to be actual
memory pressure on the executor machine. There was space for the JVM heap
but not
Hello,
I have written an Spark SQL application which reads data from HDFS and
query on it.
The data size is around 2GB (30 million records). The schema and query I am
running is as below.
The query takes around 05+ seconds to execute.
I tried by adding
Did https://issues.apache.org/jira/browse/SPARK-3807 fix the issue seen by you?
If yes, then please note that it shall be part of 1.1.1 and 1.2
Chirag
From: Chen Song chen.song...@gmail.commailto:chen.song...@gmail.com
Date: Wednesday, 15 October 2014 4:03 AM
To:
Hi,
I am running a small 6 node spark cluster for testing purposes. Recently,
one of the node's physical memory was filled up by temporary files and there
was no space left on the disk. Due to this my Spark jobs started failing
even though on the Spark Web UI the was shown 'Alive'. Once I logged
Yes, that's correct to my understanding and the probable explanation of
your issue. There are no additional limits or differences from how the JVM
works here.
On Nov 3, 2014 4:40 AM, Paul Wais pw...@yelp.com wrote:
Thanks Sean! My novice understanding is that the 'native heap' is the
address
You can enable monitoring (nagios) with alerts to tackle these kind of
issues.
Thanks
Best Regards
On Mon, Nov 3, 2014 at 10:55 AM, jatinpreet jatinpr...@gmail.com wrote:
Hi,
I am running a small 6 node spark cluster for testing purposes. Recently,
one of the node's physical memory was
Hi there,
I have a pySpark job that is simply taking a tab separated CSV outputting it
to a Parquet file. The code is based on the SQL write parquet example.
(Using a different inferred schema, only 35 columns). The input files range
from 100MB to 12 Gb.
I have tried different different block
Hi all, just wondering if there was a way to extract paths in graphx. For
example if i have the graph attached i would like to return the results
along the lines of :
101 - 103
101 -104 -108
102 -105
102 -106-107
http://apache-spark-user-list.1001560.n3.nabble.com/file/n17936/graph.jpg
--
24 matches
Mail list logo