Is this what you are looking for
1. Build Spark with the YARN profile
http://spark.apache.org/docs/1.2.0/building-spark.html. Skip this step
if you are using a pre-packaged distribution.
2. Locate the spark-version-yarn-shuffle.jar. This should be under
Thanks a lot!
Can I ask why this code generates a uniform distribution?
If dist is N(0,1) data should be N(-1, 2).
Let me know.
Thanks,
Luca
2015-02-07 3:00 GMT+00:00 Burak Yavuz brk...@gmail.com:
Hi,
You can do the following:
```
import
Hi,
I just found the following errors during computation(graphx), anyone has ideas
on this? thanks so much!
(I think the memory is sufficient, spark.executor.memory 30GB )
15/02/09 00:37:12 ERROR Executor: Exception in task 162.0 in stage 719.0 (TID
7653)
java.lang.OutOfMemoryError: Java
Hi,
Can someone please suggest some real life application implemented in spark
( things like gene sequencing) that is of type below code. Basically, the
application should have jobs submitted via as many threads as possible. I
need similar kind of spark application for benchmarking.
val
replying to my own thread; I realized that this only happens when the
replication level is 1.
Regardless of whether setting memory_only or disk or deserialized, I had to
make the replication level = 2 to make the streaming work properly on YARN.
I still don't get it why, because intuitively less
Hi All,
I have a use case where I have cached my schemaRDD and I want to launch
executors just on the partition which I know of (prime use-case of
PartitionPruningRDD).
I tried something like following :-
val partitionIdx = 2
val schemaRdd = hiveContext.table(myTable) //myTable is cached in
Are you running in yarn-cluster or yarn-client mode?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21556.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Yarn-cluster. When i run in yarn-client the driver is just run on the
machine that runs spark-submit.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Driver-Host-under-Yarn-tp21536p21558.html
Sent from the Apache Spark User List mailing list archive
Have you checked the corresponding executor logs as well? I think information
provided by you here is less to actually understand your issue.
--
View this message in context:
If you have `RDD[Array[Any]]` you can do
rdd.map(_.mkString(\t))
or with some other delimiter to make it `RDD[String]`, and then call
`saveAsTextFile`.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/saveAsTextFile-of-RDD-Array-Any-tp21548p21554.html
Sorry about that, yes, it should be uniformVectorRDD. Thanks Sean!
Burak
On Mon, Feb 9, 2015 at 2:05 AM, Sean Owen so...@cloudera.com wrote:
Yes the example given here should have used uniformVectorRDD. Then it's
correct.
On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini lucapug...@gmail.com
hi experts!
Is there any way to run spark application using java -cp command ?
thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/running-spark-project-using-java-cp-command-tp21567.html
Sent from the Apache Spark User List mailing list archive at
Yes like this:
/usr/lib/jvm/java-7-openjdk-i386/bin/java -cp
Hi Nicholas,
Thanks for your quick reply.
I'd like to try to build a image with create_image.sh. Then let's see how
we can launch spark cluster in region cn-north-1.
Guodong
On Tue, Feb 10, 2015 at 3:59 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Guodong,
spark-ec2 does not
Hi Marcelo,
Thanks for the explanation! So you mean in this way, actually only the
output of the map closure would need to be serialized so that it could be
passed further for other operations (maybe reduce or else)? And we don't
have to worry about Utils.funcX because for each closure instance we
OK, good luck!
On Mon Feb 09 2015 at 6:41:14 PM Guodong Wang wangg...@gmail.com wrote:
Hi Nicholas,
Thanks for your quick reply.
I'd like to try to build a image with create_image.sh. Then let's see how
we can launch spark cluster in region cn-north-1.
Guodong
On Tue, Feb 10, 2015 at
The partitions parameter to textFile is the minPartitions. So there will
be at least that level of parallelism. Spark delegates to Hadoop to create
the splits for that file (yes, even for a text file on disk and not hdfs).
You can take a look at the code in FileInputFormat - but briefly it will
Hi,
We implemented an External Data Source by extending the TableScan . We added
the classes to the classpath
The data source works fine when run in Spark Shell .
But currently we are unable to use this same data source in Python Environment.
So when we execute the following below in an
Hello Spark community and Holden,
I am trying to follow Holden Karau's SparkSQL and ElasticSearch tutorial
from Spark Summit 2014. I am trying to use elasticsearch-spark 2.1.0.Beta3
and SparkSQL 1.2 together.
https://github.com/holdenk/elasticsearchspark
*(Side Note: This very nice tutorial does
`mean()` and `variance()` are not defined in `Vector`. You can use the
mean and variance implementation from commons-math3
(http://commons.apache.org/proper/commons-math/javadocs/api-3.4.1/index.html)
if you don't want to implement them. -Xiangrui
On Fri, Feb 6, 2015 at 12:50 PM, SK
Guodong,
spark-ec2 does not currently support the cn-north-1 region, but you can
follow [SPARK-4241](https://issues.apache.org/jira/browse/SPARK-4241) to
find out when it does.
The base AMI used to generate the current Spark AMIs is very old. I'm not
sure anyone knows what it is anymore. What I
Logistic regression outputs probabilities if the data fits the model
assumption. Otherwise, you might need to calibrate its output to
correctly read it. You may be interested in reading this:
http://fastml.com/classifier-calibration-with-platts-scaling-and-isotonic-regression/.
We have isotonic
No particular reason. We didn't add it in the first version. Let's add
it in 1.4. -Xiangrui
On Thu, Feb 5, 2015 at 3:44 PM, jamborta jambo...@gmail.com wrote:
hi all,
just wondering if there is a reason why it is not possible to add intercepts
for streaming regression models? I understand
Could you check the Spark UI and see whether there are RDDs being
kicked out during the computation? We cache the residual RDD after
each iteration. If we don't have enough memory/disk, it gets
recomputed and results something like `t(n) = t(n-1) + const`. We
might cache the features multiple
In other words, the working command is:
/root/spark/bin/spark-submit --class com.crowdstar.etl.ParseAndClean --master
spark://ec2-54-213-73-150.us-west-2.compute.amazonaws.com:7077 --deploy-mode
cluster --total-executor-cores 4
file:///root/etl-admin/jar/spark-etl-0.0.1-SNAPSHOT.jar
Open up 'yarn-site.xml' in your hadoop configuration. You want to create
configuration for yarn.nodemanager.resource.memory-mb and
yarn.scheduler.maximum-allocation-mb. Have a look here for details on how
they work:
*Command:*
sudo python ./examples/src/main/python/pi.py
*Error:*
Traceback (most recent call last):
File ./examples/src/main/python/pi.py, line 22, in module
from pyspark import SparkContext
ImportError: No module named pyspark
I think you have to run that using $SPARK_HOME/bin/pyspark /path/to/pi.py
instead of normal python pi.py
On Mon, Feb 9, 2015 at 11:22 PM, Ashish Kumar ashish.ku...@innovaccer.com
wrote:
*Command:*
sudo python ./examples/src/main/python/pi.py
*Error:*
Traceback (most recent call last):
Hi, all
Any experts can show me what can be done to change the initialCapacity of the
following ?
org.apache.spark.util.collection.AppendOnlyMap
Cause we had caught problems in using spark to process large data sets during
sort shuffle.
Does spark offer a configurable parameter for
Thanks. But, in spark-submit, I specified the jar file in the form of
local:/spark-etl-0.0.1-SNAPSHOT.jar. It comes back with the following. What's
wrong with this?
Ey-Chih Chow
===
Date: Sun, 8 Feb 2015 22:27:17 -0800Sending launch command to
Thanks for the info guys.
For now I'm using the high level consumer i will give this one a try.
As far as the queries are concerned, check pointing helps.
I'm still no t sure whats the best way to gracefully stop the application
in yarn cluster mode.
On 5 Feb 2015 09:38, Dibyendu Bhattacharya
I have a matrix X of type:
res39: org.apache.spark.mllib.linalg.distributed.RowMatrix =
org.apache.spark.mllib.linalg.distributed.RowMatrix@6cfff1d3
with n rows and p columns
I would like to obtain an array S of size n*1 defined as the sum of the
columns of X.
S will then be replaced by
val
Hello Everyone,
I was reading this blog post:
http://homes.esat.kuleuven.be/~bioiuser/blog/a-d3-visualisation-from-spark-as-a-service/
and was wondering if this approach can be taken to visualize streaming
data...not just historical data?
Thank you!
-Suh
Hi,
Spark 1.2 changed the APIs a bit which is what's causing the problem with
es-spark 2.1.0.Beta3. This has been addressed
a while back in es-spark proper; you can get a hold of the dev build (the
upcoming 2.1.Beta4) here [1].
P.S. Do note that a lot of things have happened in
Found it - used saveAsHadoopFile
On Mon, Feb 9, 2015 at 9:11 AM, Kane Kim kane.ist...@gmail.com wrote:
Hi, how to compress output with gzip using python api?
Thanks!
Hi there,
I’m trying to improve performance on a job that has GC troubles and takes
longer to compute simply because it has to recompute failed tasks. After
deferring object creation as much as possible, I’m now trying to improve memory
usage with StorageLevel.MEMORY_AND_DISK_SER and a custom
Hi guys,
I want to launch spark cluster in AWS. And I know there is a spark_ec2.py
script.
I am using the AWS service in China. But I can not find the AMI in the
region of China.
So, I have to build one. My question is
1. Where is the bootstrap script to create the Spark AMI? Is it here(
Hi,
Please take a look at
http://docs.aws.amazon.com/AWSEC2/latest/UserGuide/creating-an-ami-ebs.html
Cheers
Gen
On Mon, Feb 9, 2015 at 6:41 AM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi I am very new both in spark and aws stuff..
Say, I want to install pandas on ec2.. (pip install
Yes the example given here should have used uniformVectorRDD. Then it's correct.
On Mon, Feb 9, 2015 at 9:56 AM, Luca Puggini lucapug...@gmail.com wrote:
Thanks a lot!
Can I ask why this code generates a uniform distribution?
If dist is N(0,1) data should be N(-1, 2).
Let me know.
Thanks,
Hi experts! I am trying to use spark in my restful webservices.I am using
scala lift frramework for writing web services. Here is my boot class
class Boot extends Bootable {
def boot {
Constants.loadConfiguration
val sc=new SparkContext(new
Hi Michael,
The storage tab shows the RDD resides fully in memory (10 partitions) with
zero disk usage. Tasks for subsequent select on this table in cache shows
minimal overheads (GC, queueing, shuffle write etc. etc.), so overhead is
not issue. However, it is still twice as slow as reading
`func1` and `func2` never get serialized. They must exist on the other
end in the form of a class loaded by the JVM.
What gets serialized is an instance of a particular closure (the
argument to your map function). That's a separate class. The
instance of that class that is serialized contains
You'll probably only get good compression for strings when dictionary
encoding works. We don't optimize decimals in the in-memory columnar
storage, so you are paying expensive serialization there likely.
On Mon, Feb 9, 2015 at 2:18 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Flat data of
If we define an Utils object:
object Utils {
def func1 = {..}
def func2 = {..}
}
And then in a RDD we refer to one of the function:
rdd.map{r = Utils.func1(r)}
Will Utils.func2 also get serialized or not?
Thanks,
Yitong
--
View this message in context:
Could you share which data types are optimized in the in-memory storage and
how are they optimized ?
On Mon, Feb 9, 2015 at 2:33 PM, Michael Armbrust mich...@databricks.com
wrote:
You'll probably only get good compression for strings when dictionary
encoding works. We don't optimize decimals
The standard way to add timestamps is java.sql.Timestamp.
On Mon, Feb 9, 2015 at 3:23 PM, jay vyas jayunit100.apa...@gmail.com
wrote:
Hi spark ! We are working on the bigpetstore-spark implementation in
apache bigtop, and want to implement idiomatic date/time usage for SparkSQL.
It appears
Is there an easy way to check if a spark binary release was built with Hive
support? Are any of the prebuilt binaries on the spark website built with hive
support?
Thanks,Ashic.
You could add a new ColumnType
https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/columnar/ColumnType.scala
.
PRs welcome :)
On Mon, Feb 9, 2015 at 3:01 PM, Manoj Samel manojsamelt...@gmail.com
wrote:
Hi Michael,
As a test, I have same data loaded as
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217
Yes all releases are built with -Phive except the 'without-hive' build.
On Mon, Feb 9, 2015 at 10:41 PM, Ashic Mahtab as...@live.com wrote:
Is there an easy way to check if a spark binary release was built
Awesome...thanks Sean.
From: so...@cloudera.com
Date: Mon, 9 Feb 2015 22:43:45 +
Subject: Re: Check if spark was built with hive
To: as...@live.com
CC: user@spark.apache.org
https://github.com/apache/spark/blob/master/dev/create-release/create-release.sh#L217
Yes all releases are
Hi Michael,
As a test, I have same data loaded as another parquet - except with the 2
decimal(14,4) replaced by double. With this, the on disk size is ~345MB,
the in-memory size is 2GB (v.s. 12 GB) and the cached query runs in 1/2 the
time of uncached query.
Would it be possible for Spark to
Hi spark ! We are working on the bigpetstore-spark implementation in apache
bigtop, and want to implement idiomatic date/time usage for SparkSQL.
It appears that org.joda.time.DateTime isnt in SparkSQL's rolodex of
reflection types.
I'd rather not force an artificial dependency on hive dates
Hi folks, puzzled by something pretty simple:
I have a standalone cluster with default parallelism of 2, spark-shell
running with 2 cores
sc.textFile(README.md).partitions.size returns 2 (this makes sense)
sc.textFile(README.md).coalesce(100,true).partitions.size returns 100,
also makes sense
The C implementation of Word2Vec updates the model using multi-threads
without locking. It is hard to implement it in a distributed way. In
the MLlib implementation, each work holds the entire model in memory
and output the part of model that gets updated. The driver still need
to collect and
54 matches
Mail list logo