Yes, that is correct. If you are executing a Spark program across multiple
machines, that you need to use a distributed file system (HDFS API
compatible) for reading and writing data. In your case, your setup is
across multiple machines. So what is probably happening is that the the RDD
data is
Hello,
I am running Cloudera 4 node cluster with 1 Master and 3 Slaves. I am
connecting with Spark Master from scala using SparkContext. I am trying to
execute a simple java function from the distributed jar on every Spark
Worker but haven't found a way to communicate with each worker or a Spark
In my application, data parts inside an RDD partition have ralations. so I
need to do some operations beween them.
for example
RDD T1 has several partitions, each partition has three parts A, B and C.
then I transform T1 to T2. after transform, T2 also has three parts D, E and
F, D = A+B, E =
so, the data structure looks like:
D consists of D1, D2, D3 (DX is partition)
and
DX consists of d1, d2, d3 (dx is the part in your context)?
what you want to do is to transform
DX to (d1 + d2, d1 + d3, d2 + d3)?
Best,
--
Nan Zhu
On Tuesday, April 8, 2014 at 8:09 AM, wxhsdp wrote:
yes, how can i do this conveniently? i can use filter, but there will be so
many RDDs and it's not concise
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Only-TraversableOnce-tp3873p3875.html
Sent from the Apache Spark User List mailing list archive at
If that’s the case, I think mapPartition is what you need, but it seems that
you have to load the partition into the memory as whole by toArray
rdd.mapPartition{D = {val p = D.toArray; ...}}
--
Nan Zhu
On Tuesday, April 8, 2014 at 8:40 AM, wxhsdp wrote:
yes, how can i do this
Forgot to post the solution. I messed up the master URL. In particular, I gave
the host (master), not a URL. My bad. The error message is weird, though. Seems
like the URL regex matches master for mesos://...
No idea about the Java Runtime Environment Error.
On Mar 26, 2014, at 3:52 PM,
thank you for your help! let me have a try
Nan Zhu wrote
If that’s the case, I think mapPartition is what you need, but it seems
that you have to load the partition into the memory as whole by toArray
rdd.mapPartition{D = {val p = D.toArray; ...}}
--
Nan Zhu
On Tuesday, April
Another thing I didn't mention. The AMI and user used: naturally I've
created several of my own AMIs with the following characteristics. None of
which worked.
1) Enabling ssh as root as per this guide (
http://blog.tiger-workshop.com/enable-root-access-on-amazon-ec2-instance/).
When doing this, I
Hi
I'm using Spark 0.9.0.
When calling saveAsTextFile on a custom hadoop inputformat (loaded with
newAPIHadoopRDD), I get the following error below.
If I call count, I get the correct count of number of records, so the
inputformat is being read correctly... the issue only appears when trying
to
I was able to keep the workaround ...around... by overwriting the
generated '/root/.ssh/authorized_keys' file with a known good one, in the
'/etc/rc.local' file
On Tue, Apr 8, 2014 at 10:12 AM, Marco Costantini
silvio.costant...@granatads.com wrote:
Another thing I didn't mention. The AMI and
Hi Flavio,
I happened to attend, actually attending the 2014 Apache Conf, I heard a
project called Apache Phoenix, which fully leverage HBase and suppose to
be 1000x faster than Hive. And it is not memory bounded, in which case sets
up a limit for Spark. It is still in the incubating group and
Flavio, the two are best at two orthogonal use cases, HBase on the
transactional side, and Spark on the analytic side. Spark is not intended
for row-based random-access updates, while far more flexible and efficient
in dataset-scale aggregations and general computations.
So yes, you can easily
Thanks for the quick reply Bin. Phenix is something I'm going to try for
sure but is seems somehow useless if I can use Spark.
Probably, as you said, since Phoenix use a dedicated data structure within
each HBase Table has a more effective memory usage but if I need to
deserialize data stored in a
i tried again with latest master, which includes commit below, but ui page
still shows nothing on storage tab.
koert
commit ada310a9d3d5419e101b24d9b41398f609da1ad3
Author: Andrew Or andrewo...@gmail.com
Date: Mon Mar 31 23:01:14 2014 -0700
[Hot Fix #42] Persisted RDD disappears on
That commit did work for me. Could you confirm the following:
1) After you called cache(), did you make any actions like count() or
reduce()? If you don't materialize the RDD, it won't show up in the
storage tab.
2) Did you run ./make-distribution.sh after you switched to the current master?
I read on the RDD paper
(http://www.cs.berkeley.edu/~matei/papers/2012/nsdi_spark.pdf) :
For example, an RDD representing an HDFS file has a partition for each block
of the file and knows which machines each block is on
And that on http://hadoop.apache.org/docs/r1.2.1/hdfs_design.html
To minimize
when i start spark-shell i now see
ls: cannot access /usr/local/lib/spark/lib_managed/jars/: No such file or
directory
we do not package a lib_managed with our spark build (never did). maybe the
logic in compute-classpath.sh that searches for datanucleus should check
for the existence of
note that for a cached rdd in the spark shell it all works fine. but
something is going wrong with the spark-shell in our applications that
extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:33 PM, Koert Kuipers ko...@tresata.com wrote:
i tried again with latest master, which includes
sorry, i meant to say: note that for a cached rdd in the spark shell it all
works fine. but something is going wrong with the SPARK-APPLICATION-UI in
our applications that extensively cache and re-use RDDs
On Tue, Apr 8, 2014 at 12:55 PM, Koert Kuipers ko...@tresata.com wrote:
note that for a
That commit fixed the exact problem you described. That is why I want to
confirm that you switched to the master branch. bin/spark-shell doesn't
detect code changes, so you need to run ./make-distribution.sh to
re-compile Spark first. -Xiangrui
On Tue, Apr 8, 2014 at 9:57 AM, Koert Kuipers
yes i call an action after cache, and i can see that the RDDs are fully
cached using context.getRDDStorageInfo which we expose via our own api.
i did not run make-distribution.sh, we have our own scripts to build a
distribution. however if your question is if i correctly deployed the
latest
Just took a quick look at the overview
herehttp://phoenix.incubator.apache.org/ and
the quick start guide
herehttp://phoenix.incubator.apache.org/Phoenix-in-15-minutes-or-less.html
.
It looks like Apache Phoenix aims to provide flexible SQL access to data,
both for transactional and analytic
Hi Ankit,
Thanx for all the work on Pig.
Finally got it working. Couple of high level bugs right now:
- Getting it working on Spark 0.9.0
- Getting UDF working
- Getting generate functionality working
- Exhaustive test suite on Spark on Pig
are you maintaining a Jira somewhere?
I am
yes i am definitely using latest
On Tue, Apr 8, 2014 at 1:07 PM, Xiangrui Meng men...@gmail.com wrote:
That commit fixed the exact problem you described. That is why I want to
confirm that you switched to the master branch. bin/spark-shell doesn't
detect code changes, so you need to run
i put some println statements in BlockManagerUI
i have RDDs that are cached in memory. I see this:
*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0
yet at same time i can see via our own api:
storageInfo: {
diskSize: 0,
memSize: 19944,
numCachedPartitions: 1,
numPartitions: 1
}
On Tue, Apr 8, 2014 at 2:25 PM, Koert Kuipers ko...@tresata.com wrote:
i put some println statements in BlockManagerUI
Yup, sorry about that. This error message should not produce incorrect
behavior, but it is annoying. Posted a patch to fix it:
https://github.com/apache/spark/pull/361
Thanks for reporting it!
On Tue, Apr 8, 2014 at 9:54 AM, Koert Kuipers ko...@tresata.com wrote:
when i start spark-shell i
I am trying to read file from HDFS on Spark Shell and getting error as below.
When i create first RDD it works fine but when i try to do count on that
RDD, it trows me some connection error. I have single node hdfs setup and on
the same machine, i have spark running. Please help. When i run jps
Thank you -- this actually helped a lot. Strangely it appears that the task
detail view is not accurate in 0.8 -- that view shows 425ms duration for
one of the tasks, but in the driver log I do indeed see Finished TID 125 in
10940ms.
On that slow worker I see the following:
14/04/08 18:06:24
There are couple of issues here which i was able to find out.
1: We should not use web port which we use to access the web UI. I was usong
that initially so it was not working.
2: All request should go to Name node and not anything else.
3: By replacing localhost:9000 in the above request, it
Hi All,
I want to measure the total network traffic for a Spark Job. But I did
not see related information from the log. Does anybody know how to measure
it? Thanks very much in advance.
--
View this message in context:
Hi All,
I have some spatial data in postgres machine. I want to be able
to move that data to Hadoop and do some geo-processing.
I tried using sqoop to move the data to Hadoop but it complained about the
position data(which it says can't recognize)
Does anyone have any idea as to
Hello Manas,
I don't know Sqoop that much but my best guess is that you're probably
using Postgis which has specific structures for Geometry and so on. And if
you need some spatial operators my gut feeling is that things will be
harder ^^ (but a raw import won't need that...).
So I did a quick
Can Spark be configured to use SSL for all its network communication?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-with-SSL-tp3916.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
1) at the end of the callback
2) yes we simply expose sc.getRDDStorageInfo to the user via REST
3) yes exactly. we define the RDDs at startup, all of them are cached. from
that point on we only do calculations on these cached RDDs.
i will add some more println statements for storageStatusList
our one cached RDD in this run has id 3
*** onStageSubmitted **
rddInfo: RDD 2 (2) Storage: StorageLevel(false, false, false, false, 1);
CachedPartitions: 0; TotalPartitions: 1; MemorySize: 0.0 B;TachyonSize: 0.0
B; DiskSize: 0.0 B
_rddInfoMap: Map(2 - RDD 2
Not that I know of, but it would be great if that was supported. The way I
typically handle security now is to put the Spark servers in their own
subnet with strict inbound/outbound firewalls.
On Tue, Apr 8, 2014 at 1:14 PM, kamatsuoka ken...@gmail.com wrote:
Can Spark be configured to use
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is being used
and on what machines.
https://spark.apache.org/docs/0.9.0/monitoring.html
On Tue, Apr 8, 2014 at 12:57 PM, yxzhao yxz...@ualr.edu wrote:
Hi All,
If you want the machine that hosts the driver to also do work, you can
designate it as a worker too, if I'm not mistaken. I don't think the
driver should do work, logically, but, that's not to say that the
machine it's on shouldn't do work.
--
Sean Owen | Director, Data Science | London
On Tue,
One thing you could do is create an RDD of [1,2,3] and set a partitioner
that puts all three values on their own nodes. Then .foreach() over the
RDD and call your function that will run on each node.
Why do you need to run the function on every node? Is it some sort of
setup code that needs to
Alright, so I guess I understand now why spark-ec2 allows you to select
different instance types for the driver node and worker nodes. If the
driver node is just driving and not doing any large collect()s or heavy
processing, it can be much smaller than the worker nodes.
With regards to data
may be unrelated to the question itself, just FYI
you can run your driver program in worker node with Spark-0.9
http://spark.apache.org/docs/latest/spark-standalone.html#launching-applications-inside-the-cluster
Best,
--
Nan Zhu
On Tuesday, April 8, 2014 at 5:11 PM, Nicholas Chammas
Hi guys,
We're going to hold a series of meetups about machine learning with
Spark in San Francisco.
The first one will be on April 24. Xiangrui Meng from Databricks will
talk about Spark, Spark/Python, features engineering, and MLlib.
See
Dears,
I'm very interested in this. However, the links mentioned above are not
accessible from China. Is there any other way to read the two blog pagess?
Thanks a lot.
2014-04-08 12:54 GMT+08:00 prabeesh k prabsma...@gmail.com:
Hi all,
Here I am sharing a blog for beginners, about creating
Thanks Andrew, I will take a look at it.
On Tue, Apr 8, 2014 at 3:35 PM, Andrew Ash [via Apache Spark User List]
ml-node+s1001560n3920...@n3.nabble.com wrote:
If you set up Spark's metrics reporting to write to the Ganglia backend
that will give you a good idea of how much network/disk/CPU is
Anybody, please help for abov e query.
It's challanging but will open new horizon for In-Memory analysis.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-RDD-to-Shark-table-IN-MEMORY-conversion-tp3682p3968.html
Sent from the Apache Spark User List
Hi ,
I am getting java.io.NotSerializableException exception while executing
following program.
import org.apache.spark.SparkContext._
import org.apache.spark.SparkContext
import org.apache.spark.AccumulatorParam
object App {
class Vector (val data: Array[Double]) {}
implicit object VectorAP
Dear list,
SBT compiles fine, but when I do the following:
sbt/sbt gen-idea
import project as SBT project to IDEA 13.1
Make Project
and these errors show up:
Error:(28, 8) object FileContext is not a member of package
org.apache.hadoop.fs
import org.apache.hadoop.fs.{FileContext, FileStatus,
Hi Dong,
This is pretty much what I did. I run into the same issue you have.
Since I'm not developing yarn related stuff, I just excluded those two
yarn related project from intellji, and it works. PS, you may need to
exclude java8 project as well now.
Sincerely,
DB Tsai
I let IntelliJ read the Maven build directly and that works fine.
--
Sean Owen | Director, Data Science | London
On Wed, Apr 9, 2014 at 6:14 AM, Dong Mo monted...@gmail.com wrote:
Dear list,
SBT compiles fine, but when I do the following:
sbt/sbt gen-idea
import project as SBT project to
51 matches
Mail list logo