From API docs: Zips this RDD with another one, returning key-value
pairs with the first element in each RDD, second element in each RDD,
etc. Assumes that the two RDDs have the *same number of partitions*
and the *same number of elements in each partition* (e.g. one was made
through a map on the
It's this: mvn -Dhadoop.version=2.0.0-cdh4.2.1 -DskipTests clean package
On Tue, Apr 1, 2014 at 5:15 PM, Vipul Pandey vipan...@gmail.com wrote:
how do you recommend building that - it says
ERROR] Failed to execute goal
org.apache.maven.plugins:maven-assembly-plugin:2.2-beta-5:assembly
Hi Thierry,
Your code does not work if @yh18190 wants a global counter. A RDD may have
more than one partition. For each partition, cnt will be reset to -1. You
can try the following code:
scala val rdd = sc.parallelize( (1, 'a') :: (2, 'b') :: (3, 'c') :: (4,
'd') :: Nil)
rdd:
I think multiply by ratings is a heuristic that worked on rating related
problems like netflix dataset or any other ratings datasets but the scope
of NMF is much more broad than that
@Sean please correct me in case you don't agree...
Definitely it's good to add all the rating dataset related
I downloaded 0.9.0 fresh and ran the mvn command - the assembly jar thus
generated also has both shaded and real version of protobuf classes
Vipuls-MacBook-Pro-3:spark-0.9.0-incubating vipul$ jar -ftv
./assembly/target/scala-2.10/spark-assembly_2.10-0.9.0-incubating-hadoop2.0.0-cdh4.2.1.jar
|
Hi Therry,
Thanks for the above responses..I implemented using RangePartitioner..we
need to use any of the custom partitioners in orderto perform this
task..Normally u cant maintain a counter becoz count operations should
beperformed on each partitioned block ofdata...
--
View this message in
I’ve been able to get CDH5 up and running on EC2 and according to Cloudera
Manager, Spark is running healthy.
But when I try to run spark-shell, I eventually get the error:
14/04/02 07:18:18 INFO client.AppClient$ClientActor: Connecting to master
spark://ip-172-xxx-xxx-xxx:7077...
14/04/02
Thanks for the update Evan! In terms of using MLI, I see that the Github
code is linked to Spark 0.8; will it not work with 0.9 (which is what I
have set up) or higher versions?
On Wed, Apr 2, 2014 at 1:44 AM, Evan R. Sparks [via Apache Spark User List]
ml-node+s1001560n3615...@n3.nabble.com
It should be kept in mind that different implementations are rarely
strictly better, and that what works well in one type of data might
not in another. It also bears keeping in mind that several of these
differences just amount to different amounts of regularization, which
need not be a
Hi, Spark Devs:
I encounter a problem which shows error message akka.actor.ActorNotFound
on our mesos mini-cluster.
mesos : 0.17.0
spark : spark-0.9.0-incubating
spark-env.sh:
#!/usr/bin/env bash
export MESOS_NATIVE_LIBRARY=/usr/local/lib/libmesos.so
export SPARK_EXECUTOR_URI=hdfs://
Heya,
Yep this is a problem in the Mesos scheduler implementation that has been
fixed after 0.9.0 (https://spark-project.atlassian.net/browse/SPARK-1052 =
MesosSchedulerBackend)
So several options, like applying the patch, upgrading to 0.9.1 :-/
Cheers,
Andy
On Wed, Apr 2, 2014 at 5:30 PM,
Aha, thank you for your kind reply.
Upgrading to 0.9.1 is a good choice. :)
On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote:
Heya,
Yep this is a problem in the Mesos scheduler implementation that has been
fixed after 0.9.0
np ;-)
On Wed, Apr 2, 2014 at 5:50 PM, Leon Zhang leonca...@gmail.com wrote:
Aha, thank you for your kind reply.
Upgrading to 0.9.1 is a good choice. :)
On Wed, Apr 2, 2014 at 11:35 PM, andy petrella andy.petre...@gmail.comwrote:
Heya,
Yep this is a problem in the Mesos scheduler
Can someone explain how RDD is resilient? If one of the partition is lost,
who is responsible to recreate that partition - is it the driver program?
Hi Guys
I would like printing the content inside of line in :
JavaDStreamString lines = ssc.socketTextStream(args[1],
Integer.parseInt(args[2]));
JavaDStreamString words = lines.flatMap(new
FlatMapFunctionString, String() {
@Override
public IterableString call(String x) {
TL;DR
Your classes are missing on the workers, pass the jar containing the
class main.scala.Utils to the SparkContext
Longer:
I miss some information, like how the SparkContext is configured but my
best guess is that you didn't provided the jars (addJars on SparkConf or
use the SC's constructor
Update: I'm now using this ghetto function to partition the RDD I get back
when I call textFile() on a gzipped file:
# Python 2.6
def partitionRDD(rdd, numPartitions):
counter = {'a': 0}
def count_up(x):
counter['a'] += 1
return counter['a']
return (rdd.keyBy(count_up)
Sorry I was not clear perhaps, anyway, could you try with the path in the
*List* to be the absolute one; e.g.
List(/home/yh/src/pj/spark-stuffs/target/scala-2.10/simple-project_2.10-1.0.jar)
In order to provide a relative path, you need first to figure out your CWD,
so you can do (to be really
For textFile I believe we overload it and let you set a codec directly:
https://github.com/apache/spark/blob/master/core/src/test/scala/org/apache/spark/FileSuite.scala#L59
For saveAsSequenceFile yep, I think Mark is right, you need an option.
On Wed, Apr 2, 2014 at 12:36 PM, Mark Hamstra
Hi Folks
I'm looking to buy some gear to run Spark. I'm quite well versed in Hadoop
Server design but there does not seem to be much Spark related collateral
around infrastructure guidelines (or at least I haven't been able to find
them). My current thinking for server design is something
Is this a
Scala-onlyhttp://spark.incubator.apache.org/docs/latest/api/pyspark/pyspark.rdd.RDD-class.html#saveAsTextFilefeature?
On Wed, Apr 2, 2014 at 5:55 PM, Patrick Wendell pwend...@gmail.com wrote:
For textFile I believe we overload it and let you set a codec directly:
There is a repartition method in pyspark master:
https://github.com/apache/spark/blob/master/python/pyspark/rdd.py#L1128
On Wed, Apr 2, 2014 at 2:44 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Update: I'm now using this ghetto function to partition the RDD I get back
when I call
Thanks for pointing that out.
On Wed, Apr 2, 2014 at 6:11 PM, Mark Hamstra m...@clearstorydata.comwrote:
First, you shouldn't be using spark.incubator.apache.org anymore, just
spark.apache.org. Second, saveAsSequenceFile doesn't appear to exist in
the Python API at this point.
On Wed,
Ah, now I see what Aaron was referring to. So I'm guessing we will get this
in the next release or two. Thank you.
On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra m...@clearstorydata.comwrote:
There is a repartition method in pyspark master:
Will be in 1.0.0
On Wed, Apr 2, 2014 at 3:22 PM, Nicholas Chammas nicholas.cham...@gmail.com
wrote:
Ah, now I see what Aaron was referring to. So I'm guessing we will get
this in the next release or two. Thank you.
On Wed, Apr 2, 2014 at 6:09 PM, Mark Hamstra
What I'd like is a way to capture the information provided on the stages
page (i.e. cluster:4040/stages via IndexPage). Looking through the
Spark code, it doesn't seem like it is possible to directly query for
specific facts such as how many tasks have succeeded or how many total
tasks there
Hi All,
I am intrested in measure the total network I/O, cpu and memory
consumed by Spark job. I tried to find the related information in logs and
Web UI. But there seems no sufficient information. Could anyone give me any
suggestion?
Thanks very much in advance.
--
View this
Hi,
I want to aggregate (time-stamped) event data at daily, weekly and monthly
level stored in a directory in data//mm/dd/dat.gz format. For example:
Each dat.gz file contains tuples in (datetime, id, value) format. I can
perform aggregation as follows:
but this code doesn't seem to be
The driver stores the meta-data associated with the partition, but the
re-computation will occur on an executor. So if several partitions are
lost, e.g. due to a few machines failing, the re-computation can be striped
across the cluster making it fast.
On Wed, Apr 2, 2014 at 11:27 AM, David
Hey Phillip,
Right now there is no mechanism for this. You have to go in through the low
level listener interface.
We could consider exposing the JobProgressListener directly - I think it's
been factored nicely so it's fairly decoupled from the UI. The concern is
this is a semi-internal piece of
Hi Philip,
In the upcoming release of Spark 1.0 there will be a feature that provides
for exactly what you describe: capturing the information displayed on the
UI in JSON. More details will be provided in the documentation, but for
now, anything before 0.9.1 can only go through JobLogger.scala,
Watch out with loading data from gzipped files. Spark cannot parallelize
the load of gzipped files, and if you do not explicitly repartition your
RDD created from such a file, everything you do on that RDD will run on a
single core.
On Wed, Apr 2, 2014 at 8:22 PM, K Koh den...@gmail.com wrote:
Targeting 0.9.0 should work out of the box (just a change to the build.sbt)
- I'll push some changes I've been sitting on to the public repo in the
next couple of days.
On Wed, Apr 2, 2014 at 4:05 AM, Krakna H shankark+...@gmail.com wrote:
Thanks for the update Evan! In terms of using MLI, I
I would suggest to start with cloud hosting if you can, depending on your
usecase, memory requirement may vary a lot .
Regards
Mayur
On Apr 2, 2014 3:59 PM, Matei Zaharia matei.zaha...@gmail.com wrote:
Hey Steve,
This configuration sounds pretty good. The one thing I would consider is
having
Hi Matei,
How can I run multiple Spark workers per node ? I am running 8 core 10 node
cluster but I do have 8 more cores on each nodeSo having 2 workers per
node will definitely help my usecase.
Thanks.
Deb
On Wed, Apr 2, 2014 at 3:58 PM, Matei Zaharia matei.zaha...@gmail.comwrote:
Hey
Hi,
Shall I send my questions to this Email address?
Sorry for bothering, and thanks a lot!
Yes, please do. :)
On Wed, Apr 2, 2014 at 7:36 PM, weida xu xwd0...@gmail.com wrote:
Hi,
Shall I send my questions to this Email address?
Sorry for bothering, and thanks a lot!
Hi,
We are placing business logic in incoming data stream using Spark
streaming. Here I want to point Shark table to use data coming from Spark
Streaming. Instead of storing Spark streaming to HDFS or other area, is
there a way I can directly point Shark in-memory table to take data from
Spark
Hi,
I'm trying to run script in SHARK(0.81) insert into emp (id,name)
values (212,Abhi) but it doesn't work.
I urgently need direct insert as it is show stopper.
I know that we can do insert into emp select * from xyz.
Here requirement is direct insert.
Does any one tried it ? Or is there
Hi,
I have a small program but I cannot seem to make it connect to the right
properties of the cluster.
I have the SPARK_YARN_APP_JAR, SPARK_JAR and SPARK_HOME set properly.
If I run this scala file, I am seeing that this is never using the
yarn.resourcemanager.address property that I set
I deployed mesos and test it using the exmaple/test-framework script, mesos
seems OK.but when runing spark on the mesos cluster, the mesos slave nodes
report the following exception, any one can help me to fix this ? thanks in
advance:14/04/03 11:24:39 INFO Slf4jLogger: Slf4jLogger started14/04/03
any advice ?
2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com:
I deployed mesos and test it using the exmaple/test-framework script,
mesos seems OK. but when runing spark on the mesos cluster, the mesos slave
nodes report the following exception, any one can help me to fix this ?
thanks
I think this is related to a known issue (regression) in 0.9.0. Try using
explicit IP other than loop back.
Sent from a mobile device
On Apr 2, 2014, at 8:53 PM, panfei cnwe...@gmail.com wrote:
any advice ?
2014-04-03 11:35 GMT+08:00 felix cnwe...@gmail.com:
I deployed mesos and test
For various schemaRDD functions like select, where, orderby, groupby etc. I
would like to create expression objects and pass these to the methods for
execution.
Can someone show some examples of how to create expressions for case class
and execute ? E.g., how to create expressions for select,
44 matches
Mail list logo