I figured it out. I had to use pyspark.files.SparkFiles to get the
locations of files loaded into Spark.
On Mon, Nov 17, 2014 at 1:26 PM, Sean Owen so...@cloudera.com wrote:
You are changing these paths and filenames to match your own actual
scripts and file locations right?
On Nov 17, 2014
My sbt file for the project includes this:
libraryDependencies ++= Seq(
org.apache.spark %% spark-core % 1.1.0,
org.apache.spark %% spark-mllib % 1.1.0,
org.apache.commons % commons-math3 % 3.3
)
=
Still I am
Add this jar
http://mvnrepository.com/artifact/org.apache.commons/commons-math3/3.3
while creating the sparkContext.
sc.addJar(/path/to/commons-math3-3.3.jar)
And make sure it is shipped and available in the environment tab (4040)
Thanks
Best Regards
On Mon, Nov 17, 2014 at 1:54 PM, Ritesh
Include the commons-math3/3.3 in class path while submitting jar to spark
cluster. Like..
spark-submit --driver-class-path maths3.3jar --class MainClass --master
spark cluster url appjar
On Mon, Nov 17, 2014 at 1:55 PM, Ritesh Kumar Singh [via Apache Spark User
List]
One 'rule of thumbs' is to use rdd.toDebugString and check the lineage for
ShuffleRDD. As long as there's no need for restructuring the RDD,
operations can be pipelined on each partition.
rdd.toDebugString is your friend :-)
-kr, Gerard.
On Mon, Nov 17, 2014 at 7:37 AM, Mukesh Jha
Hi,
I was going through the graphx section in the Spark API in
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.graphx.lib.ShortestPaths$
Here, I find the word landmark. Can anyone explain to me what is landmark
means. Is it a simple English word or does it mean
The downloads just happen once so this is not a problem.
If you are just building one module in a project, it needs a compiled
copy of other modules. It will either use your locally-built and
locally-installed artifact, or, download one from the repo if
possible.
This isn't needed if you are
Hi all,
In this presentation
(https://prezi.com/1jzqym68hwjp/spark-gotchas-and-anti-patterns/) it
mentions that Spark Streaming's behaviour is undefined if a batch overruns
the polling interval. Is this something that might be addressed in future
or is it fundamental to the design?
--
View
Hi,
JavaRDDInstrument studentsData = sc.parallelize(list);--list is Student Info
ListStudent
studentsData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);
above statements saved the students information in the HDFS as a text file.
Each object is a line in text file as below.
I am using Apache Hadoop 1.2.1 . I wanted to use Spark Sql with Hive. So I
tried to build Spark like so .
mvn -Phive,hadoop-1.2 -Dhadoop.version=1.2.1 clean -DskipTests package
But I get the following error.
The requested profile hadoop-1.2 could not be activated because it does
not
Hello,
I use Spark Standalone Cluster and I want to measure somehow internode
communication.
As I understood, Graphx transfers only vertices values. Am I right?
But I do not want to get number of bytes which were transferred between any two
nodes.
So is there way to measure how many values
You can use the sc.objectFile
https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.SparkContext
to read it. It will be RDD[Student] type.
Thanks
Best Regards
On Mon, Nov 17, 2014 at 4:03 PM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
Hi,
JavaRDDInstrument
You can use Ganglia to see the overall data transfer across the
cluster/nodes. I don't think there's a direct way to get the vertices being
transferred.
Thanks
Best Regards
On Mon, Nov 17, 2014 at 4:29 PM, Hlib Mykhailenko hlib.mykhaile...@inria.fr
wrote:
Hello,
I use Spark Standalone
Hello Naveen,
I think you should first override toString method of your
sample.spark.test.Student class.
--
Cordialement,
Hlib Mykhailenko
Doctorant à INRIA Sophia-Antipolis Méditerranée
2004 Route des Lucioles BP93
06902 SOPHIA ANTIPOLIS cedex
- Original Message -
From:
Oops , I guess , this is the right way to do it
mvn -Phive -Dhadoop.version=1.2.1 clean -DskipTests package
--
View this message in context:
This should fix it --
def func(str: String): DenseMatrix*[Double]* = {
...
...
}
So, why is this required?
Think of it like this -- If you hadn't explicitly mentioned Double, it
might have been that the calling function expected a
DenseMatrix[SomeOtherType], and performed a
Hi,
I am building spark on the most recent master branch.
I checked this page:
https://github.com/apache/spark/blob/master/docs/sql-programming-guide.md
The cmd *./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly* works
fine. A fat jar is created.
However, when I started the SQL-CLI,
Yeah, it works.
Although when I try to define a var of type DenseMatrix, like this:
var mat1: DenseMatrix[Double]
It gives an error saying we need to initialise the matrix mat1 at the time
of declaration.
Had to initialise it as :
var mat1: DenseMatrix[Double] = DenseMatrix.zeros[Double](1,1)
I've never used Mesos, sorry.
On Fri, Nov 14, 2014 at 5:30 PM, Steve Lewis lordjoe2...@gmail.com wrote:
The cluster runs Mesos and I can see the tasks in the Mesos UI but most
are not doing much - any hints about that UI
On Fri, Nov 14, 2014 at 11:39 AM, Daniel Siegmann
I am not sure there is a direct way(an api in graphx, etc) to measure the
number of transferred vertex values among nodes during computation.
It might depend on:
- the operations in your application, e.g. only communicate with its immediate
neighbours for each vertex.
- the partition strategy
Thanks. It works for me.
--
Aaron Lin
On 2014年11月15日 Saturday at 上午1:19, Xiangrui Meng wrote:
If you use Kryo serialier, you need to register mutable.BitSet and Rating:
You should *never* use accumulators for this purpose because you may get
incorrect answers. Accumulators can count the same thing multiple times -
you cannot rely upon the correctness of the values they compute. See
SPARK-732 https://issues.apache.org/jira/browse/SPARK-732 for more info.
On Sun,
Hey Hao,
Which commit are you using? Just tried 64c6b9b with exactly the same
command line flags, couldn't reproduce this issue.
Cheng
On 11/17/14 10:02 PM, Hao Ren wrote:
Hi,
I am building spark on the most recent master branch.
I checked this page:
We use Algebird for calculating things like min/max, stddev, variance, etc.
https://github.com/twitter/algebird/wiki
-Suren
On Mon, Nov 17, 2014 at 11:32 AM, Daniel Siegmann daniel.siegm...@velos.io
wrote:
You should *never* use accumulators for this purpose because you may get
incorrect
I have a 1 million row file that I'd like to read from my edge node, and then
send a copy of it to each Hadoop machine's memory in order to run JOINs in
my spark streaming code.
I see examples in the docs of how use use broadcast() for a simple array,
but how about when the data is in a textFile?
As you are using sbt ..u need not put in ~/.m2/repositories for maven.
Include the jar explicitly using the option
--driver-class-path while submitting the jar to spark cluster
On Mon, Nov 17, 2014 at 7:41 PM, Ritesh Kumar Singh [via Apache Spark User
List] ml-node+s1001560n1907...@n3.nabble.com
Sry for spamming,
Just after my previous post, I noticed that the command used is:
./sbt/sbt -Phive -Phive-thirftserver clean assembly/assembly
thriftserver*
the typo error is the evil. Stupid, me.
I believe I just copy-pasted from somewhere else, but no even checked it,
meanwhile no error
Looks like this was where you got that commandline:
http://search-hadoop.com/m/JW1q5RlPrl
Cheers
On Mon, Nov 17, 2014 at 9:44 AM, Hao Ren inv...@gmail.com wrote:
Sry for spamming,
Just after my previous post, I noticed that the command used is:
./sbt/sbt -Phive -Phive-thirftserver clean
So let us say I have RDDs A and B with the following values.
A = [ (1, 2), (2, 4), (3, 6) ]
B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]
I want to apply an inner join, such that I get the following as a result.
C = [ (1, (2, 3)), (2, (4, 5)), (3, (6,6)) ]
That is, those keys which are not
While testing sparkSQL, we were running this group by with expression query
and got an exception. The same query worked fine on hive.
SELECT from_unixtime(floor(xyz.whenrequestreceived/1000.0 - 25200),
'/MM/dd') as pst_date,
count(*) as num_xyzs
FROM
all_matched_abc
GROUP BY
The spark UI lists a number of Executor IDS on the cluster. I would like
to access both executor ID and Task/Attempt IDs from the code inside a
function running on a slave machine.
Currently my motivation is to examine parallelism and locality but in
Hadoop this aids in allowing code to write
Hi,
I have been using spark streaming in standalone mode and now I want to
migrate to spark running on yarn, but I am not sure how you would you would
go about designating a specific node in the cluster to act as an avro
listener since I am using flume based push approach with spark.
--
View
OK then I'd still need to write the code (within my spark streaming code I'm
guessing) to convert my text file into an object like a HashMap before
broadcasting.
How can I make sure only the HashMap is being broadcast while all the
pre-processing to create the HashMap is only performed once?
Thanks for the link to the bug.
Unfortunately, using accumulators like this is getting spread around as a
recommended practice despite the bug.
From: Daniel Siegmann [mailto:daniel.siegm...@velos.io]
Sent: Monday, November 17, 2014 8:32 AM
To: Segerlind, Nathan L
Cc: user
Subject: Re:
Just RDD.join() should be an inner join.
On Mon, Nov 17, 2014 at 5:51 PM, Blind Faith person.of.b...@gmail.com wrote:
So let us say I have RDDs A and B with the following values.
A = [ (1, 2), (2, 4), (3, 6) ]
B = [ (1, 3), (2, 5), (3, 6), (4, 5), (5, 6) ]
I want to apply an inner join,
only option is to split you problem further by increasing parallelism My
understanding is by increasing the number of partitions, is that right?
That didn't seem to help because it is seem the partitions are not uniformly
sized. My observation is when I increase the number of partitions, it
Minor correction: there was a typo in commandline
hive-thirftserver should be hive-thriftserver
Cheers
On Thu, Aug 7, 2014 at 6:49 PM, Cheng Lian lian.cs@gmail.com wrote:
Things have changed a bit in the master branch, and the SQL programming
guide in master branch actually doesn’t apply
Spark 1.1.0, running on AWS EMR cluster using yarn-client as master.
I'm getting the following error when I attempt to save a RDD to S3. I've
narrowed it down to a single partition that is ~150Mb in size (versus the
other partitions that are closer to 1 Mb). I am able to work around this by
Hi all,
I find the reason of this issue. It seems in the new version, if I do not
specify spark.default.parallelism in KafkaUtils.createstream, there will be
an exception since the kakfa stream creation stage. In the previous
versions, it seems Spark will use the default value.
Thanks!
Bill
On
Hello,
We're running a spark sql thriftserver that several users connect to with
beeline. One limitation we've run into is that the current working database
(set with use db) is shared across all connections. So changing the
database on one connection changes the database for all connections.
I have been playing with using accumulators (despite the possible error with
multiple attempts) These provide a convenient way to get some numbers while
still performing business logic.
I posted some sample code at http://lordjoesoftware.blogspot.com/.
Even if accumulators are not perfect today -
Hi Daniel,
Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)? Is there a mechanism similar to MR where we can ensure each
partition is assigned some amount of data by size, by setting some
I'm not aware of any such mechanism.
On Mon, Nov 17, 2014 at 2:55 PM, Pala M Muthaia mchett...@rocketfuelinc.com
wrote:
Hi Daniel,
Yes that should work also. However, is it possible to setup so that each
RDD has exactly one partition, without repartitioning (and thus incurring
extra cost)?
This is an unfortunate/known issue that we are hoping to address in the
next release: https://issues.apache.org/jira/browse/SPARK-2087
I'm not sure how straightforward a fix would be, but it would involve
keeping / setting the SessionState for each connection to the server. It
would be great if
You are perhaps hitting an issue that was fixed by #3248
https://github.com/apache/spark/pull/3248?
On Mon, Nov 17, 2014 at 9:58 AM, Sadhan Sood sadhan.s...@gmail.com wrote:
While testing sparkSQL, we were running this group by with expression
query and got an exception. The same query worked
What version of Spark SQL?
On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote:
Hi all,
We run SparkSQL on TPCDS benchmark Q19 with spark.sql.codegen=true, we
got exceptions as below, has anyone else saw these before?
java.lang.ExceptionInInitializerError
at
Hi I'm running a standalone cluster with 8 worker servers.
I'm developing a streaming app that is adding new lines of text to several
different RDDs each batch interval. Each line has a well randomized unique
identifier that I'm trying to use for partitioning, since the data stream
does contain
Hi Bill,
Would you mind describing what you found a little more specifically, I’m not
sure there’s the a parameter in KafkaUtils.createStream you can specify the
spark parallelism, also what is the exception stacks.
Thanks
Jerry
From: Bill Jay [mailto:bill.jaypeter...@gmail.com]
Sent:
Yes, thank you for suggestion. The error I found below was in the worker
logs.
AssociationError [akka.tcp://sparkwor...@cloudera01.local.company.com:7078]
- [akka.tcp://sparkexecu...@cloudera01.local.company.com:33329]: Error
[Association failed with
Hi,
so I didn't manage to get the Broadcast variable with a new value
distributed to my executors in YARN mode. In local mode it worked fine, but
when running on YARN either nothing happened (when unpersist() was called
on the driver) or I got a TimeoutException (when called on the executor).
I
Hi,
On Fri, Nov 14, 2014 at 2:49 PM, Jianshi Huang jianshi.hu...@gmail.com
wrote:
Ok, then we need another trick.
let's have an *implicit lazy var connection/context* around our code. And
setup() will trigger the eval and initialization.
Due to lazy evaluation, I think having
Hi Michael,
We use Spark v1.1.1-rc1 with jdk 1.7.0_51 and scala 2.10.4.
On Tue, Nov 18, 2014 at 7:09 AM, Michael Armbrust mich...@databricks.com
wrote:
What version of Spark SQL?
On Sat, Nov 15, 2014 at 10:25 PM, Eric Zhen zhpeng...@gmail.com wrote:
Hi all,
We run SparkSQL on TPCDS
Hi Charles,
I am not aware of other storage formats. Perhaps Sean or Sandy can
elaborate more given their experience with Oryx.
There is work by Smola et al at Google that talks about large scale model
update and deployment.
Yes, it's always appears on a part of the whole tasks in a stage(i.e. 100/100
(65 failed)), and sometimes cause the stage to fail.
And there is another error that I'm not sure if there is a correlation.
java.lang.NoClassDefFoundError: Could not initialize class
I'm just using PMML. I haven't hit any limitation of its
expressiveness, for the model types is supports. I don't think there
is a point in defining a new format for models, excepting that PMML
can get very big. Still, just compressing the XML gets it down to a
manageable size for just about any
I see. Agree that lazy eval is not suitable for proper setup and teardown.
We also abandoned it due to inherent incompatibility between implicit and
lazy. It was fun to come up this trick though.
Jianshi
On Tue, Nov 18, 2014 at 10:28 AM, Tobias Pfeiffer t...@preferred.jp wrote:
Hi,
On Fri,
Hi,
I just ran the PageRank code in GraphX with some sample data. What I am
seeing is that the total rank changes drastically if I change the number of
iterations from 10 to 100. Why is that so?
Thank You
Hi,
I am having list Students and size is one Lakh and I am trying to save the
file. It is throwing null pointer exception.
JavaRDDStudent distData = sc.parallelize(list);
distData.saveAsTextFile(hdfs://master/data/spark/instruments.txt);
14/11/18 01:33:21 WARN scheduler.TaskSetManager: Lost
I am trying to use Naive Bayes for a project of mine in Python and I want
to obtain the probability value after having built the model.
Suppose I have two classes - A and B. Currently there is an API to to find
which class a sample belongs to (predict). Now, I want to find the
probability of it
Any notable issues for using Scala 2.11? Is it stable now?
Or can I use Scala 2.11 in my spark application and use Spark dist build
with 2.10 ?
I'm looking forward to migrate to 2.11 for some quasiquote features.
Couldn't make it run in 2.10...
Cheers,
--
Jianshi Huang
LinkedIn: jianshi
This was recently discussed on this mailing list. You can't get the
probabilities out directly now, but you can hack a bit to get the internal
data structures of NaiveBayesModel and compute it from there.
If you really mean the probability of either A or B, then if your classes
are exclusive it
It is safe in the sense we would help you with the fix if you run into
issues. I have used it, but since I worked on the patch the opinion can be
biased. I am using scala 2.11 for day to day development. You should
checkout the build instructions here :
I occur to this issue with the spark on yarn version 1.0.2. Is there any
hints?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Check-your-cluster-UI-to-ensure-that-workers-are-registered-and-have-sufficient-memory-tp5358p19133.html
Sent from the Apache
Simple join would do it.
val a: List[(Int, Int)] = List((1,2),(2,4),(3,6))
val b: List[(Int, Int)] = List((1,3),(2,5),(3,6), (4,5),(5,6))
val A = sparkContext.parallelize(a)
val B = sparkContext.parallelize(b)
val ac = new PairRDDFunctions[Int, Int](A)
*val C =
Looks like sbt/sbt -Pscala-2.11 is broken by a recent patch for improving
maven build.
Prashant Sharma
On Tue, Nov 18, 2014 at 12:57 PM, Prashant Sharma scrapco...@gmail.com
wrote:
It is safe in the sense we would help you with the fix if you run into
issues. I have used it, but since I
Make sure your list is not null, if that is null then its more like doing:
JavaRDDStudent distData = sc.parallelize(*null*)
distData.foreach(println)
Thanks
Best Regards
On Tue, Nov 18, 2014 at 12:07 PM, Naveen Kumar Pokala
npok...@spcapitaliq.com wrote:
Hi,
I am having list Students
Hi Prashant Sharma,
It's not even ok to build with scala-2.11 profile on my machine.
Just check out the master(c6e0c2ab1c29c184a9302d23ad75e4ccd8060242)
run sbt/sbt -Pscala-2.11 clean assembly:
.. skip the normal part
info] Resolving org.scalamacros#quasiquotes_2.11;2.0.1 ...
[warn] module
Hi All:I was submitting a spark_program.jar to `spark on yarn cluster` on a
driver machine with yarn-client mode. Here is the spark-submit command I used:
./spark-submit --master yarn-client --class
com.charlie.spark.grax.OldFollowersExample --queue dt_spark
68 matches
Mail list logo