Thanks for the advice. But since I am not the administrator of our spark
cluster, I can't do this. Is there any better solution based on the current
spark?
At 2014-08-01 02:38:15, shijiaxin shijiaxin...@gmail.com wrote:
Have you tried to write another similar function like edgeListFile in the
At 2014-08-01 11:23:49 +0800, Bin wubin_phi...@126.com wrote:
I am wondering what is the best way to construct a graph?
Say I have some attributes for each user, and specific weight for each user
pair. The way I am currently doing is first read user information and edge
triple into two
Sometimes it is useful to convert a RDD into a DStream for testing purposes
(generating DStreams from historical data, etc). Is there an easy way to do
this?
I could come up with the following inefficient way but no sure if there is
a better way to achieve this. Thoughts?
class
Is there a way to get iterator from RDD? Something like rdd.collect(), but
returning lazy sequence and not single array.
Context: I need to GZip processed data to upload it to Amazon S3. Since
archive should be a single file, I want to iterate over RDD, writing each
line to a local .gz file. File
I used the web ui of spark and could see the conf directory is in CLASSPATH.
An abnormal thing is that when start spark-shell I always get the following
info:
WARN NativeCodeLoader: Unable to load native-hadoop library for your
platform... using builtin-java classes where applicable
At first, I
When I use fewer partitions, (like 6)
It seems that all the task will be assigned to the same machine, because the
machine has more than 6 cores.But this will run out of memory.
How to set fewer partitions number and use all the machine at the same time?
--
View this message in context:
Attempting to build Spark from source on EC2 using sbt gives the error
sbt.ResolveException: unresolved dependency:
org.scala-lang#scala-library;2.10.2: not found. This only seems to happen on
EC2, not on my local machine.
To reproduce, launch a cluster using spark-ec2, clone the Spark
See https://issues.apache.org/jira/browse/SPARK-2579
It also was mentioned on the mailing list a while ago, and have heard
tell of this from customers. I am trying to get to the bottom of it
too.
What version are you using, to start? I am wondering if it was fixed
in 1.0.x since I was not able
Hi
While trying to build spark0.9.2 using sbt the build is failing due to the
non resolving of most of the libraries .sbt cannot fetch the libraries in
the specified location.
Please tel me what changes are required to build spark using sbt
Regards
Arun
Hi Simon,
I'm trying to do the same but I'm quite lost.
How did you do that? (Too direct? :)
Thanks and ciao,
r-
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/access-hdfs-file-name-in-map-tp6551p11160.html
Sent from the Apache Spark User List mailing
It looked like you were running in standalone mode (master set to
local[4]). That's how I ran it.
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
Hi,
We would like to use Spark SQL to store data in Parquet format and then
query that data using Impala.
We've tried to come up with a solution and it is working but it doesn't
seem good. So I was wondering if you guys could tell us what is the
correct way to do this. We are using Spark 1.0
Sorry, sent early, wasn't finished typing.
CREATE EXTERNAL TABLE
Then we can select the data using Impala. But this is registered as an
external table and must be refreshed if new data is inserted.
Obviously this doesn't seem good and doesn't seem like the correct solution.
How should we
TD,
We are seeing the same issue. We struggled through this until we found this
post and the work around.
A quick fix in the Spark Streaming software will help a lot for others who
are encountering this and pulling their hair out on why RDD on some
partitions are not computed (we ended up
Here's a piece of code. In your case, you are missing the call() method
inside the map function.
import java.util.Iterator;
import java.util.List;
import org.apache.commons.configuration.Configuration;
import org.apache.hadoop.hbase.HBaseConfiguration;
import
Hi All,
My application works when I use the spark-submit with master=local[*].
But if I deploy the application to a standalone cluster
master=spark://master:7077 that the application doesn't work and I get the
following exception:
14/08/01 05:18:51 ERROR TaskSchedulerImpl: Lost executor 0 on
Hi TD,
I've also been fighting this issue only to find the exact same solution you
are suggesting.
Too bad I didn't find either the post or the issue sooner.
I'm using a 1 second batch with N amount of kafka events (1 to 1 with the
state objects) per batch and only calling the updatestatebykey
[Forking this thread.]
According to the Spark Programming Guide
http://spark.apache.org/docs/latest/programming-guide.html#rdd-persistence,
persisting RDDs with MEMORY_ONLY should not choke if the RDD cannot be held
entirely in memory:
If the RDD does not fit in memory, some partitions will not
Hi Thanks Alli have few more questions on this
suppose i don't want to pass where caluse in my sql and is their a way that
i can do this.
Right now i am trying to modify JdbcRDD class by removing all the paramaters
for lower bound and upper bound. But i am getting run time exceptions.
Is
Isn't this your worker running out of its memory for computations,
rather than for caching RDDs? so it has enough memory when you don't
actually use a lot of the heap for caching, but when the cache uses
its share, you actually run out of memory. If I'm right, and even I am
not sure I have this
Hi,
I am trying to understand the query plan and number of tasks /execution time
created for joined query.
Consider following example , creating two tables emp, sal with appropriate 100
records in each table with key for joining them.
EmpRDDRelation.scala
case class EmpRecord(key:
On Fri, Aug 1, 2014 at 12:39 PM, Sean Owen so...@cloudera.com wrote:
Isn't this your worker running out of its memory for computations,
rather than for caching RDDs?
I’m not sure how to interpret the stack trace, but let’s say that’s true.
I’m even seeing this with a simple a =
Hi everyone
I haven't been receiving replies to my queries in the distribution list.
Not pissed but I am actually curious to know if my messages are actually
going through or not. Can someone please confirm that my msgs are getting
delivered via this distribution list?
Thanks,
Aniket
On 1
I am using 1.0.1. It does not matter to me whether it is the first or second
element. I would like to know how to extract the i-th element in the feature
vector (not the label).
data.features(i) gives the following error:
method apply in trait Vector cannot be accessed in
So is the only issue that impala does not see changes until you refresh the
table? This sounds like a configuration that needs to be changed on the
impala side.
On Fri, Aug 1, 2014 at 7:20 AM, Patrick McGloin mcgloin.patr...@gmail.com
wrote:
Sorry, sent early, wasn't finished typing.
CREATE
Hi Roberto,
Ultimately, the info you need is set here:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/HadoopRDD.scala#L69
Being a spark newbie, I extended org.apache.spark.rdd.HadoopRDD class as
HadoopRDDWithEnv, which takes in an additional parameter
Oh I'm sorry, I somehow misread your email as looking for the label. I
read too fast. That was pretty silly. THis works for me though:
scala val point = LabeledPoint(1,Vectors.dense(2,3,4))
point: org.apache.spark.mllib.regression.LabeledPoint = (1.0,[2.0,3.0,4.0])
scala point.features(1)
res10:
rdd.toLocalIterator will do almost what you want, but requires that each
individual partition fits in memory (rather than each individual line).
Hopefully that's sufficient, though.
On Fri, Aug 1, 2014 at 1:38 AM, Andrei faithlessfri...@gmail.com wrote:
Is there a way to get iterator from RDD?
We are using Sparks 1.0 for Spark Streaming on Spark Standalone cluster and
seeing the following error.
Job aborted due to stage failure: Task 3475.0:15 failed 4 times, most
recent failure: Exception failure in TID 216394 on host
hslave33102.sjc9.service-now.com: java.lang.Exception:
Hi all,
I have a scenario of a web application submitting multiple jobs to Spark.
These jobs may be operating on the same RDD.
It is possible to cache() the RDD during one call...
And all subsequent calls can use the cached RDD?
basically, during one invocation
val rdd1 =
I think this is the problem. I was working in a project that inherited some
other Akka dependencies (of a different version). I'm switching to a fresh new
project which should solve the problem.
Thanks,
Alex
From: Tathagata Das
I also ran into same issue. What is the solution?
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Compiling-Spark-master-284771ef-with-sbt-sbt-assembly-fails-on-EC2-tp11155p11189.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
any pointers to this issue.
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Issue-using-kryo-serilization-tp11129p11191.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I have what seems like a relatively straightforward task to accomplish, but I
cannot seem to figure it out from the Spark documentation or searching the
mailing list.
I have an RDD[(String, MyClass)] that I would like to group by the key, and
calculate the mean and standard deviation of the foo
Are you accessing the RDDs on raw data blocks and running independent
Spark jobs on them (that is outside DStream)? In that case this may
happen as Spark Straming will clean up the raw data based on the
DStream operations (if there is a window op of 15 mins, it will keep
the data around for 15
Hi,
I upgraded to 1.0.1 from 1.0 a couple of weeks ago and have been able to use
some of the features advertised in 1.0.1. However, I get some compilation
errors in some cases and based on user response, these errors have been
addressed in the 1.0.1 version and so I should not be getting these
This should be okay, but make sure that your cluster also has the right code
deployed. Maybe you have the wrong one.
If you built Spark from source multiple times, you may also want to try sbt
clean before sbt assembly.
Matei
On August 1, 2014 at 12:00:07 PM, SK (skrishna...@gmail.com) wrote:
The reason I want an RDD is because I'm assuming that iterating the
individual elements of an RDD on the driver of the cluster is much slower
than coming up with the mean and standard deviation using a
map-reduce-based algorithm.
I don't know the intimate details of Spark's implementation, but it
You're certainly not iterating on the driver. The Iterable you process
in your function is on the cluster and done in parallel.
On Fri, Aug 1, 2014 at 8:36 PM, Kristopher Kalish k...@kalish.net wrote:
The reason I want an RDD is because I'm assuming that iterating the
individual elements of an
Hi Akhil,
Thank you very much for your help and support.
Regards,
Rajesh
On Fri, Aug 1, 2014 at 7:57 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Here's a piece of code. In your case, you are missing the call() method
inside the map function.
import java.util.Iterator;
import
It is definitely possible to run multiple workers on a single node and have
each worker with the maximum number of cores (e.g. if you have 8 cores and
2 workers you'd have 16 cores per node). I don't know if it's possible with
the out of the box scripts though.
It's actually not really that
Computing the variance is similar to this example, you just need to keep
around the sum of squares as well.
The formula for variance is (sumsq/n) - (sum/n)^2
But with big datasets or large values, you can quickly run into overflow
issues - MLlib handles this by maintaining the the average sum of
Here's the more functional programming-friendly take on the
computation (but yeah this is the naive formula):
rdd.groupByKey.mapValues { mcs =
val values = mcs.map(_.foo.toDouble)
val n = values.count
val sum = values.sum
val sumSquares = values.map(x = x * x).sum
math.sqrt(n *
We are using Sparks 1.0.
I'm using DStream operations such as map, filter and reduceByKeyAndWindow
and doing a foreach operation on DStream.
--
View this message in context:
Ignoring my warning about overflow - even more functional - just use a
reduceByKey.
Since your main operation is just a bunch of summing, you've got a
commutative-associative reduce operation and spark will run do everything
cluster-parallel, and then shuffle the (small) result set and merge
Hi,
I've seen many threads about reading from HBase into Spark, but none about
how to read from OpenTSDB into Spark. Does anyone know anything about this?
I tried looking into it, but I think OpenTSDB saves its information into
HBase using hex and I'm not sure how to interpret the data. If you
Hi,
So I again ran sbt clean followed by all of the steps listed above to
rebuild the jars after cleaning. My compilation error still persists.
Specifically, I am trying to extract an element from the feature vector that
is part of a LabeledPoint as follows:
data.features(i)
This gives the
Thanks, Aaron, it should be fine with partitions (I can repartition it
anyway, right?).
But rdd.toLocalIterator is purely Java/Scala method. Is there Python
interface to it?
I can get Java iterator though rdd._jrdd, but it isn't converted to Python
iterator automatically. E.g.:
rdd =
Hey,
There is some work that started on IndexedRDD (on master I think).
Meanwhile, checking what has been done in GraphX regarding vertex index in
partitions could be worthwhile I guess
Hth
Andy
Le 1 août 2014 22:50, Philip Ogren philip.og...@oracle.com a écrit :
Suppose I want to take my large
Me 3
On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:
I also ran into same issue. What is the solution?
--
View this message in context:
Currently scala 2.10.2 can't be pulled in from maven central it seems,
however if you have it in your ivy cache it should work.
On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:
Me 3
On Fri, Aug 1, 2014 at 11:15 AM, nit nitinp...@gmail.com wrote:
I also ran into
Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small
avro files into one avro file. I read it in with:
sc.newAPIHadoopFile[AvroKey[GenericRecord], NullWritable,
AvroKeyInputFormat[GenericRecord]](path)
but I can't find saveAsHadoopFile or saveAsNewAPIHadoopFile. Can you
At 2014-08-01 14:50:22 -0600, Philip Ogren philip.og...@oracle.com wrote:
It seems that I could do this with mapPartition so that each element in a
partition gets added to an index for that partition.
[...]
Would it then be possible to take a string and query each partition's index
with it?
Have you tried
https://aws.amazon.com/articles/Elastic-MapReduce/4926593393724923
Thr is also a 0.9.1 version they talked about in one of the meetups.
Check out the s3 bucket inthe guide.. it should have a 0.9.1 version as
well.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
This is a Scala bug - I filed something upstream, hopefully they can fix it
soon and/or we can provide a work around:
https://issues.scala-lang.org/browse/SI-8772
- Patrick
On Fri, Aug 1, 2014 at 3:15 PM, Holden Karau hol...@pigscanfly.ca wrote:
Currently scala 2.10.2 can't be pulled in from
This fails for me too. I have no idea why it happens as I can wget the pom
from maven central. To work around this I just copied the ivy xmls and jars
from this github repo
https://github.com/peterklipfel/scala_koans/tree/master/ivyrepo/cache/org.scala-lang/scala-library
and put it in
Thanks Patrick -- It does look like some maven misconfiguration as
wget
http://repo1.maven.org/maven2/org/scala-lang/scala-library/2.10.2/scala-library-2.10.2.pom
works for me.
Shivaram
On Fri, Aug 1, 2014 at 3:27 PM, Patrick Wendell pwend...@gmail.com wrote:
This is a Scala bug - I filed
I've had intermiddent access to the artifacts themselves, but for me the
directory listing always 404's.
I think if sbt hits a 404 on the directory, it sends a somewhat confusing
error message that it can't download the artifact.
- Patrick
On Fri, Aug 1, 2014 at 3:28 PM, Shivaram Venkataraman
What is the usecase you are looking at?
Tsdb is not designed for you to query data directly from HBase, Ideally you
should use REST API if you are looking to do thin analysis. Are you looking
to do whole reprocessing of TSDB ?
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
Nice question :)
Ideally you should use a queuestream interface to push RDD into a queue
then spark streaming can handle the rest.
Though why are you looking to convert RDD to DStream, another workaround
folks use is to source DStream from folders move files that they need
reprocessed back into
Only blocker is accumulator can be only added to from slaves only read
on the master. If that constraint fit you well you can fire away.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 1, 2014 at 7:38 AM, Julien
All the operations being done are using the dstream. I do read an RDD in
memory which is collected and converted into a map and used for lookups as
part of DStream operations. This RDD is loaded only once and converted into
map that is then used on streamed data.
Do you mean non streaming jobs on
I meant are you using RDD generated by DStreams, in Spark jobs out
side the DStreams computation?
Something like this:
var globalRDD = null
dstream.foreachRDD(rdd =
// have a global pointer based on the rdds generate by dstream
if (runningFirstTime) globalRDD = rdd
)
ssc.start()
.
Not at all. Don't have any such code.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-Could-not-compute-split-block-not-found-tp11186p11231.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
I'm trying to get metrics out of TSDB so I can use Spark to do anomaly
detection on graphs.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark-tp11211p11232.html
Sent from the Apache Spark User List
Http Api would be the best bet, I assume by graph you mean the charts
created by tsdb frontends.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Fri, Aug 1, 2014 at 4:48 PM, bumble123 tc1...@att.com wrote:
I'm trying to
So is there no way to do this through SparkStreaming? Won't I have to do
batch processing if I use the http api rather than getting it directly into
Spark?
--
View this message in context:
Ah, that's unfortunate, that definitely should be added. Using a
pyspark-internal method, you could try something like
javaIterator = rdd._jrdd.toLocalIterator()
it = rdd._collect_iterator_through_file(javaIterator)
On Fri, Aug 1, 2014 at 3:04 PM, Andrei faithlessfri...@gmail.com wrote:
Then could you try giving me a log.
And as a workaround, disable spark.streaming.unpersist = false
On Fri, Aug 1, 2014 at 4:10 PM, Kanwaldeep kanwal...@gmail.com wrote:
Not at all. Don't have any such code.
--
View this message in context:
You have to import org.apache.spark.rdd._, which will automatically make
available this method.
Thanks,
Ron
Sent from my iPhone
On Aug 1, 2014, at 3:26 PM, touchdown yut...@gmail.com wrote:
Hi, I am facing a similar dilemma. I am trying to aggregate a bunch of small
avro files into one
Can you share the mapValues approach you did?
Thanks,
Ron
Sent from my iPhone
On Aug 1, 2014, at 3:00 PM, kriskalish k...@kalish.net wrote:
Thanks for the help everyone. I got the mapValues approach working. I will
experiment with the reduceByKey approach later.
3
-Kris
--
Here is the log file.
streaming.gz
http://apache-spark-user-list.1001560.n3.nabble.com/file/n11240/streaming.gz
There are quite few AskTimeouts that have happening for about 2 minutes and
then followed by block not found errors.
Thanks
Kanwal
--
View this message in context:
Yes, I saw that after I looked at it closer. Thanks! But I am running into a
schema not set error:
Writer schema for output key was not set. Use AvroJob.setOutputKeySchema()
I am in the process of figuring out how to set schema for an AvroJob from a
HDFS file, but any pointer is much appreciated!
73 matches
Mail list logo