Hi,
I was checking different storage level of an RDD and found OFF_HEAP.
Has anybody used this level ?
If i use this level, where will data be stored ? If not in heap, does it mean
that we can avoid GC ?
How can I use this level ? I did not find anything in archive regarding this.
Can someone
Hi,
I bump this thread as I'm also interested in the answer. Can anyone help or
point to the information on how to do Spark Streaming from/to Cassandra?
Thanks!
Zarzyk
--
View this message in context:
Hi Andrew,
Thanks for your explanation, I confirm that the entries show up in the history
server UI when I create empty APPLICATION_COMPLETE files for each of them.
Christophe.
On 03/07/2014 18:27, Andrew Or wrote:
Hi Christophe, another Andrew speaking.
Your configuration looks fine to me.
Hi Zarzyk:
If I were you, just to start, I would look at the following:
https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ
http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala
Hi Zarzyk:
If I were you, just to start, I would look at the following:
https://groups.google.com/forum/#!topic/spark-users/htQQA3KidEQ
http://www.slideshare.net/planetcassandra/south-bay-cassandrealtime-analytics-using-cassandra-spark-and-shark-at-ooyala
Hi,
Any update on the solution? We are still facing this issue...
We could able to connect to HBase with independent code, but getting issue with
Spark integration.
Thx,
Ravi
From: nvn_r...@hotmail.com
To: u...@spark.incubator.apache.org; user@spark.apache.org
Subject: RE: Spark with HBase
Hi,
To cope with the issue with META-INF that Sean is pointing out, my solution
is replacing maven-assembly.plugin with maven-shade-plugin, using the
ServicesResourceTransformer (
http://maven.apache.org/plugins/maven-shade-plugin/examples/resource-transformers.html#ServicesResourceTransformer)
Hi Ajay,
StorageLevel OFF_HEAP means for can cache your RDD into Tachyon, the
prerequisite is that you should deploy Tachyon among Spark.
Yes, it can alleviate GC, since you offload JVM memory into system managed
memory.
You can use rdd.persist(...) to use this level, details can be checked
Hi!
I have a Spark cluster running on top of a Cassandra cluster, using
Datastax' new driver, and one of the fields of my RDDs is an
XML-string. In a normal Scala sparkjob, parsing that data is no
problem, but I would like to also make that information available
through Spark SQL. So, is there
Hello Gurus,
Pardon me I am noob @ Spark GraphX ( Scala) And I seek your wisdome here..
I want to know how to do a graph traversal and do selective merge on edges...
Thanks to the documentation :-) I could create a simple graph of employees
their colleagues. The a structure of Graph is
We are getting this issue when we are running jobs with close to 1000
workers. Spark is from the github version and mesos is 0.19.0
ERROR storage.BlockManagerMasterActor: Got two different block manager
registrations on 201407031041-1227224054-5050-24004-0
Googling about it seems that mesos is
Hi all!
I have a folder with 150 G of txt files (around 700 files, on average each
200 MB).
I'm using scala to process the files and calculate some aggregate
statistics in the end. I see two possible approaches to do that: - manually
loop through all the files, do the calculations per file and
Hello Gurus,
Pardon me I am noob @ Spark GraphX ( Scala) And I seek your wisdome
here.. I want to know how to do a graph traversal and do selective merge
on edges... Thanks to the documentation :-) I could create a simple graph
of employees their colleagues. The a structure of Graph is below,
Hi,
You can convert standard RDD of Product class (e.g. case class) to
SchemaRDD by SQLContext.
Load data from Cassandra into RDD of case class, convert it to SchemaRDD
and register it,
then you can use it in your SQLs.
Takuya, thanks for your reply :)
I am already doing that, and it is working well. My question is, can I
define arbitrary functions to be used in these queries?
On Fri, Jul 4, 2014 at 11:12 AM, Takuya UESHIN ues...@happy-camper.st wrote:
Hi,
You can convert standard RDD of Product class (e.g.
In the end it turns out that the issue was caused by a config settings
in spark-defaults.conf. After removing this setting
spark.files.userClassPathFirst true
things are back to normal. Just reporting in case f someone will have
the same issue.
- Gurvinder
On 07/03/2014 06:49 PM, Gurvinder
Original Message
Subject: matchError:null in ALS.train
From:Honey Joshi honeyjo...@ideata-analytics.com
Date:Thu, July 3, 2014 8:12 am
To: user@spark.apache.org
Ah, sorry for misreading.
I don't think there is a way to use UDF in your SQLs only with SparkSQL.
You might be able to use with SparkHive, but I'm sorry, I don't know well.
I think you should use the function before convert to SchemaRDD if you can.
Thanks.
2014-07-04 18:16 GMT+09:00 Martin
On Fri, Jul 4, 2014 at 1:59 AM, Martin Gammelsæter
martingammelsae...@gmail.com wrote:
is there any way to write user defined functions for Spark SQL?
This is coming in Spark 1.1. There is a work in progress PR here:
https://github.com/apache/spark/pull/1063
If you have a hive context, you
0
down vote
favorite
I am running spark-1.0.0 by connecting to a spark standalone cluster which
has one master and two slaves. I ran wordcount.py by Spark-submit, actually
it reads data from HDFS and also write the results into HDFS. So far
everything is fine and the results will correctly be
Hi all,
I stuck in issue with runing spark PI example on HDP 2.0
I downloaded spark 1.0 pre-build from http://spark.apache.org/downloads.html
(for HDP2)
The run example from spark web-site:
./bin/spark-submit --class org.apache.spark.examples.SparkPi --master
yarn-cluster --num-executors 3
Thanks for the help folks.
Adding the config files was necessary but not sufficient.
I also had hadoop 1.0.4 classes on the classpath because a bad jar:
spark-0.9.1/jars/spark-assembly-0.9.1-hadoop1.0.4.jar
was in my spark executor tar.gz (stored in HDFS).
I believe this was due to a bit
I would go with Spark only if you are certain that you are going to scale
out in the near future.
You can change the default storage of RDD to DISK_ONLY, that might remove
issues around any rdd leveraging memory. Thr are some functions
particularly sortbykey that require data to fit in memory to
When using DISK_ONLY, keep in mind that disk I/O is pretty high. Make sure
you are writing to multiple disks for best operation. And even with
DISK_ONLY, we've found that there is a minimum threshold for executor ram
(spark.executor.memory), which for us seemed to be around 8 GB.
If you find
Hi,
I have a large dataset of text logs files on which I need to implement
window analysis
Say, extract per-minute data and do aggregated stats on the last X minutes
I've to implement the windowing analysis with spark.
This is the workflow I'm currently using
- read a file and I create a new RDD
What I typically do is use row_number subquery to filter based on that.
It works out pretty well, reduces the iteration. I think a offset solution
based on windowsing directly would be useful.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi
Hi everybody
Someone can tell me if it is possible to read and filter a 60 GB file of
tweets (Json Docs) in a Standalone Spark Deployment that runs in a single
machine with 40 Gb RAM and 8 cores???
I mean, is it possible to configure Spark to work with some amount of
memory (20 GB) and the rest
Ok i find this slides of Yin Huai (
http://spark-summit.org/wp-content/uploads/2014/07/Easy-json-Data-Manipulation-Yin-Huai.pdf
)
to read a Json file the code seem pretty simple :
sqlContext.jsonFile(data.json) Is this already available in the
master branch???
But the question about the
Hi,
unfortunately, when I go the above approach, I run into this problem:
http://mail-archives.apache.org/mod_mbox/kafka-users/201401.mbox/%3ccabtfevyxvtaqvnmvwmh7yscfgxpw5kmrnw_gnq72cy4oa1b...@mail.gmail.com%3E
That is, a NoNode error in Zookeeper when rebalancing. The Kafka receiver
will retry
You'll get most of that information from mesos interface. You may not get
transfer of data information particularly.
Mayur Rustagi
Ph: +1 (760) 203 3257
http://www.sigmoidanalytics.com
@mayur_rustagi https://twitter.com/mayur_rustagi
On Thu, Jul 3, 2014 at 6:28 AM, Tobias Pfeiffer
The application server doesnt provide json api unlike the cluster
interface(8080).
If you are okay to patch spark, you can use our patch to create json API,
or you can use sparklistener interface in your application to get that info
out.
Mayur Rustagi
Ph: +1 (760) 203 3257
You should be able to use DynamoDBInputFormat (I think this should be part
of AWS libraries for Java) and create a HadoopRDD from that.
On Fri, Jul 4, 2014 at 8:28 AM, Ian Wilkinson ia...@me.com wrote:
Hi,
I noticed mention of DynamoDB as input source in
Hi Nick,
I’m going to be working with python primarily. Are you aware of
comparable boto support?
ian
On 4 Jul 2014, at 16:32, Nick Pentreath nick.pentre...@gmail.com wrote:
You should be able to use DynamoDBInputFormat (I think this should be part of
AWS libraries for Java) and create a
No boto support for that.
In master there is Python support for loading Hadoop inputFormat. Not sure if
it will be in 1.0.1 or 1.1
I master docs under the programming guide are instructions and also under
examples project there are pyspark examples of using Cassandra and HBase. These
should
I should qualify by saying there is boto support for dynamodb - but not for the
inputFormat. You could roll your own python-based connection but this involves
figuring out how to split the data in dynamo - inputFormat takes care of this
so should be the easier approach —
Sent from Mailbox
On
Trying to discover source for the DynamoDBInputFormat.
Not appearing in:
- https://github.com/aws/aws-sdk-java
- https://github.com/apache/hive
Then came across
http://stackoverflow.com/questions/1704/jar-containing-org-apache-hadoop-hive-dynamodb.
Unsure whether this represents the latest
Interesting - I would have thought they would make that available publicly.
Unfortunately, unless you can use Spark on EMR, I guess your options are to
hack it by spinning up an EMR cluster and getting the JAR, or maybe fall
back to using boto and rolling your own :(
On Fri, Jul 4, 2014 at 9:28
Another alternative could be use SparkStreaming's textFileStream with windowing
capabilities.
On Friday, July 4, 2014 9:52 AM, Gianluca Privitera
gianluca.privite...@studio.unibo.it wrote:
You should think about a custom receiver, in order to solve the problem of the
“already collected”
Thanks for the replies
What is not completely clear to me is how time is managed.
I can create a DStream from file.
But if I set the window property that will be bounded to the application
time, right?
If I got it right, with a receiver I can control the way DStream are
created.
But, how can
Hi, I want to use pySpark with yarn. But documentation doesn't give me full
understanding on what's going on, and I simply don't understand code. So:
1) How python shipped to cluster? Should machines in cluster already have
python?
2) What happens when I write some python code in map function -
The windowing capabilities of spark streaming determine the events in the RDD
created for that time window. If the duration is 1s then all the events
received in a particular 1s window will be a part of the RDD created for that
window for that stream.
On Friday, July 4, 2014 1:28 PM,
Sweet. Any idea about when this will be merged into master?
It is probably going to be a couple of weeks. There is a fair amount of
cleanup that needs to be done. It works though and we used it in most of
the demos at the spark summit. Mostly I just need to add tests and move it
out of
sqlContext.jsonFile(data.json) Is this already available in the
master branch???
Yes, and it will be available in the soon to come 1.0.1 release.
But the question about the use a combination of resources (Memory
processing Disk processing) still remains.
This code should work
Though I'll note that window functions are not yet supported in Spark SQL.
https://issues.apache.org/jira/browse/SPARK-1442
On Fri, Jul 4, 2014 at 6:59 AM, Mayur Rustagi mayur.rust...@gmail.com
wrote:
What I typically do is use row_number subquery to filter based on that.
It works out pretty
Thank you, DataBricks Rules
On Fri, Jul 4, 2014 at 1:58 PM, Michael Armbrust mich...@databricks.com
wrote:
sqlContext.jsonFile(data.json) Is this already available in the
master branch???
Yes, and it will be available in the soon to come 1.0.1 release.
But the question about
As far as I know, there is not much difference, except that the outer
parenthesis is redundant. The problem with your original code was that there
was mismatch in the opening and closing parenthesis. Sometimes the error
messages are misleading :-)
Do you see any performance difference with the
an update on this issue, now spark is able to read the lzo file and
recognize that it has index and starts multiple map tasks. you need to
use following function instead of textFile
csv =
Hi all,
I too am having some issues with *RegressionWithSGD algorithms.
Concerning your issue Eustache, this could be due to the fact that these
regression algorithms uses a fixed step (that is divided by
sqrt(iteration)). During my tests, quite often, the algorithm diverged an
infinity cost, I
Hi:
Is there a Java sample fragment for using cassandra-driver-spark ?
Thanks
Hi,
When I run the following a piece of code, it is throwing a classnotfound
error. Any suggestion would be appreciated.
Wanted to group an RDD by key:
val t = rdd.groupByKey()
Error message:
java.lang.ClassNotFoundException:
org.apache.spark.rdd.PairRDDFunctions$$anonfun$combineByKey$
I am using :
kafka 0.8.1
spark-streaming-kafka_2.10-0.9.0-cdh5.0.2
My analysis is simple, so I confuse why it cost so long time at take at
DStream.scala:586, it cost 2 to 8 minutes or longer .I don't know how to
find the reason. Hoping your help.
Sorry for my poor english.
--
View this
51 matches
Mail list logo