On 08/04/2014 10:57 PM, Michael Armbrust wrote:
If mesos is allocating a container that is exactly the same as the max
heap size then that is leaving no buffer space for non-heap JVM memory,
which seems wrong to me.
This can be a cause. I am now wondering how mesos pick up the size and
setup
You need to use persist or cache those rdds to appear in the Storage.
Unless you do it, those rdds will be computed again.
Thanks
Best Regards
On Tue, Aug 5, 2014 at 8:03 AM, binbinbin915 binbinbin...@live.cn wrote:
Actually, if you don’t use method like persist or cache, it even not
store
Did you ever find a sln to this problem? I'm having similar issues.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/java-lang-IllegalStateException-unread-block-data-while-running-the-sampe-WordCount-program-from-Ecle-tp8388p11412.html
Sent from the Apache
Thanks AL!
Thats what I though. I've setup nexus to maintain spark libs and download
them when needed.
For development purposes. Suppose we have a dev cluster. Is it possible to
run the driver program locally (on a developers machine)?
I..e just run the driver from the ID and have it
Hi,
I am new to Apache Spark and Trying to Develop spark streaming program
to *stream
data from kafka topics and output as parquet file on HDFS*.
Please share the *sample reference* program to stream data from kafka
topics and output as parquet file on HDFS.
Thanks in Advance.
Regards,
I'm trying to run a local driver (on a development machine) and have this
driver communicate with the Spark master and workers however I'm having a
few problems getting the driver to connect and run a simple job from within
an IDE.
It all looks like it works but when I try to do something simple
You can find in the following presentation a simple example of a clustering
model use to classify new incoming tweet :
https://www.youtube.com/watch?v=sPhyePwo7FA
Regards,
Julien
2014-08-05 7:08 GMT+02:00 Xiangrui Meng men...@gmail.com:
Some extra work is needed to close the loop. One related
The code for this example is very simple;
object SparkMain extends App with Serializable {
val conf = new SparkConf(false)
//.setAppName(cc-test)
//.setMaster(spark://hadoop-001:7077)
//.setSparkHome(/tmp)
.set(spark.driver.host, 192.168.23.108)
.set(spark.cores.max, 10)
Hi Sanjeet,
I have been using spark streaming for processing of files present in S3 and
HDFS.
I am also using SQS messages for the same purpose as yours i.e. pointer to
S3 file.
As of now, I have a separate SQS job which receive message from SQS queue
and gets the corresponding file from S3.
Now,
Hi,
I am doing some basic preprocessing in pyspark (local mode as follows):
files = [ input files]
def read(filename,sc):
#process file
return rdd
if __name__ ==__main__:
conf = SparkConf()
conf.setMaster('local')
sc = SparkContext(conf =conf)
sc.setCheckpointDir(root+temp/)
Is it possible that the Content-MD5 changes during multipart upload to s3?
But even then, it succeeds if I increase the cluster configuration..
For ex.
it throws Bad Digest error after writing 48/100 files when the cluster is of
3 m3.2xlarge slaves
it throws Bad Digest error after writing 64/100
You can try this Kafka Spark Consumer which I recently wrote. This uses the
Low Level Kafka Consumer
https://github.com/dibbhatt/kafka-spark-consumer
Dibyendu
On Tue, Aug 5, 2014 at 12:52 PM, rafeeq s rafeeq.ec...@gmail.com wrote:
Hi,
I am new to Apache Spark and Trying to Develop spark
Thanks Jonathan,
Yes, till non-ZK based offset management is available in Kafka, I need to
maintain the offset in ZK. And yes, both cases explicit commit is
necessary. I modified the Low Level Kafka Spark Consumer little bit to have
Receiver spawns threads for every partition of the topic and
Thanks Dibyendu.
1. Spark itself have api jar for kafka, still we require manual offset
management (using simple consumer concept) and manual consumer ?
2.Kafka Spark Consumer which is implemented in kafka 0.8.0 ,Can we use it
for kafka 0.8.1 ?
3.How to use Kafka Spark Consumer to produce output
Hi,
I'm running Hive 0.13.1 and the latest master branch of Spark (built with
SPARK_HIVE=true). I'm trying to compute Jaccard similarity using the Hive
UDF from Brickhouse
(https://github.com/klout/brickhouse/blob/master/src/main/java/brickhouse/udf/sketch/SetSimilarityUDF.java).
*Hive table
I'm doing a simple groupBy on a fairly small dataset (80 files in HDFS, few
gigs in total, line based, 500-2000 chars per line). I'm running Spark on 8
low-memory machines in a yarn cluster, i.e. something along the lines of:
spark-submit ... --master yarn-client --num-executors 8
Hi,
I wanted to make simple Spark app running in local mode with 2g
spark.executor.memory and 1g for caching. But following code:
val conf = new SparkConf()
.setMaster(local)
.setAppName(app)
.set(spark.executor.memory, 2g)
.set(spark.storage.memoryFraction, 0.5)
val sc = new
Hi Rafeeq,
I think current Spark Streaming api can offer you the ability to fetch data
from Kafka and store to another external store, if you do not care about
management of consumer offset manually, there’s no need to use low level api as
SimpleConsumer.
For Kafka 0.8.1 compatibility, you
I am also experiencing this kryo buffer problem. My join is left outer with
under 40mb on the right side. I would expect the broadcast join to succeed
in this case (hive did)
Another problem is that the optimizer
chose nested loop join for some reason
I would expect broadcast (map side) hash
Yes
Sent from my iPhone
On Aug 5, 2014, at 7:38 AM, Dima Zhiyanov [via Apache Spark User List]
ml-node+s1001560n11432...@n3.nabble.com wrote:
I am also experiencing this kryo buffer problem. My join is left outer with
under 40mb on the right side. I would expect the broadcast join to
Hi,
I have Spark application which computes join of two RDDs. One contains
around 150MB of data (7 million entries) second around 1,5MB (80 thousand
entries) and
result of this join contains 50MB of data (2 million entries).
When I run it on one core (with master=local) it works correctly (whole
I gave things working on my cluster with the sparksql thrift server. (Thank
you Yin Huai at Databricks!)
That said, I was curious how I can cache a table via my instance here? I
tried the shark like create table table_cached as select * from table and
that did not create a cached table.
We are working on an overhaul of the docs before the 1.1 release. In the
mean time try: CACHE TABLE tableName.
On Tue, Aug 5, 2014 at 9:02 AM, John Omernik j...@omernik.com wrote:
I gave things working on my cluster with the sparksql thrift server.
(Thank you Yin Huai at Databricks!)
That
For outer joins I'd recommend upgrading to master or waiting for a 1.1
release candidate (which should be out this week).
On Tue, Aug 5, 2014 at 7:38 AM, Dima Zhiyanov dimazhiya...@hotmail.com
wrote:
I am also experiencing this kryo buffer problem. My join is left outer with
under 40mb on the
The more cores you have, the less memory they will get.
512M is already quite small, and if you have 4 cores it will mean
roughly 128M per task.
Sometimes it is interesting to have less cores and more memory.
how many cores do you have ?
André
On 2014-08-05 16:43, Grzegorz Białek wrote:
Hi,
Ah yes, Spark doesn't cache all of your RDDs by default. It turns out that
caching things too aggressively can lead to suboptimal performance because
there might be a lot of churn. If you don't call persist or cache then your
RDDs won't actually be cached. Note that even once they're cached they
Hi Grzegorz,
For local mode you only have one executor, and this executor is your
driver, so you need to set the driver's memory instead. *That said, in
local mode, by the time you run spark-submit, a JVM has already been
launched with the default memory settings, so setting spark.driver.memory
(Clarification: you'll need to pass in --driver-memory not just for local
mode, but for any application you're launching with client deploy mode)
2014-08-05 9:24 GMT-07:00 Andrew Or and...@databricks.com:
Hi Grzegorz,
For local mode you only have one executor, and this executor is your
Hi Daniel,
Thanks a lot for your interest. Gradient boosting and AdaBoost algorithms
are under active development and should be a part of release 1.2.
-Manish
On Mon, Jul 14, 2014 at 11:24 AM, Daniel Bendavid
daniel.benda...@creditkarma.com wrote:
Hi,
My company is strongly considering
You are correct in that I am trying to publish inside of a foreachRDD loop.
I am currently refactoring and will try publishing inside the
foreachPartition loop. Below is the code showing the way it is currently
written, thanks!
object myData {
def main(args: Array[String]) {
val ssc =
Hi All,
I am trying to move away from spark-shell to spark-submit and have been making
some code changes. However, I am now having problem with serialization. It
used to work fine before the code update. Not sure what I did wrong. However,
here is the code
JaccardScore.scala
package
Hey all!
I'm a total beginner with spark / hadoop / graph computation so please
excuse my beginner question.
I've created a graph, using graphx. Now, for every vertex, I want to get
all its second degree neighbors.
so if my graph is:
v1 -- v2
v1 -- v4
v1 -- v6
I want to get something like:
v2
I'm having similar problem to:
http://mail-archives.apache.org/mod_mbox/spark-user/201407.mbox/browser
I'm trying to follow the tutorial at:
When I run: val file = sc.textFile(s3://bigdatademo/sample/wiki/)
I get:
WARN storage.BlockManager: Putting block broadcast_1 failed
Hi All,
I have a data set where each record is serialized using JSON, and I'm
interested to use SchemaRDDs to work with the data. Unfortunately I've hit
a snag since some fields in the data are maps and list, and are not
guaranteed to be populated for each record. This seems to cause
I was just about to ask about this.
Currently, there are two methods, sqlContext.jsonFile() and
sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers
the whole data set.
For example:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
a =
Hi,
I'm trying to run a spark application with the executor-memory 3G. but I'm
running into the following error:
14/08/05 18:02:58 INFO DAGScheduler: Submitting Stage 0 (MappedRDD[5]
at map at KMeans.scala:123), which has no missing parents
14/08/05 18:02:58 INFO DAGScheduler: Submitting 1
Thanks Michael.
Is there a way to specify off_heap? I.e. Tachyon via the thrift server?
Thanks!
On Tue, Aug 5, 2014 at 11:06 AM, Michael Armbrust mich...@databricks.com
wrote:
We are working on an overhaul of the docs before the 1.1 release. In the
mean time try: CACHE TABLE tableName.
Thank you!! Could you give me any sample code for the receiver? I'm still new
to Spark and not quite sure how I would do that.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-read-from-OpenTSDB-using-PySpark-or-Scala-Spark-tp11211p11454.html
Sent
Are you able to see the job on the WebUI (8080)? If yes, how much memory
are you seeing there specifically for this job?
[image: Inline image 1]
Here you can see i have 11.8Gb RAM on both workers and my app is using
11GB.
1. What are all the memory that you are seeing in your case?
2. Make sure
You can always start your spark-shell by specifying the master as
MASTER=spark://*whatever*:7077 $SPARK_HOME/bin/spark-shell
Then it will connect to that *whatever* master.
Thanks
Best Regards
On Tue, Aug 5, 2014 at 8:51 PM, Aniket Bhatnagar aniket.bhatna...@gmail.com
wrote:
Hi
The only UI I have currently is the Application Master (Cluster mode), with
the following executor nodes status:
Executors (3)
- *Memory:* 0.0 B Used (3.7 GB Total)
- *Disk:* 0.0 B Used
Executor IDAddressRDD BlocksMemory UsedDisk UsedActive TasksFailed
TasksComplete
TasksTotal TasksTask
Hi Nick,
Thanks for the great response.
I actually already investigated jsonRDD and jsonFile, although I did not
realize they provide more complete schema inference. I did however have
other problems with jsonRDD and jsonFile, but I will now describe in a
separate thread with an appropriate
For that UI to have some values, your process should do some operation.
Which is not happening here ( 14/08/05 18:03:13 WARN YarnClusterScheduler:
Initial job has not accepted any resources; check your cluster UI to ensure
that workers are registered and have sufficient memory )
Can you open up a
Notice the difference in the schema. Are you running the 1.0.1 release, or
a more bleeding-edge version from the repository?
Yep, my bad. I’m running off master at commit
184048f80b6fa160c89d5bb47b937a0a89534a95.
Nick
Hi All,
I am interested to use jsonRDD and jsonFile to create a SchemaRDD out of
some JSON data I have, but I've run into some instability involving the
following java exception:
An error occurred while calling o1326.collect.
: org.apache.spark.SparkException: Job aborted due to stage failure:
On Tue, Aug 5, 2014 at 11:01 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
I was just about to ask about this.
Currently, there are two methods, sqlContext.jsonFile() and
sqlContext.jsonRDD(), that work on JSON text and infer a schema that covers
the whole data set.
For example:
I believe this is a known issue in 1.0.1 that's fixed in 1.0.2.
See: SPARK-2376: Selecting list values inside nested JSON objects raises
java.lang.IllegalArgumentException
https://issues.apache.org/jira/browse/SPARK-2376
On Tue, Aug 5, 2014 at 2:55 PM, Brad Miller bmill...@eecs.berkeley.edu
Got it. Thanks!
On Tue, Aug 5, 2014 at 11:53 AM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Notice the difference in the schema. Are you running the 1.0.1 release,
or a more bleeding-edge version from the repository?
Yep, my bad. I’m running off master at commit
Is this on 1.0.1? I'd suggest running this on master or the 1.1-RC which
should be coming out this week. Pyspark did not have good support for
nested data previously. If you still encounter issues using a more recent
version, please file a JIRA. Thanks!
On Tue, Aug 5, 2014 at 11:55 AM, Brad
Bump
On Tuesday, August 5, 2014, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I am doing some basic preprocessing in pyspark (local mode as follows):
files = [ input files]
def read(filename,sc):
#process file
return rdd
if __name__ ==__main__:
conf = SparkConf()
Nick: Thanks for both the original JIRA bug report and the link.
Michael: This is on the 1.0.1 release. I'll update to master and follow-up
if I have any problems.
best,
-Brad
On Tue, Aug 5, 2014 at 12:04 PM, Michael Armbrust mich...@databricks.com
wrote:
Is this on 1.0.1? I'd suggest
Hi Davies,
Thanks for the response and tips. Is the sample argument to inferSchema
available in the 1.0.1 release of pyspark? I'm not sure (based on the
documentation linked below) that it is.
http://spark.apache.org/docs/latest/api/python/pyspark.sql.SQLContext-class.html#inferSchema
It
Could you create an re-producable script (and data) to allow us to
investigate this?
Davies
On Tue, Aug 5, 2014 at 1:10 AM, Chengi Liu chengi.liu...@gmail.com wrote:
Hi,
I am doing some basic preprocessing in pyspark (local mode as follows):
files = [ input files]
def read(filename,sc):
Are you sure that you were not running SparkPi in local mode?
Thanks
Best Regards
On Wed, Aug 6, 2014 at 12:43 AM, Sunny Khatri sunny.k...@gmail.com wrote:
Well I was able to run the SparkPi, that also does the similar stuff,
successfully.
On Tue, Aug 5, 2014 at 11:52 AM, Akhil Das
Yeah, ran it on yarn-cluster mode.
On Tue, Aug 5, 2014 at 12:17 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Are you sure that you were not running SparkPi in local mode?
Thanks
Best Regards
On Wed, Aug 6, 2014 at 12:43 AM, Sunny Khatri sunny.k...@gmail.com
wrote:
Well I was able
This sample argument of inferSchema is still no in master, if will
try to add it if it make
sense.
On Tue, Aug 5, 2014 at 12:14 PM, Brad Miller bmill...@eecs.berkeley.edu wrote:
Hi Davies,
Thanks for the response and tips. Is the sample argument to inferSchema
available in the 1.0.1 release
Assuming updating to master fixes the bug I was experiencing with jsonRDD
and jsonFile, then pushing sample to master will probably not be
necessary.
We believe that the link below was the bug I experienced, and I've been
told it is fixed in master.
Hi,
Anyone? Any input would be much appreciated
Thanks,
Amin
On 5 Aug 2014 00:31, Al Amin alamin.is...@gmail.com wrote:
Hi all,
Any help would be much appreciated.
Thanks,
Al
On Mon, Aug 4, 2014 at 7:09 PM, Al Amin alamin.is...@gmail.com wrote:
Hi all,
I have setup 2 nodes (master
Hi, all
I use “SELECT DISTINCT” to query the data saved in hive
it seems that this statement cannot understand the table structure and just
output the data in other fields
Anyone met the similar problem before?
Best,
--
Nan Zhu
Yes, 2376 has been fixed in master. Can you give it a try?
Also, for inferSchema, because Python is dynamically typed, I agree with
Davies to provide a way to scan a subset (or entire) of the dataset to
figure out the proper schema. We will take a look it.
Thanks,
Yin
On Tue, Aug 5, 2014 at
nvm,
some problem brought by the ill-formatted raw data
--
Nan Zhu
On Tuesday, August 5, 2014 at 3:42 PM, Nan Zhu wrote:
Hi, all
I use “SELECT DISTINCT” to query the data saved in hive
it seems that this statement cannot understand the table structure and just
output the
spark-submit doesnt handle being symlinks currently:
$ spark-submit
/usr/local/bin/spark-submit: line 44: /usr/local/bin/spark-class: No such
file or directory
/usr/local/bin/spark-submit: line 44: exec: /usr/local/bin/spark-class:
cannot execute: No such file or directory
to fix i changed the
I'm trying to use the spark-ec2 script to launch a spark cluster within a
virtual private cloud (VPC) but I don't see an option for that. Is there a
way to specify the VPC while using the spark-ec2 script?
I found an old spark-incubator mailing list comment which claims to have
added that
Forking from this thread
http://apache-spark-user-list.1001560.n3.nabble.com/pyspark-inferSchema-tc11449.html.
On Tue, Aug 5, 2014 at 3:01 PM, Davies Liu dav...@databricks.com
http://mailto:dav...@databricks.com wrote:
Before upcoming 1.1 release, we did not support nested structures
via
Hello, I have issue when try to use bson file as spark input. I use
mongo-hadoop-connector 1.3.0 and spark 1.0.0:
val sparkConf = new SparkConf()
val sc = new SparkContext(sparkConf)
val config = new Configuration()
config.set(mongo.job.input.format,
Looks like this feature has been turned off. Are these changes intentional?
Or perhaps I'm not understanding how it's supposed to work.
Nick
On Fri, Jul 18, 2014 at 12:20 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
Looks like this has now been turned on for new threads?
On
Maybe; I’m not sure just yet. Basically, I’m looking for something
functionally equivalent to this:
sqlContext.jsonRDD(RDD[dict].map(lambda x: json.dumps(x)))
In other words, given an RDD of JSON-serializable Python dictionaries, I
want to be able to infer a schema that is guaranteed to
Patrick Wendell wrote
In the latest version of Spark we've added documentation to make this
distinction more clear to users:
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/PairRDDFunctions.scala#L390
That is a very good addition to the documentation. Nice
Emails sent from Nabble have it, while others don't. Unfortunately I haven't
received a reply from ASF infra on this yet.
Matei
On August 5, 2014 at 2:04:10 PM, Nicholas Chammas (nicholas.cham...@gmail.com)
wrote:
Looks like this feature has been turned off. Are these changes intentional? Or
Spark is not able to communicate with your hadoop hdfs. Is your hdfs
running, if so can you try to explicitly connect to it with hadoop command
line tools giving full hostname port.
Or test port using
telnet localhost 9000
In all likelyhood either your hdfs is down, bound to wrong port/ip that
Oh actually sorry, it looks like infra has looked at it but they can't add
permalinks. They can only add here's how to unsubscribe footers. My bad, I
just didn't catch the email update from them.
Matei
On August 5, 2014 at 2:39:45 PM, Matei Zaharia (matei.zaha...@gmail.com) wrote:
Emails sent
Ah, the user-specific to: address? I see. OK, thanks for looking into it!
On Tue, Aug 5, 2014 at 5:40 PM, Matei Zaharia matei.zaha...@gmail.com
wrote:
Oh actually sorry, it looks like infra has looked at it but they can't add
permalinks. They can only add here's how to unsubscribe footers. My
Then dont specify hdfs when you read file.
Also the community is quite active in response in general, just be a little
patient.
Also if possible look at spark training as part of spark summit 2014 vids
and/or amplabs training on spark website.
Mayur Rustagi
Ph: +1 (760) 203 3257
Hi Amin,
This happens usually because your application can't talk to HDFS, and
thinks that the name node is waiting on port 9000 when it's not. Are you
using the EC2 scripts for standalone Spark? You can verify whether or not
the port is correct by checking the configurations with
Hi Jens,
Within a partition things will spill - so the current documentation is
correct. This spilling can only occur *across keys* at the moment. Spilling
cannot occur within a key at present.
This is discussed in the video here:
Greetings,
I modified ActorWordCount example a little and it uses simple case class as the
message for Streaming instead of the primitive string.
I also modified launch code to not use run-example script, but set spark master
in the code and attach the jar (setJars(...)) with all the classes
SPARK-2870: Thorough schema inference directly on RDDs of Python
dictionaries https://issues.apache.org/jira/browse/SPARK-2870
On Tue, Aug 5, 2014 at 5:07 PM, Michael Armbrust mich...@databricks.com
wrote:
Maybe; I’m not sure just yet. Basically, I’m looking for something
functionally
Hi All,
I've built and deployed the current head of branch-1.0, but it seems to
have only partly fixed the bug.
This code now runs as expected with the indicated output:
srdd = sqlCtx.jsonRDD(sc.parallelize(['{foo:[1,2,3]}',
'{foo:[4,5,6]}']))
srdd.printSchema()
root
|-- foo:
Hey,
I just tried to submit a task to my spark cluster using the following command
./spark/bin/spark-submit --py-files file:///root/abc.zip --master
spark://xxx.xxx.xxx.xxx:7077 test.py
It seems like the dependency I’ve added gets loaded:
14/08/05 23:07:00 INFO spark.SparkContext: Added file
This looks to be fixed in master:
from pyspark.sql import SQLContext sqlContext = SQLContext(sc)
sc.parallelize(['{foo:[[1,2,3], [4,5,6]]}', '{foo:[[1,2,3], [4,5,6]]}'])
ParallelCollectionRDD[5] at parallelize at PythonRDD.scala:315
sqlContext.jsonRDD(sc.parallelize(['{foo:[[1,2,3],
Are there any workarounds for this? Seems to be a dead end so far.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Using-sbt-pack-with-Spark-1-0-0-tp6649p11502.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Can you show us the modified version. The reason could very well be
what you suggest, but I want to understand what conditions lead to
this.
TD
On Tue, Aug 5, 2014 at 3:55 PM, Anton Brazhnyk
anton.brazh...@genesys.com wrote:
Greetings,
I modified ActorWordCount example a little and it uses
That function is simply deletes a directory recursively. you can use
alternative libraries. see this discussion
http://stackoverflow.com/questions/779519/delete-files-recursively-in-java
On Tue, Aug 5, 2014 at 5:02 PM, JiajiaJing jj.jing0...@gmail.com wrote:
Hi TD,
I encountered a problem
@ Simon Any progress?
On Tue, Aug 5, 2014 at 12:17 AM, Akhil Das ak...@sigmoidanalytics.com wrote:
You need to add twitter4j-*-3.0.3.jars to your class path
Thanks
Best Regards
On Tue, Aug 5, 2014 at 7:18 AM, Tathagata Das tathagata.das1...@gmail.com
wrote:
Are you able to run it
I can try answering the question even if I am not Sanjeet ;)
There isnt a simple way to do this. In fact the ideal way to do it would be
to create a new InputDStream (just like FileInputDStream
1. udpateStateByKey should be called on all keys even if there is not data
corresponding to that key. There is a unit test for that.
https://github.com/apache/spark/blob/master/streaming/src/test/scala/org/apache/spark/streaming/BasicOperationsSuite.scala#L337
2. I am increasing the priority for
Hi All,
I met the titled error. This exception occured in line 223, as shown below:
212 // read files
213 val lines =
sc.textFile(path_edges).map(line=line.split(,)).map(line=((line(0),
line(1)), line(2).toDouble)).reduceByKey(_+
_).cache
214
215 val
Hi,
I would like to save an RDD to a SQL database. It seems like this would be
a common enough use case. Are there any built in libraries to do it?
Otherwise, I'm just planning on mapping my RDD, and having that call a
method to write to the database. Given that a lot of records are going to
Hi All,
I checked out and built master. Note that Maven had a problem building
Kafka (in my case, at least); I was unable to fix this easily so I moved on
since it seemed unlikely to have any influence on the problem at hand.
Master improves functionality (including the example Nicholas just
I've followed up in a thread more directly related to jsonRDD and jsonFile,
but it seems like after building from the current master I'm still having
some problems with nested dictionaries.
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is when
we take the data back to the Python side, SchemaRDD#javaToPython failed on
your cases. I have created https://issues.apache.org/jira/browse/SPARK-2875
to track it.
Thanks,
Yin
On Tue, Aug 5, 2014 at 9:20 PM, Brad
I concur that printSchema works; it just seems to be operations that use
the data where trouble happens.
Thanks for posting the bug.
-Brad
On Tue, Aug 5, 2014 at 10:05 PM, Yin Huai yh...@databricks.com wrote:
I tried jsonRDD(...).printSchema() and it worked. Seems the problem is
when we
Hi All,
I am having some trouble trying to write generic code that uses sqlContext
and RDDs. Can you suggest what might be wrong?
class SparkTable[T : ClassTag](val sqlContext:SQLContext, val extractor:
(String) = (T) ) {
private[this] var location:Option[String] =None
private[this] var
93 matches
Mail list logo