Hello,
I'm actually asking my self about performance of using Spark SQL with Hive
to do real time analytics.
I know that Hive has been created for batch processing, and Spark is use to
do fast queries.
But, use Spark SQL with Hive will allow me to do real time queries ? Or it
just will make
Hi!
I am trying to load data from my sql database using following code
val query=select * from +table+
val url = jdbc:mysql:// + dataBaseHost + : + dataBasePort + / +
dataBaseName + ?user= + db_user + password= + db_pass
val sc = new SparkContext(new
This comes up so often. I wonder if the documentation or the API could be
changed to answer this question.
The solution I found is from
http://stackoverflow.com/questions/23995040/write-to-multiple-outputs-by-key-spark-one-spark-job.
You basically write the items into two directories in a single
Thanks Shixiong for the reply.
Yes, I confirm that the file exists there ,simply checks with ls -l
/data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar
bit1...@163.com
From: Shixiong Zhu
Date: 2015-07-06 18:41
To: bit1...@163.com
CC: user
Subject: Re:
Is there any equivalent of Oracle's *analytical functions* in Spark SQL.
For example, if I have following data set (say table T):
/EID|DEPT
101|COMP
102|COMP
103|COMP
104|MARK/
In Oracle, I can do something like
/select EID, DEPT, count(1) over (partition by DEPT) CNT from T;/
to get:
Hi,
I want to re-use column alias in the select clause to avoid sub query.
For example:
select check(key) as b, abs(b) as abs, value1, value2, ..., value30
from test
The query above does not work, because b is not defined in the test's
schema. In stead, I should change the query to the
Hi,
I have following shell script that will submit the application to the cluster.
But whenever I start the application, I encounter FileNotFoundException, after
retrying for serveral times, I can successfully submit it!
SPARK=/data/software/spark-1.3.1-bin-2.4.0
Hello,
I'm got a trouble with float type coercion on SparkR with hiveContext.
result - sql(hiveContext, SELECT offset, percentage from data limit 100)
show(result)
DataFrame[offset:float, percentage:float]
head(result)
Error in as.data.frame.default(x[[i]], optional = TRUE) :
cannot
Before running your script, could you confirm that
/data/software/spark-1.3.1-bin-2.4.0/applications/pss.am.core-1.0-SNAPSHOT-shaded.jar
exists? You might forget to build this jar.
Best Regards,
Shixiong Zhu
2015-07-06 18:14 GMT+08:00 bit1...@163.com bit1...@163.com:
Hi,
I have following
Hi there,
I would like to check with you whether there is any equivalent functions of
Oracle's analytical functions in Spark SQL.
For example, if I have following data set (table T):
*EID|DEPT*
*101|COMP*
*102|COMP*
*103|COMP*
*104|MARK*
In Oracle, I can do something like
*select EID, DEPT,
I went ahead and tested your file and the results from the tests can be
seen in the gist: https://gist.github.com/dennyglee/c933b5ae01c57bd01d94.
Basically, when running {Java 7, MaxPermSize = 256} or {Java 8, default}
the query ran without any issues. I was able to recreate the issue with
{Java
I have a requirement to write in kafka queue from a spark streaming
application.
I am using spark 1.2 streaming. Since different executors in spark are
allocated at each run so instantiating a new kafka producer at each run
seems a costly operation .Is there a way to reuse objects in processing
You can set spark.ui.enabled to false to disable the Web UI.
Best Regards,
Shixiong Zhu
2015-07-06 17:05 GMT+08:00 luohui20...@sina.com:
Hello there,
I heard that there is some way to shutdown Spark WEB UI, is there a
configuration to support this?
Thank you.
Thanks alot AKhil
On Mon, Jul 6, 2015 at 12:57 PM, sandeep vura sandeepv...@gmail.com wrote:
It Works !!!
On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura sandeepv...@gmail.com
wrote:
oK Let me try
On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Its
It Works !!!
On Mon, Jul 6, 2015 at 12:40 PM, sandeep vura sandeepv...@gmail.com wrote:
oK Let me try
On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Its complaining for a jdbc driver. Add it in your driver classpath like:
./bin/spark-sql --driver-class-path
I have already opened a JIRA about this.
https://issues.apache.org/jira/browse/SPARK-8743
On Mon, Jul 6, 2015 at 1:02 AM, Juan Rodríguez Hortalá
juan.rodriguez.hort...@gmail.com wrote:
Hi,
I haven't been able to reproduce the error reliably, I will open a JIRA as
soon as I can
Greetings,
Hi,
I haven't been able to reproduce the error reliably, I will open a JIRA as
soon as I can
Greetings,
Juan
2015-06-23 21:57 GMT+02:00 Tathagata Das t...@databricks.com:
Aaah this could be potentially major issue as it may prevent metrics from
restarted streaming context be not published.
Hi all,
Apparently, we can only specify character delimiter for tokenizing data
using Spark-CSV. But what if we have a log file with multiple delimiters or
even a multi-character delimiter? e.g. (field1,field2:field3) with
delimiters [,:] and (field1::field2::field3) with a single multi-character
Hi,
I've a RDD which I want to split into two disjoint RDDs on with a boolean
function. I can do this with the following
val rdd1 = rdd.filter(f)
val rdd2 = rdd.filter(fnot)
I'm assuming that each of the above statement will traverse the RDD once
thus resulting in 2 passes.
Is there a way of
Hello there,
I heard that there is some way to shutdown Spark WEB UI, is there a
configuration to support this?
Thank you.
Thanksamp;Best regards!
San.Luo
Its available in Spark 1.4 under dataframe window operations. Apparently
programming doc doesnot mention it, you need to look at the apis.
On Mon, Jul 6, 2015 at 8:50 PM, Gireesh Puthumana
gireesh.puthum...@augmentiq.in wrote:
Hi there,
I would like to check with you whether there is any
In spark streaming 1.2 , Is offset of kafka message consumed are updated in
zookeeper only after writing in WAL if WAL and checkpointig are enabled or
is it depends upon kafkaparams while initialing the kafkaDstream.
MapString,String kafkaParams = new HashMapString, String();
If you’re using WAL with Kafka, Spark Streaming will ignore this
configuration(autocommit.enable) and explicitly call commitOffset to update
offset to Kafka AFTER WAL is done.
No matter what you’re setting with autocommit.enable, internally Spark
Streaming will set it to false to turn off
Within the context of your question, Spark SQL utilizing the Hive context
is primarily about very fast queries. If you want to use real-time
queries, I would utilize Spark Streaming. A couple of great resources on
this topic include Guest Lecture on Spark Streaming in Stanford CME 323:
I used spark 1.4.0 binaries from official site:
http://spark.apache.org/downloads.html
And running it on:
* Hortonworks HDP 2.2.0.0-2041
* with Hive 0.14
* with disabled hooks for Application Timeline Servers (ATSHook) in
hive-site.xml (commented hive.exec.failure.hooks,
hive.exec.post.hooks,
Hi All ,
If some one can help me understand as which portion of the code gets
executed on Driver and which portion will be executed on executor from the
below code it would be a great help
I have to load data from 10 Tables and then use that data in various
manipulation and i am using SPARK SQL
Please see the inline comments.
From: Shushant Arora [mailto:shushantaror...@gmail.com]
Sent: Monday, July 6, 2015 8:51 PM
To: Shao, Saisai
Cc: user
Subject: Re: kafka offset commit in spark streaming 1.2
So If WAL is disabled, how developer can commit offset explicitly in spark
streaming app
If you disable WAL, Spark Streaming itself will not manage any offset related
things, is auto commit is enabled by true, Kafka itself will update offsets in
a time-based way, if auto commit is disabled, no any part will call
commitOffset, you need to call this API yourself.
Also Kafka’s offset
So If WAL is disabled, how developer can commit offset explicitly in spark
streaming app since we don't write code which will be executed in receiver
?
Plus since offset commitment is asynchronoous, is it possible -it may
happen last offset is not commited yet and next stream batch started on
Use foreachPartition, and allocate whatever the costly resource is once per
partition.
On Mon, Jul 6, 2015 at 6:11 AM, Shushant Arora shushantaror...@gmail.com
wrote:
I have a requirement to write in kafka queue from a spark streaming
application.
I am using spark 1.2 streaming. Since
Hi all!
what is the most efficient way to convert jdbcRDD to DataFrame.
any example?
Thanks
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Converting-spark-JDBCRDD-to-DataFrame-tp23647.html
Sent from the Apache Spark User List mailing list archive at
You shouldn't rely on being able to restart from a checkpoint after
changing code, regardless of whether the change was explicitly related to
serialization.
If you are relying on checkpoints to hold state, specifically which offsets
have been processed, that state will be lost if you can't
Join happens on executor. Else spark would not be much of a distributed
computing engine :)
Reads happen on executor too. Your options are passed to executors and conn
objects are created in executors.
On 6 Jul 2015 22:58, Ashish Soni asoni.le...@gmail.com wrote:
Hi All ,
If some one can help
What version of Hive and Spark are you using ?
Cheers
On Sun, Jul 5, 2015 at 10:53 PM, Rex Xiong bycha...@gmail.com wrote:
Hi,
I try to use for one table created in spark, but it seems the results are
all empty, I want to get metadata for table, what's other options?
Thanks
Hi there,
I’m trying to get a feel for how User Defined Functions from SparkSQL (as
written in Python and registered using the udf function from
pyspark.sql.functions) are run behind the scenes. Trying to grok the source it
seems that the native Python function is serialized for distribution
1. onBatchError is not a bad idea.
2. It works for all the Kafka Direct API and files as well. They are have
batches. However you will not get the number of records for the file
stream.
3. Mind giving an example of the exception you would like to see caught?
TD
On Wed, Jul 1, 2015 at 10:35 AM,
I have a Spark standalone cluster with 2 workers -
Master and one slave thread run on a single machine -- Machine 1
Another slave running on a separate machine -- Machine 2
I am running a spark shell in the 2nd machine that reads a file from hdfs
and does some calculations on them and stores the
This should work
val output: RDD[(DetailInputRecord, VISummary)] =
sc.paralellize(Seq.empty[(DetailInputRecord, VISummary)])
On Mon, Jul 6, 2015 at 5:11 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com wrote:
I need to return an empty RDD of type
val output: RDD[(DetailInputRecord, VISummary)]
This
Hi,
I've been researching spark for a couple of months now, and I strongly
believe it can solve our problem.
We are developing an application that allows the user to analyze various
sources of information. We are dealing with non-technical users, so simply
giving them and interactive shell won't
I need to return an empty RDD of type
val output: RDD[(DetailInputRecord, VISummary)]
This does not work
val output: RDD[(DetailInputRecord, VISummary)] = new RDD()
as RDD is abstract class.
How do i create empty RDD ?
--
Deepak
Hi,
I have a Dataframe which I want to use for creating a RandomForest model
using MLLib.
The RandonForest model needs a RDD with LabeledPoints.
Wondering how do I convert the DataFrame to LabeledPointRDD
Regards,
Sourav
You meant SPARK_REPL_OPTS? I did a quick search. Looks like it has been
removed since 1.0. I think it did not affect the behavior of the shell.
On Mon, Jul 6, 2015 at 9:04 AM, Simeon Simeonov s...@swoop.com wrote:
Yin, that did the trick.
I'm curious what was the effect of the environment
Hi I am having couple of Spark jobs which processes thousands of files every
day. File size may very from MBs to GBs. After finishing job I usually save
using the following code
finalJavaRDD.saveAsParquetFile(/path/in/hdfs); OR
dataFrame.write.format(orc).save(/path/in/hdfs) //storing as ORC file
Hi I have to fire few insert into queries which uses Hive partitions. I have
two Hive partitions named server and date. Now I execute insert into queries
using hiveContext as shown below query works fine
hiveContext.sql(insert into summary1
partition(server='a1',date='2015-05-22') select from
Use the built in JDBC data source:
https://spark.apache.org/docs/latest/sql-programming-guide.html#jdbc-to-other-databases
On Mon, Jul 6, 2015 at 6:42 AM, Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi all!
what is the most efficient way to convert jdbcRDD to DataFrame.
any example?
Yeah, creating a new producer at the granularity of partitions may not be
that costly.
On Mon, Jul 6, 2015 at 6:40 AM, Cody Koeninger c...@koeninger.org wrote:
Use foreachPartition, and allocate whatever the costly resource is once
per partition.
On Mon, Jul 6, 2015 at 6:11 AM, Shushant
Hi,
I'm having trouble building a recommender and would appreciate a few
pointers.
I have 350,000,000 events which are stored in roughly 500,000 S3 files and
are formatted as semi-structured JSON. These events are not all relevant to
making recommendations.
My code is (roughly):
case class
Try coalesce function to limit no of part files
On Mon, Jul 6, 2015 at 1:23 PM kachau umesh.ka...@gmail.com wrote:
Hi I am having couple of Spark jobs which processes thousands of files
every
day. File size may very from MBs to GBs. After finishing job I usually save
using the following code
whats the difference between foreachPartition vs mapPartitions for a
Dtstream both works at partition granularity?
One is an operation and another is action but if I call an opeartion
afterwords mapPartitions also, which one is more efficient and recommeded?
On Tue, Jul 7, 2015 at 12:21 AM,
Both have same efficiency. The primary difference is that one is a
transformation (hence is lazy, and requires another action to actually
execute), and the other is an action.
But it may be a slightly better design in general to have transformations
be purely functional (that is, no external side
Hi,
I've been compiling spark 1.4.0 with SBT, from the source tarball available
on the official website. I cannot run spark's master, even tho I have built
and run several other instance of spark on the same machine (spark 1.3,
master branch, pre built 1.4, ...)
/starting
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
processed. Unless there are errors where the batch completely fails to get
processed, in which case the point is moot. Just reinforcing the concept
further.
Additional information: This is true in the default configuration.
Afternoon all,
Really loving this project and the community behind it. Thank you all for
your hard work.
This past week, though, I've been having a hard time getting my first
deployed job to run without failing at the same point every time: Right
after a leftOuterJoin, most partitions (600
Not yet, though work on this feature has begun (SPARK-5133
https://issues.apache.org/jira/browse/SPARK-5133)
On Mon, Jul 6, 2015 at 4:46 PM, Sourav Mazumder sourav.mazumde...@gmail.com
wrote:
Hi,
Is there a way to get variable importance for RandomForest model created
using MLLib ? This way
Currently, Python UDFs run in a Python instances, are MUCH slower than
Scala ones (from 10 to 100x). There is JIRA to improve the
performance: https://issues.apache.org/jira/browse/SPARK-8632, After
that, they will be still much slower than Scala ones (because Python
is lower and the overhead for
I have a simple job , that reads data = union = filter = map and the
count
1 Job started 2402 tasks read 149G of input.
I started the job with different number of executors
1) 1 -- 8.3 mins
2) 2 -- 5.6 mins
3) 3 -- 3.1 mins
1) Why is increasing the cores speading up this app ?
2) I started
Hi,
Is there a way to get variable importance for RandomForest model created
using MLLib ? This way one can know among multiple features which are the
one contributing the most to the dependent variable.
Regards,
Sourav
Hi,
I am trying to connect a worker to the master. The spark master is on
cloudera manager and I know the master IP address and port number.
I downloaded the spark binary for CDH4 on the worker machine and then when I
try to invoke the command
sc = sparkR.init(master=ip address:port number) I
You can bump up number of partition by a parameter in join operator.
However you have a data skew problem which you need to resolve using a
reasonable partition by function
On 7 Jul 2015 08:57, Mohammed Omer beancinemat...@gmail.com wrote:
Afternoon all,
Really loving this project and the
Have you looked at the new Spark ML library? You can use a DataFrame directly
with the Spark ML API.
https://spark.apache.org/docs/latest/ml-guide.html
Mohammed
From: Sourav Mazumder [mailto:sourav.mazumde...@gmail.com]
Sent: Monday, July 6, 2015 10:29 AM
To: user
Subject: How to create a
You could repartition the dataframe before saving it. However, that would
impact the parallelism of the next jobs that reads these file from HDFS.
Mohammed
-Original Message-
From: kachau [mailto:umesh.ka...@gmail.com]
Sent: Monday, July 6, 2015 10:23 AM
To: user@spark.apache.org
got it ,thanks.
Thanksamp;Best regards!
San.Luo
- 原始邮件 -
发件人:Shixiong Zhu zsxw...@gmail.com
收件人:罗辉 luohui20...@sina.com
抄送人:user user@spark.apache.org
主题:Re: How to shut down spark web UI?
日期:2015年07月06日 17点31分
You can set spark.ui.enabled to false
Can you share your hadoop configuration file please?
- etc/hadoop/core-site.xml
- etc/hadoop/hdfs-site.xml
- etc/hadoop/hadoo-env.sh
AFAIK, the following properties should be configured:
hadoop.tmp.dir, dfs.namenode.name.dir, dfs.datanode.data.dir and
dfs.namenode.checkpoint.dir
Otherwise, an
Hi,
I am trying to connect a worker to the master. The spark master is on
cloudera manager and I know the master IP address and port number.
I downloaded the spark binary for CDH4 on the worker machine and then when
I try to invoke the command
sc = sparkR.init(master=ip address:port number) I
Hello Shivaram,
Thank you for your response. Being a novice at this stage can you also tell
how to configure or set the execute permission for the spark-submit file?
Thank you for your time.
Sincerely,
Ashish Dutt
On Tue, Jul 7, 2015 at 9:21 AM, Shivaram Venkataraman
Great! That's what I gathered from the thread titled Serial batching with
Spark Streaming, but thanks for confirming this again.
On 6 July 2015 at 15:31, Tathagata Das t...@databricks.com wrote:
Yes, RDD of batch t+1 will be processed only after RDD of batch t has been
processed. Unless there
Hi.
Just a few quick comment on your question.
If you drill into (click the link of the subtasks) you can get more detailed
view of the tasks.
One of the things reported is the time for serialization.
If that is your dominant factor it should be reflected there, right?
Are you sure the
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/
Thanks,
Dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
Hi Florian,
It depends on a number of factors. How much data are you querying? Where is the
data stored (HDD, SSD or DRAM)? What is the file format (Parquet or CSV)?
In theory, it is possible to use Spark SQL for real-time queries, but cost
increases as the data size grows. If you can store all
Hi folks, suffering from a pretty strange issue:
Is there a way to tell what object is being successfully
serialized/deserialized? I have a maven-installed jar that works well when
fat jarred within another, but shows the following stack when marked as
provided and copied to the runtime
I am running unit tests on Spark 1.3.1 with sbt test and besides the unit
tests being incredibly slow I keep running into
java.lang.ClassNotFoundException: org.apache.spark.storage.RDDBlockId
issues. Usually this means a dependency issue, but I wouldn't know from
where...
Any help is greatly
When I've seen this error before it has been due to the spark-submit file
(i.e. `C:\spark-1.4.0\bin/bin/spark-submit.cmd`) not having execute
permissions. You can try to set execute permission and see if it fixes
things.
Also we have a PR open to fix a related problem at
It is not a bad idea. Many people use this approach.
Mohammed
-Original Message-
From: Sagi r [mailto:stsa...@gmail.com]
Sent: Monday, July 6, 2015 1:58 PM
To: user@spark.apache.org
Subject: Spark application with a RESTful API
Hi,
I've been researching spark for a couple of months
Hi,
These are the settings for my spark-conf file on the worker machine from
where I am trying to access the spark server. I think I need to first
configure the spark-submit file too but I do not know how,, Can somebody
suggest me ?
# Default system properties included when running
Hi.
Have you tried to repartition the finalRDD before saving?
This link might help.
http://databricks.gitbooks.io/databricks-spark-reference-applications/content/logs_analyzer/chapter3/save_the_rdd_to_files.html
Regards,
Gylfi.
--
View this message in context:
Hi.
Have you tried to enable speculative execution?
This will allow Spark to run the same sub-task of the job on other available
slots when slow tasks are encountered.
This can be passed at execution time with the params are:
spark.speculation
spark.speculation.interval
I am getting following error for simple spark job
I am running following command
/spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode
cluster --master yarn
/opt/cloudera/parcels/CDH/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar/
but job doesn't show any
Try including alias in the query.
val query=(select * from +table+) a
On Mon, Jul 6, 2015 at 3:38 AM Hafiz Mujadid hafizmujadi...@gmail.com
wrote:
Hi!
I am trying to load data from my sql database using following code
val query=select * from +table+
val url = jdbc:mysql:// +
Hey Dean,
Sure, will take care of this.
HTH,
Denny
On Tue, Jul 7, 2015 at 10:07 Dean Wampler deanwamp...@gmail.com wrote:
Here's our home page: http://www.meetup.com/Chicago-Spark-Users/
Thanks,
Dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
Hi Cody and TD,
Just trying to understanding this under the hook, but cannot find any place
for this specific logic: once you reach max failures the whole stream
will stop.
If possible, could you point me to the right direction ?
For my understanding, the exception throw from the job would
On using foreachPartition jobs get created are not displayed on driver
console but are visible on web ui.
On driver it creates some stage statistics of form [Stage 2:
(0 + 2) / 5] and disappeared .
I am using foreachPartition as :
I userd val output: RDD[(DetailInputRecord, VISummary)] =
sc.emptyRDD[(DetailInputRecord,
VISummary)] to create empty RDD before. Give it a try, it might work for
you too.
2015-07-06 14:11 GMT-07:00 ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com:
I need to return an empty RDD of type
val output:
Hi Sparkers,
I am unable to start spark-sql service please check the error as mentioned
below.
Exception in thread main java.lang.RuntimeException:
java.lang.RuntimeException: Unable to instantiate
org.apache.hadoop.hive.metastore.HiveMetaStoreClient
at
Jorn,
Thanks for your response.
I am pasting below a snippet of code which shows drools integration when
facts/events are picked up after reading through a File
(FileReader-readLine()), it works as expected and I have tested it for
wide range of record data in a File.
Same code doesn't work
If you don't want those logs flood your screen, you can disable it simply
with:
import org.apache.log4j.{Level, Logger}
Logger.getLogger(org).setLevel(Level.OFF)
Logger.getLogger(akka).setLevel(Level.OFF)
Thanks
Best Regards
On Sun, Jul 5, 2015 at 7:27 PM, Hellen
Try with *spark.cores.max*, executor cores is used when you usually run it
on yarn mode.
Thanks
Best Regards
On Mon, Jul 6, 2015 at 1:22 AM, nizang ni...@windward.eu wrote:
hi,
We're running spark 1.4.0 on ec2, with 6 machines, 4 cores each. We're
trying to run an application on a number of
Hi Sudarshan,
As far as i understand your problem you should take a look at broadcast
variables in spark. here you have the docs
https://spark.apache.org/docs/latest/programming-guide.html#broadcast-variables
.
Thanks
Himanshu
--
View this message in context:
While the job is running, just look in the directory and see whats the root
cause of it (is it the logs? is it the shuffle? etc). Here's a few
configuration options which you can try:
- Disable shuffle : spark.shuffle.spill=false (It might end up in OOM)
- Enable log rotation:
You can also set these in the spark-env.sh file :
export SPARK_WORKER_DIR=/mnt/spark/
export SPARK_LOCAL_DIR=/mnt/spark/
Thanks
Best Regards
On Mon, Jul 6, 2015 at 12:29 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
While the job is running, just look in the directory and see whats the
I know that Spark is using data parallelism over, say, HDFS - optimally running
computations on local data (aka data locality).
I was wondering how Spark streaming moves data (messages) around? since the
data is streamed in as DStreams and is not on a distributed FS like HDFS.
Thanks!
Its complaining for a jdbc driver. Add it in your driver classpath like:
./bin/spark-sql --driver-class-path
/home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar
Thanks
Best Regards
On Mon, Jul 6, 2015 at 11:42 AM, sandeep vura sandeepv...@gmail.com wrote:
Hi Sparkers,
I am
oK Let me try
On Mon, Jul 6, 2015 at 12:38 PM, Akhil Das ak...@sigmoidanalytics.com
wrote:
Its complaining for a jdbc driver. Add it in your driver classpath like:
./bin/spark-sql --driver-class-path
/home/akhld/sigmoid/spark/lib/mysql-connector-java-5.1.32-bin.jar
Thanks
Best Regards
I would guess the opposite is true for highly iterative benchmarks (common in
graph processing and data-science).
Spark has a pretty large overhead per iteration, more optimisations and
planning only makes this worse.
Sure people implemented things like dijkstra's algorithm in spark
(a problem
Sorry, that should be shortest path, and diameter of the graph.
I shouldn't write emails before I get my morning coffee...
On 06 Jul 2015, at 09:09, Jan-Paul Bultmann janpaulbultm...@me.com wrote:
I would guess the opposite is true for highly iterative benchmarks (common in
graph processing
Hi Sim,
I think the right way to set the PermGen Size is through driver extra JVM
options, i.e.
--conf spark.driver.extraJavaOptions=-XX:MaxPermSize=256m
Can you try it? Without this conf, your driver's PermGen size is still 128m.
Thanks,
Yin
On Mon, Jul 6, 2015 at 4:07 AM, Denny Lee
Hi James,
The code below shows one way how you can update the broadcast variable on
the executors:
// ... events stream setup
var startTime = new Date().getTime()
var hashMap = HashMap(1 - (1, 1), 2 - (2, 2))
var hashMapBroadcast =
Yin, that did the trick.
I'm curious what was the effect of the environment variable, however, as the
behavior of the shell changed from hanging to quitting when the env var value
got to 1g.
/Sim
Simeon Simeonov, Founder CTO, Swoophttp://swoop.com/
@simeonshttp://twitter.com/simeons |
97 matches
Mail list logo