Hi I replied you in SO. If option A had a action call then it should
suffice too.
On 28 Apr 2015 05:30, Eran Medan eran.me...@gmail.com wrote:
Hi Everyone!
I'm trying to understand how Spark's cache work.
Here is my naive understanding, please let me know if I'm missing
something:
val
ah, just noticed that you are using an external package, you can add that
like this
conf = (SparkConf().set(spark.jars, jar_path))
or if it is a python package:
sc.addPyFile()
--
View this message in context:
Worked now.
On Mon, Apr 27, 2015 at 10:20 PM, Sean Owen so...@cloudera.com wrote:
Works fine for me. Make sure you're not downloading the HTML
redirector page and thinking it's the archive.
On Mon, Apr 27, 2015 at 11:43 AM, ÐΞ€ρ@Ҝ (๏̯͡๏) deepuj...@gmail.com
wrote:
I downloaded 1.3.1
Hi Mark,
That does not look like an python path issue, spark-assembly jar should have
those packaged, and should make it available for the workers. Have you built
the jar yourself?
--
View this message in context:
arguments are values of it. The name of the argument is important and all
you need to do is specify those when your creating SparkConf object.
Glad it worked.
On Tue, Apr 28, 2015 at 5:20 PM, madhvi madhvi.gu...@orkash.com wrote:
Thankyou Deepak.It worked.
Madhvi
On Tuesday 28 April 2015
Can you simply apply the
https://spark.apache.org/docs/1.3.1/api/scala/index.html#org.apache.spark.util.StatCounter
to this? You should be able to do something like this:
val stats = RDD.map(x = x._2).stats()
-Todd
On Tue, Apr 28, 2015 at 10:00 AM, subscripti...@prismalytics.io
Hi,
I'm following the pattern of filtering data by a certain criteria, and then
saving the results to a different table. The code below illustrates the
idea.
The simple integration test I wrote suggests it works, simply asserting
filtered data should be in their respective tables after being
Thank you Silvio,
I am aware of groubBy limitations and this is subject for replacement.
I did try repartitionAndSortWithinPartitions but then I end up with maybe
too much shuffling one from groupByKey and the other from repartition.
My expectation was that since N records are partitioned to
Hi,
I can offer a few ideas to investigate in regards to your issue here. I've
run into resource issues doing shuffle operations with a much smaller
dataset than 2B. The data is going to be saved to disk by the BlockManager
as part of the shuffle and then redistributed across the cluster as
Hello Friends:
I generated a Pair RDD with K/V pairs, like so:
rdd1.take(10) # Show a small sample.
[(u'2013-10-09', 7.60117302052786),
(u'2013-10-10', 9.322709163346612),
(u'2013-10-10', 28.264462809917358),
(u'2013-10-07', 9.664429530201343),
(u'2013-10-07', 12.461538461538463),
Are the tasks on the slaves also running as root? If not, that might
explain the problem.
dean
Dean Wampler, Ph.D.
Author: Programming Scala, 2nd Edition
http://shop.oreilly.com/product/0636920033073.do (O'Reilly)
Typesafe http://typesafe.com
@deanwampler http://twitter.com/deanwampler
Hi,guys,
I have the following computation with 3 workers:
spark-sql --master yarn --executor-memory 3g --executor-cores 2 --driver-memory
1g -e 'select count(*) from table'
The resources used are shown as below on the UI:
I don't understand why the memory used is 15GB and vcores used is 5. I
I am exactly having same issue. I am running hbase and spark in docker
container.
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/HBase-HTable-constructor-hangs-tp4926p22696.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.
Hi Sarath,
It might be questionable to set num-executors as 64 if you only has 8
nodes. Do you use any action like collect which will overwhelm the
driver since you have a large dataset?
Thanks
On Tue, Apr 28, 2015 at 10:50 AM, sarath sarathkrishn...@gmail.com wrote:
I am trying to train a
One reason Spark on disk is faster than MapReduce is Spark’s advanced Directed
Acyclic Graph (DAG) engine. MapReduce will require a complex job to be split
into multiple Map-Reduce jobs, with disk I/O at the end of each job and
beginning of a new job. With Spark, you may be able to express the
our experience is that unless you can benefit from spark features such as
co-partitioning that allow for more efficient execution that spark is
slightly slower for disk to disk.
On Apr 27, 2015 10:34 PM, bit1...@163.com bit1...@163.com wrote:
Hi,
I am frequently asked why spark is also much
1. The full command line is written in a shell script:
LIB=/home/spark/.m2/repository
/opt/spark/bin/spark-submit \
--class spark.pcap.run.TestPcapSpark \
--jars
I think currently there's no API in Spark Streaming you can use to get the
file names for file input streams. Actually it is not trivial to support
this, may be you could file a JIRA with wishes you want the community to
support, so anyone who is interested can take a crack on this.
Thanks
Jerry
Can you give us more information ?
Such as hbase release, Spark release.
If you can pastebin jstack of the hanging HTable process, that would help.
BTW I used http://search-hadoop.com/?q=spark+HBase+HTable+constructor+hangs
and saw a very old thread with this subject.
Cheers
On Tue, Apr 28,
btw, from spark web ui, the acl is marked with root
Best regards,
Lin Hao XU
IBM Research China
Email: xulin...@cn.ibm.com
My Flickr: http://www.flickr.com/photos/xulinhao/sets
From: Dean Wampler deanwamp...@gmail.com
To: Lin Hao Xu/China/IBM@IBMCN
Cc: Hai Shan Wu/China/IBM@IBMCN,
Thankyou Deepak.It worked.
Madhvi
On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
val conf = new SparkConf()
.setAppName(detail)
.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
.set(spark.kryoserializer.buffer.mb,
arguments.get(buffersize).get)
Can you show your code please?
On 28 Apr 2015 13:20, sranga sra...@gmail.com wrote:
Hi
I am getting the following error when persisting an RDD in parquet format
to
an S3 location. This is code that was working in the 1.2 version. The
version that it is failing to work is 1.3.1.
Any help is
Can you specifiy 'running via PyCharm'. how are you executing the
script, with spark-submit?
In PySpark I guess you used --jars databricks-csv.jar. With spark-submit
you might need the additional --driver-class-path databricks-csv.jar.
Both parameters cannot be set via the SparkConf object.
It's probably not your code.
What's the full command line you use to submit the job?
Are you sure the job on the cluster has access to the network interface?
Can you test the receiver by itself without Spark? For example, does this
line work as expected:
ListPcapNetworkInterface nifs =
According to the docs it should go like this:
spark://host1:port1,host2:port2
https://spark.apache.org/docs/latest/spark-standalone.html#standby-masters-with-zookeeper
Thanks
M
On Apr 28, 2015, at 8:13 AM, James King jakwebin...@gmail.com wrote:
I have multiple masters running and I'm
Hello all,
I have the following Spark (pseudo)code:
rdd = mapPartitionsWithIndex(...)
.mapPartitionsToPair(...)
.groupByKey()
.sortByKey(comparator)
.partitionBy(myPartitioner)
.mapPartitionsWithIndex(...)
.mapPartitionsToPair( *f* )
The input
Its a windows thing. Please escape front slash in string. Basically it is
not able to find the file
On 28 Apr 2015 22:09, Fabian Böhnlein fabian.boehnl...@gmail.com wrote:
Can you specifiy 'running via PyCharm'. how are you executing the script,
with spark-submit?
In PySpark I guess you used
That's exactly what I'm saying -- I specify the memory options using spark
options, but this is not reflected in how the JVM is created. No matter
which memory settings I specify, the JVM for the driver is always made with
512Mb of memory. So I'm not sure if this is a feature or a bug?
rok
On
So no takers regarding why spark-defaults.conf is not being picked up.
Here is another one:
If Zookeeper is configured in Spark why do we need to start a slave like
this:
spark-1.3.0-bin-hadoop2.4/sbin/start-slave.sh 1 spark://somemaster:7077
i.e. why do we need to specify the master url
I have multiple masters running and I'm trying to submit an application
using
spark-1.3.0-bin-hadoop2.4/bin/spark-submit
with this config (i.e. a comma separated list of master urls)
--master spark://master01:7077,spark://master02:7077
But getting this exception
Hi,
I'm just calling the standard SVMWithSGD implementation of Spark's MLLib.
I'm not using any method like collect.
Thanks,
Sarath
On Tue, Apr 28, 2015 at 4:35 PM, ai he heai0...@gmail.com wrote:
Hi Sarath,
It might be questionable to set num-executors as 64 if you only has 8
nodes. Do
Actually, to simplify this problem, we run our program on a single machine
with 4 slave workers. Since on a single machine, I think all slave workers
are ran with root privilege.
BTW, if we have a cluster, how to make sure slaves on remote machines run
program as root?
Best regards,
Lin Hao XU
Looks to me that the same thing also applies to the SparkContext.textFile or
SparkContext.wholeTextFile, there is no way in RDD to figure out the file
information where the data in RDD is from
bit1...@163.com
From: Saisai Shao
Date: 2015-04-29 10:10
To: lokeshkumar
CC: spark users
Hi,
I have two questions regarding testing Spark jobs:
1. Is it possible to use Mockito for that purpose? I tried to use it, but
it looks like there are no interactions with mocks. I didn't dive into the
details of how Mockito works, but I guess it might be because of the
serialization and how
Credit goes to Misha Chernetsov (see SPARK-4925)
FYI
On Tue, Apr 28, 2015 at 8:25 AM, Marco marco@gmail.com wrote:
Thx Ted for the info !
2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com:
This is available for 1.3.1:
Hi,
Can you give some tutorials/examples how to write test case based on the
mentioned framework?
Thanks,
Sourav
On Tue, Apr 28, 2015 at 9:22 PM, Silvio Fiorito
silvio.fior...@granturing.com wrote:
Sorry that’s correct, I was thinking you were maybe trying to mock
certain aspects of Spark
Sorry that’s correct, I was thinking you were maybe trying to mock certain
aspects of Spark core to write your tests. This is a library to help write unit
tests by managing the SparkContext and StreamingContext. So you can test your
transformations as necessary. More importantly on the
So the other issue could due to the fact that using mapPartitions after the
partitionBy, you essentially lose the partitioning of the keys since Spark
assumes the keys were altered in the map phase. So really the partitionBy gets
lost after the mapPartitions, that’s why you need to do it again.
If you need to keep the keys, you can use aggregateByKey to calculate an avg of
the values:
val step1 = data.aggregateByKey((0.0, 0))((a, b) = (a._1 + b, a._2 + 1), (a,
b) = (a._1 + b._1, a._2 + b._2))
val avgByKey = step1.mapValues(i = i._1/i._2)
Essentially, what this is doing is passing an
Thx Ted for the info !
2015-04-27 23:51 GMT+02:00 Ted Yu yuzhih...@gmail.com:
This is available for 1.3.1:
http://mvnrepository.com/artifact/org.apache.spark/spark-hive-thriftserver_2.10
FYI
On Mon, Feb 16, 2015 at 7:24 AM, Marco marco@gmail.com wrote:
Ok, so will it be only
Hi Michal,
Please try spark-testing-base by Holden. I’ve used it and it works well for
unit testing batch and streaming jobs
https://github.com/holdenk/spark-testing-base
Thanks,
Silvio
From: Michal Michalski
Date: Tuesday, April 28, 2015 at 11:32 AM
To: user
Subject: Best practices on
I was wondering about the same thing.
Vadim
ᐧ
On Tue, Apr 28, 2015 at 10:19 PM, bit1...@163.com bit1...@163.com wrote:
Looks to me that the same thing also applies to the SparkContext.textFile
or SparkContext.wholeTextFile, there is no way in RDD to figure out the
file information where the
I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.
Here is the jstack trace. Complete stack trace attached.
Executor task launch worker-1 #58 daemon prio=5 os_prio=0
tid=0x7fd3d0445000 nid=0x488 waiting on condition [0x7fd4507d9000]
java.lang.Thread.State: TIMED_WAITING (sleeping)
Hi,
I'm wondering about the use-case where you're not doing continuous,
incremental streaming of data out of Kafka but rather want to publish data
once with your Producer(s) and consume it once, in your Consumer, then
terminate the consumer Spark job.
JavaStreamingContext jssc = new
I think it might be useful in Spark Streaming's file input stream, but not
sure is it useful in SparkContext#textFile, since we specify the file by
our own, so why we still need to know the file name.
I will open up a JIRA to mention about this feature.
Thanks
Jerry
2015-04-29 10:49 GMT+08:00
I was having this issue when my batch interval was very big -- like 5
minutes. When my batch interval is
smaller, I don't get this exception. Can someone explain to me why this
might be happening?
Vadim
ᐧ
On Tue, Apr 28, 2015 at 4:26 PM, Vadim Bichutskiy
vadim.bichuts...@gmail.com wrote:
I am
Hi
In a multi-node setup, I am invoking a number external apps, through
Runtime.getRuntime.exec from rdd.map function, and would like to track their
completion status. Evidently, such calls spawn a separate thread, which is not
tracked by the standalone scheduler, i.e., reduce or collect are
For the SparkContext#textFile, if a directory is given as the path parameter
,then it will pick up the files in the directory, so the same thing will occur.
bit1...@163.com
From: Saisai Shao
Date: 2015-04-29 10:54
To: Vadim Bichutskiy
CC: bit1...@163.com; lokeshkumar; user
Subject: Re: Re:
How did you distribute hbase-site.xml to the nodes ?
Looks like HConnectionManager couldn't find the hbase:meta server.
Cheers
On Tue, Apr 28, 2015 at 9:19 PM, Tridib Samanta tridib.sama...@live.com
wrote:
I am using Spark 1.2.0 and HBase 0.98.1-cdh5.1.0.
Here is the jstack trace. Complete
I am 100% sure how it's picking up the configuration. I copied the
hbase-site.xml in hdfs/spark cluster (single machine). I also included
hbase-site.xml in spark-job jar files. spark-job jar file also have yarn-site
and mapred-site and core-site.xml in it.
One interesting thing is, when I run
Hi all,
I have an issue. I added some timestamps in Spark source code and built it
using:
mvn package -DskipTests
I checked the new version in my own computer and it works. However, when I ran
spark on EC2, the spark code EC2 machines ran is the original version.
Anyone knows how to deploy
Does anyone tried using solr inside spark?
below is the project describing it.
https://github.com/LucidWorks/spark-solr.
I have a requirement in which I want to index 20 millions companies name
and then search as and when new data comes in. the output should be list of
companies matching the
Thanks for reply.
Elastic search index will be within my Cluster? or I need the separate host
the elastic search?
On 28 April 2015 at 22:03, Nick Pentreath nick.pentre...@gmail.com wrote:
I haven't used Solr for a long time, and haven't used Solr in Spark.
However, why do you say
I haven't used Solr for a long time, and haven't used Solr in Spark.
However, why do you say Elasticsearch is not a good option ...? ES
absolutely supports full-text search and not just filtering and grouping
(in fact it's original purpose was and still is text search, though
filtering, grouping
AFAIK Datastax is heavily looking at it. they have a good integration of
Cassandra with it. the next was clearly to have a strong combination of the
three in one of the coming releases
Le mar. 28 avr. 2015 18:28, Jeetendra Gangele gangele...@gmail.com a
écrit :
Does anyone tried using solr
Hi experts,
Trying to use the slicing functionality in strings as part of a Spark
program (PySpark) I get this error:
Code
import pandas as pd
from pyspark.sql import SQLContext
hc = SQLContext(sc)
A = pd.DataFrame({'Firstname': ['James', 'Ali', 'Daniel'], 'Lastname':
['Jones',
Thanks again Ayan! To close the loop on this issue, I have filed the below
JIRA to track the issue:
https://issues.apache.org/jira/browse/SPARK-7197
On Fri, Apr 24, 2015 at 8:21 PM, ayan guha guha.a...@gmail.com wrote:
I just tested, your observation in DataFrame API is correct. It behaves
[-dev] [+user]
This is a question for the user list, not the dev list.
Use the --spark-version and --spark-git-repo options to specify your own
repo and hash to deploy.
Source code link.
https://github.com/apache/spark/blob/268c419f1586110b90e68f98cd000a782d18828c/ec2/spark_ec2.py#L189-L195
I have a SparkApp that runs completes in 45 mins for 5 files (5*750MB size)
and it takes 16 executors to do so.
I wanted to run it against 10 files of each input type (10*3 files as there
are three inputs that are transformed). [Input1 = 10*750 MB,
Input2=10*2.5GB, Input3 = 10*1.5G], Hence i used
Hi Forum,
Using spark streaming and listening to the files in HDFS using
textFileStream/fileStream methods, how do we get the fileNames which are
read by these methods?
I used textFileStream which has file contents in JavaDStream and I got no
success with fileStream as it is throwing me a
I am using Spark Streaming to monitor an S3 bucket. Everything appears to
be fine. But every batch interval I get the following:
*15/04/28 16:12:36 WARN HttpMethodReleaseInputStream: Attempting to release
HttpMethod in finalize() as its response data stream has gone out of scope.
This attempt
Richard,
The same problem is with sort.
I have enough disk space and tmp folder. The errors in logs tell out of memory.
I wonder what does it hold in memory?
Alexander
From: Richard Marscher [mailto:rmarsc...@localytics.com]
Sent: Tuesday, April 28, 2015 7:34 AM
To: Ulanov, Alexander
Cc:
Hi,
You can apply this patch https://github.com/apache/spark/pull/5354 and
recompile.
Hope this helps,
Calvin
On Tue, Apr 28, 2015 at 1:19 PM, sara mustafa eng.sara.must...@gmail.com
wrote:
Hi Zhang,
How did you compile Spark 1.3.1 with Tachyon? when i changed Tachyon
version
to 0.6.3 in
yes
On 29 Apr 2015 03:31, ayan guha guha.a...@gmail.com wrote:
Are your driver running on the same m/c as master?
On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote:
Hi,
I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to
Hi,
I'm running the following code in my cluster (standalone mode) via spark
shell -
val rdd = sc.parallelize(1 to 100)
rdd.count
This takes around 1.2s to run.
Is this expected or am I configuring something wrong?
I'm using about 30 cores with 512MB executor memory
As expected, GC time is
Are your driver running on the same m/c as master?
On 29 Apr 2015 03:59, Anshul Singhle ans...@betaglide.com wrote:
Hi,
I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to be able to complete my jobs (on the
cached rdd) in under 1 sec.
Hi,
I would like to collect some metrics from spark and plot them with
graphite. I managed to do that withe the metrics provided by the
or.apache.park.metrics.source.JvmSource but I would like to know if there
are other sources available beside this one.
Best,
Giovanni
Hi all.
I was launching a spark sql job on my own machine, not on the spark cluster
machines, and failed. The excpetion info is:
15/04/28 16:28:04 INFO yarn.ApplicationMaster: Final app status: FAILED,
exitCode: 15, (reason: User class threw exception: java.lang.RuntimeException:
Unable to
Hey all,
I'm trying to create tables from existing Parquet data in different
schemata. The following isn't working for me:
CREATE DATABASE foo;
CREATE TABLE foo.bar
USING com.databricks.spark.avro
OPTIONS (path '...');
-- Error: org.apache.spark.sql.AnalysisException: cannot recognize input
Alias function not in python yet. I suggest to write SQL if your data suits
it
On 28 Apr 2015 14:42, Don Drake dondr...@gmail.com wrote:
https://issues.apache.org/jira/browse/SPARK-7182
Can anyone suggest a workaround for the above issue?
Thanks.
-Don
--
Donald Drake
Drake Consulting
val conf = new SparkConf()
.setAppName(detail)
.set(spark.serializer, org.apache.spark.serializer.KryoSerializer)
.set(spark.kryoserializer.buffer.mb, arguments.get(buffersize
).get)
.set(spark.kryoserializer.buffer.max.mb, arguments.get(
maxbuffersize).get)
Depends on your use case and search volume. Typically you'd have a dedicated ES
cluster if your app is doing a lot of real time indexing and search.
If it's only for spark integration then you could colocate ES and spark
—
Sent from Mailbox
On Tue, Apr 28, 2015 at 6:41 PM, Jeetendra
Are you using a Spark build that matches your YARN cluster version?
That seems like it could happen if you're using a Spark built against
a newer version of YARN than you're running.
On Thu, Apr 2, 2015 at 12:53 AM, 董帅阳 917361...@qq.com wrote:
spark 1.3.0
spark@pc-zjqdyyn1:~ tail
Hi,
I'm running short spark jobs on rdds cached in memory. I'm also using a
long running job context. I want to be able to complete my jobs (on the
cached rdd) in under 1 sec.
I'm getting the following job times with about 15 GB of data distributed
across 6 nodes. Each executor has about 20GB of
In Normal MR job can I configure ( cluster wide) default number of reducers
- if I don't specify any reducers in my job
The initializer is a tuple (0, 0) it seems you just have 0
From: subscripti...@prismalytics.iomailto:subscripti...@prismalytics.io
Organization: PRISMALYTICS, LLC.
Reply-To: subscripti...@prismalytics.iomailto:subscripti...@prismalytics.io
Date: Tuesday, April 28, 2015 at 1:28 PM
To: Silvio
Hi Everyone,
Does anyone have example code for generating a graph from a file of edge
name-edge name tuples? I've seen the example where a Graph is generated from
an RDD of triplets composed of edge longs, but I'd like to see an example
where a graph is built from a edge name-edge -name file such
Thank you Todd, Silvio...
I had to stare at Silvio's answer for a while.
_If I'm interpreting the aggregateByKey() statement__correctly ...
_ (Within-Partition Reduction Step)
a: is a TUPLE that holds: (runningSum, runningCount).
b: is a SCALAR that holds the next Value
Hi experts,
I have an issue. I added some timestamps in Spark source code and built it
using:
mvn package -DskipTests
I checked the new version in my own computer and it works. However, when I ran
spark on EC2, the spark code EC2 machines ran is the original version.
Anyone knows how to
Hi,
Just new to Spark and in need of some help for framing the problem I
have. A problem well stated is half solved it's the saying :)
Let's say that I have a DStream[String] basically containing Json of
some measurements from IoT devices. In order to keep it simple say that
after
I am trying to train a large dataset consisting of 8 million data points and
20 million features using SVMWithSGD. But it is failing after running for
some time. I tried increasing num-partitions, driver-memory,
executor-memory, driver-max-resultSize. Also I tried by reducing the size of
dataset
Hi Forum
I am facing below compile error when using the fileStream method of the
JavaStreamingContext class.
I have copied the code from JavaAPISuite.java test class of spark test code.
The error message is
Hi Puneith,
Please provide the code if you may. It will be helpful.
Thank you,
--
View this message in context:
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Streaming-JavaDStream-compute-method-NPE-tp22676p22684.html
Sent from the Apache Spark User List mailing list archive at
I was able to solve this problem hard coding the JAVA_HOME inside
org.apache.spark.deploy.yarn.Client.scala class.
*val commands = prefixEnv ++ Seq(--
YarnSparkHadoopUtil.expandEnvironment(Environment.JAVA_HOME) +
/bin/java, -server++ /usr/java/jdk1.7.0_51/bin/java, -server)*
Somehow
How about:
JavaPairDStreamLongWritable, Text input =
jssc.fileStream(inputDirectory, LongWritable.class, Text.class,
TextInputFormat.class);
See the complete example over here
On 27 Apr 2015, at 07:51, ÐΞ€ρ@Ҝ (๏̯͡๏)
deepuj...@gmail.commailto:deepuj...@gmail.com wrote:
Spark 1.3
1. View stderr/stdout from executor from Web UI: when the job is running i
figured out the executor that am suppose to see, and those two links show 4
special characters on browser.
2.
I've tried running your code through spark-shell on both 1.3.0 (pre-built for
Hadoop 2.4 and above) and a recently built snapshot of master. Both work
fine. Running on OS X yosemite. What's your configuration?
--
View this message in context:
On Tuesday 28 April 2015 01:39 PM, ÐΞ€ρ@Ҝ (๏̯͡๏) wrote:
val conf = new SparkConf()
.setAppName(detail)
.set(spark.serializer,
org.apache.spark.serializer.KryoSerializer)
.set(spark.kryoserializer.buffer.mb,
arguments.get(buffersize).get)
Option B would be fine, as in the SO itself the answer says, Since RDD
transformations merely build DAG descriptions without execution, in Option
A by the time you call unpersist, you still only have job descriptions and
not a running execution.
Also note, In Option A, you are not specifying any
Hi,
I'm trying to figure out how to use a third party jar inside a python
program which I'm running via PyCharm in order to debug it. I am normally
able to run spark code in python such as this:
spark_conf = SparkConf().setMaster('local').setAppName('test')
sc =
90 matches
Mail list logo