I'm not a Spark expert but:
What Spark does is run receivers in the executors.
These receivers are a long-running task, each receiver occupies 1 core in
your executor, if an executor has more cores than receivers it can also
process (at least some of) the data that it is receiving.
So, enough c
1. if you have join by some specific field(e.g. user id or account-id or
whatever) you may try to partition parquet file by this field and then join
will be more efficient.
2. you need to see in spark metrics what is performance of particular join,
how much partitions is there, what is shuffle size
I believe you’ll have to use another way of creating StreamingContext by
passing create function in getOrCreate function.
private def setupSparkContext(): StreamingContext = {
val streamingSparkContext = {
val sparkConf = new
SparkConf().setAppName(config.appName).setMaster(config.master)
Within a pyspark shell, both of these work for me:
print hc.sql("SELECT * from raw.location_tbl LIMIT 10").collect()
print sqlCtx.sql("SELECT * from raw.location_tbl LIMIT 10").collect()
But when I submit both of those in batch mode (hc and sqlCtx both exist), I
get the following error. Why is t
Hi
I just started a new spark streaming project. In this phase of the system
all we want to do is save the data we received to hdfs. I after running for
a couple of days it looks like I am missing a lot of data. I wonder if
saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I captu
What a BEAR! The following recipe worked for me. (took a couple of days
hacking).
I hope this improves the out of the box experience for others
Andy
My test program is now
In [1]:
from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
In [2]:
prin
I am trying to run simple word count job in spark but I am getting
exception while running job.
For more detailed output, check application tracking
page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then,
click on links to logs of each attempt.Diagnostics: Exception from
co
I am pasting the code here . Please let me know if there is any sequence
that is wrong.
def createContext(checkpointDirectory: String, config: Config):
StreamingContext = {
println("Creating new context")
val conf = new
SparkConf(true).setAppName(appName).set("spark.streaming
Hmm,
Seems it just do a trick.
Using this method, it's very hard to recovery from failure, since we don't
know which batch have been done.
I really want to maintain the whole running stats in memory to archive full
failure-tolerant.
I just wonder if the performance of data checkpoint is that bad
This is probably because your config option actually do not take effect. Please
refer to the email thread titled “How to set memory for SparkR with
master="local[*]"”, which may answer you.
I recommend you to try to use SparkR built from the master branch, which
contains two fixes that may help
10 matches
Mail list logo