date:20151107

Re: Spark Streaming : minimum cores for a Receiver

2015-11-07 Thread Gideon

I'm not a Spark expert but: What Spark does is run receivers in the executors. These receivers are a long-running task, each receiver occupies 1 core in your executor, if an executor has more cores than receivers it can also process (at least some of) the data that it is receiving. So, enough c

Re: Whether Spark is appropriate for our use case.

2015-11-07 Thread Igor Berman

1. if you have join by some specific field(e.g. user id or account-id or whatever) you may try to partition parquet file by this field and then join will be more efficient. 2. you need to see in spark metrics what is performance of particular join, how much partitions is there, what is shuffle size

Re: Checkpointing an InputDStream from Kafka

2015-11-07 Thread Sandip Mehta

I believe you’ll have to use another way of creating StreamingContext by passing create function in getOrCreate function. private def setupSparkContext(): StreamingContext = { val streamingSparkContext = { val sparkConf = new SparkConf().setAppName(config.appName).setMaster(config.master)

sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

2015-11-07 Thread YaoPau

Within a pyspark shell, both of these work for me: print hc.sql("SELECT * from raw.location_tbl LIMIT 10").collect() print sqlCtx.sql("SELECT * from raw.location_tbl LIMIT 10").collect() But when I submit both of those in batch mode (hc and sqlCtx both exist), I get the following error. Why is t

streaming: missing data. does saveAsTextFile() append or replace?

2015-11-07 Thread Andy Davidson

Hi I just started a new spark streaming project. In this phase of the system all we want to do is save the data we received to hdfs. I after running for a couple of days it looks like I am missing a lot of data. I wonder if saveAsTextFile("hdfs:///rawSteamingData²); is overwriting the data I captu

Re: bug: can not run Ipython notebook on cluster

2015-11-07 Thread Andy Davidson

What a BEAR! The following recipe worked for me. (took a couple of days hacking). I hope this improves the out of the box experience for others Andy My test program is now In [1]: from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") In [2]: prin

Spark Job failing with exit status 15

2015-11-07 Thread Shashi Vishwakarma

I am trying to run simple word count job in spark but I am getting exception while running job. For more detailed output, check application tracking page:http://quickstart.cloudera:8088/proxy/application_1446699275562_0006/Then, click on links to logs of each attempt.Diagnostics: Exception from co

Re: Checkpoint not working after driver restart

2015-11-07 Thread vimal dinakaran

I am pasting the code here . Please let me know if there is any sequence that is wrong. def createContext(checkpointDirectory: String, config: Config): StreamingContext = { println("Creating new context") val conf = new SparkConf(true).setAppName(appName).set("spark.streaming

Re: Spark Streaming data checkpoint performance

2015-11-07 Thread trung kien

Hmm, Seems it just do a trick. Using this method, it's very hard to recovery from failure, since we don't know which batch have been done. I really want to maintain the whole running stats in memory to archive full failure-tolerant. I just wonder if the performance of data checkpoint is that bad

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

2015-11-07 Thread Sun, Rui

This is probably because your config option actually do not take effect. Please refer to the email thread titled “How to set memory for SparkR with master="local[*]"”, which may answer you. I recommend you to try to use SparkR built from the master branch, which contains two fixes that may help

Re: Spark Streaming : minimum cores for a Receiver

Re: Whether Spark is appropriate for our use case.

Re: Checkpointing an InputDStream from Kafka

sqlCtx.sql('some_hive_table') works in pyspark but not spark-submit

streaming: missing data. does saveAsTextFile() append or replace?

Re: bug: can not run Ipython notebook on cluster

Spark Job failing with exit status 15

Re: Checkpoint not working after driver restart

Re: Spark Streaming data checkpoint performance

RE: [sparkR] Any insight on java.lang.OutOfMemoryError: GC overhead limit exceeded

10 matches

Site Navigation

Mail list logo

Footer information