running dockerized spark applications in DC/OS

2017-08-31 Thread sujeet jog
Folks, Does any body have production experience in running dockerized spark application on DC/OS, and can the spark cluster run other than spark stand alone mode ?.. What are the major differences between running spark with Mesos Cluster manager Vs running Spark as dockerized container under

Re: Checkpointing With NOT Serializable

2017-08-31 Thread ayan guha
Any help on this? On Thu, Aug 31, 2017 at 10:30 AM, ayan guha wrote: > Hi > > Want to understand a basic issue. Here is my code: > > def createStreamingContext(sparkCheckpointDir: String,batchDuration: Int > ) = { > > val ssc = new StreamingContext(spark.sparkContext,

[SPARK-SQL] Spark Persist slower than non-persist calls

2017-08-31 Thread saurabh raval
Spark 2.1 My settings are: Running Spark 2.1 on 3 node YARN cluster with 160 GB. Dynamic allocation turned on. spark.executor.memory=6G, spark.executor.cores=6 First, I am reading hive tables: orders(329MB) and lineitems(1.43GB) and doing left outer join.Next, I apply 7 different filter

Cloudwatch metrics sink problem

2017-08-31 Thread Mikhailau, Alex
I am getting the following in the logs: Sink class org.apache.spark.metrics.sink.CloudwatchSink cannot be instantiated due to CloudwatchSink ClassNotFoundException. I am running this on EMR 5.7.0. Does anyone have experience adding this sink to an EMR cluster? Thanks, Alex

Re: [Structured Streaming]Data processing and output trigger should be decoupled

2017-08-31 Thread 张万新
I think something like state store can be used to keep the intermediate data. For aggregations the engines keeps processing batches of data and update the results in state store(or sth like this), and when a trigger begins the engines just fetch the current result from state store and output it to

Inconsistent results with combineByKey API

2017-08-31 Thread Swapnil Shinde
Hello All I am observing some strange results with aggregateByKey API which is implemented with combineByKey. Not sure if this is by design or bug - I created this toy example but same problem can be observed on large datasets as well - *case class ABC(key: Int, c1: Int, c2: Int)* *case class

Delay in Initial Execution Time in Spark for Benchmarking Parquet

2017-08-31 Thread austinldv
Hello, We are running few Star Schema Benchmarks on parquet performance in Spark. We have set up the below functions (shown at the bottom) for getting runtimes. This is a simplified version of spark's benchmark API. The benchmarks are called with 1 warmup and 10 runs. SSB link:

Re: Issue with Spark Twitter Streaming

2017-08-31 Thread grumi
hey, did you manage to solve the problem? I have exactly the same problem and I am not able to solve it. Thank you! -- Sent from: http://apache-spark-user-list.1001560.n3.nabble.com/ - To unsubscribe e-mail:

Re: Failing jobs with Spark 2.2 running on Yarn with HDFS

2017-08-31 Thread Jan-Hendrik Zab
This has indeed been caused by the network backend that dropped several outgoing packets. I'm not sure why this wasn't "caught" by TCP. We ended up with setting send_queue_size=256 recv_queue_size=512 for ib_ipoib and krcvqs=4 fpr hfi1. We also updated our OmniPath switch firmware to the current

[Structured Streaming]Usage of watermark

2017-08-31 Thread KevinZwx
Hi, I'm a little confused about the usage of watermark in SS. According to the guideline, when we use a window-based grouping, SS will automatically handle the late event and we should use watermark to limit the state like this(specify a watermark before groupBy): val words = ... // streaming

Re: Why do checkpoints work the way they do?

2017-08-31 Thread 张万新
So is there any documents demonstrating in what condition can my application recover from the same checkpoint and in what condition not? Tathagata Das 于2017年8月30日周三 下午1:20写道: > Hello, > > This is an unfortunate design on my part when I was building DStreams :) > >