New to spark 2.2.1 - Problem with finding tables between different metastore db

2018-02-06 Thread Subhajit Purkayastha
All, I am new to Spark 2.2.1. I have a single node cluster and also have enabled thriftserver for my Tableau application to connect to my persisted table. I feel that the spark cluster metastore is different from the thrift-server metastore. If this assumption is valid, what do I need to

Re: Spark Streaming withWatermark

2018-02-06 Thread Tathagata Das
That may very well be possible. The watermark delay guarantees that any record newer than or equal to watermark (that is, max event time seen - 20 seconds), will be considered and never be ignored. It does not guarantee the other way, that is, it does NOT guarantee that records older than the

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Could it be that these messages were processed in the same micro batch? In that case, watermark will be updated only after the batch finishes which did not have any effect of the late data in the current batch. On Tue, Feb 6, 2018 at 4:18 PM Jiewen Shao wrote: > Ok,

Re: Spark Streaming withWatermark

2018-02-06 Thread Jiewen Shao
Ok, Thanks for confirmation. So based on my code, I have messages with following timestamps (converted to more readable format) in the following order: 2018-02-06 12:00:00 2018-02-06 12:00:01 2018-02-06 12:00:02 2018-02-06 12:00:03 2018-02-06 11:59:00 <-- this message should not be counted,

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Yes, that is correct. On Tue, Feb 6, 2018 at 4:56 PM, Jiewen Shao wrote: > Vishnu, thanks for the reply > so "event time" and "window end time" have nothing to do with current > system timestamp, watermark moves with the higher value of "timestamp" > field of the input

Sharing spark executor pool across multiple long running spark applications

2018-02-06 Thread Nirav Patel
Currently sparkContext and it's executor pool is not shareable. Each spakContext gets its own executor pool for entire life of an application. So what is the best ways to share cluster resources across multiple long running spark applications? Only one I see is spark dynamic allocation but it has

Re: Spark Streaming withWatermark

2018-02-06 Thread Vishnu Viswanath
Hi 20 second corresponds to when the window state should be cleared. For the late message to be dropped, it should come in after you receive a message with event time >= window end time + 20 seconds. I wrote a post on this recently:

Spark Streaming withWatermark

2018-02-06 Thread Jiewen Shao
sample code: Let's say Xyz is POJO with a field called timestamp, regarding code withWatermark("timestamp", "20 seconds") I expect the msg with timestamp 20 seconds or older will be dropped, what does 20 seconds compare to? based on my test nothing was dropped no matter how old the timestamp

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Hi Jacek: Thanks for your response. I am just trying to understand the fundamentals of watermarking and how it behaves in aggregation vs non-aggregation scenarios. On Tuesday, February 6, 2018 9:04 AM, Jacek Laskowski wrote: Hi, What would you expect? The data is

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread Jacek Laskowski
Hi, What would you expect? The data is simply dropped as that's the purpose of watermarking it. That's my understanding at least. Pozdrawiam, Jacek Laskowski https://about.me/JacekLaskowski Mastering Spark SQL https://bit.ly/mastering-spark-sql Spark Structured Streaming