Apache Spark - Using withWatermark for DataSets

2017-12-30 Thread M Singh
Hi: I am working with DataSets so that I can use mapGroupsWithState for business logic and then use dropDuplicates over a set of fields.  I would like to use the withWatermark so that I can restrict the how much state is stored.   >From the API it looks like withWatermark takes a string -

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
Thanks Eyal - it appears that these are the same patterns used for spark DStreams. On Wednesday, December 27, 2017 1:15 AM, Eyal Zituny wrote: Hiif you're interested in stopping you're spark application externally, you will probably need a way to communicate

Converting binary files

2017-12-30 Thread Christopher Piggott
I have been searching for examples, but not finding exactly what I need. I am looking for the paradigm for using spark 2.2 to convert a bunch of binary files into a bunch of different binary files. I'm starting with: val files = spark.sparkContext.binaryFiles("hdfs://1.2.3.4/input") then

Re: Spark on EMR suddenly stalling

2017-12-30 Thread Gourav Sengupta
Hi, Please try to use the SPARK UI from the way that AWS EMR recommends, it should be available from the resource manager. I never ever had any problem working with it. THAT HAS ALWAYS BEEN MY PRIMARY AND SOLE SOURCE OF DEBUGGING. Sadly, I cannot be of much help unless we go for a screen share

Re: Subqueries

2017-12-30 Thread Lalwani, Jayesh
Thanks. You are right on both counts 1. Doing max(instnc_id) over () works. I thought that Spark would automatically treat max(instnc_id) as max(instnc_id) over () 2. Spark tries to do max function in one task, and it runs out of memory I’ll revert back to join. Thanks again From: