Apache Spark - How to concert DataFrame json string to structured element and using schema_of_json

2022-09-05 Thread M Singh
Hi: In apache spark we can read json using the following: spark.read.json("path"). There is support to convert json string in a dataframe into structured element using

Re: Apache Kafka / Spark Integration - Exception - The server disconnected before a response was received.

2018-04-10 Thread M Singh
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa <dhinoj...@evolutionnext.com> wrote: This looks more like a spark issue than it does a Kafka judging by the stack trace, are you using Spark structured streaming with Kafka integration by chance? On Mon, Apr 9, 2018 at 8:47 AM, M Singh &l

Apache Spark - Structured Streaming State Management With Watermark

2018-03-28 Thread M Singh
Hi: I am using Apache Spark Structured Streaming (2.2.1) to implement custom sessionization for events.  The processing is in two steps:1. flatMapGroupsWithState (based on user id) - which stores the state of user and emits events every minute until a expire event is received 2. The next step

Apache Spark - Structured Streaming StreamExecution Stats Description

2018-03-28 Thread M Singh
Hi: I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState and a groupBy count operators. In the StreamExecution logs I see two enteries for stateOperators "stateOperators" : [ {     "numRowsTotal" : 1617339,     "numRowsUpdated" : 9647   }, {     "numRowsTotal" :

Apache Spark Structured Streaming - How to keep executor alive.

2018-03-23 Thread M Singh
Hi: I am working on spark structured streaming (2.2.1) with kafka and want 100 executors to be alive. I set spark.executor.instances to be 100.  The process starts running with 100 executors but after some time only a few remain which causes backlog of events from kafka.  I thought I saw a

Apache Spark Structured Streaming - Kafka Streaming - Option to ignore checkpoint

2018-03-22 Thread M Singh
Hi: I am working on a realtime application using spark structured streaming (v 2.2.1). The application reads data from kafka and if there is a failure, I would like to ignore the checkpoint.  Is there any configuration to just read from last kafka offset after a failure and ignore any offset

Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread M Singh
Hi: I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last few days, after running the application for 30-60 minutes get exception from Kafka Consumer included below. The structured streaming application is processing 1 minute worth of data from kafka topic. So I've tried

Re: Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-24 Thread M Singh
Hi Vijay: I am using spark-shell because I am still prototyping the steps involved. Regarding executors - I have 280 executors and UI only show a few straggler tasks on each trigger.  The UI does not show too much time spend on GC.  suspect the delay is because of getting data from kafka. The

Apache Spark - Structured Streaming reading from Kafka some tasks take much longer

2018-02-23 Thread M Singh
Hi: I am working with spark structured streaming (2.2.1) reading data from Kafka (0.11).  I need to aggregate data ingested every minute and I am using spark-shell at the moment.  The message rate ingestion rate is approx 500k/second.  During some trigger intervals (1 minute) especially when

Re: Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-11 Thread M Singh
e code “org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to answer some of them. For example:  inputRowsPerSecond = numRecords / inputTimeSec,  processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why the 2 rowPerSec difference. On Feb 10, 2018, at 8:42 PM, M Singh <mans2si...@yahoo.co

Re: Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-10 Thread M Singh
Just checking if anyone has any pointers for dynamically updating query state in structured streaming. Thanks On Thursday, February 8, 2018 2:58 PM, M Singh <mans2si...@yahoo.com.INVALID> wrote: Hi Spark Experts: I am trying to use a stateful udf with spark structured str

Apache Spark - Structured Streaming Query Status - field descriptions

2018-02-10 Thread M Singh
Hi: I am working with spark 2.2.0 and am looking at the query status console output.  My application reads from kafka - performs flatMapGroupsWithState and then aggregates the elements for two group counts.  The output is send to console sink.  I see the following output  (with my questions

Apache Spark - Structured Streaming - Updating UDF state dynamically at run time

2018-02-08 Thread M Singh
Hi Spark Experts: I am trying to use a stateful udf with spark structured streaming that needs to update the state periodically. Here is the scenario: 1. I have a udf with a variable with default value (eg: 1)  This value is applied to a column (eg: subtract the variable from the column value

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-06 Thread M Singh
Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Mon, Feb 5, 2018 at 8:11 PM, M Singh <mans2si...@yahoo.com> wrote: Just checking if anyone has more details on how watermark works in cases where event time is earlier than processin

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-05 Thread M Singh
Just checking if anyone has more details on how watermark works in cases where event time is earlier than processing time stamp. On Friday, February 2, 2018 8:47 AM, M Singh <mans2si...@yahoo.com> wrote: Hi Vishu/Jacek: Thanks for your responses. Jacek - At the moment, the curren

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-05 Thread M Singh
Hi TD: Just wondering if you have any insight for me or need more info. Thanks On Thursday, February 1, 2018 7:43 AM, M Singh <mans2si...@yahoo.com.INVALID> wrote: Hi TD: Here is the udpated code with explain and full stack trace. Please let me know what could be the issue an

Re: Apache Spark - Spark Structured Streaming - Watermark usage

2018-02-02 Thread M Singh
If you are having only a map function and don't want to process it, you could do a filter based on its EventTime field, but I guess you will have to compare it with the processing time since there is no API to access Watermark by the user.  -Vishnu On Fri, Jan 26, 2018 at 1:14 PM, M Singh &l

Re: Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-02-01 Thread M Singh
he$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279) ... 1 more On Wednesday, January 31, 2018 3:46 PM, Tathagata Das <tathagata.das1...@gmail.com> wrote: Could you give the full stack trace of the exception? Also, can you do `dataframe2.explain(true

Apache Spark - Exception on adding column to Structured Streaming DataFrame

2018-01-31 Thread M Singh
Hi Folks: I have to add a column to a structured streaming dataframe but when I do that (using select or withColumn) I get an exception.  I can add a column in structured non-streaming structured dataframe. I could not find any documentation on how to do this in the following doc 

Apache Spark - Spark Structured Streaming - Watermark usage

2018-01-26 Thread M Singh
Hi: I am trying to filter out records which are lagging behind (based on event time) by a certain amount of time.   Is the watermark api applicable to this scenario (ie, filtering lagging records) or it is only applicable with aggregation ?  I could not get a clear understanding from the

Re: Apache Spark - Custom structured streaming data source

2018-01-26 Thread M Singh
s.  TD On Jan 25, 2018 8:36 PM, "M Singh" <mans2si...@yahoo.com.invalid> wrote: Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods availab

Apache Spark - Custom structured streaming data source

2018-01-25 Thread M Singh
Hi: I am trying to create a custom structured streaming source and would like to know if there is any example or documentation on the steps involved. I've looked at the some methods available in the SparkSession but these are internal to the sql package:   private[sql] def

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-05 Thread M Singh
ng-spark-sql Spark Structured Streaming https://bit.ly/spark-structured-streaming Mastering Kafka Streams https://bit.ly/mastering-kafka-streams Follow me at https://twitter.com/jaceklaskowski On Thu, Jan 4, 2018 at 10:49 PM, M Singh <mans2si...@yahoo.com.invalid> wrote: Thanks Tathaga

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-04 Thread M Singh
OffsetsPerTrigger", see the guide). Related note, these APIs are subject to change. In fact in the upcoming release 2.3, we are adding a DataSource V2 API for batch/microbatch-streaming/continuous-streaming sources and sinks. On Wed, Jan 3, 2018 at 11:23 PM, M Singh <mans2si...@yahoo.com.invalid

Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread M Singh
Hi: The documentation for Sink.addBatch is as follows:   /**   * Adds a batch of data to this sink. The data for a given `batchId` is deterministic and if   * this method is called more than once with the same batchId (which will happen in the case of   * failures), then `data` should only be

Re: Spark on EMR suddenly stalling

2018-01-01 Thread M Singh
Hi Jeroen: I am not sure if I missed it - but can you let us know what is your input source and output sink ?   In some cases, I found that saving to S3 was a problem. In this case I started saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which solved our issue. Mans

Apache Spark - Using withWatermark for DataSets

2017-12-30 Thread M Singh
Hi: I am working with DataSets so that I can use mapGroupsWithState for business logic and then use dropDuplicates over a set of fields.  I would like to use the withWatermark so that I can restrict the how much state is stored.   >From the API it looks like withWatermark takes a string -

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-30 Thread M Singh
n also listen to an external system (like kafka) Eyal On Tue, Dec 26, 2017 at 10:37 PM, M Singh <mans2si...@yahoo.com.invalid> wrote: Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25,

Re: Apache Spark - Structured Streaming graceful shutdown

2017-12-26 Thread M Singh
Thanks Diogo.  My question is how to gracefully call the stop method while the streaming application is running in a cluster. On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira <diogo.mun...@corp.globo.com> wrote: Hi M Singh! Here I'm using query.stop() Em 25 de dez de 2

Apache Spark - (2.2.0) - window function for DataSet

2017-12-25 Thread M Singh
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The window function requires Column as argument and can be used with DataFrames by passing the column. Is there any analogous window function or pointers to how window function can be used for DataSets ? Thanks

Apache Spark - Structured Streaming from file - checkpointing

2017-12-25 Thread M Singh
Hi: I am using spark structured streaming (v 2.2.0) to read data from files. I have configured checkpoint location. On stopping and restarting the application, it looks like it is reading the previously ingested files.  Is that expected behavior ?   Is there anyway to prevent reading files that

Apache Spark - Structured Streaming graceful shutdown

2017-12-25 Thread M Singh
Hi:Are there any patterns/recommendations for gracefully stopping a structured streaming application ?Thanks

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-12 Thread M Singh
to be written into that directory.  TD On Fri, Jul 11, 2014 at 11:46 AM, M Singh mans6si...@yahoo.com wrote: So, is it expected for the process to generate stages/tasks even after processing a file ? Also, is there a way to figure out the file that is getting processed and when

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
, Tathagata Das tathagata.das1...@gmail.com wrote: How are you supplying the text file?  On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote: Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text

Re: Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-11 Thread M Singh
, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote: Hi TD: The input file is on hdfs.   The file is approx 2.7 GB and when the process starts, there are 11 tasks (since hdfs block size is 256M) for processing and 2 tasks for reduce by key.   After the file has been processed, I see new

Spark streaming - tasks and stages continue to be generated when using reduce by key

2014-07-09 Thread M Singh
Hi Folks: I am working on an application which uses spark streaming (version 1.1.0 snapshot on a standalone cluster) to process text file and save counters in cassandra based on fields in each row.  I am testing the application in two modes:  * Process each row and save the counter

Re: Reading text file vs streaming text files

2014-07-08 Thread M Singh
be after processing of each batch, you can simply move those processed files to another directory or so. Thanks Best Regards On Thu, Jul 3, 2014 at 6:34 PM, M Singh mans6si...@yahoo.com wrote: Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped

Re: Java sample for using cassandra-driver-spark

2014-07-08 Thread M Singh
for it here:  https://github.com/datastax/cassandra-driver-spark/issues/11 We're open to any ideas. Just let us know what you need the API to have in the comments. Regards, Piotr Kołaczkowski 2014-07-05 0:48 GMT+02:00 M Singh mans6si...@yahoo.com: Hi: Is there a Java sample fragment for using

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
Another alternative could be use SparkStreaming's textFileStream with windowing capabilities. On Friday, July 4, 2014 9:52 AM, Gianluca Privitera gianluca.privite...@studio.unibo.it wrote: You should think about a custom receiver, in order to solve the problem of the “already collected”

Re: window analysis with Spark and Spark streaming

2014-07-04 Thread M Singh
The windowing capabilities of spark streaming determine the events in the RDD created for that time window.  If the duration is 1s then all the events received in a particular 1s window will be a part of the RDD created for that window for that stream. On Friday, July 4, 2014 1:28 PM,

Java sample for using cassandra-driver-spark

2014-07-04 Thread M Singh
Hi: Is there a Java sample fragment for using cassandra-driver-spark ? Thanks

Reading text file vs streaming text files

2014-07-03 Thread M Singh
Hi: I am working on a project where a few thousand text files (~20M in size) will be dropped in an hdfs directory every 15 minutes.  Data from the file will used to update counters in cassandra (non-idempotent operation).  I was wondering what is the best to deal with this: * Use text

spark text processing

2014-07-03 Thread M Singh
Hi: Is there a way to find out when spark has finished processing a text file (both for streaming and non-streaming cases) ? Also, after processing, can spark copy the file to another directory ? Thanks

Configuration properties for Spark

2014-06-30 Thread M Singh
Hi: Is there a comprehensive properties list (with permissible/default values) for spark ? Thanks Mans

Re: jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-28 Thread M Singh
(the Jackson, Woodstox, etc., folks) to see if we can get people to upgrade to more recent versions of Jackson. -- Paul — p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/ On Fri, Jun 27, 2014 at 12:58 PM, M Singh mans6si...@yahoo.com wrote: Hi: I am using spark to stream

jackson-core-asl jar (1.8.8 vs 1.9.x) conflict with the spark-sql (version 1.x)

2014-06-27 Thread M Singh
Hi: I am using spark to stream data to cassandra and it works fine in local mode. But when I execute the application in a standalone clustered env I got exception included below (java.lang.NoClassDefFoundError: org/codehaus/jackson/annotate/JsonClass). I think this is due to the