Hi:
In apache spark we can read json using the following:
spark.read.json("path").
There is support to convert json string in a dataframe into structured element
using
10, 2018, 7:49:42 AM PDT, Daniel Hinojosa
<dhinoj...@evolutionnext.com> wrote:
This looks more like a spark issue than it does a Kafka judging by the
stack trace, are you using Spark structured streaming with Kafka
integration by chance?
On Mon, Apr 9, 2018 at 8:47 AM, M Singh &l
Hi:
I am using Apache Spark Structured Streaming (2.2.1) to implement custom
sessionization for events. The processing is in two steps:1.
flatMapGroupsWithState (based on user id) - which stores the state of user and
emits events every minute until a expire event is received
2. The next step
Hi:
I am using spark structured streaming 2.2.1 and am using flatMapGroupWithState
and a groupBy count operators.
In the StreamExecution logs I see two enteries for stateOperators
"stateOperators" : [ {
"numRowsTotal" : 1617339,
"numRowsUpdated" : 9647
}, {
"numRowsTotal" :
Hi:
I am working on spark structured streaming (2.2.1) with kafka and want 100
executors to be alive. I set spark.executor.instances to be 100. The process
starts running with 100 executors but after some time only a few remain which
causes backlog of events from kafka.
I thought I saw a
Hi:
I am working on a realtime application using spark structured streaming (v
2.2.1). The application reads data from kafka and if there is a failure, I
would like to ignore the checkpoint. Is there any configuration to just read
from last kafka offset after a failure and ignore any offset
Hi:
I am working with Spark (2.2.1) and Kafka (0.10) on AWS EMR and for the last
few days, after running the application for 30-60 minutes get exception from
Kafka Consumer included below.
The structured streaming application is processing 1 minute worth of data from
kafka topic. So I've tried
Hi Vijay:
I am using spark-shell because I am still prototyping the steps involved.
Regarding executors - I have 280 executors and UI only show a few straggler
tasks on each trigger. The UI does not show too much time spend on GC.
suspect the delay is because of getting data from kafka. The
Hi:
I am working with spark structured streaming (2.2.1) reading data from Kafka
(0.11).
I need to aggregate data ingested every minute and I am using spark-shell at
the moment. The message rate ingestion rate is approx 500k/second. During
some trigger intervals (1 minute) especially when
e code
“org.apache.spark.sql.execution.streaming.ProgressReporter” is helpful to
answer some of them.
For example: inputRowsPerSecond = numRecords / inputTimeSec,
processedRowsPerSecond = numRecords / processingTimeSecThis is explaining why
the 2 rowPerSec difference.
On Feb 10, 2018, at 8:42 PM, M Singh <mans2si...@yahoo.co
Just checking if anyone has any pointers for dynamically updating query state
in structured streaming.
Thanks
On Thursday, February 8, 2018 2:58 PM, M Singh
<mans2si...@yahoo.com.INVALID> wrote:
Hi Spark Experts:
I am trying to use a stateful udf with spark structured str
Hi:
I am working with spark 2.2.0 and am looking at the query status console
output.
My application reads from kafka - performs flatMapGroupsWithState and then
aggregates the elements for two group counts. The output is send to console
sink. I see the following output (with my questions
Hi Spark Experts:
I am trying to use a stateful udf with spark structured streaming that needs to
update the state periodically.
Here is the scenario:
1. I have a udf with a variable with default value (eg: 1) This value is
applied to a column (eg: subtract the variable from the column value
Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Mon, Feb 5, 2018 at 8:11 PM, M Singh <mans2si...@yahoo.com> wrote:
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processin
Just checking if anyone has more details on how watermark works in cases where
event time is earlier than processing time stamp.
On Friday, February 2, 2018 8:47 AM, M Singh <mans2si...@yahoo.com> wrote:
Hi Vishu/Jacek:
Thanks for your responses.
Jacek - At the moment, the curren
Hi TD:
Just wondering if you have any insight for me or need more info.
Thanks
On Thursday, February 1, 2018 7:43 AM, M Singh
<mans2si...@yahoo.com.INVALID> wrote:
Hi TD:
Here is the udpated code with explain and full stack trace.
Please let me know what could be the issue an
If you are having only a
map function and don't want to process it, you could do a filter based on its
EventTime field, but I guess you will have to compare it with the processing
time since there is no API to access Watermark by the user.
-Vishnu
On Fri, Jan 26, 2018 at 1:14 PM, M Singh &l
he$spark$sql$execution$streaming$StreamExecution$$runStream(StreamExecution.scala:279)
... 1 more
On Wednesday, January 31, 2018 3:46 PM, Tathagata Das
<tathagata.das1...@gmail.com> wrote:
Could you give the full stack trace of the exception?
Also, can you do `dataframe2.explain(true
Hi Folks:
I have to add a column to a structured streaming dataframe but when I do that
(using select or withColumn) I get an exception. I can add a column in
structured non-streaming structured dataframe. I could not find any
documentation on how to do this in the following doc
Hi:
I am trying to filter out records which are lagging behind (based on event
time) by a certain amount of time.
Is the watermark api applicable to this scenario (ie, filtering lagging
records) or it is only applicable with aggregation ? I could not get a clear
understanding from the
s.
TD
On Jan 25, 2018 8:36 PM, "M Singh" <mans2si...@yahoo.com.invalid> wrote:
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods availab
Hi:
I am trying to create a custom structured streaming source and would like to
know if there is any example or documentation on the steps involved.
I've looked at the some methods available in the SparkSession but these are
internal to the sql package:
private[sql] def
ng-spark-sql
Spark Structured Streaming https://bit.ly/spark-structured-streaming
Mastering Kafka Streams https://bit.ly/mastering-kafka-streams
Follow me at https://twitter.com/jaceklaskowski
On Thu, Jan 4, 2018 at 10:49 PM, M Singh <mans2si...@yahoo.com.invalid> wrote:
Thanks Tathaga
OffsetsPerTrigger", see the guide).
Related note, these APIs are subject to change. In fact in the upcoming release
2.3, we are adding a DataSource V2 API for
batch/microbatch-streaming/continuous-streaming sources and sinks.
On Wed, Jan 3, 2018 at 11:23 PM, M Singh <mans2si...@yahoo.com.invalid
Hi:
The documentation for Sink.addBatch is as follows:
/** * Adds a batch of data to this sink. The data for a given `batchId` is
deterministic and if * this method is called more than once with the same
batchId (which will happen in the case of * failures), then `data` should
only be
Hi Jeroen:
I am not sure if I missed it - but can you let us know what is your input
source and output sink ?
In some cases, I found that saving to S3 was a problem. In this case I started
saving the output to the EMR HDFS and later copied to S3 using s3-dist-cp which
solved our issue.
Mans
Hi:
I am working with DataSets so that I can use mapGroupsWithState for business
logic and then use dropDuplicates over a set of fields. I would like to use
the withWatermark so that I can restrict the how much state is stored.
>From the API it looks like withWatermark takes a string -
n also listen to an external system (like kafka)
Eyal
On Tue, Dec 26, 2017 at 10:37 PM, M Singh <mans2si...@yahoo.com.invalid> wrote:
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25,
Thanks Diogo. My question is how to gracefully call the stop method while the
streaming application is running in a cluster.
On Monday, December 25, 2017 5:39 PM, Diogo Munaro Vieira
<diogo.mun...@corp.globo.com> wrote:
Hi M Singh! Here I'm using query.stop()
Em 25 de dez de 2
Hi:I would like to use window function on a DataSet stream (Spark 2.2.0)The
window function requires Column as argument and can be used with DataFrames by
passing the column. Is there any analogous window function or pointers to how
window function can be used for DataSets ?
Thanks
Hi:
I am using spark structured streaming (v 2.2.0) to read data from files. I have
configured checkpoint location. On stopping and restarting the application, it
looks like it is reading the previously ingested files. Is that expected
behavior ?
Is there anyway to prevent reading files that
Hi:Are there any patterns/recommendations for gracefully stopping a structured
streaming application ?Thanks
to be written into that
directory.
TD
On Fri, Jul 11, 2014 at 11:46 AM, M Singh mans6si...@yahoo.com wrote:
So, is it expected for the process to generate stages/tasks even after
processing a file ?
Also, is there a way to figure out the file that is getting processed and when
, Tathagata Das tathagata.das1...@gmail.com
wrote:
How are you supplying the text file?
On Wed, Jul 9, 2014 at 11:51 AM, M Singh mans6si...@yahoo.com wrote:
Hi Folks:
I am working on an application which uses spark streaming (version 1.1.0
snapshot on a standalone cluster) to process text
, 2014 at 3:16 AM, M Singh mans6si...@yahoo.com wrote:
Hi TD:
The input file is on hdfs.
The file is approx 2.7 GB and when the process starts, there are 11 tasks
(since hdfs block size is 256M) for processing and 2 tasks for reduce by key.
After the file has been processed, I see new
Hi Folks:
I am working on an application which uses spark streaming (version 1.1.0
snapshot on a standalone cluster) to process text file and save counters in
cassandra based on fields in each row. I am testing the application in two
modes:
* Process each row and save the counter
be after
processing of each batch, you can simply move those processed files to another
directory or so.
Thanks
Best Regards
On Thu, Jul 3, 2014 at 6:34 PM, M Singh mans6si...@yahoo.com wrote:
Hi:
I am working on a project where a few thousand text files (~20M in size) will
be dropped
for it here:
https://github.com/datastax/cassandra-driver-spark/issues/11
We're open to any ideas. Just let us know what you need the API to have in the
comments.
Regards,
Piotr Kołaczkowski
2014-07-05 0:48 GMT+02:00 M Singh mans6si...@yahoo.com:
Hi:
Is there a Java sample fragment for using
Another alternative could be use SparkStreaming's textFileStream with windowing
capabilities.
On Friday, July 4, 2014 9:52 AM, Gianluca Privitera
gianluca.privite...@studio.unibo.it wrote:
You should think about a custom receiver, in order to solve the problem of the
“already collected”
The windowing capabilities of spark streaming determine the events in the RDD
created for that time window. If the duration is 1s then all the events
received in a particular 1s window will be a part of the RDD created for that
window for that stream.
On Friday, July 4, 2014 1:28 PM,
Hi:
Is there a Java sample fragment for using cassandra-driver-spark ?
Thanks
Hi:
I am working on a project where a few thousand text files (~20M in size) will
be dropped in an hdfs directory every 15 minutes. Data from the file will used
to update counters in cassandra (non-idempotent operation). I was wondering
what is the best to deal with this:
* Use text
Hi:
Is there a way to find out when spark has finished processing a text file (both
for streaming and non-streaming cases) ?
Also, after processing, can spark copy the file to another directory ?
Thanks
Hi:
Is there a comprehensive properties list (with permissible/default values) for
spark ?
Thanks
Mans
(the Jackson,
Woodstox, etc., folks) to see if we can get people to upgrade to more recent
versions of Jackson.
-- Paul
—
p...@mult.ifario.us | Multifarious, Inc. | http://mult.ifario.us/
On Fri, Jun 27, 2014 at 12:58 PM, M Singh mans6si...@yahoo.com wrote:
Hi:
I am using spark to stream
Hi:
I am using spark to stream data to cassandra and it works fine in local mode.
But when I execute the application in a standalone clustered env I got
exception included below (java.lang.NoClassDefFoundError:
org/codehaus/jackson/annotate/JsonClass).
I think this is due to the
46 matches
Mail list logo