Re: Max number of streams supported ?

2018-01-31 Thread Tathagata Das
Just to clarify a subtle difference between DStreams and Structured Streaming. Multiple input streams in a DStreamGraph is likely to mean they are all being processed/computed in the same way as there can be only one streaming query / context active in the StreamingContext. However, in the case of

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
Hello Divya, To add further clarification, the Apache Bahir does not have any Structured Streaming support for Twitter. It only has support for Twitter + DStreams. TD On Wed, Jan 31, 2018 at 2:44 AM, vermanurag wrote: > Twitter functionality is not part of Core

Re: Spark Structured Streaming for Twitter Streaming data

2018-01-31 Thread Tathagata Das
> / Read text from socket >> >> val socketDF = spark >> >> .readStream >> >> .format("socket") >> >> .option("host", "localhost") >> >> .option("port", ) >> >> .load() >> >

Re: Spark Streaming withWatermark

2018-02-06 Thread Tathagata Das
That may very well be possible. The watermark delay guarantees that any record newer than or equal to watermark (that is, max event time seen - 20 seconds), will be considered and never be ignored. It does not guarantee the other way, that is, it does NOT guarantee that records older than the

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
Let me fix my mistake :) What I suggested in that earlier thread does not work. The streaming query that joins a streaming dataset with a batch view, does not correctly pick up when the view is updated. It works only when you restart the query. That is, - stop the query - recreate the dataframes,

Re: [Structured Streaming] Avoiding multiple streaming queries

2018-02-14 Thread Tathagata Das
Of course, you can write to multiple Kafka topics from a single query. If your dataframe that you want to write has a column named "topic" (along with "key", and "value" columns), it will write the contents of a row to the topic in that row. This automatically works. So the only thing you need to

Re: Spark structured streaming: periodically refresh static data frame

2018-02-14 Thread Tathagata Das
I’ve to > restart queries? > > Should i just wait for 2.3 where i'll be able to join two structured > streams ( if the release is just a few weeks away ) > > Appreciate all the help! > > thanks > App > > > > On 14 February 2018 at 4:41:52 PM, Tathagata Das

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-02-22 Thread Tathagata Das
Hey, Thanks for testing out stream-stream joins and reporting this issue. I am going to take a look at this. TD On Tue, Feb 20, 2018 at 8:20 PM, kant kodali wrote: > if I change it to the below code it works. However, I don't believe it is > the solution I am looking

Re: Trigger.ProcessingTime("10 seconds") & Trigger.Continuous(10.seconds)

2018-02-25 Thread Tathagata Das
The continuous one is our new low latency continuous processing engine in Structured Streaming (to be released in 2.3). Here is the pre-release doc - https://dist.apache.org/repos/dist/dev/spark/v2.3.0-rc5-docs/_site/structured-streaming-programming-guide.html#continuous-processing On Sun, Feb

Re: Apache Spark - Question about Structured Streaming Sink addBatch dataframe size

2018-01-03 Thread Tathagata Das
1. It is all the result data in that trigger. Note that it takes a DataFrame which is a purely logical representation of data and has no association with partitions, etc. which are physical representations. 2. If you want to limit the amount of data that is processed in a trigger, then you should

Re: How to read json data from kafka and store to hdfs with spark structued streaming?

2018-07-26 Thread Tathagata Das
Are you writing multiple streaming query output to the same location? If so, I can see this error occurring. Multiple streaming queries writing to the same directory is not supported. On Tue, Jul 24, 2018 at 3:38 PM, dddaaa wrote: > I'm trying to read json messages from kafka and store them in

Re: Exceptions with simplest Structured Streaming example

2018-07-26 Thread Tathagata Das
Unfortunately, your output is not visible in the email that we see. Was it an image that some got removed? Maybe best to copy the output text (i.e. the error message) into the email. On Thu, Jul 26, 2018 at 5:41 AM, Jonathan Apple wrote: > Hello, > > There is a streaming World Count example at

Re: [Structured Streaming] Two watermarks and StreamingQueryListener

2018-08-10 Thread Tathagata Das
Structured Streaming internally maintains one global watermark by taking a min of the two watermarks. Thats why one gets reported. In Spark 2.4, there will be the option of choosing max instead of min. Just curious. Why do you have to two watermarks? Whats the query like. TD On Thu, Aug 9, 2018

Re: Dataset - withColumn and withColumnRenamed that accept Column type

2018-07-17 Thread Tathagata Das
Yes. Yes you can. On Tue, Jul 17, 2018 at 11:42 AM, Sathi Chowdhury wrote: > Hi, > My question is about ability to integrate spark streaming with multiple > clusters.Is it a supported use case. An example of that is that two topics > owned by different group and they have their own kakka infra

Re: How to branch a Stream / have multiple Sinks / do multiple Queries on one Stream

2018-07-05 Thread Tathagata Das
Hey all, In Spark 2.4.0, there will be a new feature called *foreachBatch* which will expose the output rows of every micro-batch as a dataframe, on which you apply a user-defined function. With that, you can reuse existing batch sources for writing results as well as write results to multiple

Re: [Structured Streaming] Reading Checkpoint data

2018-07-09 Thread Tathagata Das
Only the stream metadata (e.g., streamid, offsets) are stored as json. The stream state data is stored in an internal binary format. On Mon, Jul 9, 2018 at 4:07 PM, subramgr wrote: > Hi, > > I read somewhere that with Structured Streaming all the checkpoint data is > more readable (Json) like.

Re: [Structured Streaming] Custom StateStoreProvider

2018-07-10 Thread Tathagata Das
Note that this is not public API yet. Hence this is not very documented. So use it at your own risk :) On Tue, Jul 10, 2018 at 11:04 AM, subramgr wrote: > Hi, > > This looks very daunting *trait* is there some blog post or some articles > which explains on how to implement this *trait* > >

Re: [Spark structured streaming] Use of (flat)mapgroupswithstate takes long time

2018-01-22 Thread Tathagata Das
For computing mapGroupsWithState, can you check the following. - How many tasks are being launched in the reduce stage (that is, the stage after the shuffle, that is computing mapGroupsWithState) - How long each task is taking? - How many cores does the cluster have? On Thu, Jan 18, 2018 at

Re: CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread warning

2018-03-06 Thread Tathagata Das
Which version of Spark are you using? And can you give us the full stack trace of the exception? On Tue, Mar 6, 2018 at 1:53 AM, Junfeng Chen wrote: > I am trying to read kafka and save the data as parquet file on hdfs > according to this

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-06 Thread Tathagata Das
ests compile > ./dev/make-distribution.sh --name custom-spark --pip --r --tgz -Psparkr > -Phadoop-2.7 -Phive -Phive-thriftserver -Pmesos -Pyarn > > > On Thu, Feb 22, 2018 at 11:25 AM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> Hey, >> >&g

Re: Upgrades of streaming jobs

2018-03-09 Thread Tathagata Das
Yes, all checkpoints are forward compatible. However, you do need to restart the query if you want to update the code of the query. This downtime can be in less than a second (if you just restart the query without stopping the application/Spark driver) or 10s of seconds (if you have to stop the

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-02-28 Thread Tathagata Das
There is no good way to save to parquet without causing downstream consistency issues. You could use foreachRDD to get each RDD, convert it to DataFrame/Dataset, and write out as parquet files. But you will later run into issues with partial files caused by failures, etc. On Wed, Feb 28, 2018 at

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-28 Thread Tathagata Das
I made a JIRA for it - https://issues.apache.org/jira/browse/SPARK-23539 Unfortunately it is blocked by Kafka version upgrade, which has a few nasty issues related to Kafka bugs - https://issues.apache.org/jira/browse/SPARK-18057 On Wed, Feb 28, 2018 at 3:17 PM, karthikus

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-12 Thread Tathagata Das
rspective, I believe just spitting out nulls for every > trigger until there is a match and when there is match spitting out the > joined rows should suffice isn't it? > > Sorry if my thoughts are too naive! > > > > > > > > > > > On Thu, Mar 8, 2018

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-15 Thread Tathagata Das
-with-tathagata-das On Thu, Mar 15, 2018 at 8:37 AM, Bowden, Chris <chris.bow...@microfocus.com> wrote: > You need to tell Spark about the structure of the data, it doesn't know > ahead of time if you put avro, json, protobuf, etc. in kafka for the > message format. If the messa

Re: what is the right syntax for self joins in Spark 2.3.0 ?

2018-03-08 Thread Tathagata Das
A tickets related to full >>>> outer joins and update mode for maybe say Spark 2.3. I wonder how hard is >>>> it two implement both of these? It turns out the update mode and full outer >>>> join is very useful and required in my case, therefore, I'm just asking.

Re: CachedKafkaConsumer: CachedKafkaConsumer is not running in UninterruptibleThread warning

2018-03-07 Thread Tathagata Das
ust many this line: > >> CachedKafkaConsumer: CachedKafkaConsumer is not running in >> UninterruptibleThread. It may hang when CachedKafkaConsumer's method are >> interrupted because of KAFKA-1894. > > > > Regard, > Junfeng Chen > > On Wed, Mar 7, 2018 at 3

Re: Multiple Kafka Spark Streaming Dataframe Join query

2018-03-14 Thread Tathagata Das
Relevant: https://databricks.com/blog/2018/03/13/introducing-stream-stream-joins-in-apache-spark-2-3.html This is true stream-stream join which will automatically buffer delayed data and appropriately join stuff with SQL join semantics. Please check it out :) TD On Wed, Mar 14, 2018 at 12:07

Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
It's not very surprising that doing this sort of RDD to DF conversion inside DStream.foreachRDD has weird corner cases like this. In fact, you are going to have additional problems with partial parquet files (when there are failures) in this approach. I strongly suggest that you use Structured

Re: Nullpointerexception error when in repartition

2018-04-12 Thread Tathagata Das
es must be executed with writeStream.start() > > > But what i need to do in this step is only transforming json string data > to Dataset . How to fix it? > > Thanks! > > > Regard, > Junfeng Chen > > On Thu, Apr 12, 2018 at 3:08 PM, Tathagata Das < > tathagata.das1.

Re: Does partition by and order by works only in stateful case?

2018-04-12 Thread Tathagata Das
The traditional SQL windows with `over` is not supported in streaming. Only time-based windows, that is, `window("timestamp", "10 minutes")` is supported in streaming. On Thu, Apr 12, 2018 at 7:34 PM, kant kodali wrote: > Hi All, > > Does partition by and order by works only

Re: Structured Streaming on Kubernetes

2018-04-13 Thread Tathagata Das
Structured streaming is stable in production! At Databricks, we and our customers collectively process almost 100s of billions of records per day using SS. However, we are not using kubernetes :) Though I don't think it will matter too much as long as kubes are correctly provisioned+configured

Re: Does structured streaming support Spark Kafka Direct?

2018-04-12 Thread Tathagata Das
The parallelism is same for Structured Streaming. In fact, the Kafka Structured Streaming source is based on the same principle as DStream's Kafka Direct, hence it has very similar behavior. On Tue, Apr 10, 2018 at 11:03 PM, SRK wrote: > hi, > > We have code based on

Re: can we use mapGroupsWithState in raw sql?

2018-04-16 Thread Tathagata Das
Unfortunately no. Honestly it does not make sense as for type-aware operations like map, mapGroups, etc., you have to provide an actual JVM function. That does not fit in with the SQL language structure. On Mon, Apr 16, 2018 at 7:34 PM, kant kodali wrote: > Hi All, > > can

Re: [Structured Streaming] Application Updates in Production

2018-03-21 Thread Tathagata Das
Why do you want to start the new code in parallel to the old one? Why not stop the old one, and then start the new one? Structured Streaming ensures that all checkpoint information (offsets and state) are future-compatible (as long as state schema is unchanged), hence new code should be able to

Re: Apache Spark Structured Streaming - Kafka Consumer cannot fetch records for offset exception

2018-03-22 Thread Tathagata Das
Structured Streaming AUTOMATICALLY saves the offsets in a checkpoint directory that you provide. And when you start the query again with the same directory it will just pick up where it left off.

Re: [Structured Streaming] Application Updates in Production

2018-03-22 Thread Tathagata Das
new app will not pick up from the old > checkpoints, one would need to keep the old app and the new app running > until new app catches up on data processing with the old app. > > > ----- Original message - > From: Tathagata Das <tathagata.das1...@gmail.com> > To: Priy

Re: [Beginner] How to save Kafka Dstream data to parquet ?

2018-03-02 Thread Tathagata Das
sful. This way we get at least > once semantic and partial file write issue. > > Thoughts ? > > > Sunil Parmar > > On Wed, Feb 28, 2018 at 1:59 PM, Tathagata Das < > tathagata.das1...@gmail.com> wrote: > >> There is no good way to save to parquet witho

Re: [Beginner] Kafka 0.11 header support in Spark Structured Streaming

2018-02-27 Thread Tathagata Das
Unfortunately, exposing Kafka headers is not yet supported in Structured Streaming. The community is more than welcome to add support for it :) On Tue, Feb 27, 2018 at 2:51 PM, Karthik Jayaraman wrote: > Hi all, > > I am using Spark 2.2.1 Structured Streaming to read

Re: How does Spark Structured Streaming determine an event has arrived late?

2018-02-27 Thread Tathagata Das
Let me answer the original question directly, that is, how do we determine that an event is late. We simply track the maximum event time the engine has seen in the data it has processed till now. And any data that has event time less than the max is basically "late" (as it is out-of-order). Now,

Re: How to reduceByKeyAndWindow in Structured Streaming?

2018-06-28 Thread Tathagata Das
The fundamental conceptual difference between the windowing in DStream vs Structured Streaming is that DStream used the arrival time of the record in Spark (aka processing time) and Structured Streaming using event time. If you want to exactly replicate DStream's processing time windows in

Re: Lag and queued up batches info in Structured Streaming UI

2018-06-28 Thread Tathagata Das
; Its all documented - https://spark.apache.org/docs/ >> latest/structured-streaming-programming-guide.html#monitorin >> g-streaming-queries >> >> On Wed, Jun 20, 2018 at 6:06 PM, Tathagata Das < >> tathagata.das1...@gmail.com> wrote: >> >>> Str

Re: Iterator of KeyValueGroupedDataset.flatMapGroupsWithState function

2018-10-31 Thread Tathagata Das
It is okay to collect the iterator. That will not break Spark. However, collecting it requires memory in the executor, so you may cause OOMs if a group has a LOT of new data. On Wed, Oct 31, 2018 at 3:44 AM Antonio Murgia - antonio.murg...@studio.unibo.it wrote: > Hi all, > > I'm currently

Re: Announcing Delta Lake 0.2.0

2019-06-21 Thread Tathagata Das
@ayan guha @Gourav Sengupta Delta Lake is OSS currently does not support defining tables in Hive metastore using DDL commands. We are hoping to add the necessary compatibility fixes in Apache Spark to make Delta Lake work with tables and DDL commands. So we will support them in a future release.

Re: How to execute non-timestamp-based aggregations in spark structured streaming?

2019-04-22 Thread Tathagata Das
ow_number() > over (partition by Origin order by OnTimeDepPct desc) OnTimeDepRank,* > from flights > > This will *not* work in *structured streaming* : The culprit is: > > partition by Origin > > The requirement is to use a timestamp-typed field such as > > partitio

Re: Structured Streaming Dataframe Size

2019-08-27 Thread Tathagata Das
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#basic-concepts *Note that Structured Streaming does not materialize the entire table*. It > reads the latest available data from the streaming data source, processes > it incrementally to update the result, and then

Re: Structured Streaming: How to add a listener for when a batch is complete

2019-09-03 Thread Tathagata Das
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reporting-metrics-programmatically-using-asynchronous-apis On Tue, Sep 3, 2019, 3:26 PM Natalie Ruiz wrote: > Hello all, > > > > I’m a beginner, new to Spark and wanted to know if there was an equivalent > to Spark

Re: Structured Streaming Dataframe Size

2019-08-29 Thread Tathagata Das
select * from **myAggTable* This will give awesome ACID transactional guarantees between reads and writes. Read more on the linked website (full disclosure, I work on that project as well). > Thank you very much for your help! > > > On Tue, Aug 27, 2019, 6:42 PM Tathagata Das

Re: how can I dynamic parse json in kafka when using Structured Streaming

2019-09-17 Thread Tathagata Das
You can use *from_json* built-in SQL function to parse json. https://spark.apache.org/docs/latest/api/java/org/apache/spark/sql/functions.html#from_json-org.apache.spark.sql.Column-org.apache.spark.sql.Column- On Mon, Sep 16, 2019 at 7:39 PM lk_spark wrote: > hi,all : > I'm using Structured

Announcing Delta Lake 0.3.0

2019-08-01 Thread Tathagata Das
Hello everyone, We are excited to announce the availability of Delta Lake 0.3.0 which introduces new programmatic APIs for manipulating and managing data in Delta Lake tables. Here are the main features: - Scala/Java APIs for DML commands - You can now modify data in Delta Lake

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Tathagata Das
You are deserializing by explicitly specifying UTC timezone, but when serializing you are not specifying it. Maybe that is reason? Also, if you can encode it using just long, then I recommend just saving the value as long and eliminating some of the serialization overheads. Spark will probably

Re: Structured Streaming: mapGroupsWithState UDT serialization does not work

2020-02-28 Thread Tathagata Das
ng a simple group by. > > Regards, > > Bryan Jeffrey > > Get Outlook for Android <https://aka.ms/ghei36> > > -- > *From:* Tathagata Das > *Sent:* Friday, February 28, 2020 4:56:07 PM > *To:* Bryan Jeffrey > *Cc:* user > *Subject:*

Re: Stateful Structured Spark Streaming: Timeout is not getting triggered

2020-03-04 Thread Tathagata Das
Make sure that you are continuously feeding data into the query to trigger the batches. only then timeouts are processed. See the timeout behavior details here - https://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.streaming.GroupState On Wed, Mar 4, 2020 at 2:51 PM

Re: Spark Streaming: Aggregating values across batches

2020-02-27 Thread Tathagata Das
Use Structured Streaming. Its aggregation, by definition, is across batches. On Thu, Feb 27, 2020 at 3:17 PM Something Something < mailinglist...@gmail.com> wrote: > We've a Spark Streaming job that calculates some values in each batch. > What we need to do now is aggregate values across ALL

Re: dropDuplicates and watermark in structured streaming

2020-02-27 Thread Tathagata Das
1. Yes. All times in event time, not processing time. So you may get 10AM event time data at 11AM processing time, but it will still be compared again all data within 9-10AM event times. 2. Show us your code. On Thu, Feb 27, 2020 at 2:30 AM lec ssmi wrote: > Hi: > I'm new to structured

Re: dropDuplicates and watermark in structured streaming

2020-02-28 Thread Tathagata Das
lec ssmi wrote: > Such as : > df.withWarmark("time","window > size").dropDulplicates("id").withWatermark("time","real > watermark").groupBy(window("time","window size","window > size")).agg(count(&

Re: Spark, read from Kafka stream failing AnalysisException

2020-04-05 Thread Tathagata Das
Have you looked at the suggestion made by the error by searching for "Structured Streaming + Kafka Integration Guide" in Google? It should be the first result. The last section in the "Structured Streaming

Re: Is Spark Structured Streaming TOTALLY BROKEN (Spark Metadata Issues)

2020-06-17 Thread Tathagata Das
Hello Rachana, Getting exactly-once semantics on files and making it scale to a very large number of files are very hard problems to solve. While Structured Streaming + built-in file sink solves the exactly-once guarantee that DStreams could not, it is definitely limited in other ways (scaling in

Re: how can i write spark addListener metric to kafka

2020-06-09 Thread Tathagata Das
https://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#reporting-metrics-programmatically-using-asynchronous-apis On Tue, Jun 9, 2020 at 4:42 PM a s wrote: > hi Guys, > > I am building a structured streaming app for google analytics data > > i want to capture the

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
If you want to find what offset ranges are present in a microbatch in Structured Streaming, you have to look at the StreamingQuery.lastProgress or use the QueryProgressListener . Both of these

Re: Dstream HasOffsetRanges equivalent in Structured streaming

2024-05-22 Thread Tathagata Das
ortunately, the listeners process is async and can't guarantee happens > before association with microbatch to commit offsets to external storage. > But still they will work. Is there a way to access lastProgress in > foreachBatch ? > > > On Wed, May 22, 2024 at 7:35 AM Tathagata Das &

<    4   5   6   7   8   9