Re: Is the trigger interval the same as batch interval in structured streaming?

Tathagata Das Mon, 10 Apr 2017 13:40:25 -0700

The trigger interval is optionally specified in the writeStream option
before start.


val windowedCounts = words.groupBy(
  window($"timestamp", "24 hours", "24 hours"),
  $"word"
).count()
.writeStream
.trigger(ProcessingTime("10 seconds"))  // optional
.format("memory")
.queryName("tableName")
.start()

See the full example here -
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/sql/streaming/StructuredNetworkWordCountWindowed.scala


On Mon, Apr 10, 2017 at 12:55 PM, kant kodali <kanth...@gmail.com> wrote:

> Thanks again! Looks like the update mode is not available in 2.1 (which
> seems to be the latest version as of today) and I am assuming there will be
> a way to specify trigger interval with the next release because with the
> following code I don't see a way to specify trigger interval.
>
> val windowedCounts = words.groupBy(
>   window($"timestamp", "24 hours", "24 hours"),
>   $"word").count()
>
>
> On Mon, Apr 10, 2017 at 12:32 PM, Michael Armbrust <mich...@databricks.com
> > wrote:
>
>> It sounds like you want a tumbling window (where the slide and duration
>> are the same).  This is the default if you give only one interval.  You
>> should set the output mode to "update" (i.e. output only the rows that have
>> been updated since the last trigger) and the trigger to "1 second".
>>
>> Try thinking about the batch query that would produce the answer you
>> want.  Structured streaming will figure out an efficient way to compute
>> that answer incrementally as new data arrives.
>>
>> On Mon, Apr 10, 2017 at 12:20 PM, kant kodali <kanth...@gmail.com> wrote:
>>
>>> Hi Michael,
>>>
>>> Thanks for the response. I guess I was thinking more in terms of the
>>> regular streaming model. so In this case I am little confused what my
>>> window interval and slide interval be for the following case?
>>>
>>> I need to hold a state (say a count) for 24 hours while capturing all
>>> its updates and produce results every second. I also need to reset the
>>> state (the count) back to zero every 24 hours.
>>>
>>>
>>>
>>>
>>>
>>>
>>> On Mon, Apr 10, 2017 at 11:49 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> Nope, structured streaming eliminates the limitation that
>>>> micro-batching should affect the results of your streaming query.  Trigger
>>>> is just an indication of how often you want to produce results (and if you
>>>> leave it blank we just run as quickly as possible).
>>>>
>>>> To control how tuples are grouped into a window, take a look at the
>>>> window
>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html#window-operations-on-event-time>
>>>> function.
>>>>
>>>> On Thu, Apr 6, 2017 at 10:26 AM, kant kodali <kanth...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hi All,
>>>>>
>>>>> Is the trigger interval mentioned in this doc
>>>>> <http://spark.apache.org/docs/latest/structured-streaming-programming-guide.html>
>>>>> the same as batch interval in structured streaming? For example I have a
>>>>> long running receiver(not kafka) which sends me a real time stream I want
>>>>> to use window interval, slide interval of 24 hours to create the Tumbling
>>>>> window effect but I want to process updates every second.
>>>>>
>>>>> Thanks!
>>>>>
>>>>
>>>>
>>>
>>
>

Re: Is the trigger interval the same as batch interval in structured streaming?

Reply via email to