Trigger does not mean report the current solution every 'trigger seconds'.
It means it will attempt to fetch new data and process it no faster than
trigger seconds intervals.

If you're reading from the beginning and you've got 10M entries in kafka,
it's likely pulling everything down then processing it completely and
giving you an initial output. From here on out, it will check kafka every 1
second for new data and process it, showing you only the updated rows. So
the initial read will give you the entire output since there is nothing to
be 'updating' from. If you add data to kafka now that the streaming job has
completed it's first batch (and leave it running), it will then show you
the new/updated rows since the last batch every 1 second (assuming it can
fetch + process in that time span).

If the combined fetch + processing time is > the trigger time, you will
notice warnings that it is 'falling behind' (I forget the exact verbiage,
but something to the effect of the calculation took XX time and is falling
behind). In that case, it will immediately check kafka for new messages and
begin processing the next batch (if new messages exist).

Hope that makes sense -


On Mon, Mar 19, 2018 at 13:36 kant kodali <kanth...@gmail.com> wrote:

> Hi All,
>
> I have 10 million records in my Kafka and I am just trying to
> spark.sql(select count(*) from kafka_view). I am reading from kafka and
> writing to kafka.
>
> My writeStream is set to "update" mode and trigger interval of one second (
> Trigger.ProcessingTime(1000)). I expect the counts to be printed every
> second but looks like it would print after going through all 10M. why?
>
> Also, it seems to take forever whereas Linux wc of 10M rows would take 30
> seconds.
>
> Thanks!
>

Reply via email to