[
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304108#comment-14304108
]
Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------
Hi, [~jkreps], just refreshed my memory on the tuple vs time window comparison
in http://cs.brown.edu/~ugur/streamsql.pdf
{quote}
Key difference is the "tuple-driven" vs "time-driven" distinction. Personally I
thought tuple driven is a much closer fit to the underlying Kafka concepts (an
ordered stream of tuples).
{quote}
I agree from the ordering point of view, tuple-driven is a natural fit w/ Kafka
concept. Revisiting the problems presented by the paper, there are mainly the
following two issues:
# in time-based window, when events happened with the same timestamp, there is
no defined ordering.
# in tuple-based window, events in the same window (by number of tuples) may
actually expand over long physical time and break the "simultaneity" assumption.
In Samza/Kafka world, we don't need to worry about ordering in a stream if:
# ordering in the same stream is always defined by Kafka partitioned ordering
# ordering between different input streams is always defined by a consistent
MessageSelector among the input streams
Now, the only issue to be resolved is how to maintain the "simultaneity"
semantics among different streams. Here are a few things to consider:
# what's the definition of "simultaneity"? Based on the time both events
arrives at Samza consumer or based on the actual application timestamp? Ideally
it should be later, but it requires the producer to explicitly tag the events,
and also requires producers to follow the same wall-clock. Short of that, we
can only inject the system timestamp at the ingress point of Samza job, at best.
# how to maintain the ordering between the streams? The paper proposed a total
order of events in the input streams in a single job: a) order by timestamp of
the event to maintain "simultaneity"; b) define a total order for events w/ the
same timestamp from all input streams. I believe that we have a way to define
the total oder for all input streams in Samza. The tricky part is to maintain
the "simultaneity" semantics which requires to sort the events based on
timestamp first, then the ordering provided by the Kafka system for "same time
events".
Here is what I am thinking as the first step implementation:
# define "simultaneous event" by the timestamp of the ingress Samza consumer.
Hence, w/ only one Samza job per query, we don't need to re-sort the received
events from Kafka streams since they are already sorted based on "arrival
timestamp"
In the future, the extension to application timestamp could be the following:
# Each event will be tagged by ( _app-ts_, _strm-order-no_ , _offset_ ) and all
events from all input streams can be sorted in a single ordered sequence of
events.
# There will be a termination condition for the last event with a timestamp
that is before a certain _app_ts_ that can still be accepted by the Job (i.e.
it may or may not be the heart-beat method). Upon closing this window, the
ordered sequence of events from all input streams are determined and should be
processed.
## I have not come up with a good idea of "closing window condition" that can
reliably work in a distributed environment yet. Some LinkedIn jobs are using
timeout method to close a window, like the call-graph jobs.
> High-Level Language for Samza
> -----------------------------
>
> Key: SAMZA-390
> URL: https://issues.apache.org/jira/browse/SAMZA-390
> Project: Samza
> Issue Type: New Feature
> Components: sql
> Reporter: Raul Castro Fernandez
> Priority: Minor
> Labels: project
> Attachments: StreamSQLforSAMZA-v0.1.docx.docx
>
>
> Discussion about high-level languages to define Samza queries. Queries are
> defined in this language and transformed to a dataflow graph where the nodes
> are Samza jobs.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)