[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Yi Pan (Data Infrastructure) (JIRA) Tue, 03 Feb 2015 14:05:51 -0800

    [ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14304108#comment-14304108
 ]


Yi Pan (Data Infrastructure) commented on SAMZA-390:
----------------------------------------------------

Hi, [~jkreps], just refreshed my memory on the tuple vs time window comparison 
in http://cs.brown.edu/~ugur/streamsql.pdf

{quote}
Key difference is the "tuple-driven" vs "time-driven" distinction. Personally I 
thought tuple driven is a much closer fit to the underlying Kafka concepts (an 
ordered stream of tuples).
{quote}

I agree from the ordering point of view, tuple-driven is a natural fit w/ Kafka 
concept. Revisiting the problems presented by the paper, there are mainly the 
following two issues:
# in time-based window, when events happened with the same timestamp, there is 
no defined ordering.
# in tuple-based window, events in the same window (by number of tuples) may 
actually expand over long physical time and break the "simultaneity" assumption.

In Samza/Kafka world, we don't need to worry about ordering in a stream if:
# ordering in the same stream is always defined by Kafka partitioned ordering
# ordering between different input streams is always defined by a consistent 
MessageSelector among the input streams

Now, the only issue to be resolved is how to maintain the "simultaneity" 
semantics among different streams. Here are a few things to consider:
# what's the definition of "simultaneity"? Based on the time both events 
arrives at Samza consumer or based on the actual application timestamp? Ideally 
it should be later, but it requires the producer to explicitly tag the events, 
and also requires producers to follow the same wall-clock. Short of that, we 
can only inject the system timestamp at the ingress point of Samza job, at best.
# how to maintain the ordering between the streams? The paper proposed a total 
order of events in the input streams in a single job: a) order by timestamp of 
the event to maintain "simultaneity"; b) define a total order for events w/ the 
same timestamp from all input streams. I believe that we have a way to define 
the total oder for all input streams in Samza. The tricky part is to maintain 
the "simultaneity" semantics which requires to sort the events based on 
timestamp first, then the ordering provided by the Kafka system for "same time 
events".

Here is what I am thinking as the first step implementation:
# define "simultaneous event" by the timestamp of the ingress Samza consumer. 
Hence, w/ only one Samza job per query, we don't need to re-sort the received 
events from Kafka streams since they are already sorted based on "arrival 
timestamp"

In the future, the extension to application timestamp could be the following:
# Each event will be tagged by ( _app-ts_, _strm-order-no_ , _offset_ ) and all 
events from all input streams can be sorted in a single ordered sequence of 
events.
# There will be a termination condition for the last event with a timestamp 
that is before a certain _app_ts_ that can still be accepted by the Job (i.e. 
it may or may not be the heart-beat method). Upon closing this window, the 
ordered sequence of events from all input streams are determined and should be 
processed. 
## I have not come up with a good idea of "closing window condition" that can 
reliably work in a distributed environment yet. Some LinkedIn jobs are using 
timeout method to close a window, like the call-graph jobs.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>          Components: sql
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>         Attachments: StreamSQLforSAMZA-v0.1.docx.docx
>
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

[jira] [Commented] (SAMZA-390) High-Level Language for Samza

Reply via email to