[ 
https://issues.apache.org/jira/browse/SAMZA-390?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14229292#comment-14229292
 ] 

Ben Kirwin commented on SAMZA-390:
----------------------------------

I'm a bit late to the conversation, so excuse me if I jump right in.

I've been developing {{coast}}, another high-level streaming project that 
compiles down to Samza jobs. (https://github.com/bkirwi/coast) The 
implementation and 'frontend DSL' for this project is in Scala, and there's 
also some support code for unit testing and job-graph visualization. Unlike 
most of what's discussed above, this project is *not* directly SQL-inspired; by 
analogy to the Hadoop ecosystem, it's more of a Cascading than a Hive. (I 
suspect you could throw a SQLish frontend on top of it, in the same way Spark 
SQL does for Spark, but nobody's actually tried this yet.) {{coast}}'s 
exactly-once semantics are largely inspired by Kafka's log model, and many of 
the design choices are intended to make it a 'good citizen' in a larger 
Kafka-based infrastructure.

Since we're taking fairly different approaches, I'm not sure how well my 
conclusions translate over. I'm still trying to plow through some of the papers 
in this thread (thanks for these!) so for the moment, I'll leave some notes on 
managing time.

[~yipan] mentioned some difficulties with reconciling the framework- and 
application-level views of time: machines may disagree about the current time, 
messages may be processed arbitrarily long after they're sent, timestamps in a 
stream might not be monotonically increasing, and messages in different 
partitions might be ordered at the application level but not within Kafka / 
another backing store. I regard all these issues as pretty fundamental in a 
distributed context. I'm also pessimistic about any framework's ability to 
'paper over' them; there are a bunch of viable solutions, none of which seem to 
work seamlessly for all problems. 

So far, like Freshet, {{coast}} has basically punted on this: the only notion 
of time / ordering comes from Kafka's ordering within a partition. It turns out 
you can get pretty far with this, but not all the way -- for example, you can 
window by message count but not by time. To bridge that gap in expressiveness, 
I've been playing with the idea of a 'clock stream', which just produces a 
'tick' message every *n* seconds. It turns out that this type of stream has 
essentially the same semantics as a Kafka-backed stream does: every task sees 
the same messages in the same order, but possibly skewed slightly in time 
depending on the local clock; and it's possible to 'checkpoint' your position 
in the stream by remembering the time of the last successfully-processed 
'tick'. For {{coast}}, this turns out to be very useful -- I already have a 
rich language for manipulating streams, so by making clock streams 'first 
class', it's possible to implement most (all?) of the time-based windowing 
strategies discussed above in library code.

In a SQL-type API, it would probably be awkward to manipulate streams of ticks 
directly; but it might still be useful as a implementation mechanism or the 
basis for a semantic model.

As noted, this is all somewhat speculative for me at the moment; I'll leave a 
note here once I get around to implementing this.

> High-Level Language for Samza
> -----------------------------
>
>                 Key: SAMZA-390
>                 URL: https://issues.apache.org/jira/browse/SAMZA-390
>             Project: Samza
>          Issue Type: New Feature
>            Reporter: Raul Castro Fernandez
>            Priority: Minor
>              Labels: project
>
> Discussion about high-level languages to define Samza queries. Queries are 
> defined in this language and transformed to a dataflow graph where the nodes 
> are Samza jobs.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to