Hey Yeonsoo Han,

This would be very cool. As Jay said, we have talked about this for a while.

Your diagram looks pretty reasonable, and reflects how I was thinking about the 
problem as well. Some initial things that strike me about your diagram:

  1.  Does it make sense to make the output of the final job a stream as well, 
rather than write it to the DB (as shown in your diagram)?
  2.  I think the DDL queries would require interacting with Kafka directly 
(create topic). Samza's system interfaces are pluggable to support more than 
just Kafka. One way to model the DDL stuff would be to update Samza's 
SystemAdmin interface to allow streams to be created and deleted. Probably need 
to think about this a bit.
  3.  I'm reading your "Query API" boxes as chunks of code that implement 
StreamTask, and do things like SELECT, WHERE, GROUP BY, etc.
  4.  Samza supports pluggable MetricsReporters for things like JMX, Ganglia, 
Graphite, etc. I imagine a fair chunk of the SELECT, COUNT, GROUP BY stuff will 
fit naturally into charts. Perhaps it makes sense to support output from a SQL 
query to a MetricsReporter? This would essentially allow you to write a SQL 
query whose output is a chart in your metrics/monitoring software.
  5.  Can you expand a bit on what the Data Collector is? It might be worth 
considering a way to support SQL queries without a custom collector— perhaps by 
enforcing a specific serialization (e.g. Avro, or JSON), or by having a 
translation layer between the serialization of the message, and the data model 
used by the SQL layer. The reason I raise this is that most organizations have 
a ton of data in existing formats (Protobuf, JSON, Avro, etc), and it'd be nice 
to be able to query that set of data without having to re-encode everything in 
some new format.

It also might be really useful to have a bit of a write-up, as Jay suggests, 
about the architecture and SQL language.

This is really exciting! Thanks for your interest. :)

Cheers,
Chris

From: Jay Kreps <[email protected]<mailto:[email protected]>>
Reply-To: 
"[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Date: Tuesday, December 10, 2013 10:25 AM
To: "[email protected]<mailto:[email protected]>" 
<[email protected]<mailto:[email protected]>>
Subject: Re: [Proposal]SQL on Samza :D

Yeah this would be great. We have talked about something like this internally 
for a while but never gotten to it.

The general idea was to do something like Hive--have a client that takes a 
query (or set of queries) and turns it into a plan that is then executed in the 
appropriate number of jobs.

I wonder if you would be up for posting some of the ideas you have around the 
design of the language as you progress to get feedback and discussion?

-Jay


On Tue, Dec 10, 2013 at 6:45 AM, HANYOUNSU 
<[email protected]<mailto:[email protected]>> wrote:
 Greetings Samza dev,
 I work for the Big Data Platform Development Team at Gruter in South Korea.
 Currently, I’m developing a real­time data processing system which integrates 
an SQL engine with a Samza­-based platform as part of a thesis for a master’s 
degree. So in a sense we might call it “SQL­on­Samza” or something like that. I 
was wondering if it might be possible for me to contribute to the Apache Samza 
project through my thesis work.
 To explain my ideas more clearly, I have attached a rough diagram; please let 
me know if that type of contribution would fit in with the Samza team’s roadmap.

>From Yeonsoo Han.
[cid:5EEB0ADA-DA6A-4332-A231-C852E469C384]

Reply via email to