Hey Yeonsoo Han, This would be very cool. As Jay said, we have talked about this for a while.
Your diagram looks pretty reasonable, and reflects how I was thinking about the problem as well. Some initial things that strike me about your diagram: 1. Does it make sense to make the output of the final job a stream as well, rather than write it to the DB (as shown in your diagram)? 2. I think the DDL queries would require interacting with Kafka directly (create topic). Samza's system interfaces are pluggable to support more than just Kafka. One way to model the DDL stuff would be to update Samza's SystemAdmin interface to allow streams to be created and deleted. Probably need to think about this a bit. 3. I'm reading your "Query API" boxes as chunks of code that implement StreamTask, and do things like SELECT, WHERE, GROUP BY, etc. 4. Samza supports pluggable MetricsReporters for things like JMX, Ganglia, Graphite, etc. I imagine a fair chunk of the SELECT, COUNT, GROUP BY stuff will fit naturally into charts. Perhaps it makes sense to support output from a SQL query to a MetricsReporter? This would essentially allow you to write a SQL query whose output is a chart in your metrics/monitoring software. 5. Can you expand a bit on what the Data Collector is? It might be worth considering a way to support SQL queries without a custom collector— perhaps by enforcing a specific serialization (e.g. Avro, or JSON), or by having a translation layer between the serialization of the message, and the data model used by the SQL layer. The reason I raise this is that most organizations have a ton of data in existing formats (Protobuf, JSON, Avro, etc), and it'd be nice to be able to query that set of data without having to re-encode everything in some new format. It also might be really useful to have a bit of a write-up, as Jay suggests, about the architecture and SQL language. This is really exciting! Thanks for your interest. :) Cheers, Chris From: Jay Kreps <[email protected]<mailto:[email protected]>> Reply-To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Date: Tuesday, December 10, 2013 10:25 AM To: "[email protected]<mailto:[email protected]>" <[email protected]<mailto:[email protected]>> Subject: Re: [Proposal]SQL on Samza :D Yeah this would be great. We have talked about something like this internally for a while but never gotten to it. The general idea was to do something like Hive--have a client that takes a query (or set of queries) and turns it into a plan that is then executed in the appropriate number of jobs. I wonder if you would be up for posting some of the ideas you have around the design of the language as you progress to get feedback and discussion? -Jay On Tue, Dec 10, 2013 at 6:45 AM, HANYOUNSU <[email protected]<mailto:[email protected]>> wrote: Greetings Samza dev, I work for the Big Data Platform Development Team at Gruter in South Korea. Currently, I’m developing a realtime data processing system which integrates an SQL engine with a Samza-based platform as part of a thesis for a master’s degree. So in a sense we might call it “SQLonSamza” or something like that. I was wondering if it might be possible for me to contribute to the Apache Samza project through my thesis work. To explain my ideas more clearly, I have attached a rough diagram; please let me know if that type of contribution would fit in with the Samza team’s roadmap. >From Yeonsoo Han. [cid:5EEB0ADA-DA6A-4332-A231-C852E469C384]
