One more, Flink seems to made lots of progress on SQL (still in experimenting), but they set more limited features to streaming SQL. https://ci.apache.org/projects/flink/flink-docs-master/dev/table_api.html#limitations-1 SELECT, FROM, WHERE, and UNION clauses are all that Flink supports for streaming SQL. I guess they seem to wait for "Streaming SQL" specification to be stable, since they don't have micro-batch so window in "Streaming SQL" is mandatory to implement aggregation, join, etc.
- Jungtaek Lim (HeartSaVioR) 2016년 10월 11일 (화) 오전 6:36, Jungtaek Lim <[email protected]>님이 작성: > Hi Morrigan, > > since we're making SQL feature, we need to respect SQL standard, behavior, > and common sense, instead of changing its behavior to meet Storm's > streaming fashion. Some limitations are actually unavoidable. > And I didn't deeply think about "Streaming SQL" yet, since there're still > lots of works (small and large) left for micro-batch query. So I couldn't > answer some questions clearly about "streaming query" and I might be wrong. > "Streaming SQL" is even still in progress in Calcite project, and may be > subject to change. > > Answered inline: > > > > > > > *1) From reading the Calcite documentation, a monotonic expression > isrequired in GROUP BYs of streaming queries. The most common use case > isprobably some sort of date/time/timestamp field. This seems > extremelylimiting to me, especially when streaming off of Kafka where data > may notbe strictly ordered. How is this handled? Is late data not allowed > at all?(quasi-monotonicity is allowed, but still doesn't solve late data > issue)* > > > First of all, I didn't deeply think about "streaming query" yet, since > there're some works (small and large) left for micro-batch query. > In streaming query, aggregation should be done within window. As you said > there're late data and out of order issue, and watermark is for trying > best-effort to handle this. It may not solve late data issue completely, > and users need to know about that. Beam and Flink strongly publicize about > this, since it's their most strong point. > > Please read these links once if you haven't read these. I admit that I > couldn't understand these clearly so willing to revisit if I need to. > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101 > https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102 > > > *2) Monotonicty becomes even more complicated when your dataset > ispartitioned, again like Kafka. How is this handled?* > > > I think same rules are applied to partitioned dataset. It just tries its > best to wait for late tuples, and drop late tuples if they're not arriving > before window is discarded. > > > *3) Does Calcite have any understanding of partitioned/sharded streams > atall?* > > > I'm not sure about this, but I guess we can include metadata to Storm's > own logical relation if needed, and can utilize while constructing physical > plan. We didn't reach to have Storm's own logical relation, and it should > take time since it requires deep understanding of Calcite concepts. If > someone can help on this area that should be great. > > > > *4) You say here that "aggregation is limited to micro-batch so result > isnon deterministic." What do you mean here? Trident supports > statefulaggregations, not just within the micro-batch.* > > > It requires every aggregate functions to store its column to State > independently (please correct me if I'm wrong) whereas we're handling 'row' > and expect to insert all columns in row together. Yes it's more limited but > for me it's closer to origin SQL's behavior. I would like to see more > opinions around this. > Btw, when we change Storm SQL to use windowing rather than micro-batch, > aggregation should be done within window, which is more clear to the users, > and more powerful (for example, 5 min average, 1 min sum, etc.) instead of > aggregating all time window. > > > *5) In your technical analysis, you say that "flexible data sources" are > apro for Storm SQL, as opposed to SQE. What do you mean?* > > > You can see that all the things for datasource are abstracted in > storm-sql-core, and datasource specific configurations are all handled from > its own. storm-sql-core doesn't even know the existence of storm-sql-kafka, > and also upcoming datasources, and in fact it should. > > > *The work on another high level API is also interesting. Is this meant > tocompletely replace Trident?* > > > I can't clearly say about that, but IMO, when higher-level API can > ensure exactly-once eventually we don't have to rely on Trident. > Micro-batch semantics can be replaced with process time or ingest time > windowing. > > Thanks, > Jungtaek Lim (HeartSaVioR) > > > 2016년 10월 11일 (화) 오전 5:08, Morrigan Jones <[email protected]>님이 작성: > > Thank you for this update Jungtaek. I've been meaning to do my own followup > to your technical analysis, but unfortunately had some Kafka work eat up a > lot of my time. I do have some immediate questions/thoughts, though: > > 1) From reading the Calcite documentation, a monotonic expression is > required in GROUP BYs of streaming queries. The most common use case is > probably some sort of date/time/timestamp field. This seems extremely > limiting to me, especially when streaming off of Kafka where data may not > be strictly ordered. How is this handled? Is late data not allowed at all? > (quasi-monotonicity is allowed, but still doesn't solve late data issue) > 2) Monotonicty becomes even more complicated when your dataset is > partitioned, again like Kafka. How is this handled? > 3) Does Calcite have any understanding of partitioned/sharded streams at > all? > 4) You say here that "aggregation is limited to micro-batch so result is > non deterministic." What do you mean here? Trident supports stateful > aggregations, not just within the micro-batch. > 5) In your technical analysis, you say that "flexible data sources" are a > pro for Storm SQL, as opposed to SQE. What do you mean? > > The work on another high level API is also interesting. Is this meant to > completely replace Trident? > > > On Fri, Oct 7, 2016 at 1:25 AM, Jungtaek Lim <[email protected]> wrote: > > > For preventing this discussion to be staled, I'd like to put a note how > > Storm SQL is going now. > > (including the change after my tech. analysis) > > > > 1. There're 6 pull requests opened regarding Storm SQL. > > https://github.com/apache/storm/pulls/HeartSaVioR > > > > 2. When STORM-2125 <https://github.com/apache/storm/pull/1714> is merged > > to > > master, Storm SQL can handle most of things which Calcite publishes to > SQL > > reference page. I wrote Storm SQL's own reference page > > < > https://github.com/HeartSaVioR/storm/blob/43478edbd7369047ccc417d6b81ab6 > > a314910437/docs/storm-sql-reference.md> > > and submitted it to another pull request (one of six pull requests) > > > > STORM-2125 is having +1 but waiting for Calcite 1.10.0 to be released in > > order to apply bugfix. (Yes, someone can state that it's a bit tightly > > coupled.) I'm expecting they can cut 1.10.0 RC1 in next week. If it > doesn't > > happen in next week, we can just stick with Calcite 1.9.0 and immediately > > upgrade 1.10.0 when it's available. > > > > 3. During adding tests I found some of bugs on Calcite side (left notes > for > > each test and also reference doc), which we can contribute back to > Calcite > > community. I think this would be a positive feedback loop between Storm > and > > Calcite project. > > > > There're still many rooms available to work on Storm SQL. > > > > 1. I haven't had much time to do now, but would like to learn Calcite and > > try to optimize Storm SQL via STORM-1446 > > <https://issues.apache.org/jira/browse/STORM-1446>. If someone who > > understands Calcite well takes over and contributes this area I'd be much > > appreciated. > > > > 2. I think Trident seems not good backend API for Storm SQL for a long > > term. > > Trident doesn't support typed operation, join and aggregation is limited > to > > micro-batch so result is non deterministic. I'm waiting for higher-level > > API (STORM-1961 <http://issues.apache.org/jira/browse/STORM-1961> is the > > start) and plan to move the backend API. > > > > 3. Storm SQL needs to have more connectors as data sources. It isn't hard > > to work on, but a bit time-consuming. > > > > 4. Storm SQL needs to have time to stabilize by experiment loop: > > experimenting from early adopters, reporting, and fixing bugs. > > > > 5. Long term work: we can expand supporting features on Storm SQL. Join > > between streaming data source and normal table should help to enrich or > > filter data with other external data source. There's also continuous > effort > > from Calcite community: 'Streaming SQL' led by Julian. etc. > > > > Looking forward to continue discussion with JW Player folks. > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) > > > > ps. > > I'd really like to make revamped Storm SQL available to the community. > > Do we think merge process can pause this effort? I hope not, but I can > > follow the community's decision. > > > > 2016년 9월 30일 (금) 오전 3:58, P. Taylor Goetz <[email protected]>님이 작성: > > > > FYI, I’ve merged the SQE code and documentation into the sqe_import > branch: > > > > https://github.com/apache/storm/tree/sqe_import > > > > Note the build will fail on the last component (storm-sqe) due to the > > compilation issues mentioned earlier. > > > > -Taylor > > > > > > On Sep 29, 2016, at 11:13 AM, Bobby Evans <[email protected]> > > wrote: > > > > Agreed, or if we can find a way to not break compatibility (like perhaps > > having a common base class for most of the logic and one subclass that > uses > > a string while another that uses the byte array). - Bobby > > > > On Thursday, September 29, 2016 10:05 AM, Jungtaek Lim < > > [email protected]> > > wrote: > > > > > > The change on storm-redis seems to require breaking backward > compatibility, > > so I would love to see another pull request to integrate it via general > > review process. > > If it can be integrated to SQE without changing storm-redis, that would > be > > nice. > > > > Does it make sense? > > > > - Jungtaek Lim (HeartSaVioR) > > > > 2016년 9월 29일 (목) 오전 6:08, P. Taylor Goetz <[email protected]>님이 작성: > > > > The changes are available in GitHub, I just overlooked it. And they’re > > actually neatly contained in two commits: > > > > The changes to storm-kafka can be found here: > > > > > > https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572 > > 836104755a > > > > > > The changes to storm-redis are here: > > > > > > https://github.com/jwplayer/storm/commit/30d000d3ff673efa8b927d23e554a7 > > 05fb2928b8 > > < > > https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572 > > 836104755a > > > > > > > > > The changes to storm-kafka are straightforward and implemented in such a > > way that they would be useful for use cases outside of SQE. As the commit > > message states, it adds a new kafka deserialization scheme (FullScheme) > > that includes the key, value, topic, partition and offset when reading > from > > kafka, which is a feature I can see as being valuable for some use > cases. I > > would be +1 for merging that code. > > > > The changes to storm-redis are a little different, as Morrigan pointed > > out, because it only addresses the Trident API, but IMHO it looks like a > > good direction. > > > > @HeartSavior — Would you have some time to take a look at the storm-redis > > changes and provide your opinion, since you’re one of the original > authors > > of that code? > > > > -Taylor > > > > > > On Sep 26, 2016, at 6:28 PM, Jungtaek Lim <[email protected]> wrote: > > > > Great! > > > > For storm-redis, we might need to modify key/value mapper to use byte[] > > rather than String. > > When I co-authored storm-redis, I forgot considering binary format of > > serde. If we want to address that part, we can also address it. > > > > 2016년 9월 27일 (화) 오전 7:19, Morrigan Jones <[email protected]>님이 작성: > > > > Sure, when I can. Storm-kafka should be pretty easy. The storm-redis one > > will require more work to make it more complete. > > > > On Mon, Sep 26, 2016 at 6:09 PM, P. Taylor Goetz <[email protected]> > > wrote: > > > > Thanks for the explanation Morrigan! > > > > Would you be willing to provide a pull request or patch so the community > > can review? > > > > It sounds like at least some of the changes you mention could be useful > > > > to > > > > the broader community (beyond the SQL effort). > > > > Thanks again, > > > > -Taylor > > > > On Sep 26, 2016, at 4:40 PM, Morrigan Jones <[email protected]> > > > > wrote: > > > > > > storm-kafka - This is needed because storm-kafka does not provide a > > > > scheme > > > > class that gives you the key, value (payload), partition, and offset. > > MessageMetadataScheme.java comes comes closest, but is missing the key. > > This was a pretty simple change on my part. > > > > storm-redis - This is needed for proper support of Redis hashes. The > > existing storm-redis uses a static string (additionalKey in > > the RedisDataTypeDescription class) for the field name in hash types. I > > updated it to use a configurable KeyFactory for both the hash name and > > > > the > > > > field name. We also added some limited support for set types. This is > > admittedly the messiest between the two jars since we only cared about > > > > the > > > > trident states and would require a lot more changes to get storm-redis > > > > more > > > > "feature complete" overall. > > > > > > > > > > On Mon, Sep 26, 2016 at 4:03 PM, P. Taylor Goetz <[email protected]> > > > > wrote: > > > > > > Sounds good. I’ll find out if it builds against 2.x. If so I’ll go > > > > that > > > > direction. Otherwise I’ll come back with my findings and we can > > > > discuss > > > > it > > > > further. > > > > I notice there are jars in the git repo that we obviously can’t > > > > import. > > > > They look like they might be custom JWPlayer builds of storm-kafka and > > storm-redis. > > > > Morrigan — Do you know if there is any differences there that required > > custom builds of those components? > > > > -Taylor > > > > On Sep 26, 2016, at 3:31 PM, Bobby Evans > > > > <[email protected] > > > > > > wrote: > > > > Does it compile against 2.X? If so I would prefer to have it go > > > > there, > > > > and then possibly 1.x if people what it there too. - Bobby > > > > > > On Monday, September 26, 2016 12:47 PM, P. Taylor Goetz < > > > > [email protected]> wrote: > > > > > > The IP Clearance vote has passed and we are now able to import the > > > > SQE > > > > code. > > > > > > The question now is to where do we want to import the code? > > > > My inclination is to import it to “external” in the 1.x branch. It > > > > can > > > > be ported to other branches as necessary/if desired. An alternative > > > > would > > > > be to treat it as a feature branch, but I’d rather take the former > > > > approach. > > > > > > Thought/opinions? > > > > -Taylor > > > > On Sep 21, 2016, at 8:39 PM, P. Taylor Goetz <[email protected]> > > > > wrote: > > > > > > My apologies. I meant to cc dev@, but didn't. Will forward in a > > > > bit... > > > > > > The vote (lazy consensus) is underway on general@incubator, and > > > > will > > > > close in less than 72 hours. After that the code can be merged. > > > > > > -Taylor > > > > On Sep 21, 2016, at 7:02 PM, Jungtaek Lim <[email protected]> > > > > wrote: > > > > > > Hi dev, > > > > While code contribution of SQE is in progress, I would like to > > > > continue > > > > discussion how to merge SQE and Storm SQL. > > > > I did an analysis of merging SQE and Storm SQL in both side, > > > > integrating > > > > SQE to Storm SQL and vice versa. > > https://cwiki.apache.org/confluence/display/STORM/ > > > > Technical+analysis+of+merging+SQE+and+Storm+SQL > > > > > > As I commented to that page, since I'm working on Storm SQL for > > > > some > > > > weeks > > > > I can be (heavily) biased. So I'd really appreciated if someone can > > > > do > > > > another analysis. > > > > Please feel free to share your thought about this analysis, or > > > > another > > > > proposal if you have any, or other things about the merging. > > > > Thanks, > > Jungtaek Lim (HeartSaVioR) > > > > > > > > -- > > Morrigan Jones > > Principal Engineer > > *JW*PLAYER | Your Way to Play > > [email protected] | jwplayer.com > > > > > > > > > > > > -- > > Morrigan Jones > > Principal Engineer > > *JW*PLAYER | Your Way to Play > > [email protected] | jwplayer.com > > > > > > -- > Morrigan Jones > Principal Engineer > *JW*PLAYER | Your Way to Play > [email protected] | jwplayer.com > >
