Re: [DISCUSS] Plan for merge SQE and Storm SQL

Jungtaek Lim Mon, 10 Oct 2016 15:27:54 -0700

One more, Flink seems to made lots of progress on SQL (still in
experimenting), but they set more limited features to streaming SQL.
https://ci.apache.org/projects/flink/flink-docs-master/dev/table_api.html#limitations-1
SELECT, FROM, WHERE, and UNION clauses are all that Flink supports for
streaming SQL. I guess they seem to wait for "Streaming SQL" specification
to be stable, since they don't have micro-batch so window in "Streaming
SQL" is mandatory to implement aggregation, join, etc.


- Jungtaek Lim (HeartSaVioR)

2016년 10월 11일 (화) 오전 6:36, Jungtaek Lim <[email protected]>님이 작성:

> Hi Morrigan,
>
> since we're making SQL feature, we need to respect SQL standard, behavior,
> and common sense, instead of changing its behavior to meet Storm's
> streaming fashion. Some limitations are actually unavoidable.
> And I didn't deeply think about "Streaming SQL" yet, since there're still
> lots of works (small and large) left for micro-batch query. So I couldn't
> answer some questions clearly about "streaming query" and I might be wrong.
> "Streaming SQL" is even still in progress in Calcite project, and may be
> subject to change.
>
> Answered inline:
>
>
>
>
>
>
> *1) From reading the Calcite documentation, a monotonic expression
> isrequired in GROUP BYs of streaming queries. The most common use case
> isprobably some sort of date/time/timestamp field. This seems
> extremelylimiting to me, especially when streaming off of Kafka where data
> may notbe strictly ordered. How is this handled? Is late data not allowed
> at all?(quasi-monotonicity is allowed, but still doesn't solve late data
> issue)*
>
> > First of all, I didn't deeply think about "streaming query" yet, since
> there're some works (small and large) left for micro-batch query.
> In streaming query, aggregation should be done within window. As you said
> there're late data and out of order issue, and watermark is for trying
> best-effort to handle this. It may not solve late data issue completely,
> and users need to know about that. Beam and Flink strongly publicize about
> this, since it's their most strong point.
>
> Please read these links once if you haven't read these. I admit that I
> couldn't understand these clearly so willing to revisit if I need to.
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-101
> https://www.oreilly.com/ideas/the-world-beyond-batch-streaming-102
>
>
> *2) Monotonicty becomes even more complicated when your dataset
> ispartitioned, again like Kafka. How is this handled?*
>
> > I think same rules are applied to partitioned dataset. It just tries its
> best to wait for late tuples, and drop late tuples if they're not arriving
> before window is discarded.
>
>
> *3) Does Calcite have any understanding of partitioned/sharded streams
> atall?*
>
> > I'm not sure about this, but I guess we can include metadata to Storm's
> own logical relation if needed, and can utilize while constructing physical
> plan. We didn't reach to have Storm's own logical relation, and it should
> take time since it requires deep understanding of Calcite concepts. If
> someone can help on this area that should be great.
>
>
>
> *4) You say here that "aggregation is limited to micro-batch so result
> isnon deterministic." What do you mean here? Trident supports
> statefulaggregations, not just within the micro-batch.*
>
> > It requires every aggregate functions to store its column to State
> independently (please correct me if I'm wrong) whereas we're handling 'row'
> and expect to insert all columns in row together. Yes it's more limited but
> for me it's closer to origin SQL's behavior. I would like to see more
> opinions around this.
> Btw, when we change Storm SQL to use windowing rather than micro-batch,
> aggregation should be done within window, which is more clear to the users,
> and more powerful (for example, 5 min average, 1 min sum, etc.) instead of
> aggregating all time window.
>
>
> *5) In your technical analysis, you say that "flexible data sources" are
> apro for Storm SQL, as opposed to SQE. What do you mean?*
>
> > You can see that all the things for datasource are abstracted in
> storm-sql-core, and datasource specific configurations are all handled from
> its own. storm-sql-core doesn't even know the existence of storm-sql-kafka,
> and also upcoming datasources, and in fact it should.
>
>
> *The work on another high level API is also interesting. Is this meant
> tocompletely replace Trident?*
>
> > I can't clearly say about that, but IMO, when higher-level API can
> ensure exactly-once eventually we don't have to rely on Trident.
> Micro-batch semantics can be replaced with process time or ingest time
> windowing.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
> 2016년 10월 11일 (화) 오전 5:08, Morrigan Jones <[email protected]>님이 작성:
>
> Thank you for this update Jungtaek. I've been meaning to do my own followup
> to your technical analysis, but unfortunately had some Kafka work eat up a
> lot of my time. I do have some immediate questions/thoughts, though:
>
> 1) From reading the Calcite documentation, a monotonic expression is
> required in GROUP BYs of streaming queries. The most common use case is
> probably some sort of date/time/timestamp field. This seems extremely
> limiting to me, especially when streaming off of Kafka where data may not
> be strictly ordered. How is this handled? Is late data not allowed at all?
> (quasi-monotonicity is allowed, but still doesn't solve late data issue)
> 2) Monotonicty becomes even more complicated when your dataset is
> partitioned, again like Kafka. How is this handled?
> 3) Does Calcite have any understanding of partitioned/sharded streams at
> all?
> 4) You say here that "aggregation is limited to micro-batch so result is
> non deterministic." What do you mean here? Trident supports stateful
> aggregations, not just within the micro-batch.
> 5) In your technical analysis, you say that "flexible data sources" are a
> pro for Storm SQL, as opposed to SQE. What do you mean?
>
> The work on another high level API is also interesting. Is this meant to
> completely replace Trident?
>
>
> On Fri, Oct 7, 2016 at 1:25 AM, Jungtaek Lim <[email protected]> wrote:
>
> > For preventing this discussion to be staled, I'd like to put a note how
> > Storm SQL is going now.
> > (including the change after my tech. analysis)
> >
> > 1. There're 6 pull requests opened regarding Storm SQL.
> > https://github.com/apache/storm/pulls/HeartSaVioR
> >
> > 2. When STORM-2125 <https://github.com/apache/storm/pull/1714> is merged
> > to
> > master, Storm SQL can handle most of things which Calcite publishes to
> SQL
> > reference page. I wrote Storm SQL's own reference page
> > <
> https://github.com/HeartSaVioR/storm/blob/43478edbd7369047ccc417d6b81ab6
> > a314910437/docs/storm-sql-reference.md>
> > and submitted it to another pull request (one of six pull requests)
> >
> > STORM-2125 is having +1 but waiting for Calcite 1.10.0 to be released in
> > order to apply bugfix. (Yes, someone can state that it's a bit tightly
> > coupled.) I'm expecting they can cut 1.10.0 RC1 in next week. If it
> doesn't
> > happen in next week, we can just stick with Calcite 1.9.0 and immediately
> > upgrade 1.10.0 when it's available.
> >
> > 3. During adding tests I found some of bugs on Calcite side (left notes
> for
> > each test and also reference doc), which we can contribute back to
> Calcite
> > community. I think this would be a positive feedback loop between Storm
> and
> > Calcite project.
> >
> > There're still many rooms available to work on Storm SQL.
> >
> > 1. I haven't had much time to do now, but would like to learn Calcite and
> > try to optimize Storm SQL via STORM-1446
> > <https://issues.apache.org/jira/browse/STORM-1446>. If someone who
> > understands Calcite well takes over and contributes this area I'd be much
> > appreciated.
> >
> > 2. I think Trident seems not good backend API for Storm SQL for a long
> > term.
> > Trident doesn't support typed operation, join and aggregation is limited
> to
> > micro-batch so result is non deterministic. I'm waiting for higher-level
> > API (STORM-1961 <http://issues.apache.org/jira/browse/STORM-1961> is the
> > start) and plan to move the backend API.
> >
> > 3. Storm SQL needs to have more connectors as data sources. It isn't hard
> > to work on, but a bit time-consuming.
> >
> > 4. Storm SQL needs to have time to stabilize by experiment loop:
> > experimenting from early adopters, reporting, and fixing bugs.
> >
> > 5. Long term work: we can expand supporting features on Storm SQL. Join
> > between streaming data source and normal table should help to enrich or
> > filter data with other external data source. There's also continuous
> effort
> > from Calcite community: 'Streaming SQL' led by Julian. etc.
> >
> > Looking forward to continue discussion with JW Player folks.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> > ps.
> > I'd really like to make revamped Storm SQL available to the community.
> > Do we think merge process can pause this effort? I hope not, but I can
> > follow the community's decision.
> >
> > 2016년 9월 30일 (금) 오전 3:58, P. Taylor Goetz <[email protected]>님이 작성:
> >
> > FYI, I’ve merged the SQE code and documentation into the sqe_import
> branch:
> >
> > https://github.com/apache/storm/tree/sqe_import
> >
> > Note the build will fail on the last component (storm-sqe) due to the
> > compilation issues mentioned earlier.
> >
> > -Taylor
> >
> >
> > On Sep 29, 2016, at 11:13 AM, Bobby Evans <[email protected]>
> > wrote:
> >
> > Agreed, or if we can find a way to not break compatibility (like perhaps
> > having a common base class for most of the logic and one subclass that
> uses
> > a string while another that uses the byte array).   - Bobby
> >
> >    On Thursday, September 29, 2016 10:05 AM, Jungtaek Lim <
> > [email protected]>
> > wrote:
> >
> >
> > The change on storm-redis seems to require breaking backward
> compatibility,
> > so I would love to see another pull request to integrate it via general
> > review process.
> > If it can be integrated to SQE without changing storm-redis, that would
> be
> > nice.
> >
> > Does it make sense?
> >
> > - Jungtaek Lim (HeartSaVioR)
> >
> > 2016년 9월 29일 (목) 오전 6:08, P. Taylor Goetz <[email protected]>님이 작성:
> >
> > The changes are available in GitHub, I just overlooked it. And they’re
> > actually neatly contained in two commits:
> >
> > The changes to storm-kafka can be found here:
> >
> >
> > https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572
> > 836104755a
> >
> >
> > The changes to storm-redis are here:
> >
> >
> > https://github.com/jwplayer/storm/commit/30d000d3ff673efa8b927d23e554a7
> > 05fb2928b8
> > <
> > https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572
> > 836104755a
> > >
> >
> >
> > The changes to storm-kafka are straightforward and implemented in such a
> > way that they would be useful for use cases outside of SQE. As the commit
> > message states, it adds a new kafka deserialization scheme (FullScheme)
> > that includes the key, value, topic, partition and offset when reading
> from
> > kafka, which is a feature I can see as being valuable for some use
> cases. I
> > would be +1 for merging that code.
> >
> > The changes to storm-redis are a little different, as Morrigan pointed
> > out, because it only addresses the Trident API, but IMHO it looks like a
> > good direction.
> >
> > @HeartSavior — Would you have some time to take a look at the storm-redis
> > changes and provide your opinion, since you’re one of the original
> authors
> > of that code?
> >
> > -Taylor
> >
> >
> > On Sep 26, 2016, at 6:28 PM, Jungtaek Lim <[email protected]> wrote:
> >
> > Great!
> >
> > For storm-redis, we might need to modify key/value mapper to use byte[]
> > rather than String.
> > When I co-authored storm-redis, I forgot considering binary format of
> > serde. If we want to address that part, we can also address it.
> >
> > 2016년 9월 27일 (화) 오전 7:19, Morrigan Jones <[email protected]>님이 작성:
> >
> > Sure, when I can. Storm-kafka should be pretty easy. The storm-redis one
> > will require more work to make it more complete.
> >
> > On Mon, Sep 26, 2016 at 6:09 PM, P. Taylor Goetz <[email protected]>
> > wrote:
> >
> > Thanks for the explanation Morrigan!
> >
> > Would you be willing to provide a pull request or patch so the community
> > can review?
> >
> > It sounds like at least some of the changes you mention could be useful
> >
> > to
> >
> > the broader community (beyond the SQL effort).
> >
> > Thanks again,
> >
> > -Taylor
> >
> > On Sep 26, 2016, at 4:40 PM, Morrigan Jones <[email protected]>
> >
> > wrote:
> >
> >
> > storm-kafka - This is needed because storm-kafka does not provide a
> >
> > scheme
> >
> > class that gives you the key, value (payload), partition, and offset.
> > MessageMetadataScheme.java comes comes closest, but is missing the key.
> > This was a pretty simple change on my part.
> >
> > storm-redis - This is needed for proper support of Redis hashes. The
> > existing storm-redis uses a static string (additionalKey in
> > the RedisDataTypeDescription class) for the field name in hash types. I
> > updated it to use a configurable KeyFactory for both the hash name and
> >
> > the
> >
> > field name. We also added some limited support for set types. This is
> > admittedly the messiest between the two jars since we only cared about
> >
> > the
> >
> > trident states and would require a lot more changes to get storm-redis
> >
> > more
> >
> > "feature complete" overall.
> >
> >
> >
> >
> > On Mon, Sep 26, 2016 at 4:03 PM, P. Taylor Goetz <[email protected]>
> >
> > wrote:
> >
> >
> > Sounds good. I’ll find out if it builds against 2.x. If so I’ll go
> >
> > that
> >
> > direction. Otherwise I’ll come back with my findings and we can
> >
> > discuss
> >
> > it
> >
> > further.
> >
> > I notice there are jars in the git repo that we obviously can’t
> >
> > import.
> >
> > They look like they might be custom JWPlayer builds of storm-kafka and
> > storm-redis.
> >
> > Morrigan — Do you know if there is any differences there that required
> > custom builds of those components?
> >
> > -Taylor
> >
> > On Sep 26, 2016, at 3:31 PM, Bobby Evans
> >
> > <[email protected]
> >
> >
> > wrote:
> >
> > Does it compile against 2.X?  If so I would prefer to have it go
> >
> > there,
> >
> > and then possibly 1.x if people what it there too. - Bobby
> >
> >
> >   On Monday, September 26, 2016 12:47 PM, P. Taylor Goetz <
> >
> > [email protected]> wrote:
> >
> >
> > The IP Clearance vote has passed and we are now able to import the
> >
> > SQE
> >
> > code.
> >
> >
> > The question now is to where do we want to import the code?
> >
> > My inclination is to import it to “external” in the 1.x branch. It
> >
> > can
> >
> > be ported to other branches as necessary/if desired. An alternative
> >
> > would
> >
> > be to treat it as a feature branch, but I’d rather take the former
> >
> > approach.
> >
> >
> > Thought/opinions?
> >
> > -Taylor
> >
> > On Sep 21, 2016, at 8:39 PM, P. Taylor Goetz <[email protected]>
> >
> > wrote:
> >
> >
> > My apologies. I meant to cc dev@, but didn't. Will forward in a
> >
> > bit...
> >
> >
> > The vote (lazy consensus) is underway on general@incubator, and
> >
> > will
> >
> > close in less than 72 hours. After that the code can be merged.
> >
> >
> > -Taylor
> >
> > On Sep 21, 2016, at 7:02 PM, Jungtaek Lim <[email protected]>
> >
> > wrote:
> >
> >
> > Hi dev,
> >
> > While code contribution of SQE is in progress, I would like to
> >
> > continue
> >
> > discussion how to merge SQE and Storm SQL.
> >
> > I did an analysis of merging SQE and Storm SQL in both side,
> >
> > integrating
> >
> > SQE to Storm SQL and vice versa.
> > https://cwiki.apache.org/confluence/display/STORM/
> >
> > Technical+analysis+of+merging+SQE+and+Storm+SQL
> >
> >
> > As I commented to that page, since I'm working on Storm SQL for
> >
> > some
> >
> > weeks
> >
> > I can be (heavily) biased. So I'd really appreciated if someone can
> >
> > do
> >
> > another analysis.
> >
> > Please feel free to share your thought about this analysis, or
> >
> > another
> >
> > proposal if you have any, or other things about the merging.
> >
> > Thanks,
> > Jungtaek Lim (HeartSaVioR)
> >
> >
> >
> > --
> > Morrigan Jones
> > Principal Engineer
> > *JW*PLAYER  |  Your Way to Play
> > [email protected]  |  jwplayer.com
> >
> >
> >
> >
> >
> > --
> > Morrigan Jones
> > Principal Engineer
> > *JW*PLAYER  |  Your Way to Play
> > [email protected]  |  jwplayer.com
> >
>
>
>
> --
> Morrigan Jones
> Principal Engineer
> *JW*PLAYER  |  Your Way to Play
> [email protected]  |  jwplayer.com
>
>

Re: [DISCUSS] Plan for merge SQE and Storm SQL

Reply via email to