Re: [DISCUSS] Plan for merge SQE and Storm SQL

Morrigan Jones Mon, 10 Oct 2016 13:09:17 -0700

Thank you for this update Jungtaek. I've been meaning to do my own followup
to your technical analysis, but unfortunately had some Kafka work eat up a
lot of my time. I do have some immediate questions/thoughts, though:


1) From reading the Calcite documentation, a monotonic expression is
required in GROUP BYs of streaming queries. The most common use case is
probably some sort of date/time/timestamp field. This seems extremely
limiting to me, especially when streaming off of Kafka where data may not
be strictly ordered. How is this handled? Is late data not allowed at all?
(quasi-monotonicity is allowed, but still doesn't solve late data issue)
2) Monotonicty becomes even more complicated when your dataset is
partitioned, again like Kafka. How is this handled?
3) Does Calcite have any understanding of partitioned/sharded streams at
all?
4) You say here that "aggregation is limited to micro-batch so result is
non deterministic." What do you mean here? Trident supports stateful
aggregations, not just within the micro-batch.
5) In your technical analysis, you say that "flexible data sources" are a
pro for Storm SQL, as opposed to SQE. What do you mean?

The work on another high level API is also interesting. Is this meant to
completely replace Trident?


On Fri, Oct 7, 2016 at 1:25 AM, Jungtaek Lim <kabh...@gmail.com> wrote:

> For preventing this discussion to be staled, I'd like to put a note how
> Storm SQL is going now.
> (including the change after my tech. analysis)
>
> 1. There're 6 pull requests opened regarding Storm SQL.
> https://github.com/apache/storm/pulls/HeartSaVioR
>
> 2. When STORM-2125 <https://github.com/apache/storm/pull/1714> is merged
> to
> master, Storm SQL can handle most of things which Calcite publishes to SQL
> reference page. I wrote Storm SQL's own reference page
> <https://github.com/HeartSaVioR/storm/blob/43478edbd7369047ccc417d6b81ab6
> a314910437/docs/storm-sql-reference.md>
> and submitted it to another pull request (one of six pull requests)
>
> STORM-2125 is having +1 but waiting for Calcite 1.10.0 to be released in
> order to apply bugfix. (Yes, someone can state that it's a bit tightly
> coupled.) I'm expecting they can cut 1.10.0 RC1 in next week. If it doesn't
> happen in next week, we can just stick with Calcite 1.9.0 and immediately
> upgrade 1.10.0 when it's available.
>
> 3. During adding tests I found some of bugs on Calcite side (left notes for
> each test and also reference doc), which we can contribute back to Calcite
> community. I think this would be a positive feedback loop between Storm and
> Calcite project.
>
> There're still many rooms available to work on Storm SQL.
>
> 1. I haven't had much time to do now, but would like to learn Calcite and
> try to optimize Storm SQL via STORM-1446
> <https://issues.apache.org/jira/browse/STORM-1446>. If someone who
> understands Calcite well takes over and contributes this area I'd be much
> appreciated.
>
> 2. I think Trident seems not good backend API for Storm SQL for a long
> term.
> Trident doesn't support typed operation, join and aggregation is limited to
> micro-batch so result is non deterministic. I'm waiting for higher-level
> API (STORM-1961 <http://issues.apache.org/jira/browse/STORM-1961> is the
> start) and plan to move the backend API.
>
> 3. Storm SQL needs to have more connectors as data sources. It isn't hard
> to work on, but a bit time-consuming.
>
> 4. Storm SQL needs to have time to stabilize by experiment loop:
> experimenting from early adopters, reporting, and fixing bugs.
>
> 5. Long term work: we can expand supporting features on Storm SQL. Join
> between streaming data source and normal table should help to enrich or
> filter data with other external data source. There's also continuous effort
> from Calcite community: 'Streaming SQL' led by Julian. etc.
>
> Looking forward to continue discussion with JW Player folks.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
> ps.
> I'd really like to make revamped Storm SQL available to the community.
> Do we think merge process can pause this effort? I hope not, but I can
> follow the community's decision.
>
> 2016년 9월 30일 (금) 오전 3:58, P. Taylor Goetz <ptgo...@gmail.com>님이 작성:
>
> FYI, I’ve merged the SQE code and documentation into the sqe_import branch:
>
> https://github.com/apache/storm/tree/sqe_import
>
> Note the build will fail on the last component (storm-sqe) due to the
> compilation issues mentioned earlier.
>
> -Taylor
>
>
> On Sep 29, 2016, at 11:13 AM, Bobby Evans <ev...@yahoo-inc.com.INVALID>
> wrote:
>
> Agreed, or if we can find a way to not break compatibility (like perhaps
> having a common base class for most of the logic and one subclass that uses
> a string while another that uses the byte array).   - Bobby
>
>    On Thursday, September 29, 2016 10:05 AM, Jungtaek Lim <
> kabh...@gmail.com>
> wrote:
>
>
> The change on storm-redis seems to require breaking backward compatibility,
> so I would love to see another pull request to integrate it via general
> review process.
> If it can be integrated to SQE without changing storm-redis, that would be
> nice.
>
> Does it make sense?
>
> - Jungtaek Lim (HeartSaVioR)
>
> 2016년 9월 29일 (목) 오전 6:08, P. Taylor Goetz <ptgo...@gmail.com>님이 작성:
>
> The changes are available in GitHub, I just overlooked it. And they’re
> actually neatly contained in two commits:
>
> The changes to storm-kafka can be found here:
>
>
> https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572
> 836104755a
>
>
> The changes to storm-redis are here:
>
>
> https://github.com/jwplayer/storm/commit/30d000d3ff673efa8b927d23e554a7
> 05fb2928b8
> <
> https://github.com/jwplayer/storm/commit/2069c76695a225e4bb8f402c89e572
> 836104755a
> >
>
>
> The changes to storm-kafka are straightforward and implemented in such a
> way that they would be useful for use cases outside of SQE. As the commit
> message states, it adds a new kafka deserialization scheme (FullScheme)
> that includes the key, value, topic, partition and offset when reading from
> kafka, which is a feature I can see as being valuable for some use cases. I
> would be +1 for merging that code.
>
> The changes to storm-redis are a little different, as Morrigan pointed
> out, because it only addresses the Trident API, but IMHO it looks like a
> good direction.
>
> @HeartSavior — Would you have some time to take a look at the storm-redis
> changes and provide your opinion, since you’re one of the original authors
> of that code?
>
> -Taylor
>
>
> On Sep 26, 2016, at 6:28 PM, Jungtaek Lim <kabh...@gmail.com> wrote:
>
> Great!
>
> For storm-redis, we might need to modify key/value mapper to use byte[]
> rather than String.
> When I co-authored storm-redis, I forgot considering binary format of
> serde. If we want to address that part, we can also address it.
>
> 2016년 9월 27일 (화) 오전 7:19, Morrigan Jones <morri...@jwplayer.com>님이 작성:
>
> Sure, when I can. Storm-kafka should be pretty easy. The storm-redis one
> will require more work to make it more complete.
>
> On Mon, Sep 26, 2016 at 6:09 PM, P. Taylor Goetz <ptgo...@gmail.com>
> wrote:
>
> Thanks for the explanation Morrigan!
>
> Would you be willing to provide a pull request or patch so the community
> can review?
>
> It sounds like at least some of the changes you mention could be useful
>
> to
>
> the broader community (beyond the SQL effort).
>
> Thanks again,
>
> -Taylor
>
> On Sep 26, 2016, at 4:40 PM, Morrigan Jones <morri...@jwplayer.com>
>
> wrote:
>
>
> storm-kafka - This is needed because storm-kafka does not provide a
>
> scheme
>
> class that gives you the key, value (payload), partition, and offset.
> MessageMetadataScheme.java comes comes closest, but is missing the key.
> This was a pretty simple change on my part.
>
> storm-redis - This is needed for proper support of Redis hashes. The
> existing storm-redis uses a static string (additionalKey in
> the RedisDataTypeDescription class) for the field name in hash types. I
> updated it to use a configurable KeyFactory for both the hash name and
>
> the
>
> field name. We also added some limited support for set types. This is
> admittedly the messiest between the two jars since we only cared about
>
> the
>
> trident states and would require a lot more changes to get storm-redis
>
> more
>
> "feature complete" overall.
>
>
>
>
> On Mon, Sep 26, 2016 at 4:03 PM, P. Taylor Goetz <ptgo...@gmail.com>
>
> wrote:
>
>
> Sounds good. I’ll find out if it builds against 2.x. If so I’ll go
>
> that
>
> direction. Otherwise I’ll come back with my findings and we can
>
> discuss
>
> it
>
> further.
>
> I notice there are jars in the git repo that we obviously can’t
>
> import.
>
> They look like they might be custom JWPlayer builds of storm-kafka and
> storm-redis.
>
> Morrigan — Do you know if there is any differences there that required
> custom builds of those components?
>
> -Taylor
>
> On Sep 26, 2016, at 3:31 PM, Bobby Evans
>
> <ev...@yahoo-inc.com.INVALID
>
>
> wrote:
>
> Does it compile against 2.X?  If so I would prefer to have it go
>
> there,
>
> and then possibly 1.x if people what it there too. - Bobby
>
>
>   On Monday, September 26, 2016 12:47 PM, P. Taylor Goetz <
>
> ptgo...@gmail.com> wrote:
>
>
> The IP Clearance vote has passed and we are now able to import the
>
> SQE
>
> code.
>
>
> The question now is to where do we want to import the code?
>
> My inclination is to import it to “external” in the 1.x branch. It
>
> can
>
> be ported to other branches as necessary/if desired. An alternative
>
> would
>
> be to treat it as a feature branch, but I’d rather take the former
>
> approach.
>
>
> Thought/opinions?
>
> -Taylor
>
> On Sep 21, 2016, at 8:39 PM, P. Taylor Goetz <ptgo...@gmail.com>
>
> wrote:
>
>
> My apologies. I meant to cc dev@, but didn't. Will forward in a
>
> bit...
>
>
> The vote (lazy consensus) is underway on general@incubator, and
>
> will
>
> close in less than 72 hours. After that the code can be merged.
>
>
> -Taylor
>
> On Sep 21, 2016, at 7:02 PM, Jungtaek Lim <kabh...@gmail.com>
>
> wrote:
>
>
> Hi dev,
>
> While code contribution of SQE is in progress, I would like to
>
> continue
>
> discussion how to merge SQE and Storm SQL.
>
> I did an analysis of merging SQE and Storm SQL in both side,
>
> integrating
>
> SQE to Storm SQL and vice versa.
> https://cwiki.apache.org/confluence/display/STORM/
>
> Technical+analysis+of+merging+SQE+and+Storm+SQL
>
>
> As I commented to that page, since I'm working on Storm SQL for
>
> some
>
> weeks
>
> I can be (heavily) biased. So I'd really appreciated if someone can
>
> do
>
> another analysis.
>
> Please feel free to share your thought about this analysis, or
>
> another
>
> proposal if you have any, or other things about the merging.
>
> Thanks,
> Jungtaek Lim (HeartSaVioR)
>
>
>
> --
> Morrigan Jones
> Principal Engineer
> *JW*PLAYER  |  Your Way to Play
> morri...@jwplayer.com  |  jwplayer.com
>
>
>
>
>
> --
> Morrigan Jones
> Principal Engineer
> *JW*PLAYER  |  Your Way to Play
> morri...@jwplayer.com  |  jwplayer.com
>



-- 
Morrigan Jones
Principal Engineer
*JW*PLAYER  |  Your Way to Play
morri...@jwplayer.com  |  jwplayer.com

Re: [DISCUSS] Plan for merge SQE and Storm SQL

Reply via email to