I'll send an update on schemas soon. But the tl;dr is that by the end of this month, I expect it to be generally usable across a variety of input formats.
Reuven On Wed, Dec 12, 2018 at 9:38 AM Xinyu Liu <xinyuliu...@gmail.com> wrote: > Agree with Kenn on this. From our SamzaRunner point of view, we would like > Beam SQL to be self-contained and flexible enough for our users to use it > in different scenarios, e.g. pure SQL and embeded in different SDKs. We are > also extremely interested in the DataFrame-like API mentioned above. To > digress a little bit from this topic, this is actually the current hurdle > of letting our users try it out in hadoop since they expect such kind of > API with columnar data set IO support, e.g. ORC. If there are any more > details about the status of DF API and columnar support, I will be very > happy to learn more about it. > > Thanks, > Xinyu > > On Wed, Dec 12, 2018 at 8:55 AM Jan Lukavský <je...@seznam.cz> wrote: > >> Hi all, >> >> after letting this sink for a while, I'd like to summarize the feedback >> and emphasize some questions that appeared: >> >> a) there were several 'it makes sense' opinions >> >> b) there was one 'not right now' - which makes sense, but the purpose of >> this discussion was to try to first answer the what and then the when :-) >> >> c) there were several 'maybe, but': >> >> i) it would be more complicated to code SQL against user-facing API, >> because that way, each change needed by SQL would have to be first >> implemented in this user-friendly API layer >> >> I can absolutely agree with this, it would be definitely more >> complicated and more work. I see basically two ways out. The first one >> would suggest to move all the code from Euphoria into something similar to >> Join library, and let Euphoria be just the user-friendly layer on top of >> this library (basically just the builders). That way, we could reuse the >> code and be pretty much sure, that the implementation of SQL transforms are >> identical to what Euphoria would offer, which is one the goals of this >> discussion. The drawback would be, that there would be no guaranties, that >> what this underlying library would offer would be also accessible from >> Euphoria - that is because the complexity would not disappear, it would be >> just moved onto different component - new added feature to the shared >> library would have to be made accessible in Euphoria. The other way around >> would be to accept this added complexity in favor of making sure, that >> every feature that is needed by SQL is also available in Euphoria, because >> the user-facing API would be used by SQL itself. I'd really like to further >> hear community opinions on pros and cons of these two (or maybe I'm >> overlooking something and there is a third way). >> >> ii) in some cases, we might want to support relational operators in SDK >> harness for performance, and we don't want to close doors for this >> >> Again, the motivation of this seems to be clear and valid, but the >> question that arises is - under the conditions (something like we have >> schema aware PCollection), would we want to enable code reuse between logic >> written in SQL and Euphoria to ensure consistent behavior? That would >> probably mean that Euphoria would have to make use of the provided scheme >> of PCollection and switch to a different behavior on API level (more >> DataFrame-like) and/or different operators created and passed to the SDK >> harness. This feature is currently absolutely missing, but seems to be >> plausible and maybe there could be benefits for both sides. >> >> Many thanks for any more opinions on this. >> >> Jan >> >> >> On 12/4/18 11:32 PM, Rui Wang wrote: >> >> For pure SQL users, there shouldn't be a SDK concepts. SQL shell and JDBC >> driver should be the way to interact Beam by SQL. >> >> >> For embedded SQL use case in all SDKs (Python, Go, etc.), even assume >> there are relational algebra operators defined in SDKs, SDKs still have to >> implement its own way to parse SQL into operators (SQL is just a string). >> To avoid that overhead, I would imagine that SDKs should keep SQL queries >> and wait for a later but shared processing (I don't know if Portability >> should handle SQL or if it could). >> >> >> -Rui >> >> On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz> wrote: >> >>> Hi Kenn, >>> >>> my intent really was not to propose any changes right now. I'm trying to >>> create a clear understanding about what the relation between Euphoria and >>> SQL should be in long run. In my point of view, Euphoria should be always >>> superset of SQL, because it should support complete relational algebra (and >>> I'm not saying it does so right now, it should just be our goal) plus more >>> flexible UDFs (not limited to SQL standard) and stateful processing (which >>> will probably not be part of SQL any time soon). There should be some sort >>> of guaranties that the semantics of SQL and Euphoria are the same, because >>> that is what users would expect it to be. This can be for sure ensured by >>> introducing another layer between Euphoria and core SDK (e.g. the join >>> library), but the question is - what makes this solution different from >>> creating this shared library from Euphoria itself (when looking at the big >>> picture)? And it is not only about implementations of joins or any other >>> operators, but there are other techniques that could be beneficial for SQL >>> - e.g. pipeline sampling, automatic pipeline optimizations based on >>> statistics from previous runs of batch queries, etc. >>> >>> The other way - that relational algebra nodes will become essential part >>> of (some) SDK, that is equivalent to actually creating SQL SDK, am I right? >>> I understand, that this approach can bring performance benefits, but >>> besides that - is the language which implements SQL really important for >>> users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? How >>> would the resulting SQL query look like? If it is about allowing using SQL >>> from all other SDKs (I want to do some basic preprocessing using SQL and >>> then optimize some hard part in my favorite SDK) - can this be solved by >>> enabling SQL in all SDKs by mixing various SDKs harnesses in single >>> pipeline instead (e.g. I want to use SQL in Go SDK, I just tell the >>> portable layer to run these operators using Java SDK and these using Go)? >>> That seems plausible, solving interoperability issues, while leaving the >>> whole implementation of SQL as an internal detail. Generally this solves >>> more issues, like ability to reuse IOs in all SDKs (I'm aware that there >>> are caveats, but that is beyond scope of intended discussion topic of this >>> thread). >>> >>> Jan >>> On 12/3/18 7:27 PM, Kenneth Knowles wrote: >>> >>> To be honest, I don't think there's much worth doing right now. I think >>> more self-contained is better for Beam SQL, generally. Two things I have on >>> my mind are (1) SQL as an inline transform in every SDK and (2) supporting >>> pure SQL like the CLI and JDBC driver, where the underlying language is an >>> implementation detail. >>> >>> Big picture / long term, I would envision pure SQL, embedded SQL >>> transform, and a DataFrame-like API in ~each SDK all desugaring to >>> relational algebra nodes, sharing an optimizer, sharing some amount of >>> mapping the physical plan to Beam transforms. The necessarily SDK-specific >>> parts are the embedded transform API and UDFs in the host language. The >>> rest should remain an implementation detail that we can change. >>> >>> - For example, it is easy to imagine a customized columnar >>> element/bundle encoding and SDK harness that only works for SQL to remove >>> overhead of being general purpose. It could be written in C/C++/Go if we >>> wanted to squeeze it for perf. Such things are made harder by having an >>> elaborate end-user API between SQL and the core Beam model. >>> - Conversely, for whatever is chosen to underlie SQL's execution, >>> stability is paramount. Ideally the simplest and least likely to change >>> transforms would be the foundation. And I wouldn't want to have to design a >>> user-friendly API for Euphoria or the join library just to enable a >>> different join algorithm in SQL. >>> >>> So my take is keep SQL flexible, implement SQL on low-level and stable >>> APIs, use join library, Euphoria, etc, if it looks like a big win, but >>> don't build any policy here or do big refactors right now. >>> >>> Kenn >>> >>> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote: >>> >>>> Hi Robert, >>>> >>>> currently there is no actual proposal, I was just trying to gather >>>> feedback from the community. But my original thoughts would be [1]. I >>>> actually don't see much need for restructuring the code by nesting >>>> directories. If the community sees that it would make sense to >>>> structure >>>> the dependencies, the second step would probably be to figure out how >>>> to >>>> accomplish this. I don't have any exact solution in mind so far, it >>>> would be probably needed to first identify features that are needed by >>>> SQL and not supported by Euphoria currently. Then we can actually >>>> identify costs and see it this still makes sense. >>>> >>>> Jan >>>> >>>> On 12/3/18 6:17 PM, Robert Bradshaw wrote: >>>> > Taking a step back, what exactly is the proposal. Looking at the >>>> > original message, I see >>>> > >>>> > (1) Letting SQL take a dependency on Euphoria, sharing more code and >>>> > taking advantage of the logical nesting of levels of abstraction. This >>>> > makes sense to me. >>>> > (2) Nesting the directories (but not the gradle targets or module >>>> > names?). Here I'm not so sure about the benefit, especially vs. the >>>> > cost. >>>> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote: >>>> >> I think that the fact that SQL uses some other internal dependency >>>> >> should remain hidden implementation detail. I absolutely agree that >>>> the >>>> >> dependency should of course remain sdks-java-sql in all cases. >>>> >> >>>> >> Jan >>>> >> >>>> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote: >>>> >>> I suppose what I'm trying to say is that I see this module structure >>>> >>> as a tool for discoverability and enumerating end-user endpoints. In >>>> >>> other words, if one wants to use SQL, it would seem odd to have to >>>> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if >>>> >>> sdks-java-euphoria is also a DSL one might use. A sibling >>>> relationship >>>> >>> does not prohibit the layered approach to implementation that sounds >>>> >>> like it makes sense. >>>> >>> >>>> >>> (As for merging Euphoria into core, my initial impression is that's >>>> >>> probably a good idea, and something we should consider for 3.0 at >>>> the >>>> >>> very least.) >>>> >>> >>>> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz> >>>> wrote: >>>> >>>> Hi Rui, >>>> >>>> >>>> >>>> yes, there are optimizations that could be added by each layer. >>>> The purpose of Euphoria layer actually is not to reorder or modify any user >>>> operators that are present in the pipeline (because it might not have >>>> enough information to do this), but it can for instance choose between >>>> various join implementations (shuffle join, broadcast join, ...) - so the >>>> optimizations it can do are more low level. But this plays nicely with the >>>> DSL hierarchy - each layer adds a little more restrictions, but can >>>> therefore do more optimizations. And I think that the layer between SDK and >>>> SQL wouldn't have to support SQL optimizations, it would only have to >>>> support way for SQL to express these optimizations. >>>> >>>> >>>> >>>> Jan ---------- Původní e-mail ---------- >>>> >>>> Od: Rui Wang <ruw...@google.com> >>>> >>>> Komu: dev@beam.apache.org >>>> >>>> Datum: 30. 11. 2018 22:43:04 >>>> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs >>>> >>>> >>>> >>>> SQL's optimization is another area to consider for integration. >>>> SQL optimization includes pushing down filters/projections, merging or >>>> removing or swapping plan nodes and comparing plan costs to choose best >>>> plan. Add another layer between SQL and java core might need the layer to >>>> support SQL optimizations if there is a need. >>>> >>>> >>>> >>>> I don't have a clear image on what SQL needs from Euphoria for >>>> optimization(best case is nothing). As those optimizations are happening or >>>> will happen, we might start to have a sense of it. >>>> >>>> >>>> >>>> -Rui >>>> >>>> >>>> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw < >>>> rober...@google.com> wrote: >>>> >>>> >>>> >>>> I don't really see Euphoria as a subset of SQL or the other way >>>> >>>> around, and I think it makes sense to use either without the >>>> other, so >>>> >>>> by this criteria keeping them as siblings than a nesting. >>>> >>>> >>>> >>>> That said, I think it's really good to have a bunch of shared code, >>>> >>>> e.g. a join library that could be used by both. One could even >>>> depend >>>> >>>> on the other without having to abandon the sibling relationship. >>>> >>>> Something like retractions belong in the core SDK itself. Deeper >>>> than >>>> >>>> that, actually, it should be part of the model. >>>> >>>> >>>> >>>> - Robert >>>> >>>> >>>> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <d...@apache.org> >>>> wrote: >>>> >>>>> Jan, we made Kryo optional recently (it is a separate module and >>>> is used only in tests). From a quick look it seems that we forgot to remove >>>> compile time dependency from euphoria's build.gradle. Only "strong" >>>> dependencies I'm aware of are core SDK and guava. We'll be probably adding >>>> sketching extension dependency soon. >>>> >>>>> >>>> >>>>> D. >>>> >>>>> >>>> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz> >>>> wrote: >>>> >>>>>> Hi Anton, >>>> >>>>>> reactions inline. >>>> >>>>>> >>>> >>>>>> ---------- Původní e-mail ---------- >>>> >>>>>> Od: Anton Kedin <ke...@google.com> >>>> >>>>>> Komu: dev@beam.apache.org >>>> >>>>>> Datum: 30. 11. 2018 18:17:06 >>>> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs >>>> >>>>>> >>>> >>>>>> I think this approach makes sense in general, Euphoria can be >>>> the implementation detail of SQL, similar to Join Library or core SDK >>>> Schemas. >>>> >>>>>> >>>> >>>>>> I wonder though whether it would be better to bring Euphoria >>>> closer to core SDK first, maybe even merge them together. If you look at >>>> Reuven's recent work around schemas it seems like there are already >>>> similarities between that and Euphoria's approach, unless I'm missing the >>>> point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're >>>> already switching parts of SQL to those transforms (e.g. SQL Aggregation is >>>> now implemented by core SDK's Group[3]). >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Yes, these transforms seem to be very similar to those Euphoria >>>> has. Whether or not to merge Euphoria with core is essentially just a >>>> decision of the community (in my point of view). >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Adding explicit Schema support to Euphoria will bring it both >>>> closer to core SDK and make it natural to use for SQL. Can this be a first >>>> step towards this integration? >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Euphoria currently operates on pure PCollections, so when >>>> PCollection has a schema, it will be accessible by Euphoria. It makes sense >>>> to make use of the schema in Euphoria - it seems natural on inputs to >>>> Euphoria operators, but it might be tricky (not saying impossible) to >>>> actually produce schema-aware PCollections as outputs from Euphoria >>>> operators (generally speaking, in special cases that might be possible). >>>> Regarding inputs, there is actually intention to act on type of PCollection >>>> - e.g. when PCollection is already of type KV, then it is possible to make >>>> key extractor and value extractor optional in Euphoria builders, so it >>>> feels natural to enable changing the builders when a schema-aware >>>> PCollection, and make use of the provided schema. The rest of Euphoria team >>>> might correct me, if I'm wrong. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> One question I have is, does Euphoria bring dependencies that >>>> are not needed by SQL, or does more or less only rely on the core SDK? >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> I think the only relevant dependency that Euphoria has besides >>>> core SDK is Kryo. It is the default coder when no coder is provided, but >>>> that could be made optional - e.g. the default coder would be supported >>>> only if an appropriate module would be available. That way I think that >>>> Euphoria has no special dependencies. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> [1] >>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73 >>>> >>>>>> [2] >>>> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms >>>> >>>>>> [3] >>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179 >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz> >>>> wrote: >>>> >>>>>> >>>> >>>>>> Hi community, >>>> >>>>>> >>>> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd >>>> like to >>>> >>>>>> discuss possible development of Java based DSLs currently >>>> present in >>>> >>>>>> Beam. In my knowledge, there are currently two DSLs based on >>>> Java SDK - >>>> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself, >>>> >>>>>> although there might be room to share some more effort. We >>>> already know >>>> >>>>>> that both Euphoria and SQL have need for retractions, but there >>>> are >>>> >>>>>> probably many more features that these two could share. >>>> >>>>>> >>>> >>>>>> So, I'd like to open a discussion on what it would cost and what >>>> it >>>> >>>>>> would possibly bring, if instead of the current structure >>>> >>>>>> >>>> >>>>>> Java SDK >>>> >>>>>> >>>> >>>>>> | ---- SQL >>>> >>>>>> >>>> >>>>>> | ---- Euphoria >>>> >>>>>> >>>> >>>>>> these DSLs would be structured as >>>> >>>>>> >>>> >>>>>> Java SDK ---> Euphoria ---> SQL >>>> >>>>>> >>>> >>>>>> I'm absolutely sure that this would be a great investment and a >>>> huge >>>> >>>>>> change, but I'd like to gather some opinions and general >>>> feelings of the >>>> >>>>>> community about this. Some points to start the discussion from >>>> my side >>>> >>>>>> would be, that structuring DSLs like this has internal logical >>>> >>>>>> consistency, because each API layer further narrows >>>> completeness, but >>>> >>>>>> brings simpler API for simpler tasks, while adding additional >>>> high-level >>>> >>>>>> view of the data processing pipeline and thus enabling more >>>> >>>>>> optimizations. On Euphoria side, these are various >>>> implementations joins >>>> >>>>>> (most effective implementation depends on data), pipeline >>>> sampling and >>>> >>>>>> more. Some (or maybe most) of these optimizations would have to >>>> be >>>> >>>>>> implemented in both DSLs, so implementing them once is >>>> beneficial. >>>> >>>>>> Another benefit is that this would bring Euphoria "closer" to >>>> Beam core >>>> >>>>>> development (which would be good, it is part of the project >>>> anyway, >>>> >>>>>> right? :)) and help better drive features, that although >>>> currently >>>> >>>>>> needed mostly by SQL, might be needed by other Java users anyway. >>>> >>>>>> >>>> >>>>>> Thanks for discussion and looking forward to any opinions. >>>> >>>>>> >>>> >>>>>> Jan >>>> >>>>>> >>>> >>>