Re: [DISCUSS] Structuring Java based DSLs

Reuven Lax Wed, 12 Dec 2018 15:06:57 -0800

I'll send an update on schemas soon. But the tl;dr is that by the end of
this month, I expect it to be generally usable across a variety of input
formats.


Reuven

On Wed, Dec 12, 2018 at 9:38 AM Xinyu Liu <xinyuliu...@gmail.com> wrote:

> Agree with Kenn on this. From our SamzaRunner point of view, we would like
> Beam SQL to be self-contained and flexible enough for our users to use it
> in different scenarios, e.g. pure SQL and embeded in different SDKs. We are
> also extremely interested in the DataFrame-like API mentioned above. To
> digress a little bit from this topic, this is actually the current hurdle
> of letting our users try it out in hadoop since they expect such kind of
> API with columnar data set IO support, e.g. ORC. If there are any more
> details about the status of DF API and columnar support, I will be very
> happy to learn more about it.
>
> Thanks,
> Xinyu
>
> On Wed, Dec 12, 2018 at 8:55 AM Jan Lukavský <je...@seznam.cz> wrote:
>
>> Hi all,
>>
>> after letting this sink for a while, I'd like to summarize the feedback
>> and emphasize some questions that appeared:
>>
>>  a) there were several 'it makes sense' opinions
>>
>>  b) there was one 'not right now' - which makes sense, but the purpose of
>> this discussion was to try to first answer the what and then the when :-)
>>
>>  c) there were several 'maybe, but':
>>
>>   i) it would be more complicated to code SQL against user-facing API,
>> because that way, each change needed by SQL would have to be first
>> implemented in this user-friendly API layer
>>
>>      I can absolutely agree with this, it would be definitely more
>> complicated and more work. I see basically two ways out. The first one
>> would suggest to move all the code from Euphoria into something similar to
>> Join library, and let Euphoria be just the user-friendly layer on top of
>> this library (basically just the builders). That way, we could reuse the
>> code and be pretty much sure, that the implementation of SQL transforms are
>> identical to what Euphoria would offer, which is one the goals of this
>> discussion. The drawback would be, that there would be no guaranties, that
>> what this underlying library would offer would be also accessible from
>> Euphoria - that is because the complexity would not disappear, it would be
>> just moved onto different component - new added feature to the shared
>> library would have to be made accessible in Euphoria. The other way around
>> would be to accept this added complexity in favor of making sure, that
>> every feature that is needed by SQL is also available in Euphoria, because
>> the user-facing API would be used by SQL itself. I'd really like to further
>> hear community opinions on pros and cons of these two (or maybe I'm
>> overlooking something and there is a third way).
>>
>>  ii) in some cases, we might want to support relational operators in SDK
>> harness for performance, and we don't want to close doors for this
>>
>>      Again, the motivation of this seems to be clear and valid, but the
>> question that arises is - under the conditions (something like we have
>> schema aware PCollection), would we want to enable code reuse between logic
>> written in SQL and Euphoria to ensure consistent behavior? That would
>> probably mean that Euphoria would have to make use of the provided scheme
>> of PCollection and switch to a different behavior on API level (more
>> DataFrame-like) and/or different operators created and passed to the SDK
>> harness. This feature is currently absolutely missing, but seems to be
>> plausible and maybe there could be benefits for both sides.
>>
>> Many thanks for any more opinions on this.
>>
>>  Jan
>>
>>
>> On 12/4/18 11:32 PM, Rui Wang wrote:
>>
>> For pure SQL users, there shouldn't be a SDK concepts. SQL shell and JDBC
>> driver should be the way to interact Beam by SQL.
>>
>>
>> For embedded SQL use case in all SDKs (Python, Go, etc.), even assume
>> there are relational algebra operators defined in SDKs, SDKs still have to
>> implement its own way to parse SQL into operators (SQL is just a string).
>> To avoid that overhead, I would imagine that SDKs should keep SQL queries
>> and wait for a later but shared processing (I don't know if Portability
>> should handle SQL or if it could).
>>
>>
>> -Rui
>>
>> On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz> wrote:
>>
>>> Hi Kenn,
>>>
>>> my intent really was not to propose any changes right now. I'm trying to
>>> create a clear understanding about what the relation between Euphoria and
>>> SQL should be in long run. In my point of view, Euphoria should be always
>>> superset of SQL, because it should support complete relational algebra (and
>>> I'm not saying it does so right now, it should just be our goal) plus more
>>> flexible UDFs (not limited to SQL standard) and stateful processing (which
>>> will probably not be part of SQL any time soon). There should be some sort
>>> of guaranties that the semantics of SQL and Euphoria are the same, because
>>> that is what users would expect it to be. This can be for sure ensured by
>>> introducing another layer between Euphoria and core SDK (e.g. the join
>>> library), but the question is - what makes this solution different from
>>> creating this shared library from Euphoria itself (when looking at the big
>>> picture)? And it is not only about implementations of joins or any other
>>> operators, but there are other techniques that could be beneficial for SQL
>>> - e.g. pipeline sampling, automatic pipeline optimizations based on
>>> statistics from previous runs of batch queries, etc.
>>>
>>> The other way - that relational algebra nodes will become essential part
>>> of (some) SDK, that is equivalent to actually creating SQL SDK, am I right?
>>> I understand, that this approach can bring performance benefits, but
>>> besides that - is the language which implements SQL really important for
>>> users? Do we need SQL implementing Go UDFs, Java UDFs, Python UDFs? How
>>> would the resulting SQL query look like? If it is about allowing using SQL
>>> from all other SDKs (I want to do some basic preprocessing using SQL and
>>> then optimize some hard part in my favorite SDK) - can this be solved by
>>> enabling SQL in all SDKs by mixing various SDKs harnesses in single
>>> pipeline instead (e.g. I want to use SQL in Go SDK, I just tell the
>>> portable layer to run these operators using Java SDK and these using Go)?
>>> That seems plausible, solving interoperability issues, while leaving the
>>> whole implementation of SQL as an internal detail. Generally this solves
>>> more issues, like ability to reuse IOs in all SDKs (I'm aware that there
>>> are caveats, but that is beyond scope of intended discussion topic of this
>>> thread).
>>>
>>>  Jan
>>> On 12/3/18 7:27 PM, Kenneth Knowles wrote:
>>>
>>> To be honest, I don't think there's much worth doing right now. I think
>>> more self-contained is better for Beam SQL, generally. Two things I have on
>>> my mind are (1) SQL as an inline transform in every SDK and (2) supporting
>>> pure SQL like the CLI and JDBC driver, where the underlying language is an
>>> implementation detail.
>>>
>>> Big picture / long term, I would envision pure SQL, embedded SQL
>>> transform, and a DataFrame-like API in ~each SDK all desugaring to
>>> relational algebra nodes, sharing an optimizer, sharing some amount of
>>> mapping the physical plan to Beam transforms. The necessarily SDK-specific
>>> parts are the embedded transform API and UDFs in the host language. The
>>> rest should remain an implementation detail that we can change.
>>>
>>>  - For example, it is easy to imagine a customized columnar
>>> element/bundle encoding and SDK harness that only works for SQL to remove
>>> overhead of being general purpose. It could be written in C/C++/Go if we
>>> wanted to squeeze it for perf. Such things are made harder by having an
>>> elaborate end-user API between SQL and the core Beam model.
>>>  - Conversely, for whatever is chosen to underlie SQL's execution,
>>> stability is paramount. Ideally the simplest and least likely to change
>>> transforms would be the foundation. And I wouldn't want to have to design a
>>> user-friendly API for Euphoria or the join library just to enable a
>>> different join algorithm in SQL.
>>>
>>> So my take is keep SQL flexible, implement SQL on low-level and stable
>>> APIs, use join library, Euphoria, etc, if it looks like a big win, but
>>> don't build any policy here or do big refactors right now.
>>>
>>> Kenn
>>>
>>> On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>
>>>> Hi Robert,
>>>>
>>>> currently there is no actual proposal, I was just trying to gather
>>>> feedback from the community. But my original thoughts would be [1]. I
>>>> actually don't see much need for restructuring the code by nesting
>>>> directories. If the community sees that it would make sense to
>>>> structure
>>>> the dependencies, the second step would probably be to figure out how
>>>> to
>>>> accomplish this. I don't have any exact solution in mind so far, it
>>>> would be probably needed to first identify features that are needed by
>>>> SQL and not supported by Euphoria currently. Then we can actually
>>>> identify costs and see it this still makes sense.
>>>>
>>>>   Jan
>>>>
>>>> On 12/3/18 6:17 PM, Robert Bradshaw wrote:
>>>> > Taking a step back, what exactly is the proposal. Looking at the
>>>> > original message, I see
>>>> >
>>>> > (1) Letting SQL take a dependency on Euphoria, sharing more code and
>>>> > taking advantage of the logical nesting of levels of abstraction. This
>>>> > makes sense to me.
>>>> > (2) Nesting the directories (but not the gradle targets or module
>>>> > names?). Here I'm not so sure about the benefit, especially vs. the
>>>> > cost.
>>>> > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský <je...@seznam.cz> wrote:
>>>> >> I think that the fact that SQL uses some other internal dependency
>>>> >> should remain hidden implementation detail. I absolutely agree that
>>>> the
>>>> >> dependency should of course remain sdks-java-sql in all cases.
>>>> >>
>>>> >>     Jan
>>>> >>
>>>> >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>>> >>> I suppose what I'm trying to say is that I see this module structure
>>>> >>> as a tool for discoverability and enumerating end-user endpoints. In
>>>> >>> other words, if one wants to use SQL, it would seem odd to have to
>>>> >>> depend on sdks-java-euphoria-sql rather than just sdks-java-sql if
>>>> >>> sdks-java-euphoria is also a DSL one might use. A sibling
>>>> relationship
>>>> >>> does not prohibit the layered approach to implementation that sounds
>>>> >>> like it makes sense.
>>>> >>>
>>>> >>> (As for merging Euphoria into core, my initial impression is that's
>>>> >>> probably a good idea, and something we should consider for 3.0 at
>>>> the
>>>> >>> very least.)
>>>> >>>
>>>> >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>> Hi Rui,
>>>> >>>>
>>>> >>>> yes, there are optimizations that could be added by each layer.
>>>> The purpose of Euphoria layer actually is not to reorder or modify any user
>>>> operators that are present in the pipeline (because it might not have
>>>> enough information to do this), but it can for instance choose between
>>>> various join implementations (shuffle join, broadcast join, ...) - so the
>>>> optimizations it can do are more low level. But this plays nicely with the
>>>> DSL hierarchy - each layer adds a little more restrictions, but can
>>>> therefore do more optimizations. And I think that the layer between SDK and
>>>> SQL wouldn't have to support SQL optimizations, it would only have to
>>>> support way for SQL to express these optimizations.
>>>> >>>>
>>>> >>>>     Jan ---------- Původní e-mail ----------
>>>> >>>> Od: Rui Wang <ruw...@google.com>
>>>> >>>> Komu: dev@beam.apache.org
>>>> >>>> Datum: 30. 11. 2018 22:43:04
>>>> >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>> >>>>
>>>> >>>> SQL's optimization is another area to consider for integration.
>>>> SQL optimization includes pushing down filters/projections, merging or
>>>> removing or swapping plan nodes and comparing plan costs to choose best
>>>> plan.  Add another layer between SQL and java core might need the layer to
>>>> support SQL optimizations if there is a need.
>>>> >>>>
>>>> >>>> I don't have a clear image on what SQL needs from Euphoria for
>>>> optimization(best case is nothing). As those optimizations are happening or
>>>> will happen, we might start to have a sense of it.
>>>> >>>>
>>>> >>>> -Rui
>>>> >>>>
>>>> >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw <
>>>> rober...@google.com> wrote:
>>>> >>>>
>>>> >>>> I don't really see Euphoria as a subset of SQL or the other way
>>>> >>>> around, and I think it makes sense to use either without the
>>>> other, so
>>>> >>>> by this criteria keeping them as siblings than a nesting.
>>>> >>>>
>>>> >>>> That said, I think it's really good to have a bunch of shared code,
>>>> >>>> e.g. a join library that could be used by both. One could even
>>>> depend
>>>> >>>> on the other without having to abandon the sibling relationship.
>>>> >>>> Something like retractions belong in the core SDK itself. Deeper
>>>> than
>>>> >>>> that, actually, it should be part of the model.
>>>> >>>>
>>>> >>>> - Robert
>>>> >>>>
>>>> >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek <d...@apache.org>
>>>> wrote:
>>>> >>>>> Jan, we made Kryo optional recently (it is a separate module and
>>>> is used only in tests). From a quick look it seems that we forgot to remove
>>>> compile time dependency from euphoria's build.gradle. Only "strong"
>>>> dependencies I'm aware of are core SDK and guava. We'll be probably adding
>>>> sketching extension dependency soon.
>>>> >>>>>
>>>> >>>>> D.
>>>> >>>>>
>>>> >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>>>> Hi Anton,
>>>> >>>>>> reactions inline.
>>>> >>>>>>
>>>> >>>>>> ---------- Původní e-mail ----------
>>>> >>>>>> Od: Anton Kedin <ke...@google.com>
>>>> >>>>>> Komu: dev@beam.apache.org
>>>> >>>>>> Datum: 30. 11. 2018 18:17:06
>>>> >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>> >>>>>>
>>>> >>>>>> I think this approach makes sense in general, Euphoria can be
>>>> the implementation detail of SQL, similar to Join Library or core SDK
>>>> Schemas.
>>>> >>>>>>
>>>> >>>>>> I wonder though whether it would be better to bring Euphoria
>>>> closer to core SDK first, maybe even merge them together. If you look at
>>>> Reuven's recent work around schemas it seems like there are already
>>>> similarities between that and Euphoria's approach, unless I'm missing the
>>>> point (e.g. Filter transforms, FullJoin vs CoGroup... see [2]). And we're
>>>> already switching parts of SQL to those transforms (e.g. SQL Aggregation is
>>>> now implemented by core SDK's Group[3]).
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Yes, these transforms seem to be very similar to those Euphoria
>>>> has. Whether or not to merge Euphoria with core is essentially just a
>>>> decision of the community (in my point of view).
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Adding explicit Schema support to Euphoria will bring it both
>>>> closer to core SDK and make it natural to use for SQL. Can this be a first
>>>> step towards this integration?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> Euphoria currently operates on pure PCollections, so when
>>>> PCollection has a schema, it will be accessible by Euphoria. It makes sense
>>>> to make use of the schema in Euphoria - it seems natural on inputs to
>>>> Euphoria operators, but it might be tricky (not saying impossible) to
>>>> actually produce schema-aware PCollections as outputs from Euphoria
>>>> operators (generally speaking, in special cases that might be possible).
>>>> Regarding inputs, there is actually intention to act on type of PCollection
>>>> - e.g. when PCollection is already of type KV, then it is possible to make
>>>> key extractor and value extractor optional in Euphoria builders, so it
>>>> feels natural to enable changing the builders when a schema-aware
>>>> PCollection, and make use of the provided schema. The rest of Euphoria team
>>>> might correct me, if I'm wrong.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> One question I have is, does Euphoria bring dependencies that
>>>> are not needed by SQL, or does more or less only rely on the core SDK?
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> I think the only relevant dependency that Euphoria has besides
>>>> core SDK is Kryo. It is the default coder when no coder is provided, but
>>>> that could be made optional - e.g. the default coder would be supported
>>>> only if an appropriate module would be available. That way I think that
>>>> Euphoria has no special dependencies.
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> [1]
>>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>>> >>>>>> [2]
>>>> https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>> >>>>>> [3]
>>>> https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>>
>>>> >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský <je...@seznam.cz>
>>>> wrote:
>>>> >>>>>>
>>>> >>>>>> Hi community,
>>>> >>>>>>
>>>> >>>>>> I'm part of Euphoria DSL team, and on behalf of this team, I'd
>>>> like to
>>>> >>>>>> discuss possible development of Java based DSLs currently
>>>> present in
>>>> >>>>>> Beam. In my knowledge, there are currently two DSLs based on
>>>> Java SDK -
>>>> >>>>>> Euphoria and SQL. These DSLs currently share only the SDK itself,
>>>> >>>>>> although there might be room to share some more effort. We
>>>> already know
>>>> >>>>>> that both Euphoria and SQL have need for retractions, but there
>>>> are
>>>> >>>>>> probably many more features that these two could share.
>>>> >>>>>>
>>>> >>>>>> So, I'd like to open a discussion on what it would cost and what
>>>> it
>>>> >>>>>> would possibly bring, if instead of the current structure
>>>> >>>>>>
>>>> >>>>>>      Java SDK
>>>> >>>>>>
>>>> >>>>>>        | ---- SQL
>>>> >>>>>>
>>>> >>>>>>        | ---- Euphoria
>>>> >>>>>>
>>>> >>>>>> these DSLs would be structured as
>>>> >>>>>>
>>>> >>>>>>      Java SDK ---> Euphoria ---> SQL
>>>> >>>>>>
>>>> >>>>>> I'm absolutely sure that this would be a great investment and a
>>>> huge
>>>> >>>>>> change, but I'd like to gather some opinions and general
>>>> feelings of the
>>>> >>>>>> community about this. Some points to start the discussion from
>>>> my side
>>>> >>>>>> would be, that structuring DSLs like this has internal logical
>>>> >>>>>> consistency, because each API layer further narrows
>>>> completeness, but
>>>> >>>>>> brings simpler API for simpler tasks, while adding additional
>>>> high-level
>>>> >>>>>> view of the data processing pipeline and thus enabling more
>>>> >>>>>> optimizations. On Euphoria side, these are various
>>>> implementations joins
>>>> >>>>>> (most effective implementation depends on data), pipeline
>>>> sampling and
>>>> >>>>>> more. Some (or maybe most) of these optimizations would have to
>>>> be
>>>> >>>>>> implemented in both DSLs, so implementing them once is
>>>> beneficial.
>>>> >>>>>> Another benefit is that this would bring Euphoria "closer" to
>>>> Beam core
>>>> >>>>>> development (which would be good, it is part of the project
>>>> anyway,
>>>> >>>>>> right? :)) and help better drive features, that although
>>>> currently
>>>> >>>>>> needed mostly by SQL, might be needed by other Java users anyway.
>>>> >>>>>>
>>>> >>>>>> Thanks for discussion and looking forward to any opinions.
>>>> >>>>>>
>>>> >>>>>>      Jan
>>>> >>>>>>
>>>>
>>>

Re: [DISCUSS] Structuring Java based DSLs

Reply via email to