Re: [DISCUSS] Structuring Java based DSLs

Jan Lukavský Wed, 12 Dec 2018 08:56:01 -0800

Hi all,

after letting this sink for a while, I'd like to summarize the feedbackand emphasize some questions that appeared:


 a) there were several 'it makes sense' opinions

b) there was one 'not right now' - which makes sense, but the purposeof this discussion was to try to first answer the what and then the when :-)


 c) there were several 'maybe, but':

i) it would be more complicated to code SQL against user-facing API,because that way, each change needed by SQL would have to be firstimplemented in this user-friendly API layer

I can absolutely agree with this, it would be definitely morecomplicated and more work. I see basically two ways out. The first onewould suggest to move all the code from Euphoria into something similarto Join library, and let Euphoria be just the user-friendly layer on topof this library (basically just the builders). That way, we could reusethe code and be pretty much sure, that the implementation of SQLtransforms are identical to what Euphoria would offer, which is one thegoals of this discussion. The drawback would be, that there would be noguaranties, that what this underlying library would offer would be alsoaccessible from Euphoria - that is because the complexity would notdisappear, it would be just moved onto different component - new addedfeature to the shared library would have to be made accessible inEuphoria. The other way around would be to accept this added complexityin favor of making sure, that every feature that is needed by SQL isalso available in Euphoria, because the user-facing API would be used bySQL itself. I'd really like to further hear community opinions on prosand cons of these two (or maybe I'm overlooking something and there is athird way).

ii) in some cases, we might want to support relational operators inSDK harness for performance, and we don't want to close doors for this

Again, the motivation of this seems to be clear and valid, but thequestion that arises is - under the conditions (something like we haveschema aware PCollection), would we want to enable code reuse betweenlogic written in SQL and Euphoria to ensure consistent behavior? Thatwould probably mean that Euphoria would have to make use of the providedscheme of PCollection and switch to a different behavior on API level(more DataFrame-like) and/or different operators created and passed tothe SDK harness. This feature is currently absolutely missing, but seemsto be plausible and maybe there could be benefits for both sides.


Many thanks for any more opinions on this.

 Jan


On 12/4/18 11:32 PM, Rui Wang wrote:

For pure SQL users, there shouldn't be a SDK concepts. SQL shell andJDBC driver should be the way to interact Beam by SQL.

For embedded SQL use case in all SDKs (Python, Go, etc.), even assumethere are relational algebra operators defined in SDKs, SDKs stillhave to implement its own way to parse SQL into operators (SQL is justa string). To avoid that overhead, I would imagine that SDKs shouldkeep SQL queries and wait for a later but shared processing (I don'tknow if Portability should handle SQL or if it could).



-Rui

On Tue, Dec 4, 2018 at 2:04 AM Jan Lukavský <je...@seznam.cz<mailto:je...@seznam.cz>> wrote:


    Hi Kenn,

    my intent really was not to propose any changes right now. I'm
    trying to create a clear understanding about what the relation
    between Euphoria and SQL should be in long run. In my point of
    view, Euphoria should be always superset of SQL, because it should
    support complete relational algebra (and I'm not saying it does so
    right now, it should just be our goal) plus more flexible UDFs
    (not limited to SQL standard) and stateful processing (which will
    probably not be part of SQL any time soon). There should be some
    sort of guaranties that the semantics of SQL and Euphoria are the
    same, because that is what users would expect it to be. This can
    be for sure ensured by introducing another layer between Euphoria
    and core SDK (e.g. the join library), but the question is - what
    makes this solution different from creating this shared library
    from Euphoria itself (when looking at the big picture)? And it is
    not only about implementations of joins or any other operators,
    but there are other techniques that could be beneficial for SQL -
    e.g. pipeline sampling, automatic pipeline optimizations based on
    statistics from previous runs of batch queries, etc.

    The other way - that relational algebra nodes will become
    essential part of (some) SDK, that is equivalent to actually
    creating SQL SDK, am I right? I understand, that this approach can
    bring performance benefits, but besides that - is the language
    which implements SQL really important for users? Do we need SQL
    implementing Go UDFs, Java UDFs, Python UDFs? How would the
    resulting SQL query look like? If it is about allowing using SQL
    from all other SDKs (I want to do some basic preprocessing using
    SQL and then optimize some hard part in my favorite SDK) - can
    this be solved by enabling SQL in all SDKs by mixing various SDKs
    harnesses in single pipeline instead (e.g. I want to use SQL in Go
    SDK, I just tell the portable layer to run these operators using
    Java SDK and these using Go)? That seems plausible, solving
    interoperability issues, while leaving the whole implementation of
    SQL as an internal detail. Generally this solves more issues, like
    ability to reuse IOs in all SDKs (I'm aware that there are
    caveats, but that is beyond scope of intended discussion topic of
    this thread).

     Jan

    On 12/3/18 7:27 PM, Kenneth Knowles wrote:

    To be honest, I don't think there's much worth doing right now. I
    think more self-contained is better for Beam SQL, generally. Two
    things I have on my mind are (1) SQL as an inline transform in
    every SDK and (2) supporting pure SQL like the CLI and JDBC
    driver, where the underlying language is an implementation detail.

    Big picture / long term, I would envision pure SQL, embedded SQL
    transform, and a DataFrame-like API in ~each SDK all desugaring
    to relational algebra nodes, sharing an optimizer, sharing some
    amount of mapping the physical plan to Beam transforms. The
    necessarily SDK-specific parts are the embedded transform API and
    UDFs in the host language. The rest should remain an
    implementation detail that we can change.

     - For example, it is easy to imagine a customized columnar
    element/bundle encoding and SDK harness that only works for SQL
    to remove overhead of being general purpose. It could be written
    in C/C++/Go if we wanted to squeeze it for perf. Such things are
    made harder by having an elaborate end-user API between SQL and
    the core Beam model.
     - Conversely, for whatever is chosen to underlie SQL's
    execution, stability is paramount. Ideally the simplest and least
    likely to change transforms would be the foundation. And I
    wouldn't want to have to design a user-friendly API for Euphoria
    or the join library just to enable a different join algorithm in SQL.

    So my take is keep SQL flexible, implement SQL on low-level and
    stable APIs, use join library, Euphoria, etc, if it looks like a
    big win, but don't build any policy here or do big refactors
    right now.

    Kenn

    On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz
    <mailto:je...@seznam.cz>> wrote:

        Hi Robert,

        currently there is no actual proposal, I was just trying to
        gather
        feedback from the community. But my original thoughts would
        be [1]. I
        actually don't see much need for restructuring the code by
        nesting
        directories. If the community sees that it would make sense
        to structure
        the dependencies, the second step would probably be to figure
        out how to
        accomplish this. I don't have any exact solution in mind so
        far, it
        would be probably needed to first identify features that are
        needed by
        SQL and not supported by Euphoria currently. Then we can
        actually
        identify costs and see it this still makes sense.

          Jan

        On 12/3/18 6:17 PM, Robert Bradshaw wrote:
        > Taking a step back, what exactly is the proposal. Looking
        at the
        > original message, I see
        >
        > (1) Letting SQL take a dependency on Euphoria, sharing more
        code and
        > taking advantage of the logical nesting of levels of
        abstraction. This
        > makes sense to me.
        > (2) Nesting the directories (but not the gradle targets or
        module
        > names?). Here I'm not so sure about the benefit, especially
        vs. the
        > cost.
        > On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský
        <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
        >> I think that the fact that SQL uses some other internal
        dependency
        >> should remain hidden implementation detail. I absolutely
        agree that the
        >> dependency should of course remain sdks-java-sql in all cases.
        >>
        >>     Jan
        >>
        >> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
        >>> I suppose what I'm trying to say is that I see this
        module structure
        >>> as a tool for discoverability and enumerating end-user
        endpoints. In
        >>> other words, if one wants to use SQL, it would seem odd
        to have to
        >>> depend on sdks-java-euphoria-sql rather than just
        sdks-java-sql if
        >>> sdks-java-euphoria is also a DSL one might use. A sibling
        relationship
        >>> does not prohibit the layered approach to implementation
        that sounds
        >>> like it makes sense.
        >>>
        >>> (As for merging Euphoria into core, my initial impression
        is that's
        >>> probably a good idea, and something we should consider
        for 3.0 at the
        >>> very least.)
        >>>
        >>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský
        <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
        >>>> Hi Rui,
        >>>>
        >>>> yes, there are optimizations that could be added by each
        layer. The purpose of Euphoria layer actually is not to
        reorder or modify any user operators that are present in the
        pipeline (because it might not have enough information to do
        this), but it can for instance choose between various join
        implementations (shuffle join, broadcast join, ...) - so the
        optimizations it can do are more low level. But this plays
        nicely with the DSL hierarchy - each layer adds a little more
        restrictions, but can therefore do more optimizations. And I
        think that the layer between SDK and SQL wouldn't have to
        support SQL optimizations, it would only have to support way
        for SQL to express these optimizations.
        >>>>
        >>>>     Jan ---------- Původní e-mail ----------
        >>>> Od: Rui Wang <ruw...@google.com <mailto:ruw...@google.com>>
        >>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
        >>>> Datum: 30. 11. 2018 22:43:04
        >>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
        >>>>
        >>>> SQL's optimization is another area to consider for
        integration. SQL optimization includes pushing down
        filters/projections, merging or removing or swapping plan
        nodes and comparing plan costs to choose best plan.  Add
        another layer between SQL and java core might need the layer
        to support SQL optimizations if there is a need.
        >>>>
        >>>> I don't have a clear image on what SQL needs from
        Euphoria for optimization(best case is nothing). As those
        optimizations are happening or will happen, we might start to
        have a sense of it.
        >>>>
        >>>> -Rui
        >>>>
        >>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw
        <rober...@google.com <mailto:rober...@google.com>> wrote:
        >>>>
        >>>> I don't really see Euphoria as a subset of SQL or the
        other way
        >>>> around, and I think it makes sense to use either without
        the other, so
        >>>> by this criteria keeping them as siblings than a nesting.
        >>>>
        >>>> That said, I think it's really good to have a bunch of
        shared code,
        >>>> e.g. a join library that could be used by both. One
        could even depend
        >>>> on the other without having to abandon the sibling
        relationship.
        >>>> Something like retractions belong in the core SDK
        itself. Deeper than
        >>>> that, actually, it should be part of the model.
        >>>>
        >>>> - Robert
        >>>>
        >>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek
        <d...@apache.org <mailto:d...@apache.org>> wrote:
        >>>>> Jan, we made Kryo optional recently (it is a separate
        module and is used only in tests). From a quick look it seems
        that we forgot to remove compile time dependency from
        euphoria's build.gradle. Only "strong" dependencies I'm aware
        of are core SDK and guava. We'll be probably adding sketching
        extension dependency soon.
        >>>>>
        >>>>> D.
        >>>>>
        >>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský
        <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
        >>>>>> Hi Anton,
        >>>>>> reactions inline.
        >>>>>>
        >>>>>> ---------- Původní e-mail ----------
        >>>>>> Od: Anton Kedin <ke...@google.com
        <mailto:ke...@google.com>>
        >>>>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
        >>>>>> Datum: 30. 11. 2018 18:17:06
        >>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
        >>>>>>
        >>>>>> I think this approach makes sense in general, Euphoria
        can be the implementation detail of SQL, similar to Join
        Library or core SDK Schemas.
        >>>>>>
        >>>>>> I wonder though whether it would be better to bring
        Euphoria closer to core SDK first, maybe even merge them
        together. If you look at Reuven's recent work around schemas
        it seems like there are already similarities between that and
        Euphoria's approach, unless I'm missing the point (e.g.
        Filter transforms, FullJoin vs CoGroup... see [2]). And we're
        already switching parts of SQL to those transforms (e.g. SQL
        Aggregation is now implemented by core SDK's Group[3]).
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> Yes, these transforms seem to be very similar to those
        Euphoria has. Whether or not to merge Euphoria with core is
        essentially just a decision of the community (in my point of
        view).
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> Adding explicit Schema support to Euphoria will bring
        it both closer to core SDK and make it natural to use for
        SQL. Can this be a first step towards this integration?
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> Euphoria currently operates on pure PCollections, so
        when PCollection has a schema, it will be accessible by
        Euphoria. It makes sense to make use of the schema in
        Euphoria - it seems natural on inputs to Euphoria operators,
        but it might be tricky (not saying impossible) to actually
        produce schema-aware PCollections as outputs from Euphoria
        operators (generally speaking, in special cases that might be
        possible). Regarding inputs, there is actually intention to
        act on type of PCollection - e.g. when PCollection is already
        of type KV, then it is possible to make key extractor and
        value extractor optional in Euphoria builders, so it feels
        natural to enable changing the builders when a schema-aware
        PCollection, and make use of the provided schema. The rest of
        Euphoria team might correct me, if I'm wrong.
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> One question I have is, does Euphoria bring
        dependencies that are not needed by SQL, or does more or less
        only rely on the core SDK?
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> I think the only relevant dependency that Euphoria has
        besides core SDK is Kryo. It is the default coder when no
        coder is provided, but that could be made optional - e.g. the
        default coder would be supported only if an appropriate
        module would be available. That way I think that Euphoria has
        no special dependencies.
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> [1]
        
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
        >>>>>> [2]
        
https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
        >>>>>> [3]
        
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
        >>>>>>
        >>>>>>
        >>>>>>
        >>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský
        <je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
        >>>>>>
        >>>>>> Hi community,
        >>>>>>
        >>>>>> I'm part of Euphoria DSL team, and on behalf of this
        team, I'd like to
        >>>>>> discuss possible development of Java based DSLs
        currently present in
        >>>>>> Beam. In my knowledge, there are currently two DSLs
        based on Java SDK -
        >>>>>> Euphoria and SQL. These DSLs currently share only the
        SDK itself,
        >>>>>> although there might be room to share some more
        effort. We already know
        >>>>>> that both Euphoria and SQL have need for retractions,
        but there are
        >>>>>> probably many more features that these two could share.
        >>>>>>
        >>>>>> So, I'd like to open a discussion on what it would
        cost and what it
        >>>>>> would possibly bring, if instead of the current structure
        >>>>>>
        >>>>>>      Java SDK
        >>>>>>
        >>>>>>        | ---- SQL
        >>>>>>
        >>>>>>        | ---- Euphoria
        >>>>>>
        >>>>>> these DSLs would be structured as
        >>>>>>
        >>>>>>      Java SDK ---> Euphoria ---> SQL
        >>>>>>
        >>>>>> I'm absolutely sure that this would be a great
        investment and a huge
        >>>>>> change, but I'd like to gather some opinions and
        general feelings of the
        >>>>>> community about this. Some points to start the
        discussion from my side
        >>>>>> would be, that structuring DSLs like this has internal
        logical
        >>>>>> consistency, because each API layer further narrows
        completeness, but
        >>>>>> brings simpler API for simpler tasks, while adding
        additional high-level
        >>>>>> view of the data processing pipeline and thus enabling
        more
        >>>>>> optimizations. On Euphoria side, these are various
        implementations joins
        >>>>>> (most effective implementation depends on data),
        pipeline sampling and
        >>>>>> more. Some (or maybe most) of these optimizations
        would have to be
        >>>>>> implemented in both DSLs, so implementing them once is
        beneficial.
        >>>>>> Another benefit is that this would bring Euphoria
        "closer" to Beam core
        >>>>>> development (which would be good, it is part of the
        project anyway,
        >>>>>> right? :)) and help better drive features, that
        although currently
        >>>>>> needed mostly by SQL, might be needed by other Java
        users anyway.
        >>>>>>
        >>>>>> Thanks for discussion and looking forward to any opinions.
        >>>>>>
        >>>>>>      Jan
        >>>>>>

Re: [DISCUSS] Structuring Java based DSLs

Reply via email to