To be honest, I don't think there's much worth doing right now. I
think more self-contained is better for Beam SQL, generally. Two
things I have on my mind are (1) SQL as an inline transform in
every SDK and (2) supporting pure SQL like the CLI and JDBC
driver, where the underlying language is an implementation detail.
Big picture / long term, I would envision pure SQL, embedded SQL
transform, and a DataFrame-like API in ~each SDK all desugaring
to relational algebra nodes, sharing an optimizer, sharing some
amount of mapping the physical plan to Beam transforms. The
necessarily SDK-specific parts are the embedded transform API and
UDFs in the host language. The rest should remain an
implementation detail that we can change.
- For example, it is easy to imagine a customized columnar
element/bundle encoding and SDK harness that only works for SQL
to remove overhead of being general purpose. It could be written
in C/C++/Go if we wanted to squeeze it for perf. Such things are
made harder by having an elaborate end-user API between SQL and
the core Beam model.
- Conversely, for whatever is chosen to underlie SQL's
execution, stability is paramount. Ideally the simplest and least
likely to change transforms would be the foundation. And I
wouldn't want to have to design a user-friendly API for Euphoria
or the join library just to enable a different join algorithm in SQL.
So my take is keep SQL flexible, implement SQL on low-level and
stable APIs, use join library, Euphoria, etc, if it looks like a
big win, but don't build any policy here or do big refactors
right now.
Kenn
On Mon, Dec 3, 2018 at 9:31 AM Jan Lukavský <je...@seznam.cz
<mailto:je...@seznam.cz>> wrote:
Hi Robert,
currently there is no actual proposal, I was just trying to
gather
feedback from the community. But my original thoughts would
be [1]. I
actually don't see much need for restructuring the code by
nesting
directories. If the community sees that it would make sense
to structure
the dependencies, the second step would probably be to figure
out how to
accomplish this. I don't have any exact solution in mind so
far, it
would be probably needed to first identify features that are
needed by
SQL and not supported by Euphoria currently. Then we can
actually
identify costs and see it this still makes sense.
Jan
On 12/3/18 6:17 PM, Robert Bradshaw wrote:
> Taking a step back, what exactly is the proposal. Looking
at the
> original message, I see
>
> (1) Letting SQL take a dependency on Euphoria, sharing more
code and
> taking advantage of the logical nesting of levels of
abstraction. This
> makes sense to me.
> (2) Nesting the directories (but not the gradle targets or
module
> names?). Here I'm not so sure about the benefit, especially
vs. the
> cost.
> On Sat, Dec 1, 2018 at 8:38 AM Jan Lukavský
<je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
>> I think that the fact that SQL uses some other internal
dependency
>> should remain hidden implementation detail. I absolutely
agree that the
>> dependency should of course remain sdks-java-sql in all cases.
>>
>> Jan
>>
>> On 12/1/18 12:54 AM, Robert Bradshaw wrote:
>>> I suppose what I'm trying to say is that I see this
module structure
>>> as a tool for discoverability and enumerating end-user
endpoints. In
>>> other words, if one wants to use SQL, it would seem odd
to have to
>>> depend on sdks-java-euphoria-sql rather than just
sdks-java-sql if
>>> sdks-java-euphoria is also a DSL one might use. A sibling
relationship
>>> does not prohibit the layered approach to implementation
that sounds
>>> like it makes sense.
>>>
>>> (As for merging Euphoria into core, my initial impression
is that's
>>> probably a good idea, and something we should consider
for 3.0 at the
>>> very least.)
>>>
>>> On Fri, Nov 30, 2018 at 11:06 PM Jan Lukavský
<je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
>>>> Hi Rui,
>>>>
>>>> yes, there are optimizations that could be added by each
layer. The purpose of Euphoria layer actually is not to
reorder or modify any user operators that are present in the
pipeline (because it might not have enough information to do
this), but it can for instance choose between various join
implementations (shuffle join, broadcast join, ...) - so the
optimizations it can do are more low level. But this plays
nicely with the DSL hierarchy - each layer adds a little more
restrictions, but can therefore do more optimizations. And I
think that the layer between SDK and SQL wouldn't have to
support SQL optimizations, it would only have to support way
for SQL to express these optimizations.
>>>>
>>>> Jan ---------- Původní e-mail ----------
>>>> Od: Rui Wang <ruw...@google.com <mailto:ruw...@google.com>>
>>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
>>>> Datum: 30. 11. 2018 22:43:04
>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>>
>>>> SQL's optimization is another area to consider for
integration. SQL optimization includes pushing down
filters/projections, merging or removing or swapping plan
nodes and comparing plan costs to choose best plan. Add
another layer between SQL and java core might need the layer
to support SQL optimizations if there is a need.
>>>>
>>>> I don't have a clear image on what SQL needs from
Euphoria for optimization(best case is nothing). As those
optimizations are happening or will happen, we might start to
have a sense of it.
>>>>
>>>> -Rui
>>>>
>>>> On Fri, Nov 30, 2018 at 12:38 PM Robert Bradshaw
<rober...@google.com <mailto:rober...@google.com>> wrote:
>>>>
>>>> I don't really see Euphoria as a subset of SQL or the
other way
>>>> around, and I think it makes sense to use either without
the other, so
>>>> by this criteria keeping them as siblings than a nesting.
>>>>
>>>> That said, I think it's really good to have a bunch of
shared code,
>>>> e.g. a join library that could be used by both. One
could even depend
>>>> on the other without having to abandon the sibling
relationship.
>>>> Something like retractions belong in the core SDK
itself. Deeper than
>>>> that, actually, it should be part of the model.
>>>>
>>>> - Robert
>>>>
>>>> On Fri, Nov 30, 2018 at 7:20 PM David Morávek
<d...@apache.org <mailto:d...@apache.org>> wrote:
>>>>> Jan, we made Kryo optional recently (it is a separate
module and is used only in tests). From a quick look it seems
that we forgot to remove compile time dependency from
euphoria's build.gradle. Only "strong" dependencies I'm aware
of are core SDK and guava. We'll be probably adding sketching
extension dependency soon.
>>>>>
>>>>> D.
>>>>>
>>>>> On Fri, Nov 30, 2018 at 7:08 PM Jan Lukavský
<je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
>>>>>> Hi Anton,
>>>>>> reactions inline.
>>>>>>
>>>>>> ---------- Původní e-mail ----------
>>>>>> Od: Anton Kedin <ke...@google.com
<mailto:ke...@google.com>>
>>>>>> Komu: dev@beam.apache.org <mailto:dev@beam.apache.org>
>>>>>> Datum: 30. 11. 2018 18:17:06
>>>>>> Předmět: Re: [DISCUSS] Structuring Java based DSLs
>>>>>>
>>>>>> I think this approach makes sense in general, Euphoria
can be the implementation detail of SQL, similar to Join
Library or core SDK Schemas.
>>>>>>
>>>>>> I wonder though whether it would be better to bring
Euphoria closer to core SDK first, maybe even merge them
together. If you look at Reuven's recent work around schemas
it seems like there are already similarities between that and
Euphoria's approach, unless I'm missing the point (e.g.
Filter transforms, FullJoin vs CoGroup... see [2]). And we're
already switching parts of SQL to those transforms (e.g. SQL
Aggregation is now implemented by core SDK's Group[3]).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Yes, these transforms seem to be very similar to those
Euphoria has. Whether or not to merge Euphoria with core is
essentially just a decision of the community (in my point of
view).
>>>>>>
>>>>>>
>>>>>>
>>>>>> Adding explicit Schema support to Euphoria will bring
it both closer to core SDK and make it natural to use for
SQL. Can this be a first step towards this integration?
>>>>>>
>>>>>>
>>>>>>
>>>>>> Euphoria currently operates on pure PCollections, so
when PCollection has a schema, it will be accessible by
Euphoria. It makes sense to make use of the schema in
Euphoria - it seems natural on inputs to Euphoria operators,
but it might be tricky (not saying impossible) to actually
produce schema-aware PCollections as outputs from Euphoria
operators (generally speaking, in special cases that might be
possible). Regarding inputs, there is actually intention to
act on type of PCollection - e.g. when PCollection is already
of type KV, then it is possible to make key extractor and
value extractor optional in Euphoria builders, so it feels
natural to enable changing the builders when a schema-aware
PCollection, and make use of the provided schema. The rest of
Euphoria team might correct me, if I'm wrong.
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>> One question I have is, does Euphoria bring
dependencies that are not needed by SQL, or does more or less
only rely on the core SDK?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I think the only relevant dependency that Euphoria has
besides core SDK is Kryo. It is the default coder when no
coder is provided, but that could be made optional - e.g. the
default coder would be supported only if an appropriate
module would be available. That way I think that Euphoria has
no special dependencies.
>>>>>>
>>>>>>
>>>>>>
>>>>>> [1]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms/Group.java#L73
>>>>>> [2]
https://github.com/apache/beam/tree/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/core/src/main/java/org/apache/beam/sdk/schemas/transforms
>>>>>> [3]
https://github.com/apache/beam/blob/f66eb5fe23b2500b396e6f711cdf4aeef6b31ab8/sdks/java/extensions/sql/src/main/java/org/apache/beam/sdk/extensions/sql/impl/rel/BeamAggregationRel.java#L179
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Nov 30, 2018 at 6:29 AM Jan Lukavský
<je...@seznam.cz <mailto:je...@seznam.cz>> wrote:
>>>>>>
>>>>>> Hi community,
>>>>>>
>>>>>> I'm part of Euphoria DSL team, and on behalf of this
team, I'd like to
>>>>>> discuss possible development of Java based DSLs
currently present in
>>>>>> Beam. In my knowledge, there are currently two DSLs
based on Java SDK -
>>>>>> Euphoria and SQL. These DSLs currently share only the
SDK itself,
>>>>>> although there might be room to share some more
effort. We already know
>>>>>> that both Euphoria and SQL have need for retractions,
but there are
>>>>>> probably many more features that these two could share.
>>>>>>
>>>>>> So, I'd like to open a discussion on what it would
cost and what it
>>>>>> would possibly bring, if instead of the current structure
>>>>>>
>>>>>> Java SDK
>>>>>>
>>>>>> | ---- SQL
>>>>>>
>>>>>> | ---- Euphoria
>>>>>>
>>>>>> these DSLs would be structured as
>>>>>>
>>>>>> Java SDK ---> Euphoria ---> SQL
>>>>>>
>>>>>> I'm absolutely sure that this would be a great
investment and a huge
>>>>>> change, but I'd like to gather some opinions and
general feelings of the
>>>>>> community about this. Some points to start the
discussion from my side
>>>>>> would be, that structuring DSLs like this has internal
logical
>>>>>> consistency, because each API layer further narrows
completeness, but
>>>>>> brings simpler API for simpler tasks, while adding
additional high-level
>>>>>> view of the data processing pipeline and thus enabling
more
>>>>>> optimizations. On Euphoria side, these are various
implementations joins
>>>>>> (most effective implementation depends on data),
pipeline sampling and
>>>>>> more. Some (or maybe most) of these optimizations
would have to be
>>>>>> implemented in both DSLs, so implementing them once is
beneficial.
>>>>>> Another benefit is that this would bring Euphoria
"closer" to Beam core
>>>>>> development (which would be good, it is part of the
project anyway,
>>>>>> right? :)) and help better drive features, that
although currently
>>>>>> needed mostly by SQL, might be needed by other Java
users anyway.
>>>>>>
>>>>>> Thanks for discussion and looking forward to any opinions.
>>>>>>
>>>>>> Jan
>>>>>>