Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

Timo Walther Fri, 11 Oct 2019 00:44:30 -0700

I'm fine with `type` for consistency reasons for now. I hope we will fixthat when we rework the properties design.

@Bowen: could you update the wiki page? I think we could start a vote,right?


Regards,
Timo

On 11.10.19 04:31, Jark Wu wrote:

Hi Timo,

I agree that we are going to rework properties soon.
But we may come up with a better name or a better way than "kind" when the
proposal is started and more people involved.
On the other hand, reworking properties should be a compatible way.
So I think it's fine to use "type" for now (keep consistent with others)
and change them all together when refactoring in the near future.

Regards,
Jark

On Thu, 10 Oct 2019 at 21:12, Timo Walther <[email protected]> wrote:

Hi Jark,

restricting one module instance per kind sounds good to me. Modules can
implement hashCode/equals and we can perform the check you metioned. The
equals() method can determine what "same kind" means.

Correct me if I'm wrong, but we wanted to perform a big properties
rework soonish anyway, no? Then we can also change `connector.type` to
`connector.kind`?

In Calcite they distinguish between `type` and `kind`:

SqlOperator(
        String name,
        SqlKind kind,
        int leftPrecedence,
        int rightPrecedence,
        SqlReturnTypeInference returnTypeInference,
        SqlOperandTypeInference operandTypeInference,
        SqlOperandTypeChecker operandTypeChecker

In Flink 1.9, we also started with
org.apache.flink.table.functions.FunctionDefinition#getKind because type
is just heavily overloaded with different meanings. Logical type, data
type, connector type, etc.

Regards,
Timo


On 10.10.19 14:40, Jark Wu wrote:

Hi Timo,

Thanks for the explanation, it makes sense to me.
So at least, we can have a validation to restrict one module instance per
type in the first version.

Regarding to "type" vs "kind", could we use "datatype" keyword to refer
data types exclusively in the future?
This can avoid the conflict/confusion when we use "type" here.
The concern of using "kind" is that it is inconsistent with other
descriptor properties.
We have heavily used "type" in properties, including `connector.type`,
`format.type`, `catalog.type`, etc...
Are we planning to change them all ?

Best,
Jark

On Thu, 10 Oct 2019 at 19:56, Timo Walther <[email protected]> wrote:

Hi Jark,

we had a long offline discussion yesterday where we considered all
options again. The reasons why we decided for the updated design that
Bowen suggested:

- Both Dawid and Xuefu argued that in the old design "kind" has binary
meanings. I agree here.
- Compared to other SQL concepts such as tables/functions/catalogs, a
"name" is never part of the object itself but always specified when
registering the object. We should have the same behavior for modules
because of consistency reasons. It also makes the "name" explicit when
it comes to unloading the module compared to the previous design.
- Regarding, "How to solve the class conflict problem?" this is an
orthogonal issue that will be fixed in future versions once we use
Flink's new plugin architecture. If a module is parameterizable, it is
the responsibility of the module to prevent class conflicts. If the Hive
classes are hidden in a module classloader, it should be possible to
load both hive121 and hive234. But in general this problem is unsolved
for now, also Kafka tables could clash if you read from two Kafka
clusters with different versions.

Regards,
Timo


On 10.10.19 08:01, Jark Wu wrote:

Hi Xuefu,

If there is only one instance per type, then what's the "name" used

for?

Could we remove it and only keep "type" or "kind" to identify modules?

Best,
Jark

On Thu, 10 Oct 2019 at 11:21, Xuefu Z <[email protected]> wrote:

Jark has a good point. However, I think validation logic can put in

place

to restrict one instance per type. Maybe the doc needs to be specific

on

this.

Thanks,
Xuefu

On Wed, Oct 9, 2019 at 7:41 PM Jark Wu <[email protected]> wrote:

Thanks Bowen for the updating.

I have some different opinions on the change.
IIUC, in the previous design, the "name" is also the "id" or "type"

to

identify which module to load. Which means we can only load one

instance

of

a module.
In the new design, the "name" is just an alias to the module

instance,

the

"kind" is used to identify modules. Which means we can load different
instances of a module.
However, what's the "name" or alias used for? Do we need to support

loading

different instances of a module? From my point of view, it brings

more

complexity and confusion.
For example, if we load a "hive121" which uses HiveModule with

version

1.2.1 and load a "hive234" which uses HiveModule with version 2.3.4,

then

how to solve the class conflict problem?

IMO, a module can only be load once in a session, so "name" maybe

useless.

So my proposal is similar to the previous one, but only change "name"

to

"kind".

      SQL:
            LOAD MODULE "kind" [WITH (properties)];
            UNLOAD MODULE "kind";
       Table:
            tEnv.loadModule("kind" [, properties]);
            tEnv.unloadModule("kind");

What do you think?


Best,
Jark





On Wed, 9 Oct 2019 at 20:38, Bowen Li <[email protected]> wrote:

Thanks everyone for your review.

After discussing with Timo and Dawid offline, as well as

incorporating

feedback from Xuefu and Jark on mailing list, I decided to make a

few

critical changes to the proposal.

- renamed the keyword "type" to "kind". The community has plan to

have

"type" keyword in yaml/descriptor refer to data types exclusively in

the

near future. We should cater to that change in our design
- allowed specifying names for modules to simplify and unify module
loading/unloading syntax between programming and SQL. Here're the

proposed

changes:
       SQL:
            LOAD MODULE "name" WITH ("kind"="xxx" [, (properties)])
            UNLOAD MODULE "name";
       Table:
            tEnv.loadModule("name", new Xxx(properties));
            tEnv.unloadModule("name");

I have completely updated the google doc [1]. Please take another

look,

and

let me know if you have any other questions. Thanks!

[1]

https://docs.google.com/document/d/17CPMpMbPDjvM4selUVEfh_tqUK_oV0TODAUA9dfHakc/edit#

On Tue, Oct 8, 2019 at 6:26 AM Jark Wu <[email protected]> wrote:

Hi Bowen,

Thanks for the proposal. I have two thoughts:

1) Regarding to "loadModule", how about
tableEnv.loadModule("xxx" [, propertiesMap]);
tableEnv.unloadModule(“xxx”);

This makes the API similar to SQL. IMO, instance of Module is not

needed

and verbose as parameter.
And this makes it easier to load a simple module without any

additional

properties, e.g. tEnv.loadModule("GEO"), tEnv.unloadModule("GEO")

2) In current design, the module interface only defines function

metadata,

but no implementations.
I'm wondering how to call/map the implementations in runtime? Am I

missing

something?

Besides, I left some minor comments in the doc.

Best,
Jark


On Sat, 5 Oct 2019 at 08:42, Xuefu Z <[email protected]> wrote:

I agree with Timo that the new table APIs need to be consistent.

I'd

go

further that an name (or id) is needed for module definition in

YAML

file.

In the current design, name is skipped and type has binary

meanings.

Thanks,
Xuefu

On Fri, Oct 4, 2019 at 5:24 AM Timo Walther <[email protected]>

wrote:

Hi everyone,

first, I was also questioning my proposal. But Bowen's proposal

of

`tEnv.offloadToYaml(<file_path>)` would not work with the current

design

because we don't know how to serialize a catalog or module into
properties. Currently, there is no converter from instance to
properties. It is a one way conversion. We can add a

`toProperties`

method to both Catalog and Module class in the future to solve

this.

Solving the table environment serializability can be future work.

However, I find the current proposal for the TableEnvironment

methods

is

contradicting:

tableEnv.loadModule(new Yyy());
tableEnv.unloadModule(“Xxx”);

The loading is specified programmatically whereas the unloading

requires

a string that is not specified in the module itself. But is

defined

in

the factory according to the current design.

SQL does it more consistently. There, the name `xxx` is used when
loading and unloading the module:

LOAD MODULE 'xxx' [WITH ('prop'='myProp', ...)]
UNLOAD MODULE 'xxx’

How about:

tableEnv.loadModule("xxx", new Yyy());
tableEnv.unloadModule(“xxx”);

This would be similar to the catalog interfaces. The name is not

part

of

the instance itself.

What do you think?

Thanks,
Timo




On 01.10.19 21:17, Bowen Li wrote:

If something like the yaml file is the way to go and achieve

such

motivation, we would cover that with current design.

On Tue, Oct 1, 2019 at 12:05 Bowen Li <[email protected]>

wrote:

Hi Timo, Dawid,

I've added the suggested SQL and related changes to

TableEnvironment

API

and other classes to the google doc. Also removed "USE MODULE"

and

its

APIs. Will update FLIP wiki once we have a consensus.

W.r.t. descriptor approach, my gut feeling is similar to

Dawid's.

Besides,

I feel yaml file would be a better solution to persist

serializable

state

of an environment as the file itself is in serializable format

already.

Though yaml file only serves SQL CLI at this moment, we may be

able

to

extend its reach to Table API and allow users to load/offload

TableEnvironment from/to yaml files, as something like

"TableEnvironment

tEnv = TableEnvironment.loadFromYaml(<file_path>)" and
"tEnv.offloadToYaml(<file_path>)" to restore and persist

state,

and

try

to

make yaml file more expressive.


On Tue, Oct 1, 2019 at 6:47 AM Dawid Wysakowicz <

[email protected]

wrote:

Hi Timo, Bowen,

Unfortunately I did not have enough time to go through all

the

suggestions in details so I can not comment on the whole

FLIP.

I just wanted to give my opinion on the "descriptor approach

in

loadModule" part. I am not sure if we need it here. We might

be

overthinking this a bit. It definitely makes sense for

objects

like

TableSource/TableSink etc. as they are logical definitions

that

nearly

always have to be persisted in a Catalog. I'm not sure if we

really

need

the same for a whole session. If we need a resume session

feature,

the

way to go would probably be to keep the session in memory on

the

server

side. I fear we will never be able to serialize the whole

session

entirely (temporary objects, objects derived from DataStream

etc.)

I think it is ok to use instances for objects like Catalogs

or

Modules

and have an overlay on top of that that can create instances

from

properties.

Best,

Dawid

On 01/10/2019 11:28, Timo Walther wrote:

Hi Bowen,

thanks for your response.

Re 2) I also don't have a better approach for this issue. It

is

similar to changing the general TableConfig between two

statements.

It

would be good to add your explanation to the design

document.

Re 3) It would be interesting to know about which "core"

functions

we

are actually talking about. Also for the overriding built-in

functions

that we discussed in the other FLIP. But I'm fine with

leaving

it

to

the user for now. How about we just introduce loadModule(),
unloadModule() methods instead of useModules()? This would

ensure

that

users don't forget to add the core module when adding an

additional

module and they need to explicitly call

"unloadModule('core')".

Re 4) Every table environment feature should also be

designed

with

SQL

statements in mind to verify the concept. SQL is also more

popular

that Java/Scala API or YAML file. I would like to add it to

1.10

for

marking the feature as complete.

SHOW MODULES -> sounds good to me, we should add a

listModules():

List<String> method to table environment

LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] --> we

should

add a

loadModule() method to table environment

UNLOAD MODULE 'hive' --> we should add a unloadModule()

method

to

table environment

I would not introduce `USE MODULES 'x' 'y' 'z'` for

simplicity

and

concise API. Users need to load the module anyway with

properties.

They can also load them "in order" immediately. CREATE TABLE

can

also

not create multiple tables but only one at a time in that

order.

One thing that came to my mind, shall we use a descriptor

approach

for

loadModule()? The past has shown that passing instances

causes

problems when persisting objects. That's why we also want to

get

rid

of registerTableSource. I could image that users might want

to

persist

a table environment's state for later use in the future.

Even

though

this is future work, we should already keep such use cases

in

mind

when adding new API methods. What do you think?

Regards,
Timo


On 30.09.19 23:17, Bowen Li wrote:

Hi Timo,

Re 1) I agree. I renamed the title to "Extend Core Table

System

with

Pluggable Modules" and all internal references

Re 2) First, I'll rename the API to useModules(). The

design

doesn't

forbid
users to call useModules() multi times. Objects in modules

are

loaded

on

demand instead of eagerly, so there won't be inconsistency.

Users

have to
be fully aware of the consequences of resetting modules as

that

might

cause
that some objects can not be referenced anymore or

resolution

order

of some
objects changes.

Re 3) Yes, we'd leave that to users.

Another approach can be to have a non-optional "Core"

module

for

all

objects that cannot be overrode like "CAST" and "AS"

functions,

and

have an
optional "ExtendedCore" module for other replaceable

built-in

objects.

"Core" should be positioned at the 1st in module list all

the

time.

I'm fine with either solution.

Re 4) It may sound like a nice-to-have advanced feature for

1.10,

but

we

can surely fully discuss it for the sake of feature

completeness.

Unlike other configs, the order of modules would matter in

Flink,

and

it

implies the LOAD/UNLOAD commands would not be equal in

operation

positions.
IIUYC, LOAD MODULE 'x' would be interpreted as appending x

to

the

end

of

module list, and UNLOAD MODULE 'x' would be interpreted as

removing x

from
any position in the list?

I'm thinking of the following list of commands:

SHOW MODULES - list modules in order
LOAD MODULE 'hive' [WITH ('prop'='myProp', ...)] - load and

append

the

module to end of the module list
UNLOAD MODULE 'hive' - remove the module from module list,

and

other

modules remain the same relative positions
USE MODULES 'x' 'y' 'z' (wondering can parser take "'x' 'y'

'z'"?),

or USE
MODULES 'x,y,z' - to reorder module list completely

--
Xuefu Zhang

"In Honey We Trust!"

--
Xuefu Zhang

"In Honey We Trust!"

Re: [DISCUSS] FLIP-68: Extend Core Table System with Modular Plugins

Reply via email to