Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Xuefu Z Fri, 06 Sep 2019 01:56:14 -0700

Hi Dawid,

Thank you for your summary. While the only difference in the two proposals
is one- or three-part in naming, the consequence would be substantial.


To me, there are two major use cases of temporary functions compared to
persistent ones:
1. Temporary in nature and auto managed by the session. More often than
not, admin doesn't even allow user to create persistent functions.
2. Provide an opportunity to overwriting system built-in functions.

Since built-in functions has one-part name, requiring three-part name for
temporary functions eliminates the overwriting opportunity.

One-part naming essentially puts all temp functions under a single
namespace and simplifies function resolution, such as we don't need to
consider the case of a temp function and a persistent function with the
same name under the same database.

I agree having three-parts does have its merits, such as consistency with
other temporary objects (table) and minor difference between temp vs
catalog functions. However, there is a slight difference between tables and
function in that there is no built-in table in SQL so there is no need to
overwrite it.

I'm not sure if I fully agree the benefits you listed as the advantages of
the three-part naming of temp functions.
  -- Allowing overwriting built-in functions is a benefit and the solution
for disallowing certain overwriting shouldn't be totally banning it.
  -- Catalog functions are defined by users, and we suppose they can
drop/alter it in any way they want. Thus, overwriting a catalog function
doesn't seem to be a strong use case that we should be concerned about.
Rather, there are known use case for overwriting built-in functions.

Thus, personally I would prefer one-part name for temporary functions. In
lack of SQL standard on this, I certainly like to get opinions from others
to see if a consensus can be eventually reached.

(To your point on modular approach to support external built-in functions,
we saw the value and are actively looking into it. Thanks for sharing your
opinion on that.)

Thanks,
Xuefu

On Fri, Sep 6, 2019 at 3:48 PM Dawid Wysakowicz <[email protected]>
wrote:

> Hi Xuefu,
>
> Thank you for your answers.
>
> Let me summarize my understanding. In principle we differ only in regards
> to the fact if a temporary function can be only 1-part or only 3-part
> identified. I can reconfirm that if the community decides it prefers the
> 1-part approach I will commit to that, with the assumption that we will
> force ONLY 1-part function names. (We will parse identifier and throw
> exception if a user tries to register e.g. db.temp_func).
>
> My preference is though the 3-part approach:
>
>    - there are some functions that it makes no sense to override, e.g.
>    CAST, moreover I'm afraid that allowing overriding such will lead to high
>    inconsistency, similar to those that I mentioned spark has
>    - you cannot shadow a fully-qualified function. (If a user fully
>    qualifies his/her objects in a SQL query, which is often considered a good
>    practice)
>    - it does not differentiate between functions & temporary functions.
>    Temporary functions just differ with regards to their life-cycle. The
>    registration & usage is exactly the same.
>
> As it can be seen, the proposed concept regarding temp function and
> function resolution is quite simple.
>
> Both approaches are equally simple. I would even say the 3-part approach
> is slightly simpler as it does not have to care about some special built-in
> functions such as CAST.
>
> I don't want to express my opinion on the differentiation between built-in
> functions and "external" built-in functions in this thread as it is rather
> orthogonal, but I also like the modular approach and I definitely don't
> like the special syntax "cat::function". I think it's better to stick to a
> standard or at least other proved solutions from other systems.
>
> Best,
>
> Dawid
> On 05/09/2019 10:12, Xuefu Z wrote:
>
> Hi David,
>
> Thanks for sharing your thoughts and  request for clarifications. I believe
> that I fully understood your proposal, which does has its merit. However,
> it's different from ours. Here are the answers to your questions:
>
> Re #1: yes, the temp functions in the proposal are global and have just
> one-part names, similar to built-in functions. Two- or three-part names are
> not allowed.
>
> Re #2: not applicable as two- or three-part names are disallowed.
>
> Re #3: same as above. Referencing external built-in functions is achieved
> either implicitly (only the built-in functions in the current catalogs are
> considered) or via special syntax such as cat::function. However, we are
> looking into the modular approach that Time suggested with other feedback
> received from the community.
>
> Re #4: the resolution order goes like the following in our proposal:
>
> 1. temporary functions
> 2. bulit-in functions (including those augmented by add-on modules)
> 3. built-in functions in current catalog (this will not be needed if the
> special syntax "cat::function" is required)
> 4. functions in current catalog and db.
>
> If we go with the modular approach and make external built-in functions as
> an add-on module, the 2 and 3 above will be combined. In essence, the
> resolution order is equivalent in the two approaches.
>
> By the way, resolution order matters only for simple name reference. For
> names such as db.function (interpreted as current_cat/db/function) or
> cat.db.function, the reference is unambiguous, so on resolution is needed.
>
> As it can be seen, the proposed concept regarding temp function and
> function resolution is quite simple. Additionally, the proposed resolution
> order allows temp function to shadow a built-in function, which is
> important (though not decisive) in our opinion.
>
> I started liking the modular approach as the resolution order will only
> include 1, 2, and 4, which is simpler and more generic. That's why I
> suggested we look more into this direction.
>
> Please let me know if there are further questions.
>
> Thanks,
> Xuefu
>
>
>
>
> On Thu, Sep 5, 2019 at 2:42 PM Dawid Wysakowicz <[email protected]> 
> <[email protected]>
> wrote:
>
>
> Hi Xuefu,
>
> Just wanted to summarize my opinion on the one topic (temporary functions).
>
> My preference would be to make temporary functions always 3-part qualified
> (as a result that would prohibit overriding built-in functions). Having
> said that if the community decides that it's better to allow overriding
> built-in functions I am fine with it and can commit to that decision.
>
> I wanted to ask if you could clarify a few points for me around that
> option.
>
>    1. Would you enforce temporary functions to be always just a single
>    name (without db & cat) as hive does, or would you allow also 3 or even 2
>    part identifiers?
>    2. Assuming 2/3-part paths. How would you register a function from a
>    following statement: CREATE TEMPORARY FUNCTION db.func? Would that shadow
>    all functions named 'func' in all databases named 'db' in all catalogs? Or
>    would you shadow only function 'func' in database 'db' in current catalog?
>    3. This point is still under discussion, but was mentioned a few
>    times, that maybe we want to enable syntax cat.func for "external built-in
>    functions". How would that affect statement from previous point? Would
>    'db.func' shadow "external built-in function" in 'db' catalog or user
>    functions as in point 2? Or maybe both?
>    4. Lastly in fact to summarize the previous points. Assuming 2/3-part
>    paths. Would the function resolution be actually as follows?:
>       1. temporary functions (1-part path)
>       2. built-in functions
>       3. temporary functions (2-part path)
>       4. 2-part catalog functions a.k.a. "external built-in functions"
>       (cat + func) - this is still under discussion, if we want that in the 
> other
>       focal point
>       5. temporary functions (3-part path)
>       6. 3-part catalog functions a.k.a. user functions
>
> I would be really grateful if you could explain me those questions, thanks.
>
> BTW, Thank you all for a healthy discussion.
>
> Best,
>
> Dawid
> On 04/09/2019 23:25, Xuefu Z wrote:
>
> Thank all for the sharing thoughts. I think we have gathered some useful
> initial feedback from this long discussion with a couple of focal points
> sticking out.
>
>  We will go back to do more research and adapt our proposal. Once it's
> ready, we will ask for a new round of review. If there is any disagreement,
> we will start a new discussion thread on each rather than having a mega
> discussion like this.
>
> Thanks to everyone for participating.
>
> Regards,
> Xuefu
>
>
> On Thu, Sep 5, 2019 at 2:52 AM Bowen Li <[email protected]> 
> <[email protected]> <[email protected]> <[email protected]> wrote:
>
>
> Let me try to summarize and conclude the long thread so far:
>
> 1. For order of temp function v.s. built-in function:
>
> I think Dawid's point that temp function should be of fully qualified path
> is a better reasoning to back the newly proposed order, and i agree we
> don't need to follow Hive/Spark.
>
> However, I'd rather not change fundamentals of temporary functions in this
> FLIP. It belongs to a bigger story of how temporary objects should be
> redefined and be handled uniformly - currently temporary tables and views
> (those registered from TableEnv#registerTable()) behave different than what
> Dawid propose for temp functions, and we need a FLIP to just unify their
> APIs and behaviors.
>
> I agree that backward compatibility is not an issue w.r.t Jark's points.
>
> ***Seems we do have consensus that it's acceptable to prevent users
> registering a temp function in the same name as a built-in function. To
> help us move forward, I'd like to propose setting such a restraint on temp
> functions in this FLIP to simplify the design and avoid disputes.*** It
> will also leave rooms for improvements in the future.
>
>
> 2. For Hive built-in function:
>
> Thanks Timo for providing the Presto and Postgres examples. I feel modular
> built-in functions can be a good fit for the geo and ml example as a native
> Flink extension, but not sure if it fits well with external integrations.
> Anyway, I think modular built-in functions is a bigger story and can be on
> its own thread too, and our proposal doesn't prevent Flink from doing that
> in the future.
>
> ***Seems we have consensus that users should be able to use built-in
> functions of Hive or other external systems in SQL explicitly and
> deterministically regardless of Flink built-in functions and the potential
> modular built-in functions, via some new syntax like "mycat::func"? If so,
> I'd like to propose removing Hive built-in functions from ambiguous
> function resolution order, and empower users with such a syntax. This way
> we sacrifice a little convenience for certainty***
>
>
> What do you think?
>
> On Wed, Sep 4, 2019 at 7:02 AM Dawid Wysakowicz <[email protected]> 
> <[email protected]> <[email protected]> <[email protected]>
> wrote:
>
>
> Hi,
>
> Regarding the Hive & Spark support of TEMPORARY FUNCTIONS. I've just
> performed some experiments (hive-2.3.2 & spark 2.4.4) and I think they
>
> are
>
> very inconsistent in that manner (spark being way worse on that).
>
> Hive:
>
> You cannot overwrite all the built-in functions. I could overwrite most
>
> of
>
> the functions I tried e.g. length, e, pi, round, rtrim, but there are
> functions I cannot overwrite e.g. CAST, ARRAY I get:
>
>
> *    ParseException line 1:29 cannot recognize input near 'array' 'AS' *
>
> What is interesting is that I cannot ovewrite *array*, but I can ovewrite
> *map* or *struct*. Though hive behaves reasonable well if I manage to
> overwrite a function. When I drop the temporary function the native
> function is still available.
>
> Spark:
>
> Spark's behavior imho is super bad.
>
> Theoretically I could overwrite all functions. I was able e.g. to
> overwrite CAST function. I had to use though CREATE OR REPLACE TEMPORARY
> FUNCTION syntax. Otherwise I get an exception that a function already
> exists. However when I used the CAST function in a query it used the
> native, built-in one.
>
> When I overwrote current_date() function, it was used in a query, but it
> completely replaces the built-in function and I can no longer use the
> native function in any way. I cannot also drop the temporary function. I
> get:
>
> *    Error in query: Cannot drop native function 'current_date';*
>
> Additional note, both systems do not allow creating TEMPORARY FUNCTIONS
> with a database. Temporary functions are always represented as a single
> name.
>
> In my opinion neither of the systems have consistent behavior. Generally
> speaking I think overwriting any system provided functions is just
> dangerous.
>
> Regarding Jark's concerns. Such functions would be registered in a
>
> current
>
> catalog/database schema, so a user could still use its own function, but
> would have to fully qualify the function (because built-in functions take
> precedence). Moreover users would have the same problem with permanent
> functions. Imagine a user have a permanent function 'cat.db.explode'. In
> 1.9 the user could use just the 'explode' function as long as the 'cat' &
> 'db' were the default catalog & database. If we introduce 'explode'
> built-in function in 1.10, the user has to fully qualify the function.
>
> Best,
>
> Dawid
> On 04/09/2019 15:19, Timo Walther wrote:
>
> Hi all,
>
> thanks for the healthy discussion. It is already a very long discussion
> with a lot of text. So I will just post my opinion to a couple of
> statements:
>
>
> Hive built-in functions are not part of Flink built-in functions, they
>
> are catalog functions
>
> That is not entirely true. Correct me if I'm wrong but I think Hive
> built-in functions are also not catalog functions. They are not stored in
> every Hive metastore catalog that is freshly created but are a set of
> functions that are listed somewhere and made available.
>
>
> ambiguous functions reference just shouldn't be resolved to a different
>
> catalog
>
> I agree. They should not be resolved to a different catalog. That's why I
> am suggesting to split the concept of built-in functions and catalog
>
> lookup
>
> semantics.
>
>
> I don't know if any other databases handle built-in functions like that
>
> What I called "module" is:
> - Extension in Postgres [1]
> - Plugin in Presto [2]
>
> Btw. Presto even mentions example modules that are similar to the ones
> that we will introduce in the near future both for ML and System XYZ
> compatibility:
> "See either the presto-ml module for machine learning functions or the
> presto-teradata-functions module for Teradata-compatible functions, both
>
> in
>
> the root of the Presto source."
>
>
> functions should be either built-in already or just libraries
>
> functions,
>
> and library functions can be adapted to catalog APIs or of some other
> syntax to use
>
> Regarding "built-in already", of course we can add a lot of functions as
> built-ins but we will end-up in a dependency hell in the near future if
>
> we
>
> don't introduce a pluggable approach. Library functions is what you also
> suggest but storing them in a catalog means to always fully qualify them
>
> or
>
> modifying the existing catalog design that was inspired by the standard.
>
> I don't think "it brings in even more complicated scenarios to the
> design", it just does clear separation of concerns. Integrating the
> functionality into the current design makes the catalog API more
> complicated.
>
>
> why would users name a temporary function the same as a built-in
>
> function then?
>
> Because you never know what users do. If they don't, my suggested
> resolution order should not be a problem, right?
>
>
> I don't think hive functions deserves be a function module
>
> Our goal is not to create a Hive clone. We need to think forward and Hive
> is just one of many systems that we can support. Not every built-in
> function behaves and will behave exactly like Hive.
>
>
> regarding temporary functions, there are few systems that support it
>
> IMHO Spark and Hive are not always the best examples for consistent
> design. Systems like Postgres, Presto, or SQL Server should be used as a
> reference. I don't think that a user can overwrite a built-in function
> there.
>
> Regards,
> Timo
>
> [1] https://www.postgresql.org/docs/10/extend-extensions.html
> [2] https://prestodb.github.io/docs/current/develop/functions.html
>
>
> On 04.09.19 13:44, Jark Wu wrote:
>
> Hi all,
>
> Regarding #1 temp function <> built-in function and naming.
> I'm fine with temp functions should precede built-in function and can
> override built-in functions (we already support to override built-in
> function in 1.9).
> If we don't allow the same name as a built-in function, I'm afraid we
>
> will
>
> have compatibility issues in the future.
> Say users register a user defined function named "explode" in 1.9, and we
> support a built-in "explode" function in 1.10.
> Then the user's jobs which call the registered "explode" function in 1.9
> will all fail in 1.10 because of naming conflict.
>
> Regarding #2 "External" built-in functions.
> I think if we store external built-in functions in catalog, then
> "hive1::sqrt" is a good way to go.
> However, I would prefer to support a discovery mechanism (e.g. SPI) for
> built-in functions as Timo suggested above.
> This gives us the flexibility to add Hive or MySQL or Geo or whatever
> function set as built-in functions in an easy way.
>
> Best,
> Jark
>
> On Wed, 4 Sep 2019 at 17:47, Xuefu Z <[email protected]> <[email protected]> 
> <[email protected]> <[email protected]><[email protected]> 
> <[email protected]> <[email protected]> <[email protected]> wrote:
>
> Hi David,
>
> Thank you for sharing your findings. It seems to me that there is no SQL
> standard regarding temporary functions. There are few systems that
>
> support
>
> it. Here are what I have found:
>
> 1. Hive: no DB qualifier allowed. Can overwrite built-in.
> 2. Spark: basically follows Hive (
>
>
>
> https://docs.databricks.com/spark/latest/spark-sql/language-manual/create-function.html
>
> )
> 3. SAP SQL Anywhere Server: can have owner (db?). Not sure of overwriting
> behavior. 
> (http://dcx.sap.com/sqla170/en/html/816bdf316ce210148d3acbebf6d39b18.html
>
> )
>
> Because of lack of standard, it's perfectly fine for Flink to define
> whatever it sees appropriate. Thus, your proposal (no overwriting and
>
> must
>
> have DB as holder) is one option. The advantage is simplicity, The
> downside
> is the deviation from Hive, which is popular and de facto standard in big
> data world.
>
> However, I don't think we have to follow Hive. More importantly, we need
>
> a
>
> consensus. I have no objection if your proposal is generally agreed upon.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 11:58 PM Dawid Wysakowicz 
> <[email protected]<[email protected]> <[email protected]> 
> <[email protected]> <[email protected]>
> wrote:
>
> Hi all,
>
> Just an opinion on the built-in <> temporary functions resolution and
> NAMING issue. I think we should not allow overriding the built-in
> functions, as this may pose serious issues and to be honest is rather
> not feasible and would require major rework. What happens if a user
> wants to override CAST? Calls to that function are generated at
> different layers of the stack that unfortunately does not always go
> through the Catalog API (at least yet). Moreover from what I've checked
> no other systems allow overriding the built-in functions. All the
> systems I've checked so far register temporary functions in a
> database/schema (either special database for temporary functions, or
> just current database). What I would suggest is to always register
> temporary functions with a 3 part identifier. The same way as tables,
> views etc. This effectively means you cannot override built-in
> functions. With such approach it is natural that the temporary functions
> end up a step lower in the resolution order:
>
> 1. built-in functions (1 part, maybe 2? - this is still under discussion)
>
> 2. temporary functions (always 3 part path)
>
> 3. catalog functions (always 3 part path)
>
> Let me know what do you think.
>
> Best,
>
> Dawid
>
> On 04/09/2019 06:13, Bowen Li wrote:
>
> Hi,
>
> I agree with Xuefu that the main controversial points are mainly the
>
> two
>
> places. My thoughts on them:
>
> 1) Determinism of referencing Hive built-in functions. We can either
>
> remove
>
> Hive built-in functions from ambiguous function resolution and require
> users to use special syntax for their qualified names, or add a config
>
> flag
>
> to catalog constructor/yaml for turning on and off Hive built-in
>
> functions
>
> with the flag set to 'false' by default and proper doc added to help
>
> users
>
> make their decisions.
>
> 2) Flink temp functions v.s. Flink built-in functions in ambiguous
>
> function
>
> resolution order. We believe Flink temp functions should precede Flink
> built-in functions, and I have presented my reasons. Just in case if we
> cannot reach an agreement, I propose forbid users registering temp
> functions in the same name as a built-in function, like MySQL's
>
> approach,
>
> for the moment. It won't have any performance concern, since built-in
> functions are all in memory and thus cost of a name check will be
>
> really
>
> trivial.
>
>
> On Tue, Sep 3, 2019 at 8:01 PM Xuefu Z <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]> wrote:
>
>  From what I have seen, there are a couple of focal disagreements:
>
> 1. Resolution order: temp function --> flink built-in function -->
>
> catalog
>
> function vs flink built-in function --> temp function -> catalog
>
> function.
>
> 2. "External" built-in functions: how to treat built-in functions in
> external system and how users reference them
>
> For #1, I agree with Bowen that temp function needs to be at the
>
> highest
>
> priority because that's how a user might overwrite a built-in function
> without referencing a persistent, overwriting catalog function with a
>
> fully
>
> qualified name. Putting built-in functions at the highest priority
> eliminates that usage.
>
> For #2, I saw a general agreement on referencing "external" built-in
> functions such as those in Hive needs to be explicit and deterministic
>
> even
>
> though different approaches are proposed. To limit the scope and
>
> simply
>
> the
>
> usage, it seems making sense to me to introduce special syntax for
>
> user  to
>
> explicitly reference an external built-in function such as hive1::sqrt
>
> or
>
> hive1._built_in.sqrt. This is a DML syntax matching nicely Catalog API
>
> call
>
> hive1.getFunction(ObjectPath functionName) where the database name is
> absent for bulit-in functions available in that catalog hive1. I
>
> understand
>
> that Bowen's original proposal was trying to avoid this, but this
>
> could
>
> turn out to be a clean and simple solution.
>
> (Timo's modular approach is great way to "expand" Flink's built-in
>
> function
>
> set, which seems orthogonal and complementary to this, which could be
> tackled in further future work.)
>
> I'd be happy to hear further thoughts on the two points.
>
> Thanks,
> Xuefu
>
> On Tue, Sep 3, 2019 at 7:11 PM Kurt Young <[email protected]> 
> <[email protected]> <[email protected]> <[email protected]><[email protected]> 
> <[email protected]> <[email protected]> <[email protected]> wrote:
>
> Thanks Timo & Bowen for the feedback. Bowen was right, my proposal is
>
> the
>
> same
> as Bowen's. But after thinking about it, I'm currently lean to Timo's
> suggestion.
>
> The reason is backward compatibility. If we follow Bowen's approach,
>
> let's
>
> say we
> first find function in Flink's built-in functions, and then hive's
> built-in. For example, `foo`
> is not supported by Flink, but hive has such built-in function. So
>
> user
>
> will have hive's
> behavior for function `foo`. And in next release, Flink realize this
>
> is a
>
> very popular function
> and add it into Flink's built-in functions, but with different
>
> behavior
>
> as
>
> hive's. So in next
> release, the behavior changes.
>
> With Timo's approach, IIUC user have to tell the framework explicitly
>
> what
>
> kind of
> built-in functions he would like to use. He can just tell framework
>
> to
>
> abandon Flink's built-in
> functions, and use hive's instead. User can only choose between them,
>
> but
>
> not use
> them at the same time. I think this approach is more predictable.
>
> Best,
> Kurt
>
>
> On Wed, Sep 4, 2019 at 8:00 AM Bowen Li <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]> wrote:
>
> Hi all,
>
> Thanks for the feedback. Just a kindly reminder that the [Proposal]
>
> section
>
> in the google doc was updated, please take a look first and let me
>
> know
>
> if
>
> you have more questions.
>
> On Tue, Sep 3, 2019 at 4:57 PM Bowen Li <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]>
>
> wrote:
>
> Hi Timo,
>
> Re> 1) We should not have the restriction "hive built-in functions
>
> can
>
> only
>
> be used when current catalog is hive catalog". Switching a catalog
> should only have implications on the cat.db.object resolution but
>
> not
>
> functions. It would be quite convinient for users to use Hive
>
> built-ins
>
> even if they use a Confluent schema registry or just the in-memory
>
> catalog.
>
> There might be a misunderstanding here.
>
> First of all, Hive built-in functions are not part of Flink
>
> built-in
>
> functions, they are catalog functions, thus if the current catalog
>
> is
>
> not a
>
> HiveCatalog but, say, a schema registry catalog, ambiguous
>
> functions
>
> reference just shouldn't be resolved to a different catalog.
>
> Second, Hive built-in functions can potentially be referenced
>
> across
>
> catalog, but it doesn't have db namespace and we currently just
>
> don't
>
> have
>
> a SQL syntax for it. It can be enabled when such a SQL syntax is
>
> defined,
>
> e.g. "catalog::function", but it's out of scope of this FLIP.
>
> 2) I would propose to have separate concepts for catalog and
>
> built-in
>
> functions. In particular it would be nice to modularize built-in
> functions. Some built-in functions are very crucial (like AS, CAST,
> MINUS), others are more optional but stable (MD5, CONCAT_WS), and
>
> maybe
>
> we add more experimental functions in the future or function for
>
> some
>
> special application area (Geo functions, ML functions). A data
>
> platform
>
> team might not want to make every built-in function available. Or a
> function module like ML functions is in a different Maven module.
>
> I think this is orthogonal to this FLIP, especially we don't have
>
> the
>
> "external built-in functions" anymore and currently the built-in
>
> function
>
> category remains untouched.
>
> But just to share some thoughts on the proposal, I'm not sure about
>
> it:
>
> - I don't know if any other databases handle built-in functions
>
> like
>
> that.
>
> Maybe you can give some examples? IMHO, built-in functions are
>
> system
>
> info
>
> and should be deterministic, not depending on loaded libraries. Geo
> functions should be either built-in already or just libraries
>
> functions,
>
> and library functions can be adapted to catalog APIs or of some
>
> other
>
> syntax to use
> - I don't know if all use cases stand, and many can be achieved by
>
> other
>
> approaches too. E.g. experimental functions can be taken good care
>
> of
>
> by
>
> documentations, annotations, etc
> - the proposal basically introduces some concept like a pluggable
>
> built-in
>
> function catalog, despite the already existing catalog APIs
> - it brings in even more complicated scenarios to the design. E.g.
>
> how
>
> do
>
> you handle built-in functions in different modules but different
>
> names?
>
> In short, I'm not sure if it really stands and it looks like an
>
> overkill
>
> to me. I'd rather not go to that route. Related discussion can be
>
> on
>
> its
>
> own thread.
>
> 3) Following the suggestion above, we can have a separate discovery
> mechanism for built-in functions. Instead of just going through a
>
> static
>
> list like in BuiltInFunctionDefinitions, a platform team should be
>
> able
>
> to select function modules like
> catalogManager.setFunctionModules(CoreFunctions, GeoFunctions,
> HiveFunctions) or via service discovery;
>
> Same as above. I'll leave it to its own thread.
>
> re > 3) Dawid and I discussed the resulution order again. I agree
>
> with
>
> Kurt
>
> that we should unify built-in function (external or internal)
>
> under a
>
> common layer. However, the resolution order should be:
>    1. built-in functions
>    2. temporary functions
>    3. regular catalog resolution logic
> Otherwise a temporary function could cause clashes with Flink's
>
> built-in
>
> functions. If you take a look at other vendors, like SQL Server
>
> they
>
> also do not allow to overwrite built-in functions.
>
> ”I agree with Kurt that we should unify built-in function (external
>
> or
>
> internal) under a common layer.“ <- I don't think this is what Kurt
>
> means.
>
> Kurt and I are in favor of unifying built-in functions of external
>
> systems
>
> and catalog functions. Did you type a mistake?
>
> Besides, I'm not sure about the resolution order you proposed.
>
> Temporary
>
> functions have a lifespan over a session and are only visible to
>
> the
>
> session owner, they are unique to each user, and users create them
>
> on
>
> purpose to be the highest priority in order to overwrite system
>
> info
>
> (built-in functions in this case).
>
> In your case, why would users name a temporary function the same
>
> as a
>
> built-in function then? Since using that name in ambiguous function
> reference will always be resolved to built-in functions, creating a
> same-named temp function would be meaningless in the end.
>
>
> On Tue, Sep 3, 2019 at 1:44 PM Bowen Li <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]>
>
> wrote:
>
> Hi Jingsong,
>
> Re> 1.Hive built-in functions is an intermediate solution. So we
>
> should
>
> not introduce interfaces to influence the framework. To make
> Flink itself more powerful, we should implement the functions
> we need to add.
>
> Yes, please see the doc.
>
> Re> 2.Non-flink built-in functions are easy for users to change
>
> their
>
> behavior. If we support some flink built-in functions in the
> future but act differently from non-flink built-in, this will
>
> lead
>
> to
>
> changes in user behavior.
>
> There's no such concept as "external built-in functions" any more.
> Built-in functions of external systems will be treated as special
>
> catalog
>
> functions.
>
> Re> Another question is, does this fallback include all
>
> hive built-in functions? As far as I know, some hive functions
> have some hacky. If possible, can we start with a white list?
> Once we implement some functions to flink built-in, we can
> also update the whitelist.
>
> Yes, that's something we thought of too. I don't think it's super
> critical to the scope of this FLIP, thus I'd like to leave it to
>
> future
>
> efforts as a nice-to-have feature.
>
>
> On Tue, Sep 3, 2019 at 1:37 PM Bowen Li <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]>
>
> wrote:
>
> Hi Kurt,
>
> Re: > What I want to propose is we can merge #3 and #4, make them
>
> both
>
> under
>
> "catalog" concept, by extending catalog function to make it have
>
> ability to
>
> have built-in catalog functions. Some benefits I can see from
>
> this
>
> approach:
>
> 1. We don't have to introduce new concept like external built-in
>
> functions.
>
> Actually I don't see a full story about how to treat a built-in
>
> functions, and it
>
> seems a little bit disrupt with catalog. As a result, you have
>
> to
>
> make
>
> some restriction
>
> like "hive built-in functions can only be used when current
>
> catalog
>
> is
>
> hive catalog".
>
> Yes, I've unified #3 and #4 but it seems I didn't update some
>
> part
>
> of
>
> the doc. I've modified those sections, and they are up to date
>
> now.
>
> In short, now built-in function of external systems are defined
>
> as
>
> a
>
> special kind of catalog function in Flink, and handled by Flink
>
> as
>
> following:
> - An external built-in function must be associated with a catalog
>
> for
>
> the purpose of decoupling flink-table and external systems.
> - It always resides in front of catalog functions in ambiguous
>
> function
>
> reference order, just like in its own external system
> - It is a special catalog function that doesn’t have a
>
> schema/database
>
> namespace
> - It goes thru the same instantiation logic as other user defined
> catalog functions in the external system
>
> Please take another look at the doc, and let me know if you have
>
> more
>
> questions.
>
>
> On Tue, Sep 3, 2019 at 7:28 AM Timo Walther <[email protected]> 
> <[email protected]> <[email protected]> 
> <[email protected]><[email protected]> <[email protected]> 
> <[email protected]> <[email protected]>
>
> wrote:
>
> Hi Kurt,
>
> it should not affect the functions and operations we currently
>
> have
>
> in
>
> SQL. It just categorizes the available built-in functions. It is
>
> kind
>
> of
> an orthogonal concept to the catalog API but built-in functions
>
> deserve
>
> this special kind of treatment. CatalogFunction still fits
>
> perfectly
>
> in
>
> there because the regular catalog object resolution logic is not
> affected. So tables and functions are resolved in the same way
>
> but
>
> with
>
> built-in functions that have priority as in the original design.
>
> Regards,
> Timo
>
>
> On 03.09.19 15:26, Kurt Young wrote:
>
> Does this only affect the functions and operations we currently
>
> have
>
> in SQL
>
> and
> have no effect on tables, right? Looks like this is an
>
> orthogonal
>
> concept
>
> with Catalog?
> If the answer are both yes, then the catalog function will be a
>
> weird
>
> concept?
>
> Best,
> Kurt
>
>
> On Tue, Sep 3, 2019 at 8:10 PM Danny Chan <[email protected]
>
> wrote:
>
> The way you proposed are basically the same as what Calcite
>
> does, I
>
> think
>
> we are in the same line.
>
> Best,
> Danny Chan
> 在 2019年9月3日 +0800 PM7:57，Timo Walther <[email protected]
>
> ，写道：
>
> This sounds exactly as the module approach I mentioned, no?
>
> Regards,
> Timo
>
> On 03.09.19 13:42, Danny Chan wrote:
>
> Thanks Bowen for bring up this topic, I think it’s a useful
>
> refactoring to make our function usage more user friendly.
>
> For the topic of how to organize the builtin operators and
>
> operators
>
> of Hive, here is a solution from Apache Calcite, the Calcite
>
> way
>
> is
>
> to make
>
> every dialect operators a “Library”, user can specify which
>
> libraries they
>
> want to use for a sql query. The builtin operators always
>
> comes
>
> as
>
> the
>
> first class objects and the others are used from the order
>
> they
>
> appears.
>
> Maybe you can take a reference.
>
> [1]
>
>
>
> https://github.com/apache/calcite/commit/9a4eab5240d96379431d14a1ac33bfebaf6fbb28
>
> Best,
> Danny Chan
> 在 2019年8月28日 +0800 AM2:50，Bowen Li <[email protected]
>
> ，写道：
>
> Hi folks,
>
> I'd like to kick off a discussion on reworking Flink's
>
> FunctionCatalog.
>
> It's critically helpful to improve function usability in
>
> SQL.
>
>
>
> https://docs.google.com/document/d/1w3HZGj9kry4RsKVCduWp82HkW6hhgi2unnvOAUS72t8/edit?usp=sharing
>
> In short, it:
> - adds support for precise function reference with
>
> fully/partially
>
> qualified name
> - redefines function resolution order for ambiguous
>
> function
>
> reference
>
> - adds support for Hive's rich built-in functions (support
>
> for
>
> Hive
>
> user
>
> defined functions was already added in 1.9.0)
> - clarifies the concept of temporary functions
>
> Would love to hear your thoughts.
>
> Bowen
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>
>
> --
> Xuefu Zhang
>
> "In Honey We Trust!"
>
>
>
>
>
>
>

-- 
Xuefu Zhang

"In Honey We Trust!"

Re: [DISCUSS] FLIP-57 - Rework FunctionCatalog

Reply via email to