Re: [DISCUSS] Features for Apache Flink 1.9.0

2019-06-22 Thread Zhang, Xuefu
To add, Hive integration depends on a few features that are actively developed. 
If the completion of those features don't leave enough time for us to 
integrate, then our work can potentially go beyond the proposed date.

Just wanted to point out such a dependency adds uncertainty.

Thanks,
Xuefu


--
From:Tzu-Li (Gordon) Tai 
Sent At:2019 Jun. 20 (Thu.) 01:01
To:dev 
Cc:Xuefu ; Timo Walther ; Dawid 
Wysakowicz 
Subject:Re: [DISCUSS] Features for Apache Flink 1.9.0

Hi all,

Thanks for all the updates and work!
From the looks so far, overall it seems like we are still in a good spot to 
officially announce the feature freeze date to be on the originally proposed 
date, June 28.

I’ll announce this in a separate thread.

Cheers,
Gordon

On Fri, Jun 7, 2019 at 2:31 AM Bowen Li  wrote:
For features I'm involved in:

 - FLIP-30 unified catalogs APIs [1]: close to be done. On track

 - hive integration
  - HiveCatalog for persisting Flink/Hive metadata [2]: close to be
 done. On track

  - hive data connector [3]: input/output format are close to be done.
 Was blocked on source/sink interfaces. We had several discussions yesterday
 and concluded that we may have a quick working solution out for 1.9. Thus
 I'd say on track

  - hive functions [4]: Just started. It has major dependencies on
 function definitions and type system rework part II. In the last few weeks,
 community is mainly focusing on Blink planner and related tasks on
 SQL/Table API side as Timo mentioned above, and the work of function
 definition just got started this week. I'm working closely with Timo to
 push this efforts forward. It's a bit risky but I'm glad we start to make
 progress now

 - SQL DDL: also had discussions yesterday. Working together with Kurt, we
 hope to have at least some basic DDL to offer users an end-to-end working
 solution for both Flink and Hive use cases in 1.9

 [1]:
- https://issues.apache.org/jira/browse/FLINK-11275
- https://issues.apache.org/jira/browse/FLINK-12625
-
https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs
 [2]: https://issues.apache.org/jira/browse/FLINK-12755
 [3]: https://issues.apache.org/jira/browse/FLINK-10729
 [4]: https://issues.apache.org/jira/browse/FLINK-12656

 On Tue, Jun 4, 2019 at 12:12 AM Till Rohrmann  wrote:

 > Thanks for starting this discussion Gordon and Kurt. For the development
 > threads I'm involved with here are the updates:
 >
 > * Pluggable scheduler: Good part of the work is completed. Gary now works
 > on the glue code to use the new high level scheduler components. The
 > estimate to finish this work is end of June (estimate: 4 weeks starting
 > from this week). The changes to the scheduler would benefit from very
 > thorough testing because they are core to Flink.
 >
 > * External shuffle service: As Zhijiang said, we hope to finish the work by
 > the end of this week or early next week (estimate: 1 week from now).
 >
 > * Result partition life cycle management / fine grained recovery: The
 > current estimate to complete this feature would be end of next week or
 > beginning of the week afterwards (estimate: 2 weeks from now). This feature
 > should enable fine grained recovery for batch.
 >
 > * Java 9 support: Flink builds with Java 9. Not all e2e tests are running
 > with Java 9 though.
 >
 > * Active K8s integration: PRs are open but reviews are still pending.
 >
 > Cheers,
 > Till
 >
 > On Wed, May 29, 2019 at 4:45 AM Biao Liu  wrote:
 >
 > > Thanks for being the release manager, Gordon & Kurt.
 > >
 > > For FLIP-27, there are still some more details need to discuss. I don't
 > > think it could catch up the release of 1.9. @Aljoscha, @Stephan, do you
 > > agree that?
 > >
 > > zhijiang  于2019年5月28日周二 下午11:28写道:
 > >
 > > > Hi Gordon,
 > > >
 > > > Thanks for the kind reminder of feature freeze date for 1.9.0. I think
 > > the
 > > > date makes sense on my side.
 > > >
 > > > For FLIP-31, I and Andrey could be done within two weeks or so.
 > > > And I already finished my side work for FLIP-1.
 > > >
 > > > Best,
 > > > Zhijiang
 > > >
 > > >
 > > > --
 > > > From:Timo Walther 
 > > > Send Time:2019年5月28日(星期二) 19:26
 > > > To:dev 
 > > > Subject:Re: [DISCUSS] Features for Apache Flink 1.9.0
 > > >
 > > > Thanks for being the release managers, Kurt and Gordon!
 > > >
 > > >  From the Table & SQL API side, there are still a lot of open issues
 > > > that need to be solved to decouple the API from a planner and enable
 > the
 > > > Blink planner. Also we need to make sure that the Blink planner
 > supports
 > > > at least everything of Flink 1.8 to not introduce a regression. We
 > might
 > > > need to focus more on the main features which is a runnable Blink
 > > > planner and might need to postpone other discussions such as DDL, new
 > > > source/sink interfaces, or proper type 

Re: [DISCUSS] Start a user...@flink.apache.org mailing list for the Chinese-speaking community?

2019-01-24 Thread Zhang, Xuefu
+1 on the idea. This will certainly help promote Flink in China industries. On 
a side note, it would be great if anyone in the list can help source ideas, bug 
reports, and feature requests to dev@ list and/or JIRAs so as to gain broader 
attention.

Thanks,
Xuefu


--
From:Fabian Hueske 
Sent At:2019 Jan. 24 (Thu.) 05:32
To:dev 
Subject:Re: [DISCUSS] Start a user...@flink.apache.org mailing list for the 
Chinese-speaking community?

Thanks Robert!
I think this is a very good idea.
+1

Fabian

Am Do., 24. Jan. 2019 um 14:09 Uhr schrieb Jeff Zhang :

> +1
>
> Piotr Nowojski  于2019年1月24日周四 下午8:38写道:
>
> > +1, good idea, especially with that many Chinese speaking contributors,
> > committers & users :)
> >
> > Piotrek
> >
> > > On 24 Jan 2019, at 13:20, Kurt Young  wrote:
> > >
> > > Big +1 on this, it will indeed help Chinese speaking users a lot.
> > >
> > > fudian.fd 于2019年1月24日 周四20:18写道:
> > >
> > >> +1. I noticed that many folks from China are requesting the JIRA
> > >> permission in the past year. It reflects that more and more developers
> > from
> > >> China are using Flink. A Chinese oriented mailing list will definitely
> > be
> > >> helpful for the growth of Flink in China.
> > >>
> > >>
> > >>> 在 2019年1月24日,下午7:42,Stephan Ewen  写道:
> > >>>
> > >>> +1, a very nice idea
> > >>>
> > >>> On Thu, Jan 24, 2019 at 12:41 PM Robert Metzger  >
> > >> wrote:
> > >>>
> >  Thanks for your response.
> > 
> >  You are right, I'm proposing "user...@flink.apache.org" as the
> > mailing
> >  list's name!
> > 
> >  On Thu, Jan 24, 2019 at 12:37 PM Tzu-Li (Gordon) Tai <
> > >> tzuli...@apache.org>
> >  wrote:
> > 
> > > Hi Robert,
> > >
> > > Thanks a lot for starting this discussion!
> > >
> > > +1 to a user-zh@flink.a.o mailing list (you mentioned -zh in the
> > >> title,
> > > but
> > > -cn in the opening email content.
> > > I think -zh would be better as we are establishing the tool for
> > general
> > > Chinese-speaking users).
> > > All dev@ discussions / JIRAs should still be in a single English
> > >> mailing
> > > list.
> > >
> > > From what I've seen in the DingTalk Flink user group, there's
> quite a
> > >> bit
> > > of activity in forms of user questions and replies.
> > > It would really be great if the Chinese-speaking user community can
> > > actually have these discussions happen in the Apache mailing lists,
> > > so that questions / discussions / replies from developers can be
> > >> indexed
> > > and searchable.
> > > Moreover, it'll give the community more insight in how active a
> > > Chinese-speaking contributor is helping with user requests,
> > > which in general is a form of contribution that the community
> always
> >  merits
> > > a lot.
> > >
> > > Cheers,
> > > Gordon
> > >
> > > On Thu, Jan 24, 2019 at 12:15 PM Robert Metzger <
> rmetz...@apache.org
> > >
> > > wrote:
> > >
> > >> Hey all,
> > >>
> > >> I would like to create a new user support mailing list called "
> > >> user...@flink.apache.org" to cater the Chinese-speaking Flink
> >  community.
> > >>
> > >> Why?
> > >> In the last year 24% of the traffic on flink.apache.org came from
> > the
> > > US,
> > >> 22% from China. In the last three months, China is at 30%, the US
> at
> >  20%.
> > >> An additional data point is that there's a Flink DingTalk group
> with
> >  more
> > >> than 5000 members, asking Flink questions.
> > >> I believe that knowledge about Flink should be available in public
> >  forums
> > >> (our mailing list), indexable by search engines. If there's a huge
> >  demand
> > >> in a Chinese language support, we as a community should provide
> > these
> > > users
> > >> the tools they need, to grow our community and to allow them to
> > follow
> > > the
> > >> Apache way.
> > >>
> > >> Is it possible?
> > >> I believe it is, because a number of other Apache projects are
> > running
> > >> non-English user@ mailing lists.
> > >> Apache OpenOffice, Cocoon, OpenMeetings, CloudStack all have
> >  non-English
> > >> lists: http://mail-archives.apache.org/mod_mbox/
> > >> One thing I want to make very clear in this discussion is that all
> > > project
> > >> decisions, developer discussions, JIRA tickets etc. need to happen
> > in
> > >> English, as this is the primary language of the Apache Foundation
> > and
> >  our
> > >> community.
> > >> We should also clarify this on the page listing the mailing lists.
> > >>
> > >> How?
> > >> If there is consensus in this discussion thread, I would request
> the
> >  new
> > >> mailing list next Monday.
> > >> In case of discussions, I will start a vote on Monday or when the
> > >> 

Re: [DISCUSS] A strategy for merging the Blink enhancements

2019-01-22 Thread Zhang, Xuefu
Hi Stephan,

Thanks for bringing up the discussions. I'm +1 on the merging plan. One 
question though: since the merge will not be completed for some time and there 
are might be uses trying blink branch, what's the plan for the development in 
the branch? Personally I think we may discourage big contributions to the 
branch, which would further complicate the merge, while we shouldn't stop 
critical fixes as well.

What's your take on this?

Thanks,
Xuefu


--
From:Stephan Ewen 
Sent At:2019 Jan. 22 (Tue.) 06:16
To:dev 
Subject:[DISCUSS] A strategy for merging the Blink enhancements

Dear Flink community!

As a follow-up to the thread announcing Alibaba's offer to contribute the
Blink code [1]

,
here are some thoughts on how this contribution could be merged.

As described in the announcement thread, it is a big contribution, and we
need to
carefully plan how to handle the contribution. We would like to get the
improvements to Flink,
while making it as non-disruptive as possible for the community.
I hope that this plan gives the community get a better understanding of
what the
proposed contribution would mean.

Here is an initial rough proposal, with thoughts from
Timo, Piotr, Dawid, Kurt, Shaoxuan, Jincheng, Jark, Aljoscha, Fabian,
Xiaowei:

  - It is obviously very hard to merge all changes in a quick move, because
we
are talking about multiple 100k lines of code.

  - As much as possible, we want to maintain compatibility with the current
Table API,
so that this becomes a transparent change for most users.

  - The two areas with the most changes we identified were
 (1) The SQL/Table query processor
 (2) The batch scheduling/failover/shuffle

  - For the query processor part, this is what we found and propose:

-> The Blink and Flink code have the same semantics (ANSI SQL) except
for minor
   aspects (under discussion). Blink also covers more SQL operations.

-> The Blink code is quite different from the current Flink SQL runtime.
   Merging as changes seems hardly feasible. From the current
evaluation, the
   Blink query processor uses the more advanced architecture, so it
would make
   sense to converge to that design.

-> We propose to gradually build up the Blink-based query processor as
a second
   query processor under the SQL/Table API. Think of it as two
different runners
   for the Table API.
   As the new query processor becomes fully merged and stable, we can
deprecate and
   eventually remove the existing query processor. That should give the
least
   disruption to Flink users and allow for gradual merge/development.

-> Some refactoring of the Table API is necessary to support the above
strategy.
   Most of the prerequisite refactoring is around splitting the project
into
   different modules, following a similar idea as FLIP-28 [2]

.

-> A more detailed proposal is being worked on.

-> Same as FLIP-28, this approach would probably need to suspend Table
API
   contributions for a short while. We hope that this can be a very
short period,
   to not impact the very active development in Flink on Table API/SQL
too much.

  - For the batch scheduling and failover enhancements, we should be able
to build
on the currently ongoing refactoring of the scheduling logic [3]
. That should
make it easy to plug in a new scheduler and failover logic. We can port
the Blink
enhancements as a new scheduler / failover handler. We can later make
it the
default for bounded stream programs once the merge is completed and it
is tested.

  - For the catalog and source/sink design and interfaces, we would like to
continue with the already started design discussion threads. Once these
are
converged, we might use some of the Blink code for the implementation,
if it
is close to the outcome of the design discussions.

Best,
Stephan

[1]
https://lists.apache.org/thread.html/2f7330e85d702a53b4a2b361149930b50f2e89d8e8a572f8ee2a0e6d@%3Cdev.flink.apache.org%3E

[2]
https://cwiki.apache.org/confluence/display/FLINK/FLIP-28%3A+Long-term+goal+of+making+flink-table+Scala-free

[3] https://issues.apache.org/jira/browse/FLINK-10429


Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2019-01-07 Thread Zhang, Xuefu
Thanks, Timo!

I have started put the content from the google doc to FLIP-30 [1]. However, 
please still keep the discussion along this thread.

Thanks,
Xuefu

[1] 
https://cwiki.apache.org/confluence/display/FLINK/FLIP-30%3A+Unified+Catalog+APIs


--
From:Timo Walther 
Sent At:2019 Jan. 7 (Mon.) 05:59
To:dev 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi everyone,

Xuefu and I had multiple iterations over the catalog design document 
[1]. I believe that it is in a good shape now to be converted into FLIP. 
Maybe we need a bit more explanation at some places but the general 
design would be ready now.

The design document covers the following changes:
- Unify external catalog interface and Flink's internal catalog in 
TableEnvironment
- Clearly define a hierarchy of reference objects namely: 
"catalog.database.table"
- Enable a tight integration with Hive + Hive data connectors as well as 
a broad integration with existing TableFactories and discovery mechanism
- Make the catalog interfaces more feature complete by adding views and 
functions

If you have any further feedback, it would be great to give it now 
before we convert it into a FLIP.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit#



Am 07.01.19 um 13:51 schrieb Timo Walther:
> Hi Eron,
>
> thank you very much for the contributions. I merged the first little 
> bug fixes. For the remaining PRs I think we can review and merge them 
> soon. As you said, the code is agnostic to the details of the 
> ExternalCatalog interface and I don't expect bigger merge conflicts in 
> the near future.
>
> However, exposing the current external catalog interfaces to SQL 
> Client users would make it even more difficult to change the 
> interfaces in the future. So maybe I would first wait until the 
> general catalog discussion is over and the FLIP has been created. This 
> should happen shortly.
>
> We should definitely coordinate the efforts better in the future to 
> avoid duplicate work.
>
> Thanks,
> Timo
>
>
> Am 07.01.19 um 00:24 schrieb Eron Wright:
>> Thanks Timo for merging a couple of the PRs.   Are you also able to 
>> review the others that I mentioned? Xuefu I would like to incorporate 
>> your feedback too.
>>
>> Check out this short demonstration of using a catalog in SQL Client:
>> https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo
>>
>> Thanks again!
>>
>> On Thu, Jan 3, 2019 at 9:37 AM Eron Wright > <mailto:eronwri...@gmail.com>> wrote:
>>
>> Would a couple folks raise their hand to make a review pass thru
>> the 6 PRs listed above?  It is a lovely stack of PRs that is 'all
>> green' at the moment.   I would be happy to open follow-on PRs to
>> rapidly align with other efforts.
>>
>> Note that the code is agnostic to the details of the
>> ExternalCatalog interface; the code would not be obsolete if/when
>> the catalog interface is enhanced as per the design doc.
>>
>>
>>
>> On Wed, Jan 2, 2019 at 1:35 PM Eron Wright > <mailto:eronwri...@gmail.com>> wrote:
>>
>> I propose that the community review and merge the PRs that I
>>     posted, and then evolve the design thru 1.8 and beyond.  I
>> think having a basic infrastructure in place now will
>> accelerate the effort, do you agree?
>>
>> Thanks again!
>>
>> On Wed, Jan 2, 2019 at 11:20 AM Zhang, Xuefu
>> mailto:xuef...@alibaba-inc.com>> 
>> wrote:
>>
>> Hi Eron,
>>
>> Happy New Year!
>>
>> Thank you very much for your contribution, especially
>> during the holidays. Wile I'm encouraged by your work, I'd
>> also like to share my thoughts on how to move forward.
>>
>> First, please note that the design discussion is still
>> finalizing, and we expect some moderate changes,
>> especially around TableFactories. Another pending change
>> is our decision to shy away from scala, which our work
>> will be impacted by.
>>
>> Secondly, while your work seemed about plugging in
>> catalogs definitions to the execution environment, which
>> is less impacted by TableFactory change, I did notice some
>> duplication of your work and ours. This is no big deal,
>> but going forward, we should probable have a better
>> communication on the work assignment so as to 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2019-01-02 Thread Zhang, Xuefu
Hi Eron,

Happy New Year!

Thank you very much for your contribution, especially during the holidays. Wile 
I'm encouraged by your work, I'd also like to share my thoughts on how to move 
forward.

First, please note that the design discussion is still finalizing, and we 
expect some moderate changes, especially around TableFactories. Another pending 
change is our decision to shy away from scala, which our work will be impacted 
by.

Secondly, while your work seemed about plugging in catalogs definitions to the 
execution environment, which is less impacted by TableFactory change, I did 
notice some duplication of your work and ours. This is no big deal, but going 
forward, we should probable have a better communication on the work assignment 
so as to avoid any possible duplication of work. On the other hand, I think 
some of your work is interesting and valuable for inclusion once we finalize 
the overall design.

Thus, please continue your research and experiment and let us know when you 
start working on anything so we can better coordinate.

Thanks again for your interest and contributions.

Thanks,
Xuefu




--
From:Eron Wright 
Sent At:2019 Jan. 1 (Tue.) 18:39
To:dev ; Xuefu 
Cc:Xiaowei Jiang ; twalthr ; piotr 
; Fabian Hueske ; suez1224 
; Bowen Li 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi folks, there's clearly some incremental steps to be taken to introduce 
catalog support to SQL Client, complementary to what is proposed in the 
Flink-Hive Metastore design doc.  I was quietly working on this over the 
holidays.   I posted some new sub-tasks, PRs, and sample code to FLINK-10744. 

What inspired me to get involved is that the catalog interface seems like a 
great way to encapsulate a 'library' of Flink tables and functions.  For 
example, the NYC Taxi dataset (TaxiRides, TaxiFares, various UDFs) may be 
nicely encapsulated as a catalog (TaxiData).   Such a library should be fully 
consumable in SQL Client.

I implemented the above.  Some highlights:
1. A fully-worked example of using the Taxi dataset in SQL Client via an 
environment file.
- an ASCII video showing the SQL Client in action:
https://asciinema.org/a/C8xuAjmZSxCuApgFgZQyeIHuo

- the corresponding environment file (will be even more concise once 
'FLINK-10696 Catalog UDFs' is merged):
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/dist/conf/sql-client-defaults.yaml

- the typed API for standalone table applications:
https://github.com/EronWright/flink-training-exercises/blob/3be008d64be975ced0f1a7e3901a8c5353f72a7e/src/main/java/com/dataartisans/flinktraining/examples/table_java/examples/ViaCatalog.java#L50

2. Implementation of the core catalog descriptor and factory.  I realize that 
some renames may later occur as per the design doc, and would be happy to do 
that as a follow-up.
https://github.com/apache/flink/pull/7390

3. Implementation of a connect-style API on TableEnvironment to use catalog 
descriptor.
https://github.com/apache/flink/pull/7392

4. Integration into SQL-Client's environment file:
https://github.com/apache/flink/pull/7393

I realize that the overall Hive integration is still evolving, but I believe 
that these PRs are a good stepping stone. Here's the list (in bottom-up order):
- https://github.com/apache/flink/pull/7386
- https://github.com/apache/flink/pull/7388
- https://github.com/apache/flink/pull/7389
- https://github.com/apache/flink/pull/7390
- https://github.com/apache/flink/pull/7392
- https://github.com/apache/flink/pull/7393

Thanks and enjoy 2019!
Eron W


On Sun, Nov 18, 2018 at 3:04 PM Zhang, Xuefu  wrote:
Hi Xiaowei,

 Thanks for bringing up the question. In the current design, the properties for 
meta objects are meant to cover anything that's specific to a particular 
catalog and agnostic to Flink. Anything that is common (such as schema for 
tables, query text for views, and udf classname) are abstracted as members of 
the respective classes. However, this is still in discussion, and Timo and I 
will go over this and provide an update.

 Please note that UDF is a little more involved than what the current design 
doc shows. I'm still refining this part.

 Thanks,
 Xuefu


 --
 Sender:Xiaowei Jiang 
 Sent at:2018 Nov 18 (Sun) 15:17
 Recipient:dev 
 Cc:Xuefu ; twalthr ; piotr 
; Fabian Hueske ; suez1224 

 Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

 Thanks Xuefu for the detailed design doc! One question on the properties 
associated with the catalog objects. Are we going to leave them completely free 
form or we are going to set some standard for that? I think that the answer may 
depend on if we want to explore catalog specific optimization opportunities. In 
any case, I think that it might be helpful for standardize as much as possible 
into strongly

Re: [DISCUSS] Enhance convenience of TableEnvironment in TableAPI/SQL

2018-12-10 Thread Zhang, Xuefu
Hi Jincheng,

Thanks for bringing this up. It seems making good sense to me. However, one 
concern I have is about backward compatibility. Could you clarify whether 
existing user program will break with the proposed changes?

The answer to the question would largely determine when this can be introduced.

Thanks,
Xuefu


--
Sender:jincheng sun 
Sent at:2018 Dec 10 (Mon) 18:14
Recipient:dev 
Subject:[DISCUSS] Enhance convenience of TableEnvironment in TableAPI/SQL

Hi All,

According to the feedback from users, the design of TableEnvironment is very 
inconvenient for users, and often mistakenly imported by IDE, especially for 
Java users, such as:

ExecutionEnvironment env = ...BatchTableEnvironment tEnv = 
TableEnvironment.getTableEnvironment(env);

The user does not know which BatchTableEnvironment should be imported, because 
there are three implementations of BatchTableEnvironment, shown as below:

1. org.apache.flink.table.api.BatchTableEnvironment 2. 
org.apache.flink.table.api.java.BatchTableEnvironment 3. 
org.apache.flink.table.api.scala.BatchTableEnvironment
[image.png]


This brings unnecessary inconveniences to the flink user. To solve this 
problem, Wei Zhong, Hequn Cheng, Dian Fu, Shaoxuan Wang and myself discussed 
offline a bit and propose to change the inheritance diagram of TableEnvironment 
is shown as follows:
 1. AbstractTaleEnvironment - rename current TableEnvironment to 
AbstractTableEnvironment, The functionality implemented by Abstract 
TableEnvironment is stream and batch shared.2. TableEnvironment - Create a new 
TableEnvironment(abstract), and defined all methods in 
(java/scala)StreamTableEnvironment and (java/scala)BatchTableEnvironment. In 
the implementation of BatchTableEnviroment and StreamTableEnviroment, the 
unsupported operations will be reported as an error.
[image.png]
Then the usage as follows:

ExecutionEnvironment env = …TableEnvironment tEnv = 
TableEnvironment.getTableEnvironment(env)
For detailed proposals please refer to the Google doc: 
https://docs.google.com/document/d/1t-AUGuaChADddyJi6e0WLsTDEnf9ZkupvvBiQ4yTTEI/edit?usp=sharing

Any mail feedback and Google doc comment are welcome.

Thanks,
Jincheng



Re: [DISCUSS] Flink SQL DDL Design

2018-12-05 Thread Zhang, Xuefu
>> I think summarizing it into a google doc is a good idea. We
> >> will
> >>>>>> prepare
> >>>>>>> it
> >>>>>>>> in the next few days.
> >>>>>>>>
> >>>>>>>> Thanks,
> >>>>>>>> Jark
> >>>>>>>>
> >>>>>>>> Shaoxuan Wang  于2018年11月28日周三 下午9:17写道:
> >>>>>>>>
> >>>>>>>>> Hi Lin and Jark,
> >>>>>>>>> Thanks for sharing those details. Can you please consider
> >>>>> summarizing
> >>>>>>>> your
> >>>>>>>>> DDL design into a google doc.
> >>>>>>>>> We can still continue the discussions on Shuyi's proposal.
> >> But
> >>>>>> having a
> >>>>>>>>> separate google doc will be easy for the DEV to
> >>>>>>>> understand/comment/discuss
> >>>>>>>>> on your proposed DDL implementation.
> >>>>>>>>>
> >>>>>>>>> Regards,
> >>>>>>>>> Shaoxuan
> >>>>>>>>>
> >>>>>>>>>
> >>>>>>>>> On Wed, Nov 28, 2018 at 7:39 PM Jark Wu 
> >>>> wrote:
> >>>>>>>>>> Hi Shuyi,
> >>>>>>>>>>
> >>>>>>>>>> Thanks for bringing up this discussion and the awesome
> >> work!
> >>> I
> >>>>> have
> >>>>>>>> left
> >>>>>>>>>> some comments in the doc.
> >>>>>>>>>>
> >>>>>>>>>> I want to share something more about the watermark
> >> definition
> >>>>>> learned
> >>>>>>>>> from
> >>>>>>>>>> Alibaba.
> >>>>>>>>>>
> >>>>>>>>>> 1.
> >>>>>>>>>>
> >>>>>>>>>> Table should be able to accept multiple watermark
> >>>> definition.
> >>>>>>>>>> Because a table may have more than one rowtime field.
> >> For
> >>>>>> example,
> >>>>>>>> one
> >>>>>>>>>> rowtime field is from existing field but missing in some
> >>>>>> records,
> >>>>>>>>>> another
> >>>>>>>>>> is the ingestion timestamp in Kafka but not very
> >> accurate.
> >>>> In
> >>>>>> this
> >>>>>>>>> case,
> >>>>>>>>>> user may define two rowtime fields with watermarks in
> >> the
> >>>>> Table
> >>>>>>> and
> >>>>>>>>>> choose
> >>>>>>>>>> one in different situation.
> >>>>>>>>>> 2.
> >>>>>>>>>>
> >>>>>>>>>> Watermark stragety always work with rowtime field
> >>> together.
> >>>>>>>>>> Based on the two points metioned above, I think we should
> >>>> combine
> >>>>>> the
> >>>>>>>>>> watermark strategy and rowtime field selection (i.e. which
> >>>>> existing
> >>>>>>>> field
> >>>>>>>>>> used to generate watermark) in one clause, so that we can
> >>>> define
> >>>>>>>> multiple
> >>>>>>>>>> watermarks in one Table.
> >>>>>>>>>>
> >>>>>>>>>> Here I will share the watermark syntax used in Alibaba
> >>> (simply
> >>>>>>>> modified):
> >>>>>>>>>> watermarkDefinition:
> >>>>>>>>>> WATERMARK [watermarkName] FOR  AS
> >> wm_strategy
> >>>>>>>>>> wm_strategy:
> >>>>>>>>>>BOUNDED WITH OFFSET 'string' timeUnit
> >>>>>>>>>> |
> >>>>>>>>>>ASCENDING
> >>>>>>>&

Re: [DISCUSS] Support Higher-order functions in Flink sql

2018-12-03 Thread Zhang, Xuefu
Hi Wenhui,

Thanks for bringing the topics up. Both make sense to me. For higher-order 
functions, I'd suggest you come up with a list of things you'd like to add. 
Overall, Flink SQL is weak in handling complex types. Ideally we should have a 
doc covering the gaps and provide a roadmap for enhancement. It would be great 
if you can broaden the topic a bit.

Thanks,
Xuefu 


--
Sender:winifred.wenhui.t...@gmail.com 
Sent at:2018 Dec 3 (Mon) 16:13
Recipient:dev 
Subject:[DISCUSS] Support Higher-order functions in Flink sql

Hello all,

Spark 2.4.0 was released last month. I noticed that Spark 2.4 
“Add a lot of new built-in functions, including higher-order functions, to deal 
with complex data types easier.”[1]
I wonder if it's necessary for Flink to add higher-order functions to enhance 
it's ability.

By the way, I found that if we wants to enhance the functionality of Flink sql, 
we often need to modify Calcite. It may be a little inconvenient,so may be we 
can extend Calcite core parser in Flink to deal with some non-standard SQL 
syntax, as mentioned in Flink SQL DDL Design[2].

Look forward to your feedback.

Best,
Wen-hui Tang

[1] https://issues.apache.org/jira/browse/SPARK-23899
[2] 
https://docs.google.com/document/d/1TTP-GCC8wSsibJaSUyFZ_5NBAHYEB1FVmPpP7RgDGBA/edit#



Winifred-wenhui Tang



Re: [DISCUSS] Flink SQL DDL Design

2018-11-28 Thread Zhang, Xuefu
Here's the approximate grammar, FYI
> > > > CREATE TABLE
> > > >
> > > > CREATE TABLE tableName(
> > > > columnDefinition [, columnDefinition]*
> > > > [ computedColumnDefinition [, computedColumnDefinition]* ]
> > > > [ tableConstraint [, tableConstraint]* ]
> > > > [ tableIndex [, tableIndex]* ]
> > > > [ PERIOD FOR SYSTEM_TIME ]
> > > > [ WATERMARK watermarkName FOR rowTimeColumn AS
> > > > withOffset(rowTimeColumn, offset) ] ) [ WITH ( tableOption [ ,
> > > > tableOption]* ) ] [ ; ]
> > > >
> > > > columnDefinition ::=
> > > > columnName dataType [ NOT NULL ]
> > > >
> > > > dataType  ::=
> > > > {
> > > >   [ VARCHAR ]
> > > >   | [ BOOLEAN ]
> > > >   | [ TINYINT ]
> > > >   | [ SMALLINT ]
> > > >   | [ INT ]
> > > >   | [ BIGINT ]
> > > >   | [ FLOAT ]
> > > >   | [ DECIMAL ]
> > > >   | [ DOUBLE ]
> > > >   | [ DATE ]
> > > >   | [ TIME ]
> > > >   | [ TIMESTAMP ]
> > > >   | [ VARBINARY ]
> > > > }
> > > >
> > > > computedColumnDefinition ::=
> > > > columnName AS computedColumnExpression
> > > >
> > > > tableConstraint ::=
> > > > { PRIMARY KEY | UNIQUE }
> > > > (columnName [, columnName]* )
> > > >
> > > > tableIndex ::=
> > > > [ UNIQUE ] INDEX indexName
> > > >  (columnName [, columnName]* )
> > > >
> > > > rowTimeColumn ::=
> > > > columnName
> > > >
> > > > tableOption ::=
> > > > property=value
> > > > offset ::=
> > > > positive integer (unit: ms)
> > > >
> > > > CREATE VIEW
> > > >
> > > > CREATE VIEW viewName
> > > >   [
> > > > ( columnName [, columnName]* )
> > > >   ]
> > > > AS queryStatement;
> > > >
> > > > CREATE FUNCTION
> > > >
> > > >  CREATE FUNCTION functionName
> > > >   AS 'className';
> > > >
> > > >  className ::=
> > > > fully qualified name
> > > >
> > > >
> > > > Shuyi Chen  于2018年11月28日周三 上午3:28写道:
> > > >
> > > > > Thanks a lot, Timo and Xuefu. Yes, I think we can finalize the
> design
> > > doc
> > > > > first and start implementation w/o the unified connector API ready
> by
> > > > > skipping some featue.
> > > > >
> > > > > Xuefu, I like the idea of making Flink specific properties into
> > generic
> > > > > key-value pairs, so that it will make integration with Hive DDL (or
> > > > others,
> > > > > e.g. Beam DDL) easier.
> > > > >
> > > > > I'll run a final pass over the design doc and finalize the design
> in
> > > the
> > > > > next few days. And we can start creating tasks and collaborate on
> the
> > > > > implementation. Thanks a lot for all the comments and inputs.
> > > > >
> > > > > Cheers!
> > > > > Shuyi
> > > > >
> > > > > On Tue, Nov 27, 2018 at 7:02 AM Zhang, Xuefu <
> > xuef...@alibaba-inc.com>
> > > > > wrote:
> > > > >
> > > > > > Yeah! I agree with Timo that DDL can actually proceed w/o being
> > > blocked
> > > > > by
> > > > > > connector API. We can leave the unknown out while defining the
> > basic
> > > > > syntax.
> > > > > >
> > > > > > @Shuyi
> > > > > >
> > > > > > As commented in the doc, I think we can probably stick with
> simple
> > > > syntax
> > > > > > with general properties, without extending the syntax too much
> that
> > > it
> > > > > > mimics the descriptor API.
> > > > > >
> > > > > > Part of our effort on Flink-Hive integration is also to make DDL
> > > syntax
> > > > > > compatible with Hive's. The one in the current proposal seems
> > making
>

Re: [DISCUSS] Flink SQL DDL Design

2018-11-27 Thread Zhang, Xuefu
Yeah! I agree with Timo that DDL can actually proceed w/o being blocked by 
connector API. We can leave the unknown out while defining the basic syntax.

@Shuyi 

As commented in the doc, I think we can probably stick with simple syntax with 
general properties, without extending the syntax too much that it mimics the 
descriptor API. 

Part of our effort on Flink-Hive integration is also to make DDL syntax 
compatible with Hive's. The one in the current proposal seems making our effort 
more challenging.

We can help and collaborate. At this moment, I think we can finalize on the 
proposal and then we can divide the tasks for better collaboration.

Please let me know if there are  any questions or suggestions.

Thanks,
Xuefu




--
Sender:Timo Walther 
Sent at:2018 Nov 27 (Tue) 16:21
Recipient:dev 
Subject:Re: [DISCUSS] Flink SQL DDL Design

Thanks for offering your help here, Xuefu. It would be great to move 
these efforts forward. I agree that the DDL is somehow releated to the 
unified connector API design but we can also start with the basic 
functionality now and evolve the DDL during this release and next releases.

For example, we could identify the MVP DDL syntax that skips defining 
key constraints and maybe even time attributes. This DDL could be used 
for batch usecases, ETL, and materializing SQL queries (no time 
operations like windows).

The unified connector API is high on our priority list for the 1.8 
release. I will try to update the document until mid of next week.


Regards,

Timo


Am 27.11.18 um 08:08 schrieb Shuyi Chen:
> Thanks a lot, Xuefu. I was busy for some other stuff for the last 2 weeks,
> but we are definitely interested in moving this forward. I think once the
> unified connector API design [1] is done, we can finalize the DDL design as
> well and start creating concrete subtasks to collaborate on the
> implementation with the community.
>
> Shuyi
>
> [1]
> https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?usp=sharing
>
> On Mon, Nov 26, 2018 at 7:01 PM Zhang, Xuefu 
> wrote:
>
>> Hi Shuyi,
>>
>> I'm wondering if you folks still have the bandwidth working on this.
>>
>> We have some dedicated resource and like to move this forward. We can
>> collaborate.
>>
>> Thanks,
>>
>> Xuefu
>>
>>
>> --
>> 发件人:wenlong.lwl
>> 日 期:2018年11月05日 11:15:35
>> 收件人:
>> 主 题:Re: [DISCUSS] Flink SQL DDL Design
>>
>> Hi, Shuyi, thanks for the proposal.
>>
>> I have two concerns about the table ddl:
>>
>> 1. how about remove the source/sink mark from the ddl, because it is not
>> necessary, the framework determine the table referred is a source or a sink
>> according to the context of the query using the table. it will be more
>> convenient for use defining a table which can be both a source and sink,
>> and more convenient for catalog to persistent and manage the meta infos.
>>
>> 2. how about just keeping one pure string map as parameters for table, like
>> create tabe Kafka10SourceTable (
>> intField INTEGER,
>> stringField VARCHAR(128),
>> longField BIGINT,
>> rowTimeField TIMESTAMP
>> ) with (
>> connector.type = ’kafka’,
>> connector.property-version = ’1’,
>> connector.version = ’0.10’,
>> connector.properties.topic = ‘test-kafka-topic’,
>> connector.properties.startup-mode = ‘latest-offset’,
>> connector.properties.specific-offset = ‘offset’,
>> format.type = 'json'
>> format.prperties.version=’1’,
>> format.derive-schema = 'true'
>> );
>> Because:
>> 1. in TableFactory, what user use is a string map properties, defining
>> parameters by string-map can be the closest way to mapping how user use the
>> parameters.
>> 2. The table descriptor can be extended by user, like what is done in Kafka
>> and Json, it means that the parameter keys in connector or format can be
>> different in different implementation, we can not restrict the key in a
>> specified set, so we need a map in connector scope and a map in
>> connector.properties scope. why not just give user a single map, let them
>> put parameters in a format they like, which is also the simplest way to
>> implement DDL parser.
>> 3. whether we can define a format clause or not, depends on the
>> implementation of the connector, using different clause in DDL may make a
>> misunderstanding that we can combine the connectors with arbitrary formats,
>> which may not work actually.
>>
>> On Sun, 4 Nov 2018 at 18:25, Dominik Wosiński  wrote:
>>
>>> +1, Thanks fo

Re: [DISCUSS] Flink SQL DDL Design

2018-11-26 Thread Zhang, Xuefu
Hi Shuyi, 

I'm wondering if you folks still have the bandwidth working on this. 

We have some dedicated resource and like to move this forward. We can 
collaborate. 

Thanks, 

Xuefu 


--
发件人:wenlong.lwl
日 期:2018年11月05日 11:15:35
收件人:
主 题:Re: [DISCUSS] Flink SQL DDL Design

Hi, Shuyi, thanks for the proposal.

I have two concerns about the table ddl:

1. how about remove the source/sink mark from the ddl, because it is not
necessary, the framework determine the table referred is a source or a sink
according to the context of the query using the table. it will be more
convenient for use defining a table which can be both a source and sink,
and more convenient for catalog to persistent and manage the meta infos.

2. how about just keeping one pure string map as parameters for table, like
create tabe Kafka10SourceTable (
intField INTEGER,
stringField VARCHAR(128),
longField BIGINT,
rowTimeField TIMESTAMP
) with (
connector.type = ’kafka’,
connector.property-version = ’1’,
connector.version = ’0.10’,
connector.properties.topic = ‘test-kafka-topic’,
connector.properties.startup-mode = ‘latest-offset’,
connector.properties.specific-offset = ‘offset’,
format.type = 'json'
format.prperties.version=’1’,
format.derive-schema = 'true'
);
Because:
1. in TableFactory, what user use is a string map properties, defining
parameters by string-map can be the closest way to mapping how user use the
parameters.
2. The table descriptor can be extended by user, like what is done in Kafka
and Json, it means that the parameter keys in connector or format can be
different in different implementation, we can not restrict the key in a
specified set, so we need a map in connector scope and a map in
connector.properties scope. why not just give user a single map, let them
put parameters in a format they like, which is also the simplest way to
implement DDL parser.
3. whether we can define a format clause or not, depends on the
implementation of the connector, using different clause in DDL may make a
misunderstanding that we can combine the connectors with arbitrary formats,
which may not work actually.

On Sun, 4 Nov 2018 at 18:25, Dominik Wosiński  wrote:

> +1, Thanks for the proposal.
>
> I guess this is a long-awaited change. This can vastly increase the
> functionalities of the SQL Client as it will be possible to use complex
> extensions like for example those provided by Apache Bahir[1].
>
> Best Regards,
> Dom.
>
> [1]
> https://github.com/apache/bahir-flink
>
> sob., 3 lis 2018 o 17:17 Rong Rong  napisał(a):
>
> > +1. Thanks for putting the proposal together Shuyi.
> >
> > DDL has been brought up in a couple of times previously [1,2]. Utilizing
> > DDL will definitely be a great extension to the current Flink SQL to
> > systematically support some of the previously brought up features such as
> > [3]. And it will also be beneficial to see the document closely aligned
> > with the previous discussion for unified SQL connector API [4].
> >
> > I also left a few comments on the doc. Looking forward to the alignment
> > with the other couple of efforts and contributing to them!
> >
> > Best,
> > Rong
> >
> > [1]
> >
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201805.mbox/%3CCAMZk55ZTJA7MkCK1Qu4gLPu1P9neqCfHZtTcgLfrFjfO4Xv5YQ%40mail.gmail.com%3E
> > [2]
> >
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201810.mbox/%3CDC070534-0782-4AFD-8A85-8A82B384B8F7%40gmail.com%3E
> >
> > [3] https://issues.apache.org/jira/browse/FLINK-8003
> > [4]
> >
> >
> http://mail-archives.apache.org/mod_mbox/flink-dev/201810.mbox/%3c6676cb66-6f31-23e1-eff5-2e9c19f88...@apache.org%3E
> >
> >
> > On Fri, Nov 2, 2018 at 10:22 AM Bowen Li  wrote:
> >
> > > Thanks Shuyi!
> > >
> > > I left some comments there. I think the design of SQL DDL and
> Flink-Hive
> > > integration/External catalog enhancements will work closely with each
> > > other. Hope we are well aligned on the directions of the two designs,
> > and I
> > > look forward to working with you guys on both!
> > >
> > > Bowen
> > >
> > >
> > > On Thu, Nov 1, 2018 at 10:57 PM Shuyi Chen  wrote:
> > >
> > > > Hi everyone,
> > > >
> > > > SQL DDL support has been a long-time ask from the community. Current
> > > Flink
> > > > SQL support only DML (e.g. SELECT and INSERT statements). In its
> > current
> > > > form, Flink SQL users still need to define/create table sources and
> > sinks
> > > > programmatically in Java/Scala. Also, in SQL Client, without DDL
> > support,
> > > > the current implementation does not allow dynamical creation of
> table,
> > > type
> > > > or functions with SQL, this adds friction for its adoption.
> > > >
> > > > I drafted a design doc [1] with a few other community members that
> > > proposes
> > > > the design and implementation for adding DDL support in Flink. The
> > > initial
> > > > design considers DDL for table, view, type, library and function. It
> > will
> > > > be great 

Re: [DISCUSS] Long-term goal of making flink-table Scala-free

2018-11-22 Thread Zhang, Xuefu
Hi Timo,

Thanks for the effort and the Google writeup. During our external catalog 
rework, we found much confusion between Java and Scala, and this Scala-free 
roadmap should greatly mitigate that.

I'm wondering that whether we can have rule in the interim when Java and Scala 
coexist that dependency can only be one-way. I found that in the current code 
base there are cases where a Scala class extends Java and vise versa. This is 
quite painful. I'm thinking if we could say that extension can only be from 
Java to Scala, which will help the situation. However, I'm not sure if this is 
practical.

Thanks,
Xuefu


--
Sender:jincheng sun 
Sent at:2018 Nov 23 (Fri) 09:49
Recipient:dev 
Subject:Re: [DISCUSS] Long-term goal of making flink-table Scala-free

Hi Timo,
Thanks for initiating this great discussion.

Currently when using SQL/TableAPI should include many dependence. In
particular, it is not necessary to introduce the specific implementation
dependencies which users do not care about. So I am glad to see your
proposal, and hope when we consider splitting the API interface into a
separate module, so that the user can introduce minimum of dependencies.

So, +1 to [separation of interface and implementation; e.g. `Table` &
`TableImpl`] which you mentioned in the google doc.
Best,
Jincheng

Xiaowei Jiang  于2018年11月22日周四 下午10:50写道:

> Hi Timo, thanks for driving this! I think that this is a nice thing to do.
> While we are doing this, can we also keep in mind that we want to
> eventually have a TableAPI interface only module which users can take
> dependency on, but without including any implementation details?
>
> Xiaowei
>
> On Thu, Nov 22, 2018 at 6:37 PM Fabian Hueske  wrote:
>
> > Hi Timo,
> >
> > Thanks for writing up this document.
> > I like the new structure and agree to prioritize the porting of the
> > flink-table-common classes.
> > Since flink-table-runtime is (or should be) independent of the API and
> > planner modules, we could start porting these classes once the code is
> > split into the new module structure.
> > The benefits of a Scala-free flink-table-runtime would be a Scala-free
> > execution Jar.
> >
> > Best, Fabian
> >
> >
> > Am Do., 22. Nov. 2018 um 10:54 Uhr schrieb Timo Walther <
> > twal...@apache.org
> > >:
> >
> > > Hi everyone,
> > >
> > > I would like to continue this discussion thread and convert the outcome
> > > into a FLIP such that users and contributors know what to expect in the
> > > upcoming releases.
> > >
> > > I created a design document [1] that clarifies our motivation why we
> > > want to do this, how a Maven module structure could look like, and a
> > > suggestion for a migration plan.
> > >
> > > It would be great to start with the efforts for the 1.8 release such
> > > that new features can be developed in Java and major refactorings such
> > > as improvements to the connectors and external catalog support are not
> > > blocked.
> > >
> > > Please let me know what you think.
> > >
> > > Regards,
> > > Timo
> > >
> > > [1]
> > >
> > >
> >
> https://docs.google.com/document/d/1PPo6goW7tOwxmpFuvLSjFnx7BF8IVz0w3dcmPPyqvoY/edit?usp=sharing
> > >
> > >
> > > Am 02.07.18 um 17:08 schrieb Fabian Hueske:
> > > > Hi Piotr,
> > > >
> > > > thanks for bumping this thread and thanks for Xingcan for the
> comments.
> > > >
> > > > I think the first step would be to separate the flink-table module
> into
> > > > multiple sub modules. These could be:
> > > >
> > > > - flink-table-api: All API facing classes. Can be later divided
> further
> > > > into Java/Scala Table API/SQL
> > > > - flink-table-planning: involves all planning (basically everything
> we
> > do
> > > > with Calcite)
> > > > - flink-table-runtime: the runtime code
> > > >
> > > > IMO, a realistic mid-term goal is to have the runtime module and
> > certain
> > > > parts of the planning module ported to Java.
> > > > The api module will be much harder to port because of several
> > > dependencies
> > > > to Scala core classes (the parser framework, tree iterations, etc.).
> > I'm
> > > > not saying we should not port this to Java, but it is not clear to me
> > > (yet)
> > > > how to do it.
> > > >
> > > > I think flink-table-runtime should not be too hard to port. The code
> > does
> > > > not make use of many Scala features, i.e., it's writing very
> Java-like.
> > > > Also, there are not many dependencies and operators can be
> individually
> > > > ported step-by-step.
> > > > For flink-table-planning, we can have certain packages that we port
> to
> > > Java
> > > > like planning rules or plan nodes. The related classes mostly extend
> > > > Calcite's Java interfaces/classes and would be natural choices for
> > being
> > > > ported. The code generation classes will require more effort to port.
> > > There
> > > > are also some dependencies in planning on the api module that we
> would
> > > need
> > > > to resolve somehow.
> > > >
> > > 

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-18 Thread Zhang, Xuefu
Hi Xiaowei,

Thanks for bringing up the question. In the current design, the properties for 
meta objects are meant to cover anything that's specific to a particular 
catalog and agnostic to Flink. Anything that is common (such as schema for 
tables, query text for views, and udf classname) are abstracted as members of 
the respective classes. However, this is still in discussion, and Timo and I 
will go over this and provide an update.

Please note that UDF is a little more involved than what the current design doc 
shows. I'm still refining this part.

Thanks,
Xuefu


--
Sender:Xiaowei Jiang 
Sent at:2018 Nov 18 (Sun) 15:17
Recipient:dev 
Cc:Xuefu ; twalthr ; piotr 
; Fabian Hueske ; suez1224 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Thanks Xuefu for the detailed design doc! One question on the properties 
associated with the catalog objects. Are we going to leave them completely free 
form or we are going to set some standard for that? I think that the answer may 
depend on if we want to explore catalog specific optimization opportunities. In 
any case, I think that it might be helpful for standardize as much as possible 
into strongly typed classes and use leave these properties for catalog specific 
things. But I think that we can do it in steps.

Xiaowei
On Fri, Nov 16, 2018 at 4:00 AM Bowen Li  wrote:
Thanks for keeping on improving the overall design, Xuefu! It looks quite
 good to me now.

 Would be nice that cc-ed Flink committers can help to review and confirm!



 One minor suggestion: Since the last section of design doc already touches
 some new sql statements, shall we add another section in our doc and
 formalize the new sql statements in SQL Client and TableEnvironment that
 are gonna come along naturally with our design? Here are some that the
 design doc mentioned and some that I came up with:

 To be added:

- USE  - set default catalog
- USE  - set default schema
- SHOW CATALOGS - show all registered catalogs
- SHOW SCHEMAS [FROM catalog] - list schemas in the current default
catalog or the specified catalog
- DESCRIBE VIEW view - show the view's definition in CatalogView
- SHOW VIEWS [FROM schema/catalog.schema] - show views from current or a
specified schema.

(DDLs that can be addressed by either our design or Shuyi's DDL design)

- CREATE/DROP/ALTER SCHEMA schema
- CREATE/DROP/ALTER CATALOG catalog

 To be modified:

- SHOW TABLES [FROM schema/catalog.schema] - show tables from current or
a specified schema. Add 'from schema' to existing 'SHOW TABLES' statement
- SHOW FUNCTIONS [FROM schema/catalog.schema] - show functions from
current or a specified schema. Add 'from schema' to existing 'SHOW TABLES'
statement'


 Thanks, Bowen



 On Wed, Nov 14, 2018 at 10:39 PM Zhang, Xuefu 
 wrote:

 > Thanks, Bowen, for catching the error. I have granted comment permission
 > with the link.
 >
 > I also updated the doc with the latest class definitions. Everyone is
 > encouraged to review and comment.
 >
 > Thanks,
 > Xuefu
 >
 > --
 > Sender:Bowen Li 
 > Sent at:2018 Nov 14 (Wed) 06:44
 > Recipient:Xuefu 
 > Cc:piotr ; dev ; Shuyi
 > Chen 
 > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >
 > Hi Xuefu,
 >
 > Currently the new design doc
 > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit>
 > is on “view only" mode, and people cannot leave comments. Can you please
 > change it to "can comment" or "can edit" mode?
 >
 > Thanks, Bowen
 >
 >
 > On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu 
 > wrote:
 > Hi Piotr
 >
 > I have extracted the API portion of  the design and the google doc is here
 > <https://docs.google.com/document/d/1Y9it78yaUvbv4g572ZK_lZnZaAGjqwM_EhjdOv4yJtw/edit?usp=sharing>.
 > Please review and provide your feedback.
 >
 > Thanks,
 > Xuefu
 >
 > --
 > Sender:Xuefu 
 > Sent at:2018 Nov 12 (Mon) 12:43
 > Recipient:Piotr Nowojski ; dev <
 > dev@flink.apache.org>
 > Cc:Bowen Li ; Shuyi Chen 
 > Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
 >
 > Hi Piotr,
 >
 > That sounds good to me. Let's close all the open questions ((there are a
 > couple of them)) in the Google doc and I should be able to quickly split
 > it into the three proposals as you suggested.
 >
 > Thanks,
 > Xuefu
 >
 > --
 > Sender:Piotr Nowojski 
 > Sent at:2018 Nov 9 (Fri) 22:46
 > Recipient:dev ; Xuefu 
 > Cc:Bowen Li ; Shuyi Chen 
 > Subject:Re: [DISCUSS] Integrate Flin

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-14 Thread Zhang, Xuefu
Thanks, Bowen, for catching the error. I have granted comment permission with 
the link.

I also updated the doc with the latest class definitions. Everyone is 
encouraged to review and comment.

Thanks,
Xuefu


--
Sender:Bowen Li 
Sent at:2018 Nov 14 (Wed) 06:44
Recipient:Xuefu 
Cc:piotr ; dev ; Shuyi Chen 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Currently the new design doc is on “view only" mode, and people cannot leave 
comments. Can you please change it to "can comment" or "can edit" mode?

Thanks, Bowen


On Mon, Nov 12, 2018 at 9:51 PM Zhang, Xuefu  wrote:

Hi Piotr

I have extracted the API portion of  the design and the google doc is here. 
Please review and provide your feedback.

Thanks,
Xuefu

--
Sender:Xuefu 
Sent at:2018 Nov 12 (Mon) 12:43
Recipient:Piotr Nowojski ; dev 
Cc:Bowen Li ; Shuyi Chen 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple 
of them)) in the Google doc and I should be able to quickly split it into the 
three proposals as you suggested.

Thanks,
Xuefu

--
Sender:Piotr Nowojski 
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev ; Xuefu 
Cc:Bowen Li ; Shuyi Chen 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if 
we can split it further? Maybe changes in the interface in one doc, reading 
from hive meta store another and final storing our meta informations in hive 
meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu  wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I 
> will leave it as it is for now. However, I'll convert it to two different 
> FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> --
> Sender:Piotr Nowojski 
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev 
> Cc:Bowen Li ; Xuefu ; Shuyi 
> Chen 
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller 
> ones, hopefully independent. The questions that you have asked Fabian have 
> for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
&

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-12 Thread Zhang, Xuefu
Hi Piotr

I have extracted the API portion of  the design and the google doc is here. 
Please review and provide your feedback.

Thanks,
Xuefu


--
Sender:Xuefu 
Sent at:2018 Nov 12 (Mon) 12:43
Recipient:Piotr Nowojski ; dev 
Cc:Bowen Li ; Shuyi Chen 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple 
of them)) in the Google doc and I should be able to quickly split it into the 
three proposals as you suggested.

Thanks,
Xuefu


--
Sender:Piotr Nowojski 
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev ; Xuefu 
Cc:Bowen Li ; Shuyi Chen 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if 
we can split it further? Maybe changes in the interface in one doc, reading 
from hive meta store another and final storing our meta informations in hive 
meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu  wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I 
> will leave it as it is for now. However, I'll convert it to two different 
> FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> --
> Sender:Piotr Nowojski 
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev 
> Cc:Bowen Li ; Xuefu ; Shuyi 
> Chen 
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller 
> ones, hopefully independent. The questions that you have asked Fabian have 
> for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu 
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> -

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-11 Thread Zhang, Xuefu
Hi Piotr,

That sounds good to me. Let's close all the open questions ((there are a couple 
of them)) in the Google doc and I should be able to quickly split it into the 
three proposals as you suggested.

Thanks,
Xuefu


--
Sender:Piotr Nowojski 
Sent at:2018 Nov 9 (Fri) 22:46
Recipient:dev ; Xuefu 
Cc:Bowen Li ; Shuyi Chen 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Yes, it seems like the best solution. Maybe someone else can also suggests if 
we can split it further? Maybe changes in the interface in one doc, reading 
from hive meta store another and final storing our meta informations in hive 
meta store?

Piotrek

> On 9 Nov 2018, at 01:44, Zhang, Xuefu  wrote:
> 
> Hi Piotr,
> 
> That seems to be good idea!
> 
> Since the google doc for the design is currently under extensive review, I 
> will leave it as it is for now. However, I'll convert it to two different 
> FLIPs when the time comes.
> 
> How does it sound to you?
> 
> Thanks,
> Xuefu
> 
> 
> --
> Sender:Piotr Nowojski 
> Sent at:2018 Nov 9 (Fri) 02:31
> Recipient:dev 
> Cc:Bowen Li ; Xuefu ; Shuyi 
> Chen 
> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
> 
> Hi,
> 
> Maybe we should split this topic (and the design doc) into couple of smaller 
> ones, hopefully independent. The questions that you have asked Fabian have 
> for example very little to do with reading metadata from Hive Meta Store?
> 
> Piotrek 
> 
>> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
>> 
>> Hi Xuefu and all,
>> 
>> Thanks for sharing this design document!
>> I'm very much in favor of restructuring / reworking the catalog handling in
>> Flink SQL as outlined in the document.
>> Most changes described in the design document seem to be rather general and
>> not specifically related to the Hive integration.
>> 
>> IMO, there are some aspects, especially those at the boundary of Hive and
>> Flink, that need a bit more discussion. For example
>> 
>> * What does it take to make Flink schema compatible with Hive schema?
>> * How will Flink tables (descriptors) be stored in HMS?
>> * How do both Hive catalogs differ? Could they be integrated into to a
>> single one? When to use which one?
>> * What meta information is provided by HMS? What of this can be leveraged
>> by Flink?
>> 
>> Thank you,
>> Fabian
>> 
>> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
>> 
>>> After taking a look at how other discussion threads work, I think it's
>>> actually fine just keep our discussion here. It's up to you, Xuefu.
>>> 
>>> The google doc LGTM. I left some minor comments.
>>> 
>>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>>> 
>>>> Hi all,
>>>> 
>>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>>> ... Hive integration design ..." on only dev mailing list for community
>>>> devs to review. The current thread sends to both dev and user list.
>>>> 
>>>> This email thread is more like validating the general idea and direction
>>>> with the community, and it's been pretty long and crowded so far. Since
>>>> everyone is pro for the idea, we can move forward with another thread to
>>>> discuss and finalize the design.
>>>> 
>>>> Thanks,
>>>> Bowen
>>>> 
>>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu 
>>>> wrote:
>>>> 
>>>>> Hi Shuiyi,
>>>>> 
>>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>>> link:
>>>>> 
>>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>>> 
>>>>> Thanks,
>>>>> Xuefu
>>>>> 
>>>>> 
>>>>> 
>>>>> --
>>>>> Sender:Shuyi Chen 
>>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>>> Recipient:Xuefu 
>>>>> Cc:vino yang ; Fabian Hueske ;
>>>>> dev ; user 
>>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>>> 
>>&g

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-11-08 Thread Zhang, Xuefu
Hi Piotr,

That seems to be good idea!

Since the google doc for the design is currently under extensive review, I will 
leave it as it is for now. However, I'll convert it to two different FLIPs when 
the time comes.

How does it sound to you?

Thanks,
Xuefu


--
Sender:Piotr Nowojski 
Sent at:2018 Nov 9 (Fri) 02:31
Recipient:dev 
Cc:Bowen Li ; Xuefu ; Shuyi Chen 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi,

Maybe we should split this topic (and the design doc) into couple of smaller 
ones, hopefully independent. The questions that you have asked Fabian have for 
example very little to do with reading metadata from Hive Meta Store?

Piotrek 

> On 7 Nov 2018, at 14:27, Fabian Hueske  wrote:
> 
> Hi Xuefu and all,
> 
> Thanks for sharing this design document!
> I'm very much in favor of restructuring / reworking the catalog handling in
> Flink SQL as outlined in the document.
> Most changes described in the design document seem to be rather general and
> not specifically related to the Hive integration.
> 
> IMO, there are some aspects, especially those at the boundary of Hive and
> Flink, that need a bit more discussion. For example
> 
> * What does it take to make Flink schema compatible with Hive schema?
> * How will Flink tables (descriptors) be stored in HMS?
> * How do both Hive catalogs differ? Could they be integrated into to a
> single one? When to use which one?
> * What meta information is provided by HMS? What of this can be leveraged
> by Flink?
> 
> Thank you,
> Fabian
> 
> Am Fr., 2. Nov. 2018 um 00:31 Uhr schrieb Bowen Li :
> 
>> After taking a look at how other discussion threads work, I think it's
>> actually fine just keep our discussion here. It's up to you, Xuefu.
>> 
>> The google doc LGTM. I left some minor comments.
>> 
>> On Thu, Nov 1, 2018 at 10:17 AM Bowen Li  wrote:
>> 
>>> Hi all,
>>> 
>>> As Xuefu has published the design doc on google, I agree with Shuyi's
>>> suggestion that we probably should start a new email thread like "[DISCUSS]
>>> ... Hive integration design ..." on only dev mailing list for community
>>> devs to review. The current thread sends to both dev and user list.
>>> 
>>> This email thread is more like validating the general idea and direction
>>> with the community, and it's been pretty long and crowded so far. Since
>>> everyone is pro for the idea, we can move forward with another thread to
>>> discuss and finalize the design.
>>> 
>>> Thanks,
>>> Bowen
>>> 
>>> On Wed, Oct 31, 2018 at 12:16 PM Zhang, Xuefu 
>>> wrote:
>>> 
>>>> Hi Shuiyi,
>>>> 
>>>> Good idea. Actually the PDF was converted from a google doc. Here is its
>>>> link:
>>>> 
>>>> https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
>>>> Once we reach an agreement, I can convert it to a FLIP.
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> 
>>>> 
>>>> --
>>>> Sender:Shuyi Chen 
>>>> Sent at:2018 Nov 1 (Thu) 02:47
>>>> Recipient:Xuefu 
>>>> Cc:vino yang ; Fabian Hueske ;
>>>> dev ; user 
>>>> Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem
>>>> 
>>>> Hi Xuefu,
>>>> 
>>>> Thanks a lot for driving this big effort. I would suggest convert your
>>>> proposal and design doc into a google doc, and share it on the dev mailing
>>>> list for the community to review and comment with title like "[DISCUSS] ...
>>>> Hive integration design ..." . Once approved,  we can document it as a FLIP
>>>> (Flink Improvement Proposals), and use JIRAs to track the implementations.
>>>> What do you think?
>>>> 
>>>> Shuyi
>>>> 
>>>> On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu 
>>>> wrote:
>>>> Hi all,
>>>> 
>>>> I have also shared a design doc on Hive metastore integration that is
>>>> attached here and also to FLINK-10556[1]. Please kindly review and share
>>>> your feedback.
>>>> 
>>>> 
>>>> Thanks,
>>>> Xuefu
>>>> 
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10556
>>>> --
>>>> Sende

Confluence permission for FLIP creation

2018-11-05 Thread Zhang, Xuefu
Hi there, 

As communicated in an email thread, I'm proposing Flink-Hive metastore 
integration. I have a draft design doc that I'd like to convert it to a FLIP. 
Thus, it would be great if anyone who can grant me the write access to 
Confluence. My Confluence ID is xuefu.

@Timo Waltherand @Fabian Hueske, it would be nice if any of you can help on 
this.

Thanks,
Xuefu



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-31 Thread Zhang, Xuefu
Hi Shuiyi,

Good idea. Actually the PDF was converted from a google doc. Here is its link:
https://docs.google.com/document/d/1SkppRD_rE3uOKSN-LuZCqn4f7dz0zW5aa6T_hBZq5_o/edit?usp=sharing
Once we reach an agreement, I can convert it to a FLIP.

Thanks,
Xuefu




--
Sender:Shuyi Chen 
Sent at:2018 Nov 1 (Thu) 02:47
Recipient:Xuefu 
Cc:vino yang ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks a lot for driving this big effort. I would suggest convert your proposal 
and design doc into a google doc, and share it on the dev mailing list for the 
community to review and comment with title like "[DISCUSS] ... Hive integration 
design ..." . Once approved,  we can document it as a FLIP (Flink Improvement 
Proposals), and use JIRAs to track the implementations. What do you think?

Shuyi
On Tue, Oct 30, 2018 at 11:32 AM Zhang, Xuefu  wrote:

Hi all,

I have also shared a design doc on Hive metastore integration that is attached 
here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
--
Sender:Xuefu 
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu ; Shuyi Chen 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which 
is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to 
track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556


--
Sender:Xuefu 
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move 
forward fast. :) We did some work internally on DDL utilizing babel parser in 
Calcite. While babel makes Calcite's grammar extensible, at first impression it 
still seems too cumbersome for a project when too much extensions are made. 
It's even challenging to find where the extension is needed! It would be 
certainly better if Calcite can magically support Hive QL by just turning on a 
flag, such as that for MYSQL_5. I can also see that this could mean a lot of 
work on Calcite. Nevertheless, I will bring up the discussion over there and to 
see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We 
can certainly collaborate on this.

Thanks,
Xuefu

--
Sender:Shuyi Chen 
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the 
proposal can be divided into 2 stages: making Flink to support Hive features, 
and make Hive to work with Flink. I agreed with Timo that on starting with a 
smaller scope, so we can make progress faster. As for [6], a proposal for DDL 
is already in progress, and will come after the unified SQL connector API is 
done. For supporting Hive syntax, we might need to work with the Calcite 
community, and a recent effort called babel 
(https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu  wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should s

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-30 Thread Zhang, Xuefu
Hi all,

I have also shared a design doc on Hive metastore integration that is attached 
here and also to FLINK-10556[1]. Please kindly review and share your feedback.


Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556
--
Sender:Xuefu 
Sent at:2018 Oct 25 (Thu) 01:08
Recipient:Xuefu ; Shuyi Chen 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which 
is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to 
track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



--
Sender:Xuefu 
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move 
forward fast. :) We did some work internally on DDL utilizing babel parser in 
Calcite. While babel makes Calcite's grammar extensible, at first impression it 
still seems too cumbersome for a project when too much extensions are made. 
It's even challenging to find where the extension is needed! It would be 
certainly better if Calcite can magically support Hive QL by just turning on a 
flag, such as that for MYSQL_5. I can also see that this could mean a lot of 
work on Calcite. Nevertheless, I will bring up the discussion over there and to 
see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We 
can certainly collaborate on this.

Thanks,
Xuefu

--
Sender:Shuyi Chen 
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the 
proposal can be divided into 2 stages: making Flink to support Hive features, 
and make Hive to work with Flink. I agreed with Timo that on starting with a 
smaller scope, so we can make progress faster. As for [6], a proposal for DDL 
is already in progress, and will come after the unified SQL connector API is 
done. For supporting Hive syntax, we might need to work with the Calcite 
community, and a recent effort called babel 
(https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu  wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with 
extension to support Hive's syntax and language features, around DDL, DML, and 
SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in 
thrift APIs, such that HiveServer2 users can reuse their existing client (such 
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage 
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all 
layers in Flink. However, a short-term goal could  include only core areas 
(such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-24 Thread Zhang, Xuefu
Hi all,

To wrap up the discussion, I have attached a PDF describing the proposal, which 
is also attached to FLINK-10556 [1]. Please feel free to watch that JIRA to 
track the progress.

Please also let me know if you have additional comments or questions.

Thanks,
Xuefu

[1] https://issues.apache.org/jira/browse/FLINK-10556



--
Sender:Xuefu 
Sent at:2018 Oct 16 (Tue) 03:40
Recipient:Shuyi Chen 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Shuyi,

Thank you for your input. Yes, I agreed with a phased approach and like to move 
forward fast. :) We did some work internally on DDL utilizing babel parser in 
Calcite. While babel makes Calcite's grammar extensible, at first impression it 
still seems too cumbersome for a project when too much extensions are made. 
It's even challenging to find where the extension is needed! It would be 
certainly better if Calcite can magically support Hive QL by just turning on a 
flag, such as that for MYSQL_5. I can also see that this could mean a lot of 
work on Calcite. Nevertheless, I will bring up the discussion over there and to 
see what their community thinks.

Would mind to share more info about the proposal on DDL that you mentioned? We 
can certainly collaborate on this.

Thanks,
Xuefu


--
Sender:Shuyi Chen 
Sent at:2018 Oct 14 (Sun) 08:30
Recipient:Xuefu 
Cc:yanghua1127 ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Welcome to the community and thanks for the great proposal, Xuefu! I think the 
proposal can be divided into 2 stages: making Flink to support Hive features, 
and make Hive to work with Flink. I agreed with Timo that on starting with a 
smaller scope, so we can make progress faster. As for [6], a proposal for DDL 
is already in progress, and will come after the unified SQL connector API is 
done. For supporting Hive syntax, we might need to work with the Calcite 
community, and a recent effort called babel 
(https://issues.apache.org/jira/browse/CALCITE-2280) in Calcite might help here.

Thanks
Shuyi
On Wed, Oct 10, 2018 at 8:02 PM Zhang, Xuefu  wrote:
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with 
extension to support Hive's syntax and language features, around DDL, DML, and 
SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in 
thrift APIs, such that HiveServer2 users can reuse their existing client (such 
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage 
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all 
layers in Flink. However, a short-term goal could  include only core areas 
(such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the 
right direction, I could come up with a formal proposal quickly and then we can 
follow up with broader discussions.

Thanks,
Xuefu



--
Sender:vino yang 
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske 
Cc:dev ; Xuefu ; user 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-15 Thread Zhang, Xuefu
Hi Bowen,

Thank you for your feedback and interest in the project. Your contribution is 
certainly welcome. Per your suggestion, I have created an Uber JIRA 
(https://issues.apache.org/jira/browse/FLINK-10556) to track our overall effort 
on this. For each subtask, we'd like to see a short description on the status 
quo and what is planned to add or change. Design doc should be provided when 
it's deemed necessary.

I'm looking forward to seeing your contributions!

Thanks,
Xuefu



Thanks,
Xuefu 


--
Sender:Bowen 
Sent at:2018 Oct 13 (Sat) 21:55
Recipient:Xuefu ; Fabian Hueske 
Cc:dev ; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Thank you Xuefu, for bringing up this awesome, detailed proposal! It will 
resolve lots of existing pain for users like me.

In general, I totally agree that improving FlinkSQL's completeness would be a 
much better start point than building 'Hive on Flink', as the Hive community is 
concerned about Flink's SQL incompleteness and lack of proven batch performance 
shown in https://issues.apache.org/jira/browse/HIVE-10712. Improving FlinkSQL 
seems a more natural direction to start with in order to achieve the 
integration.

Xuefu and Timo has laid a quite clear path of what to tackle next. Given that 
there're already some efforts going on, for item 1,2,5,3,4,6 in Xuefu's list, 
shall we:

identify gaps between a) Xuefu's proposal/discussion result in this thread and 
b) all the ongoing work/discussions?
then, create some new top-level JIRA tickets to keep track of and start more 
detailed discussions with?
It's gonna be a great and influential project , and I'd love to participate 
into it to move FlinkSQL's adoption and ecosystem even further.

Thanks,
Bowen


在 2018年10月12日,下午3:37,Jörn Franke  写道:


Thank you very nice , I fully agree with that. 

Am 11.10.2018 um 19:31 schrieb Zhang, Xuefu :

Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it 
is one of the two approaches that I named in the beginning of the thread. As 
also pointed out there, this isn't mutually exclusive from work we proposed 
inside Flink and they target at different user groups and user cases. Further, 
what we proposed to do in Flink should be a good showcase that demonstrate 
Flink's capabilities in batch processing and convince Hive community of the 
worth of a new engine. As you might know, the idea encountered some doubt and 
resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we 
will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
While we mentioned Hive in many of the proposed items, most of them are coupled 
only in concepts and functionality rather than code or libraries. We are taking 
the advantage of the connector framework in Flink. The only thing that might be 
exceptional is to support Hive built-in UDFs, which we may not make it work out 
of the box to avoid the coupling. We could, for example, require users bring in 
Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more 
tolerable (so that the job don't have to start from the beginning in case of 
task failures) and to make task scheduling more resource-efficient. Flink's 
current design in those two aspects leans more to stream processing, which may 
not be good enough for batch processing. We will provide more detailed design 
when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu


--
Sender:Jörn Franke 
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu 
Cc:vino yang ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Would it maybe make sense to provide Flink as an engine on Hive 
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
coupled than integrating hive in all possible flink core modules and thus 
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive 
engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
distant future if the Hive integration is heavily demanded one could then 
integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu :

Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both

Re: Become a contributor

2018-10-12 Thread Zhang, Xuefu
Thank you very much, Fabian! It seems working for me now.

Regards,
Xuefu


--
Sender:Fabian Hueske 
Sent at:2018 Oct 12 (Fri) 15:45
Recipient:dev ; Xuefu 
Subject:Re: Become a contributor

Hi Xuefu,

I gave (hopefully) your Jira user (xuefuz) Contributor permissions for Flink's 
Jira.
You can now assign issues to yourself.

Best, Fabian

Am Fr., 12. Okt. 2018 um 01:18 Uhr schrieb Zhang, Xuefu 
:
Hi there,

 Could anyone kindly add me as a contributor to Flink project?

 Thanks,
 Xuefu




Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-11 Thread Zhang, Xuefu
Hi Taher,

Thank you for your input. I think you emphasized two important points:

1. Hive metastore could be used for storing Flink metadata
2. There are some usability issues around Flink SQL configuration

I think we all agree on #1. #2 may be well true and the usability should be 
improved. However, I'm afraid that this is orthogonal to Hive integration and 
the proposed solution might be just one of the possible solutions. On the 
surface, the extensions you proposed seem going beyond the syntax and semantics 
of SQL language in general.

I don't disagree on the value of your proposal. I guess it's better to solve #1 
first and leave #2 for follow-up discussions. How does this sound to you?

Thanks,
Xuefu


--
Sender:Taher Koitawala 
Sent at:2018 Oct 12 (Fri) 10:06
Recipient:Xuefu 
Cc:Rong Rong ; Timo Walther ; dev 
; jornfranke ; vino yang 
; Fabian Hueske ; user 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

One other thought on the same lines was to use hive tables to store kafka 
information to process streaming tables. Something like 

"create table streaming_table (
bootstrapServers string,
topic string, keySerialiser string, ValueSerialiser string)"

Insert into streaming_table 
values(,"10.17.1.1:9092,10.17.2.2:9092,10.17.3.3:9092", "KafkaTopicName", 
"SimpleStringSchema", "SimpleSchemaString");

Create table processingtable(
//Enter fields here which match the kafka records schema);

Now we make a custom clause called something like "using"

The way we use this is:

Using streaming_table as configuration select count(*) from processingtable as 
streaming;


This way users can now pass Flink SQL info easily and get rid of the Flink SQL 
configuration file all together. This is simple and easy to understand and I 
think most users would follow this.

Thanks, 
Taher Koitawala 
On Fri 12 Oct, 2018, 7:24 AM Taher Koitawala,  wrote:

I think integrating Flink with Hive would be an amazing option and also to get 
Flink's SQL up to pace would be amazing. 

Current Flink Sql syntax to prepare and process a table is too verbose, users 
manually need to retype table definitions and that's a pain. Hive metastore 
integration should be done through, many users are okay defining their table 
schemas in Hive as it is easy to main, change or even migrate. 

Also we could simply choosing batch and stream there with simply something like 
a "process as" clause. 

select count(*) from flink_mailing_list process as stream;

select count(*) from flink_mailing_list process as batch;

This way we could completely get rid of Flink SQL configuration files. 

Thanks,
Taher Koitawala 

Integrating 
On Fri 12 Oct, 2018, 2:35 AM Zhang, Xuefu,  wrote:
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some 
of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan 
optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in 
Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server 
compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink 
provides. It's usually the servicer owner who provides drivers to their 
services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu

--
Sender:Rong Rong 
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther 
Cc:dev ; jornfranke ; Xuefu 
; vino yang ; Fabian Hueske 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top 
of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the 
metadata compatibility issues. I was actually wondering if you are referring to 
something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink 
UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the 
support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components 
might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat 
it as one "connector" system. Should we also consider treating JDBC/ODBC driver 
as part of the component from the connector system instead of having Flink to 
provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLI

Become a contributor

2018-10-11 Thread Zhang, Xuefu
Hi there,

Could anyone kindly add me as a contributor to Flink project?

Thanks,
Xuefu



Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-11 Thread Zhang, Xuefu
Hi Rong,

Thanks for your feedback. Some of my earlier comments might have addressed some 
of your points, so here I'd like to cover some specifics.

1. Yes, I expect that table stats stored in Hive will be used in Flink plan 
optimization, but it's not part of compatibility concern (yet).
2. Both implementing Hive UDFs in Flink natively and making Hive UDFs work in 
Flink are considered.
3. I am aware of FLIP-24, but here the proposal is to make remote server 
compatible with HiveServer2. They are not mutually exclusive either.
4. The JDBC/ODBC driver in question is for the remote server that Flink 
provides. It's usually the servicer owner who provides drivers to their 
services. We weren't talking about JDBC/ODBC driver to external DB systems.

Let me know if you have further questions.

Thanks,
Xuefu


--
Sender:Rong Rong 
Sent at:2018 Oct 12 (Fri) 01:52
Recipient:Timo Walther 
Cc:dev ; jornfranke ; Xuefu 
; vino yang ; Fabian Hueske 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu, 

Thanks for putting together the overview. I would like to add some more on top 
of Timo's comments.
1,2. I agree with Timo that a proper catalog support should also address the 
metadata compatibility issues. I was actually wondering if you are referring to 
something like utilizing table stats for plan optimization?
4. If the key is to have users integrate Hive UDF without code changes to Flink 
UDF, it shouldn't be a problem as Timo mentioned. Is your concern mostly on the 
support of Hive UDFs that should be implemented in Flink-table natively?
7,8. Correct me if I am wrong, but I feel like some of the related components 
might have already been discussed in the longer term road map of FLIP-24 [1]?
9. per Jorn's comment to stay clear from a tight dependency on Hive and treat 
it as one "connector" system. Should we also consider treating JDBC/ODBC driver 
as part of the component from the connector system instead of having Flink to 
provide them?

Thanks,
Rong

[1]. https://cwiki.apache.org/confluence/display/FLINK/FLIP-24+-+SQL+Client
On Thu, Oct 11, 2018 at 12:46 AM Timo Walther  wrote:
Hi Xuefu,

 thanks for your proposal, it is a nice summary. Here are my thoughts to 
 your list:

 1. I think this is also on our current mid-term roadmap. Flink lacks a 
 poper catalog support for a very long time. Before we can connect 
 catalogs we need to define how to map all the information from a catalog 
 to Flink's representation. This is why the work on the unified connector 
 API [1] is going on for quite some time as it is the first approach to 
 discuss and represent the pure characteristics of connectors.
 2. It would be helpful to figure out what is missing in [1] to to ensure 
 this point. I guess we will need a new design document just for a proper 
 Hive catalog integration.
 3. This is already work in progress. ORC has been merged, Parquet is on 
 its way [1].
 4. This should be easy. There was a PR in past that I reviewed but was 
 not maintained anymore.
 5. The type system of Flink SQL is very flexible. Only UNION type is 
 missing.
 6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
 Support for Hive syntax also needs cooperation with Apache Calcite.
 7-11. Long-term goals.

 I would also propose to start with a smaller scope where also current 
 Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
 Flink SQL ecosystem. After that we can aim to be fully compatible 
 including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
 work on the tooling (7, 8, 9) and performance (10, 11).

 @Jörn: Yes, we should not have a tight dependency on Hive. It should be 
 treated as one "connector" system out of many.

 Thanks,
 Timo

 [1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
 [2] https://github.com/apache/flink/pull/6483

 Am 11.10.18 um 07:54 schrieb Jörn Franke:
 > Would it maybe make sense to provide Flink as an engine on Hive 
 > („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
 > coupled than integrating hive in all possible flink core modules and thus 
 > introducing a very tight dependency to Hive in the core.
 > 1,2,3 could be achieved via a connector based on the Flink Table API.
 > Just as a proposal to start this Endeavour as independent projects (hive 
 > engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
 > distant future if the Hive integration is heavily demanded one could then 
 > integrate it more tightly if needed.
 >
 > What is meant by 11?
 >> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu :
 >>
 >> Hi Fabian/Vno,
 >>
 >> Thank you very much for your encouragement inquiry. Sorry that I didn't see 
 >> Fabian's email until I read Vino's response just now. (Somehow

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-11 Thread Zhang, Xuefu
Hi Timo,

Thank you for your input. It's exciting to see that the community has already 
initiated some of the topics. We'd certainly like to leverage the current and 
previous work and make progress in phases. Here I'd like to comment on a few 
things on top of your feedback.

1. I think there are two aspects #1 and #2 with regard to Hive metastore: a) as 
an backend storage for Flink's metadata (currently in memory), and b) an 
external catalog (just like a JDBC catalog) that Flink can interact with. While 
it may be possible and would be nice if we can achieve both in a single design, 
our focus has been on the latter. We will consider both cases in our design.

2. Re #5, I agree that Flink seems having the majority of data types. However, 
supporting some of them (such as struct) at SQL layer needs work on the parser 
(Calcite).

3. Similarly for #6, work needs to be done on parsing side. We can certainly 
ask Calcite community to provide Hive dialect parsing. This can be challenging 
and time-consuming. At the same time, we can also explore the possibilities of 
solving the problem in Flink, such as using Calcite's official extension 
mechanism. We will open the discussion when we get there.

Yes, I agree with you that we should start with a small scope while keeping a 
forward thinking. Specifically, we will first look at the metadata and data 
compatibilities, data types, DDL/DML, Query, UDFs, and so on. I think we align 
well on this.

Please let me know if you have further thoughts or commends.

Thanks,
Xuefu


--
Sender:Timo Walther 
Sent at:2018 Oct 11 (Thu) 15:46
Recipient:dev ; "Jörn Franke" ; 
Xuefu 
Cc:vino yang ; Fabian Hueske ; user 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

thanks for your proposal, it is a nice summary. Here are my thoughts to 
your list:

1. I think this is also on our current mid-term roadmap. Flink lacks a 
poper catalog support for a very long time. Before we can connect 
catalogs we need to define how to map all the information from a catalog 
to Flink's representation. This is why the work on the unified connector 
API [1] is going on for quite some time as it is the first approach to 
discuss and represent the pure characteristics of connectors.
2. It would be helpful to figure out what is missing in [1] to to ensure 
this point. I guess we will need a new design document just for a proper 
Hive catalog integration.
3. This is already work in progress. ORC has been merged, Parquet is on 
its way [1].
4. This should be easy. There was a PR in past that I reviewed but was 
not maintained anymore.
5. The type system of Flink SQL is very flexible. Only UNION type is 
missing.
6. A Flink SQL DDL is on the roadmap soon once we are done with [1]. 
Support for Hive syntax also needs cooperation with Apache Calcite.
7-11. Long-term goals.

I would also propose to start with a smaller scope where also current 
Flink SQL users can profit: 1, 2, 5, 3. This would allow to grow the 
Flink SQL ecosystem. After that we can aim to be fully compatible 
including syntax and UDFs (4, 6 etc.). Once the core is ready, we can 
work on the tooling (7, 8, 9) and performance (10, 11).

@Jörn: Yes, we should not have a tight dependency on Hive. It should be 
treated as one "connector" system out of many.

Thanks,
Timo

[1] 
https://docs.google.com/document/d/1Yaxp1UJUFW-peGLt8EIidwKIZEWrrA-pznWLuvaH39Y/edit?ts=5bb62df4#
[2] https://github.com/apache/flink/pull/6483

Am 11.10.18 um 07:54 schrieb Jörn Franke:
> Would it maybe make sense to provide Flink as an engine on Hive 
> („flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
> coupled than integrating hive in all possible flink core modules and thus 
> introducing a very tight dependency to Hive in the core.
> 1,2,3 could be achieved via a connector based on the Flink Table API.
> Just as a proposal to start this Endeavour as independent projects (hive 
> engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
> distant future if the Hive integration is heavily demanded one could then 
> integrate it more tightly if needed.
>
> What is meant by 11?
>> Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu :
>>
>> Hi Fabian/Vno,
>>
>> Thank you very much for your encouragement inquiry. Sorry that I didn't see 
>> Fabian's email until I read Vino's response just now. (Somehow Fabian's went 
>> to the spam folder.)
>>
>> My proposal contains long-term and short-terms goals. Nevertheless, the 
>> effort will focus on the following areas, including Fabian's list:
>>
>> 1. Hive metastore connectivity - This covers both read/write access, which 
>> means Flink can make full use of Hive's metastore as its catalog (at least 
>> for the batch but can extend for streaming as well).
&

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-11 Thread Zhang, Xuefu
Hi Jörn,

Thanks for your feedback. Yes, I think Hive on Flink makes sense and in fact it 
is one of the two approaches that I named in the beginning of the thread. As 
also pointed out there, this isn't mutually exclusive from work we proposed 
inside Flink and they target at different user groups and user cases. Further, 
what we proposed to do in Flink should be a good showcase that demonstrate 
Flink's capabilities in batch processing and convince Hive community of the 
worth of a new engine. As you might know, the idea encountered some doubt and 
resistance. Nevertheless, we do have a solid plan for Hive on Flink, which we 
will execute once Flink SQL is in a good shape.

I also agree with you that Flink SQL shouldn't be closely coupled with Hive. 
While we mentioned Hive in many of the proposed items, most of them are coupled 
only in concepts and functionality rather than code or libraries. We are taking 
the advantage of the connector framework in Flink. The only thing that might be 
exceptional is to support Hive built-in UDFs, which we may not make it work out 
of the box to avoid the coupling. We could, for example, require users bring in 
Hive library and register themselves. This is subject to further discussion.

#11 is about Flink runtime enhancement that is meant to make task failures more 
tolerable (so that the job don't have to start from the beginning in case of 
task failures) and to make task scheduling more resource-efficient. Flink's 
current design in those two aspects leans more to stream processing, which may 
not be good enough for batch processing. We will provide more detailed design 
when we get to them.

Please let me know if you have further thoughts or feedback.

Thanks,
Xuefu



--
Sender:Jörn Franke 
Sent at:2018 Oct 11 (Thu) 13:54
Recipient:Xuefu 
Cc:vino yang ; Fabian Hueske ; dev 
; user 
Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem


Would it maybe make sense to provide Flink as an engine on Hive 
(„flink-on-Hive“)? Eg to address 4,5,6,8,9,10. this could be more loosely 
coupled than integrating hive in all possible flink core modules and thus 
introducing a very tight dependency to Hive in the core.
1,2,3 could be achieved via a connector based on the Flink Table API.
Just as a proposal to start this Endeavour as independent projects (hive 
engine, connector) to avoid too tight coupling with Flink. Maybe in a more 
distant future if the Hive integration is heavily demanded one could then 
integrate it more tightly if needed. 

What is meant by 11?
Am 11.10.2018 um 05:01 schrieb Zhang, Xuefu :


Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with 
extension to support Hive's syntax and language features, around DDL, DML, and 
SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in 
thrift APIs, such that HiveServer2 users can reuse their existing client (such 
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage 
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all 
layers in Flink. However, a short-term goal could  include only core areas 
(such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the 
right direction, I could come up with a formal proposal quickly and then we

Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-10 Thread Zhang, Xuefu
Hi Fabian/Vno,

Thank you very much for your encouragement inquiry. Sorry that I didn't see 
Fabian's email until I read Vino's response just now. (Somehow Fabian's went to 
the spam folder.)

My proposal contains long-term and short-terms goals. Nevertheless, the effort 
will focus on the following areas, including Fabian's list:

1. Hive metastore connectivity - This covers both read/write access, which 
means Flink can make full use of Hive's metastore as its catalog (at least for 
the batch but can extend for streaming as well).
2. Metadata compatibility - Objects (databases, tables, partitions, etc) 
created by Hive can be understood by Flink and the reverse direction is true 
also.
3. Data compatibility - Similar to #2, data produced by Hive can be consumed by 
Flink and vise versa.
4. Support Hive UDFs - For all Hive's native udfs, Flink either provides its 
own implementation or make Hive's implementation work in Flink. Further, for 
user created UDFs in Hive, Flink SQL should provide a mechanism allowing user 
to import them into Flink without any code change required.
5. Data types -  Flink SQL should support all data types that are available in 
Hive.
6. SQL Language - Flink SQL should support SQL standard (such as SQL2003) with 
extension to support Hive's syntax and language features, around DDL, DML, and 
SELECT queries.
7.  SQL CLI - this is currently developing in Flink but more effort is needed.
8. Server - provide a server that's compatible with Hive's HiverServer2 in 
thrift APIs, such that HiveServer2 users can reuse their existing client (such 
as beeline) but connect to Flink's thrift server instead.
9. JDBC/ODBC drivers - Flink may provide its own JDBC/ODBC drivers for other 
application to use to connect to its thrift server
10. Support other user's customizations in Hive, such as Hive Serdes, storage 
handlers, etc.
11. Better task failure tolerance and task scheduling at Flink runtime.

As you can see, achieving all those requires significant effort and across all 
layers in Flink. However, a short-term goal could  include only core areas 
(such as 1, 2, 4, 5, 6, 7) or start  at a smaller scope (such as #3, #6).

Please share your further thoughts. If we generally agree that this is the 
right direction, I could come up with a formal proposal quickly and then we can 
follow up with broader discussions.

Thanks,
Xuefu




--
Sender:vino yang 
Sent at:2018 Oct 11 (Thu) 09:45
Recipient:Fabian Hueske 
Cc:dev ; Xuefu ; user 

Subject:Re: [DISCUSS] Integrate Flink SQL well with Hive ecosystem

Hi Xuefu,

Appreciate this proposal, and like Fabian, it would look better if you can give 
more details of the plan.

Thanks, vino.
Fabian Hueske  于2018年10月10日周三 下午5:27写道:

Hi Xuefu,

Welcome to the Flink community and thanks for starting this discussion! Better 
Hive integration would be really great!
Can you go into details of what you are proposing? I can think of a couple ways 
to improve Flink in that regard:

* Support for Hive UDFs
* Support for Hive metadata catalog
* Support for HiveQL syntax
* ???

Best, Fabian

Am Di., 9. Okt. 2018 um 19:22 Uhr schrieb Zhang, Xuefu 
:
Hi all,

 Along with the community's effort, inside Alibaba we have explored Flink's 
potential as an execution engine not just for stream processing but also for 
batch processing. We are encouraged by our findings and have initiated our 
effort to make Flink's SQL capabilities full-fledged. When comparing what's 
available in Flink to the offerings from competitive data processing engines, 
we identified a major gap in Flink: a well integration with Hive ecosystem. 
This is crucial to the success of Flink SQL and batch due to the 
well-established data ecosystem around Hive. Therefore, we have done some 
initial work along this direction but there are still a lot of effort needed.

 We have two strategies in mind. The first one is to make Flink SQL 
full-fledged and well-integrated with Hive ecosystem. This is a similar 
approach to what Spark SQL adopted. The second strategy is to make Hive itself 
work with Flink, similar to the proposal in [1]. Each approach bears its pros 
and cons, but they don’t need to be mutually exclusive with each targeting at 
different users and use cases. We believe that both will promote a much greater 
adoption of Flink beyond stream processing.

 We have been focused on the first approach and would like to showcase Flink's 
batch and SQL capabilities with Flink SQL. However, we have also planned to 
start strategy #2 as the follow-up effort.

 I'm completely new to Flink(, with a short bio [2] below), though many of my 
colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
to share our thoughts and invite your early feedback. At the same time, I am 
working on a detailed proposal on Flink SQL's integration with Hive ecosystem, 
which will be also shared when ready.

 While the ideas are simple, each

[DISCUSS] Integrate Flink SQL well with Hive ecosystem

2018-10-09 Thread Zhang, Xuefu
Hi all,

Along with the community's effort, inside Alibaba we have explored Flink's 
potential as an execution engine not just for stream processing but also for 
batch processing. We are encouraged by our findings and have initiated our 
effort to make Flink's SQL capabilities full-fledged. When comparing what's 
available in Flink to the offerings from competitive data processing engines, 
we identified a major gap in Flink: a well integration with Hive ecosystem. 
This is crucial to the success of Flink SQL and batch due to the 
well-established data ecosystem around Hive. Therefore, we have done some 
initial work along this direction but there are still a lot of effort needed.

We have two strategies in mind. The first one is to make Flink SQL full-fledged 
and well-integrated with Hive ecosystem. This is a similar approach to what 
Spark SQL adopted. The second strategy is to make Hive itself work with Flink, 
similar to the proposal in [1]. Each approach bears its pros and cons, but they 
don’t need to be mutually exclusive with each targeting at different users and 
use cases. We believe that both will promote a much greater adoption of Flink 
beyond stream processing.

We have been focused on the first approach and would like to showcase Flink's 
batch and SQL capabilities with Flink SQL. However, we have also planned to 
start strategy #2 as the follow-up effort.

I'm completely new to Flink(, with a short bio [2] below), though many of my 
colleagues here at Alibaba are long-time contributors. Nevertheless, I'd like 
to share our thoughts and invite your early feedback. At the same time, I am 
working on a detailed proposal on Flink SQL's integration with Hive ecosystem, 
which will be also shared when ready.

While the ideas are simple, each approach will demand significant effort, more 
than what we can afford. Thus, the input and contributions from the communities 
are greatly welcome and appreciated.

Regards,


Xuefu

References:

[1] https://issues.apache.org/jira/browse/HIVE-10712
[2] Xuefu Zhang is a long-time open source veteran, worked or working on many 
projects under Apache Foundation, of which he is also an honored member. About 
10 years ago he worked in the Hadoop team at Yahoo where the projects just got 
started. Later he worked at Cloudera, initiating and leading the development of 
Hive on Spark project in the communities and across many organizations. Prior 
to joining Alibaba, he worked at Uber where he promoted Hive on Spark to all 
Uber's SQL on Hadoop workload and significantly improved Uber's cluster 
efficiency.