Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Konstantin Knauf Thu, 30 Apr 2020 10:01:15 -0700

Hi Jark,

my gut feeling is 1), because of its consistency with other connectors
(does not add two secret keywords) although it is more verbose.


Best,

Konstantin



On Thu, Apr 30, 2020 at 5:01 PM Jark Wu <[email protected]> wrote:

> Hi Konstantin,
>
> Thanks for the link of Java Faker. It's an intereting project and
> could benefit to a comprehensive datagen source.
>
> What the discarding and printing sink look like in your thought?
> 1) manually create a table with a `blackhole` or `print` connector, e.g.
>
> CREATE TABLE my_sink (
>   a INT,
>   b STRNG,
>   c DOUBLE
> ) WITH (
>   'connector' = 'print'
> );
> INSERT INTO my_sink SELECT a, b, c FROM my_source;
>
> 2) a system built-in table named `blackhole` and `print` without manually
> schema work, e.g.
> INSERT INTO print SELECT a, b, c, d FROM my_source;
>
> Best,
> Jark
>
>
>
> On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <[email protected]> wrote:
>
> > Hi everyone,
> >
> > sorry for reviving this thread at this point in time. Generally, I think,
> > this is a very valuable effort. Have we considered only providing a very
> > basic data generator (+ discarding and printing sink tables) in Apache
> > Flink and moving a more comprehensive data generating table source to an
> > ecosystem project promoted on flink-packages.org. I think this has a lot
> > of
> > potential (e.g. in combination with Java Faker [1]), but it would
> probably
> > be better served in a small separately maintained repository.
> >
> > Cheers,
> >
> > Konstantin
> >
> > [1] https://github.com/DiUS/java-faker
> >
> >
> > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <[email protected]>
> > wrote:
> >
> > > Hi all,
> > >
> > > I created https://issues.apache.org/jira/browse/FLINK-16743 for
> > follow-up
> > > discussion. FYI.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <[email protected]> wrote:
> > >
> > > > I agree with Jingsong that sink schema inference and system tables
> can
> > be
> > > > considered later. I wouldn’t recommend to tackle them for the sake of
> > > > simplifying user experience to the extreme. Providing the above handy
> > > > source and sink implementations already offer users a ton of
> immediate
> > > > value.
> > > >
> > > >
> > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <[email protected]>
> > > wrote:
> > > >
> > > > > Hi Benchao,
> > > > >
> > > > > > do you think we need to add more columns with various types?
> > > > >
> > > > > I didn't list all types, but we should support primitive types,
> > > varchar,
> > > > > Decimal, Timestamp and etc...
> > > > > This can be done continuously.
> > > > >
> > > > > Hi Benchao, Jark,
> > > > > About console and blackhole, yes, they can have no schema, the
> schema
> > > can
> > > > > be inferred by upstream node.
> > > > > - But now we don't have this mechanism to do these configurable
> sink
> > > > > things.
> > > > > - If we want to support, we need a single way to support these two
> > > sinks.
> > > > > - And uses can use "create table like" and others way to simplify
> > DDL.
> > > > >
> > > > > And for providing system/registered tables (`console` and
> > `blackhole`):
> > > > > - I have no strong opinion on these system tables. In SQL, will be
> > > > "insert
> > > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert
> > > into
> > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from
> tableB".
> > It
> > > > > seems that Blackhole is a universal thing, which makes me feel bad
> > > > > intuitively.
> > > > > - Can user override these tables? If can, we need ensure it can be
> > > > > overwrite by catalog tables.
> > > > >
> > > > > So I think we can leave these system tables to future too.
> > > > > What do you think?
> > > > >
> > > > > Best,
> > > > > Jingsong Lee
> > > > >
> > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <[email protected]> wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL
> work,
> > so
> > > > > users
> > > > > > can use them directly:
> > > > > >
> > > > > > # this will log results to `.out` files
> > > > > > INSERT INTO console
> > > > > > SELECT ...
> > > > > >
> > > > > > # this will drop all received records
> > > > > > INSERT INTO blackhole
> > > > > > SELECT ...
> > > > > >
> > > > > > Here `console` and `blackhole` are system sinks which is similar
> to
> > > > > system
> > > > > > functions.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <[email protected]>
> > > wrote:
> > > > > >
> > > > > > > Hi Jingsong,
> > > > > > >
> > > > > > > Thanks for bring this up. Generally, it's a very good proposal.
> > > > > > >
> > > > > > > About data gen source, do you think we need to add more columns
> > > with
> > > > > > > various types?
> > > > > > >
> > > > > > > About print sink, do we need to specify the schema?
> > > > > > >
> > > > > > > Jingsong Li <[email protected]> 于2020年3月23日周一 下午1:51写道：
> > > > > > >
> > > > > > > > Thanks Bowen, Jark and Dian for your feedback and
> suggestions.
> > > > > > > >
> > > > > > > > I reorganize with your suggestions, and try to expose DDLs:
> > > > > > > >
> > > > > > > > 1.datagen source:
> > > > > > > > - easy startup/test for streaming job
> > > > > > > > - performance testing
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE user (
> > > > > > > >     id BIGINT,
> > > > > > > >     age INT,
> > > > > > > >     description STRING
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'datagen',
> > > > > > > >     'connector.rows-per-second'='100',
> > > > > > > >     'connector.total-records'='1000000',
> > > > > > > >
> > > > > > > >     'schema.id.generator' = 'sequence',
> > > > > > > >     'schema.id.generator.start' = '1',
> > > > > > > >
> > > > > > > >     'schema.age.generator' = 'random',
> > > > > > > >     'schema.age.generator.min' = '0',
> > > > > > > >     'schema.age.generator.max' = '100',
> > > > > > > >
> > > > > > > >     'schema.description.generator' = 'random',
> > > > > > > >     'schema.description.generator.length' = '100'
> > > > > > > > )
> > > > > > > >
> > > > > > > > Default is random generator.
> > > > > > > > Hi Jark, I don't want to bring complicated regularities,
> > because
> > > it
> > > > > can
> > > > > > > be
> > > > > > > > done through computed columns. And it is hard to define
> > > > > > > > standard regularities, I think we can leave it to the future.
> > > > > > > >
> > > > > > > > 2.print sink:
> > > > > > > > - easy test for streaming job
> > > > > > > > - be very useful in production debugging
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE print_table (
> > > > > > > >     ...
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'print'
> > > > > > > > )
> > > > > > > >
> > > > > > > > 3.blackhole sink
> > > > > > > > - very useful for high performance testing of Flink
> > > > > > > > - I've also run into users trying UDF to output, not sink, so
> > > they
> > > > > need
> > > > > > > > this sink as well.
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE blackhole_table (
> > > > > > > >     ...
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'blackhole'
> > > > > > > > )
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong Lee
> > > > > > > >
> > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <
> > [email protected]>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this
> > > > > > proposal. I
> > > > > > > > > think Bowen's proposal makes much sense to me.
> > > > > > > > >
> > > > > > > > > This is also a painful problem for PyFlink users. Currently
> > > there
> > > > > is
> > > > > > no
> > > > > > > > > built-in easy-to-use table source/sink and it requires
> users
> > to
> > > > > > write a
> > > > > > > > lot
> > > > > > > > > of code to trying out PyFlink. This is especially painful
> for
> > > new
> > > > > > users
> > > > > > > > who
> > > > > > > > > are not familiar with PyFlink/Flink. I have also
> encountered
> > > the
> > > > > > > tedious
> > > > > > > > > process Bowen encountered, e.g. writing random source
> > > connector,
> > > > > > print
> > > > > > > > sink
> > > > > > > > > and also blackhole print sink as there are no built-in ones
> > to
> > > > use.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Dian
> > > > > > > > >
> > > > > > > > > > 在 2020年3月22日，上午11:24，Jark Wu <[email protected]> 写道：
> > > > > > > > > >
> > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on
> > such
> > > > > > built-in
> > > > > > > > > > connectors.
> > > > > > > > > >
> > > > > > > > > > I will leave some my thoughts here:
> > > > > > > > > >
> > > > > > > > > >> 1. datagen source (random source)
> > > > > > > > > > I think we can merge the functinality of sequence-source
> > into
> > > > > > random
> > > > > > > > > source
> > > > > > > > > > to allow users to custom their data values.
> > > > > > > > > > Flink can generate random data according to the field
> > types,
> > > > > users
> > > > > > > > > > can customize their values to be more domain specific,
> e.g.
> > > > > > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > > > > > >
> > > > > > > > > >> 2. console sink (print sink)
> > > > > > > > > > This will be very useful in production debugging, to
> easily
> > > > > output
> > > > > > an
> > > > > > > > > > intermediate view or result view to a `.out` file.
> > > > > > > > > > So that we can look into the data representation, or
> check
> > > > dirty
> > > > > > > data.
> > > > > > > > > > This should be out-of-box without manually DDL
> > registration.
> > > > > > > > > >
> > > > > > > > > >> 3. blackhole sink (no output sink)
> > > > > > > > > > This is very useful for high performance testing of
> Flink,
> > to
> > > > > > > meansure
> > > > > > > > > the
> > > > > > > > > > throughput of the whole pipeline without sink.
> > > > > > > > > > Presto also provides this as a built-in connector [2].
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > > [1]:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > > > > > [2]:
> > > https://prestodb.io/docs/current/connector/blackhole.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> +1.
> > > > > > > > > >>
> > > > > > > > > >> I would suggest to take a step even further and see what
> > > users
> > > > > > > really
> > > > > > > > > need
> > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides
> > this
> > > > one,
> > > > > > > > here're
> > > > > > > > > >> some more sources and sinks that I have developed or
> used
> > > > > > previously
> > > > > > > > to
> > > > > > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>   1. random input data source
> > > > > > > > > >>      - should generate random data at a specified rate
> > > > according
> > > > > > to
> > > > > > > > > schema
> > > > > > > > > >>      - purposes
> > > > > > > > > >>         - test Flink pipeline and data can end up in
> > > external
> > > > > > > storage
> > > > > > > > > >>         correctly
> > > > > > > > > >>         - stress test Flink sink as well as tuning up
> > > external
> > > > > > > storage
> > > > > > > > > >>      2. print data sink
> > > > > > > > > >>      - should print data in row format in console
> > > > > > > > > >>      - purposes
> > > > > > > > > >>         - make it easier to test Flink SQL job e2e in
> IDE
> > > > > > > > > >>         - test Flink pipeline and ensure output data
> > > > > format/value
> > > > > > is
> > > > > > > > > >>         correct
> > > > > > > > > >>      3. no output data sink
> > > > > > > > > >>      - just swallow output data without doing anything
> > > > > > > > > >>      - purpose
> > > > > > > > > >>         - evaluate and tune performance of Flink source
> > and
> > > > the
> > > > > > > whole
> > > > > > > > > >>         pipeline. Users' don't need to worry about sink
> > back
> > > > > > > pressure
> > > > > > > > > >>
> > > > > > > > > >> These may be taken into consideration all together as an
> > > > effort
> > > > > to
> > > > > > > > lower
> > > > > > > > > >> the threshold of running Flink SQL/table API, and
> > facilitate
> > > > > > users'
> > > > > > > > > daily
> > > > > > > > > >> work.
> > > > > > > > > >>
> > > > > > > > > >> Cheers,
> > > > > > > > > >> Bowen
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> > > > > > > [email protected]>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> I heard some users complain that table is difficult to
> > > test.
> > > > > Now
> > > > > > > with
> > > > > > > > > SQL
> > > > > > > > > >>> client, users are more and more inclined to use it to
> > test
> > > > > rather
> > > > > > > > than
> > > > > > > > > >>> program.
> > > > > > > > > >>> The most common example is Kafka source. If users need
> to
> > > > test
> > > > > > > their
> > > > > > > > > SQL
> > > > > > > > > >>> output and checkpoint, they need to:
> > > > > > > > > >>>
> > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > > > > > > > >>> - 2.Write a program, mock input records, and produce
> > > records
> > > > to
> > > > > > > Kafka
> > > > > > > > > >>> topic.
> > > > > > > > > >>> - 3.Then test in Flink.
> > > > > > > > > >>>
> > > > > > > > > >>> The step 1 and 2 are annoying, although this test is
> E2E.
> > > > > > > > > >>>
> > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good
> > > because
> > > > it
> > > > > > has
> > > > > > > > > deal
> > > > > > > > > >>> with checkpoint things, so it is very good to
> checkpoint
> > > > > > > > > >> mechanism.Usually,
> > > > > > > > > >>> users are turned on checkpoint in production.
> > > > > > > > > >>>
> > > > > > > > > >>> With computed columns, user are easy to create a
> sequence
> > > > > source
> > > > > > > DDL
> > > > > > > > > same
> > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't
> need
> > > > > launch
> > > > > > > > other
> > > > > > > > > >>> things.
> > > > > > > > > >>>
> > > > > > > > > >>> Have you consider this? What do you think?
> > > > > > > > > >>>
> > > > > > > > > >>> CC: @Aljoscha Krettek <[email protected]> the author
> > > > > > > > > >>> of StatefulSequenceSource.
> > > > > > > > > >>>
> > > > > > > > > >>> Best,
> > > > > > > > > >>> Jingsong Lee
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best, Jingsong Lee
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Benchao Li
> > > > > > > School of Electronics Engineering and Computer Science, Peking
> > > > > University
> > > > > > > Tel:+86-15650713730
> > > > > > > Email: [email protected]; [email protected]
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best, Jingsong Lee
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to