Hi Jark,

my gut feeling is 1), because of its consistency with other connectors
(does not add two secret keywords) although it is more verbose.

Best,

Konstantin



On Thu, Apr 30, 2020 at 5:01 PM Jark Wu <imj...@gmail.com> wrote:

> Hi Konstantin,
>
> Thanks for the link of Java Faker. It's an intereting project and
> could benefit to a comprehensive datagen source.
>
> What the discarding and printing sink look like in your thought?
> 1) manually create a table with a `blackhole` or `print` connector, e.g.
>
> CREATE TABLE my_sink (
>   a INT,
>   b STRNG,
>   c DOUBLE
> ) WITH (
>   'connector' = 'print'
> );
> INSERT INTO my_sink SELECT a, b, c FROM my_source;
>
> 2) a system built-in table named `blackhole` and `print` without manually
> schema work, e.g.
> INSERT INTO print SELECT a, b, c, d FROM my_source;
>
> Best,
> Jark
>
>
>
> On Thu, 30 Apr 2020 at 21:19, Konstantin Knauf <kna...@apache.org> wrote:
>
> > Hi everyone,
> >
> > sorry for reviving this thread at this point in time. Generally, I think,
> > this is a very valuable effort. Have we considered only providing a very
> > basic data generator (+ discarding and printing sink tables) in Apache
> > Flink and moving a more comprehensive data generating table source to an
> > ecosystem project promoted on flink-packages.org. I think this has a lot
> > of
> > potential (e.g. in combination with Java Faker [1]), but it would
> probably
> > be better served in a small separately maintained repository.
> >
> > Cheers,
> >
> > Konstantin
> >
> > [1] https://github.com/DiUS/java-faker
> >
> >
> > On Tue, Mar 24, 2020 at 9:10 AM Jingsong Li <jingsongl...@gmail.com>
> > wrote:
> >
> > > Hi all,
> > >
> > > I created https://issues.apache.org/jira/browse/FLINK-16743 for
> > follow-up
> > > discussion. FYI.
> > >
> > > Best,
> > > Jingsong Lee
> > >
> > > On Tue, Mar 24, 2020 at 2:20 PM Bowen Li <bowenl...@gmail.com> wrote:
> > >
> > > > I agree with Jingsong that sink schema inference and system tables
> can
> > be
> > > > considered later. I wouldn’t recommend to tackle them for the sake of
> > > > simplifying user experience to the extreme. Providing the above handy
> > > > source and sink implementations already offer users a ton of
> immediate
> > > > value.
> > > >
> > > >
> > > > On Mon, Mar 23, 2020 at 20:20 Jingsong Li <jingsongl...@gmail.com>
> > > wrote:
> > > >
> > > > > Hi Benchao,
> > > > >
> > > > > > do you think we need to add more columns with various types?
> > > > >
> > > > > I didn't list all types, but we should support primitive types,
> > > varchar,
> > > > > Decimal, Timestamp and etc...
> > > > > This can be done continuously.
> > > > >
> > > > > Hi Benchao, Jark,
> > > > > About console and blackhole, yes, they can have no schema, the
> schema
> > > can
> > > > > be inferred by upstream node.
> > > > > - But now we don't have this mechanism to do these configurable
> sink
> > > > > things.
> > > > > - If we want to support, we need a single way to support these two
> > > sinks.
> > > > > - And uses can use "create table like" and others way to simplify
> > DDL.
> > > > >
> > > > > And for providing system/registered tables (`console` and
> > `blackhole`):
> > > > > - I have no strong opinion on these system tables. In SQL, will be
> > > > "insert
> > > > > into blackhole select a /*int*/, b /*string*/ from tableA", "insert
> > > into
> > > > > blackhole select a /*double*/, b /*Map*/, c /*string*/ from
> tableB".
> > It
> > > > > seems that Blackhole is a universal thing, which makes me feel bad
> > > > > intuitively.
> > > > > - Can user override these tables? If can, we need ensure it can be
> > > > > overwrite by catalog tables.
> > > > >
> > > > > So I think we can leave these system tables to future too.
> > > > > What do you think?
> > > > >
> > > > > Best,
> > > > > Jingsong Lee
> > > > >
> > > > > On Mon, Mar 23, 2020 at 4:44 PM Jark Wu <imj...@gmail.com> wrote:
> > > > >
> > > > > > Hi Jingsong,
> > > > > >
> > > > > > Regarding (2) and (3), I was thinking to ignore manually DDL
> work,
> > so
> > > > > users
> > > > > > can use them directly:
> > > > > >
> > > > > > # this will log results to `.out` files
> > > > > > INSERT INTO console
> > > > > > SELECT ...
> > > > > >
> > > > > > # this will drop all received records
> > > > > > INSERT INTO blackhole
> > > > > > SELECT ...
> > > > > >
> > > > > > Here `console` and `blackhole` are system sinks which is similar
> to
> > > > > system
> > > > > > functions.
> > > > > >
> > > > > > Best,
> > > > > > Jark
> > > > > >
> > > > > > On Mon, 23 Mar 2020 at 16:33, Benchao Li <libenc...@gmail.com>
> > > wrote:
> > > > > >
> > > > > > > Hi Jingsong,
> > > > > > >
> > > > > > > Thanks for bring this up. Generally, it's a very good proposal.
> > > > > > >
> > > > > > > About data gen source, do you think we need to add more columns
> > > with
> > > > > > > various types?
> > > > > > >
> > > > > > > About print sink, do we need to specify the schema?
> > > > > > >
> > > > > > > Jingsong Li <jingsongl...@gmail.com> 于2020年3月23日周一 下午1:51写道:
> > > > > > >
> > > > > > > > Thanks Bowen, Jark and Dian for your feedback and
> suggestions.
> > > > > > > >
> > > > > > > > I reorganize with your suggestions, and try to expose DDLs:
> > > > > > > >
> > > > > > > > 1.datagen source:
> > > > > > > > - easy startup/test for streaming job
> > > > > > > > - performance testing
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE user (
> > > > > > > >     id BIGINT,
> > > > > > > >     age INT,
> > > > > > > >     description STRING
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'datagen',
> > > > > > > >     'connector.rows-per-second'='100',
> > > > > > > >     'connector.total-records'='1000000',
> > > > > > > >
> > > > > > > >     'schema.id.generator' = 'sequence',
> > > > > > > >     'schema.id.generator.start' = '1',
> > > > > > > >
> > > > > > > >     'schema.age.generator' = 'random',
> > > > > > > >     'schema.age.generator.min' = '0',
> > > > > > > >     'schema.age.generator.max' = '100',
> > > > > > > >
> > > > > > > >     'schema.description.generator' = 'random',
> > > > > > > >     'schema.description.generator.length' = '100'
> > > > > > > > )
> > > > > > > >
> > > > > > > > Default is random generator.
> > > > > > > > Hi Jark, I don't want to bring complicated regularities,
> > because
> > > it
> > > > > can
> > > > > > > be
> > > > > > > > done through computed columns. And it is hard to define
> > > > > > > > standard regularities, I think we can leave it to the future.
> > > > > > > >
> > > > > > > > 2.print sink:
> > > > > > > > - easy test for streaming job
> > > > > > > > - be very useful in production debugging
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE print_table (
> > > > > > > >     ...
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'print'
> > > > > > > > )
> > > > > > > >
> > > > > > > > 3.blackhole sink
> > > > > > > > - very useful for high performance testing of Flink
> > > > > > > > - I've also run into users trying UDF to output, not sink, so
> > > they
> > > > > need
> > > > > > > > this sink as well.
> > > > > > > >
> > > > > > > > DDL:
> > > > > > > > CREATE TABLE blackhole_table (
> > > > > > > >     ...
> > > > > > > > ) WITH (
> > > > > > > >     'connector.type' = 'blackhole'
> > > > > > > > )
> > > > > > > >
> > > > > > > > What do you think?
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Jingsong Lee
> > > > > > > >
> > > > > > > > On Mon, Mar 23, 2020 at 12:04 PM Dian Fu <
> > dian0511...@gmail.com>
> > > > > > wrote:
> > > > > > > >
> > > > > > > > > Thanks Jingsong for bringing up this discussion. +1 to this
> > > > > > proposal. I
> > > > > > > > > think Bowen's proposal makes much sense to me.
> > > > > > > > >
> > > > > > > > > This is also a painful problem for PyFlink users. Currently
> > > there
> > > > > is
> > > > > > no
> > > > > > > > > built-in easy-to-use table source/sink and it requires
> users
> > to
> > > > > > write a
> > > > > > > > lot
> > > > > > > > > of code to trying out PyFlink. This is especially painful
> for
> > > new
> > > > > > users
> > > > > > > > who
> > > > > > > > > are not familiar with PyFlink/Flink. I have also
> encountered
> > > the
> > > > > > > tedious
> > > > > > > > > process Bowen encountered, e.g. writing random source
> > > connector,
> > > > > > print
> > > > > > > > sink
> > > > > > > > > and also blackhole print sink as there are no built-in ones
> > to
> > > > use.
> > > > > > > > >
> > > > > > > > > Regards,
> > > > > > > > > Dian
> > > > > > > > >
> > > > > > > > > > 在 2020年3月22日,上午11:24,Jark Wu <imj...@gmail.com> 写道:
> > > > > > > > > >
> > > > > > > > > > +1 to Bowen's proposal. I also saw many requirements on
> > such
> > > > > > built-in
> > > > > > > > > > connectors.
> > > > > > > > > >
> > > > > > > > > > I will leave some my thoughts here:
> > > > > > > > > >
> > > > > > > > > >> 1. datagen source (random source)
> > > > > > > > > > I think we can merge the functinality of sequence-source
> > into
> > > > > > random
> > > > > > > > > source
> > > > > > > > > > to allow users to custom their data values.
> > > > > > > > > > Flink can generate random data according to the field
> > types,
> > > > > users
> > > > > > > > > > can customize their values to be more domain specific,
> e.g.
> > > > > > > > > > 'field.user'='User_[1-9]{0,1}'
> > > > > > > > > > This will be similar to kafka-datagen-connect[1].
> > > > > > > > > >
> > > > > > > > > >> 2. console sink (print sink)
> > > > > > > > > > This will be very useful in production debugging, to
> easily
> > > > > output
> > > > > > an
> > > > > > > > > > intermediate view or result view to a `.out` file.
> > > > > > > > > > So that we can look into the data representation, or
> check
> > > > dirty
> > > > > > > data.
> > > > > > > > > > This should be out-of-box without manually DDL
> > registration.
> > > > > > > > > >
> > > > > > > > > >> 3. blackhole sink (no output sink)
> > > > > > > > > > This is very useful for high performance testing of
> Flink,
> > to
> > > > > > > meansure
> > > > > > > > > the
> > > > > > > > > > throughput of the whole pipeline without sink.
> > > > > > > > > > Presto also provides this as a built-in connector [2].
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Jark
> > > > > > > > > >
> > > > > > > > > > [1]:
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
> https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
> > > > > > > > > > [2]:
> > > https://prestodb.io/docs/current/connector/blackhole.html
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Sat, 21 Mar 2020 at 12:31, Bowen Li <
> > bowenl...@gmail.com>
> > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > >> +1.
> > > > > > > > > >>
> > > > > > > > > >> I would suggest to take a step even further and see what
> > > users
> > > > > > > really
> > > > > > > > > need
> > > > > > > > > >> to test/try/play with table API and Flink SQL. Besides
> > this
> > > > one,
> > > > > > > > here're
> > > > > > > > > >> some more sources and sinks that I have developed or
> used
> > > > > > previously
> > > > > > > > to
> > > > > > > > > >> facilitate building Flink table/SQL pipelines.
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >>   1. random input data source
> > > > > > > > > >>      - should generate random data at a specified rate
> > > > according
> > > > > > to
> > > > > > > > > schema
> > > > > > > > > >>      - purposes
> > > > > > > > > >>         - test Flink pipeline and data can end up in
> > > external
> > > > > > > storage
> > > > > > > > > >>         correctly
> > > > > > > > > >>         - stress test Flink sink as well as tuning up
> > > external
> > > > > > > storage
> > > > > > > > > >>      2. print data sink
> > > > > > > > > >>      - should print data in row format in console
> > > > > > > > > >>      - purposes
> > > > > > > > > >>         - make it easier to test Flink SQL job e2e in
> IDE
> > > > > > > > > >>         - test Flink pipeline and ensure output data
> > > > > format/value
> > > > > > is
> > > > > > > > > >>         correct
> > > > > > > > > >>      3. no output data sink
> > > > > > > > > >>      - just swallow output data without doing anything
> > > > > > > > > >>      - purpose
> > > > > > > > > >>         - evaluate and tune performance of Flink source
> > and
> > > > the
> > > > > > > whole
> > > > > > > > > >>         pipeline. Users' don't need to worry about sink
> > back
> > > > > > > pressure
> > > > > > > > > >>
> > > > > > > > > >> These may be taken into consideration all together as an
> > > > effort
> > > > > to
> > > > > > > > lower
> > > > > > > > > >> the threshold of running Flink SQL/table API, and
> > facilitate
> > > > > > users'
> > > > > > > > > daily
> > > > > > > > > >> work.
> > > > > > > > > >>
> > > > > > > > > >> Cheers,
> > > > > > > > > >> Bowen
> > > > > > > > > >>
> > > > > > > > > >>
> > > > > > > > > >> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <
> > > > > > > jingsongl...@gmail.com>
> > > > > > > > > >> wrote:
> > > > > > > > > >>
> > > > > > > > > >>> Hi all,
> > > > > > > > > >>>
> > > > > > > > > >>> I heard some users complain that table is difficult to
> > > test.
> > > > > Now
> > > > > > > with
> > > > > > > > > SQL
> > > > > > > > > >>> client, users are more and more inclined to use it to
> > test
> > > > > rather
> > > > > > > > than
> > > > > > > > > >>> program.
> > > > > > > > > >>> The most common example is Kafka source. If users need
> to
> > > > test
> > > > > > > their
> > > > > > > > > SQL
> > > > > > > > > >>> output and checkpoint, they need to:
> > > > > > > > > >>>
> > > > > > > > > >>> - 1.Launch a Kafka standalone, create a Kafka topic .
> > > > > > > > > >>> - 2.Write a program, mock input records, and produce
> > > records
> > > > to
> > > > > > > Kafka
> > > > > > > > > >>> topic.
> > > > > > > > > >>> - 3.Then test in Flink.
> > > > > > > > > >>>
> > > > > > > > > >>> The step 1 and 2 are annoying, although this test is
> E2E.
> > > > > > > > > >>>
> > > > > > > > > >>> Then I found StatefulSequenceSource, it is very good
> > > because
> > > > it
> > > > > > has
> > > > > > > > > deal
> > > > > > > > > >>> with checkpoint things, so it is very good to
> checkpoint
> > > > > > > > > >> mechanism.Usually,
> > > > > > > > > >>> users are turned on checkpoint in production.
> > > > > > > > > >>>
> > > > > > > > > >>> With computed columns, user are easy to create a
> sequence
> > > > > source
> > > > > > > DDL
> > > > > > > > > same
> > > > > > > > > >>> to Kafka DDL. Then they can test inside Flink, don't
> need
> > > > > launch
> > > > > > > > other
> > > > > > > > > >>> things.
> > > > > > > > > >>>
> > > > > > > > > >>> Have you consider this? What do you think?
> > > > > > > > > >>>
> > > > > > > > > >>> CC: @Aljoscha Krettek <aljos...@apache.org> the author
> > > > > > > > > >>> of StatefulSequenceSource.
> > > > > > > > > >>>
> > > > > > > > > >>> Best,
> > > > > > > > > >>> Jingsong Lee
> > > > > > > > > >>>
> > > > > > > > > >>
> > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > > Best, Jingsong Lee
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Benchao Li
> > > > > > > School of Electronics Engineering and Computer Science, Peking
> > > > > University
> > > > > > > Tel:+86-15650713730
> > > > > > > Email: libenc...@gmail.com; libenc...@pku.edu.cn
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best, Jingsong Lee
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best, Jingsong Lee
> > >
> >
> >
> > --
> >
> > Konstantin Knauf
> >
> > https://twitter.com/snntrable
> >
> > https://github.com/knaufk
> >
>


-- 

Konstantin Knauf

https://twitter.com/snntrable

https://github.com/knaufk

Reply via email to