Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Jark Wu Sat, 21 Mar 2020 20:25:46 -0700

+1 to Bowen's proposal. I also saw many requirements on such built-in
connectors.


I will leave some my thoughts here:

> 1. datagen source (random source)
I think we can merge the functinality of sequence-source into random source
to allow users to custom their data values.
Flink can generate random data according to the field types, users
can customize their values to be more domain specific, e.g.
'field.user'='User_[1-9]{0,1}'
This will be similar to kafka-datagen-connect[1].

> 2. console sink (print sink)
This will be very useful in production debugging, to easily output an
intermediate view or result view to a `.out` file.
So that we can look into the data representation, or check dirty data.
This should be out-of-box without manually DDL registration.

> 3. blackhole sink (no output sink)
This is very useful for high performance testing of Flink, to meansure the
throughput of the whole pipeline without sink.
Presto also provides this as a built-in connector [2].

Best,
Jark

[1]:
https://github.com/confluentinc/kafka-connect-datagen#define-a-new-schema-specification
[2]: https://prestodb.io/docs/current/connector/blackhole.html


On Sat, 21 Mar 2020 at 12:31, Bowen Li <[email protected]> wrote:

> +1.
>
> I would suggest to take a step even further and see what users really need
> to test/try/play with table API and Flink SQL. Besides this one, here're
> some more sources and sinks that I have developed or used previously to
> facilitate building Flink table/SQL pipelines.
>
>
>    1. random input data source
>       - should generate random data at a specified rate according to schema
>       - purposes
>          - test Flink pipeline and data can end up in external storage
>          correctly
>          - stress test Flink sink as well as tuning up external storage
>       2. print data sink
>       - should print data in row format in console
>       - purposes
>          - make it easier to test Flink SQL job e2e in IDE
>          - test Flink pipeline and ensure output data format/value is
>          correct
>       3. no output data sink
>       - just swallow output data without doing anything
>       - purpose
>          - evaluate and tune performance of Flink source and the whole
>          pipeline. Users' don't need to worry about sink back pressure
>
> These may be taken into consideration all together as an effort to lower
> the threshold of running Flink SQL/table API, and facilitate users' daily
> work.
>
> Cheers,
> Bowen
>
>
> On Thu, Mar 19, 2020 at 10:32 PM Jingsong Li <[email protected]>
> wrote:
>
> > Hi all,
> >
> > I heard some users complain that table is difficult to test. Now with SQL
> > client, users are more and more inclined to use it to test rather than
> > program.
> > The most common example is Kafka source. If users need to test their SQL
> > output and checkpoint, they need to:
> >
> > - 1.Launch a Kafka standalone, create a Kafka topic .
> > - 2.Write a program, mock input records, and produce records to Kafka
> > topic.
> > - 3.Then test in Flink.
> >
> > The step 1 and 2 are annoying, although this test is E2E.
> >
> > Then I found StatefulSequenceSource, it is very good because it has deal
> > with checkpoint things, so it is very good to checkpoint
> mechanism.Usually,
> > users are turned on checkpoint in production.
> >
> > With computed columns, user are easy to create a sequence source DDL same
> > to Kafka DDL. Then they can test inside Flink, don't need launch other
> > things.
> >
> > Have you consider this? What do you think?
> >
> > CC: @Aljoscha Krettek <[email protected]> the author
> > of StatefulSequenceSource.
> >
> > Best,
> > Jingsong Lee
> >
>

Re: [DISCUSS] Introduce TableFactory for StatefulSequenceSource

Reply via email to