Re: [Flink blogs]

Etienne Chauchot Fri, 01 Oct 2021 02:15:40 -0700

Hi all,

Thanks a lot for your feedback guys ! Special thanks to Fabian, Till andArvid (in a private discussion) !

The consensus seems to go toward the blog post on migrating a batchpipeline from DataSet API to DataStream API. For the record it is linkedto a work I did lately (unfortunately not public, let's see if I canmake it public in the future) of testing TPCDS performance framework onFlink. I know there is an impl already in the repo using flink-SQL but Iwanted to implement it lower level using DataSet API and laterDataStream API. It uses parquet (so old format). The query I implementedis TPCDS Query 3. That is for the use case of this future blob post.Indeed, as Fabian and Till said, it can easily become a serie.

Second blog to receive lower consensus: manual join withKeyedCoProcessFunction in DataStream (thanks Till !). I will off courseadd a pointer to the new target for users :Table/SQL API as reminded byFabian.

Another blog post could be related to performances: during this bench, Ihave observed the cost of SQL translation compared to lower level, theimprovement of perfs in DataStream or the improvement of perfs entailedby Blink planner. That also could be a good blog post. Also I tend notto compare perfs with other apache big data projects such as Sparkbecause they all have their strengths, their tricky parameters and inthe end we often end up comparing not 100% comparable things.

Regarding the other topics, as I wrote, I was doubting that they couldhave interest mainly because of deprecation of formats, steering userstoward Table/SQL API or because of too low level topics. Thanks forconfirming my doubts !



Best

Etienne

On 30/09/2021 15:51, Till Rohrmann wrote:

Hi Etienne,

Great to see that you want to write about one of the topics you have worked
on! Spreading the word about changes/improvements/new features is always
super important.

As a general recommendation I think you should write about the topic you
are most excited about. This usually results in an interesting blog post
for others. If you have multiple favourites, then I would think about what
topic could be most interesting for users. In my experience, blog posts
that deal with a user problem (e.g. how to solve xyz) get more attention
than technical blog posts. Having said this, I think the following topics
would be good candidates:

- migration of pipelines from DataSet API to DataStream API
As Fabian said, this could easily become a series of blog posts. Maybe this
could also become part of the documentation.

- doing a manual join in DataStream API in batch mode with
/KeyedCoProcessFunction
I could see that this is a nice blog post about a concrete recipe on how to
solve a certain set of problems with the DataStream API. Fabian is right
that in the future we will try to steer people towards the Table API but
maybe the join condition cannot be easily expressed with SQL so that people
would naturally switch to the DataStream API for it.

- back pressure in checkpointing
Improving the understanding of Flink operations is always a good and
worthwhile idea imo.

Cheers,
Till

On Thu, Sep 30, 2021 at 10:19 AM Fabian Paul <fabianp...@ververica.com>
wrote:

Hi Etienne,

Thanks for reaching out I think your list already looks very appealing.

* - metrics (https://github.com/apache/flink/pull/14510): it was
   dealing with delimiters. I think it is a bit low level for a blog post

I am also unsure whether this a good fit to present. I can only imagine
showing what kind of use-case it supports.

* - migration of pipelines from DataSet API to DataStream API: it is
   already discussed in the flink website
*

This is definitely something I’d like to see in my opinion it can also
become a series because the topic has a lot of aspects. If you want to
write a
post about it it would be great to show the migration of a more complex
pipeline (i.e. old formats, incompatible types ….). Many users will
eventually face this so it has a big impact. FYI probably only Flink 1.13
is the latest version with full DataSet support.

* - accumulators (https://github.com/apache/flink/pull/14558): it was
   about an asynchronous get, once again a bit too low level for a blog
   post ?
*

To me accumulator are a kind of internal concept but maybe you can provide
the use-case which drove this change? Probably explaining the
semantics of them is already complicated.

* - FileInputFormat mainly parquet improvements and fixes
   (https://github.com/apache/flink/pull/15725,
   https://github.com/apache/flink/pull/15172,
   https://github.com/apache/flink/pull/15156): interesting but as this
   API is being decommissioned, it might not be a good subject ?
*

You have already summarized it: it is being deprecated and a much more
interesting topic is the migration from DataSet to the DataStream API in
case these old formats are used.

* - doing a manual join in DataStream API in batch mode with
   /KeyedCoProcessFunction///(

https://issues.apache.org/jira/browse/FLINK-22587).

   As the target is more Flink table/SQL for these kind of things, the
   same deprecation comment as above applies.
*

I tend to not show this topic because my recommendation would be to use
the Table API directly and not build your own join in the DataStream API ;)

=> maybe a blog post on back pressure in checkpointing (

https://github.com/apache/flink/pull/13040). WDYT ?
This is also an interesting topic but we constantly work on improving the
situation and I am unsure if the blogpost is already not up-to-date anymore
when it is released.


Please let me know what you think I am also happy to give more feedback
for one of the topics in more detail if you need it.

Best,
Fabian

Re: [Flink blogs]

Reply via email to