Hi all,

Thanks a lot for your feedback guys ! Special thanks to Fabian, Till and Arvid (in a private discussion) !

The consensus seems to go toward the blog post on migrating a batch pipeline from DataSet API to DataStream API. For the record it is linked to a work I did lately (unfortunately not public, let's see if I can make it public in the future) of testing TPCDS performance framework on Flink. I know there is an impl already in the repo using flink-SQL but I wanted to implement it lower level using DataSet API and later DataStream API. It uses parquet (so old format). The query I implemented is TPCDS Query 3. That is for the use case of this future blob post. Indeed, as Fabian and Till said, it can easily become a serie.


Second blog to receive lower consensus: manual join with KeyedCoProcessFunction in DataStream (thanks Till !). I will off course add a pointer to the new target for users :Table/SQL API as reminded by Fabian.


Another blog post could be related to performances: during this bench, I have observed the cost of SQL translation compared to lower level, the improvement of perfs in DataStream or the improvement of perfs entailed by Blink planner.  That also could be a good blog post. Also I tend not to compare perfs with other apache big data projects such as Spark because they all have their strengths, their tricky parameters and in the end we often end up comparing not 100% comparable things.


Regarding the other topics, as I wrote, I was doubting that they could have interest mainly because of deprecation of formats, steering users toward Table/SQL API or because of too low level topics. Thanks for confirming my doubts !


Best

Etienne

On 30/09/2021 15:51, Till Rohrmann wrote:
Hi Etienne,

Great to see that you want to write about one of the topics you have worked
on! Spreading the word about changes/improvements/new features is always
super important.

As a general recommendation I think you should write about the topic you
are most excited about. This usually results in an interesting blog post
for others. If you have multiple favourites, then I would think about what
topic could be most interesting for users. In my experience, blog posts
that deal with a user problem (e.g. how to solve xyz) get more attention
than technical blog posts. Having said this, I think the following topics
would be good candidates:

- migration of pipelines from DataSet API to DataStream API
As Fabian said, this could easily become a series of blog posts. Maybe this
could also become part of the documentation.

- doing a manual join in DataStream API in batch mode with
/KeyedCoProcessFunction
I could see that this is a nice blog post about a concrete recipe on how to
solve a certain set of problems with the DataStream API. Fabian is right
that in the future we will try to steer people towards the Table API but
maybe the join condition cannot be easily expressed with SQL so that people
would naturally switch to the DataStream API for it.

- back pressure in checkpointing
Improving the understanding of Flink operations is always a good and
worthwhile idea imo.

Cheers,
Till

On Thu, Sep 30, 2021 at 10:19 AM Fabian Paul <fabianp...@ververica.com>
wrote:

Hi Etienne,

Thanks for reaching out I think your list already looks very appealing.

* - metrics (https://github.com/apache/flink/pull/14510): it was
   dealing with delimiters. I think it is a bit low level for a blog post
?
*
I am also unsure whether this a good fit to present. I can only imagine
showing what kind of use-case it supports.


* - migration of pipelines from DataSet API to DataStream API: it is
   already discussed in the flink website
*
This is definitely something I’d like to see in my opinion it can also
become a series because the topic has a lot of aspects. If you want to
write a
post about it it would be great to show the migration of a more complex
pipeline (i.e. old formats, incompatible types ….). Many users will
eventually face this so it has a big impact. FYI probably only Flink 1.13
is the latest version with full DataSet support.

* - accumulators (https://github.com/apache/flink/pull/14558): it was
   about an asynchronous get, once again a bit too low level for a blog
   post ?
*
To me accumulator are a kind of internal concept but maybe you can provide
the use-case which drove this change? Probably explaining the
semantics of them is already complicated.


* - FileInputFormat mainly parquet improvements and fixes
   (https://github.com/apache/flink/pull/15725,
   https://github.com/apache/flink/pull/15172,
   https://github.com/apache/flink/pull/15156): interesting but as this
   API is being decommissioned, it might not be a good subject ?
*
You have already summarized it: it is being deprecated and a much more
interesting topic is the migration from DataSet to the DataStream API in
case these old formats are used.


* - doing a manual join in DataStream API in batch mode with
   /KeyedCoProcessFunction///(
https://issues.apache.org/jira/browse/FLINK-22587).
   As the target is more Flink table/SQL for these kind of things, the
   same deprecation comment as above applies.
*

I tend to not show this topic because my recommendation would be to use
the Table API directly and not build your own join in the DataStream API ;)

=> maybe a blog post on back pressure in checkpointing (
https://github.com/apache/flink/pull/13040). WDYT ?
This is also an interesting topic but we constantly work on improving the
situation and I am unsure if the blogpost is already not up-to-date anymore
when it is released.


Please let me know what you think I am also happy to give more feedback
for one of the topics in more detail if you need it.

Best,
Fabian

Reply via email to