Overall +1 on the idea.

Danny, could we move this to the apache cwiki if you don't mind?
That's what we have been using for other RFC discussions.

On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <danny0...@apache.org> wrote:

> The RFC-13 Flink writer has some bottlenecks that make it hard to adapter
> to production:
>
> - The InstantGeneratorOperator is parallelism 1, which is a limit for
> high-throughput consumption; because all the split inputs drain to a single
> thread, the network IO would gains pressure too
> - The WriteProcessOperator handles inputs by partition, that means, within
> each partition write process, the BUCKETs are written one by one, the FILE
> IO is limit to adapter to high-throughput inputs
> - It buffers the data by checkpoints, which is too hard to be robust for
> production, the checkpoint function is blocking and should not have IO
> operations.
> - The FlinkHoodieIndex is only valid for a per-job scope, it does not work
> for existing bootstrap data or for different Flink jobs
>
> Thus, here I propose a new design for the Flink writer to solve these
> problems[1]. Overall, the new design tries to remove the single parallelism
> operators and make the index more powerful and scalable.
>
> I plan to solve these bottlenecks incrementally (4 steps), there are
> already some local POCs for these proposals.
>
> I'm looking forward to your feedback. Any suggestions are appreciated ~
>
> [1]
>
> https://docs.google.com/document/d/1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I/edit?usp=sharing
>

Reply via email to