+1 on the idea as well, these changes could be super useful. Let’s collaborate 
more on the cwiki.

-Nishith

> On Jan 4, 2021, at 10:58 AM, Vinoth Chandar <vin...@apache.org> wrote:
> 
> Overall +1 on the idea.
> 
> Danny, could we move this to the apache cwiki if you don't mind?
> That's what we have been using for other RFC discussions.
> 
>> On Mon, Jan 4, 2021 at 1:22 AM Danny Chan <danny0...@apache.org> wrote:
>> 
>> The RFC-13 Flink writer has some bottlenecks that make it hard to adapter
>> to production:
>> 
>> - The InstantGeneratorOperator is parallelism 1, which is a limit for
>> high-throughput consumption; because all the split inputs drain to a single
>> thread, the network IO would gains pressure too
>> - The WriteProcessOperator handles inputs by partition, that means, within
>> each partition write process, the BUCKETs are written one by one, the FILE
>> IO is limit to adapter to high-throughput inputs
>> - It buffers the data by checkpoints, which is too hard to be robust for
>> production, the checkpoint function is blocking and should not have IO
>> operations.
>> - The FlinkHoodieIndex is only valid for a per-job scope, it does not work
>> for existing bootstrap data or for different Flink jobs
>> 
>> Thus, here I propose a new design for the Flink writer to solve these
>> problems[1]. Overall, the new design tries to remove the single parallelism
>> operators and make the index more powerful and scalable.
>> 
>> I plan to solve these bottlenecks incrementally (4 steps), there are
>> already some local POCs for these proposals.
>> 
>> I'm looking forward to your feedback. Any suggestions are appreciated ~
>> 
>> [1]
>> 
>> https://docs.google.com/document/d/1oOcU0VNwtEtZfTRt3v9z4xNQWY-Hy5beu7a1t5B-75I/edit?usp=sharing
>> 

Reply via email to