Hey Colin & George,

Thinking on George's points I was wondering if it's feasible to submit a
big reassignment to the controller and thus Zookeeper as frequent writes
are slow as the quorum has to synchronize. Perhaps it should be the
responsibility of KIP-435 <https://issues.apache.org/jira/browse/KIP-435> but
I'd like to note it here as we're changing the current znode layout in this
KIP.
I think ideally we should add these writes in batches to zookeeper and
otherwise store it in a replicated internal topic
(__partition_reassignments). That would solve the scalability problem as
the failover controller would be able to read it up very quickly and also
we would spread the writes in Zookeeper over time. Just the current,
actively replicated partitions should be present under
/brokers/topics/[topic]/partitions/[partitionId]/state, so those partitions
will know if they have to do reassignment (even in case of a broker
bounce). The controller on the other hand could regain its state by reading
up the last produced message from this __partition_reassignments topic and
reading up the Zookeeper state to figure out which batch its currently
doing (supposing it goes sequentially in the given reassignment).
I'll think a little bit more about this to fill out any gaps there are and
perhaps add it to my KIP. That being said probably we'll need to make some
benchmarking first if this bulk read-write causes a problem at all to avoid
premature optimisation. I generally don't really worry about reading up
this new information as the controller would read up the assignment anyway
in initializeControllerContext().

A question on SubmitPartitionReassignmentsRequest and its connection with
KIP-435 <https://cwiki.apache.org/confluence/display/KAFKA/KIP-435>. Would
the list of topic-partitions have the same ordering on the client side as
well as the broker side? I think it would be an advantage as the user would
know in which order the reassignment would be performed. I think it's
useful when it comes to incrementalization as they'd be able to figure out
what replicas will be in one batch (given they know about the batch size).

Viktor

On Wed, May 1, 2019 at 8:33 AM George Li <sql_consult...@yahoo.com.invalid>
wrote:

>  Hi Colin,
>
> Thanks for KIP-455!  yes. KIP-236, etc. will depend on it.  It is the good
> direction to go for the RPC.
>
> Regarding storing the new reassignments & original replicas at the
> topic/partition level.  I have some concerns when controller is failing
> over, and the scalability of scanning the active reassignments from ZK
> topic/partition level nodes. Please see my reply to Jason in the KIP-236
> thread.
>
> Once the decision is made where new reassignment and original replicas is
> stored, I will modify KIP-236 accordingly for how to cancel/rollback the
> reassignments.
>
> Thanks,
> George
>
>
>     On Monday, April 15, 2019, 6:07:44 PM PDT, Colin McCabe <
> cmcc...@apache.org> wrote:
>
>  Hi all,
>
> We've been having discussions on a few different KIPs (KIP-236, KIP-435,
> etc.) about what the Admin Client replica reassignment API should look
> like.  The current API is really hard to extend and maintain, which is a
> big source of problems.  I think it makes sense to have a KIP that
> establishes a clean API that we can use and extend going forward, so I
> posted KIP-455.  Take a look.  :)
>
> best,
> Colin
>

Reply via email to