[jira] [Commented] (KAFKA-343) revisit the become leader and become follower state change operations using V3 design

Neha Narkhede (JIRA) Mon, 16 Jul 2012 18:36:38 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-343?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13415835#comment-13415835
 ]


Neha Narkhede commented on KAFKA-343:
-------------------------------------

Here are some initial review comments. 

1. ZkUtils
1. Remove the debug statement in getEpochForPartition that says “get data, see 
m”. Debug statements should be carefully added and should be added to be useful 
during debugging.
2. Another debug statement that can be improved - 
3. “Check the leaderEpocAndISR raw string:”. Can you please change this to 
something like trace(“The leader and isr info %s read from zk path %s”)
4. Change debug(“the leader....”) to debug(“Leader %d, Epoch %d, isr %s, zk 
path version %d for topic %s and partition %d”). Without knowing the topic and 
partition, this debug statement will not be useful 
5. Overall, change any log statement that says “get this and check that”. For 
example, getLeaderForPartition() 
6. Change epoc to epoch everywhere in your code
7. In getPartitionLeaderAndISRForTopics, the input is a list of topics for 
which leader and isr is to be fetched, but in the output, some partitions for 
which leader and isr is not available are missing from the return value. This 
is hard to understand from the caller's POV. Either it should return None or 
throw an exception so that the caller can handle Option value or an exception 
to know that some partitions don't have leader and isr info.
8. What is the value of change readData to readDataMaybeNull in 
getBrokerInfoFromIds ? Basically you are returning invalid Broker objects with 
null broker info from that API, which doesn't seem useful to the caller. The 
fix is same as above. Either throw exception (like before) or change it to 
return Option.

2. KafkaController.scala
1. Rename recoverLeaderAndISRFromZookeeper to getLeaderAndISRFromZookeeper
2. In recoverLeaderAndISRFromZookeeper () API, get is blindly invoked on an 
Option variable returned from allPartitionReplicaAssignment. Scala options 
force us to handle invalid/missing values cleanly, please use Option correctly 
everywhere in your code.
3. The following info statement is unclear - “On leaderAndISR recover, ....”. 
Please change it to “After reading leader and ISR from zookeeper, ...”
4. What is the purpose of leaderAndISRRecovery since it just calls 
recoverLeaderAndISRFromZookeeper anyways ? Let's remove this API
5. All info statements in controllerRegisterOrFailover() API are unclear. 
Please follow the info statements in leaderElection() in KafkaZookeeper(the 
code you deleted) and fix these. 

3. Partition.scala
1. Change updateISR back to take in an optional zkclient instead of null and a 
separate boolean value. That way it makes it easier to handle a missing zk 
client instead of running into NPE
2. Parentheses should be on the same line in updateISR. Try to follow 
consistent coding convention.
3. Why is the try-catch-finally clause removed while acquiring/releasing the 
leaderAndISRLock ? This is very dangerous and can lead to the lock never being 
released in some failure cases. 
4. Keep the info/error statements like those in updateISRInZk (code that you 
deleted)

4. ReplicaManager.scala
1. The deleteLog change seems pretty hacky. What happens if the log was being 
written to while processing the stop replica request ? It doesn't seem like 
that this scenario is handled and tested in this patch. Please revert all log 
deletion related changes, leave a TODO comment in stopReplica. We have another 
JIRA filed for delete topic, maybe we can handle it cleanly as part of that ? 
This will also help reduce the scope of this patch which is supposed to only 
handle become leader and become follower. 
2. Change all log4j statements that say “On broker, blah” to “Blah on broker”
3. Does makeLeader and makeFollower ever return an error code other than 
NoError ?

5. Log.scala
1. Revert 

6. LogManager.scala
1. Revert 

7. FileMessageSet.scala
1. Revert 

8. SimpleConsumer
1. Revert 

9. AbstractFetcherManager
1. Revert

10. KafkaServer.scala
1. Remove commented out code in addReplica

11. KafkaController.scala
1. Change all log4j statements of this format “Controller %d see blah” to 
something meaningful like “Blah is %s on controller %d”

12. Testing

12.1. In TestUtils, 
1. change variables ISR1 and ISR2 to lower case. Also make the same change in 
ZkUtils and anywhere else where you've used all-uppercase to denote a variable. 
Also, rename createSampleLeaderAndISRResponse to createTestLeaderAndISRResponse
2. waitUntilLeaderIsElected should be improved with this patch to handle new 
leader election as well as existing leader change. 

12.2. In ControllerBasicTest, change assertEquals to say “Controller should be 
on broker 1”. You might want to remove the debug statement that says “command 
send test finishes”
12.3. Why are all tests deleted from ControllerToBrokerRequestTest.scala ?
12.4. LeaderElectionTest and ControllerBasicTest throws couple of new errors 
with this patch -
12.5. This patch is supposed to make very significant changes to the 
replication state machine. The existing tests don't pass cleanly. I would 
prefer to see many more unit tests and a running system test before accepting 
this patch. You can probably wait until KAFKA-350 is checked in since that 
enables replication with failures. This will allow us to run the system test on 
this patch.

                
> revisit the become leader and become follower state change operations using 
> V3 design
> -------------------------------------------------------------------------------------
>
>                 Key: KAFKA-343
>                 URL: https://issues.apache.org/jira/browse/KAFKA-343
>             Project: Kafka
>          Issue Type: Sub-task
>          Components: core
>    Affects Versions: 0.8
>            Reporter: Jun Rao
>            Assignee: Yang Ye
>             Fix For: 0.8
>
>         Attachments: kafka_343.diff.2, kafka_343.diff.3, kafka_343.patch
>
>
> We need to reimplement become leader/follower using the controller model 
> described in 
> https://cwiki.apache.org/confluence/display/KAFKA/kafka+Detailed+Replication+Design+V3

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-343) revisit the become leader and become follower state change operations using V3 design

Reply via email to