[jira] [Commented] (KAFKA-350) Enable message replication in the presence of controlled failures

Jay Kreps (JIRA) Wed, 18 Jul 2012 22:05:41 -0700

    [ 
https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418065#comment-13418065
 ]


Jay Kreps commented on KAFKA-350:
---------------------------------

This is a pretty hard to review due to the large number of changes and also 
because I don't know some of this code well.

A lot of things like bad logging/naming that I think you could probably catch 
just perusing the diff.

Log:
- Log should not know about hw right? We seem to be adding more hw stuff there?
- This adds a getHW() that just returns a private val, why not make the val 
public? Please fix these. Regardless of cleanup being done get/set methods have 
been against the style guide for a long time, lets not add more. Ditto 
getEndOffset() which in addition to being a getter is inconsistent with 
Log.logEndOffset
- There is debug statement in a for loop in Log.scala that needs to be removed
- I don't understand the difference between nextAppendOffset and logEndOffset. 
Can you make it clear in the javadoc and explain on why we need both of these. 
Our public interface to Log is getting incredibly complex, which is really sad 
so I think we should really think through deeply what is added here and why.
- The javadoc on line 138 of Log.scala doesn't match the style of javadoc for 
the preceeding 5 variables.

- Does making isr.keep.in.sync.time.ms more stringent actually make sense? 10 
seconds is pretty tight. I think what you are saying is that every server 
bounce will introduce 30 seconds of latency. But I think that is kind of a 
weakness of our design. If we lower that timeout we may just get spurious 
dropped replicas, no?
- Can we change the name of isr.keep.in.sync.time.ms to replica.max.lag.time.ms?
- Good point about the socket timeouts. We can't set socket timeout equal to 
request timeout, though, as there may be a large network latency. I recommend 
we just default the socket timeout to something large (like 10x the request 
timeout), and throw an exception if it is less than the request timeout (since 
that is certainly wrong). I don't think we should be using the socket timeout 
except as an emergency escape for a totally hung broker now that we have the 
nice per-request timeout.
- Can we change producer.request.ack.timeout.ms to producer.request.timeout.ms 
so it is more intuitive? I don't think the word "ack" is going to be 
self-explanatory to users.

SocketServer.scala
- Please remove: info("Shut down acceptor completed")
- Is there a reason to add the port into the thread name? That seems extremely 
odd...is it to simplify testing where there are multiple servers on a machine?
- Why is it a bug for processNewResponses() to happen while a shutdown is 
occuring. I don't think that is a bug. That is called in the event loop. It is 
the loop that should stop, no? Is there any ill effect of this?
- Good catch on the infinite loop

System testing
- I think we should fix the performance test hacks. The performance tool is 
critical. I have watched this play out before. No one ever budgets time for 
making the performance test stuff usable and then it just gets re-written 
umpteen times and never does what is needed. Most of these are just a matter of 
some basic cleanup and adding options. Let's work clean.

AdminUtils
- I don't understand the change in getTopicMetaDataFromZK

Replica.scala
- Can you remove trace("Returning leader replica for leader " + 
leaderReplicaId) unless you think it is of value going forward

ErrorMapping.scala
- getMaxErrorCodeValue - seems to be recomputed for each TopicMetadata. Let's 
get rid of this, I don't think we need it. We already have an unknown entry in 
the mapping, we should use that and get rid of the Utils.getShortInRange
- If we do want to keep it, fix the name
- We should really give a base class KafkaException to all exceptions so the 
client can handle them more easily
- Instead of having the client get an IllegalArgumentException we should just 
throw new KafkaException("Unknown server error (error code " + code + ")")
- The file NoLeaderForPartitionException seems to be empty now, I think you 
meant to delete it

ConsoleConsumer
- What does moving those things into a finally block do? We just caught all 
exceptions...

FileMessageSet
- You added more log spam. Please format it so it is intelligible to someone 
not working on the code or remove: debug("flush size:" + sizeInBytes())
- Ditto info("recover upto size:" + sizeInBytes())

BrokerPartitionInfos is a really weird class

DefaultEventHandler
- This seems to have grown into a big glump of proceedural logic.
- Inconsistent spacing of parens should be cleaned up
- partitionAndCollate is extemely complex
- Option[Map[Int, Map[(String, Int), Seq[ProducerData[K,Message]]]]]

ZkUtils
- LeaderExists class needs a name that is a noun
- Also we have a class with the same name in TopicMetadata

I really like TestUtils.waitUntilLeaderIsElected. God bless you for not adding 
sleeps. We should consider repeating this pattern for other cases like this.

ZooKeeperTestHarness.scala
- Can we replace the thread.sleep with a waitUntilZkIsUp call?


                
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
>                 Key: KAFKA-350
>                 URL: https://issues.apache.org/jira/browse/KAFKA-350
>             Project: Kafka
>          Issue Type: Bug
>    Affects Versions: 0.8
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>         Attachments: kafka-350-v1.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server 
> failures. This JIRA will improve the log recovery logic and fix other bugs to 
> enable message replication to happen in the presence of controlled server 
> failures

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Commented] (KAFKA-350) Enable message replication in the presence of controlled failures

Reply via email to