[
https://issues.apache.org/jira/browse/KAFKA-350?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13418065#comment-13418065
]
Jay Kreps commented on KAFKA-350:
---------------------------------
This is a pretty hard to review due to the large number of changes and also
because I don't know some of this code well.
A lot of things like bad logging/naming that I think you could probably catch
just perusing the diff.
Log:
- Log should not know about hw right? We seem to be adding more hw stuff there?
- This adds a getHW() that just returns a private val, why not make the val
public? Please fix these. Regardless of cleanup being done get/set methods have
been against the style guide for a long time, lets not add more. Ditto
getEndOffset() which in addition to being a getter is inconsistent with
Log.logEndOffset
- There is debug statement in a for loop in Log.scala that needs to be removed
- I don't understand the difference between nextAppendOffset and logEndOffset.
Can you make it clear in the javadoc and explain on why we need both of these.
Our public interface to Log is getting incredibly complex, which is really sad
so I think we should really think through deeply what is added here and why.
- The javadoc on line 138 of Log.scala doesn't match the style of javadoc for
the preceeding 5 variables.
- Does making isr.keep.in.sync.time.ms more stringent actually make sense? 10
seconds is pretty tight. I think what you are saying is that every server
bounce will introduce 30 seconds of latency. But I think that is kind of a
weakness of our design. If we lower that timeout we may just get spurious
dropped replicas, no?
- Can we change the name of isr.keep.in.sync.time.ms to replica.max.lag.time.ms?
- Good point about the socket timeouts. We can't set socket timeout equal to
request timeout, though, as there may be a large network latency. I recommend
we just default the socket timeout to something large (like 10x the request
timeout), and throw an exception if it is less than the request timeout (since
that is certainly wrong). I don't think we should be using the socket timeout
except as an emergency escape for a totally hung broker now that we have the
nice per-request timeout.
- Can we change producer.request.ack.timeout.ms to producer.request.timeout.ms
so it is more intuitive? I don't think the word "ack" is going to be
self-explanatory to users.
SocketServer.scala
- Please remove: info("Shut down acceptor completed")
- Is there a reason to add the port into the thread name? That seems extremely
odd...is it to simplify testing where there are multiple servers on a machine?
- Why is it a bug for processNewResponses() to happen while a shutdown is
occuring. I don't think that is a bug. That is called in the event loop. It is
the loop that should stop, no? Is there any ill effect of this?
- Good catch on the infinite loop
System testing
- I think we should fix the performance test hacks. The performance tool is
critical. I have watched this play out before. No one ever budgets time for
making the performance test stuff usable and then it just gets re-written
umpteen times and never does what is needed. Most of these are just a matter of
some basic cleanup and adding options. Let's work clean.
AdminUtils
- I don't understand the change in getTopicMetaDataFromZK
Replica.scala
- Can you remove trace("Returning leader replica for leader " +
leaderReplicaId) unless you think it is of value going forward
ErrorMapping.scala
- getMaxErrorCodeValue - seems to be recomputed for each TopicMetadata. Let's
get rid of this, I don't think we need it. We already have an unknown entry in
the mapping, we should use that and get rid of the Utils.getShortInRange
- If we do want to keep it, fix the name
- We should really give a base class KafkaException to all exceptions so the
client can handle them more easily
- Instead of having the client get an IllegalArgumentException we should just
throw new KafkaException("Unknown server error (error code " + code + ")")
- The file NoLeaderForPartitionException seems to be empty now, I think you
meant to delete it
ConsoleConsumer
- What does moving those things into a finally block do? We just caught all
exceptions...
FileMessageSet
- You added more log spam. Please format it so it is intelligible to someone
not working on the code or remove: debug("flush size:" + sizeInBytes())
- Ditto info("recover upto size:" + sizeInBytes())
BrokerPartitionInfos is a really weird class
DefaultEventHandler
- This seems to have grown into a big glump of proceedural logic.
- Inconsistent spacing of parens should be cleaned up
- partitionAndCollate is extemely complex
- Option[Map[Int, Map[(String, Int), Seq[ProducerData[K,Message]]]]]
ZkUtils
- LeaderExists class needs a name that is a noun
- Also we have a class with the same name in TopicMetadata
I really like TestUtils.waitUntilLeaderIsElected. God bless you for not adding
sleeps. We should consider repeating this pattern for other cases like this.
ZooKeeperTestHarness.scala
- Can we replace the thread.sleep with a waitUntilZkIsUp call?
> Enable message replication in the presence of controlled failures
> -----------------------------------------------------------------
>
> Key: KAFKA-350
> URL: https://issues.apache.org/jira/browse/KAFKA-350
> Project: Kafka
> Issue Type: Bug
> Affects Versions: 0.8
> Reporter: Neha Narkhede
> Assignee: Neha Narkhede
> Attachments: kafka-350-v1.patch
>
>
> KAFKA-46 introduced message replication feature in the absence of server
> failures. This JIRA will improve the log recovery logic and fix other bugs to
> enable message replication to happen in the presence of controlled server
> failures
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira