Added the rolling upgrade instruction in the KIP, similar to those in 0.9.0 release notes.
On Wed, Dec 16, 2015 at 11:32 AM, Allen Wang <allenxw...@gmail.com> wrote: > Hi Jun, > > The reason that TopicMetadataResponse is not included in the KIP is that > it currently is not version aware . So we need to introduce version to it > in order to make sure backward compatibility. It seems to me a big change. > Do we want to couple it with this KIP? Do we need to further discuss what > information to include in the new version besides rack? For example, should > we include broker security protocol in TopicMetadataResponse? > > The other option is to make it a separate KIP to make > TopicMetadataResponse version aware and decide what to include, and make > this KIP focus on the rack aware algorithm, admin tools and related > changes to inter-broker protocol . > > Thanks, > Allen > > > > > On Mon, Dec 14, 2015 at 8:30 AM, Jun Rao <j...@confluent.io> wrote: > >> Allen, >> >> Thanks for the proposal. A few comments. >> >> 1. Since this KIP changes the inter broker communication protocol >> (UpdateMetadataRequest), we will need to document the upgrade path >> (similar >> to what's described in >> http://kafka.apache.org/090/documentation.html#upgrade). >> >> 2. It might be useful to include the rack info of the broker in >> TopicMetadataResponse. This can be useful for administrative tasks, as >> well >> as read affinity in the future. >> >> Jun >> >> >> >> On Thu, Dec 10, 2015 at 9:38 AM, Allen Wang <allenxw...@gmail.com> wrote: >> >> > If there are no more comments I would like to call for a vote. >> > >> > >> > On Sun, Nov 15, 2015 at 10:08 PM, Allen Wang <allenxw...@gmail.com> >> wrote: >> > >> > > KIP is updated with more details and how to handle the situation where >> > > rack information is incomplete. >> > > >> > > In the situation where rack information is incomplete, but we want to >> > > continue with the assignment, I have suggested to ignore all rack >> > > information and fallback to original algorithm. The reason is >> explained >> > > below: >> > > >> > > The other options are to assume that the broker without the rack >> belong >> > to >> > > its own unique rack, or they belong to one "default" rack. Either way >> we >> > > choose, it is highly likely to result in uneven number of brokers in >> > racks, >> > > and it is quite possible that the "made up" racks will have much fewer >> > > number of brokers. As I explained in the KIP, uneven number of >> brokers in >> > > racks will lead to uneven distribution of replicas among brokers (even >> > > though the leader distribution is still even). The brokers in the rack >> > that >> > > has fewer number of brokers will get more replicas per broker than >> > brokers >> > > in other racks. >> > > >> > > Given this fact and the replica assignment produced will be incorrect >> > > anyway from rack aware point of view, ignoring all rack information >> and >> > > fallback to the original algorithm is not a bad choice since it will >> at >> > > least have a better guarantee of replica distribution. >> > > >> > > Also for command line tools it gives user a choice if for any reason >> they >> > > want to ignore rack information and fallback to the original >> algorithm. >> > > >> > > >> > > On Tue, Nov 10, 2015 at 9:04 AM, Allen Wang <allenxw...@gmail.com> >> > wrote: >> > > >> > >> I am busy with some time pressing issues for the last few days. I >> will >> > >> think about how the incomplete rack information will affect the >> balance >> > and >> > >> update the KIP by early next week. >> > >> >> > >> Thanks, >> > >> Allen >> > >> >> > >> >> > >> On Tue, Nov 3, 2015 at 9:03 AM, Neha Narkhede <n...@confluent.io> >> > wrote: >> > >> >> > >>> Few suggestions on improving the KIP >> > >>> >> > >>> *If some brokers have rack, and some do not, the algorithm will >> thrown >> > an >> > >>> > exception. This is to prevent incorrect assignment caused by user >> > >>> error.* >> > >>> >> > >>> >> > >>> In the KIP, can you clearly state the user-facing behavior when some >> > >>> brokers have rack information and some don't. Which actions and >> > requests >> > >>> will error out and how? >> > >>> >> > >>> *Even distribution of partition leadership among brokers* >> > >>> >> > >>> >> > >>> There is some information about arranging the sorted broker list >> > >>> interlaced >> > >>> with rack ids. Can you describe the changes to the current algorithm >> > in a >> > >>> little more detail? How does this interlacing work if only a subset >> of >> > >>> brokers have the rack id configured? Does this still work if uneven >> # >> > of >> > >>> brokers are assigned to each rack? It might work, I'm looking for >> more >> > >>> details on the changes, since it will affect the behavior seen by >> the >> > >>> user >> > >>> - imbalance on either the leaders or data or both. >> > >>> >> > >>> On Mon, Nov 2, 2015 at 6:39 PM, Aditya Auradkar < >> > aaurad...@linkedin.com> >> > >>> wrote: >> > >>> >> > >>> > I think this sounds reasonable. Anyone else have comments? >> > >>> > >> > >>> > Aditya >> > >>> > >> > >>> > On Tue, Oct 27, 2015 at 5:23 PM, Allen Wang <allenxw...@gmail.com >> > >> > >>> wrote: >> > >>> > >> > >>> > > During the discussion in the hangout, it was mentioned that it >> > would >> > >>> be >> > >>> > > desirable that consumers know the rack information of the >> brokers >> > so >> > >>> that >> > >>> > > they can consume from the broker in the same rack to reduce >> > latency. >> > >>> As I >> > >>> > > understand this will only be beneficial if consumer can consume >> > from >> > >>> any >> > >>> > > broker in ISR, which is not possible now. >> > >>> > > >> > >>> > > I suggest we skip the change to TMR. Once the change is made to >> > >>> consumer >> > >>> > to >> > >>> > > be able to consume from any broker in ISR, the rack information >> can >> > >>> be >> > >>> > > added to TMR. >> > >>> > > >> > >>> > > Another thing I want to confirm is command line behavior. I >> think >> > >>> the >> > >>> > > desirable default behavior is to fail fast on command line for >> > >>> incomplete >> > >>> > > rack mapping. The error message can include further instruction >> > that >> > >>> > tells >> > >>> > > the user to add an extra argument (like >> "--allow-partial-rackinfo") >> > >>> to >> > >>> > > suppress the error and do an imperfect rack aware assignment. If >> > the >> > >>> > > default behavior is to allow incomplete mapping, the error can >> > still >> > >>> be >> > >>> > > easily missed. >> > >>> > > >> > >>> > > The affected command line tools are TopicCommand and >> > >>> > > ReassignPartitionsCommand. >> > >>> > > >> > >>> > > Thanks, >> > >>> > > Allen >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > >> > >>> > > On Mon, Oct 26, 2015 at 12:55 PM, Aditya Auradkar < >> > >>> > aaurad...@linkedin.com> >> > >>> > > wrote: >> > >>> > > >> > >>> > > > Hi Allen, >> > >>> > > > >> > >>> > > > For TopicMetadataResponse to understand version, you can bump >> up >> > >>> the >> > >>> > > > request version itself. Based on the version of the request, >> the >> > >>> > response >> > >>> > > > can be appropriately serialized. It shouldn't be a huge >> change. >> > For >> > >>> > > > example: We went through something similar for ProduceRequest >> > >>> recently >> > >>> > ( >> > >>> > > > https://reviews.apache.org/r/33378/) >> > >>> > > > I guess the reason protocol information is not included in the >> > TMR >> > >>> is >> > >>> > > > because the topic itself is independent of any particular >> > protocol >> > >>> (SSL >> > >>> > > vs >> > >>> > > > Plaintext). Having said that, I'm not sure we even need rack >> > >>> > information >> > >>> > > in >> > >>> > > > TMR. What usecase were you thinking of initially? >> > >>> > > > >> > >>> > > > For 1 - I'd be fine with adding an option to the command line >> > tools >> > >>> > that >> > >>> > > > check rack assignment. For e.g. "--strict-assignment" or >> > something >> > >>> > > similar. >> > >>> > > > >> > >>> > > > Aditya >> > >>> > > > >> > >>> > > > On Thu, Oct 22, 2015 at 6:44 PM, Allen Wang < >> > allenxw...@gmail.com> >> > >>> > > wrote: >> > >>> > > > >> > >>> > > > > For 2 and 3, I have updated the KIP. Please take a look. One >> > >>> thing I >> > >>> > > have >> > >>> > > > > changed is removing the proposal to add rack to >> > >>> > TopicMetadataResponse. >> > >>> > > > The >> > >>> > > > > reason is that unlike UpdateMetadataRequest, >> > >>> TopicMetadataResponse >> > >>> > does >> > >>> > > > not >> > >>> > > > > understand version. I don't see a way to include rack >> without >> > >>> > breaking >> > >>> > > > old >> > >>> > > > > version of clients. That's probably why secure protocol is >> not >> > >>> > included >> > >>> > > > in >> > >>> > > > > the TopicMetadataResponse either. I think it will be a much >> > >>> bigger >> > >>> > > change >> > >>> > > > > to include rack in TopicMetadataResponse. >> > >>> > > > > >> > >>> > > > > For 1, my concern is that doing rack aware assignment >> without >> > >>> > complete >> > >>> > > > > broker to rack mapping will result in assignment that is not >> > rack >> > >>> > aware >> > >>> > > > and >> > >>> > > > > fail to provide fault tolerance in the event of rack outage. >> > This >> > >>> > kind >> > >>> > > of >> > >>> > > > > problem will be difficult to surface. And the cost of this >> > >>> problem is >> > >>> > > > high: >> > >>> > > > > you have to do partition reassignment if you are lucky to >> spot >> > >>> the >> > >>> > > > problem >> > >>> > > > > early on or face the consequence of data loss during real >> rack >> > >>> > outage. >> > >>> > > > > >> > >>> > > > > I do see the concern of fail-fast as it might also cause >> data >> > >>> loss if >> > >>> > > > > producer is not able produce the message due to topic >> creation >> > >>> > failure. >> > >>> > > > Is >> > >>> > > > > it feasible to treat dynamic topic creation and command >> tools >> > >>> > > > differently? >> > >>> > > > > We allow dynamic topic creation with incomplete broker-rack >> > >>> mapping >> > >>> > and >> > >>> > > > > fail fast in command line. Another option is to let user >> > >>> determine >> > >>> > the >> > >>> > > > > behavior for command line. For example, by default fail >> fast in >> > >>> > command >> > >>> > > > > line but allow incomplete broker-rack mapping if another >> switch >> > >>> is >> > >>> > > > > provided. >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > >> > >>> > > > > On Tue, Oct 20, 2015 at 10:05 AM, Aditya Auradkar < >> > >>> > > > > aaurad...@linkedin.com.invalid> wrote: >> > >>> > > > > >> > >>> > > > > > Hey Allen, >> > >>> > > > > > >> > >>> > > > > > 1. If we choose fail fast topic creation, we will have >> topic >> > >>> > creation >> > >>> > > > > > failures while upgrading the cluster. I really doubt we >> want >> > >>> this >> > >>> > > > > behavior. >> > >>> > > > > > Ideally, this should be invisible to clients of a cluster. >> > >>> > Currently, >> > >>> > > > > each >> > >>> > > > > > broker is effectively its own rack. So we probably can use >> > the >> > >>> rack >> > >>> > > > > > information whenever possible but not make it a hard >> > >>> requirement. >> > >>> > To >> > >>> > > > > extend >> > >>> > > > > > Gwen's example, one badly configured broker should not >> > degrade >> > >>> > topic >> > >>> > > > > > creation for the entire cluster. >> > >>> > > > > > >> > >>> > > > > > 2. Upgrade scenario - Can you add a section on the upgrade >> > >>> piece to >> > >>> > > > > confirm >> > >>> > > > > > that old clients will not see errors? I believe >> > >>> > > > > ZookeeperConsumerConnector >> > >>> > > > > > reads the Broker objects from ZK. I wanted to confirm that >> > this >> > >>> > will >> > >>> > > > not >> > >>> > > > > > cause any problems. >> > >>> > > > > > >> > >>> > > > > > 3. Could you elaborate your proposed changes to the >> > >>> > > > UpdateMetadataRequest >> > >>> > > > > > in the "Public Interfaces" section? Personally, I find >> this >> > >>> format >> > >>> > > easy >> > >>> > > > > to >> > >>> > > > > > read in terms of wire protocol changes: >> > >>> > > > > > >> > >>> > > > > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-4+-+Command+line+and+centralized+administrative+operations#KIP-4-Commandlineandcentralizedadministrativeoperations-CreateTopicRequest >> > >>> > > > > > >> > >>> > > > > > Aditya >> > >>> > > > > > >> > >>> > > > > > On Fri, Oct 16, 2015 at 3:45 PM, Allen Wang < >> > >>> allenxw...@gmail.com> >> > >>> > > > > wrote: >> > >>> > > > > > >> > >>> > > > > > > KIP is updated include rack as an optional property for >> > >>> broker. >> > >>> > > > Please >> > >>> > > > > > take >> > >>> > > > > > > a look and let me know if more details are needed. >> > >>> > > > > > > >> > >>> > > > > > > For the case where some brokers have rack and some do >> not, >> > >>> the >> > >>> > > > current >> > >>> > > > > > KIP >> > >>> > > > > > > uses the fail-fast behavior. If there are concerns, we >> can >> > >>> > further >> > >>> > > > > > discuss >> > >>> > > > > > > this in the email thread or next hangout. >> > >>> > > > > > > >> > >>> > > > > > > >> > >>> > > > > > > >> > >>> > > > > > > On Thu, Oct 15, 2015 at 10:42 AM, Allen Wang < >> > >>> > allenxw...@gmail.com >> > >>> > > > >> > >>> > > > > > wrote: >> > >>> > > > > > > >> > >>> > > > > > > > That's a good question. I can think of three actions >> if >> > the >> > >>> > rack >> > >>> > > > > > > > information is incomplete: >> > >>> > > > > > > > >> > >>> > > > > > > > 1. Treat the node without rack as if it is on its >> unique >> > >>> rack >> > >>> > > > > > > > 2. Disregard all rack information and fallback to >> current >> > >>> > > algorithm >> > >>> > > > > > > > 3. Fail-fast >> > >>> > > > > > > > >> > >>> > > > > > > > Now I think about it, one and three make more sense. >> The >> > >>> reason >> > >>> > > for >> > >>> > > > > > > > fail-fast is that user mistake for not providing the >> rack >> > >>> may >> > >>> > > never >> > >>> > > > > be >> > >>> > > > > > > > found if we tolerate that and the assignment may not >> be >> > >>> rack >> > >>> > > aware >> > >>> > > > as >> > >>> > > > > > the >> > >>> > > > > > > > user has expected and this creates debug problems when >> > >>> things >> > >>> > > fail. >> > >>> > > > > > > > >> > >>> > > > > > > > What do you think? If not fail-fast, is there anyway >> we >> > can >> > >>> > make >> > >>> > > > the >> > >>> > > > > > user >> > >>> > > > > > > > error standing out? >> > >>> > > > > > > > >> > >>> > > > > > > > >> > >>> > > > > > > > On Thu, Oct 15, 2015 at 10:17 AM, Gwen Shapira < >> > >>> > > g...@confluent.io> >> > >>> > > > > > > wrote: >> > >>> > > > > > > > >> > >>> > > > > > > >> Thanks! Just to clarify, when some brokers have rack >> > >>> > assignment >> > >>> > > > and >> > >>> > > > > > some >> > >>> > > > > > > >> don't, do we act like none of them have it? or like >> > those >> > >>> > > without >> > >>> > > > > > > >> assignment are in their own rack? >> > >>> > > > > > > >> >> > >>> > > > > > > >> The first scenario is good when first setting up >> > >>> > rack-awareness, >> > >>> > > > but >> > >>> > > > > > the >> > >>> > > > > > > >> second makes more sense for on-going maintenance (I >> can >> > >>> > totally >> > >>> > > > see >> > >>> > > > > > > >> someone >> > >>> > > > > > > >> adding a node and forgetting to set the rack >> property, >> > we >> > >>> > don't >> > >>> > > > want >> > >>> > > > > > > this >> > >>> > > > > > > >> to change behavior for anything except the new node). >> > >>> > > > > > > >> >> > >>> > > > > > > >> What do you think? >> > >>> > > > > > > >> >> > >>> > > > > > > >> Gwen >> > >>> > > > > > > >> >> > >>> > > > > > > >> On Thu, Oct 15, 2015 at 10:13 AM, Allen Wang < >> > >>> > > > allenxw...@gmail.com> >> > >>> > > > > > > >> wrote: >> > >>> > > > > > > >> >> > >>> > > > > > > >> > For scenario 1: >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > - Add the rack information to broker property file >> or >> > >>> > > > dynamically >> > >>> > > > > > set >> > >>> > > > > > > >> it in >> > >>> > > > > > > >> > the wrapper code to bootstrap Kafka server. You >> would >> > do >> > >>> > that >> > >>> > > > for >> > >>> > > > > > all >> > >>> > > > > > > >> > brokers and restart the brokers one by one. >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > In this scenario, the complete broker to rack >> mapping >> > >>> may >> > >>> > not >> > >>> > > be >> > >>> > > > > > > >> available >> > >>> > > > > > > >> > until every broker is restarted. During that time >> we >> > >>> fall >> > >>> > back >> > >>> > > > to >> > >>> > > > > > > >> default >> > >>> > > > > > > >> > replica assignment algorithm. >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > For scenario 2: >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > - Add the rack information to broker property file >> or >> > >>> > > > dynamically >> > >>> > > > > > set >> > >>> > > > > > > >> it in >> > >>> > > > > > > >> > the wrapper code and start the broker. >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > On Wed, Oct 14, 2015 at 2:36 PM, Gwen Shapira < >> > >>> > > > g...@confluent.io> >> > >>> > > > > > > >> wrote: >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > > Can you clarify the workflow for the following >> > >>> scenarios: >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > > 1. I currently have 6 brokers and want to add >> rack >> > >>> > > information >> > >>> > > > > for >> > >>> > > > > > > >> each >> > >>> > > > > > > >> > > 2. I'm adding a new broker and I want to specify >> > which >> > >>> > rack >> > >>> > > it >> > >>> > > > > > > >> belongs on >> > >>> > > > > > > >> > > while adding it. >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > > Thanks! >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > > On Tue, Oct 13, 2015 at 2:21 PM, Allen Wang < >> > >>> > > > > allenxw...@gmail.com >> > >>> > > > > > > >> > >>> > > > > > > >> > wrote: >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > > > We discussed the KIP in the hangout today. The >> > >>> > > > recommendation >> > >>> > > > > is >> > >>> > > > > > > to >> > >>> > > > > > > >> > make >> > >>> > > > > > > >> > > > rack as a broker property in ZooKeeper. For >> users >> > >>> with >> > >>> > > > > existing >> > >>> > > > > > > rack >> > >>> > > > > > > >> > > > information stored somewhere, they would need >> to >> > >>> > retrieve >> > >>> > > > the >> > >>> > > > > > > >> > information >> > >>> > > > > > > >> > > > at broker start up and dynamically set the rack >> > >>> > property, >> > >>> > > > > which >> > >>> > > > > > > can >> > >>> > > > > > > >> be >> > >>> > > > > > > >> > > > implemented as a wrapper to bootstrap broker. >> > There >> > >>> will >> > >>> > > be >> > >>> > > > no >> > >>> > > > > > > >> > interface >> > >>> > > > > > > >> > > or >> > >>> > > > > > > >> > > > pluggable implementation to retrieve the rack >> > >>> > information. >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > The assumption is that you always need to >> restart >> > >>> the >> > >>> > > broker >> > >>> > > > > to >> > >>> > > > > > > >> make a >> > >>> > > > > > > >> > > > change to the rack. >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > Once the rack becomes a broker property, it >> will >> > be >> > >>> > > possible >> > >>> > > > > to >> > >>> > > > > > > make >> > >>> > > > > > > >> > rack >> > >>> > > > > > > >> > > > part of the meta data to help the consumer >> choose >> > >>> which >> > >>> > in >> > >>> > > > > sync >> > >>> > > > > > > >> replica >> > >>> > > > > > > >> > > to >> > >>> > > > > > > >> > > > consume from as part of the future consumer >> > >>> enhancement. >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > I will update the KIP. >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > Thanks, >> > >>> > > > > > > >> > > > Allen >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > On Thu, Oct 8, 2015 at 9:23 AM, Allen Wang < >> > >>> > > > > > allenxw...@gmail.com> >> > >>> > > > > > > >> > wrote: >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > > > I attended Tuesday's KIP hangout but this KIP >> > was >> > >>> not >> > >>> > > > > > discussed >> > >>> > > > > > > >> due >> > >>> > > > > > > >> > to >> > >>> > > > > > > >> > > > > time constraint. >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > > However, after hearing discussion of KIP-35, >> I >> > >>> have >> > >>> > the >> > >>> > > > > > feeling >> > >>> > > > > > > >> that >> > >>> > > > > > > >> > > > > incompatibility (caused by new broker >> property) >> > >>> > between >> > >>> > > > > > brokers >> > >>> > > > > > > >> with >> > >>> > > > > > > >> > > > > different versions will be solved there. In >> > >>> addition, >> > >>> > > > > having >> > >>> > > > > > > >> stack >> > >>> > > > > > > >> > in >> > >>> > > > > > > >> > > > > broker property as meta data may also help >> > >>> consumers >> > >>> > in >> > >>> > > > the >> > >>> > > > > > > >> future. >> > >>> > > > > > > >> > So >> > >>> > > > > > > >> > > I >> > >>> > > > > > > >> > > > am >> > >>> > > > > > > >> > > > > open to adding stack property to broker. >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > > Hopefully we can discuss this in the next KIP >> > >>> hangout. >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > > On Wed, Sep 30, 2015 at 2:46 PM, Allen Wang < >> > >>> > > > > > > allenxw...@gmail.com >> > >>> > > > > > > >> > >> > >>> > > > > > > >> > > > wrote: >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > >> Can you send me the information on the next >> KIP >> > >>> > > hangout? >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >> Currently the broker-rack mapping is not >> > cached. >> > >>> In >> > >>> > > > > > KafkaApis, >> > >>> > > > > > > >> > > > >> RackLocator.getRackInfo() is called each >> time >> > the >> > >>> > > mapping >> > >>> > > > > is >> > >>> > > > > > > >> needed >> > >>> > > > > > > >> > > for >> > >>> > > > > > > >> > > > >> auto topic creation. This will ensure latest >> > >>> mapping >> > >>> > is >> > >>> > > > > used >> > >>> > > > > > at >> > >>> > > > > > > >> any >> > >>> > > > > > > >> > > > time. >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >> The ability to get the complete mapping >> makes >> > it >> > >>> > simple >> > >>> > > > to >> > >>> > > > > > > reuse >> > >>> > > > > > > >> the >> > >>> > > > > > > >> > > > same >> > >>> > > > > > > >> > > > >> interface in command line tools. >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >> On Wed, Sep 30, 2015 at 11:01 AM, Aditya >> > >>> Auradkar < >> > >>> > > > > > > >> > > > >> aaurad...@linkedin.com.invalid> wrote: >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >>> Perhaps we discuss this during the next KIP >> > >>> hangout? >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> I do see that a pluggable rack locator can >> be >> > >>> useful >> > >>> > > > but I >> > >>> > > > > > do >> > >>> > > > > > > >> see a >> > >>> > > > > > > >> > > few >> > >>> > > > > > > >> > > > >>> concerns: >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> - The RackLocator (as described in the >> > >>> document), >> > >>> > > > implies >> > >>> > > > > > that >> > >>> > > > > > > >> it >> > >>> > > > > > > >> > can >> > >>> > > > > > > >> > > > >>> discover rack information for any node in >> the >> > >>> > cluster. >> > >>> > > > How >> > >>> > > > > > > does >> > >>> > > > > > > >> it >> > >>> > > > > > > >> > > deal >> > >>> > > > > > > >> > > > >>> with rack location changes? For example, >> if I >> > >>> moved >> > >>> > > > broker >> > >>> > > > > > id >> > >>> > > > > > > >> (1) >> > >>> > > > > > > >> > > from >> > >>> > > > > > > >> > > > >>> rack >> > >>> > > > > > > >> > > > >>> X to Y, I only have to start that broker >> with >> > a >> > >>> > newer >> > >>> > > > rack >> > >>> > > > > > > >> config. >> > >>> > > > > > > >> > If >> > >>> > > > > > > >> > > > >>> RackLocator discovers broker -> rack >> > >>> information at >> > >>> > > > start >> > >>> > > > > up >> > >>> > > > > > > >> time, >> > >>> > > > > > > >> > > any >> > >>> > > > > > > >> > > > >>> change to a broker will require bouncing >> the >> > >>> entire >> > >>> > > > > cluster >> > >>> > > > > > > >> since >> > >>> > > > > > > >> > > > >>> createTopic requests can be sent to any >> node >> > in >> > >>> the >> > >>> > > > > cluster. >> > >>> > > > > > > >> > > > >>> For this reason it may be simpler to have >> each >> > >>> node >> > >>> > be >> > >>> > > > > aware >> > >>> > > > > > > of >> > >>> > > > > > > >> its >> > >>> > > > > > > >> > > own >> > >>> > > > > > > >> > > > >>> rack and persist it in ZK during start up >> > time. >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> - A pluggable RackLocator relies on an >> > external >> > >>> > > service >> > >>> > > > > > being >> > >>> > > > > > > >> > > available >> > >>> > > > > > > >> > > > >>> to >> > >>> > > > > > > >> > > > >>> serve rack information. >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> Out of curiosity, I looked up how a couple >> of >> > >>> other >> > >>> > > > > systems >> > >>> > > > > > > deal >> > >>> > > > > > > >> > with >> > >>> > > > > > > >> > > > >>> zone/rack awareness. >> > >>> > > > > > > >> > > > >>> For Cassandra some interesting modes are: >> > >>> > > > > > > >> > > > >>> (Property File configuration) >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > >> > >>> > > > > > > >> >> > >>> > > > > > > >> > >>> > > > > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchPFSnitch_t.html >> > >>> > > > > > > >> > > > >>> (Dynamic inference) >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > >> > >>> > > > > > > >> >> > >>> > > > > > > >> > >>> > > > > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> http://docs.datastax.com/en/cassandra/2.0/cassandra/architecture/architectureSnitchRackInf_c.html >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> Voldemort does a static node -> zone >> > assignment >> > >>> > based >> > >>> > > on >> > >>> > > > > > > >> > > configuration. >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> Aditya >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> On Wed, Sep 30, 2015 at 10:05 AM, Allen >> Wang < >> > >>> > > > > > > >> allenxw...@gmail.com >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > > > >>> wrote: >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >>> > I would like to see if we can do both: >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > - Make RackLocator pluggable to >> facilitate >> > >>> > migration >> > >>> > > > > with >> > >>> > > > > > > >> > existing >> > >>> > > > > > > >> > > > >>> > broker-rack mapping >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > - Make rack an optional property for >> broker. >> > >>> If >> > >>> > rack >> > >>> > > > is >> > >>> > > > > > > >> available >> > >>> > > > > > > >> > > > from >> > >>> > > > > > > >> > > > >>> > broker, treat it as source of truth. For >> > users >> > >>> > with >> > >>> > > > > > existing >> > >>> > > > > > > >> > > > >>> broker-rack >> > >>> > > > > > > >> > > > >>> > mapping somewhere else, they can use the >> > >>> pluggable >> > >>> > > way >> > >>> > > > > or >> > >>> > > > > > > they >> > >>> > > > > > > >> > can >> > >>> > > > > > > >> > > > >>> transfer >> > >>> > > > > > > >> > > > >>> > the mapping to the broker rack property. >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > One thing I am not sure is what happens >> at >> > >>> rolling >> > >>> > > > > upgrade >> > >>> > > > > > > >> when >> > >>> > > > > > > >> > we >> > >>> > > > > > > >> > > > have >> > >>> > > > > > > >> > > > >>> > rack as a broker property. For brokers >> with >> > >>> older >> > >>> > > > > version >> > >>> > > > > > of >> > >>> > > > > > > >> > Kafka, >> > >>> > > > > > > >> > > > >>> will it >> > >>> > > > > > > >> > > > >>> > cause problem for them? If so, is there >> any >> > >>> > > > workaround? >> > >>> > > > > I >> > >>> > > > > > > also >> > >>> > > > > > > >> > > think >> > >>> > > > > > > >> > > > it >> > >>> > > > > > > >> > > > >>> > would be better not to have rack in the >> > >>> controller >> > >>> > > > wire >> > >>> > > > > > > >> protocol >> > >>> > > > > > > >> > > but >> > >>> > > > > > > >> > > > >>> not >> > >>> > > > > > > >> > > > >>> > sure if it is achievable. >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > Thanks, >> > >>> > > > > > > >> > > > >>> > Allen >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > On Mon, Sep 28, 2015 at 4:55 PM, Todd >> > Palino < >> > >>> > > > > > > >> tpal...@gmail.com> >> > >>> > > > > > > >> > > > >>> wrote: >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> > > I tend to like the idea of a pluggable >> > >>> locator. >> > >>> > > For >> > >>> > > > > > > >> example, we >> > >>> > > > > > > >> > > > >>> already >> > >>> > > > > > > >> > > > >>> > > have an interface for discovering >> > >>> information >> > >>> > > about >> > >>> > > > > the >> > >>> > > > > > > >> > physical >> > >>> > > > > > > >> > > > >>> location >> > >>> > > > > > > >> > > > >>> > > of servers. I don't relish the idea of >> > >>> having to >> > >>> > > > > > maintain >> > >>> > > > > > > >> data >> > >>> > > > > > > >> > in >> > >>> > > > > > > >> > > > >>> > multiple >> > >>> > > > > > > >> > > > >>> > > places. >> > >>> > > > > > > >> > > > >>> > > >> > >>> > > > > > > >> > > > >>> > > -Todd >> > >>> > > > > > > >> > > > >>> > > >> > >>> > > > > > > >> > > > >>> > > On Mon, Sep 28, 2015 at 4:48 PM, Aditya >> > >>> > Auradkar < >> > >>> > > > > > > >> > > > >>> > > aaurad...@linkedin.com.invalid> wrote: >> > >>> > > > > > > >> > > > >>> > > >> > >>> > > > > > > >> > > > >>> > > > Thanks for starting this KIP Allen. >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > > I agree with Gwen that having a >> > >>> RackLocator >> > >>> > > class >> > >>> > > > > that >> > >>> > > > > > > is >> > >>> > > > > > > >> > > > pluggable >> > >>> > > > > > > >> > > > >>> > seems >> > >>> > > > > > > >> > > > >>> > > > to be too complex. The KIP refers to >> > >>> > potentially >> > >>> > > > > > non-ZK >> > >>> > > > > > > >> > storage >> > >>> > > > > > > >> > > > >>> for the >> > >>> > > > > > > >> > > > >>> > > > rack info which I don't think is >> > >>> necessary. >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > > Perhaps we can persist this info in >> zk >> > >>> under >> > >>> > > > > > > >> > > > >>> /brokers/ids/<broker_id> >> > >>> > > > > > > >> > > > >>> > > > similar to other broker properties >> and >> > >>> add a >> > >>> > > > config >> > >>> > > > > in >> > >>> > > > > > > >> > > > KafkaConfig >> > >>> > > > > > > >> > > > >>> > called >> > >>> > > > > > > >> > > > >>> > > > "rack". >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> {"jmx_port":-1,"endpoints":[...],"host":"xxx","port":yyy, >> > >>> > > > > > > >> > > "rack": >> > >>> > > > > > > >> > > > >>> > "abc"} >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > > Aditya >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > > On Mon, Sep 28, 2015 at 2:30 PM, Gwen >> > >>> Shapira >> > >>> > < >> > >>> > > > > > > >> > > g...@confluent.io >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > >>> > wrote: >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > > > Hi, >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > First, thanks for putting out a KIP >> > for >> > >>> > this. >> > >>> > > > This >> > >>> > > > > > is >> > >>> > > > > > > >> super >> > >>> > > > > > > >> > > > >>> important >> > >>> > > > > > > >> > > > >>> > > for >> > >>> > > > > > > >> > > > >>> > > > > production deployments of Kafka. >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > Few questions: >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > 1) Are we sure we want "as many >> racks >> > as >> > >>> > > > > possible"? >> > >>> > > > > > > I'd >> > >>> > > > > > > >> > want >> > >>> > > > > > > >> > > to >> > >>> > > > > > > >> > > > >>> > balance >> > >>> > > > > > > >> > > > >>> > > > > between safety (more racks) and >> > network >> > >>> > > > > utilization >> > >>> > > > > > > >> > (traffic >> > >>> > > > > > > >> > > > >>> within a >> > >>> > > > > > > >> > > > >>> > > > rack >> > >>> > > > > > > >> > > > >>> > > > > uses the high-bandwidth TOR >> switch). >> > One >> > >>> > > replica >> > >>> > > > > on >> > >>> > > > > > a >> > >>> > > > > > > >> > > different >> > >>> > > > > > > >> > > > >>> rack >> > >>> > > > > > > >> > > > >>> > > and >> > >>> > > > > > > >> > > > >>> > > > > the rest on same rack (if possible) >> > >>> sounds >> > >>> > > > better >> > >>> > > > > to >> > >>> > > > > > > me. >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > 2) Rack-locator class seems overly >> > >>> complex >> > >>> > > > > compared >> > >>> > > > > > to >> > >>> > > > > > > >> > > adding a >> > >>> > > > > > > >> > > > >>> > > > rack.number >> > >>> > > > > > > >> > > > >>> > > > > property to the broker properties >> > file. >> > >>> Why >> > >>> > do >> > >>> > > > we >> > >>> > > > > > want >> > >>> > > > > > > >> > that? >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > Gwen >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > On Mon, Sep 28, 2015 at 12:15 PM, >> > Allen >> > >>> > Wang < >> > >>> > > > > > > >> > > > >>> allenxw...@gmail.com> >> > >>> > > > > > > >> > > > >>> > > > wrote: >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > > > Hello Kafka Developers, >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > I just created KIP-36 for rack >> aware >> > >>> > replica >> > >>> > > > > > > >> assignment. >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > >> > >>> > > > > > > >> >> > >>> > > > > > > >> > >>> > > > > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >> https://cwiki.apache.org/confluence/display/KAFKA/KIP-36+Rack+aware+replica+assignment >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > The goal is to utilize the >> isolation >> > >>> > > provided >> > >>> > > > by >> > >>> > > > > > the >> > >>> > > > > > > >> > racks >> > >>> > > > > > > >> > > in >> > >>> > > > > > > >> > > > >>> data >> > >>> > > > > > > >> > > > >>> > > > center >> > >>> > > > > > > >> > > > >>> > > > > > and distribute replicas to racks >> to >> > >>> > provide >> > >>> > > > > fault >> > >>> > > > > > > >> > > tolerance. >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > Comments are welcome. >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > > Thanks, >> > >>> > > > > > > >> > > > >>> > > > > > Allen >> > >>> > > > > > > >> > > > >>> > > > > > >> > >>> > > > > > > >> > > > >>> > > > > >> > >>> > > > > > > >> > > > >>> > > > >> > >>> > > > > > > >> > > > >>> > > >> > >>> > > > > > > >> > > > >>> > >> > >>> > > > > > > >> > > > >>> >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > >> >> > >>> > > > > > > >> > > > > >> > >>> > > > > > > >> > > > >> > >>> > > > > > > >> > > >> > >>> > > > > > > >> > >> > >>> > > > > > > >> >> > >>> > > > > > > > >> > >>> > > > > > > > >> > >>> > > > > > > >> > >>> > > > > > >> > >>> > > > > >> > >>> > > > >> > >>> > > >> > >>> > >> > >>> >> > >>> >> > >>> >> > >>> -- >> > >>> Thanks, >> > >>> Neha >> > >>> >> > >> >> > >> >> > > >> > >> > >