> > > > I am late to the game so take my comments w/ a grain of salt -- I'll > > take a > > > look at HBASE-10070 -- but high-level do we have to go the read > replicas > > > route? IMO, having our current already-strained AssignmentManager code > > > base manage three replicas instead of one will ensure that Jimmy Xiang > > and > > > Jeffrey Zhong do nothing else for the next year or two but work on the > > new > > > interesting use cases introduced by this new level of complexity put > > upon a > > > system that has just achieved a hard-won stability. > > > > > > > Stack, the model is that the replicas (HRegionInfo with an added field > > 'replicaId') are treated just as any other region in the AM. You can > > see the code - it's not adding much at all in terms of new code to > > handle replicas. > > > > >
Adding to what Devaraj said, we opted for actually creating one more HRegionInfo object per region per replica count so that the assignment state machine is not affected. The high level change is that we are creating replica x num regions many regions, and assign them. The LB ensures that replica's are placed with high availability across hosts and racks. > > > > > > A few of us chatting offline -- Jimmy, Jon, Elliott, and I -- were > > > wondering if you couldn't solve this read replicas in a more hbase > > 'native' > > > way* by just bringing up three tables -- a main table and then two > > snapshot > > > clones with the clones refreshed on a period (via snapshot or via > > > in-cluster replication) -- and then a shim on top of an HBase client > > would > > > read from the main table until failure and then from a snapshot until > the > > > main came back. Reads from snapshot tables could be marked 'stale'. > > You'd > > > have to modify the balancer so the tables -- or at least their regions > -- > > > were physically distinct... you might be able just have the three > tables > > > each in a different namespace. > Doing region replicas via tables vs multiplying the num regions will involve a very similar amount of code changes. The LB still have to be aware of the fact that regions from different tables should not be co-hosted. As per above, in neither case, the assignment state machine is altered. However, with different tables, it will be unintuitive since the meta, and the client side would have to bring different regions of different tables to make sense. Those tables will not have any associated data, but refer to the other tables etc. > > > > > > > At a high level, considering all the work that would be needed in the > > client (for it to be able to be aware of the primary and the snapshot > > regions) > > > Minor. Right? Snapshot tables would have a _snapshot suffix? > > > > and in the master (to do with managing the placements of the > > regions), > > > Balancer already factors myriad attributes. Adding one more rule seems > like it would be near-in scope. > > And this would be work not in the client but in layer above the client. > > > > > I am not convinced. Also, consider that you will be taking a > > lot of snapshots and adding to the filesystem's load for the file > > creations. > > > > > Snapshotting is a well-worn and tested code path. Making them is pretty > lightweight op. Frequency would depend on what the app needs. > > Could go the replication route too, another well-worn and tested code > path. > > Trying to minimize the new code getting to the objective. > > > I think these should be addressed by region changes section in the design doc. In region-snapshots section, we detail how this will be like single-region snapshots. We do not need table snapshots per se, since we are opening the region replica from the files of the primary. There is already a working patch for this in the branch. In async-wal replication section, we mention how this can be build using the existing replication mechanism. We cannot directly replicate to a different table since we do not want to multiply the actual data in hdfs. But we will tap into the replica sink to do the in-cluster replication. > > > > Or how much more work would it take to follow the route our Facebook > > > brothers and sisters have taken doing quorum reads and writes > incluster? > > > > > > > If you talking about Facebook's work that is talked about in > > HBASE-7509, the quorum reads is something that we will benefit from, > > and that will help the filesystem side of the story, but we still need > > multiple (redundant) regions for the hbase side. If a region is not > > reachable, the client could go to another replica for the region... > > > > > No. > > Quorum read/writes as in paxos, raft (Liyin talked about the Facebook > Hydrabase project at his keynote at hbasecon last year). > That won't happen without a major architecture surgery in HBase. HBASE-10070 is some major work, but is in no way a major arch change I would say. Hydrabase / megastore is also across DC, while we are mostly interested in intra-DC availability right now. > > > > > > * When I say 'native' way in the above, what I mean by this is that > HBase > > > has always been about giving clients a 'consistent' view -- at least > when > > > the query is to the source cluster. Introducing talk and APIs that > talk > > of > > > 'eventual consistency' muddies our story. > > > > > > > > > > As we have discussed in the jira, there are use cases. And it's > > optional - all the APIs provide 'consistency' by default (status quo). > > > > > Sorry I'm behind. Let me review. My concern is that our shell and API now > will have notions of consistency other than "what you write is what you > read" all over them because we took on a use case that is 'interesting' but > up to this at least, a rare request. > I think the Consistency API from the client and the shell is intuitive and can be configured per request, which is the expected behavior. ( https://github.com/enis/hbase/commit/cf2c94022200a6fa7f3153b7e0655134fb73ec8c and https://github.com/enis/hbase/commit/75a4d9d7734ffa4f8a7b5aeb382f7a08e444984e ) > > Thanks, > St.Ack >