[ https://issues.apache.org/jira/browse/HBASE-16095?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15384679#comment-15384679 ]
Enis Soztutar commented on HBASE-16095: --------------------------------------- Thanks Stack for taking a look. bq. can't we keep phoenix stuff up in phoenix? Secondary indices via transaction are almost here. Isn't that the proper fix rather than adding new pools to hbase (we don't need more pools), etc. Unfortunately no. This happens in region open, so we need a mechanism to inject / configure region opening, nothing related to RPC scheduling. bq. Why we need this change if configuring below could address deadlock? That is deadlock on RPC's and regular index writes. This particular issue is about the writes happening to the index region when we are opening the data region. The secondary index recovery mechanism depends on the index region(s) being online. The writes are happening in a blocking manner, so we block the actual region opener thread. Since the same region opener threads are used to open both data and index regions deadlock happens. bq. This sort of dependence amongst regions – i.e. the index has to be online before data region can come on line – is not supported in hbase; what happens if server carrying index region crashes... and other scenarios, etc. Has it been worked through? If so, where can I read about it? I am not sure where you can read more. There were presentations online, but the implementation in P is some years old with some changes. bq. We have a mechanism for onlining important regions already that has loads of holes in it (meta, namespace, etc.). The new AMv2 will go a long ways toward plugging a bunch of them. In this issue we are proposing a new means of doing a similar thing but on an even shakier foundation. Not quite the same thing. AM / Master can prioritize the opening of regions, but we cannot control all the timing from a master perspective. We cannot time new tables being created while servers going down and WAL recovery happening, etc. So there will never be perfect-and-strict ordering that can be done from a master perspective if for example we want to ensure index table regions are assigned first before the data table regions from AM. AM can do a best effort job. On the other hand though, region servers do not need to order the incoming region open requests. If there is no dependency then, having a fixed thread pool to open regions works. If there is dependency, then it does not. bq. Seems dodgy Enis Soztutar, brittle as Gary Helmling says. See my comment at https://issues.apache.org/jira/browse/HBASE-16095?focusedCommentId=15347538&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15347538. Transactions is an optional concept in Phoenix, and it is still not GA. Even if it was, not all use cases need transactions. We should still support secondary indexes without transactions in Phoenix for some time. I agree that the mutable index architecture as is today should be redesigned to remove the inter-region dependency and blocking the handlers. Working on a proposal to do this using replication, but getting that fully working will take some time. Until then, we have real users and customers running with the current stuff that needs the fix. bq. Phoenix users will have to ensure they configure all index tables as PRIORITY (making index tables 'high priority' is a little unexpected)? For preexisting tables they'll have to go through and enable this everywhere? I should have linked the Phoenix issue. My b. PHOENIX-3072 is the fix in Phoenix that would automatically configure the priorities in Phoenix. BTW, I think that the priority definition in the table descriptor also serves another purpose. We can use that in RPC scheduling itself, so that should be useful in itself regardless of P. Moreover, I was thinking that although HBase "does not support" region interdependencies, we still have important tables with dependencies for most of the frameworks, like commit table in omid, catalog/stats table in Phoenix as well as hbase-level system tables that uses this. > Add priority to TableDescriptor and priority region open thread pool > -------------------------------------------------------------------- > > Key: HBASE-16095 > URL: https://issues.apache.org/jira/browse/HBASE-16095 > Project: HBase > Issue Type: Bug > Reporter: Enis Soztutar > Assignee: Enis Soztutar > Fix For: 2.0.0, 1.3.0, 1.4.0, 0.98.21 > > Attachments: HBASE-16095-0.98.patch, HBASE-16095-0.98.patch, > hbase-16095_v0.patch, hbase-16095_v1.patch, hbase-16095_v2.patch, > hbase-16095_v3.patch > > > This is in the similar area with HBASE-15816, and also required with the > current secondary indexing for Phoenix. > The problem with P secondary indexes is that data table regions depend on > index regions to be able to make progress. Possible distributed deadlocks can > be prevented via custom RpcScheduler + RpcController configuration via > HBASE-11048 and PHOENIX-938. However, region opening also has the same > deadlock situation, because data region open has to replay the WAL edits to > the index regions. There is only 1 thread pool to open regions with 3 workers > by default. So if the cluster is recovering / restarting from scratch, the > deadlock happens because some index regions cannot be opened due to them > being in the same queue waiting for data regions to open (which waits for > RPC'ing to index regions which is not open). This is reproduced in almost all > Phoenix secondary index clusters (mutable table w/o transactions) that we > see. > The proposal is to have a "high priority" region opening thread pool, and > have the HTD carry the relative priority of a table. This maybe useful for > other "framework" level tables from Phoenix, Tephra, Trafodian, etc if they > want some specific tables to become online faster. > As a follow up patch, we can also take a look at how this priority > information can be used by the rpc scheduler on the server side or rpc > controller on the client side, so that we do not have to set priorities > manually per-operation. -- This message was sent by Atlassian JIRA (v6.3.4#6332)