[jira] [Created] (HBASE-7460) Cleanup client connection layers
Gary Helmling created HBASE-7460: Summary: Cleanup client connection layers Key: HBASE-7460 URL: https://issues.apache.org/jira/browse/HBASE-7460 Project: HBase Issue Type: Improvement Components: Client, IPC/RPC Reporter: Gary Helmling This issue originated from a discussion over in HBASE-7442. We currently have a broken abstraction with {{HBaseClient}}, where it is bound to a single {{Configuration}} instance at time of construction, but then reused for all connections to all clusters. This is combined with multiple, overlapping layers of connection caching. Going through this code, it seems like we have a lot of mismatch between the higher layers and the lower layers, with too much abstraction in between. At the lower layers, most of the {{ClientCache}} stuff seems completely unused. We currently effectively have an {{HBaseClient}} singleton (for {{SecureClient}} as well in 0.92/0.94) in the client code, as I don't see anything that calls the constructor or {{RpcEngine.getProxy()}} versions with a non-default socket factory. So a lot of the code around this seems like built up waste. The fact that a single Configuration is fixed in the {{HBaseClient}} seems like a broken abstraction as it currently stands. In addition to cluster ID, other configuration parameters (max retries, retry sleep) are fixed at time of construction. The more I look at the code, the more it looks like the {{ClientCache}} and sharing the {{HBaseClient}} instance is an unnecessary complication. Why cache the {{HBaseClient}} instances at all? In {{HConnectionManager}}, we already have a mapping from {{Configuration}} to {{HConnection}}. It seems to me like each {{HConnection(Implementation)}} instance should have it's own {{HBaseClient}} instance, doing away with the {{ClientCache}} mapping. This would keep each {{HBaseClient}} associated with a single cluster/configuration and fix the current breakage from reusing the same {{HBaseClient}} against different clusters. We need a refactoring of some of the interactions of {{HConnection(Implementation)}}, {{HBaseRPC/RpcEngine}}, and {{HBaseClient}}. Off hand, we might want to expose a separate {{RpcEngine.getClient()}} method that returns a new {{RpcClient}} interface (implemented by {{HBaseClient}}) and move the {{RpcEngine.getProxy()}}/{{stopProxy()}} implementations into the client. So all proxy invocations can go through the same client, without requiring the static client cache. I haven't fully thought this through, so I could be missing other important aspects. But that approach at least seems like a step in the right direction for fixing the client abstractions. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira
[jira] [Created] (HBASE-9774) Provide a way for coprocessors to register and report custom metrics
Gary Helmling created HBASE-9774: Summary: Provide a way for coprocessors to register and report custom metrics Key: HBASE-9774 URL: https://issues.apache.org/jira/browse/HBASE-9774 Project: HBase Issue Type: New Feature Components: Coprocessors, metrics Reporter: Gary Helmling It would help provide better visibility into what coprocessors are doing if we provided a way for coprocessors to export their own metrics. The general idea is to: * extend access to the HBase metrics bus down into the coprocessor environments * coprocessors can then register and increment custom metrics * coprocessor metrics are then reported along with all others through normal mechanisms -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-9897) Clean up some security configuration checks in LoadIncrementalHFiles
Gary Helmling created HBASE-9897: Summary: Clean up some security configuration checks in LoadIncrementalHFiles Key: HBASE-9897 URL: https://issues.apache.org/jira/browse/HBASE-9897 Project: HBase Issue Type: Task Components: security Reporter: Gary Helmling In LoadIncrementalHFiles, use of SecureBulkLoadClient is conditioned on UserProvider.isHBaseSecurityEnabled() in a couple of places. However, use of secure bulk loading seems to be required more by use of HDFS secure authentication, instead of HBase secure authentication. It should be possible to use secure bulk loading, as long as SecureBulkLoadEndpoint is loaded, and HDFS secure authentication is enabled, regardless of the HBase authentication configuration. In addition, SecureBulkLoadEndpoint does a direct check on permissions by referencing AccessController loaded on the same region, i.e.: {code} getAccessController().prePrepareBulkLoad(env); {code} It seems like this will throw an NPE if AccessController is not configured. We need an additional null check to handle this case gracefully. -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Resolved] (HBASE-9912) Need to delete a row based on partial rowkey in hbase ... Pls provide query for that
[ https://issues.apache.org/jira/browse/HBASE-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-9912. -- Resolution: Invalid This is a question, not a bug. Please email u...@hbase.apache.org with questions. JIRA is for actual bug reports, improvements, etc. See http://hbase.apache.org/mail-lists.html Need to delete a row based on partial rowkey in hbase ... Pls provide query for that - Key: HBASE-9912 URL: https://issues.apache.org/jira/browse/HBASE-9912 Project: HBase Issue Type: Bug Reporter: ranjini Priority: Critical -- This message was sent by Atlassian JIRA (v6.1#6144)
[jira] [Created] (HBASE-10162) Add RegionObserver lifecycle hook to be called when region is available
Gary Helmling created HBASE-10162: - Summary: Add RegionObserver lifecycle hook to be called when region is available Key: HBASE-10162 URL: https://issues.apache.org/jira/browse/HBASE-10162 Project: HBase Issue Type: Improvement Components: Coprocessors Reporter: Gary Helmling Over in HBASE-10161 and HBASE-10148, there is discussion of the need to modify existing coprocessors, which previously performed initialization only in postOpen(), in order to account for the new log replay mechanism happening post open. This points out that we have a hole in coprocessor lifecycle management which caused the use of region lifecycle hooks (postOpen()) in the first place. Instead of requiring coprocessor authors to hook into region lifecycle methods for initialization, we should provide an explicit implicit lifecycle hook for coprocessor authors to use when region open, log replay (and any future requirements) are complete, say initializeWhenAvailable() (open to better names). -- This message was sent by Atlassian JIRA (v6.1.4#6159)
[jira] [Created] (HBASE-10721) Parallelize execution of multi operations on RegionServer
Gary Helmling created HBASE-10721: - Summary: Parallelize execution of multi operations on RegionServer Key: HBASE-10721 URL: https://issues.apache.org/jira/browse/HBASE-10721 Project: HBase Issue Type: Improvement Components: regionserver Reporter: Gary Helmling In the context of HBASE-10169, we're adding the ability to batch Coprocessor endpoint calls per regionserver, using the same batching that happens in the RegionServer.multi() calls. However, execution of each of the calls will still happen serially on each RegionServer. For Coprocessor endpoint calls, it might help to parallelize these, since each execution could be of indeterminate length. Since it may help to parallelize the Coprocessor endpoint invocations, it raises the question of whether other operations handled in multi() calls should also be parallelized, or should we just rely on macro-scale parallelization through the RPC handler threads? -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Resolved] (HBASE-3726) Allow coprocessor callback RPC calls to be batched at region server level
[ https://issues.apache.org/jira/browse/HBASE-3726?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-3726. -- Resolution: Duplicate Closing as a duplicate of HBASE-10169 Allow coprocessor callback RPC calls to be batched at region server level - Key: HBASE-3726 URL: https://issues.apache.org/jira/browse/HBASE-3726 Project: HBase Issue Type: Improvement Components: Coprocessors Reporter: Ted Yu Cuurently the Callback.update() method is called for each Call.call() return value obtained from each region. Each Call.call() invocation is a separate RPC, so there is currently one RPC per region. So there's no place at the moment for the region server to be involved in any aggregation across regions. There is some preliminary support in HConnectionManager.HConnectionImplementation.processBatch() that would allow doing 1 RPC per region server, same as we do for multi-get and multi-put. We should provide ability to batch callback RPC calls. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11292) Add an undelete operation
Gary Helmling created HBASE-11292: - Summary: Add an undelete operation Key: HBASE-11292 URL: https://issues.apache.org/jira/browse/HBASE-11292 Project: HBase Issue Type: New Feature Components: Deletes Reporter: Gary Helmling While column families can be configured to keep deleted cells (allowing time range queries to still retrieve those cells), deletes are still somewhat unique in that they are irreversible operations. Once a delete has been issued on a cell, the only way to undelete it is to rewrite the data with a timestamp newer than the delete. The idea here is to add an undelete operation, that would make it possible to cancel a previous delete. An undelete operation will be similar to a delete, in that it will be written as a marker (tombstone doesn't seem like the right word). The undelete marker, however, will sort prior to a delete marker, canceling the effect of any following delete. In the absence of a column family configured to KEEP_DELETED_CELLS, we can't be sure if a prior delete marker and the effected cells have already been garbage collected. In this case (column family not configured with KEEP_DELETED_CELLS) it may be necessary for the server to reject undelete operations to avoid creating the appearance of a client contact for undeletes that can't reliably be honored. I think there are additional subtleties of the implementation to be worked out, but I'm also interested in a broader discussion of interest in this capability. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11653) RegionObserver coprocessor cannot override KeyValue values in prePut()
Gary Helmling created HBASE-11653: - Summary: RegionObserver coprocessor cannot override KeyValue values in prePut() Key: HBASE-11653 URL: https://issues.apache.org/jira/browse/HBASE-11653 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 0.94.21 Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical Due to a bug in {{HRegion.internalPut()}}, any modifications that a {{RegionObserver}} makes to a Put's family map in the {{prePut()}} hook are lost. This prevents coprocessors from modifying the values written by a {{Put}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-11800) Coprocessor service methods in HTableInterface should be annotated public
Gary Helmling created HBASE-11800: - Summary: Coprocessor service methods in HTableInterface should be annotated public Key: HBASE-11800 URL: https://issues.apache.org/jira/browse/HBASE-11800 Project: HBase Issue Type: Task Components: Client Affects Versions: 0.96.0, 0.98.0 Reporter: Gary Helmling The {{HTableInterface.coprocessorService(...)}} and {{HTableInterface.batchCoprocessorService(...)}} methods were made private in HBASE-9529, when the coprocessor APIs were seen as unstable and evolving. However, these methods represent a standard way for clients to use custom APIs exposed via coprocessors. In that sense, they are targeted at general HBase users (who may run but not develop coprocessors), as opposed to coprocessor developers who want to extend HBase. The coprocessor endpoint API has also remained much more stable than the coprocessor Observer interfaces, which tend to change along with HBase internals. So there should not be much difficulty in supporting these methods as part of the public API. I think we should drop the {{@InterfaceAudience.Private}} annotation on these methods and support them as part of the public {{HTableInterface}}. -- This message was sent by Atlassian JIRA (v6.2#6252)
[jira] [Created] (HBASE-12578) Change TokenProvider to a SingletonCoprocessorService
Gary Helmling created HBASE-12578: - Summary: Change TokenProvider to a SingletonCoprocessorService Key: HBASE-12578 URL: https://issues.apache.org/jira/browse/HBASE-12578 Project: HBase Issue Type: Improvement Components: security Reporter: Gary Helmling The {{TokenProvider}} coprocessor service, which is responsible for issuing HBase delegation tokens, currently runs a region endpoint. In the security documentation, we recommend configuring this coprocessor for all table regions, however, we only ever address delegation token requests to the META region. When {{TokenProvider}} was first added, region coprocessors were the only way of adding endpoints. But, since then, we've added support for endpoints for regionserver and master coprocessors. This makes loading {{TokenProvider}} on all table regions unnecessarily wasteful. We can reduce the overhead for {{TokenProvider}} and greatly improve it's scalability by doing the following: # Convert {{TokenProvider}} to a {{SingletonCoprocessorService}} that is configured to run on all regionservers. This will ensure a single instance per regionserver instead of one per region. # Direct delegation token requests to a random running regionserver so that we don't hotspot any single instance with requests. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-12579) Move obtainAuthTokenForJob() methods out of User
Gary Helmling created HBASE-12579: - Summary: Move obtainAuthTokenForJob() methods out of User Key: HBASE-12579 URL: https://issues.apache.org/jira/browse/HBASE-12579 Project: HBase Issue Type: Improvement Components: security Reporter: Gary Helmling The {{User}} class currently contains some utility methods to obtain HBase authentication tokens for the given user. However, these methods initiate an RPC to the {{TokenProvider}} coprocessor endpoint, an action which should not be part of the User class' responsibilities. This leads to a couple of problems: # The way the methods are currently structured, it is impossible to integrate them with normal connection management for the cluster (the TokenUtil class constructs its own HTable instance internally). # The User class is logically part of the hbase-common module, but uses the TokenUtil class (part of hbase-server, though it should probably be moved to hbase-client) through reflection, leading to a hidden dependency. The {{obtainAuthTokenForJob()}} methods should be deprecated and the process of obtaining authentication tokens should be moved to use the normal connection lifecycle. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14821) CopyTable should allow overriding more config properties for peer cluster
Gary Helmling created HBASE-14821: - Summary: CopyTable should allow overriding more config properties for peer cluster Key: HBASE-14821 URL: https://issues.apache.org/jira/browse/HBASE-14821 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Gary Helmling Assignee: Gary Helmling When using CopyTable across two separate clusters, you can specify the ZK quorum for the destination cluster, but not much else in configuration overrides. This can be a problem when the cluster configurations differ, such as when using security with different configurations for server principals. We should provide a general way to override configuration properties for the peer / destination cluster. One option would be to allow use of a prefix for command line properties ("peer.property."). Properties matching this prefix will be stripped and merged to the peer configuration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14775) Replication can't authenticate with peer Zookeeper with different server principal
Gary Helmling created HBASE-14775: - Summary: Replication can't authenticate with peer Zookeeper with different server principal Key: HBASE-14775 URL: https://issues.apache.org/jira/browse/HBASE-14775 Project: HBase Issue Type: Bug Reporter: Gary Helmling Assignee: Gary Helmling When replication is setup with security, where the local ZK cluster and peer ZK cluster use different server principals, the source HBase cluster is unable to authenticate with the peer ZK cluster. When ZK is configured for SASL authentication and a server principal other than the default ("zookeeper") is used, the correct server principal must be specified on the client as a system property -- the confusingly named {{zookeeper.sasl.client.username}}. However, since this is given as a system property, authentication with the peer cluster breaks when it uses a different ZK server principal than the local cluster. We need a way of tying this setting to the replication peer config and then setting the property when the peer's ZooKeeperWatcher is created. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15025) Allow clients configured with insecure fallback to attempt SIMPLE auth when KRB fails
Gary Helmling created HBASE-15025: - Summary: Allow clients configured with insecure fallback to attempt SIMPLE auth when KRB fails Key: HBASE-15025 URL: https://issues.apache.org/jira/browse/HBASE-15025 Project: HBase Issue Type: Improvement Components: security Reporter: Gary Helmling Assignee: Gary Helmling We have separate configurations for both client and server allowing a "permissive" mode where connections to insecure servers and clients (respectively) are allowed. However, if both client and server are configured for Kerberos authentication for a given cluster, and Kerberos authentication fails, the connection will still fail if the fallback configurations are set to true. If the client is configured to allow insecure fallback, and Kerberos authentication fails, we could instead have the client retry with SIMPLE auth. If the server is also configured to allow insecure fallback, this would allow the connection to succeed in the case of transient problems with Kerberos infrastructure, for example. There is of course a danger that this would allow misconfigurations of security to be silently ignored, but we can add some loud logging on the client side when fallback to SIMPLE auth occurs, plus we have metrics and logging on the server side for fallbacks. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15038) ExportSnapshot should support separate configurations for source and destination clusters
Gary Helmling created HBASE-15038: - Summary: ExportSnapshot should support separate configurations for source and destination clusters Key: HBASE-15038 URL: https://issues.apache.org/jira/browse/HBASE-15038 Project: HBase Issue Type: Improvement Components: mapreduce, snapshots Reporter: Gary Helmling Assignee: Gary Helmling Currently ExportSnapshot uses a single Configuration instance for both the source and destination FileSystem instances to use. It should allow overriding properties for each filesystem connection separately. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14886) ReplicationAdmin does not use full peer configuration
Gary Helmling created HBASE-14886: - Summary: ReplicationAdmin does not use full peer configuration Key: HBASE-14886 URL: https://issues.apache.org/jira/browse/HBASE-14886 Project: HBase Issue Type: Bug Components: Replication Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical Fix For: 2.0.0, 1.2.0, 1.3.0 In {{listValidReplicationPeers()}}, we're creating the peer {{Configuration}} based on the source connection configuration and simply applying the peer ZK cluster key. This causes any additional properties present in the {{ReplicationPeerConfig}} configuration to not be applied. We should instead be using the configuration returned by {{ReplicationPeers.getPeerConf()}}, which we already call in that method. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-14866) VerifyReplication should use peer configuration in peer connection
Gary Helmling created HBASE-14866: - Summary: VerifyReplication should use peer configuration in peer connection Key: HBASE-14866 URL: https://issues.apache.org/jira/browse/HBASE-14866 Project: HBase Issue Type: Improvement Components: Replication Reporter: Gary Helmling Fix For: 2.0.0, 1.2.0, 1.3.0 VerifyReplication uses the replication peer's configuration to construct the ZooKeeper quorum address for the peer connection. However, other configuration properties in the peer's configuration are dropped. It should merge all configuration properties from the {{ReplicationPeerConfig}} when creating the peer connection and obtaining a credentials for the peer cluster. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16025) Cache table state to reduce load on META
Gary Helmling created HBASE-16025: - Summary: Cache table state to reduce load on META Key: HBASE-16025 URL: https://issues.apache.org/jira/browse/HBASE-16025 Project: HBase Issue Type: Improvement Components: Client Reporter: Gary Helmling Priority: Critical Fix For: 2.0.0 HBASE-12035 moved keeping table enabled/disabled state from ZooKeeper into hbase:meta. When we retry operations on the client, we check table state in order to return a specific message if the table is disabled. This means that in master we will be going back to meta for every retry, even if a region's location has not changed. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16097) Flushes and compactions fail on getting split point
Gary Helmling created HBASE-16097: - Summary: Flushes and compactions fail on getting split point Key: HBASE-16097 URL: https://issues.apache.org/jira/browse/HBASE-16097 Project: HBase Issue Type: Bug Components: Compaction Affects Versions: 1.2.1 Reporter: Gary Helmling Assignee: Gary Helmling We've seen a number of cases where flushes and compactions run, completely through, then throw an IndexOutOfBoundsException when getting the split point when checking if a split is needed. For flushes, the stack trace looks something like: {noformat} ERROR regionserver.MemStoreFlusher: Cache flusher failed for entry [flush region ] java.lang.IndexOutOfBoundsException: 131148 at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) at org.apache.hadoop.hbase.util.ByteBufferUtils.toBytes(ByteBufferUtils.java:491) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.midkey(HFileBlockIndex.java:351) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.midkey(HFileReaderV2.java:520) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.midkey(StoreFile.java:1510) at org.apache.hadoop.hbase.regionserver.StoreFile.getFileSplitPoint(StoreFile.java:726) at org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getSplitPoint(DefaultStoreFileManager.java:127) at org.apache.hadoop.hbase.regionserver.HStore.getSplitPoint(HStore.java:2036) at org.apache.hadoop.hbase.regionserver.RegionSplitPolicy.getSplitPoint(RegionSplitPolicy.java:82) at org.apache.hadoop.hbase.regionserver.HRegion.checkSplit(HRegion.java:7885) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:513) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.flushRegion(MemStoreFlusher.java:471) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher.access$900(MemStoreFlusher.java:75) at org.apache.hadoop.hbase.regionserver.MemStoreFlusher$FlushHandler.run(MemStoreFlusher.java:259) at java.lang.Thread.run(Thread.java:745) {noformat} For compactions, the exception occurs in the same spot: {noformat} ERROR regionserver.CompactSplitThread: Compaction failed Request = regionName=X, storeName=X, fileCount=XX, fileSize=XXX M, priority=1, time= java.lang.IndexOutOfBoundsException at java.nio.Buffer.checkIndex(Buffer.java:540) at java.nio.HeapByteBuffer.get(HeapByteBuffer.java:139) at org.apache.hadoop.hbase.util.ByteBufferUtils.toBytes(ByteBufferUtils.java:491) at org.apache.hadoop.hbase.io.hfile.HFileBlockIndex$BlockIndexReader.midkey(HFileBlockIndex.java:351) at org.apache.hadoop.hbase.io.hfile.HFileReaderV2.midkey(HFileReaderV2.java:520) at org.apache.hadoop.hbase.regionserver.StoreFile$Reader.midkey(StoreFile.java:1510) at org.apache.hadoop.hbase.regionserver.StoreFile.getFileSplitPoint(StoreFile.java:726) at org.apache.hadoop.hbase.regionserver.DefaultStoreFileManager.getSplitPoint(DefaultStoreFileManager.java:127) at org.apache.hadoop.hbase.regionserver.HStore.getSplitPoint(HStore.java:2036) at org.apache.hadoop.hbase.regionserver.RegionSplitPolicy.getSplitPoint(RegionSplitPolicy.java:82) at org.apache.hadoop.hbase.regionserver.HRegion.checkSplit(HRegion.java:7885) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.requestSplit(CompactSplitThread.java:241) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:540) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:566) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) {noformat} This continues until a compaction runs through and rewrites whatever file is causing the problem, at which point a split can proceed successfully. While compactions and flushes are successfully completing up until this point (it occurs after new store files have been moved into place), the exception thrown on flush causes us to exit prior to checking if a compaction is needed. So normal compactions wind up not being triggered and the effected regions accumulate a large number of store files. No root cause yet, so I'm parking this info here for investigation. Seems like we're either mis-writing part of the index or making some bad assumptions on the index blocks that we've read. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15111) "hbase version" should write to stdout
Gary Helmling created HBASE-15111: - Summary: "hbase version" should write to stdout Key: HBASE-15111 URL: https://issues.apache.org/jira/browse/HBASE-15111 Project: HBase Issue Type: Improvement Components: util Reporter: Gary Helmling Assignee: Gary Helmling Priority: Trivial Calling {{hbase version}} currently outputs the version info by writing to {{LOG.info}}. This means, if you change the default log level settings, you may get no output at all on the command line. Since {{VersionInfo.main()}} is being called, it should really just output straight to stdout. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15234) ReplicationLogCleaner can abort due to transient ZK issues
Gary Helmling created HBASE-15234: - Summary: ReplicationLogCleaner can abort due to transient ZK issues Key: HBASE-15234 URL: https://issues.apache.org/jira/browse/HBASE-15234 Project: HBase Issue Type: Bug Components: master, Replication Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical The ReplicationLogCleaner delegate for the LogCleaner chore can abort due to transient errors reading the replication znodes, leaving the log cleaner chore stopped, but the master still running. This causes logs to build up in the oldWALs directory, which can even hit storage or file count limits in HDFS, causing problems. We've seen this happen in a couple of clusters when a rolling restart was performed on the zk peers (only one restarted at a time). The full stack trace when the log cleaner aborts is: {noformat} 16/02/02 15:22:39 WARN zookeeper.ZKUtil: replicationLogCleaner-0x1522c8b93c2fbae, quorum=, baseZNode=/hbase Unable to get data of znode /hbase/replication/rs org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180) at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294) at org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:110) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/02/02 15:22:39 ERROR zookeeper.ZooKeeperWatcher: replicationLogCleaner-0x1522c8b93c2fbae, quorum=, baseZNode=/hbase Received unexpected KeeperException, re-throwing exception org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss for /hbase/replication/rs at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) at org.apache.zookeeper.KeeperException.create(KeeperException.java:51) at org.apache.zookeeper.ZooKeeper.getData(ZooKeeper.java:1155) at org.apache.hadoop.hbase.zookeeper.RecoverableZooKeeper.getData(RecoverableZooKeeper.java:359) at org.apache.hadoop.hbase.zookeeper.ZKUtil.getDataNoWatch(ZKUtil.java:713) at org.apache.hadoop.hbase.replication.ReplicationQueuesClientZKImpl.getQueuesZNodeCversion(ReplicationQueuesClientZKImpl.java:80) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.loadWALsFromQueues(ReplicationLogCleaner.java:99) at org.apache.hadoop.hbase.replication.master.ReplicationLogCleaner.getDeletableFiles(ReplicationLogCleaner.java:70) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteFiles(CleanerChore.java:233) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.checkAndDeleteEntries(CleanerChore.java:157) at org.apache.hadoop.hbase.master.cleaner.CleanerChore.chore(CleanerChore.java:124) at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:185) at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511) at java.util.concurrent.FutureTask.runAndReset
[jira] [Created] (HBASE-15363) Add client side metrics for SASL connection failures
Gary Helmling created HBASE-15363: - Summary: Add client side metrics for SASL connection failures Key: HBASE-15363 URL: https://issues.apache.org/jira/browse/HBASE-15363 Project: HBase Issue Type: Improvement Components: Client, metrics, security Reporter: Gary Helmling Assignee: Gary Helmling There are a number of cases where we can get SASL connection failures before getting to the server, like errors talking to the KDC/TGS and misconfiguration of kerberos principals. Hence these will not show up in the server-side authentication_failures metric. We should add client side metrics on SASL connection failures to capture these. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15294) Document advanced replication configurations with security
Gary Helmling created HBASE-15294: - Summary: Document advanced replication configurations with security Key: HBASE-15294 URL: https://issues.apache.org/jira/browse/HBASE-15294 Project: HBase Issue Type: Task Components: documentation Reporter: Gary Helmling Assignee: Gary Helmling HBASE-14866 fixed handling of source and cluster replication configs for some replication tools, needed, for example, for correct handling of some cross-realm trust security configurations. We need to document some examples in the reference guide. One examle, to configure a replication peer with different server principals: {noformat} add_peer '1', CLUSTER_KEY => "server1.cie.com:2181:/hbase", CONFIG => { 'hbase.master.kerberos.principal' => 'hbase/instan...@realm2.com', 'hbase.regionserver.kerberos.principal' => 'hbase/instan...@realm2.com', } {noformat} Additional arguments to VerifyReplication should also be documented in the usage output. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15641) Shell "alter" should do a single modifyTable operation
Gary Helmling created HBASE-15641: - Summary: Shell "alter" should do a single modifyTable operation Key: HBASE-15641 URL: https://issues.apache.org/jira/browse/HBASE-15641 Project: HBase Issue Type: Improvement Components: shell Reporter: Gary Helmling When performing an "alter" on multiple column families in a table, then shell will perform a separate {{Admin.modifyColumn()}} call for each column family being modified, with all of the table regions being bulk-reopened each time. It would be much better to simply apply all the changes to the table descriptor, then do a single call to {{Admin.modifyTable()}}. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-15573) Indefinite pause while trying to cleanup data
[ https://issues.apache.org/jira/browse/HBASE-15573?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-15573. --- Resolution: Invalid This JIRA instance is used for tracking development issues and bugs. Please send an email to the u...@hbase.apache.org mailing list to ask any questions. Sounds like a configuration issue with your client application. > Indefinite pause while trying to cleanup data > - > > Key: HBASE-15573 > URL: https://issues.apache.org/jira/browse/HBASE-15573 > Project: HBase > Issue Type: Bug >Affects Versions: 1.1.2, 1.1.4 >Reporter: Jorge Figueira >Priority: Blocker > Attachments: hbase-hadoop-master-HBASE.log, > hbase-hadoop-regionserver-HBASE.log, hbase-hadoop-zookeeper-HBASE.log > > > Can't retrieve any information with hbase rpc java client. > With hbase shell its possible to scan data and retrieve all the information > normally. > But with any rpc client region server don't retrieve data, all data come with > null values. > Region Server log: > DEBUG [RpcServer.reader=2,bindAddress=HBASE,port=16020] ipc.RpcServer: > RpcServer.listener,port=16020: DISCONNECTING client SERVER:37088 because read > count=-1 > DEBUG [RpcServer.reader=2,bindAddress=HBASE,port=16020] ipc.RpcServer: > RpcServer.listener,port=16020: DISCONNECTING client SERVER2:36997 because > read count=-1 > Master log: > 2016-03-31 18:16:27,998 DEBUG [ProcedureExecutorTimeout] > procedure2.ProcedureExecutor$CompletedProcedureCleaner: No completed > procedures to cleanup. > 2016-03-31 18:16:57,998 DEBUG [ProcedureExecutorTimeout] > procedure2.ProcedureExecutor$CompletedProcedureCleaner: No completed > procedures to cleanup. > 2016-03-31 18:17:27,998 DEBUG [ProcedureExecutorTimeout] > procedure2.ProcedureExecutor$CompletedProcedureCleaner: No completed > procedures to cleanup -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15741) TokenProvider coprocessor RPC incompatibile between 1.2 and 1.3
Gary Helmling created HBASE-15741: - Summary: TokenProvider coprocessor RPC incompatibile between 1.2 and 1.3 Key: HBASE-15741 URL: https://issues.apache.org/jira/browse/HBASE-15741 Project: HBase Issue Type: Bug Components: Coprocessors Affects Versions: 1.3.0 Reporter: Gary Helmling Priority: Blocker Attempting to run a map reduce job with a 1.3 client on a secure cluster running 1.2 is failing when making the coprocessor rpc to obtain a delegation token: {noformat} Exception in thread "main" org.apache.hadoop.hbase.exceptions.UnknownProtocolException: org.apache.hadoop.hbase.exceptions.UnknownProtocolException: No registered coprocessor service found for name hbase.pb.AuthenticationService in region hbase:meta,,1 at org.apache.hadoop.hbase.regionserver.HRegion.execService(HRegion.java:7741) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execServiceOnRegion(RSRpcServices.java:1988) at org.apache.hadoop.hbase.regionserver.RSRpcServices.execService(RSRpcServices.java:1970) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33652) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:137) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:112) at java.lang.Thread.run(Thread.java:745) at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method) at sun.reflect.NativeConstructorAccessorImpl.newInstance(NativeConstructorAccessorImpl.java:62) at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(DelegatingConstructorAccessorImpl.java:45) at java.lang.reflect.Constructor.newInstance(Constructor.java:422) at org.apache.hadoop.ipc.RemoteException.instantiateException(RemoteException.java:106) at org.apache.hadoop.ipc.RemoteException.unwrapRemoteException(RemoteException.java:95) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.getRemoteException(ProtobufUtil.java:332) at org.apache.hadoop.hbase.protobuf.ProtobufUtil.execService(ProtobufUtil.java:1631) at org.apache.hadoop.hbase.ipc.RegionCoprocessorRpcChannel$1.call(RegionCoprocessorRpcChannel.java:104) at org.apache.hadoop.hbase.ipc.RegionCoprocessorRpcChannel$1.call(RegionCoprocessorRpcChannel.java:94) at org.apache.hadoop.hbase.client.RpcRetryingCaller.callWithRetries(RpcRetryingCaller.java:137) at org.apache.hadoop.hbase.ipc.RegionCoprocessorRpcChannel.callExecService(RegionCoprocessorRpcChannel.java:108) at org.apache.hadoop.hbase.ipc.CoprocessorRpcChannel.callBlockingMethod(CoprocessorRpcChannel.java:73) at org.apache.hadoop.hbase.protobuf.generated.AuthenticationProtos$AuthenticationService$BlockingStub.getAuthenticationToken(AuthenticationProtos.java:4512) at org.apache.hadoop.hbase.security.token.TokenUtil.obtainToken(TokenUtil.java:86) at org.apache.hadoop.hbase.security.token.TokenUtil$1.run(TokenUtil.java:111) at org.apache.hadoop.hbase.security.token.TokenUtil$1.run(TokenUtil.java:108) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1657) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.runAs(User.java:340) at org.apache.hadoop.hbase.security.token.TokenUtil.obtainToken(TokenUtil.java:108) at org.apache.hadoop.hbase.security.token.TokenUtil.addTokenForJob(TokenUtil.java:329) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initCredentials(TableMapReduceUtil.java:490) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:209) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:162) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:285) at org.apache.hadoop.hbase.mapreduce.TableMapReduceUtil.initTableMapperJob(TableMapReduceUtil.java:86) at org.apache.hadoop.hbase.mapreduce.CellCounter.createSubmittableJob(CellCounter.java:193) at org.apache.hadoop.hbase.mapreduce.CellCounter.main(CellCounter.java:290) Caused by: org.apache.hadoop.hbase.ipc.RemoteWithExtrasException(org.apache.hadoop.hbase.exceptions.UnknownProtocolException): org.apache.hadoop.hbase.exceptions.UnknownProtocolException: No registered coprocessor service found for name hbase.pb.AuthenticationService in region hba
[jira] [Created] (HBASE-15856) Cached Connection instances can wind up with addresses never resolved
Gary Helmling created HBASE-15856: - Summary: Cached Connection instances can wind up with addresses never resolved Key: HBASE-15856 URL: https://issues.apache.org/jira/browse/HBASE-15856 Project: HBase Issue Type: Bug Components: Client Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical During periods where DNS is not working properly, we can wind up caching connections to master or regionservers where the initial hostname resolution and the resolution is never re-attempted. This means that clients will forever get UnknownHostException for any calls. When constructing a BlockingRpcChannelImplementation, we instantiate the InetSocketAddress to use for the connection. This instance is then used in the rpc client connection, where we check isUnresolved() and throw an UnknownHostException if that returns true. However, at this point the rpc channel is already cached in the HConnectionImplementation map of stubs. So at this point it will never be resolved. Setting the config for hbase.resolve.hostnames.on.failure masks this issue, since the stub key used is modified to contain the address. However, even in that case, if DNS fails, an rpc channel instance with unresolved ISA will still be cached in the stubs under the hostname only key. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15773) CellCounter improvements
Gary Helmling created HBASE-15773: - Summary: CellCounter improvements Key: HBASE-15773 URL: https://issues.apache.org/jira/browse/HBASE-15773 Project: HBase Issue Type: Improvement Components: mapreduce Reporter: Gary Helmling Looking at the CellCounter map reduce, it seems like it can be improved in a few areas: * it does not currently support setting scan batching. This is important when we're fetching all versions for columns. Actually, it would be nice to support all of the scan configuration currently provided in TableInputFormat. * generating job counters containing row keys and column qualifiers is guaranteed to blow up on anything but the smallest table. This is not usable and doesn't make any sense when the same counts are in the job output. The row and qualifier specific counters should be dropped. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-13707) CellCounter uses to many counters
[ https://issues.apache.org/jira/browse/HBASE-13707?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-13707. --- Resolution: Duplicate Assignee: Gary Helmling (was: NIDHI GAMBHIR) Fixed in HBASE-15773 > CellCounter uses to many counters > - > > Key: HBASE-13707 > URL: https://issues.apache.org/jira/browse/HBASE-13707 > Project: HBase > Issue Type: Bug > Components: mapreduce >Affects Versions: 1.0.1 >Reporter: Jean-Marc Spaggiari > Assignee: Gary Helmling >Priority: Minor > Labels: beginner > > CellCounters creates a counter per row... So it quickly becomes to many. > We should provide an option to drop the statistic per rows and count only > cells overall for the table. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-15678) Normalize RetryingCallable cache clearing and implementations
Gary Helmling created HBASE-15678: - Summary: Normalize RetryingCallable cache clearing and implementations Key: HBASE-15678 URL: https://issues.apache.org/jira/browse/HBASE-15678 Project: HBase Issue Type: Sub-task Components: Client Reporter: Gary Helmling Assignee: Gary Helmling This is a fair amount of duplication and inconsistency in the meta cache handling of RetryingCallable implementations: * meta cache is often cleared in prepare() when reload=true, in addition to being cleared in throwable() * each RetryingCallable implementation does this slightly differently, leading to inconsistencies and potential bugs * RegionServerCallable and RegionAdminServiceCallable duplicate a lot of code, but with small, seemingly unnecessary inconsistencies. We should clean these up into a common base with subclasses doing only the necessary differentiation. The main goal here is to establish some common handling, to the extent possible, for the meta cache interactions by the different implementations. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16218) Eliminate use of UGI.doAs() in AccessController testing
Gary Helmling created HBASE-16218: - Summary: Eliminate use of UGI.doAs() in AccessController testing Key: HBASE-16218 URL: https://issues.apache.org/jira/browse/HBASE-16218 Project: HBase Issue Type: Sub-task Components: security Reporter: Gary Helmling Assignee: Gary Helmling Many tests for AccessController observer coprocessor hooks make use of UGI.doAs() when the test user could simply be passed through. Eliminate the unnecessary use of doAs(). -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16217) Identify calling user in ObserverContext
Gary Helmling created HBASE-16217: - Summary: Identify calling user in ObserverContext Key: HBASE-16217 URL: https://issues.apache.org/jira/browse/HBASE-16217 Project: HBase Issue Type: Sub-task Components: Coprocessors, security Reporter: Gary Helmling Assignee: Gary Helmling We already either explicitly pass down the relevant User instance initiating an action through the call path, or it is available through RpcServer.getRequestUser(). We should carry this through in the ObserverContext for coprocessor upcalls and make use of it for permissions checking. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16277) Improve CPU efficiency in VisibilityLabelsCache
Gary Helmling created HBASE-16277: - Summary: Improve CPU efficiency in VisibilityLabelsCache Key: HBASE-16277 URL: https://issues.apache.org/jira/browse/HBASE-16277 Project: HBase Issue Type: Improvement Components: security Reporter: Gary Helmling For secure clusters where the VisibilityController coprocessor is loaded, regionservers sometimes degrade into very high CPU utilization, with many of the RPC handler threads stuck in: {noformat} "B.defaultRpcServer.handler=0,queue=0,port=16020" #114 daemon prio=5 os_prio=0 tid=0x7f8a95bb7800 nid=0x382 runnable [0x7f8a3051f000] java.lang.Thread.State: RUNNABLE at java.lang.ThreadLocal$ThreadLocalMap.expungeStaleEntry(ThreadLocal.java:617) at java.lang.ThreadLocal$ThreadLocalMap.remove(ThreadLocal.java:499) at java.lang.ThreadLocal$ThreadLocalMap.access$200(ThreadLocal.java:298) at java.lang.ThreadLocal.remove(ThreadLocal.java:222) at java.util.concurrent.locks.ReentrantReadWriteLock$Sync.tryReleaseShared(ReentrantReadWriteLock.java:426) at java.util.concurrent.locks.AbstractQueuedSynchronizer.releaseShared(AbstractQueuedSynchronizer.java:1341) at java.util.concurrent.locks.ReentrantReadWriteLock$ReadLock.unlock(ReentrantReadWriteLock.java:881) at org.apache.hadoop.hbase.security.visibility.VisibilityLabelsCache.getGroupAuths(VisibilityLabelsCache.java:237) at org.apache.hadoop.hbase.security.visibility.FeedUserAuthScanLabelGenerator.getLabels(FeedUserAuthScanLabelGenerator.java:70) at org.apache.hadoop.hbase.security.visibility.DefaultVisibilityLabelServiceImpl.getVisibilityExpEvaluator(DefaultVisibilityLabelServiceImpl.java:469) at org.apache.hadoop.hbase.security.visibility.VisibilityUtils.createVisibilityLabelFilter(VisibilityUtils.java:284) at org.apache.hadoop.hbase.security.visibility.VisibilityController.preGetOp(VisibilityController.java:684) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$26.call(RegionCoprocessorHost.java:849) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost$RegionOperation.call(RegionCoprocessorHost.java:1673) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1749) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.execOperation(RegionCoprocessorHost.java:1705) at org.apache.hadoop.hbase.regionserver.RegionCoprocessorHost.preGet(RegionCoprocessorHost.java:845) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6748) at org.apache.hadoop.hbase.regionserver.HRegion.get(HRegion.java:6736) at org.apache.hadoop.hbase.regionserver.RSRpcServices.get(RSRpcServices.java:2029) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:33644) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2170) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:109) at org.apache.hadoop.hbase.ipc.RpcExecutor.consumerLoop(RpcExecutor.java:137) at org.apache.hadoop.hbase.ipc.RpcExecutor$1.run(RpcExecutor.java:112) at java.lang.Thread.run(Thread.java:745) {noformat} In this case there are no visibility labels actually in use, so it appears that the locking overhead for the VisibilityLabelsCache can reach a tipping point where it does not degrade gracefully. We should look at alternate approaches to the label caching in place of the current ReentrantReadWriteLock. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16231) Integration tests should support client keytab login for secure clusters
Gary Helmling created HBASE-16231: - Summary: Integration tests should support client keytab login for secure clusters Key: HBASE-16231 URL: https://issues.apache.org/jira/browse/HBASE-16231 Project: HBase Issue Type: Improvement Components: integration tests Reporter: Gary Helmling Assignee: Gary Helmling Integration tests currently rely on an external kerberos login for secure clusters. Elsewhere we use AuthUtil to login and refresh the credentials in a background thread. We should do the same here. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16141) Unwind use of UserGroupInformation.doAs() to convey requester identity in coprocessor upcalls
Gary Helmling created HBASE-16141: - Summary: Unwind use of UserGroupInformation.doAs() to convey requester identity in coprocessor upcalls Key: HBASE-16141 URL: https://issues.apache.org/jira/browse/HBASE-16141 Project: HBase Issue Type: Improvement Components: Coprocessors, security Reporter: Gary Helmling Assignee: Gary Helmling Fix For: 2.0.0, 1.4.0 In discussion on HBASE-16115, there is some discussion of whether UserGroupInformation.doAs() is the right mechanism for propagating the original requester's identify in certain system contexts (splits, compactions, some procedure calls). It has the unfortunately of overriding the current user, which makes for very confusing semantics for coprocessor implementors. We should instead find an alternate mechanism for conveying the caller identity, which does not override the current user context. I think we should instead look at passing this through as part of the ObserverContext passed to every coprocessor hook. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16202) Backport metric for CallQueueTooBigException to 1.3
[ https://issues.apache.org/jira/browse/HBASE-16202?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-16202. --- Resolution: Invalid No backport needed, I just did a simple cherry-pick to branch-1.3. > Backport metric for CallQueueTooBigException to 1.3 > --- > > Key: HBASE-16202 > URL: https://issues.apache.org/jira/browse/HBASE-16202 > Project: HBase > Issue Type: Improvement > Components: IPC/RPC, metrics >Reporter: Gary Helmling > Assignee: Gary Helmling > > HBASE-15353 added a separate metric for tracking the number of > CallQueueTooBigExceptions, but only went in to 1.4+. Since CQTBE is already > in 1.2+, it would be nice to at least get this in the upcoming 1.3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16202) Backport metric for CallQueueTooBigException to 1.3
Gary Helmling created HBASE-16202: - Summary: Backport metric for CallQueueTooBigException to 1.3 Key: HBASE-16202 URL: https://issues.apache.org/jira/browse/HBASE-16202 Project: HBase Issue Type: Improvement Components: IPC/RPC, metrics Reporter: Gary Helmling Assignee: Gary Helmling HBASE-15353 added a separate metric for tracking the number of CallQueueTooBigExceptions, but only went in to 1.4+. Since CQTBE is already in 1.2+, it would be nice to at least get this in the upcoming 1.3.0 release. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-17604) Backport HBASE-15437 (fix request and response size metrics) to branch-1
Gary Helmling created HBASE-17604: - Summary: Backport HBASE-15437 (fix request and response size metrics) to branch-1 Key: HBASE-17604 URL: https://issues.apache.org/jira/browse/HBASE-17604 Project: HBase Issue Type: Bug Components: IPC/RPC, metrics Reporter: Gary Helmling HBASE-15437 fixed request and response size metrics in master. We should apply the same to branch-1 and related release branches. Prior to HBASE-15437, request and response size metrics were only calculated based on the protobuf message serialized size. This isn't correct when the cell scanner payload is in use. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17611) Thrift 2 per-call latency metrics are capped at ~ 2 seconds
Gary Helmling created HBASE-17611: - Summary: Thrift 2 per-call latency metrics are capped at ~ 2 seconds Key: HBASE-17611 URL: https://issues.apache.org/jira/browse/HBASE-17611 Project: HBase Issue Type: Bug Components: metrics, Thrift Reporter: Gary Helmling Assignee: Gary Helmling Fix For: 1.3.1 Thrift 2 latency metrics are measured in nanoseconds. However, the duration used for per-method latencies is cast to an int, meaning the values are capped at 2.147 seconds. Let's use a long instead. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17578) Thrift per-method metrics should still update in the case of exceptions
Gary Helmling created HBASE-17578: - Summary: Thrift per-method metrics should still update in the case of exceptions Key: HBASE-17578 URL: https://issues.apache.org/jira/browse/HBASE-17578 Project: HBase Issue Type: Bug Components: Thrift Reporter: Gary Helmling Assignee: Gary Helmling Fix For: 1.3.1 Currently, the InvocationHandler used to update per-method metrics in the Thrift server fails to update metrics if an exception occurs. This causes us to miss outliers. We should include exceptional cases in per-method latencies, and also look at adding specific exception rate metrics. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-16540) Scan should do additional validation on start and stop row
Gary Helmling created HBASE-16540: - Summary: Scan should do additional validation on start and stop row Key: HBASE-16540 URL: https://issues.apache.org/jira/browse/HBASE-16540 Project: HBase Issue Type: Bug Components: Client Reporter: Gary Helmling Scan.setStartRow() and setStopRow() should validate the byte[] passed to ensure it meets the criteria for a row key. If the byte[] length is greater that Short.MAX_VALUE, we should throw an IllegalArgumentException in order to fast fail and prevent server-side errors being thrown and retried. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16518) Remove old .arcconfig file
Gary Helmling created HBASE-16518: - Summary: Remove old .arcconfig file Key: HBASE-16518 URL: https://issues.apache.org/jira/browse/HBASE-16518 Project: HBase Issue Type: Task Components: tooling Reporter: Gary Helmling Assignee: Gary Helmling Priority: Trivial The project .arcconfig file points to a project that no longer exists on a no longer supported phabricator instance. Since it is no longer used for reviews, let's drop it. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16788) Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles()
Gary Helmling created HBASE-16788: - Summary: Race in compacted file deletion between HStore close() and closeAndArchiveCompactedFiles() Key: HBASE-16788 URL: https://issues.apache.org/jira/browse/HBASE-16788 Project: HBase Issue Type: Bug Components: regionserver Affects Versions: 1.3.0 Reporter: Gary Helmling Assignee: Gary Helmling Priority: Blocker HBASE-13082 changed the way that compacted files are archived from being done inline on compaction completion to an async cleanup by the CompactedHFilesDischarger chore. It looks like the changes to HStore to support this introduced a race condition in the compacted HFile archiving. In the following sequence, we can wind up with two separate threads trying to archive the same HFiles, causing a regionserver abort: # compaction completes normally and the compacted files are added to {{compactedfiles}} in HStore's DefaultStoreFileManager # *threadA*: CompactedHFilesDischargeHandler runs in a RS executor service, calling closeAndArchiveCompactedFiles() ## obtains HStore readlock ## gets a copy of compactedfiles ## releases readlock # *threadB*: calls HStore.close() as part of region close ## obtains HStore writelock ## calls DefaultStoreFileManager.clearCompactedfiles(), getting a copy of same compactedfiles # *threadA*: calls HStore.removeCompactedfiles(compactedfiles) ## archives files in {compactedfiles} in HRegionFileSystem.removeStoreFiles() ## call HStore.clearCompactedFiles() ## waits on write lock # *threadB*: continues with close() ## calls removeCompactedfiles(compactedfiles) ## calls HRegionFIleSystem.removeStoreFiles() -> HFileArchiver.archiveStoreFiles() ## receives FileNotFoundException because the files have already been archived by threadA ## throws IOException # RS aborts I think the combination of fetching the compactedfiles list and removing the files needs to be covered by locking. Options I see are: * Modify HStore.closeAndArchiveCompactedFiles(): use writelock instead of readlock and move the call to removeCompactedfiles() inside the lock. This means the read operations will be blocked while the files are being archived, which is bad. * Synchronize closeAndArchiveCompactedFiles() and modify close() to call it instead of calling removeCompactedfiles() directly * Add a separate lock for compacted files removal and use in closeAndArchiveCompactedFiles() and close() -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16657) Expose per-region last major compaction timestamp in RegionServer UI
Gary Helmling created HBASE-16657: - Summary: Expose per-region last major compaction timestamp in RegionServer UI Key: HBASE-16657 URL: https://issues.apache.org/jira/browse/HBASE-16657 Project: HBase Issue Type: Improvement Components: regionserver, UI Reporter: Gary Helmling HBASE-12859 added some tracking for the last major compaction completed for each region. However, this is currently only exposed through the cluster status reporting and the Admin API. Since the regionserver is already reporting this information, it would be nice to fold it in somewhere to the region listing in the regionserver UI. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16661) Add last major compaction age to per-region metrics
Gary Helmling created HBASE-16661: - Summary: Add last major compaction age to per-region metrics Key: HBASE-16661 URL: https://issues.apache.org/jira/browse/HBASE-16661 Project: HBase Issue Type: Improvement Reporter: Gary Helmling Priority: Minor After HBASE-12859, we can now track the last major compaction timestamp for each region. However, this is only exposed through cluster status reporting and the admin API. We have similar per-region metrics around storefile age, but none that filters on major compaction specifically. Let's add a metric for last major compaction age. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16754) Regions failing compaction due to referencing non-existent store file
Gary Helmling created HBASE-16754: - Summary: Regions failing compaction due to referencing non-existent store file Key: HBASE-16754 URL: https://issues.apache.org/jira/browse/HBASE-16754 Project: HBase Issue Type: Bug Reporter: Gary Helmling Priority: Blocker Fix For: 1.3.0 Running a mixed read write workload on a recent build off branch-1.3, we are seeing compactions occasionally fail with errors like the following (actual filenames replaced with placeholders): {noformat} 16/09/27 16:57:28 ERROR regionserver.CompactSplitThread: Compaction selection failed Store = XXX, pri = 116 java.io.FileNotFoundException: File does not exist: hdfs://.../hbase/data/ns/table/region/cf/XXfilenameXX at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360) at org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:321) at org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63) at org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:63) at org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107) at org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1644) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.selectCompaction(CompactSplitThread.java:373) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.access$100(CompactSplitThread.java:59) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.doCompaction(CompactSplitThread.java:498) at org.apache.hadoop.hbase.regionserver.CompactSplitThread$CompactionRunner.run(CompactSplitThread.java:568) at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) at java.lang.Thread.run(Thread.java:745) 16/09/27 17:01:31 ERROR regionserver.CompactSplitThread: Compaction selection failed Store = XXX, pri = 115 java.io.FileNotFoundException: File does not exist: hdfs://.../hbase/data/ns/table/region/cf/XXfilenameXX at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1309) at org.apache.hadoop.hdfs.DistributedFileSystem$22.doCall(DistributedFileSystem.java:1301) at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81) at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1317) at org.apache.hadoop.fs.FilterFileSystem.getFileStatus(FilterFileSystem.java:421) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getReferencedFileStatus(StoreFileInfo.java:342) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getFileStatus(StoreFileInfo.java:355) at org.apache.hadoop.hbase.regionserver.StoreFileInfo.getModificationTime(StoreFileInfo.java:360) at org.apache.hadoop.hbase.regionserver.StoreFile.getModificationTimeStamp(StoreFile.java:321) at org.apache.hadoop.hbase.regionserver.StoreUtils.getLowestTimestamp(StoreUtils.java:63) at org.apache.hadoop.hbase.regionserver.compactions.RatioBasedCompactionPolicy.shouldPerformMajorCompaction(RatioBasedCompactionPolicy.java:63) at org.apache.hadoop.hbase.regionserver.compactions.SortedCompactionPolicy.selectCompaction(SortedCompactionPolicy.java:82) at org.apache.hadoop.hbase.regionserver.DefaultStoreEngine$DefaultCompactionContext.select(DefaultStoreEngine.java:107) at org.apache.hadoop.hbase.regionserver.HStore.requestCompaction(HStore.java:1644) at org.apache.hadoop.hbase.regionserver.CompactSplitThread.selectCompaction(CompactSplitThread.java:373
[jira] [Created] (HBASE-16958) Balancer recomputes block distributions every time balanceCluster() runs
Gary Helmling created HBASE-16958: - Summary: Balancer recomputes block distributions every time balanceCluster() runs Key: HBASE-16958 URL: https://issues.apache.org/jira/browse/HBASE-16958 Project: HBase Issue Type: Bug Components: Balancer Reporter: Gary Helmling Assignee: Gary Helmling Fix For: 1.3.0 The change in HBASE-16570 modified the balancer to compute block distributions in parallel with a pool of 5 threads. However, because it does this every time Cluster is instantiated, it effectively bypasses the cache of block locations added in HBASE-14473: In the LoadBalancer.balanceCluster() implementations (in StochasticLoadBalancer, SimpleLoadBalancer), we create a new Cluster instance. In Cluster., we call registerRegion() on every HRegionInfo. In registerRegion(), we do the following: {code} regionLocationFutures.set(regionIndex, regionFinder.asyncGetBlockDistribution(region)); {code} Then, back in Cluster. we do a get() on each ListenableFuture in a loop. So while we are doing the calls to get block locations in parallel with 5 threads, we're recomputing them every time balanceCluster() is called and not taking advantage of the cache at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16958) Balancer recomputes block distributions every time balanceCluster() runs
[ https://issues.apache.org/jira/browse/HBASE-16958?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-16958. --- Resolution: Duplicate Assignee: (was: Gary Helmling) Fix Version/s: (was: 1.3.0) I re-opened HBASE-16570 to fix the issue that is described here. > Balancer recomputes block distributions every time balanceCluster() runs > > > Key: HBASE-16958 > URL: https://issues.apache.org/jira/browse/HBASE-16958 > Project: HBase > Issue Type: Bug > Components: Balancer >Reporter: Gary Helmling > > The change in HBASE-16570 modified the balancer to compute block > distributions in parallel with a pool of 5 threads. However, because it does > this every time Cluster is instantiated, it effectively bypasses the cache of > block locations added in HBASE-14473: > In the LoadBalancer.balanceCluster() implementations (in > StochasticLoadBalancer, SimpleLoadBalancer), we create a new Cluster instance. > In Cluster., we call registerRegion() on every HRegionInfo. > In registerRegion(), we do the following: > {code} > regionLocationFutures.set(regionIndex, > regionFinder.asyncGetBlockDistribution(region)); > {code} > Then, back in Cluster. we do a get() on each ListenableFuture in a loop. > So while we are doing the calls to get block locations in parallel with 5 > threads, we're recomputing them every time balanceCluster() is called and not > taking advantage of the cache at all. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Reopened] (HBASE-16570) Compute region locality in parallel at startup
[ https://issues.apache.org/jira/browse/HBASE-16570?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling reopened HBASE-16570: --- I've reverted this from branch-1.3 for the moment, until the issue that I described can be addressed. I don't see where this would impact master startup time. If we need to pre-initialize this on startup, let's do it in a background thread only on startup. We need to make sure that locality is not recomputed on every run and that we use the cache instead. > Compute region locality in parallel at startup > -- > > Key: HBASE-16570 > URL: https://issues.apache.org/jira/browse/HBASE-16570 > Project: HBase > Issue Type: Sub-task >Reporter: binlijin >Assignee: binlijin > Fix For: 2.0.0, 1.4.0, 1.3.1 > > Attachments: HBASE-16570-master_V1.patch, > HBASE-16570-master_V2.patch, HBASE-16570-master_V3.patch, > HBASE-16570-master_V4.patch > > -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-16964) Successfully archived files are not cleared from compacted store file list if archiving of any file fails
Gary Helmling created HBASE-16964: - Summary: Successfully archived files are not cleared from compacted store file list if archiving of any file fails Key: HBASE-16964 URL: https://issues.apache.org/jira/browse/HBASE-16964 Project: HBase Issue Type: Bug Components: regionserver Reporter: Gary Helmling Assignee: Gary Helmling Priority: Blocker Fix For: 1.3.0 In HStore.removeCompactedFiles(), we only clear archived files from StoreFileManager's list of compactedfiles if _all_ files were archived successfully. If we encounter an error archiving any of the files, then any files which were already archived will remain in the list of compactedfiles. Even worse, this means that all subsequent attempts to archive the list of compacted files will fail (as the previously successfully archived files still in the list will now throw FileNotFoundException), and the list of compactedfiles will never be cleared from that point on. Finally, when the region closes, we will again throw an exception out of HStore.removeCompactedFiles(), in this case causing a regionserver abort. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16146) Counters are expensive...
[ https://issues.apache.org/jira/browse/HBASE-16146?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-16146. --- Resolution: Fixed Assignee: Gary Helmling Hadoop Flags: Reviewed Fix Version/s: 1.4.0 1.3.0 2.0.0 Committed to branch-1.3, branch-1, and master. Counter is no longer used in master, but still present as a deprecated class, so included for consistency. Thanks, [~stack], [~mantonov], and [~enis] for reviews. > Counters are expensive... > - > > Key: HBASE-16146 > URL: https://issues.apache.org/jira/browse/HBASE-16146 > Project: HBase > Issue Type: Sub-task >Reporter: stack >Assignee: Gary Helmling > Fix For: 2.0.0, 1.3.0, 1.4.0 > > Attachments: HBASE-16146.001.patch, HBASE-16146.branch-1.001.patch, > HBASE-16146.branch-1.3.001.patch, counters.patch, less_and_less_counters.png > > > Doing workloadc, perf shows 10%+ of CPU being spent on counter#add. If I > disable some of the hot ones -- see patch -- I can get 10% more throughput > (390k to 440k). Figure something better. -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Resolved] (HBASE-16337) Removing peers seem to be leaving spare queues
[ https://issues.apache.org/jira/browse/HBASE-16337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-16337. --- Resolution: Duplicate Closing as a dupe, thanks for pointing it out. > Removing peers seem to be leaving spare queues > -- > > Key: HBASE-16337 > URL: https://issues.apache.org/jira/browse/HBASE-16337 > Project: HBase > Issue Type: Sub-task > Components: Replication >Reporter: Joseph > > I have been running IntegrationTestReplication repeatedly with the backported > Replication Table changes. Every other iteration of the test fails with, but > these queues should have been deleted when we removed the peers. I believe > this may be related to HBASE-16096, HBASE-16208, or HBASE-16081. > 16/08/02 08:36:07 ERROR util.AbstractHBaseTool: Error running command-line > tool > org.apache.hadoop.hbase.replication.ReplicationException: undeleted queue for > peerId: TestPeer, replicator: > hbase4124.ash2.facebook.com,16020,1470150251042, queueId: TestPeer > at > org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.checkQueuesDeleted(ReplicationPeersZKImpl.java:544) > at > org.apache.hadoop.hbase.replication.ReplicationPeersZKImpl.addPeer(ReplicationPeersZKImpl.java:127) > at > org.apache.hadoop.hbase.client.replication.ReplicationAdmin.addPeer(ReplicationAdmin.java:200) > at > org.apache.hadoop.hbase.test.IntegrationTestReplication$VerifyReplicationLoop.setupTablesAndReplication(IntegrationTestReplication.java:239) > at > org.apache.hadoop.hbase.test.IntegrationTestReplication$VerifyReplicationLoop.run(IntegrationTestReplication.java:325) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.hbase.test.IntegrationTestReplication.runTestFromCommandLine(IntegrationTestReplication.java:418) > at > org.apache.hadoop.hbase.IntegrationTestBase.doWork(IntegrationTestBase.java:134) > at > org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:112) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.hbase.test.IntegrationTestReplication.main(IntegrationTestReplication.java:424) -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-17381) ReplicationSourceWorkerThread can die due to unhandled exceptions
Gary Helmling created HBASE-17381: - Summary: ReplicationSourceWorkerThread can die due to unhandled exceptions Key: HBASE-17381 URL: https://issues.apache.org/jira/browse/HBASE-17381 Project: HBase Issue Type: Bug Reporter: Gary Helmling If a ReplicationSourceWorkerThread encounters an unexpected exception in the run() method (for example failure to allocate direct memory for the DFS client), the exception will be logged by the UncaughtExceptionHandler, but the thread will also die and the replication queue will back up indefinitely until the Regionserver is restarted. We should make sure the worker thread is resilient to all exceptions that it can actually handle. For those that it really can't, it seems better to abort the regionserver rather than just allow replication to stop with minimal signal. Here is a sample exception: {noformat} ERROR regionserver.ReplicationSource: Unexpected exception in ReplicationSourceWorkerThread, currentPath=hdfs://.../hbase/WALs/XXXwalfilenameXXX java.lang.OutOfMemoryError: Direct buffer memory at java.nio.Bits.reserveMemory(Bits.java:693) at java.nio.DirectByteBuffer.(DirectByteBuffer.java:123) at java.nio.ByteBuffer.allocateDirect(ByteBuffer.java:311) at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:96) at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:113) at org.apache.hadoop.crypto.CryptoOutputStream.(CryptoOutputStream.java:108) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.DataTransferSaslUtil.createStreamPair(DataTransferSaslUtil.java:344) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.doSaslHandshake(SaslDataTransferClient.java:490) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.getSaslStreams(SaslDataTransferClient.java:391) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.send(SaslDataTransferClient.java:263) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.checkTrustAndSend(SaslDataTransferClient.java:211) at org.apache.hadoop.hdfs.protocol.datatransfer.sasl.SaslDataTransferClient.peerSend(SaslDataTransferClient.java:160) at org.apache.hadoop.hdfs.net.TcpPeerServer.peerFromSocketAndKey(TcpPeerServer.java:92) at org.apache.hadoop.hdfs.DFSClient.newConnectedPeer(DFSClient.java:3444) at org.apache.hadoop.hdfs.BlockReaderFactory.nextTcpPeer(BlockReaderFactory.java:778) at org.apache.hadoop.hdfs.BlockReaderFactory.getRemoteBlockReaderFromTcp(BlockReaderFactory.java:695) at org.apache.hadoop.hdfs.BlockReaderFactory.build(BlockReaderFactory.java:356) at org.apache.hadoop.hdfs.DFSInputStream.blockSeekTo(DFSInputStream.java:673) at org.apache.hadoop.hdfs.DFSInputStream.readWithStrategy(DFSInputStream.java:882) at org.apache.hadoop.hdfs.DFSInputStream.read(DFSInputStream.java:934) at java.io.DataInputStream.read(DataInputStream.java:100) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:308) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:276) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:264) at org.apache.hadoop.hbase.wal.WALFactory.createReader(WALFactory.java:423) at org.apache.hadoop.hbase.replication.regionserver.ReplicationWALReaderManager.openReader(ReplicationWALReaderManager.java:70) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.openReader(ReplicationSource.java:830) at org.apache.hadoop.hbase.replication.regionserver.ReplicationSource$ReplicationSourceWorkerThread.run(ReplicationSource.java:572) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Created] (HBASE-17827) Client tools relying on AuthUtil.getAuthChore() break credential cache login
Gary Helmling created HBASE-17827: - Summary: Client tools relying on AuthUtil.getAuthChore() break credential cache login Key: HBASE-17827 URL: https://issues.apache.org/jira/browse/HBASE-17827 Project: HBase Issue Type: Bug Components: canary, security Reporter: Gary Helmling Assignee: Gary Helmling Client tools, such as Canary, which make use of keytab based logins with AuthUtil.getAuthChore() do not allow any way to continue without a keytab-based login when security is enabled. Currently, when security is enabled and the configuration lacks {{hbase.client.keytab.file}}, these tools would fail with: {noformat} ERROR hbase.AuthUtil: Error while trying to perform the initial login: Running in secure mode, but config doesn't have a keytab java.io.IOException: Running in secure mode, but config doesn't have a keytab at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420) at org.apache.hadoop.hbase.security.User.login(User.java:258) at org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197) at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98) at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327) Exception in thread "main" java.io.IOException: Running in secure mode, but config doesn't have a keytab at org.apache.hadoop.security.SecurityUtil.login(SecurityUtil.java:239) at org.apache.hadoop.hbase.security.User$SecureHadoopUser.login(User.java:420) at org.apache.hadoop.hbase.security.User.login(User.java:258) at org.apache.hadoop.hbase.security.UserProvider.login(UserProvider.java:197) at org.apache.hadoop.hbase.AuthUtil.getAuthChore(AuthUtil.java:98) at org.apache.hadoop.hbase.tool.Canary.run(Canary.java:589) at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) at org.apache.hadoop.hbase.tool.Canary.main(Canary.java:1327) {noformat} These tools should still work with the default credential-cache login, at least when a client keytab is not configured. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Resolved] (HBASE-12579) Move obtainAuthTokenForJob() methods out of User
[ https://issues.apache.org/jira/browse/HBASE-12579?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Gary Helmling resolved HBASE-12579. --- Resolution: Duplicate The methods were deprecated and the existing usage was removed as part of HBASE-12493. I guess I left this open for the final removal of the deprecated methods from the next major release. The removal was done as part of HBASE-14208. > Move obtainAuthTokenForJob() methods out of User > > > Key: HBASE-12579 > URL: https://issues.apache.org/jira/browse/HBASE-12579 > Project: HBase > Issue Type: Improvement > Components: security >Reporter: Gary Helmling > > The {{User}} class currently contains some utility methods to obtain HBase > authentication tokens for the given user. However, these methods initiate an > RPC to the {{TokenProvider}} coprocessor endpoint, an action which should not > be part of the User class' responsibilities. > This leads to a couple of problems: > # The way the methods are currently structured, it is impossible to integrate > them with normal connection management for the cluster (the TokenUtil class > constructs its own HTable instance internally). > # The User class is logically part of the hbase-common module, but uses the > TokenUtil class (part of hbase-server, though it should probably be moved to > hbase-client) through reflection, leading to a hidden dependency. > The {{obtainAuthTokenForJob()}} methods should be deprecated and the process > of obtaining authentication tokens should be moved to use the normal > connection lifecycle. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-17884) Backport HBASE-16217 to branch-1
Gary Helmling created HBASE-17884: - Summary: Backport HBASE-16217 to branch-1 Key: HBASE-17884 URL: https://issues.apache.org/jira/browse/HBASE-17884 Project: HBase Issue Type: Sub-task Reporter: Gary Helmling The change to add calling user to ObserverContext in HBASE-16217 should also be applied to branch-1 to avoid use of UserGroupInformation.doAs() for access control checks. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-18072) Malformed Cell from client causes Regionserver abort on flush
Gary Helmling created HBASE-18072: - Summary: Malformed Cell from client causes Regionserver abort on flush Key: HBASE-18072 URL: https://issues.apache.org/jira/browse/HBASE-18072 Project: HBase Issue Type: Bug Components: regionserver, rpc Affects Versions: 1.3.0 Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical When a client writes a mutation with a Cell with a corrupted value length field, it is possible for the corrupt cell to trigger an exception on memstore flush, which will trigger regionserver aborts until the region is manually recovered. This boils down to a lack of validation on the client submitted byte[] backing the cell. Consider the following sequence: 1. Client creates a new Put with a cell with value of byte[16] 2. When the backing KeyValue for the Put is created, we serialize 16 for the value length field in the backing array 3. Client calls Table.put() 4. RpcClientImpl calls KeyValueEncoder.encode() to serialize the Cell to the OutputStream 5. Memory corruption in the backing array changes the serialized contents of the value length field from 16 to 48 6. Regionserver handling the put uses KeyValueDecoder.decode() to create a KeyValue with the byte[] read directly off the InputStream. The overall length of the array is correct, but the integer value serialized at the value length offset has been corrupted from the original value of 16 to 48. 7. The corrupt KeyValue is appended to the WAL and added to the memstore 8. After some time, the memstore flushes. As HFileWriter is writing out the corrupted cell, it reads the serialized int from the value length position in the cell's byte[] to determine the number of bytes to write for the value. Because value offset + 48 is greater than the length of the cell's byte[], we hit an IndexOutOfBoundsException: {noformat} java.lang.IndexOutOfBoundsException at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:151) at java.io.DataOutputStream.write(DataOutputStream.java:107) at org.apache.hadoop.hbase.io.hfile.NoOpDataBlockEncoder.encode(NoOpDataBlockEncoder.java:56) at org.apache.hadoop.hbase.io.hfile.HFileBlock$Writer.write(HFileBlock.java:954) at org.apache.hadoop.hbase.io.hfile.HFileWriterV2.append(HFileWriterV2.java:284) at org.apache.hadoop.hbase.io.hfile.HFileWriterV3.append(HFileWriterV3.java:87) at org.apache.hadoop.hbase.regionserver.StoreFile$Writer.append(StoreFile.java:1041) at org.apache.hadoop.hbase.regionserver.StoreFlusher.performFlush(StoreFlusher.java:138) at org.apache.hadoop.hbase.regionserver.DefaultStoreFlusher.flushSnapshot(DefaultStoreFlusher.java:75) at org.apache.hadoop.hbase.regionserver.HStore.flushCache(HStore.java:937) at org.apache.hadoop.hbase.regionserver.HStore$StoreFlusherImpl.flushCache(HStore.java:2413) at org.apache.hadoop.hbase.regionserver.HRegion.internalFlushCacheAndCommit(HRegion.java:2456) {noformat} 9. Regionserver aborts due to the failed flush 10. The regionserver WAL is split into recovered.edits files, one of these containing the same corrupted cell 11. A new regionserver is assigned the region with the corrupted write 12. The new regionserver replays the recovered.edits entries into memstore and then tries to flush the memstore to an HFile 13. The flush triggers the same IndexOutOfBoundsException, causing us to go back to step #8 and loop on repeat until manual intervention is taken The corrupted cell basically becomes a poison pill that aborts regionservers one at a time as the region with the problem edit is passed around. This also means that a malicious client could easily construct requests allowing a denial of service attack against regionservers hosting any tables that the client has write access to. At bare minimum, I think we need to do a sanity check on all the lengths for Cells read off the CellScanner for incoming requests. This would allow us to reject corrupt cells before we append them to the WAL and succeed the request, putting us in a position where we cannot recover. This would only detect the corruption of length fields which puts us in a bad state. Whether or not Cells should carry some checksum generated at the time the Cell is created, which could then validated on the server-side, is a separate question. This would allow detection of other parts of the backing cell byte[], such as within the key fields or the value field. But the computer overhead of this may be too heavyweight to be practical. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-18141) Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call
Gary Helmling created HBASE-18141: - Summary: Regionserver fails to shutdown when abort triggered in RegionScannerImpl during RPC call Key: HBASE-18141 URL: https://issues.apache.org/jira/browse/HBASE-18141 Project: HBase Issue Type: Bug Components: regionserver, security Affects Versions: 1.3.1 Reporter: Gary Helmling Assignee: Gary Helmling Priority: Critical Fix For: 1.3.2 When an abort is triggered within the RPC call path by HRegion.RegionScannerImpl, AccessController is incorrectly apply the RPC caller identity in the RegionServerObserver.preStopRegionServer() hook. This leaves the regionserver in a non-responsive state, where its regions are not reassigned and it returns exceptions for all requests. When an abort is triggered on the server side, we should not allow a coprocessor to reject the abort at all. Here is a sample stack trace: {noformat} 17/05/25 06:10:29 FATAL regionserver.HRegionServer: RegionServer abort: loaded coprocessors are: [org.apache.hadoop.hbase.security.access.AccessController, org.apache.hadoop.hbase.security.token.TokenProvider] 17/05/25 06:10:29 WARN regionserver.HRegionServer: The region server did not stop org.apache.hadoop.hbase.security.AccessDeniedException: Insufficient permissions for user 'rpcuser' (global, action=ADMIN) at org.apache.hadoop.hbase.security.access.AccessController.requireGlobalPermission(AccessController.java:548) at org.apache.hadoop.hbase.security.access.AccessController.requirePermission(AccessController.java:522) at org.apache.hadoop.hbase.security.access.AccessController.preStopRegionServer(AccessController.java:2501) at org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost$1.call(RegionServerCoprocessorHost.java:86) at org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.execShutdown(RegionServerCoprocessorHost.java:300) at org.apache.hadoop.hbase.regionserver.RegionServerCoprocessorHost.preStop(RegionServerCoprocessorHost.java:82) at org.apache.hadoop.hbase.regionserver.HRegionServer.stop(HRegionServer.java:1905) at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2118) at org.apache.hadoop.hbase.regionserver.HRegionServer.abort(HRegionServer.java:2125) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.abortRegionServer(HRegion.java:6326) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.handleFileNotFound(HRegion.java:6319) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.populateResult(HRegion.java:5941) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextInternal(HRegion.java:6084) at org.apache.hadoop.hbase.regionserver.HRegion$RegionScannerImpl.nextRaw(HRegion.java:5858) at org.apache.hadoop.hbase.regionserver.RSRpcServices.scan(RSRpcServices.java:2649) at org.apache.hadoop.hbase.protobuf.generated.ClientProtos$ClientService$2.callBlockingMethod(ClientProtos.java:34950) at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:2320) at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:123) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:188) at org.apache.hadoop.hbase.ipc.RpcExecutor$Handler.run(RpcExecutor.java:168) {noformat} I haven't yet evaluated which other release branches this might apply to. I have a patch currently in progress, which I will post as soon as I complete a test case. -- This message was sent by Atlassian JIRA (v6.3.15#6346)
[jira] [Created] (HBASE-19332) DumpReplicationQueues misreports total WAL size
Gary Helmling created HBASE-19332: - Summary: DumpReplicationQueues misreports total WAL size Key: HBASE-19332 URL: https://issues.apache.org/jira/browse/HBASE-19332 Project: HBase Issue Type: Bug Components: Replication Reporter: Gary Helmling Assignee: Gary Helmling Priority: Trivial DumpReplicationQueues uses an int to collect the total WAL size for a queue. Predictably, this overflows much of the time. Let's use a long instead. -- This message was sent by Atlassian JIRA (v6.4.14#64029)