[jira] [Created] (HBASE-28624) Docs around configuring backups can lead to unexpectedly disabling other features
Bryan Beaudreault created HBASE-28624: - Summary: Docs around configuring backups can lead to unexpectedly disabling other features Key: HBASE-28624 URL: https://issues.apache.org/jira/browse/HBASE-28624 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault In our documentation for enabling backups, we suggest that the user set the following: {code:java} hbase.master.logcleaner.plugins org.apache.hadoop.hbase.backup.master.BackupLogCleaner,... hbase.master.hfilecleaner.plugins org.apache.hadoop.hbase.backup.BackupHFileCleaner,... {code} A naive user will set these and not know what to do about the ",..." part. In doing so, they will unexpectedly be disabling all of the default cleaners we have. For example here are the defaults: {code:java} hbase.master.logcleaner.plugins org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveProcedureWALCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveMasterLocalStoreWALCleaner hbase.master.hfilecleaner.plugins org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveMasterLocalStoreHFileCleaner {code} So basically disabling support for hbase.master.logcleaner.ttl and hbase.master.hfilecleaner.ttl. There exists a method BackupManager.decorateMasterConfiguration and BackupManager.decorateRegionServerConfiguration. They are currently javadoc'd as being for tests only, but I think we should call these in HMaster and HRegionServer. Then we can only require the user to set "hbase.backup.enable" and very much simplify our docs here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28228) Release 2.6.0
[ https://issues.apache.org/jira/browse/HBASE-28228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28228. --- Resolution: Done 2.6.0 has been released > Release 2.6.0 > - > > Key: HBASE-28228 > URL: https://issues.apache.org/jira/browse/HBASE-28228 > Project: HBase > Issue Type: Umbrella >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28603) Finish 2.6.0 release
[ https://issues.apache.org/jira/browse/HBASE-28603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28603. --- Resolution: Done > Finish 2.6.0 release > > > Key: HBASE-28603 > URL: https://issues.apache.org/jira/browse/HBASE-28603 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Priority: Major > > # Release the artifacts on repository.apache.org > # Move the binaries from dist-dev to dist-release > # Add xml to download page (via HBASE-28236) > # Push tag 2.6.0RC4 as tag rel/2.6.0 > # Release 2.6.0 on JIRA > [https://issues.apache.org/jira/projects/HBASE/versions/12353291] > # Add release data on [https://reporter.apache.org/addrelease.html?hbase] > # Send announcement email -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28236) Add 2.6.0 to downloads page
[ https://issues.apache.org/jira/browse/HBASE-28236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28236. --- Resolution: Fixed > Add 2.6.0 to downloads page > --- > > Key: HBASE-28236 > URL: https://issues.apache.org/jira/browse/HBASE-28236 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28232) Add release manager for 2.6 in ref guide
[ https://issues.apache.org/jira/browse/HBASE-28232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28232. --- Resolution: Fixed > Add release manager for 2.6 in ref guide > > > Key: HBASE-28232 > URL: https://issues.apache.org/jira/browse/HBASE-28232 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28603) Finish 2.6.0 release
Bryan Beaudreault created HBASE-28603: - Summary: Finish 2.6.0 release Key: HBASE-28603 URL: https://issues.apache.org/jira/browse/HBASE-28603 Project: HBase Issue Type: Sub-task Reporter: Bryan Beaudreault # Release the artifacts on repository.apache.org # Move the binaries from dist-dev to dist-release # Add xml to download page # Push tag 2.6.0RC4 as tag rel/2.6.0 # Release 2.6.0 on JIRA [https://issues.apache.org/jira/projects/HBASE/versions/12353291] # Add release data on [https://reporter.apache.org/addrelease.html?hbase] # Send announcement email -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28237) Set version to 2.6.1-SNAPSHOT for branch-2.6
[ https://issues.apache.org/jira/browse/HBASE-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28237. --- Resolution: Done This is handled by automation so probably didn't need to be a jira > Set version to 2.6.1-SNAPSHOT for branch-2.6 > > > Key: HBASE-28237 > URL: https://issues.apache.org/jira/browse/HBASE-28237 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28233) Run ITBLL for branch-2.6
[ https://issues.apache.org/jira/browse/HBASE-28233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28233. --- Resolution: Done > Run ITBLL for branch-2.6 > > > Key: HBASE-28233 > URL: https://issues.apache.org/jira/browse/HBASE-28233 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28235) Put up 2.6.0RC0
[ https://issues.apache.org/jira/browse/HBASE-28235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28235. --- Resolution: Done Ended up going to RC4, which has now passed > Put up 2.6.0RC0 > --- > > Key: HBASE-28235 > URL: https://issues.apache.org/jira/browse/HBASE-28235 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28234) Set version as 2.6.0 in branch-2.6 in prep for first RC
[ https://issues.apache.org/jira/browse/HBASE-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28234. --- Resolution: Done > Set version as 2.6.0 in branch-2.6 in prep for first RC > --- > > Key: HBASE-28234 > URL: https://issues.apache.org/jira/browse/HBASE-28234 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-26625) ExportSnapshot tool failed to copy data files for tables with merge region
[ https://issues.apache.org/jira/browse/HBASE-26625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-26625. --- Resolution: Fixed I've merged the backport to branch-2.5 and added the next unreleased 2.5.x version to fixVersions > ExportSnapshot tool failed to copy data files for tables with merge region > -- > > Key: HBASE-26625 > URL: https://issues.apache.org/jira/browse/HBASE-26625 > Project: HBase > Issue Type: Bug >Reporter: Yi Mei >Assignee: Yi Mei >Priority: Minor > Labels: pull-request-available > Fix For: 2.6.0, 2.5.9, 2.4.10, 3.0.0-alpha-3 > > > When export snapshot for a table with merge regions, we found following > exceptions: > {code:java} > 2021-12-24 17:14:41,563 INFO [main] snapshot.ExportSnapshot: Finalize the > Snapshot Export > 2021-12-24 17:14:41,589 INFO [main] snapshot.ExportSnapshot: Verify snapshot > integrity > 2021-12-24 17:14:41,683 ERROR [main] snapshot.ExportSnapshot: Snapshot export > failed > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Missing parent > hfile for: 043a9fe8aa7c469d8324956a57849db5.8e935527eb39a2cf9bf0f596754b5853 > path=A/a=t42=8e935527eb39a2cf9bf0f596754b5853-043a9fe8aa7c469d8324956a57849db5 > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:232) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:195) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:172) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:156) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.verifySnapshot(ExportSnapshot.java:851) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.doWork(ExportSnapshot.java:1096) > at > org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:154) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.hbase.util.AbstractHBaseTool.doStaticMain(AbstractHBaseTool.java:280) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1144) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28482) Reverse scan with tags throws ArrayIndexOutOfBoundsException with DBE
[ https://issues.apache.org/jira/browse/HBASE-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28482. --- Fix Version/s: 2.6.0 2.4.18 3.0.0-beta-2 2.5.9 Resolution: Fixed Pushed to all active branches. Thanks for the follow-up fix here [~vineet.4008]! > Reverse scan with tags throws ArrayIndexOutOfBoundsException with DBE > - > > Key: HBASE-28482 > URL: https://issues.apache.org/jira/browse/HBASE-28482 > Project: HBase > Issue Type: Bug > Components: HFile >Reporter: Vineet Kumar Maheshwari >Assignee: Vineet Kumar Maheshwari >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 2.4.18, 3.0.0-beta-2, 2.5.9 > > > Facing ArrayIndexOutOfBoundsException when performing reverse scan on a table > with 30K+ records in single hfile. > Exception is happening when block changes during seekBefore call. > {code:java} > Caused by: java.lang.ArrayIndexOutOfBoundsException > at > org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray(ByteBufferUtils.java:1326) > at org.apache.hadoop.hbase.nio.SingleByteBuff.get(SingleByteBuff.java:213) > at > org.apache.hadoop.hbase.io.encoding.DiffKeyDeltaEncoder$DiffSeekerStateBufferedEncodedSeeker.decode(DiffKeyDeltaEncoder.java:431) > at > org.apache.hadoop.hbase.io.encoding.DiffKeyDeltaEncoder$DiffSeekerStateBufferedEncodedSeeker.decodeNext(DiffKeyDeltaEncoder.java:502) > at > org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.seekToKeyInBlock(BufferedDataBlockEncoder.java:1012) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.loadBlockAndSeekToKey(HFileReaderImpl.java:1605) > at > org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekBefore(HFileReaderImpl.java:719) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekBeforeAndSaveKeyToPreviousRow(StoreFileScanner.java:645) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRowWithoutHint(StoreFileScanner.java:570) > at > org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:506) > at > org.apache.hadoop.hbase.regionserver.ReversedKeyValueHeap.next(ReversedKeyValueHeap.java:126) > at > org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:693) > at > org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:151){code} > > Steps to reproduce: > Create a table with DataBlockEncoding.DIFF and block size as 1024, write some > 30K+ puts with setTTL, then do a reverse scan. > {code:java} > @Test > public void testReverseScanWithDBEWhenCurrentBlockUpdates() throws > IOException { > byte[] family = Bytes.toBytes("0"); > Configuration conf = new Configuration(TEST_UTIL.getConfiguration()); > conf.setInt(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1); > try (Connection connection = ConnectionFactory.createConnection(conf)) { > testReverseScanWithDBE(connection, DataBlockEncoding.DIFF, family, 1024, > 3); > for (DataBlockEncoding encoding : DataBlockEncoding.values()) { > testReverseScanWithDBE(connection, encoding, family, 1024, 3); > } > } > } > private void testReverseScanWithDBE(Connection conn, DataBlockEncoding > encoding, byte[] family, int blockSize, int maxRows) > throws IOException { > LOG.info("Running test with DBE={}", encoding); > TableName tableName = TableName.valueOf(TEST_NAME.getMethodName() + "-" + > encoding); > TEST_UTIL.createTable(TableDescriptorBuilder.newBuilder(tableName) > .setColumnFamily( > ColumnFamilyDescriptorBuilder.newBuilder(family).setDataBlockEncoding(encoding).setBlocksize(blockSize).build()) > .build(), null); > Table table = conn.getTable(tableName); > byte[] val1 = new byte[10]; > byte[] val2 = new byte[10]; > Bytes.random(val1); > Bytes.random(val2); > for (int i = 0; i < maxRows; i++) { > table.put(new Put(Bytes.toBytes(i)).addColumn(family, Bytes.toBytes(1), val1) > .addColumn(family, Bytes.toBytes(2), val2).setTTL(600_000)); > } > TEST_UTIL.flush(table.getName()); > Scan scan = new Scan(); > scan.setReversed(true); > try (ResultScanner scanner = table.getScanner(scan)) { > for (int i = maxRows - 1; i >= 0; i--) { > Result row = scanner.next(); > assertEquals(2, row.size()); > Cell cell1 = row.getColumnLatestCell(family, Bytes.toBytes(1)); > assertTrue(CellUtil.matchingRows(cell1, Bytes.toBytes(i))); > assertTrue(CellUtil.matchingValue(cell1, val1)); > Cell cell2 = row.getColumnLatestCell(family, Bytes.toBytes(2)); > assertTrue(CellUtil.matchingRows(cell2, Bytes.toBytes(i))); > assertTrue(CellUtil.matchingValue(cell2, val2)); > } > } > } > {code} > > HBASE-27580
[jira] [Resolved] (HBASE-28255) Correcting spelling errors or annotations with non-standard spelling
[ https://issues.apache.org/jira/browse/HBASE-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28255. --- Fix Version/s: 2.6.0 3.0.0-beta-2 2.5.8 Resolution: Fixed Looks like this was forgot to resolve. I added what I think are the correct fixVersions > Correcting spelling errors or annotations with non-standard spelling > > > Key: HBASE-28255 > URL: https://issues.apache.org/jira/browse/HBASE-28255 > Project: HBase > Issue Type: Improvement >Reporter: mazhengxuan >Priority: Minor > Labels: documentation > Fix For: 2.6.0, 3.0.0-beta-2, 2.5.8 > > > Modify some spelling errors or non-standard spelling comments pointed out by > Typo -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28538) BackupHFileCleaner.loadHFileRefs is very expensive
Bryan Beaudreault created HBASE-28538: - Summary: BackupHFileCleaner.loadHFileRefs is very expensive Key: HBASE-28538 URL: https://issues.apache.org/jira/browse/HBASE-28538 Project: HBase Issue Type: Bug Components: backuprestore Reporter: Bryan Beaudreault I noticed some odd CPU spikes on the hmasters of one of our clusters. Turns out it had been getting lots of bulkoads (30k) and processing them was expensive. The method scans hbase and then parses the paths. Surprisingly the parsing is more expensive than the reading hbase, with the vast majority of time spent in org/apache/hadoop/fs/Path.. We should see if this is possible to be optimized. Attaching profile. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28183) It's impossible to re-enable the quota table if it gets disabled
[ https://issues.apache.org/jira/browse/HBASE-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28183. --- Fix Version/s: 2.6.0 3.0.0-beta-2 2.5.9 Resolution: Fixed Pushed to branch-2.5+. Thanks for the contribution [~chandrasekhar.k]! > It's impossible to re-enable the quota table if it gets disabled > > > Key: HBASE-28183 > URL: https://issues.apache.org/jira/browse/HBASE-28183 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Chandra Sekhar K >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2, 2.5.9 > > > HMaster.enableTable tries to read the quota table. If you disable the quota > table, this fails. So then it's impossible to re-enable it. The only solution > I can find is to delete the table at this point, so that it gets recreated at > startup, but this results in losing any quotas you had defined. We should > fix enableTable to not check quotas if the table in question is hbase:quota. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28483) Merge of incremental backups fails on bulkloaded Hfiles
[ https://issues.apache.org/jira/browse/HBASE-28483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28483. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Pushed to branch-2.6+. Thanks for the report and fix [~thomas.sarens]! > Merge of incremental backups fails on bulkloaded Hfiles > --- > > Key: HBASE-28483 > URL: https://issues.apache.org/jira/browse/HBASE-28483 > Project: HBase > Issue Type: Bug > Components: backuprestore >Affects Versions: 2.6.0, 4.0.0-alpha-1 >Reporter: thomassarens >Assignee: thomassarens >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > Attachments: TestIncrementalBackupMergeWithBulkLoad.java > > > The merge of incremental backups fails in case one of the backups contains a > bulk loaded HFile and the other backups doesn't. See test in attachements > based on > {code:java} > org/apache/hadoop/hbase/backup/TestBackupRestoreWithModifications.java{code} > that reproduces the exception when useBulkLoad is set to true > [^TestIncrementalBackupMergeWithBulkLoad.java]. > This exception occurs in the call to`HFileRecordReader#initialize` as it > tries to read a directory path as an HFile. I'll see if I can create a patch > on master to fix this. > {code:java} > 2024-04-04T14:55:15,462 INFO LocalJobRunner Map Task Executor #0 {} > mapreduce.HFileInputFormat$HFileRecordReader(95): Initialize > HFileRecordReader for > hdfs://localhost:34093/user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0 > 2024-04-04T14:55:15,482 WARN [Thread-1429 {}] > mapred.LocalJobRunner$Job(590): job_local1854345815_0018 > java.lang.Exception: java.io.FileNotFoundException: Path is not a file: > /user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0 > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:460) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589) > at > org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573) > at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089) > at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012) > at java.base/java.security.AccessController.doPrivileged(Native Method) > at java.base/javax.security.auth.Subject.doAs(Subject.java:423) > at > org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899) > at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026) > > at > org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) > ~[hadoop-mapreduce-client-common-3.3.5.jar:?] > at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) > ~[hadoop-mapreduce-client-common-3.3.5.jar:?] > Caused by: java.io.FileNotFoundException: Path is not a file: > /user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0 > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90) > at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76) > at > org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156) > at > org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124) > at > org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769) > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:460) > at > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java) > at >
[jira] [Resolved] (HBASE-28460) Full backup restore fails for empty HFiles
[ https://issues.apache.org/jira/browse/HBASE-28460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28460. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Assignee: Dieter De Paepe Resolution: Fixed Thanks for the contribution [~dieterdp_ng]! Pushed to branch-2.6+ > Full backup restore fails for empty HFiles > -- > > Key: HBASE-28460 > URL: https://issues.apache.org/jira/browse/HBASE-28460 > Project: HBase > Issue Type: Bug > Components: backuprestore >Affects Versions: 2.6.0, 4.0.0-alpha-1 >Reporter: Dieter De Paepe >Assignee: Dieter De Paepe >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > A full backup restore fails if the backup contains an empty HFile, for > example when all data has been deleted from a table and full compaction has > run. There are several issues: > * HFiles are read in `RestoreTool` to read the first/last key, but this > fails for empty HFiles > * In `RestoreTool`, table creation also incorrectly assumes the region > contains keys > * In `MapReduceRestoreJob`, the tool incorrectly assumes that a bulkload > with no loaded entries is an error. > Example stacktrace: > {code:java} > 24/03/21 18:38:09 ERROR org.apache.hadoop.hbase.backup.util.BackupUtils: > java.util.NoSuchElementException: No value present > java.util.NoSuchElementException: No value present > at java.base/java.util.Optional.get(Optional.java:143) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.generateBoundaryKeys(RestoreTool.java:440) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.checkAndCreateTable(RestoreTool.java:493) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:351) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.fullRestoreTable(RestoreTool.java:211) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restoreImages(RestoreTablesClient.java:151) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restore(RestoreTablesClient.java:229) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.execute(RestoreTablesClient.java:265) > at > org.apache.hadoop.hbase.backup.impl.BackupAdminImpl.restore(BackupAdminImpl.java:518) > at > org.apache.hadoop.hbase.backup.RestoreDriver.parseAndRun(RestoreDriver.java:176) > at > org.apache.hadoop.hbase.backup.RestoreDriver.doWork(RestoreDriver.java:216) > at > org.apache.hadoop.hbase.backup.RestoreDriver.run(RestoreDriver.java:252) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hbase.backup.RestoreDriver.main(RestoreDriver.java:224) > 24/03/21 18:38:09 ERROR org.apache.hadoop.hbase.backup.RestoreDriver: Error > while running restore backup > java.lang.IllegalStateException: Cannot restore hbase table > at > org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:360) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.fullRestoreTable(RestoreTool.java:211) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restoreImages(RestoreTablesClient.java:151) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restore(RestoreTablesClient.java:229) > at > org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.execute(RestoreTablesClient.java:265) > at > org.apache.hadoop.hbase.backup.impl.BackupAdminImpl.restore(BackupAdminImpl.java:518) > at > org.apache.hadoop.hbase.backup.RestoreDriver.parseAndRun(RestoreDriver.java:176) > at > org.apache.hadoop.hbase.backup.RestoreDriver.doWork(RestoreDriver.java:216) > at > org.apache.hadoop.hbase.backup.RestoreDriver.run(RestoreDriver.java:252) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82) > at > org.apache.hadoop.hbase.backup.RestoreDriver.main(RestoreDriver.java:224) > Caused by: java.util.NoSuchElementException: No value present > at java.base/java.util.Optional.get(Optional.java:143) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.generateBoundaryKeys(RestoreTool.java:440) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.checkAndCreateTable(RestoreTool.java:493) > at > org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:351) > ... 10 more {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27657) Connection and Request Attributes
[ https://issues.apache.org/jira/browse/HBASE-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27657. --- Resolution: Fixed Addendum committed to branch-2 and branch-2.6. The problem did not exist on master/branch-3. > Connection and Request Attributes > - > > Key: HBASE-27657 > URL: https://issues.apache.org/jira/browse/HBASE-27657 > Project: HBase > Issue Type: New Feature >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > > Currently we have the ability to set Operation attributes, via > Get.setAttribute, etc. It would be useful to be able to set attributes at the > request and connection level. > These levels can result in less duplication. For example, send some > attributes once per connection instead of for every one of the millions of > requests a connection might send. Or send once for the request, instead of > duplicating on every operation in a multi request. > Additionally, the Connection and RequestHeader are more globally available on > the server side. Both can be accessed via RpcServer.getCurrentCall(), which > is useful in various integration points – coprocessors, custom queues, > quotas, slow log, etc. Operation attributes are harder to access because you > need to parse the raw Message into the appropriate type to get access to the > getter. > I was thinking adding two new methods to Connection interface: > - setAttribute (and getAttribute/getAttributes) > - setRequestAttributeProvider > Any Connection attributes would be set onto the ConnectionHeader during > initialization. The RequestAttributeProvider would be called when creating > each RequestHeader. > An alternative to setRequestAttributeProvider would be to add this into > HBaseRpcController, which can already be customized via site configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-27657) Connection and Request Attributes
[ https://issues.apache.org/jira/browse/HBASE-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-27657: --- Assignee: Bryan Beaudreault (was: Ray Mattingly) Reopening for addendum. We accidentally dropped the following method from ConnectionFactory: {code:java} ConnectionFactory.createConnection ( Configuration conf, ExecutorService pool, User user ) [static] : Connection {code} > Connection and Request Attributes > - > > Key: HBASE-27657 > URL: https://issues.apache.org/jira/browse/HBASE-27657 > Project: HBase > Issue Type: New Feature >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > > Currently we have the ability to set Operation attributes, via > Get.setAttribute, etc. It would be useful to be able to set attributes at the > request and connection level. > These levels can result in less duplication. For example, send some > attributes once per connection instead of for every one of the millions of > requests a connection might send. Or send once for the request, instead of > duplicating on every operation in a multi request. > Additionally, the Connection and RequestHeader are more globally available on > the server side. Both can be accessed via RpcServer.getCurrentCall(), which > is useful in various integration points – coprocessors, custom queues, > quotas, slow log, etc. Operation attributes are harder to access because you > need to parse the raw Message into the appropriate type to get access to the > getter. > I was thinking adding two new methods to Connection interface: > - setAttribute (and getAttribute/getAttributes) > - setRequestAttributeProvider > Any Connection attributes would be set onto the ConnectionHeader during > initialization. The RequestAttributeProvider would be called when creating > each RequestHeader. > An alternative to setRequestAttributeProvider would be to add this into > HBaseRpcController, which can already be customized via site configuration. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28462) Incremental backup can fail if log gets archived while WALPlayer is starting up
Bryan Beaudreault created HBASE-28462: - Summary: Incremental backup can fail if log gets archived while WALPlayer is starting up Key: HBASE-28462 URL: https://issues.apache.org/jira/browse/HBASE-28462 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault We had incremental backup fail with FileNotFoundException for a file in the WALs directory. Upon investigation, the log had been archived a few mins earlier. WALInputFormat's record reader has support for falling back on an archived path: {code:java} } catch (IOException e) { Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf); // archivedLog can be null if unable to locate in archiveDir. if (archivedLog != null) { openReader(archivedLog); // Try call again in recursion return nextKeyValue(); } else { throw e; } } {code} But the getSplits method has different handling: {code:java} try { List files = getFiles(fs, inputPath, startTime, endTime); allFiles.addAll(files); } catch (FileNotFoundException e) { if (ignoreMissing) { LOG.warn("File " + inputPath + " is missing. Skipping it."); continue; } throw e; } {code} This ignoreMissing variable was added in HBASE-14141 and is enabled via wal.input.ignore.missing.files which is defaulted to false and never set. Looking at the comment and reviewboard history of HBASE-14141 I think there might have been some confusion about where to handle these missing files, and this got lost in the shuffle. I would prefer not to ignore missing hfiles. I think that could result in some weird behavior: * RegionServer has 10 archived and 30 not-yet-archived WALs needing to be backed up * The process starts, and while it's running 1 of those 30 WALs gets archived. That would get skipped due to FileNotFoundException * But the remaining 29 would be backed up This scenario could cause some data consistency issues if this incremental backup is restored. We missed some edits in the middle of applied edits from other WALs. So I do think failing as we do today is necessary for consistency, but unrealistic in a live cluster. The solution is to try finding the missing file in the archived directory. Backups has a coprocessor which will not allow the archived file to be cleaned up until it's backed up, so I think it's safe to say that a WAL is either definitely in WALs or oldWALs. * - -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28459) HFileOutputFormat2 ClassCastException with s3 magic committer
Bryan Beaudreault created HBASE-28459: - Summary: HFileOutputFormat2 ClassCastException with s3 magic committer Key: HBASE-28459 URL: https://issues.apache.org/jira/browse/HBASE-28459 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault In hadoop3 there's the s3 magic committer which can speed up s3 writes dramatically. In HFileOutputFormat2.createRecordWriter we cast the passed in committer as a FileOutputCommitter. This causes a class cast exception when the s3 magic committer is enabled: Error: java.lang.ClassCastException: class org.apache.hadoop.fs.s3a.commit.magic.MagicS3GuardCommitter cannot be cast to class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter We can cast to PathOutputCommitter instead, but its only available in hadoop3+. So we will need to use reflection to work around this in branch-2. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28412) Restoring incremental backups to mapped table requires existence of original table
[ https://issues.apache.org/jira/browse/HBASE-28412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28412. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Pushed to branch-2.6+. Thanks [~rubenvw] for the contribution! I also added you and [~dieterdp_ng] as contributors to the project so that you can be assigned jiras. > Restoring incremental backups to mapped table requires existence of original > table > -- > > Key: HBASE-28412 > URL: https://issues.apache.org/jira/browse/HBASE-28412 > Project: HBase > Issue Type: Bug > Components: backuprestore >Reporter: Dieter De Paepe >Assignee: Ruben Van Wanzeele >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > It appears that restoring a non-existing table from an incremental backup > with the "-m" parameter results in an error in the restore client. > Reproduction steps: > Build & start hbase: > {code:java} > mvn clean install -Phadoop-3.0 -DskipTests > bin/start-hbase.sh{code} > In HBase shell: create table and some values: > {code:java} > create 'test', 'cf' > put 'test', 'row1', 'cf:a', 'value1' > put 'test', 'row2', 'cf:b', 'value2' > put 'test', 'row3', 'cf:c', 'value3' > scan 'test' {code} > Create a full backup: > {code:java} > bin/hbase backup create full file:/tmp/hbase-backup{code} > Adjust some data through HBase shell: > {code:java} > put 'test', 'row1', 'cf:a', 'value1-new' > scan 'test' {code} > Create an incremental backup: > {code:java} > bin/hbase backup create incremental file:/tmp/hbase-backup {code} > Delete the original table in HBase shell: > {code:java} > disable 'test' > drop 'test' {code} > Restore the incremental backup under a new table name: > {code:java} > bin/hbase backup history > bin/hbase restore file:/tmp/hbase-backup -t "test" -m > "test-restored" {code} > This results in the following output / error: > {code:java} > ... > 2024-03-25T13:38:53,062 WARN [main {}] util.NativeCodeLoader: Unable to load > native-hadoop library for your platform... using builtin-java classes where > applicable > 2024-03-25T13:38:53,174 INFO [main {}] Configuration.deprecation: > hbase.client.pause.cqtbe is deprecated. Instead, use > hbase.client.pause.server.overloaded > 2024-03-25T13:38:53,554 INFO [main {}] impl.RestoreTablesClient: HBase table > test-restored does not exist. It will be created during restore process > 2024-03-25T13:38:53,593 INFO [main {}] impl.RestoreTablesClient: Restoring > 'test' to 'test-restored' from full backup image > file:/tmp/hbase-backup/backup_1711370230143/default/test > 2024-03-25T13:38:53,707 INFO [main {}] util.BackupUtils: Creating target > table 'test-restored' > 2024-03-25T13:38:54,546 INFO [main {}] mapreduce.MapReduceRestoreJob: > Restore test into test-restored > 2024-03-25T13:38:54,646 INFO [main {}] mapreduce.HFileOutputFormat2: > bulkload locality sensitive enabled > 2024-03-25T13:38:54,647 INFO [main {}] mapreduce.HFileOutputFormat2: Looking > up current regions for table test-restored > 2024-03-25T13:38:54,669 INFO [main {}] mapreduce.HFileOutputFormat2: > Configuring 1 reduce partitions to match current region count for all tables > 2024-03-25T13:38:54,669 INFO [main {}] mapreduce.HFileOutputFormat2: Writing > partition information to > file:/tmp/hbase-tmp/partitions_0667b6e2-79ef-4cfe-97e1-abb204ee420d > 2024-03-25T13:38:54,687 INFO [main {}] compress.CodecPool: Got brand-new > compressor [.deflate] > 2024-03-25T13:38:54,713 INFO [main {}] mapreduce.HFileOutputFormat2: > Incremental output configured for tables: test-restored > 2024-03-25T13:38:54,715 WARN [main {}] mapreduce.TableMapReduceUtil: The > addDependencyJars(Configuration, Class...) method has been deprecated > since it is easy to use incorrectly. Most users should rely on > addDependencyJars(Job) instead. See HBASE-8386 for more details. > 2024-03-25T13:38:54,742 WARN [main {}] impl.MetricsConfig: Cannot locate > configuration: tried > hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties > 2024-03-25T13:38:54,834 INFO [main {}] input.FileInputFormat: Total input > files to process : 1 > 2024-03-25T13:38:54,853 INFO [main {}] mapreduce.JobSubmitter: number of > splits:1 > 2024-03-25T13:38:54,964 INFO [main {}] mapreduce.JobSubmitter: Submitting > tokens for job: job_local748155768_0001 > 2024-03-25T13:38:54,967 INFO [main {}] mapreduce.JobSubmitter: Executing > with tokens: [] > 2024-03-25T13:38:55,076 INFO [main {}] mapred.LocalDistributedCacheManager: > Creating symlink: > /tmp/hadoop-dieter/mapred/local/job_local748155768_0001_0768a243-06e8-4524-8a6d-016ddd75df52/libjars > <-
[jira] [Resolved] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles
[ https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28456. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed > HBase Restore restores old data if data for the same timestamp is in > different hfiles > - > > Key: HBASE-28456 > URL: https://issues.apache.org/jira/browse/HBASE-28456 > Project: HBase > Issue Type: Bug > Components: backuprestore >Affects Versions: 2.6.0, 3.0.0 >Reporter: Ruben Van Wanzeele >Assignee: Bryan Beaudreault >Priority: Blocker > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > Attachments: > Add_incremental_test_for_HBASE-28456_Fix_HBASE-28412_for_incremental_test.patch, > ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java > > > The restore brings back 'old' data when executing restore. > It feels like the hfile sequence id is not respected during the restore. > See testing code attached. The workaround solution is to trigger major > compaction before doing the backup (not really feasible for daily backups) > We didn't investigate this yet, but this might also impact the merge of > multiple incremental backups (since that follows a similar code path merging > hfiles). > This currently blocks our support for HBase backup and restore. > Willing to participate in a solution if necessary. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28449) Fix BackupSystemTable Scans
[ https://issues.apache.org/jira/browse/HBASE-28449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28449. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Pushed to branch-2.6+. Thanks [~baugenreich]! > Fix BackupSystemTable Scans > > > Key: HBASE-28449 > URL: https://issues.apache.org/jira/browse/HBASE-28449 > Project: HBase > Issue Type: Bug >Reporter: Briana Augenreich >Assignee: Briana Augenreich >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > When calculating which WALs should be included in an incremental backup the > backup system does a prefix scan for the last roll log timestamp. This uses > the backup root in the prefix (.) If you happen have > multiple backup roots where one is a root of the other you'll get inaccurate > results. > > Since the rowkey is let's modify > the prefix scan to be . -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28453) Support a middle ground between the Average and Fixed interval rate limiters
[ https://issues.apache.org/jira/browse/HBASE-28453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28453. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: FixedIntervalRateLimiter now supports a custom refill interval via hbase.quota.rate.limiter.refill.interval.ms. Users of quotas may wish to change hbase.quota.rate.limiter to FixedIntervalRateLimiter and customize this new setting. It will likely lead to healthier backoffs for clients and more full quota utilization. Resolution: Fixed Pushed to branch-2.6+. Thanks [~rmdmattingly] ! > Support a middle ground between the Average and Fixed interval rate limiters > > > Key: HBASE-28453 > URL: https://issues.apache.org/jira/browse/HBASE-28453 > Project: HBase > Issue Type: Improvement >Affects Versions: 2.6.0 >Reporter: Ray Mattingly >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > Attachments: Screenshot 2024-03-21 at 2.08.51 PM.png, Screenshot > 2024-03-21 at 2.30.01 PM.png > > > h3. Background > HBase quotas support two rate limiters: a "fixed" and an "average" interval > rate limiter. > h4. FixedIntervalRateLimiter > The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, > and it refills a resource allotment on the recurring interval. So you may get > 10 resources every second, and if you exhaust all 10 resources in the first > millisecond of an interval then you will need to wait 999ms to acquire even 1 > more resource. > h4. AverageIntervalRateLimiter > The average interval rate limiter, HBase's default, allows for more flexibly > timed refilling of the resource allotment. Extending our previous example, > say you have a 10 reads/sec quota and you have exhausted all 10 resources > within 1ms of the last full refill. If you request 1 more read then, rather > than returning a 999ms wait interval indicating the next full refill time, > the rate limiter will recognize that you only need to wait 99ms before 1 read > can be available. After 100ms has passed in aggregate since the last full > refill, it will support the refilling of 1/10th the limit to facilitate the > request for 1/10th the resources. > h3. The Problems with Current RateLimiters > The problem with the fixed interval rate limiter is that it is too strict > from a latency perspective. It results in quota limits to which we cannot > fully subscribe with any consistency. > The problem with the average interval rate limiter is that, in practice, it > is far too optimistic. For example, a real rate limiter might limit to > 100MB/sec of read IO per machine. Any multigets that come in will require > only a tiny fraction of this limit; for example, a 64kb block is only 0.06% > of the total. As a result, the vast majority of wait intervals end up being > tiny — like <5ms. This can actually cause an inverse of your intention, where > setting up a throttle causes a DDOS of your RPC layer via continuous > throttling and ~immediate retrying. I've discussed this problem in > https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait > interval as the solution there; after some more thinking, I believe this new > rate limiter would be a less hacky solution to this deficit so I'd like to > close that Jira in favor of this one. > See the attached chart where I put in place a 10k req/sec/machine throttle > for this user at 10:43 to try to curb this high traffic, and it resulted in a > huge spike of req/sec due to the throttle/retry loop created by the > AverageIntervalRateLimiter. > h3. Original Proposal: PartialIntervalRateLimiter as a Solution > I've implemented a RateLimiter which allows for partial chunks of the overall > interval to be refilled, by default these chunks are 10% (or 100ms of a 1s > interval). I've deployed this to a test cluster at my day job and have seen > this really help our ability to full subscribe to a quota limit without > executing superfluous retries. See the other attached chart which shows a > cluster undergoing a rolling restart from using FixedIntervalRateLimiter to > my new PartialIntervalRateLimiter and how it is then able to fully subscribe > to its allotted 25MB/sec/machine read IO quota. > h3. Updated Proposal: Improving FixedIntervalRateLimiter > Rather than implement a new rate limiter, we can make a lower touch change > which just adds support for a refill interval that is less than the time unit > on a FixedIntervalRateLimiter. This can be a no-op change for those who have > not opted into the feature by having the refill interval default to the time > unit. For clarity, see [my branch
[jira] [Created] (HBASE-28455) do-release-docker fails to setup gpg agent proxy if proxy container is slow to start
Bryan Beaudreault created HBASE-28455: - Summary: do-release-docker fails to setup gpg agent proxy if proxy container is slow to start Key: HBASE-28455 URL: https://issues.apache.org/jira/browse/HBASE-28455 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault In do-release-docker.sh we spin up the gpg-agent-proxy container and then immediately run ssh-keyscan and then immediately run ssh. Despite having {{{}set -e{}}}, both of these can fail without failing the script. This manifests as a really hard to debug failure in the hbase-rm container with "gpg: no gpg-agent running in this session" With some debugging I realized that the ssh tunnel had not been created. looking at the logs, the gpg-agent-proxy.ssh-keyscan file is empty and the gpg-proxy.ssh.log shows a Connection refused error. You'd think these would fail the script, but they don't for different reasons: # ssh-keyscan output is piped through sort. Running ssh-keyscan directly returns an error code, but piping it through sort turns it into a success code. # ssh is executed in background with {{{}&{}}}, which similarly loses the error code I think we should add a step prior to ssh-keyscan which waits until port 6 is available. I'm not sure how to retain the error codes in the above 2 commands, but can try to look into that as well. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28338) Bounded leak of FSDataInputStream buffers from checksum switching
[ https://issues.apache.org/jira/browse/HBASE-28338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28338. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed > Bounded leak of FSDataInputStream buffers from checksum switching > - > > Key: HBASE-28338 > URL: https://issues.apache.org/jira/browse/HBASE-28338 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > In FSDataInputStreamWrapper, the unbuffer() method caches an unbuffer > instance the first time it is called. When an FSDataInputStreamWrapper is > initialized, it has hbase checksum disabled. > In HFileInfo.initTrailerAndContext we get the stream, read the trailer, then > call unbuffer. At this point, checksums have not been enabled yet via > prepareForBlockReader. So the call to unbuffer() caches the current > non-checksum stream as the unbuffer instance. > Later, in initMetaAndIndex we do a similar thing. This time, > prepareForBlockReader has been called, so we are now using hbase checksums. > When initMetaAndIndex calls unbuffer(), it uses the old unbuffer instance > which actually has been closed when we switched to hbase checksums. So that > call does nothing, and the new no-checksum input stream is never unbuffered. > I haven't seen this cause an issue with normal hdfs replication (though > haven't gone looking). It's very problematic for Erasure Coding because > DFSStripedInputStream holds a large buffer (numDataBlocks * cellSize, so 6mb > for RS-6-3-1024k) that is only used for stream reads NOT pread. The > FSDataInputStreamWrapper we are talking about here is only used for pread in > hbase, so those 6mb buffers just hang around totally unused but > unreclaimable. Since there is an input stream per StoreFile, this can add up > very quickly on big servers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28385) Quota estimates are too optimistic for large scans
[ https://issues.apache.org/jira/browse/HBASE-28385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28385. --- Fix Version/s: 3.0.0-beta-2 Release Note: When hbase.quota.use.result.size.bytes is false, we will now estimate the amount of quota to grab for a scan based on the block bytes scanned of previous next() requests. This will increase throughput for large scans which might prefer to wait a little longer for a larger portion of the quota. Resolution: Fixed > Quota estimates are too optimistic for large scans > -- > > Key: HBASE-28385 > URL: https://issues.apache.org/jira/browse/HBASE-28385 > Project: HBase > Issue Type: Improvement >Reporter: Ray Mattingly >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > Let's say you're running a table scan with a throttle of 100MB/sec per > RegionServer. Ideally your scans are going to pull down large results, often > containing hundreds or thousands of blocks. > You will estimate each scan as costing a single block of read capacity, and > if your quota is already exhausted then the server will evaluate the backoff > required for your estimated consumption (1 block) to be available. This will > often be ~1ms, causing your retries to basically be immediate. > Obviously it will routinely take much longer than 1ms for 100MB of IO to > become available in the given configuration, so your retries will be destined > to fail. At worst this can cause a saturation of your server's RPC layer, and > at best this causes erroneous exhaustion of the client's retries. > We should find a way to make these estimates a bit smarter for large scans. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28440) Add support for using mapreduce sort in HFileOutputFormat2
Bryan Beaudreault created HBASE-28440: - Summary: Add support for using mapreduce sort in HFileOutputFormat2 Key: HBASE-28440 URL: https://issues.apache.org/jira/browse/HBASE-28440 Project: HBase Issue Type: Improvement Components: backuprestore Reporter: Bryan Beaudreault Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all of the cells of a row in memory using a TreeSet. There is a warning in the javadoc "If lots of columns per row, it will use lots of memory sorting." This can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have reasonably sized row which just gets lots of edits in the time period of WALs being replayed, and that would cause an OOM. We are seeing this in some cases with incremental backups. MapReduce has built-in sorting capabilities which are not limited to sorting in memory. It can spill to disk as necessary to sort very large datasets. We can get this capability in HFileOutputFormat2 with a couple changes: # Add support for a KeyOnlyCellComparable type as the map output key # When configured, use job.setSortComparatorClass(CellWritableComparator.class) and job.setReducerClass(PreSortedCellsReducer.class) # Update WALPlayer to have a mode which can output this new comparable instead of ImmutableBytesWritable CellWritableComparator exists already for the Import job, so there is some prior art. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28260) Possible data loss in WAL after RegionServer crash
[ https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28260. --- Fix Version/s: 2.5.9 Resolution: Fixed Pushed to branch-2.5 > Possible data loss in WAL after RegionServer crash > -- > > Key: HBASE-28260 > URL: https://issues.apache.org/jira/browse/HBASE-28260 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Charles Connell >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2, 2.5.9 > > > We recently had a production incident: > # RegionServer crashes, but local DataNode lives on > # WAL lease recovery kicks in > # Namenode reconstructs the block during lease recovery (which results in a > new genstamp). It chooses the replica on the local DataNode as the primary. > # Local DataNode reconstructs the block, so NameNode registers the new > genstamp. > # Local DataNode and the underlying host dies, before the new block could be > replicated to other replicas. > This leaves us with a missing block, because the new genstamp block has no > replicas. The old replicas still remain, but are considered corrupt due to > GENSTAMP_MISMATCH. > Thankfully we were able to confirm that the length of the corrupt blocks were > identical to the newly constructed and lost block. Further, the file in > question was only 1 block. So we downloaded one of those corrupt block files > and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in > hdfs. So in this case we had no actual data loss, but it could have happened > easily if the file was more than 1 block or the replicas weren't fully in > sync prior to reconstruction. > In order to avoid this issue, we should avoid writing WAL blocks too the > local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to > [~weichiu] for pointing this out. > During reading of WALs we already reorder blocks so as to avoid reading from > the local datanode, but avoiding writing there altogether would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-28260) Possible data loss in WAL after RegionServer crash
[ https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-28260: --- Assignee: Charles Connell Actually, since this is a bug and it applies cleanly to branch-2.5, I'm reopening for cherry-pick there. > Possible data loss in WAL after RegionServer crash > -- > > Key: HBASE-28260 > URL: https://issues.apache.org/jira/browse/HBASE-28260 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Charles Connell >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > We recently had a production incident: > # RegionServer crashes, but local DataNode lives on > # WAL lease recovery kicks in > # Namenode reconstructs the block during lease recovery (which results in a > new genstamp). It chooses the replica on the local DataNode as the primary. > # Local DataNode reconstructs the block, so NameNode registers the new > genstamp. > # Local DataNode and the underlying host dies, before the new block could be > replicated to other replicas. > This leaves us with a missing block, because the new genstamp block has no > replicas. The old replicas still remain, but are considered corrupt due to > GENSTAMP_MISMATCH. > Thankfully we were able to confirm that the length of the corrupt blocks were > identical to the newly constructed and lost block. Further, the file in > question was only 1 block. So we downloaded one of those corrupt block files > and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in > hdfs. So in this case we had no actual data loss, but it could have happened > easily if the file was more than 1 block or the replicas weren't fully in > sync prior to reconstruction. > In order to avoid this issue, we should avoid writing WAL blocks too the > local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to > [~weichiu] for pointing this out. > During reading of WALs we already reorder blocks so as to avoid reading from > the local datanode, but avoiding writing there altogether would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28260) Possible data loss in WAL after RegionServer crash
[ https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28260. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Pushed to branch-2.6+. Note that NO_LOCAL_WRITE was added back in 2016 for hbase's specific use, but apparently never used. So this Jira finally closes the loop on HDFS-3702. Thanks [~charlesconnell] for the contribution! > Possible data loss in WAL after RegionServer crash > -- > > Key: HBASE-28260 > URL: https://issues.apache.org/jira/browse/HBASE-28260 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > We recently had a production incident: > # RegionServer crashes, but local DataNode lives on > # WAL lease recovery kicks in > # Namenode reconstructs the block during lease recovery (which results in a > new genstamp). It chooses the replica on the local DataNode as the primary. > # Local DataNode reconstructs the block, so NameNode registers the new > genstamp. > # Local DataNode and the underlying host dies, before the new block could be > replicated to other replicas. > This leaves us with a missing block, because the new genstamp block has no > replicas. The old replicas still remain, but are considered corrupt due to > GENSTAMP_MISMATCH. > Thankfully we were able to confirm that the length of the corrupt blocks were > identical to the newly constructed and lost block. Further, the file in > question was only 1 block. So we downloaded one of those corrupt block files > and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in > hdfs. So in this case we had no actual data loss, but it could have happened > easily if the file was more than 1 block or the replicas weren't fully in > sync prior to reconstruction. > In order to avoid this issue, we should avoid writing WAL blocks too the > local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to > [~weichiu] for pointing this out. > During reading of WALs we already reorder blocks so as to avoid reading from > the local datanode, but avoiding writing there altogether would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28359) Improve quota RateLimiter synchronization
[ https://issues.apache.org/jira/browse/HBASE-28359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28359. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Pushed to branch-2.6+. Thanks for the contribution [~rmdmattingly]! > Improve quota RateLimiter synchronization > - > > Key: HBASE-28359 > URL: https://issues.apache.org/jira/browse/HBASE-28359 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > We've been experiencing RpcThrottlingException with 0ms waitInterval. This > seems odd and wasteful, since the client side will immediately retry without > backoff. I think the problem is related to the synchronization of RateLimiter. > The TimeBasedLimiter checkQuota method does the following: > {code:java} > if (!reqSizeLimiter.canExecute(estimateWriteSize + estimateReadSize)) { > RpcThrottlingException.throwRequestSizeExceeded( > reqSizeLimiter.waitInterval(estimateWriteSize + estimateReadSize)); > } {code} > Both canExecute and waitInterval are synchronized, but we're calling them > independently. So it's possible under high concurrency for canExecute to > return false, but then waitInterval returns 0 (would have been true) > I think we should simplify the API to have a single synchronized call: > {code:java} > long waitInterval = reqSizeLimiter.tryAcquire(estimateWriteSize + > estimateReadSize); > if (waitInterval > 0) { > RpcThrottlingException.throwRequestSizeExceeded(waitInterval); > }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28423) Improvements to backup of bulkloaded files
Bryan Beaudreault created HBASE-28423: - Summary: Improvements to backup of bulkloaded files Key: HBASE-28423 URL: https://issues.apache.org/jira/browse/HBASE-28423 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Backup/Restore has support for including bulkloaded files in incremental backups. There is a coprocessor hook which registers all bulkloads into a backup:system_bulk table. A cleaner plugin ensures that these files are not cleaned up from the archive until they are backed up. When the incremental backup occurs, the files are deleted from the system_bulk table and then cleaned up. We have encountered two problems to be solved with this: # The deletion process only happens during incremental backups, not full backups. A full backup already includes all data in the table via a snapshot export. So we should clear any pending bulkloads upon full backup. # There is currently no linking of bulkload state to backupRoot. It's possible to have multiple backupRoots for tables. For example, you might backup to 2 destinations with different schedules. Currently whichever backupRoot does an incremental backup first will be the one to include bulkloads, then the system_bulk table. We need some sort of mapping of bulkload to backupRoot, and we should only delete the rows from system_bulk once the files have been included in all active backupRoots. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28400) WAL readers treat any exception as EOFException, which can lead to data loss
Bryan Beaudreault created HBASE-28400: - Summary: WAL readers treat any exception as EOFException, which can lead to data loss Key: HBASE-28400 URL: https://issues.apache.org/jira/browse/HBASE-28400 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault In HBASE-28390, I found a bug in our WAL compression which manifests as an IllegalArgumentException or ArrayIndexOutOfBoundException. Even worse is that ProtobufLogReader.readNext catches any Exception and rethrows it as an EOFException. EOFException gets handled in a variety of ways by the readers of WALs, and not all of them make sense for an exception that isn't really EOF. For example, WALInputFormat catches EOFException and returns false for nextKeyValue(), effectively skipping the rest of the WAL file but not failing the job. ReplicationSourceWALReader has some much more complicated handling of EOFException. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28390) WAL value compression fails for cells with large values
[ https://issues.apache.org/jira/browse/HBASE-28390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28390. --- Fix Version/s: 2.6.0 2.5.8 3.0.0-beta-2 Assignee: Bryan Beaudreault Resolution: Fixed Pushed to branch-2.5+. Thanks [~apurtell] for the review > WAL value compression fails for cells with large values > --- > > Key: HBASE-28390 > URL: https://issues.apache.org/jira/browse/HBASE-28390 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2 > > > We are testing out WAL compression and noticed that it fails for large values > when both features (wal compression and wal value compression) are enabled. > It works fine with either feature independently, but not when combined. It > seems to fail for all of the value compressor types, and the failure is in > the LRUDictionary of wal key compression: > > {code:java} > java.io.IOException: Error while reading 2 WAL KVs; started reading at 230 > and read up to 396 > at > org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:94) > ~[classes/:?] > at > org.apache.hadoop.hbase.wal.CompressedWALTestBase.doTest(CompressedWALTestBase.java:181) > ~[test-classes/:?] > at > org.apache.hadoop.hbase.wal.CompressedWALTestBase.testForSize(CompressedWALTestBase.java:129) > ~[test-classes/:?] > at > org.apache.hadoop.hbase.wal.CompressedWALTestBase.testLarge(CompressedWALTestBase.java:94) > ~[test-classes/:?] > at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:?] > at > jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:?] > at > jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:?] > at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] > at > org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) > ~[junit-4.13.2.jar:4.13.2] > at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) > ~[junit-4.13.2.jar:4.13.2] > at > org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) > ~[junit-4.13.2.jar:4.13.2] > at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] > at java.lang.Thread.run(Thread.java:829) ~[?:?] > Caused by: java.lang.IndexOutOfBoundsException: index (21) must be less than > size (1) > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1371) > ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5] > at > org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1353) > ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5] > at >
[jira] [Created] (HBASE-28396) Quota throttling can cause a leak of scanners
Bryan Beaudreault created HBASE-28396: - Summary: Quota throttling can cause a leak of scanners Key: HBASE-28396 URL: https://issues.apache.org/jira/browse/HBASE-28396 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault In RSRpcServices.scan, we check the quota after having created a new RegionScannerHolder. If the quota is exceeded, an exception will be thrown. In this case, we can't send the scannerName back to the client because it's just an exception. So the client will be forced to retry the openScanner call, but the RegionScannerHolder is not closed. Eventually the scanners will be cleaned up by the lease expiration, but this could cause many scanners to leak during periods of high throttling. We could close the newly opened scanner before throwing the throttle exception, but I think it's better to not open the scanner at all until we've grabbed some quota. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28390) WAL compression fails for cells with large values when combined with WAL value compression
Bryan Beaudreault created HBASE-28390: - Summary: WAL compression fails for cells with large values when combined with WAL value compression Key: HBASE-28390 URL: https://issues.apache.org/jira/browse/HBASE-28390 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault We are testing out WAL compression and noticed that it fails for large values when both features (wal compression and wal value compression) are enabled. It works fine with either feature independently, but not when combined. It seems to fail for all of the value compressor types, and the failure is in the LRUDictionary of wal key compression: {code:java} java.io.IOException: Error while reading 2 WAL KVs; started reading at 230 and read up to 396 at org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:94) ~[classes/:?] at org.apache.hadoop.hbase.wal.CompressedWALTestBase.doTest(CompressedWALTestBase.java:181) ~[test-classes/:?] at org.apache.hadoop.hbase.wal.CompressedWALTestBase.testForSize(CompressedWALTestBase.java:129) ~[test-classes/:?] at org.apache.hadoop.hbase.wal.CompressedWALTestBase.testLarge(CompressedWALTestBase.java:94) ~[test-classes/:?] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?] at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?] at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?] at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?] at org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) ~[junit-4.13.2.jar:4.13.2] at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299) ~[junit-4.13.2.jar:4.13.2] at org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293) ~[junit-4.13.2.jar:4.13.2] at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?] at java.lang.Thread.run(Thread.java:829) ~[?:?] Caused by: java.lang.IndexOutOfBoundsException: index (21) must be less than size (1) at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1371) ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5] at org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1353) ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5] at org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:153) ~[classes/:?] at org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:79) ~[classes/:?] at org.apache.hadoop.hbase.io.util.LRUDictionary.getEntry(LRUDictionary.java:43) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:366) ~[classes/:?] at org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:307) ~[classes/:?] at org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:66) ~[classes/:?] at
[jira] [Resolved] (HBASE-28370) Default user quotas are refreshing too frequently
[ https://issues.apache.org/jira/browse/HBASE-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28370. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed > Default user quotas are refreshing too frequently > - > > Key: HBASE-28370 > URL: https://issues.apache.org/jira/browse/HBASE-28370 > Project: HBase > Issue Type: Improvement >Reporter: Ray Mattingly >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > In [https://github.com/apache/hbase/pull/5666] we introduced default user > quotas, but I accidentally called UserQuotaState's default constructor rather > than passing in the current timestamp. The consequence is that we're > constantly refreshing these default user quotas, and this can be a bottleneck > for horizontal cluster scalability. > This should be a 1 line fix in QuotaUtil's buildDefaultUserQuotaState method. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28376) Column family ns does not exist in region during upgrade to 3.0.0-beta-2
Bryan Beaudreault created HBASE-28376: - Summary: Column family ns does not exist in region during upgrade to 3.0.0-beta-2 Key: HBASE-28376 URL: https://issues.apache.org/jira/browse/HBASE-28376 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault Upgrading from 2.5.x to 3.0.0-alpha-2, migrateNamespaceTable kicks in to copy data from the namespace table to an "ns" family of the meta table. If you don't have an "ns" family, the migration fails and the hmaster will crash loop. You then can't rollback, because the briefly alive upgraded hmaster created a procedure that can't be deserialized by 2.x (I don't have this log handy unfortunately). I tried pushing code to create the ns family on startup, but it doesnt work becuase the migration happens while the hmaster is still initializing. So it seems imperative that you create the ns family before upgrading. We should handle this more gracefully. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28365) ChaosMonkey batch suspend/resume action assume shell implementation
Bryan Beaudreault created HBASE-28365: - Summary: ChaosMonkey batch suspend/resume action assume shell implementation Key: HBASE-28365 URL: https://issues.apache.org/jira/browse/HBASE-28365 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault These two actions have code like this: {code:java} case SUSPEND: server = serversToBeSuspended.remove(); try { suspendRs(server); } catch (Shell.ExitCodeException e) { LOG.warn("Problem suspending but presume successful; code={}", e.getExitCode(), e); } suspendedServers.add(server); break; {code} This only catches that one Shell.ExitCodeException, but operators may have an implementation of ClusterManager which does not use shell. We should expand this to catch all exceptions. The implication here is that the uncaught exception propagates, and we don't add the server to suspendedServers. If the suspension actually succeeded, this leaves some processes in a permanently suspended state until manual intervention occurs. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28364) Warn: Cache key had block type null, but was found in L1 cache
Bryan Beaudreault created HBASE-28364: - Summary: Warn: Cache key had block type null, but was found in L1 cache Key: HBASE-28364 URL: https://issues.apache.org/jira/browse/HBASE-28364 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault I'm ITBLL testing branch-2.6 and am seeing lots of these warns. This is new to me. I would expect a warn to be on the rare side or be indicative of a problem, but unclear from the code. cc [~wchevreuil] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28363) Noisy exception from FlushRegionProcedure when result is CANNOT_FLUSH
Bryan Beaudreault created HBASE-28363: - Summary: Noisy exception from FlushRegionProcedure when result is CANNOT_FLUSH Key: HBASE-28363 URL: https://issues.apache.org/jira/browse/HBASE-28363 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault Running ITBLL with chaos monkey in HBASE-28233. I noticed lots of exceptions: {code:java} [RS_FLUSH_OPERATIONS-regionserver/test-host:60020-1 {event_type=RS_FLUSH_REGIONS, pid=741536}] ERROR org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler: pid=741536 java.io.IOException: Unable to complete flush {ENCODED => 371d2ba6875913542893642c94634226, NAME => 'IntegrationTestBigLinkedList,-\x82\xD8-\x82\xD8-\x80,1707761077516.371d2ba6875913542893642c94634226.', STARTKEY = > '-\x82\xD8-\x82\xD8-\x80', ENDKEY => '3330'} at org.apache.hadoop.hbase.regionserver.FlushRegionCallable.doCall(FlushRegionCallable.java:61) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:35) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:23) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:51) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?] at java.lang.Thread.run(Thread.java:840) ~[?:?] {code} I took a look at the HRegion.flushcache code, and there are 3 reasons for CANNOT_FLUSH. All only print at debug log level and none look like actual errors. I think we shouldn't throw an exception here, or at least should downgrade to debug. It looks like a problem, but isn't (i dont think). cc [~frostruan] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28362) NPE calling bootstrapNodeManager during RegionServer initialization
Bryan Beaudreault created HBASE-28362: - Summary: NPE calling bootstrapNodeManager during RegionServer initialization Key: HBASE-28362 URL: https://issues.apache.org/jira/browse/HBASE-28362 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault Shortly after starting up, if a RegionServer is getting requests from clients before it's ready (i.e. it restarts and they haven't cleared meta cache yet), it will throw an NPE. This is because netty may bind and start accepting requests before HRegionServer.preRegistrationInitialization finishes. I think this is similar to https://issues.apache.org/jira/browse/HBASE-28088. It's not critical because the RS self-resolves within a few seconds, but it causes noise in the logs and probably errors for clients. {code:java} 2024-02-13T18:24:02,537 [RpcServer.default.FPBQ.handler=6,queue=6,port=60020 {}] ERROR org.apache.hadoop.hbase.ipc.RpcServer: Unexpected throwable object java.lang.NullPointerException: Cannot invoke "org.apache.hadoop.hbase.regionserver.BootstrapNodeManager.getBootstrapNodes()" because "this.bootstrapNodeManager" is null at org.apache.hadoop.hbase.regionserver.HRegionServer.getBootstrapNodes(HRegionServer.java:4179) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.regionserver.RSRpcServices.getAllBootstrapNodes(RSRpcServices.java:4140) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.shaded.protobuf.generated.BootstrapNodeProtos$BootstrapNodeService$2.callBlockingMethod(BootstrapNodeProtos.java:1259) ~[hbase-protocol-shaded-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:438) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28360) [hbase-thirdparty] Upgrade Netty to 4.1.107.Final
[ https://issues.apache.org/jira/browse/HBASE-28360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28360. --- Fix Version/s: thirdparty-4.1.6 Assignee: Bryan Beaudreault Resolution: Fixed Thanks [~nihaljain.cs] and [~rajeshbabu] for the review > [hbase-thirdparty] Upgrade Netty to 4.1.107.Final > - > > Key: HBASE-28360 > URL: https://issues.apache.org/jira/browse/HBASE-28360 > Project: HBase > Issue Type: Task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: thirdparty-4.1.6 > > > https://netty.io/news/2024/02/13/4-1-107-Final.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28360) [hbase-thirdparty] Upgrade Netty to 4.1.107.Final
Bryan Beaudreault created HBASE-28360: - Summary: [hbase-thirdparty] Upgrade Netty to 4.1.107.Final Key: HBASE-28360 URL: https://issues.apache.org/jira/browse/HBASE-28360 Project: HBase Issue Type: Task Reporter: Bryan Beaudreault https://netty.io/news/2024/02/13/4-1-107-Final.html -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28359) Improve quota RateLimiter synchronization
Bryan Beaudreault created HBASE-28359: - Summary: Improve quota RateLimiter synchronization Key: HBASE-28359 URL: https://issues.apache.org/jira/browse/HBASE-28359 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault We've been experiencing RpcThrottlingException with 0ms waitInterval. This seems odd and wasteful, since the client side will immediately retry without backoff. I think the problem is related to the synchronization of RateLimiter. The TimeBasedLimiter checkQuota method does the following: {code:java} if (!reqSizeLimiter.canExecute(estimateWriteSize + estimateReadSize)) { RpcThrottlingException.throwRequestSizeExceeded( reqSizeLimiter.waitInterval(estimateWriteSize + estimateReadSize)); } {code} Both canExecute and waitInterval are synchronized, but we're calling them independently. So it's possible under high concurrency for canExecute to return false, but then waitInterval returns 0 (would have been true) I think we should simplify the API to have a single synchronized call: {code:java} long waitInterval = reqSizeLimiter.tryAcquire(estimateWriteSize + estimateReadSize); if (waitInterval > 0) { RpcThrottlingException.throwRequestSizeExceeded(waitInterval); }{code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28352) HTable batch does not honor RpcThrottlingException waitInterval
[ https://issues.apache.org/jira/browse/HBASE-28352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28352. --- Fix Version/s: 2.6.0 Assignee: Bryan Beaudreault Resolution: Fixed Pushed to branch-2 and branch-2.6. I did not include in branch-2.5, because it seems we did not backport the original waitInterval support there. If we want it there, we should also backport HBASE-27798. Thanks [~zhangduo] for the review! > HTable batch does not honor RpcThrottlingException waitInterval > --- > > Key: HBASE-28352 > URL: https://issues.apache.org/jira/browse/HBASE-28352 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0 > > > I noticed that we only honor the waitInterval in > RpcRetryingCaller.callWithRetries. But HTable.batch (AsyncProcess) uses > custom retry logic. We need to update it to honor the waitInterval -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28358) AsyncProcess inconsistent exception thrown for operation timeout
Bryan Beaudreault created HBASE-28358: - Summary: AsyncProcess inconsistent exception thrown for operation timeout Key: HBASE-28358 URL: https://issues.apache.org/jira/browse/HBASE-28358 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault I'm not sure if I'll get to this, but wanted to log it as a known issue. AsyncProcess has a design where it breaks the batch into sub-batches based on regionserver, then submits a callable per regionserver in a threadpool. In the main thread, it calls waitUntilDone() with an operation timeout. If the callables don't finish within the operation timeout, a SocketTimeoutException is thrown. This exception is not very useful because it doesn't give you any sense of how many calls were in progress, on which servers, or why it's delayed. Recently we've been improving the adherence to operation timeout within the callables themselves. The main driver here has been to ensure we don't erroneously clear the meta cache for operation timeout related errors. So we've added a new OperationTimeoutExceededException, which is thrown from within the callables and does not cause a meta cache clear. The added benefit is that if these bubble up to the caller, they are wrapped in RetriesExhaustedWithDetailsException which includes a lot more info about which server and which action is affected. Now we've covered most but not all cases where operation timeout is exceeded. So when exceeding operation timeout it's possible sometimes to see a SocketTimeoutException from waitUntilDone, and sometimes see OperationTimeoutExceededException from the callables. It will depend on which one fails first. It may be nice to finish the swing here, ensuring that we always throw OperationTimeoutExceededException from the callables. The main remaining case is in the call to locateRegion, which hits meta and does not honor the call's operation timeout (instead meta operation timeout). Resolving this would require some refactoring of ConnectionImplementation.locateRegion to allow passing an operation timeout and having that affect the userRegionLock and meta scan. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28349) Atomic requests should increment read usage in quotas
[ https://issues.apache.org/jira/browse/HBASE-28349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28349. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: Conditional atomic mutations which involve a read-modify-write (increment/append) or check-and-mutate, will now count as both a read and write when evaluating quotas. Previously they would just count as a write, despite involving a read as well. Resolution: Fixed > Atomic requests should increment read usage in quotas > - > > Key: HBASE-28349 > URL: https://issues.apache.org/jira/browse/HBASE-28349 > Project: HBase > Issue Type: Improvement >Reporter: Ray Mattingly >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > Right now atomic operations are just treated as a single write from the quota > perspective. Since an atomic operation also encompasses a read, it would make > sense to increment readNum and readSize counts appropriately. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28354) RegionSizeCalculator throws NPE when regions are in transition
Bryan Beaudreault created HBASE-28354: - Summary: RegionSizeCalculator throws NPE when regions are in transition Key: HBASE-28354 URL: https://issues.apache.org/jira/browse/HBASE-28354 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault When a region is in transition, it may briefly have a null ServerName in meta. The RegionSizeCalculator calls RegionLocator.getAllRegionLocations() and does not handle the possibility that a RegionLocation.getServerName() could be null. The ServerName is eventually passed into an Admin call, which results in an NPE. This has come up in other contexts. For example, taking a look at getAllRegionLocations() impl, we have checks to ensure that we don't call null server names. We need to similarly handle the possibility of nulls in RegionSizeCalculator. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28352) HTable batch does not honor RpcThrottlingException waitInterval
Bryan Beaudreault created HBASE-28352: - Summary: HTable batch does not honor RpcThrottlingException waitInterval Key: HBASE-28352 URL: https://issues.apache.org/jira/browse/HBASE-28352 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault I noticed that we only honor the waitInterval in RpcRetryingCaller.callWithRetries. But HTable.batch (AsyncProcess) uses custom retry logic. We need to update it to honor the waitInterval -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27800) Add support for default user quotas using USER => 'all'
[ https://issues.apache.org/jira/browse/HBASE-27800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27800. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: Adds a bunch of new configs for default user machine quotas: hbase.quota.default.user.machine.read.num, hbase.quota.default.user.machine.read.size, hbase.quota.default.user.machine.write.num, hbase.quota.default.user.machine.write.size, hbase.quota.default.user.machine.request.num, hbase.quota.default.user.machine.request.size. Setting any these will apply the given limit as a default for users which are not explicitly covered by existing quotas defined through set_quota, etc. Resolution: Fixed > Add support for default user quotas using USER => 'all' > > > Key: HBASE-27800 > URL: https://issues.apache.org/jira/browse/HBASE-27800 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Ray Mattingly >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > If someone sets a quota with USER => 'all' (or maybe '*'), treat that as a > default quota for each individual user. When a request comes from a user, it > will lookup current QuotaState based on username. If one doesn't exist, it > will be pre-filled with whatever the 'all' quota was set to. Otherwise, if > you then define a quota for a specific user that will override whatever > default you have set for that user only. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-28345) Close HBase connection on exit from HBase Shell
[ https://issues.apache.org/jira/browse/HBASE-28345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-28345: --- I dont see this backported to branch-3. Are you sure you cherry-picked everywhere? > Close HBase connection on exit from HBase Shell > --- > > Key: HBASE-28345 > URL: https://issues.apache.org/jira/browse/HBASE-28345 > Project: HBase > Issue Type: Bug > Components: shell >Affects Versions: 2.4.17 >Reporter: Istvan Toth >Assignee: Istvan Toth >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2 > > > When using Netty for the ZK client, hbase shell hangs on exit. > This is caused by the non-deamon Netty threads that ZK creates. > Wheter ZK should create daemon threads for Netty or not is debatable, but > explicitly closing the connection in hbase shell on exit fixes the issue. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28348) Multi should return what results it can before rpc timeout
Bryan Beaudreault created HBASE-28348: - Summary: Multi should return what results it can before rpc timeout Key: HBASE-28348 URL: https://issues.apache.org/jira/browse/HBASE-28348 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Scans have a nice feature where they try to return a heartbeat with whatever results they have accumulated before the rpc timeout expires. It targets returning in 1/2 the rpc timeout or max scanner time. The reason for scans is to avoid painful scanner timeouts which cause the scan to have to be restarted due to out of sync sequence id. Multis have a similar problem. A big batch can come in which can't be served in the configured timeout. In this case the client side will abandon the request when the timeout is exceeded, and resubmit if there are retries/operation timeout left. This wastes work since it's likely that some of the results had been fetched by the time a timeout occurred. Multis already can retry immediately when the batch exceeds the max result size limit. We can use the same functionality to also return when we've taken more than half the rpc timeout. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28347) Update ref guide about isolation guarantees for scans
Bryan Beaudreault created HBASE-28347: - Summary: Update ref guide about isolation guarantees for scans Key: HBASE-28347 URL: https://issues.apache.org/jira/browse/HBASE-28347 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault In the "Consistency of Scans" section of [https://hbase.apache.org/acid-semantics.html,] there is some confusing and outdated information. First it's hard to realize that it's specifically talking about consistency across rows. Secondly, it's outdated because in modern hbase we acquire and maintain a memstore readPt for the lifetime of a scan in a region. So we should retain read committed behavior across rows, at least within the scope of a region. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27687) Enhance quotas to consume blockBytesScanned rather than response size
[ https://issues.apache.org/jira/browse/HBASE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27687. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: Read size quotas are now evaluated against block bytes scanned for a request, rather than result size. Block bytes scanned is a measure of the total size in bytes of all hfile blocks opened to serve a request. This results in a much more accurate picture of actual work done by a query and is the recommended mode. One can revert to the old behavior by setting hbase.quota.use.result.size.bytes to true. Resolution: Fixed > Enhance quotas to consume blockBytesScanned rather than response size > - > > Key: HBASE-27687 > URL: https://issues.apache.org/jira/browse/HBASE-27687 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > As of HBASE-27558 we now apply quota.getReadAvailable() to max block bytes > scanned by scans/multis. This issue enhances further so that we can track > read size consumed in Quotas based on block bytes scanned rather than > response size. In this mode, quotas would end-to-end be based on > blockBytesScanned. > Right now we call quota.addGetResult or addScanResult. This would just be a > matter of no-oping those calls, and calling RpcCall.getBlockBytesScanned() in > Quota.close() instead. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28346) Expose checkQuota to Coprocessor Endpoints
Bryan Beaudreault created HBASE-28346: - Summary: Expose checkQuota to Coprocessor Endpoints Key: HBASE-28346 URL: https://issues.apache.org/jira/browse/HBASE-28346 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Coprocessor endpoints may do non-trivial amounts of work, yet quotas do not throttle them. We can't generically apply quotas to coprocessors because we have no information on what a particular endpoint might do. One thing we could do is expose checkQuota to the RegionCoprocessorEnvironment. This way, coprocessor authors have the tools to ensure that quotas cover their implementations. While adding this, we can update AggregationImplementation to call checkQuota since those endpoints can be quite expensive. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28343) Write codec class into hfile header/trailer
Bryan Beaudreault created HBASE-28343: - Summary: Write codec class into hfile header/trailer Key: HBASE-28343 URL: https://issues.apache.org/jira/browse/HBASE-28343 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault We recently started playing around with the new bundled compression libraries as of 2.5.0. Specifically, we are experimenting with the different zstd codecs. The book says that aircompressor's zstd is not data compatible with hadoops, but doesn't say the same about zstd-jni. In our experiments we ended up in a state where some hfiles were encoded with zstd-jni (zstd.ZstdCodec) while others were encoded with hadoop (ZStandardCodec). At this point the cluster became extremely unstable, with some files unable to be read because they encoded with a codec that didn't match the current runtime configration. Changing the runtime configuration caused the other files to not be readable. I think this problem could be solved by writing the classname of the codec used into the hfile. This could be used as a hint so that a regionserver can read hfiles compressed with any compression codec that it supports. [~apurtell] do you have any thoughts here since you brought us all of these great compression options? -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28216) HDFS erasure coding support for table data dirs
[ https://issues.apache.org/jira/browse/HBASE-28216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28216. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: If you use hadoop3, managing the erasure coding policy of a table's data directory is now possible with a new table descriptor setting ERASURE_CODING_POLICY. The policy you set must be available and enabled in hdfs, and hbase will validate that your cluster topology is sufficient to support that policy. After setting the policy, you must major compact the table for the change to take effect. Attempting to use this feature with hadoop2 will fail a validation check prior to making any changes. Resolution: Fixed Thanks [~weichiu], [~nihaljain.cs], and [~zhangduo] for the advice and reviews! Merged to 2.6+. We've been running this in production and it's helping to cut costs on some of our clusters. > HDFS erasure coding support for table data dirs > --- > > Key: HBASE-28216 > URL: https://issues.apache.org/jira/browse/HBASE-28216 > Project: HBase > Issue Type: New Feature >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Labels: patch-available, pull-request-available > Fix For: 2.6.0, 3.0.0-beta-2 > > > [Erasure > coding|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html] > (EC) is a hadoop-3 feature which can drastically reduce storage > requirements, at the expense of locality. At my company we have a few hbase > clusters which are extremely data dense and take mostly write traffic, fewer > reads (cold data). We'd like to reduce the cost of these clusters, and EC is > a great way to do that since it can reduce replication related storage costs > by 50%. > It's possible to enable EC policies on sub directories of HDFS. One can > manually set this with {{{}hdfs ec -setPolicy -path > /hbase/data/default/usertable -policy {}}}. This can work without any > hbase support. > One problem with that is a lack of visibility by operators into which tables > might have EC enabled. I think this is where HBase can help. Here's my > proposal: > * Add a new TableDescriptor and ColumnDescriptor field ERASURE_CODING_POLICY > * In ModifyTableProcedure preflightChecks, if ERASURE_CODING_POLICY is set, > verify that the requested policy is available and enabled via > DistributedFileSystem. > getErasureCodingPolicies(). > * During ModifyTableProcedure, add a new state for > MODIFY_TABLE_SYNC_ERASURE_CODING_POLICY. > ** When adding or changing a policy, use DistributedFileSystem. > setErasureCodingPolicy to sync it for the data and archive dir of that table > (or column in table) > ** When removing the property or setting it to empty, use > DistributedFileSystem. > unsetErasureCodingPolicy to remove it from the data and archive dir. > Since this new API is in hadoop-3 only, we'll need to add a reflection > wrapper class for managing the calls and verifying that the API is available. > We'll similarly do that API check in preflightChecks. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28338) Bounded leak of FSDataInputStream buffers from checksum switching
Bryan Beaudreault created HBASE-28338: - Summary: Bounded leak of FSDataInputStream buffers from checksum switching Key: HBASE-28338 URL: https://issues.apache.org/jira/browse/HBASE-28338 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault In FSDataInputStreamWrapper, the unbuffer() method caches an unbuffer instance the first time it is called. When an FSDataInputStreamWrapper is initialized, it has hbase checksum disabled. In HFileInfo.initTrailerAndContext we get the stream, read the trailer, then call unbuffer. At this point, checksums have not been enabled yet via prepareForBlockReader. So the call to unbuffer() caches the current non-checksum stream as the unbuffer instance. Later, in initMetaAndIndex we do a similar thing. This time, prepareForBlockReader has been called, so we are now using hbase checksums. When initMetaAndIndex calls unbuffer(), it uses the old unbuffer instance which actually has been closed when we switched to hbase checksums. So that call does nothing, and the new no-checksum input stream is never unbuffered. I haven't seen this cause an issue with normal hdfs replication (though haven't gone looking). It's very problematic for Erasure Coding because DFSStripedInputStream holds a large buffer (numDataBlocks * cellSize, so 6mb for RS-6-3-1024k) that is only used for stream reads NOT pread. The FSDataInputStreamWrapper we are talking about here is only used for pread in hbase, so those 6mb buffers just hang around totally unused but unreclaimable. Since there is an input stream per StoreFile, this can add up very quickly on big servers. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28331) Client integration test fails after upgrading hadoop3 version to 3.3.x
[ https://issues.apache.org/jira/browse/HBASE-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28331. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed [~zhangduo] feel free to re-open if something is pending here. I'm auditing fixVersions for 2.6.0 and see the commit has landed, so setting them and resolving now > Client integration test fails after upgrading hadoop3 version to 3.3.x > -- > > Key: HBASE-28331 > URL: https://issues.apache.org/jira/browse/HBASE-28331 > Project: HBase > Issue Type: Bug > Components: hadoop3, jenkins >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-2 > > > Saw this error when starting HBase cluster > {noformat} > 2024-01-25T11:25:01,838 ERROR > [master/jenkins-hbase21:16000:becomeActiveMaster] master.HMaster: Failed to > become active master > java.lang.ClassCastException: > org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetSafeModeRequestProto > cannot be cast to com.google.protobuf.Message > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:247) > ~[hadoop-common-3.3.5.jar:?] > at > org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132) > ~[hadoop-common-3.3.5.jar:?] > at com.sun.proxy.$Proxy32.setSafeMode(Unknown Source) ~[?:?] > at > org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setSafeMode(ClientNamenodeProtocolTranslatorPB.java:847) > ~[hadoop-hdfs-client-3.3.5.jar:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_362] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_362] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_362] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433) > ~[hadoop-common-3.3.5.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166) > ~[hadoop-common-3.3.5.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158) > ~[hadoop-common-3.3.5.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96) > ~[hadoop-common-3.3.5.jar:?] > at > org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362) > ~[hadoop-common-3.3.5.jar:?] > at com.sun.proxy.$Proxy33.setSafeMode(Unknown Source) ~[?:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_362] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_362] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_362] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) > ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] > at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_362] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_362] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_362] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) > ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] > at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_362] > at > sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) > ~[?:1.8.0_362] > at > sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) > ~[?:1.8.0_362] > at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362] > at > org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) > ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT] > at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?] > at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) > ~[?:1.8.0_362] > at >
[jira] [Resolved] (HBASE-26816) Fix CME in ReplicationSourceManager
[ https://issues.apache.org/jira/browse/HBASE-26816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-26816. --- Fix Version/s: 2.5.8 Resolution: Fixed This one was easy for me to cherry-pick, so I've done that and added 2.5.8 fixVersion > Fix CME in ReplicationSourceManager > --- > > Key: HBASE-26816 > URL: https://issues.apache.org/jira/browse/HBASE-26816 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.4.10 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Minor > Fix For: 2.6.0, 2.5.8, 2.4.11, 3.0.0-alpha-3 > > > Exception in thread "regionserver/hostname/ip:port" > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:832) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:162) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:155) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2623) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1175) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28190) Add slow sync log rolling test in TestAsyncLogRolling
[ https://issues.apache.org/jira/browse/HBASE-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28190. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Resolution: Fixed Looks like this issue is resolved. The commit landed in branch-2.6, and branch-3 after the beta-1 release. So setting 2.6.0 and 3.0.0-beta-2 fixVersion > Add slow sync log rolling test in TestAsyncLogRolling > - > > Key: HBASE-28190 > URL: https://issues.apache.org/jira/browse/HBASE-28190 > Project: HBase > Issue Type: Improvement > Components: test >Reporter: zhuyaogai >Assignee: zhuyaogai >Priority: Minor > Fix For: 2.6.0, 3.0.0-beta-2 > > > There is a test for slow sync log rolling in `TestLogRolling`, but not in > `TestAsyncLogRolling`, so add it in `TestAsyncLogRolling`. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27784) support quota user overrides
[ https://issues.apache.org/jira/browse/HBASE-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27784. --- Fix Version/s: 2.6.0 3.0.0-beta-1 Release Note: Adds a RegionServer config hbase.quota.user.override.key which can be set to the name of a request attribute whose value should be used as the username when evaluating quotas. Resolution: Fixed > support quota user overrides > > > Key: HBASE-27784 > URL: https://issues.apache.org/jira/browse/HBASE-27784 > Project: HBase > Issue Type: New Feature >Reporter: Bryan Beaudreault >Assignee: Ray Mattingly >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > > The below is the original idea that started this work, but not what we > actually landed on. See the first comment from [~rmdmattingly] and the > release note for that. > > Old description: > {quote}Currently we provide the ability to define quotas for namespaces, > tables, or users. On multi-tenant clusters, users may be broken down into > groups based on their use-case. For us this comes down to 2 main cases: > # Hadoop jobs – it would be good to be able to limit all hadoop jobs in > aggregate > # Proxy APIs - this is common where upstream callers don't hit hbase > directly, instead they go through one of many proxy api's. For us we have a > custom auth plugin which sets the username to the upstream caller name. But > it would still be useful to be able to limit all usage from some particular > proxy API in aggregate. > I think this could build upon the idea for Connection attributes in > HBASE-27657. Basically when a Connection is established we can set an > attribute (i.e. quotaGrouping=hadoop or quotaGrouping=MyProxyAPI). In > QuotaCache, we can add a {{getQuotaGroupLimiter(String groupName)}} and also > allow someone to define quotas using {{set_quota TYPE => THROTTLE, GROUP => > 'hadoop', LIMIT => '100M/sec'}} > I need to do more investigation into whether we'd want to return a simple > group limiter (more similar to table/namespace handling) or treat it more > like the USER limiters which returns a QuotaState (so you can limit > by-group-by-table). > We need to consider how GROUP quotas interact with USER quotas. If a user has > a quota defined, and that user is also part of a group with a quota defined, > does the request need to honor both quotas? Maybe we provide a GROUP_BYPASS > setting, similar to GLOBAL_BYPASS? > {quote} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-26625) ExportSnapshot tool failed to copy data files for tables with merge region
[ https://issues.apache.org/jira/browse/HBASE-26625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-26625: --- [~meiyi] this Jira is not in branch-2.5, so shouldn't have 2.5.0 fixVersion. I'm adding 2.6.0 now. If you think it should exist in 2.5.x (probably?) then please cherry-pick there and re-add the latest 2.5.x fixVersion (2.5.8 right now) > ExportSnapshot tool failed to copy data files for tables with merge region > -- > > Key: HBASE-26625 > URL: https://issues.apache.org/jira/browse/HBASE-26625 > Project: HBase > Issue Type: Bug >Reporter: Yi Mei >Assignee: Yi Mei >Priority: Minor > Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10 > > > When export snapshot for a table with merge regions, we found following > exceptions: > {code:java} > 2021-12-24 17:14:41,563 INFO [main] snapshot.ExportSnapshot: Finalize the > Snapshot Export > 2021-12-24 17:14:41,589 INFO [main] snapshot.ExportSnapshot: Verify snapshot > integrity > 2021-12-24 17:14:41,683 ERROR [main] snapshot.ExportSnapshot: Snapshot export > failed > org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Missing parent > hfile for: 043a9fe8aa7c469d8324956a57849db5.8e935527eb39a2cf9bf0f596754b5853 > path=A/a=t42=8e935527eb39a2cf9bf0f596754b5853-043a9fe8aa7c469d8324956a57849db5 > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:232) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:195) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:172) > at > org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:156) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.verifySnapshot(ExportSnapshot.java:851) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.doWork(ExportSnapshot.java:1096) > at > org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:154) > at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70) > at > org.apache.hadoop.hbase.util.AbstractHBaseTool.doStaticMain(AbstractHBaseTool.java:280) > at > org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1144) > {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-26816) Fix CME in ReplicationSourceManager
[ https://issues.apache.org/jira/browse/HBASE-26816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-26816: --- [~Xiaolin Ha] this Jira is not present in branch-2.5, but has 2.5.0 fixVersion. Do you want to cherry-pick it there, or remove 2.5.0? > Fix CME in ReplicationSourceManager > --- > > Key: HBASE-26816 > URL: https://issues.apache.org/jira/browse/HBASE-26816 > Project: HBase > Issue Type: Bug > Components: Replication >Affects Versions: 2.4.10 >Reporter: Xiaolin Ha >Assignee: Xiaolin Ha >Priority: Minor > Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.11 > > > Exception in thread "regionserver/hostname/ip:port" > java.util.ConcurrentModificationException > at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901) > at java.util.ArrayList$Itr.next(ArrayList.java:851) > at > org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:832) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:162) > at > org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:155) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2623) > at > org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1175) > at java.lang.Thread.run(Thread.java:745) -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-26642) Increase the timeout for TestStochasticLoadBalancerRegionReplicaLargeCluster
[ https://issues.apache.org/jira/browse/HBASE-26642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-26642. --- Fix Version/s: 2.6.0 Resolution: Fixed > Increase the timeout for TestStochasticLoadBalancerRegionReplicaLargeCluster > > > Key: HBASE-26642 > URL: https://issues.apache.org/jira/browse/HBASE-26642 > Project: HBase > Issue Type: Improvement > Components: Balancer, test >Affects Versions: 2.5.0, 2.6.0 >Reporter: Duo Zhang >Assignee: Duo Zhang >Priority: Major > Fix For: 2.6.0 > > > TestStochasticLoadBalancerRegionReplicaLargeCluster is on the flaky list for > branch-2, it fails 50%+. > Looking at the output, sometimes it can not finish all the calculation in > time, so let's see if increasing the timeout can help here. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28335) Expose CacheStats ageAtEviction histogram in jmx
Bryan Beaudreault created HBASE-28335: - Summary: Expose CacheStats ageAtEviction histogram in jmx Key: HBASE-28335 URL: https://issues.apache.org/jira/browse/HBASE-28335 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault In CacheStats we keep track of the ageAtEviction in a histogram. This is exposed in the UI, but not via jmx. Expose via jmx as well for easier tracking over time. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28334) Remove unnecessary null DEFAULT_VALUE in TableDescriptorBuilder
Bryan Beaudreault created HBASE-28334: - Summary: Remove unnecessary null DEFAULT_VALUE in TableDescriptorBuilder Key: HBASE-28334 URL: https://issues.apache.org/jira/browse/HBASE-28334 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault With ERASURE_CODING_POLICY, the default value is null (no policy). I added a record of that in DEFAULT_VALUES, because other settings seemed to do that. A null value is never stored on a HTD because our code handles removing from map when setting null. So we'd never have an opportunity to match against the DEFAULT_VALUE. If someone tried setting a string value "null", that would fail validation because it's not a valid policy. So there's no reason to record this default value. It doesn't cause a problem, but is confusing to anyone reading the code. Remove it. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28327) Add remove(String key, Metric metric) method to MetricRegistry interface
[ https://issues.apache.org/jira/browse/HBASE-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28327. --- Fix Version/s: 2.6.0 2.5.8 3.0.0-beta-2 Resolution: Fixed Pushed to all active branches. Thanks for the contribution [~eboland148]! > Add remove(String key, Metric metric) method to MetricRegistry interface > > > Key: HBASE-28327 > URL: https://issues.apache.org/jira/browse/HBASE-28327 > Project: HBase > Issue Type: Improvement >Reporter: Evelyn Boland >Assignee: Evelyn Boland >Priority: Major > Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2 > > > Add a `remove(String name, Metric metric)` method to the `MetricRegistry` > interface. Right now the interface only contains a `remove(String name)` > method. > This additional remove method will give users the power to remove a `Metric` > with the specified `name` from the metric registry if and only if the > provided `metric` matches the object in the registry. > Implementing the new `remove(String name, Metric metric)` should be straight > forward because the `MetricRegistryImpl` class stores metrics in a > `ConcurrentMap`, which already contains a `remove(Object key, Object value)` > method > This change will not be a breaking one because the interface is marked with > `@IntefaceStability.Evolving` > [~rmdmattingly] -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28302) Add tracking of fs read times in ScanMetrics and slow logs
[ https://issues.apache.org/jira/browse/HBASE-28302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28302. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: Adds a new getFsReadTime() to the slow log records, and fsReadTime counter to ScanMetrics. In both cases, this is the cumulative time spent reading blocks from hdfs for the given request. Additionally, a new fsSlowReadsCount jmx metric is added to the sub=IO bean. This is the count of HDFS reads which took longer than hbase.fs.reader.warn.time.ms. Assignee: Bryan Beaudreault Resolution: Fixed Thanks [~ndimiduk] for the review! Pushed to master, branch-3, branch-2, branch-2.6. > Add tracking of fs read times in ScanMetrics and slow logs > -- > > Key: HBASE-28302 > URL: https://issues.apache.org/jira/browse/HBASE-28302 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-2 > > > We've had this in our production for a while, and it's useful info to have. > We already track FS read times in > [HFileBlock|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1828-L1831C10]. > We can project that into the ScanMetrics instance and slow log pretty > easily. It is also helpful to add a slow.fs.read.threshold, over which we log > a warn -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-27966) HBase Master/RS JVM metrics populated incorrectly
[ https://issues.apache.org/jira/browse/HBASE-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-27966. --- Resolution: Fixed Pushed to branch-3. Test looks good there. I quickly checked the other branches and looks like they were properly backported. > HBase Master/RS JVM metrics populated incorrectly > - > > Key: HBASE-27966 > URL: https://issues.apache.org/jira/browse/HBASE-27966 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 2.0.0-alpha-4 >Reporter: Nihal Jain >Assignee: Nihal Jain >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1, 2.5.6 > > Attachments: test_patch.txt > > > HBase Master/RS JVM metrics populated incorrectly due to regression causing > ambari metrics system to not able to capture them. > Based on my analysis the issue is relevant for all release post 2.0.0-alpha-4 > and seems to be caused due to HBASE-18846. > Have been able to compare the JVM metrics across 3 versions of HBase and > attaching results of same below: > HBase: 1.1.2 > {code:java} > { > "name" : "Hadoop:service=HBase,name=JvmMetrics", > "modelerType" : "JvmMetrics", > "tag.Context" : "jvm", > "tag.ProcessName" : "RegionServer", > "tag.SessionId" : "", > "tag.Hostname" : "HOSTNAME", > "MemNonHeapUsedM" : 196.05664, > "MemNonHeapCommittedM" : 347.60547, > "MemNonHeapMaxM" : 4336.0, > "MemHeapUsedM" : 7207.315, > "MemHeapCommittedM" : 66080.0, > "MemHeapMaxM" : 66080.0, > "MemMaxM" : 66080.0, > "GcCount" : 3953, > "GcTimeMillis" : 662520, > "ThreadsNew" : 0, > "ThreadsRunnable" : 214, > "ThreadsBlocked" : 0, > "ThreadsWaiting" : 626, > "ThreadsTimedWaiting" : 78, > "ThreadsTerminated" : 0, > "LogFatal" : 0, > "LogError" : 0, > "LogWarn" : 0, > "LogInfo" : 0 > }, > {code} > HBase 2.0.2 > {code:java} > { > "name" : "Hadoop:service=HBase,name=JvmMetrics", > "modelerType" : "JvmMetrics", > "tag.Context" : "jvm", > "tag.ProcessName" : "IO", > "tag.SessionId" : "", > "tag.Hostname" : "HOSTNAME", > "MemNonHeapUsedM" : 203.86688, > "MemNonHeapCommittedM" : 740.6953, > "MemNonHeapMaxM" : -1.0, > "MemHeapUsedM" : 14879.477, > "MemHeapCommittedM" : 31744.0, > "MemHeapMaxM" : 31744.0, > "MemMaxM" : 31744.0, > "GcCount" : 75922, > "GcTimeMillis" : 5134691, > "ThreadsNew" : 0, > "ThreadsRunnable" : 90, > "ThreadsBlocked" : 3, > "ThreadsWaiting" : 158, > "ThreadsTimedWaiting" : 36, > "ThreadsTerminated" : 0, > "LogFatal" : 0, > "LogError" : 0, > "LogWarn" : 0, > "LogInfo" : 0 > }, > {code} > HBase: 2.5.2 > {code:java} > { > "name": "Hadoop:service=HBase,name=JvmMetrics", > "modelerType": "JvmMetrics", > "tag.Context": "jvm", > "tag.ProcessName": "IO", > "tag.SessionId": "", > "tag.Hostname": "HOSTNAME", > "MemNonHeapUsedM": 192.9798, > "MemNonHeapCommittedM": 198.4375, > "MemNonHeapMaxM": -1.0, > "MemHeapUsedM": 773.23584, > "MemHeapCommittedM": 1004.0, > "MemHeapMaxM": 1024.0, > "MemMaxM": 1024.0, > "GcCount": 2048, > "GcTimeMillis": 25440, > "ThreadsNew": 0, > "ThreadsRunnable": 22, > "ThreadsBlocked": 0, > "ThreadsWaiting": 121, > "ThreadsTimedWaiting": 49, > "ThreadsTerminated": 0, > "LogFatal": 0, > "LogError": 0, > "LogWarn": 0, > "LogInfo": 0 > }, > {code} > It can be observed that 2.0.x onwards the field "tag.ProcessName" is > populating as "IO" instead of expected "RegionServer" or "Master". > Ambari relies on this field process name to create a metric > 'jvm.RegionServer.JvmMetrics.GcTimeMillis' etc. See > [code.|https://github.com/apache/ambari/blob/2ec4b055d99ec84c902da16dd57df91d571b48d6/ambari-server/src/main/java/org/apache/ambari/server/controller/metrics/timeline/AMSPropertyProvider.java#L722] > But post 2.0.x the field is getting populated as 'IO' and hence a metric with > name 'jvm.JvmMetrics.GcTimeMillis' is created instead of expected > 'jvm.RegionServer.JvmMetrics.GcTimeMillis', thus mixing up the metric with > various other metrics coming from rs, master, spark executor etc. running on > same host. > *Expected* > Field "tag.ProcessName" should be populated as "RegionServer" or "Master" > instead of "IO". > *Actual* > Field "tag.ProcessName" is populating as "IO" instead of expected > "RegionServer" or "Master" causing incorrect metric being published by ambari > and thus mixing up all metrics and raising various alerts around JVM metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Reopened] (HBASE-27966) HBase Master/RS JVM metrics populated incorrectly
[ https://issues.apache.org/jira/browse/HBASE-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault reopened HBASE-27966: --- Re-opening because I just realized that this was not included in branch-3. Perhaps it was committed around the time of our branching that. We need to cherry-pick to branch-3, which I will do shortly. > HBase Master/RS JVM metrics populated incorrectly > - > > Key: HBASE-27966 > URL: https://issues.apache.org/jira/browse/HBASE-27966 > Project: HBase > Issue Type: Bug > Components: metrics >Affects Versions: 2.0.0-alpha-4 >Reporter: Nihal Jain >Assignee: Nihal Jain >Priority: Major > Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1 > > Attachments: test_patch.txt > > > HBase Master/RS JVM metrics populated incorrectly due to regression causing > ambari metrics system to not able to capture them. > Based on my analysis the issue is relevant for all release post 2.0.0-alpha-4 > and seems to be caused due to HBASE-18846. > Have been able to compare the JVM metrics across 3 versions of HBase and > attaching results of same below: > HBase: 1.1.2 > {code:java} > { > "name" : "Hadoop:service=HBase,name=JvmMetrics", > "modelerType" : "JvmMetrics", > "tag.Context" : "jvm", > "tag.ProcessName" : "RegionServer", > "tag.SessionId" : "", > "tag.Hostname" : "HOSTNAME", > "MemNonHeapUsedM" : 196.05664, > "MemNonHeapCommittedM" : 347.60547, > "MemNonHeapMaxM" : 4336.0, > "MemHeapUsedM" : 7207.315, > "MemHeapCommittedM" : 66080.0, > "MemHeapMaxM" : 66080.0, > "MemMaxM" : 66080.0, > "GcCount" : 3953, > "GcTimeMillis" : 662520, > "ThreadsNew" : 0, > "ThreadsRunnable" : 214, > "ThreadsBlocked" : 0, > "ThreadsWaiting" : 626, > "ThreadsTimedWaiting" : 78, > "ThreadsTerminated" : 0, > "LogFatal" : 0, > "LogError" : 0, > "LogWarn" : 0, > "LogInfo" : 0 > }, > {code} > HBase 2.0.2 > {code:java} > { > "name" : "Hadoop:service=HBase,name=JvmMetrics", > "modelerType" : "JvmMetrics", > "tag.Context" : "jvm", > "tag.ProcessName" : "IO", > "tag.SessionId" : "", > "tag.Hostname" : "HOSTNAME", > "MemNonHeapUsedM" : 203.86688, > "MemNonHeapCommittedM" : 740.6953, > "MemNonHeapMaxM" : -1.0, > "MemHeapUsedM" : 14879.477, > "MemHeapCommittedM" : 31744.0, > "MemHeapMaxM" : 31744.0, > "MemMaxM" : 31744.0, > "GcCount" : 75922, > "GcTimeMillis" : 5134691, > "ThreadsNew" : 0, > "ThreadsRunnable" : 90, > "ThreadsBlocked" : 3, > "ThreadsWaiting" : 158, > "ThreadsTimedWaiting" : 36, > "ThreadsTerminated" : 0, > "LogFatal" : 0, > "LogError" : 0, > "LogWarn" : 0, > "LogInfo" : 0 > }, > {code} > HBase: 2.5.2 > {code:java} > { > "name": "Hadoop:service=HBase,name=JvmMetrics", > "modelerType": "JvmMetrics", > "tag.Context": "jvm", > "tag.ProcessName": "IO", > "tag.SessionId": "", > "tag.Hostname": "HOSTNAME", > "MemNonHeapUsedM": 192.9798, > "MemNonHeapCommittedM": 198.4375, > "MemNonHeapMaxM": -1.0, > "MemHeapUsedM": 773.23584, > "MemHeapCommittedM": 1004.0, > "MemHeapMaxM": 1024.0, > "MemMaxM": 1024.0, > "GcCount": 2048, > "GcTimeMillis": 25440, > "ThreadsNew": 0, > "ThreadsRunnable": 22, > "ThreadsBlocked": 0, > "ThreadsWaiting": 121, > "ThreadsTimedWaiting": 49, > "ThreadsTerminated": 0, > "LogFatal": 0, > "LogError": 0, > "LogWarn": 0, > "LogInfo": 0 > }, > {code} > It can be observed that 2.0.x onwards the field "tag.ProcessName" is > populating as "IO" instead of expected "RegionServer" or "Master". > Ambari relies on this field process name to create a metric > 'jvm.RegionServer.JvmMetrics.GcTimeMillis' etc. See > [code.|https://github.com/apache/ambari/blob/2ec4b055d99ec84c902da16dd57df91d571b48d6/ambari-server/src/main/java/org/apache/ambari/server/controller/metrics/timeline/AMSPropertyProvider.java#L722] > But post 2.0.x the field is getting populated as 'IO' and hence a metric with > name 'jvm.JvmMetrics.GcTimeMillis' is created instead of expected > 'jvm.RegionServer.JvmMetrics.GcTimeMillis', thus mixing up the metric with > various other metrics coming from rs, master, spark executor etc. running on > same host. > *Expected* > Field "tag.ProcessName" should be populated as "RegionServer" or "Master" > instead of "IO". > *Actual* > Field "tag.ProcessName" is populating as "IO" instead of expected > "RegionServer" or "Master" causing incorrect metric being published by ambari > and thus mixing up all metrics and raising various alerts around JVM metrics. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28320) Expose DelegatingRpcScheduler as IA.LimitedPrivate
[ https://issues.apache.org/jira/browse/HBASE-28320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28320. --- Resolution: Duplicate > Expose DelegatingRpcScheduler as IA.LimitedPrivate > -- > > Key: HBASE-28320 > URL: https://issues.apache.org/jira/browse/HBASE-28320 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > > We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler > itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible > change from HBASE-27144. > We can limit the impact of breaking changes like this by exposing > DelegatingRpcScheduler to users. Users can extend this class and only > override the pieces that they care about, thus reducing the surface area of > compatibility issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28318) Expose DelegatingRpcScheduler as IA.LimitedPrivate
[ https://issues.apache.org/jira/browse/HBASE-28318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28318. --- Resolution: Duplicate > Expose DelegatingRpcScheduler as IA.LimitedPrivate > -- > > Key: HBASE-28318 > URL: https://issues.apache.org/jira/browse/HBASE-28318 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > > We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler > itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible > change from HBASE-27144. > We can limit the impact of breaking changes like this by exposing > DelegatingRpcScheduler to users. Users can extend this class and only > override the pieces that they care about, thus reducing the surface area of > compatibility issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28320) Expose DelegatingRpcScheduler as IA.LimitedPrivate
Bryan Beaudreault created HBASE-28320: - Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate Key: HBASE-28320 URL: https://issues.apache.org/jira/browse/HBASE-28320 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault Fix For: 2.5.8, 3.0.0-beta-2 We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from HBASE-27144. We can limit the impact of breaking changes like this by exposing DelegatingRpcScheduler to users. Users can extend this class and only override the pieces that they care about, thus reducing the surface area of compatibility issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28319) Expose DelegatingRpcScheduler as IA.LimitedPrivate
Bryan Beaudreault created HBASE-28319: - Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate Key: HBASE-28319 URL: https://issues.apache.org/jira/browse/HBASE-28319 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault Fix For: 2.5.8, 3.0.0-beta-2 We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from HBASE-27144. We can limit the impact of breaking changes like this by exposing DelegatingRpcScheduler to users. Users can extend this class and only override the pieces that they care about, thus reducing the surface area of compatibility issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28318) Expose DelegatingRpcScheduler as IA.LimitedPrivate
Bryan Beaudreault created HBASE-28318: - Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate Key: HBASE-28318 URL: https://issues.apache.org/jira/browse/HBASE-28318 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault Fix For: 2.5.8, 3.0.0-beta-2 We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from HBASE-27144. We can limit the impact of breaking changes like this by exposing DelegatingRpcScheduler to users. Users can extend this class and only override the pieces that they care about, thus reducing the surface area of compatibility issues. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28306) Add property to customize Version information
[ https://issues.apache.org/jira/browse/HBASE-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28306. --- Fix Version/s: 2.6.0 2.5.8 3.0.0-beta-2 Release Note: Added a new build property -Dversioninfo.version which can be used to influence the generated Version.java class in custom build scenarios. The version specified will show up in the HMaster UI and also have implications on various version-related checks. This is an advanced usage property and it's recommended not to stray too far from the default format of major.minor.patch-suffix. Resolution: Fixed Pushed to all active release lines. Thanks [~zhangduo] for review! > Add property to customize Version information > - > > Key: HBASE-28306 > URL: https://issues.apache.org/jira/browse/HBASE-28306 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2 > > > In hbase-common we generate Version.java using the ${project.version} > property. In some custom builds, it may be necessary to override the project > version. The custom version may not be compatible with how Version works, or > the user may want to add extra metadata (like a build number). We can add a > property which defaults to ${project.version} but allows the user to specify > separately if desired. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28316) Add BootstrapNodeService handlers
Bryan Beaudreault created HBASE-28316: - Summary: Add BootstrapNodeService handlers Key: HBASE-28316 URL: https://issues.apache.org/jira/browse/HBASE-28316 Project: HBase Issue Type: Sub-task Affects Versions: 3.0.0-beta-1, 2.6.0 Reporter: Bryan Beaudreault We added calls to a BootstrapNodeService, but the servers are not setup to serve it. We need to add in two places: * RSRPCServices list of services: [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L1447] * HBasePolicyProvider mapping of acl to service: [https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/security/HBasePolicyProvider.java#L40] Without adding to these two places, you first see UnknownServiceExceptions and then you see AccessDeniedExceptions -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28315) Remove noisy WARN from trying to construct MetricsServlet
Bryan Beaudreault created HBASE-28315: - Summary: Remove noisy WARN from trying to construct MetricsServlet Key: HBASE-28315 URL: https://issues.apache.org/jira/browse/HBASE-28315 Project: HBase Issue Type: Improvement Affects Versions: 3.0.0-beta-1, 2.6.0 Reporter: Bryan Beaudreault MetricsServlet is deprecated since hadoop 2.8 and removed in hadoop3. In HBASE-20904 the servlet initialization was refactored, and we now have a noisy WARN (with stacktrace) when MetricsServlet does not exist. This should be common, since hadoop3 is the modern version to run on (hadoop2 almost EOL). We shouldn't warn. Fix the code to not produce a warn when MetricsServlet is not available. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28256) Enhance ByteBufferUtils.readVLong to read more bytes at a time
[ https://issues.apache.org/jira/browse/HBASE-28256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28256. --- Fix Version/s: 2.6.0 2.5.8 3.0.0-beta-2 Resolution: Fixed Pushed to all active release branches. Thanks for the great work here [~bewing], and for the review [~zhangduo]. > Enhance ByteBufferUtils.readVLong to read more bytes at a time > -- > > Key: HBASE-28256 > URL: https://issues.apache.org/jira/browse/HBASE-28256 > Project: HBase > Issue Type: Improvement > Components: Performance >Reporter: Becker Ewing >Assignee: Becker Ewing >Priority: Major > Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2 > > Attachments: ReadVLongBenchmark.zip, async-prof-rs-cpu.html > > > Currently, ByteBufferUtils.readVLong is used to decode rows in all data block > encodings in order to read the memstoreTs field. For a data block encoding > like prefix, ByteBufferUtils.readVLong can surprisingly occupy over 50% of > the CPU time in BufferedEncodedSeeker.decodeNext (which can be quite a hot > method in seek operations). > > Since memstoreTs will typically require at least 6 bytes to store, we could > look to vectorize the read path for readVLong to read 8 bytes at a time > instead of a single byte at a time (like in > https://issues.apache.org/jira/browse/HBASE-28025) in order to increase > performance. > > Attached is a CPU flamegraph of a region server process which shows that we > spend a surprising amount of time in decoding rows from the DBE in > ByteBufferUtils.readVLong. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28307) Add hbase-openssl module and include in release binaries
[ https://issues.apache.org/jira/browse/HBASE-28307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28307. --- Fix Version/s: 2.6.0 3.0.0-beta-2 Release Note: Adds a new org.apache.hbase:hbase-openssl module which users can add as a dependency in their project if they'd like to use tcnative with netty TLS. The bundled tcnative is statically linked to boringssl and properly shaded to just work with hbase netty. Additionally, the tcnative jar has been added to the release binaries published by hbase (through hbase-assembly) Resolution: Fixed Thanks [~nihaljain.cs] and [~zhangduo] for the review! > Add hbase-openssl module and include in release binaries > > > Key: HBASE-28307 > URL: https://issues.apache.org/jira/browse/HBASE-28307 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-2 > > > This will make it easier for someone to use, since a common deployment > strategy would involve untar'ing our bin assembly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28307) Include hbase-shaded-netty-tcnative in hbase-assembly
Bryan Beaudreault created HBASE-28307: - Summary: Include hbase-shaded-netty-tcnative in hbase-assembly Key: HBASE-28307 URL: https://issues.apache.org/jira/browse/HBASE-28307 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault This will make it easier for someone to use, since a common deployment strategy would involve untar'ing our bin assembly. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28306) Add property to customize Version information
Bryan Beaudreault created HBASE-28306: - Summary: Add property to customize Version information Key: HBASE-28306 URL: https://issues.apache.org/jira/browse/HBASE-28306 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault In hbase-common we generate Version.java using the ${project.version} property. In some custom builds, it may be necessary to override the project version. The custom version may not be compatible with how Version works, or the user may want to add extra metadata (like a build number). We can add a property which defaults to ${project.version} but allows the user to specify separately if desired. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28304) Add hbase-shaded-testing-util version to dependencyManagement
[ https://issues.apache.org/jira/browse/HBASE-28304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28304. --- Fix Version/s: 2.6.0 2.5.8 3.0.0-beta-2 Resolution: Fixed Pushed to all active branches. Thanks [~zhangduo] for the review! > Add hbase-shaded-testing-util version to dependencyManagement > - > > Key: HBASE-28304 > URL: https://issues.apache.org/jira/browse/HBASE-28304 > Project: HBase > Issue Type: Task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2 > > > hbase-shaded-testing-util is the only sub-module referenced as a dependency > in hbase poms which is not present in our parent pom dependencyManagement. > This causes issues in my employer's build, but is also good for consistency. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28304) Add hbase-shaded-testing-util version to dependencyManagement
Bryan Beaudreault created HBASE-28304: - Summary: Add hbase-shaded-testing-util version to dependencyManagement Key: HBASE-28304 URL: https://issues.apache.org/jira/browse/HBASE-28304 Project: HBase Issue Type: Task Reporter: Bryan Beaudreault hbase-shaded-testing-util is the only sub-module referenced as a dependency in hbase poms which is not present in our parent pom dependencyManagement. This causes issues in my employer's build, but is also good for consistency. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28302) Add tracking of fs read times in ScanMetrics, slow logs, and warn threshold
Bryan Beaudreault created HBASE-28302: - Summary: Add tracking of fs read times in ScanMetrics, slow logs, and warn threshold Key: HBASE-28302 URL: https://issues.apache.org/jira/browse/HBASE-28302 Project: HBase Issue Type: Improvement Reporter: Bryan Beaudreault We've had this in our production for a while, and it's useful info to have. We already track FS read times in [HFileBlock|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1828-L1831C10]. We can project that into the ScanMetrics instance and slow log pretty easily. It is also helpful to add a slow.fs.read.threshold, over which we log a warn -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28291) [hbase-thirdparty] Update netty version
Bryan Beaudreault created HBASE-28291: - Summary: [hbase-thirdparty] Update netty version Key: HBASE-28291 URL: https://issues.apache.org/jira/browse/HBASE-28291 Project: HBase Issue Type: Task Reporter: Bryan Beaudreault Assignee: Bryan Beaudreault There is a CVE: [https://github.com/netty/netty/security/advisories/GHSA-xpw8-rcwv-8f8p.] It does not affect us, but we can clear it anyway. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28260) Possible data loss in WAL after RegionServer crash
Bryan Beaudreault created HBASE-28260: - Summary: Possible data loss in WAL after RegionServer crash Key: HBASE-28260 URL: https://issues.apache.org/jira/browse/HBASE-28260 Project: HBase Issue Type: Bug Reporter: Bryan Beaudreault We recently had a production incident: # RegionServer crashes, but local DataNode lives on # WAL lease recovery kicks in # Namenode reconstructs the block during lease recovery (which results in a new genstamp). It chooses the replica on the local DataNode as the primary. # Local DataNode reconstructs the block, so NameNode registers the new genstamp. # Local DataNode and the underlying host dies, before the new block could be replicated to other replicas. This leaves us with a missing block, because the new genstamp block has no replicas. The old replicas still remain, but are considered corrupt due to GENSTAMP_MISMATCH. Thankfully we were able to confirm that the length of the corrupt blocks were identical to the newly constructed and lost block. Further, the file in question was only 1 block. So we downloaded one of those corrupt block files and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in hdfs. So in this case we had no actual data loss, but it could have happened easily if the file was more than 1 block or the replicas weren't fully in sync prior to reconstruction. In order to avoid this issue, we should avoid writing WAL blocks too the local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to [~weichiu] for pointing this out. During reading of WALs we already reorder blocks so as to avoid reading from the local datanode, but avoiding writing there altogether would be better. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28029) Netty SSL throughput improvement
[ https://issues.apache.org/jira/browse/HBASE-28029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28029. --- Fix Version/s: 2.6.0 3.0.0-beta-1 Resolution: Fixed Pushed to branch-2.6+. Thanks for the review [~nihaljain.cs] and [~zhangduo] > Netty SSL throughput improvement > > > Key: HBASE-28029 > URL: https://issues.apache.org/jira/browse/HBASE-28029 > Project: HBase > Issue Type: Improvement >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > Attachments: 10mb-wrap.html, default-wrap.html > > > Digging into HBASE-27947, I discovered an area for optimization in netty's > SslHandler. I submitted that upstream to > [https://github.com/netty/netty/issues/13549,] and submitted a PR for their > review [https://github.com/netty/netty/pull/13551.] > It's likely we will need changes in HBase to integrate this, including > updating hbase-thirdparty once the change is released, and adding support for > calling SslHandler.setWrapDataSize. This issue encapsulates that work. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (HBASE-28239) Auto create configured namespaces
Bryan Beaudreault created HBASE-28239: - Summary: Auto create configured namespaces Key: HBASE-28239 URL: https://issues.apache.org/jira/browse/HBASE-28239 Project: HBase Issue Type: New Feature Reporter: Bryan Beaudreault During startup, the HMaster will create the default and system namespaces automatically. To simplify the management of common namespaces, it would be beneficial to offer a configuration option that operators can use to ensure that additional namespaces are created during startup. This would eliminate the need to wrap createTable calls in checkAndCreateNamespace or provide separate cluster bootstrap functionality to guarantee that the namespace is created. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28215) Region reopen procedure should support some sort of throttling
[ https://issues.apache.org/jira/browse/HBASE-28215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28215. --- Fix Version/s: 2.6.0 3.0.0-beta-1 Release Note: Adds new configurations to control the speed and batching of region reopens after modifying a table: - hbase.reopen.table.regions.progressive.batch.size.max - When set, the HMaster will progressively reopen regions, starting with one region and then doubling until it reaches the specified max. After reaching the max, it will continue reopening at that batch size until all regions are reopened. - hbase.reopen.table.regions.progressive.batch.backoff.ms - When set, the HMaster will back off for this amount of time between each batch. Resolution: Fixed Pushed to master, branch-3, branch-2, branch-2.6 Thanks for the contribution [~rmdmattingly]! > Region reopen procedure should support some sort of throttling > -- > > Key: HBASE-28215 > URL: https://issues.apache.org/jira/browse/HBASE-28215 > Project: HBase > Issue Type: Improvement > Components: master, proc-v2 >Reporter: Ray Mattingly >Assignee: Ray Mattingly >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > > The mass reopening of regions caused by a table descriptor modification can > be quite disruptive. For latency/error sensitive workloads, like our user > facing traffic, we need to be very careful about when we modify table > descriptors, and it can be virtually impossible to do it painlessly for busy > tables. > It would be nice if we supported configurable batching/throttling of > reopenings so that the amplitude of any disruption can be kept relatively > small. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28120) Provide the switch to avoid reopening regions in the alter sync command
[ https://issues.apache.org/jira/browse/HBASE-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28120. --- Fix Version/s: (was: 2.6.0) Resolution: Invalid > Provide the switch to avoid reopening regions in the alter sync command > --- > > Key: HBASE-28120 > URL: https://issues.apache.org/jira/browse/HBASE-28120 > Project: HBase > Issue Type: Sub-task > Components: master, shell >Affects Versions: 2.0.0-alpha-1 >Reporter: Gourab Taparia >Assignee: Gourab Taparia >Priority: Major > > As part of the sub-task, as HBase 2 supports both Async and Sync API, this > task is to add this support/feature to HBase 2's Sync API. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28121) Port the switch to avoid reopening regions in the alter async in HBase 2
[ https://issues.apache.org/jira/browse/HBASE-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28121. --- Fix Version/s: (was: 2.6.0) Resolution: Invalid > Port the switch to avoid reopening regions in the alter async in HBase 2 > > > Key: HBASE-28121 > URL: https://issues.apache.org/jira/browse/HBASE-28121 > Project: HBase > Issue Type: Sub-task > Components: master, shell >Affects Versions: 2.0.0-alpha-1 >Reporter: Gourab Taparia >Assignee: Gourab Taparia >Priority: Major > > As part of the sub-task, as HBase 2 supports both Async and Sync API, this > task is to port the feature added in HBase 3 alter(async default) layer to > HBase 2's async side. > There is a separate sub-task for adding it to HBase 2's sync side. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-20433) HBase Export Snapshot utility does not close FileSystem instances
[ https://issues.apache.org/jira/browse/HBASE-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-20433. --- Resolution: Duplicate Resolving this as a duplicate of HBASE-28222 where I fixed this as best I could, by re-enabling the cache (by reverting HBASE-12819). ExportSnapshot is designed to be run as a standalone job. If someone plans to run ExportSnapshot many times in a single process, they should run FileSystem.closeAll() between each run. This is not safe for ExportSnapshot itself to do, since it could inadvertently close FileSystem objects referenced elsewhere in the user code. See HBASE-28222 for more details. > HBase Export Snapshot utility does not close FileSystem instances > - > > Key: HBASE-20433 > URL: https://issues.apache.org/jira/browse/HBASE-20433 > Project: HBase > Issue Type: Bug > Components: Client, Filesystem Integration, snapshots >Affects Versions: 1.2.6, 1.4.3 >Reporter: Voyta >Priority: Major > > It seems org.apache.hadoop.hbase.snapshot.ExportSnapshot disallows FileSystem > instance caching. > When verifySnapshot method is being run it calls often methods like > org.apache.hadoop.hbase.util.FSUtils#getRootDir that instantiate FileSystem > but never calls org.apache.hadoop.fs.FileSystem#close method. This behaviour > allows allocation of unwanted objects potentially causing memory leaks. > Related issue: https://issues.apache.org/jira/browse/HADOOP-15392 > > Expectation: > * HBase should properly release/close all objects, especially FileSystem > instances. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28222) Leak in ExportSnapshot during verifySnapshot on S3A
[ https://issues.apache.org/jira/browse/HBASE-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28222. --- Fix Version/s: 2.6.0 3.0.0-beta-1 Release Note: ExportSnapshot now uses FileSystems from the global FileSystem cache, and as such does not close those FileSystems when it finishes. If users plan to run ExportSnapshot over and over in a single process for different FileSystem urls, they should run FileSystem.closeAll() between runs. See JIRA for details. Assignee: Bryan Beaudreault Resolution: Fixed Pushed to master, branch-3, branch-2, branch-2.6. Thanks for the review [~wchevreuil]! I did not push to older branches, even though this is a bug. It might be an unexpected change, but we can if there is a desire. > Leak in ExportSnapshot during verifySnapshot on S3A > --- > > Key: HBASE-28222 > URL: https://issues.apache.org/jira/browse/HBASE-28222 > Project: HBase > Issue Type: Bug >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > Fix For: 2.6.0, 3.0.0-beta-1 > > > Each S3AFileSystem creates an S3AInstrumentation and various metrics sources, > with no real way to disable that. In HADOOP-18526, a bug was fixed so that > these are not leaked. But in order to use that, you must call > S3AFileSystem.close() when done. > In ExportSnapshot, ever since HBASE-12819 we set fs.impl.disable.cache to > true. It looks like that was added in order to prevent conflicting calls to > close() between mapper and main thread when running in a single JVM. > When verifySnapshot is enabled, SnapshotReferenceUtil.verifySnapshot iterates > all storefiles (could be many thousands) and calls > SnapshotReferenceUtil.verifyStoreFile on them. verifyStoreFile makes a number > of static calls which end up in CommonFSUtils.getRootDir, which does > Path.getFileSystem(). > Since the FS cache is disabled, every single call to Path.getFileSystem() > creates a new FileSystem instance. That FS is short lived, and gets GC'd. But > in the case of S3AFileSystem, this leaks all of the metrics stuff. > We have two easy possible fixes: > # Only set fs.impl.disable.cache when running hadoop in local mode, since > that was the original problem. > # When calling verifySnapshot, create a new Configuration which does not > include the fs.impl.disable.cache setting. > I tested out #2 in my environment and it fixed the leak. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28231) Setup jenkins job for branch-2.6
[ https://issues.apache.org/jira/browse/HBASE-28231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28231. --- Resolution: Done > Setup jenkins job for branch-2.6 > > > Key: HBASE-28231 > URL: https://issues.apache.org/jira/browse/HBASE-28231 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (HBASE-28229) Create branch-2.6
[ https://issues.apache.org/jira/browse/HBASE-28229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Bryan Beaudreault resolved HBASE-28229. --- Resolution: Done > Create branch-2.6 > - > > Key: HBASE-28229 > URL: https://issues.apache.org/jira/browse/HBASE-28229 > Project: HBase > Issue Type: Sub-task >Reporter: Bryan Beaudreault >Assignee: Bryan Beaudreault >Priority: Major > -- This message was sent by Atlassian Jira (v8.20.10#820010)