[jira] [Created] (HBASE-28624) Docs around configuring backups can lead to unexpectedly disabling other features

2024-05-28 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28624:
-

 Summary: Docs around configuring backups can lead to unexpectedly 
disabling other features
 Key: HBASE-28624
 URL: https://issues.apache.org/jira/browse/HBASE-28624
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


In our documentation for enabling backups, we suggest that the user set the 
following:
{code:java}

  hbase.master.logcleaner.plugins
  org.apache.hadoop.hbase.backup.master.BackupLogCleaner,...


  hbase.master.hfilecleaner.plugins
  org.apache.hadoop.hbase.backup.BackupHFileCleaner,...
 {code}
A naive user will set these and not know what to do about the ",..." part. In 
doing so, they will unexpectedly be disabling all of the default cleaners we 
have. For example here are the defaults:
{code:java}

  hbase.master.logcleaner.plugins
  
org.apache.hadoop.hbase.master.cleaner.TimeToLiveLogCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveProcedureWALCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveMasterLocalStoreWALCleaner


  hbase.master.hfilecleaner.plugins
  
org.apache.hadoop.hbase.master.cleaner.TimeToLiveHFileCleaner,org.apache.hadoop.hbase.master.cleaner.TimeToLiveMasterLocalStoreHFileCleaner
 {code}
So basically disabling support for hbase.master.logcleaner.ttl and 
hbase.master.hfilecleaner.ttl.

There exists a method BackupManager.decorateMasterConfiguration and 
BackupManager.decorateRegionServerConfiguration. They are currently javadoc'd 
as being for tests only, but I think we should call these in HMaster and 
HRegionServer. Then we can only require the user to set "hbase.backup.enable" 
and very much simplify our docs here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28228) Release 2.6.0

2024-05-20 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28228?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28228.
---
Resolution: Done

2.6.0 has been released

> Release 2.6.0
> -
>
> Key: HBASE-28228
> URL: https://issues.apache.org/jira/browse/HBASE-28228
> Project: HBase
>  Issue Type: Umbrella
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28603) Finish 2.6.0 release

2024-05-20 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28603?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28603.
---
Resolution: Done

> Finish 2.6.0 release
> 
>
> Key: HBASE-28603
> URL: https://issues.apache.org/jira/browse/HBASE-28603
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Priority: Major
>
> # Release the artifacts on repository.apache.org
>  # Move the binaries from dist-dev to dist-release
>  # Add xml to download page (via HBASE-28236)
>  # Push tag 2.6.0RC4 as tag rel/2.6.0
>  # Release 2.6.0 on JIRA 
> [https://issues.apache.org/jira/projects/HBASE/versions/12353291]
>  # Add release data on [https://reporter.apache.org/addrelease.html?hbase]
>  # Send announcement email



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28236) Add 2.6.0 to downloads page

2024-05-20 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28236.
---
Resolution: Fixed

> Add 2.6.0 to downloads page
> ---
>
> Key: HBASE-28236
> URL: https://issues.apache.org/jira/browse/HBASE-28236
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28232) Add release manager for 2.6 in ref guide

2024-05-20 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28232?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28232.
---
Resolution: Fixed

> Add release manager for 2.6 in ref guide
> 
>
> Key: HBASE-28232
> URL: https://issues.apache.org/jira/browse/HBASE-28232
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28603) Finish 2.6.0 release

2024-05-17 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28603:
-

 Summary: Finish 2.6.0 release
 Key: HBASE-28603
 URL: https://issues.apache.org/jira/browse/HBASE-28603
 Project: HBase
  Issue Type: Sub-task
Reporter: Bryan Beaudreault


# Release the artifacts on repository.apache.org
 # Move the binaries from dist-dev to dist-release
 # Add xml to download page
 # Push tag 2.6.0RC4 as tag rel/2.6.0
 # Release 2.6.0 on JIRA 
[https://issues.apache.org/jira/projects/HBASE/versions/12353291]
 # Add release data on [https://reporter.apache.org/addrelease.html?hbase]
 # Send announcement email



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28237) Set version to 2.6.1-SNAPSHOT for branch-2.6

2024-05-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28237?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28237.
---
Resolution: Done

This is handled by automation so probably didn't need to be a jira

> Set version to 2.6.1-SNAPSHOT for branch-2.6
> 
>
> Key: HBASE-28237
> URL: https://issues.apache.org/jira/browse/HBASE-28237
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28233) Run ITBLL for branch-2.6

2024-05-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28233.
---
Resolution: Done

> Run ITBLL for branch-2.6
> 
>
> Key: HBASE-28233
> URL: https://issues.apache.org/jira/browse/HBASE-28233
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28235) Put up 2.6.0RC0

2024-05-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28235?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28235.
---
Resolution: Done

Ended up going to RC4, which has now passed

> Put up 2.6.0RC0
> ---
>
> Key: HBASE-28235
> URL: https://issues.apache.org/jira/browse/HBASE-28235
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28234) Set version as 2.6.0 in branch-2.6 in prep for first RC

2024-05-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28234.
---
Resolution: Done

> Set version as 2.6.0 in branch-2.6 in prep for first RC
> ---
>
> Key: HBASE-28234
> URL: https://issues.apache.org/jira/browse/HBASE-28234
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-26625) ExportSnapshot tool failed to copy data files for tables with merge region

2024-05-16 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-26625.
---
Resolution: Fixed

I've merged the backport to branch-2.5 and added the next unreleased 2.5.x 
version to fixVersions

> ExportSnapshot tool failed to copy data files for tables with merge region
> --
>
> Key: HBASE-26625
> URL: https://issues.apache.org/jira/browse/HBASE-26625
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Mei
>Assignee: Yi Mei
>Priority: Minor
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.5.9, 2.4.10, 3.0.0-alpha-3
>
>
> When export snapshot for a table with merge regions, we found following 
> exceptions:
> {code:java}
> 2021-12-24 17:14:41,563 INFO  [main] snapshot.ExportSnapshot: Finalize the 
> Snapshot Export
> 2021-12-24 17:14:41,589 INFO  [main] snapshot.ExportSnapshot: Verify snapshot 
> integrity
> 2021-12-24 17:14:41,683 ERROR [main] snapshot.ExportSnapshot: Snapshot export 
> failed
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Missing parent 
> hfile for: 043a9fe8aa7c469d8324956a57849db5.8e935527eb39a2cf9bf0f596754b5853 
> path=A/a=t42=8e935527eb39a2cf9bf0f596754b5853-043a9fe8aa7c469d8324956a57849db5
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:232)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:195)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:172)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:156)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.verifySnapshot(ExportSnapshot.java:851)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.doWork(ExportSnapshot.java:1096)
>     at 
> org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:154)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at 
> org.apache.hadoop.hbase.util.AbstractHBaseTool.doStaticMain(AbstractHBaseTool.java:280)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1144)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28482) Reverse scan with tags throws ArrayIndexOutOfBoundsException with DBE

2024-04-28 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28482?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28482.
---
Fix Version/s: 2.6.0
   2.4.18
   3.0.0-beta-2
   2.5.9
   Resolution: Fixed

Pushed to all active branches. Thanks for the follow-up fix here [~vineet.4008]!

> Reverse scan with tags throws ArrayIndexOutOfBoundsException with DBE
> -
>
> Key: HBASE-28482
> URL: https://issues.apache.org/jira/browse/HBASE-28482
> Project: HBase
>  Issue Type: Bug
>  Components: HFile
>Reporter: Vineet Kumar Maheshwari
>Assignee: Vineet Kumar Maheshwari
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 3.0.0-beta-2, 2.5.9
>
>
> Facing ArrayIndexOutOfBoundsException when performing reverse scan on a table 
> with 30K+ records in single hfile.
> Exception is happening  when block changes during seekBefore call.
> {code:java}
> Caused by: java.lang.ArrayIndexOutOfBoundsException
>     at 
> org.apache.hadoop.hbase.util.ByteBufferUtils.copyFromBufferToArray(ByteBufferUtils.java:1326)
>     at org.apache.hadoop.hbase.nio.SingleByteBuff.get(SingleByteBuff.java:213)
>     at 
> org.apache.hadoop.hbase.io.encoding.DiffKeyDeltaEncoder$DiffSeekerStateBufferedEncodedSeeker.decode(DiffKeyDeltaEncoder.java:431)
>     at 
> org.apache.hadoop.hbase.io.encoding.DiffKeyDeltaEncoder$DiffSeekerStateBufferedEncodedSeeker.decodeNext(DiffKeyDeltaEncoder.java:502)
>     at 
> org.apache.hadoop.hbase.io.encoding.BufferedDataBlockEncoder$BufferedEncodedSeeker.seekToKeyInBlock(BufferedDataBlockEncoder.java:1012)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$EncodedScanner.loadBlockAndSeekToKey(HFileReaderImpl.java:1605)
>     at 
> org.apache.hadoop.hbase.io.hfile.HFileReaderImpl$HFileScannerImpl.seekBefore(HFileReaderImpl.java:719)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekBeforeAndSaveKeyToPreviousRow(StoreFileScanner.java:645)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRowWithoutHint(StoreFileScanner.java:570)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreFileScanner.seekToPreviousRow(StoreFileScanner.java:506)
>     at 
> org.apache.hadoop.hbase.regionserver.ReversedKeyValueHeap.next(ReversedKeyValueHeap.java:126)
>     at 
> org.apache.hadoop.hbase.regionserver.StoreScanner.next(StoreScanner.java:693)
>     at 
> org.apache.hadoop.hbase.regionserver.KeyValueHeap.next(KeyValueHeap.java:151){code}
>  
> Steps to reproduce:
> Create a table with DataBlockEncoding.DIFF and block size as 1024, write some 
> 30K+ puts with setTTL, then do a reverse scan.
> {code:java}
> @Test
> public void testReverseScanWithDBEWhenCurrentBlockUpdates() throws 
> IOException {
> byte[] family = Bytes.toBytes("0");
> Configuration conf = new Configuration(TEST_UTIL.getConfiguration());
> conf.setInt(HConstants.HBASE_CLIENT_RETRIES_NUMBER, 1);
> try (Connection connection = ConnectionFactory.createConnection(conf)) {
> testReverseScanWithDBE(connection, DataBlockEncoding.DIFF, family, 1024, 
> 3);
> for (DataBlockEncoding encoding : DataBlockEncoding.values()) {
> testReverseScanWithDBE(connection, encoding, family, 1024, 3);
> }
> }
> }
> private void testReverseScanWithDBE(Connection conn, DataBlockEncoding 
> encoding, byte[] family, int blockSize, int maxRows)
> throws IOException {
> LOG.info("Running test with DBE={}", encoding);
> TableName tableName = TableName.valueOf(TEST_NAME.getMethodName() + "-" + 
> encoding);
> TEST_UTIL.createTable(TableDescriptorBuilder.newBuilder(tableName)
> .setColumnFamily(
> ColumnFamilyDescriptorBuilder.newBuilder(family).setDataBlockEncoding(encoding).setBlocksize(blockSize).build())
> .build(), null);
> Table table = conn.getTable(tableName);
> byte[] val1 = new byte[10];
> byte[] val2 = new byte[10];
> Bytes.random(val1);
> Bytes.random(val2);
> for (int i = 0; i < maxRows; i++) {
> table.put(new Put(Bytes.toBytes(i)).addColumn(family, Bytes.toBytes(1), val1)
> .addColumn(family, Bytes.toBytes(2), val2).setTTL(600_000));
> }
> TEST_UTIL.flush(table.getName());
> Scan scan = new Scan();
> scan.setReversed(true);
> try (ResultScanner scanner = table.getScanner(scan)) {
> for (int i = maxRows - 1; i >= 0; i--) {
> Result row = scanner.next();
> assertEquals(2, row.size());
> Cell cell1 = row.getColumnLatestCell(family, Bytes.toBytes(1));
> assertTrue(CellUtil.matchingRows(cell1, Bytes.toBytes(i)));
> assertTrue(CellUtil.matchingValue(cell1, val1));
> Cell cell2 = row.getColumnLatestCell(family, Bytes.toBytes(2));
> assertTrue(CellUtil.matchingRows(cell2, Bytes.toBytes(i)));
> assertTrue(CellUtil.matchingValue(cell2, val2));
> }
> }
> }
> {code}
>  
> HBASE-27580 

[jira] [Resolved] (HBASE-28255) Correcting spelling errors or annotations with non-standard spelling

2024-04-23 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28255?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28255.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   2.5.8
   Resolution: Fixed

Looks like this was forgot to resolve. I added what I think are the correct 
fixVersions

> Correcting spelling errors or annotations with non-standard spelling
> 
>
> Key: HBASE-28255
> URL: https://issues.apache.org/jira/browse/HBASE-28255
> Project: HBase
>  Issue Type: Improvement
>Reporter: mazhengxuan
>Priority: Minor
>  Labels: documentation
> Fix For: 2.6.0, 3.0.0-beta-2, 2.5.8
>
>
> Modify some spelling errors or non-standard spelling comments pointed out by 
> Typo



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28538) BackupHFileCleaner.loadHFileRefs is very expensive

2024-04-19 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28538:
-

 Summary: BackupHFileCleaner.loadHFileRefs is very expensive
 Key: HBASE-28538
 URL: https://issues.apache.org/jira/browse/HBASE-28538
 Project: HBase
  Issue Type: Bug
  Components: backuprestore
Reporter: Bryan Beaudreault


I noticed some odd CPU spikes on the hmasters of one of our clusters. Turns out 
it had been getting lots of bulkoads (30k) and processing them was expensive. 
The method scans hbase and then parses the paths. Surprisingly the parsing is 
more expensive than the reading hbase, with the vast majority of time spent in 
org/apache/hadoop/fs/Path..

We should see if this is possible to be optimized. Attaching profile.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28183) It's impossible to re-enable the quota table if it gets disabled

2024-04-07 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28183?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28183.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   2.5.9
   Resolution: Fixed

Pushed to branch-2.5+. Thanks for the contribution [~chandrasekhar.k]!

> It's impossible to re-enable the quota table if it gets disabled
> 
>
> Key: HBASE-28183
> URL: https://issues.apache.org/jira/browse/HBASE-28183
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Chandra Sekhar K
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2, 2.5.9
>
>
> HMaster.enableTable tries to read the quota table. If you disable the quota 
> table, this fails. So then it's impossible to re-enable it. The only solution 
> I can find is to delete the table at this point, so that it gets recreated at 
> startup, but this results in losing any quotas you had defined.  We should 
> fix enableTable to not check quotas if the table in question is hbase:quota.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28483) Merge of incremental backups fails on bulkloaded Hfiles

2024-04-06 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28483?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28483.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Pushed to branch-2.6+. Thanks for the report and fix [~thomas.sarens]!

> Merge of incremental backups fails on bulkloaded Hfiles
> ---
>
> Key: HBASE-28483
> URL: https://issues.apache.org/jira/browse/HBASE-28483
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 4.0.0-alpha-1
>Reporter: thomassarens
>Assignee: thomassarens
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
> Attachments: TestIncrementalBackupMergeWithBulkLoad.java
>
>
> The merge of incremental backups fails in case one of the backups contains a 
> bulk loaded HFile and the other backups doesn't. See test in attachements 
> based on
> {code:java}
> org/apache/hadoop/hbase/backup/TestBackupRestoreWithModifications.java{code}
> that reproduces the exception when useBulkLoad is set to true 
> [^TestIncrementalBackupMergeWithBulkLoad.java].
> This exception occurs in the call to`HFileRecordReader#initialize` as it 
> tries to read a directory path as an HFile. I'll see if I can create a patch 
> on master to fix this.
> {code:java}
> 2024-04-04T14:55:15,462 INFO  LocalJobRunner Map Task Executor #0 {} 
> mapreduce.HFileInputFormat$HFileRecordReader(95): Initialize 
> HFileRecordReader for 
> hdfs://localhost:34093/user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0
> 2024-04-04T14:55:15,482 WARN  [Thread-1429 {}] 
> mapred.LocalJobRunner$Job(590): job_local1854345815_0018
> java.lang.Exception: java.io.FileNotFoundException: Path is not a file: 
> /user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0
> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90)
> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:460)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:621)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:589)
> at 
> org.apache.hadoop.ipc.ProtobufRpcEngine2$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine2.java:573)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1213)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1089)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:1012)
> at java.base/java.security.AccessController.doPrivileged(Native Method)
> at java.base/javax.security.auth.Subject.doAs(Subject.java:423)
> at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1899)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:3026)
>  
> at 
> org.apache.hadoop.mapred.LocalJobRunner$Job.runTasks(LocalJobRunner.java:492) 
> ~[hadoop-mapreduce-client-common-3.3.5.jar:?]
> at org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunner.java:552) 
> ~[hadoop-mapreduce-client-common-3.3.5.jar:?]
> Caused by: java.io.FileNotFoundException: Path is not a file: 
> /user/thomass/backupIT/backup_1712235269368/default/table-true/eaeb223066c24d3e77a2ee6987e30cb3/0
> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:90)
> at org.apache.hadoop.hdfs.server.namenode.INodeFile.valueOf(INodeFile.java:76)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSDirStatAndListingOp.getBlockLocations(FSDirStatAndListingOp.java:156)
> at 
> org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getBlockLocations(FSNamesystem.java:2124)
> at 
> org.apache.hadoop.hdfs.server.namenode.NameNodeRpcServer.getBlockLocations(NameNodeRpcServer.java:769)
> at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolServerSideTranslatorPB.getBlockLocations(ClientNamenodeProtocolServerSideTranslatorPB.java:460)
> at 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$ClientNamenodeProtocol$2.callBlockingMethod(ClientNamenodeProtocolProtos.java)
> at 
> 

[jira] [Resolved] (HBASE-28460) Full backup restore fails for empty HFiles

2024-04-02 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28460?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28460.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Assignee: Dieter De Paepe
   Resolution: Fixed

Thanks for the contribution [~dieterdp_ng]! Pushed to branch-2.6+

> Full backup restore fails for empty HFiles
> --
>
> Key: HBASE-28460
> URL: https://issues.apache.org/jira/browse/HBASE-28460
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 4.0.0-alpha-1
>Reporter: Dieter De Paepe
>Assignee: Dieter De Paepe
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> A full backup restore fails if the backup contains an empty HFile, for 
> example when all data has been deleted from a table and full compaction has 
> run. There are several issues:
>  * HFiles are read in `RestoreTool` to read the first/last key, but this 
> fails for empty HFiles
>  * In `RestoreTool`, table creation also incorrectly assumes the region 
> contains keys
>  * In `MapReduceRestoreJob`, the tool incorrectly assumes that a bulkload 
> with no loaded entries is an error.
> Example stacktrace:
> {code:java}
> 24/03/21 18:38:09 ERROR org.apache.hadoop.hbase.backup.util.BackupUtils: 
> java.util.NoSuchElementException: No value present
> java.util.NoSuchElementException: No value present
>   at java.base/java.util.Optional.get(Optional.java:143)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.generateBoundaryKeys(RestoreTool.java:440)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.checkAndCreateTable(RestoreTool.java:493)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:351)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.fullRestoreTable(RestoreTool.java:211)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restoreImages(RestoreTablesClient.java:151)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restore(RestoreTablesClient.java:229)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.execute(RestoreTablesClient.java:265)
>   at 
> org.apache.hadoop.hbase.backup.impl.BackupAdminImpl.restore(BackupAdminImpl.java:518)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.parseAndRun(RestoreDriver.java:176)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.doWork(RestoreDriver.java:216)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.run(RestoreDriver.java:252)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.main(RestoreDriver.java:224)
> 24/03/21 18:38:09 ERROR org.apache.hadoop.hbase.backup.RestoreDriver: Error 
> while running restore backup
> java.lang.IllegalStateException: Cannot restore hbase table
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:360)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.fullRestoreTable(RestoreTool.java:211)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restoreImages(RestoreTablesClient.java:151)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.restore(RestoreTablesClient.java:229)
>   at 
> org.apache.hadoop.hbase.backup.impl.RestoreTablesClient.execute(RestoreTablesClient.java:265)
>   at 
> org.apache.hadoop.hbase.backup.impl.BackupAdminImpl.restore(BackupAdminImpl.java:518)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.parseAndRun(RestoreDriver.java:176)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.doWork(RestoreDriver.java:216)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.run(RestoreDriver.java:252)
>   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:82)
>   at 
> org.apache.hadoop.hbase.backup.RestoreDriver.main(RestoreDriver.java:224)
> Caused by: java.util.NoSuchElementException: No value present
>   at java.base/java.util.Optional.get(Optional.java:143)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.generateBoundaryKeys(RestoreTool.java:440)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.checkAndCreateTable(RestoreTool.java:493)
>   at 
> org.apache.hadoop.hbase.backup.util.RestoreTool.createAndRestoreTable(RestoreTool.java:351)
>   ... 10 more {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27657) Connection and Request Attributes

2024-03-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27657.
---
Resolution: Fixed

Addendum committed to branch-2 and branch-2.6. The problem did not exist on 
master/branch-3.

> Connection and Request Attributes
> -
>
> Key: HBASE-27657
> URL: https://issues.apache.org/jira/browse/HBASE-27657
> Project: HBase
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> Currently we have the ability to set Operation attributes, via 
> Get.setAttribute, etc. It would be useful to be able to set attributes at the 
> request and connection level.
> These levels can result in less duplication. For example, send some 
> attributes once per connection instead of for every one of the millions of 
> requests a connection might send. Or send once for the request, instead of 
> duplicating on every operation in a multi request.
> Additionally, the Connection and RequestHeader are more globally available on 
> the server side. Both can be accessed via RpcServer.getCurrentCall(), which 
> is useful in various integration points – coprocessors, custom queues, 
> quotas, slow log, etc. Operation attributes are harder to access because you 
> need to parse the raw Message into the appropriate type to get access to the 
> getter.
> I was thinking adding two new methods to Connection interface:
> - setAttribute (and getAttribute/getAttributes)
> - setRequestAttributeProvider
> Any Connection attributes would be set onto the ConnectionHeader during 
> initialization. The RequestAttributeProvider would be called when creating 
> each RequestHeader.
> An alternative to setRequestAttributeProvider would be to add this into 
> HBaseRpcController, which can already be customized via site configuration. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27657) Connection and Request Attributes

2024-03-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27657?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-27657:
---
  Assignee: Bryan Beaudreault  (was: Ray Mattingly)

Reopening for addendum. We accidentally dropped the following method from 
ConnectionFactory:
{code:java}
ConnectionFactory.createConnection ( Configuration conf, ExecutorService pool, 
User user ) [static]  :  Connection {code}

> Connection and Request Attributes
> -
>
> Key: HBASE-27657
> URL: https://issues.apache.org/jira/browse/HBASE-27657
> Project: HBase
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> Currently we have the ability to set Operation attributes, via 
> Get.setAttribute, etc. It would be useful to be able to set attributes at the 
> request and connection level.
> These levels can result in less duplication. For example, send some 
> attributes once per connection instead of for every one of the millions of 
> requests a connection might send. Or send once for the request, instead of 
> duplicating on every operation in a multi request.
> Additionally, the Connection and RequestHeader are more globally available on 
> the server side. Both can be accessed via RpcServer.getCurrentCall(), which 
> is useful in various integration points – coprocessors, custom queues, 
> quotas, slow log, etc. Operation attributes are harder to access because you 
> need to parse the raw Message into the appropriate type to get access to the 
> getter.
> I was thinking adding two new methods to Connection interface:
> - setAttribute (and getAttribute/getAttributes)
> - setRequestAttributeProvider
> Any Connection attributes would be set onto the ConnectionHeader during 
> initialization. The RequestAttributeProvider would be called when creating 
> each RequestHeader.
> An alternative to setRequestAttributeProvider would be to add this into 
> HBaseRpcController, which can already be customized via site configuration. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28462) Incremental backup can fail if log gets archived while WALPlayer is starting up

2024-03-27 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28462:
-

 Summary: Incremental backup can fail if log gets archived while 
WALPlayer is starting up
 Key: HBASE-28462
 URL: https://issues.apache.org/jira/browse/HBASE-28462
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


We had incremental backup fail with FileNotFoundException for a file in the 
WALs directory. Upon investigation, the log had been archived a few mins 
earlier. WALInputFormat's record reader has support for falling back on an 
archived path:
{code:java}
} catch (IOException e) {
  Path archivedLog = AbstractFSWALProvider.findArchivedLog(logFile, conf);
  // archivedLog can be null if unable to locate in archiveDir.
  if (archivedLog != null) {
openReader(archivedLog);
// Try call again in recursion
return nextKeyValue();
  } else {
throw e;
  }
} {code}
But the getSplits method has different handling:
{code:java}
try {
  List files = getFiles(fs, inputPath, startTime, endTime);
  allFiles.addAll(files);
} catch (FileNotFoundException e) {
  if (ignoreMissing) {
LOG.warn("File " + inputPath + " is missing. Skipping it.");
continue;
  }
  throw e;
} {code}
This ignoreMissing variable was added in HBASE-14141 and is enabled via 
wal.input.ignore.missing.files which is defaulted to false and never set. 
Looking at the comment and reviewboard history of HBASE-14141 I think there 
might have been some confusion about where to handle these missing files, and 
this got lost in the shuffle.
 
I would prefer not to ignore missing hfiles. I think that could result in some 
weird behavior: * RegionServer has 10 archived and 30 not-yet-archived WALs 
needing to be backed up
 * The process starts, and while it's running 1 of those 30 WALs gets archived. 
That would get skipped due to FileNotFoundException
 * But the remaining 29 would be backed up

This scenario could cause some data consistency issues if this incremental 
backup is restored. We missed some edits in the middle of applied edits from 
other WALs.

So I do think failing as we do today is necessary for consistency, but 
unrealistic in a live cluster. The solution is to try finding the missing file 
in the archived directory. Backups has a coprocessor which will not allow the 
archived file to be cleaned up until it's backed up, so I think it's safe to 
say that a WAL is either definitely in WALs or oldWALs.
 *  

- 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28459) HFileOutputFormat2 ClassCastException with s3 magic committer

2024-03-27 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28459:
-

 Summary: HFileOutputFormat2 ClassCastException with s3 magic 
committer
 Key: HBASE-28459
 URL: https://issues.apache.org/jira/browse/HBASE-28459
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


In hadoop3 there's the s3 magic committer which can speed up s3 writes 
dramatically. In HFileOutputFormat2.createRecordWriter we cast the passed in 
committer as a FileOutputCommitter. This causes a class cast exception when the 
s3 magic committer is enabled:
Error: java.lang.ClassCastException: class 
org.apache.hadoop.fs.s3a.commit.magic.MagicS3GuardCommitter cannot be cast to 
class org.apache.hadoop.mapreduce.lib.output.FileOutputCommitter
We can cast to PathOutputCommitter instead, but its only available in hadoop3+. 
So we will need to use reflection to work around this in branch-2.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28412) Restoring incremental backups to mapped table requires existence of original table

2024-03-26 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28412?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28412.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Pushed to branch-2.6+. Thanks [~rubenvw] for the contribution!

I also added you and [~dieterdp_ng] as contributors to the project so that you 
can be assigned jiras.

> Restoring incremental backups to mapped table requires existence of original 
> table
> --
>
> Key: HBASE-28412
> URL: https://issues.apache.org/jira/browse/HBASE-28412
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Reporter: Dieter De Paepe
>Assignee: Ruben Van Wanzeele
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> It appears that restoring a non-existing table from an incremental backup 
> with the "-m" parameter results in an error in the restore client.
> Reproduction steps:
> Build & start hbase:
> {code:java}
> mvn clean install -Phadoop-3.0 -DskipTests
> bin/start-hbase.sh{code}
> In HBase shell: create table and some values:
> {code:java}
> create 'test', 'cf'
> put 'test', 'row1', 'cf:a', 'value1'
> put 'test', 'row2', 'cf:b', 'value2'
> put 'test', 'row3', 'cf:c', 'value3'
> scan 'test' {code}
> Create a full backup:
> {code:java}
> bin/hbase backup create full file:/tmp/hbase-backup{code}
> Adjust some data through HBase shell:
> {code:java}
> put 'test', 'row1', 'cf:a', 'value1-new'
> scan 'test' {code}
> Create an incremental backup:
> {code:java}
> bin/hbase backup create incremental file:/tmp/hbase-backup {code}
> Delete the original table in HBase shell:
> {code:java}
> disable 'test'
> drop 'test' {code}
> Restore the incremental backup under a new table name:
> {code:java}
> bin/hbase backup history
> bin/hbase restore file:/tmp/hbase-backup  -t "test" -m 
> "test-restored" {code}
> This results in the following output / error:
> {code:java}
> ...
> 2024-03-25T13:38:53,062 WARN  [main {}] util.NativeCodeLoader: Unable to load 
> native-hadoop library for your platform... using builtin-java classes where 
> applicable
> 2024-03-25T13:38:53,174 INFO  [main {}] Configuration.deprecation: 
> hbase.client.pause.cqtbe is deprecated. Instead, use 
> hbase.client.pause.server.overloaded
> 2024-03-25T13:38:53,554 INFO  [main {}] impl.RestoreTablesClient: HBase table 
> test-restored does not exist. It will be created during restore process
> 2024-03-25T13:38:53,593 INFO  [main {}] impl.RestoreTablesClient: Restoring 
> 'test' to 'test-restored' from full backup image 
> file:/tmp/hbase-backup/backup_1711370230143/default/test
> 2024-03-25T13:38:53,707 INFO  [main {}] util.BackupUtils: Creating target 
> table 'test-restored'
> 2024-03-25T13:38:54,546 INFO  [main {}] mapreduce.MapReduceRestoreJob: 
> Restore test into test-restored
> 2024-03-25T13:38:54,646 INFO  [main {}] mapreduce.HFileOutputFormat2: 
> bulkload locality sensitive enabled
> 2024-03-25T13:38:54,647 INFO  [main {}] mapreduce.HFileOutputFormat2: Looking 
> up current regions for table test-restored
> 2024-03-25T13:38:54,669 INFO  [main {}] mapreduce.HFileOutputFormat2: 
> Configuring 1 reduce partitions to match current region count for all tables
> 2024-03-25T13:38:54,669 INFO  [main {}] mapreduce.HFileOutputFormat2: Writing 
> partition information to 
> file:/tmp/hbase-tmp/partitions_0667b6e2-79ef-4cfe-97e1-abb204ee420d
> 2024-03-25T13:38:54,687 INFO  [main {}] compress.CodecPool: Got brand-new 
> compressor [.deflate]
> 2024-03-25T13:38:54,713 INFO  [main {}] mapreduce.HFileOutputFormat2: 
> Incremental output configured for tables: test-restored
> 2024-03-25T13:38:54,715 WARN  [main {}] mapreduce.TableMapReduceUtil: The 
> addDependencyJars(Configuration, Class...) method has been deprecated 
> since it is easy to use incorrectly. Most users should rely on 
> addDependencyJars(Job) instead. See HBASE-8386 for more details.
> 2024-03-25T13:38:54,742 WARN  [main {}] impl.MetricsConfig: Cannot locate 
> configuration: tried 
> hadoop-metrics2-jobtracker.properties,hadoop-metrics2.properties
> 2024-03-25T13:38:54,834 INFO  [main {}] input.FileInputFormat: Total input 
> files to process : 1
> 2024-03-25T13:38:54,853 INFO  [main {}] mapreduce.JobSubmitter: number of 
> splits:1
> 2024-03-25T13:38:54,964 INFO  [main {}] mapreduce.JobSubmitter: Submitting 
> tokens for job: job_local748155768_0001
> 2024-03-25T13:38:54,967 INFO  [main {}] mapreduce.JobSubmitter: Executing 
> with tokens: []
> 2024-03-25T13:38:55,076 INFO  [main {}] mapred.LocalDistributedCacheManager: 
> Creating symlink: 
> /tmp/hadoop-dieter/mapred/local/job_local748155768_0001_0768a243-06e8-4524-8a6d-016ddd75df52/libjars
>  <- 

[jira] [Resolved] (HBASE-28456) HBase Restore restores old data if data for the same timestamp is in different hfiles

2024-03-26 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28456?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28456.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

> HBase Restore restores old data if data for the same timestamp is in 
> different hfiles
> -
>
> Key: HBASE-28456
> URL: https://issues.apache.org/jira/browse/HBASE-28456
> Project: HBase
>  Issue Type: Bug
>  Components: backuprestore
>Affects Versions: 2.6.0, 3.0.0
>Reporter: Ruben Van Wanzeele
>Assignee: Bryan Beaudreault
>Priority: Blocker
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
> Attachments: 
> Add_incremental_test_for_HBASE-28456_Fix_HBASE-28412_for_incremental_test.patch,
>  ChangesOnHFilesOnSameTimestampAreNotCorrectlyRestored.java
>
>
> The restore brings back 'old' data when executing restore.
> It feels like the hfile sequence id is not respected during the restore.
> See testing code attached. The workaround solution is to trigger major 
> compaction before doing the backup (not really feasible for daily backups)
> We didn't investigate this yet, but this might also impact the merge of 
> multiple incremental backups (since that follows a similar code path merging 
> hfiles).
> This currently blocks our support for HBase backup and restore.
> Willing to participate in a solution if necessary.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28449) Fix BackupSystemTable Scans

2024-03-25 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28449?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28449.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Pushed to branch-2.6+. Thanks [~baugenreich]!

> Fix BackupSystemTable Scans 
> 
>
> Key: HBASE-28449
> URL: https://issues.apache.org/jira/browse/HBASE-28449
> Project: HBase
>  Issue Type: Bug
>Reporter: Briana Augenreich
>Assignee: Briana Augenreich
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> When calculating which WALs should be included in an incremental backup the 
> backup system does a prefix scan for the last roll log timestamp. This uses 
> the backup root in the prefix (.) If you happen have 
> multiple backup roots where one is a root of the other you'll get inaccurate 
> results. 
>  
> Since the rowkey is  let's modify 
> the prefix scan to be .



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28453) Support a middle ground between the Average and Fixed interval rate limiters

2024-03-25 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28453?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28453.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: FixedIntervalRateLimiter now supports a custom refill 
interval via hbase.quota.rate.limiter.refill.interval.ms. Users of quotas may 
wish to change hbase.quota.rate.limiter to FixedIntervalRateLimiter and 
customize this new setting. It will likely lead to healthier backoffs for 
clients and more full quota utilization.
   Resolution: Fixed

Pushed to branch-2.6+. Thanks [~rmdmattingly] !

> Support a middle ground between the Average and Fixed interval rate limiters
> 
>
> Key: HBASE-28453
> URL: https://issues.apache.org/jira/browse/HBASE-28453
> Project: HBase
>  Issue Type: Improvement
>Affects Versions: 2.6.0
>Reporter: Ray Mattingly
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
> Attachments: Screenshot 2024-03-21 at 2.08.51 PM.png, Screenshot 
> 2024-03-21 at 2.30.01 PM.png
>
>
> h3. Background
> HBase quotas support two rate limiters: a "fixed" and an "average" interval 
> rate limiter.
> h4. FixedIntervalRateLimiter
> The fixed interval rate limiter is simpler: it has a TimeUnit, say 1 second, 
> and it refills a resource allotment on the recurring interval. So you may get 
> 10 resources every second, and if you exhaust all 10 resources in the first 
> millisecond of an interval then you will need to wait 999ms to acquire even 1 
> more resource.
> h4. AverageIntervalRateLimiter
> The average interval rate limiter, HBase's default, allows for more flexibly 
> timed refilling of the resource allotment. Extending our previous example, 
> say you have a 10 reads/sec quota and you have exhausted all 10 resources 
> within 1ms of the last full refill. If you request 1 more read then, rather 
> than returning a 999ms wait interval indicating the next full refill time, 
> the rate limiter will recognize that you only need to wait 99ms before 1 read 
> can be available. After 100ms has passed in aggregate since the last full 
> refill, it will support the refilling of 1/10th the limit to facilitate the 
> request for 1/10th the resources.
> h3. The Problems with Current RateLimiters
> The problem with the fixed interval rate limiter is that it is too strict 
> from a latency perspective. It results in quota limits to which we cannot 
> fully subscribe with any consistency.
> The problem with the average interval rate limiter is that, in practice, it 
> is far too optimistic. For example, a real rate limiter might limit to 
> 100MB/sec of read IO per machine. Any multigets that come in will require 
> only a tiny fraction of this limit; for example, a 64kb block is only 0.06% 
> of the total. As a result, the vast majority of wait intervals end up being 
> tiny — like <5ms. This can actually cause an inverse of your intention, where 
> setting up a throttle causes a DDOS of your RPC layer via continuous 
> throttling and ~immediate retrying. I've discussed this problem in 
> https://issues.apache.org/jira/browse/HBASE-28429 and proposed a minimum wait 
> interval as the solution there; after some more thinking, I believe this new 
> rate limiter would be a less hacky solution to this deficit so I'd like to 
> close that Jira in favor of this one.
> See the attached chart where I put in place a 10k req/sec/machine throttle 
> for this user at 10:43 to try to curb this high traffic, and it resulted in a 
> huge spike of req/sec due to the throttle/retry loop created by the 
> AverageIntervalRateLimiter.
> h3. Original Proposal: PartialIntervalRateLimiter as a Solution
> I've implemented a RateLimiter which allows for partial chunks of the overall 
> interval to be refilled, by default these chunks are 10% (or 100ms of a 1s 
> interval). I've deployed this to a test cluster at my day job and have seen 
> this really help our ability to full subscribe to a quota limit without 
> executing superfluous retries. See the other attached chart which shows a 
> cluster undergoing a rolling restart from using FixedIntervalRateLimiter to 
> my new PartialIntervalRateLimiter and how it is then able to fully subscribe 
> to its allotted 25MB/sec/machine read IO quota.
> h3. Updated Proposal: Improving FixedIntervalRateLimiter
> Rather than implement a new rate limiter, we can make a lower touch change 
> which just adds support for a refill interval that is less than the time unit 
> on a FixedIntervalRateLimiter. This can be a no-op change for those who have 
> not opted into the feature by having the refill interval default to the time 
> unit. For clarity, see [my branch 

[jira] [Created] (HBASE-28455) do-release-docker fails to setup gpg agent proxy if proxy container is slow to start

2024-03-22 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28455:
-

 Summary: do-release-docker fails to setup gpg agent proxy if proxy 
container is slow to start
 Key: HBASE-28455
 URL: https://issues.apache.org/jira/browse/HBASE-28455
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


In do-release-docker.sh we spin up the gpg-agent-proxy container and then 
immediately run ssh-keyscan and then immediately run ssh. Despite having 
{{{}set -e{}}}, both of these can fail without failing the script. This 
manifests as a really hard to debug failure in the hbase-rm container with 
"gpg: no gpg-agent running in this session"

With some debugging I realized that the ssh tunnel had not been created. 
looking at the logs, the gpg-agent-proxy.ssh-keyscan file is empty and the 
gpg-proxy.ssh.log shows a Connection refused error.

You'd think these would fail the script, but they don't for different reasons:
 # ssh-keyscan output is piped through sort. Running ssh-keyscan directly 
returns an error code, but piping it through sort turns it into a success code.
 # ssh is executed in background with {{{}&{}}}, which similarly loses the 
error code

I think we should add a step prior to ssh-keyscan which waits until port 6 
is available. I'm not sure how to retain the error codes in the above 2 
commands, but can try to look into that as well.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28338) Bounded leak of FSDataInputStream buffers from checksum switching

2024-03-19 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28338?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28338.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

> Bounded leak of FSDataInputStream buffers from checksum switching
> -
>
> Key: HBASE-28338
> URL: https://issues.apache.org/jira/browse/HBASE-28338
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> In FSDataInputStreamWrapper, the unbuffer() method caches an unbuffer 
> instance the first time it is called. When an FSDataInputStreamWrapper is 
> initialized, it has hbase checksum disabled.
> In HFileInfo.initTrailerAndContext we get the stream, read the trailer, then 
> call unbuffer. At this point, checksums have not been enabled yet via 
> prepareForBlockReader. So the call to unbuffer() caches the current 
> non-checksum stream as the unbuffer instance.
> Later, in initMetaAndIndex we do a similar thing. This time, 
> prepareForBlockReader has been called, so we are now using hbase checksums. 
> When initMetaAndIndex calls unbuffer(), it uses the old unbuffer instance 
> which actually has been closed when we switched to hbase checksums. So that 
> call does nothing, and the new no-checksum input stream is never unbuffered.
> I haven't seen this cause an issue with normal hdfs replication (though 
> haven't gone looking). It's very problematic for Erasure Coding because 
> DFSStripedInputStream holds a large buffer (numDataBlocks * cellSize, so 6mb 
> for RS-6-3-1024k) that is only used for stream reads NOT pread. The 
> FSDataInputStreamWrapper we are talking about here is only used for pread in 
> hbase, so those 6mb buffers just hang around totally unused but 
> unreclaimable. Since there is an input stream per StoreFile, this can add up 
> very quickly on big servers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28385) Quota estimates are too optimistic for large scans

2024-03-13 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28385?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28385.
---
Fix Version/s: 3.0.0-beta-2
 Release Note: When hbase.quota.use.result.size.bytes is false, we will now 
estimate the amount of quota to grab for a scan based on the block bytes 
scanned of previous next() requests. This will increase throughput for large 
scans which might prefer to wait a little longer for a larger portion of the 
quota.
   Resolution: Fixed

> Quota estimates are too optimistic for large scans
> --
>
> Key: HBASE-28385
> URL: https://issues.apache.org/jira/browse/HBASE-28385
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ray Mattingly
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> Let's say you're running a table scan with a throttle of 100MB/sec per 
> RegionServer. Ideally your scans are going to pull down large results, often 
> containing hundreds or thousands of blocks.
> You will estimate each scan as costing a single block of read capacity, and 
> if your quota is already exhausted then the server will evaluate the backoff 
> required for your estimated consumption (1 block) to be available. This will 
> often be ~1ms, causing your retries to basically be immediate.
> Obviously it will routinely take much longer than 1ms for 100MB of IO to 
> become available in the given configuration, so your retries will be destined 
> to fail. At worst this can cause a saturation of your server's RPC layer, and 
> at best this causes erroneous exhaustion of the client's retries.
> We should find a way to make these estimates a bit smarter for large scans.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28440) Add support for using mapreduce sort in HFileOutputFormat2

2024-03-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28440:
-

 Summary: Add support for using mapreduce sort in HFileOutputFormat2
 Key: HBASE-28440
 URL: https://issues.apache.org/jira/browse/HBASE-28440
 Project: HBase
  Issue Type: Improvement
  Components: backuprestore
Reporter: Bryan Beaudreault


Currently HFileOutputFormat2 uses CellSortReducer, which attempts to sort all 
of the cells of a row in memory using a TreeSet. There is a warning in the 
javadoc "If lots of columns per row, it will use lots of memory sorting." This 
can be problematic for WALPlayer, which uses HFileOutputFormat2. You could have 
reasonably sized row which just gets lots of edits in the time period of WALs 
being replayed, and that would cause an OOM. We are seeing this in some cases 
with incremental backups.

MapReduce has built-in sorting capabilities which are not limited to sorting in 
memory. It can spill to disk as necessary to sort very large datasets. We can 
get this capability in HFileOutputFormat2 with a couple changes:
 # Add support for a KeyOnlyCellComparable type as the map output key
 # When configured, use 
job.setSortComparatorClass(CellWritableComparator.class) and 
job.setReducerClass(PreSortedCellsReducer.class)
 # Update WALPlayer to have a mode which can output this new comparable instead 
of ImmutableBytesWritable

CellWritableComparator exists already for the Import job, so there is some 
prior art. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28260) Possible data loss in WAL after RegionServer crash

2024-03-12 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28260.
---
Fix Version/s: 2.5.9
   Resolution: Fixed

Pushed to branch-2.5

> Possible data loss in WAL after RegionServer crash
> --
>
> Key: HBASE-28260
> URL: https://issues.apache.org/jira/browse/HBASE-28260
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Charles Connell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2, 2.5.9
>
>
> We recently had a production incident:
>  # RegionServer crashes, but local DataNode lives on
>  # WAL lease recovery kicks in
>  # Namenode reconstructs the block during lease recovery (which results in a 
> new genstamp). It chooses the replica on the local DataNode as the primary.
>  # Local DataNode reconstructs the block, so NameNode registers the new 
> genstamp.
>  # Local DataNode and the underlying host dies, before the new block could be 
> replicated to other replicas.
> This leaves us with a missing block, because the new genstamp block has no 
> replicas. The old replicas still remain, but are considered corrupt due to 
> GENSTAMP_MISMATCH.
> Thankfully we were able to confirm that the length of the corrupt blocks were 
> identical to the newly constructed and lost block. Further, the file in 
> question was only 1 block. So we downloaded one of those corrupt block files 
> and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in 
> hdfs. So in this case we had no actual data loss, but it could have happened 
> easily if the file was more than 1 block or the replicas weren't fully in 
> sync prior to reconstruction.
> In order to avoid this issue, we should avoid writing WAL blocks too the 
> local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to 
> [~weichiu] for pointing this out.
> During reading of WALs we already reorder blocks so as to avoid reading from 
> the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-28260) Possible data loss in WAL after RegionServer crash

2024-03-12 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-28260:
---
  Assignee: Charles Connell

Actually, since this is a bug and it applies cleanly to branch-2.5, I'm 
reopening for cherry-pick there.

> Possible data loss in WAL after RegionServer crash
> --
>
> Key: HBASE-28260
> URL: https://issues.apache.org/jira/browse/HBASE-28260
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Charles Connell
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> We recently had a production incident:
>  # RegionServer crashes, but local DataNode lives on
>  # WAL lease recovery kicks in
>  # Namenode reconstructs the block during lease recovery (which results in a 
> new genstamp). It chooses the replica on the local DataNode as the primary.
>  # Local DataNode reconstructs the block, so NameNode registers the new 
> genstamp.
>  # Local DataNode and the underlying host dies, before the new block could be 
> replicated to other replicas.
> This leaves us with a missing block, because the new genstamp block has no 
> replicas. The old replicas still remain, but are considered corrupt due to 
> GENSTAMP_MISMATCH.
> Thankfully we were able to confirm that the length of the corrupt blocks were 
> identical to the newly constructed and lost block. Further, the file in 
> question was only 1 block. So we downloaded one of those corrupt block files 
> and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in 
> hdfs. So in this case we had no actual data loss, but it could have happened 
> easily if the file was more than 1 block or the replicas weren't fully in 
> sync prior to reconstruction.
> In order to avoid this issue, we should avoid writing WAL blocks too the 
> local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to 
> [~weichiu] for pointing this out.
> During reading of WALs we already reorder blocks so as to avoid reading from 
> the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28260) Possible data loss in WAL after RegionServer crash

2024-03-12 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28260?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28260.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Pushed to branch-2.6+. Note that NO_LOCAL_WRITE was added back in 2016 for 
hbase's specific use, but apparently never used. So this Jira finally closes 
the loop on HDFS-3702. Thanks [~charlesconnell] for the contribution!

> Possible data loss in WAL after RegionServer crash
> --
>
> Key: HBASE-28260
> URL: https://issues.apache.org/jira/browse/HBASE-28260
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> We recently had a production incident:
>  # RegionServer crashes, but local DataNode lives on
>  # WAL lease recovery kicks in
>  # Namenode reconstructs the block during lease recovery (which results in a 
> new genstamp). It chooses the replica on the local DataNode as the primary.
>  # Local DataNode reconstructs the block, so NameNode registers the new 
> genstamp.
>  # Local DataNode and the underlying host dies, before the new block could be 
> replicated to other replicas.
> This leaves us with a missing block, because the new genstamp block has no 
> replicas. The old replicas still remain, but are considered corrupt due to 
> GENSTAMP_MISMATCH.
> Thankfully we were able to confirm that the length of the corrupt blocks were 
> identical to the newly constructed and lost block. Further, the file in 
> question was only 1 block. So we downloaded one of those corrupt block files 
> and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in 
> hdfs. So in this case we had no actual data loss, but it could have happened 
> easily if the file was more than 1 block or the replicas weren't fully in 
> sync prior to reconstruction.
> In order to avoid this issue, we should avoid writing WAL blocks too the 
> local datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to 
> [~weichiu] for pointing this out.
> During reading of WALs we already reorder blocks so as to avoid reading from 
> the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28359) Improve quota RateLimiter synchronization

2024-03-06 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28359?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28359.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Pushed to branch-2.6+. Thanks for the contribution [~rmdmattingly]!

> Improve quota RateLimiter synchronization
> -
>
> Key: HBASE-28359
> URL: https://issues.apache.org/jira/browse/HBASE-28359
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> We've been experiencing RpcThrottlingException with 0ms waitInterval. This 
> seems odd and wasteful, since the client side will immediately retry without 
> backoff. I think the problem is related to the synchronization of RateLimiter.
> The TimeBasedLimiter checkQuota method does the following:
> {code:java}
> if (!reqSizeLimiter.canExecute(estimateWriteSize + estimateReadSize)) {
>   RpcThrottlingException.throwRequestSizeExceeded(
> reqSizeLimiter.waitInterval(estimateWriteSize + estimateReadSize));
> } {code}
> Both canExecute and waitInterval are synchronized, but we're calling them 
> independently. So it's possible under high concurrency for canExecute to 
> return false, but then waitInterval returns 0 (would have been true)
> I think we should simplify the API to have a single synchronized call:
> {code:java}
> long waitInterval = reqSizeLimiter.tryAcquire(estimateWriteSize + 
> estimateReadSize);
> if (waitInterval > 0) {
>   RpcThrottlingException.throwRequestSizeExceeded(waitInterval);
> }{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28423) Improvements to backup of bulkloaded files

2024-03-06 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28423:
-

 Summary: Improvements to backup of bulkloaded files
 Key: HBASE-28423
 URL: https://issues.apache.org/jira/browse/HBASE-28423
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


Backup/Restore has support for including bulkloaded files in incremental 
backups. There is a coprocessor hook which registers all bulkloads into a 
backup:system_bulk table. A cleaner plugin ensures that these files are not 
cleaned up from the archive until they are backed up. When the incremental 
backup occurs, the files are deleted from the system_bulk table and then 
cleaned up.

We have encountered two problems to be solved with this:
 # The deletion process only happens during incremental backups, not full 
backups. A full backup already includes all data in the table via a snapshot 
export. So we should clear any pending bulkloads upon full backup.
 # There is currently no linking of bulkload state to backupRoot. It's possible 
to have multiple backupRoots for tables. For example, you might backup to 2 
destinations with different schedules. Currently whichever backupRoot does an 
incremental backup first will be the one to include bulkloads, then the 
system_bulk table. We need some sort of mapping of bulkload to backupRoot, and 
we should only delete the rows from system_bulk once the files have been 
included in all active backupRoots.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28400) WAL readers treat any exception as EOFException, which can lead to data loss

2024-02-23 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28400:
-

 Summary: WAL readers treat any exception as EOFException, which 
can lead to data loss
 Key: HBASE-28400
 URL: https://issues.apache.org/jira/browse/HBASE-28400
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


In HBASE-28390, I found a bug in our WAL compression which manifests as an 
IllegalArgumentException or ArrayIndexOutOfBoundException. Even worse is that 
ProtobufLogReader.readNext catches any Exception and rethrows it as an 
EOFException. EOFException gets handled in a variety of ways by the readers of 
WALs, and not all of them make sense for an exception that isn't really EOF.

For example, WALInputFormat catches EOFException and returns false for 
nextKeyValue(), effectively skipping the rest of the WAL file but not failing 
the job.

ReplicationSourceWALReader has some much more complicated handling of 
EOFException.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28390) WAL value compression fails for cells with large values

2024-02-23 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28390?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28390.
---
Fix Version/s: 2.6.0
   2.5.8
   3.0.0-beta-2
 Assignee: Bryan Beaudreault
   Resolution: Fixed

Pushed to branch-2.5+. Thanks [~apurtell] for the review

> WAL value compression fails for cells with large values
> ---
>
> Key: HBASE-28390
> URL: https://issues.apache.org/jira/browse/HBASE-28390
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2
>
>
> We are testing out WAL compression and noticed that it fails for large values 
> when both features (wal compression and wal value compression) are enabled. 
> It works fine with either feature independently, but not when combined. It 
> seems to fail for all of the value compressor types, and the failure is in 
> the LRUDictionary of wal key compression:
>  
> {code:java}
> java.io.IOException: Error  while reading 2 WAL KVs; started reading at 230 
> and read up to 396
>     at 
> org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:94)
>  ~[classes/:?]
>     at 
> org.apache.hadoop.hbase.wal.CompressedWALTestBase.doTest(CompressedWALTestBase.java:181)
>  ~[test-classes/:?]
>     at 
> org.apache.hadoop.hbase.wal.CompressedWALTestBase.testForSize(CompressedWALTestBase.java:129)
>  ~[test-classes/:?]
>     at 
> org.apache.hadoop.hbase.wal.CompressedWALTestBase.testLarge(CompressedWALTestBase.java:94)
>  ~[test-classes/:?]
>     at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:?]
>     at 
> jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>  ~[?:?]
>     at 
> jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:?]
>     at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
>     at 
> org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
>  ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
>  ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
>  ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
>  ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
> ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
>  ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
> ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
>  ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
>  ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
> ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
> ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
> ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
> ~[junit-4.13.2.jar:4.13.2]
>     at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
> ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
> ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
> ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
>  ~[junit-4.13.2.jar:4.13.2]
>     at 
> org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
>  ~[junit-4.13.2.jar:4.13.2]
>     at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
>     at java.lang.Thread.run(Thread.java:829) ~[?:?]
> Caused by: java.lang.IndexOutOfBoundsException: index (21) must be less than 
> size (1)
>     at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1371)
>  ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5]
>     at 
> org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1353)
>  ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5]
>     at 
> 

[jira] [Created] (HBASE-28396) Quota throttling can cause a leak of scanners

2024-02-22 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28396:
-

 Summary: Quota throttling can cause a leak of scanners
 Key: HBASE-28396
 URL: https://issues.apache.org/jira/browse/HBASE-28396
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


In RSRpcServices.scan, we check the quota after having created a new 
RegionScannerHolder. If the quota is exceeded, an exception will be thrown. In 
this case, we can't send the scannerName back to the client because it's just 
an exception. So the client will be forced to retry the openScanner call, but 
the RegionScannerHolder is not closed. Eventually the scanners will be cleaned 
up by the lease expiration, but this could cause many scanners to leak during 
periods of high throttling.

We could close the newly opened scanner before throwing the throttle exception, 
but I think it's better to not open the scanner at all until we've grabbed some 
quota.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28390) WAL compression fails for cells with large values when combined with WAL value compression

2024-02-21 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28390:
-

 Summary: WAL compression fails for cells with large values when 
combined with WAL value compression
 Key: HBASE-28390
 URL: https://issues.apache.org/jira/browse/HBASE-28390
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


We are testing out WAL compression and noticed that it fails for large values 
when both features (wal compression and wal value compression) are enabled. It 
works fine with either feature independently, but not when combined. It seems 
to fail for all of the value compressor types, and the failure is in the 
LRUDictionary of wal key compression:

 
{code:java}
java.io.IOException: Error  while reading 2 WAL KVs; started reading at 230 and 
read up to 396
    at 
org.apache.hadoop.hbase.regionserver.wal.ProtobufWALStreamReader.next(ProtobufWALStreamReader.java:94)
 ~[classes/:?]
    at 
org.apache.hadoop.hbase.wal.CompressedWALTestBase.doTest(CompressedWALTestBase.java:181)
 ~[test-classes/:?]
    at 
org.apache.hadoop.hbase.wal.CompressedWALTestBase.testForSize(CompressedWALTestBase.java:129)
 ~[test-classes/:?]
    at 
org.apache.hadoop.hbase.wal.CompressedWALTestBase.testLarge(CompressedWALTestBase.java:94)
 ~[test-classes/:?]
    at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
~[?:?]
    at 
jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 ~[?:?]
    at 
jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 ~[?:?]
    at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
    at 
org.junit.runners.model.FrameworkMethod$1.runReflectiveCall(FrameworkMethod.java:59)
 ~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.model.ReflectiveCallable.run(ReflectiveCallable.java:12)
 ~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.runners.model.FrameworkMethod.invokeExplosively(FrameworkMethod.java:56)
 ~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.statements.InvokeMethod.evaluate(InvokeMethod.java:17)
 ~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner$3.evaluate(ParentRunner.java:306) 
~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.runners.BlockJUnit4ClassRunner$1.evaluate(BlockJUnit4ClassRunner.java:100)
 ~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner.runLeaf(ParentRunner.java:366) 
~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:103)
 ~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.runners.BlockJUnit4ClassRunner.runChild(BlockJUnit4ClassRunner.java:63)
 ~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner$4.run(ParentRunner.java:331) 
~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner$1.schedule(ParentRunner.java:79) 
~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner.runChildren(ParentRunner.java:329) 
~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner.access$100(ParentRunner.java:66) 
~[junit-4.13.2.jar:4.13.2]
    at org.junit.runners.ParentRunner$2.evaluate(ParentRunner.java:293) 
~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.statements.RunBefores.evaluate(RunBefores.java:26) 
~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.statements.RunAfters.evaluate(RunAfters.java:27) 
~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:299)
 ~[junit-4.13.2.jar:4.13.2]
    at 
org.junit.internal.runners.statements.FailOnTimeout$CallableStatement.call(FailOnTimeout.java:293)
 ~[junit-4.13.2.jar:4.13.2]
    at java.util.concurrent.FutureTask.run(FutureTask.java:264) ~[?:?]
    at java.lang.Thread.run(Thread.java:829) ~[?:?]
Caused by: java.lang.IndexOutOfBoundsException: index (21) must be less than 
size (1)
    at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1371)
 ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5]
    at 
org.apache.hbase.thirdparty.com.google.common.base.Preconditions.checkElementIndex(Preconditions.java:1353)
 ~[hbase-shaded-miscellaneous-4.1.5.jar:4.1.5]
    at 
org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.get(LRUDictionary.java:153)
 ~[classes/:?]
    at 
org.apache.hadoop.hbase.io.util.LRUDictionary$BidirectionalLRUMap.access$000(LRUDictionary.java:79)
 ~[classes/:?]
    at 
org.apache.hadoop.hbase.io.util.LRUDictionary.getEntry(LRUDictionary.java:43) 
~[classes/:?]
    at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.readIntoArray(WALCellCodec.java:366)
 ~[classes/:?]
    at 
org.apache.hadoop.hbase.regionserver.wal.WALCellCodec$CompressedKvDecoder.parseCell(WALCellCodec.java:307)
 ~[classes/:?]
    at org.apache.hadoop.hbase.codec.BaseDecoder.advance(BaseDecoder.java:66) 
~[classes/:?]
    at 

[jira] [Resolved] (HBASE-28370) Default user quotas are refreshing too frequently

2024-02-19 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28370?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28370.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

> Default user quotas are refreshing too frequently
> -
>
> Key: HBASE-28370
> URL: https://issues.apache.org/jira/browse/HBASE-28370
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ray Mattingly
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> In [https://github.com/apache/hbase/pull/5666] we introduced default user 
> quotas, but I accidentally called UserQuotaState's default constructor rather 
> than passing in the current timestamp. The consequence is that we're 
> constantly refreshing these default user quotas, and this can be a bottleneck 
> for horizontal cluster scalability.
> This should be a 1 line fix in QuotaUtil's buildDefaultUserQuotaState method.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28376) Column family ns does not exist in region during upgrade to 3.0.0-beta-2

2024-02-17 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28376:
-

 Summary: Column family ns does not exist in region during upgrade 
to 3.0.0-beta-2
 Key: HBASE-28376
 URL: https://issues.apache.org/jira/browse/HBASE-28376
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


Upgrading from 2.5.x to 3.0.0-alpha-2, migrateNamespaceTable kicks in to copy 
data from the namespace table to an "ns" family of the meta table. If you don't 
have an "ns" family, the migration fails and the hmaster will crash loop. You 
then can't rollback, because the briefly alive upgraded hmaster created a 
procedure that can't be deserialized by 2.x (I don't have this log handy 
unfortunately). I tried pushing code to create the ns family on startup, but it 
doesnt work becuase the migration happens while the hmaster is still 
initializing.

So it seems imperative that you create the ns family before upgrading. We 
should handle this more gracefully.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28365) ChaosMonkey batch suspend/resume action assume shell implementation

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28365:
-

 Summary: ChaosMonkey batch suspend/resume action assume shell 
implementation
 Key: HBASE-28365
 URL: https://issues.apache.org/jira/browse/HBASE-28365
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


These two actions have code like this:
{code:java}
case SUSPEND:
  server = serversToBeSuspended.remove();
  try {
suspendRs(server);
  } catch (Shell.ExitCodeException e) {
LOG.warn("Problem suspending but presume successful; code={}", 
e.getExitCode(), e);
  }
  suspendedServers.add(server);
  break; {code}
This only catches that one Shell.ExitCodeException, but operators may have an 
implementation of ClusterManager which does not use shell. We should expand 
this to catch all exceptions.

The implication here is that the uncaught exception propagates, and we don't 
add the server to suspendedServers. If the suspension actually succeeded, this 
leaves some processes in a permanently suspended state until manual 
intervention occurs.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28364) Warn: Cache key had block type null, but was found in L1 cache

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28364:
-

 Summary: Warn: Cache key had block type null, but was found in L1 
cache
 Key: HBASE-28364
 URL: https://issues.apache.org/jira/browse/HBASE-28364
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


I'm ITBLL testing branch-2.6 and am seeing lots of these warns. This is new to 
me. I would expect a warn to be on the rare side or be indicative of a problem, 
but unclear from the code.

cc [~wchevreuil] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28363) Noisy exception from FlushRegionProcedure when result is CANNOT_FLUSH

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28363:
-

 Summary: Noisy exception from FlushRegionProcedure when result is 
CANNOT_FLUSH
 Key: HBASE-28363
 URL: https://issues.apache.org/jira/browse/HBASE-28363
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


Running ITBLL with chaos monkey in HBASE-28233. I noticed lots of exceptions:
{code:java}
[RS_FLUSH_OPERATIONS-regionserver/test-host:60020-1 
{event_type=RS_FLUSH_REGIONS, pid=741536}] ERROR 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler: pid=741536
java.io.IOException: Unable to complete flush {ENCODED => 
371d2ba6875913542893642c94634226, NAME => 
'IntegrationTestBigLinkedList,-\x82\xD8-\x82\xD8-\x80,1707761077516.371d2ba6875913542893642c94634226.',
 STARTKEY =
> '-\x82\xD8-\x82\xD8-\x80', ENDKEY => '3330'}
        at 
org.apache.hadoop.hbase.regionserver.FlushRegionCallable.doCall(FlushRegionCallable.java:61)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:35)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.procedure2.BaseRSProcedureCallable.call(BaseRSProcedureCallable.java:23)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.regionserver.handler.RSProcedureHandler.process(RSProcedureHandler.java:51)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.executor.EventHandler.run(EventHandler.java:104) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) 
~[?:?]
        at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) 
~[?:?]
        at java.lang.Thread.run(Thread.java:840) ~[?:?] {code}
I took a look at the HRegion.flushcache code, and there are 3 reasons for 
CANNOT_FLUSH. All only print at debug log level and none look like actual 
errors.

I think we shouldn't throw an exception here, or at least should downgrade to 
debug. It looks like a problem, but isn't (i dont think).

cc [~frostruan] 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28362) NPE calling bootstrapNodeManager during RegionServer initialization

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28362:
-

 Summary: NPE calling bootstrapNodeManager during RegionServer 
initialization
 Key: HBASE-28362
 URL: https://issues.apache.org/jira/browse/HBASE-28362
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


Shortly after starting up, if a RegionServer is getting requests from clients 
before it's ready (i.e. it restarts and they haven't cleared meta cache yet), 
it will throw an NPE. This is because netty may bind and start accepting 
requests before HRegionServer.preRegistrationInitialization finishes.

I think this is similar to https://issues.apache.org/jira/browse/HBASE-28088. 
It's not critical because the RS self-resolves within a few seconds, but it 
causes noise in the logs and probably errors for clients.
{code:java}
2024-02-13T18:24:02,537 [RpcServer.default.FPBQ.handler=6,queue=6,port=60020 
{}] ERROR org.apache.hadoop.hbase.ipc.RpcServer: Unexpected throwable object
java.lang.NullPointerException: Cannot invoke 
"org.apache.hadoop.hbase.regionserver.BootstrapNodeManager.getBootstrapNodes()" 
because "this.bootstrapNodeManager" is null
        at 
org.apache.hadoop.hbase.regionserver.HRegionServer.getBootstrapNodes(HRegionServer.java:4179)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.regionserver.RSRpcServices.getAllBootstrapNodes(RSRpcServices.java:4140)
 ~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at 
org.apache.hadoop.hbase.shaded.protobuf.generated.BootstrapNodeProtos$BootstrapNodeService$2.callBlockingMethod(BootstrapNodeProtos.java:1259)
 ~[hbase-protocol-shaded-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at org.apache.hadoop.hbase.ipc.RpcServer.call(RpcServer.java:438) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT]
        at org.apache.hadoop.hbase.ipc.CallRunner.run(CallRunner.java:124) 
~[hbase-server-2.6-hubspot-SNAPSHOT.jar:2.6-hubspot-SNAPSHOT] {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28360) [hbase-thirdparty] Upgrade Netty to 4.1.107.Final

2024-02-13 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28360?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28360.
---
Fix Version/s: thirdparty-4.1.6
 Assignee: Bryan Beaudreault
   Resolution: Fixed

Thanks [~nihaljain.cs] and [~rajeshbabu] for the review

> [hbase-thirdparty] Upgrade Netty to 4.1.107.Final
> -
>
> Key: HBASE-28360
> URL: https://issues.apache.org/jira/browse/HBASE-28360
> Project: HBase
>  Issue Type: Task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: thirdparty-4.1.6
>
>
> https://netty.io/news/2024/02/13/4-1-107-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28360) [hbase-thirdparty] Upgrade Netty to 4.1.107.Final

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28360:
-

 Summary: [hbase-thirdparty] Upgrade Netty to 4.1.107.Final
 Key: HBASE-28360
 URL: https://issues.apache.org/jira/browse/HBASE-28360
 Project: HBase
  Issue Type: Task
Reporter: Bryan Beaudreault


https://netty.io/news/2024/02/13/4-1-107-Final.html



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28359) Improve quota RateLimiter synchronization

2024-02-13 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28359:
-

 Summary: Improve quota RateLimiter synchronization
 Key: HBASE-28359
 URL: https://issues.apache.org/jira/browse/HBASE-28359
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


We've been experiencing RpcThrottlingException with 0ms waitInterval. This 
seems odd and wasteful, since the client side will immediately retry without 
backoff. I think the problem is related to the synchronization of RateLimiter.

The TimeBasedLimiter checkQuota method does the following:
{code:java}
if (!reqSizeLimiter.canExecute(estimateWriteSize + estimateReadSize)) {
  RpcThrottlingException.throwRequestSizeExceeded(
reqSizeLimiter.waitInterval(estimateWriteSize + estimateReadSize));
} {code}
Both canExecute and waitInterval are synchronized, but we're calling them 
independently. So it's possible under high concurrency for canExecute to return 
false, but then waitInterval returns 0 (would have been true)

I think we should simplify the API to have a single synchronized call:
{code:java}
long waitInterval = reqSizeLimiter.tryAcquire(estimateWriteSize + 
estimateReadSize);
if (waitInterval > 0) {
  RpcThrottlingException.throwRequestSizeExceeded(waitInterval);
}{code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28352) HTable batch does not honor RpcThrottlingException waitInterval

2024-02-11 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28352?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28352.
---
Fix Version/s: 2.6.0
 Assignee: Bryan Beaudreault
   Resolution: Fixed

Pushed to branch-2 and branch-2.6. I did not include in branch-2.5, because it 
seems we did not backport the original waitInterval support there. If we want 
it there, we should also backport HBASE-27798.

Thanks [~zhangduo] for the review!

> HTable batch does not honor RpcThrottlingException waitInterval
> ---
>
> Key: HBASE-28352
> URL: https://issues.apache.org/jira/browse/HBASE-28352
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0
>
>
> I noticed that we only honor the waitInterval in 
> RpcRetryingCaller.callWithRetries. But HTable.batch (AsyncProcess) uses 
> custom retry logic. We need to update it to honor the waitInterval



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28358) AsyncProcess inconsistent exception thrown for operation timeout

2024-02-11 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28358:
-

 Summary: AsyncProcess inconsistent exception thrown for operation 
timeout
 Key: HBASE-28358
 URL: https://issues.apache.org/jira/browse/HBASE-28358
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


I'm not sure if I'll get to this, but wanted to log it as a known issue.

AsyncProcess has a design where it breaks the batch into sub-batches based on 
regionserver, then submits a callable per regionserver in a threadpool. In the 
main thread, it calls waitUntilDone() with an operation timeout. If the 
callables don't finish within the operation timeout, a SocketTimeoutException 
is thrown. This exception is not very useful because it doesn't give you any 
sense of how many calls were in progress, on which servers, or why it's delayed.

Recently we've been improving the adherence to operation timeout within the 
callables themselves. The main driver here has been to ensure we don't 
erroneously clear the meta cache for operation timeout related errors. So we've 
added a new OperationTimeoutExceededException, which is thrown from within the 
callables and does not cause a meta cache clear. The added benefit is that if 
these bubble up to the caller, they are wrapped in 
RetriesExhaustedWithDetailsException which includes a lot more info about which 
server and which action is affected. 

Now we've covered most but not all cases where operation timeout is exceeded. 
So when exceeding operation timeout it's possible sometimes to see a 
SocketTimeoutException from waitUntilDone, and sometimes see 
OperationTimeoutExceededException from the callables. It will depend on which 
one fails first. It may be nice to finish the swing here, ensuring that we 
always throw OperationTimeoutExceededException from the callables.

The main remaining case is in the call to locateRegion, which hits meta and 
does not honor the call's operation timeout (instead meta operation timeout). 
Resolving this would require some refactoring of 
ConnectionImplementation.locateRegion to allow passing an operation timeout and 
having that affect the userRegionLock and meta scan.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28349) Atomic requests should increment read usage in quotas

2024-02-09 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28349?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28349.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: Conditional atomic mutations which involve a 
read-modify-write (increment/append) or check-and-mutate, will now count as 
both a read and write when evaluating quotas. Previously they would just count 
as a write, despite involving a read as well.
   Resolution: Fixed

> Atomic requests should increment read usage in quotas
> -
>
> Key: HBASE-28349
> URL: https://issues.apache.org/jira/browse/HBASE-28349
> Project: HBase
>  Issue Type: Improvement
>Reporter: Ray Mattingly
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> Right now atomic operations are just treated as a single write from the quota 
> perspective. Since an atomic operation also encompasses a read, it would make 
> sense to increment readNum and readSize counts appropriately.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28354) RegionSizeCalculator throws NPE when regions are in transition

2024-02-09 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28354:
-

 Summary: RegionSizeCalculator throws NPE when regions are in 
transition
 Key: HBASE-28354
 URL: https://issues.apache.org/jira/browse/HBASE-28354
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


When a region is in transition, it may briefly have a null ServerName in meta. 
The RegionSizeCalculator calls RegionLocator.getAllRegionLocations() and does 
not handle the possibility that a RegionLocation.getServerName() could be null. 
The ServerName is eventually passed into an Admin call, which results in an NPE.

This has come up in other contexts. For example, taking a look at 
getAllRegionLocations() impl, we have checks to ensure that we don't call null 
server names. We need to similarly handle the possibility of nulls in 
RegionSizeCalculator.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28352) HTable batch does not honor RpcThrottlingException waitInterval

2024-02-08 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28352:
-

 Summary: HTable batch does not honor RpcThrottlingException 
waitInterval
 Key: HBASE-28352
 URL: https://issues.apache.org/jira/browse/HBASE-28352
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


I noticed that we only honor the waitInterval in 
RpcRetryingCaller.callWithRetries. But HTable.batch (AsyncProcess) uses custom 
retry logic. We need to update it to honor the waitInterval



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27800) Add support for default user quotas using USER => 'all'

2024-02-07 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27800?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27800.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: Adds a bunch of new configs for default user machine quotas: 
hbase.quota.default.user.machine.read.num, 
hbase.quota.default.user.machine.read.size, 
hbase.quota.default.user.machine.write.num, 
hbase.quota.default.user.machine.write.size, 
hbase.quota.default.user.machine.request.num, 
hbase.quota.default.user.machine.request.size. Setting any these will apply the 
given limit as a default for users which are not explicitly covered by existing 
quotas defined through set_quota, etc.
   Resolution: Fixed

> Add support for default user quotas using USER => 'all' 
> 
>
> Key: HBASE-27800
> URL: https://issues.apache.org/jira/browse/HBASE-27800
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Ray Mattingly
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> If someone sets a quota with USER => 'all' (or maybe '*'), treat that as a 
> default quota for each individual user. When a request comes from a user, it 
> will lookup current QuotaState based on username. If one doesn't exist, it 
> will be pre-filled with whatever the 'all' quota was set to. Otherwise, if 
> you then define a quota for a specific user that will override whatever 
> default you have set for that user only.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-28345) Close HBase connection on exit from HBase Shell

2024-02-07 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28345?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-28345:
---

I dont see this backported to branch-3. Are you sure you cherry-picked 
everywhere?

> Close HBase connection on exit from HBase Shell
> ---
>
> Key: HBASE-28345
> URL: https://issues.apache.org/jira/browse/HBASE-28345
> Project: HBase
>  Issue Type: Bug
>  Components: shell
>Affects Versions: 2.4.17
>Reporter: Istvan Toth
>Assignee: Istvan Toth
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 2.4.18, 2.5.8, 3.0.0-beta-2
>
>
> When using Netty for the ZK client, hbase shell hangs on exit.
> This is caused by the non-deamon Netty threads that ZK creates.
> Wheter ZK should create daemon threads for Netty or not is debatable, but 
> explicitly closing the connection in hbase shell on exit fixes the issue.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28348) Multi should return what results it can before rpc timeout

2024-02-07 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28348:
-

 Summary: Multi should return what results it can before rpc timeout
 Key: HBASE-28348
 URL: https://issues.apache.org/jira/browse/HBASE-28348
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


Scans have a nice feature where they try to return a heartbeat with whatever 
results they have accumulated before the rpc timeout expires. It targets 
returning in 1/2 the rpc timeout or max scanner time. The reason for scans is 
to avoid painful scanner timeouts which cause the scan to have to be restarted 
due to out of sync sequence id.

Multis have a similar problem. A big batch can come in which can't be served in 
the configured timeout. In this case the client side will abandon the request 
when the timeout is exceeded, and resubmit if there are retries/operation 
timeout left. This wastes work since it's likely that some of the results had 
been fetched by the time a timeout occurred.

Multis already can retry immediately when the batch exceeds the max result size 
limit. We can use the same functionality to also return when we've taken more 
than half the rpc timeout.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28347) Update ref guide about isolation guarantees for scans

2024-02-06 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28347:
-

 Summary: Update ref guide about isolation guarantees for scans
 Key: HBASE-28347
 URL: https://issues.apache.org/jira/browse/HBASE-28347
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


In the "Consistency of Scans" section of 
[https://hbase.apache.org/acid-semantics.html,] there is some confusing and 
outdated information. First it's hard to realize that it's specifically talking 
about consistency across rows. Secondly, it's outdated because in modern hbase 
we acquire and maintain a memstore readPt for the lifetime of a scan in a 
region. So we should retain read committed behavior across rows, at least 
within the scope of a region.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27687) Enhance quotas to consume blockBytesScanned rather than response size

2024-02-06 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27687.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: Read size quotas are now evaluated against block bytes 
scanned for a request, rather than result size. Block bytes scanned is a 
measure of the total size in bytes of all hfile blocks opened to serve a 
request. This results in a much more accurate picture of actual work done by a 
query and is the recommended mode. One can revert to the old behavior by 
setting hbase.quota.use.result.size.bytes to true.
   Resolution: Fixed

> Enhance quotas to consume blockBytesScanned rather than response size
> -
>
> Key: HBASE-27687
> URL: https://issues.apache.org/jira/browse/HBASE-27687
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> As of HBASE-27558 we now apply quota.getReadAvailable() to max block bytes 
> scanned by scans/multis. This issue enhances further so that we can track 
> read size consumed in Quotas based on block bytes scanned rather than 
> response size. In this mode, quotas would end-to-end be based on 
> blockBytesScanned.
> Right now we call quota.addGetResult or addScanResult. This would just be a 
> matter of no-oping those calls, and calling RpcCall.getBlockBytesScanned() in 
> Quota.close() instead.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28346) Expose checkQuota to Coprocessor Endpoints

2024-02-06 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28346:
-

 Summary: Expose checkQuota to Coprocessor Endpoints
 Key: HBASE-28346
 URL: https://issues.apache.org/jira/browse/HBASE-28346
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


Coprocessor endpoints may do non-trivial amounts of work, yet quotas do not 
throttle them. We can't generically apply quotas to coprocessors because we 
have no information on what a particular endpoint might do. One thing we could 
do is expose checkQuota to the RegionCoprocessorEnvironment. This way, 
coprocessor authors have the tools to ensure that quotas cover their 
implementations.

While adding this, we can update AggregationImplementation to call checkQuota 
since those endpoints can be quite expensive.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28343) Write codec class into hfile header/trailer

2024-02-05 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28343:
-

 Summary: Write codec class into hfile header/trailer
 Key: HBASE-28343
 URL: https://issues.apache.org/jira/browse/HBASE-28343
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


We recently started playing around with the new bundled compression libraries 
as of 2.5.0. Specifically, we are experimenting with the different zstd codecs. 
The book says that aircompressor's zstd is not data compatible with hadoops, 
but doesn't say the same about zstd-jni.

In our experiments we ended up in a state where some hfiles were encoded with 
zstd-jni (zstd.ZstdCodec) while others were encoded with hadoop 
(ZStandardCodec). At this point the cluster became extremely unstable, with 
some files unable to be read because they encoded with a codec that didn't 
match the current runtime configration. Changing the runtime configuration 
caused the other files to not be readable.

I think this problem could be solved by writing the classname of the codec used 
into the hfile. This could be used as a hint so that a regionserver can read 
hfiles compressed with any compression codec that it supports.

[~apurtell] do you have any thoughts here since you brought us all of these 
great compression options?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28216) HDFS erasure coding support for table data dirs

2024-02-05 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28216.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: If you use hadoop3, managing the erasure coding policy of a 
table's data directory is now possible with a new table descriptor setting 
ERASURE_CODING_POLICY. The policy you set must be available and enabled in 
hdfs, and hbase will validate that your cluster topology is sufficient to 
support that policy. After setting the policy, you must major compact the table 
for the change to take effect. Attempting to use this feature with hadoop2 will 
fail a validation check prior to making any changes.
   Resolution: Fixed

Thanks [~weichiu], [~nihaljain.cs], and [~zhangduo] for the advice and reviews! 
Merged to 2.6+.

We've been running this in production and it's helping to cut costs on some of 
our clusters.

> HDFS erasure coding support for table data dirs
> ---
>
> Key: HBASE-28216
> URL: https://issues.apache.org/jira/browse/HBASE-28216
> Project: HBase
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>  Labels: patch-available, pull-request-available
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> [Erasure 
> coding|https://hadoop.apache.org/docs/stable/hadoop-project-dist/hadoop-hdfs/HDFSErasureCoding.html]
>  (EC) is a hadoop-3 feature which can drastically reduce storage 
> requirements, at the expense of locality. At my company we have a few hbase 
> clusters which are extremely data dense and take mostly write traffic, fewer 
> reads (cold data). We'd like to reduce the cost of these clusters, and EC is 
> a great way to do that since it can reduce replication related storage costs 
> by 50%.
> It's possible to enable EC policies on sub directories of HDFS. One can 
> manually set this with {{{}hdfs ec -setPolicy -path 
> /hbase/data/default/usertable -policy {}}}. This can work without any 
> hbase support.
> One problem with that is a lack of visibility by operators into which tables 
> might have EC enabled. I think this is where HBase can help. Here's my 
> proposal:
>  * Add a new TableDescriptor and ColumnDescriptor field ERASURE_CODING_POLICY
>  * In ModifyTableProcedure preflightChecks, if ERASURE_CODING_POLICY is set, 
> verify that the requested policy is available and enabled via 
> DistributedFileSystem.
> getErasureCodingPolicies().
>  * During ModifyTableProcedure, add a new state for 
> MODIFY_TABLE_SYNC_ERASURE_CODING_POLICY.
>  ** When adding or changing a policy, use DistributedFileSystem.
> setErasureCodingPolicy to sync it for the data and archive dir of that table 
> (or column in table)
>  ** When removing the property or setting it to empty, use 
> DistributedFileSystem.
> unsetErasureCodingPolicy to remove it from the data and archive dir.
> Since this new API is in hadoop-3 only, we'll need to add a reflection 
> wrapper class for managing the calls and verifying that the API is available. 
> We'll similarly do that API check in preflightChecks.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28338) Bounded leak of FSDataInputStream buffers from checksum switching

2024-01-30 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28338:
-

 Summary: Bounded leak of FSDataInputStream buffers from checksum 
switching
 Key: HBASE-28338
 URL: https://issues.apache.org/jira/browse/HBASE-28338
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


In FSDataInputStreamWrapper, the unbuffer() method caches an unbuffer instance 
the first time it is called. When an FSDataInputStreamWrapper is initialized, 
it has hbase checksum disabled.

In HFileInfo.initTrailerAndContext we get the stream, read the trailer, then 
call unbuffer. At this point, checksums have not been enabled yet via 
prepareForBlockReader. So the call to unbuffer() caches the current 
non-checksum stream as the unbuffer instance.

Later, in initMetaAndIndex we do a similar thing. This time, 
prepareForBlockReader has been called, so we are now using hbase checksums. 
When initMetaAndIndex calls unbuffer(), it uses the old unbuffer instance which 
actually has been closed when we switched to hbase checksums. So that call does 
nothing, and the new no-checksum input stream is never unbuffered.

I haven't seen this cause an issue with normal hdfs replication (though haven't 
gone looking). It's very problematic for Erasure Coding because 
DFSStripedInputStream holds a large buffer (numDataBlocks * cellSize, so 6mb 
for RS-6-3-1024k) that is only used for stream reads NOT pread. The 
FSDataInputStreamWrapper we are talking about here is only used for pread in 
hbase, so those 6mb buffers just hang around totally unused but unreclaimable. 
Since there is an input stream per StoreFile, this can add up very quickly on 
big servers.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28331) Client integration test fails after upgrading hadoop3 version to 3.3.x

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28331?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28331.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

[~zhangduo] feel free to re-open if something is pending here. I'm auditing 
fixVersions for 2.6.0 and see the commit has landed, so setting them and 
resolving now

> Client integration test fails after upgrading hadoop3 version to 3.3.x
> --
>
> Key: HBASE-28331
> URL: https://issues.apache.org/jira/browse/HBASE-28331
> Project: HBase
>  Issue Type: Bug
>  Components: hadoop3, jenkins
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> Saw this error when starting HBase cluster
> {noformat}
> 2024-01-25T11:25:01,838 ERROR 
> [master/jenkins-hbase21:16000:becomeActiveMaster] master.HMaster: Failed to 
> become active master
> java.lang.ClassCastException: 
> org.apache.hadoop.hdfs.protocol.proto.ClientNamenodeProtocolProtos$SetSafeModeRequestProto
>  cannot be cast to com.google.protobuf.Message
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:247)
>  ~[hadoop-common-3.3.5.jar:?]
>   at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:132)
>  ~[hadoop-common-3.3.5.jar:?]
>   at com.sun.proxy.$Proxy32.setSafeMode(Unknown Source) ~[?:?]
>   at 
> org.apache.hadoop.hdfs.protocolPB.ClientNamenodeProtocolTranslatorPB.setSafeMode(ClientNamenodeProtocolTranslatorPB.java:847)
>  ~[hadoop-hdfs-client-3.3.5.jar:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_362]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:433)
>  ~[hadoop-common-3.3.5.jar:?]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeMethod(RetryInvocationHandler.java:166)
>  ~[hadoop-common-3.3.5.jar:?]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invoke(RetryInvocationHandler.java:158)
>  ~[hadoop-common-3.3.5.jar:?]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler$Call.invokeOnce(RetryInvocationHandler.java:96)
>  ~[hadoop-common-3.3.5.jar:?]
>   at 
> org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:362)
>  ~[hadoop-common-3.3.5.jar:?]
>   at com.sun.proxy.$Proxy33.setSafeMode(Unknown Source) ~[?:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_362]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362]
>   at 
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) 
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>   at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_362]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362]
>   at 
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) 
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>   at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 
> ~[?:1.8.0_362]
>   at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>  ~[?:1.8.0_362]
>   at java.lang.reflect.Method.invoke(Method.java:498) ~[?:1.8.0_362]
>   at 
> org.apache.hadoop.hbase.fs.HFileSystem$1.invoke(HFileSystem.java:363) 
> ~[hbase-server-4.0.0-alpha-1-SNAPSHOT.jar:4.0.0-alpha-1-SNAPSHOT]
>   at com.sun.proxy.$Proxy34.setSafeMode(Unknown Source) ~[?:?]
>   at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) 
> ~[?:1.8.0_362]
>   at 
> 

[jira] [Resolved] (HBASE-26816) Fix CME in ReplicationSourceManager

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-26816.
---
Fix Version/s: 2.5.8
   Resolution: Fixed

This one was easy for me to cherry-pick, so I've done that and added 2.5.8 
fixVersion

> Fix CME in ReplicationSourceManager
> ---
>
> Key: HBASE-26816
> URL: https://issues.apache.org/jira/browse/HBASE-26816
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.6.0, 2.5.8, 2.4.11, 3.0.0-alpha-3
>
>
> Exception in thread "regionserver/hostname/ip:port" 
> java.util.ConcurrentModificationException
>         at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>         at java.util.ArrayList$Itr.next(ArrayList.java:851)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:832)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:162)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:155)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2623)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1175)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28190) Add slow sync log rolling test in TestAsyncLogRolling

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28190.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
   Resolution: Fixed

Looks like this issue is resolved. The commit landed in branch-2.6, and 
branch-3 after the beta-1 release. So setting 2.6.0 and 3.0.0-beta-2 fixVersion

> Add slow sync log rolling test in TestAsyncLogRolling
> -
>
> Key: HBASE-28190
> URL: https://issues.apache.org/jira/browse/HBASE-28190
> Project: HBase
>  Issue Type: Improvement
>  Components: test
>Reporter: zhuyaogai
>Assignee: zhuyaogai
>Priority: Minor
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> There is a test for slow sync log rolling in `TestLogRolling`, but not in 
> `TestAsyncLogRolling`, so add it in `TestAsyncLogRolling`.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27784) support quota user overrides

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27784?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27784.
---
Fix Version/s: 2.6.0
   3.0.0-beta-1
 Release Note: Adds a RegionServer config hbase.quota.user.override.key 
which can be set to the name of a request attribute whose value should be used 
as the username when evaluating quotas.
   Resolution: Fixed

> support quota user overrides
> 
>
> Key: HBASE-27784
> URL: https://issues.apache.org/jira/browse/HBASE-27784
> Project: HBase
>  Issue Type: New Feature
>Reporter: Bryan Beaudreault
>Assignee: Ray Mattingly
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> The below is the original idea that started this work, but not what we 
> actually landed on. See the first comment from [~rmdmattingly] and the 
> release note for that.
>  
> Old description:
> {quote}Currently we provide the ability to define quotas for namespaces, 
> tables, or users. On multi-tenant clusters, users may be broken down into 
> groups based on their use-case. For us this comes down to 2 main cases:
>  # Hadoop jobs – it would be good to be able to limit all hadoop jobs in 
> aggregate
>  # Proxy APIs - this is common where upstream callers don't hit hbase 
> directly, instead they go through one of many proxy api's.  For us we have a 
> custom auth plugin which sets the username to the upstream caller name. But 
> it would still be useful to be able to limit all usage from some particular 
> proxy API in aggregate.
> I think this could build upon the idea for Connection attributes in 
> HBASE-27657. Basically when a Connection is established we can set an 
> attribute (i.e. quotaGrouping=hadoop or quotaGrouping=MyProxyAPI).  In 
> QuotaCache, we can add a {{getQuotaGroupLimiter(String groupName)}} and also 
> allow someone to define quotas using {{set_quota TYPE => THROTTLE, GROUP => 
> 'hadoop', LIMIT => '100M/sec'}}
> I need to do more investigation into whether we'd want to return a simple 
> group limiter (more similar to table/namespace handling) or treat it more 
> like the USER limiters which returns a QuotaState (so you can limit 
> by-group-by-table).
> We need to consider how GROUP quotas interact with USER quotas. If a user has 
> a quota defined, and that user is also part of a group with a quota defined, 
> does the request need to honor both quotas? Maybe we provide a GROUP_BYPASS 
> setting, similar to GLOBAL_BYPASS?
> {quote}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-26625) ExportSnapshot tool failed to copy data files for tables with merge region

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26625?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-26625:
---

[~meiyi] this Jira is not in branch-2.5, so shouldn't have 2.5.0 fixVersion. 
I'm adding 2.6.0 now. If you think it should exist in 2.5.x (probably?) then 
please cherry-pick there and re-add the latest 2.5.x fixVersion (2.5.8 right 
now)

> ExportSnapshot tool failed to copy data files for tables with merge region
> --
>
> Key: HBASE-26625
> URL: https://issues.apache.org/jira/browse/HBASE-26625
> Project: HBase
>  Issue Type: Bug
>Reporter: Yi Mei
>Assignee: Yi Mei
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.10
>
>
> When export snapshot for a table with merge regions, we found following 
> exceptions:
> {code:java}
> 2021-12-24 17:14:41,563 INFO  [main] snapshot.ExportSnapshot: Finalize the 
> Snapshot Export
> 2021-12-24 17:14:41,589 INFO  [main] snapshot.ExportSnapshot: Verify snapshot 
> integrity
> 2021-12-24 17:14:41,683 ERROR [main] snapshot.ExportSnapshot: Snapshot export 
> failed
> org.apache.hadoop.hbase.snapshot.CorruptedSnapshotException: Missing parent 
> hfile for: 043a9fe8aa7c469d8324956a57849db5.8e935527eb39a2cf9bf0f596754b5853 
> path=A/a=t42=8e935527eb39a2cf9bf0f596754b5853-043a9fe8aa7c469d8324956a57849db5
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:232)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.concurrentVisitReferencedFiles(SnapshotReferenceUtil.java:195)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:172)
>     at 
> org.apache.hadoop.hbase.snapshot.SnapshotReferenceUtil.verifySnapshot(SnapshotReferenceUtil.java:156)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.verifySnapshot(ExportSnapshot.java:851)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.doWork(ExportSnapshot.java:1096)
>     at 
> org.apache.hadoop.hbase.util.AbstractHBaseTool.run(AbstractHBaseTool.java:154)
>     at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
>     at 
> org.apache.hadoop.hbase.util.AbstractHBaseTool.doStaticMain(AbstractHBaseTool.java:280)
>     at 
> org.apache.hadoop.hbase.snapshot.ExportSnapshot.main(ExportSnapshot.java:1144)
>  {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-26816) Fix CME in ReplicationSourceManager

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26816?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-26816:
---

[~Xiaolin Ha] this Jira is not present in branch-2.5, but has 2.5.0 fixVersion. 
Do you want to cherry-pick it there, or remove 2.5.0?

> Fix CME in ReplicationSourceManager
> ---
>
> Key: HBASE-26816
> URL: https://issues.apache.org/jira/browse/HBASE-26816
> Project: HBase
>  Issue Type: Bug
>  Components: Replication
>Affects Versions: 2.4.10
>Reporter: Xiaolin Ha
>Assignee: Xiaolin Ha
>Priority: Minor
> Fix For: 2.5.0, 3.0.0-alpha-3, 2.4.11
>
>
> Exception in thread "regionserver/hostname/ip:port" 
> java.util.ConcurrentModificationException
>         at java.util.ArrayList$Itr.checkForComodification(ArrayList.java:901)
>         at java.util.ArrayList$Itr.next(ArrayList.java:851)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.ReplicationSourceManager.join(ReplicationSourceManager.java:832)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.join(Replication.java:162)
>         at 
> org.apache.hadoop.hbase.replication.regionserver.Replication.stopReplicationService(Replication.java:155)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.stopServiceThreads(HRegionServer.java:2623)
>         at 
> org.apache.hadoop.hbase.regionserver.HRegionServer.run(HRegionServer.java:1175)
>         at java.lang.Thread.run(Thread.java:745)



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-26642) Increase the timeout for TestStochasticLoadBalancerRegionReplicaLargeCluster

2024-01-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-26642?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-26642.
---
Fix Version/s: 2.6.0
   Resolution: Fixed

> Increase the timeout for TestStochasticLoadBalancerRegionReplicaLargeCluster
> 
>
> Key: HBASE-26642
> URL: https://issues.apache.org/jira/browse/HBASE-26642
> Project: HBase
>  Issue Type: Improvement
>  Components: Balancer, test
>Affects Versions: 2.5.0, 2.6.0
>Reporter: Duo Zhang
>Assignee: Duo Zhang
>Priority: Major
> Fix For: 2.6.0
>
>
> TestStochasticLoadBalancerRegionReplicaLargeCluster is on the flaky list for 
> branch-2, it fails 50%+.
> Looking at the output, sometimes it can not finish all the calculation in 
> time, so let's see if increasing the timeout can help here.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28335) Expose CacheStats ageAtEviction histogram in jmx

2024-01-29 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28335:
-

 Summary: Expose CacheStats ageAtEviction histogram in jmx
 Key: HBASE-28335
 URL: https://issues.apache.org/jira/browse/HBASE-28335
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


In CacheStats we keep track of the ageAtEviction in a histogram. This is 
exposed in the UI, but not via jmx. Expose via jmx as well for easier tracking 
over time.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28334) Remove unnecessary null DEFAULT_VALUE in TableDescriptorBuilder

2024-01-29 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28334:
-

 Summary: Remove unnecessary null DEFAULT_VALUE in 
TableDescriptorBuilder
 Key: HBASE-28334
 URL: https://issues.apache.org/jira/browse/HBASE-28334
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


With ERASURE_CODING_POLICY, the default value is null (no policy). I added a 
record of that in DEFAULT_VALUES, because other settings seemed to do that.

A null value is never stored on a HTD because our code handles removing from 
map when setting null. So we'd never have an opportunity to match against the 
DEFAULT_VALUE. If someone tried setting a string value "null", that would fail 
validation because it's not a valid policy. So there's no reason to record this 
default value. It doesn't cause a problem, but is confusing to anyone reading 
the code. Remove it.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28327) Add remove(String key, Metric metric) method to MetricRegistry interface

2024-01-25 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28327?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28327.
---
Fix Version/s: 2.6.0
   2.5.8
   3.0.0-beta-2
   Resolution: Fixed

Pushed to all active branches. Thanks for the contribution [~eboland148]!

> Add remove(String key, Metric metric) method to MetricRegistry interface
> 
>
> Key: HBASE-28327
> URL: https://issues.apache.org/jira/browse/HBASE-28327
> Project: HBase
>  Issue Type: Improvement
>Reporter: Evelyn Boland
>Assignee: Evelyn Boland
>Priority: Major
> Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2
>
>
> Add a `remove(String name, Metric metric)` method to the `MetricRegistry` 
> interface. Right now the interface only contains a `remove(String name)` 
> method.
> This additional remove method will give users the power to remove a `Metric` 
> with the specified `name` from the metric registry if and only if the 
> provided `metric` matches the object in the registry.
> Implementing the new `remove(String name, Metric metric)` should be straight 
> forward because the `MetricRegistryImpl` class stores metrics in a 
> `ConcurrentMap`, which already contains a `remove(Object key, Object value)` 
> method
> This change will not be a breaking one because the interface is marked with 
> `@IntefaceStability.Evolving`
> [~rmdmattingly]



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28302) Add tracking of fs read times in ScanMetrics and slow logs

2024-01-23 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28302?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28302.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: Adds a new getFsReadTime() to the slow log records, and 
fsReadTime counter to ScanMetrics. In both cases, this is the cumulative time 
spent reading blocks from hdfs for the given request. Additionally, a new 
fsSlowReadsCount jmx metric is added to the sub=IO bean. This is the count of 
HDFS reads which took longer than hbase.fs.reader.warn.time.ms.
 Assignee: Bryan Beaudreault
   Resolution: Fixed

Thanks [~ndimiduk] for the review! Pushed to master, branch-3, branch-2, 
branch-2.6.

> Add tracking of fs read times in ScanMetrics and slow logs
> --
>
> Key: HBASE-28302
> URL: https://issues.apache.org/jira/browse/HBASE-28302
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> We've had this in our production for a while, and it's useful info to have. 
> We already track FS read times in 
> [HFileBlock|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1828-L1831C10].
>  We can project that into the ScanMetrics instance and slow log pretty 
> easily. It is also helpful to add a slow.fs.read.threshold, over which we log 
> a warn



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-27966) HBase Master/RS JVM metrics populated incorrectly

2024-01-23 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-27966.
---
Resolution: Fixed

Pushed to branch-3. Test looks good there. I quickly checked the other branches 
and looks like they were properly backported.

> HBase Master/RS JVM metrics populated incorrectly
> -
>
> Key: HBASE-27966
> URL: https://issues.apache.org/jira/browse/HBASE-27966
> Project: HBase
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.0.0-alpha-4
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1, 2.5.6
>
> Attachments: test_patch.txt
>
>
> HBase Master/RS JVM metrics populated incorrectly due to regression causing 
> ambari metrics system to not able to capture them.
> Based on my analysis the issue is relevant for all release post 2.0.0-alpha-4 
> and seems to be caused due to HBASE-18846.
> Have been able to compare the JVM metrics across 3 versions of HBase and 
> attaching results of same below:
> HBase: 1.1.2
> {code:java}
> {
> "name" : "Hadoop:service=HBase,name=JvmMetrics",
> "modelerType" : "JvmMetrics",
> "tag.Context" : "jvm",
> "tag.ProcessName" : "RegionServer",
> "tag.SessionId" : "",
> "tag.Hostname" : "HOSTNAME",
> "MemNonHeapUsedM" : 196.05664,
> "MemNonHeapCommittedM" : 347.60547,
> "MemNonHeapMaxM" : 4336.0,
> "MemHeapUsedM" : 7207.315,
> "MemHeapCommittedM" : 66080.0,
> "MemHeapMaxM" : 66080.0,
> "MemMaxM" : 66080.0,
> "GcCount" : 3953,
> "GcTimeMillis" : 662520,
> "ThreadsNew" : 0,
> "ThreadsRunnable" : 214,
> "ThreadsBlocked" : 0,
> "ThreadsWaiting" : 626,
> "ThreadsTimedWaiting" : 78,
> "ThreadsTerminated" : 0,
> "LogFatal" : 0,
> "LogError" : 0,
> "LogWarn" : 0,
> "LogInfo" : 0
>   },
> {code}
> HBase 2.0.2
> {code:java}
> {
> "name" : "Hadoop:service=HBase,name=JvmMetrics",
> "modelerType" : "JvmMetrics",
> "tag.Context" : "jvm",
> "tag.ProcessName" : "IO",
> "tag.SessionId" : "",
> "tag.Hostname" : "HOSTNAME",
> "MemNonHeapUsedM" : 203.86688,
> "MemNonHeapCommittedM" : 740.6953,
> "MemNonHeapMaxM" : -1.0,
> "MemHeapUsedM" : 14879.477,
> "MemHeapCommittedM" : 31744.0,
> "MemHeapMaxM" : 31744.0,
> "MemMaxM" : 31744.0,
> "GcCount" : 75922,
> "GcTimeMillis" : 5134691,
> "ThreadsNew" : 0,
> "ThreadsRunnable" : 90,
> "ThreadsBlocked" : 3,
> "ThreadsWaiting" : 158,
> "ThreadsTimedWaiting" : 36,
> "ThreadsTerminated" : 0,
> "LogFatal" : 0,
> "LogError" : 0,
> "LogWarn" : 0,
> "LogInfo" : 0
>   },
> {code}
> HBase: 2.5.2
> {code:java}
> {
>   "name": "Hadoop:service=HBase,name=JvmMetrics",
>   "modelerType": "JvmMetrics",
>   "tag.Context": "jvm",
>   "tag.ProcessName": "IO",
>   "tag.SessionId": "",
>   "tag.Hostname": "HOSTNAME",
>   "MemNonHeapUsedM": 192.9798,
>   "MemNonHeapCommittedM": 198.4375,
>   "MemNonHeapMaxM": -1.0,
>   "MemHeapUsedM": 773.23584,
>   "MemHeapCommittedM": 1004.0,
>   "MemHeapMaxM": 1024.0,
>   "MemMaxM": 1024.0,
>   "GcCount": 2048,
>   "GcTimeMillis": 25440,
>   "ThreadsNew": 0,
>   "ThreadsRunnable": 22,
>   "ThreadsBlocked": 0,
>   "ThreadsWaiting": 121,
>   "ThreadsTimedWaiting": 49,
>   "ThreadsTerminated": 0,
>   "LogFatal": 0,
>   "LogError": 0,
>   "LogWarn": 0,
>   "LogInfo": 0
>  },
> {code}
> It can be observed that 2.0.x onwards the field "tag.ProcessName" is 
> populating as "IO" instead of expected "RegionServer" or "Master".
> Ambari relies on this field process name to create a metric 
> 'jvm.RegionServer.JvmMetrics.GcTimeMillis' etc. See 
> [code.|https://github.com/apache/ambari/blob/2ec4b055d99ec84c902da16dd57df91d571b48d6/ambari-server/src/main/java/org/apache/ambari/server/controller/metrics/timeline/AMSPropertyProvider.java#L722]
> But post 2.0.x the field is getting populated as 'IO' and hence a metric with 
> name 'jvm.JvmMetrics.GcTimeMillis' is created instead of expected 
> 'jvm.RegionServer.JvmMetrics.GcTimeMillis', thus mixing up the metric with 
> various other metrics coming from rs, master, spark executor etc. running on 
> same host.
> *Expected*
> Field "tag.ProcessName" should be populated as "RegionServer" or "Master" 
> instead of "IO".
> *Actual*
> Field "tag.ProcessName" is populating as "IO" instead of expected 
> "RegionServer" or "Master" causing incorrect metric being published by ambari 
> and thus mixing up all metrics and raising various alerts around JVM metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Reopened] (HBASE-27966) HBase Master/RS JVM metrics populated incorrectly

2024-01-23 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-27966?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault reopened HBASE-27966:
---

Re-opening because I just realized that this was not included in branch-3. 
Perhaps it was committed around the time of our branching that. We need to 
cherry-pick to branch-3, which I will do shortly.

> HBase Master/RS JVM metrics populated incorrectly
> -
>
> Key: HBASE-27966
> URL: https://issues.apache.org/jira/browse/HBASE-27966
> Project: HBase
>  Issue Type: Bug
>  Components: metrics
>Affects Versions: 2.0.0-alpha-4
>Reporter: Nihal Jain
>Assignee: Nihal Jain
>Priority: Major
> Fix For: 2.6.0, 2.5.6, 3.0.0-beta-1
>
> Attachments: test_patch.txt
>
>
> HBase Master/RS JVM metrics populated incorrectly due to regression causing 
> ambari metrics system to not able to capture them.
> Based on my analysis the issue is relevant for all release post 2.0.0-alpha-4 
> and seems to be caused due to HBASE-18846.
> Have been able to compare the JVM metrics across 3 versions of HBase and 
> attaching results of same below:
> HBase: 1.1.2
> {code:java}
> {
> "name" : "Hadoop:service=HBase,name=JvmMetrics",
> "modelerType" : "JvmMetrics",
> "tag.Context" : "jvm",
> "tag.ProcessName" : "RegionServer",
> "tag.SessionId" : "",
> "tag.Hostname" : "HOSTNAME",
> "MemNonHeapUsedM" : 196.05664,
> "MemNonHeapCommittedM" : 347.60547,
> "MemNonHeapMaxM" : 4336.0,
> "MemHeapUsedM" : 7207.315,
> "MemHeapCommittedM" : 66080.0,
> "MemHeapMaxM" : 66080.0,
> "MemMaxM" : 66080.0,
> "GcCount" : 3953,
> "GcTimeMillis" : 662520,
> "ThreadsNew" : 0,
> "ThreadsRunnable" : 214,
> "ThreadsBlocked" : 0,
> "ThreadsWaiting" : 626,
> "ThreadsTimedWaiting" : 78,
> "ThreadsTerminated" : 0,
> "LogFatal" : 0,
> "LogError" : 0,
> "LogWarn" : 0,
> "LogInfo" : 0
>   },
> {code}
> HBase 2.0.2
> {code:java}
> {
> "name" : "Hadoop:service=HBase,name=JvmMetrics",
> "modelerType" : "JvmMetrics",
> "tag.Context" : "jvm",
> "tag.ProcessName" : "IO",
> "tag.SessionId" : "",
> "tag.Hostname" : "HOSTNAME",
> "MemNonHeapUsedM" : 203.86688,
> "MemNonHeapCommittedM" : 740.6953,
> "MemNonHeapMaxM" : -1.0,
> "MemHeapUsedM" : 14879.477,
> "MemHeapCommittedM" : 31744.0,
> "MemHeapMaxM" : 31744.0,
> "MemMaxM" : 31744.0,
> "GcCount" : 75922,
> "GcTimeMillis" : 5134691,
> "ThreadsNew" : 0,
> "ThreadsRunnable" : 90,
> "ThreadsBlocked" : 3,
> "ThreadsWaiting" : 158,
> "ThreadsTimedWaiting" : 36,
> "ThreadsTerminated" : 0,
> "LogFatal" : 0,
> "LogError" : 0,
> "LogWarn" : 0,
> "LogInfo" : 0
>   },
> {code}
> HBase: 2.5.2
> {code:java}
> {
>   "name": "Hadoop:service=HBase,name=JvmMetrics",
>   "modelerType": "JvmMetrics",
>   "tag.Context": "jvm",
>   "tag.ProcessName": "IO",
>   "tag.SessionId": "",
>   "tag.Hostname": "HOSTNAME",
>   "MemNonHeapUsedM": 192.9798,
>   "MemNonHeapCommittedM": 198.4375,
>   "MemNonHeapMaxM": -1.0,
>   "MemHeapUsedM": 773.23584,
>   "MemHeapCommittedM": 1004.0,
>   "MemHeapMaxM": 1024.0,
>   "MemMaxM": 1024.0,
>   "GcCount": 2048,
>   "GcTimeMillis": 25440,
>   "ThreadsNew": 0,
>   "ThreadsRunnable": 22,
>   "ThreadsBlocked": 0,
>   "ThreadsWaiting": 121,
>   "ThreadsTimedWaiting": 49,
>   "ThreadsTerminated": 0,
>   "LogFatal": 0,
>   "LogError": 0,
>   "LogWarn": 0,
>   "LogInfo": 0
>  },
> {code}
> It can be observed that 2.0.x onwards the field "tag.ProcessName" is 
> populating as "IO" instead of expected "RegionServer" or "Master".
> Ambari relies on this field process name to create a metric 
> 'jvm.RegionServer.JvmMetrics.GcTimeMillis' etc. See 
> [code.|https://github.com/apache/ambari/blob/2ec4b055d99ec84c902da16dd57df91d571b48d6/ambari-server/src/main/java/org/apache/ambari/server/controller/metrics/timeline/AMSPropertyProvider.java#L722]
> But post 2.0.x the field is getting populated as 'IO' and hence a metric with 
> name 'jvm.JvmMetrics.GcTimeMillis' is created instead of expected 
> 'jvm.RegionServer.JvmMetrics.GcTimeMillis', thus mixing up the metric with 
> various other metrics coming from rs, master, spark executor etc. running on 
> same host.
> *Expected*
> Field "tag.ProcessName" should be populated as "RegionServer" or "Master" 
> instead of "IO".
> *Actual*
> Field "tag.ProcessName" is populating as "IO" instead of expected 
> "RegionServer" or "Master" causing incorrect metric being published by ambari 
> and thus mixing up all metrics and raising various alerts around JVM metrics.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28320) Expose DelegatingRpcScheduler as IA.LimitedPrivate

2024-01-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28320?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28320.
---
Resolution: Duplicate

> Expose DelegatingRpcScheduler as IA.LimitedPrivate
> --
>
> Key: HBASE-28320
> URL: https://issues.apache.org/jira/browse/HBASE-28320
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler 
> itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible 
> change from HBASE-27144.
> We can limit the impact of breaking changes like this by exposing 
> DelegatingRpcScheduler to users. Users can extend this class and only 
> override the pieces that they care about, thus reducing the surface area of 
> compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28318) Expose DelegatingRpcScheduler as IA.LimitedPrivate

2024-01-17 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28318?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28318.
---
Resolution: Duplicate

> Expose DelegatingRpcScheduler as IA.LimitedPrivate
> --
>
> Key: HBASE-28318
> URL: https://issues.apache.org/jira/browse/HBASE-28318
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>
> We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler 
> itself is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible 
> change from HBASE-27144.
> We can limit the impact of breaking changes like this by exposing 
> DelegatingRpcScheduler to users. Users can extend this class and only 
> override the pieces that they care about, thus reducing the surface area of 
> compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28320) Expose DelegatingRpcScheduler as IA.LimitedPrivate

2024-01-17 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28320:
-

 Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate
 Key: HBASE-28320
 URL: https://issues.apache.org/jira/browse/HBASE-28320
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault
 Fix For: 2.5.8, 3.0.0-beta-2


We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself 
is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from 
HBASE-27144.

We can limit the impact of breaking changes like this by exposing 
DelegatingRpcScheduler to users. Users can extend this class and only override 
the pieces that they care about, thus reducing the surface area of 
compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28319) Expose DelegatingRpcScheduler as IA.LimitedPrivate

2024-01-17 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28319:
-

 Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate
 Key: HBASE-28319
 URL: https://issues.apache.org/jira/browse/HBASE-28319
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault
 Fix For: 2.5.8, 3.0.0-beta-2


We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself 
is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from 
HBASE-27144.

We can limit the impact of breaking changes like this by exposing 
DelegatingRpcScheduler to users. Users can extend this class and only override 
the pieces that they care about, thus reducing the surface area of 
compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28318) Expose DelegatingRpcScheduler as IA.LimitedPrivate

2024-01-17 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28318:
-

 Summary: Expose DelegatingRpcScheduler as IA.LimitedPrivate
 Key: HBASE-28318
 URL: https://issues.apache.org/jira/browse/HBASE-28318
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault
 Fix For: 2.5.8, 3.0.0-beta-2


We have DelegatingRpcScheduler in src/test of hbase-server. RpcScheduler itself 
is IA.LimitedPrivate, and in 2.6.0 we are pushing an incompatible change from 
HBASE-27144.

We can limit the impact of breaking changes like this by exposing 
DelegatingRpcScheduler to users. Users can extend this class and only override 
the pieces that they care about, thus reducing the surface area of 
compatibility issues.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28306) Add property to customize Version information

2024-01-16 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28306?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28306.
---
Fix Version/s: 2.6.0
   2.5.8
   3.0.0-beta-2
 Release Note: Added a new build property -Dversioninfo.version which can 
be used to influence the generated Version.java class in custom build 
scenarios. The version specified will show up in the HMaster UI and also have 
implications on various version-related checks. This is an advanced usage 
property and it's recommended not to stray too far from the default format of 
major.minor.patch-suffix.
   Resolution: Fixed

Pushed to all active release lines. Thanks [~zhangduo] for review!

> Add property to customize Version information
> -
>
> Key: HBASE-28306
> URL: https://issues.apache.org/jira/browse/HBASE-28306
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2
>
>
> In hbase-common we generate Version.java using the ${project.version} 
> property. In some custom builds, it may be necessary to override the project 
> version. The custom version may not be compatible with how Version works, or 
> the user may want to add extra metadata (like a build number). We can add a 
> property which defaults to ${project.version} but allows the user to specify 
> separately if desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28316) Add BootstrapNodeService handlers

2024-01-16 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28316:
-

 Summary: Add BootstrapNodeService handlers
 Key: HBASE-28316
 URL: https://issues.apache.org/jira/browse/HBASE-28316
 Project: HBase
  Issue Type: Sub-task
Affects Versions: 3.0.0-beta-1, 2.6.0
Reporter: Bryan Beaudreault


We added calls to a BootstrapNodeService, but the servers are not setup to 
serve it. We need to add in two places:
 * RSRPCServices list of services: 
[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/regionserver/RSRpcServices.java#L1447]
 * HBasePolicyProvider mapping of acl to service: 
[https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/security/HBasePolicyProvider.java#L40]

Without adding to these two places, you first see UnknownServiceExceptions and 
then you see AccessDeniedExceptions

 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28315) Remove noisy WARN from trying to construct MetricsServlet

2024-01-16 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28315:
-

 Summary: Remove noisy WARN from trying to construct MetricsServlet
 Key: HBASE-28315
 URL: https://issues.apache.org/jira/browse/HBASE-28315
 Project: HBase
  Issue Type: Improvement
Affects Versions: 3.0.0-beta-1, 2.6.0
Reporter: Bryan Beaudreault


MetricsServlet is deprecated since hadoop 2.8 and removed in hadoop3. In 
HBASE-20904 the servlet initialization was refactored, and we now have a noisy 
WARN (with stacktrace) when MetricsServlet does not exist. This should be 
common, since hadoop3 is the modern version to run on (hadoop2 almost EOL). We 
shouldn't warn.

Fix the code to not produce a warn when MetricsServlet is not available.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28256) Enhance ByteBufferUtils.readVLong to read more bytes at a time

2024-01-16 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28256?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28256.
---
Fix Version/s: 2.6.0
   2.5.8
   3.0.0-beta-2
   Resolution: Fixed

Pushed to all active release branches. Thanks for the great work here 
[~bewing], and for the review [~zhangduo].

> Enhance ByteBufferUtils.readVLong to read more bytes at a time
> --
>
> Key: HBASE-28256
> URL: https://issues.apache.org/jira/browse/HBASE-28256
> Project: HBase
>  Issue Type: Improvement
>  Components: Performance
>Reporter: Becker Ewing
>Assignee: Becker Ewing
>Priority: Major
> Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2
>
> Attachments: ReadVLongBenchmark.zip, async-prof-rs-cpu.html
>
>
> Currently, ByteBufferUtils.readVLong is used to decode rows in all data block 
> encodings in order to read the memstoreTs field. For a data block encoding 
> like prefix, ByteBufferUtils.readVLong can surprisingly occupy over 50% of 
> the CPU time in BufferedEncodedSeeker.decodeNext (which can be quite a hot 
> method in seek operations).
>  
> Since memstoreTs will typically require at least 6 bytes to store, we could 
> look to vectorize the read path for readVLong to read 8 bytes at a time 
> instead of a single byte at a time (like in 
> https://issues.apache.org/jira/browse/HBASE-28025) in order to increase 
> performance.
>  
> Attached is a CPU flamegraph of a region server process which shows that we 
> spend a surprising amount of time in decoding rows from the DBE in 
> ByteBufferUtils.readVLong.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28307) Add hbase-openssl module and include in release binaries

2024-01-14 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28307?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28307.
---
Fix Version/s: 2.6.0
   3.0.0-beta-2
 Release Note: Adds a new org.apache.hbase:hbase-openssl module which users 
can add as a dependency in their project if they'd like to use tcnative with 
netty TLS. The bundled tcnative is statically linked to boringssl and properly 
shaded to just work with hbase netty. Additionally, the tcnative jar has been 
added to the release binaries published by hbase (through hbase-assembly)
   Resolution: Fixed

Thanks [~nihaljain.cs] and [~zhangduo] for the review!

> Add hbase-openssl module and include in release binaries
> 
>
> Key: HBASE-28307
> URL: https://issues.apache.org/jira/browse/HBASE-28307
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-2
>
>
> This will make it easier for someone to use, since a common deployment 
> strategy would involve untar'ing our bin assembly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28307) Include hbase-shaded-netty-tcnative in hbase-assembly

2024-01-12 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28307:
-

 Summary: Include hbase-shaded-netty-tcnative in hbase-assembly
 Key: HBASE-28307
 URL: https://issues.apache.org/jira/browse/HBASE-28307
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


This will make it easier for someone to use, since a common deployment strategy 
would involve untar'ing our bin assembly.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28306) Add property to customize Version information

2024-01-12 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28306:
-

 Summary: Add property to customize Version information
 Key: HBASE-28306
 URL: https://issues.apache.org/jira/browse/HBASE-28306
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


In hbase-common we generate Version.java using the ${project.version} property. 
In some custom builds, it may be necessary to override the project version. The 
custom version may not be compatible with how Version works, or the user may 
want to add extra metadata (like a build number). We can add a property which 
defaults to ${project.version} but allows the user to specify separately if 
desired.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28304) Add hbase-shaded-testing-util version to dependencyManagement

2024-01-11 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28304?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28304.
---
Fix Version/s: 2.6.0
   2.5.8
   3.0.0-beta-2
   Resolution: Fixed

Pushed to all active branches. Thanks [~zhangduo] for the review!

> Add hbase-shaded-testing-util version to dependencyManagement
> -
>
> Key: HBASE-28304
> URL: https://issues.apache.org/jira/browse/HBASE-28304
> Project: HBase
>  Issue Type: Task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 2.5.8, 3.0.0-beta-2
>
>
> hbase-shaded-testing-util is the only sub-module referenced as a dependency 
> in hbase poms which is not present in our parent pom dependencyManagement. 
> This causes issues in my employer's build, but is also good for consistency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28304) Add hbase-shaded-testing-util version to dependencyManagement

2024-01-10 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28304:
-

 Summary: Add hbase-shaded-testing-util version to 
dependencyManagement
 Key: HBASE-28304
 URL: https://issues.apache.org/jira/browse/HBASE-28304
 Project: HBase
  Issue Type: Task
Reporter: Bryan Beaudreault


hbase-shaded-testing-util is the only sub-module referenced as a dependency in 
hbase poms which is not present in our parent pom dependencyManagement. This 
causes issues in my employer's build, but is also good for consistency.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28302) Add tracking of fs read times in ScanMetrics, slow logs, and warn threshold

2024-01-10 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28302:
-

 Summary: Add tracking of fs read times in ScanMetrics, slow logs, 
and warn threshold
 Key: HBASE-28302
 URL: https://issues.apache.org/jira/browse/HBASE-28302
 Project: HBase
  Issue Type: Improvement
Reporter: Bryan Beaudreault


We've had this in our production for a while, and it's useful info to have. We 
already track FS read times in 
[HFileBlock|https://github.com/apache/hbase/blob/master/hbase-server/src/main/java/org/apache/hadoop/hbase/io/hfile/HFileBlock.java#L1828-L1831C10].
 We can project that into the ScanMetrics instance and slow log pretty easily. 
It is also helpful to add a slow.fs.read.threshold, over which we log a warn



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28291) [hbase-thirdparty] Update netty version

2024-01-04 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28291:
-

 Summary: [hbase-thirdparty] Update netty version
 Key: HBASE-28291
 URL: https://issues.apache.org/jira/browse/HBASE-28291
 Project: HBase
  Issue Type: Task
Reporter: Bryan Beaudreault
Assignee: Bryan Beaudreault


There is a CVE: 
[https://github.com/netty/netty/security/advisories/GHSA-xpw8-rcwv-8f8p.] It 
does not affect us, but we can clear it anyway.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28260) Possible data loss in WAL after RegionServer crash

2023-12-14 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28260:
-

 Summary: Possible data loss in WAL after RegionServer crash
 Key: HBASE-28260
 URL: https://issues.apache.org/jira/browse/HBASE-28260
 Project: HBase
  Issue Type: Bug
Reporter: Bryan Beaudreault


We recently had a production incident:
 # RegionServer crashes, but local DataNode lives on
 # WAL lease recovery kicks in
 # Namenode reconstructs the block during lease recovery (which results in a 
new genstamp). It chooses the replica on the local DataNode as the primary.
 # Local DataNode reconstructs the block, so NameNode registers the new 
genstamp.
 # Local DataNode and the underlying host dies, before the new block could be 
replicated to other replicas.

This leaves us with a missing block, because the new genstamp block has no 
replicas. The old replicas still remain, but are considered corrupt due to 
GENSTAMP_MISMATCH.

Thankfully we were able to confirm that the length of the corrupt blocks were 
identical to the newly constructed and lost block. Further, the file in 
question was only 1 block. So we downloaded one of those corrupt block files 
and hdfs {{hdfs dfs -put -f}} to force that block to replace the file in hdfs. 
So in this case we had no actual data loss, but it could have happened easily 
if the file was more than 1 block or the replicas weren't fully in sync prior 
to reconstruction.

In order to avoid this issue, we should avoid writing WAL blocks too the local 
datanode. We can use CreateFlag.NO_WRITE_LOCAL for this. Hat tip to [~weichiu] 
for pointing this out.

During reading of WALs we already reorder blocks so as to avoid reading from 
the local datanode, but avoiding writing there altogether would be better.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28029) Netty SSL throughput improvement

2023-12-14 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28029?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28029.
---
Fix Version/s: 2.6.0
   3.0.0-beta-1
   Resolution: Fixed

Pushed to branch-2.6+. Thanks for the review [~nihaljain.cs] and [~zhangduo] 

> Netty SSL throughput improvement
> 
>
> Key: HBASE-28029
> URL: https://issues.apache.org/jira/browse/HBASE-28029
> Project: HBase
>  Issue Type: Improvement
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
> Attachments: 10mb-wrap.html, default-wrap.html
>
>
> Digging into HBASE-27947, I discovered an area for optimization in netty's 
> SslHandler. I submitted that upstream to 
> [https://github.com/netty/netty/issues/13549,] and submitted a PR for their 
> review [https://github.com/netty/netty/pull/13551.] 
> It's likely we will need changes in HBase to integrate this, including 
> updating hbase-thirdparty once the change is released, and adding support for 
> calling SslHandler.setWrapDataSize. This issue encapsulates that work.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Created] (HBASE-28239) Auto create configured namespaces

2023-12-04 Thread Bryan Beaudreault (Jira)
Bryan Beaudreault created HBASE-28239:
-

 Summary: Auto create configured namespaces
 Key: HBASE-28239
 URL: https://issues.apache.org/jira/browse/HBASE-28239
 Project: HBase
  Issue Type: New Feature
Reporter: Bryan Beaudreault


During startup, the HMaster will create the default and system namespaces 
automatically. To simplify the management of common namespaces, it would be 
beneficial to offer a configuration option that operators can use to ensure 
that additional namespaces are created during startup. This would eliminate the 
need to wrap createTable calls in checkAndCreateNamespace or provide separate 
cluster bootstrap functionality to guarantee that the namespace is created.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28215) Region reopen procedure should support some sort of throttling

2023-12-04 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28215?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28215.
---
Fix Version/s: 2.6.0
   3.0.0-beta-1
 Release Note: 
Adds new configurations to control the speed and batching of region reopens 
after modifying a table:
- hbase.reopen.table.regions.progressive.batch.size.max - When set, the HMaster 
will progressively reopen regions, starting with one region and then doubling 
until it reaches the specified max. After reaching the max, it will continue 
reopening at that batch size until all regions are reopened.
- hbase.reopen.table.regions.progressive.batch.backoff.ms - When set, the 
HMaster will back off for this amount of time between each batch.
   Resolution: Fixed

Pushed to master, branch-3, branch-2, branch-2.6

Thanks for the contribution [~rmdmattingly]!

> Region reopen procedure should support some sort of throttling
> --
>
> Key: HBASE-28215
> URL: https://issues.apache.org/jira/browse/HBASE-28215
> Project: HBase
>  Issue Type: Improvement
>  Components: master, proc-v2
>Reporter: Ray Mattingly
>Assignee: Ray Mattingly
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> The mass reopening of regions caused by a table descriptor modification can 
> be quite disruptive. For latency/error sensitive workloads, like our user 
> facing traffic, we need to be very careful about when we modify table 
> descriptors, and it can be virtually impossible to do it painlessly for busy 
> tables.
> It would be nice if we supported configurable batching/throttling of 
> reopenings so that the amplitude of any disruption can be kept relatively 
> small.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28120) Provide the switch to avoid reopening regions in the alter sync command

2023-12-01 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28120?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28120.
---
Fix Version/s: (was: 2.6.0)
   Resolution: Invalid

> Provide the switch to avoid reopening regions in the alter sync command
> ---
>
> Key: HBASE-28120
> URL: https://issues.apache.org/jira/browse/HBASE-28120
> Project: HBase
>  Issue Type: Sub-task
>  Components: master, shell
>Affects Versions: 2.0.0-alpha-1
>Reporter: Gourab Taparia
>Assignee: Gourab Taparia
>Priority: Major
>
> As part of the sub-task, as HBase 2 supports both Async and Sync API, this 
> task is to add this support/feature to HBase 2's Sync API.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28121) Port the switch to avoid reopening regions in the alter async in HBase 2

2023-12-01 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28121?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28121.
---
Fix Version/s: (was: 2.6.0)
   Resolution: Invalid

> Port the switch to avoid reopening regions in the alter async in HBase 2
> 
>
> Key: HBASE-28121
> URL: https://issues.apache.org/jira/browse/HBASE-28121
> Project: HBase
>  Issue Type: Sub-task
>  Components: master, shell
>Affects Versions: 2.0.0-alpha-1
>Reporter: Gourab Taparia
>Assignee: Gourab Taparia
>Priority: Major
>
> As part of the sub-task, as HBase 2 supports both Async and Sync API, this 
> task is to port the feature added in HBase 3 alter(async default) layer to 
> HBase 2's async side. 
> There is a separate sub-task for adding it to HBase 2's sync side.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-20433) HBase Export Snapshot utility does not close FileSystem instances

2023-12-01 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-20433?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-20433.
---
Resolution: Duplicate

Resolving this as a duplicate of HBASE-28222 where I fixed this as best I 
could, by re-enabling the cache (by reverting HBASE-12819). ExportSnapshot is 
designed to be run as a standalone job. If someone plans to run ExportSnapshot 
many times in a single process, they should run FileSystem.closeAll() between 
each run. This is not safe for ExportSnapshot itself to do, since it could 
inadvertently close FileSystem objects referenced elsewhere in the user code.

See HBASE-28222 for more details.

> HBase Export Snapshot utility does not close FileSystem instances
> -
>
> Key: HBASE-20433
> URL: https://issues.apache.org/jira/browse/HBASE-20433
> Project: HBase
>  Issue Type: Bug
>  Components: Client, Filesystem Integration, snapshots
>Affects Versions: 1.2.6, 1.4.3
>Reporter: Voyta
>Priority: Major
>
> It seems org.apache.hadoop.hbase.snapshot.ExportSnapshot disallows FileSystem 
> instance caching.
> When verifySnapshot method is being run it calls often methods like 
> org.apache.hadoop.hbase.util.FSUtils#getRootDir that instantiate FileSystem 
> but never calls org.apache.hadoop.fs.FileSystem#close method. This behaviour 
> allows allocation of unwanted objects potentially causing memory leaks.
> Related issue: https://issues.apache.org/jira/browse/HADOOP-15392
>  
> Expectation:
>  * HBase should properly release/close all objects, especially FileSystem 
> instances.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28222) Leak in ExportSnapshot during verifySnapshot on S3A

2023-12-01 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28222?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28222.
---
Fix Version/s: 2.6.0
   3.0.0-beta-1
 Release Note: ExportSnapshot now uses FileSystems from the global 
FileSystem cache, and as such does not close those FileSystems when it 
finishes. If users plan to run ExportSnapshot over and over in a single process 
for different FileSystem urls, they should run FileSystem.closeAll() between 
runs. See JIRA for details.
 Assignee: Bryan Beaudreault
   Resolution: Fixed

Pushed to master, branch-3, branch-2, branch-2.6. Thanks for the review 
[~wchevreuil]!

I did not push to older branches, even though this is a bug. It might be an 
unexpected change, but we can if there is a desire.

> Leak in ExportSnapshot during verifySnapshot on S3A
> ---
>
> Key: HBASE-28222
> URL: https://issues.apache.org/jira/browse/HBASE-28222
> Project: HBase
>  Issue Type: Bug
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
> Fix For: 2.6.0, 3.0.0-beta-1
>
>
> Each S3AFileSystem creates an S3AInstrumentation and various metrics sources, 
> with no real way to disable that. In HADOOP-18526, a bug was fixed so that 
> these are not leaked. But in order to use that, you must call 
> S3AFileSystem.close() when done.
> In ExportSnapshot, ever since HBASE-12819 we set fs.impl.disable.cache to 
> true. It looks like that was added in order to prevent conflicting calls to 
> close() between mapper and main thread when running in a single JVM.
> When verifySnapshot is enabled, SnapshotReferenceUtil.verifySnapshot iterates 
> all storefiles (could be many thousands) and calls 
> SnapshotReferenceUtil.verifyStoreFile on them. verifyStoreFile makes a number 
> of static calls which end up in CommonFSUtils.getRootDir, which does 
> Path.getFileSystem().
> Since the FS cache is disabled, every single call to Path.getFileSystem() 
> creates a new FileSystem instance. That FS is short lived, and gets GC'd. But 
> in the case of S3AFileSystem, this leaks all of the metrics stuff.
> We have two easy possible fixes:
>  # Only set fs.impl.disable.cache when running hadoop in local mode, since 
> that was the original problem.
>  # When calling verifySnapshot, create a new Configuration which does not 
> include the fs.impl.disable.cache setting.
> I tested out #2 in my environment and it fixed the leak.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28231) Setup jenkins job for branch-2.6

2023-11-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28231?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28231.
---
Resolution: Done

> Setup jenkins job for branch-2.6
> 
>
> Key: HBASE-28231
> URL: https://issues.apache.org/jira/browse/HBASE-28231
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Resolved] (HBASE-28229) Create branch-2.6

2023-11-29 Thread Bryan Beaudreault (Jira)


 [ 
https://issues.apache.org/jira/browse/HBASE-28229?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Bryan Beaudreault resolved HBASE-28229.
---
Resolution: Done

> Create branch-2.6
> -
>
> Key: HBASE-28229
> URL: https://issues.apache.org/jira/browse/HBASE-28229
> Project: HBase
>  Issue Type: Sub-task
>Reporter: Bryan Beaudreault
>Assignee: Bryan Beaudreault
>Priority: Major
>




--
This message was sent by Atlassian Jira
(v8.20.10#820010)


  1   2   3   4   >