[
https://issues.apache.org/jira/browse/HBASE-28105?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Ke Han resolved HBASE-28105.
----------------------------
Resolution: Fixed
PR merged
> NPE in QuotaCache if Table is dropped from cluster
> --------------------------------------------------
>
> Key: HBASE-28105
> URL: https://issues.apache.org/jira/browse/HBASE-28105
> Project: HBase
> Issue Type: Bug
> Components: Quotas
> Affects Versions: 2.4.17, 2.5.5
> Reporter: Ke Han
> Priority: Major
> Attachments: 0001-avoid-NPE.patch,
> hbase--regionserver-a0320910ca45.log
>
>
> When running HBase-2.4.17, I met a NPE in regionserver log.
> h1. Reproduce
> Config HBase cluster: 1 HMaster, 2 RS, 2.10.2 Hadoop.
> Execute the following commands in the HMaster node using hbase shell,
> {code:java}
> create 'uuidd9efa97f93a442b686adae6d9f7bb2e9', {NAME =>
> 'uuid099cbece77834a83a52bb0611c3ea080', VERSIONS => 3, COMPRESSION => 'NONE',
> BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'}, {NAME =>
> 'uuidbc1bea73952749329d7f025aab382c4e', VERSIONS => 2, COMPRESSION => 'GZ',
> BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME =>
> 'uuidff292310d9dc450697af2bb25d9f3e98', VERSIONS => 2, COMPRESSION => 'GZ',
> BLOOMFILTER => 'NONE', IN_MEMORY => 'false'}, {NAME =>
> 'uuid449de028da6b4d35be0f187ebec6c3be', VERSIONS => 2, COMPRESSION => 'GZ',
> BLOOMFILTER => 'ROW', IN_MEMORY => 'false'}, {NAME =>
> 'uuidc0840c98f9d348a18f2d454c7a503b65', VERSIONS => 2, COMPRESSION => 'GZ',
> BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'false'}
> create_namespace 'uuidec797633f5dd4ab9b96276135aeda9e2'
> create 'uuiddeb610fded9744889840ecd03dd18739', {NAME =>
> 'uuid30a0f625ad454605908b60c932957ff0', VERSIONS => 1, COMPRESSION => 'GZ',
> BLOOMFILTER => 'ROW', IN_MEMORY => 'true'}
> incr 'uuidd9efa97f93a442b686adae6d9f7bb2e9',
> 'uuid46ddc3d3557e413e915e2393ae72c082',
> 'uuidbc1bea73952749329d7f025aab382c4e:JZycbUSpbDQmwgXinp', 1
> flush 'uuidd9efa97f93a442b686adae6d9f7bb2e9',
> 'uuid449de028da6b4d35be0f187ebec6c3be'
> drop 'uuiddeb610fded9744889840ecd03dd18739'
> put 'uuidd9efa97f93a442b686adae6d9f7bb2e9',
> 'uuidf4704cae4d1e4661bd7664d26eb6b31b',
> 'uuidbc1bea73952749329d7f025aab382c4e:JZycbUSpbDQmwgXinp',
> 'XlPpFGvSYfcEXWXgwARytlSeiaSuHJFqpirMmLduqGnpdXLlHJWBumraXiifQSvHqNHmTcyzLQIvuQrkujPghfdtRkhOkgKEJHsAuAiMMeWZjdTHNZqhkOdJBOzsRYUXKOCNKeSxEDWgnKgsFDHMtxdnKKudBuceOgYmCrdaPXMclKkZKCIEiFDcdoAEJGKXYVfOjb'
> disable 'uuidd9efa97f93a442b686adae6d9f7bb2e9'
> drop 'uuidd9efa97f93a442b686adae6d9f7bb2e9'
> create 'uuid9d05a5cb34e64910ac90675186e7d0d4', {NAME =>
> 'uuid1ce512a5997b4efea3bdead2e7f723c3', VERSIONS => 2, COMPRESSION => 'NONE',
> BLOOMFILTER => 'ROWCOL', IN_MEMORY => 'true'}, {NAME =>
> 'uuid0b1baaa4275e46b2a3a1d11d6540fc30', VERSIONS => 2, COMPRESSION => 'NONE',
> BLOOMFILTER => 'NONE', IN_MEMORY => 'true'}
> put 'uuid9d05a5cb34e64910ac90675186e7d0d4',
> 'uuid552e42ade4c14099a1d8643bea1616d4',
> 'uuid1ce512a5997b4efea3bdead2e7f723c3:l', 1
> drop 'uuid9d05a5cb34e64910ac90675186e7d0d4'{code}
> The exception will be thrown in either RS1 or RS2
> {code:java}
> 2023-09-19 20:29:28,268 INFO [RS_OPEN_REGION-regionserver/hregion2:16020-2]
> handler.AssignRegionHandler: Opened
> uuid9d05a5cb34e64910ac90675186e7d0d4,,1695155367072.f59a0693a9469f9e1f131bf2aac1486d.
> 2023-09-19 20:29:29,205 ERROR [regionserver/hregion2:16020.Chore.1]
> hbase.ScheduledChore: Caught error
> java.lang.NullPointerException
> at
> org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.updateQuotaFactors(QuotaCache.java:378)
> at
> org.apache.hadoop.hbase.quotas.QuotaCache$QuotaRefresherChore.chore(QuotaCache.java:224)
> at org.apache.hadoop.hbase.ScheduledChore.run(ScheduledChore.java:158)
> at
> java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
> at java.util.concurrent.FutureTask.runAndReset(FutureTask.java:308)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(ScheduledThreadPoolExecutor.java:180)
> at
> java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:294)
> at
> org.apache.hadoop.hbase.JitterScheduledThreadPoolExecutorImpl$JitteredRunnableScheduledFuture.run(JitterScheduledThreadPoolExecutorImpl.java:107)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
> at java.lang.Thread.run(Thread.java:750){code}
> h1. Root Cause
> The NPE is thrown at function: updateQuotaFactors()
> {code:java}
> private void updateQuotaFactors() {
> // Update machine quota factor
> ClusterMetrics clusterMetrics;
> try {
> clusterMetrics = rsServices.getConnection().getAdmin()
> .getClusterMetrics(EnumSet.of(Option.SERVERS_NAME,
> Option.TABLE_TO_REGIONS_COUNT));
> } catch (IOException e) {
> LOG.warn("Failed to get cluster metrics needed for updating quotas", e);
> return;
> } int rsSize = clusterMetrics.getServersName().size();
> if (rsSize != 0) {
> // TODO if use rs group, the cluster limit should be shared by the rs
> group
> machineQuotaFactor = 1.0 / rsSize;
> } Map<TableName, RegionStatesCount> tableRegionStatesCount =
> clusterMetrics.getTableRegionStatesCount(); // Update table machine
> quota factors
> for (TableName tableName : tableQuotaCache.keySet()) {
> double factor = 1;
> try {
> // BUGGY LINE
> long regionSize =
> tableRegionStatesCount.get(tableName).getOpenRegions();
> if (regionSize == 0) {
> factor = 0;
> } else {
> int localRegionSize = rsServices.getRegions(tableName).size();
> factor = 1.0 * localRegionSize / regionSize;
> }
> } catch (IOException e) {
> LOG.warn("Get table regions failed: {}", tableName, e);
> }
> tableMachineQuotaFactors.put(tableName, factor);
> }
> } {code}
> At line 378: the *tableRegionStatesCount.get(tableName)* return null and thus
> it runs into NPE.
> The tableName leading to NPE is '{*}uuidd9efa97f93a442b686adae6d9f7bb2e9{*}',
> which is disabled and dropped by the user.
> {code:java}
> long regionSize = tableRegionStatesCount.get(tableName).getOpenRegions();
> {code}
> The root cause is that when updating the quota factors, it iterates
> {*}tableQuotaCache.keySet(){*}, which might contain a table that has been
> dropped (Outdated cache).
> h1. Fix
> Maybe we can add a check to make sure the table still exists in the system
> (Tmp solution). I have attached the full regionserver log and a simple fix.
> This bug should also happen in *2.5.5* since the related code in QuotaCache
> remains the same (But I only tested 2.4.17).
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)