Dieter De Paepe created HBASE-28445:
---------------------------------------
Summary: Shared job jars for full backups
Key: HBASE-28445
URL: https://issues.apache.org/jira/browse/HBASE-28445
Project: HBase
Issue Type: Improvement
Components: backup&restore
Affects Versions: 2.6.0
Reporter: Dieter De Paepe
Our YARN clusters are configured with 10GB of temporary local storage.
When investigating an unhealthy YARN nodemanager, we found it became unhealthy
because it's "local-dirs usable space" had dropped below 90%. Investigation
showed that this was mainly due to over a 100 different entries in the
usercache, all containing the exact same libjars:
{code:java}
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s
./usercache/lily/filecache/*
41272 ./usercache/lily/filecache/10
41272 ./usercache/lily/filecache/100
41272 ./usercache/lily/filecache/101
41272 ./usercache/lily/filecache/102
41272 ./usercache/lily/filecache/103
41272 ./usercache/lily/filecache/104
...{code}
{code:java}
yarn@yarn-nodemanager-0:/tmp/yarn/nm-local-dir$ du -s
./usercache/lily/filecache/99/libjars/* 576
./usercache/lily/filecache/99/libjars/commons-lang3-3.12.0.jar 4496
./usercache/lily/filecache/99/libjars/hadoop-common-3.3.6-2-lily.jar 1800
./usercache/lily/filecache/99/libjars/hadoop-mapreduce-client-core-3.3.6-2-lily.jar
100 ./usercache/lily/filecache/99/libjars/hbase-asyncfs-2.6.0-prc-1-lily.jar
2076 ./usercache/lily/filecache/99/libjars/hbase-client-2.6.0-prc-1-lily.jar
876 ./usercache/lily/filecache/99/libjars/hbase-common-2.6.0-prc-1-lily.jar 76
./usercache/lily/filecache/99/libjars/hbase-hadoop-compat-2.6.0-prc-1-lily.jar
164
./usercache/lily/filecache/99/libjars/hbase-hadoop2-compat-2.6.0-prc-1-lily.jar
124 ./usercache/lily/filecache/99/libjars/hbase-http-2.6.0-prc-1-lily.jar 436
./usercache/lily/filecache/99/libjars/hbase-mapreduce-2.6.0-prc-1-lily.jar 32
./usercache/lily/filecache/99/libjars/hbase-metrics-2.6.0-prc-1-lily.jar 24
./usercache/lily/filecache/99/libjars/hbase-metrics-api-2.6.0-prc-1-lily.jar
208 ./usercache/lily/filecache/99/libjars/hbase-procedure-2.6.0-prc-1-lily.jar
3208 ./usercache/lily/filecache/99/libjars/hbase-protocol-2.6.0-prc-1-lily.jar
7356
./usercache/lily/filecache/99/libjars/hbase-protocol-shaded-2.6.0-prc-1-lily.jar
52
./usercache/lily/filecache/99/libjars/hbase-replication-2.6.0-prc-1-lily.jar
5932 ./usercache/lily/filecache/99/libjars/hbase-server-2.6.0-prc-1-lily.jar
304 ./usercache/lily/filecache/99/libjars/hbase-shaded-gson-4.1.5.jar 4060
./usercache/lily/filecache/99/libjars/hbase-shaded-miscellaneous-4.1.5.jar 4864
./usercache/lily/filecache/99/libjars/hbase-shaded-netty-4.1.5.jar 1832
./usercache/lily/filecache/99/libjars/hbase-shaded-protobuf-4.1.5.jar 20
./usercache/lily/filecache/99/libjars/hbase-unsafe-4.1.5.jar 108
./usercache/lily/filecache/99/libjars/hbase-zookeeper-2.6.0-prc-1-lily.jar 120
./usercache/lily/filecache/99/libjars/metrics-core-3.1.5.jar 128
./usercache/lily/filecache/99/libjars/opentelemetry-api-1.15.0.jar 48
./usercache/lily/filecache/99/libjars/opentelemetry-context-1.15.0.jar 32
./usercache/lily/filecache/99/libjars/opentelemetry-semconv-1.15.0-alpha.jar
524 ./usercache/lily/filecache/99/libjars/protobuf-java-2.5.0.jar 1292
./usercache/lily/filecache/99/libjars/zookeeper-3.8.3.jar
{code}
Investigating the YARN logs showed that for every HBase table included in a
full backup, a separate YARN application is started, each uploading these job
jars.
We encountered this on an HBase installation with limited tables, where we were
running backup&restore related tests (so this was regular use). But I can
imagine this could be annoying for HBase installations with hundreds to
thousands of tables.
I wonder if it's possible to use shared job jars instead of the current
approach?
(Strangely enough, the mechanisms to clean up this cache weren't triggering as
expected, but that's probably something that requires its own investigation.)
--
This message was sent by Atlassian Jira
(v8.20.10#820010)