István Fajth created HDDS-4478:
----------------------------------
Summary: Large deletedKeyset slows down OM via listStatus
Key: HDDS-4478
URL: https://issues.apache.org/jira/browse/HDDS-4478
Project: Hadoop Distributed Data Store
Issue Type: Bug
Reporter: István Fajth
During TPC-DS test runs with Hive backed by Ozone storage I noticed the
following:
As part of data generation we create tables, and with auto partitioning turned
on, some of the tables have a couple thousand partitions created.
This process involves a couple of renames of the different data files while MR
containers are working on to create the data files, and move them into their
final position.
After this, when I start to run queries, certain parts of the query run work on
to split the work for mappers, and in order to do this, they do listStatus
calls, as I tracked down and understood the code, there is a recursive listing
is happening, that filters out files on the client side. This means that for a
table where we have 3K partitions, Hive issues 3K listsSatus calls on the
partition directories.
During this phase the query is stuck INITIALIZING containers and query phases
for a long time, which gave me the possibility to take a couple of jstacks, and
all of them has 5-10 threads with this particular stack trace:
{code}
"IPC Server handler 90 on 9862" #147 daemon prio=5 os_prio=0
tid=0x00007fec32dd7800 nid=0x19ac2 runnable [0x00007febe13d9000]
java.lang.Thread.State: RUNNABLE
at java.util.TreeMap.put(TreeMap.java:575)
at java.util.TreeSet.add(TreeSet.java:255)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatusFindKeyInTableCache(KeyManagerImpl.java:2041)
at
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:2117)
at
org.apache.hadoop.ozone.om.OzoneManager.listStatus(OzoneManager.java:2938)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.listStatus(OzoneManagerRequestHandler.java:573)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:197)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:218)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$88/1748961453.apply(Unknown
Source)
at
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
at
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
at
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
at
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
{code}
This is happening with a recent (maybe 1 week old tops) master build.
After I took heapdumps from the OM during query runs, at two different times, I
saw the following concerning numbers:
>From CacheKey instances we have 78k in the earlier one, and 80K in the older
>one
>From CacheValue instance we have 78K in both.
Almost all of them has the OMMetadataManager as GC root, and belongs to the
keytable cache.
During listStatusFindKeyInTableCache, we iterate through the cache entries via
the cache iterator, and if the cacheKey is not equal to the keyArgs, and
cacheValue is null, we add the key to the deletedKeySet.
Where this method is called, is KeyManager's listStatus method, which for every
call creates a new deletedKeySet as a TreeSet. Then it goes over all keys in
the bucket that are cached, and puts together the deletedKeySet. We seem to
need this deletedKeySet in order to leave out keys that are already deleted in
the cache but still present in the db from the listing which is a valid reason,
but causes a significant slowdown (1-2 order of magnitude) compared to runtimes
after OM restart.
In the heap dump the cache values are either an instance of Present, or Absent,
and the majority seems to be Absent. Based on the epoch even present values and
absent values are from various epochs, from 1 to 300K+.
Int the TableCacheImpl, we have an evictCache method that happens in an
executor service, and which should run on DoubleBuffer flush.
It gets a list of epochs to evict, and it iterates through the epochEntries the
cache has, and, as we talk about the KeyTable that has a MANUAL eviction
policy, it will remove the entry from the cache if the entry's epoch is the
current epoch in the iteration, and if the supplied epochs to evict contain the
current epoch.
Epochs come from DoubleBuffer#flushTransactions.
It starts as a HashMap of Strings and List of Longs, and DoubleBuffer puts
together the list of epochs for a table during flush, based on table names
registered for Responses with the CleanupTableInfo annotation.
The only suspicious thing I found, is that OMFileCreateResponse does not define
the key table as something to clean up, and it parents neither has it defined
for cleanup, while the OMFileCreateRequest adds the parents of the created file
to the key table cache
[here|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileCreateRequest.java#L299-L301].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]