[jira] [Created] (HDDS-4478) Large deletedKeyset slows down OM via listStatus

Jira Tue, 17 Nov 2020 08:31:12 -0800

István Fajth created HDDS-4478:
----------------------------------

             Summary: Large deletedKeyset slows down OM via listStatus
                 Key: HDDS-4478
                 URL: https://issues.apache.org/jira/browse/HDDS-4478
             Project: Hadoop Distributed Data Store
          Issue Type: Bug
            Reporter: István Fajth



During TPC-DS test runs with Hive backed by Ozone storage I noticed the 
following:

As part of data generation we create tables, and with auto partitioning turned 
on, some of the tables have a couple thousand partitions created.
This process involves a couple of renames of the different data files while MR 
containers are working on to create the data files, and move them into their 
final position.

After this, when I start to run queries, certain parts of the query run work on 
to split the work for mappers, and in order to do this, they do listStatus 
calls, as I tracked down and understood the code, there is a recursive listing 
is happening, that filters out files on the client side. This means that for a 
table where we have 3K partitions, Hive issues 3K listsSatus calls on the 
partition directories.

During this phase the query is stuck INITIALIZING containers and query phases 
for a long time, which gave me the possibility to take a couple of jstacks, and 
all of them has 5-10 threads with this particular stack trace:
{code}
"IPC Server handler 90 on 9862" #147 daemon prio=5 os_prio=0 
tid=0x00007fec32dd7800 nid=0x19ac2 runnable [0x00007febe13d9000]
   java.lang.Thread.State: RUNNABLE
        at java.util.TreeMap.put(TreeMap.java:575)
        at java.util.TreeSet.add(TreeSet.java:255)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatusFindKeyInTableCache(KeyManagerImpl.java:2041)
        at 
org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:2117)
        at 
org.apache.hadoop.ozone.om.OzoneManager.listStatus(OzoneManager.java:2938)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.listStatus(OzoneManagerRequestHandler.java:573)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:197)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:218)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$88/1748961453.apply(Unknown
 Source)
        at 
org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
        at 
org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
        at 
org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
        at 
org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
        at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:422)
        at 
org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
{code}

This is happening with a recent (maybe 1 week old tops) master build.

After I took heapdumps from the OM during query runs, at two different times, I 
saw the following concerning numbers:
>From CacheKey instances we have 78k in the earlier one, and 80K in the older 
>one
>From CacheValue instance we have 78K in both.
Almost all of them has the OMMetadataManager as GC root, and belongs to the 
keytable cache.

During listStatusFindKeyInTableCache, we iterate through the cache entries via 
the cache iterator, and if the cacheKey is not equal to the keyArgs, and 
cacheValue is null, we add the key to the deletedKeySet.
Where this method is called, is KeyManager's listStatus method, which for every 
call creates a new deletedKeySet as a TreeSet. Then it goes over all keys in 
the bucket that are cached, and puts together the deletedKeySet. We seem to 
need this deletedKeySet in order to leave out keys that are already deleted in 
the cache but still present in the db from the listing which is a valid reason, 
but causes a significant slowdown (1-2 order of magnitude) compared to runtimes 
after OM restart.

In the heap dump the cache values are either an instance of Present, or Absent, 
and the majority seems to be Absent. Based on the epoch even present values and 
absent values are from various epochs, from 1 to 300K+.

Int the TableCacheImpl, we have an evictCache method that happens in an 
executor service, and which should run on DoubleBuffer flush.
It gets a list of epochs to evict, and it iterates through the epochEntries the 
cache has, and, as we talk about the KeyTable that has a MANUAL eviction 
policy, it will remove the entry from the cache if the entry's epoch is the 
current epoch in the iteration, and if the supplied epochs to evict contain the 
current epoch.

Epochs come from DoubleBuffer#flushTransactions.
It starts as a HashMap of Strings and List of Longs, and DoubleBuffer puts 
together the list of epochs for a table during flush, based on table names 
registered for Responses with the CleanupTableInfo annotation.

The only suspicious thing I found, is that OMFileCreateResponse does not define 
the key table as something to clean up, and it parents neither has it defined 
for cleanup, while the OMFileCreateRequest adds the parents of the created file 
to the key table cache 
[here|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileCreateRequest.java#L299-L301].






--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Created] (HDDS-4478) Large deletedKeyset slows down OM via listStatus

Reply via email to