[
https://issues.apache.org/jira/browse/HDDS-4478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Bharat Viswanadham resolved HDDS-4478.
--------------------------------------
Fix Version/s: 1.1.0
Resolution: Fixed
> Large deletedKeyset slows down OM via listStatus
> ------------------------------------------------
>
> Key: HDDS-4478
> URL: https://issues.apache.org/jira/browse/HDDS-4478
> Project: Hadoop Distributed Data Store
> Issue Type: Bug
> Reporter: István Fajth
> Assignee: István Fajth
> Priority: Critical
> Labels: pull-request-available
> Fix For: 1.1.0
>
>
> During TPC-DS test runs with Hive backed by Ozone storage I noticed the
> following:
> As part of data generation we create tables, and with auto partitioning
> turned on, some of the tables have a couple thousand partitions created.
> This process involves a couple of renames of the different data files while
> MR containers are working on to create the data files, and move them into
> their final position.
> After this, when I start to run queries, certain parts of the query run work
> on to split the work for mappers, and in order to do this, they do listStatus
> calls, as I tracked down and understood the code, there is a recursive
> listing is happening, that filters out files on the client side. This means
> that for a table where we have 3K partitions, Hive issues 3K listsSatus calls
> on the partition directories.
> During this phase the query is stuck INITIALIZING containers and query phases
> for a long time, which gave me the possibility to take a couple of jstacks,
> and all of them has 5-10 threads with this particular stack trace:
> {code}
> "IPC Server handler 90 on 9862" #147 daemon prio=5 os_prio=0
> tid=0x00007fec32dd7800 nid=0x19ac2 runnable [0x00007febe13d9000]
> java.lang.Thread.State: RUNNABLE
> at java.util.TreeMap.put(TreeMap.java:575)
> at java.util.TreeSet.add(TreeSet.java:255)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatusFindKeyInTableCache(KeyManagerImpl.java:2041)
> at
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:2117)
> at
> org.apache.hadoop.ozone.om.OzoneManager.listStatus(OzoneManager.java:2938)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.listStatus(OzoneManagerRequestHandler.java:573)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:197)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:218)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$88/1748961453.apply(Unknown
> Source)
> at
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
> at
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
> at
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
> at
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
> at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
> at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
> at java.security.AccessController.doPrivileged(Native Method)
> at javax.security.auth.Subject.doAs(Subject.java:422)
> at
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
> at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> {code}
> This is happening with a recent (maybe 1 week old tops) master build.
> After I took heapdumps from the OM during query runs, at two different times,
> I saw the following concerning numbers:
> From CacheKey instances we have 78k in the earlier one, and 80K in the older
> one
> From CacheValue instance we have 78K in both.
> Almost all of them has the OMMetadataManager as GC root, and belongs to the
> keytable cache.
> During listStatusFindKeyInTableCache, we iterate through the cache entries
> via the cache iterator, and if the cacheKey is not equal to the keyArgs, and
> cacheValue is null, we add the key to the deletedKeySet.
> Where this method is called, is KeyManager's listStatus method, which for
> every call creates a new deletedKeySet as a TreeSet. Then it goes over all
> keys in the bucket that are cached, and puts together the deletedKeySet. We
> seem to need this deletedKeySet in order to leave out keys that are already
> deleted in the cache but still present in the db from the listing which is a
> valid reason, but causes a significant slowdown (1-2 order of magnitude)
> compared to runtimes after OM restart.
> In the heap dump the cache values are either an instance of Present, or
> Absent, and the majority seems to be Absent. Based on the epoch even present
> values and absent values are from various epochs, from 1 to 300K+.
> Int the TableCacheImpl, we have an evictCache method that happens in an
> executor service, and which should run on DoubleBuffer flush.
> It gets a list of epochs to evict, and it iterates through the epochEntries
> the cache has, and, as we talk about the KeyTable that has a MANUAL eviction
> policy, it will remove the entry from the cache if the entry's epoch is the
> current epoch in the iteration, and if the supplied epochs to evict contain
> the current epoch.
> Epochs come from DoubleBuffer#flushTransactions.
> It starts as a HashMap of Strings and List of Longs, and DoubleBuffer puts
> together the list of epochs for a table during flush, based on table names
> registered for Responses with the CleanupTableInfo annotation.
> The only suspicious thing I found, is that OMFileCreateResponse does not
> define the key table as something to clean up, and it parents neither has it
> defined for cleanup, while the OMFileCreateRequest adds the parents of the
> created file to the key table cache
> [here|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileCreateRequest.java#L299-L301].
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]