[jira] [Resolved] (HDDS-4478) Large deletedKeyset slows down OM via listStatus

Bharat Viswanadham (Jira) Thu, 19 Nov 2020 14:44:34 -0800


     [ 
https://issues.apache.org/jira/browse/HDDS-4478?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Bharat Viswanadham resolved HDDS-4478.
--------------------------------------
    Fix Version/s: 1.1.0
       Resolution: Fixed

> Large deletedKeyset slows down OM via listStatus
> ------------------------------------------------
>
>                 Key: HDDS-4478
>                 URL: https://issues.apache.org/jira/browse/HDDS-4478
>             Project: Hadoop Distributed Data Store
>          Issue Type: Bug
>            Reporter: István Fajth
>            Assignee: István Fajth
>            Priority: Critical
>              Labels: pull-request-available
>             Fix For: 1.1.0
>
>
> During TPC-DS test runs with Hive backed by Ozone storage I noticed the 
> following:
> As part of data generation we create tables, and with auto partitioning 
> turned on, some of the tables have a couple thousand partitions created.
> This process involves a couple of renames of the different data files while 
> MR containers are working on to create the data files, and move them into 
> their final position.
> After this, when I start to run queries, certain parts of the query run work 
> on to split the work for mappers, and in order to do this, they do listStatus 
> calls, as I tracked down and understood the code, there is a recursive 
> listing is happening, that filters out files on the client side. This means 
> that for a table where we have 3K partitions, Hive issues 3K listsSatus calls 
> on the partition directories.
> During this phase the query is stuck INITIALIZING containers and query phases 
> for a long time, which gave me the possibility to take a couple of jstacks, 
> and all of them has 5-10 threads with this particular stack trace:
> {code}
> "IPC Server handler 90 on 9862" #147 daemon prio=5 os_prio=0 
> tid=0x00007fec32dd7800 nid=0x19ac2 runnable [0x00007febe13d9000]
>    java.lang.Thread.State: RUNNABLE
>         at java.util.TreeMap.put(TreeMap.java:575)
>         at java.util.TreeSet.add(TreeSet.java:255)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatusFindKeyInTableCache(KeyManagerImpl.java:2041)
>         at 
> org.apache.hadoop.ozone.om.KeyManagerImpl.listStatus(KeyManagerImpl.java:2117)
>         at 
> org.apache.hadoop.ozone.om.OzoneManager.listStatus(OzoneManager.java:2938)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.listStatus(OzoneManagerRequestHandler.java:573)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerRequestHandler.handleReadRequest(OzoneManagerRequestHandler.java:197)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequestDirectlyToOM(OzoneManagerProtocolServerSideTranslatorPB.java:218)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.processRequest(OzoneManagerProtocolServerSideTranslatorPB.java:145)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB$$Lambda$88/1748961453.apply(Unknown
>  Source)
>         at 
> org.apache.hadoop.hdds.server.OzoneProtocolMessageDispatcher.processRequest(OzoneProtocolMessageDispatcher.java:87)
>         at 
> org.apache.hadoop.ozone.protocolPB.OzoneManagerProtocolServerSideTranslatorPB.submitRequest(OzoneManagerProtocolServerSideTranslatorPB.java:113)
>         at 
> org.apache.hadoop.ozone.protocol.proto.OzoneManagerProtocolProtos$OzoneManagerService$2.callBlockingMethod(OzoneManagerProtocolProtos.java)
>         at 
> org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:528)
>         at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1070)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:984)
>         at org.apache.hadoop.ipc.Server$RpcCall.run(Server.java:912)
>         at java.security.AccessController.doPrivileged(Native Method)
>         at javax.security.auth.Subject.doAs(Subject.java:422)
>         at 
> org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1876)
>         at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2882)
> {code}
> This is happening with a recent (maybe 1 week old tops) master build.
> After I took heapdumps from the OM during query runs, at two different times, 
> I saw the following concerning numbers:
> From CacheKey instances we have 78k in the earlier one, and 80K in the older 
> one
> From CacheValue instance we have 78K in both.
> Almost all of them has the OMMetadataManager as GC root, and belongs to the 
> keytable cache.
> During listStatusFindKeyInTableCache, we iterate through the cache entries 
> via the cache iterator, and if the cacheKey is not equal to the keyArgs, and 
> cacheValue is null, we add the key to the deletedKeySet.
> Where this method is called, is KeyManager's listStatus method, which for 
> every call creates a new deletedKeySet as a TreeSet. Then it goes over all 
> keys in the bucket that are cached, and puts together the deletedKeySet. We 
> seem to need this deletedKeySet in order to leave out keys that are already 
> deleted in the cache but still present in the db from the listing which is a 
> valid reason, but causes a significant slowdown (1-2 order of magnitude) 
> compared to runtimes after OM restart.
> In the heap dump the cache values are either an instance of Present, or 
> Absent, and the majority seems to be Absent. Based on the epoch even present 
> values and absent values are from various epochs, from 1 to 300K+.
> Int the TableCacheImpl, we have an evictCache method that happens in an 
> executor service, and which should run on DoubleBuffer flush.
> It gets a list of epochs to evict, and it iterates through the epochEntries 
> the cache has, and, as we talk about the KeyTable that has a MANUAL eviction 
> policy, it will remove the entry from the cache if the entry's epoch is the 
> current epoch in the iteration, and if the supplied epochs to evict contain 
> the current epoch.
> Epochs come from DoubleBuffer#flushTransactions.
> It starts as a HashMap of Strings and List of Longs, and DoubleBuffer puts 
> together the list of epochs for a table during flush, based on table names 
> registered for Responses with the CleanupTableInfo annotation.
> The only suspicious thing I found, is that OMFileCreateResponse does not 
> define the key table as something to clean up, and it parents neither has it 
> defined for cleanup, while the OMFileCreateRequest adds the parents of the 
> created file to the key table cache 
> [here|https://github.com/apache/ozone/blob/master/hadoop-ozone/ozone-manager/src/main/java/org/apache/hadoop/ozone/om/request/file/OMFileCreateRequest.java#L299-L301].



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Resolved] (HDDS-4478) Large deletedKeyset slows down OM via listStatus

Reply via email to