[ https://issues.apache.org/jira/browse/HDFS-14292?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16773208#comment-16773208 ]
BELUGA BEHR commented on HDFS-14292: ------------------------------------ Ya, finally found it. I'm not sure that this is the issue with all of my failing unit tests, but it's a start: {code:java|title=LocalReplicaInPipeline} Thread thread = writer.get(); if ((thread == null) || (thread == Thread.currentThread()) || (!thread.isAlive())) { if (writer.compareAndSet(thread, null)) { return; // Done } // The writer changed. Go back to the start of the loop and attempt to // stop the new writer. continue; } thread.interrupt(); try { thread.join(xceiverStopTimeout); if (thread.isAlive()) { // Our thread join timed out. final String msg = "Join on writer thread " + thread + " timed out"; DataNode.LOG.warn(msg + "\n" + StringUtils.getStackTrace(thread)); throw new IOException(msg); } {code} [https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/LocalReplicaInPipeline.java#L265-L268] Interrupts the thread and waits for it to die ({{join}}), only now I've got a thread pool in my branch so the thread is re-used, it does not die. It waits here until timeout then fails. I'm not sure what the fix is, but I'm thinking about it. > Introduce Java ExecutorService to DataXceiverServer > --------------------------------------------------- > > Key: HDFS-14292 > URL: https://issues.apache.org/jira/browse/HDFS-14292 > Project: Hadoop HDFS > Issue Type: Improvement > Components: datanode > Affects Versions: 3.2.0 > Reporter: BELUGA BEHR > Assignee: BELUGA BEHR > Priority: Major > Attachments: HDFS-14292.1.patch, HDFS-14292.2.patch, > HDFS-14292.3.patch > > > I wanted to investigate {{dfs.datanode.max.transfer.threads}} from > {{hdfs-site.xml}}. It is described as "Specifies the maximum number of > threads to use for transferring data in and out of the DN." The default > value is 4096. I found it interesting because 4096 threads sounds like a lot > to me. I'm not sure how a system with 8-16 cores would react to this large a > thread count. Intuitively, I would say that the overhead of context > switching would be immense. > During mt investigation, I discovered the > [following|https://github.com/apache/hadoop/blob/trunk/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataXceiverServer.java#L203-L216] > setup in the {{DataXceiverServer}} class: > # A peer connects to a DataNode > # A new thread is spun up to service this connection > # The thread runs to completion > # The tread dies > It would perhaps be better if we used a thread pool to better manage the > lifecycle of the service threads and to allow the DataNode to re-use existing > threads, saving on the need to create and spin-up threads on demand. > In this JIRA, I have added a couple of things: > # Added a thread pool to {{DataXceiverServer}} class that, on demand, will > create up to {{dfs.datanode.max.transfer.threads}}. A thread that has > completed its prior duties will stay idle for up to 60 seconds > (configurable), it will be retired if no new work has arrived. > # Added new methods to the {{Peer}} Interface to allow for better logging and > less code within each Thread ({{DataXceiver}}). > # Updated the Thread code ({{DataXceiver}}) regarding its interactions with > {{blockReceiver}} instance variable -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: hdfs-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: hdfs-issues-h...@hadoop.apache.org