[
https://issues.apache.org/jira/browse/SOLR-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496243#comment-17496243
]
Mike Drob commented on SOLR-16043:
----------------------------------
The output from the two different data node shutdowns might be telling us
something:
{noformat}
2> 233391 INFO
(org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@622864f7) []
o.a.h.h.s.d.DataNode Closing all peers.
2> 233395 WARN (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.DataNode Ending block pool service
for: Block pool BP-2039008531-127.0.0.1-1645493950129 (Datanode Uuid
20c3c955-3e3b-43c4-a6c8-26417ae3ea96) service to localhost/127.0.0.1:55316
2> 233395 INFO (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.DataNode Removed Block pool
BP-2039008531-127.0.0.1-1645493950129 (Datanode Uuid
20c3c955-3e3b-43c4-a6c8-26417ae3ea96)
2> 233395 INFO (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.f.i.FsDatasetImpl Removing block pool
BP-2039008531-127.0.0.1-1645493950129
2> 233404 INFO (Listener at localhost/55365) [] o.a.h.h.s.d.DataNode Waiting
up to 30 seconds for transfer threads to complete
2> 233404 INFO (Listener at localhost/55365) [] o.a.h.i.Server Stopping
server on 55351
2> 233407 INFO (IPC Server listener on 0) [] o.a.h.i.Server Stopping IPC
Server listener on 0
2> 233407 INFO (IPC Server Responder) [] o.a.h.i.Server Stopping IPC Server
Responder
2> 233408 INFO (Listener at localhost/55365) [] o.a.h.h.s.d.DataNode
Shutdown complete.
2> 233408 INFO
(org.apache.hadoop.hdfs.server.datanode.DataXceiverServer@61d48b09) []
o.a.h.h.s.d.DataNode Closing all peers.
2> 233422 INFO (Listener at localhost/55365) [] o.a.h.h.s.d.DataNode Waiting
up to 30 seconds for transfer threads to complete
2> 233422 INFO (Listener at localhost/55365) [] o.a.h.i.Server Stopping
server on 55365
2> 233426 INFO (IPC Server listener on 0) [] o.a.h.i.Server Stopping IPC
Server listener on 0
2> 233426 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor
encountered interrupt and exit.
2> 233426 INFO (IPC Server Responder) [] o.a.h.i.Server Stopping IPC Server
Responder
2> 233426 WARN (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.IncrementalBlockReportManager
IncrementalBlockReportManager interrupted
2> 233426 WARN (Command processor) [] o.a.h.h.s.d.DataNode Ending command
processor service for: Thread[Command processor,5,TGRP-HdfsDirectoryTest]
2> 233426 WARN (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.DataNode Ending block pool service
for: Block pool BP-2039008531-127.0.0.1-1645493950129 (Datanode Uuid
a7783809-4beb-4df0-b349-0e44cdaa3719) service to localhost/127.0.0.1:55316
2> 233426 INFO (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.DataNode Removed Block pool
BP-2039008531-127.0.0.1-1645493950129 (Datanode Uuid
a7783809-4beb-4df0-b349-0e44cdaa3719)
2> 233427 INFO (BP-2039008531-127.0.0.1-1645493950129 heartbeating to
localhost/127.0.0.1:55316) [] o.a.h.h.s.d.f.i.FsDatasetImpl Removing block pool
BP-2039008531-127.0.0.1-1645493950129
2> 233430 INFO (Listener at localhost/55365) [] o.a.h.h.s.d.DataNode
Shutdown complete.
{noformat}
For one of them, the block pool service looks like it is naturally terminated,
for the other the command processor thread is interrupted and _then_ the block
pool service can finally stop. Unfortunately, without a little bit better
logging I don't have an easy way to tell which Command Processor is the one
that is actually leaking.
> HDFS tests - "Command processor" thread leak
> --------------------------------------------
>
> Key: SOLR-16043
> URL: https://issues.apache.org/jira/browse/SOLR-16043
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Components: hdfs, Tests
> Reporter: Kevin Risden
> Priority: Major
> Attachments: Solr-main-MACOSX-678_consoleText.txt.gz
>
>
> This is caused by SOLR-15942. Some background:
> * originally discussed in 3.3.1 PR -
> https://github.com/apache/solr/pull/553#discussion_r792068327
> * Failures so far from jenkins
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText
> It looks like it might be trying to shut down:
> {code:java}
> 2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor
> encountered interrupt and exit.
> 2> 31835 WARN (BP-1491942626-127.0.0.1-1645493532707 heartbeating to
> localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager
> IncrementalBlockReportManager interrupted
> 2> 31835 WARN (Command processor) [] o.a.h.h.s.d.DataNode Ending command
> processor service for: Thread[Command
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> {code}
> the error message looks like (this is from
> https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText which is
> attached [^Solr-main-MACOSX-678_consoleText.txt.gz] ):
> {code:java}
> 2> 31948 WARN (Listener at localhost/51675) []
> o.a.h.h.s.d.f.i.FsDatasetAsyncDiskService AsyncDiskService has already shut
> down.
> 2> 31948 WARN (Listener at localhost/51675) []
> o.a.h.h.s.d.f.i.RamDiskAsyncLazyPersistService AsyncLazyPersistService has
> already shut down.
> 2> 31948 INFO (Listener at localhost/51675) [] o.a.h.h.s.d.DataNode
> Shutdown complete.
> 2> 32290 INFO (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter
> Closing ErrorLogMuter-regex-1 after mutting 0 log messages
> 2> 32290 INFO (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter
> Creating ErrorLogMuter-regex-2 for ERROR logs matching regex: ignore_exception
> 2> Feb 22, 2022 1:32:34 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
> 2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
> 2> SEVERE: 1 thread leaked from SUITE scope at
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest:
> 2> 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
> 2> at
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
> 2> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
> 2> at
> [email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
> 2> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
> 2> at
> [email protected]/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
> 2> at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
> 2> at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
> 2> INFO: Starting to interrupt leaked threads:
> 2> 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> 33504 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command
> processor encountered interrupt and exit.
> 2> 33504 WARN (Command processor) [] o.a.h.h.s.d.DataNode Ending command
> processor service for: Thread[Command
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
> 2> INFO: All leaked threads terminated.
> > com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
> from SUITE scope at
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest:
> > 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> > at [email protected]/jdk.internal.misc.Unsafe.park(Native
> Method)
> > at
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
> > at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
> > at
> [email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
> > at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
> > at
> [email protected]/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
> > at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
> > at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
> > at __randomizedtesting.SeedInfo.seed([1A5FAFA6CE7F7FF0]:0)
> 2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{},
> maxPointsInLeafNode=757, maxMBSortInHeap=5.99773708873607,
> sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=sah-RU,
> timezone=Asia/Macau
> 2> NOTE: Mac OS X 10.14.6 x86_64/Eclipse Foundation 16.0.2
> (64-bit)/cpus=6,threads=40,free=159527368,total=271581184
> 2> NOTE: All tests run in this JVM: [HdfsBackupRepositoryIntegrationTest]
> 2> NOTE: reproduce with: gradlew test --tests
> HdfsBackupRepositoryIntegrationTest -Dtests.seed=1A5FAFA6CE7F7FF0
> -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=Asia/Macau
> -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> {code}
> Some related code pointers:
> *
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
> *
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017
> PS: there is an error about datanode mbean - but it is unrelated as far as I
> can tell - https://issues.apache.org/jira/browse/HDFS-11041
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]