[
https://issues.apache.org/jira/browse/SOLR-16043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17496383#comment-17496383
]
Mike Drob commented on SOLR-16043:
----------------------------------
Some observations - I have a patched/hacked version of hdfs libs that I'm using
locally to test this, not much progress so far. I think that decreasing the
number of data nodes makes this less likely to happen because the problem is a
datanode shutdown problem and not really a solr problem. It still occurs, but
it's less flaky for us on the surface.
If adding the command processor to the thread leak filter solves this problem
for now, then I'm ok with that approach. We can revisit it later after HDFS
fixes it eventually?
> HDFS tests - "Command processor" thread leak
> --------------------------------------------
>
> Key: SOLR-16043
> URL: https://issues.apache.org/jira/browse/SOLR-16043
> Project: Solr
> Issue Type: Task
> Security Level: Public(Default Security Level. Issues are Public)
> Components: hdfs, Tests
> Reporter: Kevin Risden
> Assignee: Kevin Risden
> Priority: Major
> Attachments:
> OUTPUT-org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest.txt.gz,
> Solr-main-MACOSX-678_consoleText.txt.gz
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> This is caused by SOLR-15942. Some background:
> * originally discussed in 3.3.1 PR -
> https://github.com/apache/solr/pull/553#discussion_r792068327
> * Failures so far from jenkins
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3116/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-Linux/3109/consoleText
> ** https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText
> It looks like it might be trying to shut down:
> {code:java}
> 2> 31835 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command processor
> encountered interrupt and exit.
> 2> 31835 WARN (BP-1491942626-127.0.0.1-1645493532707 heartbeating to
> localhost/127.0.0.1:51572) [] o.a.h.h.s.d.IncrementalBlockReportManager
> IncrementalBlockReportManager interrupted
> 2> 31835 WARN (Command processor) [] o.a.h.h.s.d.DataNode Ending command
> processor service for: Thread[Command
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> {code}
> the error message looks like (this is from
> https://jenkins.thetaphi.de/job/Solr-main-MacOSX/678/consoleText which is
> attached [^Solr-main-MACOSX-678_consoleText.txt.gz] ):
> {code:java}
> 2> 31948 WARN (Listener at localhost/51675) []
> o.a.h.h.s.d.f.i.FsDatasetAsyncDiskService AsyncDiskService has already shut
> down.
> 2> 31948 WARN (Listener at localhost/51675) []
> o.a.h.h.s.d.f.i.RamDiskAsyncLazyPersistService AsyncLazyPersistService has
> already shut down.
> 2> 31948 INFO (Listener at localhost/51675) [] o.a.h.h.s.d.DataNode
> Shutdown complete.
> 2> 32290 INFO (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter
> Closing ErrorLogMuter-regex-1 after mutting 0 log messages
> 2> 32290 INFO (Listener at localhost/51675) [] o.a.s.u.ErrorLogMuter
> Creating ErrorLogMuter-regex-2 for ERROR logs matching regex: ignore_exception
> 2> Feb 22, 2022 1:32:34 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
> 2> WARNING: Will linger awaiting termination of 1 leaked thread(s).
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl checkThreadLeaks
> 2> SEVERE: 1 thread leaked from SUITE scope at
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest:
> 2> 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> at [email protected]/jdk.internal.misc.Unsafe.park(Native Method)
> 2> at
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
> 2> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
> 2> at
> [email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
> 2> at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
> 2> at
> [email protected]/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
> 2> at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
> 2> at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
> 2> INFO: Starting to interrupt leaked threads:
> 2> 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> 33504 ERROR (Command processor) [] o.a.h.h.s.d.DataNode Command
> processor encountered interrupt and exit.
> 2> 33504 WARN (Command processor) [] o.a.h.h.s.d.DataNode Ending command
> processor service for: Thread[Command
> processor,5,TGRP-HdfsBackupRepositoryIntegrationTest]
> 2> Feb 22, 2022 1:32:35 AM
> com.carrotsearch.randomizedtesting.ThreadLeakControl tryToInterruptAll
> 2> INFO: All leaked threads terminated.
> > com.carrotsearch.randomizedtesting.ThreadLeakError: 1 thread leaked
> from SUITE scope at
> org.apache.solr.core.backup.repository.HdfsBackupRepositoryIntegrationTest:
> > 1) Thread[id=90, name=Command processor, state=WAITING,
> group=TGRP-HdfsBackupRepositoryIntegrationTest]
> > at [email protected]/jdk.internal.misc.Unsafe.park(Native
> Method)
> > at
> [email protected]/java.util.concurrent.locks.LockSupport.park(LockSupport.java:341)
> > at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionNode.block(AbstractQueuedSynchronizer.java:505)
> > at
> [email protected]/java.util.concurrent.ForkJoinPool.managedBlock(ForkJoinPool.java:3137)
> > at
> [email protected]/java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.await(AbstractQueuedSynchronizer.java:1614)
> > at
> [email protected]/java.util.concurrent.LinkedBlockingQueue.take(LinkedBlockingQueue.java:435)
> > at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.processQueue(BPServiceActor.java:1331)
> > at
> app//org.apache.hadoop.hdfs.server.datanode.BPServiceActor$CommandProcessingThread.run(BPServiceActor.java:1315)
> > at __randomizedtesting.SeedInfo.seed([1A5FAFA6CE7F7FF0]:0)
> 2> NOTE: test params are: codec=Asserting(Lucene90): {}, docValues:{},
> maxPointsInLeafNode=757, maxMBSortInHeap=5.99773708873607,
> sim=Asserting(RandomSimilarity(queryNorm=false): {}), locale=sah-RU,
> timezone=Asia/Macau
> 2> NOTE: Mac OS X 10.14.6 x86_64/Eclipse Foundation 16.0.2
> (64-bit)/cpus=6,threads=40,free=159527368,total=271581184
> 2> NOTE: All tests run in this JVM: [HdfsBackupRepositoryIntegrationTest]
> 2> NOTE: reproduce with: gradlew test --tests
> HdfsBackupRepositoryIntegrationTest -Dtests.seed=1A5FAFA6CE7F7FF0
> -Dtests.slow=true -Dtests.locale=sah-RU -Dtests.timezone=Asia/Macau
> -Dtests.asserts=true -Dtests.file.encoding=UTF-8
> {code}
> Some related code pointers:
> *
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/BPServiceActor.java#L1336
> *
> https://github.com/apache/hadoop/blame/rel/release-3.3.1/hadoop-hdfs-project/hadoop-hdfs/src/main/java/org/apache/hadoop/hdfs/server/datanode/DataNode.java#L2017
> PS: there is an error about datanode mbean - but it is unrelated as far as I
> can tell - https://issues.apache.org/jira/browse/HDFS-11041
--
This message was sent by Atlassian Jira
(v8.20.1#820001)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]