Hello Sergey,

Your analysis looks valid to me, we definitely need to investigate this
deadlock and find out how to fix it.

Could you create a ticket and write a test that reproduces the issue with
sufficient probability?

Thanks!

On Mon, Mar 16, 2020 at 8:22 PM Sergey-A Kosarev <sergey-a.kosa...@db.com>
wrote:

> Classification: Public
>
> Hi,
> I've recently tried to apply Ilya's idea (
> https://issues.apache.org/jira/browse/IGNITE-12663) of minimizing thread
> pools and tried to set system pool to 3 in my own tests.
> It caused deadlock on a client node and I think it can happen not only on
> such small pool values.
>
> Details are following:
> I'm not using persistence currently (if it matters).
> On the client note I use ignite compute to  call   a job on every server
> node (there are 3 server nodes in the tests).
>
> Then I've found in logs:
>
> [10:55:21] :     [Step 1/1] [2020-03-13 10:55:21,773] {
> grid-timeout-worker-#8} [WARN] [o.a.i.i.IgniteKernal] - Possible thread
> pool starvation detected (no task completed in last 30000ms, is system
> thread pool size large enough?)
> [10:55:21] :     [Step 1/1]     ^-- System thread pool [active=3, idle=0,
> qSize=9]
>
>
> I see in threaddumps that all 3 system pool workers do the same -
> processing of job responses:
>
> "sys-#26" #605 daemon prio=5 os_prio=0 tid=0x0000000064a0a800 nid=0x1f34
> waiting on condition [0x000000007b91d000]
>    java.lang.Thread.State: WAITING (parking)
>                 at sun.misc.Unsafe.park(Native Method)
>                 at
> java.util.concurrent.locks.LockSupport.park(LockSupport.java:304)
>                 at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177)
>                 at
> org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140)
>                 at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.metadata(CacheObjectBinaryProcessorImpl.java:749)
>                 at
> org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$1.metadata(CacheObjectBinaryProcessorImpl.java:250)
>                 at
> org.apache.ignite.internal.binary.BinaryContext.metadata(BinaryContext.java:1169)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.getOrCreateSchema(BinaryReaderExImpl.java:2005)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:285)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:184)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.readField(BinaryReaderExImpl.java:1982)
>                 at
> org.apache.ignite.internal.binary.BinaryFieldAccessor$DefaultFinalClassAccessor.read0(BinaryFieldAccessor.java:702)
>                 at
> org.apache.ignite.internal.binary.BinaryFieldAccessor.read(BinaryFieldAccessor.java:187)
>                 at
> org.apache.ignite.internal.binary.BinaryClassDescriptor.read(BinaryClassDescriptor.java:887)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1762)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160)
>                 at
> org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914)
>                 at
> org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714)
>                 at
> org.apache.ignite.internal.binary.GridBinaryMarshaller.deserialize(GridBinaryMarshaller.java:306)
>                 at
> org.apache.ignite.internal.binary.BinaryMarshaller.unmarshal0(BinaryMarshaller.java:100)
>                 at
> org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:80)
>                 at
> org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10493)
>                 at
> org.apache.ignite.internal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:828)
>                 at
> org.apache.ignite.internal.processors.task.GridTaskProcessor.processJobExecuteResponse(GridTaskProcessor.java:1134)
>
>
> As I found analyzing this stack trace, unmarshalling a user object  the
> first time(per type) causes Binary metadata request (despite I've
> registered this type in BinaryConfiguration.setTypeConfiguration)
>
>
>
> And all this futures will be completed after consequent
> MetadataResponseMessage will be received and processed on the client node.
>
>
>
> But MetadataResponseMessage(GridTopic.TOPIC_METADATA_REQ) is also
> processing in system pool.
>
> (I see that method GridIoManager#processRegularMessage routes it to the
> System Pool)
>  So it causes deadlock as the Sytem Pool is already full.
>
> Will appreciate any feedback on the topic.
> Am I just shooting to my foot or should I create a ticket for this?
> Is it correct that both job responses and metadata responses are
> processing in the System Pool, and if yes then how this case should be
> handled successfully?
> I suppose on bigger pool values there is still a chance to get into such a
> situation, if many concurrent jobs are invoked from the same client node,
> or is there some golden rule to follow - something like not  than (capacity
> of the system pool - 1) concurrent jobs from the same node should be
> invoked?
>
> Kind regards,
> Sergey Kosarev
>
>
>
>
> ---
> This e-mail may contain confidential and/or privileged information. If you
> are not the intended recipient (or have received this e-mail in error)
> please notify the sender immediately and delete this e-mail. Any
> unauthorized copying, disclosure or distribution of the material in this
> e-mail is strictly forbidden.
>
> Please refer to https://www.db.com/disclosures for additional EU
> corporate and regulatory disclosures and to
> http://www.db.com/unitedkingdom/content/privacy.htm for information about
> privacy.
>

Reply via email to