Hi, Here is reproducer for master : https://github.com/macrergate/ignite/commit/9a7d2d27af30018a5f6faccb39176a35243ccfa2
Best regards, Sergey Kosarev. вт, 17 мар. 2020 г. в 12:27, Sergey-A Kosarev <sergey-a.kosa...@db.com>: > Classification: Public > > Hi Sergey, > Ticket is here https://issues.apache.org/jira/browse/IGNITE-12793 > > I will try to make reproducer in the coming days. > > Kind regards, > Sergey Kosarev > > -----Original Message----- > From: Sergey Chugunov [mailto:sergey.chugu...@gmail.com] > Sent: 17 March 2020 09:45 > To: dev <dev@ignite.apache.org> > Subject: Re: deadlock in system pool on meta update > > Hello Sergey, > > Your analysis looks valid to me, we definitely need to investigate this > deadlock and find out how to fix it. > > Could you create a ticket and write a test that reproduces the issue with > sufficient probability? > > Thanks! > > On Mon, Mar 16, 2020 at 8:22 PM Sergey-A Kosarev <sergey-a.kosa...@db.com> > wrote: > > > Classification: Public > > > > Hi, > > I've recently tried to apply Ilya's idea ( > > https://issues.apache.org/jira/browse/IGNITE-12663) of minimizing > > thread pools and tried to set system pool to 3 in my own tests. > > It caused deadlock on a client node and I think it can happen not only > > on such small pool values. > > > > Details are following: > > I'm not using persistence currently (if it matters). > > On the client note I use ignite compute to call a job on every server > > node (there are 3 server nodes in the tests). > > > > Then I've found in logs: > > > > [10:55:21] : [Step 1/1] [2020-03-13 10:55:21,773] { > > grid-timeout-worker-#8} [WARN] [o.a.i.i.IgniteKernal] - Possible > > thread pool starvation detected (no task completed in last 30000ms, is > > system thread pool size large enough?) > > [10:55:21] : [Step 1/1] ^-- System thread pool [active=3, idle=0, > > qSize=9] > > > > > > I see in threaddumps that all 3 system pool workers do the same - > > processing of job responses: > > > > "sys-#26" #605 daemon prio=5 os_prio=0 tid=0x0000000064a0a800 > > nid=0x1f34 waiting on condition [0x000000007b91d000] > > java.lang.Thread.State: WAITING (parking) > > at sun.misc.Unsafe.park(Native Method) > > at > > java.util.concurrent.locks.LockSupport.park(LockSupport.java:304) > > at > > > org.apache.ignite.internal.util.future.GridFutureAdapter.get0(GridFutureAdapter.java:177) > > at > > > org.apache.ignite.internal.util.future.GridFutureAdapter.get(GridFutureAdapter.java:140) > > at > > > org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl.metadata(CacheObjectBinaryProcessorImpl.java:749) > > at > > > org.apache.ignite.internal.processors.cache.binary.CacheObjectBinaryProcessorImpl$1.metadata(CacheObjectBinaryProcessorImpl.java:250) > > at > > > org.apache.ignite.internal.binary.BinaryContext.metadata(BinaryContext.java:1169) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.getOrCreateSchema(BinaryReaderExImpl.java:2005) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:285) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.<init>(BinaryReaderExImpl.java:184) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.readField(BinaryReaderExImpl.java:1982) > > at > > > org.apache.ignite.internal.binary.BinaryFieldAccessor$DefaultFinalClassAccessor.read0(BinaryFieldAccessor.java:702) > > at > > > org.apache.ignite.internal.binary.BinaryFieldAccessor.read(BinaryFieldAccessor.java:187) > > at > > > org.apache.ignite.internal.binary.BinaryClassDescriptor.read(BinaryClassDescriptor.java:887) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1762) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.doReadObject(BinaryUtils.java:1797) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.deserializeOrUnmarshal(BinaryUtils.java:2160) > > at > > > org.apache.ignite.internal.binary.BinaryUtils.doReadCollection(BinaryUtils.java:2091) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize0(BinaryReaderExImpl.java:1914) > > at > > > org.apache.ignite.internal.binary.BinaryReaderExImpl.deserialize(BinaryReaderExImpl.java:1714) > > at > > > org.apache.ignite.internal.binary.GridBinaryMarshaller.deserialize(GridBinaryMarshaller.java:306) > > at > > > org.apache.ignite.internal.binary.BinaryMarshaller.unmarshal0(BinaryMarshaller.java:100) > > at > > > org.apache.ignite.marshaller.AbstractNodeNameAwareMarshaller.unmarshal(AbstractNodeNameAwareMarshaller.java:80) > > at > > > org.apache.ignite.internal.util.IgniteUtils.unmarshal(IgniteUtils.java:10493) > > at > > > org.apache.ignite.internal.processors.task.GridTaskWorker.onResponse(GridTaskWorker.java:828) > > at > > org.apache.ignite.internal.processors.task.GridTaskProcessor.processJo > > bExecuteResponse(GridTaskProcessor.java:1134) > > > > > > As I found analyzing this stack trace, unmarshalling a user object > > the first time(per type) causes Binary metadata request (despite I've > > registered this type in BinaryConfiguration.setTypeConfiguration) > > > > > > > > And all this futures will be completed after consequent > > MetadataResponseMessage will be received and processed on the client > node. > > > > > > > > But MetadataResponseMessage(GridTopic.TOPIC_METADATA_REQ) is also > > processing in system pool. > > > > (I see that method GridIoManager#processRegularMessage routes it to > > the System Pool) So it causes deadlock as the Sytem Pool is already > > full. > > > > Will appreciate any feedback on the topic. > > Am I just shooting to my foot or should I create a ticket for this? > > Is it correct that both job responses and metadata responses are > > processing in the System Pool, and if yes then how this case should be > > handled successfully? > > I suppose on bigger pool values there is still a chance to get into > > such a situation, if many concurrent jobs are invoked from the same > > client node, or is there some golden rule to follow - something like > > not than (capacity of the system pool - 1) concurrent jobs from the > > same node should be invoked? > > > > Kind regards, > > Sergey Kosarev > > > > > > > > > > --- > > This e-mail may contain confidential and/or privileged information. If > > you are not the intended recipient (or have received this e-mail in > > error) please notify the sender immediately and delete this e-mail. > > Any unauthorized copying, disclosure or distribution of the material > > in this e-mail is strictly forbidden. > > > > Please refer to https://www.db.com/disclosures for additional EU > > corporate and regulatory disclosures and to > > http://www.db.com/unitedkingdom/content/privacy.htm for information > > about privacy. > > > > > --- > This e-mail may contain confidential and/or privileged information. If you > are not the intended recipient (or have received this e-mail in error) > please notify the sender immediately and delete this e-mail. Any > unauthorized copying, disclosure or distribution of the material in this > e-mail is strictly forbidden. > > Please refer to https://www.db.com/disclosures for additional EU > corporate and regulatory disclosures and to > http://www.db.com/unitedkingdom/content/privacy.htm for information about > privacy. >