[ 
https://issues.apache.org/jira/browse/IGNITE-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Denis Chudov updated IGNITE-27601:
----------------------------------
    Description: 
OOM in chaos testing scenario:

*Chaos configuration:*
 * Kill a random pod in the cluster.

 * Wait until the pod is ready, node joins physical and logical topology.

 * Take a pause for 3 minutes.

 * Repeat from step #1.

*Steps to reproduce:*
 * Start the replication from a huge amount of data

 * Start the chaos script that will work for the whole duration of replication.

 * Monitor the number of records in the tables.

Heapdump shows that the problem is in the 
{{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in 
several {{{}PartitionReplicaListener{}}}.

!image-2026-01-19-14-56-17-087.png!

*Proposals:*
 * proactive tx recovery: IGNITE-27610
 * TxCleanupReadyFutureList may be modified to consume less memory.

*Definition of done*
TxCleanupReadyFutureList is modified to consume less memory.
Proactive tx recovery can be done under IGNITE-27610

  was:
OOM in chaos testing scenario:

*Chaos configuration:*
 * Kill a random pod in the cluster.

 * Wait until the pod is ready, node joins physical and logical topology.

 * Take a pause for 3 minutes.

 * Repeat from step #1.

*Steps to reproduce:*
 * Start the replication from a huge amount of data

 * Start the chaos script that will work for the whole duration of replication.

 * Monitor the number of records in the tables.

Heapdump shows that the problem is in the 
{{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in 
several {{{}PartitionReplicaListener{}}}.

!image-2026-01-19-14-56-17-087.png!

*Proposals:*
 * proactive tx recovery: IGNITE-27610
 * 
TxCleanupReadyFutureList may be modified to consume less memory.

*Definition of done*
TxCleanupReadyFutureList is modified to consume less memory.
Proactive tx recovery can be done under IGNITE-27610


> OOM in txs clean up process
> ---------------------------
>
>                 Key: IGNITE-27601
>                 URL: https://issues.apache.org/jira/browse/IGNITE-27601
>             Project: Ignite
>          Issue Type: Bug
>            Reporter: Mirza Aliev
>            Assignee: Denis Chudov
>            Priority: Major
>              Labels: ignite-3
>         Attachments: image-2026-01-19-14-56-17-087.png
>
>
> OOM in chaos testing scenario:
> *Chaos configuration:*
>  * Kill a random pod in the cluster.
>  * Wait until the pod is ready, node joins physical and logical topology.
>  * Take a pause for 3 minutes.
>  * Repeat from step #1.
> *Steps to reproduce:*
>  * Start the replication from a huge amount of data
>  * Start the chaos script that will work for the whole duration of 
> replication.
>  * Monitor the number of records in the tables.
> Heapdump shows that the problem is in the 
> {{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in 
> several {{{}PartitionReplicaListener{}}}.
> !image-2026-01-19-14-56-17-087.png!
> *Proposals:*
>  * proactive tx recovery: IGNITE-27610
>  * TxCleanupReadyFutureList may be modified to consume less memory.
> *Definition of done*
> TxCleanupReadyFutureList is modified to consume less memory.
> Proactive tx recovery can be done under IGNITE-27610



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to