[
https://issues.apache.org/jira/browse/IGNITE-27601?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Denis Chudov updated IGNITE-27601:
----------------------------------
Description:
OOM in chaos testing scenario:
*Chaos configuration:*
* Kill a random pod in the cluster.
* Wait until the pod is ready, node joins physical and logical topology.
* Take a pause for 3 minutes.
* Repeat from step #1.
*Steps to reproduce:*
* Start the replication from a huge amount of data
* Start the chaos script that will work for the whole duration of replication.
* Monitor the number of records in the tables.
Heapdump shows that the problem is in the
{{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in
several {{{}PartitionReplicaListener{}}}.
!image-2026-01-19-14-56-17-087.png!
*Proposals:*
* proactive tx recovery: IGNITE-27610
*
TxCleanupReadyFutureList may be modified to consume less memory.
*Definition of done*
TxCleanupReadyFutureList is modified to consume less memory.
Proactive tx recovery can be done under IGNITE-27610
was:
OOM in chaos testing scenario:
*Chaos configuration:*
* Kill a random pod in the cluster.
* Wait until the pod is ready, node joins physical and logical topology.
* Take a pause for 3 minutes.
* Repeat from step #1.
*Steps to reproduce:*
* Start the replication from a huge amount of data
* Start the chaos script that will work for the whole duration of replication.
* Monitor the number of records in the tables.
Heapdump shows that the problem is in the
{{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in
several {{PartitionReplicaListener}}.
!image-2026-01-19-14-56-17-087.png!
> OOM in txs clean up process
> ---------------------------
>
> Key: IGNITE-27601
> URL: https://issues.apache.org/jira/browse/IGNITE-27601
> Project: Ignite
> Issue Type: Bug
> Reporter: Mirza Aliev
> Assignee: Denis Chudov
> Priority: Major
> Labels: ignite-3
> Attachments: image-2026-01-19-14-56-17-087.png
>
>
> OOM in chaos testing scenario:
> *Chaos configuration:*
> * Kill a random pod in the cluster.
> * Wait until the pod is ready, node joins physical and logical topology.
> * Take a pause for 3 minutes.
> * Repeat from step #1.
> *Steps to reproduce:*
> * Start the replication from a huge amount of data
> * Start the chaos script that will work for the whole duration of
> replication.
> * Monitor the number of records in the tables.
> Heapdump shows that the problem is in the
> {{PartitionReplicaListener#txCleanupReadyFutures}} map that holds ~600 mb in
> several {{{}PartitionReplicaListener{}}}.
> !image-2026-01-19-14-56-17-087.png!
> *Proposals:*
> * proactive tx recovery: IGNITE-27610
> *
> TxCleanupReadyFutureList may be modified to consume less memory.
> *Definition of done*
> TxCleanupReadyFutureList is modified to consume less memory.
> Proactive tx recovery can be done under IGNITE-27610
--
This message was sent by Atlassian Jira
(v8.20.10#820010)