Hello,

A couple of days ago I encountered a strange phenomenon in our application
based on Apache Ignite .Net 2.14 with persistence (3 nodes, 1 backup per
cache).
Data in a cache started disappearing for seemingly no reason and the amount
of records could be halved (220K to 108K) overnight. I spent a couple of
days trying to find a problem in the application, crunched hundreds
megabytes of application logs but didn't manage to find a reason to
blame the application. Retention/TTL is not set for the cache. Apache
Ignite logs with the option -DIGNITE_QUIET=false also don't reveal any
anomalies (or I don't know what to look for). The data shares are expected
to be durable (based on Azure Disk) and we never had any issues with them.
RAM utilisation is normal and there's plenty of available RAM.
The Ignite cluster is hosted in a 3 node Kubernetes cluster on Azure.

The question is: how would you recommend investigating issues like this?
What metrics and logs can I check? Is it possible to log and track
individual Remove() operations as well as SQL queries at Ignite engine
level?

The application has been working on Ignite for years already and we didn't
encounter data loss at such scales before. It's possible that the app
wasn't used so extensively before as it is now and the problem left
unnoticed.

My best,
Alex Avrutin

Reply via email to