Hi Jeremy,

My initial query was to see if this had been observed by others.

To answer some of your questions:

Do we specifically check that an element is added to all nodes
participating in a replicated cache: No, we do not (we take it on trust
Ignite sorts that out ;) )

Do we think it is a race condition? No, for three reasons: (1) The grid was
restarted in the interval between initial addition of the element and the
time the three nodes were failing to perform the Get(), (2) This particular
element failed on the same three nodes over many invocations of the request
over a substantial time period and (3) a subsequent grid restart fixed
the problem.

>From our logs we don't see delays, timeouts or Ignite logged errors
relating to the Get().

In terms of troubleshooting this has been a bit tricky. In this instance
only this one element (of many thousands of similar elements with similar
cluster compute requests being made across them) failed. And only within
the duration between a pair of grid restarts.

The replicated cache update is just a simple ICache<K, V>.PutAsync() with a
key struct and a byte[] array as payload. In terms of the distributed
compute code it is just performing a simple ICache<K, V>.GetAsync() with
the key struct.

So far it seems like the three failing nodes just temporarily 'forgot' they
had this element, and remembered it again after the restart.

For context, this is the first time we have seen this specific issue on a
system that has been running in production for 2+ years now. We have seen
numerous instances with replicated caches where Ignite has (permanently)
failed to write at least one, but not all, copies of the element where grid
restarts do not correct the issue. This does not feel the same though.

Raymond.





On Thu, Nov 23, 2023 at 6:50 AM Jeremy McMillan <
jeremy.mcmil...@gridgain.com> wrote:

> I suspect a race condition with async mode caches. This is a naive guess
> though, as we don't have enough details. I'll assume this is a plea for
> help in troubleshooting methodology and the question is really "what should
> we look at next?"
>
> The real answer comes from tracing the insert of element E and subsequent
> cache get() failures. Do we know if E was completely inserted into each
> replicated cache backup partition prior to the get()? Do we know if the
> reported cache get() failure was actually a fully functioning cache lookup
> and retrieval that failed during lookup, or were there timeouts or
> exceptions indicating something abnormal was happening?
>
> What steps did you take to troubleshoot this issue, and what is the
> cluster and cache configuration in play? What does the code look like for
> the updates to the replicated cache, and what does the code look like for
> the distributed compute operation?
>
> On Tue, Nov 21, 2023 at 5:21 PM Raymond Wilson <raymond_wil...@trimble.com>
> wrote:
>
>> Hi,
>>
>> We have been triaging an odd issue we encountered in a system using
>> Ignite v2.15 and the C# client.
>>
>> We have a replicated cache across four nodes, lets call them P0, P1, P2 &
>> P3. Because the cache is replicated every item added to the cache is
>> present in each of P0, P1, P2 and P3.
>>
>> Some time ago an element (E) was added to this cache (among many others).
>> A number of system restarts have occurred since that time.
>>
>> We started observing an issue where a query running across P0/P1/P2/P3 as
>> a cluster compute operation needed to load element E on each of the nodes
>> to perform that query. Node P0 succeeded, while nodes P1, P2 & P3 all
>> reported that element E did not exist.
>>
>> This situation persisted until the cluster was restarted, after which the
>> same query that had been failing now succeeded as all four 'P' nodes were
>> able to read element E.
>>
>> There were no Ignite errors reported in the context of these
>> failing queries to indicate unhappiness in the Ignite nodes.
>>
>> This seems like very strange behaviour. Are there any suggestions as to
>> what could be causing this failure to read the replicated value on the
>> three failing nodes, especially as the element 'came back' after a cluster
>> restart?
>>
>> Thanks,
>> Raymond.
>>
>>
>>
>>
>> --
>> <http://www.trimble.com/>
>> Raymond Wilson
>> Trimble Distinguished Engineer, Civil Construction Software (CCS)
>> 11 Birmingham Drive | Christchurch, New Zealand
>> raymond_wil...@trimble.com
>>
>>
>> <https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>
>>
>

-- 
<http://www.trimble.com/>
Raymond Wilson
Trimble Distinguished Engineer, Civil Construction Software (CCS)
11 Birmingham Drive | Christchurch, New Zealand
raymond_wil...@trimble.com

<https://worksos.trimble.com/?utm_source=Trimble&utm_medium=emailsign&utm_campaign=Launch>

Reply via email to