Thanks Martynas. Don't think it quite fits the symptoms, and something
we've tried to be careful of, but worth another look.
Dave
On 16/02/2021 12:55, Martynas Jusevičius wrote:
Not Fuseki-related per se, but I've experienced something similar when
the HTTP client is running out of connections.
On Tue, Feb 16, 2021 at 11:50 AM Dave Reynolds
<[email protected]> wrote:
We have a mysterious problem with fuseki in production that we've not
seen before. Posting in case anyone has seen something similar and has
any advice but I realise there's not really much here to go on.
Environment:
Fuseki 3.17 (was 3.16, tried upgrade just in case) using TDB1
OpenJDK java 8
Docker container (running in k8s pod)
ABW EBS file system
O(2k) small updates per day (uses RDFConnection to send update)
Variable read request rate but issue hits at low request levels
Symptoms are that fuseki receives an update request but never completes it:
INFO 550175 POST http://localhost:3030/ds
INFO 550175 Update
INFO 550175 204 No Content (20 ms)
INFO 550176 POST http://localhost:3030/ds
INFO 550176 Update
-->
INFO 550178 Query = ASK { ?s ?p ?o }
INFO 550178 GET
http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D
INFO 550179 GET
http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D
INFO 550179 Query = ASK { ?s ?p ?o }
So no 204 return from request 550176.
From that point on fuseki continues to log incoming read queries but
does not answer any of them and the update request never terminates.
Acts as if there's some form of deadlock.
Update requests are serialised, there's never more than one in flight at
a time.
It's not the update itself that's the issue. It's small and if the
container is restarted with the same data and the same update sequence
is reapplied it all works fine.
The jvm stats all look completely fine in the prometheus records.
The various parts of this set up have been in various production
settings without problems in the past. In particular, we've run the
exact same pattern of mixed updates and queries in fuseki in a k8s
environment for two years without ever having a lockup. But on a new
deployment it's happening every few days.
There are differences between the new and old deployments but the ones
we've identified seem very unlikely to be the cause. We've not used
RDFConnection in the client before but can't see how that could affect
this. We don't often run with TDB on EBS but we do have a dozen
instances of that around which haven't had problems. We have generally
shifted to AWS Corretto as the jvm but we have plenty of OpenJDK
instances around without problems. The docker image is slightly unusual
in using the s6 overlay init system rather than running fuseki as the
root process but again can't see how this might cause these symptoms and
other uses of that, with fuseki, have been fine.
We'll find a workaround eventually, possibly involving shifting to TDB2,
but posting in case anyone has had an experience similar enough to this
to give us some hints.
Dave