On 18/02/2021 11:04, Andy Seaborne wrote:
One clarification question - if you restart the container, did the update happen (and no response) or did the update not happen? That's a clue as to where the problem is occurring.
Good point and we don't know.The container is using an ephemeral-backed emptyDir volume. When the pod restarts it has a clean slate and loads the latest backup image and replays the missed updates, which at that point all go through. We didn't take a copy of the storage area before doing the restart to allow investigations like that (panic of restoring a production service). Will try that next time.
One of our priorities is to find a way to trigger the problem more reliably in some test set up so we can get more information and that's a good suggestion of something to check if we can catch it in the act.
[We did run the same configuration in the same cluster for months before it was released, including simulating the traffic patterns as closely as we could with no hint of a problem. So replicating reliably may be tricky :( ]
Dave
Andy On 17/02/2021 17:55, Dave Reynolds wrote:Hi Andy, Thanks for the comments.We'll investigate further and will update the thread if we get a resolution.Some responses inline. Dave On 17/02/2021 10:29, Andy Seaborne wrote:Not clear to me exactly what's happening.Is there any characteristic like increased load around or soon before the problem arises?Not that we can see.Are the read calls being made from the same machine as the updates?No. Updates will come from within the same node but the read calls come from one of a number replicas of an API service which are typically running on different nodes.Martynas's experience suggests a possibility because client code can eventually interfere with the server.1/A possibility is that an HTTP connection (client-side) is not being handled correctly. There is an (HttpClient) connection pool. If a connection is mishandled, then some time later, the server may not be able to see that it can write the update 204.Agreed. Though not clear to me how that leads to an apparent blocking of read operations (from other locations) completing.It is not necessary the operation immediately proceeding the updates that is mishandling things. Because each end has connection pools, one tainted connection in the pool can take a while to show up or the pool becoming exhaustedGood point.The problem might be the server can't send the reply because the server connection is still "in use". Due to pooling there are possibly several connections between a single client and the server.The RDFConnection calls "querySelect", "queryResults" give the app patterns where this does not happen.If you use "query()" to get a QueryExection, it still should be consumed or closed in a try-with-resource block.This is a possibility - it does not directly explain what you've seen but the effects of it are hard to absolutely characterise how they will appear.Generally careful with things like try-with-resource blocks but definitely worth checking again.2/It's made more complicated in TDB1 because read requests may be stopping the finalization (i.e. clearing out of the commit log) of updates. That again is something where the root cause is not at the point of failure but a some point earlier.This happens if there is a period of high load and may be made worse by case 1 like effects.Yes, one of the things we're testing is shifting to TDB2 to see if that changes the behaviour. Since we're not seeing any clear pattern of increased load it feels like a long shot but would at least eliminate that factor and may shed further light.Thanks again. DaveAndy On 17/02/2021 08:04, Dave Reynolds wrote:Thanks Martynas. Don't think it quite fits the symptoms, and something we've tried to be careful of, but worth another look.Dave On 16/02/2021 12:55, Martynas Jusevičius wrote:Not Fuseki-related per se, but I've experienced something similar when the HTTP client is running out of connections. On Tue, Feb 16, 2021 at 11:50 AM Dave Reynolds <[email protected]> wrote:We have a mysterious problem with fuseki in production that we've notseen before. Posting in case anyone has seen something similar and hasany advice but I realise there's not really much here to go on. Environment: Fuseki 3.17 (was 3.16, tried upgrade just in case) using TDB1 OpenJDK java 8 Docker container (running in k8s pod) ABW EBS file system O(2k) small updates per day (uses RDFConnection to send update) Variable read request rate but issue hits at low request levelsSymptoms are that fuseki receives an update request but never completes it:INFO 550175 POST http://localhost:3030/ds INFO 550175 Update INFO 550175 204 No Content (20 ms) INFO 550176 POST http://localhost:3030/ds INFO 550176 Update --> INFO 550178 Query = ASK { ?s ?p ?o } INFO 550178 GET http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D INFO 550179 GET http://localhost:3030/ds?query=ASK+%7B+%3Fs+%3Fp+%3Fo+%7D INFO 550179 Query = ASK { ?s ?p ?o } So no 204 return from request 550176.From that point on fuseki continues to log incoming read queries butdoes not answer any of them and the update request never terminates. Acts as if there's some form of deadlock.Update requests are serialised, there's never more than one in flight ata time. It's not the update itself that's the issue. It's small and if thecontainer is restarted with the same data and the same update sequenceis reapplied it all works fine. The jvm stats all look completely fine in the prometheus records. The various parts of this set up have been in various production settings without problems in the past. In particular, we've run the exact same pattern of mixed updates and queries in fuseki in a k8s environment for two years without ever having a lockup. But on a new deployment it's happening every few days.There are differences between the new and old deployments but the oneswe've identified seem very unlikely to be the cause. We've not usedRDFConnection in the client before but can't see how that could affectthis. We don't often run with TDB on EBS but we do have a dozeninstances of that around which haven't had problems. We have generallyshifted to AWS Corretto as the jvm but we have plenty of OpenJDKinstances around without problems. The docker image is slightly unusualin using the s6 overlay init system rather than running fuseki as theroot process but again can't see how this might cause these symptoms andother uses of that, with fuseki, have been fine.We'll find a workaround eventually, possibly involving shifting to TDB2, but posting in case anyone has had an experience similar enough to thisto give us some hints. Dave
