plvsadi opened a new issue, #3837:
URL: https://github.com/apache/jena/issues/3837
### Version
6.0.0 Also reproduced on 5.5.0.
### What happened?
### Version
6.0.0
Also reproduced on 5.5.0.
### What happened?
We can reproduce a failure mode where repeated canceled federated
`SERVICE` queries leave
the target dataset effectively wedged until Fuseki is restarted.
The pattern is:
1. A direct query to dataset `target` succeeds.
2. A federated query from dataset `source` to dataset `target` using
`SERVICE
<http://127.0.0.1:3030/target/sparql>` succeeds.
3. We then issue a burst of heavy federated queries from `source` to
`target`, with the
client canceling/timing out the outer HTTP request almost immediately.
4. After that, even a simple direct query to dataset `target` times out.
5. Restarting Fuseki clears the problem.
This is reproducible for us on Jena/Fuseki 6.0.0 and also on 5.5.0.
### Why this looks distinct from ordinary timeout behavior
This does not look like only the outer client timing out.
After the cancellation storm:
- a direct query to the target dataset also times out
- the store recovers only after restart
So the target dataset/server appears to be left in a bad runtime state.
### Reproducing it
We reproduced this against an isolated standalone Fuseki 6.0.0 container
built from the
official Apache release tarball.
Our production datasets are private, but the failure can be described with
this structure:
- source dataset: `source`
- target dataset: `target`
Baseline direct probe against `target`:
```sparql
SELECT * WHERE {
<urn:probe-subject> ?p ?o
}
LIMIT 5
Baseline federated probe from source to target:
SELECT * WHERE {
SERVICE <http://127.0.0.1:3030/target/sparql> {
<urn:probe-subject> ?p ?o
}
}
LIMIT 5
Cancellation-storm query:
SELECT * WHERE {
SERVICE <http://127.0.0.1:3030/target/sparql> {
?s ?p ?o
}
}
We then repeatedly send that last query to the source dataset and cancel
the outer HTTP
request almost immediately, for example:
for i in $(seq 1 40); do
curl -sS --max-time 0.05 -G \
--data-urlencode 'query=SELECT * WHERE { SERVICE
<http://127.0.0.1:3030/target/sparql> {
?s ?p ?o } }' \
http://127.0.0.1:3030/source/sparql >/dev/null || true
done
### Actual result
Before stress:
- direct query succeeds
- federated query succeeds
After the canceled federated-query burst:
- direct query to target times out
- federated query also fails/times out
- Fuseki restart is required to recover
### Expected result
Canceled outer federated queries should not leave the target
dataset/server wedged.
After the canceled requests, normal direct queries to the target dataset
should still work.
### Relevant logs
From the Jena 6.0.0 Fuseki log, after the stress starts we see many inner
requests like:
GET
http://127.0.0.1:3030/target/sparql?query=SELECT++%2A%0AWHERE%0A++%7B+?s++?p++?o+%7D%0A
The outer requests are being canceled by the client, but the inner SERVICE
subqueries
continue to run. After enough of these, the target dataset stops
responding to even direct
queries.
### Notes
- This was reproduced using loopback 127.0.0.1, so it does not appear to
require Docker DNS/
container-name routing.
- We specifically tested 6.0.0 because the changelog mentions
query-cancellation
improvements, but we still reproduce this failure.
### Relevant output and stacktrace
```shell
```
### Are you interested in making a pull request?
None
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]