crdv7 opened a new issue, #2432:
URL: https://github.com/apache/age/issues/2432
**Describe the bug**
`pthread_mutex` in `manage_GRAPH_global_contexts()` causes permanent
self-deadlock on VLE queries. When `ereport(ERROR)` is raised while the mutex
is held (e.g., statement timeout, query cancellation, OOM), PostgreSQL's
`siglongjmp` jumps to the error handler, skipping `pthread_mutex_unlock()`. The
mutex remains permanently locked. Any subsequent VLE query on the same backend
connection deadlocks on itself — the process hangs forever in
`pthread_mutex_lock()` with `__owner == own PID`.
**How are you accessing AGE (Command line, driver, etc.)?**
- psql (command line), but the bug affects any client/driver.
**What data setup do we need to do?**
```pgsql
LOAD 'age';
SET search_path = ag_catalog, "$user", public;
SELECT create_graph('test_deadlock');
SELECT * FROM cypher('test_deadlock', $$
UNWIND range(1, 50000) AS i
CREATE (:Node {id: i})
$$) AS (v agtype);
SELECT * FROM cypher('test_deadlock', $$
MATCH (a:Node), (b:Node)
WHERE b.id = a.id + 1
CREATE (a)-[:LINK {weight: a.id}]->(b)
$$) AS (e agtype);
-- Load graph context into cache first
SELECT * FROM cypher('test_deadlock', $$
MATCH path = (a)-[r*1..2]->(b)
RETURN path LIMIT 1
$$) AS (path agtype);
-- Invalidate cached context by modifying the graph
SELECT * FROM cypher('test_deadlock', $$
CREATE (:Dummy {x: 1})
$$) AS (v agtype);
```
**What is the necessary configuration info needed?**
- Any AGE version that includes PR #1881
- PostgreSQL 16, 17, or 18
- No special configuration needed
**What is the command that caused the error?**
Repeat cache-invalidate + timeout in a loop. The timeout must hit during
graph context reload (while the mutex is held). It may take a few iterations
depending on machine speed.
```pgsql
-- Repeat: invalidate cache, then cancel VLE query via statement_timeout.
-- Each round has a chance of hitting the mutex-held window.
-- Once it hits, every subsequent VLE query on this connection hangs forever.
-- Round 1
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T1 {x: 1}) $$) AS (v
agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;
-- Round 2
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T2 {x: 2}) $$) AS (v
agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;
-- Round 3
SELECT * FROM cypher('test_deadlock', $$ CREATE (:T3 {x: 3}) $$) AS (v
agtype);
SET statement_timeout = '1ms';
SELECT * FROM cypher('test_deadlock', $$
MATCH path = (a)-[r*1..3]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
RESET statement_timeout;
-- (add more rounds if needed)
-- Final test: if any round above hit the mutex window,
-- this query hangs forever (self-deadlock).
SELECT * FROM cypher('test_deadlock', $$
MATCH path = (a)-[r*1..2]->(b) RETURN path LIMIT 1
$$) AS (path agtype);
-- If it returns results, add more rounds above and retry.
```
To confirm with GDB:
```bash
gdb -batch -p <hung_pid> \
-ex "print global_graph_contexts_container.mutex_lock.__data.__owner"
# Output: $1 = <hung_pid> (owner == self → self-deadlock)
```
**Expected behavior**
VLE queries should continue to work normally after a query error or
cancellation. A statement timeout on one query should not permanently break the
backend connection.
**Environment (please complete the following information):**
- AGE Version: master (also affects PG16, PG17, PG18 branches — any version
with PR #1881)
- PostgreSQL Version: 16, 17, 18
**Additional context**
The mutex was introduced in PR #1881 (fix for issue #1878). However, it is
both unnecessary and harmful:
1. **Unnecessary:** The protected variable is a process-local `static` — no
concurrent access exists. The test failure in #1878 was a catalog-level race,
already fixed by the Assert→runtime check and `strndup` in the same PR. For
cross-backend cache invalidation, PostgreSQL syscache uses `sinval` callbacks,
and AGE PR #2376 already uses lock-free `pg_atomic_uint64` version counters in
shared memory for this.
2. **Harmful:** `pthread_mutex` is incompatible with PostgreSQL's error
handling. `ereport(ERROR)` uses `siglongjmp` to jump directly to the error
handler, bypassing all code between the error site and the handler — including
`pthread_mutex_unlock()`. Once skipped, the mutex is permanently locked for
that backend process, and any subsequent VLE query self-deadlocks.
We will submit a PR with a fix.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]