antoniopetrole opened a new issue, #927:
URL: https://github.com/apache/cloudberry/issues/927
### Apache Cloudberry version
Cloudberry 1.6.0 (pre apache release)
### What happened
A few days ago we set wal_compression = on in an attempt to reduce IO in our
production cluster. Shortly after enabling this, we had users reaching out
saying their queries that we part of a big workload were failing. After some
investigation, we saw some coredumps being generated on the segments that were
throwing errors and these coredumps are directly related to the wal compression
functionality. It seems the exception was thrown right after
XLogCompressBackupBlock.cold.4 tried running and created a coredump. Thankfully
it didn't crash any segments so I imagine the WAL stuff happens in it's own
thread. We quickly disabled this GUC and haven't seen this issue again (it
happened on multiple segments multiple times since they were running retries on
their jobs)
### Client Side Error
DEBUG ERROR: Error on receive from seg25 slice1 10. <omitted>:4001
pid=3498813: server closed the connection unexpectedly
DEBUG ERROR: current transaction is aborted, commands ignored until end of
transaction block, command: SELECT <omitted>
ERROR PSQLException: ERROR: Error on receive from seg25 slice1
10.<omitted>:4001 pid=3498813: server closed the connection unexpectedly
PL/pgSQL function <omitted> line 298 at
EXECUTEorg.postgresql.util.PSQLException: ERROR: Error on receive from seg25
slice1 10. <omitted>:4001 pid=3498813: server closed the connection unexpectedly
at
org.postgresql.core.v3.QueryExecutorImpl.receiveErrorResponse(QueryExecutorImpl.java:2440)
INFO Exited due to an error 1 min 1.7 secs after starting
### Coredump Trace
```(gdb) where
#0 0x00007f665f04e52f in raise () from /lib64/libc.so.6
#1 0x00007f665f021e65 in abort () from /lib64/libc.so.6
#2 0x00007f66600de060 in errfinish () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#3 0x00007f665fa84888 in XLogCompressBackupBlock.cold.4 () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#4 0x00007f665fbfc814 in XLogRecordAssemble () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#5 0x00007f665fbfcbc4 in XLogInsert_Internal () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#6 0x00007f665fb9364f in heap_delete () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#7 0x00007f665fb93836 in simple_heap_delete () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#8 0x00007f665fb5c81c in toast_delete_datum () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#9 0x00007f665fbd2a5f in toast_delete_external () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#10 0x00007f665fba1070 in heap_toast_delete () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#11 0x00007f665fb9343d in heap_delete () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#12 0x00007f665fda4298 in ExecDelete () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#13 0x00007f665fda62f1 in ExecModifyTable () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#14 0x00007f665fd7877b in ExecProcNodeFirst () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#15 0x00007f665fd6f47a in ExecutePlan.part.1 () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#16 0x00007f665fd6ff28 in standard_ExecutorRun () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#17 0x00007f665fd70135 in ExecutorRun () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#18 0x00007f665ff8af2d in ProcessQuery.isra.3 () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#19 0x00007f665ff8beb2 in PortalRunMulti () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#20 0x00007f665ff8c33d in PortalRun () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#21 0x00007f665ff865df in exec_mpp_query () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#22 0x00007f665ff89ebd in PostgresMain () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#23 0x00007f665fee5ddf in ServerLoop () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#24 0x00007f665fee6f1f in PostmasterMain () from
/usr/local/cloudberry-db-1.6.0/lib/libpostgres.so
#25 0x00000000004017ae in main ()```
### What you think should happen instead
Shouldn't core dump :)
### How to reproduce
I haven't tried creating a test case for this yet but it should be
relatively easy. All we did was enable the guc, run `gpstop -u`, and then our
users started having issues.
### Operating System
Rocky Linux 8.10 (Green Obsidian)
### Anything else
_No response_
### Are you willing to submit PR?
- [ ] Yes, I am willing to submit a PR!
### Code of Conduct
- [x] I agree to follow this project's [Code of
Conduct](https://github.com/apache/cloudberry/blob/main/CODE_OF_CONDUCT.md).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]