Dear community,
Based on the analysis of logs collected from several incidents under OEL
8.10 / 9.3, the most likely cause is local exhaustion of free space in
an allocation group in the XFS filesystem.
Further investigation revealed that a similar issue is documented in the
Red Hat knowledge base (https://access.redhat.com/solutions/7129010),
describing ENOSPC errors from the fallocate() function in XFS
filesystems during PostgreSQL backup operations.
Red Hat references the commit
https://github.com/torvalds/linux/commit/6773da870ab89123d1b513da63ed59e32a29cb77
and
believes that this kernel fix may address the PostgreSQL issue.
After analyzing the change set from this commit, we identified the
following combination of conditions that can trigger the ENOSPC error:
1. Presence of delayed allocations (committed but not yet written to disk).
2. Insufficient free space in the allocation group to cover all pending
delayed allocations.
Subsequent search of the PostgreSQL community knowledge base led to the
message
https://www.postgresql.org/message-id/[email protected].
Important points to highlight from this message:
1. Since kernel versions 2.6.x, XFS has implemented dynamic speculative
preallocation.
2. The term "dynamic" means the preallocation size is regulated by
internal heuristics.
3. These heuristics are based on file access patterns and history.
4. Additional space allocated during preallocation is intended to
prevent file fragmentation.
5. When a file extends, its data is written into extents that may be
distributed across one or more allocation groups.
6. Delayed allocation writes allow merging multiple allocations into
preallocated space before writing to disk, reducing the number of
extents and thus file fragmentation.
7. The logic for tracking additional space retains it as long as there
are in-memory references to the file — for example, in an actively
running PostgreSQL database.
8. The XFS filesystem itself considers this space as used.
9. The actual file size may exceed the 1GB limit (not to be confused
with apparent size).
This is confirmed by information collected using the `du -h` command,
which shows "actual" file sizes and helps to detect files larger than
1GB at the time of command execution (some even up to 2GB but we know
that maximum size is 1GB).
There may have been more such files, but after the replica crash, file
descriptors were released, causing the "actual" size to return to normal.
The dynamic allocator can be disabled by specifying the `allocsize`
mount option when mounting the XFS filesystem.
We would like to share additional observations to help resolve the issue.
We were able to reproduce the original problem in two ways: directly on
a PostgreSQL replica, and using a C program.
The first method is a test script (please see the attached
README_test_pg.md) that uses the mount option `allocsize=$(1*1024*1024)`
when mounting the disk where PGDATA is located.
The pgbench_accounts table is generated using the pgbench tool, and
multiple copies of this table are created and populated in parallel.
During the process of filling these small tables (each table is no
larger than 25 MB upon script completion), numerous delayed
preallocation events occur, consuming free disk space.
The subsequent parallel INSERT statements then cause replica crashes
because there is no contiguous free space left on the disk to extend the
file of the large table.
Here an example of availabled free space in mounted points after replica
is crashed with ENOSPC error ( pgdata_main is related to primary server
and pgdata_repl is related to replica ):
Filesystem Type Size Used Avail Use% Mounted on
/dev/loop0 xfs 4.0G 4.0G 74M 99% /pgdata_main
/dev/loop3 xfs 4.0G 3.8G 280M 94% /pgdata_repl
You may observe that when the issue is reproduced and the replica
crashes, the available disk space on the replica side appears larger
than on the primary side.
However, the ENOSPC error in the logs indicates that disk space was
exhausted — and this is indeed accurate: after the crash, all file
descriptors were released, and the space previously preallocate files
was reclaimed by the filesystem. Monitoring of files size using "du -h"
right before the moment of crash and some time ago after that is showing
that files sizes are decrease from 26 Mb to 25 Mb.
The issue does not occur when using the minimum possible value for the
allocsize parameter, which is set to allocsize=$(4*1024).
Testing various values of allocsize under a specific workload on
PostgreSQL with synchronous physical replication shows:
+----------------------+----------------------+---------------------------------------------------------------------+
| allocsize setting | Thread model | Result
|
+----------------------+----------------------+---------------------------------------------------------------------+
| 1M | single thread | No issues observed
|
+----------------------+----------------------+---------------------------------------------------------------------+
| 1M | multiple threads | Replica failed: "could
not extend file ... No space left on device" |
+----------------------+----------------------+---------------------------------------------------------------------+
| 1GB | multiple threads | Primary failed: "could
not extend file ... No space left on device" |
+----------------------+----------------------+---------------------------------------------------------------------+
| 4KB | multiple threads | No failure occurred
|
+----------------------+----------------------+---------------------------------------------------------------------+
Another method is C program ( please find README_test_c.md ) which
reproduces the ENOSPC error on kernel version
5.15.0-101.103.2.1.el9uek.x86_64.
The program first attempts to write 748 KB to a file and then allocate
an additional 16 KB using posix_fallocate().
If posix_fallocate() fails, it displays a corresponding message and
retries the operation.
The second attempt succeeds, indicating that space was available.
However, the program does not fully reproduce the potential PostgreSQL
scenario, key differences are:
1. The program uses a single process with a single thread, whereas real
systems involve one process with multiple threads or multiple processes
operating on files.
2. The program uses a fixed buffer size for the mounted filesystem's
journal, whereas in production environments the buffer size is dynamic
(allocated based on historical space usage, i.e., workload-dependent).
3. The issue does not occur when there are multiple allocation groups
that are completely empty.
In our practice, we identified two viable approaches:
1. As a permanent solution: Upgrade the UEK kernel.
Note that the fix has not been backported to all UEK versions:
- It is not present in UEK7 (5.15.x).
- It is present in UEK8 (6.12.x, available starting with OL 9.5)
from kernel version 6.12.0-0.20.20 onwards.
2. As a temporary solution: Use the allocsize parameter to disable
dynamic speculative preallocation.
However, since this does not fix the root cause, failures may still
occur.
On 9/10/24 17:11, Pecsök Ján wrote:
Dear community,
After upgrade of Posgres from version 13.5 to 16.2 we experience
following error:
could not extend file
"pg_tblspc/16401/PG_16_202307071/17820/3968302971" with
FileFallocate(): No space left on device
We cannot easily replicate problem. It happens at randomly every 1-2
weeks of intensive query computation.
Was there some changes in space allocation from Posgres 13.5 to
Posgres 16.2?
Database has size 91TB and has 27TB more space available.
# Reproducing ENOSPC Error in PostgreSQL
This script reproduces an `ENOSPC` (Error: No Space Left on Device) condition in PostgreSQL by exploiting filesystem-level extent allocation behavior under high-concurrency workloads. The issue is triggered by a combination of:
* Mount option `allocsize=1M` on the `$PGDATA` mount point
* Creation of many small tables (preallocating filesystem extents)
* Parallel bulk inserts into a single large table (fragmenting free space)
This mimics real-world scenarios such as data migration or bulk ETL operations, where filesystem fragmentation leads to allocation failures even when total free space appears sufficient.
---
## Prerequisites
* PostgreSQL 16.1 with `pgbench` installed
* **XFS** filesystem (recommended)
* `$PGDATA` and WAL logs located on different mount points
* Mount option: `allocsize=1M` (or higher) on the `$PGDATA` mount point
*(This forces larger preallocation units, increasing fragmentation risk)*
* Sufficient disk space (≥ 50 GB recommended)
* Linux environment with `psql`, `pgbench`, `xargs`, and `seq`
---
## Key Factors for Reproduction
| Factor | Recommended Value | Purpose |
| :------------------------ | :------------------ | :-------------------------------------------------------------- |
| `allocsize` mount option | 1M | Forces large preallocations, increasing fragmentation risk |
| Number of small tables | 100–200 | Consumes allocation groups/clusters |
| Parallel threads | 50–150 | Increases concurrency and allocation contention |
| Total rows inserted | 5M–10M | Pushes insert size beyond available contiguous extents |
| Filesystem | XFS | Exhibits this behavior under high fragmentation |
---
## Environment Setup
Set up XFS filesystems on separate disks for PGDATA and PGWAL with appropriate mount options:
```bash
# Format PGDATA disk with separate journal device and 128 allocation groups
mkfs.xfs -f -d agcount=128 -l logdev=/dev/journal_disk,size=64m /dev/pgdata_disk
# Format PGWAL disk
mkfs.xfs -f -d agcount=16 /dev/pgwal_disk
# Create mount points
mkdir /pgdata
mkdir /pgwal
# Mount PGDATA with allocsize=1M
mount -t xfs -o logdev=/dev/journal_disk,allocsize=1048576 /dev/pgdata_disk /pgdata
# Mount PGWAL
mount -t xfs /dev/pgwal_disk /pgwal
```
Important configuration details:
* PGDATA filesystem: XFS with separate journal device, mounted with allocsize=1M option
* Allocation groups: 128 AGs for PGDATA to increase fragmentation potential
* Separate mount: PGWAL on different disk/filesystem to isolate WAL impact
* Disk sizing: PGDATA disk should have sufficient space (≥ 50GB recommended)
* For PostgreSQL configuration, ensure data_directory points to /pgdata and consider setting WAL directory to /pgwal.
---
## Reproduction Script
The following bash script reproduces the ENOSPC error.
```bash
# Step 1: create initial table which will be used for copying rows
echo "preparing data.."
pgbench -U postgres -h localhost -p 5432 -i -I t postgres
# Step 2: Insert baseline data
psql -U postgres -h localhost -p 5432 -c "INSERT INTO pgbench_accounts(aid,bid,abalance,filler) SELECT gs.i AS aid,NULL,0,substring(md5(random()::text),0,84) from generate_series(1, 200000) gs(i)"
# Step 3: create 128 small tables in parallel (preallocates extents across AGs)
for i in $(seq 1 128); do echo $i; done | xargs -r -P 12 -I $$ psql -U postgres -h localhost -p 5432 -c "create table pgbench_accounts$$ as select * from pgbench_accounts" > /dev/null
# Step 4: clean up initial schema
pgbench -U postgres -h localhost -p 5432 -i -I d postgres
# Step 5
echo "reproducing.."
export THREADS=100
export PARTS=100
export TOTAL=6000000
export RANGE=$((TOTAL/PARTS))
# Step 6: insert 6M rows in 100 parallel batches into pgbench_accounts1
for i in $(seq 1 $PARTS); do echo $i; done | xargs -r -P $THREADS -I $$ psql -U postgres -h localhost -p 5432 -c "INSERT INTO pgbench_accounts1(aid,bid,abalance,filler) SELECT ($$*$RANGE)::integer+gs.i AS aid,NULL,0,substring(md5(random()::text),0,84) from generate_series(1, $RANGE) gs(i)" > /dev/null
# Step 7: final insert to push past threshold
psql -U postgres -h localhost -p 5432 -c "INSERT INTO pgbench_accounts1(aid,bid,abalance,filler) SELECT gs.i AS aid,NULL,0,substring(md5(random()::text),0,84) from generate_series(1, 200000) gs(i)" > /dev/null
```
---
## Important Notes
1. Step3 leads to creation of 128 tables, this consumes many allocation groups (AGs) on XFS and produces a lot of delayed preallocation events.
2. Step6 causes real issue with message "FATAL: could not extend file "base/xxxxx/xxxxxxxxx.xxxxx" with FileFallocate(): No space left on device" due to prior fragmentation from small tables, the filesystem cannot find a large enough contiguous free region — even if total free space is high ( but not available due to keeping by opened files descriptors )
3. Step7 should complete successfully if the ENOSPC issue did NOT occur, so that is prooving that space is enough for last step.
4. After a crash or restart, space is reclaimed as file descriptors are released. This makes the issue appear intermittent — but the root cause is filesystem fragmentation due to speculative preallocation, not actual disk exhaustion.
# Reproducing ENOSPC Error in C Program
This C program reproduces an `ENOSPC` (Error: No Space Left on Device) condition by demonstrating how filesystem fragmentation can cause allocation failures even when sufficient free space exists. The issue occurs when:
* Filesystem is mounted with `allocsize=1M`
* Many small files preallocate space across allocation groups
* A large file is extended, followed by a small extension attempt
The program shows that while the initial large write succeeds, a subsequent small `posix_fallocate()` call fails with ENOSPC on the first attempt but succeeds on retry, proving that free space exists but isn't contiguous.
---
## Prerequisites
* Filesystem: **XFS** with separate disk for journal
* Mount option: `allocsize=1M` on the target mount point
* Sufficient disk space (≥ 5 GB recommended)
* Single-threaded execution
* Linux environment with XFS development tools
* C compiler (gcc/clang)
---
## Key Factors for Reproduction
| Factor | Recommended value | Purpose |
| :------------------------ | :---------------- | :---------------------------------------------------------------------- |
| `allocsize` mount option | 1M | Forces 1MB preallocation units, increasing fragmentation risk |
| small files (31MB each) | 128 | Consumes allocation groups, fragmenting free space |
| Large initial write | 748KB | Creates substantial file growth within fragmented space |
| Small fallocate attempt | 16KB | Tests allocation of small contiguous space in fragmented environment |
| Immediate retry | on failure | Demonstrates space exists but wasn't contiguous on first attempt |
---
## Environment Setup
Create an XFS filesystem with separate journal device and mount with `allocsize=1M`:
```bash
# Format disk (replace /dev/sdX with your data disk, /dev/sdY with journal disk)
mkfs.xfs -f -d agcount=128 -l logdev=/dev/sdY,size=64m /dev/sdX
# Mount with allocsize=1M
mkdir /mnt/test
mount -t xfs -o logdev=/dev/sdY,allocsize=1048576 /dev/sdX /mnt/test
```
---
## Preparing C program
```bash
cat > test.c << 'EOF'
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#include <errno.h>
#include <string.h>
int main(int argc, char* argv[]) {
const char *basedir = "mnt";
char filename[256];
char writebuf[1024];
int inc1 = 748;
int inc2 = 16;
sprintf(filename, "%s/%03i.dat", basedir, 1);
memset(writebuf, 1, 1024);
printf("Opening file %s\n", filename);
int fd = open(filename, O_RDWR);
if (fd == -1) {
printf("Error on open file %s (code=%i)\n", filename, errno);
return 1;
}
off_t fs = lseek(fd, 0, SEEK_END);
printf("Current file size: %li\n", fs);
printf("Writing %i bytes at the file end\n", inc1 * 1024);
for (int i = 0; i < inc1; i++) {
int write_result = write(fd, writebuf, 1024);
if (write_result == -1) {
printf("Error on write (code=%i)\n", errno);
close(fd);
return 1;
}
}
/* Test */
int iteration = 1;
int test_result = 0;
do {
printf("Allocate addtional %i bytes at the file end\n", inc2 * 1024);
test_result = posix_fallocate(fd, fs + inc1 * 1024, inc2 * 1024);
if (test_result != 0) {
if (test_result == ENOSPC) {
printf("Error ENOSPC on posix_fallocate!\n");
if (iteration++ < 2) {
printf("Retrying operation...\n");
continue;
}
}
else
printf("Error on posix_fallocate (code=%i)\n", test_result);
close(fd);
return 1;
}
} while ( test_result != 0 );
printf("Done\n");
close(fd);
return 0;
}
EOF
echo "Compile test tool"
gcc -o test test.c
```
---
## Preparing data
```bash
# Create 128 files, each preallocating 31MB
for i in {000..127}; do
fallocate -x -l 31M "/mnt/test/${i}.dat"
done
```
---
## Reproducing by C Program
```bash
test
# output:
#Writing 765952 bytes at the file end
#Allocate additional 16384 bytes at the file end
#Error ENOSPC on posix_fallocate!
#Retrying operation...
#Allocate additional 16384 bytes at the file end
#Done
```
---
## Important Notes
1. ENOSPC is intermittent: The first posix_fallocate() call fails with ENOSPC, but the identical retry succeeds immediately. This proves free space exists but wasn't contiguous on the first attempt.
2. Root cause: Filesystem fragmentation caused by:
* allocsize=1M forcing large preallocation units
* 128 small files consuming allocation groups
* Large file extension (748KB) fragmenting remaining space
3. Not a disk space issue: The retry success demonstrates sufficient free space exists. The failure is due to inability to find contiguous space for the small (16KB) allocation.