Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Pierre Barre Fri, 18 Jul 2025 06:12:23 -0700

> The interesting thing is, a few searches about the performance return mostly 
> negative impressions about their object storage in comparison to the original 
> S3. 
I think they had a rough start, but it's quite good now from what I've 
experienced. It's also dirt-cheap, and they don't bill for operations. So if 
you run ZeroFS on that you only pay for raw storage at €4.99 a month.


Combine that with their dirt cheap dedicated servers, 
https://www.hetzner.com/dedicated-rootserver/matrix-ax/ you can have a <€50 a 
month multi-terabytes postgres database

I'm dreaming of running https://www.merklemap.com/ on such a setup, but it's 
too early yet :)

> Finding out what kind of performance your benchmarks would yield on a pure 
> AWS setting would be interesting. I am not asking you to do that, but you may 
> get even better performance in that case :) 

Yes, I need to try that!

Best,
Pierre

On Fri, Jul 18, 2025, at 14:55, Seref Arikan wrote:
> Thanks, I learned something else: I didn't know Hetzner offered S3 compatible 
> storage. 
> 
> The interesting thing is, a few searches about the performance return mostly 
> negative impressions about their object storage in comparison to the original 
> S3. 
> 
> Finding out what kind of performance your benchmarks would yield on a pure 
> AWS setting would be interesting. I am not asking you to do that, but you may 
> get even better performance in that case :) 
> 
> Cheers,
> Seref
> 
> 
> On Fri, Jul 18, 2025 at 11:58 AM Pierre Barre <pie...@barre.sh> wrote:
>> __
>> Hi Seref,
>> 
>> For the benchmarks, I used Hetzner's cloud service with the following setup:
>> 
>> - A Hetzner s3 bucket in the FSN1 region
>> - A virtual machine of type ccx63 48 vCPU 192 GB memory
>> - 3 ZeroFS nbd devices (same s3 bucket)
>> - A ZFS stripped pool with the 3 devices
>> - 200GB zfs L2ARC
>> - Postgres configured accordingly memory-wise as well as with 
>> synchronous_commit = off, wal_init_zero = off and wal_recycle = off.
>> 
>> Best,
>> Pierre
>> 
>> On Fri, Jul 18, 2025, at 12:42, Seref Arikan wrote:
>>> Sorry, this was meant to go to the whole group:
>>> 
>>> Very interesting!. Great work. Can you clarify how exactly you're running 
>>> postgres in your tests? A specific AWS service? What's the test 
>>> infrastructure that sits above the file system?
>>> 
>>> On Thu, Jul 17, 2025 at 11:59 PM Pierre Barre <pie...@barre.sh> wrote:
>>>> Hi everyone,
>>>> 
>>>> I wanted to share a project I've been working on that enables PostgreSQL 
>>>> to run on S3 storage while maintaining performance comparable to local 
>>>> NVMe. The approach uses block-level access rather than trying to map 
>>>> filesystem operations to S3 objects.
>>>> 
>>>> ZeroFS: https://github.com/Barre/ZeroFS
>>>> 
>>>> # The Architecture
>>>> 
>>>> ZeroFS provides NBD (Network Block Device) servers that expose S3 storage 
>>>> as raw block devices. PostgreSQL runs unmodified on ZFS pools built on 
>>>> these block devices:
>>>> 
>>>> PostgreSQL -> ZFS -> NBD -> ZeroFS -> S3
>>>> 
>>>> By providing block-level access and leveraging ZFS's caching capabilities 
>>>> (L2ARC), we can achieve microsecond latencies despite the underlying 
>>>> storage being in S3.
>>>> 
>>>> ## Performance Results
>>>> 
>>>> Here are pgbench results from PostgreSQL running on this setup:
>>>> 
>>>> ### Read/Write Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 example
>>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> starting vacuum...end.
>>>> transaction type: <builtin: TPC-B (sort of)>
>>>> scaling factor: 50
>>>> query mode: simple
>>>> number of clients: 50
>>>> number of threads: 15
>>>> maximum number of tries: 1
>>>> number of transactions per client: 100000
>>>> number of transactions actually processed: 5000000/5000000
>>>> number of failed transactions: 0 (0.000%)
>>>> latency average = 0.943 ms
>>>> initial connection time = 48.043 ms
>>>> tps = 53041.006947 (without initial connection time)
>>>> ```
>>>> 
>>>> ### Read-Only Workload
>>>> 
>>>> ```
>>>> postgres@ubuntu-16gb-fsn1-1:/root$ pgbench -c 50 -j 15 -t 100000 -S example
>>>> pgbench (16.9 (Ubuntu 16.9-0ubuntu0.24.04.1))
>>>> starting vacuum...end.
>>>> transaction type: <builtin: select only>
>>>> scaling factor: 50
>>>> query mode: simple
>>>> number of clients: 50
>>>> number of threads: 15
>>>> maximum number of tries: 1
>>>> number of transactions per client: 100000
>>>> number of transactions actually processed: 5000000/5000000
>>>> number of failed transactions: 0 (0.000%)
>>>> latency average = 0.121 ms
>>>> initial connection time = 53.358 ms
>>>> tps = 413436.248089 (without initial connection time)
>>>> ```
>>>> 
>>>> These numbers are with 50 concurrent clients and the actual data stored in 
>>>> S3. Hot data is served from ZFS L2ARC and ZeroFS's memory caches, while 
>>>> cold data comes from S3.
>>>> 
>>>> ## How It Works
>>>> 
>>>> 1. ZeroFS exposes NBD devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can 
>>>> use like any other block device
>>>> 2. Multiple cache layers hide S3 latency:
>>>>    a. ZFS ARC/L2ARC for frequently accessed blocks
>>>>    b. ZeroFS memory cache for metadata and hot dataZeroFS exposes NBD 
>>>> devices (e.g., /dev/nbd0) that PostgreSQL/ZFS can use like any other block 
>>>> device
>>>>    c. Optional local disk cache
>>>> 3. All data is encrypted (ChaCha20-Poly1305) before hitting S3
>>>> 4. Files are split into 128KB chunks for insertion into ZeroFS' LSM-tree
>>>> 
>>>> ## Geo-Distributed PostgreSQL
>>>> 
>>>> Since each region can run its own ZeroFS instance, you can create 
>>>> geographically distributed PostgreSQL setups.
>>>> 
>>>> Example architectures:
>>>> 
>>>> Architecture 1
>>>> 
>>>> 
>>>>                          PostgreSQL Client
>>>>                                    |
>>>>                                    | SQL queries
>>>>                                    |
>>>>                             +--------------+
>>>>                             |  PG Proxy    |
>>>>                             | (HAProxy/    |
>>>>                             |  PgBouncer)  |
>>>>                             +--------------+
>>>>                                /        \
>>>>                               /          \
>>>>                    Synchronous            Synchronous
>>>>                    Replication            Replication
>>>>                             /              \
>>>>                            /                \
>>>>               +---------------+        +---------------+
>>>>               | PostgreSQL 1  |        | PostgreSQL 2  |
>>>>               | (Primary)     |◄------►| (Standby)     |
>>>>               +---------------+        +---------------+
>>>>                       |                        |
>>>>                       |  POSIX filesystem ops  |
>>>>                       |                        |
>>>>               +---------------+        +---------------+
>>>>               |   ZFS Pool 1  |        |   ZFS Pool 2  |
>>>>               | (3-way mirror)|        | (3-way mirror)|
>>>>               +---------------+        +---------------+
>>>>                /      |      \          /      |      \
>>>>               /       |       \        /       |       \
>>>>         NBD:10809 NBD:10810 NBD:10811  NBD:10812 NBD:10813 NBD:10814
>>>>              |        |        |           |        |        |
>>>>         +--------++--------++--------++--------++--------++--------+
>>>>         |ZeroFS 1||ZeroFS 2||ZeroFS 3||ZeroFS 4||ZeroFS 5||ZeroFS 6|
>>>>         +--------++--------++--------++--------++--------++--------+
>>>>              |         |         |         |         |         |
>>>>              |         |         |         |         |         |
>>>>         S3-Region1 S3-Region2 S3-Region3 S3-Region4 S3-Region5 S3-Region6
>>>>         (us-east) (eu-west) (ap-south) (us-west) (eu-north) (ap-east)
>>>> 
>>>> Architecture 2:
>>>> 
>>>> PostgreSQL Primary (Region 1) ←→ PostgreSQL Standby (Region 2)
>>>>                 \                    /
>>>>                  \                  /
>>>>                   Same ZFS Pool (NBD)
>>>>                          |
>>>>                   6 Global ZeroFS
>>>>                          |
>>>>                       S3 Regions
>>>> 
>>>> 
>>>> The main advantages I see are:
>>>> 1. Dramatic cost reduction for large datasets
>>>> 2. Simplified geo-distribution
>>>> 3. Infinite storage capacity
>>>> 4. Built-in encryption and compression
>>>> 
>>>> Looking forward to your feedback and questions!
>>>> 
>>>> Best,
>>>> Pierre
>>>> 
>>>> P.S. The full project includes a custom NFS filesystem too.
>>>> 
>>

Re: PostgreSQL on S3-backed Block Storage with Near-Local Performance

Reply via email to