Hi, The peak was CPU 9%, IOWait, 4.7%, User: 2%, System: 2.65% Ram is 6% used, 90% cached.
Zsolt On 10. Aug 2025 at 08:32:10, Xiufeng Guo <show...@gmail.com> wrote: > Hi, > > What’s the average server load? > > Best Regards, > Xiufeng Guo > > > On Sun, Aug 10, 2025 at 02:04 Zsolt Ero <zsolt....@gmail.com> wrote: > >> Hi, >> >> I'm seeking advice on the most robust way to configure Nginx for a >> specific scenario that led to a caching issue. >> >> I run a free vector tile map service (https://openfreemap.org/). The >> server's primary job is to serve a massive number of small (~70 kB), >> pre-gzipped PBF files. >> >> To optimize for ocean areas, tiles that don't exist on disk should be >> served as a 200 OK with an empty body. These are then rendered as empty >> space on the map. >> >> Recently, the server experienced an extremely high load: 100k req/sec on >> Cloudflare, and 1k req/sec on my two Hetzner servers. During this peak, >> Nginx started serving some *existing* tiles as empty bodies. Because >> these responses included cache-friendly headers (expires 10y), the CDN >> cached the incorrect empty responses, effectively making parts of the map >> disappear until a manual cache purge was performed. >> >> My goal is to prevent this from happening again. A temporary server >> overload should result in a server error (e.g., 5xx), not incorrect >> content that gets permanently cached. >> >> The Nginx error logs clearly showed the root cause of the system error: >> >> 2025/08/08 23:08:16 [crit] 1084275#1084275: *161914910 open() >> "/mnt/ofm/planet-20250730_001001_pt/tiles/8/138/83.pbf" failed (24: Too many >> open files), client: 172.69.122.170, server: ... >> >> It appears my try_files directive interpreted this "Too many open files" >> error as a "file not found" condition and fell back to serving the empty >> tile. >> System and Nginx Diagnostic Information >> >> Here is the relevant information about the system and Nginx process state >> (captured at normal load, after I solved the high traffic incident, still >> showing high FD usage on one worker). >> >> - >> >> *OS:* Ubuntu 22.04 LTS, 64 GB RAM, local NVME SSD, physical server >> (not VPS) >> - >> >> *nginx version*: nginx/1.27.4 >> - >> >> *Systemd ulimit for nofile:* >> >> # cat /etc/security/limits.d/limits1m.conf >> - soft nofile 1048576 >> - hard nofile 1048576 >> >> - >> >> *Nginx Worker Process Limits (worker_rlimit_nofile is set to 300000):* >> >> # for pid in $(pgrep -f "nginx: worker"); do sudo cat /proc/$pid/limits | >> grep "Max open files"; done >> Max open files 300000 300000 files >> Max open files 300000 300000 files >> ... (all 8 workers show the same limit) >> >> - >> >> *Open File Descriptor Count per Worker:* >> >> # for pid in $(pgrep -f "nginx: worker"); do count=$(sudo lsof -p $pid >> 2>/dev/null | wc -l); echo "nginx PID $pid: $count open files"; done >> nginx PID 1090: 57 open files >> nginx PID 1091: 117 open files >> nginx PID 1092: 931 open files >> nginx PID 1093: 65027 open files >> nginx PID 1094: 7449 open files >> ... >> >> (Note the one worker with a very high count, ~98% of which are >> regular files). >> - >> >> sysctl fs.file-max: >> >> fs.file-max = 9223372036854775807 >> >> - >> >> systemctl show nginx | grep LimitNOFILE: >> >> LimitNOFILE=524288 >> LimitNOFILESoft=1024 >> >> >> Relevant Nginx Configuration >> >> Here are the key parts of my configuration that led to the issue. >> >> worker_processes auto; >> worker_rlimit_nofile 300000; >> >> events { >> worker_connections 40000; >> multi_accept on; >> } >> >> http { >> open_file_cache max=1000000 inactive=60m; >> open_file_cache_valid 60m; >> open_file_cache_min_uses 1; >> open_file_cache_errors on; >> # ... >> >> *server block tile serving logic:* >> >> location ^~ /monaco/20250806_231001_pt/ { >> alias /mnt/ofm/monaco-20250806_231001_pt/tiles/; >> try_files $uri @empty_tile; >> add_header Content-Encoding gzip; >> >> expires 10y; >> >> types { >> application/vnd.mapbox-vector-tile pbf; >> } >> >> add_header 'Access-Control-Allow-Origin' '*' always; >> <span role="presentation" style="box-sizing: border-box; >> --tw-border-spacing-x: 0; --tw-border-spacing-y: 0; --tw-translate-x: 0; >> --tw-translate-y: 0; --tw-rotate: 0; --tw-skew-x: 0; --tw-skew-y: 0; >> --tw-scale-x: 1; --tw-scale-y: 1; --tw-pan-x: ; --tw-pan-y: ; >> --tw-pinch-zoom: ; --tw-scroll-snap-strictness: proximity; --tw-ordinal: ; >> --tw-slashed-zero: ; --tw-numeric-figure: ; --tw-numeric-spacing: ; >> --tw-numeric-fraction: ; --tw-ring-inset: ; --tw-ring-offset-width: 0px; >> --tw-ring-offset-color: #fff; --tw-ring-color: rgb(59 130 246 / .5); >> --tw-ring-offset-shadow: 0 0 #0000; --tw-ring-shadow: 0 0 #0000; >> --tw-shadow: 0 0 #0000; --tw-shadow-colored: 0 0 #0000; --tw-blur: ; >> --tw-brightness: ; --tw-contrast: ; --tw-grayscale: ; --tw-hue-rotate: ; >> --tw-invert: ; --tw-saturate: ; --tw-sepia: ; --tw-drop-shadow: ; >> --tw-backdrop-blur: ; >> >>