(airflow-steward) branch main updated: docs(setup): catalog sandbox-shaped failure modes (SSH/Yubikey, port bind, docker socket) (#291)

potiuk Mon, 25 May 2026 13:22:03 -0700

This is an automated email from the ASF dual-hosted git repository.

potiuk pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/airflow-steward.git



The following commit(s) were added to refs/heads/main by this push:
     new 3cbb1ba  docs(setup): catalog sandbox-shaped failure modes 
(SSH/Yubikey, port bind, docker socket) (#291)
3cbb1ba is described below

commit 3cbb1ba33609573e17a0f671bea842822f6aa6c5
Author: Jarek Potiuk <[email protected]>
AuthorDate: Mon May 25 22:21:49 2026 +0200

    docs(setup): catalog sandbox-shaped failure modes (SSH/Yubikey, port bind, 
docker socket) (#291)
    
    The secure-agent setup's sandbox sometimes blocks legitimate
    workflows in ways that look like unrelated bugs -- "ssh-agent
    unreachable", "address already in use", "Cannot connect to Docker
    daemon". The root cause is a sandbox restriction; the surfaced
    error is the userland framework's generic string; the connection
    between the two is not obvious. People (and agents) waste time
    chasing the wrong layer.
    
    New `docs/setup/sandbox-troubleshooting.md` is the catalog of
    those cases. Three seed entries cover the failure modes we have
    actually hit:
    
    1. **SSH agent / Yubikey unreachable from inside the sandbox** --
       `SSH_AUTH_SOCK` is passed through `claude-iso`'s env whitelist
       but the socket *path* (`/private/tmp/com.apple.launchd.*/Listeners`
       on macOS) is not in `allowRead`. Fix: add the launchd socket
       glob to `sandbox.filesystem.allowRead`.
    
    2. **Test cannot bind to a localhost port** -- the bind itself
       usually works; the failure is the test's own HTTP client
       talking to `127.0.0.1:NNNN` and the sandbox proxy treating
       loopback as a disallowed egress target. Fix: add `localhost`
       and `127.0.0.1` to `sandbox.network.allowedDomains`.
    
    3. **Docker / Podman command fails with a socket error** -- the
       reference `~/.claude/settings.json` denies `~/.docker/**` to
       keep Docker credentials away from the agent's Read tool, which
       also blocks the runtime socket under
       `~/.docker/run/docker.sock`. Fix: allow the socket file
       specifically; narrow the deny to `config.json` and
       `contexts/**` so credentials stay protected.
    
    Each entry follows the same four-part shape -- Symptom (verbatim
    error text), Root cause (which sandbox layer + why the restriction
    exists), Fix (concrete settings.json snippet with per-entry
    rationale), Notes (platform variants + when not to apply the
    widening) -- so a future reader can grep into the page and find
    the matching entry. A closing section documents the shape so new
    failure modes can be appended in the same form.
    
    Cross-references added:
    - `docs/setup/README.md` *Deep documentation* -- new bullet.
    - `docs/setup/secure-agent-setup.md` *See also* -- new bullet.
    
    `.typos.toml` gains an `ERRO` entry under `extend-identifiers` so
    the doc can quote docker's actual truncated log prefix
    (`ERRO[0000] ...`) verbatim -- preserves the greppability of the
    catalog against the literal error strings users see.
    
    This PR is **docs-only**. The actual settings.json widenings each
    case prescribes are deliberately not applied to the framework's
    reference `.claude/settings.json` in this PR -- framework-internal
    work does not need them, and dogfooding a kitchen-sink set would
    narrow the framework's own security posture without benefit. The
    catalog is the discoverable resource for adopters to apply the
    widenings as they hit each case. Follow-up PRs will add the
    in-session diagnostic skill (`setup-isolated-setup-doctor`) and
    the PostToolUse hint hook that lean on this catalog.
    
    Generated-by: Claude Code (Opus 4.7)
---
 .typos.toml                           |   5 +
 docs/setup/README.md                  |   6 +
 docs/setup/sandbox-troubleshooting.md | 360 ++++++++++++++++++++++++++++++++++
 docs/setup/secure-agent-setup.md      |   6 +
 4 files changed, 377 insertions(+)

diff --git a/.typos.toml b/.typos.toml
index 39bf317..9d31559 100644
--- a/.typos.toml
+++ b/.typos.toml
@@ -65,6 +65,11 @@ ponymail_mcp = "ponymail_mcp"
 gh = "gh"
 mcp = "mcp"
 MCP = "MCP"
+# `ERRO` is docker / containerd / nerdctl's literal truncated log
+# level — appears in CLI output as `ERRO[0000] …`. Quoted verbatim
+# in `docs/setup/sandbox-troubleshooting.md` so grep on the real
+# error string matches the doc entry.
+ERRO = "ERRO"
 
 [files]
 # Skip auto-generated lockfiles + the cargo-style pinned-versions
diff --git a/docs/setup/README.md b/docs/setup/README.md
index 2b52fc2..20a3916 100644
--- a/docs/setup/README.md
+++ b/docs/setup/README.md
@@ -57,6 +57,12 @@ framework safe to use.
 - [**`unadopt.md`**](unadopt.md) — counterpart to `install-recipes.md`:
   remove the framework artefacts the adopt flow installed. One
   path, full plan surfaced before any write.
+- [**`sandbox-troubleshooting.md`**](sandbox-troubleshooting.md) —
+  catalog of known sandbox-shaped failure modes (SSH agent /
+  Yubikey unreachable, test port-bind blocked, docker/podman
+  socket denied) with symptom → root cause → settings.json fix
+  for each. The page to grep when a normal-looking operation
+  fails in the sandbox in an unexpected way.
 
 ## Typical lifecycle
 
diff --git a/docs/setup/sandbox-troubleshooting.md 
b/docs/setup/sandbox-troubleshooting.md
new file mode 100644
index 0000000..59f9869
--- /dev/null
+++ b/docs/setup/sandbox-troubleshooting.md
@@ -0,0 +1,360 @@
+<!-- START doctoc generated TOC please keep comment here to allow auto update 
-->
+<!-- DON'T EDIT THIS SECTION, INSTEAD RE-RUN doctoc TO UPDATE -->
+**Table of Contents**  *generated with 
[DocToc](https://github.com/thlorenz/doctoc)*
+
+- [Sandbox troubleshooting](#sandbox-troubleshooting)
+  - [Shape of each entry](#shape-of-each-entry)
+  - [SSH agent / Yubikey appears unreachable from inside the 
sandbox](#ssh-agent--yubikey-appears-unreachable-from-inside-the-sandbox)
+    - [Symptom](#symptom)
+    - [Root cause](#root-cause)
+    - [Fix](#fix)
+    - [Notes](#notes)
+  - [Test cannot bind to a localhost 
port](#test-cannot-bind-to-a-localhost-port)
+    - [Symptom](#symptom-1)
+    - [Root cause](#root-cause-1)
+    - [Fix](#fix-1)
+    - [Notes](#notes-1)
+  - [Docker / Podman command fails with a socket 
error](#docker--podman-command-fails-with-a-socket-error)
+    - [Symptom](#symptom-2)
+    - [Root cause](#root-cause-2)
+    - [Fix](#fix-2)
+    - [Notes](#notes-2)
+  - [Adding a new entry](#adding-a-new-entry)
+
+<!-- END doctoc generated TOC please keep comment here to allow auto update -->
+
+<!-- SPDX-License-Identifier: Apache-2.0
+     https://www.apache.org/licenses/LICENSE-2.0 -->
+
+# Sandbox troubleshooting
+
+The secure agent setup ([`secure-agent-setup.md`](secure-agent-setup.md))
+runs every Bash subprocess inside a sandbox: Seatbelt on macOS,
+bubblewrap on Linux, plus Claude Code's filesystem / network
+allowlists. A correct sandbox restricts what the agent can read and
+where it can talk; an *over-restrictive* one breaks legitimate
+workflows in ways that look like unrelated bugs ("ssh-agent
+unreachable", "address already in use", "Cannot connect to Docker
+daemon"). This page is the catalog of those cases — the
+**symptom** you see, the **root cause** in the sandbox config, and
+the **fix** (a settings.json widening with a one-line rationale).
+
+If you hit a sandbox-shaped failure not listed below, add it here
+in the same shape — the catalog grows by experience, not by
+prediction.
+
+Related:
+
+- [`secure-agent-setup.md`](secure-agent-setup.md) — full install
+  walkthrough including the authoritative `~/.claude/settings.json`
+  reference.
+- [`secure-agent-internals.md`](secure-agent-internals.md) — how
+  each layer of the sandbox works and why.
+
+---
+
+## Shape of each entry
+
+Every entry follows the same four sections so a future reader can
+pattern-match quickly:
+
+1. **Symptom** — the exact error message text the agent (or the
+   user, in a terminal) sees. Verbatim where possible so a grep
+   into this page surfaces the matching entry.
+2. **Root cause** — which sandbox layer (Seatbelt / bubblewrap /
+   Claude Code filesystem allowlist / network allowlist /
+   `permissions.deny`) is blocking the call, and why the
+   restriction exists.
+3. **Fix** — a concrete edit to `~/.claude/settings.json` (or the
+   adopter's project-local `.claude/settings.local.json`, where
+   that scope makes more sense) shown as a JSON snippet. Per-entry
+   rationale so the widening is auditable.
+4. **Notes** — platform-specific path variants, alternative paths
+   the same agent / runtime might use, when *not* to apply the
+   widening.
+
+---
+
+## SSH agent / Yubikey appears unreachable from inside the sandbox
+
+### Symptom
+
+Any of:
+
+```text
+sign_and_send_pubkey: signing failed for ED25519 "user@host": agent refused 
operation
+Could not open a connection to your authentication agent.
+ssh-add: error fetching identities for protocol 1: communication with agent 
failed
+Permission denied (publickey).
+```
+
+…on `git push`, `ssh user@host`, `ssh-add -l`, or any operation
+that consults `ssh-agent`. The variant the user reports as
+"Yubikey badly detected" — the Yubikey is plugged in and works
+outside the sandbox, but the agent inside the sandbox can't reach
+its socket.
+
+### Root cause
+
+`SSH_AUTH_SOCK` is passed through the `claude-iso` clean-env
+wrapper's whitelist (see [`secure-agent-setup.md` → The clean-env
+wrapper](secure-agent-setup.md#the-clean-env-wrapper)), so the
+environment variable is set inside the sandbox. The socket *path*
+it points at is the missing piece: on macOS the path is typically
+`/private/tmp/com.apple.launchd.*/Listeners`, which is not in any
+`allowRead` entry; on Linux it is typically
+`/run/user/<uid>/keyring/ssh` or a gpg-agent variant, only the
+gpg-agent path of which is currently allowed
+(`/run/user/*/gnupg/`).
+
+Without read access to the socket file, the agent's `ssh` /
+`git push` subprocesses get `Operation not permitted` when they
+try to `connect(2)` the unix-domain socket — but the userland
+error surfaces as the "agent unreachable" / "Permission denied"
+strings above, which is what makes the cause non-obvious.
+
+### Fix
+
+Add the SSH agent socket directories to `sandbox.filesystem.allowRead`:
+
+```jsonc
+// ~/.claude/settings.json
+{
+  "sandbox": {
+    "filesystem": {
+      "allowRead": [
+        // ...existing entries...
+        "/private/tmp/com.apple.launchd.*/Listeners",   // macOS: system 
launchd-managed ssh-agent socket
+        "/private/tmp/ssh-*/agent.*"                    // macOS: 
openssh-portable variant (rare)
+        // Linux: `~/.gnupg/` and `/run/user/*/gnupg/` are already in the 
framework reference;
+        // add `/run/user/*/keyring/` here if you use gnome-keyring or 
seahorse for SSH.
+      ]
+    }
+  }
+}
+```
+
+Per-entry rationale:
+
+- `/private/tmp/com.apple.launchd.*/Listeners` — Apple's launchd
+  manages per-session daemon sockets including the system
+  `ssh-agent`. The wildcard `*` matches the launchd UUID; the
+  `Listeners` directory holds the actual socket files. This is the
+  default path on macOS.
+- `/private/tmp/ssh-*/agent.*` — fallback for openssh-portable
+  running outside launchd (uncommon on stock macOS, sometimes seen
+  with Homebrew-installed openssh).
+
+### Notes
+
+- If you use **gpg-agent for SSH** (`enable-ssh-support` in
+  `~/.gnupg/gpg-agent.conf`), no extra entry is needed — the
+  framework reference already includes `~/.gnupg/` and
+  `/run/user/*/gnupg/`, which cover the gpg-agent SSH socket
+  (`S.gpg-agent.ssh`) on both platforms.
+- If you use **Secretive** (an alternative macOS Yubikey
+  agent), the socket lives under
+  `~/Library/Group Containers/<bundle>/socket.ssh`; add that
+  specific path to `allowRead` instead of the launchd glob.
+- Do **not** widen `allowRead` to `/private/tmp/**` — that opens
+  the entire system temp directory, which other processes use for
+  arbitrary files including credentials. Stay specific.
+
+---
+
+## Test cannot bind to a localhost port
+
+### Symptom
+
+```text
+[Errno 13] Permission denied
+[Errno 49] Can't assign requested address
+OSError: [Errno 98] Address already in use   # red herring when sandbox-related
+```
+
+…from a test that starts a fixture server (`pytest` with
+`live_server`, `requests-mock`, an integration test spinning up a
+local HTTP listener, a webhook fixture). The same test passes
+outside the sandbox.
+
+### Root cause
+
+Claude Code's `sandbox.network` block is allowlist-based on
+**outbound hosts** (egress to named domains), not on inbound
+binds. For most listener types this is fine — `bind(2)` on
+`127.0.0.1` doesn't go through the network namespace at all on
+macOS, and on Linux loopback is allowed by default.
+
+The case that bites is **a test that needs to talk to its own
+server over the loopback interface**: the test binds (works),
+the test's HTTP client then tries to `GET http://127.0.0.1:NNNN/`
+(may fail), because the sandbox's network allowlist does not
+include `127.0.0.1` or `localhost` and the egress proxy treats it
+as a disallowed destination.
+
+The "Permission denied" / "Address already in use" texts the test
+runner surfaces are *its own framework's* generic error strings,
+not the sandbox's — which makes the root cause hard to spot.
+
+### Fix
+
+Add `localhost` and `127.0.0.1` to the network allowlist:
+
+```jsonc
+// ~/.claude/settings.json
+{
+  "sandbox": {
+    "network": {
+      "allowedDomains": [
+        // ...existing entries...
+        "localhost",                                    // local fixture 
servers, test webhooks
+        "127.0.0.1"                                     // same; IP form for 
tests that use it directly
+      ]
+    }
+  }
+}
+```
+
+Per-entry rationale:
+
+- `localhost` / `127.0.0.1` — loopback only. Adding these does
+  not widen the egress surface (no traffic leaves the host); it
+  just lets the sandbox proxy stop treating loopback as a
+  disallowed destination.
+
+### Notes
+
+- For tests that need an *outbound* port (e.g. an integration test
+  that listens on a port and then a separate process connects from
+  outside the test's own runtime), `localhost` is not enough — you
+  need to allow the actual remote IP in `allowedDomains`. Those
+  are project-scope concerns; add to `.claude/settings.json` in
+  the adopter repo rather than the user-scope file.
+- If a test is genuinely incompatible with the sandbox (e.g. it
+  expects raw socket access to a privileged port), the per-call
+  escape hatch is `dangerouslyDisableSandbox: true` in the Bash
+  tool call — but that surface should be visually loud (the
+  `sandbox-bypass-warn.sh` hook ensures it is). Prefer the
+  allowlist fix above when applicable.
+
+---
+
+## Docker / Podman command fails with a socket error
+
+### Symptom
+
+```text
+Cannot connect to the Docker daemon at 
unix:///Users/<user>/.docker/run/docker.sock. Is the docker daemon running?
+ERRO[0000] error connecting to /var/run/docker.sock: open 
/var/run/docker.sock: operation not permitted
+Cannot connect to Podman. Please verify your connection to the Linux system 
using `podman system connection list`
+```
+
+…on any `docker` / `podman` / `nerdctl` invocation. The CLI is
+installed and the runtime is running on the host — the sandbox is
+just blocking access to its socket.
+
+### Root cause
+
+The runtime CLI talks to its daemon via a unix-domain socket. The
+framework's reference `~/.claude/settings.json` has
+`Read(~/.docker/**)` in `permissions.deny` (to keep the agent
+from reading Docker credentials stored under `~/.docker/config.json`)
+and lists `~/.docker` in the broader filesystem `denyRead` set.
+Both block the socket file under `~/.docker/run/docker.sock`,
+which is where Docker.app for Mac drops its socket.
+
+For Colima the socket lives under `~/.colima/...` (not currently
+covered by any allow / deny in the framework reference, so it
+works by default), and for rootless Podman it lives under
+`$XDG_RUNTIME_DIR/podman/...` (also not covered → works). The
+case that fails is specifically Docker.app on macOS plus the
+generic `~/.docker` denial.
+
+### Fix
+
+Allow Bash subprocesses to read the *socket file* without opening
+the `~/.docker/` directory generally:
+
+```jsonc
+// ~/.claude/settings.json
+{
+  "sandbox": {
+    "filesystem": {
+      "allowRead": [
+        // ...existing entries...
+        "~/.docker/run/docker.sock",                    // Docker.app for Mac 
socket
+        "~/.colima/default/docker.sock",                // Colima default 
socket (defensive; usually not blocked)
+        "/var/run/docker.sock"                          // Linux daemon socket 
(root-managed install)
+      ]
+    }
+  },
+  "permissions": {
+    "deny": [
+      // ...existing entries...
+      "Read(~/.docker/config.json)",                    // keep this denial — 
credentials live here
+      "Read(~/.docker/contexts/**)"                     // keep this denial — 
saved contexts
+      // (Replace the broad `Read(~/.docker/**)` with these two specific 
paths.)
+    ]
+  }
+}
+```
+
+Per-entry rationale:
+
+- `~/.docker/run/docker.sock` — Docker.app for Mac's socket
+  location. Read access on the socket file is what the docker CLI
+  needs to `connect(2)` to the daemon.
+- `~/.colima/default/docker.sock` — Colima's default; explicit
+  even though it works today, to anticipate a future widening of
+  the generic `~/.` denial.
+- `/var/run/docker.sock` — Linux systems with daemon Docker;
+  socket is root-managed but world-readable by convention.
+- The narrowed `permissions.deny` keeps the agent's `Read` tool
+  from seeing Docker auth tokens (`config.json`) and saved
+  contexts (which include host IPs and credentials), while
+  allowing the Bash subprocess to use the socket.
+
+### Notes
+
+- For **rootless Podman**, the socket is at
+  `$XDG_RUNTIME_DIR/podman/podman.sock` (typically
+  `/run/user/<uid>/podman/podman.sock`). Currently allowed by
+  default because the framework reference does not deny
+  `/run/user/<uid>/`; if a future widening adds such a denial,
+  add `/run/user/*/podman/` to `allowRead`.
+- For **CI / image-build workflows** that run inside an adopter
+  repo, prefer adding the socket allow at project scope
+  (`.claude/settings.local.json` in the adopter) rather than user
+  scope — that keeps the framework's user-scope reference minimal
+  and makes the widening visible to whoever audits the adopter's
+  repo.
+- Do **not** widen `allowRead` to `~/.docker/**` — the directory
+  holds auth tokens and saved contexts; the whole point of the
+  framework's `Read(~/.docker/**)` denial is to keep those out of
+  the agent's reach.
+
+---
+
+## Adding a new entry
+
+When you hit a sandbox-shaped failure not in this list:
+
+1. Capture the exact symptom (error text, command, what you were
+   trying to do). The error text is what makes the entry
+   greppable for the next person.
+2. Identify the layer: filesystem (`Operation not permitted` on a
+   path), network (refused / timed-out connection to an allowed
+   host's friend), or `permissions.deny` (the agent's tool got an
+   "I refuse" without the sandbox even being consulted).
+3. Find the minimal widening — the most specific `allowRead` /
+   `allowedDomains` entry that resolves the symptom without
+   opening adjacent paths. Stay as specific as the runtime
+   reasonably allows; never widen `~/`, `/var/`, or `/private/`
+   as a whole.
+4. Add an entry to this page in the *Shape of each entry* form
+   above. Cross-reference adjacent entries when relevant.
+
+If the fix involves `dangerouslyDisableSandbox: true` rather than
+a settings.json widening, document it here too — the bypass is a
+legitimate per-call escape hatch, but it should be visible in the
+catalog so future readers can see when it's the right call.
diff --git a/docs/setup/secure-agent-setup.md b/docs/setup/secure-agent-setup.md
index c202d66..70eb146 100644
--- a/docs/setup/secure-agent-setup.md
+++ b/docs/setup/secure-agent-setup.md
@@ -1547,6 +1547,12 @@ gaps; together they are the actual sandbox.
   and Seatbelt (macOS) enforce the policy at the OS layer, the
   SNI / DoH blind spot, the feedback-mechanism layering, and the
   residual risks the setup does not eliminate.
+- [`sandbox-troubleshooting.md`](sandbox-troubleshooting.md) —
+  catalog of known sandbox-shaped failure modes (SSH agent /
+  Yubikey unreachable, test port-bind blocked, docker / podman
+  socket denied) with symptom → root cause → settings.json fix
+  for each. Grep here first when a normal-looking operation fails
+  inside the sandbox.
 - [`AGENTS.md`](../../AGENTS.md) — placeholder convention used in skill
   files (`<tracker>`, `<upstream>`, `<security-list>`, …).
 - [`README.md`](../../README.md) — framework overview and how the

(airflow-steward) branch main updated: docs(setup): catalog sandbox-shaped failure modes (SSH/Yubikey, port bind, docker socket) (#291)

Reply via email to