dave2wave commented on issue #341:
URL:
https://github.com/apache/tooling-trusted-releases/issues/341#issuecomment-3855553163
From ChatGPT:
Below is a **focused discussion of the `exarch` option**, framed for a
**Quart / ASGI application** and compared to the pure-Python approach you
already have.
I’ll cover **what exarch gives you**, **how it works conceptually**, **how
to use it safely in Quart**, and **when it is the right (or wrong) choice**.
---
## What `exarch` is (in practical terms)
`exarch` is a **secure archive extraction library** with Python bindings,
implemented primarily in **Rust**.
Its goals are:
* Safe extraction of **ZIP, TAR, TAR.GZ, TAR.XZ, TAR.BZ2**, and related
formats
* Built-in defenses against:
* Excessive expansion (archive bombs)
* Path traversal (`../`, absolute paths)
* Symlink / hardlink attacks
* File count explosions
* Compression-ratio abuse
* A **single API** that handles inspection + extraction together
This means you **do not manually re-implement safety checks** per format.
---
## Why `exarch` is materially different from stdlib inspection
### Python stdlib approach
You:
* Open archive headers
* Accumulate sizes
* Enforce ratios
* Enforce limits
* Then extract (separately)
This is **correct but fragile**:
* Easy to miss edge cases (symlinks in TAR, sparse files, overlaps)
* Easy to drift policy between formats
* Requires ongoing maintenance
### `exarch` approach
You:
* Declare limits once
* Call extract
* Library enforces limits **during streaming extraction**
This is **safer by default**, especially under adversarial input.
---
## Security properties `exarch` handles for you
### 1. Streaming enforcement (critical)
Limits are checked **while bytes are being decompressed**, not after:
* Extraction stops immediately once a limit is hit
* Disk exhaustion and CPU bombs are mitigated early
Stdlib inspection cannot do this perfectly because it relies on metadata.
---
### 2. Symlink & hardlink safety
`exarch` prevents:
* Symlinks escaping the extraction directory
* TAR hardlinks pointing outside the root
* Link chains used to overwrite sensitive files
This is extremely easy to get wrong manually.
---
### 3. Path normalization correctness
Rust path handling ensures:
* Unicode normalization is applied
* `..`, `.` and platform edge cases are handled consistently
* No reliance on naive string checks
---
### 4. Unified policy across formats
One set of limits covers:
* ZIP
* TAR
* Compressed TAR variants
You don’t need format-specific logic.
---
## Using `exarch` in a Quart application
### Basic safe extraction
```python
import exarch
def extract_archive(src: str, dst: str):
return exarch.extract_archive(
src,
dst,
max_files=10_000,
max_total_size=1 * 1024 * 1024 * 1024, # 1 GiB
max_compression_ratio=100,
)
```
This call:
* Validates archive structure
* Enforces all limits
* Extracts only if safe
* Raises an exception on violation
---
### Quart endpoint (recommended pattern)
```python
from quart import Quart, request, jsonify
import tempfile
import os
import asyncio
app = Quart(__name__)
@app.route("/upload", methods=["POST"])
async def upload():
files = await request.files
uploaded = files.get("file")
if not uploaded:
return jsonify(error="file required"), 400
with tempfile.TemporaryDirectory() as tmpdir:
src = os.path.join(tmpdir, uploaded.filename)
dst = os.path.join(tmpdir, "out")
await uploaded.save(src)
os.mkdir(dst)
try:
result = await asyncio.to_thread(
extract_archive, src, dst
)
except Exception as e:
return jsonify(status="rejected", reason=str(e)), 400
return jsonify(
status="accepted",
files=result.files_extracted,
bytes=result.bytes_written,
)
```
### Why `asyncio.to_thread` matters
* Decompression is CPU-heavy
* Quart’s event loop must remain responsive
* This prevents slow uploads from blocking the server
---
## What you **give up** by using `exarch`
### 1. Fine-grained pre-inspection
You cannot easily:
* “Look but not extract”
* Accept archive metadata without touching payload bytes
`exarch` is **extract-or-fail**, not analyze-only.
---
### 2. Policy transparency
Security teams sometimes want:
* Explicit, auditable checks
* Hand-written logic they can reason about line-by-line
`exarch` trades transparency for correctness.
---
### 3. Dependency surface
* Rust binary wheel
* Platform compatibility considerations
* Needs CI coverage across OS/architectures
Usually acceptable, but relevant for constrained environments.
---
## Recommended usage patterns
### Pattern 1: Inspect → extract (hybrid)
* Do a **cheap header inspection** (file count, extension allow-list)
* Then use `exarch` for actual extraction
This reduces unnecessary decompression work.
---
### Pattern 2: Extract to quarantine
* Always extract to a **temporary sandbox**
* Only move extracted files to a trusted location after success
This limits blast radius even if policy changes.
---
### Pattern 3: Hard fail on any anomaly
Do **not** try to “partially extract”:
* Fail fast
* Log
* Discard archive
---
## When `exarch` is the right choice
Use it if:
* Users upload arbitrary archives
* Input is hostile or untrusted
* You support multiple archive formats
* You want to minimize custom security code
* You care about symlink / TAR edge cases
---
## When to avoid it
Avoid or supplement it if:
* You only need metadata inspection
* You must approve archives without extracting
* You operate in an environment where Rust wheels are unacceptable
* You require fully deterministic, custom policy enforcement
---
## Bottom line
**Stdlib inspection**
✔ Transparent
✔ Dependency-free
✖ Easy to miss edge cases
**`exarch`**
✔ Safer under adversarial input
✔ Unified ZIP/TAR handling
✔ Streaming enforcement
✖ Less transparent
✖ Extract-first model
If you want next, I can:
* Show a **hybrid inspect → exarch extract pipeline**
* Map `exarch` failure modes to HTTP error codes
* Provide **policy tuning guidelines** (realistic limits)
* Show how to sandbox extraction with Linux namespaces
Just tell me where you want to go next.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]