dave2wave commented on issue #341:
URL: 
https://github.com/apache/tooling-trusted-releases/issues/341#issuecomment-3855553163

   From ChatGPT:
   
   Below is a **focused discussion of the `exarch` option**, framed for a 
**Quart / ASGI application** and compared to the pure-Python approach you 
already have.
   
   I’ll cover **what exarch gives you**, **how it works conceptually**, **how 
to use it safely in Quart**, and **when it is the right (or wrong) choice**.
   
   ---
   
   ## What `exarch` is (in practical terms)
   
   `exarch` is a **secure archive extraction library** with Python bindings, 
implemented primarily in **Rust**.
   
   Its goals are:
   
   * Safe extraction of **ZIP, TAR, TAR.GZ, TAR.XZ, TAR.BZ2**, and related 
formats
   * Built-in defenses against:
   
     * Excessive expansion (archive bombs)
     * Path traversal (`../`, absolute paths)
     * Symlink / hardlink attacks
     * File count explosions
     * Compression-ratio abuse
   * A **single API** that handles inspection + extraction together
   
   This means you **do not manually re-implement safety checks** per format.
   
   ---
   
   ## Why `exarch` is materially different from stdlib inspection
   
   ### Python stdlib approach
   
   You:
   
   * Open archive headers
   * Accumulate sizes
   * Enforce ratios
   * Enforce limits
   * Then extract (separately)
   
   This is **correct but fragile**:
   
   * Easy to miss edge cases (symlinks in TAR, sparse files, overlaps)
   * Easy to drift policy between formats
   * Requires ongoing maintenance
   
   ### `exarch` approach
   
   You:
   
   * Declare limits once
   * Call extract
   * Library enforces limits **during streaming extraction**
   
   This is **safer by default**, especially under adversarial input.
   
   ---
   
   ## Security properties `exarch` handles for you
   
   ### 1. Streaming enforcement (critical)
   
   Limits are checked **while bytes are being decompressed**, not after:
   
   * Extraction stops immediately once a limit is hit
   * Disk exhaustion and CPU bombs are mitigated early
   
   Stdlib inspection cannot do this perfectly because it relies on metadata.
   
   ---
   
   ### 2. Symlink & hardlink safety
   
   `exarch` prevents:
   
   * Symlinks escaping the extraction directory
   * TAR hardlinks pointing outside the root
   * Link chains used to overwrite sensitive files
   
   This is extremely easy to get wrong manually.
   
   ---
   
   ### 3. Path normalization correctness
   
   Rust path handling ensures:
   
   * Unicode normalization is applied
   * `..`, `.` and platform edge cases are handled consistently
   * No reliance on naive string checks
   
   ---
   
   ### 4. Unified policy across formats
   
   One set of limits covers:
   
   * ZIP
   * TAR
   * Compressed TAR variants
   
   You don’t need format-specific logic.
   
   ---
   
   ## Using `exarch` in a Quart application
   
   ### Basic safe extraction
   
   ```python
   import exarch
   
   def extract_archive(src: str, dst: str):
       return exarch.extract_archive(
           src,
           dst,
           max_files=10_000,
           max_total_size=1 * 1024 * 1024 * 1024,  # 1 GiB
           max_compression_ratio=100,
       )
   ```
   
   This call:
   
   * Validates archive structure
   * Enforces all limits
   * Extracts only if safe
   * Raises an exception on violation
   
   ---
   
   ### Quart endpoint (recommended pattern)
   
   ```python
   from quart import Quart, request, jsonify
   import tempfile
   import os
   import asyncio
   
   app = Quart(__name__)
   
   @app.route("/upload", methods=["POST"])
   async def upload():
       files = await request.files
       uploaded = files.get("file")
   
       if not uploaded:
           return jsonify(error="file required"), 400
   
       with tempfile.TemporaryDirectory() as tmpdir:
           src = os.path.join(tmpdir, uploaded.filename)
           dst = os.path.join(tmpdir, "out")
   
           await uploaded.save(src)
           os.mkdir(dst)
   
           try:
               result = await asyncio.to_thread(
                   extract_archive, src, dst
               )
           except Exception as e:
               return jsonify(status="rejected", reason=str(e)), 400
   
           return jsonify(
               status="accepted",
               files=result.files_extracted,
               bytes=result.bytes_written,
           )
   ```
   
   ### Why `asyncio.to_thread` matters
   
   * Decompression is CPU-heavy
   * Quart’s event loop must remain responsive
   * This prevents slow uploads from blocking the server
   
   ---
   
   ## What you **give up** by using `exarch`
   
   ### 1. Fine-grained pre-inspection
   
   You cannot easily:
   
   * “Look but not extract”
   * Accept archive metadata without touching payload bytes
   
   `exarch` is **extract-or-fail**, not analyze-only.
   
   ---
   
   ### 2. Policy transparency
   
   Security teams sometimes want:
   
   * Explicit, auditable checks
   * Hand-written logic they can reason about line-by-line
   
   `exarch` trades transparency for correctness.
   
   ---
   
   ### 3. Dependency surface
   
   * Rust binary wheel
   * Platform compatibility considerations
   * Needs CI coverage across OS/architectures
   
   Usually acceptable, but relevant for constrained environments.
   
   ---
   
   ## Recommended usage patterns
   
   ### Pattern 1: Inspect → extract (hybrid)
   
   * Do a **cheap header inspection** (file count, extension allow-list)
   * Then use `exarch` for actual extraction
   
   This reduces unnecessary decompression work.
   
   ---
   
   ### Pattern 2: Extract to quarantine
   
   * Always extract to a **temporary sandbox**
   * Only move extracted files to a trusted location after success
   
   This limits blast radius even if policy changes.
   
   ---
   
   ### Pattern 3: Hard fail on any anomaly
   
   Do **not** try to “partially extract”:
   
   * Fail fast
   * Log
   * Discard archive
   
   ---
   
   ## When `exarch` is the right choice
   
   Use it if:
   
   * Users upload arbitrary archives
   * Input is hostile or untrusted
   * You support multiple archive formats
   * You want to minimize custom security code
   * You care about symlink / TAR edge cases
   
   ---
   
   ## When to avoid it
   
   Avoid or supplement it if:
   
   * You only need metadata inspection
   * You must approve archives without extracting
   * You operate in an environment where Rust wheels are unacceptable
   * You require fully deterministic, custom policy enforcement
   
   ---
   
   ## Bottom line
   
   **Stdlib inspection**
   ✔ Transparent
   ✔ Dependency-free
   ✖ Easy to miss edge cases
   
   **`exarch`**
   ✔ Safer under adversarial input
   ✔ Unified ZIP/TAR handling
   ✔ Streaming enforcement
   ✖ Less transparent
   ✖ Extract-first model
   
   If you want next, I can:
   
   * Show a **hybrid inspect → exarch extract pipeline**
   * Map `exarch` failure modes to HTTP error codes
   * Provide **policy tuning guidelines** (realistic limits)
   * Show how to sandbox extraction with Linux namespaces
   
   Just tell me where you want to go next.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to