[ 
https://issues.apache.org/jira/browse/COMPRESS-727?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18087089#comment-18087089
 ] 

Tomas Illuminati commented on COMPRESS-727:
-------------------------------------------

Thanks for opening this, Piotr. I'd like to contribute here. I've already built 
a safe-by-default extractor as a local prototype and put it through testing, 
and I wanted to lay out the design for feedback before I send a PR.

*Proposed design*

A robust Extractor that unpacks entire archives, including links, and is secure 
by default.

The factory method Extractor.newExtractor(Path target) detects capabilities and 
returns a race-safe SecureExtractor when the filesystem provider supports 
SecureDirectoryStream. Otherwise, it provides a best-effort base Extractor. On 
OpenJDK, Files.newDirectoryStream returns a SecureDirectoryStream only on 
Linux, while macOS and Windows return a plain stream. As a result, the 
race-safe implementation is used on Linux, which is where shared-tenant, CI, 
and container race conditions are most common. Other platforms use the 
best-effort approach. It is worth stating this explicitly rather than assuming 
all Unix-like systems are covered.

_*No-follow walk by descriptor:*_ For each entry, the parent is resolved one 
component at a time, starting from a directory handle on the validated root. 
Each component is opened with NOFOLLOW, and any component that is a symlink is 
rejected. The final file is written relative to the innermost handle using 
CREATE_NEW and NOFOLLOW, so a symlink introduced during extraction cannot 
redirect the write, and existing files are not overwritten.

*_Canonical root:_* The target is canonicalized once using toRealPath(), and 
the resolved entry path is then checked against that canonical root with 
Path.startsWith, which compares whole path components rather than string 
prefixes. This prevents the /root versus /root-evil prefix trap and avoids the 
false positives resolveIn produces when the parent path is not normalized.

*_Hard links:_* Before calling createLink, the link target is re-resolved on 
disk using toRealPath(), and its containment within the root is re-verified. 
Without this step, a hard link routed through a planted symlink could escape 
the intended directory even if the target appears contained. This addresses the 
same vulnerability recently fixed in node-tar (CVE-2026-26960, version 7.5.8) 
and follows the resolve-then-re-check-containment method used by Go's 
filepath-securejoin.

*_No lexical Unicode normalization:_* The no-follow check operates at the inode 
level and does not compare file names. Because filesystems can map different 
byte strings to the same object, such as on case-insensitive volumes or with 
NFC and NFD, normalizing names ourselves would reintroduce the mismatch that 
defeats lexical safeguards in the first place (see node-tar CVE-2021-37712 and 
CVE-2026-23950). A colliding name resolves to the same symlink, which the 
no-follow check then rejects.

*Safe-by-default policies*
 - SymlinkPolicy: REJECT (default), SKIP, ALLOW_WITHIN_ROOT (creates the link 
only if the target resolves inside the root, and never follows a created link 
for later writes), ALLOW_ALL (trusted archives only).
 - SpecialFilePolicy for device/FIFO/char files: SKIP (default) or REJECT, 
never materialised on the secure path.
 - overwrite is false by default like CREATE_NEW.

*An a Important Platform note*

Java doesn't expose descriptor-relative mkdir/symlink/link (there's no 
mkdirat/symlinkat/linkat in java.nio.file), so directory, symlink and hard-link 
creation keep a small TOCTOU residual even on the secure path: the parent 
components are still verified no-follow, but the node-creation step itself is 
path-based. Regular file writes, which are the common and highest-volume case, 
are fully race-safe through CREATE_NEW + NOFOLLOW relative to the pinned 
directory handle. On macOS and Windows there's no SecureDirectoryStream at all, 
so the whole thing is best-effort with that residual documented. I think a 
plainly stated platform matrix beats implying one uniform guarantee.

*Entry points*

ArchiveInputStream, ZipFile and TarFile are covered. One detail worth flagging: 
a zip unix symlink is recognised as a link only through ZipFile, because the 
symlink bit lives in the central directory's unix mode. Fed through 
ArchiveInputStream there's no central directory to read, so the same entry 
materialises as a regular file holding the link text (no-follow, create-new) 
instead of a symlink. That's lossy but never an escape, and it's documented on 
the stream entry point. The ticket already hints at this with 
ZipFile.getUnixSymlink.

7z could go in a follow-up PR. One note on scope there, SevenZArchiveEntry 
exposes no isSymbolicLink() or getLinkName(), so a 7z symlink is only 
detectable by decoding the unix mode that p7zip packs into the high bits of 
WindowsAttributes, which is a producer convention rather than part of the 7z 
format. Today the extractor classifies 7z entries as file or directory only, so 
an undecoded 7z symlink is written as a regular file holding the target text 
under CREATE_NEW + NOFOLLOW: lossy, not an escape. That decode is 
self-contained, which is part of why I'd lean toward isolating 7z in its own PR.

*Validation*

I wrote the tests first, against a matrix that covers the zip-slip regression, 
symlink-slip write-through, an escaping symlink target, a legitimate 
within-root symlink (to confirm it isn't over-blocked), a concurrent 
dir-to-symlink swap for the TOCTOU case, hard-link escape, special files, and 
overwrite fail-closed. On top of that, mutation testing with PIT, 
coverage-guided fuzzing with Jazzer, SpotBugs with FindSecBugs, and the CodeQL 
java/zipslip query. The node-tar hard-link case above actually came out of a 
literature sweep. I reproduced it as a failing test, then fixed it at the root 
cause.

I'm glad to upload the PoC, the full test suite and Javadoc. Two things I'd 
want your read on first: whether the package location and factory shape look 
right to you (org.apache.commons.compress.archivers.extractor?), and whether 
you'd rather I split 7z into its own PR. Happy to adjust any of this.

> Provide a symlink-resistant (and concurrency-safe) archive Extractor
> --------------------------------------------------------------------
>
>                 Key: COMPRESS-727
>                 URL: https://issues.apache.org/jira/browse/COMPRESS-727
>             Project: Commons Compress
>          Issue Type: Improvement
>            Reporter: Piotr Karwasz
>            Priority: Major
>
> h2. Summary
> {{ArchiveEntry.resolveIn(Path)}} protects against classic Zip-slip by 
> resolving the entry name against the target directory, normalizing it, and 
> rejecting the result if it no longer {{startsWith}} the target. This check is 
> purely *lexical*: it manipulates path strings only. It does not look at the 
> file system, so it does not defend against *symlink* attacks, and the example 
> {{Expander}} that drives extraction does not create symlinks at all (so it is 
> both lossy for legitimate archives and silent about the risk).
> This issue proposes a first-class, security-first {{Extractor}} that safely 
> materializes a whole archive into a directory, including symlinks, and 
> resists both symlink-slip from the archive itself and concurrent symlink 
> races from third parties. The API is centered on 
> {{Extractor.newExtractor(Path)}}, which returns a {{SecureExtractor}} when 
> the platform supports {{java.nio.file.SecureDirectoryStream}}, or a 
> best-effort {{Extractor}} otherwise.
> h2. Background: what resolveIn does and does not cover
> {code:java}
> default Path resolveIn(final Path parentPath) throws IOException {
>     final Path outputFile = parentPath.resolve(getName()).normalize();
>     if (!outputFile.startsWith(parentPath)) {
>         throw new ArchiveException("Zip slip ...");
>     }
>     return outputFile;
> }
> {code}
> This stops {{../../etc/passwd}} style names. It does *not* stop:
> * A symlink *entry inside the archive* (for example {{link -> /etc}} or 
> {{link -> ../..}}) followed by a later entry written *through* that link 
> ({{link/evil}}). The lexical check on {{target/link/evil}} passes because the 
> string starts with {{target}}, but {{link}} is a symlink and the write 
> escapes. This is "symlink-slip", and it defeats the lexical guard even when 
> the target directory is fully trusted.
> * A symlink (or directory-to-symlink swap) planted *concurrently* by another 
> process between the {{resolveIn}} check and the actual 
> {{Files.newOutputStream}} call. The check and the write are not atomic with 
> respect to the intermediate path components, so this is a classic TOCTOU race.
> The metadata needed to handle links is already exposed:
> * {{TarArchiveEntry}}: {{isSymbolicLink()}}, {{isLink()}} (hard link), 
> {{getLinkName()}}, plus {{isBlockDevice()}} / {{isCharacterDevice()}} / 
> {{isFIFO()}} for special types.
> * {{ZipArchiveEntry}}: {{isUnixSymlink()}}, with 
> {{ZipFile.getUnixSymlink(entry)}} returning the link target.
> There is currently no use of {{SecureDirectoryStream}}, 
> {{LinkOption.NOFOLLOW_LINKS}}, or {{Files.createSymbolicLink}} anywhere in 
> main code, so this is greenfield.
> h2. Threat model
> # *Zip-slip* (path traversal via {{..}} or absolute names). Already handled 
> by {{resolveIn}}.
> # *Symlink-slip from the archive* (trusted target directory). The archive 
> plants a symlink and then writes through it, or supplies a symlink whose 
> target points outside the extraction root. Must be handled by an extractor 
> that controls symlink creation and never follows a link when materializing a 
> child path.
> # *Concurrent symlink race* (untrusted / shared target directory). A third 
> party with write access to the target tree swaps a path component for a 
> symlink during extraction. Defeating this requires file-descriptor-relative, 
> one-component-at-a-time resolution with no-follow semantics, that is, 
> {{SecureDirectoryStream}}.
> h2. Real-life examples of untrusted / shared target directories
> These are settings where threat 3 is realistic because another user or 
> process can write into the extraction tree while extraction runs:
> * *Shared CI / build agents*: multiple jobs extract into a common workspace 
> or temp directory on the same host; one job races another.
> * *Multi-user package-manager caches*: extracting into group- or 
> world-writable shared caches (Maven {{~/.m2}}, Gradle, pip, npm) on a shared 
> build box.
> * *Web upload processing*: an upload is extracted into a directory that other 
> worker processes or requests (possibly other tenants) can also write.
> * *Container image / layer extraction*: a layer's symlink redirects a later 
> layer's write. Container runtimes specifically defend this (for example 
> Docker/containerd use chroot-style joins, and Go's {{filepath-securejoin}} 
> exists for exactly this).
> * *Background daemons*: antivirus, indexers, or backup agents racing the 
> extractor over the same tree.
> Prior art that hardened against the same classes of bug:
> * The original Zip-slip disclosure (2018) prompted the lexical fix that 
> {{resolveIn}} implements.
> * Python's {{tarfile}} extraction filters (PEP 706, shipped in 3.12, 
> hardening the long-standing CVE-2007-4559) added {{data}} / {{tar}} / 
> {{fully_trusted}} filters that, among other things, reject symlinks and links 
> escaping the destination.
> * GNU tar performs "secure" extraction by removing an existing symlink before 
> writing through its name and by deferring symlink creation.
> * Container runtimes rely on {{filepath-securejoin}} (Go) to resolve each 
> component safely against symlink races.
> h2. Proposed API
> A supported {{Extractor}} type (promoted out of {{archivers.examples}} into, 
> for example, {{org.apache.commons.compress.archivers}}), with a 
> capability-detecting factory:
> {code:java}
> // Picks the strongest implementation the platform/filesystem allows.
> public static Extractor newExtractor(Path targetDirectory) throws IOException;
> {code}
> * On platforms where {{Files.newDirectoryStream(targetDirectory) instanceof 
> SecureDirectoryStream}} (Unix-like systems, see 
> {{sun.nio.fs.UnixSecureDirectoryStream}}), {{newExtractor}} returns a 
> *{{SecureExtractor}}* that is resistant to both symlink-slip and concurrent 
> symlink races.
> * Otherwise (notably Windows, or filesystems whose provider does not return a 
> secure stream) it returns a base {{Extractor}} that defends symlink-slip from 
> the archive on a best-effort basis but documents that it cannot fully close 
> the TOCTOU race.
> The {{Extractor}} mirrors the entry points of today's {{Expander}} 
> ({{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, {{SevenZFile}}) and adds 
> configuration:
> {code:java}
> Extractor extractor = Extractor.newExtractor(targetDir)
>         .setSymlinkPolicy(SymlinkPolicy.ALLOW_WITHIN_ROOT) // default REJECT
>         .setSpecialFilePolicy(SpecialFilePolicy.SKIP)      // devices, FIFOs
>         .setOverwrite(false);
> extractor.extract(archiveInputStream);
> {code}
> h3. SecureExtractor design
> Walk each entry's parent path *component by component* starting from an open 
> directory handle on the (validated) extraction root:
> * For each intermediate component, open the subdirectory via 
> {{SecureDirectoryStream.newDirectoryStream(name, 
> LinkOption.NOFOLLOW_LINKS)}}. If a component is a symlink, refuse (it cannot 
> be a legitimate parent under no-follow).
> * Create files relative to the innermost directory handle with no-follow / 
> create-new semantics ({{newByteChannel}} with {{NOFOLLOW_LINKS}} + 
> {{CREATE_NEW}}), so a concurrently planted symlink cannot redirect the write.
> * Create directories and symlinks relative to the directory handle, never via 
> an absolute resolved path.
> Because every step is relative to a file descriptor rather than a re-resolved 
> string path, a third party cannot win the race by swapping a parent component.
> h3. Symlink policy
> ||Policy||Behavior||
> |{{REJECT}} (default)|Fail if the archive contains any symlink entry.|
> |{{SKIP}}|Silently ignore symlink entries.|
> |{{ALLOW_WITHIN_ROOT}}|Create a symlink only if its target resolves to a 
> location inside the extraction root; reject links escaping the root. Never 
> follow the new link when writing later entries.|
> |{{ALLOW_ALL}}|Create symlinks verbatim (only for fully trusted archives).|
> h3. Entry-type handling
> ||Entry type||Default action||
> |Regular file|Extract (no-follow, create-new unless overwrite enabled).|
> |Directory|Create (no-follow).|
> |Symbolic link|Per {{SymlinkPolicy}} (default {{REJECT}}).|
> |Hard link (tar {{isLink()}})|Resolve link target within root, no-follow; 
> reject if it escapes.|
> |Device / FIFO / other special|Per {{SpecialFilePolicy}} (default {{SKIP}}; 
> never created by the secure path).|
> h2. Platform support and degradation
> * {{SecureDirectoryStream}} is available on Unix-like platforms via the 
> default file system provider; {{newExtractor}} detects it with an 
> {{instanceof}} check on a directory stream opened over the target.
> * On platforms without it, the base {{Extractor}} still validates each parent 
> component with {{LinkOption.NOFOLLOW_LINKS}} (via {{Files.readAttributes}}) 
> and refuses to traverse existing symlinks, which closes symlink-slip from the 
> archive but remains non-atomic against a concurrent attacker. This residual 
> limitation must be documented clearly.
> h2. Acceptance criteria
> * New {{Extractor}} (supported, not under {{examples}}) with 
> {{newExtractor(Path)}} returning {{SecureExtractor}} when 
> {{SecureDirectoryStream}} is available, else a best-effort {{Extractor}}.
> * Configurable {{SymlinkPolicy}} (default {{REJECT}}) and special-file 
> handling; safe hard-link resolution.
> * {{SecureExtractor}} performs descriptor-relative, component-by-component, 
> no-follow resolution and create-new writes.
> * Entry points covering {{ArchiveInputStream}}, {{ZipFile}}, {{TarFile}}, and 
> {{SevenZFile}}, matching the existing {{Expander}} surface.
> * Tests: symlink-slip archive (link then write-through) is blocked under a 
> trusted target; an escaping symlink target is rejected; a concurrent 
> directory-to-symlink swap is defeated by {{SecureExtractor}} (and documented 
> as not fully defeated by the base extractor); legitimate within-root symlinks 
> extract correctly when allowed.
> * Javadoc plus a "safe extraction" section in the user guide, 
> cross-referencing {{ArchiveEntry.resolveIn}} and the platform caveats.
> * Consider deprecating / redirecting {{Expander}} toward {{Extractor}}.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to