alitheg commented on PR #624:
URL:
https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3842283394
Here's a survey of `puremagic`'s results for each of the example files.
# puremagic detection results - issue #553
Run against real files from `downloads.apache.org` using the paths from
dave2wave's
suffix survey. Only the first 64 KB of each file is fetched (HTTP Range
request);
detection uses `puremagic.magic_file()`, the same call site as
`atr/detection.py`.
`.tmp` was dropped from the survey: only 2 instances ever existed on the
mirror
(both under `zzz/`) and both have since been removed.
**84 file types tested.**
| OK | MISMATCH | NOT DETECTED |
|----|----------|--------------|
| 35 | 11 | 38 |
---
| Extension | Status | Detected by puremagic | Expected | Notes |
|-----------|--------|-----------------------|----------|-------|
| `.2` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.4` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.6` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.66` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.7` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.7z` | OK | application/x-7z-compressed | application/x-7z-compressed | |
| `.adoc` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.apk` | MISMATCH | application/x-gzip | application/zip | This particular
APK is a gzip stream, not a standard ZIP-based APK |
| `.asc` | OK | application/pgp-signature | application/pgp-signature | |
| `.bin` | MISMATCH | application/zip (+20 subtypes) |
application/octet-stream | The langdetect `.bin` is actually a ZIP; expected
set assumed opaque binary |
| `.changes` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.crate` | OK | application/x-gzip | application/x-gzip | Rust crates are
`.tar.gz` |
| `.css` | NOT DETECTED | - | text/css | Plain text, no magic bytes |
| `.deb` | OK | application/vnd.debian.binary-package, application/x-archive
| application/vnd.debian.binary-package, application/x-archive | |
| `.dmg` | NOT DETECTED | - | application/x-apple-diskimage | puremagic does
not recognise this format |
| `.exe` | OK | application/octet-stream,
application/vnd.microsoft.portable-executable |
application/vnd.microsoft.portable-executable | |
| `.far` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.gem` | MISMATCH | application/x-tar | application/x-gzip | Ruby gems are
plain tar, not gzipped tar - expected set was wrong |
| `.gif` | OK | image/gif | image/gif | |
| `.gpg` | NOT DETECTED | - | application/pgp-encrypted | Binary PGP;
puremagic does not recognise the signature |
| `.html` | OK | text/html | text/html | |
| `.ico` | MISMATCH | image/png | image/x-icon | The `favicon.ico` on
downloads.apache.org is actually a PNG file |
| `.img` | NOT DETECTED | - | application/octet-stream | Raw binary, no
magic bytes |
| `.index` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.jar` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.json` | OK | application/json | application/json | |
| `.KEYS` | MISMATCH | audio/x-ms-asx | text/plain | False positive - file
is plain-text PGP public keys; puremagic is tripped up by content |
| `.list` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.mar` | MISMATCH | application/zip (+20 subtypes) |
application/octet-stream | The sling `.mar` is a ZIP; expected set assumed
opaque binary |
| `.md` | MISMATCH | application/xml, text/html | text/plain | This
particular README.md opens with XML/HTML-like content |
| `.MD5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.md5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.mds` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.msi` | OK | application/x-ole-storage (+ other OLE types) |
application/x-ole-storage | |
| `.nar` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.nupkg` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.old` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.pack.gz` | OK | application/x-gzip | application/x-gzip | |
| `.patch` | MISMATCH | text/x-patch | text/x-diff | Correct detection; just
needs `text/x-patch` added to the expected set |
| `.pdf` | OK | application/pdf | application/pdf | |
| `.pem` | MISMATCH | application/x-pem-file | text/plain | puremagic
correctly identifies PEM; the expected set should be updated, not the detection
|
| `.pl` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.PL` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.pm` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.png` | OK | image/png | image/png | |
| `.pom` | OK | application/xml | application/xml | |
| `.prov` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.ps1` | OK | text/plain | text/plain | |
| `.py` | OK | text/x-python | text/x-python | |
| `.rar` | MISMATCH | application/zip (+20 subtypes) |
application/x-rar-compressed | The jackrabbit `.rar` is actually a ZIP, not a
real RAR archive |
| `.readme` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.repo` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.repositories` | NOT DETECTED | - | text/plain | Plain text, no magic
bytes |
| `.rpm` | OK | application/x-rpm | application/x-rpm | |
| `.sh` | NOT DETECTED | - | application/x-sh | Plain text, no magic bytes |
| `.sh1` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sha` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sha1` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sha256` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.SHA256` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.SHA512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sha512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sha512sum` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.sig` | NOT DETECTED | - | application/pgp-signature | Binary PGP;
puremagic does not recognise the signature |
| `.slingosgifeature` | MISMATCH | application/json | application/zip | Only
64 KB fetched; file likely starts with a JSON manifest before the ZIP payload |
| `.snupkg` | OK | application/java-archive, application/zip (+20 subtypes)
| application/java-archive, application/zip | |
| `.sqlite.bz2` | OK | application/x-bzip2 | application/x-bzip2 | |
| `.taco` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.tar.bz2` | OK | application/x-bzip2 | application/x-bzip2 | |
| `.tar.gz` | OK | application/x-gzip | application/x-gzip | |
| `.tar.xz` | OK | application/x-xz | application/x-xz | |
| `.temp` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.tgz` | OK | application/x-gzip | application/x-gzip | |
| `.txt` | OK | text/plain | text/plain | |
| `.vsix` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.war` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.whl` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
| `.x` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
| `.xml` | OK | application/xml | application/xml | |
| `.xsd` | OK | application/xml | application/xml | |
| `.yaml` | OK | application/x-yaml | application/x-yaml | |
| `.zip` | OK | application/java-archive, application/zip (+20 subtypes) |
application/java-archive, application/zip | |
---
## Summary of issues
### NOT DETECTED (38) - all expected
Plain-text files have no magic byte, so every type that is
really just text comes back empty. This covers all checksum variants
(`.sha1`,
`.sha256`, `.sha512`, `.md5`, `.mds`, `.sha`, `.SHA256`, `.SHA512`,
`.sha512sum`,
`.sh1`, `.512`, `.MD5`), all plain-text code / config / doc extensions
(`.css`,
`.sh`, `.pl`, `.PL`, `.pm`, `.ps1`, `.adoc`, `.readme`, `.repo`, `.list`,
`.changes`, `.index`, `.prov`, `.old`, `.repositories`, `.x`, `.temp`), the
numeric
changelog suffixes (`.2` through `.66`), and `.KEYS` / `.pem` (addressed
under
MISMATCH).
`.dmg`, `.gpg`, `.sig`, and `.img` are also not detected, but for a
different reason:
puremagic simply does not have signatures for those formats in its database.
For all of these the only viable strategy is extension-based classification
(or, for
checksums, a content-pattern check on the text).
### MISMATCH (11) - broken down by action needed
**Expected from ChatGPT is wrong, detection is fine:**
* `.gem` - Ruby gems are plain `.tar`, not `.tar.gz`. Expected set should be
`application/x-tar`.
* `.patch` - puremagic returns `text/x-patch`; expected was `text/x-diff`.
* `.pem` - puremagic correctly returns `application/x-pem-file`; expected
set had
`text/plain`.
**The example file on the mirror is not what the extension implies:**
* `.apk` - this specific file is a gzip stream, not a ZIP-based APK. A
different
APK from a different project might detect correctly.
* `.bin` - the langdetect `.bin` is actually a ZIP. `.bin` is intentionally
opaque; puremagic is not wrong, the file just happens to be a ZIP.
* `.ico` - `favicon.ico` on downloads.apache.org is a PNG. A real ICO file
would
detect as `image/x-icon`.
* `.mar` - the sling `.mar` is a ZIP. Expected assumed opaque binary.
* `.rar` - the jackrabbit `.rar` is actually a ZIP, not a RAR archive.
**Likely a 64 KB truncation artefact:**
* `.slingosgifeature` - detected as `application/json`. The file probably
opens
with a JSON manifest; the ZIP payload starts later. A full download would
likely
detect the ZIP signature.
**puremagic false positive:**
* `.KEYS` - detected as `audio/x-ms-asx`. The file is plain-text PGP public
keys.
puremagic is matching something in the key material against the ASX
signature
bytes. Needs extension-based fallback.
* `.md` - detected as `application/xml, text/html`. This particular
README.md
opens with XML / HTML-like markup. A different `.md` file would likely
return
empty (same as other plain-text types).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]