alitheg commented on PR #624:
URL: 
https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3842283394

   Here's a survey of `puremagic`'s results for each of the example files. 
   
   # puremagic detection results - issue #553
   
   Run against real files from `downloads.apache.org` using the paths from 
dave2wave's
   suffix survey.  Only the first 64 KB of each file is fetched (HTTP Range 
request);
   detection uses `puremagic.magic_file()`, the same call site as 
`atr/detection.py`.
   
   `.tmp` was dropped from the survey: only 2 instances ever existed on the 
mirror
   (both under `zzz/`) and both have since been removed.
   
   **84 file types tested.**
   
   | OK | MISMATCH | NOT DETECTED |
   |----|----------|--------------|
   | 35 | 11       | 38           |
   
   ---
   
   | Extension | Status | Detected by puremagic | Expected | Notes |
   |-----------|--------|-----------------------|----------|-------|
   | `.2` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.4` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.6` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.66` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.7` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.7z` | OK | application/x-7z-compressed | application/x-7z-compressed | |
   | `.adoc` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.apk` | MISMATCH | application/x-gzip | application/zip | This particular 
APK is a gzip stream, not a standard ZIP-based APK |
   | `.asc` | OK | application/pgp-signature | application/pgp-signature | |
   | `.bin` | MISMATCH | application/zip (+20 subtypes) | 
application/octet-stream | The langdetect `.bin` is actually a ZIP; expected 
set assumed opaque binary |
   | `.changes` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.crate` | OK | application/x-gzip | application/x-gzip | Rust crates are 
`.tar.gz` |
   | `.css` | NOT DETECTED | - | text/css | Plain text, no magic bytes |
   | `.deb` | OK | application/vnd.debian.binary-package, application/x-archive 
| application/vnd.debian.binary-package, application/x-archive | |
   | `.dmg` | NOT DETECTED | - | application/x-apple-diskimage | puremagic does 
not recognise this format |
   | `.exe` | OK | application/octet-stream, 
application/vnd.microsoft.portable-executable | 
application/vnd.microsoft.portable-executable | |
   | `.far` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.gem` | MISMATCH | application/x-tar | application/x-gzip | Ruby gems are 
plain tar, not gzipped tar - expected set was wrong |
   | `.gif` | OK | image/gif | image/gif | |
   | `.gpg` | NOT DETECTED | - | application/pgp-encrypted | Binary PGP; 
puremagic does not recognise the signature |
   | `.html` | OK | text/html | text/html | |
   | `.ico` | MISMATCH | image/png | image/x-icon | The `favicon.ico` on 
downloads.apache.org is actually a PNG file |
   | `.img` | NOT DETECTED | - | application/octet-stream | Raw binary, no 
magic bytes |
   | `.index` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.jar` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.json` | OK | application/json | application/json | |
   | `.KEYS` | MISMATCH | audio/x-ms-asx | text/plain | False positive - file 
is plain-text PGP public keys; puremagic is tripped up by content |
   | `.list` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.mar` | MISMATCH | application/zip (+20 subtypes) | 
application/octet-stream | The sling `.mar` is a ZIP; expected set assumed 
opaque binary |
   | `.md` | MISMATCH | application/xml, text/html | text/plain | This 
particular README.md opens with XML/HTML-like content |
   | `.MD5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.md5` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.mds` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.msi` | OK | application/x-ole-storage (+ other OLE types) | 
application/x-ole-storage | |
   | `.nar` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.nupkg` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.old` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.pack.gz` | OK | application/x-gzip | application/x-gzip | |
   | `.patch` | MISMATCH | text/x-patch | text/x-diff | Correct detection; just 
needs `text/x-patch` added to the expected set |
   | `.pdf` | OK | application/pdf | application/pdf | |
   | `.pem` | MISMATCH | application/x-pem-file | text/plain | puremagic 
correctly identifies PEM; the expected set should be updated, not the detection 
|
   | `.pl` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.PL` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.pm` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.png` | OK | image/png | image/png | |
   | `.pom` | OK | application/xml | application/xml | |
   | `.prov` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.ps1` | OK | text/plain | text/plain | |
   | `.py` | OK | text/x-python | text/x-python | |
   | `.rar` | MISMATCH | application/zip (+20 subtypes) | 
application/x-rar-compressed | The jackrabbit `.rar` is actually a ZIP, not a 
real RAR archive |
   | `.readme` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.repo` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.repositories` | NOT DETECTED | - | text/plain | Plain text, no magic 
bytes |
   | `.rpm` | OK | application/x-rpm | application/x-rpm | |
   | `.sh` | NOT DETECTED | - | application/x-sh | Plain text, no magic bytes |
   | `.sh1` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sha` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sha1` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sha256` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.SHA256` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.SHA512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sha512` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sha512sum` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.sig` | NOT DETECTED | - | application/pgp-signature | Binary PGP; 
puremagic does not recognise the signature |
   | `.slingosgifeature` | MISMATCH | application/json | application/zip | Only 
64 KB fetched; file likely starts with a JSON manifest before the ZIP payload |
   | `.snupkg` | OK | application/java-archive, application/zip (+20 subtypes) 
| application/java-archive, application/zip | |
   | `.sqlite.bz2` | OK | application/x-bzip2 | application/x-bzip2 | |
   | `.taco` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.tar.bz2` | OK | application/x-bzip2 | application/x-bzip2 | |
   | `.tar.gz` | OK | application/x-gzip | application/x-gzip | |
   | `.tar.xz` | OK | application/x-xz | application/x-xz | |
   | `.temp` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.tgz` | OK | application/x-gzip | application/x-gzip | |
   | `.txt` | OK | text/plain | text/plain | |
   | `.vsix` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.war` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.whl` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   | `.x` | NOT DETECTED | - | text/plain | Plain text, no magic bytes |
   | `.xml` | OK | application/xml | application/xml | |
   | `.xsd` | OK | application/xml | application/xml | |
   | `.yaml` | OK | application/x-yaml | application/x-yaml | |
   | `.zip` | OK | application/java-archive, application/zip (+20 subtypes) | 
application/java-archive, application/zip | |
   
   ---
   
   ## Summary of issues
   
   ### NOT DETECTED (38) - all expected
   
   Plain-text files have no magic byte, so every type that is
   really just text comes back empty.  This covers all checksum variants 
(`.sha1`,
   `.sha256`, `.sha512`, `.md5`, `.mds`, `.sha`, `.SHA256`, `.SHA512`, 
`.sha512sum`,
   `.sh1`, `.512`, `.MD5`), all plain-text code / config / doc extensions 
(`.css`,
   `.sh`, `.pl`, `.PL`, `.pm`, `.ps1`, `.adoc`, `.readme`, `.repo`, `.list`,
   `.changes`, `.index`, `.prov`, `.old`, `.repositories`, `.x`, `.temp`), the 
numeric
   changelog suffixes (`.2` through `.66`), and `.KEYS` / `.pem` (addressed 
under
   MISMATCH).
   
   `.dmg`, `.gpg`, `.sig`, and `.img` are also not detected, but for a 
different reason:
   puremagic simply does not have signatures for those formats in its database.
   
   For all of these the only viable strategy is extension-based classification 
(or, for
   checksums, a content-pattern check on the text).
   
   ### MISMATCH (11) - broken down by action needed
   
   **Expected from ChatGPT is wrong, detection is fine:**
   
   * `.gem` - Ruby gems are plain `.tar`, not `.tar.gz`.  Expected set should be
     `application/x-tar`.
   * `.patch` - puremagic returns `text/x-patch`; expected was `text/x-diff`.
   * `.pem` - puremagic correctly returns `application/x-pem-file`; expected 
set had
     `text/plain`.  
   
   **The example file on the mirror is not what the extension implies:**
   
   * `.apk` - this specific file is a gzip stream, not a ZIP-based APK.  A 
different
     APK from a different project might detect correctly.
   * `.bin` - the langdetect `.bin` is actually a ZIP.  `.bin` is intentionally
     opaque; puremagic is not wrong, the file just happens to be a ZIP.
   * `.ico` - `favicon.ico` on downloads.apache.org is a PNG.  A real ICO file 
would
     detect as `image/x-icon`.
   * `.mar` - the sling `.mar` is a ZIP.  Expected assumed opaque binary.
   * `.rar` - the jackrabbit `.rar` is actually a ZIP, not a RAR archive.
   
   **Likely a 64 KB truncation artefact:**
   
   * `.slingosgifeature` - detected as `application/json`.  The file probably 
opens
     with a JSON manifest; the ZIP payload starts later.  A full download would 
likely
     detect the ZIP signature.
   
   **puremagic false positive:**
   
   * `.KEYS` - detected as `audio/x-ms-asx`.  The file is plain-text PGP public 
keys.
     puremagic is matching something in the key material against the ASX 
signature
     bytes.  Needs extension-based fallback.
   * `.md` - detected as `application/xml, text/html`.  This particular 
README.md
     opens with XML / HTML-like markup.  A different `.md` file would likely 
return
     empty (same as other plain-text types).
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to