alitheg commented on PR #624:
URL:
https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3842509600
Here's the same for magika:
# magika detection results - issue #553
Same files and 64 KB Range-request downloads as the puremagic run.
Detection uses
`Magika.identify_path()`. Magika is an ML-based classifier (Google), so it
returns a
single best-guess result rather than a set of candidates.
**84 file types tested.**
| OK | MISMATCH |
|----|----------|
| 54 | 30 |
---
| Extension | Status | Magika MIME | Magika label / description | Expected |
Notes |
|-----------|--------|-------------|----------------------------|----------|-------|
| `.2` | OK | text/plain | txt - Generic text document | text/plain | |
| `.4` | OK | text/plain | txt - Generic text document | text/plain | |
| `.5` | MISMATCH | text/x-c | c - C source | text/plain | File is
`svn_version.h.dist`; detection is correct, expected set too narrow |
| `.512` | OK | text/plain | txt - Generic text document | text/plain | |
| `.6` | OK | text/plain | txt - Generic text document | text/plain | |
| `.66` | OK | text/plain | txt - Generic text document | text/plain | |
| `.7` | OK | text/plain | txt - Generic text document | text/plain | |
| `.7z` | OK | application/x-7z-compressed | sevenzip - 7-zip archive data |
application/x-7z-compressed | |
| `.adoc` | OK | text/plain | txt - Generic text document | text/plain | |
| `.apk` | MISMATCH | application/gzip | gzip - gzip compressed data |
application/zip | This APK is a gzip stream; same file, same result as
puremagic |
| `.asc` | MISMATCH | application/x-pem-file | pem - PEM certificate |
application/pgp-signature | PGP ASCII-armored and PEM share the `-----BEGIN`
wrapper; magika confuses them |
| `.bin` | MISMATCH | application/zip | zip - Zip archive data |
application/octet-stream | The langdetect .bin is actually a ZIP; same as
puremagic |
| `.changes` | MISMATCH | application/x-pem-file | pem - PEM certificate |
text/plain | False positive; a Debian changelog classified as PEM |
| `.crate` | MISMATCH | application/gzip | gzip - gzip compressed data |
application/x-gzip | `application/gzip` is the IANA-registered form of
`application/x-gzip`; equivalent |
| `.css` | OK | text/css | css - CSS source | text/css | |
| `.deb` | OK | application/vnd.debian.binary-package | deb - Debian binary
package | application/vnd.debian.binary-package | |
| `.dmg` | MISMATCH | application/zlib | zlibstream - zlib compressed data |
application/x-apple-diskimage | Detected the compression layer, not the
container format |
| `.exe` | MISMATCH | application/x-dosexec | pebin - PE Windows executable
| application/vnd.microsoft.portable-executable | `application/x-dosexec` is
the common MIME for PE; detection is correct |
| `.far` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
| `.gem` | MISMATCH | application/x-tar | tar - POSIX tar archive |
application/x-gzip | Gems are plain tar; expected set is wrong (same as
puremagic) |
| `.gif` | OK | image/gif | gif - GIF image data | image/gif | |
| `.gpg` | MISMATCH | application/octet-stream | unknown - Unknown binary
data | application/pgp-encrypted | magika does not recognise binary PGP |
| `.html` | MISMATCH | text/x-ruby | erb - Embedded Ruby source | text/html
| False positive; `jena/HEADER.html` misclassified as ERB |
| `.ico` | MISMATCH | image/png | png - PNG image | image/x-icon | The
favicon.ico is actually a PNG; same file, same result as puremagic |
| `.img` | OK | application/octet-stream | unknown - Unknown binary data |
application/octet-stream | |
| `.index` | OK | text/plain | txt - Generic text document | text/plain | |
| `.jar` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
| `.json` | OK | application/json | json - JSON document | application/json
| |
| `.KEYS` | MISMATCH | application/x-pem-file | pem - PEM certificate |
text/plain | PGP public key blocks share `-----BEGIN` with PEM; same confusion
as .asc |
| `.list` | OK | text/plain | txt - Generic text document | text/plain | |
| `.mar` | MISMATCH | application/java-archive | jar - Java archive data
(JAR) | application/octet-stream | The sling .mar is actually a ZIP/JAR; same
as puremagic |
| `.md` | OK | text/plain | txt - Generic text document | text/plain | |
| `.md5` | OK | text/plain | txt - Generic text document | text/plain | |
| `.MD5` | OK | text/plain | txt - Generic text document | text/plain | |
| `.mds` | OK | text/plain | sum - Checksum file | text/plain | |
| `.msi` | OK | application/x-msi | msi - Microsoft Installer file |
application/x-msi | |
| `.nar` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
| `.nupkg` | MISMATCH | application/octet-stream | nupkg - NuGet Package |
application/zip | Label is correct but no registered MIME in the model; falls
back to octet-stream |
| `.old` | MISMATCH | message/rfc822 | eml - RFC 822 mail | text/plain |
`cassandra/KEYS.old` is PGP public keys; ML model reads bulk key blocks as
email |
| `.pack.gz` | MISMATCH | application/gzip | gzip - gzip compressed data |
application/x-gzip | Same gzip MIME equivalence as .crate / .tar.gz / .tgz |
| `.patch` | OK | text/plain | diff - Diff file | text/plain | |
| `.pdf` | OK | application/pdf | pdf - PDF document | application/pdf | |
| `.pem` | MISMATCH | application/x-pem-file | pem - PEM certificate |
text/plain | Correct detection; expected set should be updated (same as
puremagic) |
| `.pl` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct
detection; expected set too narrow |
| `.PL` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct
detection; expected set too narrow |
| `.pm` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct
detection; expected set too narrow |
| `.png` | OK | image/png | png - PNG image | image/png | |
| `.pom` | OK | text/xml | xml - XML document | application/xml, text/xml | |
| `.prov` | MISMATCH | application/x-pem-file | pem - PEM certificate |
text/plain | Helm .prov files contain PGP signed-message blocks; same
`-----BEGIN` confusion |
| `.ps1` | MISMATCH | application/x-powershell | powershell - Powershell
source | text/plain | Correct detection; expected set too narrow |
| `.py` | OK | text/x-python | python - Python source | text/x-python | |
| `.rar` | MISMATCH | application/java-archive | jar - Java archive data
(JAR) | application/x-rar-compressed | The jackrabbit .rar is actually a ZIP;
same as puremagic |
| `.readme` | OK | text/plain | txt - Generic text document | text/plain | |
| `.repo` | OK | text/plain | ini - INI configuration file | text/plain | |
| `.repositories` | OK | text/plain | ini - INI configuration file |
text/plain | |
| `.rpm` | OK | application/x-rpm | rpm - RedHat Package Manager archive
(RPM) | application/x-rpm | |
| `.sh` | OK | text/x-shellscript | shell - Shell script |
text/x-shellscript | |
| `.sh1` | OK | text/plain | txt - Generic text document | text/plain | |
| `.sha` | OK | text/plain | txt - Generic text document | text/plain | |
| `.sha1` | OK | text/plain | txt - Generic text document | text/plain | |
| `.sha256` | OK | text/plain | txt - Generic text document | text/plain | |
| `.SHA256` | OK | text/plain | txt - Generic text document | text/plain | |
| `.SHA512` | OK | text/plain | txt - Generic text document | text/plain | |
| `.sha512` | OK | text/plain | txt - Generic text document | text/plain | |
| `.sha512sum` | OK | text/plain | txt - Generic text document | text/plain
| |
| `.sig` | MISMATCH | application/octet-stream | unknown - Unknown binary
data | application/pgp-signature | magika does not recognise binary PGP
signatures |
| `.slingosgifeature` | MISMATCH | application/json | json - JSON document |
application/zip | 64 KB truncation artefact; same as puremagic |
| `.snupkg` | MISMATCH | application/octet-stream | nupkg - NuGet Package |
application/zip | Same as .nupkg: label correct, MIME falls back to
octet-stream |
| `.sqlite.bz2` | OK | application/x-bzip2 | bzip - bzip2 compressed data |
application/x-bzip2 | |
| `.taco` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
| `.tar.bz2` | OK | application/x-bzip2 | bzip - bzip2 compressed data |
application/x-bzip2 | |
| `.tar.gz` | MISMATCH | application/gzip | gzip - gzip compressed data |
application/x-gzip | Same gzip MIME equivalence as .crate / .pack.gz / .tgz |
| `.tar.xz` | OK | application/x-xz | xz - XZ compressed data |
application/x-xz | |
| `.temp` | OK | text/plain | txt - Generic text document | text/plain | |
| `.tgz` | MISMATCH | application/gzip | gzip - gzip compressed data |
application/x-gzip | Same gzip MIME equivalence as .crate / .pack.gz / .tar.gz |
| `.txt` | MISMATCH | text/x-rst | rst - ReStructuredText document |
text/plain | The pig README.txt is written in RST; detection is correct |
| `.vsix` | OK | application/zip | zip - Zip archive data | application/zip
| |
| `.war` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
| `.whl` | OK | application/zip | zip - Zip archive data | application/zip |
|
| `.x` | OK | text/plain | txt - Generic text document | text/plain | |
| `.xml` | OK | text/xml | xml - XML document | application/xml, text/xml | |
| `.xsd` | OK | text/xml | xml - XML document | application/xml, text/xml | |
| `.yaml` | OK | application/x-yaml | yaml - YAML source |
application/x-yaml | |
| `.zip` | OK | application/java-archive | jar - Java archive data (JAR) |
application/java-archive | |
---
## Summary of mismatches
### Expected set too narrow or uses legacy MIME - detection is correct (13)
These are not magika failures; the expected result (from the ChatGPT table
in the issue)
need updating.
* `.5` - file is a C header template (`svn_version.h.dist`); `text/x-c` is
right.
* `.crate`, `.pack.gz`, `.tar.gz`, `.tgz` - magika returns
`application/gzip`,
which is the IANA-registered form. The expected set has the legacy
`application/x-gzip`. They are the same format..
* `.exe` - `application/x-dosexec` is the widely-used MIME for PE
executables.
Add it to the expected set alongside
`application/vnd.microsoft.portable-executable`.
* `.gem` - gems are plain `.tar`, not `.tar.gz`. Expected set should be
`application/x-tar` (same conclusion as puremagic).
* `.pem` - `application/x-pem-file` is the correct type. Expected set had
`text/plain` (same conclusion as puremagic).
* `.pl`, `.PL`, `.pm` - correctly identified as `text/x-perl`.
* `.ps1` - correctly identified as `application/x-powershell`.
* `.txt` - the specific pig README.txt is written in reStructuredText;
`text/x-rst`
is accurate.
### The example file on the mirror is not what the extension implies (5)
Same five files that tripped puremagic for the same reason.
* `.apk` - this file is a gzip stream, not a ZIP-based APK.
* `.bin` - the langdetect `.bin` is actually a ZIP.
* `.ico` - `favicon.ico` is a PNG.
* `.mar` - the sling `.mar` is a ZIP.
* `.rar` - the jackrabbit `.rar` is a ZIP.
### 64 KB truncation artefact (1)
* `.slingosgifeature` - opens with a JSON manifest; the ZIP payload starts
later in
the file. Same result as puremagic.
### magika does not recognise the format (5)
* `.gpg`, `.sig` - binary PGP. magika has no model for this; returns
`application/octet-stream`.
* `.dmg` - magika detects the zlib compression layer but not the Apple Disk
Image
container format on top of it.
* `.nupkg`, `.snupkg` - the label (`nupkg`) is correct, but there is no
registered
MIME type in the model so it falls back to `application/octet-stream`.
### magika false positives / PGP-PEM confusion (6)
PGP ASCII-armored blocks (`-----BEGIN PGP …-----`) and PEM blocks
(`-----BEGIN CERTIFICATE-----` etc.) share the same `-----BEGIN … -----`
wrapper.
The ML model conflates them.
* `.asc` - PGP signature classified as PEM.
* `.KEYS` - PGP public-key block classified as PEM.
* `.prov` - Helm provenance files contain PGP signed-message blocks;
classified as PEM.
* `.changes` - Debian changelog classified as PEM. Likely a spurious match
on some
content pattern in this particular file.
* `.old` - `cassandra/KEYS.old` is PGP public keys; classified as
`message/rfc822`.
Bulk PGP key material can resemble email headers to the model.
* `.html` - `jena/HEADER.html` classified as ERB (Embedded Ruby). No
obvious reason;
likely a low-confidence ML misfire on the file content.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]