alitheg commented on PR #624:
URL: 
https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3842509600

   Here's the same for magika:
   
   # magika detection results - issue #553
   
   Same files and 64 KB Range-request downloads as the puremagic run.  
Detection uses
   `Magika.identify_path()`.  Magika is an ML-based classifier (Google), so it 
returns a
   single best-guess result rather than a set of candidates.
   
   **84 file types tested.**
   
   | OK | MISMATCH |
   |----|----------|
   | 54 | 30       |
   
   ---
   
   | Extension | Status | Magika MIME | Magika label / description | Expected | 
Notes |
   
|-----------|--------|-------------|----------------------------|----------|-------|
   | `.2` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.4` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.5` | MISMATCH | text/x-c | c - C source | text/plain | File is 
`svn_version.h.dist`; detection is correct, expected set too narrow |
   | `.512` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.6` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.66` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.7` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.7z` | OK | application/x-7z-compressed | sevenzip - 7-zip archive data | 
application/x-7z-compressed | |
   | `.adoc` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.apk` | MISMATCH | application/gzip | gzip - gzip compressed data | 
application/zip | This APK is a gzip stream; same file, same result as 
puremagic |
   | `.asc` | MISMATCH | application/x-pem-file | pem - PEM certificate | 
application/pgp-signature | PGP ASCII-armored and PEM share the `-----BEGIN` 
wrapper; magika confuses them |
   | `.bin` | MISMATCH | application/zip | zip - Zip archive data | 
application/octet-stream | The langdetect .bin is actually a ZIP; same as 
puremagic |
   | `.changes` | MISMATCH | application/x-pem-file | pem - PEM certificate | 
text/plain | False positive; a Debian changelog classified as PEM |
   | `.crate` | MISMATCH | application/gzip | gzip - gzip compressed data | 
application/x-gzip | `application/gzip` is the IANA-registered form of 
`application/x-gzip`; equivalent |
   | `.css` | OK | text/css | css - CSS source | text/css | |
   | `.deb` | OK | application/vnd.debian.binary-package | deb - Debian binary 
package | application/vnd.debian.binary-package | |
   | `.dmg` | MISMATCH | application/zlib | zlibstream - zlib compressed data | 
application/x-apple-diskimage | Detected the compression layer, not the 
container format |
   | `.exe` | MISMATCH | application/x-dosexec | pebin - PE Windows executable 
| application/vnd.microsoft.portable-executable | `application/x-dosexec` is 
the common MIME for PE; detection is correct |
   | `.far` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   | `.gem` | MISMATCH | application/x-tar | tar - POSIX tar archive | 
application/x-gzip | Gems are plain tar; expected set is wrong (same as 
puremagic) |
   | `.gif` | OK | image/gif | gif - GIF image data | image/gif | |
   | `.gpg` | MISMATCH | application/octet-stream | unknown - Unknown binary 
data | application/pgp-encrypted | magika does not recognise binary PGP |
   | `.html` | MISMATCH | text/x-ruby | erb - Embedded Ruby source | text/html 
| False positive; `jena/HEADER.html` misclassified as ERB |
   | `.ico` | MISMATCH | image/png | png - PNG image | image/x-icon | The 
favicon.ico is actually a PNG; same file, same result as puremagic |
   | `.img` | OK | application/octet-stream | unknown - Unknown binary data | 
application/octet-stream | |
   | `.index` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.jar` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   | `.json` | OK | application/json | json - JSON document | application/json 
| |
   | `.KEYS` | MISMATCH | application/x-pem-file | pem - PEM certificate | 
text/plain | PGP public key blocks share `-----BEGIN` with PEM; same confusion 
as .asc |
   | `.list` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.mar` | MISMATCH | application/java-archive | jar - Java archive data 
(JAR) | application/octet-stream | The sling .mar is actually a ZIP/JAR; same 
as puremagic |
   | `.md` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.md5` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.MD5` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.mds` | OK | text/plain | sum - Checksum file | text/plain | |
   | `.msi` | OK | application/x-msi | msi - Microsoft Installer file | 
application/x-msi | |
   | `.nar` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   | `.nupkg` | MISMATCH | application/octet-stream | nupkg - NuGet Package | 
application/zip | Label is correct but no registered MIME in the model; falls 
back to octet-stream |
   | `.old` | MISMATCH | message/rfc822 | eml - RFC 822 mail | text/plain | 
`cassandra/KEYS.old` is PGP public keys; ML model reads bulk key blocks as 
email |
   | `.pack.gz` | MISMATCH | application/gzip | gzip - gzip compressed data | 
application/x-gzip | Same gzip MIME equivalence as .crate / .tar.gz / .tgz |
   | `.patch` | OK | text/plain | diff - Diff file | text/plain | |
   | `.pdf` | OK | application/pdf | pdf - PDF document | application/pdf | |
   | `.pem` | MISMATCH | application/x-pem-file | pem - PEM certificate | 
text/plain | Correct detection; expected set should be updated (same as 
puremagic) |
   | `.pl` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct 
detection; expected set too narrow |
   | `.PL` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct 
detection; expected set too narrow |
   | `.pm` | MISMATCH | text/x-perl | perl - Perl source | text/plain | Correct 
detection; expected set too narrow |
   | `.png` | OK | image/png | png - PNG image | image/png | |
   | `.pom` | OK | text/xml | xml - XML document | application/xml, text/xml | |
   | `.prov` | MISMATCH | application/x-pem-file | pem - PEM certificate | 
text/plain | Helm .prov files contain PGP signed-message blocks; same 
`-----BEGIN` confusion |
   | `.ps1` | MISMATCH | application/x-powershell | powershell - Powershell 
source | text/plain | Correct detection; expected set too narrow |
   | `.py` | OK | text/x-python | python - Python source | text/x-python | |
   | `.rar` | MISMATCH | application/java-archive | jar - Java archive data 
(JAR) | application/x-rar-compressed | The jackrabbit .rar is actually a ZIP; 
same as puremagic |
   | `.readme` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.repo` | OK | text/plain | ini - INI configuration file | text/plain | |
   | `.repositories` | OK | text/plain | ini - INI configuration file | 
text/plain | |
   | `.rpm` | OK | application/x-rpm | rpm - RedHat Package Manager archive 
(RPM) | application/x-rpm | |
   | `.sh` | OK | text/x-shellscript | shell - Shell script | 
text/x-shellscript | |
   | `.sh1` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.sha` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.sha1` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.sha256` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.SHA256` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.SHA512` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.sha512` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.sha512sum` | OK | text/plain | txt - Generic text document | text/plain 
| |
   | `.sig` | MISMATCH | application/octet-stream | unknown - Unknown binary 
data | application/pgp-signature | magika does not recognise binary PGP 
signatures |
   | `.slingosgifeature` | MISMATCH | application/json | json - JSON document | 
application/zip | 64 KB truncation artefact; same as puremagic |
   | `.snupkg` | MISMATCH | application/octet-stream | nupkg - NuGet Package | 
application/zip | Same as .nupkg: label correct, MIME falls back to 
octet-stream |
   | `.sqlite.bz2` | OK | application/x-bzip2 | bzip - bzip2 compressed data | 
application/x-bzip2 | |
   | `.taco` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   | `.tar.bz2` | OK | application/x-bzip2 | bzip - bzip2 compressed data | 
application/x-bzip2 | |
   | `.tar.gz` | MISMATCH | application/gzip | gzip - gzip compressed data | 
application/x-gzip | Same gzip MIME equivalence as .crate / .pack.gz / .tgz |
   | `.tar.xz` | OK | application/x-xz | xz - XZ compressed data | 
application/x-xz | |
   | `.temp` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.tgz` | MISMATCH | application/gzip | gzip - gzip compressed data | 
application/x-gzip | Same gzip MIME equivalence as .crate / .pack.gz / .tar.gz |
   | `.txt` | MISMATCH | text/x-rst | rst - ReStructuredText document | 
text/plain | The pig README.txt is written in RST; detection is correct |
   | `.vsix` | OK | application/zip | zip - Zip archive data | application/zip 
| |
   | `.war` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   | `.whl` | OK | application/zip | zip - Zip archive data | application/zip | 
|
   | `.x` | OK | text/plain | txt - Generic text document | text/plain | |
   | `.xml` | OK | text/xml | xml - XML document | application/xml, text/xml | |
   | `.xsd` | OK | text/xml | xml - XML document | application/xml, text/xml | |
   | `.yaml` | OK | application/x-yaml | yaml - YAML source | 
application/x-yaml | |
   | `.zip` | OK | application/java-archive | jar - Java archive data (JAR) | 
application/java-archive | |
   
   ---
   
   ## Summary of mismatches
   
   ### Expected set too narrow or uses legacy MIME - detection is correct (13)
   
   These are not magika failures; the expected result (from the ChatGPT table 
in the issue)
   need updating.
   
   * `.5` - file is a C header template (`svn_version.h.dist`); `text/x-c` is 
right.
   * `.crate`, `.pack.gz`, `.tar.gz`, `.tgz` - magika returns 
`application/gzip`,
     which is the IANA-registered form.  The expected set has the legacy
     `application/x-gzip`.  They are the same format..
   * `.exe` - `application/x-dosexec` is the widely-used MIME for PE 
executables.
     Add it to the expected set alongside 
`application/vnd.microsoft.portable-executable`.
   * `.gem` - gems are plain `.tar`, not `.tar.gz`.  Expected set should be
     `application/x-tar` (same conclusion as puremagic).
   * `.pem` - `application/x-pem-file` is the correct type.  Expected set had
     `text/plain` (same conclusion as puremagic).
   * `.pl`, `.PL`, `.pm` - correctly identified as `text/x-perl`.
   * `.ps1` - correctly identified as `application/x-powershell`.
   * `.txt` - the specific pig README.txt is written in reStructuredText; 
`text/x-rst`
     is accurate.
   
   ### The example file on the mirror is not what the extension implies (5)
   
   Same five files that tripped puremagic for the same reason.
   
   * `.apk` - this file is a gzip stream, not a ZIP-based APK.
   * `.bin` - the langdetect `.bin` is actually a ZIP.
   * `.ico` - `favicon.ico` is a PNG.
   * `.mar` - the sling `.mar` is a ZIP.
   * `.rar` - the jackrabbit `.rar` is a ZIP.
   
   ### 64 KB truncation artefact (1)
   
   * `.slingosgifeature` - opens with a JSON manifest; the ZIP payload starts 
later in
     the file.  Same result as puremagic.
   
   ### magika does not recognise the format (5)
   
   * `.gpg`, `.sig` - binary PGP.  magika has no model for this; returns
     `application/octet-stream`.
   * `.dmg` - magika detects the zlib compression layer but not the Apple Disk 
Image
     container format on top of it.
   * `.nupkg`, `.snupkg` - the label (`nupkg`) is correct, but there is no 
registered
     MIME type in the model so it falls back to `application/octet-stream`.
   
   ### magika false positives / PGP-PEM confusion (6)
   
   PGP ASCII-armored blocks (`-----BEGIN PGP …-----`) and PEM blocks
   (`-----BEGIN CERTIFICATE-----` etc.) share the same `-----BEGIN … -----` 
wrapper.
   The ML model conflates them.
   
   * `.asc` - PGP signature classified as PEM.
   * `.KEYS` - PGP public-key block classified as PEM.
   * `.prov` - Helm provenance files contain PGP signed-message blocks; 
classified as PEM.
   * `.changes` - Debian changelog classified as PEM.  Likely a spurious match 
on some
     content pattern in this particular file.
   * `.old` - `cassandra/KEYS.old` is PGP public keys; classified as 
`message/rfc822`.
     Bulk PGP key material can resemble email headers to the model.
   * `.html` - `jena/HEADER.html` classified as ERB (Embedded Ruby).  No 
obvious reason;
     likely a low-confidence ML misfire on the file content.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to