sbp commented on PR #624:
URL: 
https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3848082896

   I was thinking more about looking at the implementation to figure out the 
confidence we can have in the result for each type. For example, you ran 
`puremagic` on one real world example XML file and it passed. Was that an 
unusual result, or does `puremagic` always detect XML files correctly? We can't 
know that from checking one file. In fact, we can't know it from checking lots 
of files. For example, we could feed it a million files that don't have an XML 
declaration, `<?xml version="1.0"?>`, at the top. They all seem to be detected 
correctly. Then we use it in production to test XML files and a user uploads 
one with an XML declaration, and it breaks. The only way to know how much 
confidence we can have in results is to do detailed analysis of the algorithm 
that it uses.
   
   Obviously that would be pretty complicated, and there are a lot of file 
types, so I was suggesting that we only do this for the most common types that 
we find on the server. Does it always detect ZIP correctly? Always? 100%? Then 
we can use it to detect ZIP formats. What about TGZ? 100%? Then we can use it. 
The suggestion about magika was that we use it when `puremagic` fails but, 
again, only when we need it (on the common types) and only when it turns out to 
be very, very close to 100% accurate. It's impossible to audit magika though, 
so that's why I was dismissive of it in my comment.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to