sbp commented on PR #624: URL: https://github.com/apache/tooling-trusted-releases/pull/624#issuecomment-3848082896
I was thinking more about looking at the implementation to figure out the confidence we can have in the result for each type. For example, you ran `puremagic` on one real world example XML file and it passed. Was that an unusual result, or does `puremagic` always detect XML files correctly? We can't know that from checking one file. In fact, we can't know it from checking lots of files. For example, we could feed it a million files that don't have an XML declaration, `<?xml version="1.0"?>`, at the top. They all seem to be detected correctly. Then we use it in production to test XML files and a user uploads one with an XML declaration, and it breaks. The only way to know how much confidence we can have in results is to do detailed analysis of the algorithm that it uses. Obviously that would be pretty complicated, and there are a lot of file types, so I was suggesting that we only do this for the most common types that we find on the server. Does it always detect ZIP correctly? Always? 100%? Then we can use it to detect ZIP formats. What about TGZ? 100%? Then we can use it. The suggestion about magika was that we use it when `puremagic` fails but, again, only when we need it (on the common types) and only when it turns out to be very, very close to 100% accurate. It's impossible to audit magika though, so that's why I was dismissive of it in my comment. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
