Bug#979554: please clarify case sensitiveness of mime.types
Charles Plessy writes: > The problem here is that I have no comprehensive information on how > softwares use the mime.types files. I can not rule out that some use > case sensitivity for their own good reasons, so if no other bug arise, I > would like to continue to stick to the information provided by the IANA. > >> Those entries will behave differently for "see" and Python's >> "mimetypes.guess_type()". For instance "see" will consider "foo.sar" >> as application/vnd.sar, but "mimetypes.guess_type()" will not. > > I can not tell which approach is wiser... The mime.types file is not > comprehensive and is usually lagging. What if there is another file > format around that uses the lowercase `sar` extension? I've looked at four other implementations, to find new behaviors. Emacs assumes that /etc/mime.types contains only lowercase extensions. When (mailcap-extension-to-mime ext) is called, ext is first downcased before being compared to the extensions in /etc/mime.types. So it will be unable to work with extensions like SAR, that are only listed in uppercase in /etc/mime.types. (mailcap-extension-to-mime "JPG") => "image/jpeg" (mailcap-extension-to-mime "jpg") => "image/jpeg" (mailcap-extension-to-mime "SAR") => nil (mailcap-extension-to-mime "sar") => nil Apache 2.4's mod_mime convert extensions to lowercase when reading them from /etc/mime.types and before looking them up. So it's the same case-insensitive behavior as for the "see" command mentioned in my previous mail. Go (https://golang.org/src/mime/type.go) keeps two maps, one where all extensions are stored as they are in /etc/mime.types, one where they are lowercased. Lookup is done case-sensitive first, then case-insensitive. That's a new behavior, compared to other tools. Mutt is doing a case insensitive comparison of each extension in /etc/mime.types again the end of the filename to check. So regarding case-sensitiveness it behaves like Python and see. However, contrary to Python and see, Mutt is able to deal with extensions that contains a dot in mime.types. (The only one listed in /etc/mime.types is "pcf.Z", and seems to work in Python and see only because these implementations have some hard-coded handling of .Z, .gz, and other similar extensions, so they actually do a lookup for "pcf", which has the same mime type in /etc/mime.types). In case of multiple matches, Mutt keeps the largest one. >> It would be nice to clarify the semantics in the comments at the top of >> mime.types. > > Definitely! I hope to do so or write a proper man page after I dig the > history of that file. On that topic, the comment at the top of the file: # Users can add their own types if they wish by creating a ".mime.types" # file in their home directory. Definitions included there will take # precedence over those listed here. should probably be rephrased to suggest that this is how applications are expected to work, but that not all of them will do. For instance Python and Go won't look at this file. -- Alexandre Duret-Lutz
Bug#979554: please clarify case sensitiveness of mime.types
Le Fri, Jan 08, 2021 at 09:38:50AM +0100, Alexandre Duret-Lutz a écrit : > > Recent versions of media-types (1.1.0 and 2.0.0) have introduced some > lines with extension appearing in both lowercase and uppercase form: Hi Alexandre, thank you for your report, I was also wondering about case sensitivity when I worked on the 2.0.0 update. One of my current problems is that there does not seem to be a written specification for such details in /etc/mime.types. > audio/AMR amr AMR > audio/AMR-WBawb AWB > audio/EVRC-QCP qcp QCP The IANA assignment pages for these three types list both the lower- and the uppercase suffixes, so I decided to stick to that. > image/t38 t38 T38 When I merged the mime.types from Fedora in Debian's, Fedora had the lowercase one and Debian the uppercase one. The IANA assignment declares no extension. I decided to keep both. Perhaps it is not the best decision. > image/vnd.globalgraphics.pgbPGB pgb This type also declares both cases in IANA's assignment. > Some tools will complain if some entries in mime.types have duplicate > extensions (in some case-insensitive sense). For instance the above > lines are causing Bug#979232 for lighttpd. I am glad that you could solve the bug easily on lighttpd's side. By the way I am preparing an update that reverts all case-sensitive duplicates for this release cycle, and will surely do the same for case-insensitive ones if it causes serious bugs elsewhere. > So the question is what is the intended semantics for the above lines? > Is "audio/AMR amr AMR" really meant to achieve more than "audio/AMR amr"? The problem here is that I have no comprehensive information on how softwares use the mime.types files. I can not rule out that some use case sensitivity for their own good reasons, so if no other bug arise, I would like to continue to stick to the information provided by the IANA. > Also note that mime.types lists some extensions with only uppercase > versions, or a mix of lower and upper case letters: > > application/vnd.sar SAR > application/vnd.ves.encrypted VES These two I imported from Fedora and I checked that they are consistent with IANA. > application/x-font-pcf pcf pcf.Z This one has been for a long time in Debian. > Those entries will behave differently for "see" and Python's > "mimetypes.guess_type()". For instance "see" will consider "foo.sar" > as application/vnd.sar, but "mimetypes.guess_type()" will not. I can not tell which approach is wiser... The mime.types file is not comprehensive and is usually lagging. What if there is another file format around that uses the lowercase `sar` extension? > It would be nice to clarify the semantics in the comments at the top of > mime.types. Definitely! I hope to do so or write a proper man page after I dig the history of that file. Bonne journée, Charles -- Charles Plessy Nagahama, Yomitan, Okinawa, Japan Tooting from work, https://mastodon.technology/@charles_plessy Tooting from home, https://framapiaf.org/@charles_plessy
Bug#979554: please clarify case sensitiveness of mime.types
Package: media-types Version: 3.0.0 Severity: normal Recent versions of media-types (1.1.0 and 2.0.0) have introduced some lines with extension appearing in both lowercase and uppercase form: audio/AMR amr AMR audio/AMR-WBawb AWB audio/EVRC-QCP qcp QCP image/t38 t38 T38 image/vnd.globalgraphics.pgbPGB pgb However most tools that lookup the mime.types database will already lowercase the extensions before doing so. After all we are used to have PICTURE.JPG recognized as image/jpeg even if JPG is not listed in mime.types. For instance the Perl script "see", distributed by the mailcap package will lowercase all extensions when mime.types is read, and will lowercase any extension before looking it up. Python's mimetypes.guess_type() function will lookup the extension unmodified first, and then the lowercase version if the first lookup failed. For this implementation it means that mime.types could force some extensions to have some upper case letters. Some tools will complain if some entries in mime.types have duplicate extensions (in some case-insensitive sense). For instance the above lines are causing Bug#979232 for lighttpd. So the question is what is the intended semantics for the above lines? Is "audio/AMR amr AMR" really meant to achieve more than "audio/AMR amr"? Also note that mime.types lists some extensions with only uppercase versions, or a mix of lower and upper case letters: application/vnd.sar SAR application/vnd.ves.encrypted VES application/x-font-pcf pcf pcf.Z Those entries will behave differently for "see" and Python's "mimetypes.guess_type()". For instance "see" will consider "foo.sar" as application/vnd.sar, but "mimetypes.guess_type()" will not. It would be nice to clarify the semantics in the comments at the top of mime.types. -- Alexandre Duret-Lutz