Bug#979554: please clarify case sensitiveness of mime.types

2021-01-13 Thread Alexandre Duret-Lutz
Charles Plessy  writes:

> The problem here is that I have no comprehensive information on how
> softwares use the mime.types files.  I can not rule out that some use
> case sensitivity for their own good reasons, so if no other bug arise, I
> would like to continue to stick to the information provided by the IANA.
>
>> Those entries will behave differently for "see" and Python's
>> "mimetypes.guess_type()".   For instance "see" will consider "foo.sar"
>> as application/vnd.sar, but "mimetypes.guess_type()" will not.
>
> I can not tell which approach is wiser...  The mime.types file is not
> comprehensive and is usually lagging.  What if there is another file
> format around that uses the lowercase `sar` extension?


I've looked at four other implementations, to find new behaviors.


Emacs assumes that /etc/mime.types contains only lowercase extensions.
When (mailcap-extension-to-mime ext) is called, ext is first downcased
before being compared to the extensions in /etc/mime.types.  So it will
be unable to work with extensions like SAR, that are only listed in
uppercase in /etc/mime.types.

(mailcap-extension-to-mime "JPG") => "image/jpeg"
(mailcap-extension-to-mime "jpg") => "image/jpeg"
(mailcap-extension-to-mime "SAR") => nil
(mailcap-extension-to-mime "sar") => nil


Apache 2.4's mod_mime convert extensions to lowercase when reading them
from /etc/mime.types and before looking them up.  So it's the same
case-insensitive behavior as for the "see" command mentioned in my
previous mail.


Go (https://golang.org/src/mime/type.go) keeps two maps, one where all
extensions are stored as they are in /etc/mime.types, one where they
are lowercased.  Lookup is done case-sensitive first, then
case-insensitive.  That's a new behavior, compared to other tools.


Mutt is doing a case insensitive comparison of each extension in
/etc/mime.types again the end of the filename to check.  So regarding
case-sensitiveness it behaves like Python and see.

However, contrary to Python and see, Mutt is able to deal with 
extensions that contains a dot in mime.types.  (The only one
listed in /etc/mime.types is "pcf.Z", and seems to work in Python
and see only because these implementations have some hard-coded
handling of .Z, .gz, and other similar extensions, so they actually
do a lookup for "pcf", which has the same mime type in /etc/mime.types).
In case of multiple matches, Mutt keeps the largest one.

>> It would be nice to clarify the semantics in the comments at the top of
>> mime.types.
>
> Definitely! I hope to do so or write a proper man page after I dig the
> history of that file.

On that topic, the comment at the top of the file:

#  Users can add their own types if they wish by creating a ".mime.types"
#  file in their home directory.  Definitions included there will take
#  precedence over those listed here.

should probably be rephrased to suggest that this is how applications
are expected to work, but that not all of them will do.  For instance
Python and Go won't look at this file.

-- 
Alexandre Duret-Lutz



Bug#979554: please clarify case sensitiveness of mime.types

2021-01-12 Thread Charles Plessy
Le Fri, Jan 08, 2021 at 09:38:50AM +0100, Alexandre Duret-Lutz a écrit :
> 
> Recent versions of media-types (1.1.0 and 2.0.0) have introduced some
> lines with extension appearing in both lowercase and uppercase form:

Hi Alexandre, thank you for your report,

I was also wondering about case sensitivity when I worked on the 2.0.0
update.  One of my current problems is that there does not seem to be
a written specification for such details in /etc/mime.types.

> audio/AMR   amr AMR
> audio/AMR-WBawb AWB
> audio/EVRC-QCP  qcp QCP

The IANA assignment pages for these three types list both the lower- and
the uppercase suffixes, so I decided to stick to that.

> image/t38   t38 T38

When I merged the mime.types from Fedora in Debian's, Fedora had the
lowercase one and Debian the uppercase one.  The IANA assignment
declares no extension.  I decided to keep both.  Perhaps it is not
the best decision.

> image/vnd.globalgraphics.pgbPGB pgb

This type also declares both cases in IANA's assignment.

> Some tools will complain if some entries in mime.types have duplicate
> extensions (in some case-insensitive sense).  For instance the above
> lines are causing Bug#979232 for lighttpd.

I am glad that you could solve the bug easily on lighttpd's side.  By
the way I am preparing an update that reverts all case-sensitive
duplicates for this release cycle, and will surely do the same for
case-insensitive ones if it causes serious bugs elsewhere. 

> So the question is what is the intended semantics for the above lines?
> Is "audio/AMR amr AMR" really meant to achieve more than "audio/AMR amr"?

The problem here is that I have no comprehensive information on how
softwares use the mime.types files.  I can not rule out that some use
case sensitivity for their own good reasons, so if no other bug arise, I
would like to continue to stick to the information provided by the IANA.

> Also note that mime.types lists some extensions with only uppercase
> versions, or a mix of lower and upper case letters:
> 
> application/vnd.sar SAR
> application/vnd.ves.encrypted   VES

These two I imported from Fedora and I checked that they are consistent
with IANA.

> application/x-font-pcf  pcf pcf.Z

This one has been for a long time in Debian.

> Those entries will behave differently for "see" and Python's
> "mimetypes.guess_type()".   For instance "see" will consider "foo.sar"
> as application/vnd.sar, but "mimetypes.guess_type()" will not.

I can not tell which approach is wiser...  The mime.types file is not
comprehensive and is usually lagging.  What if there is another file
format around that uses the lowercase `sar` extension?

> It would be nice to clarify the semantics in the comments at the top of
> mime.types.

Definitely! I hope to do so or write a proper man page after I dig the
history of that file.

Bonne journée,

Charles

-- 
Charles Plessy Nagahama, Yomitan, Okinawa, Japan
Tooting from work,   https://mastodon.technology/@charles_plessy
Tooting from home, https://framapiaf.org/@charles_plessy



Bug#979554: please clarify case sensitiveness of mime.types

2021-01-08 Thread Alexandre Duret-Lutz
Package: media-types
Version: 3.0.0
Severity: normal

Recent versions of media-types (1.1.0 and 2.0.0) have introduced some
lines with extension appearing in both lowercase and uppercase form:

audio/AMR   amr AMR
audio/AMR-WBawb AWB
audio/EVRC-QCP  qcp QCP
image/t38   t38 T38
image/vnd.globalgraphics.pgbPGB pgb

However most tools that lookup the mime.types database will already
lowercase the extensions before doing so.  After all we are used to have
PICTURE.JPG recognized as image/jpeg even if JPG is not listed in
mime.types.

For instance the Perl script "see", distributed by the mailcap package will
lowercase all extensions when mime.types is read, and will lowercase any
extension before looking it up.

Python's mimetypes.guess_type() function will lookup the extension
unmodified first, and then the lowercase version if the first lookup
failed.  For this implementation it means that mime.types could force
some extensions to have some upper case letters.

Some tools will complain if some entries in mime.types have duplicate
extensions (in some case-insensitive sense).  For instance the above
lines are causing Bug#979232 for lighttpd.

So the question is what is the intended semantics for the above lines?
Is "audio/AMR amr AMR" really meant to achieve more than "audio/AMR amr"?


Also note that mime.types lists some extensions with only uppercase
versions, or a mix of lower and upper case letters:

application/vnd.sar SAR
application/vnd.ves.encrypted   VES
application/x-font-pcf  pcf pcf.Z

Those entries will behave differently for "see" and Python's
"mimetypes.guess_type()".   For instance "see" will consider "foo.sar"
as application/vnd.sar, but "mimetypes.guess_type()" will not.

It would be nice to clarify the semantics in the comments at the top of
mime.types.
-- 
Alexandre Duret-Lutz