RE: [widgets] Potential bug in Rule for Identifying the Media Type of a File

Marcin Hanclik Fri, 16 Oct 2009 03:08:48 -0700

Hi Marcos,

These are my remarks as discussed yesterday on the call.


Comment a)

6.A.If all characters in the extension are outside the two ranges, then go to 
step 5 in this algorithm.

Should be

6.A.If any of the characters in the extension is outside the two ranges, then 
go to step 5 in this algorithm.

But this is also problematic since it infinitely loops the algorithm in this 
given case.
So it should be:

6.A.If any of the characters in the extension is outside the two ranges, then 
go to step 7 in this algorithm.

Another comment to 6.A:
It seems that the whole algorithm assumes that the File Identification Table is 
constant.
E.g. if any vendor would like to add some extension with a character outside of 
the given ranges (or we in W3C would like to do this in the future), then we 
would need to rewrite the algorithm.

So what about this (we do not need the ranges IMHO):
6.  Attempt to case-insensitively match the value of extension to one of the 
values in the file extension column in the file identification table. If there 
is a match, then return the corresponding value from the media type column and 
terminate this algorithm.
And remove 6.A and 6.B as they were.

*****************
Comment b)

4. If the first character of the name is a U+002E 'FULL STOP' character, and 
the file name contains no other U+002E 'FULL STOP' character then go to step 7 
of this algorithm.

What about ".jpg"?
Do you assume that this is filename and not file extension?

What about this:
4. If the first character of the name is a U+002E 'FULL STOP' character, and 
the file name contains no other U+002E 'FULL STOP' character then let extension 
be name and go to step 6 of this algorithm.

*****************
Comment c)

Given that the processing model is developed in prose, I think we MUST fix the 
ambiguity of the grammar anyway.
Thus I suggest the following change from:

file-name      = base-name [ file-extension ]
base-name      = 1*allowed-char
file-extension = "." 1*allowed-char

to:

file-name      = 1*allowed-char

(i.e. remove base-name and file-extension).
The removal of ambiguity is motivated by the dependency of the WURI/WUS spec on 
P&C in this particular detail, so it is better to keep it right, I think.
File extension does not play any role in WURI/WUS anyway.
I think either the above change or the one in my mail below has to be 
implemented in the spec.

*****************
Comment d)

We need to somehow derive the extension if the grammar is modified as in 
comment c) [i.e. removal of two rules].
Therefore I suggest the change from:

3. If the first character of the name is not a U+002E 'FULL STOP' character and 
the name has a file-extension  component, let extension be value of the 
file-extension component.

To:

3. Let "extension" be an empty string. If the first character of the name is 
not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL 
STOP' character, then let extension be the sequence of characters from the last 
U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this 
algorithm (as proposed in comment a) [no ranges etc.]).

SUMMARY
   1.

      Let file be the file to be processed.
   2.

      Let name be the file-name string component of the zip relative path that 
identifies the file.
   3.

      Let extension be an empty string. If the first character of the name is 
not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL 
STOP' character, then let extension be the sequence of characters from the last 
U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this 
algorithm

      For example, the extension of the file name "cat.html" would be ".html".
   4.

      If the first character of the name is a U+002E 'FULL STOP' character, and 
the file name contains no other U+002E 'FULL STOP' character then let extension 
be name and go to step 6 of this algorithm.

      REMOVE For example, if the name is ".htaccess", jump to step 7 and derive 
the mime type using the [SNIFF] specification.
        ADD For example, if the name is ".jpg", jump to step 6 and match 
image/jpeg.
   5.

      If the first character of the name is a U+002E 'FULL STOP' character, and 
the file name contains another U+002E 'FULL STOP' character, then let extension 
be the sequence of characters from the last U+002E 'FULL STOP' (inclusive) to 
the end of name.

      For example, if the name is ".myhidden.html", then the extension would be 
".html".
   6.

      Attempt to case-insensitively match the value of extension to one of the 
values in the file extension column in the file identification table. If there 
is a match, then return the corresponding value from the media type column and 
terminate this algorithm.

   7. Return the result of processing file through the [SNIFF] specification.

Thanks,
Marcin

Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452  |  Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: [email protected]

-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of 
Marcos Caceres
Sent: Monday, October 12, 2009 10:36 PM
To: Marcin Hanclik
Cc: public-webapps
Subject: Re: [widgets] Potential bug in Rule for Identifying the Media Type of 
a File

>
>>>2. If file has a file-extension, attempt to match the file-extension
>>>to one in the file extensions column in the file identification table.
>>>If there is a match, then return the media type value. (returns
>>>"image/jpeg")
> I think file-extension would not be matched, but only base-name.
>
> I think the grammar is not ambiguous with regard to which rules would be 
> matched.
> The problem is that at present in case of .jpg, there would be no file 
> extension.
> A greedy parser would only match base-name and leave file-extension empty, 
> since it is optional.
> So we need to modify the grammar to clearly specify what the extension is.
> With the current grammar, there is also a problem that "." is also allowed in 
> the file-extension as part of the allowed-char.
> Therefore any parser may be confused which dot is the "." from the 
> file-extension rule (I am not sure whether a parser can be developed at all).
> And thus, file-extension has problems. I assume that file extensions do not 
> have dots, dot is to be the delimiter.
>
> What about modifying the ABNF to:
>
> file-name                 = file-name-with-extension | file-name-no-extension
>
> file-name-with-extension  = base-name file-extension
>
> base-name                 = *allowed-char
>
> file-extension            = "." 1*allowed-char-no-dot
>
> allowed-char-no-dot       = safe-char-no-dot / utf8-char
>
> safe-char-no-dot          = ALPHA / DIGIT / SP / "$" / "%"
>                           / "'" / "-" / "_" / "@"
>                           / "~" / "(" / ")" / "&" / "+"
>                           / "," / "." / "=" / "[" / "]"
>
> file-name-no-extension    = base-name-no-ext
>
> base-name-no-ext          = 1*allowed-char-no-dot
>
> This would make the base-name optional.
> .jpg is a valid file name, specifically on Linux platforms.
> Then, .jpg would have (only) a file extension and probably the prose of P&C 
> would not need to be changed.
>

As part of this discussion I spend some time fine tuning the ABNF. I
merged in all the external refs and pumped out a few thousand test
cases for analysis using abnfgen [1]. Works great in MacOS X. I also
updated the spec to cover the following use cases [3]:

1. "noextension" > send to [SNIFF] spec.
2. "some.ext" > try to recognize extension. If fail, send to [SNIFF] spec.
3. ".something" > send to SNIFF spec.
4.  ".something.ext" > try to recognize extension. If fail, send to SNIFF spec.

New ABNF:

Zip-rel-path   = [locale-folder] [*folder-name] file-name/
                        [locale-folder] 1*folder-name
locale-folder  = %x6C %x6F %x63 %x61 %x6C %x65 %x73
                        "/" language-range "/"
folder-name    = file-name "/"
file-name      = base-name [ file-extension ]
base-name      = 1*allowed-char
file-extension = "." 1*allowed-char
allowed-char   = safe-char / zip-UTF8-char
zip-UTF8-char  = UTF8-2 / UTF8-3 / UTF8-4
safe-char      = ALPHA / DIGIT / SP / "$" / "%"
                                        / "'" / "-" / "_" / "@"
                                        / "~" / "(" / ")" / "&" / "+"
                                        / "," / "=" / "[" / "]" / "."
UTF8-2         = %xC2-DF UTF8-tail
UTF8-3         = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
                 %xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4         = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
                 %xF4 %x80-8F 2( UTF8-tail )
UTF8-tail      = %x80-BF
language-range = (1*8low-alpha / "*") *("-" (1*8alphanum / "*"))
alphanum       = low-alpha  / DIGIT
low-alpha      = %x61-71

[1] http://www.quut.com/abnfgen/
(using abnfgen path.abnf | xargs  mkdir -p )

[SNIFF]
http://tools.ietf.org/html/draft-abarth-mime-sniff-03

[3]
http://dev.w3.org/2006/waf/widgets/Overview_TSE.html#default-icons-table
--
Marcos Caceres
http://datadriven.com.au

________________________________________

Access Systems Germany GmbH
Essener Strasse 5  |  D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda

www.access-company.com

CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is 
privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or 
distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by 
responding to this e-mail. Thank you.

RE: [widgets] Potential bug in Rule for Identifying the Media Type of a File

Reply via email to