Hi Marcos,
These are my remarks as discussed yesterday on the call.
Comment a)
6.A.If all characters in the extension are outside the two ranges, then go to
step 5 in this algorithm.
Should be
6.A.If any of the characters in the extension is outside the two ranges, then
go to step 5 in this algorithm.
But this is also problematic since it infinitely loops the algorithm in this
given case.
So it should be:
6.A.If any of the characters in the extension is outside the two ranges, then
go to step 7 in this algorithm.
Another comment to 6.A:
It seems that the whole algorithm assumes that the File Identification Table is
constant.
E.g. if any vendor would like to add some extension with a character outside of
the given ranges (or we in W3C would like to do this in the future), then we
would need to rewrite the algorithm.
So what about this (we do not need the ranges IMHO):
6. Attempt to case-insensitively match the value of extension to one of the
values in the file extension column in the file identification table. If there
is a match, then return the corresponding value from the media type column and
terminate this algorithm.
And remove 6.A and 6.B as they were.
*****************
Comment b)
4. If the first character of the name is a U+002E 'FULL STOP' character, and
the file name contains no other U+002E 'FULL STOP' character then go to step 7
of this algorithm.
What about ".jpg"?
Do you assume that this is filename and not file extension?
What about this:
4. If the first character of the name is a U+002E 'FULL STOP' character, and
the file name contains no other U+002E 'FULL STOP' character then let extension
be name and go to step 6 of this algorithm.
*****************
Comment c)
Given that the processing model is developed in prose, I think we MUST fix the
ambiguity of the grammar anyway.
Thus I suggest the following change from:
file-name = base-name [ file-extension ]
base-name = 1*allowed-char
file-extension = "." 1*allowed-char
to:
file-name = 1*allowed-char
(i.e. remove base-name and file-extension).
The removal of ambiguity is motivated by the dependency of the WURI/WUS spec on
P&C in this particular detail, so it is better to keep it right, I think.
File extension does not play any role in WURI/WUS anyway.
I think either the above change or the one in my mail below has to be
implemented in the spec.
*****************
Comment d)
We need to somehow derive the extension if the grammar is modified as in
comment c) [i.e. removal of two rules].
Therefore I suggest the change from:
3. If the first character of the name is not a U+002E 'FULL STOP' character and
the name has a file-extension component, let extension be value of the
file-extension component.
To:
3. Let "extension" be an empty string. If the first character of the name is
not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL
STOP' character, then let extension be the sequence of characters from the last
U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this
algorithm (as proposed in comment a) [no ranges etc.]).
SUMMARY
1.
Let file be the file to be processed.
2.
Let name be the file-name string component of the zip relative path that
identifies the file.
3.
Let extension be an empty string. If the first character of the name is
not a U+002E 'FULL STOP' character and the file name contains U+002E 'FULL
STOP' character, then let extension be the sequence of characters from the last
U+002E 'FULL STOP' (inclusive) to the end of name and go to step 6 of this
algorithm
For example, the extension of the file name "cat.html" would be ".html".
4.
If the first character of the name is a U+002E 'FULL STOP' character, and
the file name contains no other U+002E 'FULL STOP' character then let extension
be name and go to step 6 of this algorithm.
REMOVE For example, if the name is ".htaccess", jump to step 7 and derive
the mime type using the [SNIFF] specification.
ADD For example, if the name is ".jpg", jump to step 6 and match
image/jpeg.
5.
If the first character of the name is a U+002E 'FULL STOP' character, and
the file name contains another U+002E 'FULL STOP' character, then let extension
be the sequence of characters from the last U+002E 'FULL STOP' (inclusive) to
the end of name.
For example, if the name is ".myhidden.html", then the extension would be
".html".
6.
Attempt to case-insensitively match the value of extension to one of the
values in the file extension column in the file identification table. If there
is a match, then return the corresponding value from the media type column and
terminate this algorithm.
7. Return the result of processing file through the [SNIFF] specification.
Thanks,
Marcin
Marcin Hanclik
ACCESS Systems Germany GmbH
Tel: +49-208-8290-6452 | Fax: +49-208-8290-6465
Mobile: +49-163-8290-646
E-Mail: [email protected]
-----Original Message-----
From: [email protected] [mailto:[email protected]] On Behalf Of
Marcos Caceres
Sent: Monday, October 12, 2009 10:36 PM
To: Marcin Hanclik
Cc: public-webapps
Subject: Re: [widgets] Potential bug in Rule for Identifying the Media Type of
a File
>
>>>2. If file has a file-extension, attempt to match the file-extension
>>>to one in the file extensions column in the file identification table.
>>>If there is a match, then return the media type value. (returns
>>>"image/jpeg")
> I think file-extension would not be matched, but only base-name.
>
> I think the grammar is not ambiguous with regard to which rules would be
> matched.
> The problem is that at present in case of .jpg, there would be no file
> extension.
> A greedy parser would only match base-name and leave file-extension empty,
> since it is optional.
> So we need to modify the grammar to clearly specify what the extension is.
> With the current grammar, there is also a problem that "." is also allowed in
> the file-extension as part of the allowed-char.
> Therefore any parser may be confused which dot is the "." from the
> file-extension rule (I am not sure whether a parser can be developed at all).
> And thus, file-extension has problems. I assume that file extensions do not
> have dots, dot is to be the delimiter.
>
> What about modifying the ABNF to:
>
> file-name = file-name-with-extension | file-name-no-extension
>
> file-name-with-extension = base-name file-extension
>
> base-name = *allowed-char
>
> file-extension = "." 1*allowed-char-no-dot
>
> allowed-char-no-dot = safe-char-no-dot / utf8-char
>
> safe-char-no-dot = ALPHA / DIGIT / SP / "$" / "%"
> / "'" / "-" / "_" / "@"
> / "~" / "(" / ")" / "&" / "+"
> / "," / "." / "=" / "[" / "]"
>
> file-name-no-extension = base-name-no-ext
>
> base-name-no-ext = 1*allowed-char-no-dot
>
> This would make the base-name optional.
> .jpg is a valid file name, specifically on Linux platforms.
> Then, .jpg would have (only) a file extension and probably the prose of P&C
> would not need to be changed.
>
As part of this discussion I spend some time fine tuning the ABNF. I
merged in all the external refs and pumped out a few thousand test
cases for analysis using abnfgen [1]. Works great in MacOS X. I also
updated the spec to cover the following use cases [3]:
1. "noextension" > send to [SNIFF] spec.
2. "some.ext" > try to recognize extension. If fail, send to [SNIFF] spec.
3. ".something" > send to SNIFF spec.
4. ".something.ext" > try to recognize extension. If fail, send to SNIFF spec.
New ABNF:
Zip-rel-path = [locale-folder] [*folder-name] file-name/
[locale-folder] 1*folder-name
locale-folder = %x6C %x6F %x63 %x61 %x6C %x65 %x73
"/" language-range "/"
folder-name = file-name "/"
file-name = base-name [ file-extension ]
base-name = 1*allowed-char
file-extension = "." 1*allowed-char
allowed-char = safe-char / zip-UTF8-char
zip-UTF8-char = UTF8-2 / UTF8-3 / UTF8-4
safe-char = ALPHA / DIGIT / SP / "$" / "%"
/ "'" / "-" / "_" / "@"
/ "~" / "(" / ")" / "&" / "+"
/ "," / "=" / "[" / "]" / "."
UTF8-2 = %xC2-DF UTF8-tail
UTF8-3 = %xE0 %xA0-BF UTF8-tail / %xE1-EC 2( UTF8-tail ) /
%xED %x80-9F UTF8-tail / %xEE-EF 2( UTF8-tail )
UTF8-4 = %xF0 %x90-BF 2( UTF8-tail ) / %xF1-F3 3( UTF8-tail ) /
%xF4 %x80-8F 2( UTF8-tail )
UTF8-tail = %x80-BF
language-range = (1*8low-alpha / "*") *("-" (1*8alphanum / "*"))
alphanum = low-alpha / DIGIT
low-alpha = %x61-71
[1] http://www.quut.com/abnfgen/
(using abnfgen path.abnf | xargs mkdir -p )
[SNIFF]
http://tools.ietf.org/html/draft-abarth-mime-sniff-03
[3]
http://dev.w3.org/2006/waf/widgets/Overview_TSE.html#default-icons-table
--
Marcos Caceres
http://datadriven.com.au
________________________________________
Access Systems Germany GmbH
Essener Strasse 5 | D-46047 Oberhausen
HRB 13548 Amtsgericht Duisburg
Geschaeftsfuehrer: Michel Piquemal, Tomonori Watanabe, Yusuke Kanda
www.access-company.com
CONFIDENTIALITY NOTICE
This e-mail and any attachments hereto may contain information that is
privileged or confidential, and is intended for use only by the
individual or entity to which it is addressed. Any disclosure, copying or
distribution of the information by anyone else is strictly prohibited.
If you have received this document in error, please notify us promptly by
responding to this e-mail. Thank you.