Re: [xwiki-devs] Simple patch to enable/preserve underscore chars in attachment file names

Sergiu Dumitriu Fri, 23 Apr 2010 19:30:53 -0700

On 04/23/2010 09:44 PM, Caleb James DeLisle wrote:
>
>
> Sergiu Dumitriu wrote:
>> On 04/23/2010 08:50 AM, Denis Gervalle wrote:
>>> On Fri, Apr 23, 2010 at 03:32, Sergiu Dumitriu<[email protected]>   wrote:
>>>
>>>> On 04/06/2010 05:03 PM, Vincent Massol wrote:
>>>>> Hi Milind,
>>>>>
>>>>> On Apr 6, 2010, at 5:00 PM, Milind Kamble wrote:
>>>>>
>>>>>> Denis,
>>>>>>      I understand your point that XE being used globally, needs to 
>>>>>> support
>>>> more than Ascii char set.
>>>>>> While the new reference model matures, could you clarify if underscore
>>>> in a file name would break the functionality under the current model where
>>>> attachment name is used as a reference for attachments? If not, would it be
>>>> possible to eliminate the stripping of just the underscore chars and push
>>>> that fix in the next XE release -- I am OK with space chars getting 
>>>> stripped
>>>> off.
>>>>> I don't think that underscores are a problem even with the old "reference
>>>> as string" code. Actually I don't even know why we're stripping them. 
>>>> Sergiu
>>>> might know more. Any idea Sergiu?
>>>>
>>>> This is the issue that started it: XWIKI-2087
>>>>
>>>> So, there were three main problems:
>>>>
>>>> 1. Impossible to actually restore the attachment from the database since
>>>> the ID was generated using the hash of the original, correct name, yet
>>>> it was stored using the broken name, with ? instead of non-latin1
>>>> characters
>>>> 2. Impossible to link to such an attachment, since a non-UTF wiki would
>>>> encode non-ASCII chars to their&#xyz; escapes, and the filename wasn't
>>>> decoded when trying to get the attachment from the database
>>>> 3. Encoding bug in the old WYSIWYG which composed the URL using a wrong
>>>> encoding
>>>>
>>>> 3 should be fixed since we're forcing UTF-8 in URLs.
>>>> 2 and 1 should work if the wiki+database are using UTF8, but they might
>>>> still fail in latin1.
>>>
>>> Should we really support non-UTF-8 configuration ? We have already lost so
>>> much time with these encoding issues, and I really do not understand the
>>> advantage of supporting non-UTF8 environment ?
>>>
>>
>> Legacy. Maybe if we can provide a nice and quick guide for transforming
>> a latinX installation into an UTF-8, we'd be allowed to require UTF-8.
>> We could announce that from 2.5 onwards UTF-8 will be mandatory, if we
>> decide to go this way. Maybe the most important latin1 installation is
>> xwiki.org itself.
>>
>> The most problematic thing is that by default mysql databases come as
>> latin1 (in most distributions, although my Gentoo makes it utf8), and
>> this is one of the most frequent source of encoding problem reports.
>
> Am I correct in saying that mysql with utf8 is unable to handle some
> characters and so pages can't be saved? My understanding is using latin1
> is a common workaround so that mysql doesn't know that it is handling the
> characters. Forcing utf8 might lead to some unhappy users who suddenly find
> not only their database must be changed but some of the characters used in
> their language are nolonger allowed.

No, utf8 should allow all characters, since UTF-8 allows complete 
representation of all unicode characters, which cover all existing 
non-UTF charsets.

The workaround that you talk about is actually a different problem, and 
a common bad practice: storing UTF-8 or some other fixed encoding bytes 
in a latin1 column, by decomposing and recomposing strings into bytes. 
This is bad because it breaks sorting and upper/lowercase 
transformations. It's also bad because it involves another byte 
split/parse operation, since the data is already transformed once from 
fake "ISO-8859-1" bytes into a String. It is good because it actually 
doesn't care about the database encoding, it's independent, and works 
with (almost) all database encodings transparently.

What actually breaks when switching from latin1 to utf8, in this 
scenario, is that UTF-8 has some intrinsic data validation, meaning that 
certain bytes and certain byte pairs/triples are not valid UTF-8 
strings, thus, trying to push random bytes could sometime fail. 
Normally, storing UTF-8 bytes in a utf8 column should work, so it fails 
when storing another encoding in a utf8 column.

Fortunately, we don't use this technique, so we're not affected by this 
problem.

Now, on the contrary to what you say, UTF-8 has been the default 
encoding of the wiki for some time, and it fails if the database (mysql) 
is NOT in utf8. We actually require the database to be in utf8, since 
otherwise data will be lost after it gets out of the cache.

Why some people still prefer latin1 in a world that is moving more and 
more towards UTF-8? Well, there are a few disadvantages to utf8, when 
used inside mysql:
- Data is bigger, since latin1 uses exactly 1 byte for each character, 
while UTF-8 uses 2 for most european languages, and even 3 for Asian and 
other exotic scripts. I'm speaking here only about storage space needed.
- Not only is the storage bigger, but the algorithms are a bit more 
complex/time consuming: counting how many characters are in a latin1 
string is simple, just see how many bytes are in there. In UTF it's more 
complex, since 1, 2, 3 bytes can form one character, and the rules 
require full examination of each byte. Thus, length(latin1) is O(1), 
length(utf8) is O(n). Most other string functions are also affected by 
this complexity problem.
- Moreover, indexes are limited to 1024 (or was it 2048?) BYTES in 
length, and MySQL assumes the worst case scenario when computing how 
many bytes a column takes up. So, while it's possible to use 4 small 
columns (255 chars) combined in an index, if utf8 is used instead, 
4*(255*3 bytes per char in the worst case scenario)>1024, thus using 
utf8 in tables limits the size of indexes.

There might be other major disadvantages, but these are the most 
important that I know of.

The big advantage of utf8, when compared to all latinX charsets, is that 
it can store much more characters. All latinX charsets can store only 
256 possible characters (including all the control chars rarely used). 
And frankly, the entire web is moving towards UTF-8.

These disadvantages are not problems with all UTF-8 applications, it is 
just a very lousy design/implementation in mysql, and it's one of the 
main reasons why I don't like mysql. I hope that they will realize that 
they did it all wrong and fix it at some point.

> A thought.
>
> Caleb
>
>>
>>>>> Thanks
>>>>> -Vincent
>>>>>
>>>>>> ________________________________
>>>>>> From: Denis Gervalle<[email protected]>
>>>>>> To: XWiki Developers<[email protected]>
>>>>>> Sent: Tue, April 6, 2010 8:30:34 AM
>>>>>> Subject: Re: [xwiki-devs] Simple patch to enable/preserve underscore
>>>> chars in attachment file names
>>>>>> On Tue, Apr 6, 2010 at 14:02, Guillaume Lerouge<[email protected]>
>>>>    wrote:
>>>>>>> Hi Milind,
>>>>>>>
>>>>>>> On Tue, Apr 6, 2010 at 1:23 AM, Milind Kamble<[email protected]>
>>>>    wrote:
>>>>>>>> Hi. I would like the dev community to evaluate this simple fix that
>>>> will
>>>>>>>> enable uploading of files with underscore chars in the file name when
>>>>>>> users
>>>>>>>> perform the attach action. Our user community is quite impressed about
>>>>>>> the
>>>>>>>> refreshing ease of use and the power, flexibility in their
>>>> collaboration
>>>>>>>> work flow made possible by XE. They would like to escape the tyranny
>>>> of
>>>>>>>> Microsoft-MOSS as early as possible and the main roadblock to do so is
>>>>>>> the
>>>>>>>> stripping of space and underscores from file names which were created
>>>> in
>>>>>>> a
>>>>>>>> MS-Office centric environment.
>>>>>>>>
>>>>>>> I can't do much about your underscore problem (though I promise I'll
>>>> poke
>>>>>>> the developer sitting right next to me so that he looks at it).
>>>>>>>
>>>>>> I was already aware of this issue, and I have had similar problemqs with
>>>>>> attachment, not only with "_", but also with accentuated chars etc...
>>>>>> Restriction on attachment names will be easier to be changed when the
>>>> new
>>>>
>>>>>> model model using references will be fully in place, since attachment
>>>> names
>>>>>> are currently used as reference for attachments. Be sure I will take
>>>> care to
>>>>>> have it improve.

-- 
Sergiu Dumitriu
http://purl.org/net/sergiu/
_______________________________________________
devs mailing list
[email protected]
http://lists.xwiki.org/mailman/listinfo/devs

Re: [xwiki-devs] Simple patch to enable/preserve underscore chars in attachment file names

Reply via email to