Re: Suggestion: Better slugifying of scandinavic characters

Gábor Farkas Tue, 16 May 2006 00:36:54 -0700

Malcolm Tredinnick wrote:
> On Tue, 2006-05-16 at 09:00 +0200, Gábor Farkas wrote:
>> Jeroen Ruigrok van der Werven wrote:
>>> On 5/16/06, Ville Säävuori <[EMAIL PROTECTED]> wrote:
>>>> I think that this problem applies in most european languages, too.
>>>> Like, say, Swedish, German and French.
>>> The same appliesa for Dutch where we use trema's (sort of umlauts) to
>>> denote any possible ambiguity in reading. So having the accent
>>> stripped would be way better than having the entire letter stripped.
>>> The same applies of course to say Spanish with the tilde-n, or even
>>> some slavic languages or Romanian.
>>>
>> also in Hungarian and Slovak the preferred way is to just strip the accents.
>>
>> maybe the best way would be to make this locale-dependent...
> 
> At the risk of offending everybody who uses a language requiring
> accents, but this one of those "it's harder than it looks" problems in
> Unicode. You need to have a mapping from every accented character (or a
> reasonable set of them) to their unadorned equivalents. Many characters
> are a single unicode character, not a unicode composition of two
> characters, so it's not just a matter of "stripping the accent". So
> either we're going to end up carrying around a fairly large mapping
> table in the Javascript or we need a better solution.


i agree that this problem is quite hard to solve "for everyone".

but this actual problem (stripping accents) is not that hard.

in unicode, there are several "normal forms" defined.

for example, take the character [a] with an accent ['] : [á].
in unicode it can be represented either as one character, or as two 
characters: the [a] symbol and the accent['] symbol.

but there is a normal form, where every character is in it's decomposed 
(two character) form.

so to strip accents from a text, you simply do (pseudocode):

def stripAccents(text):
        text = decompose_into_normal_form_5(text)
        chars = [ c for c in text if is_not_accent(c)]
        return ''.join(chars)

of course, i have no idea what are javascript's unicode capabilities :-)

just wanted to demonstrate that this actual problem is not that hard.


> 
> To put the problem into context: it's only a small generalisation to
> attempt to do the same thing for mapping Japanese characters to
> ASCII-based URLs.

:) yes, that problem is nearly impossible to solve :-)

gabor

--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~----------~----~----~----~------~----~------~--~---

Re: Suggestion: Better slugifying of scandinavic characters

Reply via email to