Re: Suggestion: Better slugifying of scandinavic characters

Malcolm Tredinnick Tue, 16 May 2006 00:45:08 -0700

On Tue, 2006-05-16 at 09:36 +0200, Gábor Farkas wrote:
> Malcolm Tredinnick wrote:
> > On Tue, 2006-05-16 at 09:00 +0200, Gábor Farkas wrote:
> >> Jeroen Ruigrok van der Werven wrote:
> >>> On 5/16/06, Ville Säävuori <[EMAIL PROTECTED]> wrote:
> >>>> I think that this problem applies in most european languages, too.
> >>>> Like, say, Swedish, German and French.
> >>> The same appliesa for Dutch where we use trema's (sort of umlauts) to
> >>> denote any possible ambiguity in reading. So having the accent
> >>> stripped would be way better than having the entire letter stripped.
> >>> The same applies of course to say Spanish with the tilde-n, or even
> >>> some slavic languages or Romanian.
> >>>
> >> also in Hungarian and Slovak the preferred way is to just strip the 
> >> accents.
> >>
> >> maybe the best way would be to make this locale-dependent...
> > 
> > At the risk of offending everybody who uses a language requiring
> > accents, but this one of those "it's harder than it looks" problems in
> > Unicode. You need to have a mapping from every accented character (or a
> > reasonable set of them) to their unadorned equivalents. Many characters
> > are a single unicode character, not a unicode composition of two
> > characters, so it's not just a matter of "stripping the accent". So
> > either we're going to end up carrying around a fairly large mapping
> > table in the Javascript or we need a better solution.
> 
> i agree that this problem is quite hard to solve "for everyone".
> 
> but this actual problem (stripping accents) is not that hard.
> 
> in unicode, there are several "normal forms" defined.
> 
> for example, take the character [a] with an accent ['] : [á].
> in unicode it can be represented either as one character, or as two 
> characters: the [a] symbol and the accent['] symbol.
> 
> but there is a normal form, where every character is in it's decomposed 
> (two character) form.


Yes, but think about how you move to the normal forms. It requires
lookup tables and they are pretty long. In Unicode 3, the normalisation
table was about 2500 lines long (that is the only ones I have sitting
around at the moment). Sure, you can compress it a bit, but it's still a
lot of data and computation.

Now, we could try working with a small subset, which may be the most
practical solution. But we will continually be receiving mail from
people with ø or Ô or Å or some other not-so-common-outside-of-country-X
character in their title saying that we missed their favourite
character. It's possible, just fiddly. I'm not arguing for punting the
problem: I have some Norwegian and German friends I'd like to still have
talking to me. Just pointing out that a bunch of "that sounds good" mail
isn't solving it.

Malcolm



--~--~---------~--~----~------------~-------~--~----~
You received this message because you are subscribed to the Google Groups 
"Django developers" group.
To post to this group, send email to [email protected]
To unsubscribe from this group, send email to [EMAIL PROTECTED]
For more options, visit this group at 
http://groups.google.com/group/django-developers
-~----------~----~----~----~------~----~------~--~---

Re: Suggestion: Better slugifying of scandinavic characters

Reply via email to