Re: Re: urlify.js blocks out non-English chars - 2nd try?
2006/7/20, Gábor Farkas <[EMAIL PROTECTED]>: > > Jeroen Ruigrok van der Werven wrote: > > On 7/16/06, gabor <[EMAIL PROTECTED]> wrote: > >> i think we do not need to discuss japanese at all. after all, there's no > >> transliteration for kanji. so it's imho pointless to argue about > >> kana-transliteration, when you cannot transliterate kanji. > > > > If you mean that you cannot easily deduce whether the kanji for moon 月 > > should be transliterated according to the reading 'tsuki' or 'getsu', > > then yes, you are correct. But you *can* transliterate them according > > to their on or kun reading. > > > > yes, you are correct on that. > but on the other hand, what's the meaning in doing a plain on/kun > reading-based transliteration? :-) > > and also, some kanjis have a lot of on/kun readings... which one will > you use? > > at least for me it seems that a transliteration scheme should at least > keep the words readable. now take a japanese word with 2 kanjis. how > would you propose to transliterate it to still keep the meaning? We can not apply ON or KUN for kanaji by right way automatically. It has no exact rule. And I don't think slug is just for human. It's for computers too. Search-engines or some technologies may understand IDNA/Punycode(thanx Antonio!). #Google can understand IDNA already. Japanese kanji should be translated into Punycode. If slug must keep the meaning for human, you don't need care about Japanese. It's impossible for Japanese. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Jeroen Ruigrok van der Werven wrote: > On 7/16/06, gabor <[EMAIL PROTECTED]> wrote: >> i think we do not need to discuss japanese at all. after all, there's no >> transliteration for kanji. so it's imho pointless to argue about >> kana-transliteration, when you cannot transliterate kanji. > > If you mean that you cannot easily deduce whether the kanji for moon 月 > should be transliterated according to the reading 'tsuki' or 'getsu', > then yes, you are correct. But you *can* transliterate them according > to their on or kun reading. > yes, you are correct on that. but on the other hand, what's the meaning in doing a plain on/kun reading-based transliteration? :-) and also, some kanjis have a lot of on/kun readings... which one will you use? at least for me it seems that a transliteration scheme should at least keep the words readable. now take a japanese word with 2 kanjis. how would you propose to transliterate it to still keep the meaning? gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
On 7/16/06, gabor <[EMAIL PROTECTED]> wrote: > i think we do not need to discuss japanese at all. after all, there's no > transliteration for kanji. so it's imho pointless to argue about > kana-transliteration, when you cannot transliterate kanji. If you mean that you cannot easily deduce whether the kanji for moon 月 should be transliterated according to the reading 'tsuki' or 'getsu', then yes, you are correct. But you *can* transliterate them according to their on or kun reading. -- Jeroen Ruigrok van der Werven --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Gábor Farkas wrote: > i somehow have the feeling that we lost the original idea here a little. > > (as far as i understand, by urlify.js we are talking about slug > auto-generation, please correct me if i'm wrong). > > we are auto-generating slugs when it "makes sense". for example, for > english it makes sense to remove all the non-word stuff, because what > remains can still be read, be understood, and generally looks fine when > being a part of the URL. > > also, for many languages (hungarian or slavic ones), it also "makes > sense" to simply drop all the diacritical marks, because the rest can > still be read, be understood, and looks fine as part of an URL. > > but with punycode or whatever-code encoding japanese, what's the point? > what you get will be completely unreadable.. if you only need to > preserve the submitted data, you don't need to do anything. simply take > your unicode text, encode it to utf8, url-escape it and use it as a part > of the url. it will be ok. and on the other side you can url-unescape > and utf8-decode it and you're back. you will even be able to have ascii > stuff readably-preserved. I agree; this has gone *way* past the original idea. Transcription of characters onto ascii, (aka "slugging") is not the same problem as passing around encoded unicode IRI segments between clients and servers. There's a standards track IETF document for the latter purpose - RFC3987. If you want to do this, do it to spec. I think the js mapping approach is good enough for the admin interface. Once I can get the greek table to be picked up (argh), a patch will land... cheers Bill * http://www.ietf.org/rfc/rfc3987.txt --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Antonio Cavedoni wrote: > On 17 Jul 2006, at 8:25, tsuyuki makoto wrote: >> We Japanese know that we can't transarate Japanese to ASCII. >> So I want to do it as follows at least. >> A letter does not disappear and is restored. >> #FileField and ImageField have same letters disappear problem. >> >> def slug_ja(word) : >> try : >> unicode(word, 'ASCII') >> import re >> slug = re.sub('[^\w\s-]', '', word).strip().lower() >> slug = re.sub('[-\s]+', '-', slug) >> return slug >> except UnicodeDecodeError : >> from encodings import idna >> painful_slug = word.strip().lower().decode('utf-8').encode >> ('IDNA') >> return painful_slug > > I’m not convinced by this approach, but I would suggest using the > “punycode” instead of the “idna” encoder anyway. The results don’t > include the initial “xn--” marks which are only useful in a domain > name, not in a URI path. Also, the “from encodings […]” line appears > to be unnecessary on my Python 2.3.5 and 2.4.1 on OSX. > > [[[ > >>> p = u"perché" > >>> from encodings import idna > >>> p.encode('idna') > 'xn--perch-fsa' > >>> p.encode('punycode') > 'perch-fsa' > >>> puny = 'perch-fsa' > >>> puny.decode('punycode') > u'perch\xe9' > >>> print puny.decode('punycode') > perché > >>> pu = puny.decode('punycode') # it's reversible > >>> print pu > perché > ]]] > > More on Punycode: http://en.wikipedia.org/wiki/Punycode i somehow have the feeling that we lost the original idea here a little. (as far as i understand, by urlify.js we are talking about slug auto-generation, please correct me if i'm wrong). we are auto-generating slugs when it "makes sense". for example, for english it makes sense to remove all the non-word stuff, because what remains can still be read, be understood, and generally looks fine when being a part of the URL. also, for many languages (hungarian or slavic ones), it also "makes sense" to simply drop all the diacritical marks, because the rest can still be read, be understood, and looks fine as part of an URL. but with punycode or whatever-code encoding japanese, what's the point? what you get will be completely unreadable.. if you only need to preserve the submitted data, you don't need to do anything. simply take your unicode text, encode it to utf8, url-escape it and use it as a part of the url. it will be ok. and on the other side you can url-unescape and utf8-decode it and you're back. you will even be able to have ascii stuff readably-preserved. form my point of view, with the current slug-approach, you either can convert your text into ascii that "makes sense" or not. if the former, then enhancing urlify.js makes sense. if the latter, then it makes no sense. imho. gabor --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
On 17 Jul 2006, at 8:25, tsuyuki makoto wrote: > We Japanese know that we can't transarate Japanese to ASCII. > So I want to do it as follows at least. > A letter does not disappear and is restored. > #FileField and ImageField have same letters disappear problem. > > def slug_ja(word) : > try : > unicode(word, 'ASCII') > import re > slug = re.sub('[^\w\s-]', '', word).strip().lower() > slug = re.sub('[-\s]+', '-', slug) > return slug > except UnicodeDecodeError : > from encodings import idna > painful_slug = word.strip().lower().decode('utf-8').encode > ('IDNA') > return painful_slug I’m not convinced by this approach, but I would suggest using the “punycode” instead of the “idna” encoder anyway. The results don’t include the initial “xn--” marks which are only useful in a domain name, not in a URI path. Also, the “from encodings […]” line appears to be unnecessary on my Python 2.3.5 and 2.4.1 on OSX. [[[ >>> p = u"perché" >>> from encodings import idna >>> p.encode('idna') 'xn--perch-fsa' >>> p.encode('punycode') 'perch-fsa' >>> puny = 'perch-fsa' >>> puny.decode('punycode') u'perch\xe9' >>> print puny.decode('punycode') perché >>> pu = puny.decode('punycode') # it's reversible >>> print pu perché ]]] More on Punycode: http://en.wikipedia.org/wiki/Punycode Cheers. -- Antonio --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: Re: urlify.js blocks out non-English chars - 2nd try?
2006/7/17, gabor <[EMAIL PROTECTED]>: > > Jeroen Ruigrok van der Werven wrote: > > On 7/12/06, Julian 'Julik' Tarkhanov <[EMAIL PROTECTED]> wrote: > >> This is handled by Unicode standard and is called transliteration. > > > > > > Also, for Japanese, are you going to follow kunrei-shiki or rather the > > more widely used hepburn transliteration? Or perhaps even nippon-shiki > > if you feel like sticking to strictness. > > i think we do not need to discuss japanese at all. after all, there's no > transliteration for kanji. so it's imho pointless to argue about > kana-transliteration, when you cannot transliterate kanji. We Japanese know that we can't transarate Japanese to ASCII. So I want to do it as follows at least. A letter does not disappear and is restored. #FileField and ImageField have same letters disappear problem. def slug_ja(word) : try : unicode(word, 'ASCII') import re slug = re.sub('[^\w\s-]', '', word).strip().lower() slug = re.sub('[-\s]+', '-', slug) return slug except UnicodeDecodeError : from encodings import idna painful_slug = word.strip().lower().decode('utf-8').encode('IDNA') return painful_slug --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
On 7-jul-2006, at 17:50, Bill de hÓra wrote: > This is my point. Cut what exactly? "No good" for what exactly? We > could file patches to see what sticks, but it might be better to > figure > what's wanted first, instead of playing fetch me a rock. This is handled by Unicode standard and is called transliteration. The problem is that it's locale dependent. AFAIK Python's codecs don't implement it (but ICU4R does). If you go for tables it's going to be _many_. URLs can be Unicode-aware, just encoded - so why not replacing whitespace with dashes and doing a Unicode downcase, and be done with it? Some browsers (Safari) even show you the request string verbatim, so it's very readable. --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Malcolm Tredinnick wrote: > Hi Bill, > > On Fri, 2006-07-07 at 10:06 +0100, Bill de hÓra wrote: >> Malcolm Tredinnick wrote: >> >>> There was reasonable consensus in one of the threads about doing >>> something similar (but a bit smaller) than what Wordpress does. Now it's >>> a case of "patches gratefully accepted". A lot of people say this is a >>> big issue for them, so it's something that will be fixed one day, but >>> nobody has put in a reasonable patch yet. When that happens, we can >>> progress. >> What's the expected scope of the downcoding? Would it be throwing a few >> dicts together in the admin js, or a callback to unicodedata.normalize? > > I thought there was some sort of consensus; I didn't claim all the > details had been settled. Personally, I was kind of hoping whoever wrote > the patch might think this sort of thing through and give us a concrete > target to throw ideas at. :-) > > My own misguided thoughts (I *really* don't want to have write this > patch): I thought the original design wish was "something that read > sensibly" here, since slugifying is already a lossy process. If I had to > write it today, I would do the "dictionary mapping on the client side" > version. But you're more of an expert here: what does normalization gain > us without having to move to fully internationalised URLs, which still > seem to be a phishing vector: if we allow fully international URLs, then > doing everything properly would make sense. However, is it universally > supported as "not a security risk" in all common browsers yet? Normalisation/decomposition gains you greater assuredness you'll throw away what you think you're throwing away, before you try a mapping. Unicode provides mappings down to ascii but it's not complete; mapping decisions tend to be localized/controversial. The phishing problem with Internationalised URLs (IRIs) is in the internationalized domain name (IDN) where you can get redirected, and not so much the path segment where the slug lives. I work on atom protocol and IRIs are official IETF/W3C goodness these days (funny, we just went through slugging on the protocol list yesterday). IRIs are designed to to be treated as encoded Unicode (utf8 most likely) so they pass through systems without losing information. Slugging as I tend to understand it is really about dropping down to ascii and throwing character information away. I'm thinking that for slugs people want to have a character replaced with an ascii equivalent and not /preserve/ character data via encoding. It really does depend on what people want from this feature. A full full full downcoding solution needs to go back to the server I think, do the whole unicode bit, and use whatever custom mappings onto ascii. Whereas a good enough approach would be set of js dicts sent to the client; that keeps the nice js autofill feature in the admin, and will probably cover 95% of use cases. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Antonio Cavedoni wrote: > So this would be no good. > > Perhaps I’m missing something but unicodedata won’t cut it. This is my point. Cut what exactly? "No good" for what exactly? We could file patches to see what sticks, but it might be better to figure what's wanted first, instead of playing fetch me a rock. A slug function can range from a regex replace to a complete text normalization/decomposition/lookup service that will never be enough because even unicode+mappings aren't a complete solution. If it's the full unicode+mappings case, I'm doubtful that processing should be done on the client, not only because the unicode database is large, but also because the server will have a well tested setup via unicodedata. If there's a need to keep the slug current behaviour, fill out as you write as opposed to fill out on the server, that suggests an ajax callback to the server to get at unicodedata. If a latin1 hack is enough, that can be sent down to the client in the admin js. RT editors like fck do this all the time with entity replacements. No need to use Python if we're dealing with a small subset. Mappings: yes, ord/text mappings are grand (Greek, Russian, Turkish of the top of my head would be good inbuilts, as would latin if a unicode db isn't used). If there's a need for mapping extension, there needs to be a place for people to put dictionaries. If the code falls into an else because it has no lookup, does it insert a stringified hexcode or blank the character out. Etc. cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
On 7 Jul 2006, at 11:06, Bill de hÓra wrote: > What's the expected scope of the downcoding? Would it be throwing a > few > dicts together in the admin js, or a callback to > unicodedata.normalize? I’m not sure unicodedata.normalize is enough. It kind of works, if you do something like: def slugify_utf8_slug(slug): normalized = [] for c in slug.decode('utf-8'): normalized.append(unicodedata.normalize('NFD', c)[0]) return ''.join(normalized) Then it works for simple slugs: >>> slugify_utf8_slug("müller") u'muller' >>> slugify_utf8_slug('perché') u'perche' But this is because “ü” and “é” can be decomposed as “u” and “e” plus accent or diacritic. But then you couldn’t have language-specific decompositions like the “Ä = Ae” mentioned here: http://dev.textpattern.com/browser/releases/4.0.3/source/ textpattern/lib/i18n-ascii.txt Also: >>> print slugify_utf8_slug("Δ") Δ So this would be no good. Perhaps I’m missing something but unicodedata won’t cut it. If we’re going the asciify-route, we need a lookup table. Cheers. -- Antonio --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
Malcolm Tredinnick wrote: > There was reasonable consensus in one of the threads about doing > something similar (but a bit smaller) than what Wordpress does. Now it's > a case of "patches gratefully accepted". A lot of people say this is a > big issue for them, so it's something that will be fixed one day, but > nobody has put in a reasonable patch yet. When that happens, we can > progress. What's the expected scope of the downcoding? Would it be throwing a few dicts together in the admin js, or a callback to unicodedata.normalize? cheers Bill --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
Re: urlify.js blocks out non-English chars - 2nd try?
On Thu, 2006-07-06 at 10:57 +0200, David Larlet wrote: > Hi all, > > I've recently added an enhancement (ticket #2282) about urlify without > checking for duplicate and there is already a proposal (my mistake) > and a discussion on this mailing-list which were unfortunatly closed > now: > http://groups.google.com/group/django-developers/browse_thread/thread/cecdf42cb3430601/1a53ee84c1742b1e > > I'd like to know if it's possible to do something about it? What are > previous conclusions and facts since the last discussion? I'm new in > Django and I may help in Python but not in js so I need your help ;). There was reasonable consensus in one of the threads about doing something similar (but a bit smaller) than what Wordpress does. Now it's a case of "patches gratefully accepted". A lot of people say this is a big issue for them, so it's something that will be fixed one day, but nobody has put in a reasonable patch yet. When that happens, we can progress. Malcolm --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---
urlify.js blocks out non-English chars - 2nd try?
Hi all, I've recently added an enhancement (ticket #2282) about urlify without checking for duplicate and there is already a proposal (my mistake) and a discussion on this mailing-list which were unfortunatly closed now: http://groups.google.com/group/django-developers/browse_thread/thread/cecdf42cb3430601/1a53ee84c1742b1e I'd like to know if it's possible to do something about it? What are previous conclusions and facts since the last discussion? I'm new in Django and I may help in Python but not in js so I need your help ;). My current problem is for french accents so it's not really difficult (I've pasted a js from a french blog app on my ticket) but I'm conscious there are more problems with other languages. Concerning utf-8 URLs, I don't know if it's really a good idea because this is actually associated to phishing... Cheers, David Larlet --~--~-~--~~~---~--~~ You received this message because you are subscribed to the Google Groups "Django developers" group. To post to this group, send email to django-developers@googlegroups.com To unsubscribe from this group, send email to [EMAIL PROTECTED] For more options, visit this group at http://groups.google.com/group/django-developers -~--~~~~--~~--~--~---