Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Okay, back to link escaping. What this is about: Current implementation of percent escaping URIs uses a whitelist approach, e.g. only percent escapes characters that are in `org-link-escape-chars' or in a user supplied list. This is a problem because using this function requires knowledge about all possible characters that could occur in a URI -- and URIs are limited to plain ASCII, meaning a call to the function must list literally all possible characters and their escapings to get a properly percent escaped string. To solve this problem the behavior of the function is changed to percent escape every character that is an ASCII controll character or not an ASCII character. Subsequently the unescaping function is changed accordingly to handle percent encoded multibyte unicode characters. 1/ I did some testing with the new proposed org-link-escape and the modified `org-protocol-unhex-string': Create a random string with ASCII and multibyte unicode characters, randomly taken from (ucs-names); perform escape-unescape; compare the result with the original string. Works perfect. Testing randomly created string with old escaping of non-ascii strings is on the list. 2/ Of course there could still be the problem, that a user had created a sequence of old escapes that the new unescaping function will interpret wrongly. Not sure how likely this is, but in theory this could happen. Personally I think we should risk breaking peoples' links in this way. 3/ I highly suggest changing the syntax of `org-link-escape-chars'. Currently it is a list of cons with the character in car and the replacement string in cdr. Using such a table in escaping is easy (assq char table), but in the unescaping process it might get tricky. Moreover if the function should do percent escaping, the escpae sequence is already determined by the string to replace. The new syntax would be simply a list of characters to escape in addition to the rule mentioned above ( 32 and 126). This would break compatibility with functions that have used org-link-escape/unescape for something else than percent escaping (e.g. replace ] by %FF and not %5D and such). But this again is bearable: Although it the docstring talks about escaping things that are problematic, the only way to do such escaping in a standardized way is percent escaping. 4/ If all agree that breaking backward incompatibility in the case mentioned above (or did I forgot one?) is bearable, I would go ahead and perform the necessary changes: 1. Use the new algorithm in `org-link-escape' 2. Modify Syntax of `org-link-escape-chars' 3. Issue a warning if someone calls `org-link-escape' with a table of the old syntax. 4. Move the unescaping functions from org-protocol.el to org.el and rename them. 5. Declare `org-protocol-unhex-string' and `org-protocol-unhex-compound' obsolete (make-obsolete). 6. Drop a message to the list informing about these changes. 7. Wait some months and purge the obsolete functions. Best, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber dmj...@jabber.org Email. dm...@ictsoc.de pgpGl4ur84NSO.pgp Description: PGP signature ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Hi David, I have not have time to follow this in detail, but if you feel confident that this is doing the right thing, pleas go ahead and apply the necessary patches. I am an encoding moron, so I am easily convinced that you and Sebastian together cook up something useful. :-) - Carsten On Sep 27, 2010, at 7:36 AM, David Maus wrote: Also I guess the decoding is secure. Means we could change the comment of this function: (defun org-protocol-unhex-compound (hex) Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. Note: this function falls back on single byte decoding if a character sequence is not valid utf-8. See `org-protocol-unhex-single-byte-sequence'. Should I send another patch against master? (Too late here... for me...) Not necessary, following patch removed this sentence and added a proper commit message (please see: Commit messages and ChangeLog entries on http://orgmode.org/worg/org-contribute.php). I took the new patch under review in patchtracker -- If someone else wants to jump on it, just go ahead. Best, -- David Sebastian Rose (1): Decode single byte sequence if decoding unicode failed. lisp/org-protocol.el | 26 +++--- 1 files changed, 23 insertions(+), 3 deletions(-) ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
David Maus dm...@ictsoc.de writes: Also I guess the decoding is secure. Means we could change the comment of this function: (defun org-protocol-unhex-compound (hex) Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. Note: this function falls back on single byte decoding if a character sequence is not valid utf-8. See `org-protocol-unhex-single-byte-sequence'. Should I send another patch against master? (Too late here... for me...) Not necessary, following patch removed this sentence and added a proper commit message (please see: Commit messages and ChangeLog entries on http://orgmode.org/worg/org-contribute.php). I took the new patch under review in patchtracker -- If someone else wants to jump on it, just go ahead. Best, -- David Thanks David! Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Sebastian Rose wrote: David Maus dm...@ictsoc.de writes: sh$ man utf-8 Thanks! I finally get a grip on one of my personal nightmares. It's not that bad, is it? :D Even better: It makes sense ;) The attached patch is the first step in this direction: It modifies the algorithm of `org-link-escape', now iterating over the input string with `mapconcat' and escaping all characters in the escape table or are between 127 and 255. Between 128 (1000 ) and 255 ?? The binary representation of 127 is 0111 and valid ascii char. DEL actually (sh$ man ascii) Right, and that's why it is encoded: No control characters in a URI. The final algorithm for the shiny new unicode aware percent encoding function would be: - percent encode all characters in TABLE - percent encode all characters below 32 and above 126 - encode the char in utf-8 - percent escape all bytes of the encoded char The remaining problem is keeping backward compatibility. There are Org files out there where á is encoded as %E1 and not %C3A1. The percent decoding function should be able to recognize these old escapes and return the right value. I looks like this could be done by changing the behavior of `org-protocol-unhex-string'. Currently it returns the empty string for %E1 because it does not represent a valid utf-8 encoded unicode char. Maybe we could say: If the percent encoded sequence does not form a valid char, use the old method (extended ASCII?) to decode the sequences. Sadly (or luckily?) chances are good that I will be somewhat offline for the next two weeks -- I think implementing this unicode aware escaping function should be the way to go but requires some careful checking for it's consequences for old Org files. Best, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber dmj...@jabber.org Email. dm...@ictsoc.de pgpxyQzTB1pKK.pgp Description: PGP signature ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
The binary representation of 127 is 0111 and valid ascii char. DEL actually (sh$ man ascii) Right, and that's why it is encoded: No control characters in a URI. Great ! :) The final algorithm for the shiny new unicode aware percent encoding function would be: - percent encode all characters in TABLE - percent encode all characters below 32 and above 126 - encode the char in utf-8 - percent escape all bytes of the encoded char The remaining problem is keeping backward compatibility. There are Org files out there where á is encoded as %E1 and not %C3A1. The percent decoding function should be able to recognize these old escapes and return the right value. I looks like this could be done by changing the behavior of `org-protocol-unhex-string'. Currently it returns the empty string for %E1 because it does not represent a valid utf-8 encoded unicode char. Maybe we could say: If the percent encoded sequence does not form a valid char, use the old method (extended ASCII?) to decode the sequences. Well, yes. The function _should_ return something if the end of the string is reached or something else but a `%' is found. I'll have to find out where the function has to look up the correct char. 167 will be a different character for different encodings. This will not handle cases like `Größe' though. Are there cases where strings are encoded the way you showed above, and decoded using `org-unhex-string'? Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
David Maus dm...@ictsoc.de writes: Sebastian Rose wrote: David Maus dm...@ictsoc.de writes: sh$ man utf-8 Thanks! I finally get a grip on one of my personal nightmares. It's not that bad, is it? :D Even better: It makes sense ;) The attached patch is the first step in this direction: It modifies the algorithm of `org-link-escape', now iterating over the input string with `mapconcat' and escaping all characters in the escape table or are between 127 and 255. Between 128 (1000 ) and 255 ?? The binary representation of 127 is 0111 and valid ascii char. DEL actually (sh$ man ascii) Right, and that's why it is encoded: No control characters in a URI. The final algorithm for the shiny new unicode aware percent encoding function would be: - percent encode all characters in TABLE - percent encode all characters below 32 and above 126 - encode the char in utf-8 - percent escape all bytes of the encoded char The remaining problem is keeping backward compatibility. There are Org files out there where á is encoded as %E1 and not %C3A1. The percent decoding function should be able to recognize these old escapes and return the right value. There is no chance to do it in a secure way. But here's what's possible. These all work as expected: (org-protocol-unhex-string %E1) ; á (org-protocol-unhex-string %A1) ; ¡ (org-protocol-unhex-string %E1%A1) ; á¡ (org-protocol-unhex-string %C3%B6) ; still german ö Also, capturing text from this page still works: http://www.jnto.go.jp/jpn/ diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el index 21f28e7..f37ce1c 100644 --- a/lisp/org-protocol.el +++ b/lisp/org-protocol.el @@ -305,7 +305,7 @@ part. (defun org-protocol-unhex-string(str) Unhex hexified unicode strings as returned from the JavaScript function -encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. +encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ö'. (setq str (or str )) (let ((tmp ) (case-fold-search t)) @@ -321,7 +321,11 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. (defun org-protocol-unhex-compound (hex) - Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ü'. + Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. +Note: this function also decodes single byte encodings like +`%E1' (\á\) if not followed by another `%[A-F0-9]{2}' group. +Singlebyte decoding is not secure though, since we could have +two single byte characters above 128 in a row. (let* ((bytes (remove (split-string hex %))) (ret ) (eat 0) @@ -353,9 +357,22 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. (setq val (logxor val xor)) (setq sum (+ (lsh sum shift) val)) (if ( eat 0) (setq eat (- eat 1))) - (when (= 0 eat) + (cond + ((= 0 eat) ;multi byte (setq ret (concat ret (org-protocol-char-to-string sum))) (setq sum 0)) + ((not bytes) ; single byte(s) + (let ((bytes (remove (split-string hex %))) + (ret )) + (message bytes: %s bytes) + + (while bytes + (let* ((b (pop bytes)) + (a (elt b 0)) + (b (elt b 1))) + (setq ret + (concat ret (char-to-string + (+ (lsh a 4) b) )) ;; end (while bytes ret )) Best wishes Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
rrrggrgrggrgr premature and wrong patch, sorry. Again against master: diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el index 21f28e7..d69d584 100644 --- a/lisp/org-protocol.el +++ b/lisp/org-protocol.el @@ -305,7 +305,7 @@ part. (defun org-protocol-unhex-string(str) Unhex hexified unicode strings as returned from the JavaScript function -encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. +encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ö'. (setq str (or str )) (let ((tmp ) (case-fold-search t)) @@ -321,7 +321,11 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. (defun org-protocol-unhex-compound (hex) - Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ü'. + Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. +Note: this function also decodes single byte encodings like +`%E1' (\á\) if not followed by another `%[A-F0-9]{2}' group. +Singlebyte decoding is not secure though, since we could have +two single byte characters above 128 in a row. (let* ((bytes (remove (split-string hex %))) (ret ) (eat 0) @@ -353,12 +357,30 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'. (setq val (logxor val xor)) (setq sum (+ (lsh sum shift) val)) (if ( eat 0) (setq eat (- eat 1))) - (when (= 0 eat) + (cond + ((= 0 eat) ;multi byte (setq ret (concat ret (org-protocol-char-to-string sum))) (setq sum 0)) + ((not bytes) ; single byte(s) + (setq ret (org-protocol-unhex-single-byte-sequence hex )) ;; end (while bytes ret )) +(defun org-protocol-unhex-single-byte-sequence(hex) + Unhexify hex-ecncoded single byte character sequences. + (let ((bytes (remove (split-string hex %))) + (ret )) +(while bytes + (let* ((b (pop bytes)) + (a (elt b 0)) + (b (elt b 1)) + (c1 (if ( a ?9) (+ 10 (- a ?A)) (- a ?0))) + (c2 (if ( b ?9) (+ 10 (- b ?A)) (- b ?0 + (setq ret + (concat ret (char-to-string + (+ (lsh c1 4) c2)) +ret)) + (defun org-protocol-flatten-greedy (param-list optional strip-path replacement) Greedy handlers might receive a list like this from emacsclient: '( (\/dir/org-protocol:/greedy:/~/path1\ (23 . 12)) (\/dir/param\) ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Also I guess the decoding is secure. Means we could change the comment of this function: (defun org-protocol-unhex-compound (hex) Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. Note: this function falls back on single byte decoding if a character sequence is not valid utf-8. See `org-protocol-unhex-single-byte-sequence'. Should I send another patch against master? (Too late here... for me...) Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
[PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Also I guess the decoding is secure. Means we could change the comment of this function: (defun org-protocol-unhex-compound (hex) Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'. Note: this function falls back on single byte decoding if a character sequence is not valid utf-8. See `org-protocol-unhex-single-byte-sequence'. Should I send another patch against master? (Too late here... for me...) Not necessary, following patch removed this sentence and added a proper commit message (please see: Commit messages and ChangeLog entries on http://orgmode.org/worg/org-contribute.php). I took the new patch under review in patchtracker -- If someone else wants to jump on it, just go ahead. Best, -- David Sebastian Rose (1): Decode single byte sequence if decoding unicode failed. lisp/org-protocol.el | 26 +++--- 1 files changed, 23 insertions(+), 3 deletions(-) ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Sebastian Rose wrote: David Maus dm...@ictsoc.de writes: Sebastian Rose wrote: Is there a reason for this distinction between multibyte and unibyte? I favour the shotgun-approach if not. It's bullet-proof. The JavaScript function `encodeURIComponent()' encodes the German Umlaut `ü' as `%C3%B6' regardless of the sources encoding actually. That's why I wrote the two functions `org-protocol-unhex-string' and `org-protocol-unhex-compound' (s. org-protocol.el). Ah, yes. From my understandig of the RFC %C3%BC is a valid representation of the ü character. I do not yet fully understand how to unescape such a representation. E.g. Is %C3%BC a hexencoded multibyte char or a succession of two singlebyte chars? It's a hexencoded multibyte char. JavaScript implementations seem to turn non-ascii singlebyte chars into multibyte chars first, then encode the result. This means if a page is iso-8859-1 encoded (singlebyte `ü'), JavaScript will recode the `ü'. It's funny, but that's what I found when writing org-protocol.el `org-protocol-unhex-string' and `org-protocol-unhex-compound' decode such a representation. The trick is in the utf-8 encoding itself. If a byte starts with a 1, another byte will follow. The number of leading `1's denotes the amount of bytes used for one character. On a GNU/Linux system try sh$ man utf-8 Thanks! I finally get a grip on one of my personal nightmares. The attached patch is the first step in this direction: It modifies the algorithm of `org-link-escape', now iterating over the input string with `mapconcat' and escaping all characters in the escape table or are between 127 and 255. I'll try to figure out the escaping/unescaping of multibyte characters next. Sent as a patch because of it's possible side-effects: The new algorithm ignores the cdr of the escape table cons -- Thus things will break if they use this function for anything else then percent escaping. Best, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber dmj...@jabber.org Email. dm...@ictsoc.de From 8209cb831d0d387d03b10416235d2910a74f80f7 Mon Sep 17 00:00:00 2001 From: David Maus dm...@ictsoc.de Date: Thu, 23 Sep 2010 20:30:13 +0200 Subject: [PATCH] New algorithm for percent escaping * org.el (org-link-escape): New algorithm for percent escaping. Interate over TEXT and replace chars that are in TABLE or are non-ASCII single byte characters. Multibyte characters are left untouched. --- lisp/org.el | 16 +--- 1 files changed, 5 insertions(+), 11 deletions(-) diff --git a/lisp/org.el b/lisp/org.el index d7aa3d2..2c3f1b7 100644 --- a/lisp/org.el +++ b/lisp/org.el @@ -8491,17 +8491,11 @@ This is the list that is used before handing over to the browser.) (if (and org-url-encoding-use-url-hexify (not table)) (url-hexify-string text) (setq table (or table org-link-escape-chars)) -(when text - (let ((re (mapconcat (lambda (x) (regexp-quote - (char-to-string (car x - table \\|))) - (while (string-match re text) - (setq text - (replace-match -(cdr (assoc (string-to-char (match-string 0 text)) -table)) - t t text))) - text +(mapconcat (lambda (c) +(if (or (assoc c table) +(and ( c 126) ( c 255))) +(format %%%X c) + (char-to-string c))) text ))) (defun org-link-unescape (text optional table) Reverse the action of `org-link-escape'. -- 1.7.1 pgpxAKPVUnrq6.pgp Description: PGP signature ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
David Maus dm...@ictsoc.de writes: sh$ man utf-8 Thanks! I finally get a grip on one of my personal nightmares. It's not that bad, is it? :D The attached patch is the first step in this direction: It modifies the algorithm of `org-link-escape', now iterating over the input string with `mapconcat' and escaping all characters in the escape table or are between 127 and 255. Between 128 (1000 ) and 255 ?? The binary representation of 127 is 0111 and valid ascii char. DEL actually (sh$ man ascii) ;) Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Sebastian Rose wrote: Is there a reason for this distinction between multibyte and unibyte? I favour the shotgun-approach if not. It's bullet-proof. The JavaScript function `encodeURIComponent()' encodes the German Umlaut `ü' as `%C3%B6' regardless of the sources encoding actually. That's why I wrote the two functions `org-protocol-unhex-string' and `org-protocol-unhex-compound' (s. org-protocol.el). Ah, yes. From my understandig of the RFC %C3%BC is a valid representation of the ü character. I do not yet fully understand how to unescape such a representation. E.g. Is %C3%BC a hexencoded multibyte char or a succession of two singlebyte chars? I'll have to take a look at that RFC you mentioned :) Me too :D Best, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber dmj...@jabber.org Email. dm...@ictsoc.de pgpoDBBR1KfZV.pgp Description: PGP signature ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
David Maus dm...@ictsoc.de writes: Sebastian Rose wrote: Is there a reason for this distinction between multibyte and unibyte? I favour the shotgun-approach if not. It's bullet-proof. The JavaScript function `encodeURIComponent()' encodes the German Umlaut `ü' as `%C3%B6' regardless of the sources encoding actually. That's why I wrote the two functions `org-protocol-unhex-string' and `org-protocol-unhex-compound' (s. org-protocol.el). Ah, yes. From my understandig of the RFC %C3%BC is a valid representation of the ü character. I do not yet fully understand how to unescape such a representation. E.g. Is %C3%BC a hexencoded multibyte char or a succession of two singlebyte chars? It's a hexencoded multibyte char. JavaScript implementations seem to turn non-ascii singlebyte chars into multibyte chars first, then encode the result. This means if a page is iso-8859-1 encoded (singlebyte `ü'), JavaScript will recode the `ü'. It's funny, but that's what I found when writing org-protocol.el `org-protocol-unhex-string' and `org-protocol-unhex-compound' decode such a representation. The trick is in the utf-8 encoding itself. If a byte starts with a 1, another byte will follow. The number of leading `1's denotes the amount of bytes used for one character. On a GNU/Linux system try sh$ man utf-8 Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
Sébastien Vauban wrote: Hello, With current git pull, and such an Org file (in UTF-8 encoding): ... I get the following error when trying to export it via PDFLaTeX: The problem is, that the 'É' character is not in Org's default list for link escapes but `string-match' matches for the lower case character. Adding more chars to `org-link-escape-chars' would solve the problem, but this seems to be a broder issue: Regular links (URIs) are restricted to a special set of ASCII characters and non-ascii chars are hex-encoded. Currently Org escapes links to Org mode headlines using the table mentioned above. But Org files and hence Org headlines might be Unicode, containing multibyte characters that cannot be hex-escaped in the normal fashion. Maybe something like this would be a solution: - Org only escapes square brackets when escaping a link to an Org mode headline - `org-link-escape' uses a shotgun-approach: Every char that is not allowed according to the specs (Cf. RFC3986) is percent encoded if the link sequence does not contain multibyte chars; If the sequence does contain multibyte chars, `org-link-escape' produces an IRI (Cf. RFC3987). HTH, -- David -- OpenPGP... 0x99ADB83B5A4478E6 Jabber dmj...@jabber.org Email. dm...@ictsoc.de pgpr5ACyjU9yN.pgp Description: PGP signature ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode
Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)
David Maus dm...@ictsoc.de writes: Sébastien Vauban wrote: Hello, With current git pull, and such an Org file (in UTF-8 encoding): ... I get the following error when trying to export it via PDFLaTeX: The problem is, that the 'É' character is not in Org's default list for link escapes but `string-match' matches for the lower case character. Adding more chars to `org-link-escape-chars' would solve the problem, but this seems to be a broder issue: Regular links (URIs) are restricted to a special set of ASCII characters and non-ascii chars are hex-encoded. Currently Org escapes links to Org mode headlines using the table mentioned above. But Org files and hence Org headlines might be Unicode, containing multibyte characters that cannot be hex-escaped in the normal fashion. Maybe something like this would be a solution: - Org only escapes square brackets when escaping a link to an Org mode headline - `org-link-escape' uses a shotgun-approach: Every char that is not allowed according to the specs (Cf. RFC3986) is percent encoded if the link sequence does not contain multibyte chars; If the sequence does contain multibyte chars, `org-link-escape' produces an IRI (Cf. RFC3987). Is there a reason for this distinction between multibyte and unibyte? I favour the shotgun-approach if not. It's bullet-proof. The JavaScript function `encodeURIComponent()' encodes the German Umlaut `ü' as `%C3%B6' regardless of the sources encoding actually. That's why I wrote the two functions `org-protocol-unhex-string' and `org-protocol-unhex-compound' (s. org-protocol.el). I'll have to take a look at that RFC you mentioned :) Best wishes Sebastian ___ Emacs-orgmode mailing list Please use `Reply All' to send replies to the list. Emacs-orgmode@gnu.org http://lists.gnu.org/mailman/listinfo/emacs-orgmode