Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-11-04 Thread David Maus
Okay, back to link escaping.

What this is about:

Current implementation of percent escaping URIs uses a whitelist
approach, e.g. only percent escapes characters that are in
`org-link-escape-chars' or in a user supplied list.  This is a problem
because using this function requires knowledge about all possible
characters that could occur in a URI -- and URIs are limited to plain
ASCII, meaning a call to the function must list literally all possible
characters and their escapings to get a properly percent escaped
string.

To solve this problem the behavior of the function is changed to
percent escape every character that is an ASCII controll character or
not an ASCII character.  Subsequently the unescaping function is
changed accordingly to handle percent encoded multibyte unicode
characters.

1/ I did some testing with the new proposed org-link-escape and the
modified `org-protocol-unhex-string': Create a random string with
ASCII and multibyte unicode characters, randomly taken from
(ucs-names); perform escape-unescape; compare the result with the
original string.  Works perfect.  Testing randomly created string with
old escaping of non-ascii strings is on the list.

2/ Of course there could still be the problem, that a user had created
a sequence of old escapes that the new unescaping function will
interpret wrongly.  Not sure how likely this is, but in theory this
could happen.  Personally I think we should risk breaking peoples'
links in this way.

3/ I highly suggest changing the syntax of `org-link-escape-chars'.
Currently it is a list of cons with the character in car and the
replacement string in cdr.  Using such a table in escaping is easy
(assq char table), but in the unescaping process it might get tricky.

Moreover if the function should do percent escaping, the escpae
sequence is already determined by the string to replace.  The new
syntax would be simply a list of characters to escape in addition to
the rule mentioned above ( 32 and  126).

This would break compatibility with functions that have used
org-link-escape/unescape for something else than percent escaping
(e.g. replace ] by %FF and not %5D and such).  But this again is
bearable: Although it the docstring talks about escaping things that
are problematic, the only way to do such escaping in a standardized
way is percent escaping.

4/ If all agree that breaking backward incompatibility in the case
mentioned above (or did I forgot one?) is bearable, I would go ahead
and perform the necessary changes:

  1. Use the new algorithm in `org-link-escape'
  2. Modify Syntax of `org-link-escape-chars'
  3. Issue a warning if someone calls `org-link-escape' with a table
 of the old syntax.
  4. Move the unescaping functions from org-protocol.el to org.el and
 rename them.
  5. Declare `org-protocol-unhex-string' and
 `org-protocol-unhex-compound' obsolete (make-obsolete).
  6. Drop a message to the list informing about these changes.
  7. Wait some months and purge the obsolete functions.

Best,
  -- David
--
OpenPGP... 0x99ADB83B5A4478E6
Jabber dmj...@jabber.org
Email. dm...@ictsoc.de


pgpGl4ur84NSO.pgp
Description: PGP signature
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-29 Thread Carsten Dominik

Hi David,

I have not have time to follow this in detail, but if you feel  
confident that this is
doing the right thing, pleas go ahead and apply the necessary  
patches.  I am an encoding moron, so I am easily convinced that you  
and Sebastian together cook up something useful. :-)


- Carsten

On Sep 27, 2010, at 7:36 AM, David Maus wrote:


Also I guess the decoding is secure.  Means we could change the
comment of this function:



(defun org-protocol-unhex-compound (hex)
 Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
Note: this function falls back on single byte decoding if a
character sequence is not valid utf-8.
See `org-protocol-unhex-single-byte-sequence'.




Should I send another patch against master?  (Too late here... for
me...)


Not necessary, following patch removed this sentence and added a
proper commit message (please see: Commit messages and ChangeLog
entries on http://orgmode.org/worg/org-contribute.php).

I took the new patch under review in patchtracker -- If someone else
wants to jump on it, just go ahead.

Best,
 -- David

Sebastian Rose (1):
 Decode single byte sequence if decoding unicode failed.

lisp/org-protocol.el |   26 +++---
1 files changed, 23 insertions(+), 3 deletions(-)


___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode



___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-27 Thread Sebastian Rose
David Maus dm...@ictsoc.de writes:
 Also I guess the decoding is secure.  Means we could change the
 comment of this function:

 (defun org-protocol-unhex-compound (hex)
   Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
 Note: this function falls back on single byte decoding if a
 character sequence is not valid utf-8.
 See `org-protocol-unhex-single-byte-sequence'.


 Should I send another patch against master?  (Too late here... for
 me...)

 Not necessary, following patch removed this sentence and added a
 proper commit message (please see: Commit messages and ChangeLog
 entries on http://orgmode.org/worg/org-contribute.php).

 I took the new patch under review in patchtracker -- If someone else
 wants to jump on it, just go ahead.

 Best,
   -- David


Thanks David!


  Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread David Maus
Sebastian Rose wrote:
David Maus dm...@ictsoc.de writes:
  sh$  man utf-8

 Thanks!  I finally get a grip on one of my personal nightmares.


It's not that bad, is it? :D

Even better: It makes sense ;)

 The attached patch is the first step in this direction: It modifies
 the algorithm of `org-link-escape', now iterating over the input
 string with `mapconcat' and escaping all characters in the escape
 table or are between 127 and 255.

Between 128 (1000 ) and 255 ??

The binary representation of 127 is 0111  and valid ascii char. DEL
actually (sh$ man ascii)

Right, and that's why it is encoded: No control characters in a URI.

The final algorithm for the shiny new unicode aware percent encoding
function would be:

 - percent encode all characters in TABLE
 - percent encode all characters below 32 and above 126
   - encode the char in utf-8
   - percent escape all bytes of the encoded char

The remaining problem is keeping backward compatibility. There are Org
files out there where á is encoded as %E1 and not %C3A1.  The
percent decoding function should be able to recognize these old
escapes and return the right value.  

I looks like this could be done by changing the behavior of
`org-protocol-unhex-string'.  Currently it returns the empty string
for %E1 because it does not represent a valid utf-8 encoded unicode
char.  Maybe we could say: If the percent encoded sequence does not
form a valid char, use the old method (extended ASCII?) to decode the
sequences.

Sadly (or luckily?) chances are good that I will be somewhat offline
for the next two weeks -- I think implementing this unicode aware
escaping function should be the way to go but requires some careful
checking for it's consequences for old Org files.

Best,
  -- David

-- 
OpenPGP... 0x99ADB83B5A4478E6
Jabber dmj...@jabber.org
Email. dm...@ictsoc.de


pgpxyQzTB1pKK.pgp
Description: PGP signature
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread Sebastian Rose

The binary representation of 127 is 0111  and valid ascii char. DEL
actually (sh$ man ascii)

 Right, and that's why it is encoded: No control characters in a URI.

Great ! :)

 The final algorithm for the shiny new unicode aware percent encoding
 function would be:

  - percent encode all characters in TABLE
  - percent encode all characters below 32 and above 126
- encode the char in utf-8
- percent escape all bytes of the encoded char

 The remaining problem is keeping backward compatibility. There are Org
 files out there where á is encoded as %E1 and not %C3A1.  The
 percent decoding function should be able to recognize these old
 escapes and return the right value.  

 I looks like this could be done by changing the behavior of
 `org-protocol-unhex-string'.  Currently it returns the empty string
 for %E1 because it does not represent a valid utf-8 encoded unicode
 char.  Maybe we could say: If the percent encoded sequence does not
 form a valid char, use the old method (extended ASCII?) to decode the
 sequences.

Well, yes.  The function _should_ return something if the end of the
string is reached or something else but a `%' is found.

I'll have to find out where the function has to look up the correct
char.  167 will be a different character for different encodings.


This will not handle cases like `Größe' though.


Are there cases where strings are encoded the way you showed above, and
decoded using `org-unhex-string'?


  Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread Sebastian Rose
David Maus dm...@ictsoc.de writes:
 Sebastian Rose wrote:
David Maus dm...@ictsoc.de writes:
  sh$  man utf-8

 Thanks!  I finally get a grip on one of my personal nightmares.


It's not that bad, is it? :D

 Even better: It makes sense ;)

 The attached patch is the first step in this direction: It modifies
 the algorithm of `org-link-escape', now iterating over the input
 string with `mapconcat' and escaping all characters in the escape
 table or are between 127 and 255.

Between 128 (1000 ) and 255 ??

The binary representation of 127 is 0111  and valid ascii char. DEL
actually (sh$ man ascii)

 Right, and that's why it is encoded: No control characters in a URI.

 The final algorithm for the shiny new unicode aware percent encoding
 function would be:

  - percent encode all characters in TABLE
  - percent encode all characters below 32 and above 126
- encode the char in utf-8
- percent escape all bytes of the encoded char

 The remaining problem is keeping backward compatibility. There are Org
 files out there where á is encoded as %E1 and not %C3A1.  The
 percent decoding function should be able to recognize these old
 escapes and return the right value.


There is no chance to do it in a secure way.  But here's what's
possible.


These all work as expected:

(org-protocol-unhex-string %E1) ; á
(org-protocol-unhex-string %A1) ; ¡
(org-protocol-unhex-string %E1%A1)  ; á¡
(org-protocol-unhex-string %C3%B6)  ; still german ö


Also, capturing text from this page still works:
http://www.jnto.go.jp/jpn/


diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el
index 21f28e7..f37ce1c 100644
--- a/lisp/org-protocol.el
+++ b/lisp/org-protocol.el
@@ -305,7 +305,7 @@ part.
 
 (defun org-protocol-unhex-string(str)
   Unhex hexified unicode strings as returned from the JavaScript function
-encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
+encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ö'.
   (setq str (or str ))
   (let ((tmp )
 	(case-fold-search t))
@@ -321,7 +321,11 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
 
 
 (defun org-protocol-unhex-compound (hex)
-  Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ü'.
+  Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
+Note: this function also decodes single byte encodings like
+`%E1' (\á\) if not followed by another `%[A-F0-9]{2}' group.
+Singlebyte decoding is not secure though, since we could have
+two single byte characters above 128 in a row.
   (let* ((bytes (remove  (split-string hex %)))
 	 (ret )
 	 (eat 0)
@@ -353,9 +357,22 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
 	(setq val (logxor val xor))
 	(setq sum (+ (lsh sum shift) val))
 	(if ( eat 0) (setq eat (- eat 1)))
-	(when (= 0 eat)
+	(cond
+	 ((= 0 eat) ;multi byte
 	  (setq ret (concat ret (org-protocol-char-to-string sum)))
 	  (setq sum 0))
+	 ((not bytes)   ; single byte(s)
+	  (let ((bytes (remove  (split-string hex %)))
+		(ret ))
+	(message bytes: %s bytes)
+
+	(while bytes
+	  (let* ((b (pop bytes))
+		 (a (elt b 0))
+		 (b (elt b 1)))
+		 (setq ret
+			   (concat ret (char-to-string
+	(+ (lsh a 4) b)
 	)) ;; end (while bytes
 ret ))
 


Best wishes

  Sebastian
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread Sebastian Rose

rrrggrgrggrgr

premature and wrong patch, sorry.  Again against master:

diff --git a/lisp/org-protocol.el b/lisp/org-protocol.el
index 21f28e7..d69d584 100644
--- a/lisp/org-protocol.el
+++ b/lisp/org-protocol.el
@@ -305,7 +305,7 @@ part.
 
 (defun org-protocol-unhex-string(str)
   Unhex hexified unicode strings as returned from the JavaScript function
-encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
+encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ö'.
   (setq str (or str ))
   (let ((tmp )
 	(case-fold-search t))
@@ -321,7 +321,11 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
 
 
 (defun org-protocol-unhex-compound (hex)
-  Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ü'.
+  Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
+Note: this function also decodes single byte encodings like
+`%E1' (\á\) if not followed by another `%[A-F0-9]{2}' group.
+Singlebyte decoding is not secure though, since we could have
+two single byte characters above 128 in a row.
   (let* ((bytes (remove  (split-string hex %)))
 	 (ret )
 	 (eat 0)
@@ -353,12 +357,30 @@ encodeURIComponent. E.g. `%C3%B6' is the german Umlaut `ü'.
 	(setq val (logxor val xor))
 	(setq sum (+ (lsh sum shift) val))
 	(if ( eat 0) (setq eat (- eat 1)))
-	(when (= 0 eat)
+	(cond
+	 ((= 0 eat) ;multi byte
 	  (setq ret (concat ret (org-protocol-char-to-string sum)))
 	  (setq sum 0))
+	 ((not bytes)   ; single byte(s)
+	  (setq ret (org-protocol-unhex-single-byte-sequence hex
 	)) ;; end (while bytes
 ret ))
 
+(defun org-protocol-unhex-single-byte-sequence(hex)
+  Unhexify hex-ecncoded single byte character sequences.
+  (let ((bytes (remove  (split-string hex %)))
+	(ret ))
+(while bytes
+  (let* ((b (pop bytes))
+	 (a (elt b 0))
+	 (b (elt b 1))
+	 (c1 (if ( a ?9) (+ 10 (- a ?A)) (- a ?0)))
+	 (c2 (if ( b ?9) (+ 10 (- b ?A)) (- b ?0
+	(setq ret
+	  (concat ret (char-to-string
+			   (+ (lsh c1 4) c2))
+ret))
+
 (defun org-protocol-flatten-greedy (param-list optional strip-path replacement)
   Greedy handlers might receive a list like this from emacsclient:
  '( (\/dir/org-protocol:/greedy:/~/path1\ (23 . 12)) (\/dir/param\)
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread Sebastian Rose

Also I guess the decoding is secure.  Means we could change the comment
of this function:

(defun org-protocol-unhex-compound (hex)
  Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
Note: this function falls back on single byte decoding if a
character sequence is not valid utf-8.
See `org-protocol-unhex-single-byte-sequence'.


Should I send another patch against master?  (Too late here... for me...)


Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


[PATCH] Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-26 Thread David Maus
 Also I guess the decoding is secure.  Means we could change the
 comment of this function:

 (defun org-protocol-unhex-compound (hex)
   Unhexify unicode hex-chars. E.g. `%C3%B6' is the German Umlaut `ö'.
 Note: this function falls back on single byte decoding if a
 character sequence is not valid utf-8.
 See `org-protocol-unhex-single-byte-sequence'.


 Should I send another patch against master?  (Too late here... for
 me...)

Not necessary, following patch removed this sentence and added a
proper commit message (please see: Commit messages and ChangeLog
entries on http://orgmode.org/worg/org-contribute.php).

I took the new patch under review in patchtracker -- If someone else
wants to jump on it, just go ahead.

Best,
  -- David

Sebastian Rose (1):
  Decode single byte sequence if decoding unicode failed.

 lisp/org-protocol.el |   26 +++---
 1 files changed, 23 insertions(+), 3 deletions(-)


___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-23 Thread David Maus
Sebastian Rose wrote:
David Maus dm...@ictsoc.de writes:
 Sebastian Rose wrote:
Is there a reason for this distinction between multibyte and unibyte?
I favour the shotgun-approach if not.  It's bullet-proof.

The JavaScript function `encodeURIComponent()' encodes the German Umlaut
`ü' as `%C3%B6' regardless of the sources encoding actually.  That's why
I wrote the two functions `org-protocol-unhex-string' and
`org-protocol-unhex-compound' (s. org-protocol.el).

 Ah, yes.  From my understandig of the RFC %C3%BC is a valid
 representation of the ü character.  

 I do not yet fully understand
 how to unescape such a representation.  E.g. Is %C3%BC a hexencoded
 multibyte char or a succession of two singlebyte chars?

It's a hexencoded multibyte char.

JavaScript implementations seem to turn non-ascii singlebyte chars
into multibyte chars first, then encode the result.

This means if a page is iso-8859-1 encoded (singlebyte `ü'),
JavaScript will recode the `ü'.  It's funny, but that's what I found
when writing org-protocol.el


`org-protocol-unhex-string' and `org-protocol-unhex-compound' decode
such a representation.

The trick is in the utf-8 encoding itself.  If a byte starts with a 1,
another byte will follow.  The number of leading `1's denotes the amount
of bytes used for one character.   On a GNU/Linux system try

  sh$  man utf-8

Thanks!  I finally get a grip on one of my personal nightmares.  The
attached patch is the first step in this direction: It modifies the
algorithm of `org-link-escape', now iterating over the input string
with `mapconcat' and escaping all characters in the escape table or
are between 127 and 255.

I'll try to figure out the escaping/unescaping of multibyte characters
next.

Sent as a patch because of it's possible side-effects: The new
algorithm ignores the cdr of the escape table cons -- Thus things will
break if they use this function for anything else then percent
escaping.

Best,
  -- David
-- 
OpenPGP... 0x99ADB83B5A4478E6
Jabber dmj...@jabber.org
Email. dm...@ictsoc.de
From 8209cb831d0d387d03b10416235d2910a74f80f7 Mon Sep 17 00:00:00 2001
From: David Maus dm...@ictsoc.de
Date: Thu, 23 Sep 2010 20:30:13 +0200
Subject: [PATCH] New algorithm for percent escaping

* org.el (org-link-escape): New algorithm for percent escaping.

Interate over TEXT and replace chars that are in TABLE or are
non-ASCII single byte characters.  Multibyte characters are left
untouched.
---
 lisp/org.el |   16 +---
 1 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/lisp/org.el b/lisp/org.el
index d7aa3d2..2c3f1b7 100644
--- a/lisp/org.el
+++ b/lisp/org.el
@@ -8491,17 +8491,11 @@ This is the list that is used before handing over to 
the browser.)
   (if (and org-url-encoding-use-url-hexify (not table))
   (url-hexify-string text)
 (setq table (or table org-link-escape-chars))
-(when text
-  (let ((re (mapconcat (lambda (x) (regexp-quote
-   (char-to-string (car x
-  table \\|)))
-   (while (string-match re text)
- (setq text
-   (replace-match
-(cdr (assoc (string-to-char (match-string 0 text))
-table))
-  t t text)))
-   text
+(mapconcat (lambda (c)
+(if (or (assoc c table)
+(and ( c 126) ( c 255)))
+(format %%%X c)
+  (char-to-string c))) text )))
 
 (defun org-link-unescape (text optional table)
   Reverse the action of `org-link-escape'.
-- 
1.7.1



pgpxAKPVUnrq6.pgp
Description: PGP signature
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-23 Thread Sebastian Rose
David Maus dm...@ictsoc.de writes:
  sh$  man utf-8

 Thanks!  I finally get a grip on one of my personal nightmares.


It's not that bad, is it? :D


 The
 attached patch is the first step in this direction: It modifies the
 algorithm of `org-link-escape', now iterating over the input string
 with `mapconcat' and escaping all characters in the escape table or
 are between 127 and 255.


Between 128 (1000 ) and 255 ??

The binary representation of 127 is 0111  and valid ascii char. DEL
actually (sh$ man ascii)

;)



Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-22 Thread David Maus
Sebastian Rose wrote:
Is there a reason for this distinction between multibyte and unibyte?
I favour the shotgun-approach if not.  It's bullet-proof.

The JavaScript function `encodeURIComponent()' encodes the German Umlaut
`ü' as `%C3%B6' regardless of the sources encoding actually.  That's why
I wrote the two functions `org-protocol-unhex-string' and
`org-protocol-unhex-compound' (s. org-protocol.el).

Ah, yes.  From my understandig of the RFC %C3%BC is a valid
representation of the ü character.  

I do not yet fully understand
how to unescape such a representation.  E.g. Is %C3%BC a hexencoded
multibyte char or a succession of two singlebyte chars?

I'll have to take a look at that RFC you mentioned :)

Me too :D

Best,
  -- David
-- 
OpenPGP... 0x99ADB83B5A4478E6
Jabber dmj...@jabber.org
Email. dm...@ictsoc.de


pgpoDBBR1KfZV.pgp
Description: PGP signature
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-22 Thread Sebastian Rose
David Maus dm...@ictsoc.de writes:
 Sebastian Rose wrote:
Is there a reason for this distinction between multibyte and unibyte?
I favour the shotgun-approach if not.  It's bullet-proof.

The JavaScript function `encodeURIComponent()' encodes the German Umlaut
`ü' as `%C3%B6' regardless of the sources encoding actually.  That's why
I wrote the two functions `org-protocol-unhex-string' and
`org-protocol-unhex-compound' (s. org-protocol.el).

 Ah, yes.  From my understandig of the RFC %C3%BC is a valid
 representation of the ü character.  

 I do not yet fully understand
 how to unescape such a representation.  E.g. Is %C3%BC a hexencoded
 multibyte char or a succession of two singlebyte chars?


It's a hexencoded multibyte char.

JavaScript implementations seem to turn non-ascii singlebyte chars into
multibyte chars first, then encode the result.

This means if a page is iso-8859-1 encoded (singlebyte `ü'), JavaScript
will recode the `ü'.  It's funny, but that's what I found when writing
org-protocol.el 


`org-protocol-unhex-string' and `org-protocol-unhex-compound' decode
such a representation.

The trick is in the utf-8 encoding itself.  If a byte starts with a 1,
another byte will follow.  The number of leading `1's denotes the amount
of bytes used for one character.   On a GNU/Linux system try

  sh$  man utf-8


Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-20 Thread David Maus
Sébastien Vauban wrote:
Hello,

With current git pull, and such an Org file (in UTF-8 encoding):

 ...

I get the following error when trying to export it via PDFLaTeX:

The problem is, that the 'É' character is not in Org's default list
for link escapes but `string-match' matches for the lower case
character.  Adding more chars to `org-link-escape-chars' would solve
the problem, but this seems to be a broder issue:

Regular links (URIs) are restricted to a special set of ASCII
characters and non-ascii chars are hex-encoded.  Currently Org escapes
links to Org mode headlines using the table mentioned above.  But Org
files and hence Org headlines might be Unicode, containing multibyte
characters that cannot be hex-escaped in the normal fashion.  

Maybe something like this would be a solution:

 - Org only escapes square brackets when escaping a link to an Org
   mode headline
 - `org-link-escape' uses a shotgun-approach: Every char that is not
   allowed according to the specs (Cf. RFC3986) is percent encoded if
   the link sequence does not contain multibyte chars; If the sequence
   does contain multibyte chars, `org-link-escape' produces an IRI
   (Cf. RFC3987).

HTH,
  -- David

-- 
OpenPGP... 0x99ADB83B5A4478E6
Jabber dmj...@jabber.org
Email. dm...@ictsoc.de


pgpr5ACyjU9yN.pgp
Description: PGP signature
___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode


Re: [Orgmode] [bug] org-link-escape and (wrong-type-argument stringp nil)

2010-09-20 Thread Sebastian Rose
David Maus dm...@ictsoc.de writes:
 Sébastien Vauban wrote:
Hello,

With current git pull, and such an Org file (in UTF-8 encoding):

 ...

I get the following error when trying to export it via PDFLaTeX:

 The problem is, that the 'É' character is not in Org's default list
 for link escapes but `string-match' matches for the lower case
 character.  Adding more chars to `org-link-escape-chars' would solve
 the problem, but this seems to be a broder issue:

 Regular links (URIs) are restricted to a special set of ASCII
 characters and non-ascii chars are hex-encoded.  Currently Org escapes
 links to Org mode headlines using the table mentioned above.  But Org
 files and hence Org headlines might be Unicode, containing multibyte
 characters that cannot be hex-escaped in the normal fashion.

 Maybe something like this would be a solution:

  - Org only escapes square brackets when escaping a link to an Org
mode headline
  - `org-link-escape' uses a shotgun-approach: Every char that is not
allowed according to the specs (Cf. RFC3986) is percent encoded if
the link sequence does not contain multibyte chars; If the sequence
does contain multibyte chars, `org-link-escape' produces an IRI
(Cf. RFC3987).



Is there a reason for this distinction between multibyte and unibyte?
I favour the shotgun-approach if not.  It's bullet-proof.



The JavaScript function `encodeURIComponent()' encodes the German Umlaut
`ü' as `%C3%B6' regardless of the sources encoding actually.  That's why
I wrote the two functions `org-protocol-unhex-string' and
`org-protocol-unhex-compound' (s. org-protocol.el).


I'll have to take a look at that RFC you mentioned :)



Best wishes

  Sebastian

___
Emacs-orgmode mailing list
Please use `Reply All' to send replies to the list.
Emacs-orgmode@gnu.org
http://lists.gnu.org/mailman/listinfo/emacs-orgmode