Hello,

here is a patch from a Debian user.:)
Reply-To set to the submitter, Debian BTS and this list.

-------- Weitergeleitete Nachricht --------
> Von: Thomas de Grivel <billitch (at) gmail.com>

> Betreff: Bug#436566: lftp: Does not correctly encode UTF-8 symbols in
> URL.
> Datum: Wed, 08 Aug 2007 10:57:06 +0200

> UTF-8 multi-byte characters are not correctly encoded into URLs. These
> characters are for example vowels with accents, and thus appear very
> frequently in european languages like French (which is my own).
> 
> Although UTF-8 encoded web pages are not widespread yet, I believe it is
> a good practice to encourage unicode. Here is an example website which
> fails with lftp :
> 
> > $ lftp http://files.iai.heig-vd.ch/Enseignement/ <<EOF
> > cd Supports%20de%20cours/Acquisition\ de\ données\ \&\ CEM/
> > EOF
> 
> Here is the output I get :
> > $ lftp http://files.iai.heig-vd.ch/Enseignement/Supports%20de%20cours/
> > cd ok, cwd=/Enseignement/Supports de cours
> > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours>
> >  cd Acquisition\ de\ données\ \&\ CEM/
> > cd: Access failed: 404 Not Found (/Enseignement/Supports de 
> >  cours/Acquisition de données & CEM)
> > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> exit
> 
> I wrote a naïve patch to url-encode some of these characters and it
> seems to work for the example page, but it still misses most UTF-8
> characters. While I figure out how to do it correctly maybe you can
> point me to some relevant information or to upstream coders which
> would be interested ?



-- 
Noèl Köthe <noel debian.org>
Debian GNU/Linux, www.debian.org
diff -ur lftp-3.5.6.orig/src/url.cc lftp-3.5.6/src/url.cc
--- lftp-3.5.6.orig/src/url.cc	2006-02-06 11:59:59.000000000 +0100
+++ lftp-3.5.6/src/url.cc	2007-08-08 08:08:37.000000000 +0200
@@ -441,6 +441,7 @@
 
 /* Encodes the unsafe characters (listed in URL_UNSAFE) in a given
    string, returning a malloc-ed %XX encoded string.  */
+inline char *cat_quoted (char *p, const unsigned char c);
 #define need_quote(c) (!unsafe || iscntrl((unsigned char)(c)) || strchr(unsafe,(c)))
 char *url::encode_string (const char *s,char *res,const char *unsafe)
 {
@@ -462,10 +463,12 @@
   {
     if (need_quote(*s))
       {
-	const unsigned char c = *s;
-	*p++ = '%';
-	sprintf(p,"%02X",c);
-	p+=2;
+	p = cat_quoted (p, *s);
+	if ((unsigned char) *s == 0xC3 && s[1])
+	  {
+	    s++;
+	    p = cat_quoted (p, *s);
+	  }
       }
     else
       *p++ = *s;
@@ -474,6 +477,14 @@
   return res;
 }
 
+inline char *cat_quoted (char *p, const unsigned char c)
+{
+  *p++ = '%';
+  sprintf(p,"%02X",c);
+  p+=2;
+  return p;
+}
+
 bool url::dir_needs_trailing_slash(const char *proto)
 {
    if(!proto)
diff -ur lftp-3.5.6.orig/src/url.h lftp-3.5.6/src/url.h
--- lftp-3.5.6.orig/src/url.h	2006-02-06 12:00:06.000000000 +0100
+++ lftp-3.5.6/src/url.h	2007-08-08 08:04:48.000000000 +0200
@@ -47,7 +47,7 @@
    char *Combine(const char *home=0,bool use_rfc1738=true);
 };
 
-# define URL_UNSAFE " <>\"%{}|\\^[]`"
+# define URL_UNSAFE " <>\"%{}|\\^[]`\xC3"
 # define URL_PATH_UNSAFE URL_UNSAFE"#;?"
 # define URL_HOST_UNSAFE URL_UNSAFE":/"
 # define URL_PORT_UNSAFE URL_UNSAFE"/"

Attachment: signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil

Reply via email to