Hello, here is a patch from a Debian user.:) Reply-To set to the submitter, Debian BTS and this list.
-------- Weitergeleitete Nachricht -------- > Von: Thomas de Grivel <billitch (at) gmail.com> > Betreff: Bug#436566: lftp: Does not correctly encode UTF-8 symbols in > URL. > Datum: Wed, 08 Aug 2007 10:57:06 +0200 > UTF-8 multi-byte characters are not correctly encoded into URLs. These > characters are for example vowels with accents, and thus appear very > frequently in european languages like French (which is my own). > > Although UTF-8 encoded web pages are not widespread yet, I believe it is > a good practice to encourage unicode. Here is an example website which > fails with lftp : > > > $ lftp http://files.iai.heig-vd.ch/Enseignement/ <<EOF > > cd Supports%20de%20cours/Acquisition\ de\ données\ \&\ CEM/ > > EOF > > Here is the output I get : > > $ lftp http://files.iai.heig-vd.ch/Enseignement/Supports%20de%20cours/ > > cd ok, cwd=/Enseignement/Supports de cours > > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> > > cd Acquisition\ de\ données\ \&\ CEM/ > > cd: Access failed: 404 Not Found (/Enseignement/Supports de > > cours/Acquisition de données & CEM) > > lftp files.iai.heig-vd.ch:/Enseignement/Supports de cours> exit > > I wrote a naïve patch to url-encode some of these characters and it > seems to work for the example page, but it still misses most UTF-8 > characters. While I figure out how to do it correctly maybe you can > point me to some relevant information or to upstream coders which > would be interested ? -- Noèl Köthe <noel debian.org> Debian GNU/Linux, www.debian.org
diff -ur lftp-3.5.6.orig/src/url.cc lftp-3.5.6/src/url.cc
--- lftp-3.5.6.orig/src/url.cc 2006-02-06 11:59:59.000000000 +0100
+++ lftp-3.5.6/src/url.cc 2007-08-08 08:08:37.000000000 +0200
@@ -441,6 +441,7 @@
/* Encodes the unsafe characters (listed in URL_UNSAFE) in a given
string, returning a malloc-ed %XX encoded string. */
+inline char *cat_quoted (char *p, const unsigned char c);
#define need_quote(c) (!unsafe || iscntrl((unsigned char)(c)) || strchr(unsafe,(c)))
char *url::encode_string (const char *s,char *res,const char *unsafe)
{
@@ -462,10 +463,12 @@
{
if (need_quote(*s))
{
- const unsigned char c = *s;
- *p++ = '%';
- sprintf(p,"%02X",c);
- p+=2;
+ p = cat_quoted (p, *s);
+ if ((unsigned char) *s == 0xC3 && s[1])
+ {
+ s++;
+ p = cat_quoted (p, *s);
+ }
}
else
*p++ = *s;
@@ -474,6 +477,14 @@
return res;
}
+inline char *cat_quoted (char *p, const unsigned char c)
+{
+ *p++ = '%';
+ sprintf(p,"%02X",c);
+ p+=2;
+ return p;
+}
+
bool url::dir_needs_trailing_slash(const char *proto)
{
if(!proto)
diff -ur lftp-3.5.6.orig/src/url.h lftp-3.5.6/src/url.h
--- lftp-3.5.6.orig/src/url.h 2006-02-06 12:00:06.000000000 +0100
+++ lftp-3.5.6/src/url.h 2007-08-08 08:04:48.000000000 +0200
@@ -47,7 +47,7 @@
char *Combine(const char *home=0,bool use_rfc1738=true);
};
-# define URL_UNSAFE " <>\"%{}|\\^[]`"
+# define URL_UNSAFE " <>\"%{}|\\^[]`\xC3"
# define URL_PATH_UNSAFE URL_UNSAFE"#;?"
# define URL_HOST_UNSAFE URL_UNSAFE":/"
# define URL_PORT_UNSAFE URL_UNSAFE"/"
signature.asc
Description: Dies ist ein digital signierter Nachrichtenteil
