Greetings, Andries E. Brouwer >>- the patch is inside #ifdef WINDOWS ... #endif while the problem >> occurs on all systems, also on Unix. Yes, it is. >> - Presently, 0-31 and 127-159 are considerd "control". Sorry, i preffer converting. At least for uppercase/lowercase conversion (with towlower() and towupper()). Sometimes it useful - when one site, mirrored with Wget, moved between case-sensitive and case-unsensitive filesystems (ext3 and NTFS). Remastered patch, so it has some chance to work in non-windows system. Tested with cyrillic names in FAT32 and NTFS win32 system. mswindow.diff - only windows related stuff. Best regards, Bykov Aleksey
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ From: [email protected] To: [email protected] Date: 17:28:10, 04.23.2014 Subject: Re: [Bug-wget] bad filename ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ >>On Wed, Apr 23, 2014 at 04:57:11PM +0300, Bykov Aleksey wrote: >> > Greetings, Darshit Shah >> > This was disscussed some (or long) time ago. >> > Possible logic: >> > If locale isn't UTF-8 then process as before else >> > 1. Convert string to WideCharString with mbstowcs(). >> > 2. For Each WideChar check it size with wctomb(). If size is 1 then >> > compare it with mask. If char restricted, then "quoted++;" >> > 3. If need, convert to lower/upper case with towlower()/towupper() >> > 4. Recreate string char by char with wctomb: Convert char to temporary >> > buffer. If filechar size is 1 compare with mask and repalce. Else >> > "memcpy(q, char_buffer, char_size); q+=char_size;" >> > In windows i can't check it ( mbstowcs didn't work with UTF-8, so must be >> > used MultiByteToWideChar()...) >> > Patch for windows (unstructured, unclear, unfinished, but worked) is >> > attached. >> > Best Regards, Bykov Aleksey. >> >> Good! >> >> However: >> - the patch is inside #ifdef WINDOWS ... #endif while the problem >> occurs on all systems, also on Unix. >> - I think all of this is needlessly complicated. Repeatedly >> converting filenames is not a good plan if the goal is to >> keep them unchanged. >> - UTF-8 has the nice property that the only 7-bit bytes that occur >> inside a character code are those in the ASCII set. So, no >> conversion is needed to test the length: every byte in 0-127 >> always represents a full character. >> - Presently, 0-31 and 127-159 are considerd "control". That is >> wrong on UTF-8 systems, where 128-159 are part of a multibyte character. >> If one wants to preserve the filename mangling in the 0-31,127 range, >> but wants to do the mangling to 128-159 only when some option asks >> for it, then 0-31,127 and 128-159 should have different flags in >> url.c:static const unsigned char urlchr_table[256], e.g. >> .. >> #define D filechr_highcontrol >> .. >> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 128-143 */ >> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 144-159 */ >> .. >> #undef D >> >> Andries >>
mswindows.diff
Description: Binary data
url.diff
Description: Binary data
