Re: [Bug-wget] bad filename

Bykov Aleksey Fri, 25 Apr 2014 11:40:40 -0700

Greetings, Andries E. Brouwer
>>- the patch is inside #ifdef WINDOWS ... #endif while the problem
>> occurs on all systems, also on Unix.
Yes, it is. 
>> - Presently, 0-31 and 127-159 are considerd "control". 
Sorry, i preffer converting. At least for uppercase/lowercase conversion (with 
towlower() and towupper()). Sometimes it useful - when one site, mirrored with 
Wget, moved between case-sensitive and case-unsensitive filesystems (ext3 and 
NTFS).
Remastered patch, so it has some chance to work in non-windows system. Tested 
with cyrillic names  in FAT32 and NTFS win32 system. mswindow.diff - only 
windows related stuff. 
Best regards, Bykov Aleksey


~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~
From: [email protected]
To: [email protected]
Date: 17:28:10, 04.23.2014
Subject: Re: [Bug-wget] bad filename
~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~ ~



>>On Wed, Apr 23, 2014 at 04:57:11PM +0300, Bykov Aleksey wrote:
>> > Greetings, Darshit Shah
>> > This was disscussed some (or long) time ago. 
>> > Possible logic:
>> > If locale isn't UTF-8 then process as before else
>> > 1. Convert string to WideCharString with mbstowcs(). 
>> > 2. For Each WideChar check it size with wctomb(). If size is 1 then 
>> > compare it with mask. If char restricted, then "quoted++;"
>> > 3. If need, convert to lower/upper case with towlower()/towupper()
>> > 4. Recreate string char by char with wctomb: Convert char to temporary 
>> > buffer. If filechar size is 1 compare with mask and repalce. Else 
>> > "memcpy(q, char_buffer, char_size); q+=char_size;"
>> > In windows i can't check it ( mbstowcs didn't work with UTF-8, so must be 
>> > used MultiByteToWideChar()...)
>> > Patch for windows (unstructured, unclear, unfinished, but worked) is 
>> > attached.
>> > Best Regards, Bykov Aleksey.
>> 
>> Good!
>> 
>> However:
>> - the patch is inside #ifdef WINDOWS ... #endif while the problem
>> occurs on all systems, also on Unix.
>> - I think all of this is needlessly complicated. Repeatedly
>> converting filenames is not a good plan if the goal is to
>> keep them unchanged.
>> - UTF-8 has the nice property that the only 7-bit bytes that occur
>> inside a character code are those in the ASCII set. So, no
>> conversion is needed to test the length: every byte in 0-127
>> always represents a full character.
>> - Presently, 0-31 and 127-159 are considerd "control". That is
>> wrong on UTF-8 systems, where 128-159 are part of a multibyte character.
>> If one wants to preserve the filename mangling in the 0-31,127 range,
>> but wants to do the mangling to 128-159 only when some option asks
>> for it, then 0-31,127 and 128-159 should have different flags in
>> url.c:static const unsigned char urlchr_table[256], e.g.
>> ..
>> #define D filechr_highcontrol
>> ..
>> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 128-143 */
>> D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, D, /* 144-159 */
>> ..
>> #undef D
>> 
>> Andries
>>

mswindows.diff
Description: Binary data

url.diff
Description: Binary data

Re: [Bug-wget] bad filename

Reply via email to