Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

John Hansen Sat, 07 Aug 2004 20:12:09 -0700

> Well, this is still working at the wrong level.  The code 
> that's in pg_verifymbstr is mainly intended to enforce the 
> *system wide* assumption that multibyte characters must have 
> the high bit set in every byte.  (We do not support encodings 
> without this property in the backend, because it breaks code 
> that looks for ASCII characters ... such as the main 
> parser/lexer ...)  It's not really intended to check that the 
> multibyte character is actually legal in its encoding.
>


Ok, point taken.

> The "special UTF-8 check" was never more than a very 
> quick-n-dirty hack that was in the wrong place to start with. 
>  We ought to be getting rid of it not institutionalizing it.  
> If you want an exact encoding-specific check on the 
> legitimacy of a multibyte sequence, I think the right way to 
> do it is to add another function pointer to pg_wchar_table 
> entries to let each encoding have its own check routine.  
> Perhaps this could be defined so as to avoid a separate call 
> to pg_mblen inside the loop, and thereby not add any new 
> overhead.  I'm thinking about an API something like
> 
>       int validate_mbchar(const unsigned char *str, int len)
> 
> with result +N if a valid character N bytes long is present 
> at *str, and -N if an invalid character is present at *str 
> and it would be appropriate to display N bytes in the complaint.
> (N must be <= len in either case.)  This would reduce the 
> main loop of pg_verifymbstr to a call of this function and an 
> error-case-handling block.
> 

Sounds like a plan...

>                       regards, tom lane
> 
> 

Regards,

John Hansen

---------------------------(end of broadcast)---------------------------
TIP 7: don't forget to increase your free space map settings

Re: [PATCHES] [HACKERS] UNICODE characters above 0x10000

Reply via email to