> If so, tightening up the validation may break such that uses. I agree. What about introducing an extra GUC which allows users to specify verification logic? In fact, I have implemented this patch.
``` SHOW encoding_validation; -- default behaviour SET encoding_validation = 'native'; -- enforce Write to be fully compatible with Read SET encoding_validation = 'read_compatible'; ``` On Wed, May 6, 2026 at 8:19 PM Tatsuo Ishii <[email protected]> wrote: > > It is in general not necessarily required that all text in all > > non-UTF8 encodings must be convertible to UTF8. > > > > (This is also a result of history: These encodings were implemented in > > PostgreSQL before Unicode.) > > > > That said, I can see how different behaviors might be desirable. > > > > My first question would be, are these non-convertible byte sequences > > just characters that don't map to Unicode, or are they invalid within > > the definition of the EUC-* encodings themselves? > > A strict answer is, the former. 0xA2A3 is 3 of lowercase forms of the > Roman numerals (iii), which is not defined in the original GB2312 > (the character set of EUC_CN), > > > If the latter, then > > we should just reject them (modulo some backward compatibility), > > similar to how we reject certain Unicode code points that exist > > "structurally" but are not valid for one reason or another. > > After GB2312, GB18030 was defined. (It is claimed that GB18030 is a > super set of GB2312). In DB18030, lowercase forms of the Roman > numerals and other characters (e.g. Euro sign) were added. > > So I suspect that a) those characters are sometimes used with EUC_CN > despite the fact that they are not valid GB2312 characters. b) some > EUC_CN users might have already written those characters into EUC_CN > databases. If so, tightening up the validation may break such that > uses. This is just my guess. Please correct me if I am wrong. > > > Alternatively, if these byte sequences are valid characters but they > > just didn't end up in Unicode for some reason, then rejecting them > > might break valid uses. > > That's not the case, at least for 0xA2A3. It seems UCS_ti_EUC_CN.pl > explicitly rejects characters that are not part of GB2312, including > 0xA2A3, as the script is using GB18030 as a source data. > > Regards, > -- > Tatsuo Ishii > SRA OSS K.K. > English: http://www.sraoss.co.jp/index_en/ > Japanese:http://www.sraoss.co.jp > -- Zhongpu Chen
