Re: Remove invalid SS2/SS3 handling from EUC-KR routines

SungJun Jang Wed, 13 May 2026 21:24:37 -0700

Hi Henson,

Thanks for the thorough review notes and the detailed coverage list.


> I plan to work through them myself gradually, as time
> allows, and share what I find along the way; other eyes would
> be welcome.

Agreed.  I will follow the same approach -- work through the list
gradually and share findings as I go.


> On the release-19.sgml hunk: release
> notes are curated by the release manager from commit messages
> after the fact, not written into the patch.  It would probably
> be cleaner for v3 to omit that hunk and let the commit message
> carry the user-visible change instead.

Thanks for the clarification..  v3 attached: release-19.sgml hunk
removed.  The commit message now carries the user-visible change:

  "The only user-visible effect is that pg_encoding_max_length('EUC_KR')
  now returns 2 instead of 3."

The charset.sgml Table 23.3 fix is retained.

Regards,
SungJun Jang

2026년 5월 14일 (목) 오전 10:56, Henson Choi <[email protected]>님이 작성:

> Hi Michael, SungJun,
>
> One framing point first.
>
> Correctness of the rewritten routines is the easy part -- for valid
> EUC-KR data there is no behavior change.  What deserves scrutiny is
> the side-effect surface: callers that have been getting mblen=2 or 3
> for 0x8E/0x8F, and the fact that such bytes can already be on disk
> via routes that skip pg_verify_mbstr.  After the patch those rows
> are measured differently by length/substring/LIKE.
>
> Suggested review principles:
>
>   (P1) Indirect-entry completeness -- pg_wchar_table[] should be
>        confirmed as the only callback registry into the EUC-KR slots.
>   (P2) Direct-call review -- callers of pg_mblen / pg_dsplen /
>        pg_encoding_mblen / pg_verify_mbstr / pg_mb*cliplen /
>        pg_encoding_max_length and their cstring variants should be
>        enumerated and each call site checked for any reliance on
>        the old mblen=2/3 return for 0x8E/0x8F.
>   (P3) Verify-bypass:
>        (a) call-time -- code that runs mblen/dsplen on bytes not
>            verified in the same call chain (copyto.c file_encoding
>            scan, cstring pg_mblen in formatting.c / varlena.c) should
>            be identified.
>        (b) load-time -- byte-preserving carry-over that skips
>            pg_verify_mbstr (pg_upgrade relfilenode preservation,
>            physical replication / WAL replay, pre-patch on-disk data
>            from earlier versions) should be reasoned about.
>   (P4) Cross-encoding contamination -- it should be confirmed that no
>        Japanese-shaped assumption (SS2/SS3, JIS X 0201/0212,
>        pg_eucjp_increment, shared pg_euc_* helpers) can reach EUC-KR
>        data after the split.
>   (P5) User-visible width -- pg_encoding_max_length('EUC_KR') changes
>        from 3 to 2; buffer-sizing / width-bounded consumers should be
>        checked.
>
> Review-coverage list (not a test matrix):
>
>   Encoding-layer surface (P5)
>   - pg_wchar_table[PG_EUC_KR].maxmblen and readers of
>     pg_encoding_max_length() that switch on EUC-KR width
>   - pg_database_encoding_max_length() callers in mbutils.c
>
>   Direct callers of pg_mblen / pg_dsplen (P2)
>   - varlena.c:    text_substring, text_position, text_left/right,
>                   text_reverse, length-family
>   - oracle_compat.c: lpad/rpad/ltrim/rtrim/btrim/translate
>   - formatting.c: to_char width handling (pg_mblen_cstr)
>   - varchar.c:    bpchar / varchar length enforcement
>                   (pg_mbcharcliplen)
>   - xml.c, jsonfuncs.c: pg_mblen on textual content
>
>   Pattern matching (P2)
>   - like.c: multibyte LIKE
>
>   Call-time verify-bypass (P3a)
>   - copyto.c: file_encoding delimiter scan
>   - formatting.c / varlena.c: cstring pg_mblen callers
>
>   Load-time verify-bypass (P3b) -- highest follow-up priority
>   - pg_upgrade: relfilenode preservation (bytes as-is)
>   - Physical replication / WAL replay
>   - Pre-patch on-disk data from earlier versions
>
>   Pre-existing-data behavior (P3b consequence)
>   - length/char_length/octet_length, substring/left/right/position,
>     LIKE / regex on rows containing 0x8E/0x8F
>   - Sort order / btree text indexes built before the patch
>     (bytewise comparator, likely safe but worth naming)
>
>   tsearch (P2)
>   - spell.c, ts_locale.c, wparser_def.c, dict_thesaurus.c
>
>   Conversion proc sanity
>   - utf8_and_euc_kr (table-driven, expected unchanged)
>
> These are the items I think are worth a line or two of reasoning
> from someone who has read the code, though I don't think they
> need to block this patch -- treating the review-coverage list as
> background and following up separately is a perfectly fine
> outcome.  I plan to work through them myself gradually, as time
> allows, and share what I find along the way; other eyes would
> be welcome.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, Michael Paquier wrote:
>
> > I unfortunately cannot read the PNG file [...]
>
> I read KS X 2901 clause 4.2 directly in Korean; SungJun's later
> transcription matches the normative text verbatim.
>
> (On src/test/locale/: agreed with SungJun -- a separate concern
> from this patch.)
>
> > - the proposed patch has zero regression tests.
>
> Agreed on the coverage gap.  SungJun's proposed 0001 baseline
> tests (pg_encoding_max_length, G0/ASCII handling, G1/KS X 1001
> handling, SS2 rejection) are a sensible starting point and worth
> adopting as-is.  Beyond that, the review-coverage list above is
> what I plan to walk as follow-up: some items will naturally
> suggest further regression tests, while others (notably the
> load-time bypass paths in P3b) can only be reasoned about in
> review.  Any additional tests can come out of that pass.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, SungJun Jang wrote (v2):
>
> > v2 attached. [commit message and release note updated]
>
> Commit-message update LGTM.  On the release-19.sgml hunk: release
> notes are curated by the release manager from commit messages
> after the fact, not written into the patch.  It would probably
> be cleaner for v3 to omit that hunk and let the commit message
> carry the user-visible change instead (pg_encoding_max_length
> 3 -> 2, plus the pre-existing 0x8E/0x8F behavior shift via the
> P3b routes).  The charset.sgml Table 23.3 fix should stay --
> that is documented behavior, not a release note.
>
> ----------------------------------------------------------------------
>
> On 2026-05-13, SungJun Jang wrote (reply to Michael):
>
> > [KS X 2901 clause text quoted in Korean and English]
>
> Confirmed -- both match standard.go.kr.
>
> > I am planning to split into two patches:
> >   0001 -- Baseline capture [...]
> >   0002 -- Actual fix [...]
>
> The split is a nice shape for review -- characterization-first
> makes the behavior delta easy to see, and "exactly one line moves"
> is a useful audit signal.  Whether to keep the split or squash at
> commit time is a committer call.
>
> Regards,
> Henson Choi
>

v3-0001-Make-EUC-KR-encoding-routines-self-contained.patch
Description: Binary data

Re: Remove invalid SS2/SS3 handling from EUC-KR routines

Reply via email to