On 29/05/2026 14:44, Hannu Krosing wrote:
Hi hackers

I was loading our mailing list into a database and noticed that some
text results int substring not working.

Specifically calling substring with some specific values fails

badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION:  report_invalid_encoding_int, mbutils.c:1847

Asking one byte longer substring works ok

badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
  ctid  │ id │ substring
───────┼────┼───────────
  (0,1) │  1 │ Hi ev
(1 row)

as do other ways of getting the same 4 bytes

badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
  ctid  │ id │ substring
───────┼────┼───────────
  (0,1) │  1 │ Hi e
(1 row)

badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
  ctid  │ id │ substring
───────┼────┼───────────
  (0,1) │  1 │ Hi e
(1 row)

is this expected behaviour and I just have to always noirmalize when
loading exotic UTF8 strings ?

Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be able to load it into the database in the first place. If it is valid, then the substring() should work.

If you want to replicate this use attached python script to load data

I could not reproduce this. That substring() query after running your script works fine for me.

badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
 ctid  | id | substring
-------+----+-----------
 (0,1) |  1 | Hi e
(1 row)

Which version did you use? What is the database's encoding and what is the client encoding? I used 'master', with UTF-8 as server and client encoding.

- Heikki



Reply via email to