But looks like I can not reproduce it on any other computer (I though I had verified it on vanill Ubuntu 22.04 as well), so no need to do any more investigation until I can reproduce it elsewhere
On Fri, May 29, 2026 at 5:16 PM Hannu Krosing <[email protected]> wrote: > > Body is valid, can be selected in full and the the body can be also > fed to other functions, including version of substring using regexes. > > badutf8=# select ctid, id, length(body), length(normalize(body)), > body=normalize(body), octet_length(body), > octet_length(normalize(body)) from bademail; > ctid │ id │ length │ length │ ?column? │ octet_length │ octet_length > ───────┼────┼────────┼────────┼──────────┼──────────────┼────────────── > (0,1) │ 1 │ 2314 │ 2314 │ t │ 2323 │ 2323 > (1 row) > > The most confusing thing is that the byte it complains about in case > of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be > present in the original string at all and definitely not within the > first few characters I ams asking to extract > > > badutf8=# select ctid, id, substring(body, 1, 2) from bademail; > ctid │ id │ substring > ───────┼────┼─────────── > (0,1) │ 1 │ Hi > (1 row) > > Time: 0.527 ms > badutf8=# select ctid, id, substring(body, 1, 3) from bademail; > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc3 > LOCATION: report_invalid_encoding_int, mbutils.c:1847 > Time: 0.638 ms > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 > LOCATION: report_invalid_encoding_int, mbutils.c:1847 > Time: 0.555 ms > badutf8=# select ctid, id, substring(body, 1, 5) from bademail; > ctid │ id │ substring > ───────┼────┼─────────── > (0,1) │ 1 │ Hi ev > (1 row) > > On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <[email protected]> wrote: > > > > On 29/05/2026 14:44, Hannu Krosing wrote: > > > Hi hackers > > > > > > I was loading our mailing list into a database and noticed that some > > > text results int substring not working. > > > > > > Specifically calling substring with some specific values fails > > > > > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > > > ERROR: 22021: invalid byte sequence for encoding "UTF8": 0xc2 > > > LOCATION: report_invalid_encoding_int, mbutils.c:1847 > > > > > > Asking one byte longer substring works ok > > > > > > badutf8=# select ctid, id, substring(body, 1, 5) from bademail; > > > ctid │ id │ substring > > > ───────┼────┼─────────── > > > (0,1) │ 1 │ Hi ev > > > (1 row) > > > > > > as do other ways of getting the same 4 bytes > > > > > > badutf8=# select ctid, id, substring(body from '^.{4}') from bademail; > > > ctid │ id │ substring > > > ───────┼────┼─────────── > > > (0,1) │ 1 │ Hi e > > > (1 row) > > > > > > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from > > > bademail; > > > ctid │ id │ substring > > > ───────┼────┼─────────── > > > (0,1) │ 1 │ Hi e > > > (1 row) > > > > > > is this expected behaviour and I just have to always noirmalize when > > > loading exotic UTF8 strings ? > > > > Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be > > able to load it into the database in the first place. If it is valid, > > then the substring() should work. > > > > > If you want to replicate this use attached python script to load data > > > > I could not reproduce this. That substring() query after running your > > script works fine for me. > > > > badutf8=# select ctid, id, substring(body, 1, 4) from bademail; > > ctid | id | substring > > -------+----+----------- > > (0,1) | 1 | Hi e > > (1 row) > > > > Which version did you use? What is the database's encoding and what is > > the client encoding? I used 'master', with UTF-8 as server and client > > encoding. > > > > - Heikki > >
