Body is valid, can be selected in full and the the body can be also
fed to other functions, including version of substring using regexes.

badutf8=# select ctid, id, length(body), length(normalize(body)),
body=normalize(body), octet_length(body),
octet_length(normalize(body)) from  bademail;
 ctid  │ id │ length │ length │ ?column? │ octet_length │ octet_length
───────┼────┼────────┼────────┼──────────┼──────────────┼──────────────
 (0,1) │  1 │   2314 │   2314 │ t        │         2323 │         2323
(1 row)

The most confusing thing is that the byte it complains about in case
of substring(body, 1, 3) or substring(body, 1, 4) does not seem to be
present in the original string at all and definitely not within the
first few characters I ams asking to extract


badutf8=# select ctid, id, substring(body, 1, 2) from  bademail;
 ctid  │ id │ substring
───────┼────┼───────────
 (0,1) │  1 │ Hi
(1 row)

Time: 0.527 ms
badutf8=# select ctid, id, substring(body, 1, 3) from  bademail;
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc3
LOCATION:  report_invalid_encoding_int, mbutils.c:1847
Time: 0.638 ms
badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
LOCATION:  report_invalid_encoding_int, mbutils.c:1847
Time: 0.555 ms
badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
 ctid  │ id │ substring
───────┼────┼───────────
 (0,1) │  1 │ Hi ev
(1 row)

On Fri, May 29, 2026 at 3:29 PM Heikki Linnakangas <[email protected]> wrote:
>
> On 29/05/2026 14:44, Hannu Krosing wrote:
> > Hi hackers
> >
> > I was loading our mailing list into a database and noticed that some
> > text results int substring not working.
> >
> > Specifically calling substring with some specific values fails
> >
> > badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
> > ERROR:  22021: invalid byte sequence for encoding "UTF8": 0xc2
> > LOCATION:  report_invalid_encoding_int, mbutils.c:1847
> >
> > Asking one byte longer substring works ok
> >
> > badutf8=# select ctid, id, substring(body, 1, 5) from  bademail;
> >   ctid  │ id │ substring
> > ───────┼────┼───────────
> >   (0,1) │  1 │ Hi ev
> > (1 row)
> >
> > as do other ways of getting the same 4 bytes
> >
> > badutf8=# select ctid, id, substring(body from '^.{4}') from  bademail;
> >   ctid  │ id │ substring
> > ───────┼────┼───────────
> >   (0,1) │  1 │ Hi e
> > (1 row)
> >
> > badutf8=# select ctid, id, substring(normalize(body), 1, 4) from  bademail;
> >   ctid  │ id │ substring
> > ───────┼────┼───────────
> >   (0,1) │  1 │ Hi e
> > (1 row)
> >
> > is this expected behaviour and I just have to always noirmalize when
> > loading exotic UTF8 strings ?
>
> Is the body valid UTF-8 or not? If it's not valid, then you shouldn't be
> able to load it into the database in the first place. If it is valid,
> then the substring() should work.
>
> > If you want to replicate this use attached python script to load data
>
> I could not reproduce this. That substring() query after running your
> script works fine for me.
>
> badutf8=# select ctid, id, substring(body, 1, 4) from  bademail;
>   ctid  | id | substring
> -------+----+-----------
>   (0,1) |  1 | Hi e
> (1 row)
>
> Which version did you use? What is the database's encoding and what is
> the client encoding? I used 'master', with UTF-8 as server and client
> encoding.
>
> - Heikki
>


Reply via email to