Hi,

While playing with COPY FROM refactorings in another thread, I noticed corner case where I think backslash escaping doesn't work correctly. Consider the following input:

\么.foo

I hope that came through in this email correctly as UTF-8. The string contains a sequence of: backslash, multibyte-character and a dot.

The documentation says:

Backslash characters (\) can be used in the COPY data to quote data
characters that might otherwise be taken as row or column delimiters

So I believe escaping multi-byte characters is supposed to work, and it usually does.

However, let's consider the same string in Big5 encoding (in hex escaped format):

\x5ca45c2e666f6f

The first byte 0x5c, is the backslash. The multi-byte character consists of two bytes: 0xa4 0x5c. Note that the second byte is equal to a backslash.

That confuses the parser in CopyReadLineText, so that you get an error:

postgres=# create table copytest (t text);
CREATE TABLE
postgres=# \copy copytest from 'big5-skip-test.data' with (encoding 'big5');
ERROR:  end-of-copy marker corrupt
CONTEXT:  COPY copytest, line 1

What happens is that when the parser sees the backslash, it looks ahead at the next byte, and when it's not a dot, it skips over it:

                        else if (!cstate->opts.csv_mode)

                                /*
                                 * If we are here, it means we found a 
backslash followed by
                                 * something other than a period.  In non-CSV 
mode, anything
                                 * after a backslash is special, so we skip 
over that second
                                 * character too.  If we didn't do that \\. 
would be
                                 * considered an eof-of copy, while in non-CSV 
mode it is a
                                 * literal backslash followed by a period.  In 
CSV mode,
                                 * backslashes are not special, so we want to 
process the
                                 * character after the backslash just like a 
normal character,
                                 * so we don't increment in those cases.
                                 */
                                raw_buf_ptr++;

However, in a multi-byte encoding that might "embed" ascii characters, it should skip over the next *character*, not byte.

Attached is a pretty straightforward patch to fix that. Anyone see a problem with this?

- Heikki
>From 632549e79dc98cb338d7fc22888d8b6237a47d10 Mon Sep 17 00:00:00 2001
From: Heikki Linnakangas <heikki.linnakan...@iki.fi>
Date: Wed, 3 Feb 2021 14:00:32 +0200
Subject: [PATCH 1/1] Fix a corner-case in COPY FROM backslash processing.

If a multi-byte character is escaped with a backslash in TEXT mode input,
we didn't always treat the escape correctly. If:

- a multi-byte character is escaped with a backslash, and
- the second byte of the character is 0x5C, i.e. the ASCII code of a
  backslash (\), and
- the next character is a dot (.), then

copyreadline we would incorrectly interpret the sequence as an end-of-copy
marker (\.). this can only happen in encodings that can "embed" ascii
characters as the second byte. one example of such sequence is
'\x5ca45c2e666f6f' in Big5 encoding. If you put that in a file, and
load it with COPY FROM, you'd incorrectly get an "end-of-copy marker
corrupt" error.
---
 src/backend/commands/copyfromparse.c | 13 +++++++++++--
 1 file changed, 11 insertions(+), 2 deletions(-)

diff --git a/src/backend/commands/copyfromparse.c b/src/backend/commands/copyfromparse.c
index 798e18e013..c20ec482db 100644
--- a/src/backend/commands/copyfromparse.c
+++ b/src/backend/commands/copyfromparse.c
@@ -1084,7 +1084,7 @@ CopyReadLineText(CopyFromState cstate)
 				break;
 			}
 			else if (!cstate->opts.csv_mode)
-
+			{
 				/*
 				 * If we are here, it means we found a backslash followed by
 				 * something other than a period.  In non-CSV mode, anything
@@ -1095,8 +1095,17 @@ CopyReadLineText(CopyFromState cstate)
 				 * backslashes are not special, so we want to process the
 				 * character after the backslash just like a normal character,
 				 * so we don't increment in those cases.
+				 *
+				 * In principle, we would need to use pg_encoding_mblen() to
+				 * skip over the character in some encodings, like at the end
+				 * of the loop.  But if it's a multi-byte character, it cannot
+				 * have any special meaning and skipping isn't necessary for
+				 * correctness.  We can fall through to process it normally
+				 * instead.
 				 */
-				raw_buf_ptr++;
+				if (!IS_HIGHBIT_SET(c2))
+					raw_buf_ptr++;
+			}
 		}
 
 		/*
-- 
2.30.0

Reply via email to