Re: New "raw" COPY format

Joel Jacobson Tue, 29 Oct 2024 09:49:19 -0700

On Mon, Oct 28, 2024, at 18:50, Masahiko Sawada wrote:
> Thank you for updating the patch. Here are review comments on the v15
> 0002 patch:


Thanks for review.

> When testing the patch with an empty delimiter, I got the following failure:
>
> postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw',
> delimiter '');
> TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line:
> 1173, PID: 903898

Fixed.

> ---
> -           else
> +           else if (cstate->opts.format == COPY_FORMAT_TEXT)
>                 fldct = CopyReadAttributesText(cstate);
> +           else
> +           {
> +               elog(ERROR, "unexpected COPY format: %d", 
> cstate->opts.format);
> +               pg_unreachable();
> +           }
>
> Since we already check the incompatible options with COPY_FORMAT_RAW
> and default_print, I think it's better to add an assertion to make
> sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead
> of using elog(ERROR).

I agree, fixed.

> ---
> +/*
> + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode
> + */
> +static bool
> +CopyReadLineRawText(CopyFromState cstate)
>
> This function has a lot of duplication with CopyReadLineText(). I
> think it's better to modify CopyReadLineText() to support 'raw'
> format, rather than adding a separate function.

Hmm, there is a bit of duplication, yes, but is also a hot-path,
so I think we want to minimize branches and code size in the
hot loop.

Combining them into one function, would mean the total function
size and branching increases for both cases.

I haven't made any benchmarks on this though.

> ---
> +   bool        read_entire_file = (cstate->opts.delim == NULL);
> +   int         delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : 
> 0;
>
> I think we can use 'delim_len == 0' instead of read_entire_file.

Fixed.

> ---
> +       if (read_entire_file)
> +       {
> +           /* Continue until EOF if reading entire file */
> +           input_buf_ptr++;
> +           continue;
> +       }
>
> In the case where we're reading the entire file as a single tuple, we
> don't need to advance the input_buf_ptr one by one. Instead,
> input_buf_ptr can jump to copy_buf_len, which is faster.

Fixed.

> ---
> +           /* Check for delimiter, possibly multi-byte */
> +           IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1);
> +           if (strncmp(&copy_input_buf[input_buf_ptr], cstate->opts.delim,
> +                       delim_len) == 0)
> +           {
> +               cstate->eol_type = EOL_CUSTOM;
> +               input_buf_ptr += delim_len;
> +               break;
> +           }
> +           input_buf_ptr++;
>
> Similar to the above comment, I think we don't need to check the char
> one by one. I guess that it would be faster if we locate the delimiter
> string in the intput_buf (e.g. using strstr()), and then move
> input_buf_ptr to the detected position.

Fixed.

>
> ---
> +   /* Copy the entire line into attribute_buf */
> +   memcpy(cstate->attribute_buf.data, cstate->line_buf.data,
> +          cstate->line_buf.len);
> +   cstate->attribute_buf.data[cstate->line_buf.len] = '\0';
> +   cstate->attribute_buf.len = cstate->line_buf.len;
>
> The CopyReadAttributesRaw() just copies line_buf data to
> attirbute_buf, which seems to be a waste. I think we can have
> attribute_buf point to the line_buf. That way, we can skip the whole
> step 4 that is described in the comment on top o f copyfromparse.c:
>
> * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf
> *                1.          2.            3.           4.
>

Fixed. I've removed CopyReadAttributesRaw() entirely.

> ---
> +static int
> +CopyReadAttributesRaw(CopyFromState cstate)
> +{
> +   /* Enforce single column requirement */
> +   if (cstate->max_fields != 1)
> +   {
> +       ereport(ERROR,
> +               (errcode(ERRCODE_INVALID_PARAMETER_VALUE),
> +                errmsg("COPY with format 'raw' must specify exactly
> one column")));
> +   }
>
> This check should have already been done in BeginCopyFrom(). Is there
> any case where max_fields gets to != 1 during reading the input?

Good point. Removed.

>
> ---
> It's a bit odd to me to use the delimiter as a EOL marker in raw
> format, but probably it's okay.
>
> ---
> -           if (cstate->opts.format != COPY_FORMAT_BINARY)
> +           if (cstate->opts.format == COPY_FORMAT_RAW &&
> +               cstate->opts.delim != NULL)
> +           {
> +               /* Output the user-specified delimiter between rows */
> +               CopySendString(cstate, cstate->opts.delim);
> +           }
> +           else if (cstate->opts.format == COPY_FORMAT_TEXT ||
> +                    cstate->opts.format == COPY_FORMAT_CSV)
>
> Since it sends the delimiter as a string, even if we specify the
> delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10).
> For example,
>
> postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with
> (format 'raw', delimiter '\n');
> {"j": 1}\npostgres(1:904427)=#
>
> I think there is a similar problem in COPY FROM; if we set a delimiter
> to '\n' when doing COPY FROM in raw format, it expects the string '\n'
> as a line termination but not ASCII LF(10). I think that input data
> normally doesn't use the string '\n' as a line termination.

You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter
consisting of backslash followed by "n".

Is this a problem? Since any string can be used as delimiter,
I think it would be strange if we parsed it and replaced the string
with a different string.

Another thought:

Maybe we shouldn't default to no delimiter after all,
maybe it would be better to default to the OS default EOL,
and maybe a final delimiter should always be written at the end,
so that when exporting a single json field, it would get exported
to the text file with \n at the end, which is what most text editor
does when saving a .json file.

/Joel

v16-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
Description: Binary data

v16-0002-Add-raw-format-to-COPY-command.patch
Description: Binary data

v16-0003-Reorganize-option-validations.patch
Description: Binary data

Re: New "raw" COPY format

Reply via email to