On Mon, Oct 28, 2024, at 18:50, Masahiko Sawada wrote: > Thank you for updating the patch. Here are review comments on the v15 > 0002 patch:
Thanks for review. > When testing the patch with an empty delimiter, I got the following failure: > > postgres(1:903898)=# copy hoge from '/tmp/tmp.raw' with (format 'raw', > delimiter ''); > TRAP: failed Assert("delim_len > 0"), File: "copyfromparse.c", Line: > 1173, PID: 903898 Fixed. > --- > - else > + else if (cstate->opts.format == COPY_FORMAT_TEXT) > fldct = CopyReadAttributesText(cstate); > + else > + { > + elog(ERROR, "unexpected COPY format: %d", > cstate->opts.format); > + pg_unreachable(); > + } > > Since we already check the incompatible options with COPY_FORMAT_RAW > and default_print, I think it's better to add an assertion to make > sure the format is either COPY_FORMAT_CSV or COPY_FORMAT_TEXT, instead > of using elog(ERROR). I agree, fixed. > --- > +/* > + * CopyReadLineRawText - inner loop of CopyReadLine for raw text mode > + */ > +static bool > +CopyReadLineRawText(CopyFromState cstate) > > This function has a lot of duplication with CopyReadLineText(). I > think it's better to modify CopyReadLineText() to support 'raw' > format, rather than adding a separate function. Hmm, there is a bit of duplication, yes, but is also a hot-path, so I think we want to minimize branches and code size in the hot loop. Combining them into one function, would mean the total function size and branching increases for both cases. I haven't made any benchmarks on this though. > --- > + bool read_entire_file = (cstate->opts.delim == NULL); > + int delim_len = cstate->opts.delim ? strlen(cstate->opts.delim) : > 0; > > I think we can use 'delim_len == 0' instead of read_entire_file. Fixed. > --- > + if (read_entire_file) > + { > + /* Continue until EOF if reading entire file */ > + input_buf_ptr++; > + continue; > + } > > In the case where we're reading the entire file as a single tuple, we > don't need to advance the input_buf_ptr one by one. Instead, > input_buf_ptr can jump to copy_buf_len, which is faster. Fixed. > --- > + /* Check for delimiter, possibly multi-byte */ > + IF_NEED_REFILL_AND_NOT_EOF_CONTINUE(delim_len - 1); > + if (strncmp(©_input_buf[input_buf_ptr], cstate->opts.delim, > + delim_len) == 0) > + { > + cstate->eol_type = EOL_CUSTOM; > + input_buf_ptr += delim_len; > + break; > + } > + input_buf_ptr++; > > Similar to the above comment, I think we don't need to check the char > one by one. I guess that it would be faster if we locate the delimiter > string in the intput_buf (e.g. using strstr()), and then move > input_buf_ptr to the detected position. Fixed. > > --- > + /* Copy the entire line into attribute_buf */ > + memcpy(cstate->attribute_buf.data, cstate->line_buf.data, > + cstate->line_buf.len); > + cstate->attribute_buf.data[cstate->line_buf.len] = '\0'; > + cstate->attribute_buf.len = cstate->line_buf.len; > > The CopyReadAttributesRaw() just copies line_buf data to > attirbute_buf, which seems to be a waste. I think we can have > attribute_buf point to the line_buf. That way, we can skip the whole > step 4 that is described in the comment on top o f copyfromparse.c: > > * [data source] --> raw_buf --> input_buf --> line_buf --> attribute_buf > * 1. 2. 3. 4. > Fixed. I've removed CopyReadAttributesRaw() entirely. > --- > +static int > +CopyReadAttributesRaw(CopyFromState cstate) > +{ > + /* Enforce single column requirement */ > + if (cstate->max_fields != 1) > + { > + ereport(ERROR, > + (errcode(ERRCODE_INVALID_PARAMETER_VALUE), > + errmsg("COPY with format 'raw' must specify exactly > one column"))); > + } > > This check should have already been done in BeginCopyFrom(). Is there > any case where max_fields gets to != 1 during reading the input? Good point. Removed. > > --- > It's a bit odd to me to use the delimiter as a EOL marker in raw > format, but probably it's okay. > > --- > - if (cstate->opts.format != COPY_FORMAT_BINARY) > + if (cstate->opts.format == COPY_FORMAT_RAW && > + cstate->opts.delim != NULL) > + { > + /* Output the user-specified delimiter between rows */ > + CopySendString(cstate, cstate->opts.delim); > + } > + else if (cstate->opts.format == COPY_FORMAT_TEXT || > + cstate->opts.format == COPY_FORMAT_CSV) > > Since it sends the delimiter as a string, even if we specify the > delimiter to '\n', it doesn't send the new line (i.e. ASCII LF, 10). > For example, > > postgres(1:904427)=# copy (select '{"j" : 1}'::jsonb) to stdout with > (format 'raw', delimiter '\n'); > {"j": 1}\npostgres(1:904427)=# > > I think there is a similar problem in COPY FROM; if we set a delimiter > to '\n' when doing COPY FROM in raw format, it expects the string '\n' > as a line termination but not ASCII LF(10). I think that input data > normally doesn't use the string '\n' as a line termination. You need to use E'\n' to get ASCII LF(10), since '\n' is just a delimiter consisting of backslash followed by "n". Is this a problem? Since any string can be used as delimiter, I think it would be strange if we parsed it and replaced the string with a different string. Another thought: Maybe we shouldn't default to no delimiter after all, maybe it would be better to default to the OS default EOL, and maybe a final delimiter should always be written at the end, so that when exporting a single json field, it would get exported to the text file with \n at the end, which is what most text editor does when saving a .json file. /Joel
v16-0001-Introduce-CopyFormat-and-replace-csv_mode-and-binary.patch
Description: Binary data
v16-0002-Add-raw-format-to-COPY-command.patch
Description: Binary data
v16-0003-Reorganize-option-validations.patch
Description: Binary data