On Thu, Jan 25, 2024 at 05:45:43PM +0900, Sutou Kouhei wrote: > In <[email protected]> > "Re: Make COPY format extendable: Extract COPY TO format implementations" > on Thu, 25 Jan 2024 12:17:55 +0900, > Michael Paquier <[email protected]> wrote: >> +extern CopyToRoutine CopyToRoutineText; >> +extern CopyToRoutine CopyToRoutineCSV; >> +extern CopyToRoutine CopyToRoutineBinary; >> >> All that should IMO remain in copyto.c and copyfrom.c in the initial >> patch doing the refactoring. Why not using a fetch function instead >> that uses a string in input? Then you can call that once after >> parsing the List of options in ProcessCopyOptions(). > > OK. How about the following for the fetch function > signature? > > extern CopyToRoutine *GetBuiltinCopyToRoutine(const char *format);
Or CopyToRoutineGet()? I am not wedded to my suggestion, got a bad
history with naming things around here.
> We may introduce an enum and use it:
>
> typedef enum CopyBuiltinFormat
> {
> COPY_BUILTIN_FORMAT_TEXT = 0,
> COPY_BUILTIN_FORMAT_CSV,
> COPY_BUILTIN_FORMAT_BINARY,
> } CopyBuiltinFormat;
>
> extern CopyToRoutine *GetBuiltinCopyToRoutine(CopyBuiltinFormat format);
I am not sure that this is necessary as the option value is a string.
> Oh, sorry. I assumed that the comment style was adjusted by
> pgindent.
No worries, that's just something we get used to. I tend to fix a lot
of these things by myself when editing patches.
>> + getTypeBinaryOutputInfo(attr->atttypid, &out_func_oid, &isvarlena);
>> + fmgr_info(out_func_oid, &cstate->out_functions[attnum - 1]);
>>
>> Actually, this split is interesting. It is possible for a custom
>> format to plug in a custom set of out functions. Did you make use of
>> something custom for your own stuff?
>
> I didn't. My PoC custom COPY format handler for Apache Arrow
> just handles integer and text for now. It doesn't use
> cstate->out_functions because cstate->out_functions may not
> return a valid binary format value for Apache Arrow. So it
> formats each value by itself.
I mean, if you use a custom output function, you could tweak things
even more with byteas or such.. If a callback is expected to do
something, like setting the output function OIDs in the start
callback, we'd better document it rather than letting that be implied.
>> Actually, could it make sense to
>> split the assignment of cstate->out_functions into its own callback?
>
> Yes. Because we need to use getTypeBinaryOutputInfo() for
> "binary" and use getTypeOutputInfo() for "text" and "csv".
Okay. After sleeping on it, a split makes sense here, because it also
reduces the presence of TupleDesc in the start callback.
>> Sure, that's part of the start phase, but at least it would make clear
>> that a custom method *has* to assign these OIDs to work. The patch
>> implies that as a rule, without a comment that CopyToStart *must* set
>> up these OIDs.
>
> CopyToStart doesn't need to set up them if the handler
> doesn't use cstate->out_functions.
Noted.
>> I think that 0001 and 0005 should be handled first, as pieces
>> independent of the rest. Then we could move on with 0002~0004 and
>> 0006~0008.
>
> OK. I'll focus on 0001 and 0005 for now. I'll restart
> 0002-0004/0006-0008 after 0001 and 0005 are accepted.
Once you get these, I'd be interested in re-doing an evaluation of
COPY TO and more tests with COPY FROM while running Postgres on
scissors. One thing I was thinking to use here is my blackhole_am for
COPY FROM:
https://github.com/michaelpq/pg_plugins/tree/main/blackhole_am
As per its name, it does nothing on INSERT, so you could create a
table using it as access method, and stress the COPY FROM execution
paths without having to mount Postgres on a tmpfs because the data is
sent to the void. Perhaps it does not matter, but that moves the
tests to the bottlenecks we want to stress (aka the per-row callback
for large data sets).
I've switched the patch as waiting on author for now. Thanks for your
perseverance here. I understand that's not easy to follow up with
patches and reviews (^_^;)
--
Michael
signature.asc
Description: PGP signature
