Re: Make COPY format extendable: Extract COPY TO format implementations

Andres Freund Fri, 09 Feb 2024 11:27:29 -0800

Hi,

On 2024-02-09 13:21:34 +0900, Michael Paquier wrote:
> +static void
> +CopyFromTextInFunc(CopyFromState cstate, Oid atttypid,
> +                                FmgrInfo *finfo, Oid *typioparam)
> +{
> +     Oid                     func_oid;
> +
> +     getTypeInputInfo(atttypid, &func_oid, typioparam);
> +     fmgr_info(func_oid, finfo);
> +}


FWIW, we should really change the copy code to initialize FunctionCallInfoData
instead of re-initializing that on every call, realy makes a difference
performance wise.


> +/*
> + * CopyFromTextStart
> + *
> + * Start of COPY FROM for text/CSV format.
> + */
> +static void
> +CopyFromTextStart(CopyFromState cstate, TupleDesc tupDesc)
> +{
> +     AttrNumber      attr_count;
> +
> +     /*
> +      * If encoding conversion is needed, we need another buffer to hold the
> +      * converted input data.  Otherwise, we can just point input_buf to the
> +      * same buffer as raw_buf.
> +      */
> +     if (cstate->need_transcoding)
> +     {
> +             cstate->input_buf = (char *) palloc(INPUT_BUF_SIZE + 1);
> +             cstate->input_buf_index = cstate->input_buf_len = 0;
> +     }
> +     else
> +             cstate->input_buf = cstate->raw_buf;
> +     cstate->input_reached_eof = false;
> +
> +     initStringInfo(&cstate->line_buf);

Seems kinda odd that we have a supposedly extensible API that then stores all
this stuff in the non-extensible CopyFromState.


> +     /* create workspace for CopyReadAttributes results */
> +     attr_count = list_length(cstate->attnumlist);
> +     cstate->max_fields = attr_count;

Why is this here? This seems like generic code, not text format specific.


> +     cstate->raw_fields = (char **) palloc(attr_count * sizeof(char *));
> +     /* Set read attribute callback */
> +     if (cstate->opts.csv_mode)
> +             cstate->copy_read_attributes = CopyReadAttributesCSV;
> +     else
> +             cstate->copy_read_attributes = CopyReadAttributesText;
> +}

Isn't this precisely repeating the mistake of 2889fd23be56?

And, why is this done here? Shouldn't this decision have been made prior to
even calling CopyFromTextStart()?

> +/*
> + * CopyFromTextOneRow
> + *
> + * Copy one row to a set of `values` and `nulls` for the text and CSV
> + * formats.
> + */

I'm very doubtful it's a good idea to combine text and CSV here. They have
basically no shared parsing code, so what's the point in sending them through
one input routine?


> +bool
> +CopyFromTextOneRow(CopyFromState cstate,
> +                                ExprContext *econtext,
> +                                Datum *values,
> +                                bool *nulls)
> +{
> +     TupleDesc       tupDesc;
> +     AttrNumber      attr_count;
> +     FmgrInfo   *in_functions = cstate->in_functions;
> +     Oid                *typioparams = cstate->typioparams;
> +     ExprState **defexprs = cstate->defexprs;
> +     char      **field_strings;
> +     ListCell   *cur;
> +     int                     fldct;
> +     int                     fieldno;
> +     char       *string;
> +
> +     tupDesc = RelationGetDescr(cstate->rel);
> +     attr_count = list_length(cstate->attnumlist);
> +
> +     /* read raw fields in the next line */
> +     if (!NextCopyFromRawFields(cstate, &field_strings, &fldct))
> +             return false;
> +
> +     /* check for overflowing fields */
> +     if (attr_count > 0 && fldct > attr_count)
> +             ereport(ERROR,
> +                             (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                              errmsg("extra data after last expected 
> column")));

It bothers me that we look to be ending up with different error handling
across the various output formats, particularly if they're ending up in
extensions. That'll make it harder to evolve this code in the future.


> +     fieldno = 0;
> +
> +     /* Loop to read the user attributes on the line. */
> +     foreach(cur, cstate->attnumlist)
> +     {
> +             int                     attnum = lfirst_int(cur);
> +             int                     m = attnum - 1;
> +             Form_pg_attribute att = TupleDescAttr(tupDesc, m);
> +
> +             if (fieldno >= fldct)
> +                     ereport(ERROR,
> +                                     (errcode(ERRCODE_BAD_COPY_FILE_FORMAT),
> +                                      errmsg("missing data for column 
> \"%s\"",
> +                                                     
> NameStr(att->attname))));
> +             string = field_strings[fieldno++];
> +
> +             if (cstate->convert_select_flags &&
> +                     !cstate->convert_select_flags[m])
> +             {
> +                     /* ignore input field, leaving column as NULL */
> +                     continue;
> +             }
> +
> +             cstate->cur_attname = NameStr(att->attname);
> +             cstate->cur_attval = string;
> +
> +             if (cstate->opts.csv_mode)
> +             {

More unfortunate intermingling of multiple formats in a single routine.


> +
> +             if (cstate->defaults[m])
> +             {
> +                     /*
> +                      * The caller must supply econtext and have switched 
> into the
> +                      * per-tuple memory context in it.
> +                      */
> +                     Assert(econtext != NULL);
> +                     Assert(CurrentMemoryContext == 
> econtext->ecxt_per_tuple_memory);
> +
> +                     values[m] = ExecEvalExpr(defexprs[m], econtext, 
> &nulls[m]);
> +             }

I don't think it's good that we end up with this code in different copy
implementations.

Greetings,

Andres Freund

Re: Make COPY format extendable: Extract COPY TO format implementations

Reply via email to