Tom, Thanks for finding the bugs and reworking things.
> I had some difficulty in generating test cases that weren't largely > I/O-bound, but AFAICT the patch as applied is about the same speed > as what you submitted. You achieve the important objective of knocking the parsing stage down a lot, but your parsing code is actually about 20% slower than Alon's. Before your patch: Time: 14205.606 ms With your patch: Time: 10565.374 ms With Alon's patch: Time: 10289.845 ms The parsing part of the code in your version is slower, but as a percentage of the total it's hidden. The loss of 0.3 seconds on 143MB means: - If parsing takes a total of 0.9 seconds, the parsing rate is 160MB/s (143/0.9) - If we add another 0.3 seconds to parsing to bring it to 1.2, then the parsing rate becomes 120MB/s When we improve the next stages of the processing (attribute conversion, write-to disk), this will stand out a lot more. Our objective is to get the COPY rate *much* faster than the current poky rate of 14MB/s (after this patch). - Luke On 8/6/05 2:04 PM, "Tom Lane" <[EMAIL PROTECTED]> wrote: > "Alon Goldshuv" <[EMAIL PROTECTED]> writes: >> New patch attached. It includes very minor changes. These are changes that >> were committed to CVS 3 weeks ago (copy.c 1.247) which I missed in the >> previous patch. > > I've applied this with (rather extensive) revisions. I didn't like what > you had done with the control structure --- loading the input buffer > only at the outermost loop level was a bad design choice IMHO. You had > sprinkled the code with an unreasonable number of special cases in order > to try to cope with the effects of that mistake, but there were lots > of problems still left. Some of the bugs I noticed: > > * Broke old-protocol COPY, since that has no provision for stopping at > the EOF marker except by parsing the data carefully to start with. The > backend would just hang up unless the total data size chanced to be a > multiple of 64K. > > * Subtle change in interpretation of \. EOF marker (the existing code > will recognize it even when not at start of line). > > * Seems to have thrown away detection of newline format discrepancies. > > * Fails for zero-column tables. > > * Broke display of column values during error context callback (would > always show the first column contents no matter which one is being > complained of). > > * DetectLineEnd mistakenly assumes CR mode if very last character of first > bufferload is CR; need to reserve judgment until next char is available. > > * DetectLineEnd fails to account for backslashed control characters, > so it will e.g. accept \ followed by \n as determining the newline > style. > > * Fails to apply encoding conversion if first line exceeds copy buf > size, because when DetectLineEnd fails the quick-exit path doesn't do > it. > > * There seem to be several bugs associated with the fact that input_buf[] > always has 64K of data in it even when EOF has been reached on the > input. One example: > echo -n 123 >zzz1 > psql> create temp table t1(f1 text); > psql> copy t1 from '/home/tgl/zzz1'; > psql> select * from t1; > hmm ... where'd that 64K of whitespace come from? > > I rewrote the patch in a way that retained most of the speedups without > changing the basic control structure (except for replacing multiple > CopyReadAttribute calls with one CopyReadAttributes call per line). > > I had some difficulty in generating test cases that weren't largely > I/O-bound, but AFAICT the patch as applied is about the same speed > as what you submitted. > > regards, tom lane > > ---------------------------(end of broadcast)--------------------------- > TIP 2: Don't 'kill -9' the postmaster > ---------------------------(end of broadcast)--------------------------- TIP 6: explain analyze is your friend