Just remembered another thing. A struct "bytebuffer" is used instead of a StringInfoData for storing the line and attributes. A StringInfoData is actually really cool and useful, but we don't really need it's formatting capabilities in COPY FROM (as far as I know), and so the bytebuffer is more straightfoward and faster.
Alon. On 6/25/05 3:52 PM, "Alon Goldshuv" <[EMAIL PROTECTED]> wrote: > > >> I haven't seen any demand for users to be allowed to specify an escape char >> in text mode (BTW, that's what the docs call what you have called delimited >> mode). Maybe there's a case for it, but I somewhat doubt it. > > I ran into users that could not load their data using COPY because their > data included a lot of embedded '\\'... These users would like to chose > another char to be the escape or maybe even chose to disable escaping > alltogher. However, this is probably a separate discussion that we should > think about in the future and see if it's worth introducing. > >> As for variable names, choose some variant that seems reasonable. If we need >> to distinguish we should use csv_escapec and text_escapec. But I'm not sure >> we can't use the same variable name for both cases, unless they would have >> conflicting scopes. > Cool. I don't think the scopes are conflicting at the moment. > >> What I would like to have is a high level description of >> . how the new text mode code differs from the old text mode code, and >> . which part of the change is responsible for how much performance gain. > > Fair enough! > > Where the code differs and why it's faster: > ------------------------------------------- > > The changes in the code only touch the *parsing* parts for *text* format. > All the rest of the COPY process remains the same. More specifically: > > CopyFrom() still handles BINARY and CSV, and I extracted TEXT mode to > another function called CopyFromDelimited() (refering to delimited text > format). This function calls the rest of the modified functions below: > > CopyGetData() was modified to return the number of bytes actually read > successfully. Rather than asking for a char at a time, the text parsing code > asks for big blocks of data at a time, and needs to know how much data was > accepted into the input buffer (input_buf) for parsing. CopyFrom still can > call CopyGetData and not care about the return value, like it did before. > > Why is it faster: Instead of a calling the CopyGetData for every single byte > (involves huge amount of function calls, conditionals, and getc() ) we call > this function once for every buffer (64K) and then start parsing the buffer. > > CopyReadLineBuffered is a modified version of CopyReadLine and is used by > CopyFromDelimited to load a line at a time into the line buffer > line_bytebuf. Very similar to CopyReadLine, but: > > Why is it faster: beside to reason I explained above (non char-at-a-time > read) this function copies the line of data in it's entirety to the line > buffer, instead of a char-at-a-time with AppendStringInfoMacro. The gain > here is very large. Also, instead of checking for if '\n', if '\r', if '\\', > if '.'... For every character, we only look for the first character of the > line end sequence and when finding it we look backwards or forwards to see > if it's escaped etc.. > > CopyReadAllAttrs() is the equivalent of CopyReadAttribute(). However it > reads ALL the attributes of the line at once. It stores all of them in a > byte buffer called attr_bytebuf and the offsets to beginnings of the > attributes in attr_offsets array. There is a pretty detailed description in > the function header. > > Why is it faster: Here lies probably the biggest improvement. For one thing > all the work is focused (all attributes parsed at once) - less function > calls and conditions. Further, the null mark side is determined ahead of > time to save from multiple calls to strlen (you'll be surprised how slow it > can make things in some occasions). Also, again, the attribute data is > loaded big chunks to the attribute buffer (1 chunk if there are no escapes > in the data) instead of a character at a time. This is another major > speedup. So, after parsing all attributes in one function, functioncall3 > will find them by pointing to the attribute array using the attribute > location pointers in attr_offsets. > > There is a function (bottom of the file) I wrote that is like the c library > strchr() but instead allows you to not stop scanning when the string has > embedded nulls in it, but stop only at end-of-line characters or end of > input buffer. It's pretty simple and fast, but there may be a way to > optimize it further. > > WRT encodings. Conversion to server encoding still happens for every line if > necessary. However checking for mblen only needs to occur if our client > encoding is one of the 5 non-server-supported encodings, not for every case > that the encodings are different (although I don't think it previously > harmed the performance, it probably helped a bit by avoiding extra per > character conditionals). In any case, the strchrlen function skips mb chars > appropriately as well. > > If I remember more relevant things I'll post them here. > > Hope that helps, please ask more if there are unclear things! > > Alon. > > > > > >> >> Maybe I have missed that in previous discussion, but this change is >> sufficiently invasive that I think you owe that to the reviewers. >> >> cheers >> >> andrew >> >> cheers >> >> andrew >> >> >> > > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match > ---------------------------(end of broadcast)--------------------------- TIP 2: you can get off all lists at once with the unregister command (send "unregister YourEmailAddressHere" to [EMAIL PROTECTED])