> I haven't seen any demand for users to be allowed to specify an escape char > in text mode (BTW, that's what the docs call what you have called delimited > mode). Maybe there's a case for it, but I somewhat doubt it.
I ran into users that could not load their data using COPY because their data included a lot of embedded '\\'... These users would like to chose another char to be the escape or maybe even chose to disable escaping alltogher. However, this is probably a separate discussion that we should think about in the future and see if it's worth introducing. > As for variable names, choose some variant that seems reasonable. If we need > to distinguish we should use csv_escapec and text_escapec. But I'm not sure > we can't use the same variable name for both cases, unless they would have > conflicting scopes. Cool. I don't think the scopes are conflicting at the moment. > What I would like to have is a high level description of > . how the new text mode code differs from the old text mode code, and > . which part of the change is responsible for how much performance gain. Fair enough! Where the code differs and why it's faster: ------------------------------------------- The changes in the code only touch the *parsing* parts for *text* format. All the rest of the COPY process remains the same. More specifically: CopyFrom() still handles BINARY and CSV, and I extracted TEXT mode to another function called CopyFromDelimited() (refering to delimited text format). This function calls the rest of the modified functions below: CopyGetData() was modified to return the number of bytes actually read successfully. Rather than asking for a char at a time, the text parsing code asks for big blocks of data at a time, and needs to know how much data was accepted into the input buffer (input_buf) for parsing. CopyFrom still can call CopyGetData and not care about the return value, like it did before. Why is it faster: Instead of a calling the CopyGetData for every single byte (involves huge amount of function calls, conditionals, and getc() ) we call this function once for every buffer (64K) and then start parsing the buffer. CopyReadLineBuffered is a modified version of CopyReadLine and is used by CopyFromDelimited to load a line at a time into the line buffer line_bytebuf. Very similar to CopyReadLine, but: Why is it faster: beside to reason I explained above (non char-at-a-time read) this function copies the line of data in it's entirety to the line buffer, instead of a char-at-a-time with AppendStringInfoMacro. The gain here is very large. Also, instead of checking for if '\n', if '\r', if '\\', if '.'... For every character, we only look for the first character of the line end sequence and when finding it we look backwards or forwards to see if it's escaped etc.. CopyReadAllAttrs() is the equivalent of CopyReadAttribute(). However it reads ALL the attributes of the line at once. It stores all of them in a byte buffer called attr_bytebuf and the offsets to beginnings of the attributes in attr_offsets array. There is a pretty detailed description in the function header. Why is it faster: Here lies probably the biggest improvement. For one thing all the work is focused (all attributes parsed at once) - less function calls and conditions. Further, the null mark side is determined ahead of time to save from multiple calls to strlen (you'll be surprised how slow it can make things in some occasions). Also, again, the attribute data is loaded big chunks to the attribute buffer (1 chunk if there are no escapes in the data) instead of a character at a time. This is another major speedup. So, after parsing all attributes in one function, functioncall3 will find them by pointing to the attribute array using the attribute location pointers in attr_offsets. There is a function (bottom of the file) I wrote that is like the c library strchr() but instead allows you to not stop scanning when the string has embedded nulls in it, but stop only at end-of-line characters or end of input buffer. It's pretty simple and fast, but there may be a way to optimize it further. WRT encodings. Conversion to server encoding still happens for every line if necessary. However checking for mblen only needs to occur if our client encoding is one of the 5 non-server-supported encodings, not for every case that the encodings are different (although I don't think it previously harmed the performance, it probably helped a bit by avoiding extra per character conditionals). In any case, the strchrlen function skips mb chars appropriately as well. If I remember more relevant things I'll post them here. Hope that helps, please ask more if there are unclear things! Alon. > > Maybe I have missed that in previous discussion, but this change is > sufficiently invasive that I think you owe that to the reviewers. > > cheers > > andrew > > cheers > > andrew > > > ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match