Hi hackers, This thread is about implementing a new "raw" COPY format.
This idea came up in a different thread [1], moved here. [1] https://postgr.es/m/47b5c6a7-5c0e-40aa-8ea2-c7b95ccf296f%40app.fastmail.com The main use-case for the raw format, is when needing to import arbitrary unstructured text files, such as log files, into a single text column of a table. The name "raw" is just a working title. Andrew had some other good name ideas: > WFM, so something like FORMAT {SIMPLE, RAW, FAST, SINGLE}? Below is the draft of its description, sent previously [1], adjusted thanks to feedback from Daniel Verite, who made me realize the HEADER option should be made available also for this format. --- START OF DESCRIPTION --- Raw Format The "raw" format is used for importing and exporting files containing unstructured text, where each line is treated as a single field. This format is ideal when dealing with data that doesn't conform to a structured, tabular format and lacks delimiters. Key Characteristics: - No Field Delimiters: Each line is considered a complete value without any field separation. - Single Column Requirement: The COPY command must specify exactly one column when using the raw format. Specifying multiple columns will result in an error. - Literal Data Interpretation: All characters are taken literally. There is no special handling for quotes, backslashes, or escape sequences. - No NULL Distinction: Empty lines are imported as empty strings, not as NULL values. Notes: - Error Handling: An error will occur if you use the raw format without specifying exactly one column or if the table has multiple columns and no column list is provided. - Data Preservation: All characters, including whitespace and special characters, are preserved exactly as they appear in the file. --- END OF DESCRIPTION --- After having studied the code that will be affected, I feel that before making any changes, I would like to try to improve ProcessCopyOptions, in terms of readability and maintainability, first. This seems possible by just reorganize it a bit. It is actually already organized quite nicely, where the code is mostly organized per-option, but not always, as it sometimes is spread across different sections. It seems possible to organize even more of it per-option, which would make it easier to reason about each option separately. This seems possible by organizing the checks per option, under a single if-branch per option, and moving the setting of defaults per option (when applicable) to the corresponding else-branch. This would also avoid setting defaults for options that are not applicable for a given format, and instead let their initial NULL value remain untouched, rather than setting unnecessary defaults. Some of the checks depend on multiple options in an interdependent way, not belonging to a specific option more than another. I think such checks would be nice to place at the end under a separate section. I also think it would be more readable to use the existing bool variables named [option]_specified, to determine if an option has been set, rather than relying on the option's default enum value to evaluate to false. The attached patch implements the above ideas. I think with these changes, it would be easier to hack on new and existing copy options and formats. /Joel
v1-0001-Replace-binary-flags-binary-and-csv_mode-with-format.patch
Description: Binary data
v1-0002-Reorganize-ProcessCopyOptions-for-clarity-and-consis.patch
Description: Binary data