[jira] [Comment Edited] (CASSANDRA-21381) CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values

Brad Schoening (Jira) Wed, 27 May 2026 19:54:26 -0700


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18083997#comment-18083997
 ]


Brad Schoening edited comment on CASSANDRA-21381 at 5/28/26 2:53 AM:
---------------------------------------------------------------------

[~jensg] [~arvindk12] 
CSV format as standardized and defined in [RFC 
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control 
characters. It excludes this by omission in the BNF, defining TEXTDATA as only 
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
{code}

Python's {{csv}} module supports multiple "dialects" — on my macOS install 
there are three:

{code:python}
>>> import csv
>>> csv.list_dialects()
['excel', 'excel-tab', 'unix']
{code}

CQLSH doesn't support selecting the dialect, and as a database interpreter, it 
probably shouldn't expose Python attributes.

If not strictly RFC 4180, what definition of TEXTDATA should we support for 
CSV? The CSV exports from CQLSH should stay close to the standard so they can 
be consumed as widely as possible — Java, C, Python, Excel, and so on. Allowing 
arbitrary text (control characters, embedded nulls, etc.) would make our output 
less portable and push the burden of handling it onto every downstream tool.


was (Author: bschoeni):
[~jensg] [~arvindk12] 
CSV format as standardized and defined in [RFC 
4180|https://datatracker.ietf.org/doc/html/rfc4180] doesn't support control 
characters. It excludes this by omission in the BNF, defining TEXTDATA as only 
*printable* characters:
{code:sh}
CR = %x0D
LF = %x0A
TEXTDATA =  %x20-21 / %x23-2B / %x2D-7E
{code}

Python's {{csv}} module supports multiple "dialects" — on my macOS install 
there are three:

{code:python}
>>> import csv
>>> csv.list_dialects()
['excel', 'excel-tab', 'unix']
{code}

CQLSH doesn't support selecting the dialect, and as a database interpreter, it 
probably shouldn't expose Python attributes.

If not strictly RFC 4180, what definition of TEXTDATA should we support for CSV?

> CSV COPY TO corrupts control characters (newline, null byte, etc.) in text 
> values
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21381
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21381
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: CQL/Interpreter
>            Reporter: Jens Geyer
>            Assignee: Arvind Kandpal
>            Priority: Normal
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Problem
> During COPY TO, control characters in text column values are replaced with 
> their Python repr() notation by 
> UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...) in 
> {{format_value_text}} ({{pylib/cqlshlib/formatting.py}}).
> Examples:
> * A stored newline (0x0A) becomes the two-character sequence {{\n}} in the 
> CSV; after COPY FROM it is stored as {{\n}} (backslash + n) -- a different 
> value.
> * A null byte (0x00) becomes {{\x00}} (six characters).
> The regex {{UNICODE_CONTROLCHARS_RE = re.compile(r"[\x00-\x1f\x7f-\xa0]")}} 
> matches all ASCII control characters (0x00-0x1F: newline, tab, carriage 
> return, BEL, etc.) and Latin-1 supplement control characters (0x7F-0xA0).
> This substitution is correct for terminal display of SELECT results (where 
> invisible characters need a human-readable representation). It is incorrect 
> in the *CSV export path*, where {{csv.writer}} handles control characters 
> correctly via field quoting -- no pre-processing is needed.
> h2. Affected Versions
> All Cassandra versions with {{format_value_text}} containing the 
> {{UNICODE_CONTROLCHARS_RE}} substitution (at minimum 3.x through trunk).
> h2. Root Cause
> {{format_value_text}} is shared between the terminal display path (SELECT 
> output) and the CSV export path (COPY TO). The {{UNICODE_CONTROLCHARS_RE}} 
> substitution converts control characters to their Python repr-string for 
> display, but this transformation is *not reversible* via the CSV import path.
> This bug is *independent of, but in the same function as*, the 
> backslash-doubling bug fixed in CASSANDRA-21131. Applying the CASSANDRA-21131 
> patch does NOT fix this issue.
> h2. Expected Fix
> In the CSV export path, skip the {{UNICODE_CONTROLCHARS_RE.sub(...)}} call. 
> An {{escape_control_chars}} parameter (analogous to the {{escape_backslash}} 
> parameter introduced by CASSANDRA-21131) can conditionally suppress the 
> substitution when calling {{format_value_text}} from the CSV export path.
> h2. Related
> CASSANDRA-21131 -- backslash-doubling bug in the same code path, already 
> patched.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-21381) CSV COPY TO corrupts control characters (newline, null byte, etc.) in text values

Reply via email to