[
https://issues.apache.org/jira/browse/CASSANDRA-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084231#comment-18084231
]
Arvind Kandpal commented on CASSANDRA-21381:
--------------------------------------------
COPY TO/FROM already deviates from RFC 4180 in a few ways — it uses a custom
escapechar (backslash instead of quote-doubling), and supports collection types
(list, set, map) and UDTs which RFC 4180 has no concept of. So it was never a
pure RFC 4180 implementation to begin with.
On \x00 and other binary control chars — completely agreed, those are dangerous
and should stay escaped.
The case worth flagging is \r and \n specifically. These silently corrupt real
data today — multi-line addresses, JSON payloads, and formatted logs stored in
text columns all lose their newlines on round-trip. The user has no way to know
this happened — COPY TO succeeds, COPY FROM succeeds, but the data is
fundamentally different.
RFC 4180's escaped production explicitly allows CR and LF inside quoted fields:
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
And csv.writer under QUOTE_MINIMAL (current default in copyutil.py) already
auto-quotes fields containing \r or \n correctly — so stopping the substitution
for just these two characters costs nothing and fixes real data loss.
If the decision is to keep CSV strictly printable-only, I am happy to update
this PR to keep the current escaping behavior, add a warning when {{\r}} or
{{\n}} is detected, and update the documentation to clearly call out this
limitation — so users know upfront that text columns containing newlines will
not round-trip cleanly via CSV.
Happy to go either direction.
cc [~jensg] [~bschoeni] [~smiklosovic]
> CSV COPY TO corrupts control characters (newline, null byte, etc.) in text
> values
> ---------------------------------------------------------------------------------
>
> Key: CASSANDRA-21381
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21381
> Project: Apache Cassandra
> Issue Type: Bug
> Components: CQL/Interpreter
> Reporter: Jens Geyer
> Assignee: Arvind Kandpal
> Priority: Normal
> Time Spent: 2h 10m
> Remaining Estimate: 0h
>
> h2. Problem
> During COPY TO, control characters in text column values are replaced with
> their Python repr() notation by
> UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...) in
> {{format_value_text}} ({{pylib/cqlshlib/formatting.py}}).
> Examples:
> * A stored newline (0x0A) becomes the two-character sequence {{\n}} in the
> CSV; after COPY FROM it is stored as {{\n}} (backslash + n) -- a different
> value.
> * A null byte (0x00) becomes {{\x00}} (six characters).
> The regex {{UNICODE_CONTROLCHARS_RE = re.compile(r"[\x00-\x1f\x7f-\xa0]")}}
> matches all ASCII control characters (0x00-0x1F: newline, tab, carriage
> return, BEL, etc.) and Latin-1 supplement control characters (0x7F-0xA0).
> This substitution is correct for terminal display of SELECT results (where
> invisible characters need a human-readable representation). It is incorrect
> in the *CSV export path*, where {{csv.writer}} handles control characters
> correctly via field quoting -- no pre-processing is needed.
> h2. Affected Versions
> All Cassandra versions with {{format_value_text}} containing the
> {{UNICODE_CONTROLCHARS_RE}} substitution (at minimum 3.x through trunk).
> h2. Root Cause
> {{format_value_text}} is shared between the terminal display path (SELECT
> output) and the CSV export path (COPY TO). The {{UNICODE_CONTROLCHARS_RE}}
> substitution converts control characters to their Python repr-string for
> display, but this transformation is *not reversible* via the CSV import path.
> This bug is *independent of, but in the same function as*, the
> backslash-doubling bug fixed in CASSANDRA-21131. Applying the CASSANDRA-21131
> patch does NOT fix this issue.
> h2. Expected Fix
> In the CSV export path, skip the {{UNICODE_CONTROLCHARS_RE.sub(...)}} call.
> An {{escape_control_chars}} parameter (analogous to the {{escape_backslash}}
> parameter introduced by CASSANDRA-21131) can conditionally suppress the
> substitution when calling {{format_value_text}} from the CSV export path.
> h2. Related
> CASSANDRA-21131 -- backslash-doubling bug in the same code path, already
> patched.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]