[ 
https://issues.apache.org/jira/browse/CASSANDRA-21381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18084027#comment-18084027
 ] 

Arvind Kandpal commented on CASSANDRA-21381:
--------------------------------------------

Hi Brad,

I ran tests against trunk directly — both {{format_value_text}} 
and {{csv.writer}} with the exact dialect from {{copyutil.py}} (lines 329–336).

*What format_value_text does today (trunk, verified):*
{noformat}
Input  : 'Hello\nWorld'    (actual newline)
Output : 'Hello\\nWorld'   (literal backslash-n — bug)

Input  : 'Hello\x00World'  (null byte)
Output : 'Hello\\x00World' (literal 6-char string — bug)
{noformat}
Both land in the CSV as plain unquoted text. COPY FROM reads them back 
as literal strings — data lost.

*What csv.writer does with the raw value (test-confirmed):*
{noformat}
Input      : 'Hello\nWorld'
CSV output : 1,"Hello\nWorld"   <- auto-quoted, newline preserved
Round-trip : exact match
{noformat}
Trunk's dialect has no explicit {{quoting=}} override so it runs at 
Python's default {{{}QUOTE_MINIMAL{}}}. Under that, any field with {{\r}} or 
{{\n}} is automatically double-quoted — which is exactly what RFC 4180's 
{{escaped}} production defines:
{noformat}
escaped = DQUOTE *(TEXTDATA / COMMA / CR / LF / 2DQUOTE) DQUOTE
{noformat}
*On \x00 — you are right:*
{noformat}
Input      : 'Hello\x00World'
CSV output : 1,Hello[invisible]World  <- raw embed, no quoting
{noformat}
{{csv.writer}} does not quote this at all. Null byte is invisible in 
terminal, truncates in C parsers. Should stay escaped — agreed.

*Proposed scope:*
 - {{{}\r{}}}, {{\n}} — stop substitution, let {{csv.writer}} auto-quote 
(RFC-backed, round-trip verified)
 - {{{}\x00{}}}, {{{}\x01{}}}–{{{}\x09{}}}, {{{}\x0B{}}}–{{{}\x1C{}}}, 
{{{}\x7F{}}}–{{{}\xA0{}}} — keep escaping

*On the architectural concern:*

Valid point. Both this PR and #4813 are adding kwargs to the same 
formatter chain. A {{FormattingContext}} dataclass covering 
{{escape_backslash}} and {{escape_control_chars}} together would be the 
right fix for both tickets.

> CSV COPY TO corrupts control characters (newline, null byte, etc.) in text 
> values
> ---------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-21381
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21381
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: CQL/Interpreter
>            Reporter: Jens Geyer
>            Assignee: Arvind Kandpal
>            Priority: Normal
>          Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> h2. Problem
> During COPY TO, control characters in text column values are replaced with 
> their Python repr() notation by 
> UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...) in 
> {{format_value_text}} ({{pylib/cqlshlib/formatting.py}}).
> Examples:
> * A stored newline (0x0A) becomes the two-character sequence {{\n}} in the 
> CSV; after COPY FROM it is stored as {{\n}} (backslash + n) -- a different 
> value.
> * A null byte (0x00) becomes {{\x00}} (six characters).
> The regex {{UNICODE_CONTROLCHARS_RE = re.compile(r"[\x00-\x1f\x7f-\xa0]")}} 
> matches all ASCII control characters (0x00-0x1F: newline, tab, carriage 
> return, BEL, etc.) and Latin-1 supplement control characters (0x7F-0xA0).
> This substitution is correct for terminal display of SELECT results (where 
> invisible characters need a human-readable representation). It is incorrect 
> in the *CSV export path*, where {{csv.writer}} handles control characters 
> correctly via field quoting -- no pre-processing is needed.
> h2. Affected Versions
> All Cassandra versions with {{format_value_text}} containing the 
> {{UNICODE_CONTROLCHARS_RE}} substitution (at minimum 3.x through trunk).
> h2. Root Cause
> {{format_value_text}} is shared between the terminal display path (SELECT 
> output) and the CSV export path (COPY TO). The {{UNICODE_CONTROLCHARS_RE}} 
> substitution converts control characters to their Python repr-string for 
> display, but this transformation is *not reversible* via the CSV import path.
> This bug is *independent of, but in the same function as*, the 
> backslash-doubling bug fixed in CASSANDRA-21131. Applying the CASSANDRA-21131 
> patch does NOT fix this issue.
> h2. Expected Fix
> In the CSV export path, skip the {{UNICODE_CONTROLCHARS_RE.sub(...)}} call. 
> An {{escape_control_chars}} parameter (analogous to the {{escape_backslash}} 
> parameter introduced by CASSANDRA-21131) can conditionally suppress the 
> substitution when calling {{format_value_text}} from the CSV export path.
> h2. Related
> CASSANDRA-21131 -- backslash-doubling bug in the same code path, already 
> patched.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to