[ 
https://issues.apache.org/jira/browse/CASSANDRA-21131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081328#comment-18081328
 ] 

Jens Geyer commented on CASSANDRA-21131:
----------------------------------------

h2. Root Cause Analysis

The bug is a *double backslash escaping* in the CSV export path.

h3. Trace for {{firstname = \\"Marianne\"\\}} (stored bytes: {{\\}}, 
{{"Marianne"}}, {{\\}})

*Step 1 — {{format_value_text}} in {{pylib/cqlshlib/formatting.py}}* (called 
for every text cell during COPY TO):
{code:python}
def format_value_text(val, encoding, colormap, quote=False, **_):
    escapedval = val.replace(\\, \\\\)   # BUG: always, even for CSV output
    ...
{code}
Input: {{\\"Marianne"\\}} → after replace: {{\\\\"Marianne"\\\\}}

*Step 2 — {{csv.writer}}* (escapechar={{\\}}, doublequote=False) escapes {{\\}} 
→ {{\\\\}} and {{"}} → {{\\"}}:
CSV file contains: {{\\\\\\\\\\"Marianne\\"\\\\\\\\}} — exactly matching the 
attached {{cyclist.csv}}

*Step 3 — {{csv.reader}}* on COPY FROM unescapes once:
Result: {{\\\\"Marianne"\\\\}} (2× backslash per side instead of 1×)

*Stored after round-trip:* {{\\\\"Marianne"\\\\}} — data corrupted. Every 
further cycle doubles again.

{{format_value_text}} was designed for *terminal display* of SELECT results 
(where {{\\}} makes a backslash visible). It is incorrectly reused in the CSV 
export path, where {{csv.writer}} handles all escaping.

The same bug affects *collection types*: {{list<text>}}, {{set<text>}}, 
{{map<text,text>}}, tuples and UDTs, because {{format_simple_collection}} and 
{{format_value_map}} call {{format_value}} with {{quote=True}} for each 
element, which also routes through {{format_value_text}}.

h2. Fix

Add an {{escape_backslash}} parameter (default {{True}}, preserving terminal 
display behaviour) to {{format_value_text}} and all collection formatters. Pass 
{{escape_backslash=False}} from {{ExportProcess.format_value}} in 
{{copyutil.py}} so {{csv.writer}} handles backslash escaping exclusively.

Changed functions in {{pylib/cqlshlib/formatting.py}}:
* {{format_value_text}} — new parameter
* {{format_simple_collection}} — new parameter, propagated to element 
{{format_value}} calls
* {{format_value_list}}, {{format_value_set}}, {{format_value_tuple}} — new 
parameter, forwarded
* {{format_value_map}} — new parameter, propagated through inner {{subformat}}
* {{format_value_utype}} — new parameter, propagated through inner 
{{format_field_value}}

Changed in {{pylib/cqlshlib/copyutil.py}}:
* {{ExportProcess.format_value}} — passes {{escape_backslash=False}}

h2. Test Results

Two standalone Python test scripts (no Cassandra cluster needed) verify bug and 
fix:

|| File || Scope || Before fix || After fix ||
| {{test_cassandra_21131.py}} | plain {{text}} columns | 5/10 pass | 10/10 ✓ |
| {{test_cassandra_21131_collections.py}} | {{list/set/map<text>}} | 3/12 pass 
| 12/12 ✓ |

Failing cases before fix include all values with backslashes: {{V\\S}}, 
{{\\"Marianne"\\}}, {{a\\\\b}}, etc.

h2. Pull Request

[PR #4813 against trunk|https://github.com/apache/cassandra/pull/4813] 
(Jens-G/cassandra)

The PR was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under 
human review and direction, per [ASF generative tooling 
policy|https://www.apache.org/legal/generative-tooling.html]. The commit 
includes the required {{Generated-by:}} token.

h2. Related Finding (separate ticket)

During analysis a second independent bug was found in the same code path: 
{{UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...)}} in 
{{format_value_text}} converts control characters (e.g. newline {{\\n}}, null 
byte {{\\x00}}) to their Python repr notation for display purposes. During CSV 
export these substitutions corrupt the stored values (e.g. a literal newline 
becomes the two-character sequence {{\\n}}). This affects both the current and 
the patched version and will be tracked separately.

> CSV Export and Import broken: unexpected data changes
> -----------------------------------------------------
>
>                 Key: CASSANDRA-21131
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-21131
>             Project: Apache Cassandra
>          Issue Type: Bug
>            Reporter: Jens Geyer
>            Priority: Normal
>         Attachments: console.txt, cyclist.csv
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> I'm following the example from the docs, just slightly changed: 
> https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshCopy.html
> {code}
> CREATE KEYSPACE cycling  WITH REPLICATION = { 'class' : 
> 'NetworkTopologyStrategy', 'datacenter1' : 1 } ;
> CREATE TABLE cycling.cyclist_name ( id UUID PRIMARY KEY, lastname text, 
> firstname text ) ;
> INSERT INTO cycling.cyclist_name (id, lastname, firstname) 
>    VALUES (5b6962dd-3f90-4c93-8f61-eabfa4a803e2, 'VOS','Marianne');
> INSERT INTO cycling.cyclist_name (id, lastname, firstname) 
>    VALUES (88888888-8888-8888-8888-888888888888, 'V\S','\"Marianne"\');
> SELECT * FROM cycling.cyclist_name ;
> {code}
> This gives:
> {code}
>  id                                   | firstname      | lastname
> --------------------------------------+----------------+----------
>  5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |       Marianne |      VOS
>  88888888-8888-8888-8888-888888888888 | \\"Marianne"\\ |     V\\S
> (2 rows)
> {code}
> Now export to a file, then import from the very same file:
> {code}
> COPY cycling.cyclist_name TO 'cyclist.csv' WITH HEADER = TRUE ;
> TRUNCATE cycling.cyclist_name ;
> COPY cycling.cyclist_name FROM 'cyclist.csv' WITH HEADER = TRUE ;
> SELECT * FROM cycling.cyclist_name ;
> {code}
> Now the result is this:
> {code}
>  id                                   | firstname          | lastname
> --------------------------------------+--------------------+----------
>  5b6962dd-3f90-4c93-8f61-eabfa4a803e2 |           Marianne |      VOS
>  88888888-8888-8888-8888-888888888888 | \\\\"Marianne"\\\\ |   V\\\\S
> (2 rows)
> {code}
> I did not change any data in between. Just export and reimport.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to