[
https://issues.apache.org/jira/browse/CASSANDRA-21131?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18081328#comment-18081328
]
Jens Geyer commented on CASSANDRA-21131:
----------------------------------------
h2. Root Cause Analysis
The bug is a *double backslash escaping* in the CSV export path.
h3. Trace for {{firstname = \\"Marianne\"\\}} (stored bytes: {{\\}},
{{"Marianne"}}, {{\\}})
*Step 1 — {{format_value_text}} in {{pylib/cqlshlib/formatting.py}}* (called
for every text cell during COPY TO):
{code:python}
def format_value_text(val, encoding, colormap, quote=False, **_):
escapedval = val.replace(\\, \\\\) # BUG: always, even for CSV output
...
{code}
Input: {{\\"Marianne"\\}} → after replace: {{\\\\"Marianne"\\\\}}
*Step 2 — {{csv.writer}}* (escapechar={{\\}}, doublequote=False) escapes {{\\}}
→ {{\\\\}} and {{"}} → {{\\"}}:
CSV file contains: {{\\\\\\\\\\"Marianne\\"\\\\\\\\}} — exactly matching the
attached {{cyclist.csv}}
*Step 3 — {{csv.reader}}* on COPY FROM unescapes once:
Result: {{\\\\"Marianne"\\\\}} (2× backslash per side instead of 1×)
*Stored after round-trip:* {{\\\\"Marianne"\\\\}} — data corrupted. Every
further cycle doubles again.
{{format_value_text}} was designed for *terminal display* of SELECT results
(where {{\\}} makes a backslash visible). It is incorrectly reused in the CSV
export path, where {{csv.writer}} handles all escaping.
The same bug affects *collection types*: {{list<text>}}, {{set<text>}},
{{map<text,text>}}, tuples and UDTs, because {{format_simple_collection}} and
{{format_value_map}} call {{format_value}} with {{quote=True}} for each
element, which also routes through {{format_value_text}}.
h2. Fix
Add an {{escape_backslash}} parameter (default {{True}}, preserving terminal
display behaviour) to {{format_value_text}} and all collection formatters. Pass
{{escape_backslash=False}} from {{ExportProcess.format_value}} in
{{copyutil.py}} so {{csv.writer}} handles backslash escaping exclusively.
Changed functions in {{pylib/cqlshlib/formatting.py}}:
* {{format_value_text}} — new parameter
* {{format_simple_collection}} — new parameter, propagated to element
{{format_value}} calls
* {{format_value_list}}, {{format_value_set}}, {{format_value_tuple}} — new
parameter, forwarded
* {{format_value_map}} — new parameter, propagated through inner {{subformat}}
* {{format_value_utype}} — new parameter, propagated through inner
{{format_field_value}}
Changed in {{pylib/cqlshlib/copyutil.py}}:
* {{ExportProcess.format_value}} — passes {{escape_backslash=False}}
h2. Test Results
Two standalone Python test scripts (no Cassandra cluster needed) verify bug and
fix:
|| File || Scope || Before fix || After fix ||
| {{test_cassandra_21131.py}} | plain {{text}} columns | 5/10 pass | 10/10 ✓ |
| {{test_cassandra_21131_collections.py}} | {{list/set/map<text>}} | 3/12 pass
| 12/12 ✓ |
Failing cases before fix include all values with backslashes: {{V\\S}},
{{\\"Marianne"\\}}, {{a\\\\b}}, etc.
h2. Pull Request
[PR #4813 against trunk|https://github.com/apache/cassandra/pull/4813]
(Jens-G/cassandra)
The PR was developed with AI assistance (Claude Sonnet 4.6 / Anthropic) under
human review and direction, per [ASF generative tooling
policy|https://www.apache.org/legal/generative-tooling.html]. The commit
includes the required {{Generated-by:}} token.
h2. Related Finding (separate ticket)
During analysis a second independent bug was found in the same code path:
{{UNICODE_CONTROLCHARS_RE.sub(_show_control_chars, ...)}} in
{{format_value_text}} converts control characters (e.g. newline {{\\n}}, null
byte {{\\x00}}) to their Python repr notation for display purposes. During CSV
export these substitutions corrupt the stored values (e.g. a literal newline
becomes the two-character sequence {{\\n}}). This affects both the current and
the patched version and will be tracked separately.
> CSV Export and Import broken: unexpected data changes
> -----------------------------------------------------
>
> Key: CASSANDRA-21131
> URL: https://issues.apache.org/jira/browse/CASSANDRA-21131
> Project: Apache Cassandra
> Issue Type: Bug
> Reporter: Jens Geyer
> Priority: Normal
> Attachments: console.txt, cyclist.csv
>
> Time Spent: 10m
> Remaining Estimate: 0h
>
> I'm following the example from the docs, just slightly changed:
> https://docs.datastax.com/en/cql-oss/3.x/cql/cql_reference/cqlshCopy.html
> {code}
> CREATE KEYSPACE cycling WITH REPLICATION = { 'class' :
> 'NetworkTopologyStrategy', 'datacenter1' : 1 } ;
> CREATE TABLE cycling.cyclist_name ( id UUID PRIMARY KEY, lastname text,
> firstname text ) ;
> INSERT INTO cycling.cyclist_name (id, lastname, firstname)
> VALUES (5b6962dd-3f90-4c93-8f61-eabfa4a803e2, 'VOS','Marianne');
> INSERT INTO cycling.cyclist_name (id, lastname, firstname)
> VALUES (88888888-8888-8888-8888-888888888888, 'V\S','\"Marianne"\');
> SELECT * FROM cycling.cyclist_name ;
> {code}
> This gives:
> {code}
> id | firstname | lastname
> --------------------------------------+----------------+----------
> 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS
> 88888888-8888-8888-8888-888888888888 | \\"Marianne"\\ | V\\S
> (2 rows)
> {code}
> Now export to a file, then import from the very same file:
> {code}
> COPY cycling.cyclist_name TO 'cyclist.csv' WITH HEADER = TRUE ;
> TRUNCATE cycling.cyclist_name ;
> COPY cycling.cyclist_name FROM 'cyclist.csv' WITH HEADER = TRUE ;
> SELECT * FROM cycling.cyclist_name ;
> {code}
> Now the result is this:
> {code}
> id | firstname | lastname
> --------------------------------------+--------------------+----------
> 5b6962dd-3f90-4c93-8f61-eabfa4a803e2 | Marianne | VOS
> 88888888-8888-8888-8888-888888888888 | \\\\"Marianne"\\\\ | V\\\\S
> (2 rows)
> {code}
> I did not change any data in between. Just export and reimport.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]