[jira] [Created] (CASSANALYTICS-19) Bulk writer in S3_COMPAT mode calculate file digest twice

Yifan Cai (Jira) Mon, 14 Apr 2025 11:52:44 -0700

Yifan Cai created CASSANALYTICS-19:
--------------------------------------

             Summary: Bulk writer in S3_COMPAT mode calculate file digest twice
                 Key: CASSANALYTICS-19
                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-19
             Project: Apache Cassandra Analytics
          Issue Type: Bug
          Components: Writer
            Reporter: Yifan Cai
            Assignee: Yifan Cai



{color:#000000}In the implementation of S3_COMPAT mode of bulk writer, the file 
digest is calculated twice, specifically in `CloudStorageStreamSession`. Once 
at `sstableWriter.prepareSStablesToSend`, and the other time at building the 
bundle, i.e. 
`org.apache.cassandra.spark.bulkwriter.cloudstorage.Bundle.Builder#prepareBuild`.
 {color}

{color:#000000}It is a waste of CPU to go over the same data twice. The other 
issue is that it permits corrupted data make its way to attempt to import into 
Cassandra, since digest is generated _after_ validating the sstables. Although 
the import attempt won't be successful due to the Digest component of the 
sstable, it would be nice to fail sooner.{color}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

[jira] [Created] (CASSANALYTICS-19) Bulk writer in S3_COMPAT mode calculate file digest twice

Reply via email to