Yifan Cai created CASSANALYTICS-19: -------------------------------------- Summary: Bulk writer in S3_COMPAT mode calculate file digest twice Key: CASSANALYTICS-19 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-19 Project: Apache Cassandra Analytics Issue Type: Bug Components: Writer Reporter: Yifan Cai Assignee: Yifan Cai
{color:#000000}In the implementation of S3_COMPAT mode of bulk writer, the file digest is calculated twice, specifically in `CloudStorageStreamSession`. Once at `sstableWriter.prepareSStablesToSend`, and the other time at building the bundle, i.e. `org.apache.cassandra.spark.bulkwriter.cloudstorage.Bundle.Builder#prepareBuild`. {color} {color:#000000}It is a waste of CPU to go over the same data twice. The other issue is that it permits corrupted data make its way to attempt to import into Cassandra, since digest is generated _after_ validating the sstables. Although the import attempt won't be successful due to the Digest component of the sstable, it would be nice to fail sooner.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org