[ https://issues.apache.org/jira/browse/CASSANALYTICS-19?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yifan Cai updated CASSANALYTICS-19: ----------------------------------- Status: Ready to Commit (was: Review In Progress) > Bulk writer in S3_COMPAT mode calculate file digest twice > --------------------------------------------------------- > > Key: CASSANALYTICS-19 > URL: https://issues.apache.org/jira/browse/CASSANALYTICS-19 > Project: Apache Cassandra Analytics > Issue Type: Bug > Components: Writer > Reporter: Yifan Cai > Assignee: Yifan Cai > Priority: Normal > Fix For: 1.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > {color:#000000}In the implementation of S3_COMPAT mode of bulk writer, the > file digest is calculated twice, specifically in `CloudStorageStreamSession`. > Once at `sstableWriter.prepareSStablesToSend`, and the other time at building > the bundle, i.e. > `org.apache.cassandra.spark.bulkwriter.cloudstorage.Bundle.Builder#prepareBuild`. > {color} > {color:#000000}It is a waste of CPU to go over the same data twice. The other > issue is that it permits corrupted data make its way to attempt to import > into Cassandra, since digest is generated _after_ validating the sstables. > Although the import attempt won't be successful due to the Digest component > of the sstable, it would be nice to fail sooner.{color} -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org For additional commands, e-mail: commits-h...@cassandra.apache.org