Yifan Cai created CASSANALYTICS-104:
---------------------------------------
Summary: Eliminate redundant filesystem lookups in SSTable direct
streaming
Key: CASSANALYTICS-104
URL: https://issues.apache.org/jira/browse/CASSANALYTICS-104
Project: Apache Cassandra Analytics
Issue Type: Improvement
Components: Writer
Reporter: Yifan Cai
Each SSTable currently triggers 2-3 redundant filesystem directory scans when
running in the DIRECT mode to bulk write:
# SortedSSTableWriter.prepareSStablesToSend() - scans directory, calculates
digests,returns flat Map<Path, Digest>
# DirectStreamSession.sendSSTableToOneReplica() - scans directory again with
globpattern to find components
# SortedSSTableWriter.close() → calculateFileDigestMap() - scans directory
again foreach data file
Additionally, prepareSStablesToSend() returns a flat map that discards SSTable
componentgrouping, which must be reconstructed via additional filesystem
lookups.
The solution proposed is grouping the SSTable components by its base name to
avoid the extra file system directory scans. Basically, revising the data
structure from Map<Path, Digest> to Map<String, Map<Path, Digest>> (string key
is the SSTable base name).
It should reduce the filesystem scans, as well as eliminate glob pattern
matching overhead.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]