Yifan Cai created CASSANALYTICS-104:
---------------------------------------

             Summary: Eliminate redundant filesystem lookups in SSTable direct 
streaming
                 Key: CASSANALYTICS-104
                 URL: https://issues.apache.org/jira/browse/CASSANALYTICS-104
             Project: Apache Cassandra Analytics
          Issue Type: Improvement
          Components: Writer
            Reporter: Yifan Cai


Each SSTable currently triggers 2-3 redundant filesystem directory scans when 
running in the DIRECT mode to bulk write:
# SortedSSTableWriter.prepareSStablesToSend() - scans directory, calculates 
digests,returns flat Map<Path, Digest>
# DirectStreamSession.sendSSTableToOneReplica() - scans directory again with 
globpattern to find components
# SortedSSTableWriter.close() → calculateFileDigestMap() - scans directory 
again foreach data file
Additionally, prepareSStablesToSend() returns a flat map that discards SSTable 
componentgrouping, which must be reconstructed via additional filesystem 
lookups.

The solution proposed is grouping the SSTable components by its base name to 
avoid the extra file system directory scans. Basically, revising the data 
structure from Map<Path, Digest> to  Map<String, Map<Path, Digest>> (string key 
is the SSTable base name). 
It should reduce the filesystem scans, as well as eliminate glob pattern 
matching overhead. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to