[ 
https://issues.apache.org/jira/browse/CASSANDRA-19661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=18055439#comment-18055439
 ] 

Dmitry Konstantinov commented on CASSANDRA-19661:
-------------------------------------------------

Thank you for the logs.
I think I have a theory about a root cause for this issue.
Initially I thought that it could be a kind of concurrency issue when two 
different flusher threads are trying to write the same index but I've not found 
in the code how it could happen + based on the last comments the issue is 
reproduced with memtable_flush_writers=1 as well.

The second idea was about a cycle here when we invoke writer.commit for every 
Flushing.FlushRunnable
{code:java}
for (SSTableMultiWriter writer : flushResults)
{
    accumulate = writer.commit(accumulate);
    metric.flushSizeOnDisk.update(writer.getOnDiskBytesWritten());
}
{code}
it could happen (and maybe it is a real issue too) if we have multiple data 
directories and a memtable if flushed to several SSTables to them.
But, based on the provided log it is not the case - we have a single data 
directory:
{code:java}
DEBUG [main] 2026-01-29 09:59:01,308 DiskBoundaryManager.java:57 - Updating 
boundaries from null to 
DiskBoundaries{directories=[DataDirectory{location=/app/cassandra/data}],
{code}
After looking to the stacktrace more precisely I've noticed that we use 
org.apache.cassandra.db.compaction.unified.ShardedMultiWriter.commit
which has multiple inner writers:
{code:java}
@Override
public Throwable commit(Throwable accumulate)
{
    Throwable t = accumulate;
    for (SSTableWriter writer : writers)
        if (writer != null)
            t = writer.commit(t);
    return t;
}
{code}
so, if we have more than 1 writer in our case here we will invoke the commit 
method several times for the same observers:
{code:java}
observers.forEach(SSTableFlushObserver::complete);
{code}
the observers are retrieved in the same way during 
org.apache.cassandra.io.sstable.format.SSTableWriter#SSTableWriter constructing:
{code:java}
SSTableFlushObserver observer = group.getFlushObserver(descriptor, 
lifecycleNewTracker, metadata.getLocal());
{code}
The number of shards is calculated by UCS here: 
org.apache.cassandra.db.compaction.UnifiedCompactionStrategy#createSSTableMultiWriter

and this logic is applicable to flushing writers too:
{code:java}
double flushDensity = cfs.metric.flushSizeOnDisk.get() * 
shardManager.shardSetCoverage() / shardManager.localSpaceCoverage();
ShardTracker boundaries = 
shardManager.boundaries(controller.getNumShards(flushDensity));
{code}
it is dynamic and depends on data distribution/density.
We have a debug log printed by controller.getNumShards, so we can find in the 
debug.log a proof for this idea:
{code:java}
INFO  [NativePoolCleaner] 2026-01-29 11:34:52,296 ColumnFamilyStore.java:1052 - 
Enqueuing flush of vector_bench.vectors_cbp, Reason: MEMTABLE_LIMIT, Usage: 
2.000GiB (20%) on-heap, 1.256GiB (13%) off-heap
DEBUG [MemtableFlushWriter:5] 2026-01-29 11:34:53,071 Controller.java:348 - 
Shard count 4 for density 1.241 GiB, 1.8334467342069816 times target 693.273 MiB
{code}
=====

As a result, it looks like the issue is caused by combination of 2 features: 
UCS which can flush a memtable to multiple sharded SSTables and SAI vector 
index which does not support sharding 
(https://issues.apache.org/jira/browse/CASSANDRA-20752 - a related change but 
it looks like it does not cover Vector index scenario).

So, if the theory is correct then the possible WA can be switch compaction from 
UCS to another strategy for this table or to force UCS to use a single shard 
(based on org.apache.cassandra.db.compaction.unified.Controller#getNumShards 
code we can get it if we set min_sstable_size to 0 and base_shard_count to 1).

[~maedhroz] [~blambov]  - what do you think, am I missing something?

> Cannot restart Cassandra 5 after creating a vector table and index
> ------------------------------------------------------------------
>
>                 Key: CASSANDRA-19661
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19661
>             Project: Apache Cassandra
>          Issue Type: Bug
>          Components: Feature/SAI, Feature/Vector Search, Local/Startup and 
> Shutdown
>            Reporter: Sergio Rua
>            Priority: Normal
>             Fix For: 5.0.x, 6.x
>
>         Attachments: 5.0.2_fail_memtableflush_vector_full.txt, logs.tar.gz, 
> upload_content.py
>
>
> I'm using llama-index and llama3 to train a model. I'm using a very simple 
> code that reads some *.txt files from local and uploads them to Cassandra and 
> then creates the index:
>  
> {code:java}
> # Create the index from documents
> index = VectorStoreIndex.from_documents(
>     documents,
>     service_context=vector_store.service_context,
>     storage_context=storage_context,
>     show_progress=True,
>     ) {code}
> This works well and I'm able to use a Chat app to get responses from the 
> Cassandra data. however, right after, I cannot restart Cassandra. It'll break 
> with the following error:
>  
> {code:java}
> INFO  [PerDiskMemtableFlushWriter_0:7] 2024-05-23 08:23:20,102 
> Flushing.java:179 - Completed flushing 
> /data/cassandra/data/gpt/docs_20240523-10c8eaa018d811ef8dadf75182f3e2b4/da-6-bti-Data.db
>  (124.236MiB) for commitlog position 
> CommitLogPosition(segmentId=1716452305636, position=15336)
> [...]
> WARN  [MemtableFlushWriter:1] 2024-05-23 08:28:29,575 
> MemtableIndexWriter.java:92 - [gpt.docs.idx_vector_docs] Aborting index 
> memtable flush for 
> /data/cassandra/data/gpt/docs-aea77a80184b11ef8dadf75182f3e2b4/da-3-bti...{code}
> {code:java}
> java.lang.IllegalStateException: null
>         at 
> com.google.common.base.Preconditions.checkState(Preconditions.java:496)
>         at 
> org.apache.cassandra.index.sai.disk.v1.vector.VectorPostings.computeRowIds(VectorPostings.java:76)
>         at 
> org.apache.cassandra.index.sai.disk.v1.vector.OnHeapGraph.writeData(OnHeapGraph.java:313)
>         at 
> org.apache.cassandra.index.sai.memory.VectorMemoryIndex.writeDirect(VectorMemoryIndex.java:272)
>         at 
> org.apache.cassandra.index.sai.memory.MemtableIndex.writeDirect(MemtableIndex.java:110)
>         at 
> org.apache.cassandra.index.sai.disk.v1.MemtableIndexWriter.flushVectorIndex(MemtableIndexWriter.java:192)
>         at 
> org.apache.cassandra.index.sai.disk.v1.MemtableIndexWriter.complete(MemtableIndexWriter.java:117)
>         at 
> org.apache.cassandra.index.sai.disk.StorageAttachedIndexWriter.complete(StorageAttachedIndexWriter.java:185)
>         at java.base/java.util.ArrayList.forEach(ArrayList.java:1541)
>         at 
> java.base/java.util.Collections$UnmodifiableCollection.forEach(Collections.java:1085)
>         at 
> org.apache.cassandra.io.sstable.format.SSTableWriter.commit(SSTableWriter.java:289)
>         at 
> org.apache.cassandra.db.compaction.unified.ShardedMultiWriter.commit(ShardedMultiWriter.java:219)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore$Flush.flushMemtable(ColumnFamilyStore.java:1323)
>         at 
> org.apache.cassandra.db.ColumnFamilyStore$Flush.run(ColumnFamilyStore.java:1222)
>         at 
> org.apache.cassandra.concurrent.ExecutionFailure$1.run(ExecutionFailure.java:133)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
>         at 
> java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
>         at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
>         at java.base/java.lang.Thread.run(Thread.java:829) {code}
> The table created by the script is as follows:
>  
> {noformat}
> CREATE TABLE gpt.docs (
>     partition_id text,
>     row_id text,
>     attributes_blob text,
>     body_blob text,
>     vector vector<float, 1024>,
>     metadata_s map<text, text>,
>     PRIMARY KEY (partition_id, row_id)
> ) WITH CLUSTERING ORDER BY (row_id ASC)
>     AND additional_write_policy = '99p'
>     AND allow_auto_snapshot = true
>     AND bloom_filter_fp_chance = 0.01
>     AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
>     AND cdc = false
>     AND comment = ''
>     AND compaction = {'class': 
> 'org.apache.cassandra.db.compaction.UnifiedCompactionStrategy', 
> 'scaling_parameters': 'T4', 'target_sstable_size': '1GiB'}
>     AND compression = {'chunk_length_in_kb': '16', 'class': 
> 'org.apache.cassandra.io.compress.LZ4Compressor'}
>     AND memtable = 'default'
>     AND crc_check_chance = 1.0
>     AND default_time_to_live = 0
>     AND extensions = {}
>     AND gc_grace_seconds = 864000
>     AND incremental_backups = true
>     AND max_index_interval = 2048
>     AND memtable_flush_period_in_ms = 0
>     AND min_index_interval = 128
>     AND read_repair = 'BLOCKING'
>     AND speculative_retry = '99p';
> CREATE CUSTOM INDEX eidx_metadata_s_docs ON gpt.docs (entries(metadata_s)) 
> USING 'org.apache.cassandra.index.sai.StorageAttachedIndex';
> CREATE CUSTOM INDEX idx_vector_docs ON gpt.docs (vector) USING 
> 'org.apache.cassandra.index.sai.StorageAttachedIndex';{noformat}
> Thank you
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to