Dmitry Konstantinov created CASSANDRA-19557:
-----------------------------------------------

             Summary: ShallowIndexedEntry scenario: the same IndexInfo is read 
multiple times, per every read row
                 Key: CASSANDRA-19557
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19557
             Project: Cassandra
          Issue Type: Bug
          Components: Local/SSTable
            Reporter: Dmitry Konstantinov


When we read rows from a large partition stored in an SSTable and 
ShallowIndexedEntry is used - the same IndexInfo entity is read from disk 
multiple times, it happens per every read row.
The following stacktrace shows the execution path:
{code:java}
at 
org.apache.cassandra.db.RowIndexEntry$ShallowInfoRetriever.fetchIndex(RowIndexEntry.java:742)
at 
org.apache.cassandra.db.RowIndexEntry$FileIndexInfoRetriever.columnsIndex(RowIndexEntry.java:792)
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.index(AbstractSSTableIterator.java:528)
 
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.currentIndex(AbstractSSTableIterator.java:523)
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.isPastCurrentBlock(AbstractSSTableIterator.java:513)
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.updateBlock(AbstractSSTableIterator.java:487)
 <=== here we retrieve the current index entry
at 
org.apache.cassandra.db.columniterator.SSTableIterator$ForwardIndexedReader.computeNext(SSTableIterator.java:290)
 <========== here we iterate over rows
at 
org.apache.cassandra.db.columniterator.SSTableIterator$ForwardReader.hasNextInternal(SSTableIterator.java:182)
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator$Reader.hasNext(AbstractSSTableIterator.java:342)
at 
org.apache.cassandra.db.columniterator.AbstractSSTableIterator.hasNext(AbstractSSTableIterator.java:224)
at org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
at 
org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.computeNext(UnfilteredRowIteratorWithLowerBound.java:110)
at 
org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.computeNext(UnfilteredRowIteratorWithLowerBound.java:48)
at org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
at org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
at 
org.apache.cassandra.db.transform.UnfilteredRows.isEmpty(UnfilteredRows.java:74)
at 
org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:76)
at 
org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:27)
at 
org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:97)
at 
org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:303)
at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:191)
at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:181)
at 
org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:177)
at org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
at org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:308)
at 
org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1991)
at 
org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2277)
at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
at 
org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
at 
io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
at java.lang.Thread.run(Thread.java:829)
{code}
This Cassandra logic was originally written for the case when there is a small 
number of index entries and all of them are fetched in memory, so it was cheap 
to retrieve the IndexInfo again and again for this case. Later a support of 
ShallowIndexedEntry case was added (CASSANDRA-11206) which shares the same 
IndexState logic but in this case the cost to fetch IndexInfo is much higher - 
we read it from disk every time.

I have "disk_access_mode: standard" enabled and for each IndexInfo retrieval 
there are two read syscalls:
1) to find an offset for the correspondent IndexInfo
2) to read the IndexInfo itself

If we trace pread syscall we will see the following disk read syscall pattern:
{code:java}
2722116<ReadStage-5> 14:20:00.769513 
pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 338611) = 
65536 <0.000021>
2722116<ReadStage-5> 14:20:00.769562 
pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 164405) = 
65536 <0.000020>
2722116<ReadStage-5> 14:20:00.769616 
pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 338611) = 
65536 <0.000021>
2722116<ReadStage-5> 14:20:00.769664 
pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 164405) = 
65536 <0.000020>
... same is repeated many times ...
{code}
when we do 2 reads for the same positions again and again, for every data row 
we read from Data file. 
Also because these 2 reads are not near each other - we reset index reader 
buffer each time, so it does not help to prevent the repeated disk reads.

This can be improved with a local simple change: we can cache the last read 
IndexInfo within IndexState, one IndexInfo is small and it doesn't create a lot 
of GC pressure.
The caching is efficient because when we iterate through rows we iterate in 
parallel through the IndexInfo items in a sequential way, so we fetch once an 
IndexInfo and then re-use it till the moment when we need the next one.

I have an example (pathological but still representative) when 350'000 rows are 
read from a partition (majority are expired tombstones). Before the change the 
read takes 1196 ms vs 150 ms after (361'262 read syscalls from Index file 
before vs 622 after).

I will attach a patch with the proposed change soon.

There are other optimisations which can be done for ShallowIndexedEntry logic 
additionally but they are more complicated, once I will have some specific 
proposals ready I will create separate tickets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to