[ 
https://issues.apache.org/jira/browse/CASSANDRA-19557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dmitry Konstantinov updated CASSANDRA-19557:
--------------------------------------------
    Attachment: 19557-4.1.patch

> ShallowIndexedEntry scenario: the same IndexInfo is read multiple times, per 
> every read row
> -------------------------------------------------------------------------------------------
>
>                 Key: CASSANDRA-19557
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-19557
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Local/SSTable
>            Reporter: Dmitry Konstantinov
>            Priority: Normal
>         Attachments: 19557-4.1.patch
>
>
> When we read rows from a large partition stored in an SSTable and 
> ShallowIndexedEntry is used - the same IndexInfo entity is read from disk 
> multiple times, it happens per every read row.
> The following stacktrace shows the execution path:
> {code:java}
> at 
> org.apache.cassandra.db.RowIndexEntry$ShallowInfoRetriever.fetchIndex(RowIndexEntry.java:742)
> at 
> org.apache.cassandra.db.RowIndexEntry$FileIndexInfoRetriever.columnsIndex(RowIndexEntry.java:792)
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.index(AbstractSSTableIterator.java:528)
>  
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.currentIndex(AbstractSSTableIterator.java:523)
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.isPastCurrentBlock(AbstractSSTableIterator.java:513)
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator$IndexState.updateBlock(AbstractSSTableIterator.java:487)
>  <=== here we retrieve the current index entry
> at 
> org.apache.cassandra.db.columniterator.SSTableIterator$ForwardIndexedReader.computeNext(SSTableIterator.java:290)
>  <========== here we iterate over rows
> at 
> org.apache.cassandra.db.columniterator.SSTableIterator$ForwardReader.hasNextInternal(SSTableIterator.java:182)
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator$Reader.hasNext(AbstractSSTableIterator.java:342)
> at 
> org.apache.cassandra.db.columniterator.AbstractSSTableIterator.hasNext(AbstractSSTableIterator.java:224)
> at org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
> at 
> org.apache.cassandra.db.rows.LazilyInitializedUnfilteredRowIterator.computeNext(LazilyInitializedUnfilteredRowIterator.java:100)
> at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.computeNext(UnfilteredRowIteratorWithLowerBound.java:110)
> at 
> org.apache.cassandra.db.rows.UnfilteredRowIteratorWithLowerBound.computeNext(UnfilteredRowIteratorWithLowerBound.java:48)
> at 
> org.apache.cassandra.utils.AbstractIterator.hasNext(AbstractIterator.java:47)
> at org.apache.cassandra.db.transform.BaseRows.hasNext(BaseRows.java:133)
> at 
> org.apache.cassandra.db.transform.UnfilteredRows.isEmpty(UnfilteredRows.java:74)
> at 
> org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:76)
> at 
> org.apache.cassandra.db.partitions.PurgeFunction.applyToPartition(PurgeFunction.java:27)
> at 
> org.apache.cassandra.db.transform.BasePartitions.hasNext(BasePartitions.java:97)
> at 
> org.apache.cassandra.db.partitions.UnfilteredPartitionIterators$Serializer.serialize(UnfilteredPartitionIterators.java:303)
> at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.build(ReadResponse.java:191)
> at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:181)
> at 
> org.apache.cassandra.db.ReadResponse$LocalDataResponse.<init>(ReadResponse.java:177)
> at 
> org.apache.cassandra.db.ReadResponse.createDataResponse(ReadResponse.java:48)
> at org.apache.cassandra.db.ReadCommand.createResponse(ReadCommand.java:308)
> at 
> org.apache.cassandra.service.StorageProxy$LocalReadRunnable.runMayThrow(StorageProxy.java:1991)
> at 
> org.apache.cassandra.service.StorageProxy$DroppableRunnable.run(StorageProxy.java:2277)
> at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:515)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$FutureTask.run(AbstractLocalAwareExecutorService.java:165)
> at 
> org.apache.cassandra.concurrent.AbstractLocalAwareExecutorService$LocalSessionFutureTask.run(AbstractLocalAwareExecutorService.java:137)
> at org.apache.cassandra.concurrent.SEPWorker.run(SEPWorker.java:119)
> at 
> io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
> at java.lang.Thread.run(Thread.java:829)
> {code}
> This Cassandra logic was originally written for the case when there is a 
> small number of index entries and all of them are fetched in memory, so it 
> was cheap to retrieve the IndexInfo again and again for this case. Later a 
> support of ShallowIndexedEntry case was added (CASSANDRA-11206) which shares 
> the same IndexState logic but in this case the cost to fetch IndexInfo is 
> much higher - we read it from disk every time.
> I have "disk_access_mode: standard" enabled and for each IndexInfo retrieval 
> there are two read syscalls:
> 1) to find an offset for the correspondent IndexInfo
> 2) to read the IndexInfo itself
> If we trace pread syscall we will see the following disk read syscall pattern:
> {code:java}
> 2722116<ReadStage-5> 14:20:00.769513 
> pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 338611) = 
> 65536 <0.000021>
> 2722116<ReadStage-5> 14:20:00.769562 
> pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 164405) = 
> 65536 <0.000020>
> 2722116<ReadStage-5> 14:20:00.769616 
> pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 338611) = 
> 65536 <0.000021>
> 2722116<ReadStage-5> 14:20:00.769664 
> pread64(102<my_keyspace/mytable/nb-1-big-Index.db>, ""..., 65536, 164405) = 
> 65536 <0.000020>
> ... same is repeated many times ...
> {code}
> when we do 2 reads for the same positions again and again, for every data row 
> we read from Data file. 
> Also because these 2 reads are not near each other - we reset index reader 
> buffer each time, so it does not help to prevent the repeated disk reads.
> This can be improved with a local simple change: we can cache the last read 
> IndexInfo within IndexState, one IndexInfo is small and it doesn't create a 
> lot of GC pressure.
> The caching is efficient because when we iterate through rows we iterate in 
> parallel through the IndexInfo items in a sequential way, so we fetch once an 
> IndexInfo and then re-use it till the moment when we need the next one.
> I have an example (pathological but still representative) when 350'000 rows 
> are read from a partition (majority are expired tombstones). Before the 
> change the read takes 1196 ms vs 150 ms after (361'262 read syscalls from 
> Index file before vs 622 after).
> I will attach a patch with the proposed change soon.
> There are other optimisations which can be done for ShallowIndexedEntry logic 
> additionally but they are more complicated, once I will have some specific 
> proposals ready I will create separate tickets.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org

Reply via email to