If the data is stored in ASC order and query asks for DESC, then wouldn’t it read whole partition in first and then pick data from reverse order?
> On May 16, 2017, at 10:03 AM, Stefano Ortolani <ostef...@gmail.com> wrote: > > Hi Hannu, > > the piece of data in question is older. In my example the tombstone is the > newest piece of data. > Since a range tombstone has information re the clustering key ranges, and the > data is clustering key sorted, I would expect a linear scan not to be > necessary. > > On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com > <mailto:hkro...@gmail.com>> wrote: > Well, as mentioned, probably Cassandra doesn’t have logic and data to skip > bigger regions of deleted data based on range tombstone. If some piece of > data in a partition is newer than the tombstone, then it cannot be skipped. > Therefore some partition level statistics of cell ages would need to be kept > in the column index for the skipping and that is probably not there. > > Hannu > >> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com >> <mailto:ostef...@gmail.com>> wrote: >> >> That is another way to see the question: are reverse iterators range >> tombstone aware? Yes. >> That is why I am puzzled by this afore-mentioned behavior. >> I would expect them to handle this case more gracefully. >> >> Cheers, >> Stefano >> >> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com >> <mailto:ni...@bamlabs.com>> wrote: >> Hannu, >> >> How can you read a partition in reverse? >> >> Sent from my iPhone >> >> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com >> > <mailto:hkro...@gmail.com>> wrote: >> > >> > Well, I’m guessing that Cassandra doesn't really know if the range >> > tombstone is useful for this or not. >> > >> > In many cases it might be that the partition contains data that is within >> > the range of the tombstone but is newer than the tombstone and therefore >> > it might be still be returned. Scanning through deleted data can be >> > avoided by reading the partition in reverse (if all the deleted data is in >> > the beginning of the partition). Eventually you will still end up reading >> > a lot of tombstones but you will get a lot of live data first and the >> > implicit query limit of 10000 probably is reached before you get to the >> > tombstones. Therefore you will get an immediate answer. >> > >> > Does it make sense? >> > >> > Hannu >> > >> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com >> >> <mailto:ostef...@gmail.com>> wrote: >> >> >> >> Hi all, >> >> >> >> I am seeing inconsistencies when mixing range tombstones, wide >> >> partitions, and reverse iterators. >> >> I still have to understand if the behaviour is to be expected hence the >> >> message on the mailing list. >> >> >> >> The situation is conceptually simple. I am using a table defined as >> >> follows: >> >> >> >> CREATE TABLE test_cql.test_cf ( >> >> hash blob, >> >> timeid timeuuid, >> >> PRIMARY KEY (hash, timeid) >> >> ) WITH CLUSTERING ORDER BY (timeid ASC) >> >> AND compaction = {'class' : 'LeveledCompactionStrategy'}; >> >> >> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a >> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest >> >> _half_ of that partition by executing the query below, and restart the >> >> node: >> >> >> >> DELETE >> >> FROM test_cql.test_cf >> >> WHERE hash = x AND timeid < y; >> >> >> >> If I keep compactions disabled the following query timeouts (takes more >> >> than 10 seconds to >> >> succeed): >> >> >> >> SELECT * >> >> FROM test_cql.test_cf >> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >> >> ORDER BY timeid ASC; >> >> >> >> While the following returns immediately (obviously because no deleted >> >> data is ever read): >> >> >> >> SELECT * >> >> FROM test_cql.test_cf >> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf >> >> ORDER BY timeid DESC; >> >> >> >> If I force a compaction the problem is gone, but I presume just because >> >> the data is rearranged. >> >> >> >> It seems to me that reading by ASC does not make use of the range >> >> tombstone until C* reads the >> >> last sstables (which actually contains the range tombstone and is flushed >> >> at node restart), and it wastes time reading all rows that are actually >> >> not live anymore. >> >> >> >> Is this expected? Should the range tombstone actually help in these cases? >> >> >> >> Thanks a lot! >> >> Stefano >> > >> > >> > --------------------------------------------------------------------- >> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org >> > <mailto:user-unsubscr...@cassandra.apache.org> >> > For additional commands, e-mail: user-h...@cassandra.apache.org >> > <mailto:user-h...@cassandra.apache.org> >> > >> > >