If the data is stored in ASC order and query asks for DESC, then wouldn’t it 
read whole partition in first and then pick data from reverse order?


> On May 16, 2017, at 10:03 AM, Stefano Ortolani <ostef...@gmail.com> wrote:
> 
> Hi Hannu,
> 
> the piece of data in question is older. In my example the tombstone is the 
> newest piece of data.
> Since a range tombstone has information re the clustering key ranges, and the 
> data is clustering key sorted, I would expect a linear scan not to be 
> necessary.
> 
> On Tue, May 16, 2017 at 3:46 PM, Hannu Kröger <hkro...@gmail.com 
> <mailto:hkro...@gmail.com>> wrote:
> Well, as mentioned, probably Cassandra doesn’t have logic and data to skip 
> bigger regions of deleted data based on range tombstone. If some piece of 
> data in a partition is newer than the tombstone, then it cannot be skipped. 
> Therefore some partition level statistics of cell ages would need to be kept 
> in the column index for the skipping and that is probably not there.
> 
> Hannu 
> 
>> On 16 May 2017, at 17:33, Stefano Ortolani <ostef...@gmail.com 
>> <mailto:ostef...@gmail.com>> wrote:
>> 
>> That is another way to see the question: are reverse iterators range 
>> tombstone aware? Yes.
>> That is why I am puzzled by this afore-mentioned behavior. 
>> I would expect them to handle this case more gracefully.
>> 
>> Cheers,
>> Stefano
>> 
>> On Tue, May 16, 2017 at 3:29 PM, Nitan Kainth <ni...@bamlabs.com 
>> <mailto:ni...@bamlabs.com>> wrote:
>> Hannu,
>> 
>> How can you read a partition in reverse?
>> 
>> Sent from my iPhone
>> 
>> > On May 16, 2017, at 9:20 AM, Hannu Kröger <hkro...@gmail.com 
>> > <mailto:hkro...@gmail.com>> wrote:
>> >
>> > Well, I’m guessing that Cassandra doesn't really know if the range 
>> > tombstone is useful for this or not.
>> >
>> > In many cases it might be that the partition contains data that is within 
>> > the range of the tombstone but is newer than the tombstone and therefore 
>> > it might be still be returned. Scanning through deleted data can be 
>> > avoided by reading the partition in reverse (if all the deleted data is in 
>> > the beginning of the partition). Eventually you will still end up reading 
>> > a lot of tombstones but you will get a lot of live data first and the 
>> > implicit query limit of 10000 probably is reached before you get to the 
>> > tombstones. Therefore you will get an immediate answer.
>> >
>> > Does it make sense?
>> >
>> > Hannu
>> >
>> >> On 16 May 2017, at 16:33, Stefano Ortolani <ostef...@gmail.com 
>> >> <mailto:ostef...@gmail.com>> wrote:
>> >>
>> >> Hi all,
>> >>
>> >> I am seeing inconsistencies when mixing range tombstones, wide 
>> >> partitions, and reverse iterators.
>> >> I still have to understand if the behaviour is to be expected hence the 
>> >> message on the mailing list.
>> >>
>> >> The situation is conceptually simple. I am using a table defined as 
>> >> follows:
>> >>
>> >> CREATE TABLE test_cql.test_cf (
>> >>  hash blob,
>> >>  timeid timeuuid,
>> >>  PRIMARY KEY (hash, timeid)
>> >> ) WITH CLUSTERING ORDER BY (timeid ASC)
>> >>  AND compaction = {'class' : 'LeveledCompactionStrategy'};
>> >>
>> >> I then proceed by loading 2/3GB from 3 sstables which I know contain a 
>> >> really wide partition (> 512 MB) for `hash = x`. I then delete the oldest 
>> >> _half_ of that partition by executing the query below, and restart the 
>> >> node:
>> >>
>> >> DELETE
>> >> FROM test_cql.test_cf
>> >> WHERE hash = x AND timeid < y;
>> >>
>> >> If I keep compactions disabled the following query timeouts (takes more 
>> >> than 10 seconds to
>> >> succeed):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid ASC;
>> >>
>> >> While the following returns immediately (obviously because no deleted 
>> >> data is ever read):
>> >>
>> >> SELECT *
>> >> FROM test_cql.test_cf
>> >> WHERE hash = 0x963204d451de3e611daf5e340c3594acead0eaaf
>> >> ORDER BY timeid DESC;
>> >>
>> >> If I force a compaction the problem is gone, but I presume just because 
>> >> the data is rearranged.
>> >>
>> >> It seems to me that reading by ASC does not make use of the range 
>> >> tombstone until C* reads the
>> >> last sstables (which actually contains the range tombstone and is flushed 
>> >> at node restart), and it wastes time reading all rows that are actually 
>> >> not live anymore.
>> >>
>> >> Is this expected? Should the range tombstone actually help in these cases?
>> >>
>> >> Thanks a lot!
>> >> Stefano
>> >
>> >
>> > ---------------------------------------------------------------------
>> > To unsubscribe, e-mail: user-unsubscr...@cassandra.apache.org 
>> > <mailto:user-unsubscr...@cassandra.apache.org>
>> > For additional commands, e-mail: user-h...@cassandra.apache.org 
>> > <mailto:user-h...@cassandra.apache.org>
>> >
>> 
> 
> 

Reply via email to