Hi folks,

thanks for the feedback so far.

@Jeff, there are two distinct cases here :

   1. The range tombstones created on a partial primary key (that doesn't
   include the last column of the PK for example) : a single tombstone can
   shadow many rows
   2. The range tombstones created on the full PK : a single tombstone can
   shadow a single row only

In the first case, the range tombstones are (almost) correctly counted
(after merge since that happens really early in the code). We cannot know
how many shadowed rows/cells were read because they get merged early with
the tombstones.

In the second case (full PK delete), there is no tombstone counted and we
only get a count of the live rows after merge.

I'll illustrate this with the example below :

CREATE TABLE users.test (id int, clust1 text, clust2 text, val1 text, val2
text, PRIMARY KEY(id, clust1, clust2));
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c1','cc1', 'v1','v2');
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c1','cc2', 'v1','v2');
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c1','cc3', 'v1','v2');
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c2','cc1', 'v1','v2');
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c2','cc2', 'v1','v2');
insert into users.test(id , clust1, clust2 , val1 , val2)
values(1,'c2','cc3', 'v1','v2');

cqlsh> select * from users.test
   ... ;

 id | clust1 | clust2 | val1 | val2
----+--------+--------+------+------
  1 |     c1 |    cc1 |   v1 |   v2
  1 |     c1 |    cc2 |   v1 |   v2
  1 |     c1 |    cc3 |   v1 |   v2
  1 |     c2 |    cc1 |   v1 |   v2
  1 |     c2 |    cc2 |   v1 |   v2
  1 |     c2 |    cc3 |   v1 |   v2


Tracing session: 4c9804c0-e73a-11e7-931f-517010c60cf9

 activity
                                                       | timestamp
        | source    | source_elapsed | client
----------------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
...
Read 6 live rows, 0 deleted rows and 0 tombstone cells [ReadStage-1] |
2017-12-22 18:12:55.954000 | 127.0.0.1 |           5567 | 127.0.0.2

Then I issue a range tombstone on id = 1 and clust1 =  'c1' :

cqlsh> delete from users.test where id = 1 and clust1 = 'c1';
cqlsh> select * from users.test;

 id | clust1 | clust2 | val1 | val2
----+--------+--------+------+------
  1 |     c2 |    cc1 |   v1 |   v2
  1 |     c2 |    cc2 |   v1 |   v2
  1 |     c2 |    cc3 |   v1 |   v2

Tracing session: 8597e320-e73b-11e7-931f-517010c60cf9

 activity
                                                | timestamp
  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
                                    ...
Read 3 live rows, 0 deleted rows and 2 tombstone cells [ReadStage-1] |
2017-12-22 18:14:08.855000 | 127.0.0.1 |           2878 | 127.0.0.2

Each range tombstone apparently counts for 2 tombstone cells (probably due
the the start bound marker and the end bound marker ?).

Then if I just delete a single row :

cqlsh> delete from users.test where id = 1 and clust1 = 'c2' and clust2 =
'cc1';
cqlsh> select * from users.test;

 id | clust1 | clust2 | val1 | val2
----+--------+--------+------+------
  1 |     c2 |    cc2 |   v1 |   v2
  1 |     c2 |    cc3 |   v1 |   v2

  Tracing session: b43d9170-e73b-11e7-931f-517010c60cf9

 activity
                                                | timestamp
  | source    | source_elapsed | client
---------------------------------------------------------------------------------------------------------------------------+----------------------------+-----------+----------------+-----------
...
                                                      Read 2 live rows, 1
deleted rows and 2 tombstone cells [ReadStage-1] | 2017-12-22
18:15:27.117000 | 127.0.0.1 |           4487 | 127.0.0.2
...

My patch is applied here so the deleted row appears in the new counter, but
the tombstone cell count is unchanged (I haven't touched the way they are
counted).

It seems like only PK tombstones are the ones that are currently impossible
to notice. Are they even considered as range tombstones ? I've failed to
identify them as such while debugging the code.

We still cannot know how many tombstones or non live cells we're really
reading from disk due to early merging. What we get in the ReadCommand
class is expectedly pretty different from what Cassandra reads from
disk/memory.

@Kurt : I'd be in favor of not adding a new setting in the yaml file. I
think it's better for folks to realize early that they're reading a lot of
deleted rows. Makes sense to do this only in 4.x.

@DuyHai : what would you consider a non wise use of Mockito ?





On Fri, Dec 22, 2017 at 4:53 AM kurt greaves <k...@instaclustr.com> wrote:

> I think that's a good idea for 4.x, but not so for current branches. I
> think as long as we document it well for 4.0 upgrades it's not so much of a
> problem. Obviously there will be cases of queries failing that were
> previously succeeding but we can already use
> tombstone_failure|warn_threshold to tune around that already. I don't think
> we need another yaml setting to enable/disable counting deleted rows for
> these thresholds, especially because it's going into 4.0. It *might* be a
> good idea to bump the tombstone failure threshold default as a safety net
> though (as well as put something in the NEWS.txt).
>
> On 21 December 2017 at 20:11, Jon Haddad <j...@jonhaddad.com> wrote:
>
> > I had suggested to Alex we kick this discussion over to the ML because
> the
> > change will have a significant impact on the behavior of Cassandra when
> > doing reads with range tombstones that cover a lot of rows.  The behavior
> > now is a little weird, a single tombstone could shadow hundreds of
> > thousands or even millions of rows, and the query would probably just
> time
> > out.  Personally, I’m in favor of the change in behavior of this patch
> but
> > I wanted to get some more feedback before committing to it.  Are there
> any
> > objections to what Alex described?
> >
> > Regarding Mockito, I’ve been meaning to bring this up for a while, and
> I’m
> > a solid +1 on introducing it to help with testing.  In an ideal world
> we’d
> > have no singletons and could test everything in isolation, but
> > realistically that’s a multi year process and we just aren’t there.
> >
> >
> > > On Dec 19, 2017, at 11:07 PM, Alexander Dejanovski <
> > a...@thelastpickle.com> wrote:
> > >
> > > Hi folks,
> > >
> > > I'm working on CASSANDRA-8527
> > > <https://issues.apache.org/jira/browse/CASSANDRA-8527> and would need
> to
> > > discuss a few things.
> > >
> > > The ticket makes it visible in tracing and metrics that rows shadowed
> by
> > > range tombstones were scanned during reads.
> > > Currently, scanned range tombstones aren't reported anywhere which
> hides
> > > the cause of performance issues during reads when the users perform
> > primary
> > > key deletes.
> > > As such, they do not count in the warn and failure thresholds.
> > >
> > > While the number of live rows and tombstone cells is counted in the
> > > ReadCommand class, it is currently not possible to count the number of
> > > range tombstones there as they are merged with the rows they shadow
> > before
> > > reaching the class.
> > > Instead, we can count the number of deleted rows that were read , which
> > > already improves diagnosis and show that range tombstones were scanned
> :
> > >
> > > if (row.hasLiveData(ReadCommand.this.nowInSec(),
> enforceStrictLiveness))
> > >    ++liveRows;
> > > else if (!row.primaryKeyLivenessInfo().isLive(ReadCommand.this.
> > nowInSec()))
> > > {
> > >    // We want to detect primary key deletions only.
> > >    // If all cells have expired they will count as tombstones.
> > >   ++deletedRows;
> > > }
> > >
> > > Deleted rows would be part of the warning threshold so that we can spot
> > the
> > > range tombstone scans in the logs and tracing would look like this :
> > >
> > > WARN  [ReadStage-2] 2017-12-18 18:22:31,352 ReadCommand.java:491 -
> > > Read 2086 live rows, 2440 deleted rows and 0 tombstone cells for
> > > query..
> > >
> > >
> > > Are there any caveats to that approach ?
> > > Should we include the number of deleted rows in the failure threshold
> or
> > > make it optional, knowing that it could make some queries fail while
> they
> > > were passing before ?
> > >
> > > On a side note, is it ok to bring in Mockito in order to make it easier
> > > writing tests ? I would like to use a Spy in order to write some tests
> > > without starting the whole stack.
> > >
> > > Thanks,
> > >
> > >
> > > --
> > > -----------------
> > > Alexander Dejanovski
> > > France
> > > @alexanderdeja
> > >
> > > Consultant
> > > Apache Cassandra Consulting
> > > http://www.thelastpickle.com
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: dev-unsubscr...@cassandra.apache.org
> > For additional commands, e-mail: dev-h...@cassandra.apache.org
> >
> >
>
-- 
-----------------
Alexander Dejanovski
France
@alexanderdeja

Consultant
Apache Cassandra Consulting
http://www.thelastpickle.com

Reply via email to