[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-27 Thread Sergey Kandyla (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17292178#comment-17292178
 ] 

Sergey Kandyla commented on CASSANDRA-15977:


[~brandon.williams]  I'm not sure, since did not have any issues with both 
cassandra 3.11.7 and 3.11.8.

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png, Screenshot 
> 2021-02-22 at 16.07.29.png, Screenshot 2021-02-22 at 16.07.45.png, Screenshot 
> 2021-02-22 at 16.08.01.png, Screenshot 2021-02-22 at 16.08.17.png, Screenshot 
> 2021-02-22 at 16.10.53.png, Screenshot 2021-02-22 at 16.14.51.png, Screenshot 
> 2021-02-22 at 16.15.12.png, Screenshot 2021-02-22 at 16.18.38.png, Screenshot 
> 2021-02-23 at 08.40.53.png, Screenshot 2021-02-26 at 20.59.08.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-26 Thread Brandon Williams (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291890#comment-17291890
 ] 

Brandon Williams commented on CASSANDRA-15977:
--

bq. we have started experience perfomance issues after upgrading to 3.11.9, and 
3.11.10 is actually the same.

That sounds more like it would be CASSANDRA-16465 to me.

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png, Screenshot 
> 2021-02-22 at 16.07.29.png, Screenshot 2021-02-22 at 16.07.45.png, Screenshot 
> 2021-02-22 at 16.08.01.png, Screenshot 2021-02-22 at 16.08.17.png, Screenshot 
> 2021-02-22 at 16.10.53.png, Screenshot 2021-02-22 at 16.14.51.png, Screenshot 
> 2021-02-22 at 16.15.12.png, Screenshot 2021-02-22 at 16.18.38.png, Screenshot 
> 2021-02-23 at 08.40.53.png, Screenshot 2021-02-26 at 20.59.08.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-26 Thread Sergey Kandyla (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17291874#comment-17291874
 ] 

Sergey Kandyla commented on CASSANDRA-15977:


[~adelapena]  as for
>> What's the average number of items of {{ids}}? Is it possible that the 
>> replicas are very out-of-sync?
in my tests I query one id. In life app it can be multiple, but most often 
query is still for one id or few.

I've collected some more info about this problem by upgrading one of our live 
regional clusters (from 3.11.8 to 3.11.10)
All benchmarks where made by Vegeta loadtesting tool, which generate constant 
RPS rate (actually like ab but a bit more accurate).
The idea was to not overload the cluster, but generate some constant request 
rate to see a latency in optimal conditions.


!Screenshot 2021-02-26 at 20.59.08.png|width=831,height=150!
There are 4 tests in this table. One via life app, and other 3 via small golang 
app which made a query to the database to isolate any distortions from the 
environment (i.e. k8s and so on).

Cassandra cluster metrics during the benchmark: 
!Screenshot 2021-02-22 at 16.10.53.png|width=723,height=169!
Where 12:03-12:24 loadtest for cassandra 3.11.8,
12:35-12:40 upgrading cassandra to 3.11.10
12:43-13:05 the same loadtest for cassandra 3.11.10

CPU load increase for 5-20%.

*Latency:*
!Screenshot 2021-02-22 at 16.07.45.png|width=737,height=181!
Avg latency P50, cassandra 3.11.8  (metrics taken from jolokia2 plugin)

!Screenshot 2021-02-22 at 16.07.29.png|width=741,height=133!
Avg latency P50, cassandra 3.11.10 - actually doubled.

!Screenshot 2021-02-22 at 16.08.01.png|width=742,height=140!
Avg latency P99, cassandra 3.11.8

!Screenshot 2021-02-22 at 16.08.17.png|width=744,height=125!
Avg latency P99, cassandra 3.11.10 - latency doubled again.

Histogram Buckets:
{code:java}
Bucket   #   %   Histogram
[0s, 500µs]  0   0.00%
[500µs,  1ms]0   0.00%
[1ms,1.5ms]  0   0.00%
[1.5ms,  2ms]19208   12.73%  #
[2ms,3ms]100785  66.79%  
##
[3ms,4ms]12385   8.21%   ##
[4ms,5ms]720 0.48%
[5ms,6ms]634 0.42%
[6ms,7ms]632 0.42%
[7ms,8ms]546 0.36%
[8ms,9ms]550 0.36%
[9ms,10ms]   674 0.45%
{code}
 

Latency distribution (for one of tests) by histogram buckets, cassandra 3.11.8


{code:java}
Bucket   #  %   Histogram
[0s, 500µs]  0  0.00%
[500µs,  1ms]0  0.00%
[1ms,1.5ms]  0  0.00%
[1.5ms,  2ms]0  0.00%
[2ms,3ms]30852  20.45%  ###
[3ms,4ms]20332  13.47%  ##
[4ms,5ms]69602  46.12%  ##
[5ms,6ms]2347   1.56%   #
[6ms,7ms]6130.41%
[7ms,8ms]3540.23%
[8ms,9ms]3280.22%
[9ms,10ms]   3410.23%
[10ms,   12ms]   7020.47%
{code}
Latency distribution (for one of tests) by histogram buckets, cassandra 3.11.10


The following metrics are not clear for me, but may be they make sense for you.

!Screenshot 2021-02-22 at 16.14.51.png|width=844,height=159!
loadtest cassandra 3.11.8 (left), 3.11.10(right)

!Screenshot 2021-02-22 at 16.18.38.png|width=831,height=522!
loadtest cassandra 3.11.8 (left), 3.11.10(right). 

!Screenshot 2021-02-22 at 16.15.12.png|width=829,height=373!
loadtest cassandra 3.11.8 (left), 3.11.10(right).  Increase in ReadStage for 
both Active and Pending tasks for cassandra 3.11.10. Also increase in Native 
Transport Requests (Active tasks).

Finally latency difference in 2 day view for real traffic:
!Screenshot 2021-02-23 at 08.40.53.png|width=835,height=167!
Latency P95 read before and after upgrade to 3.11.10

Don't mind I've benchmarked cassandra 3.11.8 vs 3.11.10. Since we have started 
experience perfomance issues after upgrading to 3.11.9, and 3.11.10 is actually 
the same.

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png, Screenshot 
> 2021-02-22 at 16.07.29.png, Screenshot 2021-02-22 at 16.07.45.png, Screenshot 
> 2021-02-22 at 16.08.01.png, Screenshot 2021-02-22 at 16.08.17.png, Screenshot 
> 2021-02-22 at 16.10.53.png, Screenshot 2021-02-22 at 16.14.51.png, Screenshot 
> 2021-02-22 at 16.15.12.png, Screenshot 2021-02-22 at 16.18.38.png, Screenshot 
> 2021-02-23 at 08.40.53.png, Screenshot 2021-02-26 at 20.59.08.png
>
>  Time Spent: 13h 50m
>  Remaining 

[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282684#comment-17282684
 ] 

Andres de la Peña commented on CASSANDRA-15977:
---

[This tiny 
microbenchmark|https://github.com/adelapena/cassandra/commit/9ee9119389d3b77262e10453fdca7c4db46b6f46]
 seems to show that {{ColumnFilter#fetchedCellIsQueried}} is around ~2.5 times 
slower with the implementation introduced by this ticket than before, and only 
when the call to {{PartitionColumns#contains}} is involved:
{code}
Benchmark   Mode  Cnt   Score   Error   
Units
FetchedCellIsQueriedBench.testNewImpl  thrpt5  134722.178 ±  9107.505  
ops/ms
FetchedCellIsQueriedBench.testOldImpl  thrpt5  329104.675 ± 18626.088  
ops/ms
{code}
That doesn't necessarily explains the reported 20-30% additional CPU load since 
this is only one method of many, and it doesn't seem particularly slow in any 
case. But it might make it worth trying to reduce the number of calls to 
{{fetchedColumnIsQueried}} from {{fetchedCellIsQueried}}.

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-10 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17282552#comment-17282552
 ] 

Andres de la Peña commented on CASSANDRA-15977:
---

[~skandyla] thanks for the feedback. I see that the provided typical request 
explicitly selects the {{ids}} column, which is a not-frozen collection. 

Not-frozen collection columns are columns with multiple cells. The 
{{ColumnFilter}} uses the method {{fetchedColumnIsQueried}} for single-cell 
columns and {{fetchedCellIsQueried}} for each cell in a multi-cell column. The 
bug fixed by this ticket makes {{fetchedCellIsQueried}} to first verify that 
the containing column of the cell is also fetched, doing an extra call to 
{{fetchedColumnIsQueried}} for every collection item. This might be what is 
causing the performance regression. Note that, if I'm right, those extra calls 
only happen if there's an explicit selection of a multi-cell column (not a 
{{*}} selection),  and there are conflicts among the queried replicas. What's 
the average number of items of {{ids}}? Is it possible that the replicas are 
very out-of-sync?

For the sake of correctness in (at least) read repair we need that additional 
call to {{fetchedColumnIsQueried}} to verify that the column of a cell is also 
fetched. Otherwise, we'd be sending repairs for columns that are not selected. 
What we could do to make this more efficiently is trying to make the call to 
{{fetchedColumnIsQueried}} just once when we start reading the first cell of a 
collection, and reuse that result for every following cell of the same 
collection. I'll open a separate ticket for that after a bit more investigation.

Also, forgive me for asking but, are you sure that this is the specific commit 
causing the problem? Any help with reproduction steps is more that welcome.

cc [~maedhroz]

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-08 Thread Sergey Kandyla (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17281267#comment-17281267
 ] 

Sergey Kandyla commented on CASSANDRA-15977:


[~adelapena]
we don't use UDTs and frozen collections.
Data model:
{code:java}
CREATE TABLE book_ticket (
 application_id text,
 user_id text,
 ticket_id text,
 book_type text,
 created_at_timestamp int,
 deleted_at int,
 ids set,
 PRIMARY KEY ((application_id, user_id), ticket_id)
) WITH CLUSTERING ORDER BY (ticket_id ASC)
 AND bloom_filter_fp_chance = 0.01
 AND caching = {'keys': 'ALL', 'rows_per_partition': 'NONE'}
 AND comment = ''
 AND compaction = {'class': 
'org.apache.cassandra.db.compaction.SizeTieredCompactionStrategy', 
'max_threshold': '32', 'min_threshold': '4'}
 AND compression = {'chunk_length_in_kb': '64', 'class': 
'org.apache.cassandra.io.compress.LZ4Compressor'}
 AND crc_check_chance = 1.0
 AND dclocal_read_repair_chance = 0.0
 AND default_time_to_live = 0
 AND gc_grace_seconds = 864000
 AND max_index_interval = 2048
 AND memtable_flush_period_in_ms = 0
 AND min_index_interval = 128
 AND read_repair_chance = 0.0
 AND speculative_retry = '99PERCENTILE';{code}

Typical request from golang application something like this:
{code:java}
SELECT deleted_at,ids FROM book_ticket WHERE application_id = ? AND user_id IN 
('5fff6d0e-2658-4e93-8f8a-05120a0f021d');{code}

user_id can be multiple values.
cassandra 3.11.10, 3 nodes, RF=3. gocql with consistency local_quorum.

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-05 Thread Jira


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279772#comment-17279772
 ] 

Andres de la Peña commented on CASSANDRA-15977:
---

[~skandyla] thanks for the info. Could you please elaborate on the data model 
used by your test? Particularly, are you using not-frozen collections or UDTs? 
How many columns do you have in the clustering keys?

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org



[jira] [Commented] (CASSANDRA-15977) 4.0 Quality: Read Repair Test Audit

2021-02-05 Thread Sergey Kandyla (Jira)


[ 
https://issues.apache.org/jira/browse/CASSANDRA-15977?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17279741#comment-17279741
 ] 

Sergey Kandyla commented on CASSANDRA-15977:


Hello!
This fix in ColumnFilter leads to 20-30% additional cpu load. 
!Screenshot 2021-02-05 at 18.01.10.png!

> 4.0 Quality: Read Repair Test Audit
> ---
>
> Key: CASSANDRA-15977
> URL: https://issues.apache.org/jira/browse/CASSANDRA-15977
> Project: Cassandra
>  Issue Type: Task
>  Components: Test/dtest/java, Test/unit
>Reporter: Andres de la Peña
>Assignee: Andres de la Peña
>Priority: Normal
> Fix For: 3.11.9, 4.0, 4.0-beta3
>
> Attachments: Screenshot 2021-02-05 at 18.01.10.png
>
>  Time Spent: 13h 50m
>  Remaining Estimate: 0h
>
> This is a subtask of CASSANDRA-15579 focusing on read repair.
> [This 
> document|https://docs.google.com/document/d/1-gldHcdLSMRbDhhI8ahs_tPeAZsjurjXr38xABVjWHE/edit?usp=sharing]
>  lists and describes the existing functional tests for read repair, so we can 
> have a broad view of what is currently covered. We can comment on this 
> document and add ideas for new cases/tests, so it can gradually evolve to a 
> more or less detailed test plan.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: commits-unsubscr...@cassandra.apache.org
For additional commands, e-mail: commits-h...@cassandra.apache.org