[jira] [Comment Edited] (CASSANDRA-18464) Enable Direct I/O For CommitLog Files

Jacek Lewandowski (Jira) Mon, 27 Nov 2023 04:45:07 -0800


    [ 
https://issues.apache.org/jira/browse/CASSANDRA-18464?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17790064#comment-17790064
 ]


Jacek Lewandowski edited comment on CASSANDRA-18464 at 11/27/23 12:44 PM:
--------------------------------------------------------------------------

I think we are ready to merge, testing was difficult because of flakies. The 
detailed information below (copied from pull request):

trunk: https://github.com/apache/cassandra/pull/2931
-------------------------------------------------------------------------------------
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1122/workflows/a6889c2a-3b62-4f04-91e5-4a4e270a6368
 (j11)
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1122/workflows/0c4f1d43-810b-47bb-a83a-0d645f20efc0
 (j17)

In both builds there are many failures, most of them are mentioned in 
CASSANDRA-19055 (CEP-21 follow-up). For few others, I've run them locally on 
trunk and they failed, thus I've created tickets for them (CASSANDRA-19102, 
CASSANDRA-19101).

The only thing left is 
org.apache.cassandra.distributed.test.ReadRepairEmptyRangeTombstonesTest - it 
failed once on CircleCI by timeout in teardown method of the entire class, when 
the cluster was being shutdown. It does not seem related. I could not reproduce 
it locally on either feature branch or trunk. Here 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1123/workflows/a62417d7-5ecf-4d4e-9303-ff05df3282c0/jobs/51273
 I couldn't reproduce it on CircleCI as well. It might have been some infra 
problem I think.

cassandra-5.0: https://github.com/apache/cassandra/pull/2894
---------------------------------------------------------------------------------
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1115/workflows/fd53b9c8-ee4d-4e25-8a06-2079f1732f8e
 (j11)
test_stop_failure_policy passes locally and repeated run (100x) on this branch 
passed 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1120/workflows/4a9716c9-24c3-485f-9f11-bb596cbbca5f/jobs/50783/steps
 

{{test_shutdown_wiped_node_cannot_join}} is flaky on both this branch 
(https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1118/workflows/0eb8dc23-980d-45bc-9fc6-890ea05941a9/jobs/50573/tests
 
13% flakiness) and on {{cassandra-5.0}} 
(https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1119/workflows/932a44e3-c25e-4c5e-b677-66579380faeb/jobs/50782/tests
 
17% flakiness), there is a ticket for this and links are
https://issues.apache.org/jira/browse/CASSANDRA-19097

Repeated unit tests seems to pass all tests but has some problems collecting 
logs - the same problem is present on {{cassandra-5.0}} - 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1114/workflows/a1d02a59-31ee-48e2-9426-a4eca52ed3ff/jobs/50107
 and I've created a ticket for that: 
https://issues.apache.org/jira/browse/CASSANDRA-19086

https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1120/workflows/2dc9ef3b-0a64-47bf-a05b-de7a36b1ca2c
 (j17)
{{test_stop_failure_policy}} failed 1/100 in repeated run; therefore, I'm 
rerunning the repeated run on {{cassandra-5.0}} here: 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1121/workflows/fce907b1-0526-4d4d-beb5-b6620737a5f3
 - (failed 2/500, thus the test is flaky, creating ticket for it)



was (Author: jlewandowski):
I think we are ready to merge, testing was difficult because of flakies. The 
detailed information below (copied from pull request):

*trunk: https://github.com/apache/cassandra/pull/2931*
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1122/workflows/a6889c2a-3b62-4f04-91e5-4a4e270a6368
 (j11)
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1122/workflows/0c4f1d43-810b-47bb-a83a-0d645f20efc0
 (j17)

In both builds there are many failures, most of them are mentioned in 
CASSANDRA-19055 (CEP-21 follow-up). For few others, I've run them locally on 
trunk and they failed, thus I've created tickets for them (CASSANDRA-19102, 
CASSANDRA-19101).

The only thing left is 
org.apache.cassandra.distributed.test.ReadRepairEmptyRangeTombstonesTest - it 
failed once on CircleCI by timeout in teardown method of the entire class, when 
the cluster was being shutdown. It does not seem related. I could not reproduce 
it locally on either feature branch or trunk. Here 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1123/workflows/a62417d7-5ecf-4d4e-9303-ff05df3282c0/jobs/51273
 I couldn't reproduce it on CircleCI as well. It might have been some infra 
problem I think.

*cassandra-5.0: https://github.com/apache/cassandra/pull/2894*
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1115/workflows/fd53b9c8-ee4d-4e25-8a06-2079f1732f8e
 (j11)
test_stop_failure_policy passes locally and repeated run (100x) on this branch 
passed 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1120/workflows/4a9716c9-24c3-485f-9f11-bb596cbbca5f/jobs/50783/steps
 

{{test_shutdown_wiped_node_cannot_join}} is flaky on both this branch 
(https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1118/workflows/0eb8dc23-980d-45bc-9fc6-890ea05941a9/jobs/50573/tests
 
13% flakiness) and on {{cassandra-5.0}} 
(https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1119/workflows/932a44e3-c25e-4c5e-b677-66579380faeb/jobs/50782/tests
 
17% flakiness), there is a ticket for this and links are
https://issues.apache.org/jira/browse/CASSANDRA-19097

Repeated unit tests seems to pass all tests but has some problems collecting 
logs - the same problem is present on {{cassandra-5.0}} - 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1114/workflows/a1d02a59-31ee-48e2-9426-a4eca52ed3ff/jobs/50107
 and I've created a ticket for that: 
https://issues.apache.org/jira/browse/CASSANDRA-19086

https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1120/workflows/2dc9ef3b-0a64-47bf-a05b-de7a36b1ca2c
 (j17)
{{test_stop_failure_policy}} failed 1/100 in repeated run; therefore, I'm 
rerunning the repeated run on {{cassandra-5.0}} here: 
https://app.circleci.com/pipelines/github/jacek-lewandowski/cassandra/1121/workflows/fce907b1-0526-4d4d-beb5-b6620737a5f3
 - (failed 2/500, thus the test is flaky, creating ticket for it)


> Enable Direct I/O For CommitLog Files
> -------------------------------------
>
>                 Key: CASSANDRA-18464
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-18464
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Local/Commit Log
>            Reporter: Josh McKenzie
>            Assignee: Amit Pawar
>            Priority: Normal
>             Fix For: 5.0.x, 5.x
>
>         Attachments: CommitLogStressTest.patch, 
> EnableDirectIOForCommitLogUsingNativeAPI.patch, 
> PeriodicCommitLogStressTest.tar.bz2, SetCommitLogFileSize.patch, 
> UseDirectIOFeatureForCommitLogFiles.patch, image-2023-06-29-01-12-49-382.png
>
>
> Relocating from [dev@ email 
> thread.|https://lists.apache.org/thread/j6ny17q2rhkp7jxvwxm69dd6v1dozjrg]
>  
> I shared my investigation about Commitlog I/O issue on large core count 
> system in my previous email dated July-22 and link to the thread is given 
> below.
> [https://lists.apache.org/thread/xc5ocog2qz2v2gnj4xlw5hbthfqytx2n]
> Basically, two solutions looked possible to improve the CommitLog I/O.
>  # Multi-threaded syncing
>  # Using Direct-IO through JNA
> I worked on 2nd option considering the following benefit compared to the 
> first one
>  # Direct I/O read/write throughput is very high compared to non-Direct I/O. 
> Learnt through FIO benchmarking.
>  # Reduces kernel file cache uses which in-turn reduces kernel I/O activity 
> for Commitlog files only.
>  # Overall CPU usage reduced for flush activity. JVisualvm shows CPU usage < 
> 30% for Commitlog syncer thread with Direct I/O feature
>  # Direct I/O implementation is easier compared to multi-threaded
> As per the community suggestion, less in code complex is good to have. Direct 
> I/O enablement looked promising but there was one issue. 
> Java version 8 does not have native support to enable Direct I/O. So, JNA 
> library usage is must. The same implementation should also work across other 
> versions of Java (like 11 and beyond).
> I have completed Direct I/O implementation and summary of the attached patch 
> changes are given below.
>  # This implementation is not using Java file channels and file is opened 
> through JNA to use Direct I/O feature.
>  # New Segment are defined named “DirectIOSegment”  for Direct I/O and 
> “NonDirectIOSegment” for non-direct I/O (NonDirectIOSegment is test purpose 
> only).
>  # JNA write call is used to flush the changes.
>  # New helper functions are defined in NativeLibrary.java and platform 
> specific file. Currently tested on Linux only.
>  # Patch allows user to configure optimum block size  and alignment if 
> default values are not OK for CommitLog disk.
>  # Following configuration options are provided in Cassandra.yaml file
> a. use_jna_for_commitlog_io : to use jna feature
> b. use_direct_io_for_commitlog : to use Direct I/O feature.
> c. direct_io_minimum_block_alignment: 512 (default)
> d. nvme_disk_block_size: 32MiB (default and can be changed as per the 
> required size)
>  Test matrix is complex so CommitLog related testcases and TPCx-IOT benchmark 
> was tested. It works with both Java 8 and 11 versions. Compressed and 
> Encrypted based segments are not supported yet and it can be enabled later 
> based on the Community feedback.
>  Following improvement are seen with Direct I/O enablement.
>  # 32 cores >= ~15%
>  # 64 cores >= ~80%
>  Also, another observation would like to share here. Reading Commitlog files 
> with Direct I/O might help in reducing node bring-up time after the node 
> crash.
>  Tested with commit ID: 91f6a9aca8d3c22a03e68aa901a0b154d960ab07
>  The attached patch enables Direct I/O feature for Commitlog files. Please 
> check and share your feedback.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Comment Edited] (CASSANDRA-18464) Enable Direct I/O For CommitLog Files

Reply via email to