Re: [ceph-users] speed decrease with size
I did find the journal configuration entries and they indeed did help for this test, thanks Configuration was: journal_max_write_entries=100 journal_queue_max_ops=300 journal_queue_max_bytes=33554432 journal_max_write_bytes=10485760 Configuration after update: journal_max_write_entries=1 journal_queue_max_ops=5 journal_queue_max_bytes=1048576 journal_max_write_bytes=1073714824 Before changes: dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f /mnt/ext4/output; 5120+0 records in 5120+0 records out 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s After changes: dd if=/dev/zero of=/mnt/ceph-block-device/output bs=1000k count=5k; rm -f /mnt/ceph-block-device/output; 5120+0 records in 5120+0 records out 524288 bytes (5.2 GB) copied, 3.20913 s, 1.6 GB/s I still need to validate that this is better for our workload. Thanks for your help. On Mon, Mar 13, 2017 at 7:24 PM, Christian Balzer wrote: > > Hello, > > On Mon, 13 Mar 2017 11:25:15 -0400 Ben Erridge wrote: > > > On Sun, Mar 12, 2017 at 8:24 PM, Christian Balzer wrote: > > > > > > > > Hello, > > > > > > On Sun, 12 Mar 2017 19:37:16 -0400 Ben Erridge wrote: > > > > > > > I am testing attached volume storage on our openstack cluster which > uses > > > > ceph for block storage. > > > > our Ceph nodes have large SSD's for their journals 50+GB for each > OSD. > > > I'm > > > > thinking some parameter is a little off because with relatively small > > > > writes I am seeing drastically reduced write speeds. > > > > > > > Large journals are a waste for most people, especially when your > backing > > > storage are HDDs. > > > > > > > > > > > we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal. > > > > > > > I hope that's not your plan for production, with a replica of 2 you're > > > looking at pretty much guaranteed data loss over time, unless your OSDs > > > are actually RAIDs. > > > > > > I am aware that replica of 3 is suggested thanks. > > > > > > > 5GB journals tend to be overkill already. > > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/ > 2016-March/008606.html > > > > > > If you were to actually look at your OSD nodes during those tests with > > > something like atop or "iostat -x", you'd likely see that with > prolonged > > > writes you wind up with the speed of what your HDDs can do, i.e. see > them > > > (all or individually) being quite busy. > > > > > > > That is what I was thinking as well which is not what I want. I want to > > better utilize these large SSD journals. If I have 50GB journal > > and I only want to write 5GB of data I should be able to get near SSD > speed > > for this operation. Why am I not? > See the thread above and > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-June/010754.html > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038669.html > > > Maybe I should increase > > *filestore_max_sync_interval.* > > > That is your least worry, even though it seems to be the first parameter > to change. > Use your google foo to find some really old threads about this. > > The journal* parameters are what you want to look at, see the threads > above. And AFAIK Ceph will flush the journal at 50% full, no matter what. > > And at the end you will likely find that using your 50GB journals in full > will be difficult and doing so w/o getting a very uneven performance > nearly impossible. > > Christian > > > > > > > > Lastly, for nearly everybody in real life situations the > > > bandwidth/throughput becomes a distant second to latency > considerations. > > > > > > > Thanks for the advice however. > > > > > > > Christian > > > > > > > > > > > here is our Ceph config > > > > > > > > [global] > > > > fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d > > > > mon_initial_members = node-5 node-4 node-3 > > > > mon_host = 192.168.0.8 192.168.0.7 192.168.0.13 > > > > auth_cluster_required = cephx > > > > auth_service_required = cephx > > > > auth_client_required = cephx > > > > filestore_xattr_use_omap = true > > > > log_to_syslog_level = info > > > > log_to_syslog = True > > > > osd_pool_default_size = 1 > > > > osd_pool_default_min_size = 1 > > > > osd_pool_default_pg_num = 64 > > > > public_network = 192.168.0.0/24 > > > > log_to_syslog_facility = LOG_LOCAL0 > > > > osd_journal_size = 5 > > > > auth_supported = cephx > > > > osd_pool_default_pgp_num = 64 > > > > osd_mkfs_type = xfs > > > > cluster_network = 192.168.1.0/24 > > > > osd_recovery_max_active = 1 > > > > osd_max_backfills = 1 > > > > > > > > [client] > > > > rbd_cache = True > > > > rbd_cache_writethrough_until_flush = True > > > > > > > > [client.radosgw.gateway] > > > > rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator > > > > keyring = /etc/ceph/keyring.radosgw.gateway > > > > rgw_socket_path = /tmp/radosgw.sock > > > > rgw_keystone_revocation_interval = 100 > > > > rgw_keystone_url = 192.168.0.2:35357 > > > > rgw_keystone_admin_token = ZBz37Vlv > > > > host = node-3 > > > > rgw_d
Re: [ceph-users] speed decrease with size
Hello, On Mon, 13 Mar 2017 11:25:15 -0400 Ben Erridge wrote: > On Sun, Mar 12, 2017 at 8:24 PM, Christian Balzer wrote: > > > > > Hello, > > > > On Sun, 12 Mar 2017 19:37:16 -0400 Ben Erridge wrote: > > > > > I am testing attached volume storage on our openstack cluster which uses > > > ceph for block storage. > > > our Ceph nodes have large SSD's for their journals 50+GB for each OSD. > > I'm > > > thinking some parameter is a little off because with relatively small > > > writes I am seeing drastically reduced write speeds. > > > > > Large journals are a waste for most people, especially when your backing > > storage are HDDs. > > > > > > > > we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal. > > > > > I hope that's not your plan for production, with a replica of 2 you're > > looking at pretty much guaranteed data loss over time, unless your OSDs > > are actually RAIDs. > > > > I am aware that replica of 3 is suggested thanks. > > > > 5GB journals tend to be overkill already. > > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008606.html > > > > If you were to actually look at your OSD nodes during those tests with > > something like atop or "iostat -x", you'd likely see that with prolonged > > writes you wind up with the speed of what your HDDs can do, i.e. see them > > (all or individually) being quite busy. > > > > That is what I was thinking as well which is not what I want. I want to > better utilize these large SSD journals. If I have 50GB journal > and I only want to write 5GB of data I should be able to get near SSD speed > for this operation. Why am I not? See the thread above and http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-June/010754.html http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-April/038669.html > Maybe I should increase > *filestore_max_sync_interval.* > That is your least worry, even though it seems to be the first parameter to change. Use your google foo to find some really old threads about this. The journal* parameters are what you want to look at, see the threads above. And AFAIK Ceph will flush the journal at 50% full, no matter what. And at the end you will likely find that using your 50GB journals in full will be difficult and doing so w/o getting a very uneven performance nearly impossible. Christian > > > > > Lastly, for nearly everybody in real life situations the > > bandwidth/throughput becomes a distant second to latency considerations. > > > > Thanks for the advice however. > > > > Christian > > > > > > > > here is our Ceph config > > > > > > [global] > > > fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d > > > mon_initial_members = node-5 node-4 node-3 > > > mon_host = 192.168.0.8 192.168.0.7 192.168.0.13 > > > auth_cluster_required = cephx > > > auth_service_required = cephx > > > auth_client_required = cephx > > > filestore_xattr_use_omap = true > > > log_to_syslog_level = info > > > log_to_syslog = True > > > osd_pool_default_size = 1 > > > osd_pool_default_min_size = 1 > > > osd_pool_default_pg_num = 64 > > > public_network = 192.168.0.0/24 > > > log_to_syslog_facility = LOG_LOCAL0 > > > osd_journal_size = 5 > > > auth_supported = cephx > > > osd_pool_default_pgp_num = 64 > > > osd_mkfs_type = xfs > > > cluster_network = 192.168.1.0/24 > > > osd_recovery_max_active = 1 > > > osd_max_backfills = 1 > > > > > > [client] > > > rbd_cache = True > > > rbd_cache_writethrough_until_flush = True > > > > > > [client.radosgw.gateway] > > > rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator > > > keyring = /etc/ceph/keyring.radosgw.gateway > > > rgw_socket_path = /tmp/radosgw.sock > > > rgw_keystone_revocation_interval = 100 > > > rgw_keystone_url = 192.168.0.2:35357 > > > rgw_keystone_admin_token = ZBz37Vlv > > > host = node-3 > > > rgw_dns_name = *.ciminc.com > > > rgw_print_continue = True > > > rgw_keystone_token_cache_size = 10 > > > rgw_data = /var/lib/ceph/radosgw > > > user = www-data > > > > > > This is the degradation I am speaking of.. > > > > > > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=1k; rm -f > > > /mnt/ext4/output; > > > 1024+0 records in > > > 1024+0 records out > > > 1048576000 bytes (1.0 GB) copied, 0.887431 s, 1.2 GB/s > > > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=2k; rm -f > > > /mnt/ext4/output; > > > 2048+0 records in > > > 2048+0 records out > > > 2097152000 bytes (2.1 GB) copied, 3.75782 s, 558 MB/s > > > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=3k; rm -f > > > /mnt/ext4/output; > > > 3072+0 records in > > > 3072+0 records out > > > 3145728000 bytes (3.1 GB) copied, 10.0054 s, 314 MB/s > > > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f > > > /mnt/ext4/output; > > > 5120+0 records in > > > 5120+0 records out > > > 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s > > > > > > Any suggestions for improving the large write degradation? > > > > > > -- > > Christian Bal
Re: [ceph-users] speed decrease with size
On Sun, Mar 12, 2017 at 8:24 PM, Christian Balzer wrote: > > Hello, > > On Sun, 12 Mar 2017 19:37:16 -0400 Ben Erridge wrote: > > > I am testing attached volume storage on our openstack cluster which uses > > ceph for block storage. > > our Ceph nodes have large SSD's for their journals 50+GB for each OSD. > I'm > > thinking some parameter is a little off because with relatively small > > writes I am seeing drastically reduced write speeds. > > > Large journals are a waste for most people, especially when your backing > storage are HDDs. > > > > > we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal. > > > I hope that's not your plan for production, with a replica of 2 you're > looking at pretty much guaranteed data loss over time, unless your OSDs > are actually RAIDs. > > I am aware that replica of 3 is suggested thanks. > 5GB journals tend to be overkill already. > http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008606.html > > If you were to actually look at your OSD nodes during those tests with > something like atop or "iostat -x", you'd likely see that with prolonged > writes you wind up with the speed of what your HDDs can do, i.e. see them > (all or individually) being quite busy. > That is what I was thinking as well which is not what I want. I want to better utilize these large SSD journals. If I have 50GB journal and I only want to write 5GB of data I should be able to get near SSD speed for this operation. Why am I not? Maybe I should increase *filestore_max_sync_interval.* > > Lastly, for nearly everybody in real life situations the > bandwidth/throughput becomes a distant second to latency considerations. > Thanks for the advice however. > Christian > > > > > here is our Ceph config > > > > [global] > > fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d > > mon_initial_members = node-5 node-4 node-3 > > mon_host = 192.168.0.8 192.168.0.7 192.168.0.13 > > auth_cluster_required = cephx > > auth_service_required = cephx > > auth_client_required = cephx > > filestore_xattr_use_omap = true > > log_to_syslog_level = info > > log_to_syslog = True > > osd_pool_default_size = 1 > > osd_pool_default_min_size = 1 > > osd_pool_default_pg_num = 64 > > public_network = 192.168.0.0/24 > > log_to_syslog_facility = LOG_LOCAL0 > > osd_journal_size = 5 > > auth_supported = cephx > > osd_pool_default_pgp_num = 64 > > osd_mkfs_type = xfs > > cluster_network = 192.168.1.0/24 > > osd_recovery_max_active = 1 > > osd_max_backfills = 1 > > > > [client] > > rbd_cache = True > > rbd_cache_writethrough_until_flush = True > > > > [client.radosgw.gateway] > > rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator > > keyring = /etc/ceph/keyring.radosgw.gateway > > rgw_socket_path = /tmp/radosgw.sock > > rgw_keystone_revocation_interval = 100 > > rgw_keystone_url = 192.168.0.2:35357 > > rgw_keystone_admin_token = ZBz37Vlv > > host = node-3 > > rgw_dns_name = *.ciminc.com > > rgw_print_continue = True > > rgw_keystone_token_cache_size = 10 > > rgw_data = /var/lib/ceph/radosgw > > user = www-data > > > > This is the degradation I am speaking of.. > > > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=1k; rm -f > > /mnt/ext4/output; > > 1024+0 records in > > 1024+0 records out > > 1048576000 bytes (1.0 GB) copied, 0.887431 s, 1.2 GB/s > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=2k; rm -f > > /mnt/ext4/output; > > 2048+0 records in > > 2048+0 records out > > 2097152000 bytes (2.1 GB) copied, 3.75782 s, 558 MB/s > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=3k; rm -f > > /mnt/ext4/output; > > 3072+0 records in > > 3072+0 records out > > 3145728000 bytes (3.1 GB) copied, 10.0054 s, 314 MB/s > > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f > > /mnt/ext4/output; > > 5120+0 records in > > 5120+0 records out > > 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s > > > > Any suggestions for improving the large write degradation? > > > -- > Christian BalzerNetwork/Systems Engineer > ch...@gol.com Global OnLine Japan/Rakuten Communications > http://www.gol.com/ > -- -. Ben Erridge Center For Information Management, Inc. (734) 930-0855 3550 West Liberty Road Ste 1 Ann Arbor, MI 48103 ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] speed decrease with size
Hello, On Sun, 12 Mar 2017 19:37:16 -0400 Ben Erridge wrote: > I am testing attached volume storage on our openstack cluster which uses > ceph for block storage. > our Ceph nodes have large SSD's for their journals 50+GB for each OSD. I'm > thinking some parameter is a little off because with relatively small > writes I am seeing drastically reduced write speeds. > Large journals are a waste for most people, especially when your backing storage are HDDs. > > we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal. > I hope that's not your plan for production, with a replica of 2 you're looking at pretty much guaranteed data loss over time, unless your OSDs are actually RAIDs. 5GB journals tend to be overkill already. http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-March/008606.html If you were to actually look at your OSD nodes during those tests with something like atop or "iostat -x", you'd likely see that with prolonged writes you wind up with the speed of what your HDDs can do, i.e. see them (all or individually) being quite busy. Lastly, for nearly everybody in real life situations the bandwidth/throughput becomes a distant second to latency considerations. Christian > > here is our Ceph config > > [global] > fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d > mon_initial_members = node-5 node-4 node-3 > mon_host = 192.168.0.8 192.168.0.7 192.168.0.13 > auth_cluster_required = cephx > auth_service_required = cephx > auth_client_required = cephx > filestore_xattr_use_omap = true > log_to_syslog_level = info > log_to_syslog = True > osd_pool_default_size = 1 > osd_pool_default_min_size = 1 > osd_pool_default_pg_num = 64 > public_network = 192.168.0.0/24 > log_to_syslog_facility = LOG_LOCAL0 > osd_journal_size = 5 > auth_supported = cephx > osd_pool_default_pgp_num = 64 > osd_mkfs_type = xfs > cluster_network = 192.168.1.0/24 > osd_recovery_max_active = 1 > osd_max_backfills = 1 > > [client] > rbd_cache = True > rbd_cache_writethrough_until_flush = True > > [client.radosgw.gateway] > rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator > keyring = /etc/ceph/keyring.radosgw.gateway > rgw_socket_path = /tmp/radosgw.sock > rgw_keystone_revocation_interval = 100 > rgw_keystone_url = 192.168.0.2:35357 > rgw_keystone_admin_token = ZBz37Vlv > host = node-3 > rgw_dns_name = *.ciminc.com > rgw_print_continue = True > rgw_keystone_token_cache_size = 10 > rgw_data = /var/lib/ceph/radosgw > user = www-data > > This is the degradation I am speaking of.. > > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=1k; rm -f > /mnt/ext4/output; > 1024+0 records in > 1024+0 records out > 1048576000 bytes (1.0 GB) copied, 0.887431 s, 1.2 GB/s > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=2k; rm -f > /mnt/ext4/output; > 2048+0 records in > 2048+0 records out > 2097152000 bytes (2.1 GB) copied, 3.75782 s, 558 MB/s > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=3k; rm -f > /mnt/ext4/output; > 3072+0 records in > 3072+0 records out > 3145728000 bytes (3.1 GB) copied, 10.0054 s, 314 MB/s > > dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f > /mnt/ext4/output; > 5120+0 records in > 5120+0 records out > 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s > > Any suggestions for improving the large write degradation? -- Christian BalzerNetwork/Systems Engineer ch...@gol.com Global OnLine Japan/Rakuten Communications http://www.gol.com/ ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] speed decrease with size
I am testing attached volume storage on our openstack cluster which uses ceph for block storage. our Ceph nodes have large SSD's for their journals 50+GB for each OSD. I'm thinking some parameter is a little off because with relatively small writes I am seeing drastically reduced write speeds. we have 2 nodes withs 12 total OSD's each with 50GB SSD Journal. here is our Ceph config [global] fsid = 19bc15fd-c0cc-4f35-acd2-292a86fbcf7d mon_initial_members = node-5 node-4 node-3 mon_host = 192.168.0.8 192.168.0.7 192.168.0.13 auth_cluster_required = cephx auth_service_required = cephx auth_client_required = cephx filestore_xattr_use_omap = true log_to_syslog_level = info log_to_syslog = True osd_pool_default_size = 1 osd_pool_default_min_size = 1 osd_pool_default_pg_num = 64 public_network = 192.168.0.0/24 log_to_syslog_facility = LOG_LOCAL0 osd_journal_size = 5 auth_supported = cephx osd_pool_default_pgp_num = 64 osd_mkfs_type = xfs cluster_network = 192.168.1.0/24 osd_recovery_max_active = 1 osd_max_backfills = 1 [client] rbd_cache = True rbd_cache_writethrough_until_flush = True [client.radosgw.gateway] rgw_keystone_accepted_roles = _member_, Member, admin, swiftoperator keyring = /etc/ceph/keyring.radosgw.gateway rgw_socket_path = /tmp/radosgw.sock rgw_keystone_revocation_interval = 100 rgw_keystone_url = 192.168.0.2:35357 rgw_keystone_admin_token = ZBz37Vlv host = node-3 rgw_dns_name = *.ciminc.com rgw_print_continue = True rgw_keystone_token_cache_size = 10 rgw_data = /var/lib/ceph/radosgw user = www-data This is the degradation I am speaking of.. dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=1k; rm -f /mnt/ext4/output; 1024+0 records in 1024+0 records out 1048576000 bytes (1.0 GB) copied, 0.887431 s, 1.2 GB/s dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=2k; rm -f /mnt/ext4/output; 2048+0 records in 2048+0 records out 2097152000 bytes (2.1 GB) copied, 3.75782 s, 558 MB/s dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=3k; rm -f /mnt/ext4/output; 3072+0 records in 3072+0 records out 3145728000 bytes (3.1 GB) copied, 10.0054 s, 314 MB/s dd if=/dev/zero of=/mnt/ext4/output bs=1000k count=5k; rm -f /mnt/ext4/output; 5120+0 records in 5120+0 records out 524288 bytes (5.2 GB) copied, 24.1971 s, 217 MB/s Any suggestions for improving the large write degradation? ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com