Kernel 3.8 supports automatic numa balancing maybe this helps. Am 22.04.2013 um 01:55 schrieb Mark Nelson <[email protected]>:
> On 04/21/2013 06:18 PM, Malcolm Haak wrote: >> Hi all, >> >> We switched to a, now free, Sandy Bridge based server. >> >> This has resolved our read issues. So something about the Quad AMD box >> was very bad for reads... >> >> I've got numbers if people are interested.. but I would say that AMD is >> not a great idea for OSD's. > > This is very good to know! It makes me nervous that the slower and > not-fully-connected nature of the hypertransport interconnect on quad socket > AMD setups is causing issues. With so many threads flying around potentially > accessing remote memory and having to communicate with PCIE slots on remote > IO hubs, it could be a recipe for disaster. Your findings may indicate this > could be the case. > > With proper thread pinning and local disk and network controllers on each > node, there is a chance that this could be dramatically improved. It'd be a > lot of work to test it though. > >> >> Thanks for all the pointers! >> >> Regards >> >> Malcolm Haak > > <snip> > >>> So.. we just started reading from the block device. And the numbers were >>> well.. Faster than the QDR IB can do TCP/IP. So we figured local >>> caching. So we dropped caches and ramped up to bigger than ram. (ram is >>> 24GB) and it got faster. So we went to 3x ram.. and it was a bit slower.. >>> >>> Oh also the whole time we were doing these tests, the back-end disk was >>> seeing no I/O at all.. We were dropping caches on the OSD's as well, but >>> even if it was caching at the OSD end, the IB link is only QDR and we >>> aren't doing RDMA so. Yeah..No idea what is going on here... > > I've seen similar things with fio on a kernel rbd block device. We suspect > that because the blocks are a non-standard size it's screwing up the numbers > being reported. The issue wasn't apparent when tests were done against a > file on a file system instead of directly against the block device. > >>> >>> >>> On 19/04/13 10:40, Mark Nelson wrote: >>>> On 04/18/2013 07:27 PM, Malcolm Haak wrote: >>>>> Morning all, >>>>> >>>>> Did the echos on all boxes involved... and the results are in.. >>>>> >>>>> [root@dogbreath ~]# >>>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>>> count=10000 iflag=direct >>>>> 10000+0 records in >>>>> 10000+0 records out >>>>> 41943040000 bytes (42 GB) copied, 144.083 s, 291 MB/s >>>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>>> count=10000 >>>>> 10000+0 records in >>>>> 10000+0 records out >>>>> 41943040000 bytes (42 GB) copied, 316.025 s, 133 MB/s >>>>> [root@dogbreath ~]# >>>> >>>> Boo! >>>> >>>>> >>>>> No change which is a shame. What other information or testing should I >>>>> start? >>>> >>>> Any chance you can try out a quick rados bench test from the client >>>> against the pool for writes and reads and see how that works? >>>> >>>> rados -p <pool> bench 300 write --no-cleanup >>>> rados -p <pool> bench 300 seq >>>> >>>>> >>>>> Regards >>>>> >>>>> Malcolm Haak >>>>> >>>>> On 18/04/13 17:22, Malcolm Haak wrote: >>>>>> Hi Mark! >>>>>> >>>>>> Thanks for the quick reply! >>>>>> >>>>>> I'll reply inline below. >>>>>> >>>>>> On 18/04/13 17:04, Mark Nelson wrote: >>>>>>> On 04/17/2013 11:35 PM, Malcolm Haak wrote: >>>>>>>> Hi all, >>>>>>> >>>>>>> Hi Malcolm! >>>>>>> >>>>>>>> >>>>>>>> I jumped into the IRC channel yesterday and they said to email >>>>>>>> ceph-devel. I have been having some read performance issues. With >>>>>>>> Reads >>>>>>>> being slower than writes by a factor of ~5-8. >>>>>>> >>>>>>> I recently saw this kind of behaviour (writes were fine, but reads >>>>>>> were >>>>>>> terrible) on an IPoIB based cluster and it was caused by the same TCP >>>>>>> auto tune issues that Jim Schutt saw last year. It's worth a try at >>>>>>> least to see if it helps. >>>>>>> >>>>>>> echo "0" > /proc/sys/net/ipv4/tcp_moderate_rcvbuf >>>>>>> >>>>>>> on all of the clients and server nodes should be enough to test it >>>>>>> out. >>>>>>> Sage added an option in more recent Ceph builds that lets you work >>>>>>> around it too. >>>>>> Awesome I will test this first up tomorrow. >>>>>>>> >>>>>>>> First info: >>>>>>>> Server >>>>>>>> SLES 11 SP2 >>>>>>>> Ceph 0.56.4. >>>>>>>> 12 OSD's that are Hardware Raid 5 each of the twelve is made from 5 >>>>>>>> NL-SAS disks for a total of 60 disks (Each lun can do around 320MB/s >>>>>>>> stream write and the same if not better read) Connected via 2xQDR IB >>>>>>>> OSD's/MDS and such all on same box (for testing) >>>>>>>> Box is a Quad AMD Opteron 6234 >>>>>>>> Ram is 256Gb >>>>>>>> 10GB Journals >>>>>>>> osd_op_theads: 8 >>>>>>>> osd_disk_threads:2 >>>>>>>> Filestore_op_threads:4 >>>>>>>> OSD's are all XFS >>>>>>> >>>>>>> Interesting setup! QUAD socket Opteron boxes have somewhat slow and >>>>>>> slightly oversubscribed hypertransport links don't they? I wonder >>>>>>> if on >>>>>>> a system with so many disks and QDR-IB if that could become a >>>>>>> problem... >>>>>>> >>>>>>> We typically like smaller nodes where we can reasonably do 1 OSD per >>>>>>> drive, but we've tested on a couple of 60 drive chassis in RAID >>>>>>> configs >>>>>>> too. Should be interesting to hear what kind of aggregate >>>>>>> performance >>>>>>> you can eventually get. >>>>>> >>>>>> We are also going to try this out with 6 luns on a dual xeon box. The >>>>>> Opteron box was the biggest scariest thing we had that was doing >>>>>> nothing. >>>>>> >>>>>>> >>>>>>>> >>>>>>>> All nodes are connected via QDR IB using IP_O_IB. We get 1.7GB/s on >>>>>>>> TCP >>>>>>>> performance tests between the nodes. >>>>>>>> >>>>>>>> Clients: One is FC17 the other us Ubuntu 12.10 they only have around >>>>>>>> 32GB-70GB ram. >>>>>>>> >>>>>>>> We ran into an odd issue were the OSD's would all start in the same >>>>>>>> NUMA >>>>>>>> node and pretty much on the same processor core. We fixed that up >>>>>>>> with >>>>>>>> some cpuset magic. >>>>>>> >>>>>>> Strange! Was that more due to cpuset or Ceph? I can't imagine >>>>>>> that we >>>>>>> are doing anything that would cause that. >>>>>> >>>>>> More than likely it is an odd quirk in the SLES kernel.. but when I >>>>>> have >>>>>> time I'll do some more poking. We were seeing insane CPU usage on some >>>>>> cores because all the OSD's were piled up in one place. >>>>>> >>>>>>>> >>>>>>>> Performance testing we have done: (Note oflag=direct was yielding >>>>>>>> results within 5% of cached results) >>>>>>>> >>>>>>>> >>>>>>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M >>>>>>>> count=3200 >>>>>>>> 3200+0 records in >>>>>>>> 3200+0 records out >>>>>>>> 33554432000 bytes (34 GB) copied, 47.6685 s, 704 MB/s >>>>>>>> root@ty3:~# >>>>>>>> root@ty3:~# rm /test-rbd-fs/DELETEME >>>>>>>> root@ty3:~# >>>>>>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M >>>>>>>> count=4800 >>>>>>>> 4800+0 records in >>>>>>>> 4800+0 records out >>>>>>>> 50331648000 bytes (50 GB) copied, 69.5527 s, 724 MB/s >>>>>>>> >>>>>>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>>>>>> count=2400 >>>>>>>> 2400+0 records in >>>>>>>> 2400+0 records out >>>>>>>> 25165824000 bytes (25 GB) copied, 26.3593 s, 955 MB/s >>>>>>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >>>>>>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>>>>>> count=9600 >>>>>>>> 9600+0 records in >>>>>>>> 9600+0 records out >>>>>>>> 100663296000 bytes (101 GB) copied, 145.212 s, 693 MB/s >>>>>>>> >>>>>>>> Both clients each doing a 140GB write (2x dogbreath's RAM) at the >>>>>>>> same >>>>>>>> time to two different rbds in the same pool. >>>>>>>> >>>>>>>> root@ty3:~# rm /test-rbd-fs/DELETEME >>>>>>>> root@ty3:~# dd if=/dev/zero of=/test-rbd-fs/DELETEME bs=10M >>>>>>>> count=14000 >>>>>>>> 14000+0 records in >>>>>>>> 14000+0 records out >>>>>>>> 146800640000 bytes (147 GB) copied, 412.404 s, 356 MB/s >>>>>>>> root@ty3:~# >>>>>>>> >>>>>>>> [root@dogbreath ~]# rm -f /test-rbd-fs/DELETEME >>>>>>>> [root@dogbreath ~]# dd of=/test-rbd-fs/DELETEME if=/dev/zero bs=10M >>>>>>>> count=14000 >>>>>>>> 14000+0 records in >>>>>>>> 14000+0 records out >>>>>>>> 146800640000 bytes (147 GB) copied, 433.351 s, 339 MB/s >>>>>>>> [root@dogbreath ~]# >>>>>>>> >>>>>>>> Onto reads... >>>>>>>> Also we found that doing iflag=direct increased read performance. >>>>>>>> >>>>>>>> [root@dogbreath ~]# dd of=/dev/null if=/test-rbd-fs/DELETEME bs=10M >>>>>>>> count=160 >>>>>>>> 160+0 records in >>>>>>>> 160+0 records out >>>>>>>> 1677721600 bytes (1.7 GB) copied, 29.4242 s, 57.0 MB/s >>>>>>>> [root@dogbreath ~]# >>>>>>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >>>>>>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>>>>>> count=10000 >>>>>>>> 10000+0 records in >>>>>>>> 10000+0 records out >>>>>>>> 41943040000 bytes (42 GB) copied, 382.334 s, 110 MB/s >>>>>>>> [root@dogbreath ~]# >>>>>>>> [root@dogbreath ~]# echo 1 > /proc/sys/vm/drop_caches >>>>>>>> [root@dogbreath ~]# dd if=/test-rbd-fs/DELETEME of=/dev/null bs=4M >>>>>>>> count=10000 iflag=direct >>>>>>>> 10000+0 records in >>>>>>>> 10000+0 records out >>>>>>>> 41943040000 bytes (42 GB) copied, 150.774 s, 278 MB/s >>>>>>>> [root@dogbreath ~]# >>>>>>>> >>>>>>>> >>>>>>>> So what info do you want/where do I start hunting for my wumpus? >>>>>>> >>>>>>> might also be worth looking at the size of the reads to see if >>>>>>> there's a >>>>>>> lot of fragmentation. Also, is this kernel rbd or qemu-kvm? >>>>>> >>>>>> Thing that got us was the back-end storage was showing very low read >>>>>> rates. Where as when writing we could see almost a 2xWrite rate >>>>>> back to >>>>>> physical disk (we assume that is Journal+data as the 2x is not from >>>>>> the >>>>>> word go but ramps up around the 3-5 second mark) >>>>>> >>>>>> It is kernel rbd at the moment, we will be testing qemu-kvm after >>>>>> things >>>>>> make sense. >>>>>> >>>>>>>> >>>>>>>> Regards >>>>>>>> >>>>>>>> Malcolm Haak >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>>>> ceph-devel" in >>>>>>>> the body of a message to [email protected] >>>>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html >>>>>> -- >>>>>> To unsubscribe from this list: send the line "unsubscribe >>>>>> ceph-devel" in >>>>>> the body of a message to [email protected] >>>>>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to [email protected] > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to [email protected] More majordomo info at http://vger.kernel.org/majordomo-info.html
