Michael, I do not know the process for setting this up in a multipathing configuration, but the scheduler to test is the noop scheduler.
Please let us know what would it yield. Regards, Chris -----Original Message----- From: [email protected] [mailto:[email protected]] On Behalf Of Michael Lackner Sent: Wednesday, 16 June 2010 17:50 To: linux clustering Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems Chris, Can do. Which one shall I try? I got these four to choose from: * noop * anticipatory * deadline * cfq One more thing, because of the Fibrechannel Storage I am using multipathing. And I cannot set the scheduler for the multipath device (/dev/dm-0), because "/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four paths to the storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and "/dev/sdd". I guess it's ok if I change the scheduler for those four? Is it ok to just run a command similar to the one below, and will this change the scheduler on the fly? "echo noop > /sys/block/sd*/queue/scheduler" Cause at the moment, the scheduler files for each blockdevice contain this line: "noop anticipatory deadline [cfq]" Maybe I would have to do something like "echo [noop] anticipatory deadline cfq > /sys/block/sd*/queue/scheduler" instead? Thanks for the help. Jankowski, Chris wrote: > Michael, > > Would you be willing to repeat the tests with large block with different IO > scheduler. Specifically there is a scheduler that actually is a null > scheduler. > > I think that I saw cases when the cfq IO scheduler was not working all that > great on single streams. > > Thanks and regards, > > Chris > > -----Original Message----- > From: [email protected] > [mailto:[email protected]] On Behalf Of Michael Lackner > Sent: Tuesday, 15 June 2010 22:04 > To: linux clustering > Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance > problems > > Hello! > > I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the > difference in performance was negligible. Also, GFS2 was almost on the same > speed level when compared to GFS1 for Reads (see below why..). I/O scheduler > is "cfq" by the way. I never really cared about the I/O scheduler since I do > not yet understand the differences between the available ones anyway. > > But, I found out something else. As suggested by Steven in his reply, I ran > tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and > surprisingly the results were almost the same! > > So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total > of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node > sequential read the performance went up to a nice 180-190MB/s for both FS > versions. > > Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes > showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with > multiple nodes at the same time!! When reading the raw blockdevice on a > single node, I got slightly over 190MB/s again. > > So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but > more a problem of the underlying storage. This is extremely surprising and a > bit shocking I must say. > > I guess for the Reads I will need to check the SAN itself, see if I can do > any optimization on it.. That thing can't possibly be that bad when it comes > to reading.. > > Thanks a lot for your ideas so far! > > Jankowski, Chris wrote: > >> Michael, >> >> For comparison, could you do your dd(1) tests with a very large block size >> (1 MB) and tell us the results, please? >> >> I have a vague hunch that the problem may have something to do with >> coalescing or not of IO operations. >> >> Also, which IO scheduler are you using? >> >> Thanks abnd regards, >> >> Chris Jankowski >> >> >> -----Original Message----- >> From: [email protected] >> [mailto:[email protected]] On Behalf Of Michael >> Lackner >> Sent: Tuesday, 15 June 2010 00:22 >> To: linux clustering >> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance >> problems >> >> Hello! >> >> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually >> testing, stupid. >> >> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile >> size per node. >> I read from /dev/zero for writing tests and wrote to /dev/null for reading >> tests. So, totally sequential, somewhat small blocksize (equal to filesystem >> BS). >> >> The performance was measured directly on the Fibrechannel Switch, which >> offers nice per-port monitoring for that purpose. >> >> I have yet to do some serious read testing on GFS2. I have aborted my >> GFS2 tests as >> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks >> (i did this with a 2-node configuration before) are lost, I will need to >> re-do them to give you some numbers. >> >> After each write test I did a "sync" to flush everything to disks. I did >> not do this before or after read tests though.. >> >> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that >> only 2-3% logspace were in use after the tests (I guess this is the per-node >> fs journal?). >> >> As for the direct I/O tests, by that you mean testing without ANY >> caching going on, a synchronous write? What I did before was test >> EXT3 >> (~190MB/s) and XFS >> (~320MB/s) >> on the Storage Array. I think what I'm getting here is raw throughput, since >> I am not monitoring in the OS, but at the Fibrechannel Switch itself.. >> >> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able >> to do that tomorrow morning, then I can post the numbers here. >> >> Thanks! >> >> Steven Whitehouse wrote: >> >> >>> Hi, >>> >>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote: >>> >>> >>> >>>> Hello! >>>> >>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage. >>>> >>>> At the moment, the storage subsystem consists of an HP MSA2312 >>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client >>>> machines are connected to that switch over 8gbit FC. The disks >>>> themselves are >>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares. >>>> >>>> Now, the whole storage shall be shared (single filesystem), here >>>> GFS comes in. >>>> >>>> The Cluster is only 3 nodes large at the moment, more nodes will be >>>> added later on. I am currently testing GFS1 and GFS2 for performance. >>>> Lock Management is done over single 1Gbit Ethernet Links (1 per >>>> machine). >>>> >>>> Thing is, with GFS1 I get far better performance than with the >>>> newer >>>> GFS2 across the board, with a few tunable parameters set, for >>>> writes >>>> GFS1 is roughly twice as fast. >>>> >>>> >>>> >>>> >>> What tests are you running? GFS2 is generally faster than GFS1 >>> except for streaming writes, which is an area that we are putting >>> some effort into solving currently. Small writes (one fs block (4k >>> default) or >>> less) on GFS2 are much faster than on GFS1. >>> >>> >>> >>> >>>> But, concurrent reads are totally abysmal. The total write >>>> performance (all nodes combined) sits around 280-330Mbyte/sec, >>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing >>>> concurrent reads. Surprisingly, single-node read is somewhat ok at >>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS >>>> (version 1 at the >>>> moment) at the same time, things turn ugly. >>>> >>>> >>>> >>>> >>> Reads on GFS2 should be much faster than GFS1, so it sounds as if >>> something isn't working correctly for some reason. For cached data, >>> reads on GFS2 should be as fast as ext2/3 since the code path is >>> identical (to the page cache) and only changes if pages are not cached. >>> GFS1 does its locking at a higher level, so there will be more >>> overhead for cached reads in general. >>> >>> Do make sure that if you are preparing the test files for reading >>> all from one node (or even just a different node to that on which >>> you sre running the read tests) that you need to sync them to disk >>> on that node before starting the tests to avoid issues with caching. >>> >>> >>> >>> >>>> This is strange, because for writes, global performance across the >>>> cluster increases slightly when adding more nodes. But for reads, >>>> the oppsite seems to be true. >>>> >>>> For read and write tests, separate testfiles were created and read >>>> for each node, with each testfile sitting in its own subdirectory, >>>> so no node would access another nodes file. >>>> >>>> >>>> >>>> >>> That sounds like a good test set up to me. >>> >>> >>> >>> >>>> GFS1 created with the following mkfs.gfs parameters: >>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm" >>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, >>>> Distributed >>>> LockManager) >>>> >>>> Mount Options set: "noatime,nodiratime,noquota" >>>> >>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, >>>> demote_secs 20" >>>> >>>> >>>> >>> You shouldn't normally need to set the glock_purge and demote_secs >>> to anything other than the default. These settings no longer exist >>> in >>> GFS2 since it makes use of the shrinker subsystem provided by the VM >>> and is auto-tuning. If your workload is metadata heavy, you could >>> try boosting the journal size and/or the incore_log_blocks tunable. >>> >>> >>> >>> >>>> Also, in /etc/cluster/cluster.conf, I added this: >>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld >>>> plock_rate_limit="0"/> >>>> >>>> Any ideas on how to figure out what's going wrong, and how to tune >>>> GFS1 for better concurrent read performance, or tune GFS2 in >>>> general to be competitive/better than GFS1? >>>> >>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and >>>> somewhat good reaction times while under heavy sequential and/or >>>> random load. But for now, I just wanna get the seq reading to work >>>> acceptably fast. >>>> >>>> Thanks a lot for your help! >>>> >>>> >>>> >>>> >>> Can you try doing some I/O direct to the block device so that we can >>> get an idea of what the raw device can manage? Using dd both read >>> and write, across the nodes (different disk locations on each node >>> to simulate different files). >>> >>> I'm wondering if the problem might be due to the seek pattern >>> generated by the multiple read locations, >>> >>> Steve. -- Michael Lackner Chair of Information Technology, University of Leoben IT Administration [email protected] | +43 (0)3842/402-1505 -- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster -- Linux-cluster mailing list [email protected] https://www.redhat.com/mailman/listinfo/linux-cluster
