Michael,

I do not know the process for setting this up in a multipathing configuration, 
but the scheduler to test is the noop scheduler.

Please let us know what would it yield.

Regards,

Chris 

-----Original Message-----
From: [email protected] 
[mailto:[email protected]] On Behalf Of Michael Lackner
Sent: Wednesday, 16 June 2010 17:50
To: linux clustering
Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance problems

Chris,

Can do. Which one shall I try? I got these four to choose from:

* noop
* anticipatory
* deadline
* cfq

One more thing, because of the Fibrechannel Storage I am using multipathing. 
And I cannot set the scheduler for the multipath device (/dev/dm-0), because 
"/sys/block/dm-0/queue/scheduler" doesn't exist. I actually have four paths to 
the storage that i can see as "/dev/sda", "/dev/sdb", "/dev/sdc/" and 
"/dev/sdd".

I guess it's ok if I change the scheduler for those four? Is it ok to just run 
a command similar to the one below, and will this change the scheduler on the 
fly?

"echo noop > /sys/block/sd*/queue/scheduler"

Cause at the moment, the scheduler files for each blockdevice contain this line:

"noop anticipatory deadline [cfq]"

Maybe I would have to do something like
"echo [noop] anticipatory deadline cfq > /sys/block/sd*/queue/scheduler"
instead?

Thanks for the help.

Jankowski, Chris wrote:
> Michael,
>
> Would you be willing to repeat the tests with large block with different IO 
> scheduler. Specifically there is a scheduler that actually is a null 
> scheduler.
>
> I think that I saw cases when the cfq IO scheduler was not working all that 
> great on single streams.
>
> Thanks and regards,
>
> Chris
>
> -----Original Message-----
> From: [email protected] 
> [mailto:[email protected]] On Behalf Of Michael Lackner
> Sent: Tuesday, 15 June 2010 22:04
> To: linux clustering
> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
> problems
>
> Hello!
>
> I tried to do R/W tests comparing 4kB blocksize to 1MB blocksize now, and the 
> difference in performance was negligible. Also, GFS2 was almost on the same 
> speed level when compared to GFS1 for Reads (see below why..). I/O scheduler 
> is "cfq" by the way. I never really cared about the I/O scheduler since I do 
> not yet understand the differences between the available ones anyway.
>
> But, I found out something else. As suggested by Steven in his reply, I ran 
> tests both on the GFS1/2 filesystems, and also on the raw blockdevice, and 
> surprisingly the  results were almost the same!
>
> So: GFS1 as well as GFS2 3-Node concurrent, sequential Reads showed a total 
> of 40MB/s (GFS1) and 45MB/s (GFS2) using a blocksize of 1MB. For single-node 
> sequential read the performance went up to a nice 180-190MB/s for both FS 
> versions.
>
> Now, the surprising part: Doing a dd read on the raw blockdevice with 3 nodes 
> showed a total of only ~60MB/s!! Almost as low as reading from GFS1/2 with 
> multiple nodes at the same time!! When reading the raw blockdevice on a 
> single node, I got slightly over 190MB/s again.
>
> So, this concurrent read issue seems not to be a GFS1 or GFS2 problem, but 
> more a problem of the underlying storage. This is extremely surprising and a 
> bit shocking I must say.
>
> I guess for the Reads I will need to check the SAN itself, see if I can do 
> any optimization on it..  That thing can't possibly be that bad when it comes 
> to reading..
>
> Thanks a lot for your ideas so far!
>
> Jankowski, Chris wrote:
>   
>> Michael,
>>
>> For comparison, could you do your dd(1) tests with a very large block size 
>> (1 MB) and tell us the results, please?
>>
>> I have a vague hunch that the problem may have something to do with 
>> coalescing or not of IO operations.
>>
>> Also, which IO scheduler are you using?
>>
>> Thanks abnd regards,
>>
>> Chris Jankowski
>>
>>
>> -----Original Message-----
>> From: [email protected] 
>> [mailto:[email protected]] On Behalf Of Michael 
>> Lackner
>> Sent: Tuesday, 15 June 2010 00:22
>> To: linux clustering
>> Subject: Re: [Linux-cluster] GFS (1 & partially 2) performance 
>> problems
>>
>> Hello!
>>
>> Thanks for your reply. I unfortunately forgot to mention, HOW I was actually 
>> testing, stupid.
>>
>> I tested with dd, doing 4kB blocksize reads and writes, 160GB total testfile 
>> size per node.
>> I read from /dev/zero for writing tests and wrote to /dev/null for reading 
>> tests. So, totally sequential, somewhat small blocksize (equal to filesystem 
>> BS).
>>
>> The performance was measured directly on the Fibrechannel Switch, which 
>> offers nice per-port monitoring for that purpose.
>>
>> I have yet to do some serious read testing on GFS2. I have aborted my
>> GFS2 tests as
>> write performance was not up to GFS1 to begin with. My older GFS2 benchmarks 
>> (i did this with a 2-node configuration before) are lost, I will need to 
>> re-do them to give you some numbers.
>>
>> After each write test I did a "sync" to flush everything to disks.  I did 
>> not do this before or after read tests though..
>>
>> As you mentioned Journal Size, "gfs_tool counters <mountpoint>" said, that 
>> only 2-3% logspace were in use after the tests (I guess this is the per-node 
>> fs journal?).
>>
>> As for the direct I/O tests, by that you mean testing without ANY 
>> caching going on, a synchronous write? What I did before was test 
>> EXT3
>> (~190MB/s) and XFS
>> (~320MB/s)
>> on the Storage Array. I think what I'm getting here is raw throughput, since 
>> I am not monitoring in the OS, but at the Fibrechannel Switch itself..
>>
>> I will do GFS2 read tests similiar to those conducted for GFS1. I'll be able 
>> to do that tomorrow morning, then I can post the numbers here.
>>
>> Thanks!
>>
>> Steven Whitehouse wrote:
>>   
>>     
>>> Hi,
>>>
>>> On Mon, 2010-06-14 at 14:00 +0200, Michael Lackner wrote:
>>>   
>>>     
>>>       
>>>> Hello!
>>>>
>>>> I am currently building a Cluster sitting on CentOS 5 for GFS usage.
>>>>
>>>> At the moment, the storage subsystem consists of an HP MSA2312 
>>>> Fibrechannel SAN linked to an FC 8gbit switch. Three client 
>>>> machines are connected to that switch over 8gbit FC. The disks 
>>>> themselves are
>>>> 12 * 15.000rpm SAS configured in RAID-5 with two hotspares.
>>>>
>>>> Now, the whole storage shall be shared (single filesystem), here 
>>>> GFS comes in.
>>>>
>>>> The Cluster is only 3 nodes large at the moment, more nodes will be 
>>>> added later on. I am currently testing GFS1 and GFS2 for performance.
>>>> Lock Management is done over single 1Gbit Ethernet Links (1 per 
>>>> machine).
>>>>
>>>> Thing is, with GFS1 I get far better performance than with the 
>>>> newer
>>>> GFS2 across the board, with a few tunable parameters set, for 
>>>> writes
>>>> GFS1 is roughly twice as fast.
>>>>
>>>>     
>>>>       
>>>>         
>>> What tests are you running? GFS2 is generally faster than GFS1 
>>> except for streaming writes, which is an area that we are putting 
>>> some effort into solving currently. Small writes (one fs block (4k
>>> default) or
>>> less) on GFS2 are much faster than on GFS1.
>>>
>>>   
>>>     
>>>       
>>>> But, concurrent reads are totally abysmal. The total write 
>>>> performance (all nodes combined) sits around 280-330Mbyte/sec, 
>>>> whereas the READ performance is as low as 30-40Mbyte/sec when doing 
>>>> concurrent reads. Surprisingly, single-node read is somewhat ok at 
>>>> 180Mbyte/sec, but as soon as several nodes are reading from GFS 
>>>> (version 1 at the
>>>> moment) at the same time,  things turn ugly.
>>>>
>>>>     
>>>>       
>>>>         
>>> Reads on GFS2 should be much faster than GFS1, so it sounds as if 
>>> something isn't working correctly for some reason. For cached data, 
>>> reads on GFS2 should be as fast as ext2/3 since the code path is 
>>> identical (to the page cache) and only changes if pages are not cached.
>>> GFS1 does its locking at a higher level, so there will be more 
>>> overhead for cached reads in general.
>>>
>>> Do make sure that if you are preparing the test files for reading 
>>> all from one node (or even just a different node to that on which 
>>> you sre running the read tests) that you need to sync them to disk 
>>> on that node before starting the tests to avoid issues with caching.
>>>
>>>   
>>>     
>>>       
>>>> This is strange, because for writes, global performance across the 
>>>> cluster increases slightly when adding more nodes. But for reads, 
>>>> the oppsite seems to be true.
>>>>
>>>> For read and write tests, separate testfiles were created and read 
>>>> for each node, with each testfile sitting in its own subdirectory, 
>>>> so no node would access another nodes file.
>>>>
>>>>     
>>>>       
>>>>         
>>> That sounds like a good test set up to me.
>>>
>>>   
>>>     
>>>       
>>>> GFS1 created with the following mkfs.gfs parameters:
>>>> "-b 4096 -J 128 -j 16 -r 2048 -p lock_dlm"
>>>> (4kB blocksite, 16 * 128MB journals, 2GB resource groups, 
>>>> Distributed
>>>> LockManager)
>>>>
>>>> Mount Options set: "noatime,nodiratime,noquota"
>>>>
>>>> Tunables set: "glock_purge 50, statfs_slots 128, statfs_fast 1, 
>>>> demote_secs 20"
>>>>     
>>>>       
>>>>         
>>> You shouldn't normally need to set the glock_purge and demote_secs 
>>> to anything other than the default. These settings no longer exist 
>>> in
>>> GFS2 since it makes use of the shrinker subsystem provided by the VM 
>>> and is auto-tuning. If your workload is metadata heavy, you could 
>>> try boosting the journal size and/or the incore_log_blocks tunable.
>>>
>>>   
>>>     
>>>       
>>>> Also, in /etc/cluster/cluster.conf, I added this:
>>>> <dlm plock_ownership="1" plock_rate_limit="0"/> <gfs_controld 
>>>> plock_rate_limit="0"/>
>>>>
>>>> Any ideas on how to figure out what's going wrong, and how to tune
>>>> GFS1 for better concurrent read performance, or tune GFS2 in 
>>>> general to be competitive/better than GFS1?
>>>>
>>>> I'm dreaming about 300MB/sec read, 300MB/sec write sequentially and 
>>>> somewhat good reaction times while under heavy sequential and/or 
>>>> random load. But for now, I just wanna get the seq reading to work 
>>>> acceptably fast.
>>>>
>>>> Thanks a lot for your help!
>>>>
>>>>     
>>>>       
>>>>         
>>> Can you try doing some I/O direct to the block device so that we can 
>>> get an idea of what the raw device can manage? Using dd both read 
>>> and write, across the nodes (different disk locations on each node 
>>> to simulate different files).
>>>
>>> I'm wondering if the problem might be due to the seek pattern 
>>> generated by the multiple read locations,
>>>
>>> Steve.
--
Michael Lackner
Chair of Information Technology, University of Leoben IT Administration 
[email protected] | +43 (0)3842/402-1505

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
[email protected]
https://www.redhat.com/mailman/listinfo/linux-cluster

Reply via email to