Re: [Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

Area de Sistemas Wed, 16 Sep 2015 01:49:14 -0700

Hi Tariq,

DATA=WRITEBACK:

From your words I understand that IF WE CAN ASSUME/TOLERATE A POSSIBLEFILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this option IMPROVESCLEARLY PERFORMANCE...right?* I ask to you this cause, according mount.ocfs2 man page, this option"is rumored to be the highest-throughput option"...which is a veryvague/unclear idea of its benefits


COMMIT=XX (higher than 5s default):

I am a bit confused: after searching on internet, many people recommendsto use a value higher than 5s default (typically 30s or 60s) in case ofperformance issues, but you suggests that higher values can incrementnumber of unecessary writes, that sounds the opposite so...can youclarify if/when commit="higher value than default" can be useful?



THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM:

From your words I understand that perhaps log area size is too smallwhich can cause too "extra" I/Os in order to free/empty/clear log duringhigh write loads...

- There is some method to monitor the log usage?

- It's possible to get the actual log size and/or increase it size(without "loosing" data)? how?* I've seen the tunefs.ocfs2 -J option and I suppose doesn't harmthe filesystem but I prefer be sure about this. Anyway if I don't knowthe actual size of the log, I can't set an "acceptable" higher value


LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more 120s):

Apparently the load in the usage of the app continues being high BUT ALLTHESE errors have dissapeared...

* The usage pattern of the app these days is:

- high number of users generating "backups" of partial contents ofOCFS disk (so: high read+write) <--this is specific to the last weeks

    - "normal/low" reading access to contents

CHANGES MADE AND ERRORS EVOLUTION:
1) First change:

* two nodes disabled (httpd stopped but ocfs2 volume continuesmounted) so ONLY ONE NODE IS SERVICING THE APP

    * After that, errors dissapeared...although %util was high
2) Three days after:

* added commit=20 and data=writeback mount options to OCFS2 volume(maintaining only one node servicing app)

    * Situation persists: NO errors, although %util high

So...It's possible that the concurrent use of the OCFS2 (2-3 nodesservicing app simultaneously) generate to much overload (caused by OCFS2operation)?* Obviously, the OCFS2 operation generates an "extra" load but...perhapsunder some circumstances (like these days usage of the app) the extraload becomes REALLY HIGH?


Regards.

------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*

*
------------------------------------------------------------------------
*
El 16/09/15 a las 4:04, Tariq Saeed escribió:

Hi Area,
data=writeback improves things greatly. In ordered mode , the default,before writinga transaction(which only logs meta data changes) data is written. Thisis very conservativeto ensure that before journal log buffer is written to disk journalarea, data has hit the disk andtransaction can be safely replayed in case of a crash -- only completetransactions are replayed,by complete I mean: begin-trans changebuf1, changebuf2, ... ,changebufnn end-trans. Replay means bufferare dispatched from the journal area on disk to their ultimate homeloc on disk. You can see now whyordered mode generates so much i/o. In write back mode, transactioncan hit the disk but datawill be written whenever the kernel wants, asynchronously and withoutknowing any relationshipto its related data. The danger is in case of a crash, we can replay atransaction but its associateddata is not on disk. For example, if you truncate up a file to a newbigger size and thenwrite something to a page beyond the old size, the page could hangaround in core for a long timeafter transaction is written to the journal area on disk. If there isa crash while the data page is stillin core, after replay, the file will have new size but the page withdata will show all zeros instead of
what you wrote. At any rate, this is a digression, just for your info.
The commit is the interval at which data is synced to disc. I think itmay also be the intervalafter which journal log buffer is written to disk. So decreasing itreduces number of unecessary
writes.
Now for the threads blocked for more than 120 sec in/var/log/messages. There are two types.First type is blocked on mutex on ocfs2 system file, mostly the globalbit map file shared byall nodes. All writes to system files are done under transactions andthat may requireflushing to disk the journal buffer, depending upon your journal filesize. The smaller the size,the fewer transactions it can hold, so more frequently the journal logon disk needs to bereclaimed by dispatching the meta data blocks from the journal spaceto their home locations,thus freeing up on-disk journal space. This requires reading meta datablocks from journal areaon disk, and writing them to their home location. So again, lot ofi/o. I think the threads arewaiting on mutex because journal code must do this reclaiming to freeup space. The other kind
 of blocked threads are NOT in ocfs2 code but they
all are blocked on mutex. I don't know why. That would require gettinga vmcore and chasingthe mutex owner and finding out why is it taking long time. I don'tthink that is warranted at
this time.
Let me know if you have any further questions.
Thanks
-Tariq
On 09/15/2015 01:55 AM, Area de Sistemas wrote:
Hi Tariq,
Yesterday one node was under load but not as high as past week, andiostat showed:- 10% of samples with %util >90% (some peaks of 100%) and an averagevalue of 18%
- %iowait peaks of 37% with an average value of 4%

BUT:
- none of the indicated error messages appeared in /var/log/messages
- we have mounted the OCFS2 filesystem with TWO extra options:
     data=writeback
     commit=20
* Question about these extra options:
    Perhaps they help to mitigate in some way the problem?
I've read about using them (usually commit=60) but I don't knowif they really helps and/or they are even some other useful optionsto useBefore, the volume as mounted using only the options"_netdev,rw,noatime"
NOTE:
- we have left only one node active (not the three nodes of thecluster) to "force" overloads- although only one node is serving the app, all the three nodes havethe OCFS volume mounted
About the EACCESS/ENOENT errors...we don't know if they areoriginated by:
- an abnormal behavior of the application
- the OCFS2 problem (a user tries to unlink/rename something and ifsystem is slow due to OCFS the users retries again and again thisoperation, causing first operation to complete successfully butfollowing fail)- a possible problem in the concurrency: now with only one nodeservicing the application errors doesn't appear but with the threenodes in service errors appeared (several nodes trying to do the sameoperation)
And about the messages about blocked proccess in /var/log/messagesI'll send directly to you (instead to the list) the file.
Regards.

------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*

*
------------------------------------------------------------------------
*
El 14/09/15 a las 20:29, Tariq Saeed escribió:
On 09/14/2015 01:20 AM, Area de Sistemas wrote:
Hello everyone,
We have a problem in a 3 member OCFS2 cluster used to serve anweb/php application that access (read and/or write) files locatedin the OCFS2 volume.The problem appears only some times (apparently during high loadperiods).
SYMPTOMS:
- access to OCFS2 content becomes more an more slow until stalls
    * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
- load average of the system grows to 150, 200 or even more

- high iowait values: 70-90%
This is hint that disk is under pressure. Run iostat (seeman page)when this happens, producing report every 3 seconds or andlook at
         %util col
                       %util
Percentage of CPU time during which I/Orequests were issued to the device (bandwidthutilization for the device). Device saturationoccurs when this value is close to 100%.
* but CPU usage is low

- in the syslog appears a lot of messages like:
    (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
EACCES    Permission denied. find the filename and check perms ls -l.
  or
    (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
ENOENT All we can say is an attempt to delete a file from adirectory that has already been deleted.This requires some knowledge of theenvironment. Is there an application log.
  and the more "worrying":
     kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"disables this message.
     kernel: httpd           D c6fe5d74     0  3488   1616 0x00000080
kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fdc6fe5d88 c0439b18kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8c0b976c0 f75ac6c0kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8f8fd9a86 00000001
     kernel: Call Trace:
     kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
     kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
     kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]
     kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
     kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
     kernel: [<c0873105>] schedule+0x35/0x50
     kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
     ....
the important part of bt is cut off. Where is the rest of it? Theentries starting with "?"are junk. You can attach /v/l/messages to give us a complete pic.Myguess is blocking on
mutex for so long is that the thread holding mutex is blocked on i/o.
Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at'D' state (uninterruptable slee)
process. These are processes usually blocked on i/o.
(UNACCEPTABLE) WORKAROUND:
   stop httpd (really slow)
   stop ocfs2 service (really slow)
   start ocfs2 an httpd

MORE INFO:
- OS information:
    Oracle Linux 6.4 32bit
    4GB RAM
uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 2809:55:10 PDT 2013 i686 i686 i386 GNU/Linux* anyway: we have another 5 nodes cluster with Oracle Linux 7.1(so 64bit OS) serving a newer version of the same application andthe problems are similar, so it appears not to be a OS problem buta more specific OCFS2 problem (bug? some tuning? other?)
- standard configuration
* if you want I can show the cluster.conf configuration but isthe "expected configuration"
- standard configuration in o2cb:
    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "MoodleOCFS2": Online
      Heartbeat dead threshold: 31
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Local
    Checking O2CB heartbeat: Active

- mount options: _netdev,rw,noatime
    * so other options (commit, data, ...) have their default values


Any ideas/suggestion?

Regards.

--
------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*

*
------------------------------------------------------------------------
*


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] Problem with OCFS2 disk on some moments (slow until stalls)

Reply via email to