Hi Tariq,

Yesterday one node was under load but not as high as past week, and iostat showed: - 10% of samples with %util >90% (some peaks of 100%) and an average value of 18%
- %iowait peaks of 37% with an average value of 4%

BUT:
- none of the indicated error messages appeared in /var/log/messages
- we have mounted the OCFS2 filesystem with TWO extra options:
     data=writeback
     commit=20
* Question about these extra options:
    Perhaps they help to mitigate in some way the problem?
I've read about using them (usually commit=60) but I don't know if they really helps and/or they are even some other useful options to use Before, the volume as mounted using only the options "_netdev,rw,noatime"

NOTE:
- we have left only one node active (not the three nodes of the cluster) to "force" overloads - although only one node is serving the app, all the three nodes have the OCFS volume mounted


About the EACCESS/ENOENT errors...we don't know if they are originated by:
- an abnormal behavior of the application
- the OCFS2 problem (a user tries to unlink/rename something and if system is slow due to OCFS the users retries again and again this operation, causing first operation to complete successfully but following fail) - a possible problem in the concurrency: now with only one node servicing the application errors doesn't appear but with the three nodes in service errors appeared (several nodes trying to do the same operation)

And about the messages about blocked proccess in /var/log/messages I'll send directly to you (instead to the list) the file.

Regards.

------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*

*
------------------------------------------------------------------------
*
El 14/09/15 a las 20:29, Tariq Saeed escribió:

On 09/14/2015 01:20 AM, Area de Sistemas wrote:
Hello everyone,

We have a problem in a 3 member OCFS2 cluster used to serve an web/php application that access (read and/or write) files located in the OCFS2 volume. The problem appears only some times (apparently during high load periods).

SYMPTOMS:
- access to OCFS2 content becomes more an more slow until stalls
    * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
- load average of the system grows to 150, 200 or even more

- high iowait values: 70-90%

This is hint that disk is under pressure. Run iostat (see man page) when this happens, producing report every 3 seconds or and look at
         %util col
                       %util
Percentage of CPU time during which I/O requests were issued to the device (bandwidth utilization for the device). Device saturation occurs when this value is close to 100%.

* but CPU usage is low

- in the syslog appears a lot of messages like:
    (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
EACCES    Permission denied. find the filename and check perms ls -l.
or
    (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
ENOENT All we can say is an attempt to delete a file from a directory that has already been deleted. This requires some knowledge of the environment. Is there an application log.

  and the more "worrying":
     kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
     kernel: httpd           D c6fe5d74     0  3488   1616 0x00000080
kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd c6fe5d88 c0439b18 kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8 c0b976c0 f75ac6c0 kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8 f8fd9a86 00000001
     kernel: Call Trace:
     kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
     kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
     kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]
     kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
     kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
     kernel: [<c0873105>] schedule+0x35/0x50
     kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
     ....

the important part of bt is cut off. Where is the rest of it? The entries starting with "?" are junk. You can attach /v/l/messages to give us a complete pic.My guess is blocking on
mutex for so long is that the thread holding mutex is blocked on i/o.
Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at 'D' state (uninterruptable slee)
process. These are processes usually blocked on i/o.

(UNACCEPTABLE) WORKAROUND:
   stop httpd (really slow)
   stop ocfs2 service (really slow)
   start ocfs2 an httpd

MORE INFO:
- OS information:
    Oracle Linux 6.4 32bit
    4GB RAM
uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux * anyway: we have another 5 nodes cluster with Oracle Linux 7.1 (so 64bit OS) serving a newer version of the same application and the problems are similar, so it appears not to be a OS problem but a more specific OCFS2 problem (bug? some tuning? other?)

- standard configuration
* if you want I can show the cluster.conf configuration but is the "expected configuration"

- standard configuration in o2cb:
    Driver for "configfs": Loaded
    Filesystem "configfs": Mounted
    Stack glue driver: Loaded
    Stack plugin "o2cb": Loaded
    Driver for "ocfs2_dlmfs": Loaded
    Filesystem "ocfs2_dlmfs": Mounted
    Checking O2CB cluster "MoodleOCFS2": Online
      Heartbeat dead threshold: 31
      Network idle timeout: 30000
      Network keepalive delay: 2000
      Network reconnect delay: 2000
      Heartbeat mode: Local
    Checking O2CB heartbeat: Active

- mount options: _netdev,rw,noatime
    * so other options (commit, data, ...) have their default values


Any ideas/suggestion?

Regards.

--
------------------------------------------------------------------------

*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*

*
------------------------------------------------------------------------
*


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users


_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Reply via email to