Hi Tariq,
DATA=WRITEBACK:
From your words I understand that IF WE CAN ASSUME/TOLERATE A POSSIBLE
FILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this option IMPROVES
CLEARLY PERFORMANCE...right?
* I ask to you this cause, according mount.ocfs2 man page, this option
"is rumored to be the highest-throughput option"...which is a very
vague/unclear idea of its benefits
COMMIT=XX (higher than 5s default):
I am a bit confused: after searching on internet, many people recommends
to use a value higher than 5s default (typically 30s or 60s) in case of
performance issues, but you suggests that higher values can increment
number of unecessary writes, that sounds the opposite so...can you
clarify if/when commit="higher value than default" can be useful?
THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM:
From your words I understand that perhaps log area size is too small
which can cause too "extra" I/Os in order to free/empty/clear log during
high write loads...
- There is some method to monitor the log usage?
- It's possible to get the actual log size and/or increase it size
(without "loosing" data)? how?
* I've seen the tunefs.ocfs2 -J option and I suppose doesn't harm
the filesystem but I prefer be sure about this. Anyway if I don't know
the actual size of the log, I can't set an "acceptable" higher value
LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more 120s):
Apparently the load in the usage of the app continues being high BUT ALL
THESE errors have dissapeared...
* The usage pattern of the app these days is:
- high number of users generating "backups" of partial contents of
OCFS disk (so: high read+write) <--this is specific to the last weeks
- "normal/low" reading access to contents
CHANGES MADE AND ERRORS EVOLUTION:
1) First change:
* two nodes disabled (httpd stopped but ocfs2 volume continues
mounted) so ONLY ONE NODE IS SERVICING THE APP
* After that, errors dissapeared...although %util was high
2) Three days after:
* added commit=20 and data=writeback mount options to OCFS2 volume
(maintaining only one node servicing app)
* Situation persists: NO errors, although %util high
So...It's possible that the concurrent use of the OCFS2 (2-3 nodes
servicing app simultaneously) generate to much overload (caused by OCFS2
operation)?
* Obviously, the OCFS2 operation generates an "extra" load but...perhaps
under some circumstances (like these days usage of the app) the extra
load becomes REALLY HIGH?
Regards.
------------------------------------------------------------------------
*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*
*
------------------------------------------------------------------------
*
El 16/09/15 a las 4:04, Tariq Saeed escribió:
Hi Area,
data=writeback improves things greatly. In ordered mode , the default,
before writing
a transaction(which only logs meta data changes) data is written. This
is very conservative
to ensure that before journal log buffer is written to disk journal
area, data has hit the disk and
transaction can be safely replayed in case of a crash -- only complete
transactions are replayed,
by complete I mean: begin-trans changebuf1, changebuf2, ... ,
changebufnn end-trans. Replay means buffer
are dispatched from the journal area on disk to their ultimate home
loc on disk. You can see now why
ordered mode generates so much i/o. In write back mode, transaction
can hit the disk but data
will be written whenever the kernel wants, asynchronously and without
knowing any relationship
to its related data. The danger is in case of a crash, we can replay a
transaction but its associated
data is not on disk. For example, if you truncate up a file to a new
bigger size and then
write something to a page beyond the old size, the page could hang
around in core for a long time
after transaction is written to the journal area on disk. If there is
a crash while the data page is still
in core, after replay, the file will have new size but the page with
data will show all zeros instead of
what you wrote. At any rate, this is a digression, just for your info.
The commit is the interval at which data is synced to disc. I think it
may also be the interval
after which journal log buffer is written to disk. So decreasing it
reduces number of unecessary
writes.
Now for the threads blocked for more than 120 sec in
/var/log/messages. There are two types.
First type is blocked on mutex on ocfs2 system file, mostly the global
bit map file shared by
all nodes. All writes to system files are done under transactions and
that may require
flushing to disk the journal buffer, depending upon your journal file
size. The smaller the size,
the fewer transactions it can hold, so more frequently the journal log
on disk needs to be
reclaimed by dispatching the meta data blocks from the journal space
to their home locations,
thus freeing up on-disk journal space. This requires reading meta data
blocks from journal area
on disk, and writing them to their home location. So again, lot of
i/o. I think the threads are
waiting on mutex because journal code must do this reclaiming to free
up space. The other kind
of blocked threads are NOT in ocfs2 code but they
all are blocked on mutex. I don't know why. That would require getting
a vmcore and chasing
the mutex owner and finding out why is it taking long time. I don't
think that is warranted at
this time.
Let me know if you have any further questions.
Thanks
-Tariq
On 09/15/2015 01:55 AM, Area de Sistemas wrote:
Hi Tariq,
Yesterday one node was under load but not as high as past week, and
iostat showed:
- 10% of samples with %util >90% (some peaks of 100%) and an average
value of 18%
- %iowait peaks of 37% with an average value of 4%
BUT:
- none of the indicated error messages appeared in /var/log/messages
- we have mounted the OCFS2 filesystem with TWO extra options:
data=writeback
commit=20
* Question about these extra options:
Perhaps they help to mitigate in some way the problem?
I've read about using them (usually commit=60) but I don't know
if they really helps and/or they are even some other useful options
to use
Before, the volume as mounted using only the options
"_netdev,rw,noatime"
NOTE:
- we have left only one node active (not the three nodes of the
cluster) to "force" overloads
- although only one node is serving the app, all the three nodes have
the OCFS volume mounted
About the EACCESS/ENOENT errors...we don't know if they are
originated by:
- an abnormal behavior of the application
- the OCFS2 problem (a user tries to unlink/rename something and if
system is slow due to OCFS the users retries again and again this
operation, causing first operation to complete successfully but
following fail)
- a possible problem in the concurrency: now with only one node
servicing the application errors doesn't appear but with the three
nodes in service errors appeared (several nodes trying to do the same
operation)
And about the messages about blocked proccess in /var/log/messages
I'll send directly to you (instead to the list) the file.
Regards.
------------------------------------------------------------------------
*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*
*
------------------------------------------------------------------------
*
El 14/09/15 a las 20:29, Tariq Saeed escribió:
On 09/14/2015 01:20 AM, Area de Sistemas wrote:
Hello everyone,
We have a problem in a 3 member OCFS2 cluster used to serve an
web/php application that access (read and/or write) files located
in the OCFS2 volume.
The problem appears only some times (apparently during high load
periods).
SYMPTOMS:
- access to OCFS2 content becomes more an more slow until stalls
* a "ls" command that normally takes <=1s takes 30s, 40s, 1m,...
- load average of the system grows to 150, 200 or even more
- high iowait values: 70-90%
This is hint that disk is under pressure. Run iostat (see
man page)
when this happens, producing report every 3 seconds or and
look at
%util col
%util
Percentage of CPU time during which I/O
requests were issued to the device (bandwidth
utilization for the device). Device saturation
occurs when this value is close to 100%.
* but CPU usage is low
- in the syslog appears a lot of messages like:
(httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13
EACCES Permission denied. find the filename and check perms ls -l.
or
(httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2
ENOENT All we can say is an attempt to delete a file from a
directory that has already been deleted.
This requires some knowledge of the
environment. Is there an application log.
and the more "worrying":
kernel: INFO: task httpd:3488 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs"
disables this message.
kernel: httpd D c6fe5d74 0 3488 1616 0x00000080
kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 000041fd
c6fe5d88 c0439b18
kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 c6fe5de8
c0b976c0 f75ac6c0
kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d c6fe5de8
f8fd9a86 00000001
kernel: Call Trace:
kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10
kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10
kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 [ocfs2]
kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2]
kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2]
kernel: [<c0873105>] schedule+0x35/0x50
kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120
....
the important part of bt is cut off. Where is the rest of it? The
entries starting with "?"
are junk. You can attach /v/l/messages to give us a complete pic.My
guess is blocking on
mutex for so long is that the thread holding mutex is blocked on i/o.
Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at
'D' state (uninterruptable slee)
process. These are processes usually blocked on i/o.
(UNACCEPTABLE) WORKAROUND:
stop httpd (really slow)
stop ocfs2 service (really slow)
start ocfs2 an httpd
MORE INFO:
- OS information:
Oracle Linux 6.4 32bit
4GB RAM
uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28
09:55:10 PDT 2013 i686 i686 i386 GNU/Linux
* anyway: we have another 5 nodes cluster with Oracle Linux 7.1
(so 64bit OS) serving a newer version of the same application and
the problems are similar, so it appears not to be a OS problem but
a more specific OCFS2 problem (bug? some tuning? other?)
- standard configuration
* if you want I can show the cluster.conf configuration but is
the "expected configuration"
- standard configuration in o2cb:
Driver for "configfs": Loaded
Filesystem "configfs": Mounted
Stack glue driver: Loaded
Stack plugin "o2cb": Loaded
Driver for "ocfs2_dlmfs": Loaded
Filesystem "ocfs2_dlmfs": Mounted
Checking O2CB cluster "MoodleOCFS2": Online
Heartbeat dead threshold: 31
Network idle timeout: 30000
Network keepalive delay: 2000
Network reconnect delay: 2000
Heartbeat mode: Local
Checking O2CB heartbeat: Active
- mount options: _netdev,rw,noatime
* so other options (commit, data, ...) have their default values
Any ideas/suggestion?
Regards.
--
------------------------------------------------------------------------
*Area de Sistemas
Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC)
Universidad de Valladolid
Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA
Telefono: 983 18-6410, Fax: 983 423271
E-mail: siste...@uva.es
*
*
------------------------------------------------------------------------
*
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users
_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users