Hi Tariq,
------------------------------------------------------------------------ *Area de Sistemas Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) Universidad de Valladolid Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA Telefono: 983 18-6410, Fax: 983 423271 E-mail: siste...@uva.es * * ------------------------------------------------------------------------ * El 16/09/15 a las 20:51, Tariq Saeed escribió: > > On 09/16/2015 01:19 AM, Area de Sistemas wrote: >> Hi Tariq, >> >> DATA=WRITEBACK: >> From your words I understand that IF WE CAN ASSUME/TOLERATE A >> POSSIBLE FILES CONTENTS "CORRUPTION" IN CASE OF FAILURE, this option >> IMPROVES CLEARLY PERFORMANCE...right? >> * I ask to you this cause, according mount.ocfs2 man page, this >> option "is rumored to be the highest-throughput option"...which is a >> very vague/unclear idea of its benefits > I don't understand what exactly is unclear? Can you be more specific? > Is this a question or > a comment? It's a question: data=writeback REALLY IMPROVES PERFORMANCE?? That is: the improvement is noticeable? >> >> COMMIT=XX (higher than 5s default): >> I am a bit confused: after searching on internet, many people >> recommends to use a value higher than 5s default (typically 30s or >> 60s) in case of performance issues, but you suggests that higher >> values can increment number of unecessary writes, that sounds the >> opposite so...can you clarify if/when commit="higher value than >> default" can be useful? > I stand corrected. I said the opposite of what I meant to say. Yes, > the higher the commit interval, > the longer the time lapse between syncing to disc, hence fewer the > comparative number of i/os due to > commit. >> OK, so increasing commit interval also helps to improve performance/response of filesystem...right? >> >> THREADS BLOCKED ON MUTEX ON OCFS2 FILESYSTEM: >> From your words I understand that perhaps log area size is too small >> which can cause too "extra" I/Os in order to free/empty/clear log >> during high write loads... >> - There is some method to monitor the log usage? > I never had a need to do that and therefore don't know. If tunefs man > page does > not show anything, then there is none. You will have to google. ocfs2 > uses jbd2, which is > used by other filesystems (ext3,4 ...) in Linux. The only control > ocfs2 has over jbd2 is to call 'checkpoint', which means all blocks > upto the latest transaction need to be written > to their home locations on disc from the log area. This is i/o > intensive and happens > for example when a meta data cluster wide lock held in exclusive mode > is given up > (the on disk meta data protected by the lock must be upto date before > releasing it). >> * I've seen the tunefs.ocfs2 -J option and I suppose doesn't harm the >> filesystem but I prefer be sure about this. Anyway if I don't know >> the actual size of the log, I can't set an "acceptable" higher value > Again, I have not done it myself, so we are in the same > boat.tunfefs.ocfs2 should do it. You > should not lose data (unless there is a bug in tunefs.ocfs2). google > is your best source > for customer experiences. >> >> LOGS/ERRORS (ocfs2_unlink, ocfs2_rename, task blocked for more 120s): >> Apparently the load in the usage of the app continues being high BUT >> ALL THESE errors have dissapeared... >> * The usage pattern of the app these days is: >> - high number of users generating "backups" of partial contents >> of OCFS disk (so: high read+write) <--this is specific to the last weeks >> - "normal/low" reading access to contents >> >> CHANGES MADE AND ERRORS EVOLUTION: >> 1) First change: >> * two nodes disabled (httpd stopped but ocfs2 volume continues >> mounted) so ONLY ONE NODE IS SERVICING THE APP >> * After that, errors dissapeared...although %util was high >> 2) Three days after: >> * added commit=20 and data=writeback mount options to OCFS2 >> volume (maintaining only one node servicing app) >> * Situation persists: NO errors, although %util high >> >> So...It's possible that the concurrent use of the OCFS2 (2-3 nodes >> servicing app simultaneously) generate to much overload (caused by >> OCFS2 operation)? > You are on the spot. If you get high %util with one node, with more > nodes, there will be even > more apps served simultaneously, burning more disc bandwidth, which > seems to be the bottleneck here. > And then there is the overhead of internode communication through > disk, even if you don't service more > apps. Adding nodes gives you HA at cost of consuming some i/o > bandwidth and won't help you since > you already are using plenty of bandwidth.You should consider if you > already have not, stripping, making a volume out of many discs with > the logical volume manager etc. > > Question: Are you using a separate disc for global heart beat? No: we use "local" heart beat. * We have another different OCFS2 cluster (different disks, nodes, etc) servicing a newer version of the same app. In this cluster we use a "global" heart beat BUT in the same OCFS2 disk (not in a separate disk) >> * Obviously, the OCFS2 operation generates an "extra" load >> but...perhaps under some circumstances (like these days usage of the >> app) the extra load becomes REALLY HIGH? >> >> Regards. >> >> ------------------------------------------------------------------------ >> >> *Area de Sistemas >> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) >> Universidad de Valladolid >> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA >> Telefono: 983 18-6410, Fax: 983 423271 >> E-mail: siste...@uva.es >> * >> >> * >> ------------------------------------------------------------------------ >> * >> El 16/09/15 a las 4:04, Tariq Saeed escribió: >>> Hi Area, >>> data=writeback improves things greatly. In ordered mode , the >>> default, before writing >>> a transaction(which only logs meta data changes) data is written. >>> This is very conservative >>> to ensure that before journal log buffer is written to disk journal >>> area, data has hit the disk and >>> transaction can be safely replayed in case of a crash -- only >>> complete transactions are replayed, >>> by complete I mean: begin-trans changebuf1, changebuf2, ... , >>> changebufnn end-trans. Replay means buffer >>> are dispatched from the journal area on disk to their ultimate home >>> loc on disk. You can see now why >>> ordered mode generates so much i/o. In write back mode, transaction >>> can hit the disk but data >>> will be written whenever the kernel wants, asynchronously and >>> without knowing any relationship >>> to its related data. The danger is in case of a crash, we can replay >>> a transaction but its associated >>> data is not on disk. For example, if you truncate up a file to a new >>> bigger size and then >>> write something to a page beyond the old size, the page could hang >>> around in core for a long time >>> after transaction is written to the journal area on disk. If there >>> is a crash while the data page is still >>> in core, after replay, the file will have new size but the page with >>> data will show all zeros instead of >>> what you wrote. At any rate, this is a digression, just for your info. >>> >>> The commit is the interval at which data is synced to disc. I think >>> it may also be the interval >>> after which journal log buffer is written to disk. So decreasing it >>> reduces number of unecessary >>> writes. >>> >>> Now for the threads blocked for more than 120 sec in >>> /var/log/messages. There are two types. >>> First type is blocked on mutex on ocfs2 system file, mostly the >>> global bit map file shared by >>> all nodes. All writes to system files are done under transactions >>> and that may require >>> flushing to disk the journal buffer, depending upon your journal >>> file size. The smaller the size, >>> the fewer transactions it can hold, so more frequently the journal >>> log on disk needs to be >>> reclaimed by dispatching the meta data blocks from the journal space >>> to their home locations, >>> thus freeing up on-disk journal space. This requires reading meta >>> data blocks from journal area >>> on disk, and writing them to their home location. So again, lot of >>> i/o. I think the threads are >>> waiting on mutex because journal code must do this reclaiming to >>> free up space. The other kind >>> of blocked threads are NOT in ocfs2 code but they >>> all are blocked on mutex. I don't know why. That would require >>> getting a vmcore and chasing >>> the mutex owner and finding out why is it taking long time. I don't >>> think that is warranted at >>> this time. >>> Let me know if you have any further questions. >>> Thanks >>> -Tariq >>> On 09/15/2015 01:55 AM, Area de Sistemas wrote: >>>> Hi Tariq, >>>> >>>> Yesterday one node was under load but not as high as past week, and >>>> iostat showed: >>>> - 10% of samples with %util >90% (some peaks of 100%) and an >>>> average value of 18% >>>> - %iowait peaks of 37% with an average value of 4% >>>> >>>> BUT: >>>> - none of the indicated error messages appeared in /var/log/messages >>>> - we have mounted the OCFS2 filesystem with TWO extra options: >>>> data=writeback >>>> commit=20 >>>> * Question about these extra options: >>>> Perhaps they help to mitigate in some way the problem? >>>> I've read about using them (usually commit=60) but I don't know >>>> if they really helps and/or they are even some other useful options >>>> to use >>>> Before, the volume as mounted using only the options >>>> "_netdev,rw,noatime" >>>> >>>> NOTE: >>>> - we have left only one node active (not the three nodes of the >>>> cluster) to "force" overloads >>>> - although only one node is serving the app, all the three nodes >>>> have the OCFS volume mounted >>>> >>>> >>>> About the EACCESS/ENOENT errors...we don't know if they are >>>> originated by: >>>> - an abnormal behavior of the application >>>> - the OCFS2 problem (a user tries to unlink/rename something and if >>>> system is slow due to OCFS the users retries again and again this >>>> operation, causing first operation to complete successfully but >>>> following fail) >>>> - a possible problem in the concurrency: now with only one node >>>> servicing the application errors doesn't appear but with the three >>>> nodes in service errors appeared (several nodes trying to do the >>>> same operation) >>>> >>>> And about the messages about blocked proccess in /var/log/messages >>>> I'll send directly to you (instead to the list) the file. >>>> >>>> Regards. >>>> >>>> ------------------------------------------------------------------------ >>>> >>>> *Area de Sistemas >>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) >>>> Universidad de Valladolid >>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - ESPAÑA >>>> Telefono: 983 18-6410, Fax: 983 423271 >>>> E-mail: siste...@uva.es >>>> * >>>> >>>> * >>>> ------------------------------------------------------------------------ >>>> * >>>> El 14/09/15 a las 20:29, Tariq Saeed escribió: >>>>> >>>>> On 09/14/2015 01:20 AM, Area de Sistemas wrote: >>>>>> Hello everyone, >>>>>> >>>>>> We have a problem in a 3 member OCFS2 cluster used to serve an >>>>>> web/php application that access (read and/or write) files located >>>>>> in the OCFS2 volume. >>>>>> The problem appears only some times (apparently during high load >>>>>> periods). >>>>>> >>>>>> SYMPTOMS: >>>>>> - access to OCFS2 content becomes more an more slow until stalls >>>>>> * a "ls" command that normally takes <=1s takes 30s, 40s, 1m,... >>>>>> - load average of the system grows to 150, 200 or even more >>>>>> >>>>>> - high iowait values: 70-90% >>>>>> >>>>> This is hint that disk is under pressure. Run iostat (see >>>>> man page) >>>>> when this happens, producing report every 3 seconds or >>>>> and look at >>>>> %util col >>>>> %util >>>>> Percentage of CPU time during which I/O >>>>> requests were issued to the device (bandwidth >>>>> utilization for the device). Device >>>>> saturation occurs when this value is close to 100%. >>>>> >>>>>> * but CPU usage is low >>>>>> >>>>>> - in the syslog appears a lot of messages like: >>>>>> (httpd,XXXXX,Y):ocfs2_rename:1474 ERROR: status = -13 >>>>> EACCES Permission denied. find the filename and check perms ls -l. >>>>>> or >>>>>> (httpd,XXXXX,Y):ocfs2_unlink:951 ERROR: status = -2 >>>>> ENOENT All we can say is an attempt to delete a file from a >>>>> directory that has already been deleted. >>>>> This requires some knowledge of the >>>>> environment. Is there an application log. >>>>>> >>>>>> and the more "worrying": >>>>>> kernel: INFO: task httpd:3488 blocked for more than 120 seconds. >>>>>> kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" >>>>>> disables this message. >>>>>> kernel: httpd D c6fe5d74 0 3488 1616 0x00000080 >>>>>> kernel: c6fe5e04 00000082 00000000 c6fe5d74 c6fe5d74 >>>>>> 000041fd c6fe5d88 c0439b18 >>>>>> kernel: c0b976c0 c0b976c0 c0b976c0 c0b976c0 ed0f0ac0 >>>>>> c6fe5de8 c0b976c0 f75ac6c0 >>>>>> kernel: f2f0cd60 c0a95060 00000001 c6fe5dbc c0874b8d >>>>>> c6fe5de8 f8fd9a86 00000001 >>>>>> kernel: Call Trace: >>>>>> kernel: [<c0439b18>] ? default_spin_lock_flags+0x8/0x10 >>>>>> kernel: [<c0874b8d>] ? _raw_spin_lock+0xd/0x10 >>>>>> kernel: [<f8fd9a86>] ? ocfs2_dentry_revalidate+0xc6/0x2d0 >>>>>> [ocfs2] >>>>>> kernel: [<f8ff17be>] ? ocfs2_permission+0xfe/0x110 [ocfs2] >>>>>> kernel: [<f905b6f0>] ? ocfs2_acl_chmod+0xd0/0xd0 [ocfs2] >>>>>> kernel: [<c0873105>] schedule+0x35/0x50 >>>>>> kernel: [<c0873b2e>] __mutex_lock_slowpath+0xbe/0x120 >>>>>> .... >>>>>> >>>>> the important part of bt is cut off. Where is the rest of it? The >>>>> entries starting with "?" >>>>> are junk. You can attach /v/l/messages to give us a complete >>>>> pic.My guess is blocking on >>>>> mutex for so long is that the thread holding mutex is blocked on i/o. >>>>> Run "ps -e -o pid,stat,comm,whchan=WIDE_WCHAN-COLUMN" and look at >>>>> 'D' state (uninterruptable slee) >>>>> process. These are processes usually blocked on i/o. >>>>>> >>>>>> (UNACCEPTABLE) WORKAROUND: >>>>>> stop httpd (really slow) >>>>>> stop ocfs2 service (really slow) >>>>>> start ocfs2 an httpd >>>>>> >>>>>> MORE INFO: >>>>>> - OS information: >>>>>> Oracle Linux 6.4 32bit >>>>>> 4GB RAM >>>>>> uname -a: 2.6.39-400.109.6.el6uek.i686 #1 SMP Wed Aug 28 >>>>>> 09:55:10 PDT 2013 i686 i686 i386 GNU/Linux >>>>>> * anyway: we have another 5 nodes cluster with Oracle Linux >>>>>> 7.1 (so 64bit OS) serving a newer version of the same application >>>>>> and the problems are similar, so it appears not to be a OS >>>>>> problem but a more specific OCFS2 problem (bug? some tuning? other?) >>>>>> >>>>>> - standard configuration >>>>>> * if you want I can show the cluster.conf configuration but >>>>>> is the "expected configuration" >>>>>> >>>>>> - standard configuration in o2cb: >>>>>> Driver for "configfs": Loaded >>>>>> Filesystem "configfs": Mounted >>>>>> Stack glue driver: Loaded >>>>>> Stack plugin "o2cb": Loaded >>>>>> Driver for "ocfs2_dlmfs": Loaded >>>>>> Filesystem "ocfs2_dlmfs": Mounted >>>>>> Checking O2CB cluster "MoodleOCFS2": Online >>>>>> Heartbeat dead threshold: 31 >>>>>> Network idle timeout: 30000 >>>>>> Network keepalive delay: 2000 >>>>>> Network reconnect delay: 2000 >>>>>> Heartbeat mode: Local >>>>>> Checking O2CB heartbeat: Active >>>>>> >>>>>> - mount options: _netdev,rw,noatime >>>>>> * so other options (commit, data, ...) have their default values >>>>>> >>>>>> >>>>>> Any ideas/suggestion? >>>>>> >>>>>> Regards. >>>>>> >>>>>> -- >>>>>> ------------------------------------------------------------------------ >>>>>> >>>>>> *Area de Sistemas >>>>>> Servicio de las Tecnologias de la Informacion y Comunicaciones (STIC) >>>>>> Universidad de Valladolid >>>>>> Edificio Alfonso VIII, C/Real de Burgos s/n. 47011, Valladolid - >>>>>> ESPAÑA >>>>>> Telefono: 983 18-6410, Fax: 983 423271 >>>>>> E-mail: siste...@uva.es >>>>>> * >>>>>> >>>>>> * >>>>>> ------------------------------------------------------------------------ >>>>>> * >>>>>> >>>>>> >>>>>> _______________________________________________ >>>>>> Ocfs2-users mailing list >>>>>> Ocfs2-users@oss.oracle.com >>>>>> https://oss.oracle.com/mailman/listinfo/ocfs2-users >>>>> >>>> >>> >> > _______________________________________________ Ocfs2-users mailing list Ocfs2-users@oss.oracle.com https://oss.oracle.com/mailman/listinfo/ocfs2-users