Re: [Ocfs2-users] input / out error on some nodes

Eric Ren Sun, 25 Oct 2015 19:44:29 -0700

Hi,

On 10/22/15 21:00, gjprabu wrote:

Hi Eric,
Thanks for your reply, Still we are facing same issue. we found thisdmesg logs and this is known logs because our self made down node1 andmade up this is showing in logs and other then we didn't found errormessage. Even we do have problem while unmounting. umount process goesto "D" stat and fsck through fsck.ocfs2: I/O error. If required to runany other command pls let me know.

1. system log over boots
#journalctl --list-boots

If there is just one boot record, please " man journald.conf" toconfigure saving system logs over boots.

so, you can use "journalctl -b xxx" to see any specific boot system log.

I can't see what steps exactly lead to that error message? Better totidy up your problems from clean state.

2. umount issue may be caused by the bad condition cluster.Communication between nodes hung up.


3. please using device instead of mount point.

4. Did you build up CEPH RBD based on a good conditional ocfs2 cluster?It's better test more if cluster is

good before working on it.


Thanks,
Eric
**

*ocfs2 version*
debugfs.ocfs2 1.8.0

*# cat /etc/sysconfig/o2cb*
#
# This is a configuration file for automatic startup of the O2CB
# driver.  It is generated by running /etc/init.d/o2cb configure.
# On Debian based systems the preferred method is running
# 'dpkg-reconfigure ocfs2-tools'.
#

# O2CB_STACK: The name of the cluster stack backing O2CB.
O2CB_STACK=o2cb

# O2CB_BOOTCLUSTER: If not empty, the name of a cluster to start.
O2CB_BOOTCLUSTER=ocfs2

# O2CB_HEARTBEAT_THRESHOLD: Iterations before a node is considered dead.
O2CB_HEARTBEAT_THRESHOLD=31
# O2CB_IDLE_TIMEOUT_MS: Time in ms before a network connection isconsidered dead.
O2CB_IDLE_TIMEOUT_MS=30000
# O2CB_KEEPALIVE_DELAY_MS: Max time in ms before a keepalive packet issent
O2CB_KEEPALIVE_DELAY_MS=2000

# O2CB_RECONNECT_DELAY_MS: Min time in ms between connection attempts
O2CB_RECONNECT_DELAY_MS=2000

*# fsck.ocfs2 -fy /home/build/downloads/*
fsck.ocfs2 1.8.0
fsck.ocfs2: I/O error on channel while opening "/zoho/build/downloads/"

_*dmesg logs*_
[ 4229.886284] o2dlm: Joining domain A895BC216BE641A8A7E20AA89D57E051( 5 ) 1 nodes[ 4251.437451] o2dlm: Node 3 joins domainA895BC216BE641A8A7E20AA89D57E051 ( 3 5 ) 2 nodes[ 4267.836392] o2dlm: Node 1 joins domainA895BC216BE641A8A7E20AA89D57E051 ( 1 3 5 ) 3 nodes[ 4292.755589] o2dlm: Node 2 joins domainA895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 5 ) 4 nodes[ 4306.262165] o2dlm: Node 4 joins domainA895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes[316476.505401] (kworker/u192:0,95923,0):dlm_do_assert_master:1717ERROR: Error -112 when sending message 502 (key 0xc3460ae7) to node 1[316476.505470] o2cb: o2dlm has evicted node 1 from domainA895BC216BE641A8A7E20AA89D57E051[316480.437231] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316480.442389] o2cb: o2dlm has evicted node 1 from domainA895BC216BE641A8A7E20AA89D57E051[316480.442412] (kworker/u192:0,95923,20):dlm_begin_reco_handler:2765A895BC216BE641A8A7E20AA89D57E051: dead_node previously set to 1, node3 changing it to 1[316480.541237] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316480.541241] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316485.542733] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316485.542740] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316485.542742] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316490.544535] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316490.544538] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316490.544539] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316495.546356] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316495.546362] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316495.546364] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316500.548135] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316500.548139] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316500.548140] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316505.549947] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316505.549951] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316505.549952] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316510.551734] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316510.551739] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316510.551740] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316515.553543] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316515.553547] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316515.553548] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316520.555337] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316520.555341] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316520.555343] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316525.557131] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316525.557136] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316525.557153] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316530.558952] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316530.558955] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316530.558957] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[316535.560781] o2dlm: Begin recovery on domainA895BC216BE641A8A7E20AA89D57E051 for node 1[316535.560789] o2dlm: Node 3 (he) is the Recovery Master for the deadnode 1 in domain A895BC216BE641A8A7E20AA89D57E051[316535.560792] o2dlm: End recovery on domainA895BC216BE641A8A7E20AA89D57E051[319419.525609] o2dlm: Node 1 joins domainA895BC216BE641A8A7E20AA89D57E051 ( 1 2 3 4 5 ) 5 nodes
*ps -auxxxxx | grep umount*
root 32083 21.8 0.0 125620 2828 pts/14 D+ 19:37 0:18 umount/home/build/repositoryroot 32196 0.0 0.0 112652 2264 pts/8 S+ 19:38 0:00 grep--color=auto umount
*cat /proc/32083/stack*
[<ffffffff8132ad7d>] o2net_send_message_vec+0x71d/0xb00
[<ffffffff81352148>] dlm_send_remote_unlock_request.isra.2+0x128/0x410
[<ffffffff813527db>] dlmunlock_common+0x3ab/0x9e0
[<ffffffff81353088>] dlmunlock+0x278/0x800
[<ffffffff8131f765>] o2cb_dlm_unlock+0x35/0x50
[<ffffffff8131ecfe>] ocfs2_dlm_unlock+0x1e/0x30
[<ffffffff812a8776>] ocfs2_drop_lock.isra.29.part.30+0x1f6/0x700
[<ffffffff812ae40d>] ocfs2_simple_drop_lockres+0x2d/0x40
[<ffffffff8129b43c>] ocfs2_dentry_lock_put+0x5c/0x80
[<ffffffff8129b4a2>] ocfs2_dentry_iput+0x42/0x1d0
[<ffffffff81204dc2>] __dentry_kill+0x102/0x1f0
[<ffffffff81205294>] shrink_dentry_list+0xe4/0x2a0
[<ffffffff81205aa8>] shrink_dcache_parent+0x38/0x90
[<ffffffff81205b16>] do_one_tree+0x16/0x50
[<ffffffff81206e9f>] shrink_dcache_for_umount+0x2f/0x90
[<ffffffff811efb15>] generic_shutdown_super+0x25/0x100
[<ffffffff811eff57>] kill_block_super+0x27/0x70
[<ffffffff811f02a9>] deactivate_locked_super+0x49/0x60
[<ffffffff811f089e>] deactivate_super+0x4e/0x70
[<ffffffff8120da83>] cleanup_mnt+0x43/0x90
[<ffffffff8120db22>] __cleanup_mnt+0x12/0x20
[<ffffffff81093ba4>] task_work_run+0xc4/0xe0
[<ffffffff81013c67>] do_notify_resume+0x97/0xb0
[<ffffffff817d2ee7>] int_signal+0x12/0x17
[<ffffffffffffffff>] 0xffffffffffffffff

Regards
Prabu
---- On Wed, 21 Oct 2015 08:32:15 +0530 *Eric Ren <z...@suse.com>*wrote ----
    Hi Prabu,

    I guess others like me are not familiar with this case that
    combine CEPH RBD and OCFS2.

    We'd really like to help you. But I think ocfs2 developers cannot
    get any info about what happened
    to ocfs2 from your descriptions.

    So, I'm wondering if you can reproduce and tell us the steps. Once
    developers can reproduce it,
    it's likely be resolved;-) BTW, any dmesg log about ocfs2
    especially the initial error message and stack
    back trace will be helpful!

    Thanks,
    Eric

    On 10/20/15 17:29, gjprabu wrote:

        Hi

                We are looking forward to your input on this.

        Regads
        Prabu

        --- On Fri, 09 Oct 2015 12:08:19 +0530 *gjprabu
        <gjpr...@zohocorp.com> <mailto:gjpr...@zohocorp.com>* wrote ----





                Hi All,

                         Anybody pls help me on this issue.

                Regards
                Prabu




                ---- On Thu, 08 Oct 2015 12:33:57 +0530 *gjprabu
                <gjpr...@zohocorp.com <mailto:gjpr...@zohocorp.com>>*
                wrote ----



                    Hi All,

                           We have CEPH  RBD with OCFS2 mounted
                    servers. we are facing i/o errors simultaneously
                    while move the data's in the same disk (Copying is
                    not having any problem). Temporary we remount the
                    partition and the issue get resolved but after
                    sometime problem again reproduced. If anybody
                    faced same issue. Please help us.

                    Note : We have total 5 Nodes, here two nodes
                    working fine other nodes are showing like below
                    input/output error.

                    ls -althr
                    ls: cannot access LITE_3_0_M4_1_TEST: Input/output
                    error
                    ls: cannot access LITE_3_0_M4_1_OLD: Input/output
                    error
                    total 0
                    d????????? ? ? ? ? ? LITE_3_0_M4_1_TEST
                    d????????? ? ? ? ? ? LITE_3_0_M4_1_OLD

                    cluster:
                           node_count=5
                           heartbeat_mode = local
                           name=ocfs2

                    node:
                            ip_port = 7777
                            ip_address = 192.168.113.42
                            number = 1
                            name = integ-hm9
                            cluster = ocfs2

                    node:
                            ip_port = 7777
                            ip_address = 192.168.112.115
                            number = 2
                            name = integ-hm2
                            cluster = ocfs2

                    node:
                            ip_port = 7777
                            ip_address = 192.168.113.43
                            number = 3
                            name = integ-ci-1
                            cluster = ocfs2
                    node:
                            ip_port = 7777
                            ip_address = 192.168.112.217
                            number = 4
                            name = integ-hm8
                            cluster = ocfs2
                    node:
                            ip_port = 7777
                            ip_address = 192.168.112.192
                            number = 5
                            name = integ-hm5
                            cluster = ocfs2


                    Regards
                    Prabu



                    _______________________________________________
                    Ocfs2-users mailing list
                    Ocfs2-users@oss.oracle.com
                    <mailto:Ocfs2-users@oss.oracle.com>
                    https://oss.oracle.com/mailman/listinfo/ocfs2-users




        _______________________________________________
        Ocfs2-users mailing list
        Ocfs2-users@oss.oracle.com <mailto:Ocfs2-users@oss.oracle.com>  
https://oss.oracle.com/mailman/listinfo/ocfs2-users

_______________________________________________
Ocfs2-users mailing list
Ocfs2-users@oss.oracle.com
https://oss.oracle.com/mailman/listinfo/ocfs2-users

Re: [Ocfs2-users] input / out error on some nodes

Reply via email to