Hi Sage,
I am trying to reproduce the hang with the latest client and servers.
I am able to start the servers, however mount fails with input/output error 5.
The dmesg listing shows the following info:
[17008.244739] ceph: loaded 0.18.0 (mon/mds/osd proto 15/30/22)
[17015.888143] ceph: mon0 10.55.147.70:6789 connection failed
[17025.880170] ceph: mon0 10.55.147.70:6789 connection failed
[17035.880121] ceph: mon0 10.55.147.70:6789 connection failed
[17045.880189] ceph: mon0 10.55.147.70:6789 connection failed
[17055.880130] ceph: mon0 10.55.147.70:6789 connection failed
[17065.880113] ceph: mon0 10.55.147.70:6789 connection failed
[17075.880170] ceph: mon0 10.55.147.70:6789 connection failed
The server is reachable, as the following command output shows:
$ nc 10.55.147.83 6789
ceph v027
I started running the experiments with ceph 0.18 using the configuration, where
clients and servers run on separate nodes. It turns out that the performance is
extremely bad. Looking at dmesg trace I see ceph-related faults (the partial
trace is attached to the email).
Any suggestions how to proceed are more than welcome.
Thanks,
Roman
-----Original Message-----
From: Sage Weil [mailto:s...@newdream.net]
Sent: Wednesday, February 10, 2010 11:39 PM
To: Talyansky, Roman
Cc: ceph-devel@lists.sourceforge.net
Subject: Re: [ceph-devel] Write operation is stuck
Hi Roman,
On Wed, 10 Feb 2010, Talyansky, Roman wrote:
> Hello,
>
> Recently I ran three application instances simultaneously over a mounted
> CEPH file system and one of them got stuck calling a write operation.
> I had the following CEPH configuration:
> - The nodes have Debian installation - lenny , unstable
> - Three nodes with osd servers
> - Three client nodes
> - One client node among the three mentioned above was located at a node
> where an osd server ran.
>
> Can the origin of the problem be the client collocated with an osd server?
The collocated client+osd can theoretically cause problems when you run
out of memory, but it doesn't sound like that's the case here.
> Can you help me to resolve this issue?
I assume the OSDs and MDS are all still running?
We fixed a number of bugs recently with multiple clients interacting with
the same files. Is the hang reproducable? Can you try it with the latest
unstable client and servers? Or, enable mds debug logging and post that
somewhere (debug mds = 20, debug ms = 1)?
Thanks-
sage
[112691.516538] general protection fault: 0000 [#73] SMP
[112691.520517] last sysfs file: /sys/devices/virtual/net/lo/operstate
[112691.520517] CPU 1
[112691.520517] Modules linked in: ceph crc32c libcrc32c nfs lockd fscache
nfs_acl auth_rpcgss sunrpc autofs4 ext4 jbd2 crc16 loop parport_pc fschmd
i2c_i801
parport i2c_core snd_hda_codec_realtek evdev tpm_infineon snd_hda_intel
psmouse serio_raw tpm snd_hda_codec snd_pcsp snd_hwdep snd_pcm snd_timer snd
soundco
re snd_page_alloc container tpm_bios processor ext3 jbd mbcache sg sd_mod
crc_t10dif sr_mod cdrom ide_pci_generic ide_core ata_generic uhci_hcd floppy
ata_pi
ix button e1000e intel_agp agpgart libata ehci_hcd scsi_mod usbcore nls_base
thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[112691.520517] Pid: 3780, comm: ioplayer2 Tainted: G D
2.6.32-trunk-amd64 #1 ESPRIMO P5925
[112691.520517] RIP: 0010:[<ffffffffa03a6f16>] [<ffffffffa03a6f16>]
zero_user_segment+0x62/0x75 [ceph]
[112691.520517] RSP: 0018:ffff880037861c88 EFLAGS: 00010246
[112691.520517] RAX: 0000000000000000 RBX: 00000000fffa8d87 RCX:
0000000000001000
[112691.520517] RDX: 6db6db6db6db6db7 RSI: 0000000000000000 RDI:
76f19732eb7bc000
[112691.520517] RBP: 0000000000001000 R08: 3120393532383120 R09:
ffffffff814390b0
[112691.520517] R10: ffff88010b157800 R11: ffff8800d6802000 R12:
00000000fffa8d86
[112691.520517] R13: 0000000000002000 R14: ffff88010a918b10 R15:
ffff88010a918b10
[112691.520517] FS: 00007f5be27fc910(0000) GS:ffff880005100000(0000)
knlGS:0000000000000000
[112691.520517] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112691.520517] CR2: 00007f4bd3a9ca90 CR3: 0000000109461000 CR4:
00000000000006e0
[112691.520517] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[112691.520517] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7:
0000000000000400
[112691.520517] Process ioplayer2 (pid: 3780, threadinfo ffff880037860000, task
ffff880037945bd0)
[112691.520517] Stack:
[112691.520517] 0000010000000230 ffffffffa03a6f93 0000000000000000
00000001a8d86000
[112691.520517] <0> 0000000000000001 0000000000000000 00000001a8d86000
ffffffffa03a7bf7
[112691.520517] <0> ffff880000000000 ffffffffffffffff ffff88010a918b10
ffff880100000002
[112691.520517] Call Trace:
[112691.520517] [<ffffffffa03a6f93>] ? zero_page_vector_range+0x6a/0xa5 [ceph]
[112691.520517] [<ffffffffa03a7bf7>] ? ceph_aio_read+0x33d/0x4aa [ceph]
[112691.520517] [<ffffffff810ebf01>] ? do_sync_read+0xce/0x113
[112691.520517] [<ffffffff810676d4>] ? hrtimer_try_to_cancel+0x3a/0x43
[112691.520517] [<ffffffff81064aae>] ? autoremove_wake_function+0x0/0x2e
[112691.520517] [<ffffffff810676e9>] ? hrtimer_cancel+0xc/0x16
[112691.520517] [<ffffffff812e62aa>] ? do_nanosleep+0x6d/0xa3
[112691.520517] [<ffffffff8103aa9a>] ? pick_next_task+0x24/0x3f
[112691.520517] [<ffffffff810ec94a>] ? vfs_read+0xa6/0xff
[112691.520517] [<ffffffff810eca5f>] ? sys_read+0x45/0x6e
[112691.520517] [<ffffffff81010b02>] ? system_call_fastpath+0x16/0x1b
[112691.520517] Code: b6 6d db b6 6d 49 8d 04 00 89 f7 29 f1 48 c1 f8 03 48 0f
af c2 48 c1 e0 0c 48 01 c7 48 b8 00 00 00 00 00 88 ff ff 48 01 c7 31 c0 <f3> a
a 65 48 8b 04 25 c8 cb 00 00 ff 88 44 e0 ff ff 59 c3 41 56
[112691.520517] RIP [<ffffffffa03a6f16>] zero_user_segment+0x62/0x75 [ceph]
[112691.520517] RSP <ffff880037861c88>
[112691.871238] ---[ end trace 09486983a8cdbe04 ]---
[112691.876331] note: ioplayer2[3780] exited with preempt_count 1
[112691.881409] BUG: scheduling while atomic: ioplayer2/3780/0x10000001
[112691.886452] Modules linked in: ceph crc32c libcrc32c nfs lockd fscache
nfs_acl auth_rpcgss sunrpc autofs4 ext4 jbd2 crc16 loop parport_pc fschmd
i2c_i801
parport i2c_core snd_hda_codec_realtek evdev tpm_infineon snd_hda_intel
psmouse serio_raw tpm snd_hda_codec snd_pcsp snd_hwdep snd_pcm snd_timer snd
soundco
re snd_page_alloc container tpm_bios processor ext3 jbd mbcache sg sd_mod
crc_t10dif sr_mod cdrom ide_pci_generic ide_core ata_generic uhci_hcd floppy
ata_pi
ix button e1000e intel_agp agpgart libata ehci_hcd scsi_mod usbcore nls_base
thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[112691.928434] Pid: 3780, comm: ioplayer2 Tainted: G D
2.6.32-trunk-amd64 #1
[112691.938802] Call Trace:
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Ceph-devel mailing list
Ceph-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ceph-devel