Hi Sage,

I am trying to reproduce the hang with the latest client and servers.
I am able to start the servers, however mount fails with input/output error 5. 
The dmesg listing shows the following info:

[17008.244739] ceph: loaded 0.18.0 (mon/mds/osd proto 15/30/22)
[17015.888143] ceph: mon0 10.55.147.70:6789 connection failed
[17025.880170] ceph: mon0 10.55.147.70:6789 connection failed
[17035.880121] ceph: mon0 10.55.147.70:6789 connection failed
[17045.880189] ceph: mon0 10.55.147.70:6789 connection failed
[17055.880130] ceph: mon0 10.55.147.70:6789 connection failed
[17065.880113] ceph: mon0 10.55.147.70:6789 connection failed
[17075.880170] ceph: mon0 10.55.147.70:6789 connection failed

The server is reachable, as the following command output shows:

$ nc 10.55.147.83 6789
ceph v027

I started running the experiments with ceph 0.18 using the configuration, where 
clients and servers run on separate nodes. It turns out that the performance is 
extremely bad. Looking at dmesg trace I see ceph-related faults (the partial 
trace is attached to the email).

Any suggestions how to proceed are more than welcome.

Thanks,
Roman

-----Original Message-----
From: Sage Weil [mailto:s...@newdream.net] 
Sent: Wednesday, February 10, 2010 11:39 PM
To: Talyansky, Roman
Cc: ceph-devel@lists.sourceforge.net
Subject: Re: [ceph-devel] Write operation is stuck

Hi Roman,

On Wed, 10 Feb 2010, Talyansky, Roman wrote:

> Hello,
> 
> Recently I ran three application  instances simultaneously over a mounted 
> CEPH file system and one of them got stuck calling a write operation.
> I had the following CEPH configuration:
> -       The nodes have Debian installation - lenny  , unstable
> -       Three nodes with osd servers
> -       Three client nodes
> -       One client node among the three mentioned above was located at a node 
> where an osd server ran.
> 
> Can the origin of the problem be the client collocated with an osd server?

The collocated client+osd can theoretically cause problems when you run 
out of memory, but it doesn't sound like that's the case here.

> Can you help me to resolve this issue?

I assume the OSDs and MDS are all still running?

We fixed a number of bugs recently with multiple clients interacting with 
the same files.  Is the hang reproducable?  Can you try it with the latest 
unstable client and servers?  Or, enable mds debug logging and post that 
somewhere (debug mds = 20, debug ms = 1)?

Thanks-
sage
[112691.516538] general protection fault: 0000 [#73] SMP
[112691.520517] last sysfs file: /sys/devices/virtual/net/lo/operstate
[112691.520517] CPU 1
[112691.520517] Modules linked in: ceph crc32c libcrc32c nfs lockd fscache 
nfs_acl auth_rpcgss sunrpc autofs4 ext4 jbd2 crc16 loop parport_pc fschmd 
i2c_i801
 parport i2c_core snd_hda_codec_realtek evdev tpm_infineon snd_hda_intel 
psmouse serio_raw tpm snd_hda_codec snd_pcsp snd_hwdep snd_pcm snd_timer snd 
soundco
re snd_page_alloc container tpm_bios processor ext3 jbd mbcache sg sd_mod 
crc_t10dif sr_mod cdrom ide_pci_generic ide_core ata_generic uhci_hcd floppy 
ata_pi
ix button e1000e intel_agp agpgart libata ehci_hcd scsi_mod usbcore nls_base 
thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[112691.520517] Pid: 3780, comm: ioplayer2 Tainted: G      D    
2.6.32-trunk-amd64 #1 ESPRIMO P5925
[112691.520517] RIP: 0010:[<ffffffffa03a6f16>]  [<ffffffffa03a6f16>] 
zero_user_segment+0x62/0x75 [ceph]
[112691.520517] RSP: 0018:ffff880037861c88  EFLAGS: 00010246
[112691.520517] RAX: 0000000000000000 RBX: 00000000fffa8d87 RCX: 
0000000000001000
[112691.520517] RDX: 6db6db6db6db6db7 RSI: 0000000000000000 RDI: 
76f19732eb7bc000
[112691.520517] RBP: 0000000000001000 R08: 3120393532383120 R09: 
ffffffff814390b0
[112691.520517] R10: ffff88010b157800 R11: ffff8800d6802000 R12: 
00000000fffa8d86
[112691.520517] R13: 0000000000002000 R14: ffff88010a918b10 R15: 
ffff88010a918b10
[112691.520517] FS:  00007f5be27fc910(0000) GS:ffff880005100000(0000) 
knlGS:0000000000000000
[112691.520517] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[112691.520517] CR2: 00007f4bd3a9ca90 CR3: 0000000109461000 CR4: 
00000000000006e0
[112691.520517] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 
0000000000000000
[112691.520517] DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 
0000000000000400
[112691.520517] Process ioplayer2 (pid: 3780, threadinfo ffff880037860000, task 
ffff880037945bd0)
[112691.520517] Stack:
[112691.520517]  0000010000000230 ffffffffa03a6f93 0000000000000000 
00000001a8d86000
[112691.520517] <0> 0000000000000001 0000000000000000 00000001a8d86000 
ffffffffa03a7bf7
[112691.520517] <0> ffff880000000000 ffffffffffffffff ffff88010a918b10 
ffff880100000002
[112691.520517] Call Trace:
[112691.520517]  [<ffffffffa03a6f93>] ? zero_page_vector_range+0x6a/0xa5 [ceph]
[112691.520517]  [<ffffffffa03a7bf7>] ? ceph_aio_read+0x33d/0x4aa [ceph]
[112691.520517]  [<ffffffff810ebf01>] ? do_sync_read+0xce/0x113
[112691.520517]  [<ffffffff810676d4>] ? hrtimer_try_to_cancel+0x3a/0x43
[112691.520517]  [<ffffffff81064aae>] ? autoremove_wake_function+0x0/0x2e
[112691.520517]  [<ffffffff810676e9>] ? hrtimer_cancel+0xc/0x16
[112691.520517]  [<ffffffff812e62aa>] ? do_nanosleep+0x6d/0xa3
[112691.520517]  [<ffffffff8103aa9a>] ? pick_next_task+0x24/0x3f
[112691.520517]  [<ffffffff810ec94a>] ? vfs_read+0xa6/0xff
[112691.520517]  [<ffffffff810eca5f>] ? sys_read+0x45/0x6e
[112691.520517]  [<ffffffff81010b02>] ? system_call_fastpath+0x16/0x1b
[112691.520517] Code: b6 6d db b6 6d 49 8d 04 00 89 f7 29 f1 48 c1 f8 03 48 0f 
af c2 48 c1 e0 0c 48 01 c7 48 b8 00 00 00 00 00 88 ff ff 48 01 c7 31 c0 <f3> a
a 65 48 8b 04 25 c8 cb 00 00 ff 88 44 e0 ff ff 59 c3 41 56
[112691.520517] RIP  [<ffffffffa03a6f16>] zero_user_segment+0x62/0x75 [ceph]
[112691.520517]  RSP <ffff880037861c88>
[112691.871238] ---[ end trace 09486983a8cdbe04 ]---
[112691.876331] note: ioplayer2[3780] exited with preempt_count 1
[112691.881409] BUG: scheduling while atomic: ioplayer2/3780/0x10000001
[112691.886452] Modules linked in: ceph crc32c libcrc32c nfs lockd fscache 
nfs_acl auth_rpcgss sunrpc autofs4 ext4 jbd2 crc16 loop parport_pc fschmd 
i2c_i801
 parport i2c_core snd_hda_codec_realtek evdev tpm_infineon snd_hda_intel 
psmouse serio_raw tpm snd_hda_codec snd_pcsp snd_hwdep snd_pcm snd_timer snd 
soundco
re snd_page_alloc container tpm_bios processor ext3 jbd mbcache sg sd_mod 
crc_t10dif sr_mod cdrom ide_pci_generic ide_core ata_generic uhci_hcd floppy 
ata_pi
ix button e1000e intel_agp agpgart libata ehci_hcd scsi_mod usbcore nls_base 
thermal fan thermal_sys [last unloaded: scsi_wait_scan]
[112691.928434] Pid: 3780, comm: ioplayer2 Tainted: G      D    
2.6.32-trunk-amd64 #1
[112691.938802] Call Trace:
------------------------------------------------------------------------------
SOLARIS 10 is the OS for Data Centers - provides features such as DTrace,
Predictive Self Healing and Award Winning ZFS. Get Solaris 10 NOW
http://p.sf.net/sfu/solaris-dev2dev
_______________________________________________
Ceph-devel mailing list
Ceph-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/ceph-devel

Reply via email to