Re: Issue with Ceph File System and LIO
On Thu, Dec 17, 2015 at 4:56 PM, Eric Eastman <eric.east...@keepertech.com> wrote: > I patched the 4.4rc4 kernel source and restarted the test. Shortly > after starting it, this showed up in dmesg: > > [Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at > fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]() > [Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod > vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop > target_core_file target_core_iblock target_core_pscsi target_core_user > target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core > ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm > ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit > i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt > ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter > irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph > lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel > ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes > scsi_transport_sas [last unloaded: target_core_mod] > [Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G >W I 4.4.0-rc4-ede1 #1 > [Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS > P64 01/22/2015 > [Thu Dec 17 03:29:55 2015] c020cd47 8805f1e97958 > 813ad644 > [Thu Dec 17 03:29:55 2015] 8805f1e97990 81079702 > 8805f1e97a50 015dd000 > [Thu Dec 17 03:29:55 2015] 880c034df800 0200 > eab26a80 8805f1e979a0 > [Thu Dec 17 03:29:55 2015] Call Trace: > [Thu Dec 17 03:29:55 2015] [] dump_stack+0x44/0x60 > [Thu Dec 17 03:29:55 2015] [] > warn_slowpath_common+0x82/0xc0 > [Thu Dec 17 03:29:55 2015] [] warn_slowpath_null+0x1a/0x20 > [Thu Dec 17 03:29:55 2015] [] > ceph_write_begin+0xfb/0x120 [ceph] > [Thu Dec 17 03:29:55 2015] [] > generic_perform_write+0xbf/0x1a0 > [Thu Dec 17 03:29:55 2015] [] > ceph_write_iter+0xf5c/0x1010 [ceph] > [Thu Dec 17 03:29:55 2015] [] ? __enqueue_entity+0x6c/0x70 > [Thu Dec 17 03:29:55 2015] [] ? > iov_iter_get_pages+0x113/0x210 > [Thu Dec 17 03:29:55 2015] [] ? > skb_copy_datagram_iter+0x122/0x250 > [Thu Dec 17 03:29:55 2015] [] vfs_iter_write+0x63/0xa0 > [Thu Dec 17 03:29:55 2015] [] > fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file] > [Thu Dec 17 03:29:55 2015] [] > fd_execute_rw+0xc5/0x2a0 [target_core_file] > [Thu Dec 17 03:29:55 2015] [] > sbc_execute_rw+0x22/0x30 [target_core_mod] > [Thu Dec 17 03:29:55 2015] [] > __target_execute_cmd+0x1f/0x70 [target_core_mod] > [Thu Dec 17 03:29:55 2015] [] > target_execute_cmd+0x195/0x2a0 [target_core_mod] > [Thu Dec 17 03:29:55 2015] [] > iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] > [Thu Dec 17 03:29:55 2015] [] > iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] > [Thu Dec 17 03:29:55 2015] [] > iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] > [Thu Dec 17 03:29:55 2015] [] ? __switch_to+0x1cd/0x570 > [Thu Dec 17 03:29:55 2015] [] ? > iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod] > [Thu Dec 17 03:29:55 2015] [] kthread+0xc9/0xe0 > [Thu Dec 17 03:29:55 2015] [] ? > kthread_create_on_node+0x180/0x180 > [Thu Dec 17 03:29:55 2015] [] ret_from_fork+0x3f/0x70 > [Thu Dec 17 03:29:55 2015] [] ? > kthread_create_on_node+0x180/0x180 > [Thu Dec 17 03:29:55 2015] ---[ end trace 382a45986961da4e ]--- Could you please try the apply the new incremental patch and try again. Regards Yan, Zheng > > There are WARNINGs on both line 125 and 1162. I will attached the > whole set of dmesg output to the tracker ticket 14086 > > I wanted to note that file system snapshots are enabled and being used > on this file system. > > Thanks > Eric > > On Wed, Dec 16, 2015 at 8:15 AM, Eric Eastman > <eric.east...@keepertech.com> wrote: >>>> >>> This warning is really strange. Could you try the attached debug patch. >>> >>> Regards >>> Yan, Zheng >> >> I will try the patch and get back to the list. >> >> Eric cephfs1.patch Description: Binary data
Re: Issue with Ceph File System and LIO
On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman <eric.east...@keepertech.com> wrote: >> Hi Yan Zheng, Eric Eastman >> >> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing >> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal >> handling fix"). >> >> Related report & discussion was here: >> https://lkml.org/lkml/2015/12/12/149 >> >> I'm not sure the current reported issue of ceph was related to that though, >> but at least try testing with an upgraded or patched kernel could verify it. >> :) >> >> Thanks, >> >>> -Original Message- >>> From: ceph-devel-ow...@vger.kernel.org >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of >>> Yan, Zheng >>> Sent: Friday, December 18, 2015 12:05 PM >>> To: Eric Eastman >>> Cc: Ceph Development >>> Subject: Re: Issue with Ceph File System and LIO >>> >>> On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman >>> <eric.east...@keepertech.com> wrote: >>> > With cephfs.patch and cephfs1.patch applied and I am now seeing: >>> > >>> > [Thu Dec 17 14:27:59 2015] [ cut here ] >>> > [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at >>> > fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]() >>> > [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod > ... >>> > >>> >>> The page gets unlocked mystically. I still don't find any clue. Could >>> you please try the new patch (not incremental patch). Besides, please >>> enable CONFIG_DEBUG_VM when compiling the kernel. >>> >>> Thanks you very much >>> Yan, Zheng >> > I have just installed the cephfs_new.patch and have set > CONFIG_DEBUG_VM=y on a new 4.4rc4 kernel and restarted the ESXi iSCSI > test to my Ceph File System gateway. I plan to let it run overnight > and report the status tomorrow. > > Let me know if I should move on to 4.4rc5 with or without patches and > with or without CONFIG_DEBUG_VM=y > please try rc5 kernel without patches and DEBUG_VM=y Regards Yan, Zheng > Looking at the network traffic stats on my iSCSI gateway, with > CONFIG_DEBUG_VM=y, throughput seems to be down by a factor of at least > 10 compared to my last test without setting CONFIG_DEBUG_VM=y > > Regards, > Eric -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Issue with Ceph File System and LIO
On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman <eric.east...@keepertech.com> wrote: > With cephfs.patch and cephfs1.patch applied and I am now seeing: > > [Thu Dec 17 14:27:59 2015] [ cut here ] > [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at > fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]() > [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod > vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop > target_core_file target_core_iblock target_core_pscsi target_core_user > target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core > ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm > drm_kms_helper drm ipmi_ssif coretemp gpio_ich i2c_algo_bit kvm > fb_sys_fops syscopyarea sysfillrect sysimgblt shpchp input_leds ceph > irqbypass i7core_edac serio_raw hpilo edac_core ipmi_si > ipmi_msghandler 8250_fintek lpc_ich acpi_power_meter libceph mac_hid > libcrc32c fscache bonding lp parport mlx4_en vxlan ip6_udp_tunnel > udp_tunnel ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse > bnx2 fjes scsi_transport_sas [last unloaded: target_core_mod] > [Thu Dec 17 14:27:59 2015] CPU: 0 PID: 3036 Comm: iscsi_trx Tainted: G >W I 4.4.0-rc4-ede2 #1 > [Thu Dec 17 14:27:59 2015] Hardware name: HP ProLiant DL360 G6, BIOS > P64 01/22/2015 > [Thu Dec 17 14:27:59 2015] c02b2e37 880c0289b958 > 813ad644 > [Thu Dec 17 14:27:59 2015] 880c0289b990 81079702 > 880c0289ba50 000846c21000 > [Thu Dec 17 14:27:59 2015] 880c009ea200 1000 > ea00122ed700 880c0289b9a0 > [Thu Dec 17 14:27:59 2015] Call Trace: > [Thu Dec 17 14:27:59 2015] [] dump_stack+0x44/0x60 > [Thu Dec 17 14:27:59 2015] [] > warn_slowpath_common+0x82/0xc0 > [Thu Dec 17 14:27:59 2015] [] warn_slowpath_null+0x1a/0x20 > [Thu Dec 17 14:27:59 2015] [] > ceph_write_begin+0xfb/0x120 [ceph] > [Thu Dec 17 14:27:59 2015] [] > generic_perform_write+0xbf/0x1a0 > [Thu Dec 17 14:27:59 2015] [] > ceph_write_iter+0xf5c/0x1010 [ceph] > [Thu Dec 17 14:27:59 2015] [] ? __schedule+0x386/0x9c0 > [Thu Dec 17 14:27:59 2015] [] ? schedule+0x35/0x80 > [Thu Dec 17 14:27:59 2015] [] ? __slab_free+0xb5/0x290 > [Thu Dec 17 14:27:59 2015] [] ? > iov_iter_get_pages+0x113/0x210 > [Thu Dec 17 14:27:59 2015] [] vfs_iter_write+0x63/0xa0 > [Thu Dec 17 14:27:59 2015] [] > fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file] > [Thu Dec 17 14:27:59 2015] [] > fd_execute_rw+0xc5/0x2a0 [target_core_file] > [Thu Dec 17 14:27:59 2015] [] > sbc_execute_rw+0x22/0x30 [target_core_mod] > [Thu Dec 17 14:27:59 2015] [] > __target_execute_cmd+0x1f/0x70 [target_core_mod] > [Thu Dec 17 14:27:59 2015] [] > target_execute_cmd+0x195/0x2a0 [target_core_mod] > [Thu Dec 17 14:27:59 2015] [] > iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] > [Thu Dec 17 14:27:59 2015] [] > iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] > [Thu Dec 17 14:27:59 2015] [] > iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] > [Thu Dec 17 14:27:59 2015] [] ? __switch_to+0x1cd/0x570 > [Thu Dec 17 14:27:59 2015] [] ? > iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod] > [Thu Dec 17 14:27:59 2015] [] kthread+0xc9/0xe0 > [Thu Dec 17 14:27:59 2015] [] ? > kthread_create_on_node+0x180/0x180 > [Thu Dec 17 14:27:59 2015] [] ret_from_fork+0x3f/0x70 > [Thu Dec 17 14:27:59 2015] [] ? > kthread_create_on_node+0x180/0x180 > [Thu Dec 17 14:27:59 2015] ---[ end trace 8346192e3f29ed5d ]--- > The page gets unlocked mystically. I still don't find any clue. Could you please try the new patch (not incremental patch). Besides, please enable CONFIG_DEBUG_VM when compiling the kernel. Thanks you very much Yan, Zheng cephfs_new.patch Description: Binary data
Re: Issue with Ceph File System and LIO
On Wed, Dec 16, 2015 at 12:51 AM, Eric Eastman <eric.east...@keepertech.com> wrote: > I have opened ticket: 14086 > > On Tue, Dec 15, 2015 at 5:05 AM, Yan, Zheng <uker...@gmail.com> wrote: >> On Tue, Dec 15, 2015 at 2:08 PM, Eric Eastman >>> [Tue Dec 15 00:46:55 2015] [ cut here ] >>> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at >>> /home/kernel/COD/linux/fs/ceph/addr.c:125 >> >> could you confirm that addr.c:125 is WARN_ON(!PageLocked(page)); > > I am using the generic kernel from: > http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc4-wily > and assuming they did not change anything, from the 4.4rc4 source tree > I pulled down shows: > > 124 ret = __set_page_dirty_nobuffers(page); > 125 WARN_ON(!PageLocked(page)); > 126 WARN_ON(!page->mapping); > > > modinfo ceph > filename: /lib/modules/4.4.0-040400rc4-generic/kernel/fs/ceph/ceph.ko > license:GPL > description:Ceph filesystem for Linux > author: Patience Warnick <patie...@newdream.net> > author: Yehuda Sadeh <yeh...@hq.newdream.net> > author: Sage Weil <s...@newdream.net> > alias: fs-ceph > srcversion: E94BA78C2D998705FE2C600 > depends:libceph,fscache > intree: Y > vermagic: 4.4.0-040400rc4-generic SMP mod_unload modversions > > This error has shown up about 20 times in 12 hours, since I started > the ESXi test. > This warning is really strange. Could you try the attached debug patch. Regards Yan, Zheng cephfs.patch Description: Binary data
Re: Issue with Ceph File System and LIO
On Tue, Dec 15, 2015 at 2:08 PM, Eric Eastman <eric.east...@keepertech.com> wrote: > I am testing Linux Target SCSI, LIO, with a Ceph File System backstore > and I am seeing this error on my LIO gateway. I am using Ceph v9.2.0 > on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File > System. A file on the Ceph File System is exported via iSCSI to a > VMware ESXi 5.0 server, and I am seeing this error when doing a lot of > I/O on the ESXi server. Is this a LIO or a Ceph issue? > > [Tue Dec 15 00:46:55 2015] [ cut here ] > [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at > /home/kernel/COD/linux/fs/ceph/addr.c:125 could you confirm that addr.c:125 is WARN_ON(!PageLocked(page)); Regards Yan, Zheng > ceph_set_page_dirty+0x230/0x240 [ceph]() > [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables > x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt > tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock > target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost > qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc > libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper > gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si > sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek > irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac > edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport > mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic > usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last > unloaded: target_core_mod] > [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx > Tainted: GW I 4.4.0-040400rc4-generic #201512061930 > [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS > P64 01/22/2015 > [Tue Dec 15 00:46:55 2015] fdc0ce43 > 880bf38c38c0 813c8ab4 > [Tue Dec 15 00:46:55 2015] 880bf38c38f8 > 8107d772 ea00127a8680 > [Tue Dec 15 00:46:55 2015] 8804e52c1448 8804e52c15b0 > 8804e52c10f0 0200 > [Tue Dec 15 00:46:55 2015] Call Trace: > [Tue Dec 15 00:46:55 2015] [] dump_stack+0x44/0x60 > [Tue Dec 15 00:46:55 2015] [] > warn_slowpath_common+0x82/0xc0 > [Tue Dec 15 00:46:55 2015] [] warn_slowpath_null+0x1a/0x20 > [Tue Dec 15 00:46:55 2015] [] > ceph_set_page_dirty+0x230/0x240 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > pagecache_get_page+0x150/0x1c0 > [Tue Dec 15 00:46:55 2015] [] ? > ceph_pool_perm_check+0x48/0x700 [ceph] > [Tue Dec 15 00:46:55 2015] [] set_page_dirty+0x3d/0x70 > [Tue Dec 15 00:46:55 2015] [] > ceph_write_end+0x5e/0x180 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > iov_iter_copy_from_user_atomic+0x156/0x220 > [Tue Dec 15 00:46:55 2015] [] > generic_perform_write+0x114/0x1c0 > [Tue Dec 15 00:46:55 2015] [] > ceph_write_iter+0xf8a/0x1050 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > ceph_put_cap_refs+0x143/0x320 [ceph] > [Tue Dec 15 00:46:55 2015] [] ? > check_preempt_wakeup+0xfa/0x220 > [Tue Dec 15 00:46:55 2015] [] ? zone_statistics+0x7c/0xa0 > [Tue Dec 15 00:46:55 2015] [] ? copy_page_to_iter+0x5e/0xa0 > [Tue Dec 15 00:46:55 2015] [] ? > skb_copy_datagram_iter+0x122/0x250 > [Tue Dec 15 00:46:55 2015] [] vfs_iter_write+0x76/0xc0 > [Tue Dec 15 00:46:55 2015] [] > fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file] > [Tue Dec 15 00:46:55 2015] [] > fd_execute_rw+0xc5/0x2a0 [target_core_file] > [Tue Dec 15 00:46:55 2015] [] > sbc_execute_rw+0x22/0x30 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > __target_execute_cmd+0x1f/0x70 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > target_execute_cmd+0x195/0x2a0 [target_core_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] > iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] ? __switch_to+0x1dc/0x5a0 > [Tue Dec 15 00:46:55 2015] [] ? > iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod] > [Tue Dec 15 00:46:55 2015] [] kthread+0xd8/0xf0 > [Tue Dec 15 00:46:55 2015] [] ? > kthread_create_on_node+0x1a0/0x1a0 > [Tue Dec 15 00:46:55 2015] [] ret_from_fork+0x3f/0x70 > [Tue Dec 15 00:46:55 2015] [] ? > kthread_create_on_node+0x1a0/0x1a0 > [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]--- > [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: > 95784927 > [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already > complete, skipping > > If it is a Ceph File System issue
Re: 答复: how to see file object-mappings for cephfuse client
On Mon, Dec 7, 2015 at 1:52 PM, Wuxiangwei <wuxiang...@h3c.com> wrote: > Thanks Yan, what if we wanna see some more specific or detailed information? > E.g. with cephfs we may run 'cephfs /mnt/a.txt show_location --offset' to > find the location of a given offset. > When using default layout, object size is 4M, (offset / 4194304) is subfix of object name. For example. offset 0 is located in object ., offset 4194304 is located in object .001. For fancy layout, please read http://docs.ceph.com/docs/master/dev/file-striping/ > --- > Wu Xiangwei > Tel : 0571-86760875 > 2014 UIS 2, TEAM BORE > > > -邮件原件- > 发件人: Yan, Zheng [mailto:uker...@gmail.com] > 发送时间: 2015年12月7日 11:22 > 收件人: wuxiangwei 09660 (RD) > 抄送: ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com > 主题: Re: how to see file object-mappings for cephfuse client > > On Mon, Dec 7, 2015 at 10:51 AM, Wuxiangwei <wuxiang...@h3c.com> wrote: >> Hi, Everyone >> >> Recently I'm trying to figure out how to use ceph-fuse. If we mount cephfs >> as the kernel client, there is a 'cephfs' command tool (though it seems to >> be 'deprecated') with 'map' and 'show_location' commands to show the RADOS >> objects belonging to a given file. However, it doesn't work well for >> ceph-fuse, neither can I find a similar tool to see the mappings. Any >> suggestions? >> Thank you! > > Rados objects for a give inode is in form .. > > To get a file's layout, you can use getfattr -n ceph.file.layout > > To find a given object is stored on which OSDs, you can use command, ceph osd > map > >> >> >> --- >> Wu Xiangwei >> Tel : 0571-86760875 >> 2014 UIS 2, TEAM BORE >> >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" >> in the body of a message to majord...@vger.kernel.org More majordomo >> info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: how to see file object-mappings for cephfuse client
On Mon, Dec 7, 2015 at 10:51 AM, Wuxiangweiwrote: > Hi, Everyone > > Recently I'm trying to figure out how to use ceph-fuse. If we mount cephfs as > the kernel client, there is a 'cephfs' command tool (though it seems to be > 'deprecated') with 'map' and 'show_location' commands to show the RADOS > objects belonging to a given file. However, it doesn't work well for > ceph-fuse, neither can I find a similar tool to see the mappings. Any > suggestions? > Thank you! Rados objects for a give inode is in form .. To get a file's layout, you can use getfattr -n ceph.file.layout To find a given object is stored on which OSDs, you can use command, ceph osd map > > > --- > Wu Xiangwei > Tel : 0571-86760875 > 2014 UIS 2, TEAM BORE > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compiling for FreeBSD
On Thu, Dec 3, 2015 at 4:52 AM, Willem Jan Withagen <w...@digiware.nl> wrote: > On 2-12-2015 15:13, Yan, Zheng wrote: >> see https://github.com/ceph/ceph/pull/6770. The code can be compiled >> on FreeBSD/OSX, most client programs can connect to ceph servers on >> Linux. > > Hi, > > I do like some of the inline compiler tests. > > I guess that the way the errno's are done like the other OS's have done > as well? > I'd normally solve this with a static array, and just index it. > But perhaps the compiler is smart enough to do the same. > > > I see that you have disabled uuid? > Might I ask why? not disable. Currently ceph uses boost uuid implementation. so no need to link to libuuid. > > I Suggest you have a look at the issue Alan brought up. > Which is a possible fix for doing it the other way around: > Linux clients on a FreeBSD "cluster" > But as Sage suggest: Could be very well solved by fixed brougt in for AIX. > > --WjW > >> Regards >> Yan. Zheng >> >> On Wed, Dec 2, 2015 at 2:43 AM, Willem Jan Withagen <w...@digiware.nl> wrote: >>> On 1-12-2015 19:36, Sage Weil wrote: >>>> >>>> On Tue, 1 Dec 2015, Alan Somers wrote: >>>>> >>>>> On Tue, Dec 1, 2015 at 11:08 AM, Willem Jan Withagen <w...@digiware.nl> >>>>> wrote: >>>>>> >>>>>> On 1-12-2015 18:22, Alan Somers wrote: >>>>>>> >>>>>>> >>>>>>> I did some work porting Ceph to FreeBSD, but got distracted and >>>>>>> stopped about two years ago. You may find this port useful, though it >>>>>>> will probably need to be updated: >>>>>>> >>>>>>> https://people.freebsd.org/~asomers/ports/net/ceph/ >>>>>> >>>>>> >>>>>> >>>>>> I'll chcek that one as well... >>>>>> >>>>>>> Also, there's one major outstanding issue that I know of. It breaks >>>>>>> interoperability between FreeBSD and Linux Ceph nodes. I posted a >>>>>>> patch to fix it, but it doesn't look like it's been merged yet. >>>>>>> http://tracker.ceph.com/issues/6636 >>>>>> >>>>>> >>>>>> >>>>>> In the issues I find: >>>>>> >>>>>> Updated by Sage Weil almost 2 years ago >>>>>> >>>>>> Status changed from New to Verified >>>>>> Updated by Sage Weil almost 2 years ago >>>>>> >>>>>> Assignee set to Noah Watkins >>>>>> >>>>>> >>>>>> Probably left at that point because there was no presure to actually >>>>>> commit? >>>>>> >>>>>> --WjW >>>>> >>>>> >>>>> It looks like Sage reviewed the change, but had some comments that >>>>> were mostly style-related. Neither Noah nor I actually got around to >>>>> implementing Sage's suggestions. >>>>> >>>>> https://github.com/ceph/ceph/pull/828 >>>> >>>> >>>> The uuid transition to boost::uuid has happened since then (a few months >>>> back) and I believe Rohan's AIX and Solaris ports for librados (that just >>>> merged) included a fix for the sockaddr_storage issue: >>>> >>>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L180 >>>> >>>> and also >>>> >>>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L160 >>> >>> >>> >>> Would be nice to actually find that this works for FreeBSD as well. >>> But I'm putting this on the watch-list, once I get there. >>> >>> --WjW >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html >> > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compiling for FreeBSD
see https://github.com/ceph/ceph/pull/6770. The code can be compiled on FreeBSD/OSX, most client programs can connect to ceph servers on Linux. Regards Yan. Zheng On Wed, Dec 2, 2015 at 2:43 AM, Willem Jan Withagen <w...@digiware.nl> wrote: > On 1-12-2015 19:36, Sage Weil wrote: >> >> On Tue, 1 Dec 2015, Alan Somers wrote: >>> >>> On Tue, Dec 1, 2015 at 11:08 AM, Willem Jan Withagen <w...@digiware.nl> >>> wrote: >>>> >>>> On 1-12-2015 18:22, Alan Somers wrote: >>>>> >>>>> >>>>> I did some work porting Ceph to FreeBSD, but got distracted and >>>>> stopped about two years ago. You may find this port useful, though it >>>>> will probably need to be updated: >>>>> >>>>> https://people.freebsd.org/~asomers/ports/net/ceph/ >>>> >>>> >>>> >>>> I'll chcek that one as well... >>>> >>>>> Also, there's one major outstanding issue that I know of. It breaks >>>>> interoperability between FreeBSD and Linux Ceph nodes. I posted a >>>>> patch to fix it, but it doesn't look like it's been merged yet. >>>>> http://tracker.ceph.com/issues/6636 >>>> >>>> >>>> >>>> In the issues I find: >>>> >>>> Updated by Sage Weil almost 2 years ago >>>> >>>> Status changed from New to Verified >>>> Updated by Sage Weil almost 2 years ago >>>> >>>> Assignee set to Noah Watkins >>>> >>>> >>>> Probably left at that point because there was no presure to actually >>>> commit? >>>> >>>> --WjW >>> >>> >>> It looks like Sage reviewed the change, but had some comments that >>> were mostly style-related. Neither Noah nor I actually got around to >>> implementing Sage's suggestions. >>> >>> https://github.com/ceph/ceph/pull/828 >> >> >> The uuid transition to boost::uuid has happened since then (a few months >> back) and I believe Rohan's AIX and Solaris ports for librados (that just >> merged) included a fix for the sockaddr_storage issue: >> >> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L180 >> >> and also >> >> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L160 > > > > Would be nice to actually find that this works for FreeBSD as well. > But I'm putting this on the watch-list, once I get there. > > --WjW -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Compiling for FreeBSD
On Mon, Nov 30, 2015 at 2:57 AM, Willem Jan Withagen <w...@digiware.nl> wrote: > > On 29-11-2015 19:08, Haomai Wang wrote: > > > I guess we still expect FreeBSD support, which version do you test to > > compile? I'd like to help to make bsd work :-) > > I considered it is best to develop against HEAD aka: > 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r291381: Sat Nov 28 14:22:54 CET 2015 > I'm also training configure the use as much of CLANG as possible. > > I would guess that by the time we get anything worth mentioning 11.0 > will be released of close to release. > Note that the release date for 11.0 is july 2016 > > --WjW > I can compile infernalis on FreeBSD 10.1 with following commands ./autogen.sh ./configure CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib" CXXFLAGS="-DGTEST_HAS_TR1_TUPLE=0" --without-tcmalloc --without-libaio --without-libxfs gmake I don't know extact list of packages dependencies. But ./configure should tell you what is missing. Yan, Zheng > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph-fuse single read limitation?
> On Nov 23, 2015, at 16:40, Z Zhang <zhangz.da...@outlook.com> wrote: > > Hi Yan, Thanks for the reply. I already tried following settings, but no > lock. client_readahead_min = 1048576 client_readahead_max_bytes = 4194304 > Thanks. Zhi Zhang (David) > Subject: Re: Ceph-fuse single read limitation? > > From: z...@redhat.com > Date: Mon, 23 Nov 2015 10:22:23 +0800 > CC: > ceph-devel@vger.kernel.org > To: zhangz.da...@outlook.com > > >> On Nov 21, > 2015, at 10:12, Z Zhang wrote: >> >> Hi Guys, >> >> Now we have a very small > cluster with 3 OSDs but using 40Gb NIC. We use ceph-fuse as cephfs client and > enable readahead, but testing single reading a large file from cephfs via > fio, dd or cp can only achieve ~70+MB/s, even if fio or dd's block size is > set to 1MB or 4MB. >> >> From the ceph client log, we found each read > request's size (ll_read) is limited to 128KB, which should be the limitation > of kernel fuse's FUSE_MAX_PAGES_PER_REQ = 32. We may try to increase this > value to see the performance difference, but are there other options we can > try to increase single read performance? >> > > try setting > client_readahead_max_bytes to 4M > > Yan, Zheng > > -- > To unsubscribe from > this list: send the line "unsubscribe ceph-devel" in > the body of a message > to majord...@vger.kernel.org > More majordomo info at > http://vger.kernel.org/majordomo-info.html can you check ceps-fuse log to see if readahead actually happen. Regards Yan, Zheng-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Ceph-fuse single read limitation?
> On Nov 21, 2015, at 10:12, Z Zhang <zhangz.da...@outlook.com> wrote: > > Hi Guys, > > Now we have a very small cluster with 3 OSDs but using 40Gb NIC. We use > ceph-fuse as cephfs client and enable readahead, but testing single reading a > large file from cephfs via fio, dd or cp can only achieve ~70+MB/s, even if > fio or dd's block size is set to 1MB or 4MB. > > From the ceph client log, we found each read request's size (ll_read) is > limited to 128KB, which should be the limitation of kernel fuse's > FUSE_MAX_PAGES_PER_REQ = 32. We may try to increase this value to see the > performance difference, but are there other options we can try to increase > single read performance? > try setting client_readahead_max_bytes to 4M Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] fs/ceph: ceph_frag_contains_value can be boolean
> On Nov 17, 2015, at 14:52, Yaowei Bai <baiyao...@cmss.chinamobile.com> wrote: > > This patch makes ceph_frag_contains_value return bool to improve > readability due to this particular function only using either one or > zero as its return value. > > No functional change. > > Signed-off-by: Yaowei Bai <baiyao...@cmss.chinamobile.com> > --- > include/linux/ceph/ceph_frag.h | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/include/linux/ceph/ceph_frag.h b/include/linux/ceph/ceph_frag.h > index 970ba5c..b827e06 100644 > --- a/include/linux/ceph/ceph_frag.h > +++ b/include/linux/ceph/ceph_frag.h > @@ -40,7 +40,7 @@ static inline __u32 ceph_frag_mask_shift(__u32 f) > return 24 - ceph_frag_bits(f); > } > > -static inline int ceph_frag_contains_value(__u32 f, __u32 v) > +static inline bool ceph_frag_contains_value(__u32 f, __u32 v) > { > return (v & ceph_frag_mask(f)) == ceph_frag_value(f); > } > -- > 1.9.1 both applied Thanks Yan, Zheng > > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph:Fix error handling in the function down_reply
> On Nov 9, 2015, at 11:11, Nicholas Krause <xerofo...@gmail.com> wrote: > > This fixes error handling in the function down_reply in order to > properly check and jump to the goto label, out_err for this > particular function if a error code is returned by any function > called in down_reply and therefore make checking be included > for the call to ceph_update_snap_trace in order to comply with > these error handling checks/paths. > > Signed-off-by: Nicholas Krause <xerofo...@gmail.com> > --- > fs/ceph/mds_client.c | 11 +++ > 1 file changed, 7 insertions(+), 4 deletions(-) > > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 51cb02d..0b01f94 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session > *session, struct ceph_msg *msg) > realm = NULL; > if (rinfo->snapblob_len) { > down_write(>snap_rwsem); > - ceph_update_snap_trace(mdsc, rinfo->snapblob, > - rinfo->snapblob + rinfo->snapblob_len, > - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP, > - ); > + err = ceph_update_snap_trace(mdsc, rinfo->snapblob, > + rinfo->snapblob + > rinfo->snapblob_len, > + le32_to_cpu(head->op) == > CEPH_MDS_OP_RMSNAP, > + ); > downgrade_write(>snap_rwsem); > } else { > down_read(>snap_rwsem); > } > + > + if (err) > + goto out_err; > > /* insert trace into our cache */ > mutex_lock(>r_fill_mutex); Applied, thanks Yan, Zheng > -- > 2.5.0 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph:Fix error handling in the function ceph_readddir_prepopulate
> On Nov 9, 2015, at 05:13, Nicholas Krause <xerofo...@gmail.com> wrote: > > This fixes error handling in the function ceph_readddir_prepopulate > to properly check if the call to the function ceph_fill_dirfrag has > failed by returning a error code. Further more if this does arise > jump to the goto label, out of the function ceph_readdir_prepopulate > in order to clean up previously allocated resources by this function > before returning to the caller this errror code in order for all callers > to be now aware and able to handle this failure in their own intended > error paths. > > Signed-off-by: Nicholas Krause <xerofo...@gmail.com> > --- > fs/ceph/inode.c | 7 +-- > 1 file changed, 5 insertions(+), 2 deletions(-) > > diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c > index 96d2bd8..7738be6 100644 > --- a/fs/ceph/inode.c > +++ b/fs/ceph/inode.c > @@ -1417,8 +1417,11 @@ int ceph_readdir_prepopulate(struct ceph_mds_request > *req, > } else { > dout("readdir_prepopulate %d items under dn %p\n", >rinfo->dir_nr, parent); > - if (rinfo->dir_dir) > - ceph_fill_dirfrag(d_inode(parent), rinfo->dir_dir); > + if (rinfo->dir_dir) { > + err = ceph_fill_dirfrag(d_inode(parent), > rinfo->dir_dir); > + if (err) > + goto out; > + } > } > ceph_fill_dirfrag() failure is not fatal. I think it’s better to not skip later code when it happens. Regards Yan, Zheng > if (ceph_frag_is_leftmost(frag) && req->r_readdir_offset == 2) { > -- > 2.5.0 > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Only client.admin can mount cephfs by ceph-fuse
On Mon, Nov 2, 2015 at 12:28 PM, Jaze Leewrote: > Hello, >I find only ceph.client.admin can mount cephfs. > > [root@ceph-base-0 ceph]# ceph auth get client.cephfs_user > exported keyring for client.cephfs_user > [client.cephfs_user] > key = AQDZ3DZWR7nqBxAAzSoU/yRz1oJsOYdYrTAzcw== > caps mds = "allow *" > caps mon = "allow *" > caps osd = "allow *" > > [root@cephfs-client ~]# cat ceph.client.cephfs_user > [client.cephfs_user] > key = AQDZ3DZWR7nqBxAAzSoU/yRz1oJsOYdYrTAzcw== > > But when i mount fuse, it throw an exception. > [root@cephfs-client ~]# ceph-fuse -k ceph.client.cephfs_user -m > 10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs > 2015-11-02 12:23:17.938470 7f9d63f9b760 -1 did not load config file, > using default settings. > ceph-fuse[15325]: starting ceph client > 2015-11-02 12:23:17.973134 7f9d63f9b760 -1 init, newargv = 0x2c33f30 > newargc=11 > ceph-fuse[15325]: ceph mount failed with (1) Operation not permitted > ceph-fuse[15323]: mount failed: (1) Operation not permitted you don't specify user. ceph-fuse uses the default user admin. try following command ceph-fuse --id cephfs_user -k ceph.client.cephfs_user -m 10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs > > But when i use client.admin > [root@ceph-base-0 ceph]# ceph auth get client.admin > exported keyring for client.admin > [client.admin] > key = AQCMrhxWhRFBBhAA+f1yZXyjzF2eBrUt1UMdFA== > auid = 0 > caps mds = "allow *" > caps mon = "allow *" > caps osd = "allow *" > > [root@cephfs-client ~]# ceph-fuse -k ceph.client.admin.keyring -m > 10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs > 2015-11-02 12:25:03.038526 7f66e0957760 -1 did not load config file, > using default settings. > ceph-fuse[15360]: starting ceph client > 2015-11-02 12:25:03.073797 7f66e0957760 -1 init, newargv = 0x3035f30 > newargc=11 > ceph-fuse[15360]: starting fuse > > It is ok. > > So i want to know. > Dose ceph's fuse only allow admin to mount fs? > > > > > > -- > 谦谦君子 > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/4] libceph: nocephx_sign_messages option + misc
> On Nov 3, 2015, at 06:44, Ilya Dryomov <idryo...@gmail.com> wrote: > > Hello, > > This adds nocephx_sign_messages libceph option (a lack of which is > something people are running into, see [1]), plus a couple of related > cleanups. > > [1] https://forum.proxmox.com/threads/24116-new-krbd-option-on-pve4-don-t-work > > Thanks, > >Ilya > > > Ilya Dryomov (4): > libceph: msg signing callouts don't need con argument > libceph: drop authorizer check from cephx msg signing routines > libceph: stop duplicating client fields in messenger > libceph: add nocephx_sign_messages option > > fs/ceph/mds_client.c | 14 -- > include/linux/ceph/libceph.h | 4 +++- > include/linux/ceph/messenger.h | 16 +++- > net/ceph/auth_x.c | 8 ++-- > net/ceph/ceph_common.c | 18 +- > net/ceph/messenger.c | 32 > net/ceph/osd_client.c | 14 ++++-- > 7 files changed, 53 insertions(+), 53 deletions(-) > Reviewed-by: Yan, Zheng <z...@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: introduce ceph_x_authorizer_cleanup()
> On Oct 26, 2015, at 20:03, Ilya Dryomov <idryo...@gmail.com> wrote: > > Commit ae385eaf24dc ("libceph: store session key in cephx authorizer") > introduced ceph_x_authorizer::session_key, but didn't update all the > exit/error paths. Introduce ceph_x_authorizer_cleanup() to encapsulate > ceph_x_authorizer cleanup and switch to it. This fixes ceph_x_destroy(), > which currently always leaks key and ceph_x_build_authorizer() error > paths. > > Cc: Yan, Zheng <z...@redhat.com> > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > net/ceph/auth_x.c | 28 +--- > net/ceph/crypto.h | 4 +++- > 2 files changed, 20 insertions(+), 12 deletions(-) > > diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c > index ba6eb17226da..65054fd31b97 100644 > --- a/net/ceph/auth_x.c > +++ b/net/ceph/auth_x.c > @@ -279,6 +279,15 @@ bad: > return -EINVAL; > } > > +static void ceph_x_authorizer_cleanup(struct ceph_x_authorizer *au) > +{ > + ceph_crypto_key_destroy(>session_key); > + if (au->buf) { > + ceph_buffer_put(au->buf); > + au->buf = NULL; > + } > +} > + > static int ceph_x_build_authorizer(struct ceph_auth_client *ac, > struct ceph_x_ticket_handler *th, > struct ceph_x_authorizer *au) > @@ -297,7 +306,7 @@ static int ceph_x_build_authorizer(struct > ceph_auth_client *ac, > ceph_crypto_key_destroy(>session_key); > ret = ceph_crypto_key_clone(>session_key, >session_key); > if (ret) > - return ret; > + goto out_au; > > maxlen = sizeof(*msg_a) + sizeof(msg_b) + > ceph_x_encrypt_buflen(ticket_blob_len); > @@ -309,8 +318,8 @@ static int ceph_x_build_authorizer(struct > ceph_auth_client *ac, > if (!au->buf) { > au->buf = ceph_buffer_new(maxlen, GFP_NOFS); > if (!au->buf) { > - ceph_crypto_key_destroy(>session_key); > - return -ENOMEM; > + ret = -ENOMEM; > + goto out_au; > } > } > au->service = th->service; > @@ -340,7 +349,7 @@ static int ceph_x_build_authorizer(struct > ceph_auth_client *ac, > ret = ceph_x_encrypt(>session_key, _b, sizeof(msg_b), >p, end - p); > if (ret < 0) > - goto out_buf; > + goto out_au; > p += ret; > au->buf->vec.iov_len = p - au->buf->vec.iov_base; > dout(" built authorizer nonce %llx len %d\n", au->nonce, > @@ -348,9 +357,8 @@ static int ceph_x_build_authorizer(struct > ceph_auth_client *ac, > BUG_ON(au->buf->vec.iov_len > maxlen); > return 0; > > -out_buf: > - ceph_buffer_put(au->buf); > - au->buf = NULL; > +out_au: > + ceph_x_authorizer_cleanup(au); > return ret; > } > > @@ -624,8 +632,7 @@ static void ceph_x_destroy_authorizer(struct > ceph_auth_client *ac, > { > struct ceph_x_authorizer *au = (void *)a; > > - ceph_crypto_key_destroy(>session_key); > - ceph_buffer_put(au->buf); > + ceph_x_authorizer_cleanup(au); > kfree(au); > } > > @@ -653,8 +660,7 @@ static void ceph_x_destroy(struct ceph_auth_client *ac) > remove_ticket_handler(ac, th); > } > > - if (xi->auth_authorizer.buf) > - ceph_buffer_put(xi->auth_authorizer.buf); > + ceph_x_authorizer_cleanup(>auth_authorizer); > > kfree(ac->private); > ac->private = NULL; > diff --git a/net/ceph/crypto.h b/net/ceph/crypto.h > index d1498224c49d..2e9cab09f37b 100644 > --- a/net/ceph/crypto.h > +++ b/net/ceph/crypto.h > @@ -16,8 +16,10 @@ struct ceph_crypto_key { > > static inline void ceph_crypto_key_destroy(struct ceph_crypto_key *key) > { > - if (key) > + if (key) { > kfree(key->key); > + key->key = NULL; > + } > } > > int ceph_crypto_key_clone(struct ceph_crypto_key *dst, > — > 2.4.3 > Reviewed-by: Yan, Zheng <z...@redhat.com> -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: why package ceph-fuse needs packages ceph?
On Mon, Oct 26, 2015 at 6:31 PM, Jevon Qiaowrote: > Hi Sage, > > Here comes another question, does ceph-fuse support Unix OS(like HP-UNIX or > AIX)? It supports OSX and FreeBSD. (I tested it a few weeks ago. current development branch may require some minor modification) > > Thanks, > Jevon > > On 26/10/15 15:19, Sage Weil wrote: >> >> On Mon, 26 Oct 2015, Jaze Lee wrote: >>> >>> Hello, >>> I think the ceph-fuse is just a client, why it needs packages ceph? >>> I found when i install ceph-fuse, it will install package ceph. >>> But when i install ceph-common, it will not install package ceph. >>> >>> May be ceph-fuse is not just a ceph client? >>> >> It is, and the Debian packaging works as expected. This is a simple >> error in the spec file. I'll submit a patch. >> >> sage >> -- >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in >> the body of a message to majord...@vger.kernel.org >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a patch to improve cephfs direct io performance
ret = striped_read(inode, off, n, > pages, num_pages, checkeof, > 1, start); > > -ceph_put_page_vector(pages, num_pages, true); > +dio_free_pagev(pages, num_pages, true); > > if (ret <= 0) > break; > @@ -596,7 +705,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct > iov_iter *from, loff_t pos, > CEPH_OSD_FLAG_WRITE; > > while (iov_iter_count(from) > 0) { > -u64 len = iov_iter_single_seg_count(from); > +u64 len = dio_get_pagevlen(from); > size_t start; > ssize_t n; > > @@ -615,14 +724,15 @@ ceph_sync_direct_write(struct kiocb *iocb, struct > iov_iter *from, loff_t pos, > > osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC, 0); > > -n = iov_iter_get_pages_alloc(from, , len, ); > -if (unlikely(n < 0)) { > -ret = n; > +n = len; > +pages = dio_alloc_pagev(from, len, false, , > +_pages); > +if (IS_ERR(pages)) { > ceph_osdc_put_request(req); > +ret = PTR_ERR(pages); > break; > } > > -num_pages = (n + start + PAGE_SIZE - 1) / PAGE_SIZE; > /* > * throw out any page cache pages in this range. this > * may block. > @@ -639,8 +749,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct > iov_iter *from, loff_t pos, > if (!ret) > ret = ceph_osdc_wait_request(>client->osdc, req); > > -ceph_put_page_vector(pages, num_pages, false); > - > +dio_free_pagev(pages, num_pages, false); > ceph_osdc_put_request(req); > if (ret) > break; > > applied (with a few modification), thanks Yan, Zheng -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix message length computation
On Wed, Sep 30, 2015 at 9:04 PM, Arnd Bergmann <a...@arndb.de> wrote: > create_request_message() computes the maximum length of a message, > but uses the wrong type for the time stamp: sizeof(struct timespec) > may be 8 or 16 depending on the architecture, while sizeof(struct > ceph_timespec) is always 8, and that is what gets put into the > message. > > Found while auditing the uses of timespec for y2038 problems. > > Signed-off-by: Arnd Bergmann <a...@arndb.de> > Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request") > --- > diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c > index 51cb02da75d9..fe2c982764e7 100644 > --- a/fs/ceph/mds_client.c > +++ b/fs/ceph/mds_client.c > @@ -1935,7 +1935,7 @@ static struct ceph_msg *create_request_message(struct > ceph_mds_client *mdsc, > > len = sizeof(*head) + > pathlen1 + pathlen2 + 2*(1 + sizeof(u32) + sizeof(u64)) + > - sizeof(struct timespec); > + sizeof(struct ceph_timespec); > > /* calculate (max) length for cap releases */ > len += sizeof(struct ceph_mds_request_release) * > Applied. thanks Yan, Zheng > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a patch to improve cephfs direct io performance
On Wed, Sep 30, 2015 at 5:40 PM, zhucaifeng <zhucaif...@unissoft-nj.com> wrote: > Hi, Yan > > iov_iter APIs seems unsuitable for the direct io manipulation below. > iov_iter APIs > hide how to iterate over elements, whileas dio_xxx below explicitly control > over > the iteration. They conflict with each other in the principle. why? I think it's easy to replace dio_alloc_pagev() by iov_iter_get_pages(). We just need to use dio_get_pagevlen() to calculate how many page to get each time. Regards Yan, Zheng > > The patch for the newest kernel branch is below. > > Best Regards > > diff --git a/fs/ceph/file.c b/fs/ceph/file.c > index 8b79d87..3938ac9 100644 > --- a/fs/ceph/file.c > +++ b/fs/ceph/file.c > > @@ -34,6 +34,115 @@ > * need to wait for MDS acknowledgement. > */ > > +/* > + * Calculate the length sum of direct io vectors that can > + * be combined into one page vector. > + */ > +static int > +dio_get_pagevlen(const struct iov_iter *it) > +{ > +const struct iovec *iov = it->iov; > +const struct iovec *iovend = iov + it->nr_segs; > +int pagevlen; > + > +pagevlen = iov->iov_len - it->iov_offset; > +/* > + * An iov can be page vectored when both the current tail > + * and the next base are page aligned. > + */ > +while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) && > + (++iov < iovend && PAGE_ALIGNED((iov->iov_base { > +pagevlen += iov->iov_len; > +} > +dout("dio_get_pagevlen len = %d\n", pagevlen); > +return pagevlen; > +} > + > +/* > + * Grab @num_pages from the process vm space. These pages are > + * continuous and start from @data. > + */ > +static int > +dio_grab_pages(const void *data, int num_pages, bool write_page, > +struct page **pages) > +{ > +int got = 0; > +int rc = 0; > + > +down_read(>mm->mmap_sem); > +while (got < num_pages) { > +rc = get_user_pages(current, current->mm, > +(unsigned long)data + ((unsigned long)got * PAGE_SIZE), > +num_pages - got, write_page, 0, pages + got, NULL); > +if (rc < 0) > +break; > +BUG_ON(rc == 0); > +got += rc; > +} > +up_read(>mm->mmap_sem); > +return got; > +} > + > +static void > +dio_free_pagev(struct page **pages, int num_pages, bool dirty) > +{ > +int i; > + > +for (i = 0; i < num_pages; i++) { > +if (dirty) > +set_page_dirty_lock(pages[i]); > +put_page(pages[i]); > +} > +kfree(pages); > +} > + > +/* > + * Allocate a page vector based on (@it, @pagevlen). > + * The return value is the tuple describing a page vector, > + * that is (@pages, @pagevlen, @page_align, @num_pages). > + */ > +static struct page ** > +dio_alloc_pagev(const struct iov_iter *it, int pagevlen, bool write_page, > +size_t *page_align, int *num_pages) > +{ > +const struct iovec *iov = it->iov; > +struct page **pages; > +int n, m, k, npages; > +int align; > +int len; > +void *data; > + > +data = iov->iov_base + it->iov_offset; > +len = iov->iov_len - it->iov_offset; > +align = ((ulong)data) & ~PAGE_MASK; > +npages = calc_pages_for((ulong)data, pagevlen); > +pages = kmalloc(sizeof(*pages) * npages, GFP_NOFS); > +if (!pages) > +return ERR_PTR(-ENOMEM); > +for (n = 0; n < npages; n += m) { > +m = calc_pages_for((ulong)data, len); > +if (n + m > npages) > +m = npages - n; > +k = dio_grab_pages(data, m, write_page, pages + n); > +if (k < m) { > +n += k; > +goto failed; > +} > + > +iov++; > +data = iov->iov_base; > +len = iov->iov_len; > +} > +*num_pages = npages; > +*page_align = align; > +dout("dio_alloc_pagev: alloc pages pages[0:%d], page align %d\n", > +npages, align); > +return pages; > + > +failed: > +dio_free_pagev(pages, n, false); > +return ERR_PTR(-ENOMEM); > +} > > /* > * Prepare an open request. Preallocate ceph_cap to avoid an > @@ -462,17 +571,17 @@ static ssize_t ceph_sync_read(struct kiocb *iocb, > struct iov_iter *i, > size_t start; > ssize_t n; > > -n = iov_iter_get_pages_alloc(i, , INT_MAX, ); > -if (n < 0) > -return n; > - > -num_pages = (n + start + PAGE_SIZE - 1) / PAGE_SIZE; > +n = dio_get_pagevlen(i); > +
[BUG] commit "namei: d_is_negative() should be checked before ->d_seq validation" breaks ceph-fuse
Hi, Al I found that commit 766c4cbfac "namei: d_is_negative() should be checked before ->d_seq validation” breaks ceph-fuse. After that commit, lookup_fast can return -ENOENT before calling d_revalidate(). This breaks remote filesystems which allows creating/deleting files from multiple machines. Regards Yan, Zheng-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: fix a comment typo
> On Sep 30, 2015, at 11:36, Geliang Tang <geliangt...@163.com> wrote: > > Just fix a typo in the code comment. > > Signed-off-by: Geliang Tang <geliangt...@163.com> > --- > fs/ceph/cache.c | 2 +- > 1 file changed, 1 insertion(+), 1 deletion(-) > > diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c > index 834f9f3..a4766de 100644 > --- a/fs/ceph/cache.c > +++ b/fs/ceph/cache.c > @@ -88,7 +88,7 @@ static uint16_t ceph_fscache_inode_get_key(const void > *cookie_netfs_data, > const struct ceph_inode_info* ci = cookie_netfs_data; > uint16_t klen; > > - /* use ceph virtual inode (id + snaphot) */ > + /* use ceph virtual inode (id + snapshot) */ > klen = sizeof(ci->i_vino); > if (klen > maxbuf) > return 0; Applied, thank you. Yan, Zheng > -- > 1.9.1 > > -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: a patch to improve cephfs direct io performance
gt;iov_len - it->iov_offset; > +align = ((ulong)data) & ~PAGE_MASK; > +npages = calc_pages_for((ulong)data, pagevlen); > +pages = kmalloc(sizeof(*pages) * npages, GFP_NOFS); > +if (!pages) > +return ERR_PTR(-ENOMEM); > + for (n = 0; n < npages; n += m) { > +m = calc_pages_for((ulong)data, len); > +if (n + m > npages) > +m = npages - n; > +k = dio_grab_pages(data, m, write_page, pages + n); > +if (k < m) { > +n += k; > +goto failed; > +} if (align + len) is not page aligned, the loop should stop. > + > +iov++; > +data = iov->iov_base; if (unsigned long)data is not page aligned, the loop should stop Regards Yan, Zheng > +len = iov->iov_len; > +} > +*num_pages = npages; > +*page_align = align; > +dout("dio_alloc_pagev: alloc pages pages[0:%d], page align %d\n", > +npages, align); > +return pages; > + > +failed: > +dio_free_pagev(pages, n, false); > +return ERR_PTR(-ENOMEM); > +} > > /* > * Prepare an open request. Preallocate ceph_cap to avoid an > @@ -354,13 +463,14 @@ more: > if (ret >= 0) { > int didpages; > if (was_short && (pos + ret < inode->i_size)) { > -u64 tmp = min(this_len - ret, > +u64 zlen = min(this_len - ret, > inode->i_size - pos - ret); > +int zoff = (o_direct ? buf_align : io_align) + > + read + ret; > dout(" zero gap %llu to %llu\n", > -pos + ret, pos + ret + tmp); > -ceph_zero_page_vector_range(page_align + read + ret, > -tmp, pages); > -ret += tmp; > +pos + ret, pos + ret + zlen); > +ceph_zero_page_vector_range(zoff, zlen, pages); > +ret += zlen; > } > > didpages = (page_align + ret) >> PAGE_CACHE_SHIFT; > @@ -421,19 +531,19 @@ static ssize_t ceph_sync_read(struct kio > > if (file->f_flags & O_DIRECT) { > while (iov_iter_count(i)) { > -void __user *data = i->iov[0].iov_base + i->iov_offset; > -size_t len = i->iov[0].iov_len - i->iov_offset; > - > -num_pages = calc_pages_for((unsigned long)data, len); > -pages = ceph_get_direct_page_vector(data, > -num_pages, true); > +int page_align; > +size_t len; > + > +len = dio_get_pagevlen(i); > +pages = dio_alloc_pagev(i, len, true, _align, > +_pages); > if (IS_ERR(pages)) > return PTR_ERR(pages); > > ret = striped_read(inode, off, len, > pages, num_pages, checkeof, > - 1, (unsigned long)data & ~PAGE_MASK); > -ceph_put_page_vector(pages, num_pages, true); > + 1, page_align); > +dio_free_pagev(pages, num_pages, true); > > if (ret <= 0) > break; > @@ -570,10 +680,7 @@ ceph_sync_direct_write(struct kiocb *ioc > iov_iter_init(, iov, nr_segs, count, 0); > > while (iov_iter_count() > 0) { > -void __user *data = i.iov->iov_base + i.iov_offset; > -u64 len = i.iov->iov_len - i.iov_offset; > - > -page_align = (unsigned long)data & ~PAGE_MASK; > +u64 len = dio_get_pagevlen(); > > snapc = ci->i_snap_realm->cached_context; > vino = ceph_vino(inode); > @@ -589,8 +696,7 @@ ceph_sync_direct_write(struct kiocb *ioc > break; > } > > -num_pages = calc_pages_for(page_align, len); > -pages = ceph_get_direct_page_vector(data, num_pages, false); > +pages = dio_alloc_pagev(, len, false, _align, _pages); > if (IS_ERR(pages)) { > ret = PTR_ERR(pages); > goto out; > @@ -612,7 +718,7 @@ ceph_sync_direct_write(struct kiocb *ioc > if (!ret) > ret = ceph_osdc_wait_request(>client->osdc, req); > > -ceph_put_page_vector(pages, num_pages, false); > +dio_free_pagev(pages, num_pages, false); > > out: > ceph_osdc_put_request(req); > > > > > > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: don't access invalid memory in keepalive2 path
On Mon, Sep 14, 2015 at 9:50 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > This > > struct ceph_timespec ceph_ts; > ... > con_out_kvec_add(con, sizeof(ceph_ts), _ts); > > wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet > ceph_ts becomes invalid on return from prepare_write_keepalive(). As > a result, we send out bogus keepalive2 stamps. Fix this by encoding > into a ceph_timespec member, similar to how acks are read and written. > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > include/linux/ceph/messenger.h | 4 +++- > net/ceph/messenger.c | 9 + > 2 files changed, 8 insertions(+), 5 deletions(-) > > diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h > index 7e1252e97a30..b2371d9b51fa 100644 > --- a/include/linux/ceph/messenger.h > +++ b/include/linux/ceph/messenger.h > @@ -238,6 +238,8 @@ struct ceph_connection { > bool out_kvec_is_msg; /* kvec refers to out_msg */ > int out_more;/* there is more data after the kvecs */ > __le64 out_temp_ack; /* for writing an ack */ > + struct ceph_timespec out_temp_keepalive2; /* for writing keepalive2 > +stamp */ > > /* message in temps */ > struct ceph_msg_header in_hdr; > @@ -248,7 +250,7 @@ struct ceph_connection { > int in_base_pos; /* bytes read */ > __le64 in_temp_ack; /* for reading an ack */ > > - struct timespec last_keepalive_ack; > + struct timespec last_keepalive_ack; /* keepalive2 ack stamp */ > > struct delayed_work work; /* send|recv work */ > unsigned long delay; /* current delay interval */ > diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c > index 525f454f7531..b9b0e3b5da49 100644 > --- a/net/ceph/messenger.c > +++ b/net/ceph/messenger.c > @@ -1353,11 +1353,12 @@ static void prepare_write_keepalive(struct > ceph_connection *con) > dout("prepare_write_keepalive %p\n", con); > con_out_kvec_reset(con); > if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) { > - struct timespec ts = CURRENT_TIME; > - struct ceph_timespec ceph_ts; > - ceph_encode_timespec(_ts, ); > + struct timespec now = CURRENT_TIME; > + > con_out_kvec_add(con, sizeof(tag_keepalive2), > _keepalive2); > - con_out_kvec_add(con, sizeof(ceph_ts), _ts); > + ceph_encode_timespec(>out_temp_keepalive2, ); > + con_out_kvec_add(con, sizeof(con->out_temp_keepalive2), > + >out_temp_keepalive2); > } else { > con_out_kvec_add(con, sizeof(tag_keepalive), _keepalive); > } > -- Sorry for introducing this bug Reviewed-by: Yan, Zheng <z...@redhat.com> > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: advertise support for keepalive2
On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > We are the client, but advertise keepalive2 anyway - for consistency, > if nothing else. In the future the server might want to know whether > its clients support keepalive2. the kernel code still does fully support KEEPALIVE2 (it does not recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not advertise support for keepalive2. Regards Yan, Zheng > > Signed-off-by: Ilya Dryomov <idryo...@gmail.com> > --- > include/linux/ceph/ceph_features.h | 1 + > 1 file changed, 1 insertion(+) > > diff --git a/include/linux/ceph/ceph_features.h > b/include/linux/ceph/ceph_features.h > index 4763ad64e832..f89b31d45cc8 100644 > --- a/include/linux/ceph/ceph_features.h > +++ b/include/linux/ceph/ceph_features.h > @@ -107,6 +107,7 @@ static inline u64 ceph_sanitize_features(u64 features) > CEPH_FEATURE_OSDMAP_ENC | \ > CEPH_FEATURE_CRUSH_TUNABLES3 | \ > CEPH_FEATURE_OSD_PRIMARY_AFFINITY |\ > +CEPH_FEATURE_MSGR_KEEPALIVE2 | \ > CEPH_FEATURE_CRUSH_V4) > > #define CEPH_FEATURES_REQUIRED_DEFAULT \ > -- > 1.9.3 > > -- > To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: advertise support for keepalive2
On Wed, Sep 16, 2015 at 3:39 PM, Ilya Dryomov <idryo...@gmail.com> wrote: > On Wed, Sep 16, 2015 at 9:28 AM, Yan, Zheng <uker...@gmail.com> wrote: >> On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote: >>> We are the client, but advertise keepalive2 anyway - for consistency, >>> if nothing else. In the future the server might want to know whether >>> its clients support keepalive2. >> >> the kernel code still does fully support KEEPALIVE2 (it does not >> recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not >> advertise support for keepalive2. > > I guess by "does not recognize" you mean the kernel client knows how to > write TAG_KEEPALIVE2, but not how to read it? The same is true about > TAG_KEEPALIVE tag and the reverse (we can read, but can't write) is > true about TAG_KEEPALIVE2_ACK. > > What I'm getting at is the kernel client is the client, and so it > doesn't need to know how to read keepalive bytes or write keepalive > acks. The server however might want to know if its clients can send > keepalive2 bytes and handle keepalive2 acks. Does this make sense? > Ok, it makes sense. thank you for explaination Reviewed-by: Yan, Zheng <z...@redhat.com> > Thanks, > > Ilya -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive
> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote: > > On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote: >> timespec_to_jiffies() does not work this way. it convert time delta in form >> of timespec to time delta in form of jiffies. > > Ah sorry, con->last_keepalive_ack is a realtime timespec from > userspace. > >> >> I will updated the patch according to the rest comments . > > I still want ceph_con_keepalive_expired() to handle interval == 0 > internally, so that opt->monc_ping_timeout can be passed without any > checks in the caller. > > I also noticed you added delay = max_t(unsigned long, delay, HZ); to > __schedule_delayed(). Is it really necessary? > Updated. please review. Regards Yan, Zheng-- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: use keepalive2 to verify the mon session is alive
Signed-off-by: Yan, Zheng <z...@redhat.com> --- include/linux/ceph/libceph.h | 2 ++ include/linux/ceph/messenger.h | 4 +++ include/linux/ceph/msgr.h | 4 ++- net/ceph/ceph_common.c | 18 - net/ceph/messenger.c | 60 ++ net/ceph/mon_client.c | 38 -- 6 files changed, 111 insertions(+), 15 deletions(-) diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index 9ebee53..9a0b471 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -46,6 +46,7 @@ struct ceph_options { unsigned long mount_timeout;/* jiffies */ unsigned long osd_idle_ttl; /* jiffies */ unsigned long osd_keepalive_timeout;/* jiffies */ + unsigned long mon_keepalive_timeout;/* jiffies */ /* * any type that can't be simply compared or doesn't need need @@ -66,6 +67,7 @@ struct ceph_options { #define CEPH_MOUNT_TIMEOUT_DEFAULT msecs_to_jiffies(60 * 1000) #define CEPH_OSD_KEEPALIVE_DEFAULT msecs_to_jiffies(5 * 1000) #define CEPH_OSD_IDLE_TTL_DEFAULT msecs_to_jiffies(60 * 1000) +#define CEPH_MON_KEEPALIVE_DEFAULT msecs_to_jiffies(30 * 1000) #define CEPH_MSG_MAX_FRONT_LEN (16*1024*1024) #define CEPH_MSG_MAX_MIDDLE_LEN(16*1024*1024) diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 3775327..83063b6 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -248,6 +248,8 @@ struct ceph_connection { int in_base_pos; /* bytes read */ __le64 in_temp_ack; /* for reading an ack */ + struct timespec last_keepalive_ack; + struct delayed_work work; /* send|recv work */ unsigned long delay; /* current delay interval */ }; @@ -285,6 +287,8 @@ extern void ceph_msg_revoke(struct ceph_msg *msg); extern void ceph_msg_revoke_incoming(struct ceph_msg *msg); extern void ceph_con_keepalive(struct ceph_connection *con); +extern int ceph_con_keepalive_expired(struct ceph_connection *con, + unsigned long interval); extern void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages, size_t length, size_t alignment); diff --git a/include/linux/ceph/msgr.h b/include/linux/ceph/msgr.h index 1c18872..0fe2656 100644 --- a/include/linux/ceph/msgr.h +++ b/include/linux/ceph/msgr.h @@ -84,10 +84,12 @@ struct ceph_entity_inst { #define CEPH_MSGR_TAG_MSG 7 /* message */ #define CEPH_MSGR_TAG_ACK 8 /* message ack */ #define CEPH_MSGR_TAG_KEEPALIVE 9 /* just a keepalive byte! */ -#define CEPH_MSGR_TAG_BADPROTOVER 10 /* bad protocol version */ +#define CEPH_MSGR_TAG_BADPROTOVER 10 /* bad protocol version */ #define CEPH_MSGR_TAG_BADAUTHORIZER 11 /* bad authorizer */ #define CEPH_MSGR_TAG_FEATURES 12 /* insufficient features */ #define CEPH_MSGR_TAG_SEQ 13 /* 64-bit int follows with seen seq number */ +#define CEPH_MSGR_TAG_KEEPALIVE214 /* keepalive2 byte + ceph_timespec */ +#define CEPH_MSGR_TAG_KEEPALIVE2_ACK 15 /* keepalive2 reply */ /* diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index f30329f..5143f6e 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -226,6 +226,7 @@ static int parse_fsid(const char *str, struct ceph_fsid *fsid) * ceph options */ enum { + Opt_monkeepalivetimeout, Opt_osdtimeout, Opt_osdkeepalivetimeout, Opt_mount_timeout, @@ -250,6 +251,7 @@ enum { }; static match_table_t opt_tokens = { + {Opt_monkeepalivetimeout, "monkeepalive=%d"}, {Opt_osdtimeout, "osdtimeout=%d"}, {Opt_osdkeepalivetimeout, "osdkeepalive=%d"}, {Opt_mount_timeout, "mount_timeout=%d"}, @@ -354,9 +356,10 @@ ceph_parse_options(char *options, const char *dev_name, /* start with defaults */ opt->flags = CEPH_OPT_DEFAULT; - opt->osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT; opt->mount_timeout = CEPH_MOUNT_TIMEOUT_DEFAULT; opt->osd_idle_ttl = CEPH_OSD_IDLE_TTL_DEFAULT; + opt->osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT; + opt->mon_keepalive_timeout = CEPH_MON_KEEPALIVE_DEFAULT; /* get mon ip(s) */ /* ip1[:port1][,ip2[:port2]...] */ @@ -460,6 +463,16 @@ ceph_parse_options(char *options, const char *dev_name, } opt->osd_idle_ttl = msecs_to_jiffies(intval * 1000); break; + case Opt_monkeepalivetimeout: + /* 0 isn't well defined right now, reject it */ + if (intval < 1 || intval > INT_MAX / 1000) { + pr_err("monkeepalive out of range\n"); +
Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open.
It seems all of your patches are malformed. please use 'git send-email’ to send patch. Regards Yan, Zheng On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote: For ceph_open, it already get the parent inode. So no need to get again. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..5ecd3dc 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) - parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); - iput(parent_inode); if (!err) err = ceph_init_file(inode, file, req-r_fmode); ceph_mdsc_put_request(req); -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open.
On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote: For ceph_open, it already get the parent inode. So no need to get again. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 3 --- 1 file changed, 3 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..5ecd3dc 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) - parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); - iput(parent_inode); if (!err) err = ceph_init_file(inode, file, req-r_fmode); ceph_mdsc_put_request(req); — I fixed your patches (all tabs are replaced by spaces in your patches) and added them to our testing branch Thanks Yan, Zheng 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.
On Thu, Aug 13, 2015 at 5:01 PM, Ma, Jianpeng jianpeng...@intel.com wrote: Because flags removed the O_CREAT, so we should use file-f_flags. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..e1347cf 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) + if (file-f_flags O_CREAT) parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); iput(parent_inode); In this case, we do not need parent_inode because file already exists. I'd like to remove the code that finds parent inode. Regards Yan, Zheng -- 2.4.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.
On Thu, Aug 13, 2015 at 8:34 PM, Sage Weil s...@newdream.net wrote: On Thu, 13 Aug 2015, Yan, Zheng wrote: On Thu, Aug 13, 2015 at 5:01 PM, Ma, Jianpeng jianpeng...@intel.com wrote: Because flags removed the O_CREAT, so we should use file-f_flags. Signed-off-by: Jianpeng Ma jianpeng...@intel.com --- fs/ceph/file.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 8b79d87..e1347cf 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file) ihold(inode); req-r_num_caps = 1; - if (flags O_CREAT) + if (file-f_flags O_CREAT) parent_inode = ceph_get_dentry_parent_inode(file-f_path.dentry); err = ceph_mdsc_do_request(mdsc, parent_inode, req); iput(parent_inode); In this case, we do not need parent_inode because file already exists. I'd like to remove the code that finds parent inode. Do you mean that O_CREAT is only removed in the case where the inode already exists? Because there is at least one case where we do a single op to the MDS that does the create + open, and in that case we do need to make sure a dir fsync waits for it to commit. As long as we don't break that one! create+open case is handled by ceph_atomic_open or ceph_create. ceph_open is used only when inode already exists. Regards Yan, Zheng sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: FileStore should not use syncfs(2)
On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote: Today I learned that syncfs(2) does an O(n) search of the superblock's inode list searching for dirty items. I've always assumed that it was only traversing dirty inodes (e.g., a list of dirty inodes), but that appears not to be the case, even on the latest kernels. I checked syncfs code in 3.10/4.1 kernel. I think both kernels only traverse dirty inodes (inodes in bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing? That means that the more RAM in the box, the larger (generally) the inode cache, the longer syncfs(2) will take, and the more CPU you'll waste doing it. The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 servicing a very light workload, and each syncfs(2) call was taking ~7 seconds (usually to write out a single inode). A possible workaround for such boxes is to turn /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching pages instead of inodes/dentries)... I think the take-away though is that we do need to bite the bullet and make FileStore f[data]sync all the right things so that the syncfs call can be avoided. This is the path you were originally headed down, Somnath, and I think it's the right one. The main thing to watch out for is that according to POSIX you really need to fsync directories. With XFS that isn't the case since all metadata operations are going into the journal and that's fully ordered, but we don't want to allow data loss on e.g. ext4 (we need to check what the metadata ordering behavior is there) or other file systems. :( sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: CephFS and the next hammer release v0.94.3
Hi Loic. Yes, https://github.com/ceph/ceph/pull/5222 is problematic. Do you mean should we include these RPs in v0.94.3? These RPs fix a bug in rare configure, I think it’s not a big deal to not include it in v0.94.3 Regards Yan, Zheng On Aug 4, 2015, at 00:32, Loic Dachary l...@dachary.org wrote: Hi Greg, I assume the file handle reference counting is about http://tracker.ceph.com/issues/12088 which is backported as described at http://tracker.ceph.com/issues/12319. It was indeed somewhat problematic and required two pull requests: https://github.com/ceph/ceph/pull/5222 (authored by Yan Zheng) and https://github.com/ceph/ceph/pull/5427 (merged by Yan Zheng). Cheers On 03/08/2015 18:01, Gregory Farnum wrote: On Mon, Aug 3, 2015 at 6:43 PM Loic Dachary l...@dachary.org wrote: Hi Greg, The next hammer release as found at https://github.com/ceph/ceph/tree/hammer passed the fs suite (http://tracker.ceph.com/issues/11990#fs). Do you think it is ready for QE to start their own round of testing ? I'm on vacation right now, but the only thing I see there that might be iffy is the backport of the file handle reference counting. As long as that is all good (Zheng?) things look fine to me. -Greg Cheers P.S. http://tracker.ceph.com/issues/11990#Release-information has direct links to the pull requests merged into hammer since v0.94.2 in case you need more context about one of them. -- Loïc Dachary, Artisan Logiciel Libre -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Client: what is supposed to protect racing readdirs and unlinks?
On Wed, Jul 15, 2015 at 1:11 AM, Gregory Farnum g...@gregs42.com wrote: I spent a bunch of today looking at http://tracker.ceph.com/issues/12297. Long story short: the workload is doing a readdir at the same time as it's unlinking files. The readdir functions (in this case, _readdir_cache_cb) drop the client_lock each time they invoke the callback (for obvious reasons). There is some effort in _readdir_cache_cb to try and keep the iterator valid (we check on each loop that we aren't at end; we increment the iterator before dropping the lock), but it's not sufficient. Is there supposed to be something preventing this kind of race? If not I can work something out in the code but I've not done much work in that bit and there are enough pieces that I wonder if I'm missing some other issue. I think calling (*pd)-get() before release the client_lock should work. Regards Yan, Zheng -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph:for-linus 11/40] fs/ceph/mds_client.c:2865:1-10: second lock on line 2904
On Fri, Jul 3, 2015 at 4:52 AM, Julia Lawall julia.law...@lip6.fr wrote: I haven't looked at all the called functions, to see if any of them drops the lock, but it could be worth a check. the lock is dropped by cleanup_cap_releases() julia On Fri, 3 Jul 2015, kbuild test robot wrote: TO: Yan, Zheng z...@redhat.com CC: Ilya Dryomov idryo...@gmail.com CC: ceph-devel@vger.kernel.org tree: git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus head: 5a60e87603c4c533492c515b7f62578189b03c9c commit: 745a8e3bccbc6adae69a98ddc525e529aa44636e [11/40] ceph: don't pre-allocate space for cap release messages :: branch date: 2 days ago :: commit date: 7 days ago fs/ceph/mds_client.c:2865:1-10: second lock on line 2904 -- fs/ceph/mds_client.c:2961:1-7: preceding lock on line 2865 git remote add ceph git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git git remote update ceph git checkout 745a8e3bccbc6adae69a98ddc525e529aa44636e vim +2865 fs/ceph/mds_client.c a687ecaf John Spray 2014-09-19 2859 ceph_session_state_name(session-s_state)); 2f2dc053 Sage Weil 2009-10-06 2860 99a9c273 Yan, Zheng 2013-09-22 2861 spin_lock(session-s_gen_ttl_lock); 99a9c273 Yan, Zheng 2013-09-22 2862 session-s_cap_gen++; 99a9c273 Yan, Zheng 2013-09-22 2863 spin_unlock(session-s_gen_ttl_lock); 99a9c273 Yan, Zheng 2013-09-22 2864 99a9c273 Yan, Zheng 2013-09-22 @2865 spin_lock(session-s_cap_lock); 03f4fcb0 Yan, Zheng 2015-01-05 2866 /* don't know if session is readonly */ 03f4fcb0 Yan, Zheng 2015-01-05 2867 session-s_readonly = 0; 99a9c273 Yan, Zheng 2013-09-22 2868 /* 99a9c273 Yan, Zheng 2013-09-22 2869 * notify __ceph_remove_cap() that we are composing cap reconnect. 99a9c273 Yan, Zheng 2013-09-22 2870 * If a cap get released before being added to the cap reconnect, 99a9c273 Yan, Zheng 2013-09-22 2871 * __ceph_remove_cap() should skip queuing cap release. 99a9c273 Yan, Zheng 2013-09-22 2872 */ 99a9c273 Yan, Zheng 2013-09-22 2873 session-s_cap_reconnect = 1; e01a5946 Sage Weil 2010-05-10 2874 /* drop old cap expires; we're about to reestablish that state */ 745a8e3b Yan, Zheng 2015-05-14 2875 cleanup_cap_releases(mdsc, session); e01a5946 Sage Weil 2010-05-10 2876 5d23371f Yan, Zheng 2014-09-10 2877 /* trim unused caps to reduce MDS's cache rejoin time */ c0bd50e2 Yan, Zheng 2015-04-07 2878 if (mdsc-fsc-sb-s_root) 5d23371f Yan, Zheng 2014-09-10 2879 shrink_dcache_parent(mdsc-fsc-sb-s_root); 5d23371f Yan, Zheng 2014-09-10 2880 5d23371f Yan, Zheng 2014-09-10 2881 ceph_con_close(session-s_con); 5d23371f Yan, Zheng 2014-09-10 2882 ceph_con_open(session-s_con, 5d23371f Yan, Zheng 2014-09-10 2883 CEPH_ENTITY_TYPE_MDS, mds, 5d23371f Yan, Zheng 2014-09-10 2884 ceph_mdsmap_get_addr(mdsc-mdsmap, mds)); 5d23371f Yan, Zheng 2014-09-10 2885 5d23371f Yan, Zheng 2014-09-10 2886 /* replay unsafe requests */ 5d23371f Yan, Zheng 2014-09-10 2887 replay_unsafe_requests(mdsc, session); 5d23371f Yan, Zheng 2014-09-10 2888 5d23371f Yan, Zheng 2014-09-10 2889 down_read(mdsc-snap_rwsem); 5d23371f Yan, Zheng 2014-09-10 2890 2f2dc053 Sage Weil 2009-10-06 2891 /* traverse this session's caps */ 44c99757 Yan, Zheng 2013-09-22 2892 s_nr_caps = session-s_nr_caps; 44c99757 Yan, Zheng 2013-09-22 2893 err = ceph_pagelist_encode_32(pagelist, s_nr_caps); 93cea5be Sage Weil 2009-12-23 2894 if (err) 93cea5be Sage Weil 2009-12-23 2895 goto fail; 20cb34ae Sage Weil 2010-05-12 2896 44c99757 Yan, Zheng 2013-09-22 2897 recon_state.nr_caps = 0; 20cb34ae Sage Weil 2010-05-12 2898 recon_state.pagelist = pagelist; 20cb34ae Sage Weil 2010-05-12 2899 recon_state.flock = session-s_con.peer_features CEPH_FEATURE_FLOCK; 20cb34ae Sage Weil 2010-05-12 2900 err = iterate_session_caps(session, encode_caps_cb, recon_state); 2f2dc053 Sage Weil 2009-10-06 2901 if (err 0) 9abf82b8 Sage Weil 2010-05-10 2902 goto fail; 2f2dc053 Sage Weil 2009-10-06 2903 99a9c273 Yan, Zheng 2013-09-22 @2904 spin_lock(session-s_cap_lock); 99a9c273 Yan, Zheng 2013-09-22 2905 session-s_cap_reconnect = 0; 99a9c273 Yan, Zheng 2013-09-22 2906 spin_unlock(session-s_cap_lock); 99a9c273 Yan, Zheng 2013-09-22 2907 2f2dc053 Sage Weil 2009-10-06 2908 /* 2f2dc053 Sage Weil 2009-10-06 2909 * snaprealms. we provide mds with the ino, seq (version), and 2f2dc053 Sage Weil 2009-10-06 2910 * parent for all of our realms. If the mds has any newer info, 2f2dc053
Re: Ceph hard lock Hammer 9.2
Could you please run echo 1 /proc/sys/kernel/sysrq; echo t /proc/sysrq-trigger when this warning happens again. then send the kernel message to us. Regards Yan, Zheng On Tue, Jun 23, 2015 at 10:25 PM, Barclay Jameson almightybe...@gmail.com wrote: Sure, I guess it's actually a soft kernel lock since it's only the filesystem that is hung with high IO wait. The kernel is 4.0.4-1.el6.elrepo.x86. The Ceph version is 0.94.2 (Sorry about the confusion I missed a 4 when I typed in the subject line). I was testing copying 100,000 files from directory (dir1) to (dir1-`hostname`) on three septate hosts. 2 of the hosts completed the job and the third one hung with the stack trace in /var/log/messages. On Tue, Jun 23, 2015 at 6:54 AM, Gregory Farnum g...@gregs42.com wrote: On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson almightybe...@gmail.com wrote: Has anyone seen this? Can you describe the kernel you're using, the workload you were running, the Ceph cluster you're running against, etc? Jun 22 15:09:27 node kernel: Call Trace: Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90 Jun 22 15:09:27 node kernel: [8168062e] schedule_preempt_disabled+0xe/0x10 Jun 22 15:09:27 node kernel: [81681ce3] __mutex_lock_slowpath+0x93/0x100 Jun 22 15:09:27 node kernel: [a060def8] ? __cap_is_valid+0x58/0x70 [ceph] Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40 Jun 22 15:09:27 node kernel: [a0610f2d] ceph_check_caps+0x38d/0x780 [ceph] Jun 22 15:09:27 node kernel: [812f5a9b] ? __radix_tree_delete_node+0x7b/0x130 Jun 22 15:09:27 node kernel: [a0612637] ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph] Jun 22 15:09:27 node kernel: [a060b170] writepages_finish+0x200/0x290 [ceph] Jun 22 15:09:27 node kernel: [a05e2731] handle_reply+0x4f1/0x640 [libceph] Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 [libceph] Jun 22 15:09:27 node kernel: [a05d7ceb] process_message+0xab/0xd0 [libceph] Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 [libceph] Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 [libceph] Jun 22 15:09:27 node kernel: [8108c475] process_one_work+0x145/0x460 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8108c790] ? process_one_work+0x460/0x460 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90 Jun 22 15:09:27 node kernel: [81091640] ? kthread_freezable_should_stop+0x70/0x70 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for more than 120 seconds. Jun 22 15:11:27 node kernel: Tainted: G I 4.0.4-1.el6.elrepo.x86_64 #1 Jun 22 15:11:27 node kernel: echo 0 /proc/sys/kernel/hung_task_timeout_secs disables this message. Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0 40 2 0x Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph] Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010 881ff2b67050 88207fd95270 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200 7fff 0002 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818 816803ee 810ae63b -- To unsubscribe from this list: send the line unsubscribe ceph-devel in -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Slow file creating and deleting using bonnie ++ on Hammer
On Sat, May 23, 2015 at 1:34 AM, Barclay Jameson almightybe...@gmail.com wrote: Here are some more info : rados bench -p cephfs_data 100 write --no-cleanup Total time run: 100.096473 Total writes made: 21900 Write size: 4194304 Bandwidth (MB/sec): 875.156 Stddev Bandwidth: 96.1234 Max bandwidth (MB/sec): 932 Min bandwidth (MB/sec): 0 Average Latency:0.0731273 Stddev Latency: 0.0439909 Max latency:1.23972 Min latency:0.0306901 (Again the numbers from bench don't match what is listed in client io. Ceph -s shows anywhere from 200 MB/s to 1700 MB/s even when the max bandwidth lists 932 as the highest) rados bench -p cephfs_data 100 seq Total time run:29.460172 Total reads made: 21900 Read size:4194304 Bandwidth (MB/sec):2973.506 Average Latency: 0.0215173 Max latency: 0.693831 Min latency: 0.00519763 On client: [root@blarg cephfs]# time for i in {1..10}; do mkdir blarg$i ; done real10m36.794s user1m45.329s sys6m29.982s [root@blarg cephfs]# time for i in {1..10}; do touch yadda$i ; done real13m29.155s user3m55.256s sys7m50.301s What variables are most important in the perf dump? I would like to grep out the vars (ceph daemon /var/run/ceph-mds.cephnautilus01.asok perf dump | jq '.') that are of meaning while running the bonnie++ test again with -s 0. Thanks, BJ On Fri, May 22, 2015 at 10:34 AM, John Spray john.sp...@redhat.com wrote: On 22/05/2015 16:25, Barclay Jameson wrote: The Bonnie++ job _FINALLY_ finished. If I am reading this correctly it took days to create, stat, and delete 16 files?? [root@blarg cephfs]# ~/bonnie++-1.03e/bonnie++ -u root:root -s 256g -r 131072 -d /cephfs/ -m CephBench -f -b Using uid:0, gid:0. Writing intelligently...done Rewriting...done Reading intelligently...done start 'em...done...done...done... Create files in sequential order...done. Stat files in sequential order...done. Delete files in sequential order...done. Create files in random order...done. Stat files in random order...done. Delete files in random order...done. Version 1.03e --Sequential Output-- --Sequential Input- --Random- -Per Chr- --Block-- -Rewrite- -Per Chr- --Block-- --Seeks-- MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP /sec %CP CephBench 256G 1006417 76 90114 13 137110 8 329.8 7 --Sequential Create-- Random Create -Create-- --Read--- -Delete-- -Create-- --Read--- -Delete-- files /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP /sec %CP 16 0 0 + +++ 0 0 0 0 5267 19 0 0 CephBench,256G,,,1006417,76,90114,13,,,137110,8,329.8,7,16,0,0,+,+++,0,0,0,0,5267,19,0,0 Any thoughts? It's 16000 files by default (not 16), but this usually takes only a few minutes. FWIW I tried running a quick bonnie++ (with -s 0 to skip the IO phase) on a development (vstart.sh) cluster with a fuse client, and it readily handles several hundred client requests per second (checked with ceph daemonperf mds.id) Nothing immediately leapt out at me from a quick look at the log you posted, but with issues like these it is always worth trying to narrow it down by trying the fuse client instead of the kernel client, and/or different kernel versions. You may also want to check that your underlying RADOS cluster is performing reasonably by doing a rados bench too. Cheers, John -- the reason for slow file creations is that bonnie++ call fsync(2) after each creat(2). fsync() wait for safe replies of the create requests. MDS sends safe reply when log event for the request gets journaled safely. MDS flush the journal every 5 seconds (mds_tick_interval). So the speed of file creation for bonnie++ is one file every file seconds. To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 4/5] ceph: simplify two mount_timeout sites
On Thu, May 21, 2015 at 8:35 PM, Ilya Dryomov idryo...@gmail.com wrote: No need to bifurcate wait now that we've got ceph_timeout_jiffies(). Signed-off-by: Ilya Dryomov idryo...@gmail.com --- fs/ceph/dir.c| 14 -- fs/ceph/mds_client.c | 18 ++ 2 files changed, 14 insertions(+), 18 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 173dd4b58c71..3dec27e36417 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1257,17 +1257,11 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, dout(dir_fsync %p wait on tid %llu (until %llu)\n, inode, req-r_tid, last_tid); - if (req-r_timeout) { - unsigned long time_left = wait_for_completion_timeout( - req-r_safe_completion, + ret = !wait_for_completion_timeout(req-r_safe_completion, ceph_timeout_jiffies(req-r_timeout)); - if (time_left 0) - ret = 0; - else - ret = -EIO; /* timed out */ - } else { - wait_for_completion(req-r_safe_completion); - } + if (ret) + ret = -EIO; /* timed out */ + ceph_mdsc_put_request(req); spin_lock(ci-i_unsafe_lock); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 0b0e0a9a81c0..5be2d287a26c 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2266,16 +2266,18 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc, /* wait */ mutex_unlock(mdsc-mutex); dout(do_request waiting\n); - if (req-r_timeout) { - err = (long)wait_for_completion_killable_timeout( - req-r_completion, - ceph_timeout_jiffies(req-r_timeout)); - if (err == 0) - err = -EIO; - } else if (req-r_wait_for_completion) { + if (!req-r_timeout req-r_wait_for_completion) { err = req-r_wait_for_completion(mdsc, req); } else { - err = wait_for_completion_killable(req-r_completion); + long timeleft = wait_for_completion_killable_timeout( + req-r_completion, + ceph_timeout_jiffies(req-r_timeout)); + if (timeleft 0) + err = 0; + else if (!timeleft) + err = -EIO; /* timed out */ + else + err = timeleft; /* killed */ } dout(do_request waited, got %d\n, err); mutex_lock(mdsc-mutex); -- 1.9.3 Reviewed-by: Yan, Zheng z...@redhat.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] ceph: check OSD caps before read/write
Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/addr.c | 203 +++ fs/ceph/caps.c | 4 + fs/ceph/inode.c | 3 + fs/ceph/mds_client.c | 4 + fs/ceph/mds_client.h | 9 +++ fs/ceph/super.c | 15 +++- fs/ceph/super.h | 17 +++-- 7 files changed, 249 insertions(+), 6 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index bdeea57..c65f9e0 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1599,3 +1599,206 @@ int ceph_mmap(struct file *file, struct vm_area_struct *vma) vma-vm_ops = ceph_vmops; return 0; } + +enum { + POOL_READ = 1, + POOL_WRITE = 2, +}; + +static int __ceph_pool_perm_get(struct ceph_inode_info *ci, u32 pool) +{ + struct ceph_fs_client *fsc = ceph_inode_to_client(ci-vfs_inode); + struct ceph_mds_client *mdsc = fsc-mdsc; + struct ceph_osd_request *rd_req = NULL, *wr_req = NULL; + struct rb_node **p, *parent; + struct ceph_pool_perm *perm; + struct page **pages; + int err = 0, err2 = 0, have = 0; + + down_read(mdsc-pool_perm_rwsem); + p = mdsc-pool_perm_tree.rb_node; + while (*p) { + perm = rb_entry(*p, struct ceph_pool_perm, node); + if (pool perm-pool) + p = (*p)-rb_left; + else if (pool perm-pool) + p = (*p)-rb_right; + else { + have = perm-perm; + break; + } + } + up_read(mdsc-pool_perm_rwsem); + if (*p) + goto out; + + dout(__ceph_pool_perm_get pool %u no perm cached\n, pool); + + down_write(mdsc-pool_perm_rwsem); + parent = NULL; + while (*p) { + parent = *p; + perm = rb_entry(parent, struct ceph_pool_perm, node); + if (pool perm-pool) + p = (*p)-rb_left; + else if (pool perm-pool) + p = (*p)-rb_right; + else { + have = perm-perm; + break; + } + } + if (*p) { + up_write(mdsc-pool_perm_rwsem); + goto out; + } + + rd_req = ceph_osdc_alloc_request(fsc-client-osdc, +ci-i_snap_realm-cached_context, +1, false, GFP_NOFS); + if (!rd_req) { + err = -ENOMEM; + goto out_unlock; + } + + rd_req-r_flags = CEPH_OSD_FLAG_READ; + osd_req_op_init(rd_req, 0, CEPH_OSD_OP_STAT, 0); + rd_req-r_base_oloc.pool = pool; + snprintf(rd_req-r_base_oid.name, sizeof(rd_req-r_base_oid.name), +%llx., ci-i_vino.ino); + rd_req-r_base_oid.name_len = strlen(rd_req-r_base_oid.name); + + wr_req = ceph_osdc_alloc_request(fsc-client-osdc, +ci-i_snap_realm-cached_context, +1, false, GFP_NOFS); + if (!wr_req) { + err = -ENOMEM; + goto out_unlock; + } + + wr_req-r_flags = CEPH_OSD_FLAG_WRITE | + CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK; + osd_req_op_init(wr_req, 0, CEPH_OSD_OP_CREATE, CEPH_OSD_OP_FLAG_EXCL); + wr_req-r_base_oloc.pool = pool; + wr_req-r_base_oid = rd_req-r_base_oid; + + /* one page should be large enough for STAT data */ + pages = ceph_alloc_page_vector(1, GFP_KERNEL); + if (IS_ERR(pages)) { + err = PTR_ERR(pages); + goto out_unlock; + } + + osd_req_op_raw_data_in_pages(rd_req, 0, pages, PAGE_SIZE, +0, false, true); + ceph_osdc_build_request(rd_req, 0, NULL, CEPH_NOSNAP, + ci-vfs_inode.i_mtime); + err = ceph_osdc_start_request(fsc-client-osdc, rd_req, false); + + ceph_osdc_build_request(wr_req, 0, NULL, CEPH_NOSNAP, + ci-vfs_inode.i_mtime); + err2 = ceph_osdc_start_request(fsc-client-osdc, wr_req, false); + + if (!err) + err = ceph_osdc_wait_request(fsc-client-osdc, rd_req); + if (!err2) + err2 = ceph_osdc_wait_request(fsc-client-osdc, wr_req); + + if (err = 0 || err == -ENOENT) + have |= POOL_READ; + else if (err != -EPERM) + goto out_unlock; + + if (err2 == 0 || err2 == -EEXIST) + have |= POOL_WRITE; + else if (err2 != -EPERM) { + err = err2; + goto out_unlock; + } + + perm = kmalloc(sizeof(*perm), GFP_NOFS); + if (!perm) { + err = -ENOMEM; + goto out_unlock; + } + + perm-pool = pool; + perm-perm = have; + rb_link_node(perm-node, parent, p); + rb_insert_color
[PATCH 2/3] libceph: allow setting osd_req_op's flags
Signed-off-by: Yan, Zheng z...@redhat.com --- drivers/block/rbd.c | 4 ++-- fs/ceph/addr.c | 3 ++- fs/ceph/file.c | 2 +- include/linux/ceph/osd_client.h | 2 +- net/ceph/osd_client.c | 24 +++- 5 files changed, 21 insertions(+), 14 deletions(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 8125233..15ddc5e 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -2371,7 +2371,7 @@ static void rbd_img_obj_request_fill(struct rbd_obj_request *obj_request, } if (opcode == CEPH_OSD_OP_DELETE) - osd_req_op_init(osd_request, num_ops, opcode); + osd_req_op_init(osd_request, num_ops, opcode, 0); else osd_req_op_extent_init(osd_request, num_ops, opcode, offset, length, 0, 0); @@ -2843,7 +2843,7 @@ static int rbd_img_obj_exists_submit(struct rbd_obj_request *obj_request) goto out; stat_request-callback = rbd_img_obj_exists_callback; - osd_req_op_init(stat_request-osd_req, 0, CEPH_OSD_OP_STAT); + osd_req_op_init(stat_request-osd_req, 0, CEPH_OSD_OP_STAT, 0); osd_req_op_raw_data_in_pages(stat_request-osd_req, 0, pages, size, 0, false, false); rbd_osd_req_format_read(stat_request); diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index cab1cf5..bdeea57 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -884,7 +884,8 @@ get_more_pages: } if (do_sync) - osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC); + osd_req_op_init(req, 1, + CEPH_OSD_OP_STARTSYNC, 0); req-r_callback = writepages_finish; req-r_inode = inode; diff --git a/fs/ceph/file.c b/fs/ceph/file.c index d533075..a972019 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -615,7 +615,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct iov_iter *from, loff_t pos) break; } - osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC); + osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC, 0); n = iov_iter_get_pages_alloc(from, pages, len, start); if (unlikely(n 0)) { diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 61b19c4..7506b48 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -249,7 +249,7 @@ extern void ceph_osdc_handle_map(struct ceph_osd_client *osdc, struct ceph_msg *msg); extern void osd_req_op_init(struct ceph_osd_request *osd_req, - unsigned int which, u16 opcode); + unsigned int which, u16 opcode, u32 flags); extern void osd_req_op_raw_data_in_pages(struct ceph_osd_request *, unsigned int which, diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index b93531f..e3c2e8b 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -453,7 +453,7 @@ __CEPH_FORALL_OSD_OPS(GENERATE_CASE) */ static struct ceph_osd_req_op * _osd_req_op_init(struct ceph_osd_request *osd_req, unsigned int which, - u16 opcode) +u16 opcode, u32 flags) { struct ceph_osd_req_op *op; @@ -463,14 +463,15 @@ _osd_req_op_init(struct ceph_osd_request *osd_req, unsigned int which, op = osd_req-r_ops[which]; memset(op, 0, sizeof (*op)); op-op = opcode; + op-flags = flags; return op; } void osd_req_op_init(struct ceph_osd_request *osd_req, - unsigned int which, u16 opcode) +unsigned int which, u16 opcode, u32 flags) { - (void)_osd_req_op_init(osd_req, which, opcode); + (void)_osd_req_op_init(osd_req, which, opcode, flags); } EXPORT_SYMBOL(osd_req_op_init); @@ -479,7 +480,8 @@ void osd_req_op_extent_init(struct ceph_osd_request *osd_req, u64 offset, u64 length, u64 truncate_size, u32 truncate_seq) { - struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode); + struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, + opcode, 0); size_t payload_len = 0; BUG_ON(opcode != CEPH_OSD_OP_READ opcode != CEPH_OSD_OP_WRITE @@ -518,7 +520,8 @@ EXPORT_SYMBOL(osd_req_op_extent_update); void osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, u16 opcode, const char *class, const char *method) { - struct ceph_osd_req_op *op = _osd_req_op_init
Re: [ceph-users] CephFS Slow writes with 1MB files
On Thu, Apr 2, 2015 at 11:18 PM, Barclay Jameson almightybe...@gmail.com wrote: I am using the Giant release. The OSDs and MON/MDS are using default RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using cephaux. I reproduced this issue by using giant release. It's a bug in the MDS code. Could you try the newest development version ceph (It includes the fix). Or apply the attached patch to source of giant release. Regards Yan, Zheng I may have found something. I did the build manually as such I did _NOT_ set up these config settings: filestore xattr use omap = false filestore max inline xattr size = 65536, filestore_max_inline_xattr_size_xfs = 65536 filestore_max_inline_xattr_size_other = 512 filestore_max_inline_xattrs_xfs = 10 I just changed these settings to see if it will make a difference. I copied data from one directory that had files I created before I set these values ( time cp small1/* small2/.) and it takes 2 min 30 secs to copy 1600 files. If I took the files I just copied from small2 and copy them to a different directory ( time cp small2/* small3/.) it only takes 5 mins to copy 1 files! Could this be part of the problem? On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng uker...@gmail.com wrote: On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson almightybe...@gmail.com wrote: Here is the mds output from the command you requested. I did this during the small data run . ( time cp small1/* small2/ ) It is 20MB in size so I couldn't find a place online that would accept that much data. Please find attached file. Thanks, In the log file, each 'create' request is followed by several 'getattr' requests. I guess these 'getattr' requests resulted from some kinds of permission check, but I can't reproduce this situation locally. which version of ceph/kernel are you using? do you use ceph-fuse or kernel client, what's the mount options? Regards Yan, Zheng Beeij On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng uker...@gmail.com wrote: On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson almightybe...@gmail.com wrote: I redid my entire Ceph build going back to to CentOS 7 hoping to the get the same performance I did last time. The rados bench test was the best I have ever had with a time of 740 MB wr and 1300 MB rd. This was even better than the first rados bench test that had performance equal to PanFS. I find that this does not translate to my CephFS. Even with the following tweaking it still at least twice as slow as PanFS and my first *Magical* build (that had absolutely no tweaking): OSD osd_op_treads 8 /sys/block/sd*/queue/nr_requests 4096 /sys/block/sd*/queue/read_ahead_kb 4096 Client rsize=16777216 readdir_max_bytes=16777216 readdir_max_entries=16777216 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS. Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s. Strange thing is none of the resources are taxed. CPU, ram, network, disks, are not even close to being taxed on either the client,mon/mds, or the osd nodes. The PanFS client node was a 10Gb network the same as the CephFS client but you can see the huge difference in speed. As per Gregs questions before: There is only one client reading and writing (time cp Small1/* Small2/.) but three clients have cephfs mounted, although they aren't doing anything on the filesystem. I have done another test where I stream data info a file as fast as the processor can put it there. (for (i=0; i 11; i++){ fprintf (out_file, I is : %d\n,i);} ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the above tuning vs 130 seconds for PanFS. Without the tuning it takes 230 seconds for CephFS although the first build did it in 130 seconds without any tuning. This leads me to believe the bottleneck is the mds. Does anybody have any thoughts on this? Are there any tuning parameters that I would need to speed up the mds? could you enable mds debugging for a few seconds (ceph daemon mds.x config set debug_mds 10; sleep 10; ceph daemon mds.x config set debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere. Regards Yan, Zheng On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson almightybe...@gmail.com wrote: Yes it's the exact same hardware except for the MDS server (although I tried using the MDS on the old node). I have not tried moving the MON back to the old node. My default cache size is mds cache size = 1000 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. I created 2048 for data and metadata: ceph osd pool create cephfs_data 2048 2048 ceph osd pool create cephfs_metadata 2048 2048 To your point on clients competing against each other... how would I check that? Do you have multiple clients mounted? Are they both accessing files in the directory(ies) you're testing? Were they accessing the same pattern
Re: [ceph-users] CephFS Slow writes with 1MB files
On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson almightybe...@gmail.com wrote: Here is the mds output from the command you requested. I did this during the small data run . ( time cp small1/* small2/ ) It is 20MB in size so I couldn't find a place online that would accept that much data. Please find attached file. Thanks, In the log file, each 'create' request is followed by several 'getattr' requests. I guess these 'getattr' requests resulted from some kinds of permission check, but I can't reproduce this situation locally. which version of ceph/kernel are you using? do you use ceph-fuse or kernel client, what's the mount options? Regards Yan, Zheng Beeij On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng uker...@gmail.com wrote: On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson almightybe...@gmail.com wrote: I redid my entire Ceph build going back to to CentOS 7 hoping to the get the same performance I did last time. The rados bench test was the best I have ever had with a time of 740 MB wr and 1300 MB rd. This was even better than the first rados bench test that had performance equal to PanFS. I find that this does not translate to my CephFS. Even with the following tweaking it still at least twice as slow as PanFS and my first *Magical* build (that had absolutely no tweaking): OSD osd_op_treads 8 /sys/block/sd*/queue/nr_requests 4096 /sys/block/sd*/queue/read_ahead_kb 4096 Client rsize=16777216 readdir_max_bytes=16777216 readdir_max_entries=16777216 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS. Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s. Strange thing is none of the resources are taxed. CPU, ram, network, disks, are not even close to being taxed on either the client,mon/mds, or the osd nodes. The PanFS client node was a 10Gb network the same as the CephFS client but you can see the huge difference in speed. As per Gregs questions before: There is only one client reading and writing (time cp Small1/* Small2/.) but three clients have cephfs mounted, although they aren't doing anything on the filesystem. I have done another test where I stream data info a file as fast as the processor can put it there. (for (i=0; i 11; i++){ fprintf (out_file, I is : %d\n,i);} ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the above tuning vs 130 seconds for PanFS. Without the tuning it takes 230 seconds for CephFS although the first build did it in 130 seconds without any tuning. This leads me to believe the bottleneck is the mds. Does anybody have any thoughts on this? Are there any tuning parameters that I would need to speed up the mds? could you enable mds debugging for a few seconds (ceph daemon mds.x config set debug_mds 10; sleep 10; ceph daemon mds.x config set debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere. Regards Yan, Zheng On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson almightybe...@gmail.com wrote: Yes it's the exact same hardware except for the MDS server (although I tried using the MDS on the old node). I have not tried moving the MON back to the old node. My default cache size is mds cache size = 1000 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. I created 2048 for data and metadata: ceph osd pool create cephfs_data 2048 2048 ceph osd pool create cephfs_metadata 2048 2048 To your point on clients competing against each other... how would I check that? Do you have multiple clients mounted? Are they both accessing files in the directory(ies) you're testing? Were they accessing the same pattern of files for the old cluster? If you happen to be running a hammer rc or something pretty new you can use the MDS admin socket to explore a bit what client sessions there are and what they have permissions on and check; otherwise you'll have to figure it out from the client side. -Greg Thanks for the input! On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote: So this is exactly the same test you ran previously, but now it's on faster hardware and the test is slower? Do you have more data in the test cluster? One obvious possibility is that previously you were working entirely in the MDS' cache, but now you've got more dentries and so it's kicking data out to RADOS and then reading it back in. If you've got the memory (you appear to) you can pump up the mds cache size config option quite dramatically from it's default 10. Other things to check are that you've got an appropriately-sized metadata pool, that you've not got clients competing against each other inappropriately, etc. -Greg On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson almightybe...@gmail.com wrote: Opps I should have said that I am not just writing the data but copying it : time cp Small1/* Small2/* Thanks, BJ On Fri, Mar 27
Re: [ceph-users] CephFS Slow writes with 1MB files
On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson almightybe...@gmail.com wrote: I redid my entire Ceph build going back to to CentOS 7 hoping to the get the same performance I did last time. The rados bench test was the best I have ever had with a time of 740 MB wr and 1300 MB rd. This was even better than the first rados bench test that had performance equal to PanFS. I find that this does not translate to my CephFS. Even with the following tweaking it still at least twice as slow as PanFS and my first *Magical* build (that had absolutely no tweaking): OSD osd_op_treads 8 /sys/block/sd*/queue/nr_requests 4096 /sys/block/sd*/queue/read_ahead_kb 4096 Client rsize=16777216 readdir_max_bytes=16777216 readdir_max_entries=16777216 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS. Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s. Strange thing is none of the resources are taxed. CPU, ram, network, disks, are not even close to being taxed on either the client,mon/mds, or the osd nodes. The PanFS client node was a 10Gb network the same as the CephFS client but you can see the huge difference in speed. As per Gregs questions before: There is only one client reading and writing (time cp Small1/* Small2/.) but three clients have cephfs mounted, although they aren't doing anything on the filesystem. I have done another test where I stream data info a file as fast as the processor can put it there. (for (i=0; i 11; i++){ fprintf (out_file, I is : %d\n,i);} ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the above tuning vs 130 seconds for PanFS. Without the tuning it takes 230 seconds for CephFS although the first build did it in 130 seconds without any tuning. This leads me to believe the bottleneck is the mds. Does anybody have any thoughts on this? Are there any tuning parameters that I would need to speed up the mds? could you enable mds debugging for a few seconds (ceph daemon mds.x config set debug_mds 10; sleep 10; ceph daemon mds.x config set debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere. Regards Yan, Zheng On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote: On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson almightybe...@gmail.com wrote: Yes it's the exact same hardware except for the MDS server (although I tried using the MDS on the old node). I have not tried moving the MON back to the old node. My default cache size is mds cache size = 1000 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks. I created 2048 for data and metadata: ceph osd pool create cephfs_data 2048 2048 ceph osd pool create cephfs_metadata 2048 2048 To your point on clients competing against each other... how would I check that? Do you have multiple clients mounted? Are they both accessing files in the directory(ies) you're testing? Were they accessing the same pattern of files for the old cluster? If you happen to be running a hammer rc or something pretty new you can use the MDS admin socket to explore a bit what client sessions there are and what they have permissions on and check; otherwise you'll have to figure it out from the client side. -Greg Thanks for the input! On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote: So this is exactly the same test you ran previously, but now it's on faster hardware and the test is slower? Do you have more data in the test cluster? One obvious possibility is that previously you were working entirely in the MDS' cache, but now you've got more dentries and so it's kicking data out to RADOS and then reading it back in. If you've got the memory (you appear to) you can pump up the mds cache size config option quite dramatically from it's default 10. Other things to check are that you've got an appropriately-sized metadata pool, that you've not got clients competing against each other inappropriately, etc. -Greg On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson almightybe...@gmail.com wrote: Opps I should have said that I am not just writing the data but copying it : time cp Small1/* Small2/* Thanks, BJ On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson almightybe...@gmail.com wrote: I did a Ceph cluster install 2 weeks ago where I was getting great performance (~= PanFS) where I could write 100,000 1MB files in 61 Mins (Took PanFS 59 Mins). I thought I could increase the performance by adding a better MDS server so I redid the entire build. Now it takes 4 times as long to write the same data as it did before. The only thing that changed was the MDS server. (I even tried moving the MDS back on the old slower node and the performance was the same.) The first install was on CentOS 7. I tried going down to CentOS 6.6 and it's the same results. I use the same scripts to install the OSDs (which I created because I can never get ceph-deploy to behave correctly
Re: ceph-fuse remount issues
On Mon, Mar 16, 2015 at 1:28 PM, Sage Weil sw...@redhat.com wrote: Hi Zheng, On Thu, 26 Feb 2015, ?? wrote: ? 2015?2?20??06:23?John Spray john.sp...@redhat.com ??? Background: a while ago, we found (#10277) that existing cache expiration mechanism wasn't working with latest kernels. We used to invalidate the top level dentries, which caused fuse to invalidate everything, but an implementation detail in fuse caused it to start ignoring our repeated invalidate calls, so this doesn't work any more. To persuade fuse to dirty its entire metadata cache, Zheng added in a system() call to mount -o remount after we expire things from our client side cache. Change of d_invalidate() implementation breaks our old cache expiration mechanism. When invalidating a denty, d_invalidate() also walks the dentry subtree and try pruning any unused descendant dentries. Our old cache expiration mechanism replies on this to prune unused dentries. We invalidate the top level dentries, d_invalidate() try pruning unused dentries underneath these top level dentries. Prior to 3.18 kernel, d_invalidate() can fail if the dentry is used by some one. Implementation of d_invalidate() change in 3.18 kernel, d_invalidate() always successes and unhash the dentry even if it?s still in use. This behavior changes make us not be able to use d_invalidate() at will. One known bad consequence is getcwd() system call return -EINVAL after process? working directory gets invalidated. I took another look at this and it seems to me like we might need something more than a new call that does the pruning. What we were doing before was also a bit of a hack, it seems. What is really going on is that the MDS is telling us to reduce the number of inodes we have pinned. We should ideally turn that into pressure on dcache.. but it's not per-superblock, so there's not a shrinker we can poke that that does what we want. We *are* doing the dentry is, so the dentries are unhashed. But they aren't getting destroyed... Zheng, this is what the previous hack was doing, right? Forcing unhashed dentries to get trimmed from the LRU? Yes, that's what previous hack does. When invalidating dentry, VFS also tries destroying the dentry. It seems like the most elegant solution would be to patch fs/fuse to make that happen in the general case when we do the invalidate upcall. Does that sound right? what do you mean make that happen in the general case. In my option, there isn't much fuse kernel module can do about dentry. Maybe we can try adding a new callback which call d_prune_aliases(). Regards Yan, Zheng sage -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: use msecs_to_jiffies for time conversion
On Fri, Feb 6, 2015 at 7:52 PM, Nicholas Mc Guire hof...@osadl.org wrote: This is only an API consolidation and should make things more readable it replaces var * HZ / 1000 by msecs_to_jiffies(var). Signed-off-by: Nicholas Mc Guire hof...@osadl.org --- Patch was only compile tested with x86_64_defconfig + CONFIG_CEPH_FS=m Patch is against 3.19.0-rc7 (localversion-next is -next-20150204) fs/ceph/mds_client.c |2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 5f62fb7..ced7503 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3070,7 +3070,7 @@ static void handle_lease(struct ceph_mds_client *mdsc, di-lease_renew_from di-lease_renew_after == 0) { unsigned long duration = - le32_to_cpu(h-duration_ms) * HZ / 1000; + msecs_to_jiffies(le32_to_cpu(h-duration_ms)); di-lease_seq = seq; dentry-d_time = di-lease_renew_from + duration; -- 1.7.10.4 added to our testing branch Thanks Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: match wait_for_completion_timeout return type
On Tue, Mar 10, 2015 at 11:18 PM, Nicholas Mc Guire hof...@osadl.org wrote: return type of wait_for_completion_timeout is unsigned long not int. An appropriately named unsigned long is added and the assignment fixed up. Signed-off-by: Nicholas Mc Guire hof...@osadl.org --- This was only compile tested for x86_64_defconfig + CONFIG_CEPH_FS=m Patch is against 4.0-rc2 linux-next (localversion-next is -next-20150306) fs/ceph/dir.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 83e9976..4bee6b7 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1218,6 +1218,7 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, struct ceph_mds_request *req; u64 last_tid; int ret = 0; + unsigned long time_left; dout(dir_fsync %p\n, inode); ret = filemap_write_and_wait_range(inode-i_mapping, start, end); @@ -1240,11 +1241,11 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, dout(dir_fsync %p wait on tid %llu (until %llu)\n, inode, req-r_tid, last_tid); if (req-r_timeout) { - ret = wait_for_completion_timeout( + time_left = wait_for_completion_timeout( req-r_safe_completion, req-r_timeout); - if (ret 0) + if (time_left 0) ret = 0; - else if (ret == 0) + else if (!time_left) ret = -EIO; /* timed out */ } else { wait_for_completion(req-r_safe_completion); There are lots of similar codes in kernel. I don't think this code causes problem in reality -- 1.7.10.4 -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: match wait_for_completion_timeout return type
On Wed, Mar 11, 2015 at 7:04 PM, Nicholas Mc Guire der.h...@hofr.at wrote: On Wed, 11 Mar 2015, Yan, Zheng wrote: On Tue, Mar 10, 2015 at 11:18 PM, Nicholas Mc Guire hof...@osadl.org wrote: return type of wait_for_completion_timeout is unsigned long not int. An appropriately named unsigned long is added and the assignment fixed up. Signed-off-by: Nicholas Mc Guire hof...@osadl.org --- This was only compile tested for x86_64_defconfig + CONFIG_CEPH_FS=m Patch is against 4.0-rc2 linux-next (localversion-next is -next-20150306) fs/ceph/dir.c |7 --- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 83e9976..4bee6b7 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -1218,6 +1218,7 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, struct ceph_mds_request *req; u64 last_tid; int ret = 0; + unsigned long time_left; dout(dir_fsync %p\n, inode); ret = filemap_write_and_wait_range(inode-i_mapping, start, end); @@ -1240,11 +1241,11 @@ static int ceph_dir_fsync(struct file *file, loff_t start, loff_t end, dout(dir_fsync %p wait on tid %llu (until %llu)\n, inode, req-r_tid, last_tid); if (req-r_timeout) { - ret = wait_for_completion_timeout( + time_left = wait_for_completion_timeout( req-r_safe_completion, req-r_timeout); - if (ret 0) + if (time_left 0) ret = 0; - else if (ret == 0) + else if (!time_left) ret = -EIO; /* timed out */ } else { wait_for_completion(req-r_safe_completion); There are lots of similar codes in kernel. I don't think this code causes problem in reality true - there are 38 (of the initial 81 files) left for which no patch has been submitted yet - its cleanup in progress. type correctness I do believe is an issue and code readability as well so both fixing the type and that name is relevant. As Wolfram Sang w...@the-dreams.de put it: snip 'ret' being an int is kind of an idiom, so I'd rather see the variable renamed, too, like the other patches do. snip [http://lkml.iu.edu/hypermail/linux/kernel/1502.1/00084.html] regarding causing problems - it is hard to say - type missmatch may go without problems for a long time and then pop up in strange corner cases. But you are right that it is not fixing any currently inown incorrect behavior. The motivation for cleaning this up is also to make static code checkers happy which eases scanning for incorrect API usage and general bug-hunting, Ok, added to our testing branch Thanks Yan, Zheng thx! hofrat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: workloads/rbd_fsx_cache_writethrough.yaml hangs on dumpling
On Mon, Feb 2, 2015 at 9:18 PM, Loic Dachary l...@dachary.org wrote: Hi, http://pulpito.ceph.com/loic-2015-01-29_15:39:38-rbd-dumpling-backports---basic-multi/730029/ hangs on dumpling and got killed after two days. Although workloads/rbd_fsx_cache_writethrough.yaml was running with the thrasher, it does not seem to be related to http://tracker.ceph.com/issues/10513 : it was running on a mira070 (16GB RAM) and plana78 (8GB RAM). I don't see anything in https://github.com/ceph/ceph-qa-suite/blob/master/tasks/rbd_fsx.py that would suggest it needs backporting a fix to the ceph-qa-suite branch of dumpling. I browsed the commits related to RBD that are in the dumpling-backports branch ( see here for the list I used http://tracker.ceph.com/issues/10560 ). I don't see anything in the issues that need backporting to dumpling either. Another run of this specific test has been scheduled to check if it always happens at http://pulpito.ceph.com/loic-2015-02-02_13:58:32-rbd:thrash:workloads:rbd_fsx_cache_writethrough.yaml-dumpling-backports---basic-multi/ Does that ring a bell by any chance ? maybe it's the same as http://tracker.ceph.com/issues/10498 -- Loïc Dachary, Artisan Logiciel Libre -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] several ceph dentry leaks
On Mon, Feb 2, 2015 at 8:31 AM, Al Viro v...@zeniv.linux.org.uk wrote: What do you expect to happen when if () is taken in int ceph_handle_notrace_create(struct inode *dir, struct dentry *dentry) { struct dentry *result = ceph_lookup(dir, dentry, 0); if (result !IS_ERR(result)) { If result is non-NULL, it means that we have just acquired a new reference to preexisting dentry (in ceph_finish_lookup()); where do you expect that reference to be dropped? Another thing: in ceph_readdir_prepopulate() if (!dn-d_inode) { dn = splice_dentry(dn, in, NULL); if (IS_ERR(dn)) { err = PTR_ERR(dn); dn = NULL; goto next_item; } } you leak dn if that IS_ERR() ever gets hit - d_splice_alias(d, i) does *not* drop reference to d in any cases, so splice_dentry() leaves the sum total of all dentry refcounts unchanged. And in case when return value is ERR_PTR(...), this assignment results in a leak. That one is trival to fix, but ceph_handle_notrace_create() looks very confusing - if nothing else, we should _never_ create multiple dentries pointing to directory inode, so d_instantiate() in there isn't mitigating anything - it's actively breaking things as far as the rest of the kernel is concerned... What are you trying to do there? ceph_handle_notrace_create() is for case: Ceph Metadata server restarted, client re-sent a request. Ceph MDS found the request has already completed, so it sent a traceless reply to client. (traceless reply contains no information for the dentry and the newly created inode). It's hard to handle this case elegantly, because MDS may have done other operation, which moved the newly created file/directory to other place. For multiple dentries pointing to directory inode, maybe we can return an error (-ENOENT or -ESTALE). Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC] several ceph dentry leaks
On Mon, Feb 2, 2015 at 12:41 PM, Al Viro v...@zeniv.linux.org.uk wrote: On Mon, Feb 02, 2015 at 11:23:58AM +0800, Yan, Zheng wrote: we should _never_ create multiple dentries pointing to directory inode, so d_instantiate() in there isn't mitigating anything - it's actively breaking things as far as the rest of the kernel is concerned... What are you trying to do there? ceph_handle_notrace_create() is for case: Ceph Metadata server restarted, client re-sent a request. Ceph MDS found the request has already completed, so it sent a traceless reply to client. (traceless reply contains no information for the dentry and the newly created inode). It's hard to handle this case elegantly, because MDS may have done other operation, which moved the newly created file/directory to other place. For multiple dentries pointing to directory inode, maybe we can return an error (-ENOENT or -ESTALE). IDGI... Suppose we got non-NULL from ceph_lookup() there. So we'd sent either CEPH_MDS_OP_LOOKUP or CEPH_MDS_OP_LOOKUPSNAP and got -r_dentry changed (presumably in ceph_fill_trace(), after it got the sucker from splice_dentry(), i.e. ultimately d_splice_alias()). Right? In theory, yes. But I think it never happens in reality. Because parent directory of the newly created file/directory is locked, client has no other way to lookup the newly created file/directory. Now, we have acquired a reference to it (in ceph_finish_lookup()). So far, so good; for one thing, we definitely do *NOT* want to forget about that reference. For another, we've got a good and valid dentry; so why are we playing with the originally passed one? Just unhash the original and be done with that... I think the reason is that VFS still uses the original dentry. (for example, after calling filesystem's mkdir callback, vfs_mkdir() calls fsnotify_mkdir() and passing the original dentry to it) Looking at the code downstream from the calls of ceph_handle_notrace_create(), ceph_mkdir() looks very dubious - we do ceph_init_inode_acls(dentry-d_inode, acls); there, after having set -d_inode to that of a preexisting alias. Is that really correct in the case when such an alias used to exist? you are right, it's incorrect. how about make ceph_handle_notrace_create() return an error when it gets an non-NULL from ceph_lookup() ? this method is not perfect but should work in most case. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: MDS has inconsistent performance
On Fri, Jan 16, 2015 at 2:37 PM, Gregory Farnum g...@gregs42.com wrote: On Thu, Jan 15, 2015 at 2:44 PM, Michael Sevilla mikesevil...@gmail.com wrote: Let me know if this works and/or you need anything else: https://www.dropbox.com/s/fq47w6jebnyluu0/lookup-logs.tar.gz?dl=0 Beware - the clients were on debug=10. Also, I tried this with the kernel client and it is more consistent; it does the 2 lookups per create on 1 client every single time. Mmmm, there are no mds logs of note here. :( I did look enough to see that: 1) The MDS is for some reason revoking caps on the file create prompting the switch to double-lookups, which it was not before. The client doesn't really have any visibility into why that would be the case; the best guess I can come up with is that maybe the MDS split up the directory into multiple frags at this point -- do you have that enabled? 2) The only way we set the I_COMPLETE flag is when we create an empty directory, or when we do a complete listdir on one. That makes it pretty difficult to get the flag back (and so do the optimal create path) once you lose it. :( I'd love a better way to do so, but we'll have to look at what's involved in a bit of depth. I'm not sure why the kernel client is so much more cautious, but I think there were a number of troubles with the directory listing orders and things which were harder to solve there - I don't remember if we introduced the I_DIR_ORDERED flag in it or not. Zheng can talk more about that. What kernel client version are you using? And for a vanity data point, what kind of hardware is your MDS running on? :) For kernel before 3.18, the I_COMPLETE get cleared once directory is modified. I_DIR_ORDERED is introduced by 3.18 kernel. I just tried 3.18 kernel, unfortunately there still is a bug that prevent new directory from having I_COMPLETE flag Regards Yan, Zheng -Greg -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: acl: Remove unused function
On Sun, Jan 4, 2015 at 7:44 AM, Rickard Strandqvist rickard_strandqv...@spectrumdigital.se wrote: Remove the function ceph_get_cached_acl() that is not used anywhere. This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se --- fs/ceph/acl.c | 14 -- 1 file changed, 14 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 5bd853b..64fa248 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -40,20 +40,6 @@ static inline void ceph_set_cached_acl(struct inode *inode, spin_unlock(ci-i_ceph_lock); } -static inline struct posix_acl *ceph_get_cached_acl(struct inode *inode, - int type) -{ - struct ceph_inode_info *ci = ceph_inode(inode); - struct posix_acl *acl = ACL_NOT_CACHED; - - spin_lock(ci-i_ceph_lock); - if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 0)) - acl = get_cached_acl(inode, type); - spin_unlock(ci-i_ceph_lock); - - return acl; -} - struct posix_acl *ceph_get_acl(struct inode *inode, int type) { int size; -- 1.7.10.4 added to testing branch. Thanks -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: fix setting empty extended attribute
make sure 'value' is not null. otherwise __ceph_setxattr will remove the extended attribute. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/xattr.c | 7 +-- 1 file changed, 5 insertions(+), 2 deletions(-) diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index 678b0d2..5a492ca 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -854,7 +854,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, struct ceph_pagelist *pagelist = NULL; int err; - if (value) { + if (size 0) { /* copy value into pagelist */ pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS); if (!pagelist) @@ -864,7 +864,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, err = ceph_pagelist_append(pagelist, value, size); if (err) goto out; - } else { + } else if (!value) { flags |= CEPH_XATTR_REMOVE; } @@ -1001,6 +1001,9 @@ int ceph_setxattr(struct dentry *dentry, const char *name, if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN)) return generic_setxattr(dentry, name, value, size, flags); + if (size == 0) + value = ; /* empty EA, do not remove */ + return __ceph_setxattr(dentry, name, value, size, flags); } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [GIT PULL] Ceph updates for 3.19-rc1
my commit ceph: add inline data to pagecache incidentally adds fs/ceph/super.j.rej. please remove it from your branch. sorry for the inconvenience Yan, Zheng On Thu, Dec 18, 2014 at 7:27 AM, Sage Weil sw...@redhat.com wrote: Hi Linus, Please pull the following Ceph updates from git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus The big item here is support for inline data for CephFS and for message signatures from Zheng. There are also several bug fixes, including interrupted flock request handling, 0-length xattrs, mksnap, cached readdir results, and a message version compat field. Finally there are several cleanups from Ilya, Dan, and Markus. Note that there is another series coming soon that fixes some bugs in the RBD 'lingering' requests, but it isn't quite ready yet. Thanks! sage Dan Carpenter (1): ceph: do_sync is never initialized Ilya Dryomov (4): libceph: nuke ceph_kvfree() ceph: remove unused stringification macros rbd: don't treat CEPH_OSD_OP_DELETE as extent op libceph: fixup includes in pagelist.h John Spray (2): libceph: update ceph_msg_header structure ceph: message versioning fixes SF Markus Elfring (1): ceph, rbd: delete unnecessary checks before two function calls Yan, Zheng (19): ceph: fix file lock interruption ceph: introduce a new inode flag indicating if cached dentries are ordered libceph: store session key in cephx authorizer libceph: message signature support ceph: introduce global empty snap context libceph: require cephx message signature by default libceph: add SETXATTR/CMPXATTR osd operations support libceph: add CREATE osd operation support libceph: specify position of extent operation ceph: parse inline data in MClientReply and MClientCaps ceph: add inline data to pagecache ceph: use getattr request to fetch inline data ceph: fetch inline data when getting Fcr cap refs ceph: sync read inline data ceph: convert inline data to normal data before data write ceph: flush inline version ceph: support inline data feature ceph: fix mksnap crash ceph: fix setting empty extended attribute drivers/block/rbd.c| 11 +- fs/ceph/addr.c | 273 +++-- fs/ceph/caps.c | 132 ++ fs/ceph/dir.c | 27 ++-- fs/ceph/file.c | 97 +++-- fs/ceph/inode.c| 59 ++-- fs/ceph/locks.c| 64 +++-- fs/ceph/mds_client.c | 41 +- fs/ceph/mds_client.h | 10 ++ fs/ceph/snap.c | 37 - fs/ceph/super.c| 16 ++- fs/ceph/super.h| 55 ++-- fs/ceph/super.h.rej| 10 ++ fs/ceph/xattr.c| 7 +- include/linux/ceph/auth.h | 26 include/linux/ceph/buffer.h| 3 +- include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/ceph_fs.h | 10 +- include/linux/ceph/libceph.h | 2 +- include/linux/ceph/messenger.h | 9 +- include/linux/ceph/msgr.h | 11 +- include/linux/ceph/osd_client.h| 13 +- include/linux/ceph/pagelist.h | 4 +- net/ceph/auth_x.c | 76 ++- net/ceph/auth_x.h | 1 + net/ceph/buffer.c | 4 +- net/ceph/ceph_common.c | 21 +-- net/ceph/messenger.c | 34 - net/ceph/osd_client.c | 118 29 files changed, 992 insertions(+), 180 deletions(-) create mode 100644 fs/ceph/super.h.rej -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: fix mksnap crash
mksnap reply only contain 'target', does not contain 'dentry'. So it's wrong to use req-r_reply_info.head-is_dentry to detect traceless reply. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/dir.c | 4 +++- 1 file changed, 3 insertions(+), 1 deletion(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index 6526199..fcfd0ab 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -812,7 +812,9 @@ static int ceph_mkdir(struct inode *dir, struct dentry *dentry, umode_t mode) acls.pagelist = NULL; } err = ceph_mdsc_do_request(mdsc, dir, req); - if (!err !req-r_reply_info.head-is_dentry) + if (!err + !req-r_reply_info.head-is_target + !req-r_reply_info.head-is_dentry) err = ceph_handle_notrace_create(dir, dentry); ceph_mdsc_put_request(req); out: -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/12] libceph: add SETXATTR/CMPXATTR osd operations support
Signed-off-by: Yan, Zheng z...@redhat.com --- include/linux/ceph/osd_client.h | 10 ++ net/ceph/osd_client.c | 38 ++ 2 files changed, 48 insertions(+) diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h index 03aeb27..1ad217b 100644 --- a/include/linux/ceph/osd_client.h +++ b/include/linux/ceph/osd_client.h @@ -87,6 +87,13 @@ struct ceph_osd_req_op { struct ceph_osd_data osd_data; } extent; struct { + __le32 name_len; + __le32 value_len; + __u8 cmp_op; /* CEPH_OSD_CMPXATTR_OP_* */ + __u8 cmp_mode; /* CEPH_OSD_CMPXATTR_MODE_* */ + struct ceph_osd_data osd_data; + } xattr; + struct { const char *class_name; const char *method_name; struct ceph_osd_data request_info; @@ -295,6 +302,9 @@ extern void osd_req_op_cls_response_data_pages(struct ceph_osd_request *, extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, u16 opcode, const char *class, const char *method); +extern void osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which, + u16 opcode, const char *name, const void *value, + size_t size, u8 cmp_op, u8 cmp_mode); extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req, unsigned int which, u16 opcode, u64 cookie, u64 version, int flag); diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c index 1f6c405..f8fb376 100644 --- a/net/ceph/osd_client.c +++ b/net/ceph/osd_client.c @@ -545,6 +545,34 @@ void osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which, } EXPORT_SYMBOL(osd_req_op_cls_init); +void osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int which, + u16 opcode, const char *name, const void *value, + size_t size, u8 cmp_op, u8 cmp_mode) +{ + struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode); + struct ceph_pagelist *pagelist; + size_t payload_len; + + pagelist = kmalloc(sizeof (*pagelist), GFP_NOFS); + BUG_ON(!pagelist); + ceph_pagelist_init(pagelist); + + payload_len = strlen(name); + op-xattr.name_len = payload_len; + ceph_pagelist_append(pagelist, name, payload_len); + + op-xattr.value_len = size; + ceph_pagelist_append(pagelist, value, size); + payload_len += size; + + op-xattr.cmp_op = cmp_op; + op-xattr.cmp_mode = cmp_mode; + + ceph_osd_data_pagelist_init(op-xattr.osd_data, pagelist); + op-payload_len = payload_len; +} +EXPORT_SYMBOL(osd_req_op_xattr_init); + void osd_req_op_watch_init(struct ceph_osd_request *osd_req, unsigned int which, u16 opcode, u64 cookie, u64 version, int flag) @@ -676,6 +704,16 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req, dst-alloc_hint.expected_write_size = cpu_to_le64(src-alloc_hint.expected_write_size); break; + case CEPH_OSD_OP_SETXATTR: + case CEPH_OSD_OP_CMPXATTR: + dst-xattr.name_len = cpu_to_le32(src-xattr.name_len); + dst-xattr.value_len = cpu_to_le32(src-xattr.value_len); + dst-xattr.cmp_op = src-xattr.cmp_op; + dst-xattr.cmp_mode = src-xattr.cmp_mode; + osd_data = src-xattr.osd_data; + ceph_osdc_msg_data_add(req-r_request, osd_data); + request_data_len = osd_data-pagelist-length; + break; default: pr_err(unsupported osd opcode %s\n, ceph_osd_op_name(src-op)); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 05/12] ceph: add inline data to pagecache
Request reply and cap message can contain inline data. add inline data to the page cache if there is Fc cap. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/addr.c | 42 ++ fs/ceph/caps.c | 11 +++ fs/ceph/inode.c | 16 fs/ceph/inode.c.rej | 21 + fs/ceph/mds_client.c.rej | 10 ++ fs/ceph/super.h | 5 - fs/ceph/super.h.rej | 10 ++ include/linux/ceph/ceph_fs.h | 1 + 8 files changed, 115 insertions(+), 1 deletion(-) create mode 100644 fs/ceph/inode.c.rej create mode 100644 fs/ceph/mds_client.c.rej create mode 100644 fs/ceph/super.h.rej diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index f2c7aa8..7320b11 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1318,6 +1318,48 @@ out: return ret; } +void ceph_fill_inline_data(struct inode *inode, struct page *locked_page, + char *data, size_t len) +{ + struct address_space *mapping = inode-i_mapping; + struct page *page; + + if (locked_page) { + page = locked_page; + } else { + if (i_size_read(inode) == 0) + return; + page = find_or_create_page(mapping, 0, GFP_NOFS); + if (!page) + return; + if (PageUptodate(page)) { + unlock_page(page); + page_cache_release(page); + return; + } + } + + dout(fill_inline_data %p %llx.%llx len %lu locked_page %p\n, +inode, ceph_vinop(inode), len, locked_page); + + if (len 0) { + void *kaddr = kmap_atomic(page); + memcpy(kaddr, data, len); + kunmap_atomic(kaddr); + } + + if (page != locked_page) { + if (len PAGE_CACHE_SIZE) + zero_user_segment(page, len, PAGE_CACHE_SIZE); + else + flush_dcache_page(page); + + SetPageUptodate(page); + unlock_page(page); + page_cache_release(page); + } +} + static struct vm_operations_struct ceph_vmops = { .fault = ceph_filemap_fault, .page_mkwrite = ceph_page_mkwrite, diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index b88ae60..6372eb9 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2405,6 +2405,7 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc, bool queue_invalidate = false; bool queue_revalidate = false; bool deleted_inode = false; + bool fill_inline = false; dout(handle_cap_grant inode %p cap %p mds%d seq %d %s\n, inode, cap, mds, seq, ceph_cap_string(newcaps)); @@ -2578,6 +2579,13 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc, } BUG_ON(cap-issued ~cap-implemented); + if (inline_version 0 inline_version = ci-i_inline_version) { + ci-i_inline_version = inline_version; + if (ci-i_inline_version != CEPH_INLINE_NONE + (newcaps (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO))) + fill_inline = true; + } + spin_unlock(ci-i_ceph_lock); if (le32_to_cpu(grant-op) == CEPH_CAP_OP_IMPORT) { @@ -2591,6 +2599,9 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc, wake = true; } + if (fill_inline) + ceph_fill_inline_data(inode, NULL, inline_data, inline_len); + if (queue_trunc) { ceph_queue_vmtruncate(inode); ceph_queue_revalidate(inode); diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 72607c1..feea6a8 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -387,6 +387,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb) spin_lock_init(ci-i_ceph_lock); ci-i_version = 0; + ci-i_inline_version = 0; ci-i_time_warp_seq = 0; ci-i_ceph_flags = 0; ci-i_ordered_count = 0; @@ -676,6 +677,7 @@ static int fill_inode(struct inode *inode, bool wake = false; bool queue_trunc = false; bool new_version = false; + bool fill_inline = false; dout(fill_inode %p ino %llx.%llx v %llu had %llu\n, inode, ceph_vinop(inode), le64_to_cpu(info-version), @@ -875,8 +877,22 @@ static int fill_inode(struct inode *inode, ceph_vinop(inode)); __ceph_get_fmode(ci, cap_fmode); } + + if (iinfo-inline_version 0 + iinfo-inline_version = ci-i_inline_version) { + int cache_caps = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO; + ci-i_inline_version = iinfo-inline_version; + if (ci-i_inline_version != CEPH_INLINE_NONE + (le32_to_cpu(info-cap.caps
[PATCH 06/12] ceph: use getattr request to fetch inline data
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data in getattr reply will be copied to the page. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/inode.c | 34 +- fs/ceph/mds_client.h | 1 + fs/ceph/super.h | 7 ++- include/linux/ceph/ceph_fs.h | 2 ++ 4 files changed, 34 insertions(+), 10 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index feea6a8..0bdabd1 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -659,7 +659,7 @@ void ceph_fill_file_time(struct inode *inode, int issued, * Populate an inode based on info from mds. May be called on new or * existing inodes. */ -static int fill_inode(struct inode *inode, +static int fill_inode(struct inode *inode, struct page *locked_page, struct ceph_mds_reply_info_in *iinfo, struct ceph_mds_reply_dirfrag *dirinfo, struct ceph_mds_session *session, @@ -883,14 +883,15 @@ static int fill_inode(struct inode *inode, int cache_caps = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO; ci-i_inline_version = iinfo-inline_version; if (ci-i_inline_version != CEPH_INLINE_NONE - (le32_to_cpu(info-cap.caps) cache_caps)) + (locked_page || +(le32_to_cpu(info-cap.caps) cache_caps))) fill_inline = true; } spin_unlock(ci-i_ceph_lock); if (fill_inline) - ceph_fill_inline_data(inode, NULL, + ceph_fill_inline_data(inode, locked_page, iinfo-inline_data, iinfo-inline_len); if (wake) @@ -1080,7 +1081,8 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req, struct inode *dir = req-r_locked_dir; if (dir) { - err = fill_inode(dir, rinfo-diri, rinfo-dirfrag, + err = fill_inode(dir, NULL, +rinfo-diri, rinfo-dirfrag, session, req-r_request_started, -1, req-r_caps_reservation); if (err 0) @@ -1150,7 +1152,7 @@ retry_lookup: } req-r_target_inode = in; - err = fill_inode(in, rinfo-targeti, NULL, + err = fill_inode(in, req-r_locked_page, rinfo-targeti, NULL, session, req-r_request_started, (!req-r_aborted rinfo-head-result == 0) ? req-r_fmode : -1, @@ -1321,7 +1323,7 @@ static int readdir_prepopulate_inodes_only(struct ceph_mds_request *req, dout(new_inode badness got %d\n, err); continue; } - rc = fill_inode(in, rinfo-dir_in[i], NULL, session, + rc = fill_inode(in, NULL, rinfo-dir_in[i], NULL, session, req-r_request_started, -1, req-r_caps_reservation); if (rc 0) { @@ -1437,7 +1439,7 @@ retry_lookup: } } - if (fill_inode(in, rinfo-dir_in[i], NULL, session, + if (fill_inode(in, NULL, rinfo-dir_in[i], NULL, session, req-r_request_started, -1, req-r_caps_reservation) 0) { pr_err(fill_inode badness on %p\n, in); @@ -1920,7 +1922,8 @@ out_put: * Verify that we have a lease on the given mask. If not, * do a getattr against an mds. */ -int ceph_do_getattr(struct inode *inode, int mask, bool force) +int __ceph_do_getattr(struct inode *inode, struct page *locked_page, + int mask, bool force) { struct ceph_fs_client *fsc = ceph_sb_to_client(inode-i_sb); struct ceph_mds_client *mdsc = fsc-mdsc; @@ -1932,7 +1935,8 @@ int ceph_do_getattr(struct inode *inode, int mask, bool force) return 0; } - dout(do_getattr inode %p mask %s mode 0%o\n, inode, ceph_cap_string(mask), inode-i_mode); + dout(do_getattr inode %p mask %s mode 0%o\n, +inode, ceph_cap_string(mask), inode-i_mode); if (!force ceph_caps_issued_mask(ceph_inode(inode), mask, 1)) return 0; @@ -1943,7 +1947,19 @@ int ceph_do_getattr(struct inode *inode, int mask, bool force) ihold(inode); req-r_num_caps = 1; req-r_args.getattr.mask = cpu_to_le32(mask); + req-r_locked_page = locked_page; err = ceph_mdsc_do_request(mdsc, NULL, req); + if (locked_page err == 0) { + u64 inline_version = req-r_reply_info.targeti.inline_version; + if (inline_version == 0) { + /* the reply is supposed to contain inline data
[PATCH 0/12] ceph: read/modify inline data support
Hi, This patch series allows cephfs client kernel to read/modify inline data. When modifying a inode with inline data, inline data are always converted to normal object first. This patch series does not support storing new data as inline. Regards Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/12] ceph: parse inline data in MClientReply and MClientCaps
Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/caps.c | 34 +++--- fs/ceph/mds_client.c | 10 ++ fs/ceph/mds_client.h | 3 +++ 3 files changed, 36 insertions(+), 11 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index eb1bf1f..b88ae60 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2383,6 +2383,8 @@ static void invalidate_aliases(struct inode *inode) static void handle_cap_grant(struct ceph_mds_client *mdsc, struct inode *inode, struct ceph_mds_caps *grant, void *snaptrace, int snaptrace_len, +u64 inline_version, +void *inline_data, int inline_len, struct ceph_buffer *xattr_buf, struct ceph_mds_session *session, struct ceph_cap *cap, int issued) @@ -2996,11 +2998,12 @@ void ceph_handle_caps(struct ceph_mds_session *session, u64 cap_id; u64 size, max_size; u64 tid; + u64 inline_version = 0; + void *inline_data = NULL; + u32 inline_len = 0; void *snaptrace; size_t snaptrace_len; - void *flock; - void *end; - u32 flock_len; + void *p, *end; dout(handle_caps from mds%d\n, mds); @@ -3021,30 +3024,37 @@ void ceph_handle_caps(struct ceph_mds_session *session, snaptrace = h + 1; snaptrace_len = le32_to_cpu(h-snap_trace_len); + p = snaptrace + snaptrace_len; if (le16_to_cpu(msg-hdr.version) = 2) { - void *p = snaptrace + snaptrace_len; + u32 flock_len; ceph_decode_32_safe(p, end, flock_len, bad); if (p + flock_len end) goto bad; - flock = p; - } else { - flock = NULL; - flock_len = 0; + p += flock_len; } if (le16_to_cpu(msg-hdr.version) = 3) { if (op == CEPH_CAP_OP_IMPORT) { - void *p = flock + flock_len; if (p + sizeof(*peer) end) goto bad; peer = p; + p += sizeof(*peer); } else if (op == CEPH_CAP_OP_EXPORT) { /* recorded in unused fields */ peer = (void *)h-size; } } + if (le16_to_cpu(msg-hdr.version) = 4) { + ceph_decode_64_safe(p, end, inline_version, bad); + ceph_decode_32_safe(p, end, inline_len, bad); + if (p + inline_len end) + goto bad; + inline_data = p; + p += inline_len; + } + /* lookup ino */ inode = ceph_find_inode(sb, vino); ci = ceph_inode(inode); @@ -3085,6 +3095,7 @@ void ceph_handle_caps(struct ceph_mds_session *session, handle_cap_import(mdsc, inode, h, peer, session, cap, issued); handle_cap_grant(mdsc, inode, h, snaptrace, snaptrace_len, +inline_version, inline_data, inline_len, msg-middle, session, cap, issued); goto done_unlocked; } @@ -3105,8 +3116,9 @@ void ceph_handle_caps(struct ceph_mds_session *session, case CEPH_CAP_OP_GRANT: __ceph_caps_issued(ci, issued); issued |= __ceph_caps_dirty(ci); - handle_cap_grant(mdsc, inode, h, NULL, 0, msg-middle, -session, cap, issued); + handle_cap_grant(mdsc, inode, h, NULL, 0, +inline_version, inline_data, inline_len, +msg-middle, session, cap, issued); goto done_unlocked; case CEPH_CAP_OP_FLUSH_ACK: diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 387a489..d2171f4 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -89,6 +89,16 @@ static int parse_reply_info_in(void **p, void *end, ceph_decode_need(p, end, info-xattr_len, bad); info-xattr_data = *p; *p += info-xattr_len; + + if (features CEPH_FEATURE_MDS_INLINE_DATA) { + ceph_decode_64_safe(p, end, info-inline_version, bad); + ceph_decode_32_safe(p, end, info-inline_len, bad); + ceph_decode_need(p, end, info-inline_len, bad); + info-inline_data = *p; + *p += info-inline_len; + } else + info-inline_version = CEPH_INLINE_NONE; + return 0; bad: return err; diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index 230bda7..01f5e4c 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -41,6 +41,9 @@ struct ceph_mds_reply_info_in { char *symlink; u32 xattr_len
[PATCH 07/12] ceph: fetch inline data while getting Fcr cap refs
we can't use getattr to fetch inline data after getting Fcr caps, because it can cause deadlock. The solution is try bringing inline data to page cache when not holding any cap, and hope the inline data page is still there after getting the Fcr caps. If the page is still there, pin it in page cache for later IO. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/addr.c | 9 +++-- fs/ceph/caps.c | 55 ++- fs/ceph/file.c | 12 +--- 3 files changed, 62 insertions(+), 14 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 7320b11..fdbdbbe 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1207,6 +1207,7 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct inode *inode = file_inode(vma-vm_file); struct ceph_inode_info *ci = ceph_inode(inode); struct ceph_file_info *fi = vma-vm_file-private_data; + struct page *pinned_page = NULL; loff_t off = vmf-pgoff PAGE_CACHE_SHIFT; int want, got, ret; @@ -1218,7 +1219,8 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) want = CEPH_CAP_FILE_CACHE; while (1) { got = 0; - ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, want, got, -1); + ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, want, -1, + got, pinned_page); if (ret == 0) break; if (ret != -ERESTARTSYS) { @@ -1233,6 +1235,8 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) dout(filemap_fault %p %llu~%zd dropping cap refs on %s ret %d\n, inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got), ret); + if (pinned_page) + page_cache_release(pinned_page); ceph_put_cap_refs(ci, got); return ret; @@ -1266,7 +1270,8 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf) want = CEPH_CAP_FILE_BUFFER; while (1) { got = 0; - ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, got, off + len); + ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len, + got, NULL); if (ret == 0) break; if (ret != -ERESTARTSYS) { diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 6372eb9..3bc0908 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2057,15 +2057,18 @@ static void __take_cap_refs(struct ceph_inode_info *ci, int got) * requested from the MDS. */ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, - int *got, loff_t endoff, int *check_max, int *err) + loff_t endoff, int *got, struct page **pinned_page, + int *check_max, int *err) { struct inode *inode = ci-vfs_inode; + struct page *page = NULL; int ret = 0; - int have, implemented; + int have, implemented, _got = 0; int file_wanted; dout(get_cap_refs %p need %s want %s\n, inode, ceph_cap_string(need), ceph_cap_string(want)); +again: spin_lock(ci-i_ceph_lock); /* make sure file is actually open */ @@ -2120,8 +2123,8 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, inode, ceph_cap_string(have), ceph_cap_string(not), ceph_cap_string(revoking)); if ((revoking not) == 0) { - *got = need | (have want); - __take_cap_refs(ci, *got); + _got = need | (have want); + __take_cap_refs(ci, _got); ret = 1; } } else { @@ -2130,8 +2133,42 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want, } out: spin_unlock(ci-i_ceph_lock); + + if ((_got (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) + ci-i_inline_version != CEPH_INLINE_NONE + i_size_read(inode) 0) { + page = find_get_page(inode-i_mapping, 0); + if (page) { + if (!PageUptodate(page)) { + page_cache_release(page); + page = NULL; + } + } + if (page) { + *pinned_page = page; + } else { + int ret1; + /* +* drop cap refs first because getattr while holding +* caps refs can cause deadlock. +*/ + ceph_put_cap_refs(ci, _got); + _got = 0; + /* getattt will bring inline data to page
[PATCH 12/12] ceph: support inline data feature
Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/super.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/fs/ceph/super.c b/fs/ceph/super.c index 2b5481f..c37f1b9 100644 --- a/fs/ceph/super.c +++ b/fs/ceph/super.c @@ -515,7 +515,8 @@ static struct ceph_fs_client *create_fs_client(struct ceph_mount_options *fsopt, struct ceph_fs_client *fsc; const u64 supported_features = CEPH_FEATURE_FLOCK | - CEPH_FEATURE_DIRLAYOUTHASH; + CEPH_FEATURE_DIRLAYOUTHASH | + CEPH_FEATURE_MDS_INLINE_DATA; const u64 required_features = 0; int page_count; size_t size; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/12] ceph: flush inline version
After converting inline data to normal data, client need to flush the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes cap messages (sent to MDS) contain inline_version and inline_data. Client always converts inline data to normal data before data write, so the inline data length part is always zero. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/caps.c | 24 fs/ceph/snap.c | 2 ++ fs/ceph/super.h | 1 + 3 files changed, 23 insertions(+), 4 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 3bc0908..77c4068 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -975,10 +975,12 @@ static int send_cap_msg(struct ceph_mds_session *session, kuid_t uid, kgid_t gid, umode_t mode, u64 xattr_version, struct ceph_buffer *xattrs_buf, - u64 follows) + u64 follows, bool inline_data) { struct ceph_mds_caps *fc; struct ceph_msg *msg; + void *p; + size_t extra_len; dout(send_cap_msg %s %llx %llx caps %s wanted %s dirty %s seq %u/%u mseq %u follows %lld size %llu/%llu @@ -988,7 +990,10 @@ static int send_cap_msg(struct ceph_mds_session *session, seq, issue_seq, mseq, follows, size, max_size, xattr_version, xattrs_buf ? (int)xattrs_buf-vec.iov_len : 0); - msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc), GFP_NOFS, false); + /* flock buffer size + inline version + inline data size */ + extra_len = 4 + 8 + 4; + msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len, + GFP_NOFS, false); if (!msg) return -ENOMEM; @@ -1020,6 +1025,14 @@ static int send_cap_msg(struct ceph_mds_session *session, fc-gid = cpu_to_le32(from_kgid(init_user_ns, gid)); fc-mode = cpu_to_le32(mode); + p = fc + 1; + /* flock buffer size */ + ceph_encode_32(p, 0); + /* inline version */ + ceph_encode_64(p, inline_data ? 0 : CEPH_INLINE_NONE); + /* inline data size */ + ceph_encode_32(p, 0); + fc-xattr_version = cpu_to_le64(xattr_version); if (xattrs_buf) { msg-middle = ceph_buffer_get(xattrs_buf); @@ -1126,6 +1139,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, u64 flush_tid = 0; int i; int ret; + bool inline_data; held = cap-issued | cap-implemented; revoking = cap-implemented ~cap-issued; @@ -1209,13 +1223,15 @@ static int __send_cap(struct ceph_mds_client *mdsc, struct ceph_cap *cap, xattr_version = ci-i_xattrs.version; } + inline_data = ci-i_inline_version != CEPH_INLINE_NONE; + spin_unlock(ci-i_ceph_lock); ret = send_cap_msg(session, ceph_vino(inode).ino, cap_id, op, keep, want, flushing, seq, flush_tid, issue_seq, mseq, size, max_size, mtime, atime, time_warp_seq, uid, gid, mode, xattr_version, xattr_blob, - follows); + follows, inline_data); if (ret 0) { dout(error sending cap msg, must requeue %p\n, inode); delayed = 1; @@ -1336,7 +1352,7 @@ retry: capsnap-time_warp_seq, capsnap-uid, capsnap-gid, capsnap-mode, capsnap-xattr_version, capsnap-xattr_blob, -capsnap-follows); +capsnap-follows, capsnap-inline_data); next_follows = capsnap-follows + 1; ceph_put_cap_snap(capsnap); diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c index 24b454a..28571cf 100644 --- a/fs/ceph/snap.c +++ b/fs/ceph/snap.c @@ -515,6 +515,8 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci) capsnap-xattr_version = 0; } + capsnap-inline_data = ci-i_inline_version != CEPH_INLINE_NONE; + /* dirty page count moved from _head to this cap_snap; all subsequent writes page dirties occur _after_ this snapshot. */ diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 02de104..33885ad 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -161,6 +161,7 @@ struct ceph_cap_snap { u64 time_warp_seq; int writing; /* a sync write is still in progress */ int dirty_pages; /* dirty pages awaiting writeback */ + bool inline_data; }; static inline void ceph_put_cap_snap(struct ceph_cap_snap *capsnap) -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 08/12] ceph: sync read inline data
we can't use getattr to fetch inline data while holding Fr cap, because it can cause deadlock. If we need to sync read inline data, drop cap refs first, then use getattr to fetch inline data. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/addr.c | 7 ++- fs/ceph/file.c | 63 ++ 2 files changed, 61 insertions(+), 9 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index fdbdbbe..9dcb4a9 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -194,8 +194,10 @@ static int readpage_nounlock(struct file *filp, struct page *page) int err = 0; u64 len = PAGE_CACHE_SIZE; - err = ceph_readpage_from_fscache(inode, page); + if (ci-i_inline_version != CEPH_INLINE_NONE) + return -EINVAL; + err = ceph_readpage_from_fscache(inode, page); if (err == 0) goto out; @@ -384,6 +386,9 @@ static int ceph_readpages(struct file *file, struct address_space *mapping, int rc = 0; int max = 0; + if (ceph_inode(inode)-i_inline_version != CEPH_INLINE_NONE) + return -EINVAL; + rc = ceph_readpages_from_fscache(mapping-host, mapping, page_list, nr_pages); diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 861b995..5b092bd 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -333,6 +333,11 @@ int ceph_release(struct inode *inode, struct file *file) return 0; } +enum { + CHECK_EOF = 1, + READ_INLINE = 2, +}; + /* * Read a range of bytes striped over one or more objects. Iterate over * objects we stripe over. (That's not atomic, but good enough for now.) @@ -412,7 +417,7 @@ more: ret = read; /* did we bounce off eof? */ if (pos + left inode-i_size) - *checkeof = 1; + *checkeof = CHECK_EOF; } dout(striped_read returns %d\n, ret); @@ -808,7 +813,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct iov_iter *to) struct page *pinned_page = NULL; ssize_t ret; int want, got = 0; - int checkeof = 0, read = 0; + int retry_op = 0, read = 0; again: dout(aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n, @@ -830,8 +835,12 @@ again: inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len, ceph_cap_string(got)); - /* hmm, this isn't really async... */ - ret = ceph_sync_read(iocb, to, checkeof); + if (ci-i_inline_version == CEPH_INLINE_NONE) { + /* hmm, this isn't really async... */ + ret = ceph_sync_read(iocb, to, retry_op); + } else { + retry_op = READ_INLINE; + } } else { dout(aio_read %p %llx.%llx %llu~%u got cap refs on %s\n, inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len, @@ -846,12 +855,50 @@ again: pinned_page = NULL; } ceph_put_cap_refs(ci, got); + if (retry_op ret = 0) { + int statret; + struct page *page = NULL; + loff_t i_size; + if (retry_op == READ_INLINE) { + page = __page_cache_alloc(GFP_NOFS); + if (!page) + return -ENOMEM; + } + + statret = __ceph_do_getattr(inode, page, + CEPH_STAT_CAP_INLINE_DATA, !!page); + if (statret 0) { +__free_page(page); + if (statret == -ENODATA) { + BUG_ON(retry_op != READ_INLINE); + goto again; + } + return statret; + } - if (checkeof ret = 0) { - int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false); + i_size = i_size_read(inode); + if (retry_op == READ_INLINE) { + /* does not support inline data PAGE_SIZE */ + if (i_size PAGE_CACHE_SIZE) { + ret = -EIO; + } else if (iocb-ki_pos i_size) { + loff_t end = min_t(loff_t, i_size, + iocb-ki_pos + len); + if (statret end) + zero_user_segment(page, statret, end); + ret = copy_page_to_iter(page, + iocb-ki_pos ~PAGE_MASK, + end - iocb-ki_pos, to); + iocb-ki_pos += ret; + } else { + ret = 0
[PATCH 10/12] ceph: convert inline data for memory mapped read
We can't use getattr to fetch inline data while holding Fr caps, If there is no Fc cap, ceph_get_caps() does not bring inline data to page cache neither. To simplify this case, just convert inline data to normal data. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/addr.c | 17 +++-- 1 file changed, 15 insertions(+), 2 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 72336bd..1768919 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1215,7 +1215,7 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) struct page *pinned_page = NULL; loff_t off = vmf-pgoff PAGE_CACHE_SHIFT; int want, got, ret; - +again: dout(filemap_fault %p %llx.%llx %llu~%zd trying to get caps\n, inode, ceph_vinop(inode), off, (size_t)PAGE_CACHE_SIZE); if (fi-fmode CEPH_FILE_MODE_LAZY) @@ -1236,7 +1236,10 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) dout(filemap_fault %p %llu~%zd got cap refs on %s\n, inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got)); - ret = filemap_fault(vma, vmf); + if ((got want) || ci-i_inline_version == CEPH_INLINE_NONE) + ret = filemap_fault(vma, vmf); + else + ret = -EAGAIN; /* readpage() can't fetch inline data */ dout(filemap_fault %p %llu~%zd dropping cap refs on %s ret %d\n, inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got), ret); @@ -1244,6 +1247,16 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, struct vm_fault *vmf) page_cache_release(pinned_page); ceph_put_cap_refs(ci, got); + if (ret == -EAGAIN) { + ret = ceph_uninline_data(vma-vm_file, NULL); + if (ret 0) + return VM_FAULT_SIGBUS; + spin_lock(ci-i_ceph_lock); + ci-i_inline_version = CEPH_INLINE_NONE; + spin_unlock(ci-i_ceph_lock); + goto again; + } + return ret; } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: require cephx message signature by default
Signed-off-by: Yan, Zheng z...@redhat.com --- include/linux/ceph/libceph.h | 1 + net/ceph/ceph_common.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index d293f7e..8b11a79 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -29,6 +29,7 @@ #define CEPH_OPT_NOSHARE (11) /* don't share client with other sbs */ #define CEPH_OPT_MYIP (12) /* specified my ip */ #define CEPH_OPT_NOCRC(13) /* no data crc on writes */ +#define CEPH_OPT_NOMSGAUTH (14) /* not require cephx message signature */ #define CEPH_OPT_DEFAULT (0) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index d361a274..b22d82c 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -237,6 +237,8 @@ enum { Opt_noshare, Opt_crc, Opt_nocrc, + Opt_cephx_require_signature, + Opt_cephx_require_no_signature, }; static match_table_t opt_tokens = { @@ -255,6 +257,8 @@ static match_table_t opt_tokens = { {Opt_noshare, noshare}, {Opt_crc, crc}, {Opt_nocrc, nocrc}, + {Opt_cephx_require_signature, cephx_require_signature}, + {Opt_cephx_require_no_signature, cephx_require_no_signature}, {-1, NULL} }; @@ -453,6 +457,12 @@ ceph_parse_options(char *options, const char *dev_name, case Opt_nocrc: opt-flags |= CEPH_OPT_NOCRC; break; + case Opt_cephx_require_signature: + opt-flags = ~CEPH_OPT_NOMSGAUTH; + break; + case Opt_cephx_require_no_signature: + opt-flags |= CEPH_OPT_NOMSGAUTH; + break; default: BUG_ON(token); @@ -496,6 +506,9 @@ struct ceph_client *ceph_create_client(struct ceph_options *opt, void *private, init_waitqueue_head(client-auth_wq); client-auth_err = 0; + if (!ceph_test_opt(client, NOMSGAUTH)) + required_features |= CEPH_FEATURE_MSG_AUTH; + client-extra_mon_dispatch = NULL; client-supported_features = CEPH_FEATURES_SUPPORTED_DEFAULT | supported_features; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] libceph: require cephx message signature by default
Signed-off-by: Yan, Zheng z...@redhat.com --- include/linux/ceph/libceph.h | 1 + net/ceph/ceph_common.c | 13 + 2 files changed, 14 insertions(+) diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h index d293f7e..8b11a79 100644 --- a/include/linux/ceph/libceph.h +++ b/include/linux/ceph/libceph.h @@ -29,6 +29,7 @@ #define CEPH_OPT_NOSHARE (11) /* don't share client with other sbs */ #define CEPH_OPT_MYIP (12) /* specified my ip */ #define CEPH_OPT_NOCRC(13) /* no data crc on writes */ +#define CEPH_OPT_NOMSGAUTH (14) /* not require cephx message signature */ #define CEPH_OPT_DEFAULT (0) diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c index d361a274..5d5ab67 100644 --- a/net/ceph/ceph_common.c +++ b/net/ceph/ceph_common.c @@ -237,6 +237,8 @@ enum { Opt_noshare, Opt_crc, Opt_nocrc, + Opt_cephx_require_signatures, + Opt_nocephx_require_signatures, }; static match_table_t opt_tokens = { @@ -255,6 +257,8 @@ static match_table_t opt_tokens = { {Opt_noshare, noshare}, {Opt_crc, crc}, {Opt_nocrc, nocrc}, + {Opt_cephx_require_signatures, cephx_require_signatures}, + {Opt_nocephx_require_signatures, nocephx_require_signatures}, {-1, NULL} }; @@ -453,6 +457,12 @@ ceph_parse_options(char *options, const char *dev_name, case Opt_nocrc: opt-flags |= CEPH_OPT_NOCRC; break; + case Opt_cephx_require_signatures: + opt-flags = ~CEPH_OPT_NOMSGAUTH; + break; + case Opt_nocephx_require_signatures: + opt-flags |= CEPH_OPT_NOMSGAUTH; + break; default: BUG_ON(token); @@ -496,6 +506,9 @@ struct ceph_client *ceph_create_client(struct ceph_options *opt, void *private, init_waitqueue_head(client-auth_wq); client-auth_err = 0; + if (!ceph_test_opt(client, NOMSGAUTH)) + required_features |= CEPH_FEATURE_MSG_AUTH; + client-extra_mon_dispatch = NULL; client-supported_features = CEPH_FEATURES_SUPPORTED_DEFAULT | supported_features; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] libceph: message signature support
Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 16 +++ include/linux/ceph/auth.h | 26 + include/linux/ceph/ceph_features.h | 1 + include/linux/ceph/messenger.h | 9 +- include/linux/ceph/msgr.h | 8 ++ net/ceph/auth_x.c | 58 ++ net/ceph/messenger.c | 32 +++-- net/ceph/osd_client.c | 16 +++ 8 files changed, 162 insertions(+), 4 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 2eab332..14ca763 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -3668,6 +3668,20 @@ static struct ceph_msg *mds_alloc_msg(struct ceph_connection *con, return msg; } +static int sign_message(struct ceph_connection *con, struct ceph_msg *msg) +{ + struct ceph_mds_session *s = con-private; + struct ceph_auth_handshake *auth = s-s_auth; + return ceph_auth_sign_message(auth, msg); +} + +static int check_message_signature(struct ceph_connection *con, struct ceph_msg *msg) +{ + struct ceph_mds_session *s = con-private; + struct ceph_auth_handshake *auth = s-s_auth; + return ceph_auth_check_message_signature(auth, msg); +} + static const struct ceph_connection_operations mds_con_ops = { .get = con_get, .put = con_put, @@ -3677,6 +3691,8 @@ static const struct ceph_connection_operations mds_con_ops = { .invalidate_authorizer = invalidate_authorizer, .peer_reset = peer_reset, .alloc_msg = mds_alloc_msg, + .sign_message = sign_message, + .check_message_signature = check_message_signature, }; /* eof */ diff --git a/include/linux/ceph/auth.h b/include/linux/ceph/auth.h index 5f33868..260d78b 100644 --- a/include/linux/ceph/auth.h +++ b/include/linux/ceph/auth.h @@ -13,6 +13,7 @@ struct ceph_auth_client; struct ceph_authorizer; +struct ceph_msg; struct ceph_auth_handshake { struct ceph_authorizer *authorizer; @@ -20,6 +21,10 @@ struct ceph_auth_handshake { size_t authorizer_buf_len; void *authorizer_reply_buf; size_t authorizer_reply_buf_len; + int (*sign_message)(struct ceph_auth_handshake *auth, + struct ceph_msg *msg); + int (*check_message_signature)(struct ceph_auth_handshake *auth, + struct ceph_msg *msg); }; struct ceph_auth_client_ops { @@ -66,6 +71,11 @@ struct ceph_auth_client_ops { void (*reset)(struct ceph_auth_client *ac); void (*destroy)(struct ceph_auth_client *ac); + + int (*sign_message)(struct ceph_auth_handshake *auth, + struct ceph_msg *msg); + int (*check_message_signature)(struct ceph_auth_handshake *auth, + struct ceph_msg *msg); }; struct ceph_auth_client { @@ -113,4 +123,20 @@ extern int ceph_auth_verify_authorizer_reply(struct ceph_auth_client *ac, extern void ceph_auth_invalidate_authorizer(struct ceph_auth_client *ac, int peer_type); +static inline int ceph_auth_sign_message(struct ceph_auth_handshake *auth, +struct ceph_msg *msg) +{ + if (auth-sign_message) + return auth-sign_message(auth, msg); + return 0; +} + +static inline +int ceph_auth_check_message_signature(struct ceph_auth_handshake *auth, + struct ceph_msg *msg) +{ + if (auth-check_message_signature) + return auth-check_message_signature(auth, msg); + return 0; +} #endif diff --git a/include/linux/ceph/ceph_features.h b/include/linux/ceph/ceph_features.h index d12659c..71e05bb 100644 --- a/include/linux/ceph/ceph_features.h +++ b/include/linux/ceph/ceph_features.h @@ -84,6 +84,7 @@ static inline u64 ceph_sanitize_features(u64 features) CEPH_FEATURE_PGPOOL3 | \ CEPH_FEATURE_OSDENC | \ CEPH_FEATURE_CRUSH_TUNABLES | \ +CEPH_FEATURE_MSG_AUTH |\ CEPH_FEATURE_CRUSH_TUNABLES2 | \ CEPH_FEATURE_REPLY_CREATE_INODE | \ CEPH_FEATURE_OSDHASHPSPOOL | \ diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h index 40ae58e..d9d396c 100644 --- a/include/linux/ceph/messenger.h +++ b/include/linux/ceph/messenger.h @@ -42,6 +42,10 @@ struct ceph_connection_operations { struct ceph_msg * (*alloc_msg) (struct ceph_connection *con, struct ceph_msg_header *hdr, int *skip); + int (*sign_message) (struct ceph_connection *con, struct ceph_msg *msg); + + int (*check_message_signature) (struct ceph_connection *con, + struct ceph_msg *msg); }; /* use
[PATCH 1/2] libceph: store session key in cephx authorizer
Session key is required when calculating message signature. Save the session key in authorizer, this avoid lookup ticket handler for each message Signed-off-by: Yan, Zheng z...@redhat.com --- net/ceph/auth_x.c | 18 +++--- net/ceph/auth_x.h | 1 + 2 files changed, 12 insertions(+), 7 deletions(-) diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c index de6662b..8da8568 100644 --- a/net/ceph/auth_x.c +++ b/net/ceph/auth_x.c @@ -298,6 +298,11 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, dout(build_authorizer for %s %p\n, ceph_entity_type_name(th-service), au); + ceph_crypto_key_destroy(au-session_key); + ret = ceph_crypto_key_clone(au-session_key, th-session_key); + if (ret) + return ret; + maxlen = sizeof(*msg_a) + sizeof(msg_b) + ceph_x_encrypt_buflen(ticket_blob_len); dout( need len %d\n, maxlen); @@ -307,8 +312,10 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, } if (!au-buf) { au-buf = ceph_buffer_new(maxlen, GFP_NOFS); - if (!au-buf) + if (!au-buf) { + ceph_crypto_key_destroy(au-session_key); return -ENOMEM; + } } au-service = th-service; au-secret_id = th-secret_id; @@ -334,7 +341,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client *ac, get_random_bytes(au-nonce, sizeof(au-nonce)); msg_b.struct_v = 1; msg_b.nonce = cpu_to_le64(au-nonce); - ret = ceph_x_encrypt(th-session_key, msg_b, sizeof(msg_b), + ret = ceph_x_encrypt(au-session_key, msg_b, sizeof(msg_b), p, end - p); if (ret 0) goto out_buf; @@ -593,17 +600,13 @@ static int ceph_x_verify_authorizer_reply(struct ceph_auth_client *ac, struct ceph_authorizer *a, size_t len) { struct ceph_x_authorizer *au = (void *)a; - struct ceph_x_ticket_handler *th; int ret = 0; struct ceph_x_authorize_reply reply; void *preply = reply; void *p = au-reply_buf; void *end = p + sizeof(au-reply_buf); - th = get_ticket_handler(ac, au-service); - if (IS_ERR(th)) - return PTR_ERR(th); - ret = ceph_x_decrypt(th-session_key, p, end, preply, sizeof(reply)); + ret = ceph_x_decrypt(au-session_key, p, end, preply, sizeof(reply)); if (ret 0) return ret; if (ret != sizeof(reply)) @@ -623,6 +626,7 @@ static void ceph_x_destroy_authorizer(struct ceph_auth_client *ac, { struct ceph_x_authorizer *au = (void *)a; + ceph_crypto_key_destroy(au-session_key); ceph_buffer_put(au-buf); kfree(au); } diff --git a/net/ceph/auth_x.h b/net/ceph/auth_x.h index 65ee720..e8b7c69 100644 --- a/net/ceph/auth_x.h +++ b/net/ceph/auth_x.h @@ -26,6 +26,7 @@ struct ceph_x_ticket_handler { struct ceph_x_authorizer { + struct ceph_crypto_key session_key; struct ceph_buffer *buf; unsigned int service; u64 nonce; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: design a rados-based distributed kv store support scan op
On Tue, Sep 30, 2014 at 5:23 PM, Plato Zhang sango...@gmail.com wrote: Hi all, We plan to develop a new distributed key-value storage engine, which would based on rados and supoort range-scan operation. Besides, it'll be open source. We know rados is already a kv storage engine, and can do pretty well if we use KeyValueStore. However, it lacks the ability of scanning a key range, since keys are hashed into pgs. In our scenarios, scanning is a common need. We considered the way of splitting our key space into fixed number of ranges, mapping each range to a rados object, and then storing key/value pairs in rados objects. However, there are two main disadvantages: 1. We can not adjust the number of ranges as the cluster scales. 2. For a good splitting, we must properly predict the distribution of keys, or we'll encounter serious load-balance problem. So we decide to develop a new service based on rados (we want to use rados for unifying our storage infrastructure). Before we start, we want to do some inquiries: 1. Is there already any similar system/module(scan-supported rados-based) existed? FYI: directory fragments in MDS satisfy all above requirement. but the code is not general enough to be used outside MDS 2. Except for the basic key-value functions, what features are you most interested in? 3. Other suggestions? Thanks! -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: remove redundant io_iter_advance()
ceph_sync_read and generic_file_read_iter() have already advanced the IO iterator. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/file.c | 1 - 1 file changed, 1 deletion(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 1c1df08..d7e0da8 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -850,7 +850,6 @@ again: , reading more\n, iocb-ki_pos, inode-i_size); - iov_iter_advance(to, ret); read += ret; len -= ret; checkeof = 0; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: remove redundant code for max file size verification
On Wed, Sep 17, 2014 at 5:26 PM, Chao Yu chao2...@samsung.com wrote: Both ceph_update_writeable_page and ceph_setattr will verify file size with max size ceph supported. There are two caller for ceph_update_writeable_page, ceph_write_begin and ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no chance to change file size when mmap. Likewise we have already verified the size in inode_change_ok when we call ceph_setattr. So let's remove the redundant code for max file size verification. Signed-off-by: Chao Yu chao2...@samsung.com --- fs/ceph/addr.c | 9 - fs/ceph/inode.c | 6 -- 2 files changed, 15 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index 90b3954..18c06bb 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -1076,12 +1076,6 @@ retry_locked: /* past end of file? */ i_size = inode-i_size; /* caller holds i_mutex */ - if (i_size + len inode-i_sb-s_maxbytes) { - /* file is too big */ - r = -EINVAL; - goto fail; - } - if (page_off = i_size || (pos_in_page == 0 (pos+len) = i_size end_in_page - pos_in_page != PAGE_CACHE_SIZE)) { @@ -1099,9 +1093,6 @@ retry_locked: if (r 0) goto fail_nosnap; goto retry_locked; - -fail: - up_read(mdsc-snap_rwsem); fail_nosnap: unlock_page(page); return r; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 04c89c2..25c242e 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1813,10 +1813,6 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr) if (ia_valid ATTR_SIZE) { dout(setattr %p size %lld - %lld\n, inode, inode-i_size, attr-ia_size); - if (attr-ia_size inode-i_sb-s_maxbytes) { - err = -EINVAL; - goto out; - } if ((issued CEPH_CAP_FILE_EXCL) attr-ia_size inode-i_size) { inode-i_size = attr-ia_size; @@ -1896,8 +1892,6 @@ int ceph_setattr(struct dentry *dentry, struct iattr *attr) if (mask CEPH_SETATTR_SIZE) __ceph_do_pending_vmtruncate(inode); return err; -out: - spin_unlock(ci-i_ceph_lock); out_put: ceph_mdsc_put_request(req); return err; -- 2.0.1.474.g72c7794 Reviewed-by: Yan, Zheng z...@redhat.com I added this to testing branch of client-client. Thanks Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe linux-kernel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html Please read the FAQ at http://www.tux.org/lkml/ -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: request xattrs if xattr_version is zero
Following sequence of events can happen. - Client releases an inode, queues cap release message. - A 'lookup' reply brings the same inode back, but the reply doesn't contain xattrs because MDS didn't receive the cap release message and thought client already has up-to-data xattrs. The fix is force sending a getattr request to MDS if xattrs_version is 0. The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client does not have xattr. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/file.c | 5 ++--- fs/ceph/inode.c | 8 fs/ceph/ioctl.c | 4 ++-- fs/ceph/super.h | 2 +- fs/ceph/xattr.c | 29 ++--- 5 files changed, 19 insertions(+), 29 deletions(-) diff --git a/fs/ceph/file.c b/fs/ceph/file.c index 46a0525f..bf926fb 100644 --- a/fs/ceph/file.c +++ b/fs/ceph/file.c @@ -841,8 +841,7 @@ again: ceph_put_cap_refs(ci, got); if (checkeof ret = 0) { - int statret = ceph_do_getattr(inode, - CEPH_STAT_CAP_SIZE); + int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false); /* hit EOF or hole? */ if (statret == 0 iocb-ki_pos inode-i_size @@ -1010,7 +1009,7 @@ static loff_t ceph_llseek(struct file *file, loff_t offset, int whence) mutex_lock(inode-i_mutex); if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) { - ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE); + ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false); if (ret 0) { offset = ret; goto out; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 04c89c2..40e6289 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -1907,7 +1907,7 @@ out_put: * Verify that we have a lease on the given mask. If not, * do a getattr against an mds. */ -int ceph_do_getattr(struct inode *inode, int mask) +int ceph_do_getattr(struct inode *inode, int mask, bool force) { struct ceph_fs_client *fsc = ceph_sb_to_client(inode-i_sb); struct ceph_mds_client *mdsc = fsc-mdsc; @@ -1920,7 +1920,7 @@ int ceph_do_getattr(struct inode *inode, int mask) } dout(do_getattr inode %p mask %s mode 0%o\n, inode, ceph_cap_string(mask), inode-i_mode); - if (ceph_caps_issued_mask(ceph_inode(inode), mask, 1)) + if (!force ceph_caps_issued_mask(ceph_inode(inode), mask, 1)) return 0; req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, USE_ANY_MDS); @@ -1948,7 +1948,7 @@ int ceph_permission(struct inode *inode, int mask) if (mask MAY_NOT_BLOCK) return -ECHILD; - err = ceph_do_getattr(inode, CEPH_CAP_AUTH_SHARED); + err = ceph_do_getattr(inode, CEPH_CAP_AUTH_SHARED, false); if (!err) err = generic_permission(inode, mask); @@ -1966,7 +1966,7 @@ int ceph_getattr(struct vfsmount *mnt, struct dentry *dentry, struct ceph_inode_info *ci = ceph_inode(inode); int err; - err = ceph_do_getattr(inode, CEPH_STAT_CAP_INODE_ALL); + err = ceph_do_getattr(inode, CEPH_STAT_CAP_INODE_ALL, false); if (!err) { generic_fillattr(inode, stat); stat-ino = ceph_translate_ino(inode-i_sb, inode-i_ino); diff --git a/fs/ceph/ioctl.c b/fs/ceph/ioctl.c index a822a6e..d7dc812 100644 --- a/fs/ceph/ioctl.c +++ b/fs/ceph/ioctl.c @@ -19,7 +19,7 @@ static long ceph_ioctl_get_layout(struct file *file, void __user *arg) struct ceph_ioctl_layout l; int err; - err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT); + err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT, false); if (!err) { l.stripe_unit = ceph_file_layout_su(ci-i_layout); l.stripe_count = ceph_file_layout_stripe_count(ci-i_layout); @@ -74,7 +74,7 @@ static long ceph_ioctl_set_layout(struct file *file, void __user *arg) return -EFAULT; /* validate changed params against current layout */ - err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT); + err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT, false); if (err) return err; diff --git a/fs/ceph/super.h b/fs/ceph/super.h index 0cfb1ec..8405a79 100644 --- a/fs/ceph/super.h +++ b/fs/ceph/super.h @@ -714,7 +714,7 @@ extern void ceph_queue_vmtruncate(struct inode *inode); extern void ceph_queue_invalidate(struct inode *inode); extern void ceph_queue_writeback(struct inode *inode); -extern int ceph_do_getattr(struct inode *inode, int mask); +extern int ceph_do_getattr(struct inode *inode, int mask, bool force); extern int ceph_permission(struct inode *inode, int mask); extern int ceph_setattr(struct dentry *dentry, struct iattr *attr); extern int ceph_getattr(struct vfsmount *mnt, struct dentry *dentry, diff --git a/fs/ceph
[PATCH 1/3] libceph: reference counting pagelist
this allow pagelist to present data that may be sent multiple times. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 1 - include/linux/ceph/pagelist.h | 5 - net/ceph/messenger.c | 4 +--- net/ceph/pagelist.c | 8 ++-- 4 files changed, 11 insertions(+), 7 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index a17fc49..30d7338 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2796,7 +2796,6 @@ fail: mutex_unlock(session-s_mutex); fail_nomsg: ceph_pagelist_release(pagelist); - kfree(pagelist); fail_nopagelist: pr_err(error %d preparing reconnect for mds%d\n, err, mds); return; diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h index 9660d6b..5f871d8 100644 --- a/include/linux/ceph/pagelist.h +++ b/include/linux/ceph/pagelist.h @@ -2,6 +2,7 @@ #define __FS_CEPH_PAGELIST_H #include linux/list.h +#include linux/atomic.h struct ceph_pagelist { struct list_head head; @@ -10,6 +11,7 @@ struct ceph_pagelist { size_t room; struct list_head free_list; size_t num_pages_free; + atomic_t refcnt; }; struct ceph_pagelist_cursor { @@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist *pl) pl-room = 0; INIT_LIST_HEAD(pl-free_list); pl-num_pages_free = 0; + atomic_set(pl-refcnt, 1); } -extern int ceph_pagelist_release(struct ceph_pagelist *pl); +extern void ceph_pagelist_release(struct ceph_pagelist *pl); extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, size_t l); diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c index e7d9411..9764c77 100644 --- a/net/ceph/messenger.c +++ b/net/ceph/messenger.c @@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data *data) return; WARN_ON(!list_empty(data-links)); - if (data-type == CEPH_MSG_DATA_PAGELIST) { + if (data-type == CEPH_MSG_DATA_PAGELIST) ceph_pagelist_release(data-pagelist); - kfree(data-pagelist); - } kmem_cache_free(ceph_msg_data_cache, data); } diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c index 92866be..f70b651 100644 --- a/net/ceph/pagelist.c +++ b/net/ceph/pagelist.c @@ -1,5 +1,6 @@ #include linux/module.h #include linux/gfp.h +#include linux/slab.h #include linux/pagemap.h #include linux/highmem.h #include linux/ceph/pagelist.h @@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist *pl) } } -int ceph_pagelist_release(struct ceph_pagelist *pl) +void ceph_pagelist_release(struct ceph_pagelist *pl) { + if (!atomic_dec_and_test(pl-refcnt)) + return; ceph_pagelist_unmap_tail(pl); while (!list_empty(pl-head)) { struct page *page = list_first_entry(pl-head, struct page, @@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl) __free_page(page); } ceph_pagelist_free_reserve(pl); - return 0; + kfree(pl); + return; } EXPORT_SYMBOL(ceph_pagelist_release); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] ceph: use pagelist to present MDS request data
Current code uses page array to present MDS request data. Pages in the array are allocated/freed by caller of ceph_mdsc_do_request(). If request is interrupted, the pages can be freed while they are still being used by the request message. The fix is use pagelist to present MDS request data. Pagelist is reference counted. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 14 +- fs/ceph/mds_client.h | 4 +--- fs/ceph/xattr.c | 46 -- 3 files changed, 26 insertions(+), 38 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 30d7338..80d9f07 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -542,6 +542,8 @@ void ceph_mdsc_release_request(struct kref *kref) } kfree(req-r_path1); kfree(req-r_path2); + if (req-r_pagelist) + ceph_pagelist_release(req-r_pagelist); put_request_session(req); ceph_unreserve_caps(req-r_mdsc, req-r_caps_reservation); kfree(req); @@ -1847,13 +1849,15 @@ static struct ceph_msg *create_request_message(struct ceph_mds_client *mdsc, msg-front.iov_len = p - msg-front.iov_base; msg-hdr.front_len = cpu_to_le32(msg-front.iov_len); - if (req-r_data_len) { - /* outbound data set only by ceph_sync_setxattr() */ - BUG_ON(!req-r_pages); - ceph_msg_data_add_pages(msg, req-r_pages, req-r_data_len, 0); + if (req-r_pagelist) { + struct ceph_pagelist *pagelist = req-r_pagelist; + atomic_inc(pagelist-refcnt); + ceph_msg_data_add_pagelist(msg, pagelist); + msg-hdr.data_len = cpu_to_le32(pagelist-length); + } else { + msg-hdr.data_len = 0; } - msg-hdr.data_len = cpu_to_le32(req-r_data_len); msg-hdr.data_off = cpu_to_le16(0); out_free2: diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h index e00737c..23015f7 100644 --- a/fs/ceph/mds_client.h +++ b/fs/ceph/mds_client.h @@ -202,9 +202,7 @@ struct ceph_mds_request { bool r_direct_is_hash; /* true if r_direct_hash is valid */ /* data payload is used for xattr ops */ - struct page **r_pages; - int r_num_pages; - int r_data_len; + struct ceph_pagelist *r_pagelist; /* what caps shall we drop? */ int r_inode_drop, r_inode_unless; diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c index eab3e2f..c7b18b2 100644 --- a/fs/ceph/xattr.c +++ b/fs/ceph/xattr.c @@ -1,4 +1,5 @@ #include linux/ceph/ceph_debug.h +#include linux/ceph/pagelist.h #include super.h #include mds_client.h @@ -852,28 +853,17 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, struct ceph_mds_request *req; struct ceph_mds_client *mdsc = fsc-mdsc; int err; - int i, nr_pages; - struct page **pages = NULL; - void *kaddr; - - /* copy value into some pages */ - nr_pages = calc_pages_for(0, size); - if (nr_pages) { - pages = kmalloc(sizeof(pages[0])*nr_pages, GFP_NOFS); - if (!pages) - return -ENOMEM; - err = -ENOMEM; - for (i = 0; i nr_pages; i++) { - pages[i] = __page_cache_alloc(GFP_NOFS); - if (!pages[i]) { - nr_pages = i; - goto out; - } - kaddr = kmap(pages[i]); - memcpy(kaddr, value + i*PAGE_CACHE_SIZE, - min(PAGE_CACHE_SIZE, size-i*PAGE_CACHE_SIZE)); - } - } + struct ceph_pagelist *pagelist; + + /* copy value into pagelist */ + pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS); + if (!pagelist) + return -ENOMEM; + + ceph_pagelist_init(pagelist); + err = ceph_pagelist_append(pagelist, value, size); + if (err) + goto out; dout(setxattr value=%.*s\n, (int)size, value); @@ -894,9 +884,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, req-r_args.setxattr.flags = cpu_to_le32(flags); req-r_path2 = kstrdup(name, GFP_NOFS); - req-r_pages = pages; - req-r_num_pages = nr_pages; - req-r_data_len = size; + req-r_pagelist = pagelist; + pagelist = NULL; dout(xattr.ver (before): %lld\n, ci-i_xattrs.version); err = ceph_mdsc_do_request(mdsc, NULL, req); @@ -904,11 +893,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const char *name, dout(xattr.ver (after): %lld\n, ci-i_xattrs.version); out: - if (pages) { - for (i = 0; i nr_pages; i++) - __free_page(pages[i]); - kfree(pages); - } + if (pagelist) + ceph_pagelist_release(pagelist
[PATCH 3/3] ceph: include the initial ACL in create/mkdir/mknod MDS requests
Current code set new file/directory's initial ACL in a non-atomic manner. Client first sends request to MDS to create new file/directory, then set the initial ACL after the new file/directory is successfully created. The fix is include the initial ACL in create/mkdir/mknod MDS requests. So MDS can handle creating file/directory and setting the initial ACL in one request. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/acl.c | 125 fs/ceph/dir.c | 41 ++- fs/ceph/file.c | 27 +--- fs/ceph/super.h | 24 --- 4 files changed, 170 insertions(+), 47 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index cebf2eb..5bd853b 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -169,36 +169,109 @@ out: return ret; } -int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode *dir) +int ceph_pre_init_acls(struct inode *dir, umode_t *mode, + struct ceph_acls_info *info) { - struct posix_acl *default_acl, *acl; - umode_t new_mode = inode-i_mode; - int error; - - error = posix_acl_create(dir, new_mode, default_acl, acl); - if (error) - return error; - - if (!default_acl !acl) { - cache_no_acl(inode); - if (new_mode != inode-i_mode) { - struct iattr newattrs = { - .ia_mode = new_mode, - .ia_valid = ATTR_MODE, - }; - error = ceph_setattr(dentry, newattrs); + struct posix_acl *acl, *default_acl; + size_t val_size1 = 0, val_size2 = 0; + struct ceph_pagelist *pagelist = NULL; + void *tmp_buf = NULL; + int err; + + err = posix_acl_create(dir, mode, default_acl, acl); + if (err) + return err; + + if (acl) { + int ret = posix_acl_equiv_mode(acl, mode); + if (ret 0) + goto out_err; + if (ret == 0) { + posix_acl_release(acl); + acl = NULL; } - return error; } - if (default_acl) { - error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT); - posix_acl_release(default_acl); - } + if (!default_acl !acl) + return 0; + + if (acl) + val_size1 = posix_acl_xattr_size(acl-a_count); + if (default_acl) + val_size2 = posix_acl_xattr_size(default_acl-a_count); + + err = -ENOMEM; + tmp_buf = kmalloc(max(val_size1, val_size2), GFP_NOFS); + if (!tmp_buf) + goto out_err; + pagelist = kmalloc(sizeof(struct ceph_pagelist), GFP_NOFS); + if (!pagelist) + goto out_err; + ceph_pagelist_init(pagelist); + + err = ceph_pagelist_reserve(pagelist, PAGE_SIZE); + if (err) + goto out_err; + + ceph_pagelist_encode_32(pagelist, acl default_acl ? 2 : 1); + if (acl) { - if (!error) - error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS); - posix_acl_release(acl); + size_t len = strlen(POSIX_ACL_XATTR_ACCESS); + err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8); + if (err) + goto out_err; + ceph_pagelist_encode_string(pagelist, POSIX_ACL_XATTR_ACCESS, + len); + err = posix_acl_to_xattr(init_user_ns, acl, +tmp_buf, val_size1); + if (err 0) + goto out_err; + ceph_pagelist_encode_32(pagelist, val_size1); + ceph_pagelist_append(pagelist, tmp_buf, val_size1); } - return error; + if (default_acl) { + size_t len = strlen(POSIX_ACL_XATTR_DEFAULT); + err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8); + if (err) + goto out_err; + err = ceph_pagelist_encode_string(pagelist, + POSIX_ACL_XATTR_DEFAULT, len); + err = posix_acl_to_xattr(init_user_ns, default_acl, +tmp_buf, val_size2); + if (err 0) + goto out_err; + ceph_pagelist_encode_32(pagelist, val_size2); + ceph_pagelist_append(pagelist, tmp_buf, val_size2); + } + + kfree(tmp_buf); + + info-acl = acl; + info-default_acl = default_acl; + info-pagelist = pagelist; + return 0; + +out_err: + posix_acl_release(acl); + posix_acl_release(default_acl); + kfree(tmp_buf); + if (pagelist) + ceph_pagelist_release(pagelist); + return
[PATCH] ceph: move ceph_find_inode() outside the s_mutex
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex may cause deadlock. (the freeing inode is waiting for OSD read reply, but dispatch thread is blocked by the s_mutex) Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/caps.c | 11 ++- fs/ceph/mds_client.c | 7 --- 2 files changed, 10 insertions(+), 8 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 6d1cd45..b3b0a91 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -3045,6 +3045,12 @@ void ceph_handle_caps(struct ceph_mds_session *session, } } + /* lookup ino */ + inode = ceph_find_inode(sb, vino); + ci = ceph_inode(inode); + dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino, +vino.snap, inode); + mutex_lock(session-s_mutex); session-s_seq++; dout( mds%d seq %lld cap seq %u\n, session-s_mds, session-s_seq, @@ -3053,11 +3059,6 @@ void ceph_handle_caps(struct ceph_mds_session *session, if (op == CEPH_CAP_OP_IMPORT) ceph_add_cap_releases(mdsc, session); - /* lookup ino */ - inode = ceph_find_inode(sb, vino); - ci = ceph_inode(inode); - dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino, -vino.snap, inode); if (!inode) { dout( i don't have ino %llx\n, vino.ino); diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 80d9f07..c27e204 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2947,14 +2947,15 @@ static void handle_lease(struct ceph_mds_client *mdsc, if (dname.len != get_unaligned_le32(h+1)) goto bad; - mutex_lock(session-s_mutex); - session-s_seq++; - /* lookup inode */ inode = ceph_find_inode(sb, vino); dout(handle_lease %s, ino %llx %p %.*s\n, ceph_lease_op_name(h-action), vino.ino, inode, dname.len, dname.name); + + mutex_lock(session-s_mutex); + session-s_seq++; + if (inode == NULL) { dout(handle_lease no inode %llx\n, vino.ino); goto release; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: include the initial ACL in create/mkdir/mknod MDS requests
Current code set new file/directory's initial ACL in a non-atomic manner. Client first sends request to MDS to create new file/directory, then set the initial ACL after the new file/directory is successfully created. The fix is include the initial ACL in create/mkdir/mknod MDS requests. So MDS can handle creating file/directory and setting the initial ACL in one request. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/acl.c| 124 +-- fs/ceph/dir.c| 41 - fs/ceph/file.c | 27 --- fs/ceph/mds_client.c | 1 - fs/ceph/super.h | 27 --- 5 files changed, 174 insertions(+), 46 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index cebf2eb..2b7f497 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -169,36 +169,112 @@ out: return ret; } -int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode *dir) +int ceph_pre_init_acls(struct inode *dir, umode_t *mode, + struct ceph_acls_info *info) { - struct posix_acl *default_acl, *acl; - umode_t new_mode = inode-i_mode; - int error; - - error = posix_acl_create(dir, new_mode, default_acl, acl); - if (error) - return error; - - if (!default_acl !acl) { - cache_no_acl(inode); - if (new_mode != inode-i_mode) { - struct iattr newattrs = { - .ia_mode = new_mode, - .ia_valid = ATTR_MODE, - }; - error = ceph_setattr(dentry, newattrs); + struct posix_acl *acl, *default_acl; + size_t buf_size, name_len1, name_len2, val_size1, val_size2; + struct page **pages = NULL; + void *p, *xattr_buf = NULL; + int err, i, nr_pages; + + err = posix_acl_create(dir, mode, default_acl, acl); + if (err) + return err; + + if (acl) { + int ret = posix_acl_equiv_mode(acl, mode); + if (ret 0) + goto out_err; + if (ret == 0) { + posix_acl_release(acl); + acl = NULL; } - return error; } + if (!default_acl !acl) + return 0; + + buf_size = 4; + if (acl) { + name_len1 = strlen(POSIX_ACL_XATTR_ACCESS); + val_size1 = posix_acl_xattr_size(acl-a_count); + buf_size += 8 + name_len1 + val_size1; + } else { + name_len1 = val_size1 = 0; + } if (default_acl) { - error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT); - posix_acl_release(default_acl); + name_len2 = strlen(POSIX_ACL_XATTR_DEFAULT); + val_size2 = posix_acl_xattr_size(default_acl-a_count); + buf_size += 8 + name_len2 + val_size2; + } else { + name_len2 = val_size2 = 0; } + + err = -ENOMEM; + nr_pages = calc_pages_for(0, buf_size); + pages = kmalloc(sizeof(struct page*) * nr_pages, GFP_NOFS); + if (!pages) + goto out_err; + xattr_buf = kmalloc(PAGE_SIZE * nr_pages, GFP_NOFS); + if (!xattr_buf) + goto out_err; + + pages[0] = virt_to_page(xattr_buf); + for (i = 1; i nr_pages; i++) + pages[i] = pages[0] + i; + + p = xattr_buf; + ceph_encode_32(p, acl default_acl ? 2 : 1); + if (acl) { - if (!error) - error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS); - posix_acl_release(acl); + ceph_encode_32(p, name_len1); + memcpy(p, POSIX_ACL_XATTR_ACCESS, name_len1); + p += name_len1; + ceph_encode_32(p, val_size1); + err = posix_acl_to_xattr(init_user_ns, acl, p, val_size1); + if (err 0) + goto out_err; + p += val_size1; } - return error; + if (default_acl) { + ceph_encode_32(p, name_len2); + memcpy(p, POSIX_ACL_XATTR_DEFAULT, name_len2); + p += name_len2; + ceph_encode_32(p, val_size2); + err = posix_acl_to_xattr(init_user_ns, default_acl, p, val_size2); + if (err 0) + goto out_err; + p += val_size2; + } + + info-acl = acl; + info-default_acl = default_acl; + info-xattr_buf = xattr_buf; + info-buf_pages = pages; + info-buf_size = buf_size; + return 0; + +out_err: + posix_acl_release(acl); + posix_acl_release(default_acl); + kfree(pages); + kfree(xattr_buf); + return err; +} + +void ceph_init_inode_acls(struct inode* inode, struct ceph_acls_info *info) +{ + if (!inode
[PATCH 1/2] ceph: protect kick_requests() with mdsc-mutex
From: Yan, Zheng uker...@gmail.com Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index f751fea..267ba44 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2471,9 +2471,8 @@ static void handle_session(struct ceph_mds_session *session, if (session-s_state == CEPH_MDS_SESSION_RECONNECTING) pr_info(mds%d reconnect denied\n, session-s_mds); remove_session_caps(session); - wake = 1; /* for good measure */ + wake = 2; /* for good measure */ wake_up_all(mdsc-session_close_wq); - kick_requests(mdsc, mds); break; case CEPH_SESSION_STALE: @@ -2503,6 +2502,8 @@ static void handle_session(struct ceph_mds_session *session, if (wake) { mutex_lock(mdsc-mutex); __wake_requests(mdsc, session-s_waiting); + if (wake == 2) + kick_requests(mdsc, mds); mutex_unlock(mdsc-mutex); } return; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/2] ceph: make sure request isn't in any waiting list when kicking request.
From: Yan, Zheng uker...@gmail.com we may corrupt waiting list if a request in the waiting list is kicked. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 1 + 1 file changed, 1 insertion(+) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index 267ba44..a17fc49 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2078,6 +2078,7 @@ static void kick_requests(struct ceph_mds_client *mdsc, int mds) if (req-r_session req-r_session-s_mds == mds) { dout( kicking tid %llu\n, req-r_tid); + list_del_init(req-r_wait); __do_request(mdsc, req); } } -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] ceph: trim unused inodes before reconnecting to recovering MDS
So the recovering MDS does not need to fetch these ununsed inodes during cache rejoin. This may reduce MDS recovery time. Signed-off-by: Yan, Zheng z...@redhat.com --- fs/ceph/mds_client.c | 23 +-- 1 file changed, 13 insertions(+), 10 deletions(-) diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c index bad07c0..f751fea 100644 --- a/fs/ceph/mds_client.c +++ b/fs/ceph/mds_client.c @@ -2695,16 +2695,6 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, session-s_state = CEPH_MDS_SESSION_RECONNECTING; session-s_seq = 0; - ceph_con_close(session-s_con); - ceph_con_open(session-s_con, - CEPH_ENTITY_TYPE_MDS, mds, - ceph_mdsmap_get_addr(mdsc-mdsmap, mds)); - - /* replay unsafe requests */ - replay_unsafe_requests(mdsc, session); - - down_read(mdsc-snap_rwsem); - dout(session %p state %s\n, session, session_state_name(session-s_state)); @@ -2723,6 +2713,19 @@ static void send_mds_reconnect(struct ceph_mds_client *mdsc, discard_cap_releases(mdsc, session); spin_unlock(session-s_cap_lock); + /* trim unused caps to reduce MDS's cache rejoin time */ + shrink_dcache_parent(mdsc-fsc-sb-s_root); + + ceph_con_close(session-s_con); + ceph_con_open(session-s_con, + CEPH_ENTITY_TYPE_MDS, mds, + ceph_mdsmap_get_addr(mdsc-mdsmap, mds)); + + /* replay unsafe requests */ + replay_unsafe_requests(mdsc, session); + + down_read(mdsc-snap_rwsem); + /* traverse this session's caps */ s_nr_caps = session-s_nr_caps; err = ceph_pagelist_encode_32(pagelist, s_nr_caps); -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: refactor readpage_nounlock() to make the logic clearer
On Wed, May 28, 2014 at 3:09 PM, Zhang Zhen zhenzhang.zh...@huawei.com wrote: If the return value of ceph_osdc_readpages() is not negative, it is certainly greater than or equal to zero. Remove the useless condition judgment and redundant braces. Signed-off-by: Zhang Zhen zhenzhang.zh...@huawei.com --- fs/ceph/addr.c | 17 +++-- 1 file changed, 7 insertions(+), 10 deletions(-) diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c index b53278c..6aa2e3f 100644 --- a/fs/ceph/addr.c +++ b/fs/ceph/addr.c @@ -211,18 +211,15 @@ static int readpage_nounlock(struct file *filp, struct page *page) SetPageError(page); ceph_fscache_readpage_cancel(inode, page); goto out; - } else { - if (err PAGE_CACHE_SIZE) { - /* zero fill remainder of page */ - zero_user_segment(page, err, PAGE_CACHE_SIZE); - } else { - flush_dcache_page(page); - } } - SetPageUptodate(page); + if (err PAGE_CACHE_SIZE) + /* zero fill remainder of page */ + zero_user_segment(page, err, PAGE_CACHE_SIZE); + else + flush_dcache_page(page); - if (err = 0) - ceph_readpage_to_fscache(inode, page); + SetPageUptodate(page); + ceph_readpage_to_fscache(inode, page); out: return err 0 ? err : 0; -- Added to testing branch of ceph-client Thanks Yan, Zheng 1.8.1.2 . -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc
On Wed, Apr 30, 2014 at 3:07 PM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: Hi Zheng, Sorry for the late reply. For sure, I will try this again after we completely verifying all content in the file system. Hopefully all will be good. And, please confirm this, I will set debug_mds=10 for the ceph-mds, and do you want me to send the ceph-mon log too? yes please. BTW, how to confirm that the mds has passed the beacon to mon or not? read monitor's log Regards Yan, Zheng Thank you so much Zheng! Bazli -Original Message- From: Yan, Zheng [mailto:uker...@gmail.com] Sent: Tuesday, April 29, 2014 10:13 PM To: Mohd Bazli Ab Karim Cc: Luke Jing Yuan; Wong Ming Tat Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc On Tue, Apr 29, 2014 at 5:30 PM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: Hi Zheng, The another issue that Luke mentioned just now was like this. At first, we ran one mds (mon01) with the new compiled ceph-mds. It works fine with only one MDS running at that time. However, when we ran two more MDSes mon02 mon03 with the new compiled ceph-mds, it started acting weird. Mon01 which was became active at first, will have the error and started to respawning. Once respawning happened, mon03 will take over from mon01 as master mds, and replay happened again. Again, when mon03 became active, it will have the same error like below, and respawning again. So, it seems to me that replay will continue to happen from one mds to another when they got respawned. 2014-04-29 15:36:24.917798 7f5c36476700 1 mds.0.server reconnect_clients -- 1 sessions 2014-04-29 15:36:24.919620 7f5c2fb3e700 0 -- 10.4.118.23:6800/26401 10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=0 pgs=0 cs=0 l=0 c=0x37056e0).accept peer addr is really 10.1.64.181:0/1558263174 (socket is 10.1.64.181:57649/0) 2014-04-29 15:36:24.921661 7f5c36476700 0 log [DBG] : reconnect by client.884169 10.1.64.181:0/1558263174 after 0.003774 2014-04-29 15:36:24.921786 7f5c36476700 1 mds.0.12858 reconnect_done 2014-04-29 15:36:25.109391 7f5c36476700 1 mds.0.12858 handle_mds_map i am now mds.0.12858 2014-04-29 15:36:25.109413 7f5c36476700 1 mds.0.12858 handle_mds_map state change up:reconnect -- up:rejoin 2014-04-29 15:36:25.109417 7f5c36476700 1 mds.0.12858 rejoin_start 2014-04-29 15:36:26.918067 7f5c36476700 1 mds.0.12858 rejoin_joint_start 2014-04-29 15:36:33.520985 7f5c36476700 1 mds.0.12858 rejoin_done 2014-04-29 15:36:36.252925 7f5c36476700 1 mds.0.12858 handle_mds_map i am now mds.0.12858 2014-04-29 15:36:36.252927 7f5c36476700 1 mds.0.12858 handle_mds_map state change up:rejoin -- up:active 2014-04-29 15:36:36.252932 7f5c36476700 1 mds.0.12858 recovery_done -- successful recovery! 2014-04-29 15:36:36.745833 7f5c36476700 1 mds.0.12858 active_start 2014-04-29 15:36:36.987854 7f5c36476700 1 mds.0.12858 cluster recovered. 2014-04-29 15:36:40.182604 7f5c36476700 0 mds.0.12858 handle_mds_beacon no longer laggy 2014-04-29 15:36:57.947441 7f5c2fb3e700 0 -- 10.4.118.23:6800/26401 10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=2 pgs=156 cs=1 l=0 c=0x37056e0).fault with nothing to send, going to standby 2014-04-29 15:37:10.534593 7f5c36476700 1 mds.-1.-1 handle_mds_map i (10.4.118.23:6800/26401) dne in the mdsmap, respawning myself 2014-04-29 15:37:10.534604 7f5c36476700 1 mds.-1.-1 respawn 2014-04-29 15:37:10.534609 7f5c36476700 1 mds.-1.-1 e: '/usr/bin/ceph-mds' 2014-04-29 15:37:10.534612 7f5c36476700 1 mds.-1.-1 0: '/usr/bin/ceph-mds' 2014-04-29 15:37:10.534616 7f5c36476700 1 mds.-1.-1 1: '--cluster=ceph' 2014-04-29 15:37:10.534619 7f5c36476700 1 mds.-1.-1 2: '-i' 2014-04-29 15:37:10.534621 7f5c36476700 1 mds.-1.-1 3: 'mon03' 2014-04-29 15:37:10.534623 7f5c36476700 1 mds.-1.-1 4: '-f' 2014-04-29 15:37:10.534641 7f5c36476700 1 mds.-1.-1 cwd / 2014-04-29 15:37:12.155458 7f8907c8b780 0 ceph version (), process ceph-mds, pid 26401 2014-04-29 15:37:12.249780 7f8902d10700 1 mds.-1.0 handle_mds_map standby p/s. we ran ceph-mon and ceph-mds on same servers, (mon01,mon02,mon03) I sent to you two log files, mon01 and mon03 where the scenario of mon03 have state-standby-replay-active-respawned. And also, mon01 which is now running as active as a single MDS at this moment. After the MDS became ative, it did not send beacon to the monitor. It seems like the MDS was busy doing something else. If this issue still happen, set debug_mds=10 and send the log to me. Regards Yan, Zheng Regards, Bazli -Original Message- From: Luke Jing Yuan Sent: Tuesday, April 29, 2014 4:46 PM To: Yan, Zheng Cc: Mohd Bazli Ab Karim; Wong Ming Tat Subject: RE: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc Hi Zheng, Thanks for the information. Actually we encounter another issue, in our original setup, we have 3 MDS running
Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc
On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote: Hi, We had applied the patch and recompile ceph as well as updated the ceph.conf as per suggested, when we re-run ceph-mds we noticed the following: 2014-04-29 10:45:22.260798 7f90b971d700 0 log [WRN] : replayed op client.324186:51366457,12681393 no session for client.324186 2014-04-29 10:45:22.262419 7f90b971d700 0 log [WRN] : replayed op client.324186:51366475,12681393 no session for client.324186 2014-04-29 10:45:22.267699 7f90b971d700 0 log [WRN] : replayed op client.324186:5135,12681393 no session for client.324186 2014-04-29 10:45:22.271664 7f90b971d700 0 log [WRN] : replayed op client.324186:51366724,12681393 no session for client.324186 2014-04-29 10:45:22.281050 7f90b971d700 0 log [WRN] : replayed op client.324186:51366945,12681393 no session for client.324186 2014-04-29 10:45:22.283196 7f90b971d700 0 log [WRN] : replayed op client.324186:51366996,12681393 no session for client.324186 2014-04-29 10:45:22.287801 7f90b971d700 0 log [WRN] : replayed op client.324186:51367043,12681393 no session for client.324186 2014-04-29 10:45:22.289967 7f90b971d700 0 log [WRN] : replayed op client.324186:51367082,12681393 no session for client.324186 2014-04-29 10:45:22.291026 7f90b971d700 0 log [WRN] : replayed op client.324186:51367110,12681393 no session for client.324186 2014-04-29 10:45:22.294459 7f90b971d700 0 log [WRN] : replayed op client.324186:51367192,12681393 no session for client.324186 2014-04-29 10:45:22.297228 7f90b971d700 0 log [WRN] : replayed op client.324186:51367257,12681393 no session for client.324186 2014-04-29 10:45:22.297477 7f90b971d700 0 log [WRN] : replayed op client.324186:51367264,12681393 no session for client.324186 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @ 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a 0x7f90c0cca3fd tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @ 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a 0x7f90c0cca3fd ceph -s shows that MDS up:replay, Also the messages above seemed to be repeating again after a while but with a different session number. Is there a way for us to determine that we are on the right track? Thanks. It's on the right track as long as the MDS doesn't crash. Regards, Luke On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote: On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote: Hi Greg, Actually our cluster is pretty empty, but we suspect we had a temporary network disconnection to one of our OSD, not sure if this caused the problem. Anyway we don't mind try the method you mentioned, how can we do that? compile ceph-mds with the attached patch. add a line mds wipe_sessions = 1 to the ceph.conf, Yan, Zheng Regards, Luke On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote: Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did your cluster get full at some point? In any case, we're working on tools to repair this now but they aren't ready for use yet. Probably the only thing you could do is create an empty sessionmap with a higher version than the ones the journal refers to, but that might have other fallout effects... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: More logs. I ran ceph-mds with debug-mds=20. -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay inotable tablev 4316124 = table 4317932 -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc [141df86~1] used 141db9e 0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED assert(session) Please look at the attachment for more details. Regards, Bazli From: Mohd Bazli Ab Karim Sent: Friday, April 25, 2014 12:26 PM To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com Subject: Ceph mds laggy and failed assert in function replay mds/journal.cc Dear Ceph-devel, ceph-users, I am currently facing issue with my ceph mds server. Ceph-mds daemon does not want to bring up back. Tried running that manually with ceph-mds -i mon01 -d but it shows that it stucks at failed assert(session) line 1303 in mds/journal.cc and aborted. Can someone shed some light in this issue. ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) Let me know if I need to send log with debug enabled. Regards, Bazli -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More
Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc
On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke jyl...@gmail.com wrote: Hi, Assuming we got MDS working back on track, should we still leave the mds_wipe_sessions in the ceph.conf or remove it and restart MDS. Thanks. No. It has been several hours. the MDS still does not finish replaying the journal? Regards Yan, Zheng Regards, Luke On Tue, Apr 29, 2014 at 2:12 PM, Yan, Zheng uker...@gmail.com wrote: On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote: Hi, We had applied the patch and recompile ceph as well as updated the ceph.conf as per suggested, when we re-run ceph-mds we noticed the following: 2014-04-29 10:45:22.260798 7f90b971d700 0 log [WRN] : replayed op client.324186:51366457,12681393 no session for client.324186 2014-04-29 10:45:22.262419 7f90b971d700 0 log [WRN] : replayed op client.324186:51366475,12681393 no session for client.324186 2014-04-29 10:45:22.267699 7f90b971d700 0 log [WRN] : replayed op client.324186:5135,12681393 no session for client.324186 2014-04-29 10:45:22.271664 7f90b971d700 0 log [WRN] : replayed op client.324186:51366724,12681393 no session for client.324186 2014-04-29 10:45:22.281050 7f90b971d700 0 log [WRN] : replayed op client.324186:51366945,12681393 no session for client.324186 2014-04-29 10:45:22.283196 7f90b971d700 0 log [WRN] : replayed op client.324186:51366996,12681393 no session for client.324186 2014-04-29 10:45:22.287801 7f90b971d700 0 log [WRN] : replayed op client.324186:51367043,12681393 no session for client.324186 2014-04-29 10:45:22.289967 7f90b971d700 0 log [WRN] : replayed op client.324186:51367082,12681393 no session for client.324186 2014-04-29 10:45:22.291026 7f90b971d700 0 log [WRN] : replayed op client.324186:51367110,12681393 no session for client.324186 2014-04-29 10:45:22.294459 7f90b971d700 0 log [WRN] : replayed op client.324186:51367192,12681393 no session for client.324186 2014-04-29 10:45:22.297228 7f90b971d700 0 log [WRN] : replayed op client.324186:51367257,12681393 no session for client.324186 2014-04-29 10:45:22.297477 7f90b971d700 0 log [WRN] : replayed op client.324186:51367264,12681393 no session for client.324186 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @ 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a 0x7f90c0cca3fd tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @ 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a 0x7f90c0cca3fd ceph -s shows that MDS up:replay, Also the messages above seemed to be repeating again after a while but with a different session number. Is there a way for us to determine that we are on the right track? Thanks. It's on the right track as long as the MDS doesn't crash. Regards, Luke On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote: On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote: Hi Greg, Actually our cluster is pretty empty, but we suspect we had a temporary network disconnection to one of our OSD, not sure if this caused the problem. Anyway we don't mind try the method you mentioned, how can we do that? compile ceph-mds with the attached patch. add a line mds wipe_sessions = 1 to the ceph.conf, Yan, Zheng Regards, Luke On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote: Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did your cluster get full at some point? In any case, we're working on tools to repair this now but they aren't ready for use yet. Probably the only thing you could do is create an empty sessionmap with a higher version than the ones the journal refers to, but that might have other fallout effects... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: More logs. I ran ceph-mds with debug-mds=20. -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay inotable tablev 4316124 = table 4317932 -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc [141df86~1] used 141db9e 0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED assert(session) Please look at the attachment for more details. Regards, Bazli From: Mohd Bazli Ab Karim Sent: Friday, April 25, 2014 12:26 PM To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com Subject: Ceph mds laggy and failed assert in function replay mds/journal.cc Dear Ceph-devel, ceph-users, I am currently facing issue with my ceph mds server. Ceph-mds daemon does not want to bring up back. Tried running that manually with ceph-mds -i mon01 -d
Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc
On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote: Hi Greg, Actually our cluster is pretty empty, but we suspect we had a temporary network disconnection to one of our OSD, not sure if this caused the problem. Anyway we don't mind try the method you mentioned, how can we do that? compile ceph-mds with the attached patch. add a line mds wipe_sessions = 1 to the ceph.conf, Yan, Zheng Regards, Luke On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote: Hmm, it looks like your on-disk SessionMap is horrendously out of date. Did your cluster get full at some point? In any case, we're working on tools to repair this now but they aren't ready for use yet. Probably the only thing you could do is create an empty sessionmap with a higher version than the ones the journal refers to, but that might have other fallout effects... -Greg Software Engineer #42 @ http://inktank.com | http://ceph.com On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim bazli.abka...@mimos.my wrote: More logs. I ran ceph-mds with debug-mds=20. -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay inotable tablev 4316124 = table 4317932 -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc [141df86~1] used 141db9e 0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread 7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED assert(session) Please look at the attachment for more details. Regards, Bazli From: Mohd Bazli Ab Karim Sent: Friday, April 25, 2014 12:26 PM To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com Subject: Ceph mds laggy and failed assert in function replay mds/journal.cc Dear Ceph-devel, ceph-users, I am currently facing issue with my ceph mds server. Ceph-mds daemon does not want to bring up back. Tried running that manually with ceph-mds -i mon01 -d but it shows that it stucks at failed assert(session) line 1303 in mds/journal.cc and aborted. Can someone shed some light in this issue. ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60) Let me know if I need to send log with debug enabled. Regards, Bazli DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may be confidential, especially as regards personal data. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message (including any attachments). MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e-mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law. -- - - DISCLAIMER: This e-mail (including any attachments) is for the addressee(s) only and may contain confidential information. If you are not the intended recipient, please note that any dealing, review, distribution, printing, copying or use of this e-mail is strictly prohibited. If you have received this email in error, please notify the sender immediately and delete the original message. MIMOS Berhad is a research and development institution under the purview of the Malaysian Ministry of Science, Technology and Innovation. Opinions, conclusions and other information in this e- mail that do not relate to the official business of MIMOS Berhad and/or its subsidiaries shall be understood as neither given nor endorsed by MIMOS Berhad and/or its subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts responsibility for the same. All liability arising from or in connection with computer viruses and/or corrupted e-mails is excluded to the fullest extent permitted by law. ___ ceph-users mailing list ceph-us...@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] ceph: pre-allocate ceph_cap struct for ceph_add_cap()
So that ceph_add_cap() can be used while i_ceph_lock is locked. This simplifies the code that handle cap import/export. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 81 +++-- fs/ceph/inode.c | 70 +++-- fs/ceph/super.h | 13 - 3 files changed, 85 insertions(+), 79 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 5f6d24e..73a42f5 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -221,8 +221,8 @@ int ceph_unreserve_caps(struct ceph_mds_client *mdsc, return 0; } -static struct ceph_cap *get_cap(struct ceph_mds_client *mdsc, - struct ceph_cap_reservation *ctx) +struct ceph_cap *ceph_get_cap(struct ceph_mds_client *mdsc, + struct ceph_cap_reservation *ctx) { struct ceph_cap *cap = NULL; @@ -508,15 +508,14 @@ static void __check_cap_issue(struct ceph_inode_info *ci, struct ceph_cap *cap, * it is 0. (This is so we can atomically add the cap and add an * open file reference to it.) */ -int ceph_add_cap(struct inode *inode, -struct ceph_mds_session *session, u64 cap_id, -int fmode, unsigned issued, unsigned wanted, -unsigned seq, unsigned mseq, u64 realmino, int flags, -struct ceph_cap_reservation *caps_reservation) +void ceph_add_cap(struct inode *inode, + struct ceph_mds_session *session, u64 cap_id, + int fmode, unsigned issued, unsigned wanted, + unsigned seq, unsigned mseq, u64 realmino, int flags, + struct ceph_cap **new_cap) { struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)-mdsc; struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_cap *new_cap = NULL; struct ceph_cap *cap; int mds = session-s_mds; int actual_wanted; @@ -531,20 +530,10 @@ int ceph_add_cap(struct inode *inode, if (fmode = 0) wanted |= ceph_caps_for_mode(fmode); -retry: - spin_lock(ci-i_ceph_lock); cap = __get_cap_for_mds(ci, mds); if (!cap) { - if (new_cap) { - cap = new_cap; - new_cap = NULL; - } else { - spin_unlock(ci-i_ceph_lock); - new_cap = get_cap(mdsc, caps_reservation); - if (new_cap == NULL) - return -ENOMEM; - goto retry; - } + cap = *new_cap; + *new_cap = NULL; cap-issued = 0; cap-implemented = 0; @@ -562,9 +551,6 @@ retry: session-s_nr_caps++; spin_unlock(session-s_cap_lock); } else { - if (new_cap) - ceph_put_cap(mdsc, new_cap); - /* * auth mds of the inode changed. we received the cap export * message, but still haven't received the cap import message. @@ -626,7 +612,6 @@ retry: ci-i_auth_cap = cap; cap-mds_wanted = wanted; } - ci-i_cap_exporting_issued = 0; } else { WARN_ON(ci-i_auth_cap == cap); } @@ -648,9 +633,6 @@ retry: if (fmode = 0) __ceph_get_fmode(ci, fmode); - spin_unlock(ci-i_ceph_lock); - wake_up_all(ci-i_cap_wq); - return 0; } /* @@ -685,7 +667,7 @@ static int __cap_is_valid(struct ceph_cap *cap) */ int __ceph_caps_issued(struct ceph_inode_info *ci, int *implemented) { - int have = ci-i_snap_caps | ci-i_cap_exporting_issued; + int have = ci-i_snap_caps; struct ceph_cap *cap; struct rb_node *p; @@ -900,7 +882,7 @@ int __ceph_caps_mds_wanted(struct ceph_inode_info *ci) */ static int __ceph_is_any_caps(struct ceph_inode_info *ci) { - return !RB_EMPTY_ROOT(ci-i_caps) || ci-i_cap_exporting_issued; + return !RB_EMPTY_ROOT(ci-i_caps); } int ceph_is_any_caps(struct inode *inode) @@ -2796,7 +2778,7 @@ static void handle_cap_export(struct inode *inode, struct ceph_mds_caps *ex, { struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)-mdsc; struct ceph_mds_session *tsession = NULL; - struct ceph_cap *cap, *tcap; + struct ceph_cap *cap, *tcap, *new_cap = NULL; struct ceph_inode_info *ci = ceph_inode(inode); u64 t_cap_id; unsigned mseq = le32_to_cpu(ex-migrate_seq); @@ -2858,15 +2840,14 @@ retry: } __ceph_remove_cap(cap, false); goto out_unlock; - } - - if (tsession) { - int flag = (cap == ci-i_auth_cap) ? CEPH_CAP_FLAG_AUTH : 0; - spin_unlock(ci-i_ceph_lock); + } else if (tsession) { /* add
[PATCH 5/6] ceph: introduce ceph_fill_fragtree()
Move the code that update the i_fragtree into a separate function. Also add simple probabilistic test to decide whether the i_fragtree should be updated Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/inode.c | 129 1 file changed, 84 insertions(+), 45 deletions(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 65462ad..80528ff 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -10,6 +10,7 @@ #include linux/writeback.h #include linux/vmalloc.h #include linux/posix_acl.h +#include linux/random.h #include super.h #include mds_client.h @@ -179,9 +180,8 @@ struct ceph_inode_frag *__ceph_find_frag(struct ceph_inode_info *ci, u32 f) * specified, copy the frag delegation info to the caller if * it is present. */ -u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v, -struct ceph_inode_frag *pfrag, -int *found) +static u32 __ceph_choose_frag(struct ceph_inode_info *ci, u32 v, + struct ceph_inode_frag *pfrag, int *found) { u32 t = ceph_frag_make(0, 0); struct ceph_inode_frag *frag; @@ -191,7 +191,6 @@ u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v, if (found) *found = 0; - mutex_lock(ci-i_fragtree_mutex); while (1) { WARN_ON(!ceph_frag_contains_value(t, v)); frag = __ceph_find_frag(ci, t); @@ -220,10 +219,19 @@ u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v, } dout(choose_frag(%x) = %x\n, v, t); - mutex_unlock(ci-i_fragtree_mutex); return t; } +u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v, +struct ceph_inode_frag *pfrag, int *found) +{ + u32 ret; + mutex_lock(ci-i_fragtree_mutex); + ret = __ceph_choose_frag(ci, v, pfrag, found); + mutex_unlock(ci-i_fragtree_mutex); + return ret; +} + /* * Process dirfrag (delegation) info from the mds. Include leaf * fragment in tree ONLY if ndist 0. Otherwise, only @@ -286,6 +294,75 @@ out: return err; } +static int ceph_fill_fragtree(struct inode *inode, + struct ceph_frag_tree_head *fragtree, + struct ceph_mds_reply_dirfrag *dirinfo) +{ + struct ceph_inode_info *ci = ceph_inode(inode); + struct ceph_inode_frag *frag; + struct rb_node *rb_node; + int i; + u32 id, nsplits; + bool update = false; + + mutex_lock(ci-i_fragtree_mutex); + nsplits = le32_to_cpu(fragtree-nsplits); + if (nsplits) { + i = prandom_u32() % nsplits; + id = le32_to_cpu(fragtree-splits[i].frag); + if (!__ceph_find_frag(ci, id)) + update = true; + } else if (!RB_EMPTY_ROOT(ci-i_fragtree)) { + rb_node = rb_first(ci-i_fragtree); + frag = rb_entry(rb_node, struct ceph_inode_frag, node); + if (frag-frag != ceph_frag_make(0, 0) || rb_next(rb_node)) + update = true; + } + if (!update dirinfo) { + id = le32_to_cpu(dirinfo-frag); + if (id != __ceph_choose_frag(ci, id, NULL, NULL)) + update = true; + } + if (!update) + goto out_unlock; + + dout(fill_fragtree %llx.%llx\n, ceph_vinop(inode)); + rb_node = rb_first(ci-i_fragtree); + for (i = 0; i nsplits; i++) { + id = le32_to_cpu(fragtree-splits[i].frag); + frag = NULL; + while (rb_node) { + frag = rb_entry(rb_node, struct ceph_inode_frag, node); + if (ceph_frag_compare(frag-frag, id) = 0) { + if (frag-frag != id) + frag = NULL; + else + rb_node = rb_next(rb_node); + break; + } + rb_node = rb_next(rb_node); + rb_erase(frag-node, ci-i_fragtree); + kfree(frag); + frag = NULL; + } + if (!frag) { + frag = __get_or_create_frag(ci, id); + if (IS_ERR(frag)) + continue; + } + frag-split_by = le32_to_cpu(fragtree-splits[i].by); + dout( frag %x split by %d\n, frag-frag, frag-split_by); + } + while (rb_node) { + frag = rb_entry(rb_node, struct ceph_inode_frag, node); + rb_node = rb_next(rb_node); + rb_erase(frag-node, ci-i_fragtree); + kfree(frag); + } +out_unlock: + mutex_unlock(ci-i_fragtree_mutex); + return 0; +} /* * initialize a newly allocated inode. @@ -584,12 +661,8
[PATCH 4/6] ceph: handle cap import atomically
cap import messages are processed by both handle_cap_import() and handle_cap_grant(). These two functions are not executed in the same atomic context, so they can races with cap release. The fix is make handle_cap_import() not release the i_ceph_lock when it returns. Let handle_cap_grant() release the lock after it finishes its job. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 98 +++--- 1 file changed, 53 insertions(+), 45 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 73a42f5..318b50f 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2379,23 +2379,20 @@ static void invalidate_aliases(struct inode *inode) * actually be a revocation if it specifies a smaller cap set.) * * caller holds s_mutex and i_ceph_lock, we drop both. - * - * return value: - * 0 - ok - * 1 - check_caps on auth cap only (writeback) - * 2 - check_caps (ack revoke) */ -static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, +static void handle_cap_grant(struct ceph_mds_client *mdsc, +struct inode *inode, struct ceph_mds_caps *grant, +void *snaptrace, int snaptrace_len, +struct ceph_buffer *xattr_buf, struct ceph_mds_session *session, -struct ceph_cap *cap, -struct ceph_buffer *xattr_buf) - __releases(ci-i_ceph_lock) +struct ceph_cap *cap, int issued) + __releases(ci-i_ceph_lock) { struct ceph_inode_info *ci = ceph_inode(inode); int mds = session-s_mds; int seq = le32_to_cpu(grant-seq); int newcaps = le32_to_cpu(grant-caps); - int issued, implemented, used, wanted, dirty; + int used, wanted, dirty; u64 size = le64_to_cpu(grant-size); u64 max_size = le64_to_cpu(grant-max_size); struct timespec mtime, atime, ctime; @@ -2449,10 +2446,6 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, } /* side effects now are allowed */ - - issued = __ceph_caps_issued(ci, implemented); - issued |= implemented | __ceph_caps_dirty(ci); - cap-cap_gen = session-s_cap_gen; cap-seq = seq; @@ -2585,6 +2578,16 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, spin_unlock(ci-i_ceph_lock); + if (le32_to_cpu(grant-op) == CEPH_CAP_OP_IMPORT) { + down_write(mdsc-snap_rwsem); + ceph_update_snap_trace(mdsc, snaptrace, + snaptrace + snaptrace_len, + false); + downgrade_write(mdsc-snap_rwsem); + kick_flushing_inode_caps(mdsc, session, inode); + up_read(mdsc-snap_rwsem); + } + if (queue_trunc) { ceph_queue_vmtruncate(inode); ceph_queue_revalidate(inode); @@ -2886,21 +2889,22 @@ out_unlock: } /* - * Handle cap IMPORT. If there are temp bits from an older EXPORT, - * clean them up. + * Handle cap IMPORT. * - * caller holds s_mutex. + * caller holds s_mutex. acquires i_ceph_lock */ static void handle_cap_import(struct ceph_mds_client *mdsc, struct inode *inode, struct ceph_mds_caps *im, struct ceph_mds_cap_peer *ph, struct ceph_mds_session *session, - void *snaptrace, int snaptrace_len) + struct ceph_cap **target_cap, int *old_issued) + __acquires(ci-i_ceph_lock) { struct ceph_inode_info *ci = ceph_inode(inode); - struct ceph_cap *cap, *new_cap = NULL; + struct ceph_cap *cap, *ocap, *new_cap = NULL; int mds = session-s_mds; - unsigned issued = le32_to_cpu(im-caps); + int issued; + unsigned caps = le32_to_cpu(im-caps); unsigned wanted = le32_to_cpu(im-wanted); unsigned seq = le32_to_cpu(im-seq); unsigned mseq = le32_to_cpu(im-migrate_seq); @@ -2929,44 +2933,43 @@ retry: new_cap = ceph_get_cap(mdsc, NULL); goto retry; } + cap = new_cap; + } else { + if (new_cap) { + ceph_put_cap(mdsc, new_cap); + new_cap = NULL; + } } - ceph_add_cap(inode, session, cap_id, -1, issued, wanted, seq, mseq, + __ceph_caps_issued(ci, issued); + issued |= __ceph_caps_dirty(ci); + + ceph_add_cap(inode, session, cap_id, -1, caps, wanted, seq, mseq, realmino, CEPH_CAP_FLAG_AUTH, new_cap); - cap = peer = 0 ? __get_cap_for_mds(ci, peer) : NULL; - if (cap cap-cap_id == p_cap_id) { + ocap = peer = 0 ? __get_cap_for_mds(ci
[PATCH 1/6] ceph: avoid releasing caps that are being used
To avoid releasing caps that are being used, encode_inode_release() should send implemented caps to MDS. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index 9946ce3..de39a03 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -3267,7 +3267,7 @@ int ceph_encode_inode_release(void **p, struct inode *inode, rel-seq = cpu_to_le32(cap-seq); rel-issue_seq = cpu_to_le32(cap-issue_seq), rel-mseq = cpu_to_le32(cap-mseq); - rel-caps = cpu_to_le32(cap-issued); + rel-caps = cpu_to_le32(cap-implemented); rel-wanted = cpu_to_le32(cap-mds_wanted); rel-dname_len = 0; rel-dname_seq = 0; -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] ceph: remember subtree root dirfrag's auth MDS
remember dirfrag's auth MDS when it's different from its parent inode's auth MDS. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/inode.c | 8 +++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 80528ff..7736097 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -245,11 +245,17 @@ static int ceph_fill_dirfrag(struct inode *inode, u32 id = le32_to_cpu(dirinfo-frag); int mds = le32_to_cpu(dirinfo-auth); int ndist = le32_to_cpu(dirinfo-ndist); + int diri_auth = -1; int i; int err = 0; + spin_lock(ci-i_ceph_lock); + if (ci-i_auth_cap) + diri_auth = ci-i_auth_cap-mds; + spin_unlock(ci-i_ceph_lock); + mutex_lock(ci-i_fragtree_mutex); - if (ndist == 0) { + if (ndist == 0 mds == diri_auth) { /* no delegation info needed. */ frag = __ceph_find_frag(ci, id); if (!frag) -- 1.9.0 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] ceph: update inode fields according to issued caps
Cap message and request reply from non-auth MDS may carry stale information (corresponding locks are in LOCK states) even they have the newest inode version. So client should update inode fields according to issued caps. Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/caps.c | 58 fs/ceph/inode.c | 70 include/linux/ceph/ceph_fs.h | 2 ++ 3 files changed, 73 insertions(+), 57 deletions(-) diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c index de39a03..5f6d24e 100644 --- a/fs/ceph/caps.c +++ b/fs/ceph/caps.c @@ -2476,7 +2476,8 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, __check_cap_issue(ci, cap, newcaps); - if ((issued CEPH_CAP_AUTH_EXCL) == 0) { + if ((newcaps CEPH_CAP_AUTH_SHARED) + (issued CEPH_CAP_AUTH_EXCL) == 0) { inode-i_mode = le32_to_cpu(grant-mode); inode-i_uid = make_kuid(init_user_ns, le32_to_cpu(grant-uid)); inode-i_gid = make_kgid(init_user_ns, le32_to_cpu(grant-gid)); @@ -2485,7 +2486,8 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, from_kgid(init_user_ns, inode-i_gid)); } - if ((issued CEPH_CAP_LINK_EXCL) == 0) { + if ((newcaps CEPH_CAP_AUTH_SHARED) + (issued CEPH_CAP_LINK_EXCL) == 0) { set_nlink(inode, le32_to_cpu(grant-nlink)); if (inode-i_nlink == 0 (newcaps (CEPH_CAP_LINK_SHARED | CEPH_CAP_LINK_EXCL))) @@ -2512,31 +2514,35 @@ static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant, if ((issued CEPH_CAP_FILE_CACHE) ci-i_rdcache_gen 1) queue_revalidate = 1; - /* size/ctime/mtime/atime? */ - queue_trunc = ceph_fill_file_size(inode, issued, - le32_to_cpu(grant-truncate_seq), - le64_to_cpu(grant-truncate_size), - size); - ceph_decode_timespec(mtime, grant-mtime); - ceph_decode_timespec(atime, grant-atime); - ceph_decode_timespec(ctime, grant-ctime); - ceph_fill_file_time(inode, issued, - le32_to_cpu(grant-time_warp_seq), ctime, mtime, - atime); - - - /* file layout may have changed */ - ci-i_layout = grant-layout; - - /* max size increase? */ - if (ci-i_auth_cap == cap max_size != ci-i_max_size) { - dout(max_size %lld - %llu\n, ci-i_max_size, max_size); - ci-i_max_size = max_size; - if (max_size = ci-i_wanted_max_size) { - ci-i_wanted_max_size = 0; /* reset */ - ci-i_requested_max_size = 0; + if (newcaps CEPH_CAP_ANY_RD) { + /* ctime/mtime/atime? */ + ceph_decode_timespec(mtime, grant-mtime); + ceph_decode_timespec(atime, grant-atime); + ceph_decode_timespec(ctime, grant-ctime); + ceph_fill_file_time(inode, issued, + le32_to_cpu(grant-time_warp_seq), + ctime, mtime, atime); + } + + if (newcaps (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) { + /* file layout may have changed */ + ci-i_layout = grant-layout; + /* size/truncate_seq? */ + queue_trunc = ceph_fill_file_size(inode, issued, + le32_to_cpu(grant-truncate_seq), + le64_to_cpu(grant-truncate_size), + size); + /* max size increase? */ + if (ci-i_auth_cap == cap max_size != ci-i_max_size) { + dout(max_size %lld - %llu\n, +ci-i_max_size, max_size); + ci-i_max_size = max_size; + if (max_size = ci-i_wanted_max_size) { + ci-i_wanted_max_size = 0; /* reset */ + ci-i_requested_max_size = 0; + } + wake = 1; } - wake = 1; } /* check cap bits */ diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 233c6f9..f9e7399 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -585,14 +585,15 @@ static int fill_inode(struct inode *inode, struct ceph_mds_reply_inode *info = iinfo-in; struct ceph_inode_info *ci = ceph_inode(inode); int i; - int issued = 0, implemented; + int issued = 0, implemented, new_issued; struct timespec mtime, atime, ctime; u32 nsplits; struct ceph_inode_frag *frag; struct rb_node *rb_node; struct ceph_buffer
[PATCH] ceph: clear directory's completeness when creating file
When creating a file, ceph_set_dentry_offset() puts the new dentry at the end of directory's d_subdirs, then set the dentry's offset based on directory's max offset. The offset does not reflect the real postion of the dentry in directory. Later readdir reply from MDS may change the dentry's position/offset. This inconsistency can cause missing/duplicate entries in readdir result if readdir is partly satisfied by dcache_readdir(). The fix is clear directory's completeness after creating/renaming file. It prevents later readdir from using dcache_readdir(). Fixes: http://tracker.ceph.com/issues/8025 Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/dir.c | 9 fs/ceph/inode.c | 71 + fs/ceph/super.h | 1 - 3 files changed, 21 insertions(+), 60 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index fb4f7a2..c29d6ae 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -448,7 +448,6 @@ more: if (atomic_read(ci-i_release_count) == fi-dir_release_count) { dout( marking %p complete\n, inode); __ceph_dir_set_complete(ci, fi-dir_release_count); - ci-i_max_offset = ctx-pos; } spin_unlock(ci-i_ceph_lock); @@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, * to do it here. */ - /* d_move screws up d_subdirs order */ - ceph_dir_clear_complete(new_dir); - d_move(old_dentry, new_dentry); /* ensure target dentry is invalidated, despite rehashing bug in vfs_rename_dir */ ceph_invalidate_dentry_lease(new_dentry); + + /* d_move screws up sibling dentries' offsets */ + ceph_dir_clear_complete(old_dir); + ceph_dir_clear_complete(new_dir); + } ceph_mdsc_put_request(req); return err; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 0b0728e..233c6f9 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode, !__ceph_dir_is_complete(ci)) { dout( marking %p complete (empty)\n, inode); __ceph_dir_set_complete(ci, atomic_read(ci-i_release_count)); - ci-i_max_offset = 2; } no_change: /* only update max_size on auth cap */ @@ -890,41 +889,6 @@ out_unlock: } /* - * Set dentry's directory position based on the current dir's max, and - * order it in d_subdirs, so that dcache_readdir behaves. - * - * Always called under directory's i_mutex. - */ -static void ceph_set_dentry_offset(struct dentry *dn) -{ - struct dentry *dir = dn-d_parent; - struct inode *inode = dir-d_inode; - struct ceph_inode_info *ci; - struct ceph_dentry_info *di; - - BUG_ON(!inode); - - ci = ceph_inode(inode); - di = ceph_dentry(dn); - - spin_lock(ci-i_ceph_lock); - if (!__ceph_dir_is_complete(ci)) { - spin_unlock(ci-i_ceph_lock); - return; - } - di-offset = ceph_inode(inode)-i_max_offset++; - spin_unlock(ci-i_ceph_lock); - - spin_lock(dir-d_lock); - spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED); - list_move(dn-d_u.d_child, dir-d_subdirs); - dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset, -dn-d_u.d_child.prev, dn-d_u.d_child.next); - spin_unlock(dn-d_lock); - spin_unlock(dir-d_lock); -} - -/* * splice a dentry to an inode. * caller must hold directory i_mutex for this to be safe. * @@ -933,7 +897,7 @@ static void ceph_set_dentry_offset(struct dentry *dn) * the caller) if we fail. */ static struct dentry *splice_dentry(struct dentry *dn, struct inode *in, - bool *prehash, bool set_offset) + bool *prehash) { struct dentry *realdn; @@ -965,8 +929,6 @@ static struct dentry *splice_dentry(struct dentry *dn, struct inode *in, } if ((!prehash || *prehash) d_unhashed(dn)) d_rehash(dn); - if (set_offset) - ceph_set_dentry_offset(dn); out: return dn; } @@ -987,7 +949,6 @@ int ceph_fill_trace(struct super_block *sb, struct ceph_mds_request *req, { struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info; struct inode *in = NULL; - struct ceph_mds_reply_inode *ininfo; struct ceph_vino vino; struct ceph_fs_client *fsc = ceph_sb_to_client(sb); int err = 0; @@ -1161,6 +1122,9 @@ retry_lookup: /* rename? */ if (req-r_old_dentry req-r_op == CEPH_MDS_OP_RENAME) { + struct inode *olddir = req-r_old_dentry_dir; + BUG_ON(!olddir); + dout( src %p '%.*s' dst %p '%.*s'\n, req
Re: [PATCH] ceph: clear directory's completeness when creating file
On Mon, Apr 14, 2014 at 9:59 PM, Sage Weil s...@inktank.com wrote: On Mon, 14 Apr 2014, Yan, Zheng wrote: When creating a file, ceph_set_dentry_offset() puts the new dentry at the end of directory's d_subdirs, then set the dentry's offset based on directory's max offset. The offset does not reflect the real postion of the dentry in directory. Later readdir reply from MDS may change the dentry's position/offset. This inconsistency can cause missing/duplicate entries in readdir result if readdir is partly satisfied by dcache_readdir(). The fix is clear directory's completeness after creating/renaming file. It prevents later readdir from using dcache_readdir(). Two thoughts: First, we could preserve this behavior when the directory is small (e.g., 1000 entries, or whatever the readdir_max is set to) since any readdir would always be satisfied in a single request and we don't need to worry about the mds vs dcache_readdir() case. I that will still capture the No, we couldn't. because caller of readdir can provide small buffer for readdir result. benefit for many/most directories. This is probably primarily a matter of adding a directory size counter into the ceph_dentry_info? Second, it seems like in order to keep this behavior in the general case, we would basically need to build an rbtree that mirrors in d_subdirs list and sorts the same way the mds does (by frag, name). Which would mean resorting that list when we discover a split/merge event. That sounds like a pretty big pain to me. And similarly, I think that the larger the directory, the less important readdir usually is in the workload... sage Fixes: http://tracker.ceph.com/issues/8025 Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/dir.c | 9 fs/ceph/inode.c | 71 + fs/ceph/super.h | 1 - 3 files changed, 21 insertions(+), 60 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index fb4f7a2..c29d6ae 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -448,7 +448,6 @@ more: if (atomic_read(ci-i_release_count) == fi-dir_release_count) { dout( marking %p complete\n, inode); __ceph_dir_set_complete(ci, fi-dir_release_count); - ci-i_max_offset = ctx-pos; } spin_unlock(ci-i_ceph_lock); @@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, * to do it here. */ - /* d_move screws up d_subdirs order */ - ceph_dir_clear_complete(new_dir); - d_move(old_dentry, new_dentry); /* ensure target dentry is invalidated, despite rehashing bug in vfs_rename_dir */ ceph_invalidate_dentry_lease(new_dentry); + + /* d_move screws up sibling dentries' offsets */ + ceph_dir_clear_complete(old_dir); + ceph_dir_clear_complete(new_dir); + } ceph_mdsc_put_request(req); return err; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 0b0728e..233c6f9 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode, !__ceph_dir_is_complete(ci)) { dout( marking %p complete (empty)\n, inode); __ceph_dir_set_complete(ci, atomic_read(ci-i_release_count)); - ci-i_max_offset = 2; } no_change: /* only update max_size on auth cap */ @@ -890,41 +889,6 @@ out_unlock: } /* - * Set dentry's directory position based on the current dir's max, and - * order it in d_subdirs, so that dcache_readdir behaves. - * - * Always called under directory's i_mutex. - */ -static void ceph_set_dentry_offset(struct dentry *dn) -{ - struct dentry *dir = dn-d_parent; - struct inode *inode = dir-d_inode; - struct ceph_inode_info *ci; - struct ceph_dentry_info *di; - - BUG_ON(!inode); - - ci = ceph_inode(inode); - di = ceph_dentry(dn); - - spin_lock(ci-i_ceph_lock); - if (!__ceph_dir_is_complete(ci)) { - spin_unlock(ci-i_ceph_lock); - return; - } - di-offset = ceph_inode(inode)-i_max_offset++; - spin_unlock(ci-i_ceph_lock); - - spin_lock(dir-d_lock); - spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED); - list_move(dn-d_u.d_child, dir-d_subdirs); - dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset, - dn-d_u.d_child.prev, dn-d_u.d_child.next); - spin_unlock(dn-d_lock); - spin_unlock(dir-d_lock); -} - -/* * splice a dentry to an inode. * caller must hold directory i_mutex for this to be safe. * @@ -933,7 +897,7 @@ static void ceph_set_dentry_offset(struct dentry *dn) * the caller) if we fail. */ static struct dentry *splice_dentry(struct dentry *dn, struct inode *in, - bool
Re: [PATCH] ceph: clear directory's completeness when creating file
On Mon, Apr 14, 2014 at 11:12 PM, Sage Weil s...@inktank.com wrote: On Mon, 14 Apr 2014, Yan, Zheng wrote: On Mon, Apr 14, 2014 at 9:59 PM, Sage Weil s...@inktank.com wrote: On Mon, 14 Apr 2014, Yan, Zheng wrote: When creating a file, ceph_set_dentry_offset() puts the new dentry at the end of directory's d_subdirs, then set the dentry's offset based on directory's max offset. The offset does not reflect the real postion of the dentry in directory. Later readdir reply from MDS may change the dentry's position/offset. This inconsistency can cause missing/duplicate entries in readdir result if readdir is partly satisfied by dcache_readdir(). The fix is clear directory's completeness after creating/renaming file. It prevents later readdir from using dcache_readdir(). Two thoughts: First, we could preserve this behavior when the directory is small (e.g., 1000 entries, or whatever the readdir_max is set to) since any readdir would always be satisfied in a single request and we don't need to worry about the mds vs dcache_readdir() case. I that will still capture the No, we couldn't. because caller of readdir can provide small buffer for readdir result. Hmm, we could set a different flag, UNORDERED, and then on readdir clear COMPLETE if UNORDERED and size = X... It is too tricky. The readdir buffer is hidden by VFS, FS driver has no way to check its size. benefit for many/most directories. This is probably primarily a matter of adding a directory size counter into the ceph_dentry_info? Second, it seems like in order to keep this behavior in the general case, we would basically need to build an rbtree that mirrors in d_subdirs list and sorts the same way the mds does (by frag, name). Which would mean resorting that list when we discover a split/merge event. That sounds like a pretty big pain to me. And similarly, I think that the larger the directory, the less important readdir usually is in the workload... sage Fixes: http://tracker.ceph.com/issues/8025 Signed-off-by: Yan, Zheng zheng.z@intel.com --- fs/ceph/dir.c | 9 fs/ceph/inode.c | 71 + fs/ceph/super.h | 1 - 3 files changed, 21 insertions(+), 60 deletions(-) diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c index fb4f7a2..c29d6ae 100644 --- a/fs/ceph/dir.c +++ b/fs/ceph/dir.c @@ -448,7 +448,6 @@ more: if (atomic_read(ci-i_release_count) == fi-dir_release_count) { dout( marking %p complete\n, inode); __ceph_dir_set_complete(ci, fi-dir_release_count); - ci-i_max_offset = ctx-pos; } spin_unlock(ci-i_ceph_lock); @@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, struct dentry *old_dentry, * to do it here. */ - /* d_move screws up d_subdirs order */ - ceph_dir_clear_complete(new_dir); - d_move(old_dentry, new_dentry); /* ensure target dentry is invalidated, despite rehashing bug in vfs_rename_dir */ ceph_invalidate_dentry_lease(new_dentry); + + /* d_move screws up sibling dentries' offsets */ + ceph_dir_clear_complete(old_dir); + ceph_dir_clear_complete(new_dir); + } ceph_mdsc_put_request(req); return err; diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c index 0b0728e..233c6f9 100644 --- a/fs/ceph/inode.c +++ b/fs/ceph/inode.c @@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode, !__ceph_dir_is_complete(ci)) { dout( marking %p complete (empty)\n, inode); __ceph_dir_set_complete(ci, atomic_read(ci-i_release_count)); - ci-i_max_offset = 2; } no_change: /* only update max_size on auth cap */ @@ -890,41 +889,6 @@ out_unlock: } /* - * Set dentry's directory position based on the current dir's max, and - * order it in d_subdirs, so that dcache_readdir behaves. - * - * Always called under directory's i_mutex. - */ -static void ceph_set_dentry_offset(struct dentry *dn) -{ - struct dentry *dir = dn-d_parent; - struct inode *inode = dir-d_inode; - struct ceph_inode_info *ci; - struct ceph_dentry_info *di; - - BUG_ON(!inode); - - ci = ceph_inode(inode); - di = ceph_dentry(dn); - - spin_lock(ci-i_ceph_lock); - if (!__ceph_dir_is_complete(ci)) { - spin_unlock(ci-i_ceph_lock); - return; - } - di-offset = ceph_inode(inode)-i_max_offset++; - spin_unlock(ci-i_ceph_lock); - - spin_lock(dir-d_lock); - spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED); - list_move(dn-d_u.d_child, dir-d_subdirs); - dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset
Re: [PATCH] rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO
On Fri, Apr 11, 2014 at 4:38 PM, Duan Jiong duanj.f...@cn.fujitsu.com wrote: This patch fixes coccinelle error regarding usage of IS_ERR and PTR_ERR instead of PTR_ERR_OR_ZERO. Signed-off-by: Duan Jiong duanj.f...@cn.fujitsu.com --- drivers/block/rbd.c | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c index 4c95b50..552a2ed 100644 --- a/drivers/block/rbd.c +++ b/drivers/block/rbd.c @@ -4752,7 +4752,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev) image_id = ceph_extract_encoded_string(p, p + ret, NULL, GFP_NOIO); - ret = IS_ERR(image_id) ? PTR_ERR(image_id) : 0; + ret = PTR_ERR_OR_ZERO(image_id); if (!ret) rbd_dev-image_format = 2; } else { Reviewed-by: Yan, Zheng zheng.z@intel.com -- 1.8.3.1 -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] ceph: remove useless ACL check
On Thu, Apr 10, 2014 at 1:29 PM, Zhang Zhen zhenzhang.zh...@huawei.com wrote: posix_acl_xattr_set() already does the check, and it's the only way to feed in an ACL from userspace. So the check here is useless, remove it. Signed-off-by: zhang zhen zhenzhang.zh...@huawei.com --- fs/ceph/acl.c | 6 -- 1 file changed, 6 deletions(-) diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c index 21887d6..469f2e8 100644 --- a/fs/ceph/acl.c +++ b/fs/ceph/acl.c @@ -104,12 +104,6 @@ int ceph_set_acl(struct inode *inode, struct posix_acl *acl, int type) umode_t new_mode = inode-i_mode, old_mode = inode-i_mode; struct dentry *dentry; - if (acl) { - ret = posix_acl_valid(acl); - if (ret 0) - goto out; - } - switch (type) { case ACL_TYPE_ACCESS: name = POSIX_ACL_XATTR_ACCESS; -- 1.8.5.5 aded to our testing branch. Thanks. Yan, Zheng -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html