Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Thu, Dec 17, 2015 at 4:56 PM, Eric Eastman
<eric.east...@keepertech.com> wrote:
> I patched the 4.4rc4 kernel source and restarted the test.  Shortly
> after starting it, this showed up in dmesg:
>
> [Thu Dec 17 03:29:55 2015] WARNING: CPU: 0 PID: 2547 at
> fs/ceph/addr.c:1162 ceph_write_begin+0xfb/0x120 [ceph]()
> [Thu Dec 17 03:29:55 2015] Modules linked in: iscsi_target_mod
> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
> target_core_file target_core_iblock target_core_pscsi target_core_user
> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
> ipmi_ssif drm_kms_helper drm coretemp kvm gpio_ich i2c_algo_bit
> i7core_edac fb_sys_fops syscopyarea edac_core sysfillrect sysimgblt
> ipmi_si input_leds hpilo ipmi_msghandler shpchp acpi_power_meter
> irqbypass serio_raw 8250_fintek lpc_ich mac_hid ceph bonding libceph
> lp parport libcrc32c fscache mlx4_en vxlan ip6_udp_tunnel udp_tunnel
> ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse bnx2 fjes
> scsi_transport_sas [last unloaded: target_core_mod]
> [Thu Dec 17 03:29:55 2015] CPU: 0 PID: 2547 Comm: iscsi_trx Tainted: G
>W I 4.4.0-rc4-ede1 #1
> [Thu Dec 17 03:29:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Thu Dec 17 03:29:55 2015]  c020cd47 8805f1e97958
> 813ad644 
> [Thu Dec 17 03:29:55 2015]  8805f1e97990 81079702
> 8805f1e97a50 015dd000
> [Thu Dec 17 03:29:55 2015]  880c034df800 0200
> eab26a80 8805f1e979a0
> [Thu Dec 17 03:29:55 2015] Call Trace:
> [Thu Dec 17 03:29:55 2015]  [] dump_stack+0x44/0x60
> [Thu Dec 17 03:29:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Thu Dec 17 03:29:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_begin+0xfb/0x120 [ceph]
> [Thu Dec 17 03:29:55 2015]  []
> generic_perform_write+0xbf/0x1a0
> [Thu Dec 17 03:29:55 2015]  []
> ceph_write_iter+0xf5c/0x1010 [ceph]
> [Thu Dec 17 03:29:55 2015]  [] ? __enqueue_entity+0x6c/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> iov_iter_get_pages+0x113/0x210
> [Thu Dec 17 03:29:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Thu Dec 17 03:29:55 2015]  [] vfs_iter_write+0x63/0xa0
> [Thu Dec 17 03:29:55 2015]  []
> fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Thu Dec 17 03:29:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] ? __switch_to+0x1cd/0x570
> [Thu Dec 17 03:29:55 2015]  [] ?
> iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
> [Thu Dec 17 03:29:55 2015]  [] kthread+0xc9/0xe0
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015]  [] ret_from_fork+0x3f/0x70
> [Thu Dec 17 03:29:55 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 03:29:55 2015] ---[ end trace 382a45986961da4e ]---


Could you please try the apply the new incremental patch and try again.


Regards
Yan, Zheng


>
> There are WARNINGs on both line 125 and 1162. I will attached the
> whole set of dmesg output to the tracker ticket 14086
>
> I wanted to note that file system snapshots are enabled and being used
> on this file system.
>
> Thanks
> Eric
>
> On Wed, Dec 16, 2015 at 8:15 AM, Eric Eastman
> <eric.east...@keepertech.com> wrote:
>>>>
>>> This warning is really strange. Could you try the attached debug patch.
>>>
>>> Regards
>>> Yan, Zheng
>>
>> I will try the patch and get back to the list.
>>
>> Eric


cephfs1.patch
Description: Binary data


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Fri, Dec 18, 2015 at 2:23 PM, Eric Eastman
<eric.east...@keepertech.com> wrote:
>> Hi Yan Zheng, Eric Eastman
>>
>> Similar bug was reported in f2fs, btrfs, it does affect 4.4-rc4, the fixing
>> patch was merged into 4.4-rc5, dfd01f026058 ("sched/wait: Fix the signal
>> handling fix").
>>
>> Related report & discussion was here:
>> https://lkml.org/lkml/2015/12/12/149
>>
>> I'm not sure the current reported issue of ceph was related to that though,
>> but at least try testing with an upgraded or patched kernel could verify it.
>> :)
>>
>> Thanks,
>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of
>>> Yan, Zheng
>>> Sent: Friday, December 18, 2015 12:05 PM
>>> To: Eric Eastman
>>> Cc: Ceph Development
>>> Subject: Re: Issue with Ceph File System and LIO
>>>
>>> On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman
>>> <eric.east...@keepertech.com> wrote:
>>> > With cephfs.patch and cephfs1.patch applied and I am now seeing:
>>> >
>>> > [Thu Dec 17 14:27:59 2015] [ cut here ]
>>> > [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
>>> > fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
>>> > [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
> ...
>>> >
>>>
>>> The page gets unlocked mystically. I still don't find any clue. Could
>>> you please try the new patch (not incremental patch). Besides, please
>>> enable CONFIG_DEBUG_VM when compiling the kernel.
>>>
>>> Thanks you very much
>>> Yan, Zheng
>>
> I have just installed the cephfs_new.patch and have set
> CONFIG_DEBUG_VM=y on a new 4.4rc4 kernel and restarted the ESXi iSCSI
> test to my Ceph File System gateway.  I plan to let it run overnight
> and report the status tomorrow.
>
> Let me know if I should move on to 4.4rc5 with or without patches and
> with or without  CONFIG_DEBUG_VM=y
>

please try rc5 kernel without patches and DEBUG_VM=y

Regards
Yan, Zheng


> Looking at the network traffic stats on my iSCSI gateway, with
> CONFIG_DEBUG_VM=y, throughput seems to be down by a factor of at least
> 10 compared to my last test without setting CONFIG_DEBUG_VM=y
>
> Regards,
> Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Issue with Ceph File System and LIO

2015-12-17 Thread Yan, Zheng
On Fri, Dec 18, 2015 at 3:49 AM, Eric Eastman
<eric.east...@keepertech.com> wrote:
> With cephfs.patch and cephfs1.patch applied and I am now seeing:
>
> [Thu Dec 17 14:27:59 2015] [ cut here ]
> [Thu Dec 17 14:27:59 2015] WARNING: CPU: 0 PID: 3036 at
> fs/ceph/addr.c:1171 ceph_write_begin+0xfb/0x120 [ceph]()
> [Thu Dec 17 14:27:59 2015] Modules linked in: iscsi_target_mod
> vhost_scsi tcm_qla2xxx ib_srpt tcm_fc tcm_usb_gadget tcm_loop
> target_core_file target_core_iblock target_core_pscsi target_core_user
> target_core_mod ipmi_devintf vhost qla2xxx ib_cm ib_sa ib_mad ib_core
> ib_addr libfc scsi_transport_fc libcomposite udc_core uio configfs ttm
> drm_kms_helper drm ipmi_ssif coretemp gpio_ich i2c_algo_bit kvm
> fb_sys_fops syscopyarea sysfillrect sysimgblt shpchp input_leds ceph
> irqbypass i7core_edac serio_raw hpilo edac_core ipmi_si
> ipmi_msghandler 8250_fintek lpc_ich acpi_power_meter libceph mac_hid
> libcrc32c fscache bonding lp parport mlx4_en vxlan ip6_udp_tunnel
> udp_tunnel ptp pps_core hid_generic usbhid hid mlx4_core hpsa psmouse
> bnx2 fjes scsi_transport_sas [last unloaded: target_core_mod]
> [Thu Dec 17 14:27:59 2015] CPU: 0 PID: 3036 Comm: iscsi_trx Tainted: G
>W I 4.4.0-rc4-ede2 #1
> [Thu Dec 17 14:27:59 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Thu Dec 17 14:27:59 2015]  c02b2e37 880c0289b958
> 813ad644 
> [Thu Dec 17 14:27:59 2015]  880c0289b990 81079702
> 880c0289ba50 000846c21000
> [Thu Dec 17 14:27:59 2015]  880c009ea200 1000
> ea00122ed700 880c0289b9a0
> [Thu Dec 17 14:27:59 2015] Call Trace:
> [Thu Dec 17 14:27:59 2015]  [] dump_stack+0x44/0x60
> [Thu Dec 17 14:27:59 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Thu Dec 17 14:27:59 2015]  [] warn_slowpath_null+0x1a/0x20
> [Thu Dec 17 14:27:59 2015]  []
> ceph_write_begin+0xfb/0x120 [ceph]
> [Thu Dec 17 14:27:59 2015]  []
> generic_perform_write+0xbf/0x1a0
> [Thu Dec 17 14:27:59 2015]  []
> ceph_write_iter+0xf5c/0x1010 [ceph]
> [Thu Dec 17 14:27:59 2015]  [] ? __schedule+0x386/0x9c0
> [Thu Dec 17 14:27:59 2015]  [] ? schedule+0x35/0x80
> [Thu Dec 17 14:27:59 2015]  [] ? __slab_free+0xb5/0x290
> [Thu Dec 17 14:27:59 2015]  [] ?
> iov_iter_get_pages+0x113/0x210
> [Thu Dec 17 14:27:59 2015]  [] vfs_iter_write+0x63/0xa0
> [Thu Dec 17 14:27:59 2015]  []
> fd_do_rw.isra.5+0xc9/0x1b0 [target_core_file]
> [Thu Dec 17 14:27:59 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Thu Dec 17 14:27:59 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  [] ? __switch_to+0x1cd/0x570
> [Thu Dec 17 14:27:59 2015]  [] ?
> iscsi_target_tx_thread+0x1c0/0x1c0 [iscsi_target_mod]
> [Thu Dec 17 14:27:59 2015]  [] kthread+0xc9/0xe0
> [Thu Dec 17 14:27:59 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 14:27:59 2015]  [] ret_from_fork+0x3f/0x70
> [Thu Dec 17 14:27:59 2015]  [] ?
> kthread_create_on_node+0x180/0x180
> [Thu Dec 17 14:27:59 2015] ---[ end trace 8346192e3f29ed5d ]---
>

The page gets unlocked mystically. I still don't find any clue. Could
you please try the new patch (not incremental patch). Besides, please
enable CONFIG_DEBUG_VM when compiling the kernel.

Thanks you very much
Yan, Zheng


cephfs_new.patch
Description: Binary data


Re: Issue with Ceph File System and LIO

2015-12-16 Thread Yan, Zheng
On Wed, Dec 16, 2015 at 12:51 AM, Eric Eastman
<eric.east...@keepertech.com> wrote:
> I have opened ticket: 14086
>
> On Tue, Dec 15, 2015 at 5:05 AM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Tue, Dec 15, 2015 at 2:08 PM, Eric Eastman
>>> [Tue Dec 15 00:46:55 2015] [ cut here ]
>>> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
>>> /home/kernel/COD/linux/fs/ceph/addr.c:125
>>
>> could you confirm that addr.c:125 is WARN_ON(!PageLocked(page));
>
> I am using the generic kernel from:
> http://kernel.ubuntu.com/~kernel-ppa/mainline/v4.4-rc4-wily
> and assuming they did not change anything, from the 4.4rc4 source tree
> I pulled down shows:
>
> 124 ret = __set_page_dirty_nobuffers(page);
> 125 WARN_ON(!PageLocked(page));
> 126 WARN_ON(!page->mapping);
>
>
> modinfo ceph
> filename:   /lib/modules/4.4.0-040400rc4-generic/kernel/fs/ceph/ceph.ko
> license:GPL
> description:Ceph filesystem for Linux
> author: Patience Warnick <patie...@newdream.net>
> author: Yehuda Sadeh <yeh...@hq.newdream.net>
> author: Sage Weil <s...@newdream.net>
> alias:  fs-ceph
> srcversion: E94BA78C2D998705FE2C600
> depends:libceph,fscache
> intree: Y
> vermagic:   4.4.0-040400rc4-generic SMP mod_unload modversions
>
> This error has shown up about 20 times in 12 hours, since I started
> the ESXi test.
>

This warning is really strange. Could you try the attached debug patch.

Regards
Yan, Zheng


cephfs.patch
Description: Binary data


Re: Issue with Ceph File System and LIO

2015-12-15 Thread Yan, Zheng
On Tue, Dec 15, 2015 at 2:08 PM, Eric Eastman
<eric.east...@keepertech.com> wrote:
> I am testing Linux Target SCSI, LIO, with a Ceph File System backstore
> and I am seeing this error on my LIO gateway.  I am using Ceph v9.2.0
> on a 4.4rc4 Kernel, on Trusty, using a kernel mounted Ceph File
> System.  A file on the Ceph File System is exported via iSCSI to a
> VMware ESXi 5.0 server, and I am seeing this error when doing a lot of
> I/O on the ESXi server.   Is this a LIO or a Ceph issue?
>
> [Tue Dec 15 00:46:55 2015] [ cut here ]
> [Tue Dec 15 00:46:55 2015] WARNING: CPU: 0 PID: 1123421 at
> /home/kernel/COD/linux/fs/ceph/addr.c:125

could you confirm that addr.c:125 is WARN_ON(!PageLocked(page));

Regards
Yan, Zheng

> ceph_set_page_dirty+0x230/0x240 [ceph]()
> [Tue Dec 15 00:46:55 2015] Modules linked in: iptable_filter ip_tables
> x_tables xfs rbd iscsi_target_mod vhost_scsi tcm_qla2xxx ib_srpt
> tcm_fc tcm_usb_gadget tcm_loop target_core_file target_core_iblock
> target_core_pscsi target_core_user target_core_mod ipmi_devintf vhost
> qla2xxx ib_cm ib_sa ib_mad ib_core ib_addr libfc scsi_transport_fc
> libcomposite udc_core uio configfs ipmi_ssif ttm drm_kms_helper
> gpio_ich drm i2c_algo_bit fb_sys_fops coretemp syscopyarea ipmi_si
> sysfillrect ipmi_msghandler sysimgblt kvm acpi_power_meter 8250_fintek
> irqbypass hpilo shpchp input_leds serio_raw lpc_ich i7core_edac
> edac_core mac_hid ceph libceph libcrc32c fscache bonding lp parport
> mlx4_en vxlan ip6_udp_tunnel udp_tunnel ptp pps_core hid_generic
> usbhid hid hpsa mlx4_core psmouse bnx2 scsi_transport_sas fjes [last
> unloaded: target_core_mod]
> [Tue Dec 15 00:46:55 2015] CPU: 0 PID: 1123421 Comm: iscsi_trx
> Tainted: GW I 4.4.0-040400rc4-generic #201512061930
> [Tue Dec 15 00:46:55 2015] Hardware name: HP ProLiant DL360 G6, BIOS
> P64 01/22/2015
> [Tue Dec 15 00:46:55 2015]   fdc0ce43
> 880bf38c38c0 813c8ab4
> [Tue Dec 15 00:46:55 2015]   880bf38c38f8
> 8107d772 ea00127a8680
> [Tue Dec 15 00:46:55 2015]  8804e52c1448 8804e52c15b0
> 8804e52c10f0 0200
> [Tue Dec 15 00:46:55 2015] Call Trace:
> [Tue Dec 15 00:46:55 2015]  [] dump_stack+0x44/0x60
> [Tue Dec 15 00:46:55 2015]  [] 
> warn_slowpath_common+0x82/0xc0
> [Tue Dec 15 00:46:55 2015]  [] warn_slowpath_null+0x1a/0x20
> [Tue Dec 15 00:46:55 2015]  []
> ceph_set_page_dirty+0x230/0x240 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> pagecache_get_page+0x150/0x1c0
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_pool_perm_check+0x48/0x700 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] set_page_dirty+0x3d/0x70
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_end+0x5e/0x180 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> iov_iter_copy_from_user_atomic+0x156/0x220
> [Tue Dec 15 00:46:55 2015]  []
> generic_perform_write+0x114/0x1c0
> [Tue Dec 15 00:46:55 2015]  []
> ceph_write_iter+0xf8a/0x1050 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> ceph_put_cap_refs+0x143/0x320 [ceph]
> [Tue Dec 15 00:46:55 2015]  [] ?
> check_preempt_wakeup+0xfa/0x220
> [Tue Dec 15 00:46:55 2015]  [] ? zone_statistics+0x7c/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ? copy_page_to_iter+0x5e/0xa0
> [Tue Dec 15 00:46:55 2015]  [] ?
> skb_copy_datagram_iter+0x122/0x250
> [Tue Dec 15 00:46:55 2015]  [] vfs_iter_write+0x76/0xc0
> [Tue Dec 15 00:46:55 2015]  []
> fd_do_rw.isra.5+0xd8/0x1e0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> fd_execute_rw+0xc5/0x2a0 [target_core_file]
> [Tue Dec 15 00:46:55 2015]  []
> sbc_execute_rw+0x22/0x30 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> __target_execute_cmd+0x1f/0x70 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> target_execute_cmd+0x195/0x2a0 [target_core_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_execute_cmd+0x20a/0x270 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsit_sequence_cmd+0xda/0x190 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  []
> iscsi_target_rx_thread+0x51d/0xe30 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] ? __switch_to+0x1dc/0x5a0
> [Tue Dec 15 00:46:55 2015]  [] ?
> iscsi_target_tx_thread+0x1e0/0x1e0 [iscsi_target_mod]
> [Tue Dec 15 00:46:55 2015]  [] kthread+0xd8/0xf0
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015]  [] ret_from_fork+0x3f/0x70
> [Tue Dec 15 00:46:55 2015]  [] ?
> kthread_create_on_node+0x1a0/0x1a0
> [Tue Dec 15 00:46:55 2015] ---[ end trace 4079437668c77cbb ]---
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: Found referenced iSCSI task_tag: 
> 95784927
> [Tue Dec 15 00:47:45 2015] ABORT_TASK: ref_tag: 95784927 already
> complete, skipping
>
> If it is a Ceph File System issue

Re: 答复: how to see file object-mappings for cephfuse client

2015-12-07 Thread Yan, Zheng
On Mon, Dec 7, 2015 at 1:52 PM, Wuxiangwei <wuxiang...@h3c.com> wrote:
> Thanks Yan, what if we wanna see some more specific or detailed information? 
> E.g. with cephfs we may run 'cephfs /mnt/a.txt show_location --offset' to 
> find the location of a given offset.
>

When using default layout, object size is 4M, (offset / 4194304) is
subfix of object name. For example. offset 0 is located in object
., offset 4194304 is located in object
.001.

For fancy layout, please read
http://docs.ceph.com/docs/master/dev/file-striping/

> ---
> Wu Xiangwei
> Tel : 0571-86760875
> 2014 UIS 2, TEAM BORE
>
>
> -邮件原件-
> 发件人: Yan, Zheng [mailto:uker...@gmail.com]
> 发送时间: 2015年12月7日 11:22
> 收件人: wuxiangwei 09660 (RD)
> 抄送: ceph-devel@vger.kernel.org; ceph-us...@lists.ceph.com
> 主题: Re: how to see file object-mappings for cephfuse client
>
> On Mon, Dec 7, 2015 at 10:51 AM, Wuxiangwei <wuxiang...@h3c.com> wrote:
>> Hi, Everyone
>>
>> Recently I'm trying to figure out how to use ceph-fuse. If we mount cephfs 
>> as the kernel client, there is a 'cephfs' command tool (though it seems to 
>> be 'deprecated') with 'map' and 'show_location' commands to show the RADOS 
>> objects belonging to a given file. However, it doesn't work well for 
>> ceph-fuse, neither can I find a similar tool to see the mappings. Any 
>> suggestions?
>> Thank you!
>
> Rados objects  for a give inode is in form ..
>
> To get a file's layout, you can use getfattr -n ceph.file.layout 
>
> To find a given object is stored on which OSDs, you can use command, ceph osd 
> map  
>
>>
>>
>> ---
>> Wu Xiangwei
>> Tel : 0571-86760875
>> 2014 UIS 2, TEAM BORE
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: how to see file object-mappings for cephfuse client

2015-12-06 Thread Yan, Zheng
On Mon, Dec 7, 2015 at 10:51 AM, Wuxiangwei  wrote:
> Hi, Everyone
>
> Recently I'm trying to figure out how to use ceph-fuse. If we mount cephfs as 
> the kernel client, there is a 'cephfs' command tool (though it seems to be 
> 'deprecated') with 'map' and 'show_location' commands to show the RADOS 
> objects belonging to a given file. However, it doesn't work well for 
> ceph-fuse, neither can I find a similar tool to see the mappings. Any 
> suggestions?
> Thank you!

Rados objects  for a give inode is in form ..

To get a file's layout, you can use getfattr -n ceph.file.layout 

To find a given object is stored on which OSDs, you can use command,
ceph osd map  

>
>
> ---
> Wu Xiangwei
> Tel : 0571-86760875
> 2014 UIS 2, TEAM BORE
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compiling for FreeBSD

2015-12-02 Thread Yan, Zheng
On Thu, Dec 3, 2015 at 4:52 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
> On 2-12-2015 15:13, Yan, Zheng wrote:
>> see https://github.com/ceph/ceph/pull/6770. The code can be compiled
>> on FreeBSD/OSX, most client programs can connect to ceph servers on
>> Linux.
>
> Hi,
>
> I do like some of the inline compiler tests.
>
> I guess that the way the errno's are done like the other OS's have done
> as well?
> I'd normally solve this with a static array, and just index it.
> But perhaps the compiler is smart enough to do the same.
>
>
> I see that you have disabled uuid?
> Might I ask why?

not disable. Currently ceph uses boost uuid implementation. so no need
to link to libuuid.

>
> I Suggest you have a look at the issue Alan brought up.
> Which is a possible fix for doing it the other way around:
> Linux clients on a FreeBSD "cluster"
> But as Sage suggest: Could be very well solved by fixed brougt in for AIX.
>
> --WjW
>
>> Regards
>> Yan. Zheng
>>
>> On Wed, Dec 2, 2015 at 2:43 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
>>> On 1-12-2015 19:36, Sage Weil wrote:
>>>>
>>>> On Tue, 1 Dec 2015, Alan Somers wrote:
>>>>>
>>>>> On Tue, Dec 1, 2015 at 11:08 AM, Willem Jan Withagen <w...@digiware.nl>
>>>>> wrote:
>>>>>>
>>>>>> On 1-12-2015 18:22, Alan Somers wrote:
>>>>>>>
>>>>>>>
>>>>>>> I did some work porting Ceph to FreeBSD, but got distracted and
>>>>>>> stopped about two years ago.  You may find this port useful, though it
>>>>>>> will probably need to be updated:
>>>>>>>
>>>>>>> https://people.freebsd.org/~asomers/ports/net/ceph/
>>>>>>
>>>>>>
>>>>>>
>>>>>> I'll chcek that one as well...
>>>>>>
>>>>>>> Also, there's one major outstanding issue that I know of.  It breaks
>>>>>>> interoperability between FreeBSD and Linux Ceph nodes.  I posted a
>>>>>>> patch to fix it, but it doesn't look like it's been merged yet.
>>>>>>> http://tracker.ceph.com/issues/6636
>>>>>>
>>>>>>
>>>>>>
>>>>>> In the issues I find:
>>>>>> 
>>>>>> Updated by Sage Weil almost 2 years ago
>>>>>>
>>>>>>  Status changed from New to Verified
>>>>>> Updated by Sage Weil almost 2 years ago
>>>>>>
>>>>>>  Assignee set to Noah Watkins
>>>>>> 
>>>>>>
>>>>>> Probably left at that point because there was no presure to actually
>>>>>> commit?
>>>>>>
>>>>>> --WjW
>>>>>
>>>>>
>>>>> It looks like Sage reviewed the change, but had some comments that
>>>>> were mostly style-related.  Neither Noah nor I actually got around to
>>>>> implementing Sage's suggestions.
>>>>>
>>>>> https://github.com/ceph/ceph/pull/828
>>>>
>>>>
>>>> The uuid transition to boost::uuid has happened since then (a few months
>>>> back) and I believe Rohan's AIX and Solaris ports for librados (that just
>>>> merged) included a fix for the sockaddr_storage issue:
>>>>
>>>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L180
>>>>
>>>> and also
>>>>
>>>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L160
>>>
>>>
>>>
>>> Would be nice to actually find that this works for FreeBSD as well.
>>> But I'm putting this on the watch-list, once I get there.
>>>
>>> --WjW
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>>
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compiling for FreeBSD

2015-12-02 Thread Yan, Zheng
see https://github.com/ceph/ceph/pull/6770. The code can be compiled
on FreeBSD/OSX, most client programs can connect to ceph servers on
Linux.

Regards
Yan. Zheng

On Wed, Dec 2, 2015 at 2:43 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
> On 1-12-2015 19:36, Sage Weil wrote:
>>
>> On Tue, 1 Dec 2015, Alan Somers wrote:
>>>
>>> On Tue, Dec 1, 2015 at 11:08 AM, Willem Jan Withagen <w...@digiware.nl>
>>> wrote:
>>>>
>>>> On 1-12-2015 18:22, Alan Somers wrote:
>>>>>
>>>>>
>>>>> I did some work porting Ceph to FreeBSD, but got distracted and
>>>>> stopped about two years ago.  You may find this port useful, though it
>>>>> will probably need to be updated:
>>>>>
>>>>> https://people.freebsd.org/~asomers/ports/net/ceph/
>>>>
>>>>
>>>>
>>>> I'll chcek that one as well...
>>>>
>>>>> Also, there's one major outstanding issue that I know of.  It breaks
>>>>> interoperability between FreeBSD and Linux Ceph nodes.  I posted a
>>>>> patch to fix it, but it doesn't look like it's been merged yet.
>>>>> http://tracker.ceph.com/issues/6636
>>>>
>>>>
>>>>
>>>> In the issues I find:
>>>> 
>>>> Updated by Sage Weil almost 2 years ago
>>>>
>>>>  Status changed from New to Verified
>>>> Updated by Sage Weil almost 2 years ago
>>>>
>>>>  Assignee set to Noah Watkins
>>>> 
>>>>
>>>> Probably left at that point because there was no presure to actually
>>>> commit?
>>>>
>>>> --WjW
>>>
>>>
>>> It looks like Sage reviewed the change, but had some comments that
>>> were mostly style-related.  Neither Noah nor I actually got around to
>>> implementing Sage's suggestions.
>>>
>>> https://github.com/ceph/ceph/pull/828
>>
>>
>> The uuid transition to boost::uuid has happened since then (a few months
>> back) and I believe Rohan's AIX and Solaris ports for librados (that just
>> merged) included a fix for the sockaddr_storage issue:
>>
>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L180
>>
>> and also
>>
>> https://github.com/ceph/ceph/blob/master/src/msg/msg_types.h#L160
>
>
>
> Would be nice to actually find that this works for FreeBSD as well.
> But I'm putting this on the watch-list, once I get there.
>
> --WjW
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Compiling for FreeBSD

2015-11-29 Thread Yan, Zheng
On Mon, Nov 30, 2015 at 2:57 AM, Willem Jan Withagen <w...@digiware.nl> wrote:
>
> On 29-11-2015 19:08, Haomai Wang wrote:
>
> > I guess we still expect FreeBSD support, which version do you test to
> > compile? I'd like to help to make bsd work :-)
>
> I considered it is best to develop against HEAD aka:
> 11.0-CURRENT FreeBSD 11.0-CURRENT #0 r291381: Sat Nov 28 14:22:54 CET 2015
> I'm also training configure the use as much of CLANG as possible.
>
> I would guess that by the time we get anything worth mentioning 11.0
> will be released of close to release.
> Note that the release date for 11.0 is july 2016
>
> --WjW
>

I can compile infernalis on FreeBSD 10.1 with following commands

./autogen.sh
./configure CPPFLAGS="-I/usr/local/include" LDFLAGS="-L/usr/local/lib"
CXXFLAGS="-DGTEST_HAS_TR1_TUPLE=0" --without-tcmalloc --without-libaio
--without-libxfs
gmake

I don't know extact list of packages dependencies. But ./configure
should tell you what is missing.

Yan, Zheng
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph-fuse single read limitation?‏

2015-11-23 Thread Yan, Zheng

> On Nov 23, 2015, at 16:40, Z Zhang <zhangz.da...@outlook.com> wrote:
> 
> Hi Yan, Thanks for the reply. I already tried following settings, but no 
> lock.  client_readahead_min = 1048576 client_readahead_max_bytes = 4194304 
> Thanks. Zhi Zhang (David) > Subject: Re: Ceph-fuse single read limitation?‏ > 
> From: z...@redhat.com > Date: Mon, 23 Nov 2015 10:22:23 +0800 > CC: 
> ceph-devel@vger.kernel.org > To: zhangz.da...@outlook.com > > >> On Nov 21, 
> 2015, at 10:12, Z Zhang wrote: >> >> Hi Guys, >> >> Now we have a very small 
> cluster with 3 OSDs but using 40Gb NIC. We use ceph-fuse as cephfs client and 
> enable readahead, but testing single reading a large file from cephfs via 
> fio, dd or cp can only achieve ~70+MB/s, even if fio or dd's block size is 
> set to 1MB or 4MB. >> >> From the ceph client log, we found each read 
> request's size (ll_read) is limited to 128KB, which should be the limitation 
> of kernel fuse's FUSE_MAX_PAGES_PER_REQ = 32. We may try to increase this 
> value to see the performance difference, but are there other options we can 
> try to increase single read performance? >> > > try setting 
> client_readahead_max_bytes to 4M > > Yan, Zheng > > -- > To unsubscribe from 
> this list: send the line "unsubscribe ceph-devel" in > the body of a message 
> to majord...@vger.kernel.org > More majordomo info at 
> http://vger.kernel.org/majordomo-info.html

can you check ceps-fuse log to see if readahead actually happen.

Regards
Yan, Zheng--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Ceph-fuse single read limitation?‏

2015-11-22 Thread Yan, Zheng

> On Nov 21, 2015, at 10:12, Z Zhang <zhangz.da...@outlook.com> wrote:
> 
> Hi Guys,
> 
> Now we have a very small cluster with 3 OSDs but using 40Gb NIC. We use 
> ceph-fuse as cephfs client and enable readahead, but testing single reading a 
> large file from cephfs via fio, dd or cp can only achieve ~70+MB/s, even if 
> fio or dd's block size is set to 1MB or 4MB.
> 
> From the ceph client log, we found each read request's size (ll_read) is 
> limited to 128KB, which should be the limitation of kernel fuse's 
> FUSE_MAX_PAGES_PER_REQ = 32. We may try to increase this value to see the 
> performance difference, but are there other options we can try to increase 
> single read performance?
> 

try setting client_readahead_max_bytes to 4M

Yan, Zheng

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] fs/ceph: ceph_frag_contains_value can be boolean

2015-11-17 Thread Yan, Zheng

> On Nov 17, 2015, at 14:52, Yaowei Bai <baiyao...@cmss.chinamobile.com> wrote:
> 
> This patch makes ceph_frag_contains_value return bool to improve
> readability due to this particular function only using either one or
> zero as its return value.
> 
> No functional change.
> 
> Signed-off-by: Yaowei Bai <baiyao...@cmss.chinamobile.com>
> ---
> include/linux/ceph/ceph_frag.h | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/include/linux/ceph/ceph_frag.h b/include/linux/ceph/ceph_frag.h
> index 970ba5c..b827e06 100644
> --- a/include/linux/ceph/ceph_frag.h
> +++ b/include/linux/ceph/ceph_frag.h
> @@ -40,7 +40,7 @@ static inline __u32 ceph_frag_mask_shift(__u32 f)
>   return 24 - ceph_frag_bits(f);
> }
> 
> -static inline int ceph_frag_contains_value(__u32 f, __u32 v)
> +static inline bool ceph_frag_contains_value(__u32 f, __u32 v)
> {
>   return (v & ceph_frag_mask(f)) == ceph_frag_value(f);
> }
> -- 
> 1.9.1

both applied

Thanks
Yan, Zheng

> 
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph:Fix error handling in the function down_reply

2015-11-09 Thread Yan, Zheng

> On Nov 9, 2015, at 11:11, Nicholas Krause <xerofo...@gmail.com> wrote:
> 
> This fixes error handling in the function down_reply in order to
> properly check and jump to the goto label, out_err for this
> particular function if a error code is returned by any function
> called in down_reply and therefore make checking be included
> for the call to ceph_update_snap_trace in order to comply with
> these error handling checks/paths.
> 
> Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
> ---
> fs/ceph/mds_client.c | 11 +++
> 1 file changed, 7 insertions(+), 4 deletions(-)
> 
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 51cb02d..0b01f94 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -2495,14 +2495,17 @@ static void handle_reply(struct ceph_mds_session 
> *session, struct ceph_msg *msg)
>   realm = NULL;
>   if (rinfo->snapblob_len) {
>   down_write(>snap_rwsem);
> - ceph_update_snap_trace(mdsc, rinfo->snapblob,
> - rinfo->snapblob + rinfo->snapblob_len,
> - le32_to_cpu(head->op) == CEPH_MDS_OP_RMSNAP,
> - );
> + err = ceph_update_snap_trace(mdsc, rinfo->snapblob,
> +  rinfo->snapblob + 
> rinfo->snapblob_len,
> +  le32_to_cpu(head->op) == 
> CEPH_MDS_OP_RMSNAP,
> +  );
>   downgrade_write(>snap_rwsem);
>   } else {
>   down_read(>snap_rwsem);
>       }
> +
> + if (err)
> + goto out_err;
> 
>   /* insert trace into our cache */
>   mutex_lock(>r_fill_mutex);

Applied, thanks

Yan, Zheng

> -- 
> 2.5.0
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph:Fix error handling in the function ceph_readddir_prepopulate

2015-11-09 Thread Yan, Zheng

> On Nov 9, 2015, at 05:13, Nicholas Krause <xerofo...@gmail.com> wrote:
> 
> This fixes error handling in the function ceph_readddir_prepopulate
> to properly check if the call to the function ceph_fill_dirfrag has
> failed by returning a error code. Further more if this does arise
> jump to the goto label, out of the function ceph_readdir_prepopulate
> in order to clean up previously allocated resources by this function
> before returning to the caller this errror code in order for all callers
> to be now aware and able to handle this failure in their own intended
> error paths.
> 
> Signed-off-by: Nicholas Krause <xerofo...@gmail.com>
> ---
> fs/ceph/inode.c | 7 +--
> 1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
> index 96d2bd8..7738be6 100644
> --- a/fs/ceph/inode.c
> +++ b/fs/ceph/inode.c
> @@ -1417,8 +1417,11 @@ int ceph_readdir_prepopulate(struct ceph_mds_request 
> *req,
>   } else {
>   dout("readdir_prepopulate %d items under dn %p\n",
>rinfo->dir_nr, parent);
> - if (rinfo->dir_dir)
> - ceph_fill_dirfrag(d_inode(parent), rinfo->dir_dir);
> + if (rinfo->dir_dir) {
> + err = ceph_fill_dirfrag(d_inode(parent), 
> rinfo->dir_dir);
> + if (err)
> + goto out;
> +         }
>   }
> 

ceph_fill_dirfrag() failure is not fatal. I think it’s better to not skip later 
code when it happens.

Regards
Yan, Zheng 


>   if (ceph_frag_is_leftmost(frag) && req->r_readdir_offset == 2) {
> -- 
> 2.5.0
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Only client.admin can mount cephfs by ceph-fuse

2015-11-02 Thread Yan, Zheng
On Mon, Nov 2, 2015 at 12:28 PM, Jaze Lee  wrote:
> Hello,
>I find only ceph.client.admin can mount cephfs.
>
> [root@ceph-base-0 ceph]# ceph auth get client.cephfs_user
> exported keyring for client.cephfs_user
> [client.cephfs_user]
> key = AQDZ3DZWR7nqBxAAzSoU/yRz1oJsOYdYrTAzcw==
> caps mds = "allow *"
> caps mon = "allow *"
> caps osd = "allow *"
>
> [root@cephfs-client ~]# cat ceph.client.cephfs_user
> [client.cephfs_user]
> key = AQDZ3DZWR7nqBxAAzSoU/yRz1oJsOYdYrTAzcw==
>
> But when i mount fuse, it throw an exception.
> [root@cephfs-client ~]# ceph-fuse -k ceph.client.cephfs_user -m
> 10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs
> 2015-11-02 12:23:17.938470 7f9d63f9b760 -1 did not load config file,
> using default settings.
> ceph-fuse[15325]: starting ceph client
> 2015-11-02 12:23:17.973134 7f9d63f9b760 -1 init, newargv = 0x2c33f30 
> newargc=11
> ceph-fuse[15325]: ceph mount failed with (1) Operation not permitted
> ceph-fuse[15323]: mount failed: (1) Operation not permitted

you don't specify user. ceph-fuse uses the default user admin. try
following command

ceph-fuse --id cephfs_user -k ceph.client.cephfs_user -m
10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs

>
> But when i use client.admin
> [root@ceph-base-0 ceph]# ceph auth get client.admin
> exported keyring for client.admin
> [client.admin]
> key = AQCMrhxWhRFBBhAA+f1yZXyjzF2eBrUt1UMdFA==
> auid = 0
> caps mds = "allow *"
> caps mon = "allow *"
> caps osd = "allow *"
>
> [root@cephfs-client ~]# ceph-fuse -k ceph.client.admin.keyring -m
> 10.12.12.10:6789 -m 10.12.12.6:6789 -m 10.12.12.5:6789 ./test_ceph_fs
> 2015-11-02 12:25:03.038526 7f66e0957760 -1 did not load config file,
> using default settings.
> ceph-fuse[15360]: starting ceph client
> 2015-11-02 12:25:03.073797 7f66e0957760 -1 init, newargv = 0x3035f30 
> newargc=11
> ceph-fuse[15360]: starting fuse
>
> It is ok.
>
> So i want to know.
> Dose ceph's fuse only allow admin to mount fs?
>
>
>
>
>
> --
> 谦谦君子
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 0/4] libceph: nocephx_sign_messages option + misc

2015-11-02 Thread Yan, Zheng

> On Nov 3, 2015, at 06:44, Ilya Dryomov <idryo...@gmail.com> wrote:
> 
> Hello,
> 
> This adds nocephx_sign_messages libceph option (a lack of which is
> something people are running into, see [1]), plus a couple of related
> cleanups.
> 
> [1] https://forum.proxmox.com/threads/24116-new-krbd-option-on-pve4-don-t-work
> 
> Thanks,
> 
>Ilya
> 
> 
> Ilya Dryomov (4):
>  libceph: msg signing callouts don't need con argument
>  libceph: drop authorizer check from cephx msg signing routines
>  libceph: stop duplicating client fields in messenger
>  libceph: add nocephx_sign_messages option
> 
> fs/ceph/mds_client.c   | 14 --
> include/linux/ceph/libceph.h   |  4 +++-
> include/linux/ceph/messenger.h | 16 +++-
> net/ceph/auth_x.c  |  8 ++--
> net/ceph/ceph_common.c | 18 +-
> net/ceph/messenger.c   | 32 
> net/ceph/osd_client.c  | 14 ++++--
> 7 files changed, 53 insertions(+), 53 deletions(-)
> 

Reviewed-by: Yan, Zheng <z...@redhat.com>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: introduce ceph_x_authorizer_cleanup()

2015-10-26 Thread Yan, Zheng

> On Oct 26, 2015, at 20:03, Ilya Dryomov <idryo...@gmail.com> wrote:
> 
> Commit ae385eaf24dc ("libceph: store session key in cephx authorizer")
> introduced ceph_x_authorizer::session_key, but didn't update all the
> exit/error paths.  Introduce ceph_x_authorizer_cleanup() to encapsulate
> ceph_x_authorizer cleanup and switch to it.  This fixes ceph_x_destroy(),
> which currently always leaks key and ceph_x_build_authorizer() error
> paths.
> 
> Cc: Yan, Zheng <z...@redhat.com>
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
> net/ceph/auth_x.c | 28 +---
> net/ceph/crypto.h |  4 +++-
> 2 files changed, 20 insertions(+), 12 deletions(-)
> 
> diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
> index ba6eb17226da..65054fd31b97 100644
> --- a/net/ceph/auth_x.c
> +++ b/net/ceph/auth_x.c
> @@ -279,6 +279,15 @@ bad:
>   return -EINVAL;
> }
> 
> +static void ceph_x_authorizer_cleanup(struct ceph_x_authorizer *au)
> +{
> + ceph_crypto_key_destroy(>session_key);
> + if (au->buf) {
> + ceph_buffer_put(au->buf);
> + au->buf = NULL;
> + }
> +}
> +
> static int ceph_x_build_authorizer(struct ceph_auth_client *ac,
>  struct ceph_x_ticket_handler *th,
>  struct ceph_x_authorizer *au)
> @@ -297,7 +306,7 @@ static int ceph_x_build_authorizer(struct 
> ceph_auth_client *ac,
>   ceph_crypto_key_destroy(>session_key);
>   ret = ceph_crypto_key_clone(>session_key, >session_key);
>   if (ret)
> - return ret;
> + goto out_au;
> 
>   maxlen = sizeof(*msg_a) + sizeof(msg_b) +
>   ceph_x_encrypt_buflen(ticket_blob_len);
> @@ -309,8 +318,8 @@ static int ceph_x_build_authorizer(struct 
> ceph_auth_client *ac,
>   if (!au->buf) {
>   au->buf = ceph_buffer_new(maxlen, GFP_NOFS);
>   if (!au->buf) {
> - ceph_crypto_key_destroy(>session_key);
> - return -ENOMEM;
> + ret = -ENOMEM;
> + goto out_au;
>   }
>   }
>   au->service = th->service;
> @@ -340,7 +349,7 @@ static int ceph_x_build_authorizer(struct 
> ceph_auth_client *ac,
>   ret = ceph_x_encrypt(>session_key, _b, sizeof(msg_b),
>p, end - p);
>   if (ret < 0)
> - goto out_buf;
> + goto out_au;
>   p += ret;
>   au->buf->vec.iov_len = p - au->buf->vec.iov_base;
>   dout(" built authorizer nonce %llx len %d\n", au->nonce,
> @@ -348,9 +357,8 @@ static int ceph_x_build_authorizer(struct 
> ceph_auth_client *ac,
>   BUG_ON(au->buf->vec.iov_len > maxlen);
>   return 0;
> 
> -out_buf:
> - ceph_buffer_put(au->buf);
> - au->buf = NULL;
> +out_au:
> + ceph_x_authorizer_cleanup(au);
>   return ret;
> }
> 
> @@ -624,8 +632,7 @@ static void ceph_x_destroy_authorizer(struct 
> ceph_auth_client *ac,
> {
>   struct ceph_x_authorizer *au = (void *)a;
> 
> - ceph_crypto_key_destroy(>session_key);
> - ceph_buffer_put(au->buf);
> + ceph_x_authorizer_cleanup(au);
>   kfree(au);
> }
> 
> @@ -653,8 +660,7 @@ static void ceph_x_destroy(struct ceph_auth_client *ac)
>   remove_ticket_handler(ac, th);
>   }
> 
> - if (xi->auth_authorizer.buf)
> - ceph_buffer_put(xi->auth_authorizer.buf);
> + ceph_x_authorizer_cleanup(>auth_authorizer);
> 
>   kfree(ac->private);
>   ac->private = NULL;
> diff --git a/net/ceph/crypto.h b/net/ceph/crypto.h
> index d1498224c49d..2e9cab09f37b 100644
> --- a/net/ceph/crypto.h
> +++ b/net/ceph/crypto.h
> @@ -16,8 +16,10 @@ struct ceph_crypto_key {
> 
> static inline void ceph_crypto_key_destroy(struct ceph_crypto_key *key)
> {
> - if (key)
> + if (key) {
>   kfree(key->key);
> + key->key = NULL;
> + }
> }
> 
> int ceph_crypto_key_clone(struct ceph_crypto_key *dst,
> — 
> 2.4.3
> 

Reviewed-by: Yan, Zheng <z...@redhat.com>


--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: why package ceph-fuse needs packages ceph?

2015-10-26 Thread Yan, Zheng
On Mon, Oct 26, 2015 at 6:31 PM, Jevon Qiao  wrote:
> Hi Sage,
>
> Here comes another question, does ceph-fuse support Unix OS(like HP-UNIX or
> AIX)?

It supports OSX and FreeBSD. (I tested it a few weeks ago. current
development branch may require some minor modification)

>
> Thanks,
> Jevon
>
> On 26/10/15 15:19, Sage Weil wrote:
>>
>> On Mon, 26 Oct 2015, Jaze Lee wrote:
>>>
>>> Hello,
>>>  I think the ceph-fuse is just a client, why it needs packages ceph?
>>>  I found when i install ceph-fuse, it will install package ceph.
>>>  But when i install ceph-common, it will not install package ceph.
>>>
>>>  May be ceph-fuse is not just a ceph client?
>>>
>> It is, and the Debian packaging works as expected.  This is a simple
>> error in the spec file.  I'll submit a patch.
>>
>> sage
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: a patch to improve cephfs direct io performance

2015-10-08 Thread Yan, Zheng
  ret = striped_read(inode, off, n,
> pages, num_pages, checkeof,
> 1, start);
>
> -ceph_put_page_vector(pages, num_pages, true);
> +dio_free_pagev(pages, num_pages, true);
>
>  if (ret <= 0)
>  break;
> @@ -596,7 +705,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct
> iov_iter *from, loff_t pos,
>  CEPH_OSD_FLAG_WRITE;
>
>  while (iov_iter_count(from) > 0) {
> -u64 len = iov_iter_single_seg_count(from);
> +u64 len = dio_get_pagevlen(from);
>  size_t start;
>  ssize_t n;
>
> @@ -615,14 +724,15 @@ ceph_sync_direct_write(struct kiocb *iocb, struct
> iov_iter *from, loff_t pos,
>
>  osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC, 0);
>
> -n = iov_iter_get_pages_alloc(from, , len, );
> -if (unlikely(n < 0)) {
> -ret = n;
> +n = len;
> +pages = dio_alloc_pagev(from, len, false, ,
> +_pages);
> +if (IS_ERR(pages)) {
>  ceph_osdc_put_request(req);
> +ret = PTR_ERR(pages);
>  break;
>  }
>
> -num_pages = (n + start + PAGE_SIZE - 1) / PAGE_SIZE;
>  /*
>   * throw out any page cache pages in this range. this
>   * may block.
> @@ -639,8 +749,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct
> iov_iter *from, loff_t pos,
>  if (!ret)
>  ret = ceph_osdc_wait_request(>client->osdc, req);
>
> -ceph_put_page_vector(pages, num_pages, false);
> -
> +dio_free_pagev(pages, num_pages, false);
>  ceph_osdc_put_request(req);
>  if (ret)
>  break;
>
>

applied (with a few modification), thanks

Yan, Zheng
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: fix message length computation

2015-10-01 Thread Yan, Zheng
On Wed, Sep 30, 2015 at 9:04 PM, Arnd Bergmann <a...@arndb.de> wrote:
> create_request_message() computes the maximum length of a message,
> but uses the wrong type for the time stamp: sizeof(struct timespec)
> may be 8 or 16 depending on the architecture, while sizeof(struct
> ceph_timespec) is always 8, and that is what gets put into the
> message.
>
> Found while auditing the uses of timespec for y2038 problems.
>
> Signed-off-by: Arnd Bergmann <a...@arndb.de>
> Fixes: b8e69066d8af ("ceph: include time stamp in every MDS request")
> ---
> diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
> index 51cb02da75d9..fe2c982764e7 100644
> --- a/fs/ceph/mds_client.c
> +++ b/fs/ceph/mds_client.c
> @@ -1935,7 +1935,7 @@ static struct ceph_msg *create_request_message(struct 
> ceph_mds_client *mdsc,
>
> len = sizeof(*head) +
> pathlen1 + pathlen2 + 2*(1 + sizeof(u32) + sizeof(u64)) +
> -   sizeof(struct timespec);
> +   sizeof(struct ceph_timespec);
>
> /* calculate (max) length for cap releases */
> len += sizeof(struct ceph_mds_request_release) *
>

Applied. thanks

Yan, Zheng

> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: a patch to improve cephfs direct io performance

2015-09-30 Thread Yan, Zheng
On Wed, Sep 30, 2015 at 5:40 PM, zhucaifeng <zhucaif...@unissoft-nj.com> wrote:
> Hi, Yan
>
> iov_iter APIs seems unsuitable for the direct io manipulation below.
> iov_iter APIs
> hide how to iterate over elements, whileas dio_xxx below explicitly control
> over
> the iteration. They conflict with each other in the principle.

why? I think it's easy to replace dio_alloc_pagev() by
iov_iter_get_pages(). We just need to use dio_get_pagevlen() to
calculate how many page to get each time.

Regards
Yan, Zheng

>
> The patch for the newest kernel branch is below.
>
> Best Regards
>
> diff --git a/fs/ceph/file.c b/fs/ceph/file.c
> index 8b79d87..3938ac9 100644
> --- a/fs/ceph/file.c
> +++ b/fs/ceph/file.c
>
> @@ -34,6 +34,115 @@
>   * need to wait for MDS acknowledgement.
>   */
>
> +/*
> + * Calculate the length sum of direct io vectors that can
> + * be combined into one page vector.
> + */
> +static int
> +dio_get_pagevlen(const struct iov_iter *it)
> +{
> +const struct iovec *iov = it->iov;
> +const struct iovec *iovend = iov + it->nr_segs;
> +int pagevlen;
> +
> +pagevlen = iov->iov_len - it->iov_offset;
> +/*
> + * An iov can be page vectored when both the current tail
> + * and the next base are page aligned.
> + */
> +while (PAGE_ALIGNED((iov->iov_base + iov->iov_len)) &&
> +   (++iov < iovend && PAGE_ALIGNED((iov->iov_base {
> +pagevlen += iov->iov_len;
> +}
> +dout("dio_get_pagevlen len = %d\n", pagevlen);
> +return pagevlen;
> +}
> +
> +/*
> + * Grab @num_pages from the process vm space. These pages are
> + * continuous and start from @data.
> + */
> +static int
> +dio_grab_pages(const void *data, int num_pages, bool write_page,
> +struct page **pages)
> +{
> +int got = 0;
> +int rc = 0;
> +
> +down_read(>mm->mmap_sem);
> +while (got < num_pages) {
> +rc = get_user_pages(current, current->mm,
> +(unsigned long)data + ((unsigned long)got * PAGE_SIZE),
> +num_pages - got, write_page, 0, pages + got, NULL);
> +if (rc < 0)
> +break;
> +BUG_ON(rc == 0);
> +got += rc;
> +}
> +up_read(>mm->mmap_sem);
> +return got;
> +}
> +
> +static void
> +dio_free_pagev(struct page **pages, int num_pages, bool dirty)
> +{
> +int i;
> +
> +for (i = 0; i < num_pages; i++) {
> +if (dirty)
> +set_page_dirty_lock(pages[i]);
> +put_page(pages[i]);
> +}
> +kfree(pages);
> +}
> +
> +/*
> + * Allocate a page vector based on (@it, @pagevlen).
> + * The return value is the tuple describing a page vector,
> + * that is (@pages, @pagevlen, @page_align, @num_pages).
> + */
> +static struct page **
> +dio_alloc_pagev(const struct iov_iter *it, int pagevlen, bool write_page,
> +size_t *page_align, int *num_pages)
> +{
> +const struct iovec *iov = it->iov;
> +struct page **pages;
> +int n, m, k, npages;
> +int align;
> +int len;
> +void *data;
> +
> +data = iov->iov_base + it->iov_offset;
> +len = iov->iov_len - it->iov_offset;
> +align = ((ulong)data) & ~PAGE_MASK;
> +npages = calc_pages_for((ulong)data, pagevlen);
> +pages = kmalloc(sizeof(*pages) * npages, GFP_NOFS);
> +if (!pages)
> +return ERR_PTR(-ENOMEM);
> +for (n = 0; n < npages; n += m) {
> +m = calc_pages_for((ulong)data, len);
> +if (n + m > npages)
> +m = npages - n;
> +k = dio_grab_pages(data, m, write_page, pages + n);
> +if (k < m) {
> +n += k;
> +goto failed;
> +}
> +
> +iov++;
> +data = iov->iov_base;
> +len = iov->iov_len;
> +}
> +*num_pages = npages;
> +*page_align = align;
> +dout("dio_alloc_pagev: alloc pages pages[0:%d], page align %d\n",
> +npages, align);
> +return pages;
> +
> +failed:
> +dio_free_pagev(pages, n, false);
> +return ERR_PTR(-ENOMEM);
> +}
>
>  /*
>   * Prepare an open request.  Preallocate ceph_cap to avoid an
> @@ -462,17 +571,17 @@ static ssize_t ceph_sync_read(struct kiocb *iocb,
> struct iov_iter *i,
>  size_t start;
>  ssize_t n;
>
> -n = iov_iter_get_pages_alloc(i, , INT_MAX, );
> -if (n < 0)
> -return n;
> -
> -num_pages = (n + start + PAGE_SIZE - 1) / PAGE_SIZE;
> +n = dio_get_pagevlen(i);
> +  

[BUG] commit "namei: d_is_negative() should be checked before ->d_seq validation" breaks ceph-fuse

2015-09-30 Thread Yan, Zheng
Hi, Al

I found that commit 766c4cbfac "namei: d_is_negative() should be checked before 
->d_seq validation”  breaks ceph-fuse. After that commit,  lookup_fast can 
return -ENOENT before calling d_revalidate(). This breaks remote filesystems 
which allows creating/deleting files from multiple machines. 

Regards
Yan, Zheng--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: fix a comment typo

2015-09-29 Thread Yan, Zheng

> On Sep 30, 2015, at 11:36, Geliang Tang <geliangt...@163.com> wrote:
> 
> Just fix a typo in the code comment.
> 
> Signed-off-by: Geliang Tang <geliangt...@163.com>
> ---
> fs/ceph/cache.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/fs/ceph/cache.c b/fs/ceph/cache.c
> index 834f9f3..a4766de 100644
> --- a/fs/ceph/cache.c
> +++ b/fs/ceph/cache.c
> @@ -88,7 +88,7 @@ static uint16_t ceph_fscache_inode_get_key(const void 
> *cookie_netfs_data,
>   const struct ceph_inode_info* ci = cookie_netfs_data;
>   uint16_t klen;
> 
> - /* use ceph virtual inode (id + snaphot) */
> + /* use ceph virtual inode (id + snapshot) */
>   klen = sizeof(ci->i_vino);
>   if (klen > maxbuf)
>   return 0;

Applied, thank you.

Yan, Zheng


> -- 
> 1.9.1
> 
> 

--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: a patch to improve cephfs direct io performance

2015-09-27 Thread Yan, Zheng
gt;iov_len - it->iov_offset;
> +align = ((ulong)data) & ~PAGE_MASK;
> +npages = calc_pages_for((ulong)data, pagevlen);
> +pages = kmalloc(sizeof(*pages) * npages, GFP_NOFS);
> +if (!pages)
> +return ERR_PTR(-ENOMEM);
> +    for (n = 0; n < npages; n += m) {
> +m = calc_pages_for((ulong)data, len);
> +if (n + m > npages)
> +m = npages - n;
> +k = dio_grab_pages(data, m, write_page, pages + n);
> +if (k < m) {
> +n += k;
> +goto failed;
> +}

if (align + len) is not page aligned, the loop should stop.

> +
> +iov++;
> +data = iov->iov_base;

if (unsigned long)data is not page aligned, the loop should stop


Regards
Yan, Zheng

> +len = iov->iov_len;
> +}
> +*num_pages = npages;
> +*page_align = align;
> +dout("dio_alloc_pagev: alloc pages pages[0:%d], page align %d\n",
> +npages, align);
> +return pages;
> +
> +failed:
> +dio_free_pagev(pages, n, false);
> +return ERR_PTR(-ENOMEM);
> +}
>
>  /*
>   * Prepare an open request.  Preallocate ceph_cap to avoid an
> @@ -354,13 +463,14 @@ more:
>  if (ret >= 0) {
>  int didpages;
>  if (was_short && (pos + ret < inode->i_size)) {
> -u64 tmp = min(this_len - ret,
> +u64 zlen = min(this_len - ret,
>  inode->i_size - pos - ret);
> +int zoff = (o_direct ? buf_align : io_align) +
> +   read + ret;
>  dout(" zero gap %llu to %llu\n",
> -pos + ret, pos + ret + tmp);
> -ceph_zero_page_vector_range(page_align + read + ret,
> -tmp, pages);
> -ret += tmp;
> +pos + ret, pos + ret + zlen);
> +ceph_zero_page_vector_range(zoff, zlen, pages);
> +ret += zlen;
>  }
>
>  didpages = (page_align + ret) >> PAGE_CACHE_SHIFT;
> @@ -421,19 +531,19 @@ static ssize_t ceph_sync_read(struct kio
>
>  if (file->f_flags & O_DIRECT) {
>  while (iov_iter_count(i)) {
> -void __user *data = i->iov[0].iov_base + i->iov_offset;
> -size_t len = i->iov[0].iov_len - i->iov_offset;
> -
> -num_pages = calc_pages_for((unsigned long)data, len);
> -pages = ceph_get_direct_page_vector(data,
> -num_pages, true);
> +int page_align;
> +size_t len;
> +
> +len = dio_get_pagevlen(i);
> +pages = dio_alloc_pagev(i, len, true, _align,
> +_pages);
>  if (IS_ERR(pages))
>  return PTR_ERR(pages);
>
>  ret = striped_read(inode, off, len,
> pages, num_pages, checkeof,
> -   1, (unsigned long)data & ~PAGE_MASK);
> -ceph_put_page_vector(pages, num_pages, true);
> +   1, page_align);
> +dio_free_pagev(pages, num_pages, true);
>
>  if (ret <= 0)
>  break;
> @@ -570,10 +680,7 @@ ceph_sync_direct_write(struct kiocb *ioc
>  iov_iter_init(, iov, nr_segs, count, 0);
>
>  while (iov_iter_count() > 0) {
> -void __user *data = i.iov->iov_base + i.iov_offset;
> -u64 len = i.iov->iov_len - i.iov_offset;
> -
> -page_align = (unsigned long)data & ~PAGE_MASK;
> +u64 len = dio_get_pagevlen();
>
>  snapc = ci->i_snap_realm->cached_context;
>  vino = ceph_vino(inode);
> @@ -589,8 +696,7 @@ ceph_sync_direct_write(struct kiocb *ioc
>  break;
>  }
>
> -num_pages = calc_pages_for(page_align, len);
> -pages = ceph_get_direct_page_vector(data, num_pages, false);
> +pages = dio_alloc_pagev(, len, false, _align, _pages);
>  if (IS_ERR(pages)) {
>  ret = PTR_ERR(pages);
>  goto out;
> @@ -612,7 +718,7 @@ ceph_sync_direct_write(struct kiocb *ioc
>  if (!ret)
>  ret = ceph_osdc_wait_request(>client->osdc, req);
>
> -ceph_put_page_vector(pages, num_pages, false);
> +dio_free_pagev(pages, num_pages, false);
>
>  out:
>  ceph_osdc_put_request(req);
>
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: don't access invalid memory in keepalive2 path

2015-09-16 Thread Yan, Zheng
On Mon, Sep 14, 2015 at 9:50 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> This
>
> struct ceph_timespec ceph_ts;
> ...
> con_out_kvec_add(con, sizeof(ceph_ts), _ts);
>
> wraps ceph_ts into a kvec and adds it to con->out_kvec array, yet
> ceph_ts becomes invalid on return from prepare_write_keepalive().  As
> a result, we send out bogus keepalive2 stamps.  Fix this by encoding
> into a ceph_timespec member, similar to how acks are read and written.
>
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
>  include/linux/ceph/messenger.h | 4 +++-
>  net/ceph/messenger.c   | 9 +
>  2 files changed, 8 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
> index 7e1252e97a30..b2371d9b51fa 100644
> --- a/include/linux/ceph/messenger.h
> +++ b/include/linux/ceph/messenger.h
> @@ -238,6 +238,8 @@ struct ceph_connection {
> bool out_kvec_is_msg; /* kvec refers to out_msg */
> int out_more;/* there is more data after the kvecs */
> __le64 out_temp_ack; /* for writing an ack */
> +   struct ceph_timespec out_temp_keepalive2; /* for writing keepalive2
> +stamp */
>
> /* message in temps */
> struct ceph_msg_header in_hdr;
> @@ -248,7 +250,7 @@ struct ceph_connection {
> int in_base_pos; /* bytes read */
> __le64 in_temp_ack;  /* for reading an ack */
>
> -   struct timespec last_keepalive_ack;
> +   struct timespec last_keepalive_ack; /* keepalive2 ack stamp */
>
> struct delayed_work work;   /* send|recv work */
> unsigned long   delay;  /* current delay interval */
> diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
> index 525f454f7531..b9b0e3b5da49 100644
> --- a/net/ceph/messenger.c
> +++ b/net/ceph/messenger.c
> @@ -1353,11 +1353,12 @@ static void prepare_write_keepalive(struct 
> ceph_connection *con)
> dout("prepare_write_keepalive %p\n", con);
> con_out_kvec_reset(con);
> if (con->peer_features & CEPH_FEATURE_MSGR_KEEPALIVE2) {
> -   struct timespec ts = CURRENT_TIME;
> -   struct ceph_timespec ceph_ts;
> -   ceph_encode_timespec(_ts, );
> +   struct timespec now = CURRENT_TIME;
> +
> con_out_kvec_add(con, sizeof(tag_keepalive2), 
> _keepalive2);
> -   con_out_kvec_add(con, sizeof(ceph_ts), _ts);
> +   ceph_encode_timespec(>out_temp_keepalive2, );
> +   con_out_kvec_add(con, sizeof(con->out_temp_keepalive2),
> +    >out_temp_keepalive2);
> } else {
> con_out_kvec_add(con, sizeof(tag_keepalive), _keepalive);
> }
> --
Sorry for introducing this bug


Reviewed-by: Yan, Zheng <z...@redhat.com>

> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: advertise support for keepalive2

2015-09-16 Thread Yan, Zheng
On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> We are the client, but advertise keepalive2 anyway - for consistency,
> if nothing else.  In the future the server might want to know whether
> its clients support keepalive2.

the kernel code still does fully support KEEPALIVE2 (it does not
recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not
advertise support for keepalive2.

Regards
Yan, Zheng

>
> Signed-off-by: Ilya Dryomov <idryo...@gmail.com>
> ---
>  include/linux/ceph/ceph_features.h | 1 +
>  1 file changed, 1 insertion(+)
>
> diff --git a/include/linux/ceph/ceph_features.h 
> b/include/linux/ceph/ceph_features.h
> index 4763ad64e832..f89b31d45cc8 100644
> --- a/include/linux/ceph/ceph_features.h
> +++ b/include/linux/ceph/ceph_features.h
> @@ -107,6 +107,7 @@ static inline u64 ceph_sanitize_features(u64 features)
>  CEPH_FEATURE_OSDMAP_ENC |  \
>  CEPH_FEATURE_CRUSH_TUNABLES3 | \
>  CEPH_FEATURE_OSD_PRIMARY_AFFINITY |\
> +CEPH_FEATURE_MSGR_KEEPALIVE2 | \
>  CEPH_FEATURE_CRUSH_V4)
>
>  #define CEPH_FEATURES_REQUIRED_DEFAULT   \
> --
> 1.9.3
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: advertise support for keepalive2

2015-09-16 Thread Yan, Zheng
On Wed, Sep 16, 2015 at 3:39 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
> On Wed, Sep 16, 2015 at 9:28 AM, Yan, Zheng <uker...@gmail.com> wrote:
>> On Mon, Sep 14, 2015 at 9:51 PM, Ilya Dryomov <idryo...@gmail.com> wrote:
>>> We are the client, but advertise keepalive2 anyway - for consistency,
>>> if nothing else.  In the future the server might want to know whether
>>> its clients support keepalive2.
>>
>> the kernel code still does fully support KEEPALIVE2 (it does not
>> recognize CEPH_MSGR_TAG_KEEPALIVE2 tag). I think it's better to not
>> advertise support for keepalive2.
>
> I guess by "does not recognize" you mean the kernel client knows how to
> write TAG_KEEPALIVE2, but not how to read it?  The same is true about
> TAG_KEEPALIVE tag and the reverse (we can read, but can't write) is
> true about TAG_KEEPALIVE2_ACK.
>
> What I'm getting at is the kernel client is the client, and so it
> doesn't need to know how to read keepalive bytes or write keepalive
> acks.  The server however might want to know if its clients can send
> keepalive2 bytes and handle keepalive2 acks.  Does this make sense?
>

Ok, it makes sense. thank you for explaination

Reviewed-by: Yan, Zheng <z...@redhat.com>

> Thanks,
>
> Ilya
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-02 Thread Yan, Zheng

> On Sep 2, 2015, at 17:12, Ilya Dryomov <idryo...@gmail.com> wrote:
> 
> On Wed, Sep 2, 2015 at 5:22 AM, Yan, Zheng <z...@redhat.com> wrote:
>> timespec_to_jiffies() does not work this way. it convert time delta in form 
>> of timespec to time delta in form of jiffies.
> 
> Ah sorry, con->last_keepalive_ack is a realtime timespec from
> userspace.
> 
>> 
>> I will updated the patch according  to the rest comments .
> 
> I still want ceph_con_keepalive_expired() to handle interval == 0
> internally, so that opt->monc_ping_timeout can be passed without any
> checks in the caller.
> 
> I also noticed you added delay = max_t(unsigned long, delay, HZ); to
> __schedule_delayed().  Is it really necessary?
> 

Updated. please review.

Regards
Yan, Zheng--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: use keepalive2 to verify the mon session is alive

2015-09-01 Thread Yan, Zheng
Signed-off-by: Yan, Zheng <z...@redhat.com>
---
 include/linux/ceph/libceph.h   |  2 ++
 include/linux/ceph/messenger.h |  4 +++
 include/linux/ceph/msgr.h  |  4 ++-
 net/ceph/ceph_common.c | 18 -
 net/ceph/messenger.c   | 60 ++
 net/ceph/mon_client.c  | 38 --
 6 files changed, 111 insertions(+), 15 deletions(-)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index 9ebee53..9a0b471 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -46,6 +46,7 @@ struct ceph_options {
unsigned long mount_timeout;/* jiffies */
unsigned long osd_idle_ttl; /* jiffies */
unsigned long osd_keepalive_timeout;/* jiffies */
+   unsigned long mon_keepalive_timeout;/* jiffies */
 
/*
 * any type that can't be simply compared or doesn't need need
@@ -66,6 +67,7 @@ struct ceph_options {
 #define CEPH_MOUNT_TIMEOUT_DEFAULT msecs_to_jiffies(60 * 1000)
 #define CEPH_OSD_KEEPALIVE_DEFAULT msecs_to_jiffies(5 * 1000)
 #define CEPH_OSD_IDLE_TTL_DEFAULT  msecs_to_jiffies(60 * 1000)
+#define CEPH_MON_KEEPALIVE_DEFAULT msecs_to_jiffies(30 * 1000)
 
 #define CEPH_MSG_MAX_FRONT_LEN (16*1024*1024)
 #define CEPH_MSG_MAX_MIDDLE_LEN(16*1024*1024)
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 3775327..83063b6 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -248,6 +248,8 @@ struct ceph_connection {
int in_base_pos; /* bytes read */
__le64 in_temp_ack;  /* for reading an ack */
 
+   struct timespec last_keepalive_ack;
+
struct delayed_work work;   /* send|recv work */
unsigned long   delay;  /* current delay interval */
 };
@@ -285,6 +287,8 @@ extern void ceph_msg_revoke(struct ceph_msg *msg);
 extern void ceph_msg_revoke_incoming(struct ceph_msg *msg);
 
 extern void ceph_con_keepalive(struct ceph_connection *con);
+extern int ceph_con_keepalive_expired(struct ceph_connection *con,
+ unsigned long interval);
 
 extern void ceph_msg_data_add_pages(struct ceph_msg *msg, struct page **pages,
size_t length, size_t alignment);
diff --git a/include/linux/ceph/msgr.h b/include/linux/ceph/msgr.h
index 1c18872..0fe2656 100644
--- a/include/linux/ceph/msgr.h
+++ b/include/linux/ceph/msgr.h
@@ -84,10 +84,12 @@ struct ceph_entity_inst {
 #define CEPH_MSGR_TAG_MSG   7  /* message */
 #define CEPH_MSGR_TAG_ACK   8  /* message ack */
 #define CEPH_MSGR_TAG_KEEPALIVE 9  /* just a keepalive byte! */
-#define CEPH_MSGR_TAG_BADPROTOVER  10  /* bad protocol version */
+#define CEPH_MSGR_TAG_BADPROTOVER   10 /* bad protocol version */
 #define CEPH_MSGR_TAG_BADAUTHORIZER 11 /* bad authorizer */
 #define CEPH_MSGR_TAG_FEATURES  12 /* insufficient features */
 #define CEPH_MSGR_TAG_SEQ   13 /* 64-bit int follows with seen seq 
number */
+#define CEPH_MSGR_TAG_KEEPALIVE214 /* keepalive2 byte + ceph_timespec */
+#define CEPH_MSGR_TAG_KEEPALIVE2_ACK 15 /* keepalive2 reply */
 
 
 /*
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index f30329f..5143f6e 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -226,6 +226,7 @@ static int parse_fsid(const char *str, struct ceph_fsid 
*fsid)
  * ceph options
  */
 enum {
+   Opt_monkeepalivetimeout,
Opt_osdtimeout,
Opt_osdkeepalivetimeout,
Opt_mount_timeout,
@@ -250,6 +251,7 @@ enum {
 };
 
 static match_table_t opt_tokens = {
+   {Opt_monkeepalivetimeout, "monkeepalive=%d"},
{Opt_osdtimeout, "osdtimeout=%d"},
{Opt_osdkeepalivetimeout, "osdkeepalive=%d"},
{Opt_mount_timeout, "mount_timeout=%d"},
@@ -354,9 +356,10 @@ ceph_parse_options(char *options, const char *dev_name,
 
/* start with defaults */
opt->flags = CEPH_OPT_DEFAULT;
-   opt->osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT;
opt->mount_timeout = CEPH_MOUNT_TIMEOUT_DEFAULT;
opt->osd_idle_ttl = CEPH_OSD_IDLE_TTL_DEFAULT;
+   opt->osd_keepalive_timeout = CEPH_OSD_KEEPALIVE_DEFAULT;
+   opt->mon_keepalive_timeout = CEPH_MON_KEEPALIVE_DEFAULT;
 
/* get mon ip(s) */
/* ip1[:port1][,ip2[:port2]...] */
@@ -460,6 +463,16 @@ ceph_parse_options(char *options, const char *dev_name,
}
opt->osd_idle_ttl = msecs_to_jiffies(intval * 1000);
break;
+   case Opt_monkeepalivetimeout:
+   /* 0 isn't well defined right now, reject it */
+   if (intval < 1 || intval > INT_MAX / 1000) {
+   pr_err("monkeepalive out of range\n");
+   

Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open.

2015-08-17 Thread Yan, Zheng
It seems all of your patches are malformed. please use 'git send-email’ to send 
patch.

Regards
Yan, Zheng


 On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote:
 
 For ceph_open, it already get the parent inode. So no need to get again.
 
 Signed-off-by: Jianpeng Ma jianpeng...@intel.com
 ---
 fs/ceph/file.c | 3 ---
 1 file changed, 3 deletions(-)
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index 8b79d87..5ecd3dc 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
ihold(inode);
 
req-r_num_caps = 1;
 -   if (flags  O_CREAT)
 -   parent_inode = 
 ceph_get_dentry_parent_inode(file-f_path.dentry);
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
 -   iput(parent_inode);
if (!err)
err = ceph_init_file(inode, file, req-r_fmode);
ceph_mdsc_put_request(req);
 --
 2.4.3
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] fs/ceph: No need get inode of parent in ceph_open.

2015-08-17 Thread Yan, Zheng

 On Aug 17, 2015, at 16:43, Ma, Jianpeng jianpeng...@intel.com wrote:
 
 For ceph_open, it already get the parent inode. So no need to get again.
 
 Signed-off-by: Jianpeng Ma jianpeng...@intel.com
 ---
 fs/ceph/file.c | 3 ---
 1 file changed, 3 deletions(-)
 
 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index 8b79d87..5ecd3dc 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -210,10 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
ihold(inode);
 
req-r_num_caps = 1;
 -   if (flags  O_CREAT)
 -   parent_inode = 
 ceph_get_dentry_parent_inode(file-f_path.dentry);
err = ceph_mdsc_do_request(mdsc, parent_inode, req);
 -   iput(parent_inode);
if (!err)
err = ceph_init_file(inode, file, req-r_fmode);
ceph_mdsc_put_request(req);
 —

I fixed your patches (all tabs are replaced by spaces in your patches)  and 
added them to our testing branch

Thanks
Yan, Zheng

 2.4.3
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.

2015-08-13 Thread Yan, Zheng
On Thu, Aug 13, 2015 at 5:01 PM, Ma, Jianpeng jianpeng...@intel.com wrote:
 Because flags removed the O_CREAT, so we should use file-f_flags.

 Signed-off-by: Jianpeng Ma jianpeng...@intel.com
 ---
  fs/ceph/file.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ceph/file.c b/fs/ceph/file.c
 index 8b79d87..e1347cf 100644
 --- a/fs/ceph/file.c
 +++ b/fs/ceph/file.c
 @@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
 ihold(inode);

 req-r_num_caps = 1;
 -   if (flags  O_CREAT)
 +   if (file-f_flags  O_CREAT)
 parent_inode = 
 ceph_get_dentry_parent_inode(file-f_path.dentry);
 err = ceph_mdsc_do_request(mdsc, parent_inode, req);
 iput(parent_inode);

In this case, we do not need parent_inode because file already exists.
I'd like to remove the code that finds parent inode.


Regards
Yan, Zheng

 --
 2.4.3
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: Using file-f_flags rather than flags check O_CREAT.

2015-08-13 Thread Yan, Zheng
On Thu, Aug 13, 2015 at 8:34 PM, Sage Weil s...@newdream.net wrote:
 On Thu, 13 Aug 2015, Yan, Zheng wrote:
 On Thu, Aug 13, 2015 at 5:01 PM, Ma, Jianpeng jianpeng...@intel.com wrote:
  Because flags removed the O_CREAT, so we should use file-f_flags.
 
  Signed-off-by: Jianpeng Ma jianpeng...@intel.com
  ---
   fs/ceph/file.c | 2 +-
   1 file changed, 1 insertion(+), 1 deletion(-)
 
  diff --git a/fs/ceph/file.c b/fs/ceph/file.c
  index 8b79d87..e1347cf 100644
  --- a/fs/ceph/file.c
  +++ b/fs/ceph/file.c
  @@ -210,7 +210,7 @@ int ceph_open(struct inode *inode, struct file *file)
  ihold(inode);
 
  req-r_num_caps = 1;
  -   if (flags  O_CREAT)
  +   if (file-f_flags  O_CREAT)
  parent_inode = 
  ceph_get_dentry_parent_inode(file-f_path.dentry);
  err = ceph_mdsc_do_request(mdsc, parent_inode, req);
  iput(parent_inode);

 In this case, we do not need parent_inode because file already exists.
 I'd like to remove the code that finds parent inode.

 Do you mean that O_CREAT is only removed in the case where the inode
 already exists?  Because there is at least one case where we do a single
 op to the MDS that does the create + open, and in that case we do need to
 make sure a dir fsync waits for it to commit.  As long as we don't break
 that one!


create+open case is handled by ceph_atomic_open or ceph_create.
ceph_open is used only when inode already exists.

Regards
Yan, Zheng

 sage
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: FileStore should not use syncfs(2)

2015-08-06 Thread Yan, Zheng
On Thu, Aug 6, 2015 at 5:26 AM, Sage Weil sw...@redhat.com wrote:
 Today I learned that syncfs(2) does an O(n) search of the superblock's
 inode list searching for dirty items.  I've always assumed that it was
 only traversing dirty inodes (e.g., a list of dirty inodes), but that
 appears not to be the case, even on the latest kernels.


I checked syncfs code in 3.10/4.1 kernel. I think both kernels only
traverse dirty inodes (inodes in
bdi_writeback::{b_dirty,b_io,b_more_io} lists). what am I missing?


 That means that the more RAM in the box, the larger (generally) the inode
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40
 servicing a very light workload, and each syncfs(2) call was taking ~7
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and
 make FileStore f[data]sync all the right things so that the syncfs call
 can be avoided.  This is the path you were originally headed down,
 Somnath, and I think it's the right one.

 The main thing to watch out for is that according to POSIX you really need
 to fsync directories.  With XFS that isn't the case since all metadata
 operations are going into the journal and that's fully ordered, but we
 don't want to allow data loss on e.g. ext4 (we need to check what the
 metadata ordering behavior is there) or other file systems.

 :(

 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: CephFS and the next hammer release v0.94.3

2015-08-03 Thread Yan, Zheng
Hi Loic.

Yes, https://github.com/ceph/ceph/pull/5222 is problematic.  Do you mean should 
we include these RPs in v0.94.3?  These RPs fix a bug in rare configure, I 
think it’s not a big deal to not include it in v0.94.3 

Regards
Yan, Zheng


 On Aug 4, 2015, at 00:32, Loic Dachary l...@dachary.org wrote:
 
 Hi Greg,
 
 I assume the file handle reference counting is about 
 http://tracker.ceph.com/issues/12088 which is backported as described at 
 http://tracker.ceph.com/issues/12319. It was indeed somewhat problematic and 
 required two pull requests: https://github.com/ceph/ceph/pull/5222 (authored 
 by Yan Zheng) and https://github.com/ceph/ceph/pull/5427 (merged by Yan 
 Zheng).
 
 Cheers
 
 On 03/08/2015 18:01, Gregory Farnum wrote:
 On Mon, Aug 3, 2015 at 6:43 PM Loic Dachary l...@dachary.org wrote:
 
 Hi Greg,
 
 The next hammer release as found at 
 https://github.com/ceph/ceph/tree/hammer passed the fs suite 
 (http://tracker.ceph.com/issues/11990#fs). Do you think it is ready for QE 
 to start their own round of testing ?
 
 I'm on vacation right now, but the only thing I see there that might
 be iffy is the backport of the file handle reference counting. As long
 as that is all good (Zheng?) things look fine to me.
 -Greg
 
 
 Cheers
 
 P.S. http://tracker.ceph.com/issues/11990#Release-information has direct 
 links to the pull requests merged into hammer since v0.94.2 in case you 
 need more context about one of them.
 
 --
 Loïc Dachary, Artisan Logiciel Libre
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 -- 
 Loïc Dachary, Artisan Logiciel Libre
 

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Client: what is supposed to protect racing readdirs and unlinks?

2015-07-15 Thread Yan, Zheng
On Wed, Jul 15, 2015 at 1:11 AM, Gregory Farnum g...@gregs42.com wrote:
 I spent a bunch of today looking at http://tracker.ceph.com/issues/12297.

 Long story short: the workload is doing a readdir at the same time as
 it's unlinking files. The readdir functions (in this case,
 _readdir_cache_cb) drop the client_lock each time they invoke the
 callback (for obvious reasons). There is some effort in
 _readdir_cache_cb to try and keep the iterator valid (we check on each
 loop that we aren't at end; we increment the iterator before dropping
 the lock), but it's not sufficient.

 Is there supposed to be something preventing this kind of race? If not
 I can work something out in the code but I've not done much work in
 that bit and there are enough pieces that I wonder if I'm missing some
 other issue.

I think calling (*pd)-get() before release the client_lock should work.

Regards
Yan, Zheng

 -Greg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph:for-linus 11/40] fs/ceph/mds_client.c:2865:1-10: second lock on line 2904

2015-07-02 Thread Yan, Zheng
On Fri, Jul 3, 2015 at 4:52 AM, Julia Lawall julia.law...@lip6.fr wrote:
 I haven't looked at all the called functions, to see if any of them drops
 the lock, but it could be worth a check.


the lock is dropped by cleanup_cap_releases()

 julia

 On Fri, 3 Jul 2015, kbuild test robot wrote:

 TO: Yan, Zheng z...@redhat.com
 CC: Ilya Dryomov idryo...@gmail.com
 CC: ceph-devel@vger.kernel.org

 tree:   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git 
 for-linus
 head:   5a60e87603c4c533492c515b7f62578189b03c9c
 commit: 745a8e3bccbc6adae69a98ddc525e529aa44636e [11/40] ceph: don't 
 pre-allocate space for cap release messages
 :: branch date: 2 days ago
 :: commit date: 7 days ago

  fs/ceph/mds_client.c:2865:1-10: second lock on line 2904
 --
  fs/ceph/mds_client.c:2961:1-7: preceding lock on line 2865

 git remote add ceph 
 git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git
 git remote update ceph
 git checkout 745a8e3bccbc6adae69a98ddc525e529aa44636e
 vim +2865 fs/ceph/mds_client.c

 a687ecaf John Spray 2014-09-19  2859   
 ceph_session_state_name(session-s_state));
 2f2dc053 Sage Weil  2009-10-06  2860
 99a9c273 Yan, Zheng 2013-09-22  2861  
 spin_lock(session-s_gen_ttl_lock);
 99a9c273 Yan, Zheng 2013-09-22  2862  session-s_cap_gen++;
 99a9c273 Yan, Zheng 2013-09-22  2863  
 spin_unlock(session-s_gen_ttl_lock);
 99a9c273 Yan, Zheng 2013-09-22  2864
 99a9c273 Yan, Zheng 2013-09-22 @2865  
 spin_lock(session-s_cap_lock);
 03f4fcb0 Yan, Zheng 2015-01-05  2866  /* don't know if session is 
 readonly */
 03f4fcb0 Yan, Zheng 2015-01-05  2867  session-s_readonly = 0;
 99a9c273 Yan, Zheng 2013-09-22  2868  /*
 99a9c273 Yan, Zheng 2013-09-22  2869   * notify __ceph_remove_cap() 
 that we are composing cap reconnect.
 99a9c273 Yan, Zheng 2013-09-22  2870   * If a cap get released 
 before being added to the cap reconnect,
 99a9c273 Yan, Zheng 2013-09-22  2871   * __ceph_remove_cap() should 
 skip queuing cap release.
 99a9c273 Yan, Zheng 2013-09-22  2872   */
 99a9c273 Yan, Zheng 2013-09-22  2873  session-s_cap_reconnect = 1;
 e01a5946 Sage Weil  2010-05-10  2874  /* drop old cap expires; we're 
 about to reestablish that state */
 745a8e3b Yan, Zheng 2015-05-14  2875  cleanup_cap_releases(mdsc, 
 session);
 e01a5946 Sage Weil  2010-05-10  2876
 5d23371f Yan, Zheng 2014-09-10  2877  /* trim unused caps to reduce 
 MDS's cache rejoin time */
 c0bd50e2 Yan, Zheng 2015-04-07  2878  if (mdsc-fsc-sb-s_root)
 5d23371f Yan, Zheng 2014-09-10  2879  
 shrink_dcache_parent(mdsc-fsc-sb-s_root);
 5d23371f Yan, Zheng 2014-09-10  2880
 5d23371f Yan, Zheng 2014-09-10  2881  
 ceph_con_close(session-s_con);
 5d23371f Yan, Zheng 2014-09-10  2882  ceph_con_open(session-s_con,
 5d23371f Yan, Zheng 2014-09-10  2883
 CEPH_ENTITY_TYPE_MDS, mds,
 5d23371f Yan, Zheng 2014-09-10  2884
 ceph_mdsmap_get_addr(mdsc-mdsmap, mds));
 5d23371f Yan, Zheng 2014-09-10  2885
 5d23371f Yan, Zheng 2014-09-10  2886  /* replay unsafe requests */
 5d23371f Yan, Zheng 2014-09-10  2887  replay_unsafe_requests(mdsc, 
 session);
 5d23371f Yan, Zheng 2014-09-10  2888
 5d23371f Yan, Zheng 2014-09-10  2889  down_read(mdsc-snap_rwsem);
 5d23371f Yan, Zheng 2014-09-10  2890
 2f2dc053 Sage Weil  2009-10-06  2891  /* traverse this session's 
 caps */
 44c99757 Yan, Zheng 2013-09-22  2892  s_nr_caps = session-s_nr_caps;
 44c99757 Yan, Zheng 2013-09-22  2893  err = 
 ceph_pagelist_encode_32(pagelist, s_nr_caps);
 93cea5be Sage Weil  2009-12-23  2894  if (err)
 93cea5be Sage Weil  2009-12-23  2895  goto fail;
 20cb34ae Sage Weil  2010-05-12  2896
 44c99757 Yan, Zheng 2013-09-22  2897  recon_state.nr_caps = 0;
 20cb34ae Sage Weil  2010-05-12  2898  recon_state.pagelist = 
 pagelist;
 20cb34ae Sage Weil  2010-05-12  2899  recon_state.flock = 
 session-s_con.peer_features  CEPH_FEATURE_FLOCK;
 20cb34ae Sage Weil  2010-05-12  2900  err = 
 iterate_session_caps(session, encode_caps_cb, recon_state);
 2f2dc053 Sage Weil  2009-10-06  2901  if (err  0)
 9abf82b8 Sage Weil  2010-05-10  2902  goto fail;
 2f2dc053 Sage Weil  2009-10-06  2903
 99a9c273 Yan, Zheng 2013-09-22 @2904  
 spin_lock(session-s_cap_lock);
 99a9c273 Yan, Zheng 2013-09-22  2905  session-s_cap_reconnect = 0;
 99a9c273 Yan, Zheng 2013-09-22  2906  
 spin_unlock(session-s_cap_lock);
 99a9c273 Yan, Zheng 2013-09-22  2907
 2f2dc053 Sage Weil  2009-10-06  2908  /*
 2f2dc053 Sage Weil  2009-10-06  2909   * snaprealms.  we provide mds 
 with the ino, seq (version), and
 2f2dc053 Sage Weil  2009-10-06  2910   * parent for all of our 
 realms.  If the mds has any newer info,
 2f2dc053

Re: Ceph hard lock Hammer 9.2

2015-06-24 Thread Yan, Zheng
Could you please run echo 1  /proc/sys/kernel/sysrq;  echo t 
/proc/sysrq-trigger when this warning happens again.  then send the
kernel message to us.

Regards
Yan, Zheng

On Tue, Jun 23, 2015 at 10:25 PM, Barclay Jameson
almightybe...@gmail.com wrote:
 Sure,
 I guess it's actually a soft kernel lock since it's only the
 filesystem that is hung with high IO wait.
 The kernel is 4.0.4-1.el6.elrepo.x86.
 The Ceph version is 0.94.2 (Sorry about the confusion I missed a 4
 when I typed in the subject line).
 I was testing copying 100,000 files from directory (dir1) to
 (dir1-`hostname`) on three septate hosts.
 2 of the hosts completed the job and the third one hung with the stack
 trace in /var/log/messages.

 On Tue, Jun 23, 2015 at 6:54 AM, Gregory Farnum g...@gregs42.com wrote:
 On Mon, Jun 22, 2015 at 9:45 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Has anyone seen this?

 Can you describe the kernel you're using, the workload you were
 running, the Ceph cluster you're running against, etc?


 Jun 22 15:09:27 node kernel: Call Trace:
 Jun 22 15:09:27 node kernel: [816803ee] schedule+0x3e/0x90
 Jun 22 15:09:27 node kernel: [8168062e]
 schedule_preempt_disabled+0xe/0x10
 Jun 22 15:09:27 node kernel: [81681ce3]
 __mutex_lock_slowpath+0x93/0x100
 Jun 22 15:09:27 node kernel: [a060def8] ?
 __cap_is_valid+0x58/0x70 [ceph]
 Jun 22 15:09:27 node kernel: [81681d73] mutex_lock+0x23/0x40
 Jun 22 15:09:27 node kernel: [a0610f2d]
 ceph_check_caps+0x38d/0x780 [ceph]
 Jun 22 15:09:27 node kernel: [812f5a9b] ?
 __radix_tree_delete_node+0x7b/0x130
 Jun 22 15:09:27 node kernel: [a0612637]
 ceph_put_wrbuffer_cap_refs+0xf7/0x240 [ceph]
 Jun 22 15:09:27 node kernel: [a060b170]
 writepages_finish+0x200/0x290 [ceph]
 Jun 22 15:09:27 node kernel: [a05e2731]
 handle_reply+0x4f1/0x640 [libceph]
 Jun 22 15:09:27 node kernel: [a05e3065] dispatch+0x85/0xa0 
 [libceph]
 Jun 22 15:09:27 node kernel: [a05d7ceb]
 process_message+0xab/0xd0 [libceph]
 Jun 22 15:09:27 node kernel: [a05db052] try_read+0x2d2/0x430 
 [libceph]
 Jun 22 15:09:27 node kernel: [a05db7e8] con_work+0x78/0x220 
 [libceph]
 Jun 22 15:09:27 node kernel: [8108c475] 
 process_one_work+0x145/0x460
 Jun 22 15:09:27 node kernel: [8108c8b2] worker_thread+0x122/0x420
 Jun 22 15:09:27 node kernel: [8167fdb8] ? __schedule+0x398/0x840
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8108c790] ? 
 process_one_work+0x460/0x460
 Jun 22 15:09:27 node kernel: [8109170e] kthread+0xce/0xf0
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:09:27 node kernel: [81683dd8] ret_from_fork+0x58/0x90
 Jun 22 15:09:27 node kernel: [81091640] ?
 kthread_freezable_should_stop+0x70/0x70
 Jun 22 15:11:27 node kernel: INFO: task kworker/2:1:40 blocked for
 more than 120 seconds.
 Jun 22 15:11:27 node kernel:  Tainted: G  I
 4.0.4-1.el6.elrepo.x86_64 #1
 Jun 22 15:11:27 node kernel: echo 0 
 /proc/sys/kernel/hung_task_timeout_secs disables this message.
 Jun 22 15:11:27 node kernel: kworker/2:1 D 881ff279f7f8 0
   40  2 0x
 Jun 22 15:11:27 node kernel: Workqueue: ceph-msgr con_work [libceph]
 Jun 22 15:11:27 node kernel: 881ff279f7f8 881ff261c010
 881ff2b67050 88207fd95270
 Jun 22 15:11:27 node kernel: 881ff279c010 88207fd15200
 7fff 0002
 Jun 22 15:11:27 node kernel: 81680ae0 881ff279f818
 816803ee 810ae63b
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Slow file creating and deleting using bonnie ++ on Hammer

2015-05-26 Thread Yan, Zheng
On Sat, May 23, 2015 at 1:34 AM, Barclay Jameson
almightybe...@gmail.com wrote:
 Here are some more info :
 rados bench -p cephfs_data 100 write --no-cleanup

 Total time run: 100.096473
 Total writes made:  21900
 Write size: 4194304
 Bandwidth (MB/sec): 875.156

 Stddev Bandwidth:   96.1234
 Max bandwidth (MB/sec): 932
 Min bandwidth (MB/sec): 0
 Average Latency:0.0731273
 Stddev Latency: 0.0439909
 Max latency:1.23972
 Min latency:0.0306901

 (Again the numbers from bench don't match what is listed in client io.
 Ceph -s shows anywhere from 200 MB/s to 1700 MB/s even when the max
 bandwidth lists 932 as the highest)

 rados bench -p cephfs_data 100 seq

 Total time run:29.460172
 Total reads made: 21900
 Read size:4194304
 Bandwidth (MB/sec):2973.506

 Average Latency:   0.0215173
 Max latency:   0.693831
 Min latency:   0.00519763



 On client:

 [root@blarg cephfs]# time for i in {1..10}; do mkdir blarg$i ; done

 real10m36.794s
 user1m45.329s
 sys6m29.982s
 [root@blarg cephfs]# time for i in {1..10}; do touch yadda$i ; done

 real13m29.155s
 user3m55.256s
 sys7m50.301s

 What variables are most important in the perf dump?
 I would like to grep out the vars (ceph daemon
 /var/run/ceph-mds.cephnautilus01.asok perf dump | jq '.') that are of
 meaning while running the bonnie++ test again with -s 0.

 Thanks,
 BJ

 On Fri, May 22, 2015 at 10:34 AM, John Spray john.sp...@redhat.com wrote:


 On 22/05/2015 16:25, Barclay Jameson wrote:

 The Bonnie++ job _FINALLY_ finished. If I am reading this correctly it
 took days to create, stat, and delete 16 files??
 [root@blarg cephfs]# ~/bonnie++-1.03e/bonnie++ -u root:root -s 256g -r
 131072 -d /cephfs/ -m CephBench -f -b
 Using uid:0, gid:0.
 Writing intelligently...done
 Rewriting...done
 Reading intelligently...done
 start 'em...done...done...done...
 Create files in sequential order...done.
 Stat files in sequential order...done.
 Delete files in sequential order...done.
 Create files in random order...done.
 Stat files in random order...done.
 Delete files in random order...done.
 Version 1.03e   --Sequential Output-- --Sequential Input-
 --Random-
  -Per Chr- --Block-- -Rewrite- -Per Chr- --Block--
 --Seeks--
 MachineSize K/sec %CP K/sec %CP K/sec %CP K/sec %CP K/sec %CP
 /sec %CP
 CephBench  256G   1006417  76 90114  13   137110
 8 329.8   7
  --Sequential Create-- Random
 Create
  -Create-- --Read--- -Delete-- -Create-- --Read---
 -Delete--
files  /sec %CP  /sec %CP  /sec %CP  /sec %CP  /sec %CP
 /sec %CP
   16 0   0 + +++ 0   0 0   0  5267  19
 0   0

 CephBench,256G,,,1006417,76,90114,13,,,137110,8,329.8,7,16,0,0,+,+++,0,0,0,0,5267,19,0,0

 Any thoughts?

 It's 16000 files by default (not 16), but this usually takes only a few
 minutes.

 FWIW I tried running a quick bonnie++ (with -s 0 to skip the IO phase) on a
 development (vstart.sh) cluster with a fuse client, and it readily handles
 several hundred client requests per second (checked with ceph daemonperf
 mds.id)

 Nothing immediately leapt out at me from a quick look at the log you posted,
 but with issues like these it is always worth trying to narrow it down by
 trying the fuse client instead of the kernel client, and/or different kernel
 versions.

 You may also want to check that your underlying RADOS cluster is performing
 reasonably by doing a rados bench too.

 Cheers,
 John
 --

the reason for slow file creations is that bonnie++ call fsync(2)
after each creat(2). fsync() wait for safe replies of the create
requests.  MDS sends safe reply when log event for the request gets
journaled safely. MDS flush the journal every 5 seconds
(mds_tick_interval). So the speed of file creation for bonnie++ is one
file every file seconds.



 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 4/5] ceph: simplify two mount_timeout sites

2015-05-21 Thread Yan, Zheng
On Thu, May 21, 2015 at 8:35 PM, Ilya Dryomov idryo...@gmail.com wrote:
 No need to bifurcate wait now that we've got ceph_timeout_jiffies().

 Signed-off-by: Ilya Dryomov idryo...@gmail.com
 ---
  fs/ceph/dir.c| 14 --
  fs/ceph/mds_client.c | 18 ++
  2 files changed, 14 insertions(+), 18 deletions(-)

 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 173dd4b58c71..3dec27e36417 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -1257,17 +1257,11 @@ static int ceph_dir_fsync(struct file *file, loff_t 
 start, loff_t end,

 dout(dir_fsync %p wait on tid %llu (until %llu)\n,
  inode, req-r_tid, last_tid);
 -   if (req-r_timeout) {
 -   unsigned long time_left = wait_for_completion_timeout(
 -   req-r_safe_completion,
 +   ret = !wait_for_completion_timeout(req-r_safe_completion,
 ceph_timeout_jiffies(req-r_timeout));
 -   if (time_left  0)
 -   ret = 0;
 -   else
 -   ret = -EIO;  /* timed out */
 -   } else {
 -   wait_for_completion(req-r_safe_completion);
 -   }
 +   if (ret)
 +   ret = -EIO;  /* timed out */
 +
 ceph_mdsc_put_request(req);

 spin_lock(ci-i_unsafe_lock);
 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 0b0e0a9a81c0..5be2d287a26c 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -2266,16 +2266,18 @@ int ceph_mdsc_do_request(struct ceph_mds_client *mdsc,
 /* wait */
 mutex_unlock(mdsc-mutex);
 dout(do_request waiting\n);
 -   if (req-r_timeout) {
 -   err = (long)wait_for_completion_killable_timeout(
 -   req-r_completion,
 -   ceph_timeout_jiffies(req-r_timeout));
 -   if (err == 0)
 -   err = -EIO;
 -   } else if (req-r_wait_for_completion) {
 +   if (!req-r_timeout  req-r_wait_for_completion) {
 err = req-r_wait_for_completion(mdsc, req);
 } else {
 -   err = wait_for_completion_killable(req-r_completion);
 +   long timeleft = wait_for_completion_killable_timeout(
 +   req-r_completion,
 +   ceph_timeout_jiffies(req-r_timeout));
 +   if (timeleft  0)
 +   err = 0;
 +   else if (!timeleft)
 +   err = -EIO;  /* timed out */
 +   else
 +   err = timeleft;  /* killed */
 }
 dout(do_request waited, got %d\n, err);
 mutex_lock(mdsc-mutex);
 --
 1.9.3


Reviewed-by: Yan, Zheng z...@redhat.com


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 3/3] ceph: check OSD caps before read/write

2015-04-27 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/addr.c   | 203 +++
 fs/ceph/caps.c   |   4 +
 fs/ceph/inode.c  |   3 +
 fs/ceph/mds_client.c |   4 +
 fs/ceph/mds_client.h |   9 +++
 fs/ceph/super.c  |  15 +++-
 fs/ceph/super.h  |  17 +++--
 7 files changed, 249 insertions(+), 6 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index bdeea57..c65f9e0 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1599,3 +1599,206 @@ int ceph_mmap(struct file *file, struct vm_area_struct 
*vma)
vma-vm_ops = ceph_vmops;
return 0;
 }
+
+enum {
+   POOL_READ   = 1,
+   POOL_WRITE  = 2,
+};
+
+static int __ceph_pool_perm_get(struct ceph_inode_info *ci, u32 pool)
+{
+   struct ceph_fs_client *fsc = ceph_inode_to_client(ci-vfs_inode);
+   struct ceph_mds_client *mdsc = fsc-mdsc;
+   struct ceph_osd_request *rd_req = NULL, *wr_req = NULL;
+   struct rb_node **p, *parent;
+   struct ceph_pool_perm *perm;
+   struct page **pages;
+   int err = 0, err2 = 0, have = 0;
+
+   down_read(mdsc-pool_perm_rwsem);
+   p = mdsc-pool_perm_tree.rb_node;
+   while (*p) {
+   perm = rb_entry(*p, struct ceph_pool_perm, node);
+   if (pool  perm-pool)
+   p = (*p)-rb_left;
+   else if (pool  perm-pool)
+   p = (*p)-rb_right;
+   else {
+   have = perm-perm;
+   break;
+   }
+   }
+   up_read(mdsc-pool_perm_rwsem);
+   if (*p)
+   goto out;
+
+   dout(__ceph_pool_perm_get pool %u no perm cached\n, pool);
+
+   down_write(mdsc-pool_perm_rwsem);
+   parent = NULL;
+   while (*p) {
+   parent = *p;
+   perm = rb_entry(parent, struct ceph_pool_perm, node);
+   if (pool  perm-pool)
+   p = (*p)-rb_left;
+   else if (pool  perm-pool)
+   p = (*p)-rb_right;
+   else {
+   have = perm-perm;
+   break;
+   }
+   }
+   if (*p) {
+   up_write(mdsc-pool_perm_rwsem);
+   goto out;
+   }
+
+   rd_req = ceph_osdc_alloc_request(fsc-client-osdc,
+ci-i_snap_realm-cached_context,
+1, false, GFP_NOFS);
+   if (!rd_req) {
+   err = -ENOMEM;
+   goto out_unlock;
+   }
+
+   rd_req-r_flags = CEPH_OSD_FLAG_READ;
+   osd_req_op_init(rd_req, 0, CEPH_OSD_OP_STAT, 0);
+   rd_req-r_base_oloc.pool = pool;
+   snprintf(rd_req-r_base_oid.name, sizeof(rd_req-r_base_oid.name),
+%llx., ci-i_vino.ino);
+   rd_req-r_base_oid.name_len = strlen(rd_req-r_base_oid.name);
+
+   wr_req = ceph_osdc_alloc_request(fsc-client-osdc,
+ci-i_snap_realm-cached_context,
+1, false, GFP_NOFS);
+   if (!wr_req) {
+   err = -ENOMEM;
+   goto out_unlock;
+   }
+
+   wr_req-r_flags = CEPH_OSD_FLAG_WRITE |
+ CEPH_OSD_FLAG_ACK | CEPH_OSD_FLAG_ONDISK;
+   osd_req_op_init(wr_req, 0, CEPH_OSD_OP_CREATE, CEPH_OSD_OP_FLAG_EXCL);
+   wr_req-r_base_oloc.pool = pool;
+   wr_req-r_base_oid = rd_req-r_base_oid;
+
+   /* one page should be large enough for STAT data */
+   pages = ceph_alloc_page_vector(1, GFP_KERNEL);
+   if (IS_ERR(pages)) {
+   err = PTR_ERR(pages);
+   goto out_unlock;
+   }
+
+   osd_req_op_raw_data_in_pages(rd_req, 0, pages, PAGE_SIZE,
+0, false, true);
+   ceph_osdc_build_request(rd_req, 0, NULL, CEPH_NOSNAP,
+   ci-vfs_inode.i_mtime);
+   err = ceph_osdc_start_request(fsc-client-osdc, rd_req, false);
+
+   ceph_osdc_build_request(wr_req, 0, NULL, CEPH_NOSNAP,
+   ci-vfs_inode.i_mtime);
+   err2 = ceph_osdc_start_request(fsc-client-osdc, wr_req, false);
+
+   if (!err)
+   err = ceph_osdc_wait_request(fsc-client-osdc, rd_req);
+   if (!err2)
+   err2 = ceph_osdc_wait_request(fsc-client-osdc, wr_req);
+
+   if (err = 0 || err == -ENOENT)
+   have |= POOL_READ;
+   else if (err != -EPERM)
+   goto out_unlock;
+
+   if (err2 == 0 || err2 == -EEXIST)
+   have |= POOL_WRITE;
+   else if (err2 != -EPERM) {
+   err = err2;
+   goto out_unlock;
+   }
+
+   perm = kmalloc(sizeof(*perm), GFP_NOFS);
+   if (!perm) {
+   err = -ENOMEM;
+   goto out_unlock;
+   }
+
+   perm-pool = pool;
+   perm-perm = have;
+   rb_link_node(perm-node, parent, p);
+   rb_insert_color

[PATCH 2/3] libceph: allow setting osd_req_op's flags

2015-04-27 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 drivers/block/rbd.c |  4 ++--
 fs/ceph/addr.c  |  3 ++-
 fs/ceph/file.c  |  2 +-
 include/linux/ceph/osd_client.h |  2 +-
 net/ceph/osd_client.c   | 24 +++-
 5 files changed, 21 insertions(+), 14 deletions(-)

diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
index 8125233..15ddc5e 100644
--- a/drivers/block/rbd.c
+++ b/drivers/block/rbd.c
@@ -2371,7 +2371,7 @@ static void rbd_img_obj_request_fill(struct 
rbd_obj_request *obj_request,
}
 
if (opcode == CEPH_OSD_OP_DELETE)
-   osd_req_op_init(osd_request, num_ops, opcode);
+   osd_req_op_init(osd_request, num_ops, opcode, 0);
else
osd_req_op_extent_init(osd_request, num_ops, opcode,
   offset, length, 0, 0);
@@ -2843,7 +2843,7 @@ static int rbd_img_obj_exists_submit(struct 
rbd_obj_request *obj_request)
goto out;
stat_request-callback = rbd_img_obj_exists_callback;
 
-   osd_req_op_init(stat_request-osd_req, 0, CEPH_OSD_OP_STAT);
+   osd_req_op_init(stat_request-osd_req, 0, CEPH_OSD_OP_STAT, 0);
osd_req_op_raw_data_in_pages(stat_request-osd_req, 0, pages, size, 0,
false, false);
rbd_osd_req_format_read(stat_request);
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index cab1cf5..bdeea57 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -884,7 +884,8 @@ get_more_pages:
}
 
if (do_sync)
-   osd_req_op_init(req, 1, 
CEPH_OSD_OP_STARTSYNC);
+   osd_req_op_init(req, 1,
+   CEPH_OSD_OP_STARTSYNC, 
0);
 
req-r_callback = writepages_finish;
req-r_inode = inode;
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index d533075..a972019 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -615,7 +615,7 @@ ceph_sync_direct_write(struct kiocb *iocb, struct iov_iter 
*from, loff_t pos)
break;
}
 
-   osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC);
+   osd_req_op_init(req, 1, CEPH_OSD_OP_STARTSYNC, 0);
 
n = iov_iter_get_pages_alloc(from, pages, len, start);
if (unlikely(n  0)) {
diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 61b19c4..7506b48 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -249,7 +249,7 @@ extern void ceph_osdc_handle_map(struct ceph_osd_client 
*osdc,
 struct ceph_msg *msg);
 
 extern void osd_req_op_init(struct ceph_osd_request *osd_req,
-   unsigned int which, u16 opcode);
+   unsigned int which, u16 opcode, u32 flags);
 
 extern void osd_req_op_raw_data_in_pages(struct ceph_osd_request *,
unsigned int which,
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index b93531f..e3c2e8b 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -453,7 +453,7 @@ __CEPH_FORALL_OSD_OPS(GENERATE_CASE)
  */
 static struct ceph_osd_req_op *
 _osd_req_op_init(struct ceph_osd_request *osd_req, unsigned int which,
-   u16 opcode)
+u16 opcode, u32 flags)
 {
struct ceph_osd_req_op *op;
 
@@ -463,14 +463,15 @@ _osd_req_op_init(struct ceph_osd_request *osd_req, 
unsigned int which,
op = osd_req-r_ops[which];
memset(op, 0, sizeof (*op));
op-op = opcode;
+   op-flags = flags;
 
return op;
 }
 
 void osd_req_op_init(struct ceph_osd_request *osd_req,
-   unsigned int which, u16 opcode)
+unsigned int which, u16 opcode, u32 flags)
 {
-   (void)_osd_req_op_init(osd_req, which, opcode);
+   (void)_osd_req_op_init(osd_req, which, opcode, flags);
 }
 EXPORT_SYMBOL(osd_req_op_init);
 
@@ -479,7 +480,8 @@ void osd_req_op_extent_init(struct ceph_osd_request 
*osd_req,
u64 offset, u64 length,
u64 truncate_size, u32 truncate_seq)
 {
-   struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
+   struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which,
+ opcode, 0);
size_t payload_len = 0;
 
BUG_ON(opcode != CEPH_OSD_OP_READ  opcode != CEPH_OSD_OP_WRITE 
@@ -518,7 +520,8 @@ EXPORT_SYMBOL(osd_req_op_extent_update);
 void osd_req_op_cls_init(struct ceph_osd_request *osd_req, unsigned int which,
u16 opcode, const char *class, const char *method)
 {
-   struct ceph_osd_req_op *op = _osd_req_op_init

Re: [ceph-users] CephFS Slow writes with 1MB files

2015-04-02 Thread Yan, Zheng
On Thu, Apr 2, 2015 at 11:18 PM, Barclay Jameson
almightybe...@gmail.com wrote:
 I am using the Giant release. The OSDs and MON/MDS are using default
 RHEL 7 kernel. Client is using elrepo 3.19 kernel. I am also using
 cephaux.

I reproduced this issue by using giant release. It's a bug in the MDS
code. Could you try the newest development version ceph (It includes
the fix). Or apply the attached patch to source of  giant release.

Regards
Yan, Zheng


 I may have found something.
 I did the build manually as such I did _NOT_ set up these config settings:
 filestore xattr use omap = false
 filestore max inline xattr size = 65536,
 filestore_max_inline_xattr_size_xfs = 65536
 filestore_max_inline_xattr_size_other = 512
 filestore_max_inline_xattrs_xfs = 10

 I just changed these settings to see if it will make a difference.
 I copied data from one directory that had files I created before I set
 these values ( time cp small1/* small2/.) and it takes 2 min 30 secs
 to copy 1600 files.
 If I took the files I just copied from small2 and copy them to a
 different directory ( time cp small2/* small3/.) it only takes 5 mins
 to copy 1 files!

 Could this be part of the problem?


 On Thu, Apr 2, 2015 at 6:03 AM, Yan, Zheng uker...@gmail.com wrote:
 On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Here is the mds output from the command you requested. I did this
 during the small data run . ( time cp small1/* small2/ )
 It is 20MB in size so I couldn't find a place online that would accept
 that much data.

 Please find attached file.

 Thanks,

 In the log file, each 'create' request is followed by several
 'getattr' requests. I guess these 'getattr' requests resulted from
 some kinds of permission check, but I can't reproduce this situation
 locally.

 which version of ceph/kernel are you using? do you use ceph-fuse or
 kernel client, what's the mount options?

 Regards
 Yan, Zheng



 Beeij


 On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I redid my entire Ceph build going back to to CentOS 7 hoping to the
 get the same performance I did last time.
 The rados bench test was the best I have ever had with a time of 740
 MB wr and 1300 MB rd. This was even better than the first rados bench
 test that had performance equal to PanFS. I find that this does not
 translate to my CephFS. Even with the following tweaking it still at
 least twice as slow as PanFS and my first *Magical* build (that had
 absolutely no tweaking):

 OSD
  osd_op_treads 8
  /sys/block/sd*/queue/nr_requests 4096
  /sys/block/sd*/queue/read_ahead_kb 4096

 Client
  rsize=16777216
  readdir_max_bytes=16777216
  readdir_max_entries=16777216

 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS.
 Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.

 Strange thing is none of the resources are taxed.
 CPU, ram, network, disks, are not even close to being taxed on either
 the client,mon/mds, or the osd nodes.
 The PanFS client node was a 10Gb network the same as the CephFS client
 but you can see the huge difference in speed.

 As per Gregs questions before:
 There is only one client reading and writing (time cp Small1/*
 Small2/.) but three clients have cephfs mounted, although they aren't
 doing anything on the filesystem.

 I have done another test where I stream data info a file as fast as
 the processor can put it there.
 (for (i=0; i  11; i++){ fprintf (out_file, I is : %d\n,i);}
 ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
 above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
 seconds for CephFS although the first build did it in 130 seconds
 without any tuning.

 This leads me to believe the bottleneck is the mds. Does anybody have
 any thoughts on this?
 Are there any tuning parameters that I would need to speed up the mds?

 could you enable mds debugging for a few seconds (ceph daemon mds.x
 config set debug_mds 10; sleep 10; ceph daemon mds.x config set
 debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.

 Regards
 Yan, Zheng


 On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Yes it's the exact same hardware except for the MDS server (although I
 tried using the MDS on the old node).
 I have not tried moving the MON back to the old node.

 My default cache size is mds cache size = 1000
 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
 I created 2048 for data and metadata:
 ceph osd pool create cephfs_data 2048 2048
 ceph osd pool create cephfs_metadata 2048 2048


 To your point on clients competing against each other... how would I 
 check that?

 Do you have multiple clients mounted? Are they both accessing files in
 the directory(ies) you're testing? Were they accessing the same
 pattern

Re: [ceph-users] CephFS Slow writes with 1MB files

2015-04-02 Thread Yan, Zheng
On Wed, Apr 1, 2015 at 12:31 AM, Barclay Jameson
almightybe...@gmail.com wrote:
 Here is the mds output from the command you requested. I did this
 during the small data run . ( time cp small1/* small2/ )
 It is 20MB in size so I couldn't find a place online that would accept
 that much data.

 Please find attached file.

 Thanks,

In the log file, each 'create' request is followed by several
'getattr' requests. I guess these 'getattr' requests resulted from
some kinds of permission check, but I can't reproduce this situation
locally.

which version of ceph/kernel are you using? do you use ceph-fuse or
kernel client, what's the mount options?

Regards
Yan, Zheng



 Beeij


 On Mon, Mar 30, 2015 at 10:59 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I redid my entire Ceph build going back to to CentOS 7 hoping to the
 get the same performance I did last time.
 The rados bench test was the best I have ever had with a time of 740
 MB wr and 1300 MB rd. This was even better than the first rados bench
 test that had performance equal to PanFS. I find that this does not
 translate to my CephFS. Even with the following tweaking it still at
 least twice as slow as PanFS and my first *Magical* build (that had
 absolutely no tweaking):

 OSD
  osd_op_treads 8
  /sys/block/sd*/queue/nr_requests 4096
  /sys/block/sd*/queue/read_ahead_kb 4096

 Client
  rsize=16777216
  readdir_max_bytes=16777216
  readdir_max_entries=16777216

 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS.
 Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.

 Strange thing is none of the resources are taxed.
 CPU, ram, network, disks, are not even close to being taxed on either
 the client,mon/mds, or the osd nodes.
 The PanFS client node was a 10Gb network the same as the CephFS client
 but you can see the huge difference in speed.

 As per Gregs questions before:
 There is only one client reading and writing (time cp Small1/*
 Small2/.) but three clients have cephfs mounted, although they aren't
 doing anything on the filesystem.

 I have done another test where I stream data info a file as fast as
 the processor can put it there.
 (for (i=0; i  11; i++){ fprintf (out_file, I is : %d\n,i);}
 ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
 above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
 seconds for CephFS although the first build did it in 130 seconds
 without any tuning.

 This leads me to believe the bottleneck is the mds. Does anybody have
 any thoughts on this?
 Are there any tuning parameters that I would need to speed up the mds?

 could you enable mds debugging for a few seconds (ceph daemon mds.x
 config set debug_mds 10; sleep 10; ceph daemon mds.x config set
 debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.

 Regards
 Yan, Zheng


 On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Yes it's the exact same hardware except for the MDS server (although I
 tried using the MDS on the old node).
 I have not tried moving the MON back to the old node.

 My default cache size is mds cache size = 1000
 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
 I created 2048 for data and metadata:
 ceph osd pool create cephfs_data 2048 2048
 ceph osd pool create cephfs_metadata 2048 2048


 To your point on clients competing against each other... how would I 
 check that?

 Do you have multiple clients mounted? Are they both accessing files in
 the directory(ies) you're testing? Were they accessing the same
 pattern of files for the old cluster?

 If you happen to be running a hammer rc or something pretty new you
 can use the MDS admin socket to explore a bit what client sessions
 there are and what they have permissions on and check; otherwise
 you'll have to figure it out from the client side.
 -Greg


 Thanks for the input!


 On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote:
 So this is exactly the same test you ran previously, but now it's on
 faster hardware and the test is slower?

 Do you have more data in the test cluster? One obvious possibility is
 that previously you were working entirely in the MDS' cache, but now
 you've got more dentries and so it's kicking data out to RADOS and
 then reading it back in.

 If you've got the memory (you appear to) you can pump up the mds
 cache size config option quite dramatically from it's default 10.

 Other things to check are that you've got an appropriately-sized
 metadata pool, that you've not got clients competing against each
 other inappropriately, etc.
 -Greg

 On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Opps I should have said that I am not just writing the data but copying 
 it :

 time cp Small1/* Small2/*

 Thanks,

 BJ

 On Fri, Mar 27

Re: [ceph-users] CephFS Slow writes with 1MB files

2015-03-30 Thread Yan, Zheng
On Sun, Mar 29, 2015 at 1:12 AM, Barclay Jameson
almightybe...@gmail.com wrote:
 I redid my entire Ceph build going back to to CentOS 7 hoping to the
 get the same performance I did last time.
 The rados bench test was the best I have ever had with a time of 740
 MB wr and 1300 MB rd. This was even better than the first rados bench
 test that had performance equal to PanFS. I find that this does not
 translate to my CephFS. Even with the following tweaking it still at
 least twice as slow as PanFS and my first *Magical* build (that had
 absolutely no tweaking):

 OSD
  osd_op_treads 8
  /sys/block/sd*/queue/nr_requests 4096
  /sys/block/sd*/queue/read_ahead_kb 4096

 Client
  rsize=16777216
  readdir_max_bytes=16777216
  readdir_max_entries=16777216

 ~160 mins to copy 10 (1MB) files for CephFS vs ~50 mins for PanFS.
 Throughput on CephFS is about 10MB/s vs PanFS 30 MB/s.

 Strange thing is none of the resources are taxed.
 CPU, ram, network, disks, are not even close to being taxed on either
 the client,mon/mds, or the osd nodes.
 The PanFS client node was a 10Gb network the same as the CephFS client
 but you can see the huge difference in speed.

 As per Gregs questions before:
 There is only one client reading and writing (time cp Small1/*
 Small2/.) but three clients have cephfs mounted, although they aren't
 doing anything on the filesystem.

 I have done another test where I stream data info a file as fast as
 the processor can put it there.
 (for (i=0; i  11; i++){ fprintf (out_file, I is : %d\n,i);}
 ) and it is faster than the PanFS. CephFS 16GB in 105 seconds with the
 above tuning vs 130 seconds for PanFS. Without the tuning it takes 230
 seconds for CephFS although the first build did it in 130 seconds
 without any tuning.

 This leads me to believe the bottleneck is the mds. Does anybody have
 any thoughts on this?
 Are there any tuning parameters that I would need to speed up the mds?

could you enable mds debugging for a few seconds (ceph daemon mds.x
config set debug_mds 10; sleep 10; ceph daemon mds.x config set
debug_mds 0). and upload /var/log/ceph/mds.x.log to somewhere.

Regards
Yan, Zheng


 On Fri, Mar 27, 2015 at 4:50 PM, Gregory Farnum g...@gregs42.com wrote:
 On Fri, Mar 27, 2015 at 2:46 PM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Yes it's the exact same hardware except for the MDS server (although I
 tried using the MDS on the old node).
 I have not tried moving the MON back to the old node.

 My default cache size is mds cache size = 1000
 The OSDs (3 of them) have 16 Disks with 4 SSD Journal Disks.
 I created 2048 for data and metadata:
 ceph osd pool create cephfs_data 2048 2048
 ceph osd pool create cephfs_metadata 2048 2048


 To your point on clients competing against each other... how would I check 
 that?

 Do you have multiple clients mounted? Are they both accessing files in
 the directory(ies) you're testing? Were they accessing the same
 pattern of files for the old cluster?

 If you happen to be running a hammer rc or something pretty new you
 can use the MDS admin socket to explore a bit what client sessions
 there are and what they have permissions on and check; otherwise
 you'll have to figure it out from the client side.
 -Greg


 Thanks for the input!


 On Fri, Mar 27, 2015 at 3:04 PM, Gregory Farnum g...@gregs42.com wrote:
 So this is exactly the same test you ran previously, but now it's on
 faster hardware and the test is slower?

 Do you have more data in the test cluster? One obvious possibility is
 that previously you were working entirely in the MDS' cache, but now
 you've got more dentries and so it's kicking data out to RADOS and
 then reading it back in.

 If you've got the memory (you appear to) you can pump up the mds
 cache size config option quite dramatically from it's default 10.

 Other things to check are that you've got an appropriately-sized
 metadata pool, that you've not got clients competing against each
 other inappropriately, etc.
 -Greg

 On Fri, Mar 27, 2015 at 9:47 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 Opps I should have said that I am not just writing the data but copying 
 it :

 time cp Small1/* Small2/*

 Thanks,

 BJ

 On Fri, Mar 27, 2015 at 11:40 AM, Barclay Jameson
 almightybe...@gmail.com wrote:
 I did a Ceph cluster install 2 weeks ago where I was getting great
 performance (~= PanFS) where I could write 100,000 1MB files in 61
 Mins (Took PanFS 59 Mins). I thought I could increase the performance
 by adding a better MDS server so I redid the entire build.

 Now it takes 4 times as long to write the same data as it did before.
 The only thing that changed was the MDS server. (I even tried moving
 the MDS back on the old slower node and the performance was the same.)

 The first install was on CentOS 7. I tried going down to CentOS 6.6
 and it's the same results.
 I use the same scripts to install the OSDs (which I created because I
 can never get ceph-deploy to behave correctly

Re: ceph-fuse remount issues

2015-03-16 Thread Yan, Zheng
On Mon, Mar 16, 2015 at 1:28 PM, Sage Weil sw...@redhat.com wrote:
 Hi Zheng,

 On Thu, 26 Feb 2015, ?? wrote:
  ? 2015?2?20??06:23?John Spray john.sp...@redhat.com ???
 
 
  Background: a while ago, we found (#10277) that existing cache expiration 
  mechanism wasn't working with latest kernels.  We used to invalidate the 
  top level dentries, which caused fuse to invalidate everything, but an 
  implementation detail in fuse caused it to start ignoring our repeated 
  invalidate calls, so this doesn't work any more.  To persuade fuse to 
  dirty its entire metadata cache, Zheng added in a system() call to mount 
  -o remount after we expire things from our client side cache.

 Change of d_invalidate() implementation breaks our old cache expiration
 mechanism. When invalidating a denty, d_invalidate() also walks the
 dentry subtree and try pruning any unused descendant dentries. Our old
 cache expiration mechanism replies on this to prune unused dentries. We
 invalidate the top level dentries, d_invalidate() try pruning unused
 dentries underneath these top level dentries. Prior to 3.18 kernel,
 d_invalidate() can fail if the dentry is used by some one.
 Implementation of d_invalidate() change in 3.18 kernel, d_invalidate()
 always successes and unhash the dentry even if it?s still in use. This
 behavior changes make us not be able to use d_invalidate() at will. One
 known bad consequence is getcwd() system call return -EINVAL after
 process? working directory gets invalidated.

 I took another look at this and it seems to me like we might need
 something more than a new call that does the pruning.  What we were doing
 before was also a bit of a hack, it seems.

 What is really going on is that the MDS is telling us to reduce the number
 of inodes we have pinned.  We should ideally turn that into pressure on
 dcache.. but it's not per-superblock, so there's not a shrinker we can
 poke that that does what we want.

 We *are* doing the dentry is, so the dentries are unhashed.
 But they aren't getting destroyed... Zheng, this is what the previous hack
 was doing, right?  Forcing unhashed dentries to get trimmed from the LRU?

Yes, that's what previous hack does. When invalidating dentry, VFS also tries
destroying the dentry.

 It seems like the most elegant solution would be to patch fs/fuse to make
 that happen in the general case when we do the invalidate upcall.  Does
 that sound right?

what do you mean make that happen in the general case.  In my option,
there isn't much fuse kernel module can do about dentry. Maybe we can try
adding a new callback which call d_prune_aliases().

Regards
Yan, Zheng


 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: use msecs_to_jiffies for time conversion

2015-03-11 Thread Yan, Zheng
On Fri, Feb 6, 2015 at 7:52 PM, Nicholas Mc Guire hof...@osadl.org wrote:
 This is only an API consolidation and should make things more readable
 it replaces var * HZ / 1000 by msecs_to_jiffies(var).

 Signed-off-by: Nicholas Mc Guire hof...@osadl.org
 ---

 Patch was only compile tested with x86_64_defconfig + CONFIG_CEPH_FS=m

 Patch is against 3.19.0-rc7 (localversion-next is -next-20150204)

  fs/ceph/mds_client.c |2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
 index 5f62fb7..ced7503 100644
 --- a/fs/ceph/mds_client.c
 +++ b/fs/ceph/mds_client.c
 @@ -3070,7 +3070,7 @@ static void handle_lease(struct ceph_mds_client *mdsc,
 di-lease_renew_from 
 di-lease_renew_after == 0) {
 unsigned long duration =
 -   le32_to_cpu(h-duration_ms) * HZ / 1000;
 +   msecs_to_jiffies(le32_to_cpu(h-duration_ms));

 di-lease_seq = seq;
 dentry-d_time = di-lease_renew_from + duration;
 --
 1.7.10.4

added to our testing branch

Thanks
Yan, Zheng

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: match wait_for_completion_timeout return type

2015-03-11 Thread Yan, Zheng
On Tue, Mar 10, 2015 at 11:18 PM, Nicholas Mc Guire hof...@osadl.org wrote:
 return type of wait_for_completion_timeout is unsigned long not int. An
 appropriately named unsigned long is added and the assignment fixed up.

 Signed-off-by: Nicholas Mc Guire hof...@osadl.org
 ---

 This was only compile tested for x86_64_defconfig + CONFIG_CEPH_FS=m

 Patch is against 4.0-rc2 linux-next (localversion-next is -next-20150306)

  fs/ceph/dir.c |7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index 83e9976..4bee6b7 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -1218,6 +1218,7 @@ static int ceph_dir_fsync(struct file *file, loff_t 
 start, loff_t end,
 struct ceph_mds_request *req;
 u64 last_tid;
 int ret = 0;
 +   unsigned long time_left;

 dout(dir_fsync %p\n, inode);
 ret = filemap_write_and_wait_range(inode-i_mapping, start, end);
 @@ -1240,11 +1241,11 @@ static int ceph_dir_fsync(struct file *file, loff_t 
 start, loff_t end,
 dout(dir_fsync %p wait on tid %llu (until %llu)\n,
  inode, req-r_tid, last_tid);
 if (req-r_timeout) {
 -   ret = wait_for_completion_timeout(
 +   time_left = wait_for_completion_timeout(
 req-r_safe_completion, req-r_timeout);
 -   if (ret  0)
 +   if (time_left  0)
 ret = 0;
 -   else if (ret == 0)
 +   else if (!time_left)
 ret = -EIO;  /* timed out */
 } else {
 wait_for_completion(req-r_safe_completion);

There are lots of similar codes in kernel. I don't think this code
causes problem in reality

 --
 1.7.10.4

 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: match wait_for_completion_timeout return type

2015-03-11 Thread Yan, Zheng
On Wed, Mar 11, 2015 at 7:04 PM, Nicholas Mc Guire der.h...@hofr.at wrote:
 On Wed, 11 Mar 2015, Yan, Zheng wrote:

 On Tue, Mar 10, 2015 at 11:18 PM, Nicholas Mc Guire hof...@osadl.org wrote:
  return type of wait_for_completion_timeout is unsigned long not int. An
  appropriately named unsigned long is added and the assignment fixed up.
 
  Signed-off-by: Nicholas Mc Guire hof...@osadl.org
  ---
 
  This was only compile tested for x86_64_defconfig + CONFIG_CEPH_FS=m
 
  Patch is against 4.0-rc2 linux-next (localversion-next is -next-20150306)
 
   fs/ceph/dir.c |7 ---
   1 file changed, 4 insertions(+), 3 deletions(-)
 
  diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
  index 83e9976..4bee6b7 100644
  --- a/fs/ceph/dir.c
  +++ b/fs/ceph/dir.c
  @@ -1218,6 +1218,7 @@ static int ceph_dir_fsync(struct file *file, loff_t 
  start, loff_t end,
  struct ceph_mds_request *req;
  u64 last_tid;
  int ret = 0;
  +   unsigned long time_left;
 
  dout(dir_fsync %p\n, inode);
  ret = filemap_write_and_wait_range(inode-i_mapping, start, end);
  @@ -1240,11 +1241,11 @@ static int ceph_dir_fsync(struct file *file, 
  loff_t start, loff_t end,
  dout(dir_fsync %p wait on tid %llu (until %llu)\n,
   inode, req-r_tid, last_tid);
  if (req-r_timeout) {
  -   ret = wait_for_completion_timeout(
  +   time_left = wait_for_completion_timeout(
  req-r_safe_completion, req-r_timeout);
  -   if (ret  0)
  +   if (time_left  0)
  ret = 0;
  -   else if (ret == 0)
  +   else if (!time_left)
  ret = -EIO;  /* timed out */
  } else {
  wait_for_completion(req-r_safe_completion);

 There are lots of similar codes in kernel. I don't think this code
 causes problem in reality

 true - there are 38 (of the initial 81 files) left for which no patch has been
 submitted yet - its cleanup in progress.

 type correctness I do believe is an issue and code readability as well
 so both fixing the type and that name is relevant.

 As Wolfram Sang w...@the-dreams.de put it:
 snip
  'ret' being an int is kind of an idiom, so I'd rather see the variable
  renamed, too, like the other patches do.
 snip
 [http://lkml.iu.edu/hypermail/linux/kernel/1502.1/00084.html]

 regarding causing problems - it is hard to say - type missmatch
 may go without problems for a long time and then pop up in strange
 corner cases. But you are right that it is not fixing any currently
 inown incorrect behavior.

 The motivation for cleaning this up is also to make static code checkers
 happy which eases scanning for incorrect API usage and general bug-hunting,

Ok, added to our testing branch

Thanks
Yan, Zheng


 thx!
 hofrat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: workloads/rbd_fsx_cache_writethrough.yaml hangs on dumpling

2015-02-02 Thread Yan, Zheng
On Mon, Feb 2, 2015 at 9:18 PM, Loic Dachary l...@dachary.org wrote:
 Hi,

 http://pulpito.ceph.com/loic-2015-01-29_15:39:38-rbd-dumpling-backports---basic-multi/730029/
  hangs on dumpling and got killed after two days.

 Although workloads/rbd_fsx_cache_writethrough.yaml  was running with the 
 thrasher, it does not seem to be related to 
 http://tracker.ceph.com/issues/10513 : it was running on a mira070 (16GB RAM) 
 and plana78 (8GB RAM).

 I don't see anything in 
 https://github.com/ceph/ceph-qa-suite/blob/master/tasks/rbd_fsx.py that would 
 suggest it needs backporting a fix to the ceph-qa-suite branch of dumpling.

 I browsed the commits related to RBD that are in the dumpling-backports 
 branch ( see here for the list I used http://tracker.ceph.com/issues/10560 ). 
 I don't see anything in the issues that need backporting to dumpling either.

 Another run of this specific test has been scheduled to check if it always 
 happens at 
 http://pulpito.ceph.com/loic-2015-02-02_13:58:32-rbd:thrash:workloads:rbd_fsx_cache_writethrough.yaml-dumpling-backports---basic-multi/

 Does that ring a bell by any chance ?


maybe it's the same as http://tracker.ceph.com/issues/10498


 --
 Loïc Dachary, Artisan Logiciel Libre

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] several ceph dentry leaks

2015-02-01 Thread Yan, Zheng
On Mon, Feb 2, 2015 at 8:31 AM, Al Viro v...@zeniv.linux.org.uk wrote:
 What do you expect to happen when if () is taken in
 int ceph_handle_notrace_create(struct inode *dir, struct dentry *dentry)
 {
 struct dentry *result = ceph_lookup(dir, dentry, 0);

 if (result  !IS_ERR(result)) {

 If result is non-NULL, it means that we have just acquired a new reference
 to preexisting dentry (in ceph_finish_lookup()); where do you expect that
 reference to be dropped?

 Another thing: in ceph_readdir_prepopulate()
 if (!dn-d_inode) {
 dn = splice_dentry(dn, in, NULL);
 if (IS_ERR(dn)) {
 err = PTR_ERR(dn);
 dn = NULL;
 goto next_item;
 }
 }
 you leak dn if that IS_ERR() ever gets hit - d_splice_alias(d, i) does *not*
 drop reference to d in any cases, so splice_dentry() leaves the sum total
 of all dentry refcounts unchanged.  And in case when return value is
 ERR_PTR(...), this assignment results in a leak.  That one is trival to
 fix, but ceph_handle_notrace_create() looks very confusing - if nothing else,
 we should _never_ create multiple dentries pointing to directory inode, so
 d_instantiate() in there isn't mitigating anything - it's actively breaking
 things as far as the rest of the kernel is concerned...  What are you
 trying to do there?

ceph_handle_notrace_create() is for case: Ceph Metadata server restarted,
client re-sent a request. Ceph MDS found the request has already completed,
so it sent a traceless reply to client. (traceless reply contains no information
for the dentry and the newly created inode). It's hard to handle this case
elegantly, because MDS may have done other operation, which moved the
newly created file/directory to other place.

For multiple dentries pointing to directory inode, maybe we can return an error
(-ENOENT or -ESTALE).

Regards
Yan, Zheng


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC] several ceph dentry leaks

2015-02-01 Thread Yan, Zheng
On Mon, Feb 2, 2015 at 12:41 PM, Al Viro v...@zeniv.linux.org.uk wrote:

 On Mon, Feb 02, 2015 at 11:23:58AM +0800, Yan, Zheng wrote:
   we should _never_ create multiple dentries pointing to directory inode, so
   d_instantiate() in there isn't mitigating anything - it's actively 
   breaking
   things as far as the rest of the kernel is concerned...  What are you
   trying to do there?
 
  ceph_handle_notrace_create() is for case: Ceph Metadata server restarted,
  client re-sent a request. Ceph MDS found the request has already completed,
  so it sent a traceless reply to client. (traceless reply contains no 
  information
  for the dentry and the newly created inode). It's hard to handle this case
  elegantly, because MDS may have done other operation, which moved the
  newly created file/directory to other place.
 
  For multiple dentries pointing to directory inode, maybe we can return an 
  error
  (-ENOENT or -ESTALE).

 IDGI...  Suppose we got non-NULL from ceph_lookup() there.  So we'd sent 
 either
 CEPH_MDS_OP_LOOKUP or CEPH_MDS_OP_LOOKUPSNAP and got -r_dentry changed
 (presumably in ceph_fill_trace(), after it got the sucker from 
 splice_dentry(),
 i.e. ultimately d_splice_alias()).  Right?

In theory, yes. But I think it never happens in reality. Because
parent directory of the newly created
file/directory is locked, client has no other way to lookup the newly
created file/directory.


Now, we have acquired a reference
 to it (in ceph_finish_lookup()).  So far, so good; for one thing, we 
 definitely
 do *NOT* want to forget about that reference.  For another, we've got
 a good and valid dentry; so why are we playing with the originally passed
 one?  Just unhash the original and be done with that...


I think the reason is that VFS still uses the original dentry. (for
example, after calling
filesystem's mkdir callback, vfs_mkdir() calls fsnotify_mkdir() and
passing the original
dentry to it)


 Looking at the code downstream from the calls of ceph_handle_notrace_create(),
 ceph_mkdir() looks very dubious - we do ceph_init_inode_acls(dentry-d_inode,
 acls); there, after having set -d_inode to that of a preexisting alias.
 Is that really correct in the case when such an alias used to exist?

you are right, it's incorrect.

how about make ceph_handle_notrace_create()  return an error when it
gets an non-NULL
from ceph_lookup() ? this method is not perfect but should work in most case.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: MDS has inconsistent performance

2015-01-16 Thread Yan, Zheng
On Fri, Jan 16, 2015 at 2:37 PM, Gregory Farnum g...@gregs42.com wrote:
 On Thu, Jan 15, 2015 at 2:44 PM, Michael Sevilla mikesevil...@gmail.com 
 wrote:
 Let me know if this works and/or you need anything else:

 https://www.dropbox.com/s/fq47w6jebnyluu0/lookup-logs.tar.gz?dl=0

 Beware - the clients were on debug=10. Also, I tried this with the
 kernel client and it is more consistent; it does the 2 lookups per
 create on 1 client every single time.

 Mmmm, there are no mds logs of note here. :(

 I did look enough to see that:
 1) The MDS is for some reason revoking caps on the file create
 prompting the switch to double-lookups, which it was not before. The
 client doesn't really have any visibility into why that would be the
 case; the best guess I can come up with is that maybe the MDS split up
 the directory into multiple frags at this point -- do you have that
 enabled?
 2) The only way we set the I_COMPLETE flag is when we create an empty
 directory, or when we do a complete listdir on one. That makes it
 pretty difficult to get the flag back (and so do the optimal create
 path) once you lose it. :( I'd love a better way to do so, but we'll
 have to look at what's involved in a bit of depth.

 I'm not sure why the kernel client is so much more cautious, but I
 think there were a number of troubles with the directory listing
 orders and things which were harder to solve there - I don't remember
 if we introduced the I_DIR_ORDERED flag in it or not. Zheng can talk
 more about that. What kernel client version are you using?

 And for a vanity data point, what kind of hardware is your MDS running on? :)

For kernel before 3.18, the I_COMPLETE get cleared once directory is modified.
I_DIR_ORDERED is introduced by 3.18 kernel. I just tried 3.18 kernel,
unfortunately
there still is a bug that prevent new directory from having I_COMPLETE flag

Regards
Yan, Zheng


 -Greg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: acl: Remove unused function

2015-01-05 Thread Yan, Zheng
On Sun, Jan 4, 2015 at 7:44 AM, Rickard Strandqvist
rickard_strandqv...@spectrumdigital.se wrote:
 Remove the function ceph_get_cached_acl() that is not used anywhere.

 This was partially found by using a static code analysis program called 
 cppcheck.

 Signed-off-by: Rickard Strandqvist rickard_strandqv...@spectrumdigital.se
 ---
  fs/ceph/acl.c |   14 --
  1 file changed, 14 deletions(-)

 diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
 index 5bd853b..64fa248 100644
 --- a/fs/ceph/acl.c
 +++ b/fs/ceph/acl.c
 @@ -40,20 +40,6 @@ static inline void ceph_set_cached_acl(struct inode *inode,
 spin_unlock(ci-i_ceph_lock);
  }

 -static inline struct posix_acl *ceph_get_cached_acl(struct inode *inode,
 -   int type)
 -{
 -   struct ceph_inode_info *ci = ceph_inode(inode);
 -   struct posix_acl *acl = ACL_NOT_CACHED;
 -
 -   spin_lock(ci-i_ceph_lock);
 -   if (__ceph_caps_issued_mask(ci, CEPH_CAP_XATTR_SHARED, 0))
 -   acl = get_cached_acl(inode, type);
 -   spin_unlock(ci-i_ceph_lock);
 -
 -   return acl;
 -}
 -
  struct posix_acl *ceph_get_acl(struct inode *inode, int type)
  {
 int size;
 --
 1.7.10.4


added to testing branch.

Thanks


 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: fix setting empty extended attribute

2014-12-17 Thread Yan, Zheng
make sure 'value' is not null. otherwise __ceph_setxattr will remove
the extended attribute.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/xattr.c | 7 +--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index 678b0d2..5a492ca 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -854,7 +854,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const 
char *name,
struct ceph_pagelist *pagelist = NULL;
int err;
 
-   if (value) {
+   if (size  0) {
/* copy value into pagelist */
pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS);
if (!pagelist)
@@ -864,7 +864,7 @@ static int ceph_sync_setxattr(struct dentry *dentry, const 
char *name,
err = ceph_pagelist_append(pagelist, value, size);
if (err)
goto out;
-   } else {
+   } else if (!value) {
flags |= CEPH_XATTR_REMOVE;
}
 
@@ -1001,6 +1001,9 @@ int ceph_setxattr(struct dentry *dentry, const char *name,
if (!strncmp(name, XATTR_SYSTEM_PREFIX, XATTR_SYSTEM_PREFIX_LEN))
return generic_setxattr(dentry, name, value, size, flags);
 
+   if (size == 0)
+   value = ;  /* empty EA, do not remove */
+
return __ceph_setxattr(dentry, name, value, size, flags);
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [GIT PULL] Ceph updates for 3.19-rc1

2014-12-17 Thread Yan, Zheng
my commit ceph: add inline data to pagecache  incidentally adds
fs/ceph/super.j.rej.  please remove it from your branch.  sorry for
the inconvenience

Yan, Zheng

On Thu, Dec 18, 2014 at 7:27 AM, Sage Weil sw...@redhat.com wrote:
 Hi Linus,

 Please pull the following Ceph updates from

   git://git.kernel.org/pub/scm/linux/kernel/git/sage/ceph-client.git for-linus

 The big item here is support for inline data for CephFS and for message
 signatures from Zheng.  There are also several bug fixes, including
 interrupted flock request handling, 0-length xattrs, mksnap, cached
 readdir results, and a message version compat field.  Finally there are
 several cleanups from Ilya, Dan, and Markus.

 Note that there is another series coming soon that fixes some bugs in the
 RBD 'lingering' requests, but it isn't quite ready yet.

 Thanks!
 sage


 
 Dan Carpenter (1):
   ceph: do_sync is never initialized

 Ilya Dryomov (4):
   libceph: nuke ceph_kvfree()
   ceph: remove unused stringification macros
   rbd: don't treat CEPH_OSD_OP_DELETE as extent op
   libceph: fixup includes in pagelist.h

 John Spray (2):
   libceph: update ceph_msg_header structure
   ceph: message versioning fixes

 SF Markus Elfring (1):
   ceph, rbd: delete unnecessary checks before two function calls

 Yan, Zheng (19):
   ceph: fix file lock interruption
   ceph: introduce a new inode flag indicating if cached dentries are 
 ordered
   libceph: store session key in cephx authorizer
   libceph: message signature support
   ceph: introduce global empty snap context
   libceph: require cephx message signature by default
   libceph: add SETXATTR/CMPXATTR osd operations support
   libceph: add CREATE osd operation support
   libceph: specify position of extent operation
   ceph: parse inline data in MClientReply and MClientCaps
   ceph: add inline data to pagecache
   ceph: use getattr request to fetch inline data
   ceph: fetch inline data when getting Fcr cap refs
   ceph: sync read inline data
   ceph: convert inline data to normal data before data write
   ceph: flush inline version
   ceph: support inline data feature
   ceph: fix mksnap crash
   ceph: fix setting empty extended attribute

  drivers/block/rbd.c|  11 +-
  fs/ceph/addr.c | 273 
 +++--
  fs/ceph/caps.c | 132 ++
  fs/ceph/dir.c  |  27 ++--
  fs/ceph/file.c |  97 +++--
  fs/ceph/inode.c|  59 ++--
  fs/ceph/locks.c|  64 +++--
  fs/ceph/mds_client.c   |  41 +-
  fs/ceph/mds_client.h   |  10 ++
  fs/ceph/snap.c |  37 -
  fs/ceph/super.c|  16 ++-
  fs/ceph/super.h|  55 ++--
  fs/ceph/super.h.rej|  10 ++
  fs/ceph/xattr.c|   7 +-
  include/linux/ceph/auth.h  |  26 
  include/linux/ceph/buffer.h|   3 +-
  include/linux/ceph/ceph_features.h |   1 +
  include/linux/ceph/ceph_fs.h   |  10 +-
  include/linux/ceph/libceph.h   |   2 +-
  include/linux/ceph/messenger.h |   9 +-
  include/linux/ceph/msgr.h  |  11 +-
  include/linux/ceph/osd_client.h|  13 +-
  include/linux/ceph/pagelist.h  |   4 +-
  net/ceph/auth_x.c  |  76 ++-
  net/ceph/auth_x.h  |   1 +
  net/ceph/buffer.c  |   4 +-
  net/ceph/ceph_common.c |  21 +--
  net/ceph/messenger.c   |  34 -
  net/ceph/osd_client.c  | 118 
  29 files changed, 992 insertions(+), 180 deletions(-)
  create mode 100644 fs/ceph/super.h.rej
 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: fix mksnap crash

2014-12-10 Thread Yan, Zheng
mksnap reply only contain 'target', does not contain 'dentry'. So
it's wrong to use req-r_reply_info.head-is_dentry to detect traceless
reply.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/dir.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index 6526199..fcfd0ab 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -812,7 +812,9 @@ static int ceph_mkdir(struct inode *dir, struct dentry 
*dentry, umode_t mode)
acls.pagelist = NULL;
}
err = ceph_mdsc_do_request(mdsc, dir, req);
-   if (!err  !req-r_reply_info.head-is_dentry)
+   if (!err 
+   !req-r_reply_info.head-is_target 
+   !req-r_reply_info.head-is_dentry)
err = ceph_handle_notrace_create(dir, dentry);
ceph_mdsc_put_request(req);
 out:
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/12] libceph: add SETXATTR/CMPXATTR osd operations support

2014-11-16 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 include/linux/ceph/osd_client.h | 10 ++
 net/ceph/osd_client.c   | 38 ++
 2 files changed, 48 insertions(+)

diff --git a/include/linux/ceph/osd_client.h b/include/linux/ceph/osd_client.h
index 03aeb27..1ad217b 100644
--- a/include/linux/ceph/osd_client.h
+++ b/include/linux/ceph/osd_client.h
@@ -87,6 +87,13 @@ struct ceph_osd_req_op {
struct ceph_osd_data osd_data;
} extent;
struct {
+   __le32 name_len;
+   __le32 value_len;
+   __u8 cmp_op;   /* CEPH_OSD_CMPXATTR_OP_* */
+   __u8 cmp_mode; /* CEPH_OSD_CMPXATTR_MODE_* */
+   struct ceph_osd_data osd_data;
+   } xattr;
+   struct {
const char *class_name;
const char *method_name;
struct ceph_osd_data request_info;
@@ -295,6 +302,9 @@ extern void osd_req_op_cls_response_data_pages(struct 
ceph_osd_request *,
 extern void osd_req_op_cls_init(struct ceph_osd_request *osd_req,
unsigned int which, u16 opcode,
const char *class, const char *method);
+extern void osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned 
int which,
+ u16 opcode, const char *name, const void 
*value,
+ size_t size, u8 cmp_op, u8 cmp_mode);
 extern void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
unsigned int which, u16 opcode,
u64 cookie, u64 version, int flag);
diff --git a/net/ceph/osd_client.c b/net/ceph/osd_client.c
index 1f6c405..f8fb376 100644
--- a/net/ceph/osd_client.c
+++ b/net/ceph/osd_client.c
@@ -545,6 +545,34 @@ void osd_req_op_cls_init(struct ceph_osd_request *osd_req, 
unsigned int which,
 }
 EXPORT_SYMBOL(osd_req_op_cls_init);
 
+void osd_req_op_xattr_init(struct ceph_osd_request *osd_req, unsigned int 
which,
+ u16 opcode, const char *name, const void *value,
+ size_t size, u8 cmp_op, u8 cmp_mode)
+{
+   struct ceph_osd_req_op *op = _osd_req_op_init(osd_req, which, opcode);
+   struct ceph_pagelist *pagelist;
+   size_t payload_len;
+
+   pagelist = kmalloc(sizeof (*pagelist), GFP_NOFS);
+   BUG_ON(!pagelist);
+   ceph_pagelist_init(pagelist);
+
+   payload_len = strlen(name);
+   op-xattr.name_len = payload_len;
+   ceph_pagelist_append(pagelist, name, payload_len);
+
+   op-xattr.value_len = size;
+   ceph_pagelist_append(pagelist, value, size);
+   payload_len += size;
+
+   op-xattr.cmp_op = cmp_op;
+   op-xattr.cmp_mode = cmp_mode;
+
+   ceph_osd_data_pagelist_init(op-xattr.osd_data, pagelist);
+   op-payload_len = payload_len;
+}
+EXPORT_SYMBOL(osd_req_op_xattr_init);
+
 void osd_req_op_watch_init(struct ceph_osd_request *osd_req,
unsigned int which, u16 opcode,
u64 cookie, u64 version, int flag)
@@ -676,6 +704,16 @@ static u64 osd_req_encode_op(struct ceph_osd_request *req,
dst-alloc_hint.expected_write_size =
cpu_to_le64(src-alloc_hint.expected_write_size);
break;
+   case CEPH_OSD_OP_SETXATTR:
+   case CEPH_OSD_OP_CMPXATTR:
+   dst-xattr.name_len = cpu_to_le32(src-xattr.name_len);
+   dst-xattr.value_len = cpu_to_le32(src-xattr.value_len);
+   dst-xattr.cmp_op = src-xattr.cmp_op;
+   dst-xattr.cmp_mode = src-xattr.cmp_mode;
+   osd_data = src-xattr.osd_data;
+   ceph_osdc_msg_data_add(req-r_request, osd_data);
+   request_data_len = osd_data-pagelist-length;
+   break;
default:
pr_err(unsupported osd opcode %s\n,
ceph_osd_op_name(src-op));
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/12] ceph: add inline data to pagecache

2014-11-16 Thread Yan, Zheng
Request reply and cap message can contain inline data. add inline data
to the page cache if there is Fc cap.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/addr.c   | 42 ++
 fs/ceph/caps.c   | 11 +++
 fs/ceph/inode.c  | 16 
 fs/ceph/inode.c.rej  | 21 +
 fs/ceph/mds_client.c.rej | 10 ++
 fs/ceph/super.h  |  5 -
 fs/ceph/super.h.rej  | 10 ++
 include/linux/ceph/ceph_fs.h |  1 +
 8 files changed, 115 insertions(+), 1 deletion(-)
 create mode 100644 fs/ceph/inode.c.rej
 create mode 100644 fs/ceph/mds_client.c.rej
 create mode 100644 fs/ceph/super.h.rej

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index f2c7aa8..7320b11 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1318,6 +1318,48 @@ out:
return ret;
 }
 
+void ceph_fill_inline_data(struct inode *inode, struct page *locked_page,
+  char *data, size_t len)
+{
+   struct address_space *mapping = inode-i_mapping;
+   struct page *page;
+
+   if (locked_page) {
+   page = locked_page;
+   } else {
+   if (i_size_read(inode) == 0)
+   return;
+   page = find_or_create_page(mapping, 0, GFP_NOFS);
+   if (!page)
+   return;
+   if (PageUptodate(page)) {
+   unlock_page(page);
+   page_cache_release(page);
+   return;
+   }
+   }
+
+   dout(fill_inline_data %p %llx.%llx len %lu locked_page %p\n,
+inode, ceph_vinop(inode), len, locked_page);
+
+   if (len  0) {
+   void *kaddr = kmap_atomic(page);
+   memcpy(kaddr, data, len);
+   kunmap_atomic(kaddr);
+   }
+
+   if (page != locked_page) {
+   if (len  PAGE_CACHE_SIZE)
+   zero_user_segment(page, len, PAGE_CACHE_SIZE);
+   else
+   flush_dcache_page(page);
+
+   SetPageUptodate(page);
+   unlock_page(page);
+   page_cache_release(page);
+   }
+}
+
 static struct vm_operations_struct ceph_vmops = {
.fault  = ceph_filemap_fault,
.page_mkwrite   = ceph_page_mkwrite,
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index b88ae60..6372eb9 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2405,6 +2405,7 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc,
bool queue_invalidate = false;
bool queue_revalidate = false;
bool deleted_inode = false;
+   bool fill_inline = false;
 
dout(handle_cap_grant inode %p cap %p mds%d seq %d %s\n,
 inode, cap, mds, seq, ceph_cap_string(newcaps));
@@ -2578,6 +2579,13 @@ static void handle_cap_grant(struct ceph_mds_client 
*mdsc,
}
BUG_ON(cap-issued  ~cap-implemented);
 
+   if (inline_version  0  inline_version = ci-i_inline_version) {
+   ci-i_inline_version = inline_version;
+   if (ci-i_inline_version != CEPH_INLINE_NONE 
+   (newcaps  (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)))
+   fill_inline = true;
+   }
+
spin_unlock(ci-i_ceph_lock);
 
if (le32_to_cpu(grant-op) == CEPH_CAP_OP_IMPORT) {
@@ -2591,6 +2599,9 @@ static void handle_cap_grant(struct ceph_mds_client *mdsc,
wake = true;
}
 
+   if (fill_inline)
+   ceph_fill_inline_data(inode, NULL, inline_data, inline_len);
+
if (queue_trunc) {
ceph_queue_vmtruncate(inode);
ceph_queue_revalidate(inode);
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 72607c1..feea6a8 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -387,6 +387,7 @@ struct inode *ceph_alloc_inode(struct super_block *sb)
spin_lock_init(ci-i_ceph_lock);
 
ci-i_version = 0;
+   ci-i_inline_version = 0;
ci-i_time_warp_seq = 0;
ci-i_ceph_flags = 0;
ci-i_ordered_count = 0;
@@ -676,6 +677,7 @@ static int fill_inode(struct inode *inode,
bool wake = false;
bool queue_trunc = false;
bool new_version = false;
+   bool fill_inline = false;
 
dout(fill_inode %p ino %llx.%llx v %llu had %llu\n,
 inode, ceph_vinop(inode), le64_to_cpu(info-version),
@@ -875,8 +877,22 @@ static int fill_inode(struct inode *inode,
   ceph_vinop(inode));
__ceph_get_fmode(ci, cap_fmode);
}
+
+   if (iinfo-inline_version  0 
+   iinfo-inline_version = ci-i_inline_version) {
+   int cache_caps = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
+   ci-i_inline_version = iinfo-inline_version;
+   if (ci-i_inline_version != CEPH_INLINE_NONE 
+   (le32_to_cpu(info-cap.caps

[PATCH 06/12] ceph: use getattr request to fetch inline data

2014-11-16 Thread Yan, Zheng
Add a new parameter 'locked_page' to ceph_do_getattr(). If inline data
in getattr reply will be copied to the page.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/inode.c  | 34 +-
 fs/ceph/mds_client.h |  1 +
 fs/ceph/super.h  |  7 ++-
 include/linux/ceph/ceph_fs.h |  2 ++
 4 files changed, 34 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index feea6a8..0bdabd1 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -659,7 +659,7 @@ void ceph_fill_file_time(struct inode *inode, int issued,
  * Populate an inode based on info from mds.  May be called on new or
  * existing inodes.
  */
-static int fill_inode(struct inode *inode,
+static int fill_inode(struct inode *inode, struct page *locked_page,
  struct ceph_mds_reply_info_in *iinfo,
  struct ceph_mds_reply_dirfrag *dirinfo,
  struct ceph_mds_session *session,
@@ -883,14 +883,15 @@ static int fill_inode(struct inode *inode,
int cache_caps = CEPH_CAP_FILE_CACHE | CEPH_CAP_FILE_LAZYIO;
ci-i_inline_version = iinfo-inline_version;
if (ci-i_inline_version != CEPH_INLINE_NONE 
-   (le32_to_cpu(info-cap.caps)  cache_caps))
+   (locked_page ||
+(le32_to_cpu(info-cap.caps)  cache_caps)))
fill_inline = true;
}
 
spin_unlock(ci-i_ceph_lock);
 
if (fill_inline)
-   ceph_fill_inline_data(inode, NULL,
+   ceph_fill_inline_data(inode, locked_page,
  iinfo-inline_data, iinfo-inline_len);
 
if (wake)
@@ -1080,7 +1081,8 @@ int ceph_fill_trace(struct super_block *sb, struct 
ceph_mds_request *req,
struct inode *dir = req-r_locked_dir;
 
if (dir) {
-   err = fill_inode(dir, rinfo-diri, rinfo-dirfrag,
+   err = fill_inode(dir, NULL,
+rinfo-diri, rinfo-dirfrag,
 session, req-r_request_started, -1,
 req-r_caps_reservation);
if (err  0)
@@ -1150,7 +1152,7 @@ retry_lookup:
}
req-r_target_inode = in;
 
-   err = fill_inode(in, rinfo-targeti, NULL,
+   err = fill_inode(in, req-r_locked_page, rinfo-targeti, NULL,
session, req-r_request_started,
(!req-r_aborted  rinfo-head-result == 0) ?
req-r_fmode : -1,
@@ -1321,7 +1323,7 @@ static int readdir_prepopulate_inodes_only(struct 
ceph_mds_request *req,
dout(new_inode badness got %d\n, err);
continue;
}
-   rc = fill_inode(in, rinfo-dir_in[i], NULL, session,
+   rc = fill_inode(in, NULL, rinfo-dir_in[i], NULL, session,
req-r_request_started, -1,
req-r_caps_reservation);
if (rc  0) {
@@ -1437,7 +1439,7 @@ retry_lookup:
}
}
 
-   if (fill_inode(in, rinfo-dir_in[i], NULL, session,
+   if (fill_inode(in, NULL, rinfo-dir_in[i], NULL, session,
   req-r_request_started, -1,
   req-r_caps_reservation)  0) {
pr_err(fill_inode badness on %p\n, in);
@@ -1920,7 +1922,8 @@ out_put:
  * Verify that we have a lease on the given mask.  If not,
  * do a getattr against an mds.
  */
-int ceph_do_getattr(struct inode *inode, int mask, bool force)
+int __ceph_do_getattr(struct inode *inode, struct page *locked_page,
+ int mask, bool force)
 {
struct ceph_fs_client *fsc = ceph_sb_to_client(inode-i_sb);
struct ceph_mds_client *mdsc = fsc-mdsc;
@@ -1932,7 +1935,8 @@ int ceph_do_getattr(struct inode *inode, int mask, bool 
force)
return 0;
}
 
-   dout(do_getattr inode %p mask %s mode 0%o\n, inode, 
ceph_cap_string(mask), inode-i_mode);
+   dout(do_getattr inode %p mask %s mode 0%o\n,
+inode, ceph_cap_string(mask), inode-i_mode);
if (!force  ceph_caps_issued_mask(ceph_inode(inode), mask, 1))
return 0;
 
@@ -1943,7 +1947,19 @@ int ceph_do_getattr(struct inode *inode, int mask, bool 
force)
ihold(inode);
req-r_num_caps = 1;
req-r_args.getattr.mask = cpu_to_le32(mask);
+   req-r_locked_page = locked_page;
err = ceph_mdsc_do_request(mdsc, NULL, req);
+   if (locked_page  err == 0) {
+   u64 inline_version = req-r_reply_info.targeti.inline_version;
+   if (inline_version == 0) {
+   /* the reply is supposed to contain inline data

[PATCH 0/12] ceph: read/modify inline data support

2014-11-16 Thread Yan, Zheng
Hi,

This patch series allows cephfs client kernel to read/modify inline data.
When modifying a inode with inline data, inline data are always converted
to normal object first. This patch series does not support storing new data
as inline.

Regards
Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/12] ceph: parse inline data in MClientReply and MClientCaps

2014-11-16 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/caps.c   | 34 +++---
 fs/ceph/mds_client.c | 10 ++
 fs/ceph/mds_client.h |  3 +++
 3 files changed, 36 insertions(+), 11 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index eb1bf1f..b88ae60 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2383,6 +2383,8 @@ static void invalidate_aliases(struct inode *inode)
 static void handle_cap_grant(struct ceph_mds_client *mdsc,
 struct inode *inode, struct ceph_mds_caps *grant,
 void *snaptrace, int snaptrace_len,
+u64 inline_version,
+void *inline_data, int inline_len,
 struct ceph_buffer *xattr_buf,
 struct ceph_mds_session *session,
 struct ceph_cap *cap, int issued)
@@ -2996,11 +2998,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
u64 cap_id;
u64 size, max_size;
u64 tid;
+   u64 inline_version = 0;
+   void *inline_data = NULL;
+   u32  inline_len = 0;
void *snaptrace;
size_t snaptrace_len;
-   void *flock;
-   void *end;
-   u32 flock_len;
+   void *p, *end;
 
dout(handle_caps from mds%d\n, mds);
 
@@ -3021,30 +3024,37 @@ void ceph_handle_caps(struct ceph_mds_session *session,
 
snaptrace = h + 1;
snaptrace_len = le32_to_cpu(h-snap_trace_len);
+   p = snaptrace + snaptrace_len;
 
if (le16_to_cpu(msg-hdr.version) = 2) {
-   void *p = snaptrace + snaptrace_len;
+   u32 flock_len;
ceph_decode_32_safe(p, end, flock_len, bad);
if (p + flock_len  end)
goto bad;
-   flock = p;
-   } else {
-   flock = NULL;
-   flock_len = 0;
+   p += flock_len;
}
 
if (le16_to_cpu(msg-hdr.version) = 3) {
if (op == CEPH_CAP_OP_IMPORT) {
-   void *p = flock + flock_len;
if (p + sizeof(*peer)  end)
goto bad;
peer = p;
+   p += sizeof(*peer);
} else if (op == CEPH_CAP_OP_EXPORT) {
/* recorded in unused fields */
peer = (void *)h-size;
}
}
 
+   if (le16_to_cpu(msg-hdr.version) = 4) {
+   ceph_decode_64_safe(p, end, inline_version, bad);
+   ceph_decode_32_safe(p, end, inline_len, bad);
+   if (p + inline_len  end)
+   goto bad;
+   inline_data = p;
+   p += inline_len;
+   }
+
/* lookup ino */
inode = ceph_find_inode(sb, vino);
ci = ceph_inode(inode);
@@ -3085,6 +3095,7 @@ void ceph_handle_caps(struct ceph_mds_session *session,
handle_cap_import(mdsc, inode, h, peer, session,
  cap, issued);
handle_cap_grant(mdsc, inode, h,  snaptrace, snaptrace_len,
+inline_version, inline_data, inline_len,
 msg-middle, session, cap, issued);
goto done_unlocked;
}
@@ -3105,8 +3116,9 @@ void ceph_handle_caps(struct ceph_mds_session *session,
case CEPH_CAP_OP_GRANT:
__ceph_caps_issued(ci, issued);
issued |= __ceph_caps_dirty(ci);
-   handle_cap_grant(mdsc, inode, h, NULL, 0, msg-middle,
-session, cap, issued);
+   handle_cap_grant(mdsc, inode, h, NULL, 0,
+inline_version, inline_data, inline_len,
+msg-middle, session, cap, issued);
goto done_unlocked;
 
case CEPH_CAP_OP_FLUSH_ACK:
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 387a489..d2171f4 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -89,6 +89,16 @@ static int parse_reply_info_in(void **p, void *end,
ceph_decode_need(p, end, info-xattr_len, bad);
info-xattr_data = *p;
*p += info-xattr_len;
+
+   if (features  CEPH_FEATURE_MDS_INLINE_DATA) {
+   ceph_decode_64_safe(p, end, info-inline_version, bad);
+   ceph_decode_32_safe(p, end, info-inline_len, bad);
+   ceph_decode_need(p, end, info-inline_len, bad);
+   info-inline_data = *p;
+   *p += info-inline_len;
+   } else
+   info-inline_version = CEPH_INLINE_NONE;
+
return 0;
 bad:
return err;
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index 230bda7..01f5e4c 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -41,6 +41,9 @@ struct ceph_mds_reply_info_in {
char *symlink;
u32 xattr_len

[PATCH 07/12] ceph: fetch inline data while getting Fcr cap refs

2014-11-16 Thread Yan, Zheng
we can't use getattr to fetch inline data after getting Fcr caps,
because it can cause deadlock. The solution is try bringing inline
data to page cache when not holding any cap, and hope the inline
data page is still there after getting the Fcr caps. If the page
is still there, pin it in page cache for later IO.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/addr.c |  9 +++--
 fs/ceph/caps.c | 55 ++-
 fs/ceph/file.c | 12 +---
 3 files changed, 62 insertions(+), 14 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 7320b11..fdbdbbe 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1207,6 +1207,7 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
struct inode *inode = file_inode(vma-vm_file);
struct ceph_inode_info *ci = ceph_inode(inode);
struct ceph_file_info *fi = vma-vm_file-private_data;
+   struct page *pinned_page = NULL;
loff_t off = vmf-pgoff  PAGE_CACHE_SHIFT;
int want, got, ret;
 
@@ -1218,7 +1219,8 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
want = CEPH_CAP_FILE_CACHE;
while (1) {
got = 0;
-   ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, want, got, -1);
+   ret = ceph_get_caps(ci, CEPH_CAP_FILE_RD, want, -1,
+   got, pinned_page);
if (ret == 0)
break;
if (ret != -ERESTARTSYS) {
@@ -1233,6 +1235,8 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
 
dout(filemap_fault %p %llu~%zd dropping cap refs on %s ret %d\n,
 inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got), ret);
+   if (pinned_page)
+   page_cache_release(pinned_page);
ceph_put_cap_refs(ci, got);
 
return ret;
@@ -1266,7 +1270,8 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, 
struct vm_fault *vmf)
want = CEPH_CAP_FILE_BUFFER;
while (1) {
got = 0;
-   ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, got, off + 
len);
+   ret = ceph_get_caps(ci, CEPH_CAP_FILE_WR, want, off + len,
+   got, NULL);
if (ret == 0)
break;
if (ret != -ERESTARTSYS) {
diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 6372eb9..3bc0908 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2057,15 +2057,18 @@ static void __take_cap_refs(struct ceph_inode_info *ci, 
int got)
  * requested from the MDS.
  */
 static int try_get_cap_refs(struct ceph_inode_info *ci, int need, int want,
-   int *got, loff_t endoff, int *check_max, int *err)
+   loff_t endoff, int *got, struct page **pinned_page,
+   int *check_max, int *err)
 {
struct inode *inode = ci-vfs_inode;
+   struct page *page = NULL;
int ret = 0;
-   int have, implemented;
+   int have, implemented, _got = 0;
int file_wanted;
 
dout(get_cap_refs %p need %s want %s\n, inode,
 ceph_cap_string(need), ceph_cap_string(want));
+again:
spin_lock(ci-i_ceph_lock);
 
/* make sure file is actually open */
@@ -2120,8 +2123,8 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, 
int need, int want,
 inode, ceph_cap_string(have), ceph_cap_string(not),
 ceph_cap_string(revoking));
if ((revoking  not) == 0) {
-   *got = need | (have  want);
-   __take_cap_refs(ci, *got);
+   _got = need | (have  want);
+   __take_cap_refs(ci, _got);
ret = 1;
}
} else {
@@ -2130,8 +2133,42 @@ static int try_get_cap_refs(struct ceph_inode_info *ci, 
int need, int want,
}
 out:
spin_unlock(ci-i_ceph_lock);
+
+   if ((_got  (CEPH_CAP_FILE_CACHE|CEPH_CAP_FILE_LAZYIO)) 
+   ci-i_inline_version != CEPH_INLINE_NONE 
+   i_size_read(inode)  0) {
+   page = find_get_page(inode-i_mapping, 0);
+   if (page) {
+   if (!PageUptodate(page)) {
+   page_cache_release(page);
+   page = NULL;
+   }
+   }
+   if (page) {
+   *pinned_page = page;
+   } else {
+   int ret1;
+   /*
+* drop cap refs first because getattr while holding
+* caps refs can cause deadlock.
+*/
+   ceph_put_cap_refs(ci, _got);
+   _got = 0;
+   /* getattt will bring inline data to page

[PATCH 12/12] ceph: support inline data feature

2014-11-16 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/super.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/super.c b/fs/ceph/super.c
index 2b5481f..c37f1b9 100644
--- a/fs/ceph/super.c
+++ b/fs/ceph/super.c
@@ -515,7 +515,8 @@ static struct ceph_fs_client *create_fs_client(struct 
ceph_mount_options *fsopt,
struct ceph_fs_client *fsc;
const u64 supported_features =
CEPH_FEATURE_FLOCK |
-   CEPH_FEATURE_DIRLAYOUTHASH;
+   CEPH_FEATURE_DIRLAYOUTHASH |
+   CEPH_FEATURE_MDS_INLINE_DATA;
const u64 required_features = 0;
int page_count;
size_t size;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/12] ceph: flush inline version

2014-11-16 Thread Yan, Zheng
After converting inline data to normal data, client need to flush
the new i_inline_version (CEPH_INLINE_NONE) to MDS. This commit makes
cap messages (sent to MDS) contain inline_version and inline_data.
Client always converts inline data to normal data before data write,
so the inline data length part is always zero.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/caps.c  | 24 
 fs/ceph/snap.c  |  2 ++
 fs/ceph/super.h |  1 +
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 3bc0908..77c4068 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -975,10 +975,12 @@ static int send_cap_msg(struct ceph_mds_session *session,
kuid_t uid, kgid_t gid, umode_t mode,
u64 xattr_version,
struct ceph_buffer *xattrs_buf,
-   u64 follows)
+   u64 follows, bool inline_data)
 {
struct ceph_mds_caps *fc;
struct ceph_msg *msg;
+   void *p;
+   size_t extra_len;
 
dout(send_cap_msg %s %llx %llx caps %s wanted %s dirty %s
  seq %u/%u mseq %u follows %lld size %llu/%llu
@@ -988,7 +990,10 @@ static int send_cap_msg(struct ceph_mds_session *session,
 seq, issue_seq, mseq, follows, size, max_size,
 xattr_version, xattrs_buf ? (int)xattrs_buf-vec.iov_len : 0);
 
-   msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc), GFP_NOFS, false);
+   /* flock buffer size + inline version + inline data size */
+   extra_len = 4 + 8 + 4;
+   msg = ceph_msg_new(CEPH_MSG_CLIENT_CAPS, sizeof(*fc) + extra_len,
+  GFP_NOFS, false);
if (!msg)
return -ENOMEM;
 
@@ -1020,6 +1025,14 @@ static int send_cap_msg(struct ceph_mds_session *session,
fc-gid = cpu_to_le32(from_kgid(init_user_ns, gid));
fc-mode = cpu_to_le32(mode);
 
+   p = fc + 1;
+   /* flock buffer size */
+   ceph_encode_32(p, 0);
+   /* inline version */
+   ceph_encode_64(p, inline_data ? 0 : CEPH_INLINE_NONE);
+   /* inline data size */
+   ceph_encode_32(p, 0);
+
fc-xattr_version = cpu_to_le64(xattr_version);
if (xattrs_buf) {
msg-middle = ceph_buffer_get(xattrs_buf);
@@ -1126,6 +1139,7 @@ static int __send_cap(struct ceph_mds_client *mdsc, 
struct ceph_cap *cap,
u64 flush_tid = 0;
int i;
int ret;
+   bool inline_data;
 
held = cap-issued | cap-implemented;
revoking = cap-implemented  ~cap-issued;
@@ -1209,13 +1223,15 @@ static int __send_cap(struct ceph_mds_client *mdsc, 
struct ceph_cap *cap,
xattr_version = ci-i_xattrs.version;
}
 
+   inline_data = ci-i_inline_version != CEPH_INLINE_NONE;
+
spin_unlock(ci-i_ceph_lock);
 
ret = send_cap_msg(session, ceph_vino(inode).ino, cap_id,
op, keep, want, flushing, seq, flush_tid, issue_seq, mseq,
size, max_size, mtime, atime, time_warp_seq,
uid, gid, mode, xattr_version, xattr_blob,
-   follows);
+   follows, inline_data);
if (ret  0) {
dout(error sending cap msg, must requeue %p\n, inode);
delayed = 1;
@@ -1336,7 +1352,7 @@ retry:
 capsnap-time_warp_seq,
 capsnap-uid, capsnap-gid, capsnap-mode,
 capsnap-xattr_version, capsnap-xattr_blob,
-capsnap-follows);
+capsnap-follows, capsnap-inline_data);
 
next_follows = capsnap-follows + 1;
ceph_put_cap_snap(capsnap);
diff --git a/fs/ceph/snap.c b/fs/ceph/snap.c
index 24b454a..28571cf 100644
--- a/fs/ceph/snap.c
+++ b/fs/ceph/snap.c
@@ -515,6 +515,8 @@ void ceph_queue_cap_snap(struct ceph_inode_info *ci)
capsnap-xattr_version = 0;
}
 
+   capsnap-inline_data = ci-i_inline_version != CEPH_INLINE_NONE;
+
/* dirty page count moved from _head to this cap_snap;
   all subsequent writes page dirties occur _after_ this
   snapshot. */
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 02de104..33885ad 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -161,6 +161,7 @@ struct ceph_cap_snap {
u64 time_warp_seq;
int writing;   /* a sync write is still in progress */
int dirty_pages; /* dirty pages awaiting writeback */
+   bool inline_data;
 };
 
 static inline void ceph_put_cap_snap(struct ceph_cap_snap *capsnap)
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 08/12] ceph: sync read inline data

2014-11-16 Thread Yan, Zheng
we can't use getattr to fetch inline data while holding Fr cap,
because it can cause deadlock. If we need to sync read inline data,
drop cap refs first, then use getattr to fetch inline data.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/addr.c |  7 ++-
 fs/ceph/file.c | 63 ++
 2 files changed, 61 insertions(+), 9 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index fdbdbbe..9dcb4a9 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -194,8 +194,10 @@ static int readpage_nounlock(struct file *filp, struct 
page *page)
int err = 0;
u64 len = PAGE_CACHE_SIZE;
 
-   err = ceph_readpage_from_fscache(inode, page);
+   if (ci-i_inline_version != CEPH_INLINE_NONE)
+   return -EINVAL;
 
+   err = ceph_readpage_from_fscache(inode, page);
if (err == 0)
goto out;
 
@@ -384,6 +386,9 @@ static int ceph_readpages(struct file *file, struct 
address_space *mapping,
int rc = 0;
int max = 0;
 
+   if (ceph_inode(inode)-i_inline_version != CEPH_INLINE_NONE)
+   return -EINVAL;
+
rc = ceph_readpages_from_fscache(mapping-host, mapping, page_list,
 nr_pages);
 
diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 861b995..5b092bd 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -333,6 +333,11 @@ int ceph_release(struct inode *inode, struct file *file)
return 0;
 }
 
+enum {
+   CHECK_EOF = 1,
+   READ_INLINE = 2,
+};
+
 /*
  * Read a range of bytes striped over one or more objects.  Iterate over
  * objects we stripe over.  (That's not atomic, but good enough for now.)
@@ -412,7 +417,7 @@ more:
ret = read;
/* did we bounce off eof? */
if (pos + left  inode-i_size)
-   *checkeof = 1;
+   *checkeof = CHECK_EOF;
}
 
dout(striped_read returns %d\n, ret);
@@ -808,7 +813,7 @@ static ssize_t ceph_read_iter(struct kiocb *iocb, struct 
iov_iter *to)
struct page *pinned_page = NULL;
ssize_t ret;
int want, got = 0;
-   int checkeof = 0, read = 0;
+   int retry_op = 0, read = 0;
 
 again:
dout(aio_read %p %llx.%llx %llu~%u trying to get caps on %p\n,
@@ -830,8 +835,12 @@ again:
 inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len,
 ceph_cap_string(got));
 
-   /* hmm, this isn't really async... */
-   ret = ceph_sync_read(iocb, to, checkeof);
+   if (ci-i_inline_version == CEPH_INLINE_NONE) {
+   /* hmm, this isn't really async... */
+   ret = ceph_sync_read(iocb, to, retry_op);
+   } else {
+   retry_op = READ_INLINE;
+   }
} else {
dout(aio_read %p %llx.%llx %llu~%u got cap refs on %s\n,
 inode, ceph_vinop(inode), iocb-ki_pos, (unsigned)len,
@@ -846,12 +855,50 @@ again:
pinned_page = NULL;
}
ceph_put_cap_refs(ci, got);
+   if (retry_op  ret = 0) {
+   int statret;
+   struct page *page = NULL;
+   loff_t i_size;
+   if (retry_op == READ_INLINE) {
+   page = __page_cache_alloc(GFP_NOFS);
+   if (!page)
+   return -ENOMEM;
+   }
+
+   statret = __ceph_do_getattr(inode, page,
+   CEPH_STAT_CAP_INLINE_DATA, !!page);
+   if (statret  0) {
+__free_page(page);
+   if (statret == -ENODATA) {
+   BUG_ON(retry_op != READ_INLINE);
+   goto again;
+   }
+   return statret;
+   }
 
-   if (checkeof  ret = 0) {
-   int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
+   i_size = i_size_read(inode);
+   if (retry_op == READ_INLINE) {
+   /* does not support inline data  PAGE_SIZE */
+   if (i_size  PAGE_CACHE_SIZE) {
+   ret = -EIO;
+   } else if (iocb-ki_pos  i_size) {
+   loff_t end = min_t(loff_t, i_size,
+  iocb-ki_pos + len);
+   if (statret  end)
+   zero_user_segment(page, statret, end);
+   ret = copy_page_to_iter(page,
+   iocb-ki_pos  ~PAGE_MASK,
+   end - iocb-ki_pos, to);
+   iocb-ki_pos += ret;
+   } else {
+   ret = 0

[PATCH 10/12] ceph: convert inline data for memory mapped read

2014-11-16 Thread Yan, Zheng
We can't use getattr to fetch inline data while holding Fr caps,
If there is no Fc cap, ceph_get_caps() does not bring inline data
to page cache neither. To simplify this case, just convert inline
data to normal data.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/addr.c | 17 +++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 72336bd..1768919 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1215,7 +1215,7 @@ static int ceph_filemap_fault(struct vm_area_struct *vma, 
struct vm_fault *vmf)
struct page *pinned_page = NULL;
loff_t off = vmf-pgoff  PAGE_CACHE_SHIFT;
int want, got, ret;
-
+again:
dout(filemap_fault %p %llx.%llx %llu~%zd trying to get caps\n,
 inode, ceph_vinop(inode), off, (size_t)PAGE_CACHE_SIZE);
if (fi-fmode  CEPH_FILE_MODE_LAZY)
@@ -1236,7 +1236,10 @@ static int ceph_filemap_fault(struct vm_area_struct 
*vma, struct vm_fault *vmf)
dout(filemap_fault %p %llu~%zd got cap refs on %s\n,
 inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got));
 
-   ret = filemap_fault(vma, vmf);
+   if ((got  want) || ci-i_inline_version == CEPH_INLINE_NONE)
+   ret = filemap_fault(vma, vmf);
+   else
+   ret = -EAGAIN; /* readpage() can't fetch inline data */
 
dout(filemap_fault %p %llu~%zd dropping cap refs on %s ret %d\n,
 inode, off, (size_t)PAGE_CACHE_SIZE, ceph_cap_string(got), ret);
@@ -1244,6 +1247,16 @@ static int ceph_filemap_fault(struct vm_area_struct 
*vma, struct vm_fault *vmf)
page_cache_release(pinned_page);
ceph_put_cap_refs(ci, got);
 
+   if (ret == -EAGAIN) {
+   ret = ceph_uninline_data(vma-vm_file, NULL);
+   if (ret  0)
+   return VM_FAULT_SIGBUS;
+   spin_lock(ci-i_ceph_lock);
+   ci-i_inline_version = CEPH_INLINE_NONE;
+   spin_unlock(ci-i_ceph_lock);
+   goto again;
+   }
+
return ret;
 }
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: require cephx message signature by default

2014-11-11 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 include/linux/ceph/libceph.h |  1 +
 net/ceph/ceph_common.c   | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index d293f7e..8b11a79 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -29,6 +29,7 @@
 #define CEPH_OPT_NOSHARE  (11) /* don't share client with other sbs 
*/
 #define CEPH_OPT_MYIP (12) /* specified my ip */
 #define CEPH_OPT_NOCRC(13) /* no data crc on writes */
+#define CEPH_OPT_NOMSGAUTH   (14) /* not require cephx message signature 
*/
 
 #define CEPH_OPT_DEFAULT   (0)
 
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index d361a274..b22d82c 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -237,6 +237,8 @@ enum {
Opt_noshare,
Opt_crc,
Opt_nocrc,
+   Opt_cephx_require_signature,
+   Opt_cephx_require_no_signature,
 };
 
 static match_table_t opt_tokens = {
@@ -255,6 +257,8 @@ static match_table_t opt_tokens = {
{Opt_noshare, noshare},
{Opt_crc, crc},
{Opt_nocrc, nocrc},
+   {Opt_cephx_require_signature, cephx_require_signature},
+   {Opt_cephx_require_no_signature, cephx_require_no_signature},
{-1, NULL}
 };
 
@@ -453,6 +457,12 @@ ceph_parse_options(char *options, const char *dev_name,
case Opt_nocrc:
opt-flags |= CEPH_OPT_NOCRC;
break;
+   case Opt_cephx_require_signature:
+   opt-flags = ~CEPH_OPT_NOMSGAUTH;
+   break;
+   case Opt_cephx_require_no_signature:
+   opt-flags |= CEPH_OPT_NOMSGAUTH;
+   break;
 
default:
BUG_ON(token);
@@ -496,6 +506,9 @@ struct ceph_client *ceph_create_client(struct ceph_options 
*opt, void *private,
init_waitqueue_head(client-auth_wq);
client-auth_err = 0;
 
+   if (!ceph_test_opt(client, NOMSGAUTH))
+   required_features |= CEPH_FEATURE_MSG_AUTH;
+
client-extra_mon_dispatch = NULL;
client-supported_features = CEPH_FEATURES_SUPPORTED_DEFAULT |
supported_features;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] libceph: require cephx message signature by default

2014-11-11 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 include/linux/ceph/libceph.h |  1 +
 net/ceph/ceph_common.c   | 13 +
 2 files changed, 14 insertions(+)

diff --git a/include/linux/ceph/libceph.h b/include/linux/ceph/libceph.h
index d293f7e..8b11a79 100644
--- a/include/linux/ceph/libceph.h
+++ b/include/linux/ceph/libceph.h
@@ -29,6 +29,7 @@
 #define CEPH_OPT_NOSHARE  (11) /* don't share client with other sbs 
*/
 #define CEPH_OPT_MYIP (12) /* specified my ip */
 #define CEPH_OPT_NOCRC(13) /* no data crc on writes */
+#define CEPH_OPT_NOMSGAUTH   (14) /* not require cephx message signature 
*/
 
 #define CEPH_OPT_DEFAULT   (0)
 
diff --git a/net/ceph/ceph_common.c b/net/ceph/ceph_common.c
index d361a274..5d5ab67 100644
--- a/net/ceph/ceph_common.c
+++ b/net/ceph/ceph_common.c
@@ -237,6 +237,8 @@ enum {
Opt_noshare,
Opt_crc,
Opt_nocrc,
+   Opt_cephx_require_signatures,
+   Opt_nocephx_require_signatures,
 };
 
 static match_table_t opt_tokens = {
@@ -255,6 +257,8 @@ static match_table_t opt_tokens = {
{Opt_noshare, noshare},
{Opt_crc, crc},
{Opt_nocrc, nocrc},
+   {Opt_cephx_require_signatures, cephx_require_signatures},
+   {Opt_nocephx_require_signatures, nocephx_require_signatures},
{-1, NULL}
 };
 
@@ -453,6 +457,12 @@ ceph_parse_options(char *options, const char *dev_name,
case Opt_nocrc:
opt-flags |= CEPH_OPT_NOCRC;
break;
+   case Opt_cephx_require_signatures:
+   opt-flags = ~CEPH_OPT_NOMSGAUTH;
+   break;
+   case Opt_nocephx_require_signatures:
+   opt-flags |= CEPH_OPT_NOMSGAUTH;
+   break;
 
default:
BUG_ON(token);
@@ -496,6 +506,9 @@ struct ceph_client *ceph_create_client(struct ceph_options 
*opt, void *private,
init_waitqueue_head(client-auth_wq);
client-auth_err = 0;
 
+   if (!ceph_test_opt(client, NOMSGAUTH))
+   required_features |= CEPH_FEATURE_MSG_AUTH;
+
client-extra_mon_dispatch = NULL;
client-supported_features = CEPH_FEATURES_SUPPORTED_DEFAULT |
supported_features;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] libceph: message signature support

2014-11-04 Thread Yan, Zheng
Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c   | 16 +++
 include/linux/ceph/auth.h  | 26 +
 include/linux/ceph/ceph_features.h |  1 +
 include/linux/ceph/messenger.h |  9 +-
 include/linux/ceph/msgr.h  |  8 ++
 net/ceph/auth_x.c  | 58 ++
 net/ceph/messenger.c   | 32 +++--
 net/ceph/osd_client.c  | 16 +++
 8 files changed, 162 insertions(+), 4 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 2eab332..14ca763 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -3668,6 +3668,20 @@ static struct ceph_msg *mds_alloc_msg(struct 
ceph_connection *con,
return msg;
 }
 
+static int sign_message(struct ceph_connection *con, struct ceph_msg *msg)
+{
+   struct ceph_mds_session *s = con-private;
+   struct ceph_auth_handshake *auth = s-s_auth;
+   return ceph_auth_sign_message(auth, msg);
+}
+
+static int check_message_signature(struct ceph_connection *con, struct 
ceph_msg *msg)
+{
+   struct ceph_mds_session *s = con-private;
+   struct ceph_auth_handshake *auth = s-s_auth;
+   return ceph_auth_check_message_signature(auth, msg);
+}
+
 static const struct ceph_connection_operations mds_con_ops = {
.get = con_get,
.put = con_put,
@@ -3677,6 +3691,8 @@ static const struct ceph_connection_operations 
mds_con_ops = {
.invalidate_authorizer = invalidate_authorizer,
.peer_reset = peer_reset,
.alloc_msg = mds_alloc_msg,
+   .sign_message = sign_message,
+   .check_message_signature = check_message_signature,
 };
 
 /* eof */
diff --git a/include/linux/ceph/auth.h b/include/linux/ceph/auth.h
index 5f33868..260d78b 100644
--- a/include/linux/ceph/auth.h
+++ b/include/linux/ceph/auth.h
@@ -13,6 +13,7 @@
 
 struct ceph_auth_client;
 struct ceph_authorizer;
+struct ceph_msg;
 
 struct ceph_auth_handshake {
struct ceph_authorizer *authorizer;
@@ -20,6 +21,10 @@ struct ceph_auth_handshake {
size_t authorizer_buf_len;
void *authorizer_reply_buf;
size_t authorizer_reply_buf_len;
+   int (*sign_message)(struct ceph_auth_handshake *auth,
+   struct ceph_msg *msg);
+   int (*check_message_signature)(struct ceph_auth_handshake *auth,
+  struct ceph_msg *msg);
 };
 
 struct ceph_auth_client_ops {
@@ -66,6 +71,11 @@ struct ceph_auth_client_ops {
void (*reset)(struct ceph_auth_client *ac);
 
void (*destroy)(struct ceph_auth_client *ac);
+
+   int (*sign_message)(struct ceph_auth_handshake *auth,
+   struct ceph_msg *msg);
+   int (*check_message_signature)(struct ceph_auth_handshake *auth,
+  struct ceph_msg *msg);
 };
 
 struct ceph_auth_client {
@@ -113,4 +123,20 @@ extern int ceph_auth_verify_authorizer_reply(struct 
ceph_auth_client *ac,
 extern void ceph_auth_invalidate_authorizer(struct ceph_auth_client *ac,
int peer_type);
 
+static inline int ceph_auth_sign_message(struct ceph_auth_handshake *auth,
+struct ceph_msg *msg)
+{
+   if (auth-sign_message)
+   return auth-sign_message(auth, msg);
+   return 0;
+}
+
+static inline
+int ceph_auth_check_message_signature(struct ceph_auth_handshake *auth,
+ struct ceph_msg *msg)
+{
+   if (auth-check_message_signature)
+   return auth-check_message_signature(auth, msg);
+   return 0;
+}
 #endif
diff --git a/include/linux/ceph/ceph_features.h 
b/include/linux/ceph/ceph_features.h
index d12659c..71e05bb 100644
--- a/include/linux/ceph/ceph_features.h
+++ b/include/linux/ceph/ceph_features.h
@@ -84,6 +84,7 @@ static inline u64 ceph_sanitize_features(u64 features)
 CEPH_FEATURE_PGPOOL3 | \
 CEPH_FEATURE_OSDENC |  \
 CEPH_FEATURE_CRUSH_TUNABLES |  \
+CEPH_FEATURE_MSG_AUTH |\
 CEPH_FEATURE_CRUSH_TUNABLES2 | \
 CEPH_FEATURE_REPLY_CREATE_INODE |  \
 CEPH_FEATURE_OSDHASHPSPOOL |   \
diff --git a/include/linux/ceph/messenger.h b/include/linux/ceph/messenger.h
index 40ae58e..d9d396c 100644
--- a/include/linux/ceph/messenger.h
+++ b/include/linux/ceph/messenger.h
@@ -42,6 +42,10 @@ struct ceph_connection_operations {
struct ceph_msg * (*alloc_msg) (struct ceph_connection *con,
struct ceph_msg_header *hdr,
int *skip);
+   int (*sign_message) (struct ceph_connection *con, struct ceph_msg *msg);
+
+   int (*check_message_signature) (struct ceph_connection *con,
+   struct ceph_msg *msg);
 };
 
 /* use

[PATCH 1/2] libceph: store session key in cephx authorizer

2014-11-04 Thread Yan, Zheng
Session key is required when calculating message signature. Save the session key
in authorizer, this avoid lookup ticket handler for each message

Signed-off-by: Yan, Zheng z...@redhat.com
---
 net/ceph/auth_x.c | 18 +++---
 net/ceph/auth_x.h |  1 +
 2 files changed, 12 insertions(+), 7 deletions(-)

diff --git a/net/ceph/auth_x.c b/net/ceph/auth_x.c
index de6662b..8da8568 100644
--- a/net/ceph/auth_x.c
+++ b/net/ceph/auth_x.c
@@ -298,6 +298,11 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
dout(build_authorizer for %s %p\n,
 ceph_entity_type_name(th-service), au);
 
+   ceph_crypto_key_destroy(au-session_key);
+   ret = ceph_crypto_key_clone(au-session_key, th-session_key);
+   if (ret)
+   return ret;
+
maxlen = sizeof(*msg_a) + sizeof(msg_b) +
ceph_x_encrypt_buflen(ticket_blob_len);
dout(  need len %d\n, maxlen);
@@ -307,8 +312,10 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
}
if (!au-buf) {
au-buf = ceph_buffer_new(maxlen, GFP_NOFS);
-   if (!au-buf)
+   if (!au-buf) {
+   ceph_crypto_key_destroy(au-session_key);
return -ENOMEM;
+   }
}
au-service = th-service;
au-secret_id = th-secret_id;
@@ -334,7 +341,7 @@ static int ceph_x_build_authorizer(struct ceph_auth_client 
*ac,
get_random_bytes(au-nonce, sizeof(au-nonce));
msg_b.struct_v = 1;
msg_b.nonce = cpu_to_le64(au-nonce);
-   ret = ceph_x_encrypt(th-session_key, msg_b, sizeof(msg_b),
+   ret = ceph_x_encrypt(au-session_key, msg_b, sizeof(msg_b),
 p, end - p);
if (ret  0)
goto out_buf;
@@ -593,17 +600,13 @@ static int ceph_x_verify_authorizer_reply(struct 
ceph_auth_client *ac,
  struct ceph_authorizer *a, size_t len)
 {
struct ceph_x_authorizer *au = (void *)a;
-   struct ceph_x_ticket_handler *th;
int ret = 0;
struct ceph_x_authorize_reply reply;
void *preply = reply;
void *p = au-reply_buf;
void *end = p + sizeof(au-reply_buf);
 
-   th = get_ticket_handler(ac, au-service);
-   if (IS_ERR(th))
-   return PTR_ERR(th);
-   ret = ceph_x_decrypt(th-session_key, p, end, preply, sizeof(reply));
+   ret = ceph_x_decrypt(au-session_key, p, end, preply, sizeof(reply));
if (ret  0)
return ret;
if (ret != sizeof(reply))
@@ -623,6 +626,7 @@ static void ceph_x_destroy_authorizer(struct 
ceph_auth_client *ac,
 {
struct ceph_x_authorizer *au = (void *)a;
 
+   ceph_crypto_key_destroy(au-session_key);
ceph_buffer_put(au-buf);
kfree(au);
 }
diff --git a/net/ceph/auth_x.h b/net/ceph/auth_x.h
index 65ee720..e8b7c69 100644
--- a/net/ceph/auth_x.h
+++ b/net/ceph/auth_x.h
@@ -26,6 +26,7 @@ struct ceph_x_ticket_handler {
 
 
 struct ceph_x_authorizer {
+   struct ceph_crypto_key session_key;
struct ceph_buffer *buf;
unsigned int service;
u64 nonce;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: design a rados-based distributed kv store support scan op

2014-09-30 Thread Yan, Zheng
On Tue, Sep 30, 2014 at 5:23 PM, Plato Zhang sango...@gmail.com wrote:
 Hi all,

 We plan to develop a new distributed key-value storage engine, which
 would based on rados and supoort range-scan operation. Besides, it'll
 be open source.

 We know rados is already a kv storage engine, and can do pretty well
 if we use KeyValueStore.
 However, it lacks the ability of scanning a key range, since keys are
 hashed into pgs.

 In our scenarios, scanning is a common need. We considered the way of
 splitting our key space into fixed number of ranges, mapping each
 range to a rados object, and then storing key/value pairs in rados
 objects. However, there are two main disadvantages:
 1. We can not adjust the number of ranges as the cluster scales.
 2. For a good splitting, we must properly predict the distribution of
 keys, or we'll encounter serious load-balance problem.

 So we decide to develop a new service based on rados (we want to use
 rados for unifying our storage infrastructure).

 Before we start, we want to do some inquiries:
 1. Is there already any similar system/module(scan-supported 
 rados-based) existed?

FYI:
directory fragments in MDS satisfy all above requirement.  but the code
is not  general enough to be used outside MDS



 2. Except for the basic key-value functions, what features are you
 most interested in?
 3. Other suggestions?

 Thanks!
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: remove redundant io_iter_advance()

2014-09-17 Thread Yan, Zheng
ceph_sync_read and generic_file_read_iter() have already advanced the
IO iterator.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/file.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 1c1df08..d7e0da8 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -850,7 +850,6 @@ again:
 , reading more\n, iocb-ki_pos,
 inode-i_size);
 
-   iov_iter_advance(to, ret);
read += ret;
len -= ret;
checkeof = 0;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: remove redundant code for max file size verification

2014-09-17 Thread Yan, Zheng
On Wed, Sep 17, 2014 at 5:26 PM, Chao Yu chao2...@samsung.com wrote:
 Both ceph_update_writeable_page and ceph_setattr will verify file size
 with max size ceph supported.
 There are two caller for ceph_update_writeable_page, ceph_write_begin and
 ceph_page_mkwrite. For ceph_write_begin, we have already verified the size in
 generic_write_checks of ceph_write_iter; for ceph_page_mkwrite, we have no
 chance to change file size when mmap. Likewise we have already verified the 
 size
 in inode_change_ok when we call ceph_setattr.
 So let's remove the redundant code for max file size verification.

 Signed-off-by: Chao Yu chao2...@samsung.com
 ---
  fs/ceph/addr.c  | 9 -
  fs/ceph/inode.c | 6 --
  2 files changed, 15 deletions(-)

 diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
 index 90b3954..18c06bb 100644
 --- a/fs/ceph/addr.c
 +++ b/fs/ceph/addr.c
 @@ -1076,12 +1076,6 @@ retry_locked:
 /* past end of file? */
 i_size = inode-i_size;   /* caller holds i_mutex */

 -   if (i_size + len  inode-i_sb-s_maxbytes) {
 -   /* file is too big */
 -   r = -EINVAL;
 -   goto fail;
 -   }
 -
 if (page_off = i_size ||
 (pos_in_page == 0  (pos+len) = i_size 
  end_in_page - pos_in_page != PAGE_CACHE_SIZE)) {
 @@ -1099,9 +1093,6 @@ retry_locked:
 if (r  0)
 goto fail_nosnap;
 goto retry_locked;
 -
 -fail:
 -   up_read(mdsc-snap_rwsem);
  fail_nosnap:
 unlock_page(page);
 return r;
 diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
 index 04c89c2..25c242e 100644
 --- a/fs/ceph/inode.c
 +++ b/fs/ceph/inode.c
 @@ -1813,10 +1813,6 @@ int ceph_setattr(struct dentry *dentry, struct iattr 
 *attr)
 if (ia_valid  ATTR_SIZE) {
 dout(setattr %p size %lld - %lld\n, inode,
  inode-i_size, attr-ia_size);
 -   if (attr-ia_size  inode-i_sb-s_maxbytes) {
 -   err = -EINVAL;
 -   goto out;
 -   }
 if ((issued  CEPH_CAP_FILE_EXCL) 
 attr-ia_size  inode-i_size) {
 inode-i_size = attr-ia_size;
 @@ -1896,8 +1892,6 @@ int ceph_setattr(struct dentry *dentry, struct iattr 
 *attr)
 if (mask  CEPH_SETATTR_SIZE)
 __ceph_do_pending_vmtruncate(inode);
 return err;
 -out:
 -   spin_unlock(ci-i_ceph_lock);
  out_put:
 ceph_mdsc_put_request(req);
 return err;
 --
 2.0.1.474.g72c7794



Reviewed-by: Yan, Zheng z...@redhat.com

I added  this to testing branch of client-client.

Thanks
Yan, Zheng


 --
 To unsubscribe from this list: send the line unsubscribe linux-kernel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 Please read the FAQ at  http://www.tux.org/lkml/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: request xattrs if xattr_version is zero

2014-09-16 Thread Yan, Zheng
Following sequence of events can happen.
 - Client releases an inode, queues cap release message.
 - A 'lookup' reply brings the same inode back, but the reply
   doesn't contain xattrs because MDS didn't receive the cap release
   message and thought client already has up-to-data xattrs.

The fix is force sending a getattr request to MDS if xattrs_version is 0.
The getattr mask is set to CEPH_STAT_CAP_XATTR, so MDS knows client does
not have xattr.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/file.c  |  5 ++---
 fs/ceph/inode.c |  8 
 fs/ceph/ioctl.c |  4 ++--
 fs/ceph/super.h |  2 +-
 fs/ceph/xattr.c | 29 ++---
 5 files changed, 19 insertions(+), 29 deletions(-)

diff --git a/fs/ceph/file.c b/fs/ceph/file.c
index 46a0525f..bf926fb 100644
--- a/fs/ceph/file.c
+++ b/fs/ceph/file.c
@@ -841,8 +841,7 @@ again:
ceph_put_cap_refs(ci, got);
 
if (checkeof  ret = 0) {
-   int statret = ceph_do_getattr(inode,
- CEPH_STAT_CAP_SIZE);
+   int statret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
 
/* hit EOF or hole? */
if (statret == 0  iocb-ki_pos  inode-i_size 
@@ -1010,7 +1009,7 @@ static loff_t ceph_llseek(struct file *file, loff_t 
offset, int whence)
mutex_lock(inode-i_mutex);
 
if (whence == SEEK_END || whence == SEEK_DATA || whence == SEEK_HOLE) {
-   ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE);
+   ret = ceph_do_getattr(inode, CEPH_STAT_CAP_SIZE, false);
if (ret  0) {
offset = ret;
goto out;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 04c89c2..40e6289 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -1907,7 +1907,7 @@ out_put:
  * Verify that we have a lease on the given mask.  If not,
  * do a getattr against an mds.
  */
-int ceph_do_getattr(struct inode *inode, int mask)
+int ceph_do_getattr(struct inode *inode, int mask, bool force)
 {
struct ceph_fs_client *fsc = ceph_sb_to_client(inode-i_sb);
struct ceph_mds_client *mdsc = fsc-mdsc;
@@ -1920,7 +1920,7 @@ int ceph_do_getattr(struct inode *inode, int mask)
}
 
dout(do_getattr inode %p mask %s mode 0%o\n, inode, 
ceph_cap_string(mask), inode-i_mode);
-   if (ceph_caps_issued_mask(ceph_inode(inode), mask, 1))
+   if (!force  ceph_caps_issued_mask(ceph_inode(inode), mask, 1))
return 0;
 
req = ceph_mdsc_create_request(mdsc, CEPH_MDS_OP_GETATTR, USE_ANY_MDS);
@@ -1948,7 +1948,7 @@ int ceph_permission(struct inode *inode, int mask)
if (mask  MAY_NOT_BLOCK)
return -ECHILD;
 
-   err = ceph_do_getattr(inode, CEPH_CAP_AUTH_SHARED);
+   err = ceph_do_getattr(inode, CEPH_CAP_AUTH_SHARED, false);
 
if (!err)
err = generic_permission(inode, mask);
@@ -1966,7 +1966,7 @@ int ceph_getattr(struct vfsmount *mnt, struct dentry 
*dentry,
struct ceph_inode_info *ci = ceph_inode(inode);
int err;
 
-   err = ceph_do_getattr(inode, CEPH_STAT_CAP_INODE_ALL);
+   err = ceph_do_getattr(inode, CEPH_STAT_CAP_INODE_ALL, false);
if (!err) {
generic_fillattr(inode, stat);
stat-ino = ceph_translate_ino(inode-i_sb, inode-i_ino);
diff --git a/fs/ceph/ioctl.c b/fs/ceph/ioctl.c
index a822a6e..d7dc812 100644
--- a/fs/ceph/ioctl.c
+++ b/fs/ceph/ioctl.c
@@ -19,7 +19,7 @@ static long ceph_ioctl_get_layout(struct file *file, void 
__user *arg)
struct ceph_ioctl_layout l;
int err;
 
-   err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT);
+   err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT, false);
if (!err) {
l.stripe_unit = ceph_file_layout_su(ci-i_layout);
l.stripe_count = ceph_file_layout_stripe_count(ci-i_layout);
@@ -74,7 +74,7 @@ static long ceph_ioctl_set_layout(struct file *file, void 
__user *arg)
return -EFAULT;
 
/* validate changed params against current layout */
-   err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT);
+   err = ceph_do_getattr(file_inode(file), CEPH_STAT_CAP_LAYOUT, false);
if (err)
return err;
 
diff --git a/fs/ceph/super.h b/fs/ceph/super.h
index 0cfb1ec..8405a79 100644
--- a/fs/ceph/super.h
+++ b/fs/ceph/super.h
@@ -714,7 +714,7 @@ extern void ceph_queue_vmtruncate(struct inode *inode);
 extern void ceph_queue_invalidate(struct inode *inode);
 extern void ceph_queue_writeback(struct inode *inode);
 
-extern int ceph_do_getattr(struct inode *inode, int mask);
+extern int ceph_do_getattr(struct inode *inode, int mask, bool force);
 extern int ceph_permission(struct inode *inode, int mask);
 extern int ceph_setattr(struct dentry *dentry, struct iattr *attr);
 extern int ceph_getattr(struct vfsmount *mnt, struct dentry *dentry,
diff --git a/fs/ceph

[PATCH 1/3] libceph: reference counting pagelist

2014-09-16 Thread Yan, Zheng
this allow pagelist to present data that may be sent multiple times.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c  | 1 -
 include/linux/ceph/pagelist.h | 5 -
 net/ceph/messenger.c  | 4 +---
 net/ceph/pagelist.c   | 8 ++--
 4 files changed, 11 insertions(+), 7 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index a17fc49..30d7338 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2796,7 +2796,6 @@ fail:
mutex_unlock(session-s_mutex);
 fail_nomsg:
ceph_pagelist_release(pagelist);
-   kfree(pagelist);
 fail_nopagelist:
pr_err(error %d preparing reconnect for mds%d\n, err, mds);
return;
diff --git a/include/linux/ceph/pagelist.h b/include/linux/ceph/pagelist.h
index 9660d6b..5f871d8 100644
--- a/include/linux/ceph/pagelist.h
+++ b/include/linux/ceph/pagelist.h
@@ -2,6 +2,7 @@
 #define __FS_CEPH_PAGELIST_H
 
 #include linux/list.h
+#include linux/atomic.h
 
 struct ceph_pagelist {
struct list_head head;
@@ -10,6 +11,7 @@ struct ceph_pagelist {
size_t room;
struct list_head free_list;
size_t num_pages_free;
+   atomic_t refcnt;
 };
 
 struct ceph_pagelist_cursor {
@@ -26,9 +28,10 @@ static inline void ceph_pagelist_init(struct ceph_pagelist 
*pl)
pl-room = 0;
INIT_LIST_HEAD(pl-free_list);
pl-num_pages_free = 0;
+   atomic_set(pl-refcnt, 1);
 }
 
-extern int ceph_pagelist_release(struct ceph_pagelist *pl);
+extern void ceph_pagelist_release(struct ceph_pagelist *pl);
 
 extern int ceph_pagelist_append(struct ceph_pagelist *pl, const void *d, 
size_t l);
 
diff --git a/net/ceph/messenger.c b/net/ceph/messenger.c
index e7d9411..9764c77 100644
--- a/net/ceph/messenger.c
+++ b/net/ceph/messenger.c
@@ -3071,10 +3071,8 @@ static void ceph_msg_data_destroy(struct ceph_msg_data 
*data)
return;
 
WARN_ON(!list_empty(data-links));
-   if (data-type == CEPH_MSG_DATA_PAGELIST) {
+   if (data-type == CEPH_MSG_DATA_PAGELIST)
ceph_pagelist_release(data-pagelist);
-   kfree(data-pagelist);
-   }
kmem_cache_free(ceph_msg_data_cache, data);
 }
 
diff --git a/net/ceph/pagelist.c b/net/ceph/pagelist.c
index 92866be..f70b651 100644
--- a/net/ceph/pagelist.c
+++ b/net/ceph/pagelist.c
@@ -1,5 +1,6 @@
 #include linux/module.h
 #include linux/gfp.h
+#include linux/slab.h
 #include linux/pagemap.h
 #include linux/highmem.h
 #include linux/ceph/pagelist.h
@@ -13,8 +14,10 @@ static void ceph_pagelist_unmap_tail(struct ceph_pagelist 
*pl)
}
 }
 
-int ceph_pagelist_release(struct ceph_pagelist *pl)
+void ceph_pagelist_release(struct ceph_pagelist *pl)
 {
+   if (!atomic_dec_and_test(pl-refcnt))
+   return;
ceph_pagelist_unmap_tail(pl);
while (!list_empty(pl-head)) {
struct page *page = list_first_entry(pl-head, struct page,
@@ -23,7 +26,8 @@ int ceph_pagelist_release(struct ceph_pagelist *pl)
__free_page(page);
}
ceph_pagelist_free_reserve(pl);
-   return 0;
+   kfree(pl);
+   return;
 }
 EXPORT_SYMBOL(ceph_pagelist_release);
 
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/3] ceph: use pagelist to present MDS request data

2014-09-16 Thread Yan, Zheng
Current code uses page array to present MDS request data. Pages in the
array are allocated/freed by caller of ceph_mdsc_do_request(). If request
is interrupted, the pages can be freed while they are still being used by
the request message.

The fix is use pagelist to present MDS request data. Pagelist is
reference counted.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c | 14 +-
 fs/ceph/mds_client.h |  4 +---
 fs/ceph/xattr.c  | 46 --
 3 files changed, 26 insertions(+), 38 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 30d7338..80d9f07 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -542,6 +542,8 @@ void ceph_mdsc_release_request(struct kref *kref)
}
kfree(req-r_path1);
kfree(req-r_path2);
+   if (req-r_pagelist)
+   ceph_pagelist_release(req-r_pagelist);
put_request_session(req);
ceph_unreserve_caps(req-r_mdsc, req-r_caps_reservation);
kfree(req);
@@ -1847,13 +1849,15 @@ static struct ceph_msg *create_request_message(struct 
ceph_mds_client *mdsc,
msg-front.iov_len = p - msg-front.iov_base;
msg-hdr.front_len = cpu_to_le32(msg-front.iov_len);
 
-   if (req-r_data_len) {
-   /* outbound data set only by ceph_sync_setxattr() */
-   BUG_ON(!req-r_pages);
-   ceph_msg_data_add_pages(msg, req-r_pages, req-r_data_len, 0);
+   if (req-r_pagelist) {
+   struct ceph_pagelist *pagelist = req-r_pagelist;
+   atomic_inc(pagelist-refcnt);
+   ceph_msg_data_add_pagelist(msg, pagelist);
+   msg-hdr.data_len = cpu_to_le32(pagelist-length);
+   } else {
+   msg-hdr.data_len = 0;
}
 
-   msg-hdr.data_len = cpu_to_le32(req-r_data_len);
msg-hdr.data_off = cpu_to_le16(0);
 
 out_free2:
diff --git a/fs/ceph/mds_client.h b/fs/ceph/mds_client.h
index e00737c..23015f7 100644
--- a/fs/ceph/mds_client.h
+++ b/fs/ceph/mds_client.h
@@ -202,9 +202,7 @@ struct ceph_mds_request {
bool r_direct_is_hash;  /* true if r_direct_hash is valid */
 
/* data payload is used for xattr ops */
-   struct page **r_pages;
-   int r_num_pages;
-   int r_data_len;
+   struct ceph_pagelist *r_pagelist;
 
/* what caps shall we drop? */
int r_inode_drop, r_inode_unless;
diff --git a/fs/ceph/xattr.c b/fs/ceph/xattr.c
index eab3e2f..c7b18b2 100644
--- a/fs/ceph/xattr.c
+++ b/fs/ceph/xattr.c
@@ -1,4 +1,5 @@
 #include linux/ceph/ceph_debug.h
+#include linux/ceph/pagelist.h
 
 #include super.h
 #include mds_client.h
@@ -852,28 +853,17 @@ static int ceph_sync_setxattr(struct dentry *dentry, 
const char *name,
struct ceph_mds_request *req;
struct ceph_mds_client *mdsc = fsc-mdsc;
int err;
-   int i, nr_pages;
-   struct page **pages = NULL;
-   void *kaddr;
-
-   /* copy value into some pages */
-   nr_pages = calc_pages_for(0, size);
-   if (nr_pages) {
-   pages = kmalloc(sizeof(pages[0])*nr_pages, GFP_NOFS);
-   if (!pages)
-   return -ENOMEM;
-   err = -ENOMEM;
-   for (i = 0; i  nr_pages; i++) {
-   pages[i] = __page_cache_alloc(GFP_NOFS);
-   if (!pages[i]) {
-   nr_pages = i;
-   goto out;
-   }
-   kaddr = kmap(pages[i]);
-   memcpy(kaddr, value + i*PAGE_CACHE_SIZE,
-  min(PAGE_CACHE_SIZE, size-i*PAGE_CACHE_SIZE));
-   }
-   }
+   struct ceph_pagelist *pagelist;
+
+   /* copy value into pagelist */
+   pagelist = kmalloc(sizeof(*pagelist), GFP_NOFS);
+   if (!pagelist)
+   return -ENOMEM;
+
+   ceph_pagelist_init(pagelist);
+   err = ceph_pagelist_append(pagelist, value, size);
+   if (err)
+   goto out;
 
dout(setxattr value=%.*s\n, (int)size, value);
 
@@ -894,9 +884,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const 
char *name,
req-r_args.setxattr.flags = cpu_to_le32(flags);
req-r_path2 = kstrdup(name, GFP_NOFS);
 
-   req-r_pages = pages;
-   req-r_num_pages = nr_pages;
-   req-r_data_len = size;
+   req-r_pagelist = pagelist;
+   pagelist = NULL;
 
dout(xattr.ver (before): %lld\n, ci-i_xattrs.version);
err = ceph_mdsc_do_request(mdsc, NULL, req);
@@ -904,11 +893,8 @@ static int ceph_sync_setxattr(struct dentry *dentry, const 
char *name,
dout(xattr.ver (after): %lld\n, ci-i_xattrs.version);
 
 out:
-   if (pages) {
-   for (i = 0; i  nr_pages; i++)
-   __free_page(pages[i]);
-   kfree(pages);
-   }
+   if (pagelist)
+   ceph_pagelist_release(pagelist

[PATCH 3/3] ceph: include the initial ACL in create/mkdir/mknod MDS requests

2014-09-16 Thread Yan, Zheng
Current code set new file/directory's initial ACL in a non-atomic
manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.

The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/acl.c   | 125 
 fs/ceph/dir.c   |  41 ++-
 fs/ceph/file.c  |  27 +---
 fs/ceph/super.h |  24 ---
 4 files changed, 170 insertions(+), 47 deletions(-)

diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
index cebf2eb..5bd853b 100644
--- a/fs/ceph/acl.c
+++ b/fs/ceph/acl.c
@@ -169,36 +169,109 @@ out:
return ret;
 }
 
-int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode 
*dir)
+int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
+  struct ceph_acls_info *info)
 {
-   struct posix_acl *default_acl, *acl;
-   umode_t new_mode = inode-i_mode;
-   int error;
-
-   error = posix_acl_create(dir, new_mode, default_acl, acl);
-   if (error)
-   return error;
-
-   if (!default_acl  !acl) {
-   cache_no_acl(inode);
-   if (new_mode != inode-i_mode) {
-   struct iattr newattrs = {
-   .ia_mode = new_mode,
-   .ia_valid = ATTR_MODE,
-   };
-   error = ceph_setattr(dentry, newattrs);
+   struct posix_acl *acl, *default_acl;
+   size_t val_size1 = 0, val_size2 = 0;
+   struct ceph_pagelist *pagelist = NULL;
+   void *tmp_buf = NULL;
+   int err;
+
+   err = posix_acl_create(dir, mode, default_acl, acl);
+   if (err)
+   return err;
+
+   if (acl) {
+   int ret = posix_acl_equiv_mode(acl, mode);
+   if (ret  0)
+   goto out_err;
+   if (ret == 0) {
+   posix_acl_release(acl);
+   acl = NULL;
}
-   return error;
}
 
-   if (default_acl) {
-   error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT);
-   posix_acl_release(default_acl);
-   }
+   if (!default_acl  !acl)
+   return 0;
+
+   if (acl)
+   val_size1 = posix_acl_xattr_size(acl-a_count);
+   if (default_acl)
+   val_size2 = posix_acl_xattr_size(default_acl-a_count);
+
+   err = -ENOMEM;
+   tmp_buf = kmalloc(max(val_size1, val_size2), GFP_NOFS);
+   if (!tmp_buf)
+   goto out_err;
+   pagelist = kmalloc(sizeof(struct ceph_pagelist), GFP_NOFS);
+   if (!pagelist)
+   goto out_err;
+   ceph_pagelist_init(pagelist);
+
+   err = ceph_pagelist_reserve(pagelist, PAGE_SIZE);
+   if (err)
+   goto out_err;
+
+   ceph_pagelist_encode_32(pagelist, acl  default_acl ? 2 : 1);
+
if (acl) {
-   if (!error)
-   error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS);
-   posix_acl_release(acl);
+   size_t len = strlen(POSIX_ACL_XATTR_ACCESS);
+   err = ceph_pagelist_reserve(pagelist, len + val_size1 + 8);
+   if (err)
+   goto out_err;
+   ceph_pagelist_encode_string(pagelist, POSIX_ACL_XATTR_ACCESS,
+   len);
+   err = posix_acl_to_xattr(init_user_ns, acl,
+tmp_buf, val_size1);
+   if (err  0)
+   goto out_err;
+   ceph_pagelist_encode_32(pagelist, val_size1);
+   ceph_pagelist_append(pagelist, tmp_buf, val_size1);
}
-   return error;
+   if (default_acl) {
+   size_t len = strlen(POSIX_ACL_XATTR_DEFAULT);
+   err = ceph_pagelist_reserve(pagelist, len + val_size2 + 8);
+   if (err)
+   goto out_err;
+   err = ceph_pagelist_encode_string(pagelist,
+ POSIX_ACL_XATTR_DEFAULT, len);
+   err = posix_acl_to_xattr(init_user_ns, default_acl,
+tmp_buf, val_size2);
+   if (err  0)
+   goto out_err;
+   ceph_pagelist_encode_32(pagelist, val_size2);
+   ceph_pagelist_append(pagelist, tmp_buf, val_size2);
+   }
+
+   kfree(tmp_buf);
+
+   info-acl = acl;
+   info-default_acl = default_acl;
+   info-pagelist = pagelist;
+   return 0;
+
+out_err:
+   posix_acl_release(acl);
+   posix_acl_release(default_acl);
+   kfree(tmp_buf);
+   if (pagelist)
+   ceph_pagelist_release(pagelist);
+   return

[PATCH] ceph: move ceph_find_inode() outside the s_mutex

2014-09-16 Thread Yan, Zheng
ceph_find_inode() may wait on freeing inode, using it inside the s_mutex
may cause deadlock. (the freeing inode is waiting for OSD read reply, but
dispatch thread is blocked by the s_mutex)

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/caps.c   | 11 ++-
 fs/ceph/mds_client.c |  7 ---
 2 files changed, 10 insertions(+), 8 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 6d1cd45..b3b0a91 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3045,6 +3045,12 @@ void ceph_handle_caps(struct ceph_mds_session *session,
}
}
 
+   /* lookup ino */
+   inode = ceph_find_inode(sb, vino);
+   ci = ceph_inode(inode);
+   dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino,
+vino.snap, inode);
+
mutex_lock(session-s_mutex);
session-s_seq++;
dout( mds%d seq %lld cap seq %u\n, session-s_mds, session-s_seq,
@@ -3053,11 +3059,6 @@ void ceph_handle_caps(struct ceph_mds_session *session,
if (op == CEPH_CAP_OP_IMPORT)
ceph_add_cap_releases(mdsc, session);
 
-   /* lookup ino */
-   inode = ceph_find_inode(sb, vino);
-   ci = ceph_inode(inode);
-   dout( op %s ino %llx.%llx inode %p\n, ceph_cap_op_name(op), vino.ino,
-vino.snap, inode);
if (!inode) {
dout( i don't have ino %llx\n, vino.ino);
 
diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 80d9f07..c27e204 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2947,14 +2947,15 @@ static void handle_lease(struct ceph_mds_client *mdsc,
if (dname.len != get_unaligned_le32(h+1))
goto bad;
 
-   mutex_lock(session-s_mutex);
-   session-s_seq++;
-
/* lookup inode */
inode = ceph_find_inode(sb, vino);
dout(handle_lease %s, ino %llx %p %.*s\n,
 ceph_lease_op_name(h-action), vino.ino, inode,
 dname.len, dname.name);
+
+   mutex_lock(session-s_mutex);
+   session-s_seq++;
+
if (inode == NULL) {
dout(handle_lease no inode %llx\n, vino.ino);
goto release;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: include the initial ACL in create/mkdir/mknod MDS requests

2014-09-15 Thread Yan, Zheng
Current code set new file/directory's initial ACL in a non-atomic manner.
Client first sends request to MDS to create new file/directory, then set
the initial ACL after the new file/directory is successfully created.

The fix is include the initial ACL in create/mkdir/mknod MDS requests.
So MDS can handle creating file/directory and setting the initial ACL in
one request.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/acl.c| 124 +--
 fs/ceph/dir.c|  41 -
 fs/ceph/file.c   |  27 ---
 fs/ceph/mds_client.c |   1 -
 fs/ceph/super.h  |  27 ---
 5 files changed, 174 insertions(+), 46 deletions(-)

diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
index cebf2eb..2b7f497 100644
--- a/fs/ceph/acl.c
+++ b/fs/ceph/acl.c
@@ -169,36 +169,112 @@ out:
return ret;
 }
 
-int ceph_init_acl(struct dentry *dentry, struct inode *inode, struct inode 
*dir)
+int ceph_pre_init_acls(struct inode *dir, umode_t *mode,
+  struct ceph_acls_info *info)
 {
-   struct posix_acl *default_acl, *acl;
-   umode_t new_mode = inode-i_mode;
-   int error;
-
-   error = posix_acl_create(dir, new_mode, default_acl, acl);
-   if (error)
-   return error;
-
-   if (!default_acl  !acl) {
-   cache_no_acl(inode);
-   if (new_mode != inode-i_mode) {
-   struct iattr newattrs = {
-   .ia_mode = new_mode,
-   .ia_valid = ATTR_MODE,
-   };
-   error = ceph_setattr(dentry, newattrs);
+   struct posix_acl *acl, *default_acl;
+   size_t buf_size, name_len1, name_len2, val_size1, val_size2;
+   struct page **pages = NULL;
+   void  *p, *xattr_buf = NULL;
+   int err, i, nr_pages;
+
+   err = posix_acl_create(dir, mode, default_acl, acl);
+   if (err)
+   return err;
+
+   if (acl) {
+   int ret = posix_acl_equiv_mode(acl, mode);
+   if (ret  0)
+   goto out_err;
+   if (ret == 0) {
+   posix_acl_release(acl);
+   acl = NULL;
}
-   return error;
}
 
+   if (!default_acl  !acl)
+   return 0;
+
+   buf_size = 4;
+   if (acl) {
+   name_len1 = strlen(POSIX_ACL_XATTR_ACCESS);
+   val_size1 = posix_acl_xattr_size(acl-a_count);
+   buf_size += 8 + name_len1 + val_size1;
+   } else {
+   name_len1 = val_size1 = 0;
+   }
if (default_acl) {
-   error = ceph_set_acl(inode, default_acl, ACL_TYPE_DEFAULT);
-   posix_acl_release(default_acl);
+   name_len2 = strlen(POSIX_ACL_XATTR_DEFAULT);
+   val_size2 = posix_acl_xattr_size(default_acl-a_count);
+   buf_size += 8 + name_len2 + val_size2;
+   } else {
+   name_len2 = val_size2 = 0;
}
+
+   err = -ENOMEM;
+   nr_pages = calc_pages_for(0, buf_size);
+   pages = kmalloc(sizeof(struct page*) * nr_pages, GFP_NOFS);
+   if (!pages)
+   goto out_err;
+   xattr_buf = kmalloc(PAGE_SIZE * nr_pages, GFP_NOFS);
+   if (!xattr_buf)
+   goto out_err;
+
+   pages[0] = virt_to_page(xattr_buf);
+   for (i = 1; i  nr_pages; i++)
+   pages[i] = pages[0] + i;
+
+   p = xattr_buf;
+   ceph_encode_32(p, acl  default_acl ? 2 : 1);
+
if (acl) {
-   if (!error)
-   error = ceph_set_acl(inode, acl, ACL_TYPE_ACCESS);
-   posix_acl_release(acl);
+   ceph_encode_32(p, name_len1);
+   memcpy(p, POSIX_ACL_XATTR_ACCESS, name_len1);
+   p += name_len1;
+   ceph_encode_32(p, val_size1);
+   err = posix_acl_to_xattr(init_user_ns, acl, p, val_size1);
+   if (err  0)
+   goto out_err;
+   p += val_size1;
}
-   return error;
+   if (default_acl) {
+   ceph_encode_32(p, name_len2);
+   memcpy(p, POSIX_ACL_XATTR_DEFAULT, name_len2);
+   p += name_len2;
+   ceph_encode_32(p, val_size2);
+   err = posix_acl_to_xattr(init_user_ns, default_acl, p, 
val_size2);
+   if (err  0)
+   goto out_err;
+   p += val_size2;
+   }
+
+   info-acl = acl;
+   info-default_acl = default_acl;
+   info-xattr_buf = xattr_buf;
+   info-buf_pages = pages;
+   info-buf_size = buf_size;
+   return 0;
+
+out_err:
+   posix_acl_release(acl);
+   posix_acl_release(default_acl);
+   kfree(pages);
+   kfree(xattr_buf);
+   return err;
+}
+
+void ceph_init_inode_acls(struct inode* inode, struct ceph_acls_info *info)
+{
+   if (!inode

[PATCH 1/2] ceph: protect kick_requests() with mdsc-mutex

2014-09-11 Thread Yan, Zheng
From: Yan, Zheng uker...@gmail.com

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c | 5 +++--
 1 file changed, 3 insertions(+), 2 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index f751fea..267ba44 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2471,9 +2471,8 @@ static void handle_session(struct ceph_mds_session 
*session,
if (session-s_state == CEPH_MDS_SESSION_RECONNECTING)
pr_info(mds%d reconnect denied\n, session-s_mds);
remove_session_caps(session);
-   wake = 1; /* for good measure */
+   wake = 2; /* for good measure */
wake_up_all(mdsc-session_close_wq);
-   kick_requests(mdsc, mds);
break;
 
case CEPH_SESSION_STALE:
@@ -2503,6 +2502,8 @@ static void handle_session(struct ceph_mds_session 
*session,
if (wake) {
mutex_lock(mdsc-mutex);
__wake_requests(mdsc, session-s_waiting);
+   if (wake == 2)
+   kick_requests(mdsc, mds);
mutex_unlock(mdsc-mutex);
}
return;
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/2] ceph: make sure request isn't in any waiting list when kicking request.

2014-09-11 Thread Yan, Zheng
From: Yan, Zheng uker...@gmail.com

we may corrupt waiting list if a request in the waiting list is kicked.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index 267ba44..a17fc49 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2078,6 +2078,7 @@ static void kick_requests(struct ceph_mds_client *mdsc, 
int mds)
if (req-r_session 
req-r_session-s_mds == mds) {
dout( kicking tid %llu\n, req-r_tid);
+   list_del_init(req-r_wait);
__do_request(mdsc, req);
}
}
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] ceph: trim unused inodes before reconnecting to recovering MDS

2014-09-10 Thread Yan, Zheng
So the recovering MDS does not need to fetch these ununsed inodes during
cache rejoin. This may reduce MDS recovery time.

Signed-off-by: Yan, Zheng z...@redhat.com
---
 fs/ceph/mds_client.c | 23 +--
 1 file changed, 13 insertions(+), 10 deletions(-)

diff --git a/fs/ceph/mds_client.c b/fs/ceph/mds_client.c
index bad07c0..f751fea 100644
--- a/fs/ceph/mds_client.c
+++ b/fs/ceph/mds_client.c
@@ -2695,16 +2695,6 @@ static void send_mds_reconnect(struct ceph_mds_client 
*mdsc,
session-s_state = CEPH_MDS_SESSION_RECONNECTING;
session-s_seq = 0;
 
-   ceph_con_close(session-s_con);
-   ceph_con_open(session-s_con,
- CEPH_ENTITY_TYPE_MDS, mds,
- ceph_mdsmap_get_addr(mdsc-mdsmap, mds));
-
-   /* replay unsafe requests */
-   replay_unsafe_requests(mdsc, session);
-
-   down_read(mdsc-snap_rwsem);
-
dout(session %p state %s\n, session,
 session_state_name(session-s_state));
 
@@ -2723,6 +2713,19 @@ static void send_mds_reconnect(struct ceph_mds_client 
*mdsc,
discard_cap_releases(mdsc, session);
spin_unlock(session-s_cap_lock);
 
+   /* trim unused caps to reduce MDS's cache rejoin time */
+   shrink_dcache_parent(mdsc-fsc-sb-s_root);
+
+   ceph_con_close(session-s_con);
+   ceph_con_open(session-s_con,
+ CEPH_ENTITY_TYPE_MDS, mds,
+ ceph_mdsmap_get_addr(mdsc-mdsmap, mds));
+
+   /* replay unsafe requests */
+   replay_unsafe_requests(mdsc, session);
+
+   down_read(mdsc-snap_rwsem);
+
/* traverse this session's caps */
s_nr_caps = session-s_nr_caps;
err = ceph_pagelist_encode_32(pagelist, s_nr_caps);
-- 
1.9.3

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: refactor readpage_nounlock() to make the logic clearer

2014-05-28 Thread Yan, Zheng
On Wed, May 28, 2014 at 3:09 PM, Zhang Zhen zhenzhang.zh...@huawei.com wrote:
 If the return value of ceph_osdc_readpages() is not negative,
 it is certainly greater than or equal to zero.

 Remove the useless condition judgment and redundant braces.

 Signed-off-by: Zhang Zhen zhenzhang.zh...@huawei.com
 ---
  fs/ceph/addr.c | 17 +++--
  1 file changed, 7 insertions(+), 10 deletions(-)

 diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
 index b53278c..6aa2e3f 100644
 --- a/fs/ceph/addr.c
 +++ b/fs/ceph/addr.c
 @@ -211,18 +211,15 @@ static int readpage_nounlock(struct file *filp, struct 
 page *page)
 SetPageError(page);
 ceph_fscache_readpage_cancel(inode, page);
 goto out;
 -   } else {
 -   if (err  PAGE_CACHE_SIZE) {
 -   /* zero fill remainder of page */
 -   zero_user_segment(page, err, PAGE_CACHE_SIZE);
 -   } else {
 -   flush_dcache_page(page);
 -   }
 }
 -   SetPageUptodate(page);
 +   if (err  PAGE_CACHE_SIZE)
 +   /* zero fill remainder of page */
 +   zero_user_segment(page, err, PAGE_CACHE_SIZE);
 +   else
 +   flush_dcache_page(page);

 -   if (err = 0)
 -   ceph_readpage_to_fscache(inode, page);
 +   SetPageUptodate(page);
 +   ceph_readpage_to_fscache(inode, page);

  out:
 return err  0 ? err : 0;
 --

Added to testing branch of ceph-client

Thanks
Yan, Zheng


 1.8.1.2


 .




--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-30 Thread Yan, Zheng
On Wed, Apr 30, 2014 at 3:07 PM, Mohd Bazli Ab Karim
bazli.abka...@mimos.my wrote:
 Hi Zheng,

 Sorry for the late reply. For sure, I will try this again after we completely 
 verifying all content in the file system. Hopefully all will be good.
 And, please confirm this, I will set debug_mds=10 for the ceph-mds, and do 
 you want me to send the ceph-mon log too?

yes please.


 BTW, how to confirm that the mds has passed the beacon to mon or not?

read monitor's log

Regards
Yan, Zheng


 Thank you so much Zheng!

 Bazli

 -Original Message-
 From: Yan, Zheng [mailto:uker...@gmail.com]
 Sent: Tuesday, April 29, 2014 10:13 PM
 To: Mohd Bazli Ab Karim
 Cc: Luke Jing Yuan; Wong Ming Tat
 Subject: Re: [ceph-users] Ceph mds laggy and failed assert in function replay 
 mds/journal.cc

 On Tue, Apr 29, 2014 at 5:30 PM, Mohd Bazli Ab Karim bazli.abka...@mimos.my 
 wrote:
 Hi Zheng,

 The another issue that Luke mentioned just now was like this.
 At first, we ran one mds (mon01) with the new compiled ceph-mds. It works 
 fine with only one MDS running at that time. However, when we ran two more 
 MDSes mon02 mon03 with the new compiled ceph-mds, it started acting weird.
 Mon01 which was became active at first, will have the error and started to 
 respawning. Once respawning happened, mon03 will take over from mon01 as 
 master mds, and replay happened again.
 Again, when mon03 became active, it will have the same error like below, and 
 respawning again. So, it seems to me that replay will continue to happen 
 from one mds to another when they got respawned.

 2014-04-29 15:36:24.917798 7f5c36476700  1 mds.0.server
 reconnect_clients -- 1 sessions
 2014-04-29 15:36:24.919620 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=0 pgs=0
 cs=0 l=0 c=0x37056e0).accept peer addr is really
 10.1.64.181:0/1558263174 (socket is 10.1.64.181:57649/0)
 2014-04-29 15:36:24.921661 7f5c36476700  0 log [DBG] : reconnect by
 client.884169 10.1.64.181:0/1558263174 after 0.003774
 2014-04-29 15:36:24.921786 7f5c36476700  1 mds.0.12858 reconnect_done
 2014-04-29 15:36:25.109391 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:25.109413 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:reconnect -- up:rejoin
 2014-04-29 15:36:25.109417 7f5c36476700  1 mds.0.12858 rejoin_start
 2014-04-29 15:36:26.918067 7f5c36476700  1 mds.0.12858
 rejoin_joint_start
 2014-04-29 15:36:33.520985 7f5c36476700  1 mds.0.12858 rejoin_done
 2014-04-29 15:36:36.252925 7f5c36476700  1 mds.0.12858 handle_mds_map
 i am now mds.0.12858
 2014-04-29 15:36:36.252927 7f5c36476700  1 mds.0.12858 handle_mds_map
 state change up:rejoin -- up:active
 2014-04-29 15:36:36.252932 7f5c36476700  1 mds.0.12858 recovery_done -- 
 successful recovery!
 2014-04-29 15:36:36.745833 7f5c36476700  1 mds.0.12858 active_start
 2014-04-29 15:36:36.987854 7f5c36476700  1 mds.0.12858 cluster recovered.
 2014-04-29 15:36:40.182604 7f5c36476700  0 mds.0.12858
 handle_mds_beacon no longer laggy
 2014-04-29 15:36:57.947441 7f5c2fb3e700  0 -- 10.4.118.23:6800/26401
  10.1.64.181:0/1558263174 pipe(0x2924f5780 sd=41 :6800 s=2 pgs=156
 cs=1 l=0 c=0x37056e0).fault with nothing to send, going to standby
 2014-04-29 15:37:10.534593 7f5c36476700  1 mds.-1.-1 handle_mds_map i
 (10.4.118.23:6800/26401) dne in the mdsmap, respawning myself
 2014-04-29 15:37:10.534604 7f5c36476700  1 mds.-1.-1 respawn
 2014-04-29 15:37:10.534609 7f5c36476700  1 mds.-1.-1  e: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534612 7f5c36476700  1 mds.-1.-1  0: '/usr/bin/ceph-mds'
 2014-04-29 15:37:10.534616 7f5c36476700  1 mds.-1.-1  1: '--cluster=ceph'
 2014-04-29 15:37:10.534619 7f5c36476700  1 mds.-1.-1  2: '-i'
 2014-04-29 15:37:10.534621 7f5c36476700  1 mds.-1.-1  3: 'mon03'
 2014-04-29 15:37:10.534623 7f5c36476700  1 mds.-1.-1  4: '-f'
 2014-04-29 15:37:10.534641 7f5c36476700  1 mds.-1.-1  cwd /
 2014-04-29 15:37:12.155458 7f8907c8b780  0 ceph version  (), process
 ceph-mds, pid 26401
 2014-04-29 15:37:12.249780 7f8902d10700  1 mds.-1.0 handle_mds_map
 standby

 p/s. we ran ceph-mon and ceph-mds on same servers, (mon01,mon02,mon03)

 I sent to you two log files, mon01 and mon03 where the scenario of mon03 
 have state-standby-replay-active-respawned. And also, mon01 which is now 
 running as active as a single MDS at this moment.


 After the MDS became ative, it did not send beacon to the monitor. It seems 
 like the MDS was busy doing something else. If this issue still happen, set 
 debug_mds=10 and send the log to me.

 Regards
 Yan, Zheng

 Regards,
 Bazli
 -Original Message-
 From: Luke Jing Yuan
 Sent: Tuesday, April 29, 2014 4:46 PM
 To: Yan, Zheng
 Cc: Mohd Bazli Ab Karim; Wong Ming Tat
 Subject: RE: [ceph-users] Ceph mds laggy and failed assert in function
 replay mds/journal.cc

 Hi Zheng,

 Thanks for the information. Actually we encounter another issue, in our 
 original setup, we have 3 MDS running

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 We had applied the patch and recompile ceph as well as updated the
 ceph.conf as per suggested, when we re-run ceph-mds we noticed the
 following:


 2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366457,12681393 no session for client.324186
 2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366475,12681393 no session for client.324186
 2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:5135,12681393 no session for client.324186
 2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366724,12681393 no session for client.324186
 2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366945,12681393 no session for client.324186
 2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366996,12681393 no session for client.324186
 2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367043,12681393 no session for client.324186
 2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367082,12681393 no session for client.324186
 2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367110,12681393 no session for client.324186
 2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367192,12681393 no session for client.324186
 2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367257,12681393 no session for client.324186
 2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367264,12681393 no session for client.324186

 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
 0x7f90c0cca3fd
 tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
 0x7f90c231de9a 0x7f90c0cca3fd

 ceph -s shows that MDS up:replay,

 Also the messages above seemed to be repeating again after a while but
 with a different session number. Is there a way for us to determine
 that we are on the right track? Thanks.


It's on the right track as long as the MDS doesn't crash.

 Regards,
 Luke

 On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-29 Thread Yan, Zheng
On Tue, Apr 29, 2014 at 3:13 PM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 Assuming we got MDS working back on track, should we still leave the
 mds_wipe_sessions in the ceph.conf or remove it and restart MDS.
 Thanks.

No.

It has been several hours. the MDS still does not finish replaying the journal?

Regards
Yan, Zheng


 Regards,
 Luke


 On Tue, Apr 29, 2014 at 2:12 PM, Yan, Zheng uker...@gmail.com wrote:
 On Tue, Apr 29, 2014 at 11:24 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi,

 We had applied the patch and recompile ceph as well as updated the
 ceph.conf as per suggested, when we re-run ceph-mds we noticed the
 following:


 2014-04-29 10:45:22.260798 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366457,12681393 no session for client.324186
 2014-04-29 10:45:22.262419 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366475,12681393 no session for client.324186
 2014-04-29 10:45:22.267699 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:5135,12681393 no session for client.324186
 2014-04-29 10:45:22.271664 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366724,12681393 no session for client.324186
 2014-04-29 10:45:22.281050 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366945,12681393 no session for client.324186
 2014-04-29 10:45:22.283196 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51366996,12681393 no session for client.324186
 2014-04-29 10:45:22.287801 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367043,12681393 no session for client.324186
 2014-04-29 10:45:22.289967 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367082,12681393 no session for client.324186
 2014-04-29 10:45:22.291026 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367110,12681393 no session for client.324186
 2014-04-29 10:45:22.294459 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367192,12681393 no session for client.324186
 2014-04-29 10:45:22.297228 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367257,12681393 no session for client.324186
 2014-04-29 10:45:22.297477 7f90b971d700  0 log [WRN] :  replayed op
 client.324186:51367264,12681393 no session for client.324186

 tcmalloc: large alloc 1136660480 bytes == 0xb2019000 @  0x7f90c2564da7
 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed 0x7f90c231de9a
 0x7f90c0cca3fd
 tcmalloc: large alloc 2273316864 bytes == 0x15d73d000 @
 0x7f90c2564da7 0x5bb9cb 0x5ac8eb 0x5b32f7 0x79ecd8 0x58cbed
 0x7f90c231de9a 0x7f90c0cca3fd

 ceph -s shows that MDS up:replay,

 Also the messages above seemed to be repeating again after a while but
 with a different session number. Is there a way for us to determine
 that we are on the right track? Thanks.


 It's on the right track as long as the MDS doesn't crash.

 Regards,
 Luke

 On Sun, Apr 27, 2014 at 12:04 PM, Yan, Zheng uker...@gmail.com wrote:
 On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


 compile ceph-mds with the attached patch. add a line mds
 wipe_sessions = 1 to the ceph.conf,

 Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' 
  thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: 
  FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d

Re: [ceph-users] Ceph mds laggy and failed assert in function replay mds/journal.cc

2014-04-26 Thread Yan, Zheng
On Sat, Apr 26, 2014 at 9:56 AM, Jingyuan Luke jyl...@gmail.com wrote:
 Hi Greg,

 Actually our cluster is pretty empty, but we suspect we had a temporary
 network disconnection to one of our OSD, not sure if this caused the
 problem.

 Anyway we don't mind try the method you mentioned, how can we do that?


compile ceph-mds with the attached patch. add a line mds
wipe_sessions = 1 to the ceph.conf,

Yan, Zheng

 Regards,
 Luke


 On Saturday, April 26, 2014, Gregory Farnum g...@inktank.com wrote:

 Hmm, it looks like your on-disk SessionMap is horrendously out of
 date. Did your cluster get full at some point?

 In any case, we're working on tools to repair this now but they aren't
 ready for use yet. Probably the only thing you could do is create an
 empty sessionmap with a higher version than the ones the journal
 refers to, but that might have other fallout effects...
 -Greg
 Software Engineer #42 @ http://inktank.com | http://ceph.com


 On Fri, Apr 25, 2014 at 2:57 AM, Mohd Bazli Ab Karim
 bazli.abka...@mimos.my wrote:
  More logs. I ran ceph-mds  with debug-mds=20.
 
  -2 2014-04-25 17:47:54.839672 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay inotable tablev 4316124 = table 4317932
  -1 2014-04-25 17:47:54.839674 7f0d6f3f0700 10 mds.0.journal
  EMetaBlob.replay sessionmap v8632368 -(1|2) == table 7239603 prealloc
  [141df86~1] used 141db9e
0 2014-04-25 17:47:54.840733 7f0d6f3f0700 -1 mds/journal.cc: In
  function 'void EMetaBlob::replay(MDS*, LogSegment*, MDSlaveUpdate*)' thread
  7f0d6f3f0700 time 2014-04-25 17:47:54.839688 mds/journal.cc: 1303: FAILED
  assert(session)
 
  Please look at the attachment for more details.
 
  Regards,
  Bazli
 
  From: Mohd Bazli Ab Karim
  Sent: Friday, April 25, 2014 12:26 PM
  To: 'ceph-devel@vger.kernel.org'; ceph-us...@lists.ceph.com
  Subject: Ceph mds laggy and failed assert in function replay
  mds/journal.cc
 
  Dear Ceph-devel, ceph-users,
 
  I am currently facing issue with my ceph mds server. Ceph-mds daemon
  does not want to bring up back.
  Tried running that manually with ceph-mds -i mon01 -d but it shows that
  it stucks at failed assert(session) line 1303 in mds/journal.cc and 
  aborted.
 
  Can someone shed some light in this issue.
  ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
 
  Let me know if I need to send log with debug enabled.
 
  Regards,
  Bazli
 
  
  DISCLAIMER:
 
 
  This e-mail (including any attachments) is for the addressee(s) only and
  may be confidential, especially as regards personal data. If you are not 
  the
  intended recipient, please note that any dealing, review, distribution,
  printing, copying or use of this e-mail is strictly prohibited. If you have
  received this email in error, please notify the sender immediately and
  delete the original message (including any attachments).
 
 
  MIMOS Berhad is a research and development institution under the purview
  of the Malaysian Ministry of Science, Technology and Innovation. Opinions,
  conclusions and other information in this e-mail that do not relate to the
  official business of MIMOS Berhad and/or its subsidiaries shall be
  understood as neither given nor endorsed by MIMOS Berhad and/or its
  subsidiaries and neither MIMOS Berhad nor its subsidiaries accepts
  responsibility for the same. All liability arising from or in connection
  with computer viruses and/or corrupted e-mails is excluded to the fullest
  extent permitted by law.
 
 
  --
  -
  -
  DISCLAIMER:
 
  This e-mail (including any attachments) is for the addressee(s)
  only and may contain confidential information. If you are not the
  intended recipient, please note that any dealing, review,
  distribution, printing, copying or use of this e-mail is strictly
  prohibited. If you have received this email in error, please notify
  the sender  immediately and delete the original message.
  MIMOS Berhad is a research and development institution under
  the purview of the Malaysian Ministry of Science, Technology and
  Innovation. Opinions, conclusions and other information in this e-
  mail that do not relate to the official business of MIMOS Berhad
  and/or its subsidiaries shall be understood as neither given nor
  endorsed by MIMOS Berhad and/or its subsidiaries and neither
  MIMOS Berhad nor its subsidiaries accepts responsibility for the
  same. All liability arising from or in connection with computer
  viruses and/or corrupted e-mails is excluded to the fullest extent
  permitted by law.
 
 
  ___
  ceph-users mailing list
  ceph-us...@lists.ceph.com
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html

[PATCH 3/6] ceph: pre-allocate ceph_cap struct for ceph_add_cap()

2014-04-18 Thread Yan, Zheng
So that ceph_add_cap() can be used while i_ceph_lock is locked.
This simplifies the code that handle cap import/export.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c  | 81 +++--
 fs/ceph/inode.c | 70 +++--
 fs/ceph/super.h | 13 -
 3 files changed, 85 insertions(+), 79 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 5f6d24e..73a42f5 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -221,8 +221,8 @@ int ceph_unreserve_caps(struct ceph_mds_client *mdsc,
return 0;
 }
 
-static struct ceph_cap *get_cap(struct ceph_mds_client *mdsc,
-   struct ceph_cap_reservation *ctx)
+struct ceph_cap *ceph_get_cap(struct ceph_mds_client *mdsc,
+ struct ceph_cap_reservation *ctx)
 {
struct ceph_cap *cap = NULL;
 
@@ -508,15 +508,14 @@ static void __check_cap_issue(struct ceph_inode_info *ci, 
struct ceph_cap *cap,
  * it is  0.  (This is so we can atomically add the cap and add an
  * open file reference to it.)
  */
-int ceph_add_cap(struct inode *inode,
-struct ceph_mds_session *session, u64 cap_id,
-int fmode, unsigned issued, unsigned wanted,
-unsigned seq, unsigned mseq, u64 realmino, int flags,
-struct ceph_cap_reservation *caps_reservation)
+void ceph_add_cap(struct inode *inode,
+ struct ceph_mds_session *session, u64 cap_id,
+ int fmode, unsigned issued, unsigned wanted,
+ unsigned seq, unsigned mseq, u64 realmino, int flags,
+ struct ceph_cap **new_cap)
 {
struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)-mdsc;
struct ceph_inode_info *ci = ceph_inode(inode);
-   struct ceph_cap *new_cap = NULL;
struct ceph_cap *cap;
int mds = session-s_mds;
int actual_wanted;
@@ -531,20 +530,10 @@ int ceph_add_cap(struct inode *inode,
if (fmode = 0)
wanted |= ceph_caps_for_mode(fmode);
 
-retry:
-   spin_lock(ci-i_ceph_lock);
cap = __get_cap_for_mds(ci, mds);
if (!cap) {
-   if (new_cap) {
-   cap = new_cap;
-   new_cap = NULL;
-   } else {
-   spin_unlock(ci-i_ceph_lock);
-   new_cap = get_cap(mdsc, caps_reservation);
-   if (new_cap == NULL)
-   return -ENOMEM;
-   goto retry;
-   }
+   cap = *new_cap;
+   *new_cap = NULL;
 
cap-issued = 0;
cap-implemented = 0;
@@ -562,9 +551,6 @@ retry:
session-s_nr_caps++;
spin_unlock(session-s_cap_lock);
} else {
-   if (new_cap)
-   ceph_put_cap(mdsc, new_cap);
-
/*
 * auth mds of the inode changed. we received the cap export
 * message, but still haven't received the cap import message.
@@ -626,7 +612,6 @@ retry:
ci-i_auth_cap = cap;
cap-mds_wanted = wanted;
}
-   ci-i_cap_exporting_issued = 0;
} else {
WARN_ON(ci-i_auth_cap == cap);
}
@@ -648,9 +633,6 @@ retry:
 
if (fmode = 0)
__ceph_get_fmode(ci, fmode);
-   spin_unlock(ci-i_ceph_lock);
-   wake_up_all(ci-i_cap_wq);
-   return 0;
 }
 
 /*
@@ -685,7 +667,7 @@ static int __cap_is_valid(struct ceph_cap *cap)
  */
 int __ceph_caps_issued(struct ceph_inode_info *ci, int *implemented)
 {
-   int have = ci-i_snap_caps | ci-i_cap_exporting_issued;
+   int have = ci-i_snap_caps;
struct ceph_cap *cap;
struct rb_node *p;
 
@@ -900,7 +882,7 @@ int __ceph_caps_mds_wanted(struct ceph_inode_info *ci)
  */
 static int __ceph_is_any_caps(struct ceph_inode_info *ci)
 {
-   return !RB_EMPTY_ROOT(ci-i_caps) || ci-i_cap_exporting_issued;
+   return !RB_EMPTY_ROOT(ci-i_caps);
 }
 
 int ceph_is_any_caps(struct inode *inode)
@@ -2796,7 +2778,7 @@ static void handle_cap_export(struct inode *inode, struct 
ceph_mds_caps *ex,
 {
struct ceph_mds_client *mdsc = ceph_inode_to_client(inode)-mdsc;
struct ceph_mds_session *tsession = NULL;
-   struct ceph_cap *cap, *tcap;
+   struct ceph_cap *cap, *tcap, *new_cap = NULL;
struct ceph_inode_info *ci = ceph_inode(inode);
u64 t_cap_id;
unsigned mseq = le32_to_cpu(ex-migrate_seq);
@@ -2858,15 +2840,14 @@ retry:
}
__ceph_remove_cap(cap, false);
goto out_unlock;
-   }
-
-   if (tsession) {
-   int flag = (cap == ci-i_auth_cap) ? CEPH_CAP_FLAG_AUTH : 0;
-   spin_unlock(ci-i_ceph_lock);
+   } else if (tsession) {
/* add

[PATCH 5/6] ceph: introduce ceph_fill_fragtree()

2014-04-18 Thread Yan, Zheng
Move the code that update the i_fragtree into a separate function.
Also add simple probabilistic test to decide whether the i_fragtree
should be updated

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/inode.c | 129 
 1 file changed, 84 insertions(+), 45 deletions(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 65462ad..80528ff 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -10,6 +10,7 @@
 #include linux/writeback.h
 #include linux/vmalloc.h
 #include linux/posix_acl.h
+#include linux/random.h
 
 #include super.h
 #include mds_client.h
@@ -179,9 +180,8 @@ struct ceph_inode_frag *__ceph_find_frag(struct 
ceph_inode_info *ci, u32 f)
  * specified, copy the frag delegation info to the caller if
  * it is present.
  */
-u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
-struct ceph_inode_frag *pfrag,
-int *found)
+static u32 __ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
+ struct ceph_inode_frag *pfrag, int *found)
 {
u32 t = ceph_frag_make(0, 0);
struct ceph_inode_frag *frag;
@@ -191,7 +191,6 @@ u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
if (found)
*found = 0;
 
-   mutex_lock(ci-i_fragtree_mutex);
while (1) {
WARN_ON(!ceph_frag_contains_value(t, v));
frag = __ceph_find_frag(ci, t);
@@ -220,10 +219,19 @@ u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
}
dout(choose_frag(%x) = %x\n, v, t);
 
-   mutex_unlock(ci-i_fragtree_mutex);
return t;
 }
 
+u32 ceph_choose_frag(struct ceph_inode_info *ci, u32 v,
+struct ceph_inode_frag *pfrag, int *found)
+{
+   u32 ret;
+   mutex_lock(ci-i_fragtree_mutex);
+   ret = __ceph_choose_frag(ci, v, pfrag, found);
+   mutex_unlock(ci-i_fragtree_mutex);
+   return ret;
+}
+
 /*
  * Process dirfrag (delegation) info from the mds.  Include leaf
  * fragment in tree ONLY if ndist  0.  Otherwise, only
@@ -286,6 +294,75 @@ out:
return err;
 }
 
+static int ceph_fill_fragtree(struct inode *inode,
+ struct ceph_frag_tree_head *fragtree,
+ struct ceph_mds_reply_dirfrag *dirinfo)
+{
+   struct ceph_inode_info *ci = ceph_inode(inode);
+   struct ceph_inode_frag *frag;
+   struct rb_node *rb_node;
+   int i;
+   u32 id, nsplits;
+   bool update = false;
+
+   mutex_lock(ci-i_fragtree_mutex);
+   nsplits = le32_to_cpu(fragtree-nsplits);
+   if (nsplits) {
+   i = prandom_u32() % nsplits;
+   id = le32_to_cpu(fragtree-splits[i].frag);
+   if (!__ceph_find_frag(ci, id))
+   update = true;
+   } else if (!RB_EMPTY_ROOT(ci-i_fragtree)) {
+   rb_node = rb_first(ci-i_fragtree);
+   frag = rb_entry(rb_node, struct ceph_inode_frag, node);
+   if (frag-frag != ceph_frag_make(0, 0) || rb_next(rb_node))
+   update = true;
+   }
+   if (!update  dirinfo) {
+   id = le32_to_cpu(dirinfo-frag);
+   if (id != __ceph_choose_frag(ci, id, NULL, NULL))
+   update = true;
+   }
+   if (!update)
+   goto out_unlock;
+
+   dout(fill_fragtree %llx.%llx\n, ceph_vinop(inode));
+   rb_node = rb_first(ci-i_fragtree);
+   for (i = 0; i  nsplits; i++) {
+   id = le32_to_cpu(fragtree-splits[i].frag);
+   frag = NULL;
+   while (rb_node) {
+   frag = rb_entry(rb_node, struct ceph_inode_frag, node);
+   if (ceph_frag_compare(frag-frag, id) = 0) {
+   if (frag-frag != id)
+   frag = NULL;
+   else
+   rb_node = rb_next(rb_node);
+   break;
+   }
+   rb_node = rb_next(rb_node);
+   rb_erase(frag-node, ci-i_fragtree);
+   kfree(frag);
+   frag = NULL;
+   }
+   if (!frag) {
+   frag = __get_or_create_frag(ci, id);
+   if (IS_ERR(frag))
+   continue;
+   }
+   frag-split_by = le32_to_cpu(fragtree-splits[i].by);
+   dout( frag %x split by %d\n, frag-frag, frag-split_by);
+   }
+   while (rb_node) {
+   frag = rb_entry(rb_node, struct ceph_inode_frag, node);
+   rb_node = rb_next(rb_node);
+   rb_erase(frag-node, ci-i_fragtree);
+   kfree(frag);
+   }
+out_unlock:
+   mutex_unlock(ci-i_fragtree_mutex);
+   return 0;
+}
 
 /*
  * initialize a newly allocated inode.
@@ -584,12 +661,8

[PATCH 4/6] ceph: handle cap import atomically

2014-04-18 Thread Yan, Zheng
cap import messages are processed by both handle_cap_import() and
handle_cap_grant(). These two functions are not executed in the same
atomic context, so they can races with cap release.

The fix is make handle_cap_import() not release the i_ceph_lock when
it returns. Let handle_cap_grant() release the lock after it finishes
its job.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c | 98 +++---
 1 file changed, 53 insertions(+), 45 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 73a42f5..318b50f 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2379,23 +2379,20 @@ static void invalidate_aliases(struct inode *inode)
  * actually be a revocation if it specifies a smaller cap set.)
  *
  * caller holds s_mutex and i_ceph_lock, we drop both.
- *
- * return value:
- *  0 - ok
- *  1 - check_caps on auth cap only (writeback)
- *  2 - check_caps (ack revoke)
  */
-static void handle_cap_grant(struct inode *inode, struct ceph_mds_caps *grant,
+static void handle_cap_grant(struct ceph_mds_client *mdsc,
+struct inode *inode, struct ceph_mds_caps *grant,
+void *snaptrace, int snaptrace_len,
+struct ceph_buffer *xattr_buf,
 struct ceph_mds_session *session,
-struct ceph_cap *cap,
-struct ceph_buffer *xattr_buf)
-   __releases(ci-i_ceph_lock)
+struct ceph_cap *cap, int issued)
+   __releases(ci-i_ceph_lock)
 {
struct ceph_inode_info *ci = ceph_inode(inode);
int mds = session-s_mds;
int seq = le32_to_cpu(grant-seq);
int newcaps = le32_to_cpu(grant-caps);
-   int issued, implemented, used, wanted, dirty;
+   int used, wanted, dirty;
u64 size = le64_to_cpu(grant-size);
u64 max_size = le64_to_cpu(grant-max_size);
struct timespec mtime, atime, ctime;
@@ -2449,10 +2446,6 @@ static void handle_cap_grant(struct inode *inode, struct 
ceph_mds_caps *grant,
}
 
/* side effects now are allowed */
-
-   issued = __ceph_caps_issued(ci, implemented);
-   issued |= implemented | __ceph_caps_dirty(ci);
-
cap-cap_gen = session-s_cap_gen;
cap-seq = seq;
 
@@ -2585,6 +2578,16 @@ static void handle_cap_grant(struct inode *inode, struct 
ceph_mds_caps *grant,
 
spin_unlock(ci-i_ceph_lock);
 
+   if (le32_to_cpu(grant-op) == CEPH_CAP_OP_IMPORT) {
+   down_write(mdsc-snap_rwsem);
+   ceph_update_snap_trace(mdsc, snaptrace,
+  snaptrace + snaptrace_len,
+  false);
+   downgrade_write(mdsc-snap_rwsem);
+   kick_flushing_inode_caps(mdsc, session, inode);
+   up_read(mdsc-snap_rwsem);
+   }
+
if (queue_trunc) {
ceph_queue_vmtruncate(inode);
ceph_queue_revalidate(inode);
@@ -2886,21 +2889,22 @@ out_unlock:
 }
 
 /*
- * Handle cap IMPORT.  If there are temp bits from an older EXPORT,
- * clean them up.
+ * Handle cap IMPORT.
  *
- * caller holds s_mutex.
+ * caller holds s_mutex. acquires i_ceph_lock
  */
 static void handle_cap_import(struct ceph_mds_client *mdsc,
  struct inode *inode, struct ceph_mds_caps *im,
  struct ceph_mds_cap_peer *ph,
  struct ceph_mds_session *session,
- void *snaptrace, int snaptrace_len)
+ struct ceph_cap **target_cap, int *old_issued)
+   __acquires(ci-i_ceph_lock)
 {
struct ceph_inode_info *ci = ceph_inode(inode);
-   struct ceph_cap *cap, *new_cap = NULL;
+   struct ceph_cap *cap, *ocap, *new_cap = NULL;
int mds = session-s_mds;
-   unsigned issued = le32_to_cpu(im-caps);
+   int issued;
+   unsigned caps = le32_to_cpu(im-caps);
unsigned wanted = le32_to_cpu(im-wanted);
unsigned seq = le32_to_cpu(im-seq);
unsigned mseq = le32_to_cpu(im-migrate_seq);
@@ -2929,44 +2933,43 @@ retry:
new_cap = ceph_get_cap(mdsc, NULL);
goto retry;
}
+   cap = new_cap;
+   } else {
+   if (new_cap) {
+   ceph_put_cap(mdsc, new_cap);
+   new_cap = NULL;
+   }
}
 
-   ceph_add_cap(inode, session, cap_id, -1, issued, wanted, seq, mseq,
+   __ceph_caps_issued(ci, issued);
+   issued |= __ceph_caps_dirty(ci);
+
+   ceph_add_cap(inode, session, cap_id, -1, caps, wanted, seq, mseq,
 realmino, CEPH_CAP_FLAG_AUTH, new_cap);
 
-   cap = peer = 0 ? __get_cap_for_mds(ci, peer) : NULL;
-   if (cap  cap-cap_id == p_cap_id) {
+   ocap = peer = 0 ? __get_cap_for_mds(ci

[PATCH 1/6] ceph: avoid releasing caps that are being used

2014-04-18 Thread Yan, Zheng
To avoid releasing caps that are being used, encode_inode_release()
should send implemented caps to MDS.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index 9946ce3..de39a03 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -3267,7 +3267,7 @@ int ceph_encode_inode_release(void **p, struct inode 
*inode,
rel-seq = cpu_to_le32(cap-seq);
rel-issue_seq = cpu_to_le32(cap-issue_seq),
rel-mseq = cpu_to_le32(cap-mseq);
-   rel-caps = cpu_to_le32(cap-issued);
+   rel-caps = cpu_to_le32(cap-implemented);
rel-wanted = cpu_to_le32(cap-mds_wanted);
rel-dname_len = 0;
rel-dname_seq = 0;
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 6/6] ceph: remember subtree root dirfrag's auth MDS

2014-04-18 Thread Yan, Zheng
remember dirfrag's auth MDS when it's different from its parent inode's
auth MDS.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/inode.c | 8 +++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 80528ff..7736097 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -245,11 +245,17 @@ static int ceph_fill_dirfrag(struct inode *inode,
u32 id = le32_to_cpu(dirinfo-frag);
int mds = le32_to_cpu(dirinfo-auth);
int ndist = le32_to_cpu(dirinfo-ndist);
+   int diri_auth = -1;
int i;
int err = 0;
 
+   spin_lock(ci-i_ceph_lock);
+   if (ci-i_auth_cap)
+   diri_auth = ci-i_auth_cap-mds;
+   spin_unlock(ci-i_ceph_lock);
+
mutex_lock(ci-i_fragtree_mutex);
-   if (ndist == 0) {
+   if (ndist == 0  mds == diri_auth) {
/* no delegation info needed. */
frag = __ceph_find_frag(ci, id);
if (!frag)
-- 
1.9.0

--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 2/6] ceph: update inode fields according to issued caps

2014-04-18 Thread Yan, Zheng
Cap message and request reply from non-auth MDS may carry stale
information (corresponding locks are in LOCK states) even they
have the newest inode version. So client should update inode fields
according to issued caps.

Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/caps.c   | 58 
 fs/ceph/inode.c  | 70 
 include/linux/ceph/ceph_fs.h |  2 ++
 3 files changed, 73 insertions(+), 57 deletions(-)

diff --git a/fs/ceph/caps.c b/fs/ceph/caps.c
index de39a03..5f6d24e 100644
--- a/fs/ceph/caps.c
+++ b/fs/ceph/caps.c
@@ -2476,7 +2476,8 @@ static void handle_cap_grant(struct inode *inode, struct 
ceph_mds_caps *grant,
 
__check_cap_issue(ci, cap, newcaps);
 
-   if ((issued  CEPH_CAP_AUTH_EXCL) == 0) {
+   if ((newcaps  CEPH_CAP_AUTH_SHARED) 
+   (issued  CEPH_CAP_AUTH_EXCL) == 0) {
inode-i_mode = le32_to_cpu(grant-mode);
inode-i_uid = make_kuid(init_user_ns, 
le32_to_cpu(grant-uid));
inode-i_gid = make_kgid(init_user_ns, 
le32_to_cpu(grant-gid));
@@ -2485,7 +2486,8 @@ static void handle_cap_grant(struct inode *inode, struct 
ceph_mds_caps *grant,
 from_kgid(init_user_ns, inode-i_gid));
}
 
-   if ((issued  CEPH_CAP_LINK_EXCL) == 0) {
+   if ((newcaps  CEPH_CAP_AUTH_SHARED) 
+   (issued  CEPH_CAP_LINK_EXCL) == 0) {
set_nlink(inode, le32_to_cpu(grant-nlink));
if (inode-i_nlink == 0 
(newcaps  (CEPH_CAP_LINK_SHARED | CEPH_CAP_LINK_EXCL)))
@@ -2512,31 +2514,35 @@ static void handle_cap_grant(struct inode *inode, 
struct ceph_mds_caps *grant,
if ((issued  CEPH_CAP_FILE_CACHE)  ci-i_rdcache_gen  1)
queue_revalidate = 1;
 
-   /* size/ctime/mtime/atime? */
-   queue_trunc = ceph_fill_file_size(inode, issued,
- le32_to_cpu(grant-truncate_seq),
- le64_to_cpu(grant-truncate_size),
- size);
-   ceph_decode_timespec(mtime, grant-mtime);
-   ceph_decode_timespec(atime, grant-atime);
-   ceph_decode_timespec(ctime, grant-ctime);
-   ceph_fill_file_time(inode, issued,
-   le32_to_cpu(grant-time_warp_seq), ctime, mtime,
-   atime);
-
-
-   /* file layout may have changed */
-   ci-i_layout = grant-layout;
-
-   /* max size increase? */
-   if (ci-i_auth_cap == cap  max_size != ci-i_max_size) {
-   dout(max_size %lld - %llu\n, ci-i_max_size, max_size);
-   ci-i_max_size = max_size;
-   if (max_size = ci-i_wanted_max_size) {
-   ci-i_wanted_max_size = 0;  /* reset */
-   ci-i_requested_max_size = 0;
+   if (newcaps  CEPH_CAP_ANY_RD) {
+   /* ctime/mtime/atime? */
+   ceph_decode_timespec(mtime, grant-mtime);
+   ceph_decode_timespec(atime, grant-atime);
+   ceph_decode_timespec(ctime, grant-ctime);
+   ceph_fill_file_time(inode, issued,
+   le32_to_cpu(grant-time_warp_seq),
+   ctime, mtime, atime);
+   }
+
+   if (newcaps  (CEPH_CAP_ANY_FILE_RD | CEPH_CAP_ANY_FILE_WR)) {
+   /* file layout may have changed */
+   ci-i_layout = grant-layout;
+   /* size/truncate_seq? */
+   queue_trunc = ceph_fill_file_size(inode, issued,
+   le32_to_cpu(grant-truncate_seq),
+   le64_to_cpu(grant-truncate_size),
+   size);
+   /* max size increase? */
+   if (ci-i_auth_cap == cap  max_size != ci-i_max_size) {
+   dout(max_size %lld - %llu\n,
+ci-i_max_size, max_size);
+   ci-i_max_size = max_size;
+   if (max_size = ci-i_wanted_max_size) {
+   ci-i_wanted_max_size = 0;  /* reset */
+   ci-i_requested_max_size = 0;
+   }
+   wake = 1;
}
-   wake = 1;
}
 
/* check cap bits */
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 233c6f9..f9e7399 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -585,14 +585,15 @@ static int fill_inode(struct inode *inode,
struct ceph_mds_reply_inode *info = iinfo-in;
struct ceph_inode_info *ci = ceph_inode(inode);
int i;
-   int issued = 0, implemented;
+   int issued = 0, implemented, new_issued;
struct timespec mtime, atime, ctime;
u32 nsplits;
struct ceph_inode_frag *frag;
struct rb_node *rb_node;
struct ceph_buffer

[PATCH] ceph: clear directory's completeness when creating file

2014-04-14 Thread Yan, Zheng
When creating a file, ceph_set_dentry_offset() puts the new dentry
at the end of directory's d_subdirs, then set the dentry's offset
based on directory's max offset. The offset does not reflect the
real postion of the dentry in directory. Later readdir reply from
MDS may change the dentry's position/offset. This inconsistency
can cause missing/duplicate entries in readdir result if readdir
is partly satisfied by dcache_readdir().

The fix is clear directory's completeness after creating/renaming
file. It prevents later readdir from using dcache_readdir().

Fixes: http://tracker.ceph.com/issues/8025
Signed-off-by: Yan, Zheng zheng.z@intel.com
---
 fs/ceph/dir.c   |  9 
 fs/ceph/inode.c | 71 +
 fs/ceph/super.h |  1 -
 3 files changed, 21 insertions(+), 60 deletions(-)

diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
index fb4f7a2..c29d6ae 100644
--- a/fs/ceph/dir.c
+++ b/fs/ceph/dir.c
@@ -448,7 +448,6 @@ more:
if (atomic_read(ci-i_release_count) == fi-dir_release_count) {
dout( marking %p complete\n, inode);
__ceph_dir_set_complete(ci, fi-dir_release_count);
-   ci-i_max_offset = ctx-pos;
}
spin_unlock(ci-i_ceph_lock);
 
@@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, struct 
dentry *old_dentry,
 * to do it here.
 */
 
-   /* d_move screws up d_subdirs order */
-   ceph_dir_clear_complete(new_dir);
-
d_move(old_dentry, new_dentry);
 
/* ensure target dentry is invalidated, despite
   rehashing bug in vfs_rename_dir */
ceph_invalidate_dentry_lease(new_dentry);
+
+   /* d_move screws up sibling dentries' offsets */
+   ceph_dir_clear_complete(old_dir);
+   ceph_dir_clear_complete(new_dir);
+
}
ceph_mdsc_put_request(req);
return err;
diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
index 0b0728e..233c6f9 100644
--- a/fs/ceph/inode.c
+++ b/fs/ceph/inode.c
@@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode,
!__ceph_dir_is_complete(ci)) {
dout( marking %p complete (empty)\n, inode);
__ceph_dir_set_complete(ci, atomic_read(ci-i_release_count));
-   ci-i_max_offset = 2;
}
 no_change:
/* only update max_size on auth cap */
@@ -890,41 +889,6 @@ out_unlock:
 }
 
 /*
- * Set dentry's directory position based on the current dir's max, and
- * order it in d_subdirs, so that dcache_readdir behaves.
- *
- * Always called under directory's i_mutex.
- */
-static void ceph_set_dentry_offset(struct dentry *dn)
-{
-   struct dentry *dir = dn-d_parent;
-   struct inode *inode = dir-d_inode;
-   struct ceph_inode_info *ci;
-   struct ceph_dentry_info *di;
-
-   BUG_ON(!inode);
-
-   ci = ceph_inode(inode);
-   di = ceph_dentry(dn);
-
-   spin_lock(ci-i_ceph_lock);
-   if (!__ceph_dir_is_complete(ci)) {
-   spin_unlock(ci-i_ceph_lock);
-   return;
-   }
-   di-offset = ceph_inode(inode)-i_max_offset++;
-   spin_unlock(ci-i_ceph_lock);
-
-   spin_lock(dir-d_lock);
-   spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED);
-   list_move(dn-d_u.d_child, dir-d_subdirs);
-   dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset,
-dn-d_u.d_child.prev, dn-d_u.d_child.next);
-   spin_unlock(dn-d_lock);
-   spin_unlock(dir-d_lock);
-}
-
-/*
  * splice a dentry to an inode.
  * caller must hold directory i_mutex for this to be safe.
  *
@@ -933,7 +897,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
  * the caller) if we fail.
  */
 static struct dentry *splice_dentry(struct dentry *dn, struct inode *in,
-   bool *prehash, bool set_offset)
+   bool *prehash)
 {
struct dentry *realdn;
 
@@ -965,8 +929,6 @@ static struct dentry *splice_dentry(struct dentry *dn, 
struct inode *in,
}
if ((!prehash || *prehash)  d_unhashed(dn))
d_rehash(dn);
-   if (set_offset)
-   ceph_set_dentry_offset(dn);
 out:
return dn;
 }
@@ -987,7 +949,6 @@ int ceph_fill_trace(struct super_block *sb, struct 
ceph_mds_request *req,
 {
struct ceph_mds_reply_info_parsed *rinfo = req-r_reply_info;
struct inode *in = NULL;
-   struct ceph_mds_reply_inode *ininfo;
struct ceph_vino vino;
struct ceph_fs_client *fsc = ceph_sb_to_client(sb);
int err = 0;
@@ -1161,6 +1122,9 @@ retry_lookup:
 
/* rename? */
if (req-r_old_dentry  req-r_op == CEPH_MDS_OP_RENAME) {
+   struct inode *olddir = req-r_old_dentry_dir;
+   BUG_ON(!olddir);
+
dout( src %p '%.*s' dst %p '%.*s'\n,
 req

Re: [PATCH] ceph: clear directory's completeness when creating file

2014-04-14 Thread Yan, Zheng
On Mon, Apr 14, 2014 at 9:59 PM, Sage Weil s...@inktank.com wrote:
 On Mon, 14 Apr 2014, Yan, Zheng wrote:
 When creating a file, ceph_set_dentry_offset() puts the new dentry
 at the end of directory's d_subdirs, then set the dentry's offset
 based on directory's max offset. The offset does not reflect the
 real postion of the dentry in directory. Later readdir reply from
 MDS may change the dentry's position/offset. This inconsistency
 can cause missing/duplicate entries in readdir result if readdir
 is partly satisfied by dcache_readdir().

 The fix is clear directory's completeness after creating/renaming
 file. It prevents later readdir from using dcache_readdir().

 Two thoughts:

 First, we could preserve this behavior when the directory is small (e.g.,
  1000 entries, or whatever the readdir_max is set to) since any readdir
 would always be satisfied in a single request and we don't need to worry
 about the mds vs dcache_readdir() case.  I that will still capture the

No, we couldn't. because caller of readdir can provide small buffer
for readdir result.

 benefit for many/most directories.  This is probably primarily a matter of
 adding a directory size counter into the ceph_dentry_info?

 Second, it seems like in order to keep this behavior in the general case,
 we would basically need to build an rbtree that mirrors in d_subdirs list
 and sorts the same way the mds does (by frag, name).  Which would mean
 resorting that list when we discover a split/merge event.  That sounds
 like a pretty big pain to me.  And similarly, I think that the larger the
 directory, the less important readdir usually is in the workload...

 sage



 Fixes: http://tracker.ceph.com/issues/8025
 Signed-off-by: Yan, Zheng zheng.z@intel.com
 ---
  fs/ceph/dir.c   |  9 
  fs/ceph/inode.c | 71 
 +
  fs/ceph/super.h |  1 -
  3 files changed, 21 insertions(+), 60 deletions(-)

 diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
 index fb4f7a2..c29d6ae 100644
 --- a/fs/ceph/dir.c
 +++ b/fs/ceph/dir.c
 @@ -448,7 +448,6 @@ more:
   if (atomic_read(ci-i_release_count) == fi-dir_release_count) {
   dout( marking %p complete\n, inode);
   __ceph_dir_set_complete(ci, fi-dir_release_count);
 - ci-i_max_offset = ctx-pos;
   }
   spin_unlock(ci-i_ceph_lock);

 @@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, struct 
 dentry *old_dentry,
* to do it here.
*/

 - /* d_move screws up d_subdirs order */
 - ceph_dir_clear_complete(new_dir);
 -
   d_move(old_dentry, new_dentry);

   /* ensure target dentry is invalidated, despite
  rehashing bug in vfs_rename_dir */
   ceph_invalidate_dentry_lease(new_dentry);
 +
 + /* d_move screws up sibling dentries' offsets */
 + ceph_dir_clear_complete(old_dir);
 + ceph_dir_clear_complete(new_dir);
 +
   }
   ceph_mdsc_put_request(req);
   return err;
 diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
 index 0b0728e..233c6f9 100644
 --- a/fs/ceph/inode.c
 +++ b/fs/ceph/inode.c
 @@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode,
   !__ceph_dir_is_complete(ci)) {
   dout( marking %p complete (empty)\n, inode);
   __ceph_dir_set_complete(ci, atomic_read(ci-i_release_count));
 - ci-i_max_offset = 2;
   }
  no_change:
   /* only update max_size on auth cap */
 @@ -890,41 +889,6 @@ out_unlock:
  }

  /*
 - * Set dentry's directory position based on the current dir's max, and
 - * order it in d_subdirs, so that dcache_readdir behaves.
 - *
 - * Always called under directory's i_mutex.
 - */
 -static void ceph_set_dentry_offset(struct dentry *dn)
 -{
 - struct dentry *dir = dn-d_parent;
 - struct inode *inode = dir-d_inode;
 - struct ceph_inode_info *ci;
 - struct ceph_dentry_info *di;
 -
 - BUG_ON(!inode);
 -
 - ci = ceph_inode(inode);
 - di = ceph_dentry(dn);
 -
 - spin_lock(ci-i_ceph_lock);
 - if (!__ceph_dir_is_complete(ci)) {
 - spin_unlock(ci-i_ceph_lock);
 - return;
 - }
 - di-offset = ceph_inode(inode)-i_max_offset++;
 - spin_unlock(ci-i_ceph_lock);
 -
 - spin_lock(dir-d_lock);
 - spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED);
 - list_move(dn-d_u.d_child, dir-d_subdirs);
 - dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset,
 -  dn-d_u.d_child.prev, dn-d_u.d_child.next);
 - spin_unlock(dn-d_lock);
 - spin_unlock(dir-d_lock);
 -}
 -
 -/*
   * splice a dentry to an inode.
   * caller must hold directory i_mutex for this to be safe.
   *
 @@ -933,7 +897,7 @@ static void ceph_set_dentry_offset(struct dentry *dn)
   * the caller) if we fail.
   */
  static struct dentry *splice_dentry(struct dentry *dn, struct inode *in,
 - bool

Re: [PATCH] ceph: clear directory's completeness when creating file

2014-04-14 Thread Yan, Zheng
On Mon, Apr 14, 2014 at 11:12 PM, Sage Weil s...@inktank.com wrote:
 On Mon, 14 Apr 2014, Yan, Zheng wrote:
 On Mon, Apr 14, 2014 at 9:59 PM, Sage Weil s...@inktank.com wrote:
  On Mon, 14 Apr 2014, Yan, Zheng wrote:
  When creating a file, ceph_set_dentry_offset() puts the new dentry
  at the end of directory's d_subdirs, then set the dentry's offset
  based on directory's max offset. The offset does not reflect the
  real postion of the dentry in directory. Later readdir reply from
  MDS may change the dentry's position/offset. This inconsistency
  can cause missing/duplicate entries in readdir result if readdir
  is partly satisfied by dcache_readdir().
 
  The fix is clear directory's completeness after creating/renaming
  file. It prevents later readdir from using dcache_readdir().
 
  Two thoughts:
 
  First, we could preserve this behavior when the directory is small (e.g.,
   1000 entries, or whatever the readdir_max is set to) since any readdir
  would always be satisfied in a single request and we don't need to worry
  about the mds vs dcache_readdir() case.  I that will still capture the

 No, we couldn't. because caller of readdir can provide small buffer
 for readdir result.

 Hmm, we could set a different flag, UNORDERED, and then on readdir clear
 COMPLETE if UNORDERED and size = X...


It is too tricky. The readdir buffer is hidden by VFS, FS driver has
no way to check its size.





  benefit for many/most directories.  This is probably primarily a matter of
  adding a directory size counter into the ceph_dentry_info?
 
  Second, it seems like in order to keep this behavior in the general case,
  we would basically need to build an rbtree that mirrors in d_subdirs list
  and sorts the same way the mds does (by frag, name).  Which would mean
  resorting that list when we discover a split/merge event.  That sounds
  like a pretty big pain to me.  And similarly, I think that the larger the
  directory, the less important readdir usually is in the workload...
 
  sage
 
 
 
  Fixes: http://tracker.ceph.com/issues/8025
  Signed-off-by: Yan, Zheng zheng.z@intel.com
  ---
   fs/ceph/dir.c   |  9 
   fs/ceph/inode.c | 71 
  +
   fs/ceph/super.h |  1 -
   3 files changed, 21 insertions(+), 60 deletions(-)
 
  diff --git a/fs/ceph/dir.c b/fs/ceph/dir.c
  index fb4f7a2..c29d6ae 100644
  --- a/fs/ceph/dir.c
  +++ b/fs/ceph/dir.c
  @@ -448,7 +448,6 @@ more:
if (atomic_read(ci-i_release_count) == fi-dir_release_count) {
dout( marking %p complete\n, inode);
__ceph_dir_set_complete(ci, fi-dir_release_count);
  - ci-i_max_offset = ctx-pos;
}
spin_unlock(ci-i_ceph_lock);
 
  @@ -937,14 +936,16 @@ static int ceph_rename(struct inode *old_dir, 
  struct dentry *old_dentry,
 * to do it here.
 */
 
  - /* d_move screws up d_subdirs order */
  - ceph_dir_clear_complete(new_dir);
  -
d_move(old_dentry, new_dentry);
 
/* ensure target dentry is invalidated, despite
   rehashing bug in vfs_rename_dir */
ceph_invalidate_dentry_lease(new_dentry);
  +
  + /* d_move screws up sibling dentries' offsets */
  + ceph_dir_clear_complete(old_dir);
  + ceph_dir_clear_complete(new_dir);
  +
}
ceph_mdsc_put_request(req);
return err;
  diff --git a/fs/ceph/inode.c b/fs/ceph/inode.c
  index 0b0728e..233c6f9 100644
  --- a/fs/ceph/inode.c
  +++ b/fs/ceph/inode.c
  @@ -744,7 +744,6 @@ static int fill_inode(struct inode *inode,
!__ceph_dir_is_complete(ci)) {
dout( marking %p complete (empty)\n, inode);
__ceph_dir_set_complete(ci, 
  atomic_read(ci-i_release_count));
  - ci-i_max_offset = 2;
}
   no_change:
/* only update max_size on auth cap */
  @@ -890,41 +889,6 @@ out_unlock:
   }
 
   /*
  - * Set dentry's directory position based on the current dir's max, and
  - * order it in d_subdirs, so that dcache_readdir behaves.
  - *
  - * Always called under directory's i_mutex.
  - */
  -static void ceph_set_dentry_offset(struct dentry *dn)
  -{
  - struct dentry *dir = dn-d_parent;
  - struct inode *inode = dir-d_inode;
  - struct ceph_inode_info *ci;
  - struct ceph_dentry_info *di;
  -
  - BUG_ON(!inode);
  -
  - ci = ceph_inode(inode);
  - di = ceph_dentry(dn);
  -
  - spin_lock(ci-i_ceph_lock);
  - if (!__ceph_dir_is_complete(ci)) {
  - spin_unlock(ci-i_ceph_lock);
  - return;
  - }
  - di-offset = ceph_inode(inode)-i_max_offset++;
  - spin_unlock(ci-i_ceph_lock);
  -
  - spin_lock(dir-d_lock);
  - spin_lock_nested(dn-d_lock, DENTRY_D_LOCK_NESTED);
  - list_move(dn-d_u.d_child, dir-d_subdirs);
  - dout(set_dentry_offset %p %lld (%p %p)\n, dn, di-offset

Re: [PATCH] rbd: replace IS_ERR and PTR_ERR with PTR_ERR_OR_ZERO

2014-04-11 Thread Yan, Zheng
On Fri, Apr 11, 2014 at 4:38 PM, Duan Jiong duanj.f...@cn.fujitsu.com wrote:
 This patch fixes coccinelle error regarding usage of IS_ERR and
 PTR_ERR instead of PTR_ERR_OR_ZERO.

 Signed-off-by: Duan Jiong duanj.f...@cn.fujitsu.com
 ---
  drivers/block/rbd.c | 2 +-
  1 file changed, 1 insertion(+), 1 deletion(-)

 diff --git a/drivers/block/rbd.c b/drivers/block/rbd.c
 index 4c95b50..552a2ed 100644
 --- a/drivers/block/rbd.c
 +++ b/drivers/block/rbd.c
 @@ -4752,7 +4752,7 @@ static int rbd_dev_image_id(struct rbd_device *rbd_dev)

 image_id = ceph_extract_encoded_string(p, p + ret,
 NULL, GFP_NOIO);
 -   ret = IS_ERR(image_id) ? PTR_ERR(image_id) : 0;
 +   ret = PTR_ERR_OR_ZERO(image_id);
 if (!ret)
 rbd_dev-image_format = 2;
 } else {

Reviewed-by: Yan, Zheng zheng.z@intel.com

 --
 1.8.3.1

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] ceph: remove useless ACL check

2014-04-10 Thread Yan, Zheng
On Thu, Apr 10, 2014 at 1:29 PM, Zhang Zhen zhenzhang.zh...@huawei.com wrote:
 posix_acl_xattr_set() already does the check, and it's the only
 way to feed in an ACL from userspace.
 So the check here is useless, remove it.

 Signed-off-by: zhang zhen zhenzhang.zh...@huawei.com
 ---
  fs/ceph/acl.c | 6 --
  1 file changed, 6 deletions(-)

 diff --git a/fs/ceph/acl.c b/fs/ceph/acl.c
 index 21887d6..469f2e8 100644
 --- a/fs/ceph/acl.c
 +++ b/fs/ceph/acl.c
 @@ -104,12 +104,6 @@ int ceph_set_acl(struct inode *inode, struct posix_acl 
 *acl, int type)
 umode_t new_mode = inode-i_mode, old_mode = inode-i_mode;
 struct dentry *dentry;

 -   if (acl) {
 -   ret = posix_acl_valid(acl);
 -   if (ret  0)
 -   goto out;
 -   }
 -
 switch (type) {
 case ACL_TYPE_ACCESS:
 name = POSIX_ACL_XATTR_ACCESS;
 --
 1.8.5.5


aded to our testing branch. Thanks.

Yan, Zheng
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   3   4   5   6   7   >