Dear All, Recently we read from this mailing list and learned that currently Lustre file system MDT with ZFS backend has poor performance. Too bad that we did not notice this before deploying this configuration to many of our cluster systems. Now that there are already too much amount of data to reconfigure them back to MDT with ldiskfs backend.
So now the only way for us is looking forward and try to do our best. I read this article about the improvement of MDT with ZFS: https://www.nextplatform.com/2017/01/11/bolstering-lustre-zfs-highlights-continuing-work/ It seems that MDT with ZFS already had large improvement over time, but in our applications it is still not enough. So I am asking that whether there is a roadmap to improve this part ? Furthermore, I saw that ZFS already has version 2.0.1. Is there any plan of Lustre part to integrate and take advantages of the new ZFS software ? In the meanwhile, currently we have the following configurations and tunings in our Lustre system to try to overcome various performance bottlenecks (both MDT and OST are ZFS backend): - Linux kernel 4.19.126 + MLNX_OFED-4.6 + Lustre-2.12.6 + ZFS-0.7.3 (ps. We build all the above software by ourselves in Debian-9.12) - Loading zfs module with the following options (Many thanks to the suggestions by Riccardo Veraldi): options zfs zfs_prefetch_disable=1 options zfs zfs_txg_history=120 options zfs metaslab_debug_unload=1 options zfs zfs_vdev_async_write_active_min_dirty_percent=20 options zfs zfs_vdev_scrub_min_active=48 options zfs zfs_vdev_scrub_max_active=128 options zfs zfs_vdev_sync_write_min_active=8 options zfs zfs_vdev_sync_write_max_active=32 options zfs zfs_vdev_sync_read_min_active=8 options zfs zfs_vdev_sync_read_max_active=32 options zfs zfs_vdev_async_read_min_active=8 options zfs zfs_vdev_async_read_max_active=32 options zfs zfs_top_maxinflight=320 options zfs zfs_txg_timeout=30 options zfs zfs_dirty_data_max_percent=40 options zfs zfs_vdev_async_write_min_active=8 options zfs zfs_vdev_async_write_max_active=32 - ZFS pool is configured with the following options: zfs set atime=off <pool> zfs set redundant_metadata=most <pool> zfs set xattr=sa <pool> zfs set recordsize=1M <pool> - Set the grant_shrink option to 0 for all clients of Lustre: lctl set_param osc.*.grant_shrink=0 These are all we have learned so far. We are wondering whether there are still something we have overlooked (e.g., are iommu and intel_iommu settings of kernel parameters help)? We will be very appreciated if anyone could give us further suggestions. Best Regards, T.H.Hsieh _______________________________________________ lustre-discuss mailing list [email protected] http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org
