Sorry for replaying late, I answered in-line

On 10/21/18 6:00 AM, Andreas Dilger wrote:
It would be useful to post information like this on wiki.lustre.org so they can 
be found more easily by others.  There are already some ZFS tunings there (I 
don't have the URL handy, just on a plane), so it might be useful to include 
some information about the hardware and workload to give context to what this 
is tuned for.

Even more interesting would be to see if there is a general set of tunings that 
people agree should be made the default?  It is even better when new users 
don't have to seek out the various tuning parameters, and instead get good 
performance out of the box.

A few comments inline...

On Oct 19, 2018, at 17:52, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> 
wrote:
On 10/19/18 12:37 PM, Mohr Jr, Richard Frank (Rick Mohr) wrote:
On Oct 17, 2018, at 7:30 PM, Riccardo Veraldi <riccardo.vera...@cnaf.infn.it> 
wrote:

anyway especially regarding the OSSes you may eventually need some ZFS module parameters 
optimizations regarding vdev_write and vdev_read max to increase those values higher than 
default. You may also disable ZIL, change the redundant_metadata to "most"  
atime off.

I could send you a list of parameters that in my case work well.
Riccardo,

Would you mind sharing your ZFS parameters with the mailing list?  I would be 
interested to see which options you have changed.

this worked for me on my high performance cluster

options zfs zfs_prefetch_disable=1
This matches what I've seen in the past - at high bandwidth under concurrent 
client load the prefetched data on the server is lost, and just causes needless 
disk IO that is discarded.

options zfs zfs_txg_history=120
options zfs metaslab_debug_unload=1
#
options zfs zfs_vdev_scheduler=deadline
options zfs zfs_vdev_async_write_active_min_dirty_percent=20
#
options zfs zfs_vdev_scrub_min_active=48
options zfs zfs_vdev_scrub_max_active=128
#
options zfs zfs_vdev_sync_write_min_active=8
options zfs zfs_vdev_sync_write_max_active=32
options zfs zfs_vdev_sync_read_min_active=8
options zfs zfs_vdev_sync_read_max_active=32
options zfs zfs_vdev_async_read_min_active=8
options zfs zfs_vdev_async_read_max_active=32
options zfs zfs_top_maxinflight=320
options zfs zfs_txg_timeout=30
This is interesting.  Is this actually setting the maximum TXG age up to 30s?

yes, I think the default is 5 seconds.



options zfs zfs_dirty_data_max_percent=40
options zfs zfs_vdev_async_write_min_active=8
options zfs zfs_vdev_async_write_max_active=32

##############

these the zfs attributes that I changed on the OSSes:

zfs set mountpoint=none $ostpool
zfs set sync=disabled $ostpool
zfs set atime=off $ostpool
zfs set redundant_metadata=most $ostpool
zfs set xattr=sa $ostpool
zfs set recordsize=1M $ostpool
The recordsize=1M is already the default for Lustre OSTs.

Did you disable multimount, or just not include it here?  That is fairly
important for any multi-homed ZFS storage, to prevent multiple imports.

#################


these the ko2iblnd parameters for FDR Mellanox IB interfaces

options ko2iblnd timeout=100 peer_credits=63 credits=2560 concurrent_sends=63 
ntx=2048 fmr_pool_size=1280 fmr_flush_trigger=1024 ntx=5120
You have ntx= in there twice...

yes it is a mistake I typed it two times




If this provides a significant improvement for FDR, it might make sense to add in 
machinery to lustre/conf/{ko2iblnd-probe,ko2iblnd.conf} to have a new alias 
"ko2iblnd-fdr" set these values on Mellanox FDB IB cards by default?

I found it it works better with FDR.

Anyway most of the tunings I did were taken here and there reading what other people did. So mostly from here:

 * https://lustre.ornl.gov/lustre101-courses/content/C1/L5/LustreTuning.pdf
 * https://www.eofs.eu/_media/events/lad15/15_chris_horn_lad_2015_lnet.pdf
 * https://lustre.ornl.gov/ecosystem-2015/documents/LustreEco2015-Tutorial2.pdf

And by the way the most effective tweaks were after reading Rick Mohr adviceĀ  in LustreTuning.pdf, Thanks Rick!


############

these the ksocklnd paramaters

options ksocklnd sock_timeout=100 credits=2560 peer_credits=63

##############

these other parameters that I did tweak

echo 32 > /sys/module/ptlrpc/parameters/max_ptlrpcds
echo 3 > /sys/module/ptlrpc/parameters/ptlrpcd_bind_policy
This parameter is marked as obsolete in the code.

Yes I should fix my configuration and use the new parameters


lctl set_param timeout=600
lctl set_param ldlm_timeout=200
lctl set_param at_min=250
lctl set_param at_max=600

###########

Also I run this script at boot time to redefine IRQ assignments for hard drives 
spanned across all CPUs, not needed for kernel > 4.4

#!/bin/sh
# numa_smp.sh
device=$1
cpu1=$2
cpu2=$3
cpu=$cpu1
grep $1 /proc/interrupts|awk '{print $1}'|sed 's/://'|while read int
do
  echo $cpu > /proc/irq/$int/smp_affinity_list
  echo "echo CPU $cpu > /proc/irq/$a/smp_affinity_list"
  if [ $cpu = $cpu2 ]
  then
     cpu=$cpu1
  else
     ((cpu=$cpu+1))
  fi
done
Cheers, Andreas
---
Andreas Dilger
Principal Lustre Architect
Whamcloud








_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to