Greg,
Thanks so much for the reply!
We are not clear on why ZFS is behaving poorly under some circumstances on
getxattr system calls, but that appears to be the case.
Since the last update we have discovered that back-to-back booting of the OSD
yields very fast boot time, and very fast getxattr system calls.
A longer period between boots (or perhaps related to influx of new data)
correlates to longer boot duration. This is due to slow getxattr calls of
certain types.
We suspect this may be a caching or fragmentation issue with ZFS for xattrs.
Use of longer filenames appear to make this worse.
We experimented on some OSDs with swapping over to XFS as the filesystem, and
the problem does not appear to be present on those OSDs.
The two examples below are representative of a Long Boot (longer running time
and more data influx between osd rebooting) and a Short Boot where we booted
the same OSD back to back.
Notice the drastic difference in time on the getxattr that yields the ENODATA
return. Around 0.009 secs for "long boot" and "0.0002" secs when the same OSD
is booted back to back. Long boot time is approx 40x to 50x longer. Multiplied
by thousands of getxattr calls, this is/was our source of longer boot time.
We are considering a full switch to XFS, but would love to hear any ZFS tuning
tips that might be a short term workaround.
We are using ZFS 6.5.11 prior to implementation of the ability to use large
dnodes which would allow the use of dnodesize=auto.
#Long Boot
<0.000044>[pid 3413902] 13:08:00.884238
getxattr("/osd/9/current/20.86bs3_head/default.34597.7\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebana_1d9e1e82d623f49c994f_0_long",
"user.cephos.lfn3",
"default.34597.7\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-92d9df789f9aaf007c50c50bb66e70af__head_0177C86B__14_ffffffffffffffff_3",
1024) = 616 <0.000044>
<0.008875>[pid 3413902] 13:08:00.884476
getxattr("/osd/9/current/20.86bs3_head/default.34597.57\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_79a7acf2d32f4302a1a4_0_long",
"user.cephos.lfn3-alt", 0x7f849bf95180, 1024) = -1 ENODATA (No data available)
<0.008875>
#Short Boot
<0.000015> [pid 3452111] 13:37:18.604442
getxattr("/osd/9/current/20.15c2s3_head/default.34597.22\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_efb8ca13c57689d76797_0_long",
"user.cephos.lfn3",
"default.34597.22\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememamboptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickinthebananahoorayforhavanababybabymakemelocobabybabymakememambo-b519f8607a3d9de0f815d18b6905b27d__head_9726F5C2__14_ffffffffffffffff_3",
1024) = 617 <0.000015>
<0.000018> [pid 3452111] 13:37:18.604546
getxattr("/osd/9/current/20.15c2s3_head/default.34597.66\\uptboatonthewaytohavanaiusedtomakealivingmanpickinthebanananowimaguidefortheciahoorayfortheusababybabymakemelocobabybabymakememambosenttospyonacubantalentshowfirststophavanagogoiusedtomakealivingmanpickintheban_0e6d86f58e03d0f6de04_0_long",
"user.cephos.lfn3-alt", 0x7fd4e8017680, 1024) = -1 ENODATA (No data available)
<0.000018>
--------------------------------------------------
Christopher J. Jones
________________________________
From: Gregory Farnum <[email protected]>
Sent: Monday, October 30, 2017 6:20:15 PM
To: Chris Jones
Cc: [email protected]
Subject: Re: [ceph-users] Hammer to Jewel Upgrade - Extreme OSD Boot Time
On Thu, Oct 26, 2017 at 11:33 AM Chris Jones
<[email protected]<mailto:[email protected]>> wrote:
The long running functionality appears to be related to clear_temp_objects();
from OSD.cc called from init().
What is this functionality intended to do? Is it required to be run on every
OSD startup? Any configuration settings that would help speed this up?
This function looks like it's cleaning up temporary objects that might have
been left behind. Basically, we are scanning through the objects looking for
temporaries, but we stop doing so once we hit a non-temp object (implying they
are ordered). So in the common case I think we're doing one listing in each PG,
finding there are no temp objects (or noting the few that remain), and then
advancing to the next PG. This will take a little time as we're basically doing
one metadata listing per PG, but that should end quickly.
I'm curious why this is so slow for you as I'm not aware of anybody else
reporting such issues. I suspect the ZFS backend is behaving rather differently
than the others, or that you've changed the default config options
dramatically, so that your OSDs have to do a much larger listing in order to
return the sorted list the OSD interface requires.
-Greg
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com