On Fri, Apr 24, 2015 at 4:05 PM, David Pacheco <[email protected]> wrote:
> On Fri, Apr 24, 2015 at 7:49 AM, Matthew Ahrens <[email protected]> > wrote: > >> On Wed, Apr 22, 2015 at 12:18 PM, David Pacheco <[email protected]> wrote: >>> >>> Given that this seems to indicate that we've freed more data than we >>> expected to, could this be indicative of corruption? >>> >> >> I'd say these values being wrong is "corruption" by itself (albeit mostly >> harmless), but it could also be indicative of a larger problem. >> > > > Fair enough. Yeah, I'm mostly wondering: given that I have a bunch of > very large snapshots on these boxes that I'd like to delete, should I be > worried that doing so might result in either data loss or a pool that can't > be imported (as due to metadata corruption)? > > > >> In case it's relevant: most of the affected systems were previously >>> affected by spacemap corruption related to async destroy[1]. >>> >> >> That could definitely cause this -- if we free the same block twice, >> you'd get the spacemap corruption, and also the "freeing" would be >> decremented twice for the same block, causing it to eventually go negative. >> > > > Okay. That's good to know. So maybe there are two failure modes here: > one is caused by that earlier spacemap issue, and the other is unknown and > affects the newer box. The newer one seemed a little different anyway, > since that system's accounting is off by several orders of magnitude more > than the older boxes. > > > But the system with the largest negative value was installed recently and >>> has only ever run bits as new as February.[2] >>> >> >> I don't know what would cause that. The first step to tracking this down >> would probably be to determine whether it's due to background free of >> filesystems/zvols vs snapshots. >> > > > Any suggestions on how to do that? What do you think about making the > assertions that attempt to check this into VERIFYs? If there's a D script > that could watch for this, we could run it on these boxes, but for all we > know it could be months or years before we see it again. > Not sure how to without reproducing it :-( VERIFIY's would be good assuming you can tolerate the reboot. You could also record the "freeing" before and after each phase of dsl_scan_sync(). (i.e. between when we do bpobj_iterate() and when we do bptree_iterate()). If you see it go negative then record grab the ::zfs_dbgmsg so you can see exactly what was recently deleted. --matt > > Thanks for taking a look, and thanks Turbo for the additional data points. > > -- Dave >
_______________________________________________ developer mailing list [email protected] http://lists.open-zfs.org/mailman/listinfo/developer
