On Fri, Nov 13, 2015 at 02:40:44PM -0500, Austin S Hemmelgarn wrote:
> On 2015-11-13 13:42, Hugo Mills wrote:
> >On Fri, Nov 13, 2015 at 01:10:12PM -0500, Austin S Hemmelgarn wrote:
> >>On 2015-11-13 12:30, Vedran Vucic wrote:
> >>>Hello,
> >>>
> >>>Here are outputs of commands as you requested:
> >>>  btrfs fi df /
> >>>Data, single: total=8.00GiB, used=7.71GiB
> >>>System, DUP: total=32.00MiB, used=16.00KiB
> >>>Metadata, DUP: total=1.12GiB, used=377.25MiB
> >>>GlobalReserve, single: total=128.00MiB, used=0.00B
> >>>
> >>>btrfs fi show
> >>>Label: none  uuid: d6934db3-3ac9-49d0-83db-287be7b995a5
> >>>         Total devices 1 FS bytes used 8.08GiB
> >>>         devid    1 size 18.71GiB used 10.31GiB path /dev/sda6
> >>>
> >>>btrfs-progs v4.0+20150429
> >>>
> >>Hmm, that's odd, based on these numbers, you should be having no
> >>issue at all trying to run a balance. You might be hitting some
> >>other bug in the kernel, however, but I don't remember if there were
> >>any known bugs related to ENOSPC or balance in the version you're
> >>running.
> >
> >    There's one specific bug that shows up with ENOSPC exactly like
> >this. It's in all versions of the kernel, there's no known solution,
> >and no guaranteed mitigation strategy, I'm afraid. Various things like
> >balancing, or adding, balancing, and removing a device again have been
> >tried. Sometimes they seem to help; sometimes they just make the
> >problem worse.
> >
> >    We average maybe one report a week or so with this particular
> >set of symptoms.
> We should get this listed on the Wiki on the Gotcha's page ASAP,
> especially considering that it's a pretty significant bug (not quite
> as bad as data corruption, but pretty darn close).

   It's certainly mentioned in the FAQ, in the main entry on
unexpected ENOSPC. The text takes you through identifying when there's
the "usual" problem, then goes on to say that if you've hit ENOSPC
with free space still to be unallocated, you've got this issue.

> Vedran, could you try running the balance with just '-dusage=40' and
> then again with just '-musage=40'?  If just one of those fails, it
> could help narrow things down significantly.
> 
> Hugo, is there anything else known about this issue (I don't recall
> seeing it mentioned before, and a quick web search didn't turn up
> much)?

   I grumble about it regularly on IRC, where we get many more reports
of it than on the mailing list. There have been a couple on here that
I can recall, but not many.

>  In particular:
> 1. Is there any known way to reliably reproduce it (I would assume
> not, as that would likely lead to a mitigation strategy.  If someone
> does find a reliable reproducer, please let me know, I've got some
> significant spare processor time and storage space I could dedicate
> to getting traces and filesystem images for debugging, and already
> have most of the required infrastructure set up for something like
> this)?

   None that I know of. I can start asking people for btrfs-image
dumps again, if you want to investigate. I did do that for a while, to
pass them to josef, but he said he didn't need any more of them after
a while. (He was always planning on investigating it, but kept getting
diverted by data corruption bugs, which have higher priority).

> 2. Is it contagious (that is, if I send a snapshot from a filesystem
> that is affected by it, does the filesystem that receives the
> snapshot become affected; if we could find a way to reproduce it, I
> could easily answer this question within a couple of minutes of
> reproducing it)?

   No, as far as I know, it doesn't transfer via send/receive.
send/receive is largely equivalent to copying the data by other means
-- receive is implemented almost exclusively in userspace, with only a
couple of ioctls for mucking around with the UUIDs at the end.

> 3. Do we have any kind of statistics beyond the rate of reports (for
> example, does it happen more often on bigger filesystems, or
> possibly more frequently with certain chunk profiles)?

   Not that I've noticed, no. We've had it on small and large,
single-device and many devices, HDD and SSD, converted and not
converted. At one point, a couple of years ago, I did think it was
down to converted filesystems, because we had a run of them, but that
seems not to be the case.

   Hugo.

-- 
Hugo Mills             | The glass is neither half-full nor half-empty; it is
hugo@... carfax.org.uk | twice as large as it needs to be.
http://carfax.org.uk/  |
PGP: E2AB1DE4          |                                      Dr Jon Whitehead

Attachment: signature.asc
Description: Digital signature

Reply via email to