Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 09:55:59PM -0400, Zygo Blaxell wrote:
> In this current case, I'm getting things like this:
> 
> [12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
> 26805825306624 csum 4105596028 wanted 787343232 mirror 0
[...]
> The other other weird thing here is that I can't find an example in the
> logs of an extent with an EIO that isn't compressed.  I've been looking
> up a random sample of the extent numbers, matching them up to filefrag
> output, and finding e.g. the one compressed extent in the middle of an
> otherwise uncompressed git pack file.  That's...odd.  Maybe there's a
> problem with compressed extents in particular?  I'll see if I can
> script something to check all the logs at once...

No need for a script:  this message wording appears only in
fs/btrfs/compression.c so it can only ever be emitted by reading a
compressed extent.

Maybe there's a problem specific to raid5, degraded mode, and compressed
extents?



signature.asc
Description: Digital signature


[PATCH] btrfs: add missing bytes_readonly attribute file in sysfs

2016-06-20 Thread Wang Xiaoguang
Signed-off-by: Wang Xiaoguang 
---
 fs/btrfs/sysfs.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/fs/btrfs/sysfs.c b/fs/btrfs/sysfs.c
index 4da84ca..7b0fcca 100644
--- a/fs/btrfs/sysfs.c
+++ b/fs/btrfs/sysfs.c
@@ -328,6 +328,7 @@ SPACE_INFO_ATTR(bytes_used);
 SPACE_INFO_ATTR(bytes_pinned);
 SPACE_INFO_ATTR(bytes_reserved);
 SPACE_INFO_ATTR(bytes_may_use);
+SPACE_INFO_ATTR(bytes_readonly);
 SPACE_INFO_ATTR(disk_used);
 SPACE_INFO_ATTR(disk_total);
 BTRFS_ATTR(total_bytes_pinned, btrfs_space_info_show_total_bytes_pinned);
@@ -339,6 +340,7 @@ static struct attribute *space_info_attrs[] = {
BTRFS_ATTR_PTR(bytes_pinned),
BTRFS_ATTR_PTR(bytes_reserved),
BTRFS_ATTR_PTR(bytes_may_use),
+   BTRFS_ATTR_PTR(bytes_readonly),
BTRFS_ATTR_PTR(disk_used),
BTRFS_ATTR_PTR(disk_total),
BTRFS_ATTR_PTR(total_bytes_pinned),
-- 
1.8.3.1



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 03:27:03PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> 
> >> For me the critical question is what does "some corrupted sectors" mean?
> >
> > On other raid5 arrays, I would observe a small amount of corruption every
> > time there was a system crash (some of which were triggered by disk
> > failures, some not).
> 
> What test are you using to determine there is corruption, and how much
> data is corrupted? Is this on every disk? Non-deterministically fewer
> than all disks? Have you identified this as a torn write or
> misdirected write or is it just garbage at some sectors? And what's
> the size? Partial sector? Partial md chunk (or fs block?)

In earlier cases, scrub, read(), and btrfs dev stat all reported the
incidents differently.  Scrub would attribute errors randomly to disks
(error counts spread randomly across all the disks in the 'btrfs scrub
status -d' output).  'dev stat' would correctly increment counts on only
those disks which had individually had an event (e.g. media error or
SATA bus reset).

Before deploying raid5, I tested these by intentionally corrupting
one disk in an otherwise healthy raid5 array and watching the result.
When scrub identified an inode and offset in the kernel log, the csum
failure log message matched the offsets producing EIO on read(), but
the statistics reported by scrub about which disk had been corrupted
were mostly wrong.  In such cases a scrub could repair the data.

A different thing happens if there is a crash.  In that case, scrub cannot
repair the errors.  Every btrfs raid5 filesystem I've deployed so far
behaves this way when disks turn bad.  I had assumed it was a software bug
in the comparatively new raid5 support that would get fixed eventually.

In this current case, I'm getting things like this:

[12008.243867] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4105596028 wanted 787343232 mirror 0
[12008.243876] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1689373462 wanted 787343232 mirror 0
[12008.243885] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3621611229 wanted 787343232 mirror 0
[12008.243893] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 113993114 wanted 787343232 mirror 0
[12008.243902] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1464956834 wanted 787343232 mirror 0
[12008.243911] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2545274038 wanted 787343232 mirror 0
[12008.243942] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4090153227 wanted 787343232 mirror 0
[12008.243952] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243961] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 0
[12008.243976] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 172651968 wanted 787343232 mirror 0
[12008.246158] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4129844199 wanted 787343232 mirror 1
[12008.247557] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1374425809 wanted 787343232 mirror 1
[12008.403493] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1567917468 wanted 787343232 mirror 1
[12008.409809] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2881359629 wanted 787343232 mirror 0
[12008.411165] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3021442070 wanted 787343232 mirror 0
[12008.411180] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3984314874 wanted 787343232 mirror 0
[12008.411189] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 599192427 wanted 787343232 mirror 0
[12008.411199] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 2887010053 wanted 787343232 mirror 0
[12008.411208] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 1314141634 wanted 787343232 mirror 0
[12008.411217] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 3156167613 wanted 787343232 mirror 0
[12008.411227] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 565550942 wanted 787343232 mirror 0
[12008.411236] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 4068631390 wanted 787343232 mirror 0
[12008.411245] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 csum 531263990 wanted 787343232 mirror 0
[12008.411255] BTRFS info (device vdc): csum failed ino 4420604 extent 
26805825306624 

Re: [RFC PATCH] btrfs: fix free space calculation in dump_space_info()

2016-06-20 Thread Wang Xiaoguang

hello,

On 06/21/2016 02:09 AM, Omar Sandoval wrote:

On Mon, Jun 20, 2016 at 06:47:05PM +0800, Wang Xiaoguang wrote:

Signed-off-by: Wang Xiaoguang 
---
  fs/btrfs/extent-tree.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
index f3e..13a87fe 100644
--- a/fs/btrfs/extent-tree.c
+++ b/fs/btrfs/extent-tree.c
@@ -7736,8 +7736,8 @@ static void dump_space_info(struct btrfs_space_info 
*info, u64 bytes,
printk(KERN_INFO "BTRFS: space_info %llu has %llu free, is %sfull\n",
   info->flags,
   info->total_bytes - info->bytes_used - info->bytes_pinned -
-  info->bytes_reserved - info->bytes_readonly,
-  (info->full) ? "" : "not ");
+  info->bytes_reserved - info->bytes_readonly -
+  info->bytes_may_use, (info->full) ? "" : "not ");
printk(KERN_INFO "BTRFS: space_info total=%llu, used=%llu, pinned=%llu, 
"
   "reserved=%llu, may_use=%llu, readonly=%llu\n",
   info->total_bytes, info->bytes_used, info->bytes_pinned,

I think you meant to send this to linux-btrfs@vger.kernel.org

Yes, should I resend it to linux-btrfs@vger.kernel.org again? thanks.

Regards,
Xiaoguang Wang






--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confusing output from fi us/df

2016-06-20 Thread Satoru Takeuchi
On 2016/06/21 8:30, Marc Grondin wrote:
> Hi everyone,
> 
> 
> I have a btrfs filesystem ontop of a 4x1tb mdraid raid5 array and I've
> been getting confusing output on metadata usage. Seems that even tho
> both data and metadata are in single profile metadata is reporting
> double the space(as if it was in dupe profile)
> 
> 
> 
> root@thebeach /h/marcg> uname -a
> Linux thebeach 4.6.2-gentoo-GMAN #1 SMP Sat Jun 11 22:32:27 ADT 2016
> x86_64 Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz GenuineIntel GNU/Linux
> root@thebeach /h/marcg> btrfs --version
> btrfs-progs v4.5.3
> root@thebeach /h/marcg> btrfs fi us /media/Storage2
> Overall:
> Device size: 2.73TiB
> Device allocated: 1.71TiB
> Device unallocated: 1.02TiB
> Device missing: 0.00B
> Used: 1.38TiB
> Free (estimated): 1.34TiB (min: 1.34TiB)
> Data ratio: 1.00
> Metadata ratio: 1.00
> Global reserve: 512.00MiB (used: 0.00B)
> 
> 
> Data,single: Size:1.71TiB, Used:1.38TiB
> /dev/mapper/storage2 1.71TiB
> 
> 
> Metadata,single: Size:3.00GiB, Used:1.53GiB
> /dev/mapper/storage2 3.00GiB
> 
> 
> System,single: Size:32.00MiB, Used:208.00KiB
> /dev/mapper/storage2 32.00MiB
> 
> 
> Unallocated:
> /dev/mapper/storage2 1.02TiB
> root@thebeach /h/marcg> btrfs fi df /media/Storage2
> Data, single: total=1.71TiB, used=1.38TiB
> System, single: total=32.00MiB, used=208.00KiB
> Metadata, single: total=3.00GiB, used=1.53GiB
> GlobalReserve, single: total=512.00MiB, used=0.00B
> root@thebeach /h/marcg>
> 
> 
> I'm not sure if this is known and if it's btrfs-progs related or if it
> is actually allocating that space.

Could you tell me the location where you think
metadata is reporting double the space?

from fi us:
> Metadata,single: Size:3.00GiB, Used:1.53GiB
> /dev/mapper/storage2 3.00GiB

from fi df:
> Metadata,single: Size:3.00GiB, Used:1.53GiB
> /dev/mapper/storage2 3.00GiB

As far as I can see, Btrfs just allocates 3.00 GiB
from /dev/mapper/storage2, Metadata,single size is
the same as it (not double), and 1.53 GiB is used.


The following is in my case where data is single
and meta is dup.

from fi us:
Metadata,DUP: Size:384.00MiB, Used:221.36MiB
   /dev/vda3 768.00MiB

from fi df:
Metadata, DUP: total=384.00MiB, used=221.36MiB

Here Btrfs allocates 768.0MiB from /dev/vda3 and it's
twice as large as the size of Metadata,DUP(384.00MiB).
I guess it means that "metadata is reporting double the space"
as you said and your case it not the case.

CMIIW.

Thanks,
Satoru

> 
> 
> Thank you for reading.
> 
> 
> Marc
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-20 Thread Qu Wenruo



At 06/21/2016 12:03 AM, David Sterba wrote:

Hi,

I'm looking how well does this patchset merges with the rest, so far
there are excpected conflicts with Chandan's subpage-blocksize
patchset. For the easy parts, we can add stub patches to extend
functions like cow_file_range with parameters that are added by the
other patch.

Honestly I don't know which patchset to take first. As the
subpage-blockszie is in the core, there are no user visibility and
interface questions, but it must not cause any regressions.

Dedupe is optional, not default, and we have to mainly make sure it does
not have any impact when turned off.

So I see three possible ways:

* merge subpage first, as it defines the API, adapt dedupe


Personally, I'd like to merge subpage first.

AFAIK, it's more important than dedupe.
It affects whether a fs created in 64K page size environment can be 
mounted on a 4K page size system.


And further more, dedupe is still not in the ready-to-be-merged status.

The main undetermined part is ioctl interface.
I'm still working on the state-ful ioctl interface(use -f option to be 
stateless), along with some minor change to allow easy extension.
(To allow user-space caller to know exactly what optional parameter is 
not supported, for later dedeup rate accounting and other things)


And Wang and I are waiting for feedback for V11 patchset.
The latest ENOSPC fix may need another version to address such feedback.

And further more, for dedupe, it's quite easy to avoid any possible 
problem related to sectorsize change.


Just increase minimal dedupe blocksize to maximum sectorsize(64K), then
possible conflicts would be solved.

Thanks,
Qu


* merge dedupe first, as it only enhanced existing API, adapt subpage
* create a common branch for both, merge relevant parts of each
  patchset, add more patches to prepare common ground for either patch

You can deduce yourself which vairant poses work on who. My preference
is to do the 3rd variant, as it does not force us any particular merge
order.





--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Confusing output from fi us/df

2016-06-20 Thread Hans van Kranenburg

Hi,

On 06/21/2016 01:30 AM, Marc Grondin wrote:


I have a btrfs filesystem ontop of a 4x1tb mdraid raid5 array and I've
been getting confusing output on metadata usage. Seems that even tho
both data and metadata are in single profile metadata is reporting
double the space(as if it was in dupe profile)


I guess that's a coincidence.

From the total amount of space you have (on top of the mdraid), there's 
3 GiB allocated/reserved for use as metadata. Inside that 3 GiB, 1.53GiB 
of actual metadata is present.



[...]
Metadata,single: Size:3.00GiB, Used:1.53GiB
/dev/mapper/storage2 3.00GiB



Metadata, single: total=3.00GiB, used=1.53GiB


If you'd change to DUP, it would look like:

for fi usage:

Metadata,single: Size:3.00GiB, Used:1.53GiB
/dev/mapper/storage2 6.00GiB

for fi df:

Metadata, single: total=3.00GiB, used=1.53GiB

Except for the 6.00GiB which would show the actual usage on disk, the 
other metadata numbers all hide the profile ratio.


Hans
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Confusing output from fi us/df

2016-06-20 Thread Marc Grondin
Hi everyone,


I have a btrfs filesystem ontop of a 4x1tb mdraid raid5 array and I've
been getting confusing output on metadata usage. Seems that even tho
both data and metadata are in single profile metadata is reporting
double the space(as if it was in dupe profile)



root@thebeach /h/marcg> uname -a
Linux thebeach 4.6.2-gentoo-GMAN #1 SMP Sat Jun 11 22:32:27 ADT 2016
x86_64 Intel(R) Core(TM) i5-2320 CPU @ 3.00GHz GenuineIntel GNU/Linux
root@thebeach /h/marcg> btrfs --version
btrfs-progs v4.5.3
root@thebeach /h/marcg> btrfs fi us /media/Storage2
Overall:
Device size: 2.73TiB
Device allocated: 1.71TiB
Device unallocated: 1.02TiB
Device missing: 0.00B
Used: 1.38TiB
Free (estimated): 1.34TiB (min: 1.34TiB)
Data ratio: 1.00
Metadata ratio: 1.00
Global reserve: 512.00MiB (used: 0.00B)


Data,single: Size:1.71TiB, Used:1.38TiB
/dev/mapper/storage2 1.71TiB


Metadata,single: Size:3.00GiB, Used:1.53GiB
/dev/mapper/storage2 3.00GiB


System,single: Size:32.00MiB, Used:208.00KiB
/dev/mapper/storage2 32.00MiB


Unallocated:
/dev/mapper/storage2 1.02TiB
root@thebeach /h/marcg> btrfs fi df /media/Storage2
Data, single: total=1.71TiB, used=1.38TiB
System, single: total=32.00MiB, used=208.00KiB
Metadata, single: total=3.00GiB, used=1.53GiB
GlobalReserve, single: total=512.00MiB, used=0.00B
root@thebeach /h/marcg>


I'm not sure if this is known and if it's btrfs-progs related or if it
is actually allocating that space.


Thank you for reading.


Marc

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 2:40 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:

>> For me the critical question is what does "some corrupted sectors" mean?
>
> On other raid5 arrays, I would observe a small amount of corruption every
> time there was a system crash (some of which were triggered by disk
> failures, some not).

What test are you using to determine there is corruption, and how much
data is corrupted? Is this on every disk? Non-deterministically fewer
than all disks? Have you identified this as a torn write or
misdirected write or is it just garbage at some sectors? And what's
the size? Partial sector? Partial md chunk (or fs block?)

>  It looked like any writes in progress at the time
> of the failure would be damaged.  In the past I would just mop up the
> corrupt files (they were always the last extents written, easy to find
> with find-new or scrub) and have no further problems.

This is on Btrfs? This isn't supposed to be possible. Even a literal
overwrite of a file is not an overwrite on Btrfs unless the file is
nodatacow. Data extents get written, then the metadata is updated to
point to those new blocks. There should be flush or fua requests to
make sure the order is such that the fs points to either the old or
new file, in either case uncorrupted. That's why I'm curious about the
nature of this corruption. It sounds like your hardware is not exactly
honoring flush requests.

With md raid and any other file system, it's pure luck that such
corrupted writes would only affect data extents and not the fs
metadata. Corrupted fs metadata is not well tolerated by any file
system, not least of which is most of them have no idea the metadata
is corrupt. At least Btrfs can determine this and if there's another
copy use that or just stop and face plant before more damage happens.
Maybe an exception now is XFS v5 metadata which employs checksumming.
But it still doesn't know if data extents are wrong (i.e. a torn or
misdirected write).

I've had perhaps a hundred power off during write with Btrfs and SSD
and I don't ever see corrupt files. It's definitely not normal to see
this with Btrfs.


> In the earlier
> cases there were no new instances of corruption after the initial failure
> event and manual cleanup.
>
> Now that I did a little deeper into this, I do see one fairly significant
> piece of data:
>
> root@host:~# btrfs dev stat /data | grep -v ' 0$'
> [/dev/vdc].corruption_errs 16774
> [/dev/vde].write_io_errs   121
> [/dev/vde].read_io_errs4
> [devid:8].read_io_errs16
>
> Prior to the failure of devid:8, vde had 121 write errors and 4 read
> errors (these counter values are months old and the errors were long
> since repaired by scrub).  The 16774 corruption errors on vdc are all
> new since the devid:8 failure, though.

On md RAID 5 and 6, if the array gets parity mismatch counts above 0
doing a scrub (check > md/sync_action) there's a hardware problem.
It's entirely possible you've found a bug, but it must be extremely
obscure to basically not have hit everyone trying Btrfs raid56. I
think you need to track down the source of this corruption and stop it
 however possible; whether that's changing hardware, or making sure
the system isn't crashing.

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 01:30:11PM -0600, Chris Murphy wrote:
> On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
>  wrote:
> > On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> >> On Sun, 19 Jun 2016 23:44:27 -0400
> Seems difficult at best due to this:
> >>The normal 'device delete' operation got about 25% of the way in,
> then got stuck on some corrupted sectors and aborting with EIO.
> 
> In effect it's like a 2 disk failure for a raid5 (or it's
> intermittently a 2 disk failure but always at least a 1 disk failure).
> That's not something md raid recovers from. Even manual recovery in
> such a case is far from certain.
> 
> Perhaps Roman's advice is also a question about the cause of this
> corruption? I'm wondering this myself. That's the real problem here as
> I see it. Losing a drive is ordinary. Additional corruptions happening
> afterward is not. And are those corrupt sectors hardware corruptions,
> or Btrfs corruptions at the time the data was written to disk, or
> Btrfs being confused as it's reading the data from disk?

> For me the critical question is what does "some corrupted sectors" mean?

On other raid5 arrays, I would observe a small amount of corruption every
time there was a system crash (some of which were triggered by disk
failures, some not).  It looked like any writes in progress at the time
of the failure would be damaged.  In the past I would just mop up the
corrupt files (they were always the last extents written, easy to find
with find-new or scrub) and have no further problems.  In the earlier
cases there were no new instances of corruption after the initial failure
event and manual cleanup.

Now that I did a little deeper into this, I do see one fairly significant
piece of data:

root@host:~# btrfs dev stat /data | grep -v ' 0$'
[/dev/vdc].corruption_errs 16774
[/dev/vde].write_io_errs   121
[/dev/vde].read_io_errs4
[devid:8].read_io_errs16

Prior to the failure of devid:8, vde had 121 write errors and 4 read
errors (these counter values are months old and the errors were long
since repaired by scrub).  The 16774 corruption errors on vdc are all
new since the devid:8 failure, though.

> 
> 
> -- 
> Chris Murphy
> 


signature.asc
Description: Digital signature


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 1:11 PM, Zygo Blaxell
 wrote:
> On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
>> On Sun, 19 Jun 2016 23:44:27 -0400
>> Zygo Blaxell  wrote:
>> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
>> better off shutting down the system, booting a rescue OS, copying the content
>> of the failing disk to the replacement one using 'ddrescue', then removing 
>> the
>> bad disk, and after boot up your main system wouldn't notice anything has 
>> ever
>> happened, aside from a few recoverable CRC errors in the "holes" on the areas
>> which ddrescue failed to copy.
>
> I'm aware of ddrescue and myrescue, but in this case the disk has failed,
> past tense.  At this point the remaining choices are to make btrfs native
> raid5 recovery work, or to restore from backups.

Seems difficult at best due to this:
>>The normal 'device delete' operation got about 25% of the way in,
then got stuck on some corrupted sectors and aborting with EIO.

In effect it's like a 2 disk failure for a raid5 (or it's
intermittently a 2 disk failure but always at least a 1 disk failure).
That's not something md raid recovers from. Even manual recovery in
such a case is far from certain.

Perhaps Roman's advice is also a question about the cause of this
corruption? I'm wondering this myself. That's the real problem here as
I see it. Losing a drive is ordinary. Additional corruptions happening
afterward is not. And are those corrupt sectors hardware corruptions,
or Btrfs corruptions at the time the data was written to disk, or
Btrfs being confused as it's reading the data from disk?

For me the critical question is what does "some corrupted sectors" mean?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Zygo Blaxell
On Mon, Jun 20, 2016 at 11:13:51PM +0500, Roman Mamedov wrote:
> On Sun, 19 Jun 2016 23:44:27 -0400
> Zygo Blaxell  wrote:
> From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
> better off shutting down the system, booting a rescue OS, copying the content
> of the failing disk to the replacement one using 'ddrescue', then removing the
> bad disk, and after boot up your main system wouldn't notice anything has ever
> happened, aside from a few recoverable CRC errors in the "holes" on the areas
> which ddrescue failed to copy.

I'm aware of ddrescue and myrescue, but in this case the disk has failed,
past tense.  At this point the remaining choices are to make btrfs native
raid5 recovery work, or to restore from backups.

> But in general it's commendable that you're experimenting with doing things
> "the native way", as this is provides feedback to the developers and could 
> help
> make the RAID implementation better. I guess that's the whole point of the
> exercise and the report, and hope this ends up being useful for everyone.

The intent was both to provide a cautionary tale for anyone considering
deploying a btrfs raid5 system today, and to possibly engage some
developers to help solve the problems.

The underlying causes seem to be somewhat removed from where the symptoms
are appearing, and at the moment I don't understand this code well enough
to know where to look for them.  Any assistance would be greatly appreciated.


> -- 
> With respect,
> Roman




signature.asc
Description: Digital signature


Re: [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-20 Thread Deepa Dinamani
> This version now looks ok to me.
>
> I do have a comment (or maybe just a RFD) for future work.
>
> It does strike me that once we actually change over the inode times to
> use timespec64, the calling conventions are going to be fairly
> horrendous on most 32-bit architectures.
>
> Gcc handles 8-byte structure returns (on most architectures) by
> returning them as two 32-bit registers (%edx:%eax on x86). But once it
> is timespec64, that will no longer be the case, and the calling
> convention will end up using a pointer to the local stack instead.
>
> So for 32-bit code generation, we *may* want to introduce a new model of doing
>
> set_inode_time(inode, ATTR_ATIME | ATTR_MTIME);
>
> which basically just does
>
> inode->i_atime = inode->i_mtime = current_time(inode);
>
> but with a much easier calling convention on 32-bit architectures.

Arnd and I had discussed something like this before.
But, for entirely different reasons:

Having the set_inode_time() like you suggest will also help switching
of vfs inode times to timespec64.
We were suggesting all the accesses to inode time be abstracted
through something like inode_set_time().
Arnd also had suggested a split representation of fields in the struct
inode as well which led to space savings
as well. And, having the split representation also meant no more
direct assignments:

https://lkml.org/lkml/2016/1/7/20

This in general will be similar to setattr_copy(), but only sets times
rather than other attributes as well.

If this is what is preferred, then the patches to change vfs to use
timespec64 could make use of this and will
need to be refactored. So maybe it would be good to discuss before I
post those patches.

-Deepa
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub aborts on newer kernels

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 3:22 AM, Tyson Whitehead  wrote:
> On 17/06/16 06:18 PM, Chris Murphy wrote:
>>
>> On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead 
>> wrote:
>>>
>>> On May 27, 2016 12:12:54 PM Chris Murphy wrote:

 On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead 
 wrote:
>
> Under the last several kernels versions (4.6 and I believe 4.4 and,
> 4.5) btrfs scrub aborts before completing.


 I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.

 I think the bigger issue is the lack of information why a scrub is
 aborted.
>>>
>>>
>>> Thanks for checking into this Chris.
>>>
>>> Any advice on how to get some more information out of the scrub process?
>>
>>
>> Next you need to find out why this one device has so many errors. Put the
>> output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug
>> report since it's somewhat related. Either that device is simply unreliable,
>> or you've got a bad cable connection.
>
>
> The device is okay.  The errors were caused by me running it for a period
> with only one of the devices present.
>
> In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to
> it on the road, so I just shut the desktop down, grabbed one of the drives,
> and used it in my laptop in degraded mode.
>
> When I got back I recombined them (the one I left in my office was never
> booted by itself) and started a scrub.  That is when I discovered scrub
> aborted on newer kernels but completed okay on older ones.

That's troubling indeed.

OK so what about a balance with one of the kernels that aborts scrub?
Does the balance abort or does it work and then does a subsequent
scrub work?

-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Adventures in btrfs raid5 disk recovery

2016-06-20 Thread Roman Mamedov
On Sun, 19 Jun 2016 23:44:27 -0400
Zygo Blaxell  wrote:

> It's not going well so far.  Pay attention, there are at least four
> separate problems in here and we're not even half done yet.
> 
> I'm currently using kernel 4.6.2 with btrfs fixes forward-ported from
> 4.5.7, because 4.5.7 has a number of fixes that 4.6.2 doesn't.  I have
> also pulled in some patches from the 4.7-rc series.
> 
> This fixed a few problems I encountered early on, and I'm still making
> forward progress, but I've only replaced 50% of the failed disk so far,
> and this is week four of this particular project.

From a practical standpoint, [aside from not using Btrfs RAID5], you'd be
better off shutting down the system, booting a rescue OS, copying the content
of the failing disk to the replacement one using 'ddrescue', then removing the
bad disk, and after boot up your main system wouldn't notice anything has ever
happened, aside from a few recoverable CRC errors in the "holes" on the areas
which ddrescue failed to copy.

But in general it's commendable that you're experimenting with doing things
"the native way", as this is provides feedback to the developers and could help
make the RAID implementation better. I guess that's the whole point of the
exercise and the report, and hope this ends up being useful for everyone.

-- 
With respect,
Roman


pgp8h7EycbEd6.pgp
Description: OpenPGP digital signature


Re: Scrub aborts on newer kernels

2016-06-20 Thread Chris Murphy
On Mon, Jun 20, 2016 at 3:22 AM, Tyson Whitehead  wrote:
> On 17/06/16 06:18 PM, Chris Murphy wrote:
>>
>> On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead 
>> wrote:
>>>
>>> On May 27, 2016 12:12:54 PM Chris Murphy wrote:

 On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead 
 wrote:
>
> Under the last several kernels versions (4.6 and I believe 4.4 and,
> 4.5) btrfs scrub aborts before completing.


 I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.

 I think the bigger issue is the lack of information why a scrub is
 aborted.
>>>
>>>
>>> Thanks for checking into this Chris.
>>>
>>> Any advice on how to get some more information out of the scrub process?
>>
>>
>> Next you need to find out why this one device has so many errors. Put the
>> output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug
>> report since it's somewhat related. Either that device is simply unreliable,
>> or you've got a bad cable connection.
>
>
> The device is okay.  The errors were caused by me running it for a period
> with only one of the devices present.
>
> In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to
> it on the road, so I just shut the desktop down, grabbed one of the drives,
> and used it in my laptop in degraded mode.
>
> When I got back I recombined them (the one I left in my office was never
> booted by itself) and started a scrub.  That is when I discovered scrub
> aborted on newer kernels but completed okay on older ones.
>
> Cheers!  -Tyson

Are you absolutely positively certain that only one of the two devices
was ever written to while mounted degraded? As in you're 100% certain,
not 99% or less certain?


-- 
Chris Murphy
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH] btrfs: fix free space calculation in dump_space_info()

2016-06-20 Thread Omar Sandoval
On Mon, Jun 20, 2016 at 06:47:05PM +0800, Wang Xiaoguang wrote:
> Signed-off-by: Wang Xiaoguang 
> ---
>  fs/btrfs/extent-tree.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/fs/btrfs/extent-tree.c b/fs/btrfs/extent-tree.c
> index f3e..13a87fe 100644
> --- a/fs/btrfs/extent-tree.c
> +++ b/fs/btrfs/extent-tree.c
> @@ -7736,8 +7736,8 @@ static void dump_space_info(struct btrfs_space_info 
> *info, u64 bytes,
>   printk(KERN_INFO "BTRFS: space_info %llu has %llu free, is %sfull\n",
>  info->flags,
>  info->total_bytes - info->bytes_used - info->bytes_pinned -
> -info->bytes_reserved - info->bytes_readonly,
> -(info->full) ? "" : "not ");
> +info->bytes_reserved - info->bytes_readonly -
> +info->bytes_may_use, (info->full) ? "" : "not ");
>   printk(KERN_INFO "BTRFS: space_info total=%llu, used=%llu, pinned=%llu, 
> "
>  "reserved=%llu, may_use=%llu, readonly=%llu\n",
>  info->total_bytes, info->bytes_used, info->bytes_pinned,

I think you meant to send this to linux-btrfs@vger.kernel.org

-- 
Omar
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 00/24] Delete CURRENT_TIME and CURRENT_TIME_SEC macros

2016-06-20 Thread Linus Torvalds
On Sun, Jun 19, 2016 at 5:26 PM, Deepa Dinamani  wrote:
> The series is aimed at getting rid of CURRENT_TIME and CURRENT_TIME_SEC 
> macros.

This version now looks ok to me.

I do have a comment (or maybe just a RFD) for future work.

It does strike me that once we actually change over the inode times to
use timespec64, the calling conventions are going to be fairly
horrendous on most 32-bit architectures.

Gcc handles 8-byte structure returns (on most architectures) by
returning them as two 32-bit registers (%edx:%eax on x86). But once it
is timespec64, that will no longer be the case, and the calling
convention will end up using a pointer to the local stack instead.

So for 32-bit code generation, we *may* want to introduce a new model of doing

set_inode_time(inode, ATTR_ATIME | ATTR_MTIME);

which basically just does

inode->i_atime = inode->i_mtime = current_time(inode);

but with a much easier calling convention on 32-bit architectures.

But that is entirely orthogonal to this patch-set, and should be seen
as a separate issue.

And maybe it doesn't end up helping anyway, but I do think those
"simple" direct assignments will really generate pretty disgusting
code on 32-bit architectures.

That whole

inode->i_atime = inode->i_mtime = CURRENT_TIME;

model really made a lot more sense back in the ancient days when inode
times were just simply 32-bit seconds (not even timeval structures).

  Linus
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Is "btrfs balance start" truly asynchronous?

2016-06-20 Thread Dmitry Katsubo

Dear btfs community,

I have added a drive to existing raid1 btrfs volume and decided to 
perform balancing so that data distributes "fairly" among drives. I have 
started "btrfs balance start", but it stalled for about 5-10 minutes 
intensively doing the work. After that time it has printed something 
like "had to relocate 50 chunks" and exited. According to drive I/O, 
"btrfs balance" did most (if not all) of the work, so after it has 
exited the job was done.


Shouldn't "btrfs balance start" do the operation in the background?

Thanks for any information.

--
With best regards,
Dmitry
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v11 00/13] Btrfs dedupe framework

2016-06-20 Thread David Sterba
Hi,

I'm looking how well does this patchset merges with the rest, so far
there are excpected conflicts with Chandan's subpage-blocksize
patchset. For the easy parts, we can add stub patches to extend
functions like cow_file_range with parameters that are added by the
other patch.

Honestly I don't know which patchset to take first. As the
subpage-blockszie is in the core, there are no user visibility and
interface questions, but it must not cause any regressions.

Dedupe is optional, not default, and we have to mainly make sure it does
not have any impact when turned off.

So I see three possible ways:

* merge subpage first, as it defines the API, adapt dedupe
* merge dedupe first, as it only enhanced existing API, adapt subpage
* create a common branch for both, merge relevant parts of each
  patchset, add more patches to prepare common ground for either patch

You can deduce yourself which vairant poses work on who. My preference
is to do the 3rd variant, as it does not force us any particular merge
order.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix ->iterate_shared() by upgrading i_rwsem for delayed nodes

2016-06-20 Thread David Sterba
On Fri, Jun 17, 2016 at 10:55:54AM -0700, Omar Sandoval wrote:
> On Wed, May 25, 2016 at 04:22:26PM -0400, Chris Mason wrote:
> > On Wed, May 25, 2016 at 10:11:29PM +0200, David Sterba wrote:
> > > On Fri, May 20, 2016 at 01:50:33PM -0700, Omar Sandoval wrote:
> > > > Commit fe742fd4f90f ("Revert "btrfs: switch to ->iterate_shared()"")
> > > > backed out the conversion to ->iterate_shared() for Btrfs because the
> > > > delayed inode handling in btrfs_real_readdir() is racy. However, we can
> > > > still do readdir in parallel if there are no delayed nodes.
> > > 
> > > So this is for current master (pre 4.7-rc1), I'll add an appropriate
> > > merge point for to my for-next.
> > 
> > I'll get this bashed on in a big stress.sh run, but it looks good to me.
> > 
> > -chris
> > 
> 
> Chris/Dave, what's the plan for this? Since it's getting late for 4.7,
> should we just think harder for a proper solution for 4.8?

There are a few more fixes queued for rc5 so I can add this one as well.
It's been in for-next since the first posting but I haven't run any
specific tests.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: dd on wrong device, 1.9 GiB from the beginning has been overwritten, how to restore partition?

2016-06-20 Thread Maximilian Böhm
Hey, I want you to know that it was impossible to recover the filesystem and 
that I have recreated the partition. Lost ~1.5 TiB unredundant data but it's 
just an annoyance, no catastrophe – I can recreate my collection and it wasn't 
any critical data.

For others with related problems: My latest attempt was to recover the 
superblock and metadata:

$ btrfs rescue super-recover -v /dev/sdc1
All Devices:
Device: id = 1, name = /dev/sdc1

Before Recovering:
[All good supers]:
device name = /dev/sdc1
superblock bytenr = 274877906944

[All bad supers]:
device name = /dev/sdc1
superblock bytenr = 65536

device name = /dev/sdc1
superblock bytenr = 67108864


Make sure this is a btrfs disk otherwise the tool will destroy other fs, Are 
you sure? [y/N]: y
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
checksum verify failed on 21004288 found E4E3BDB6 wanted 
bytenr mismatch, want=21004288, have=0
ERROR: cannot read chunk root
Failed to recover bad superblocks
*** Error in `btrfs': double free or corruption (fasttop): 0x00946010 
***
=== Backtrace: =
/usr/lib/libc.so.6(+0x6ed4b)[0x7fa620c8cd4b]
/usr/lib/libc.so.6(+0x74546)[0x7fa620c92546]
/usr/lib/libc.so.6(+0x74d1e)[0x7fa620c92d1e]
btrfs(btrfs_close_devices+0xf7)[0x456227]
btrfs(btrfs_recover_superblocks+0x459)[0x437729]
btrfs(main+0x7b)[0x40ac3b]
/usr/lib/libc.so.6(__libc_start_main+0xf1)[0x7fa620c3e741]
btrfs(_start+0x29)[0x40ad39]
=== Memory map: 
0040-00495000 r-xp  00:13 4097939
/usr/bin/btrfs
00694000-00698000 r--p 00094000 00:13 4097939
/usr/bin/btrfs
00698000-0069a000 rw-p 00098000 00:13 4097939
/usr/bin/btrfs
0069a000-0069f000 rw-p  00:00 0
00946000-00967000 rw-p  00:00 0  [heap]
7fa61c00-7fa61c021000 rw-p  00:00 0
7fa61c021000-7fa62000 ---p  00:00 0
7fa620a06000-7fa620a1c000 r-xp  00:13 3704352
/usr/lib/libgcc_s.so.1
7fa620a1c000-7fa620c1b000 ---p 00016000 00:13 3704352
/usr/lib/libgcc_s.so.1
7fa620c1b000-7fa620c1c000 rw-p 00015000 00:13 3704352
/usr/lib/libgcc_s.so.1
7fa620c1e000-7fa620db5000 r-xp  00:13 3867364
/usr/lib/libc-2.23.so
7fa620db5000-7fa620fb5000 ---p 00197000 00:13 3867364
/usr/lib/libc-2.23.so
7fa620fb5000-7fa620fb9000 r--p 00197000 00:13 3867364
/usr/lib/libc-2.23.so
7fa620fb9000-7fa620fbb000 rw-p 0019b000 00:13 3867364
/usr/lib/libc-2.23.so
7fa620fbb000-7fa620fbf000 rw-p  00:00 0
7fa620fc6000-7fa620fde000 r-xp  00:13 3867345
/usr/lib/libpthread-2.23.so
7fa620fde000-7fa6211dd000 ---p 00018000 00:13 3867345
/usr/lib/libpthread-2.23.so
7fa6211dd000-7fa6211de000 r--p 00017000 00:13 3867345
/usr/lib/libpthread-2.23.so
7fa6211de000-7fa6211df000 rw-p 00018000 00:13 3867345
/usr/lib/libpthread-2.23.so
7fa6211df000-7fa6211e3000 rw-p  00:00 0
7fa6211e6000-7fa621207000 r-xp  00:13 6889   
/usr/lib/liblzo2.so.2.0.0
7fa621207000-7fa621406000 ---p 00021000 00:13 6889   
/usr/lib/liblzo2.so.2.0.0
7fa621406000-7fa621407000 r--p 0002 00:13 6889   
/usr/lib/liblzo2.so.2.0.0
7fa621407000-7fa621408000 rw-p 00021000 00:13 6889   
/usr/lib/liblzo2.so.2.0.0
7fa62140e000-7fa621423000 r-xp  00:13 7051   
/usr/lib/libz.so.1.2.8
7fa621423000-7fa621622000 ---p 00015000 00:13 7051   
/usr/lib/libz.so.1.2.8
7fa621622000-7fa621623000 r--p 00014000 00:13 7051   
/usr/lib/libz.so.1.2.8
7fa621623000-7fa621624000 rw-p 00015000 00:13 7051   
/usr/lib/libz.so.1.2.8
7fa621626000-7fa621664000 r-xp  00:13 3436728
/usr/lib/libblkid.so.1.1.0
7fa621664000-7fa621863000 ---p 0003e000 00:13 3436728
/usr/lib/libblkid.so.1.1.0
7fa621863000-7fa621867000 r--p 0003d000 00:13 3436728
/usr/lib/libblkid.so.1.1.0
7fa621867000-7fa621868000 rw-p 00041000 00:13 3436728
/usr/lib/libblkid.so.1.1.0
7fa621868000-7fa621869000 rw-p  00:00 0
7fa62186e000-7fa621872000 r-xp  00:13 3436727
/usr/lib/libuuid.so.1.3.0
7fa621872000-7fa621a71000 ---p 4000 00:13 3436727
/usr/lib/libuuid.so.1.3.0
7fa621a71000-7fa621a72000 r--p 3000 00:13 3436727
/usr/lib/libuuid.so.1.3.0
7fa621a72000-7fa621a73000 rw-p 4000 00:13 3436727   

Re: [PATCH V19 05/19] Btrfs: subpage-blocksize: Read tree blocks whose size is < PAGE_SIZE

2016-06-20 Thread Chandan Rajendra
On Monday, June 20, 2016 01:54:05 PM David Sterba wrote:
> On Tue, Jun 14, 2016 at 12:41:02PM +0530, Chandan Rajendra wrote:
> > In the case of subpage-blocksize, this patch makes it possible to read
> > only a single metadata block from the disk instead of all the metadata
> > blocks that map into a page.
> 
> This patch has a conflict with a next pending patch
> 
> "Btrfs: fix eb memory leak due to readpage failure"
> https://patchwork.kernel.org/patch/9153927/
>

I will fix this and also the merge conflict in the patch " [PATCH V19 11/19]
Btrfs: subpage-blocksize: Prevent writes to an extent buffer when PG_writeback
flag is set" and resend the patchset.

-- 
chandan

--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs: page allocation failure: order:1, mode:0x2204020

2016-06-20 Thread David Sterba
On Sat, Jun 18, 2016 at 08:47:55PM +0200, Hans van Kranenburg wrote:
> Last night, one of my btrfs filesystems went read-only after a memory 
> allocation failure (logging attached).

According to the logs, the allocation itself happens out of btrfs so we
can't do much here.

More specifically, when creating a new subvolume and requesting an
anonymous block device (via get_anon_bdev), there's a call to request a
free id for it. This could ask for "order-1" ie 8kb of contiguous
memory. And it failed.

This depends on memory fragmentation, ie. how the pages are allocated
and freed over time. Tweaking vm.min_free_kbytes could help but it's not
designed to prevent memory allocations in such scenario.

The id range sturctures themselves do not need more than a 4k page, but
the slab cache tries to provide more objects per slab, I see this on my
box right now:

Excerpt from /proc/slabinfo:

idr_layer_cache  474486   209632

where  3 == objperslab and 2 == pages per slab, which corresponds to the
8kb. This seems to depend on internal slab cache logic, and nothing I'd
like to go chaning right now.

Looking at the IDR structure sizes and possible tweaks, the idr_layer
object size is 2096 on 64bit machine and we cannot squeeze it to 2048 so
it fits the page better.

http://lxr.free-electrons.com/source/include/linux/idr.h#L21

The IDR_BITS is 8, which is 256 pointers, 8 bytes each, and summs up to
2048 on itself, and we need a few more members.

Permanent change of IDR_BITS to smalle value would have to be evaluated,
otherwise the 'ary' member of 'idr_layer' could be stored separately for
better alignment.

> I've seen this happen once before somewhere else, also during snapshot 
> creation, also with a 4.5.x kernel.
> 
> There's a bug report at Debian, in which is suggested to increase the 
> value of vm.min_free_kbytes:
> https://bugs.debian.org/cgi-bin/bugreport.cgi?bug=666021

[...]

> [2363000.815554] Node 0 Normal: 2424*4kB (U) 0*8kB 0*16kB 0*32kB 0*64kB 
> 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 9696kB

Just to confirm the fragmentation, there are no page orders higher than
0 (ie.  8k and up).

So, technically not a btrfs bug but we could get affected by it badly.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V19 05/19] Btrfs: subpage-blocksize: Read tree blocks whose size is < PAGE_SIZE

2016-06-20 Thread David Sterba
On Tue, Jun 14, 2016 at 12:41:02PM +0530, Chandan Rajendra wrote:
> In the case of subpage-blocksize, this patch makes it possible to read
> only a single metadata block from the disk instead of all the metadata
> blocks that map into a page.

This patch has a conflict with a next pending patch

"Btrfs: fix eb memory leak due to readpage failure"
https://patchwork.kernel.org/patch/9153927/

in this hunk:

> @@ -5557,10 +5645,32 @@ int read_extent_buffer_pages(struct extent_io_tree 
> *tree,
>   page = eb_head(eb)->pages[i];
>   if (!PageUptodate(page)) {
>   ClearPageError(page);
> - err = __extent_read_full_page(tree, page,
> -   get_extent, ,
> -   mirror_num, _flags,
> -   READ | REQ_META);
> + if (eb->len < PAGE_SIZE) {
> + lock_extent_bits(tree, eb->start, eb->start + 
> eb->len - 1,
> + _state);
> + err = submit_extent_page(READ | REQ_META, tree,
> + NULL, page,
> + eb->start >> 9, eb->len,
> + eb->start - 
> page_offset(page),
> + 
> fs_info->fs_devices->latest_bdev,
> + , -1,
> + 
> end_bio_extent_buffer_readpage,
> + mirror_num, bio_flags,
> + bio_flags, false);
> + } else {
> + lock_extent_bits(tree, page_offset(page),
> + page_offset(page) + PAGE_SIZE - 
> 1,
> + _state);
> + err = submit_extent_page(READ | REQ_META, tree,
> + NULL, page,
> + page_offset(page) >> 9,
> + PAGE_SIZE, 0,
> + 
> fs_info->fs_devices->latest_bdev,
> + , -1,
> + 
> end_bio_extent_buffer_readpage,
> + mirror_num, bio_flags,
> + bio_flags, false);
> + }
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH V19 11/19] Btrfs: subpage-blocksize: Prevent writes to an extent buffer when PG_writeback flag is set

2016-06-20 Thread David Sterba
On Tue, Jun 14, 2016 at 12:41:08PM +0530, Chandan Rajendra wrote:
> In non-subpage-blocksize scenario, BTRFS_HEADER_FLAG_WRITTEN flag
> prevents Btrfs code from writing into an extent buffer whose pages are
> under writeback. This facility isn't sufficient for achieving the same
> in subpage-blocksize scenario, since we have more than one extent buffer
> mapped to a page.
> 
> Hence this patch adds a new flag (i.e. EXTENT_BUFFER_HEAD_WRITEBACK) and
> corresponding code to track the writeback status of the page and to
> prevent writes to any of the extent buffers mapped to the page while
> writeback is going on.
> 
> Signed-off-by: Chandan Rajendra 

This patch has a minor conflict with patch merged into 4.7-rc4 and I'm
not sure if the resolution is correct.

btrfs: account for non-CoW'd blocks in btrfs_abort_transaction
64c12921e11b3a0c10d088606e328c58e29274d8

introduces trans->dirty, the conflicts are in btrfs_cow_block and
btrfs_search_slot.
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


On shrinkable caches

2016-06-20 Thread Nikolay Borisov
Hello,

I have a question regarding the SLAB_RECLAIM_ACCOUNT flag with which
BTRFS caches are created. Currently there isn't a single usage of
register_shrinker under fs/btrfs. Apart from the inode cache which is
being shrunk from the generic super_cache_scan I don't think the memory
used for those caches should be accounted as reclaimable?

Regards,
Nikolay
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 08/24] fs: btrfs: Use ktime_get_real_ts for root ctime

2016-06-20 Thread David Sterba
On Sun, Jun 19, 2016 at 05:27:07PM -0700, Deepa Dinamani wrote:
> btrfs_root_item maintains the ctime for root updates.
> This is not part of vfs_inode.
> 
> Since current_time() uses struct inode* as an argument
> as Linus suggested, this cannot be used to update root
> times unless, we modify the signature to use inode.
> 
> Since btrfs uses nanosecond time granularity, it can also
> use ktime_get_real_ts directly to obtain timestamp for
> the root. It is necessary to use the timespec time api
> here because the same btrfs_set_stack_timespec_*() apis
> are used for vfs inode times as well. These can be
> transitioned to using timespec64 when btrfs internally
> changes to use timespec64 as well.
> 
> Signed-off-by: Deepa Dinamani 
> Cc: Chris Mason 
> Cc: Josef Bacik 

Acked-by: David Sterba 
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Scrub aborts on newer kernels

2016-06-20 Thread Tyson Whitehead

On 17/06/16 06:18 PM, Chris Murphy wrote:

On Fri, Jun 17, 2016 at 8:45 AM, Tyson Whitehead  wrote:

On May 27, 2016 12:12:54 PM Chris Murphy wrote:

On Thu, May 26, 2016 at 11:55 AM, Tyson Whitehead  wrote:

Under the last several kernels versions (4.6 and I believe 4.4 and, 4.5) btrfs 
scrub aborts before completing.


I can't reproduce this with btrfs-progs 4.5.2 and kernel 4.6.0.

I think the bigger issue is the lack of information why a scrub is aborted.


Thanks for checking into this Chris.

Any advice on how to get some more information out of the scrub process?


Next you need to find out why this one device has so many errors. Put the 
output from smartctl -x /dev/sdb somewhere, maybe even attach it to the bug 
report since it's somewhat related. Either that device is simply unreliable, or 
you've got a bad cable connection.


The device is okay.  The errors were caused by me running it for a period with 
only one of the devices present.

In more detail.  My desktop has a BTRFS RAID 1 setup.  I needed to access to it 
on the road, so I just shut the desktop down, grabbed one of the drives, and 
used it in my laptop in degraded mode.

When I got back I recombined them (the one I left in my office was never booted 
by itself) and started a scrub.  That is when I discovered scrub aborted on 
newer kernels but completed okay on older ones.

Cheers!  -Tyson
--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 2/2] btrfs: wait for bdev put

2016-06-20 Thread Anand Jain



On 06/19/2016 12:34 AM, Holger Hoffstätte wrote:

On Tue, 14 Jun 2016 18:55:26 +0800, Anand Jain wrote:


Further to the previous commit
 bc178622d40d87e75abc131007342429c9b03351
 btrfs: use rcu_barrier() to wait for bdev puts at unmount

Since free_device() spinoff __free_device() the rcu_barrier() for
  call_rcu(>rcu, free_device);
didn't help.

This patch reverts changes by
 bc178622d40d87e75abc131007342429c9b03351
and implement a method to wait on the __free_device() by using
a new bdev_closing member in struct btrfs_device.

Signed-off-by: Anand Jain 
[rework: bc178622d40d87e75abc131007342429c9b03351]
---
 fs/btrfs/volumes.c | 44 ++--
 fs/btrfs/volumes.h |  1 +
 2 files changed, 39 insertions(+), 6 deletions(-)

diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
index a4e8d48acd4b..404ce1daebb1 100644
--- a/fs/btrfs/volumes.c
+++ b/fs/btrfs/volumes.c
@@ -27,6 +27,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include "ctree.h"
 #include "extent_map.h"
@@ -254,6 +255,17 @@ static struct btrfs_device *__alloc_device(void)
return dev;
 }

+static int is_device_closing(struct list_head *head)
+{
+   struct btrfs_device *dev;
+
+   list_for_each_entry(dev, head, dev_list) {
+   if (dev->bdev_closing)
+   return 1;
+   }
+   return 0;
+}
+
 static noinline struct btrfs_device *__find_device(struct list_head *head,
   u64 devid, u8 *uuid)
 {
@@ -832,12 +844,22 @@ again:
 static void __free_device(struct work_struct *work)
 {
struct btrfs_device *device;
+   struct btrfs_device *new_device_addr;

device = container_of(work, struct btrfs_device, rcu_work);

if (device->bdev)
blkdev_put(device->bdev, device->mode);

+   /*
+* If we are coming here from btrfs_close_one_device()
+* then it allocates a new device structure for the same
+* devid, so find device again with the devid
+*/
+   new_device_addr = __find_device(>fs_devices->devices,
+   device->devid, NULL);
+
+   new_device_addr->bdev_closing = 0;
rcu_string_free(device->name);
kfree(device);
 }
@@ -884,6 +906,12 @@ static void btrfs_close_one_device(struct btrfs_device 
*device)
list_replace_rcu(>dev_list, _device->dev_list);
new_device->fs_devices = device->fs_devices;

+   /*
+* So to wait for kworkers to finish all blkdev_puts,
+* so device is really free when umount is done.
+*/
+   new_device->bdev_closing = 1;
+
call_rcu(>rcu, free_device);
 }

@@ -912,6 +940,7 @@ int btrfs_close_devices(struct btrfs_fs_devices *fs_devices)
 {
struct btrfs_fs_devices *seed_devices = NULL;
int ret;
+   int retry_cnt = 5;

mutex_lock(_mutex);
ret = __btrfs_close_devices(fs_devices);
@@ -927,12 +956,15 @@ int btrfs_close_devices(struct btrfs_fs_devices 
*fs_devices)
__btrfs_close_devices(fs_devices);
free_fs_devices(fs_devices);
}
-   /*
-* Wait for rcu kworkers under __btrfs_close_devices
-* to finish all blkdev_puts so device is really
-* free when umount is done.
-*/
-   rcu_barrier();
+
+   while (is_device_closing(_devices->devices) &&
+   --retry_cnt) {
+   mdelay(1000); //1 sec
+   }
+
+   if (!(retry_cnt > 0))
+   printk(KERN_WARNING "BTRFS: %pU bdev_put didn't complete, giving 
up\n",
+   fs_devices->fsid);
return ret;
 }

diff --git a/fs/btrfs/volumes.h b/fs/btrfs/volumes.h
index 0ac90f8d85bd..945e49f5e17d 100644
--- a/fs/btrfs/volumes.h
+++ b/fs/btrfs/volumes.h
@@ -150,6 +150,7 @@ struct btrfs_device {
/* Counter to record the change of device stats */
atomic_t dev_stats_ccnt;
atomic_t dev_stat_values[BTRFS_DEV_STAT_VALUES_MAX];
+   int bdev_closing;
 };

 /*
--
2.7.0


I gave this a try and somehow it seems to make unmounting worse:
it now always takes ~5s (max retry time) and I see the warning every
time. Without the patch unmounting a single volume (disk) is much
faster (1-2s), without problems.


 Thanks Holger, for testing.
 It depends on long the blkdev_put() will take, originally unmount
 thread didn't wait for it to complete, so it was faster, but had
 other problem as explained.

Thanks, Anand


Any ideas?




cheers,
Holger



--
To unsubscribe from this list: send the line "unsubscribe linux-btrfs" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html