Re: I need to P. are we almost there yet?

2015-01-03 Thread Bob Marley

On 03/01/2015 14:11, Duncan wrote:

Bob Marley posted on Sat, 03 Jan 2015 12:34:41 +0100 as excerpted:


On 29/12/2014 19:56, sys.syphus wrote:

specifically (P)arity. very specifically n+2. when will raid5  raid6
be at least as safe to run as raid1 currently is? I don't like the idea
of being 2 bad drives away from total catastrophe.

(and yes i backup, it just wouldn't be fun to go down that route.)

What about using btrfs on top of MD raid?

The problem with that is data integrity.  mdraid doesn't have it.  btrfs
does.

If you present a single mdraid device to btrfs and run single mode on it,
and one copy on the mdraid is corrupt, mdraid may well simply present it
as it does no integrity checking.  btrfs will catch and reject that, but
because it sees a single device, it'll think the entire thing is corrupt.


Which is really not bad, considering the chance that something gets corrupt.
Already it is an exceedingly rare event. Detection without correction 
can be more than enough. Since always things have worked in the computer 
science field without even the detection feature.
Most likely even your bank account and mine are held in databases which 
are located in filesystems or blockdevices which do not even have the 
corruption detection feature.
And, last but not least, as of now a btrfs bug is more likely than hard 
disks' silent data corruption.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: I need to P. are we almost there yet?

2015-01-03 Thread Bob Marley

On 29/12/2014 19:56, sys.syphus wrote:

specifically (P)arity. very specifically n+2. when will raid5  raid6
be at least as safe to run as raid1 currently is? I don't like the
idea of being 2 bad drives away from total catastrophe.

(and yes i backup, it just wouldn't be fun to go down that route.)


What about using btrfs on top of MD raid?

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: device balance times

2014-10-22 Thread Bob Marley

On 22/10/2014 14:40, Piotr Pawłow wrote:

On 22.10.2014 03:43, Chris Murphy wrote:

On Oct 21, 2014, at 4:14 PM, Piotr Pawłowp...@siedziba.pl  wrote:
Looks normal to me. Last time I started a balance after adding 6th 
device to my FS, it took 4 days to move 25GBs of data.
It's long term untenable. At some point it must be fixed. It's way, 
way slower than md raid.
At a certain point it needs to fallback to block level copying, with 
a ~ 32KB block. It can't be treating things as if they're 1K files, 
doing file level copying that takes forever. It's just too risky that 
another device fails in the meantime.


There's device replace for restoring redundancy, which is fast, but 
not implemented yet for RAID5/6.


Device replace on raid 0,1,10 works if the device to be replaced is 
still alive, otherwise the operation is as long as a rebalance and works 
similarly (AFAIR).

Which is way too long in terms of the likelihood of another disk failing.
Additionally, it seeks like crazy during the operation, which also 
greatly increases the likelihood of another disk failing.


Until this is fixed I am not confident in using btrfs on a production 
system which requires RAID redundancy.


The operation needs to be streamlined: it should be as sequential as 
possible (sort files according to their LBA before reading/writing), 
with the fewest number of seeks on every disk, and with large buffers, 
so that reads from the source disk(s) and writes to the replacement disk 
goes at platter-speed or near there.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 03:58, Chris Murphy wrote:



* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a 
pace that suggests something could sneak in that makes things worse? It is 
almost an oxymoron in that I'm manually enabling an autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the default 
enabling of autorecovery.


No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on 
both sides of a distributed transaction, then power was lost on one 
side, than btrfs had corruption. When I remount it, definitely the worst 
thing that can happen is that it auto-rolls-back to a previous 
known-good state.


Now if I can express wishes:

I would like an option that spits out all the usable tree roots (or 
what's the name, superblocks?) and not just the newest one which is 
corrupt. And then another option that lets me mount *readonly* starting 
from the tree root I specify. So I can check how much of the data is 
still there. After I decide that such tree root is good, I need another 
option that lets me mount with such tree root in readwrite mode, and 
obviously eliminating all tree roots newer than that.
Some time ago I read that mounting the filesystem with an earlier tree 
root was possible, but only by manually erasing the disk regions in 
which the newer superblocks are. This is crazy, it's too risky on too 
many levels, and also as I wrote I want to check what data is available 
on a certain tree root before mounting readwrite from that one.



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 12:59, Roman Mamedov wrote:

On Fri, 10 Oct 2014 12:53:38 +0200
Bob Marley bobmar...@shiftmail.org wrote:


On 10/10/2014 03:58, Chris Murphy wrote:

* mount -o recovery
Enable autorecovery attempts if a bad tree root is found at mount 
time.

I'm confused why it's not the default yet. Maybe it's continuing to evolve at a 
pace that suggests something could sneak in that makes things worse? It is 
almost an oxymoron in that I'm manually enabling an autorecovery

If true, maybe the closest indication we'd get of btrfs stablity is the default 
enabling of autorecovery.

No way!
I wouldn't want a default like that.

If you think at distributed transactions: suppose a sync was issued on
both sides of a distributed transaction, then power was lost on one
side

What distributed transactions? Btrfs is not a clustered filesystem[1], it does
not support and likely will never support being mounted from multiple hosts at
the same time.

[1]http://en.wikipedia.org/wiki/Clustered_file_system



This is not the only way to do a distributed transaction.
Databases can be hosted on the filesystem, and those can do distributed 
transations.
Think of two bank accounts, one on btrfs fs1 here, and another bank 
account on database on a whatever filesystem in another country. You 
want to debit one account and credit the other one: the filesystems at 
the two sides *must not rollback their state* !! (especially not 
transparently without human intervention)


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: What is the vision for btrfs fs repair?

2014-10-10 Thread Bob Marley

On 10/10/2014 16:37, Chris Murphy wrote:

The fail safe behavior is to treat the known good tree root as the default tree 
root, and bypass the bad tree root if it cannot be repaired, so that the volume 
can be mounted with default mount options (i.e. the ones in fstab). Otherwise 
it's a filesystem that isn't well suited for general purpose use as rootfs let 
alone for boot.



A filesystem which is suited for general purpose use is a filesystem 
which honors fsync, and doesn't *ever* auto-roll-back without user 
intervention.


Anything different is not suited for database transactions at all. Any 
paid service which has the users database on btrfs is going to be at 
risk of losing payments, and probably without the company even knowing. 
If btrfs goes this way I hope a big warning is written on the wiki and 
on the manpages telling that this filesystem is totally unsuitable for 
hosting databases performing transactions.


At most I can suggest that a flag in the metadata be added to 
allow/disallow auto-roll-back-on-error on such filesystem, so people can 
decide the tolerant vs. transaction-safe mode at filesystem creation.


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Performance reduces with nodatasum

2014-10-04 Thread Bob Marley

Hello,
apparently I have found an issue with btrfs: performance reduces with 
nodatasum and multi-device raid0 or single.


I was testing with a series of 8 LIO ramdisks, with btrfs on those in 
multi-device single mode, and writing zeroes on the filesystem with 16 
dd in parallel.
Performance decreases significantly if the filesystem is mounted with 
nodatasum, or with nodatacow which implies nodatasum.

CPU occupation also reduces, together with speed, as seen with htop.

At first I thought it was my problem, but then I saw this web page
http://www.linux-mag.com/id/7308/3/
which also reports reduced performance with nodatasum and multi-device 
raid0 or single, e.g. see these two lines:


Btrfs
two disks,
single
standard
50.14450.264126.984131.130

Btrfs
two disks,
single
nodatacow,
nodatasum
43.83447.603131.612131.470

similarly with raid0

even more with compression:

Btrfs
two disks,
raid0
-o compress
70.23469.048130.852129928
Btrfs

two disks,
raid0
-o compress
nodatacow,
nodatasum
48.76248.831130.812130.202


If you go higher with the performances, such as with ramdisks, in the 
GB/sec range, it reduces more than that. I have noticed upto 50% reduction.


It would be important to fix this problem for high-performance usages of 
btrfs.


Best regards
BM

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance reduces with nodatasum

2014-10-04 Thread Bob Marley

On 04/10/2014 12:26, Bob Marley wrote:

Hello,
apparently I have found an issue with btrfs


Sorry I forgot to mention the kernel version:  3.14.19
not tested with higher versions

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Performance reduces with nodatasum

2014-10-04 Thread Bob Marley

On 04/10/2014 12:36, Bob Marley wrote:

On 04/10/2014 12:26, Bob Marley wrote:

Hello,
apparently I have found an issue with btrfs


Sorry I forgot to mention the kernel version:  3.14.19
not tested with higher versions


I just noticed that the page I have linked which also reports the problem
http://www.linux-mag.com/id/7308/3/
is dated April 21st, 2009 , with kernel version 2.6.30-rc1
so this problem is not a recent regression but has been there probably 
since always.
So it's likely that it's present also in latest 3.17-rc7 even if I can't 
check right now.

Best regards
BM
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: 1 week to rebuid 4x 3TB raid10 is a long time!

2014-07-20 Thread Bob Marley

On 20/07/2014 10:45, TM wrote:

Hi,

I have a raid10 with 4x 3TB disks on a microserver
http://n40l.wikia.com/wiki/Base_Hardware_N54L , 8Gb RAM

Recently one disk started to fail (smart errors), so I replaced it
Mounted as degraded, added new disk, removed old
Started yesterday
I am monitoring /var/log/messages and it seems it will take a long time
Started at about 8010631739392
And 20 hours later I am at 6910631739392
btrfs: relocating block group 6910631739392 flags 65

At this rate it will take a week to complete the raid rebuild!!!

Furthermore it seems that the operation is getting slower and slower
When the rebuild started I had a new message every half a minute, now it’s
getting to OneAndHalf minutes
Most files are small files like flac/jpeg



Hi TM, are you doing other significant filesystem activity during this 
rebuild, especially random accesses?

This can reduce performances a lot on HDDs.
E.g. if you were doing strenous multithreaded random writes in the 
meanwhile, I could expect even less than 5MB/sec overall...


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Especially broken btrfs

2014-03-31 Thread Bob Marley

Hi, I hadn't noticed this post,
I think I know the reason this time : you have used USB you bad guy!
I think USB does not support flush / barrier , which is mandatory for 
BTRFS to work correctly in case of power loss.
For most filesystems actually, but the damages suffered by COW 
filesystems such as btrfs are much more severe than for static 
filesystems such as ext4 .


Please check if when you connect the USB drive you see in dmesg 
something like:


|[   . .] sd ...:0:0:0: [sdf] Write cache: ., read cache: ., 
doesn't support DPO or FUA
|

Regards
BM


On 21/03/2014 04:21, sepero...@gmx.com wrote:
Hello all. I submit bugs to different foss projects regularly, but I 
don't really have a bug report this time. I have a broken filesystem 
to report. And I have no idea how to reproduce it.


I am including a link to the filesystem itself, because it appears to 
be unrepairable and unrestorable. I have no personal information on 
the disk image. The filesystem is almost 512MB uncompressed. I was 
using it on an old usb drive with 512MB size limitation. I only used 
(abused?) it about 2 days before this corruption.


My goal was to use the usb as a bootable rescue system. I decided to 
try Btrfs instead of Ext4, because it supports filesystem compression.


BTRFS IMAGE LINK (please pardon my file hosting service)
http://www.mediafire.com/download/gdaydt3mz8uwtmm/sdb1.btrfs.xz


These are some things that may have helped to cause the corruption.

+Created btrfs with -M flag
+Installed Debian testing/unstable
+When mounting, I always used at least these options: 
ssd_spread,noatime,compression=zlib,autodefrag

+Occasionally force powering off computer.
+While booted into usb system, I was constantly running out of space 
while trying to install new packages.


It is my hope that this image might be used to improve the btrfs 
restore and btrfsck tools. Please let me know if I can provide any 
further information. Big thanks to everyone helping to further 
development of Btrfs.


Sepero

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html



--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: btrfs and ECC RAM

2014-01-20 Thread Bob Marley

On 20/01/2014 15:57, Ian Hinder wrote:
i.e. that there is parity information stored with every piece of data, 
and ZFS will correct errors automatically from the parity information. 


So this is not just parity data to check correctness but there are many 
more additional bits to actually correct these errors, based on an 
algorithm like reed-solomon ?


Where can I find information on how much parity is stored in ZFS ?

I start to suspect that there is confusion here between checksumming 
for data integrity and parity information. If this is really how ZFS 
works, then if memory corruption interferes with this process, then I 
can see how a scrub could be devastating. 


I don't . If you have additional bits to correct errors (other than 
detect errors), this will never be worse than having less of them.
All algorithms I know of, don't behave any worse if the erroneous bits 
are in the checksum part, or if the algorithm is correct+detect instead 
of just detect.
If the algorithm stores X+2Y extra bits (supposed ZFS case) in order to 
detectcorrect Y erroneous bits and detect additional X erroneous bits, 
this will not be worse than having just X checksum bits (btrfs case).


So does ZFS really uses detectcorrect parity? I'd expect this to be 
quite a lot computationally expensive


I don't know if ZFS really works like this. It sounds very odd to do 
this without an additional checksum check. This sounds very different 
to what you say below that btrfs does, which is only to check against 
redundantly-stored copies, which I agree sounds much safer. The second 
link above from the ZFS FAQ just says that if you place a very high 
value on data integrity, you should be using ECC memory anyway, which 
I'm sure we can all agree with. 
hxxp://zfsonlinux.org/faq.html#DoIHaveToUseECCMemory:

1.16 Do I have to use ECC memory for ZFS?
Using ECC memory for ZFS is strongly recommended for enterprise environments 
where the strongest data integrity guarantees are required. Without ECC memory 
rare random bit flips caused by cosmic rays or by faulty memory can go 
undetected. If this were to occur ZFS (or any other filesystem) will write the 
damaged data to disk and be unable to automatically detect the corruption.


The above sentence imho means that the data can get corrupted just prior 
to its first write.
This is obviously applicable to every filesystem on earth, without ECC, 
especially if it happens prior to the computation of the parity.


BM

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] Btrfs: fix race condition between writting and scrubing supers

2013-10-20 Thread Bob Marley

On 19/10/2013 16:03, Stefan Behrens wrote:

On 10/19/2013 12:32, Shilong Wang wrote:

 Yeah, it did not hurt. but it may  output checksum mismatch. For 
example:
 Writing 4k superblock is not totally finished, but we are trying to 
scrub it.


Have you ever seen this issue?



...

If this is really an issue and these 4K disk writes and reads 
interfere, let's find a better solution please.





Why don't you scrub optimistically as is now, and then just in case of 
checksum mismatch you re-scrub in transaction context?


--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: raid6: rmw writes all the time?

2013-05-23 Thread Bob Marley

On 23/05/2013 15:22, Bernd Schubert wrote:


Yeah, I know and I'm using iostat already. md raid6 does not do rmw, 
but does not fill the device queue, afaik it flushes the underlying 
devices quickly as it does not have barrier support - that is another 
topic, but was the reason why I started to test btrfs.


MD raid6 DOES have barrier support!

--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: BTRFS, getting darn slower everyday

2012-12-09 Thread Bob Marley

On 12/09/12 12:38, Hugo Mills wrote:

On Sun, Dec 09, 2012 at 12:20:46PM +0100, Swâmi Petaramesh wrote:

Le 09/12/2012 11:41, Roman Mamedov a écrit :

CoW filesystem incurs fragmentation by its nature, not specifically snapshots.
Even without snapshots, rewriting portions of existing files will write the
new blocks not over the original ones, but elsewhere, thus increasing
fragmentation.

Is it to expect that somewhere in the future, BTRFS will be able to
defragment itself without duplicating snapshot data ?

In the presence of snapshots that are modified, no, it's impossible
to fully defrag all the files.


Of course, but I would agree with the poster that it would be important 
to partially defrag all the files, avoiding at least unneeded 
fragmentation...

At least the fragmentation generated by normal writes
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


High-sensitivity fs checker (not repairer) for btrfs

2012-11-10 Thread Bob Marley

Hello all
I would like to know if there exists a tool to check the btrfs 
filesystem very thoroughly.

It's ok if it needs the FS unmounted to operate. Also mounted is OK.
It does not need repair capability
It needs very good checking capability: it has to return Good / Bad 
status with the Bad meaning that there is at least ONE inconsistency. 
Good means that it is really really 100% consistent.


Does something like this exists?

We need to detect as much ahead of time as possible if the btrfs 
filesystem has become even just a little bit inconsistent


Thank you
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: High-sensitivity fs checker (not repairer) for btrfs

2012-11-10 Thread Bob Marley

On 11/10/12 22:23, Hugo Mills wrote:

The closest thing is btrfsck. That's about as picky as we've got to
date.

What exactly is your use-case for this requirement?


We need a decently-available system. We can rollback filesystem to 
last-known-good if the test detects an inconsistency on current btrfs 
filesystem, but we need a very good test for that (i.e. if 
last-known-good is actually bad we get into serious troubles).


So do you think btrfsck can return a false OK result? can it not-see 
an inconsistency?


Thank you
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Systemcall for offline deduplication

2012-10-15 Thread Bob Marley

Hello all btrfs developers

I would really appreciate a systemcall (or ioctl or the like) to allow 
deduplication of a block of a file against a block of another file.

(ok if blocks need to be aligned to filesystem blocks)

So that if I know that bytes 32768...65536 of FileA  are identical to 
bytes 131072...163840 of FileB I can call that syscall to have the 
regions deduplicated one against the other atomically and with the 
filesystem running.
The syscall should presumably check that the regions are really equal 
and perform the deduplication atomically.


This would be the start for a lot of deduplication algorithms in userspace.
It would be a killer feature for backup systems.

Thank you,
Bob
--
To unsubscribe from this list: send the line unsubscribe linux-btrfs in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html