[gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread James
Bill Kenworthy billk at iinet.net.au writes:


  The main thing keeping me away from CephFS is that it has no mechanism
  for resolving silent corruption.  Btrfs underneath it would obviously
  help, though not for failure modes that involve CephFS itself.  I'd
  feel a lot better if CephFS had some way of determining which copy was
  the right one other than the master server always wins.


The Giant version 0.87 is a major release with many new fixes;
it may have the features you need. Currently the ongoing releases are
up to : v0.91. The readings look promissing, but I'll agree it
needs to be tested with non-critical data.

http://ceph.com/docs/master/release-notes/#v0-87-giant

http://ceph.com/docs/master/release-notes/#notable-changes


 Forget ceph on btrfs for the moment - the COW kills it stone dead after
 real use.  When running a small handful of VMs on a raid1 with ceph -
 slw :)

I'm staying away from VMs. It's spark on top of mesos I'm after. Maybe
docker or another container solution, down the road.

I read where some are using a SSD with raid 1 and bcache to speed up
performance and stability a bit. I do not want to add SSD to the mix right
now, as the (3) node development systems all have 32 G of ram.



 You can turn off COW and go single on btrfs to speed it up but bugs in
 ceph and btrfs lose data real fast!

Interesting idea, since I'll have raid1 underneath each node. I'll need to
dig into this idea a bit more.


 ceph itself (my last setup trashed itself 6 months ago and I've given
 up!) will only work under real use/heavy loads with lots of discrete
 systems, ideally 10G network, and small disks to spread the failure
 domain.  Using 3 hosts and 2x2g disks per host wasn't near big enough :(
  Its design means that small scale trials just wont work.

Huh. My systems are FX8350 (8)processors running at 4GHz with 32 G ram.
Water coolers will allow me to crank up the speed (when/if needed) to
5 or 6 GHz. Not intel but  low end either.


 Its not designed for small scale/low end hardware, no matter how
 attractive the idea is :(

Supposedly there are tool to measure/monitor ceph better now. That is
one of the things I need to research. How to manage the small cluster
better and back off the throughput/load while monitoring performance
on a variety of different tasks. Definitely not a production usage.

I certainly appreciate your ceph_experiences. I filed a but with the
version request for Giant v0.87. Did your run the  version ?
What versions did you experiment with?

I hope to set up Anisble to facilitate rapid installations of a variety
of gentoo systems used for cluster or ceph testing. That way configurations
should be able to reboot after bad failures.  Did your experienced
failures with Ceph require the gentoo-btrfs based systems to be complete
reinstalled from scratch, or just purge the disk of Ceph and reconfigure Ceph?

I'm hoping to configure ceph in such a way that failures do not corrupt
the gentoo-btrfs installation and only require repair to ceph; so your
comments on that strategy are most welcome.




 BillK


James


 







Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Rich Freeman
On Tue, Jan 20, 2015 at 10:07 AM, James wirel...@tampabay.rr.com wrote:
 Bill Kenworthy billk at iinet.net.au writes:

 You can turn off COW and go single on btrfs to speed it up but bugs in
 ceph and btrfs lose data real fast!

 Interesting idea, since I'll have raid1 underneath each node. I'll need to
 dig into this idea a bit more.


So, btrfs and ceph solve an overlapping set of problems in an
overlapping set of ways.  In general adding data security often comes
at the cost of performance, and obviously adding it at multiple layers
can come at the cost of additional performance.  I think the right
solution is going to depend on the circumstances.

if ceph provided that protection against bitrot I'd probably avoid a
COW filesystem entirely.  It isn't going to add any additional value,
and they do have a performance cost.  If I had mirroring at the ceph
level I'd probably just run them on ext4 on lvm with no
mdadm/btrfs/whatever below that.  Availability is already ensured by
ceph - if you lose a drive then other nodes will pick up the load.  If
I didn't have robust mirroring at the ceph level then having mirroring
of some kind at the individual node level would improve availability.

On the other hand, ceph currently has some gaps, so having it on top
of zfs/btrfs could provide protection against bitrot.  However, right
now there is no way to turn off COW while leaving checksumming
enabled.  It would be nice if you could leave the checksumming on.
Then if there was bitrot btrfs would just return an error when you
tried to read the file, and then ceph would handle it like any other
disk error and use a mirrored copy on another node.  The problem with
ceph+ext4 is that if there is bitrot neither layer will detect it.

Does btrfs+ceph really have a performance hit that is larger than
btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
Btrfs in general performs fairly poorly right now - that is expected
to improve in the future, but I doubt that it will ever outperform
ext4 other than for specific operations that benefit from it (like
reflink copies).  It will always be faster to just overwrite one block
in the middle of a file than to write the block out to unallocated
space and update all the metadata.

-- 
Rich



[gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread James
Rich Freeman rich0 at gentoo.org writes:


  You can turn off COW and go single on btrfs to speed it up but bugs in
  ceph and btrfs lose data real fast!

 So, btrfs and ceph solve an overlapping set of problems in an
 overlapping set of ways.  In general adding data security often comes
 at the cost of performance, and obviously adding it at multiple layers
 can come at the cost of additional performance.  I think the right
 solution is going to depend on the circumstances.

Raid 1 with btrfs can not only protect the ceph fs files but the gentoo
node installation itself.  I'm not so worried about proformance, because
my main (end result) goal is to throttle codes so they run almost
exclusively in ram (in memory) as design by amplabs. Spark plus Tachyon is a
work in progress, for sure.  The DFS will be used in lieu of HDFS for
distributed/cluster types of apps, hence ceph.  Btrfs + raid 1 is as
a failsafe for the node installations, but also all data. I only intend
to write out data, once a job/run is finished; but granted that is very
experimental right now and will evolve over time.


 
 if ceph provided that protection against bitrot I'd probably avoid a
 COW filesystem entirely.  It isn't going to add any additional value,
 and they do have a performance cost.  If I had mirroring at the ceph
 level I'd probably just run them on ext4 on lvm with no
 mdadm/btrfs/whatever below that.  Availability is already ensured by
 ceph - if you lose a drive then other nodes will pick up the load.  If
 I didn't have robust mirroring at the ceph level then having mirroring
 of some kind at the individual node level would improve availability.

I've read that btrfs and ceph are a very, suitable, yet very immature
match for local-distributed file system needs.


 On the other hand, ceph currently has some gaps, so having it on top
 of zfs/btrfs could provide protection against bitrot.  However, right
 now there is no way to turn off COW while leaving checksumming
 enabled.  It would be nice if you could leave the checksumming on.
 Then if there was bitrot btrfs would just return an error when you
 tried to read the file, and then ceph would handle it like any other
 disk error and use a mirrored copy on another node.  The problem with
 ceph+ext4 is that if there is bitrot neither layer will detect it.

Good points, hence a flexible configuration where ceph can be reconfigured
and recovered as warranted, for this long term set of experiments.

 Does btrfs+ceph really have a performance hit that is larger than
 btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
 Btrfs in general performs fairly poorly right now - that is expected
 to improve in the future, but I doubt that it will ever outperform
 ext4 other than for specific operations that benefit from it (like
 reflink copies).  It will always be faster to just overwrite one block
 in the middle of a file than to write the block out to unallocated
 space and update all the metadata.

I fully expect the combination of btrfs+ceph to mature and become
competitive. It's not critical data, but a long term experiment. Surely
critical data will be backed up off the 3-node cluster. I hope to use
ansible to enable recovery, configuration changes and bringing on and
managing additional nodes; this a concept at the moment, but googling around
it does seem to be a popular idea.

As always your insight and advice is warmly received.


James


 







Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Rich Freeman
On Tue, Jan 20, 2015 at 12:27 PM, James wirel...@tampabay.rr.com wrote:

 Raid 1 with btrfs can not only protect the ceph fs files but the gentoo
 node installation itself.

Agree 100%.  Like I said, the right solution depends on your situation.

If you're using the server doing ceph storage only for file serving,
then protecting the OS installation isn't very important.  Heck, you
could just run the OS off of a USB stick.

If you're running nodes that do a combination of application and
storage, then obviously you need to worry about both, which probably
means not relying on ceph as your sole source of protection.  That
applies to a lot of kitchen sink setups where hosts don't have a
single role.

--
Rich



Re: [gentoo-user] Re: btrfs fails to balance

2015-01-20 Thread Bill Kenworthy
On 21/01/15 00:03, Rich Freeman wrote:
 On Tue, Jan 20, 2015 at 10:07 AM, James wirel...@tampabay.rr.com wrote:
 Bill Kenworthy billk at iinet.net.au writes:

 You can turn off COW and go single on btrfs to speed it up but bugs in
 ceph and btrfs lose data real fast!

 Interesting idea, since I'll have raid1 underneath each node. I'll need to
 dig into this idea a bit more.

 
 So, btrfs and ceph solve an overlapping set of problems in an
 overlapping set of ways.  In general adding data security often comes
 at the cost of performance, and obviously adding it at multiple layers
 can come at the cost of additional performance.  I think the right
 solution is going to depend on the circumstances.
 
 if ceph provided that protection against bitrot I'd probably avoid a
 COW filesystem entirely.  It isn't going to add any additional value,
 and they do have a performance cost.  If I had mirroring at the ceph
 level I'd probably just run them on ext4 on lvm with no
 mdadm/btrfs/whatever below that.  Availability is already ensured by
 ceph - if you lose a drive then other nodes will pick up the load.  If
 I didn't have robust mirroring at the ceph level then having mirroring
 of some kind at the individual node level would improve availability.
 
 On the other hand, ceph currently has some gaps, so having it on top
 of zfs/btrfs could provide protection against bitrot.  However, right
 now there is no way to turn off COW while leaving checksumming
 enabled.  It would be nice if you could leave the checksumming on.
 Then if there was bitrot btrfs would just return an error when you
 tried to read the file, and then ceph would handle it like any other
 disk error and use a mirrored copy on another node.  The problem with
 ceph+ext4 is that if there is bitrot neither layer will detect it.
 
 Does btrfs+ceph really have a performance hit that is larger than
 btrfs without ceph?  I fully expect it to be slower than ext4+ceph.
 Btrfs in general performs fairly poorly right now - that is expected
 to improve in the future, but I doubt that it will ever outperform
 ext4 other than for specific operations that benefit from it (like
 reflink copies).  It will always be faster to just overwrite one block
 in the middle of a file than to write the block out to unallocated
 space and update all the metadata.
 

answer to both you and James here:

I think it was pre 8.0 when I dropped out.  Its Ceph that suffers from
bitrot - I use the golden master approach to generating the VM's so
corruption was obvious.  I did report one bug in the early days that
turned out to be btrfs, but I think it was largely ceph which has been
born out by consolidating the ceph trial hardware and using it with
btrfs and the same storage - rare problems and I can point to
hardware/power when it happened.

The performance hit was not due to lack of horsepower (cpu, ram etc) but
due to I/O - both network bandwidth and internal bus on the hosts.  That
is why a small number of systems no matter how powerful wont work well.
 For real performance, I saw people using SSD's and large numbers of
hosts in order to distribute the data flows - this does work and I saw
some insane numbers posted.  It also requires multiple networks
(internal and external) to separate the flows (not VLAN but dedicated
pipes) due to the extreem burstiness of the traffic.  As well as VM
images, I had backups (using dirvish) and thousands of security camera
images.  Deletes of a directory with a lot of files would take many
hours.  Same with using ceph for a mail store (came up on the ceph list
under why is it so slow) - as a chunk server its just not suitable for
lots of small files.

Towards the end of my use, I stopped seeing bitrot on a system with data
but idle to limiting it to occurring during heavy use.  My overall
conclusion is lots of small hosts with no more than a couple of drives
each and multiple networks with lots of bandwidth is what its designed for.

I had two reasons for looking at ceph - distributed storage where data
in use was held close to the user but could be redistributed easily
with multiple copies (think two small data stores with an intermittent
WAN link storing high and low priority data) and high performance with
high availability on HW failure.

Ceph was not the answer for me with the scale I have.

BillK




Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Bill Kenworthy
On 20/01/15 05:10, Rich Freeman wrote:
 On Mon, Jan 19, 2015 at 11:50 AM, James wirel...@tampabay.rr.com wrote:
 Bill Kenworthy billk at iinet.net.au writes:

 I was wondering what my /etc/fstab should look like using uuids, raid 1 and
 btrfs.
 
 From mine:
 /dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342  /
  btrfs   noatime,ssd,compress=none
 /dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959  /data
  btrfs   noatime,compress=none
 
 The first is a single disk, the second is 5-drive raid1.
 
 I disabled compression due to some bugs a few kernels ago.  I need to
 look into whether those were fixed - normally I'd use lzo.
 
 I use dracut - obviously you need to use some care when running root
 on a disk identified by uuid since this isn't a kernel feature.  With
 btrfs as long as you identify one device in an array it will find the
 rest.  They all have the same UUID though.
 
 Probably also worth nothing that if you try to run btrfs on top of lvm
 and then create an lvm snapshot btrfs can cause spectacular breakage
 when it sees two devices whose metadata identify them as being the
 same - I don't know where it went but there was talk of trying to use
 a generation id/etc to keep track of which ones are old vs recent in
 this scenario.
 

 Eventually, I want to run CephFS on several of these raid one btrfs
 systems for some clustering code experiments. I'm not sure how that
 will affect, if at all, the raid 1-btrfs-uuid setup.

 
 Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.
 
 The main thing keeping me away from CephFS is that it has no mechanism
 for resolving silent corruption.  Btrfs underneath it would obviously
 help, though not for failure modes that involve CephFS itself.  I'd
 feel a lot better if CephFS had some way of determining which copy was
 the right one other than the master server always wins.
 

Forget ceph on btrfs for the moment - the COW kills it stone dead after
real use.  When running a small handful of VMs on a raid1 with ceph -
slw :)

You can turn off COW and go single on btrfs to speed it up but bugs in
ceph and btrfs lose data real fast!

ceph itself (my last setup trashed itself 6 months ago and I've given
up!) will only work under real use/heavy loads with lots of discrete
systems, ideally 10G network, and small disks to spread the failure
domain.  Using 3 hosts and 2x2g disks per host wasn't near big enough :(
 Its design means that small scale trials just wont work.

Its not designed for small scale/low end hardware, no matter how
attractive the idea is :(

BillK








Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Bill Kenworthy
On 20/01/15 00:50, James wrote:
 Bill Kenworthy billk at iinet.net.au writes:
 
 
 Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:
 
 Can someone suggest what is causing a balance on this raid 1
 
 Interesting.
 I am about to test (reboot) a btrfs, raid one installation.
 
 Brilliant, you have hit on the answer! - The ancient 300GB system disk
 was sda at one point and moved to sdb - possibly at the time I changed
 to using UUID's.  Ive just resized all the disks and its now moved past
 300G for the first time as well as the other two falling in step with
 the data moving.
 
 I was wondering what my /etc/fstab should look like using uuids, raid 1 and
 btrfs.
 
 Could you post your /etc/fstab and any other modifications you made to
 your installation related to the btrfs, raid 1 uuid setup?
 
 I'm just using (2) identical 2T disks for my new gentoo workstation.
 
 I moved to UUID's as the machine has a number of sata ports and a PCI-e
 sata adaptor and the sd* drive numbering kept moving around when I added
 the WD red.
 
 
 Eventually, I want to run CephFS on several of these raid one btrfs
 systems for some clustering code experiments. I'm not sure how that
 will affect, if at all, the raid 1-btrfs-uuid setup.
 
 
 TIA,
 James
 
 
 
 

Sorry about the line wrap:

rattus backups # lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINT
sda  8:00   1.8T  0 disk
sdb  8:16   0 279.5G  0 disk
├─sdb1   8:17   0   100M  0 part
├─sdb2   8:18   0 8G  0 part [SWAP]
└─sdb3   8:19   0 271.4G  0 part /
sdc  8:32   0   1.8T  0 disk /mnt/vm
sdd  8:48   0   1.8T  0 disk
sde  8:64   0   1.8T  0 disk
rattus backups #

rattus backups # blkid
/dev/sda: UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1
UUID_SUB=9003b772-3487-447a-9794-50cf9880a9c0 TYPE=btrfs PTTYPE=dos
/dev/sdc: UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1
UUID_SUB=20523d9d-3d90-439e-ad68-62def0824198 TYPE=btrfs
/dev/sdb1: UUID=cc5f4bf7-28fc-4661-9d24-a0c9d0048f40 TYPE=ext2
/dev/sdb2: UUID=dddb7e60-89a9-40d4-bf6b-ff4644e079e9 TYPE=swap
/dev/sdb3: UUID=04d8ff4f-fe19-4530-ab45-d82fcd647515
UUID_SUB=72134593-8c9f-436f-98ce-fbb07facbf35 TYPE=btrfs
/dev/sdd: UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1
UUID_SUB=2ca026f7-e5c9-4ece-bba1-809ddb03979b TYPE=btrfs
rattus backups #


rattus backups # cat /etc/fstab

UUID=cc5f4bf7-28fc-4661-9d24-a0c9d0048f40   /boot
ext2noauto,noatime
1 2
UUID=04d8ff4f-fe19-4530-ab45-d82fcd647515   /
btrfs
defaults,noatime,compress=lzo,space_cache   0 0
UUID=dddb7e60-89a9-40d4-bf6b-ff4644e079e9   none
swapsw
0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/btrfs-root
btrfs
defaults,noatime,compress=lzo,space_cache   0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /home/wdk
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=258  0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/backups
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=365  0 0
UUID=f5a284b6-442f-4b3d-aa1a-8d6296f517b1   /mnt/vm
btrfs
defaults,noatime,compress=lzo,space_cache,subvolid=149160 0

rattus backups #




[gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread James
Bill Kenworthy billk at iinet.net.au writes:


  Am 19.01.2015 um 09:32 schrieb Bill Kenworthy:

  Can someone suggest what is causing a balance on this raid 1

Interesting.
I am about to test (reboot) a btrfs, raid one installation.

 Brilliant, you have hit on the answer! - The ancient 300GB system disk
 was sda at one point and moved to sdb - possibly at the time I changed
 to using UUID's.  Ive just resized all the disks and its now moved past
 300G for the first time as well as the other two falling in step with
 the data moving.

I was wondering what my /etc/fstab should look like using uuids, raid 1 and
btrfs.

Could you post your /etc/fstab and any other modifications you made to
your installation related to the btrfs, raid 1 uuid setup?

I'm just using (2) identical 2T disks for my new gentoo workstation.

 I moved to UUID's as the machine has a number of sata ports and a PCI-e
 sata adaptor and the sd* drive numbering kept moving around when I added
 the WD red.


Eventually, I want to run CephFS on several of these raid one btrfs
systems for some clustering code experiments. I'm not sure how that
will affect, if at all, the raid 1-btrfs-uuid setup.


TIA,
James






Re: [gentoo-user] Re: btrfs fails to balance

2015-01-19 Thread Rich Freeman
On Mon, Jan 19, 2015 at 11:50 AM, James wirel...@tampabay.rr.com wrote:
 Bill Kenworthy billk at iinet.net.au writes:

 I was wondering what my /etc/fstab should look like using uuids, raid 1 and
 btrfs.

From mine:
/dev/disk/by-uuid/7d9f3772-a39c-408b-9be0-5fa26eec8342  /
 btrfs   noatime,ssd,compress=none
/dev/disk/by-uuid/cd074207-9bc3-402d-bee8-6a8c77d56959  /data
 btrfs   noatime,compress=none

The first is a single disk, the second is 5-drive raid1.

I disabled compression due to some bugs a few kernels ago.  I need to
look into whether those were fixed - normally I'd use lzo.

I use dracut - obviously you need to use some care when running root
on a disk identified by uuid since this isn't a kernel feature.  With
btrfs as long as you identify one device in an array it will find the
rest.  They all have the same UUID though.

Probably also worth nothing that if you try to run btrfs on top of lvm
and then create an lvm snapshot btrfs can cause spectacular breakage
when it sees two devices whose metadata identify them as being the
same - I don't know where it went but there was talk of trying to use
a generation id/etc to keep track of which ones are old vs recent in
this scenario.


 Eventually, I want to run CephFS on several of these raid one btrfs
 systems for some clustering code experiments. I'm not sure how that
 will affect, if at all, the raid 1-btrfs-uuid setup.


Btrfs would run below CephFS I imagine, so it wouldn't affect it at all.

The main thing keeping me away from CephFS is that it has no mechanism
for resolving silent corruption.  Btrfs underneath it would obviously
help, though not for failure modes that involve CephFS itself.  I'd
feel a lot better if CephFS had some way of determining which copy was
the right one other than the master server always wins.

-- 
Rich