Re: disk partitioning with gmirror + gpt + gjournal (RFC)

Miroslav Lachman Mon, 24 Oct 2011 05:35:19 -0700

Alfred Bartsch wrote:

Am 19.10.2011 15:42, schrieb Miroslav Lachman:


[...]

UEFI will replace old BIOS sooner or later, so what you will do
then? Than you will need to rework your servers and change your
setup routine. And I think it is better to avolid known possible
problem than hoping "it will not bite me". You can't avoid Murphy's
law ;)

From my present point of view there are two alternatives: Hardware

RAID and (matured) ZFS.

If I were a GEOM guru, i would try to enhance the compatibility
between upcoming UEFI and GMIRROR / GRAID3 etc. Just guessing:
What about adding a flag named "-gpt" "-efi" or just "-offset" to
these geoms to reserve enough space (at least 33 sectors) behind the
metadata sector at the end of the disk provider to hold whatever
secondary gpt table is needed to satisfy UEFI.

In ideal world, it will be "the right way", but I guess it will neverhappen in our real FreeBSD world. There are nobody with time, skills ancourage to revork "all" GEOM classes. It is not so easy task (backwardcompatibility, compatibility with other OSes / tools, etc.)

I am using gjournal on few of our servers, but we are slowly
removing it from our setups. Data writes to gjournaled disks
are too slow and sometimes gjournal is not playing nice.


I'm heavily interested in more details.


When I did some tests in the past, gjournal cannot be used in
combination with iSCSI and I was not able to stop gjournal tasting
providers (I was not able to remove / disable gjournal on device)
until I stop all of them and unload gjournal kernel module. I don't
know the current state.


Up to now I'm not using any ISCSI equipment. Good to know about some
weaknesses in advance.

Maybe ZFS or UFS+SUJ is better option.


Yes, maybe. ZFS is mainly for future use. Do you use the second
option on large filesystems?


ZFS is there for "a long time". I feel safe to use it in production
on few of our servers. I didn't test UFS+SUJ because it is released
in forthcoming 9.0 and we are not deploying current on our
servers.


Compared to UFS, ZFS lifetime is relatively short. From my point of
view ZFS in its present state is too immature to hold mission critical
data, YMMV.

UFS2 or UFS2+SU (Soft-Updates) is there for a longer time than ZFS, butUFS2+SUJ (journaled soft-updates) is there ofr a short time and not muchtested in production. Even UFS2+gjournal is not widely deployed / tested.

On the other hand ZFS needs a lot of redundant disk space and memory
to work as expected, not to forget cpu-cycles. IMHO, ZFS is not 32-bit
capable, so there is no way to use it on older and small hardware.

Yes, you are right, ZFS cannot be used in some scenarios. But in someothers scenario, ZFS is the best possible.e.g. for large flexible storage, I will use ZFS, for database server Iwill use UFS2+SU without gjournal.


[...]


Did you perform any benchmarks (UFS+Softupdates vs. UFS+Gjournal)? If
yes, did you compare async mounts + write cache enabled (gjournal) to
sync mounts + write cache disabled (softupdates)?

I don't have a fancy graphs or tables from benchmarking SW, I just havereal workload experiencies where write cache were enabled in both cases.

If I understand you right, you prefer write speed to data consistency
in these cases. This may be the right choice in your environment.

From my point of view, I am happy to find all bits restored in /var

after an unclean shutdown for error analysis and recovery purposes,
and I hate the vision of having to restore databases from backup, even
after power failures. Furthermore I am glad, only having to wait for
gmirror synchronizing to regain redundancy after replacing a failed disk.

I am not sure, you can rely on data consistency with todays HDDs whencache is enabled even if you use gjournal. You allways lose content ofdevice cache as rotating (and some flash devices - SSD - too) is knownto lie to OS about "data is written", so you end up with lost or demageddate with unclean shutdown.

Database engines handle it in its own way with own journal log etc.,because some of them can be installed on raw partitions withoutunderling FS. (also MySQL can do it)

I rember one case (about 3 years ago) where server remains unbootableafter kernel panic and I spent a couple of hours by playing withdisabling gjournal, doing full fsck on the given partition etc. It israre case, but can happen.

with fdisk + bsdlabel there are not enough partitions in one
slice to hold all the journals, and as I already mentioned I
really want to minimize recovery time. With gmirror + gjournal
I'm able to activate disk write cache without losing data
consistency, which improves performance significantly.


According to following commit message, bsdlabel was extended to 26
partitions 3 years ago.
http://lists.freebsd.org/pipermail/cvs-all/2007-December/239719.html

(I didn't tested yet, because I don't need it - we are using two slices

on our servers)


I didn't know this, thanks for revealing. I'm not sure if all BSD
utilities can deal with this.

I see what you are trying to do and it would be nice if "all
works as one can expect", but the reality is different. So I
don't think it is good idea to make it as you described.

I'm not yet fully convinced, that my idea of disk partitioning is
a bad one, so please let me take part in your negative
experiences with gjournal. Thanks in advance.


I am not saying that your idea is bad. It just contains some
things which I rather avoid.


To summarize some of the pros and cons of this method of disk
partitioning:
pros:
   - IMHO easy to configure
   - easy to recover from a failed disk (just replace with a suitable
one and resync with gmirror, needs no preparation of the new disk)
   - minimal downtime after unclean shutdowns (gjournal is responsible
for this, no sucking fsck on large file systems)
   - disk write cache can and should be enabled (enhanced performance)
   - all disk / partition sizes are supported (even>  2TB)
   - 32 bit version of FreeBSD (i386) is sufficient (small and old
hardware remains usable)

cons:
   - danger of overwriting gmirror metadata by an "unfriendly" UEFI-BIOS

- somewhat complex initial setup or future changes in partitioning(you must have prepared right number of partitions for journals, soadding more partitions is not so easy - in case with UFS2+SUJ or ZFS,you just add another partition)

   - to be continued ...

Feel free to add some topics here which I am missing.

One thing in my mind is longstanding problem with gjournal on heavilyloaded servers:


Aug 16 01:48:28 praha kernel: fsync: giving up on dirty
Aug 16 01:48:30 praha kernel: 0xc44ba9b4: tag devfs, type VCHR

Aug 16 01:48:30 praha kernel: usecount 1, writecount 0, refcount 6941mountedhere 0xc445b700

Aug 16 01:48:30 praha kernel: flags ()
Aug 16 01:48:30 praha kernel: v_object 0xc1548c00 ref 0 pages 192023

Aug 16 01:48:30 praha kernel: lock type devfs: EXCL (count 1) by thread0xc42a7240 (pid 45)

Aug 16 01:48:30 praha kernel: dev mirror/gm0s2e.journal

Aug 16 01:48:30 praha kernel: GEOM_JOURNAL: Cannot suspend file system/vol0 (error=35).


Aug 16 02:32:34 praha kernel: fsync: giving up on dirty
Aug 16 02:32:34 praha kernel: 0xc44ba9b4: tag devfs, type VCHR

Aug 16 02:32:34 praha kernel: usecount 1, writecount 0, refcount 1418mountedhere 0xc445b700

Aug 16 02:32:34 praha kernel: flags ()
Aug 16 02:32:34 praha kernel: v_object 0xc1548c00 ref 0 pages 128123

Aug 16 02:32:34 praha kernel: lock type devfs: EXCL (count 1) by thread0xc42a7240 (pid 45)

Aug 16 02:32:34 praha kernel: dev mirror/gm0s2e.journal

Aug 16 02:32:34 praha kernel: GEOM_JOURNAL: Cannot suspend file system/vol0 (error=35).

This error messages is seen on theme almost every second day and nobodygives me suitable explanation what it really means / what it cause. Theonly answer I got was something like "it is not harmfull"... then why itis logged at all?


So today I removed gjournal from the next older server.

I will try UFS2+SUJ with 9.0 as one of the possible ways for a futuresetups, where ZFS cannot be used.


Miroslav Lachman
_______________________________________________
[email protected] mailing list
http://lists.freebsd.org/mailman/listinfo/freebsd-geom
To unsubscribe, send any mail to "[email protected]"

Re: disk partitioning with gmirror + gpt + gjournal (RFC)

Reply via email to