On 06/13/2012 11:59 PM, Bill Broadley wrote: > On 06/13/2012 06:40 AM, Bernd Schubert wrote: >> What about an easy to setup cluster file system such as FhGFS? > > Great suggestion. I'm all for a generally useful parallel file systems > instead of torrent solution with a very narrow use case. > >> As one of >> its developers I'm a bit biased of course, but then I'm also familiar > > I think this list is exactly the place where a developer should jump in > and suggest/explain their solutions as it related to use in HPC clusters. > >> with Lustre, an I think FhGFS is far more easiy to setup. We also do not >> have the problem to run clients and servers on the same node and so of >> our customers make heavy use of that and use their compute nodes as >> storage servers. That should a provide the same or better throughput as >> your torrent system. > > I found the wiki, the "view flyer", FAQ, and related. > > I had a few questions, I found this link > http://www.fhgfs.com/wiki/wikka.php?wakka=FAQ#ha_support but was not > sure of the details. > > What happens when a metadata server dies? > > What happens when a storage server dies?
Right, those two issues we are presently actively working on. So the current release relies on hardware raid. But later on this year there will be meta data mirroring. After that data mirroring will follow. > > If either above is data loss/failure/unreadable files is there a > description of how to improve against this with drbd+heartbeat or > equivalent? During the next weeks we will test fhgfs-ocf scripts for an HA (pacemaker) installation. As we are going to be paid for the installation, I do no know yet when we will make those scripts publically available. Generally drbd+heartbeat as mirroring solution is possible. > > Sounds like source is not available, and only binaries for CentOS? Well, RHEL5 / RHEL6 based, SLES10 / SLES11 and Debian. And sorry, the server daemons are not open source yet. I think the more people asking to open it, the faster this process will be. Especially if those people also are going to buy support contracts :) > > Looks like it does need a kernel module, does that mean only old 2.6.X > CentOS kernels are supported? Oh, on the contrary. We basically support any kernel beginning with 2.6.16 onwards. Even support for most recent vanilla kernels is usually done within a few weeks after its release. > > Does it work with mainline ofed on qlogic and mellanox hardware? Definitely works with both and RDMA (ibverbs) transfers. As QLogic has some problems with ibverbs, we had a cooperation with QLogic to improve performance on their hardware. Recent QLogic OFED stacks do include performance fixes. Please also see http://www.fhgfs.com/wiki/wikka.php?wakka=NativeInfinibandSupport for (QLogic) tuning advises. > > From a sysadmin point of view I'm also interested in: > * Do blocks auto balance across storage nodes? Actually files are balanced. The default file stripe count is 4, but can be adjusted by the admin. So assuming you would have only one target per server, a large file would be distributed over 4 nodes. The default chunk size is 512kB. For files smaller than that size there is no stripe-overhead. > * Is managing disk space, inodes (or equiv) and related capacity > planning complex? Or does df report useful/obvious numbers? Hmm, right now (unix) "df -i" does not report the inode usage yet for fhgfs. We will fix that in later releases. At least for traditional storage severs we recommend to use ext4 on meta-data partitions for performance reasons. For storage partitions we usually recommend XFS, again for performance. Also, storage and meta-data can be on the very same partion, you just need configure the path were to find those data in the corresponding config files. If you are going to use all your client nodes as fhgfs servers and those already have XFS as scratch partion, XFS is probably also fine. However, due a severe XFS performance issue, you should either need a kernel to have this issue fixed or you should disable meta-data-as-xattr (in fhgfs-meta.conf: storeUseExtendedAttribs = false). Also please see here for a discussion and benchmarks http://oss.sgi.com/archives/xfs/2011-08/msg00233.html Christoph Hellwig then fixed the unlink issue later on and this patch should be in all recent linux-stable kernels. I have not checked RHEL5/RHEL6, though. Anyway, if you are going use ext4 on your meta-data partition, you need to make sure yourself you do have sufficient inodes available. Our wiki has recommendations for mkfs.ext4 options. > * Can storage nodes be added/removed easily by migrating on/off of > hardware? Adding storage nodes on the fly works perfectly fine. Our fhgfs-ctl tool also has a mode to migrate files off a storage node. However, we really recommend not to do that while clients are writing to the file system right now. Reason is that we do not lock files-in-migration yet and a client then might write to unlinked files, which would result in silent data loss. We have on-the-fly data migration on our todo list, but I cannot say yet, when that is going to come. If you are going to use your clients as storage nodes, you could specify that system as preferred system to write files to. That would easily allow to remove that node... > * Is FhGFS handle 100% of the distributed file system responsibilities > or does it layer on top of xfs/ext4 or related? (like ceph) Like ceph on top of other file systems, such as xfs or ext4. > * With large files does performance scale reasonably with storage > servers? Yes, you may also adjust the stripe count by your needs. Default stripe count is 4, which approximately provides the performance of 4 storage targets. > * With small files does performance scale reasonably with metadata > servers? Striping over different meta data servers is done on a per-directory basis. As most users and applications work in different directories, meta data performance usually scales linearily with the number of metadata servers. Please note: Our wiki has tuning advices for meta data performance and with our next major release we also should see a greatly improved meta data performance. Hope it helps and please let me know if you have further questions! Cheers, Bernd PS: We have a GUI, which should help you to just try it out within a few minutes. Please see here: http://www.fhgfs.com/wiki/wikka.php?wakka=GUIbasedInstallation _______________________________________________ Beowulf mailing list, [email protected] sponsored by Penguin Computing To change your subscription (digest mode or unsubscribe) visit http://www.beowulf.org/mailman/listinfo/beowulf
