Hi, While not necessarily CephFS specific - we somehow seem to manage to frequently end up with objects that have inconsistent omaps. This seems to be replication (as anecdotally it's a replica that ends up diverging, and it's at least a few times something that happened after the osd that held that replica were re-started). (I had hoped http://tracker.ceph.com/issues/17177 would solve this - but it doesn't appear to have solved it completely).
We also have one workload which we'd need to re-engineer in order to be a good fit for CephFS, we do a lot of hardlinks where there's no clear "origin" file, which is slightly at odds with the hardlink implementation. If I understand correctly, unlink is move from directory tree into the stray directories, decrement link count, if link count = 0, purge, if not keep it around until you encounter another link to it and re-integrate it back in again. This netted us hilariously large stray directories, which combined with the above were less than ideal. Beyond that - there's been other small(-ish) bugs we've encountered, but it's either been solvable by cherry-picking fixes, upgrading, or using the available tools for doing surgery guided either by the internet and/or an approximate understanding of how it's supposed to work/be). -KJ On Wed, Jul 19, 2017 at 11:20 AM, Brady Deetz <[email protected]> wrote: > Thanks Greg. I thought it was impossible when I reported 34MB for 52 > million files. > > On Jul 19, 2017 1:17 PM, "Gregory Farnum" <[email protected]> wrote: > >> >> >> On Wed, Jul 19, 2017 at 10:25 AM David <[email protected]> wrote: >> >>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite < >>> [email protected]> wrote: >>> >>>> We are a data-intensive university, with an increasingly large fleet >>>> of scientific instruments capturing various types of data (mostly >>>> imaging of one kind or another). That data typically needs to be >>>> stored, protected, managed, shared, connected/moved to specialised >>>> compute for analysis. Given the large variety of use-cases we are >>>> being somewhat more circumspect it our CephFS adoption and really only >>>> dipping toes in the water, ultimately hoping it will become a >>>> long-term default NAS choice from Luminous onwards. >>>> >>>> On 18 July 2017 at 15:21, Brady Deetz <[email protected]> wrote: >>>> > All of that said, you could also consider using rbd and zfs or >>>> whatever filesystem you like. That would allow you to gain the benefits of >>>> scaleout while still getting a feature rich fs. But, there are some down >>>> sides to that architecture too. >>>> >>>> We do this today (KVMs with a couple of large RBDs attached via >>>> librbd+QEMU/KVM), but the throughput able to be achieved this way is >>>> nothing like native CephFS - adding more RBDs doesn't seem to help >>>> increase overall throughput. Also, if you have NFS clients you will >>>> absolutely need SSD ZIL. And of course you then have a single point of >>>> failure and downtime for regular updates etc. >>>> >>>> In terms of small file performance I'm interested to hear about >>>> experiences with in-line file storage on the MDS. >>>> >>>> Also, while we're talking about CephFS - what size metadata pools are >>>> people seeing on their production systems with 10s-100s millions of >>>> files? >>>> >>> >>> On a system with 10.1 million files, metadata pool is 60MB >>> >>> >> Unfortunately that's not really an accurate assessment, for good but >> terrible reasons: >> 1) CephFS metadata is principally stored via the omap interface (which is >> designed for handling things like the directory storage CephFS needs) >> 2) omap is implemented via Level/RocksDB >> 3) there is not a good way to determine which pool is responsible for >> which portion of RocksDBs data >> 4) So the pool stats do not incorporate omap data usage at all in their >> reports (it's part of the overall space used, and is one of the things that >> can make that larger than the sum of the per-pool spaces) >> >> You could try and estimate it by looking at how much "lost" space there >> is (and subtracting out journal sizes and things, depending on setup). But >> I promise there's more than 60MB of CephFS metadata for 10.1 million files! >> -Greg >> >> _______________________________________________ >> ceph-users mailing list >> [email protected] >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com >> >> > _______________________________________________ > ceph-users mailing list > [email protected] > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com > > -- Kjetil Joergensen <[email protected]> SRE, Medallia Inc Phone: +1 (650) 739-6580
_______________________________________________ ceph-users mailing list [email protected] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
