Re: [ceph-users] How does monitor know OSD is dead?
> I'm a bit confused about what happened here, though: that 600 second > interval is only important if *every* OSD in the system is down. If you > reboot the data center, why didn't *any* OSD daemons start? (And even if > none did, having the ceph -s report all OSDs down instead of up isn't > going to change anything except whether your pager is going off, right?) I think you got lost in the thread of discussion. Enough OSDs for the cluster to be fully functional _did_ come back. But the cluster insisted on going to the dead ones (which it claimed all the while were up) for some I/O, even after running for 20 minutes that way, so the cluster was not functional. The 600 second "mon osd down out interval" was a red herring. It might be relevant that there was a grand total of three OSDs in the map. One came up; two did not. All objects were replicated across all three, with the hope that this sort of thing would not be fatal. It's a Jewel system with that version's default of 1 for "mon osd min down reporters". -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
Here's some counter-evidence to the proposition that it's not pretty common for an entire cluster to go down because of a power failure. Every data center class hardware storage server product I know of has dual power input and is also designed to tolerate losing power on both at once. If that happens, they don't lose data and when the power comes back, they come back up all by themselves and start serving storage again. This design usually involves an expensive battery and maintenance procedure to make sure the battery gets replaced before it wears out (the battery is to keep the system up long enough to flush write buffers when the power fails), so users must think total power loss is a serious enough threat to pay for that. I may need to modify the above, though, now that I know how Ceph works, because I've seen storage server products that use Ceph inside. However, I'll bet the people who buy those are not aware that it's designed never to go down and if something breaks while the system is coming up, a repair action may be necessary before data is accessible again. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> Normally in the case of a restart then somebody who used to have a > connection to the OSD would still be running and flag it as dead. But > if *all* the daemons in the cluster lose their soft state, that can't > happen. OK, thanks. I guess that explains it. But that's a pretty serious design flaw, isn't it? What I experienced is a pretty common failure mode: a power outage caused the entire cluster to die simultaneously, then when power came back, some OSDs didn't (the most common time for a server to fail is at startup). I wonder if I could close this gap with additional monitoring of my own. I could have a cluster bringup protocol that detects OSD processes that aren't running after a while and mark those OSDs down. It would be cleaner, though, if I could just find out from the monitor what OSDs are in the map but not connected to the monitor cluster. Is that possible? A related question: If I mark an OSD down administratively, does it stay down until I give a command to mark it back up, or will the monitor detect signs of life and declare it up again on its own? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> I'm not sure why the monitor did not mark it _out_ after 600 seconds > (default) Well, that part I understand. The monitor didn't mark the OSD out because the monitor still considered the OSD up. No reason to mark an up OSD out. I think the monitor should have marked the OSD down upon not hearing from it for 15 minutes ("mon osd report interval"), then out 10 minutes after that ("mon osd down out interval"). And that's worst case. Though details of how OSDs watch each other are vague, I suspect an existing OSD was supposed to detect the dead OSDs and report that to the monitor, which would believe it within about a minute and mark the OSDs down. ("osd heartbeat interval", "mon osd min down reports", "mon osd min down reporters", "osd reporter subtree level"). -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] How does monitor know OSD is dead?
> The reason it is so long is that you don't want to move data > around unnecessarily if the osd is just being rebooted/restarted. I think you're confusing down with out. When an OSD is out, Ceph backfills. While it is merely down, Ceph hopes that it will come back. But it will direct I/O to other redundant OSDs instead of a down one. Going down leads to going out, and I believe that is the 600 seconds you mention - the time between when the OSD is marked down and when Ceph marks it out (if all other conditions permit). There is a pretty good explanation of how OSDs get marked down, which is pretty complicated, at http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/ It just doesn't seem to match the implementation. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How does monitor know OSD is dead?
What does it take for a monitor to consider an OSD down which has been dead as a doornail since the cluster started? A couple of times, I have seen 'ceph status' report an OSD was up, when it was quite dead. Recently, a couple of OSDs were on machines that failed to boot up after a power failure. The rest of the Ceph cluster came up, though, and reported all OSDs up and in. I/Os stalled, probably because they were waiting for the dead OSDs to come back. I waited 15 minutes, because the manual says if the monitor doesn't hear a heartbeat from an OSD in that long (default value of mon_osd_report_timeout), it marks it down. But it didn't. I did "osd down" commands for the dead OSDs and the status changed to down and I/O started working. And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to wait that long before falling back to a redundant OSD? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs file block size: must it be so big?
> I tested fread on Fedora 28. fread does 8k read on even block size is 4M. So maybe I should be looking at changing my GNU Libc instead of my Ceph. But I can't confirm that reading 8K regardless of blocksize is normal anywhere. My test on Debian 9 (about 3 years old) with glibc 2.24 shows fread causes a blocksize read. Same on a system with Glibc 2.19. I'm using the 'stat' program from Coreutils to see the blocksize and using strace of a program that does a fopen and a 4-byte fread to see the read size. Here is the code from Glibc 2.23 as well as current development where it appears to be designed to use the blocksize if it has one: size = _IO_BUFSIZ; if (fp->_fileno >= 0 && __builtin_expect (_IO_SYSSTAT (fp, ), 0) >= 0) { if (S_ISCHR (st.st_mode)) { ... } #if _IO_HAVE_ST_BLKSIZE if (st.st_blksize > 0) size = st.st_blksize; #endif } p = malloc (size); ... _IO_setb (fp, p, p + size, 1); _IO_BUFSIZ above is 8K, so I expect an 8K read if the stat fails or reports no blocksize (st_blksize == 0). The fread code in glibc reads the full size of the buffer allocated by the above code, as recorded by that _IO_setb call. [_IO_file_doallcate() in libio/filedoalloc.c] > NFS reports 1M block size Can't reproduce that one either. In some NFS experiments of mine, the blocksize reported by 'stat' appears to be controlled by the rsize and wsize mount options. Without such options, in the one case I tried, Linux 4.9, blocksize was 32K. Maybe it's affected by the server or by the filesystem the NFS server is serving. This was NFS 3. > This patch should address this issue [massive reads of e.g. /dev/urandom]:. Thanks! > mount option should work. And thanks again. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] cephfs file block size: must it be so big?
> Going back through the logs though it looks like the main reason we do a > 4MiB block size is so that we have a chance of reporting actual cluster > sizes to 32-bit systems, I believe you're talking about a different block size (there are so many of them). The 'statvfs' system call (the essence of a 'df' command) can return its space sizes in any units it wants, and tells you that unit. The unit has variously been called block size and fragment size. In Cephfs, it is hardcoded as 4 MiB so that 32 bit fields can represent large storage sizes. I'm not aware that anyone attempts to use that value for anything but interpreting statvfs results. Not saying they don't, though. What I'm looking at, in contrast, is the block size returned by a 'stat' system call on a particular file. In Cephfs, it's the stripe unit size for the file, which is an aspect of the file's layout. In the default layout, stripe unit size is 4 MiB. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] cephfs file block size: must it be so big?
I've searched the ceph-users archives and found no discussion to speak of of Cephfs block sizes, and I wonder how much people have thought about it. The POSIX 'stat' system call reports for each file a block size, which is usually defined vaguely as the smallest read or write size that is efficient. It usually takes into account that small writes may require a read-modify-write and there may be a minimum size on reads from backing storage. One thing that uses this information is the stream I/O implementation (fopen/fclose/fread/fwrite) in GNU libc. It always reads and usually writes full blocks, buffering as necessary. Most filesystems report this number as 4K. Ceph reports the stripe unit (stripe column size), which is the maximum size of the RADOS objects that back the file. This is 4M by default. One result of this is that a program uses a thousand times more buffer space when running against a Ceph file as against a traditional filesystem. And a really pernicious result occurs when you have a special file in Cephfs. Block size doesn't make any sense at all for special files, and it's probably a bad idea to use stream I/O to read one, but I've seen it done. The Chrony clock synchronizer programs use fread to read random numbers from /dev/urandom. Should /dev/urandom be in a Cephfs filesystem, with defaults, it's going to generate 4M of random bits to satisfy a 4-byte request. On one of my computers, that takes 7 seconds - and wipes out the entropy pool. Has stat block size been discussed much? Is there a good reason that it's the RADOS object size? I'm thinking of modifying the cephfs filesystem driver to add a mount option to specify a fixed block size to be reported for all files, and using 4K or 64K. Would that break something? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] searching mailing list archives
Is it possible to search the mailing list archives? http://lists.ceph.com/pipermail/ceph-users-ceph.com/ seems to have a search function, but in my experience never finds anything. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] How to repair rstats mismatch
How does one repair an rstats mismatch detected by 'scrub_path' (caused by a previous failure to write the journal)? And how bad is an rstats mismatch? What are rstats used for? I see one thing the mismatch does, apparently, is make it impossible to delete the directory, as Cephfs says it isn't empty, while also giving an empty list of its contents. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Should OSD write error result in damaged filesystem?
>OSD write errors are not usual events: any issues with the underlying >storage are expected to be handled by RADOS, and write operations to >an unhealthy cluster should block, rather than returning an error. It >would not be correct for CephFS to throw away metadata updates in the >case of unexpected write errors -- this is a strongly consistent >system, so when we can't make progress consistently (i.e. respecting >all the ops we've seen in order), then we have to stop. Thank you for that explanation; that all makes sense. I have to get used to the idea of responding to broken storage by waiting indefinitely until it is isn't broken. I wasn't thinking in those terms. >I'm guessing that you changed some related settings (like >mds_log_segment_size) to get into this situation? Otherwise, an error >like this would definitely be a bug. What I changed (from default) was osd_max_write_size. I set it to its legal minimum, 1M. I've discovered that there are clients all around that expect to be able to write 4M and don't respond nicely when they can't. Rather than try to find and change them all, I'm going to capitulate and go ahead and make osd_max_write_size 4M. Does manually tuning every client to make it consistent with the OSD's maximum write size have to be what avoids crashes like this? It sure would be nice if an MDS could detect much earlier that the log is on an OSD that's incapable of hosting that log. But I found the filesystem driver is the same way - I have to tell it how big a write it can do; it can't figure it out from the OSDs. So maybe its a fundamental architecture thing. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Should OSD write error result in damaged filesystem?
I had a filesystem rank get damaged when the MDS had an error writing the log to the OSD. Is damage expected when a log write fails? According to log messages, an OSD write failed because the MDS attempted to write a bigger chunk than the OSD's maximum write size. I can probably figure out why that happened and fix it, but OSD write failures can happen for lots of reasons, and I would have expected the MDS just to discard the recent filesystem updates, issue a log message, and keep going. The user had presumably not been told those updates were committed. And how do I repair this now? Is this a job for cephfs-journal-tool event recover_dentries cephfs-journal-tool journal reset ? This is Jewel. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS does not always failover to hot standby
dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline) -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] MDS does not always failover to hot standby on reboot
> If the active MDS is connected to a monitor and they fail at the same time, > the monitors can't replace the mds until they've been through their own > election and a full mds timeout window. So how long are we talking? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Why does Ceph probe for end of MDS log?
>No, the log end in the header is a hint. This is because we can't >atomically wrote to two objects (the header and the last log object) at the >same time, so we do atomic appends to the end of the log and flush out the >journal header lazily. Thanks; I get it now. >I believe zeroes at the end of the log are deliberate, as we "pre-zero" to >avoid some rare edge cases when MDSes restart and the log might have had >writes to later objects complete successfully while earlier ones were >blocked. If your MDS is not restarting it is probably because of the >non-zero data. I don't think my zeroes are pre-zeroing, as they actually occur in the middle of the final short object. As they start at what the log header says is the write point, my guess is that the MDS thought it flushed some stuff, so advanced the flush pointer, but in reality the write never happened. This failure to restart happened after the MDS crashed, and I lost any messages that would tell me why it crashed. I'll fix that and turn up verbosity and if it happens again, I'll have a better idea how the zeroes got there. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Why does Ceph probe for end of MDS log?
I've been reading MDS log code, and I have a question: why does it "probe for the end of the log" after reading the log header when starting up? As I understand it, the log header says the log had been written up to Location X ("write_pos") the last time the log was committed, but the end-probe code determines whether there is stuff physically in the log (based on Rados object size) beyond X and if so, ignores the header and uses the physical end of the log instead. Wouldn't stuff after where the header says writing left off be unreliable? Maybe incompletely or incorrectly written? I'm looking at this because I have an MDS that will not start because there is junk (zeroes) in that space after where the log header says the log ends, so replay of the log fails there. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Fwd: down+peering PGs, can I move PGs from one OSD to another
>You can export and import PG's using ceph_objectstore_tool, but if the osd >won't start you may have trouble exporting a PG. I believe the very purpose of ceph-objectstore-tool is to manipulate OSDs while they aren't running. If the crush map says these PGs that are on the broken OSD belong on another OSD (which I guess it ought to, since the OSD is out), ceph-objecstore-tool is what you would use to move them over there manually, since ordinary peering can't do it. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Cephfs kernel driver availability
>Kernel 3.16 is not *the* LTS kernel but *an* LTS kernel. The current LTS >kernel is 4.14 Thanks for clarifying that. I guess I forgot how long I've been trying to get Ceph to work. When I started, 3.16 was the current LTS kernel! Had I known that it's so stable that serious bugs are left in it, I would not have given so much preference to using stable code. I think I'll have a look at the Git history and see how practical it would be for me to proactively backport all the bug fixes to my local kernel. FUSE really isn't an option for me because Ceph is the root filesystem for these clients. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Cephfs kernel driver availability
Is there some better place to get a filesystem driver for the longterm stable Linux kernel (3.16) than the regular kernel.org source distribution? The reason I ask is that I have been trying to get some clients running Linux kernel 3.16 (the current long term stable Linux kernel) and so far I have run into two serious bugs that, it turns out, were found and fixed years ago in more current mainline kernels. In both cases, I emailed Ben Hutchings, the apparent maintainer of 3.16, asking if the fixes could be added to 3.16, but was met with silence. This leads me to believe that there are many more bugs in the 3.16 cephfs filesystem driver waiting for me. Indeed, I've seen panics not yet explained. So what are other people using? A less stable kernel? An out-of-tree driver? FUSE? Is there a working process for getting known bugs fixed in 3.16? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data recovery after loosing all monitors
> Kill all mds first , create new fs with old pools , then run ‘fs reset’ > before start any MDS. Brilliant! I can't wait to try it. Thanks. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Data recovery after loosing all monitors
>Luckily; it's not. I don't remember if the MDS maps contain entirely >ephemeral data, but on the scale of cephfs recovery scenarios that's just >about the easiest one. Somebody would have to walk through it; you probably >need to look up the table states and mds counts from the RADOS store and >generate a new (epoch 1 or 2) mdsmap which contains those settings ready to >go. Or maybe you just need to "create" a new cephfs on the prior pools and >set it up with the correct number of MDSes. > >At the moment the mostly-documented recovery procedure probably involves >recovering the journals, flushing everything out, and resetting the server >state to a single MDS, and if you lose all your monitors there's a good >chance you need to be going through recovery anyway, so...*shrug* The idea of just creating a new filesystem from old metadata and data pools intrigued me, so I looked into it further, including reading some code. It appears that there's nothing in the MDS map that can't be regenerated, and while it's probably easy for a Ceph developer to do that, there aren't tools available that can. 'fs new' comes close, but according to http://docs.ceph.com/docs/master/cephfs/disaster-recovery/ it causes a new empty root directory to be created, so you lose access to all your files (and leak all the storage space they occupy). The same document mentions 'fs reset', which also comes close and keeps the existing root directory, but it requires, perhaps gratuitously, that a filesystem already exist in the MDS map, albeit maybe corrupted, before it regenerates it. I'm tempted to modify Ceph to try to add a 'fs recreate' that does what 'fs reset' does, but without expecting anything to be there already. Maybe that's all it takes along with 'ceph-objecstore-tool --op update-mon-db' to recover from a lost cluster map. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Data recovery after loosing all monitors
>> Suppose I lost all monitors in a ceph cluster in my laboratory. I have >> all OSDs intact. Is it possible to recover something from Ceph? > >Yes, there is. Using ceph-objectstore-tool you are able to rebuild the >MON database. > >BUT, this isn't something you would really want to do as you loose your >cephx keys and such and getting them all back will be a total nightmare. According to the section of the manual on this, TROUBLESHOOTING MONITORS -> RECOVERY USING OSDS, another thing that you lose when you use ceph-objectstore-tool --op update-mon-db to recover a lost monitor database is the MDS maps. That seems like a pretty casual way of saying if your monitor database gets corrupted, you can kiss your entire cephfs filesystem goodbye. Is that what it means? Is there a way to recover the MDS maps or otherwise gain access to all the files once you've recovered access to the OSDs? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] Intepreting reason for blocked request
>>> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 : >>> cluster [WRN] slow request 961.557151 seconds old, >>> received at 2018-05-03 01:40:33.689191: >>> pg_query(4.f epoch 490) currently wait for new map >>> > >The OSD is waiting for a new OSD map, which it will get from one of its >peers or the monitor (by request). This tends to happen if the client sees >a newer version than the OSD does. Hmmm. So the client gets the current OSD map from the Monitor and then indicates in its request to the OSD what map epoch it is using? And if the OSD has an older map, it requests a new one from another OSD or Monitor before proceeding? And I suppose if the current epoch is still older than what the client said, the OSD keeps trying until it gets the epoch the client stated. If that's so, this situation could happen if for some reason the client got the idea that there's a newer map than what there really is. What I'm looking at is probably just a Ceph bug, because this small test cluster got into this state immediately upon startup, before any client had connected (I assume these blocked requests are from inside the cluster), and the requests aren't just blocked for a long time; they're blocked indefinitely. The only time I've seen it is when I brought the cluster up in a different order than I usually do. So I'm just trying to understand the inner workings in case I need to debug it if it keeps happening. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Intepreting reason for blocked request
I recently had some requests blocked indefinitely; I eventually cleared it up by recycling the OSDs, but I'd like some help interpreting the log messages that supposedly give clue as to what caused the blockage: (I reformatted for easy email reading) 2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 : cluster [WRN] 7 slow requests, 2 included below; oldest blocked for > 961.596517 secs 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 : cluster [WRN] slow request 961.557151 seconds old, received at 2018-05-03 01:40:33.689191: pg_query(4.f epoch 490) currently wait for new map 2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 : cluster [WRN] slow request 961.556655 seconds old, received at 2018-05-03 01:40:33.689686: pg_query(1.d epoch 490) currently wait for new map 2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 : cluster [WRN] 2 slow requests, 2 included below; oldest blocked for > 960.677480 secs 2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 : cluster [WRN] slow request 960.677480 seconds old, received at 2018-05-03 01:40:31.238642: osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893 RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg 2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 : cluster [WRN] slow request 960.663817 seconds old, received at 2018-05-03 01:40:31.252305: osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg "wait for new map": what map would that be, and where is the OSD expecting it to come from? "reached_pg"? You see two OSDs: osd.0 and osd.1. They're basically set up as a mirrored pair. Thanks. ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] stale status from monitor?
My cluster got stuck somehow, and at one point in trying to recycle things to unstick it, I ended up shutting down everything, then bringing up just the monitors. At that point, the cluster reported the status below. With nothing but the monitors running, I don't see how the status can say there are two OSDs and an MDS up and requests are blocked. This was the status of the cluster when I previously shut down the monitors (which I probably shouldn't have done when there were still OSDs and MDSs up, but I did). It stayed that way for about 20 minutes, and I finally brought up the OSDs and everything went back to normal. So my question is: Is this normal and what has to happen for the status to be current? cluster 23352cdb-18fc-4efc-9d54-e72c000abfdb health HEALTH_WARN 60 pgs peering 60 pgs stuck inactive 60 pgs stuck unclean 4 requests are blocked > 32 sec mds cluster is degraded mds a is laggy monmap e3: 3 mons at {a=192.168.1.16:6789/0,b=192.168.1.23:6789/0,c=192.168.1.20:6789/0} election epoch 202, quorum 0,1,2 a,c,b mdsmap e315: 1/1/1 up {0=a=up:replay(laggy or crashed)} osdmap e495: 2 osds: 2 up, 2 in pgmap v33881: 160 pgs, 4 pools, 568 MB data, 14851 objects 1430 MB used, 43704 MB / 45134 MB avail 100 active+clean 60 peering ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Shutting down: why OSDs first?
There is a lot of advice around on shutting down a Ceph cluster that says to shut down the OSDs before the monitors and bring up the monitors before the OSDs, but no one explains why. I would have thought it would be better to shut down the monitors first and bring them up last, so they don't have to witness all the interim states with OSDs down. And it should make the noout, nodown, etc. settings unnecessary. So what am I missing? Also, how much difference does it really make? Ceph is obviously designed to tolerate any sequence of failures and recoveries of nodes, so how much risk would I be taking if I just haphazardly killed everything instead of orchestrating a shutdown? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Why keep old epochs?
Some questions about maps and epochs: I see that I can control the minimum number of osdmap epochs to keep with "mon min osdmap epoch". Why do I care? Why would I want any but the current osdmap, and why would the system keep more than my minimum? Similarly, "mon max pgmap epoch" controls the _maximum_ number of pgmap epochs to keep around. I believe I need more than the most recent pgmap because I need to keep previous ones until all PGs that were placed according to that pgmap have migrated to where the current pgmap says they should be. But do I need more epochs than that, and what happens if the maximum I set is too low to cover those necessesary old pgmaps? -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] What goes in the monitor database?
Hi. Can anyone give me a rough idea of what the monitor database is for? I'm curious about how it is behaving in an experimental system I set up. I have a single monitor and a single client that just connects once a second and does a "status" command. There is one OSD and one MDS in there too. This is a Hammer system with a LevelDB key-value store. This produces a fair amount of activity in the database; it looks like about 25K of updates for every "status" transaction. The database compacts periodically and over the longrun, does not grow in size. Using ceph_kvstore_tool after shutting down the monitor, I see hundreds of keys. So what does the monitor have to store to do a "status" command? I've seen clues that the activity has to do with Paxos elections, but I'm fuzzy on why elections would be happening or why they would need a persistent database. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] Ceph program memory usage
A few months ago, I posted here asking why the Ceph program takes so much memory (virtual, real, and address space) for what seems to be a simple task. Nobody knew, but I have done extensive research and I have the answer now, and thought I would publish it here. All it takes to do a Ceph "status" command is to create a TCP connection to the monitor, do a small login handshake, send a JSON document that says "status command", and receive and print the text response. This could be done in 64K with maybe a few megabytes of additional address space for the shared C library. If you do it with a 'ceph status' command, though, in the Hammer release it has a 700M peak address space usage (though it varies a lot from one run to the next) and uses 60M of real memory. The reason for this is that the Ceph program uses facilities in the librados library that are meant for much more than just performing a command. These facilities are meant to be used by a full-blown server that is a Ceph client. The facilities deal with locating a monitor within a cluster and failing over when that monitor dies; they interpret the Ceph configuration file and adjust dynamically when that file changes; they do logging; and more. When you type 'ceph status', you are building a sophisticated command-issuing machine, having it issue one command, and then tearing it down. 'ceph creates about 20 threads. They are asynchronous enough that in some runs, multiple threads exist at the same time and in other ones, they exist serially. This is why peak memory usage varies from one run to the next. In its quiescent state, ready to perform a command, the program has 13 threads standing by for various purposes. Each of these has 8M of virtual memory reserved for its stack and most have 64M for a heap. Finally, there is a lock auditor facility ("lockdep") that watches for locks being acquired out of order, as evidence of a bug in the code. This facility is not optional; it is always there. To keep track of all the locking, it sets up a 2000x2000 array (2000 being an upper limit on the number of locks the program might contain). That's 32M of real memory. I read in a forum that this has been greatly reduced in later releases. I was able to reduce the usage to 130M address space and 9M of real memory, while still using most of the same librados code to do the work, by creating a stripped-down version of the librados 'MonClient' class, setting the maximum thread stack size to 1M with rlimits, and making the threads share a heap with the MALLOC_ARENA_MAX environment variable. I also disabled lockdep. I just thought this might be interesting to someone searching the archives for memory usage information. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Re: [ceph-users] ceph program uses lots of memory
I did some investigation and tracked the high usage down to librados. I don't think Python has anything to do with it. I also noticed that the memory usage was really unpredictable. Sometimes I could do a whole 'ceph -s' with only 256M; most of the time I couldn't, but the program crashed in various points along the way. I was going to instrument librados and try to track it further, but I found that Ceph is too complex and resource-consuming for me to build. I wonder if there is a way to build just librados without downloading and building 3 GiB of source code. I hadn't thought before about starting the 'ceph' shell and looking at the process as it's waiting for a command, but I just did, and see the virtual memory size does vary a lot from one invocation to the next. Strange. Makes one think there's some kind of race or use of an unset variable. So I looked at the memory map (/proc/PID/maps) and see in one run (where I got lucky and it fit in my 256M limit) 165 vmareas occupying 226 MiB (compared to 49 and 25 MiB for a Python shell). I'll look closer and see if there are some particulary large ones and what varies from one invocation to the next. >Is there a reason you're worried about the address space but not the >actual RAM used? Yes. The way I prevent programs from destroying my system with excessive real memory usage or paging, either by accident or by my ignorance, is by running with address space rlimits. It's the best I can do; there is no real memory or paging rate rlimit. As it stands, any normal shell on my systems has an address space limit of 256M, which has never been a problem before, but is majorly inconvenient now. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
[ceph-users] ceph program uses lots of memory
Does anyone know why the 'ceph' program uses so much memory? If I run it with an address space rlimit of less than 300M, it usually dies with messages about not being able to allocate memory. I'm curious as to what it could be doing that requires so much address space. It doesn't matter what specific command I'm doing and it does this even with there is no ceph cluster running, so it must be something pretty basic. -- Bryan Henderson San Jose, California ___ ceph-users mailing list ceph-users@lists.ceph.com http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com