Re: [ceph-users] How does monitor know OSD is dead?

2019-07-03 Thread Bryan Henderson
> I'm a bit confused about what happened here, though: that 600 second 
> interval is only important if *every* OSD in the system is down.  If you 
> reboot the data center, why didn't *any* OSD daemons start?  (And even if 
> none did, having the ceph -s report all OSDs down instead of up isn't 
> going to change anything except whether your pager is going off, right?)

I think you got lost in the thread of discussion.  Enough OSDs for the cluster
to be fully functional _did_ come back.  But the cluster insisted on going to
the dead ones (which it claimed all the while were up) for some I/O, even
after running for 20 minutes that way, so the cluster was not functional.  The
600 second "mon osd down out interval" was a red herring.

It might be relevant that there was a grand total of three OSDs in the map.
One came up; two did not.  All objects were replicated across all three, with
the hope that this sort of thing would not be fatal.  It's a Jewel system with
that version's default of 1 for "mon osd min down reporters".

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-02 Thread Bryan Henderson
Here's some counter-evidence to the proposition that it's not pretty common
for an entire cluster to go down because of a power failure.

Every data center class hardware storage server product I know of has dual
power input and is also designed to tolerate losing power on both at once.  If
that happens, they don't lose data and when the power comes back, they come
back up all by themselves and start serving storage again.

This design usually involves an expensive battery and maintenance procedure to
make sure the battery gets replaced before it wears out (the battery is to
keep the system up long enough to flush write buffers when the power fails),
so users must think total power loss is a serious enough threat to pay for
that.

I may need to modify the above, though, now that I know how Ceph works,
because I've seen storage server products that use Ceph inside.  However, I'll
bet the people who buy those are not aware that it's designed never to go down
and if something breaks while the system is coming up, a repair action may be
necessary before data is accessible again.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-07-01 Thread Bryan Henderson
> Normally in the case of a restart then somebody who used to have a
> connection to the OSD would still be running and flag it as dead. But
> if *all* the daemons in the cluster lose their soft state, that can't
> happen.

OK, thanks.  I guess that explains it.  But that's a pretty serious design
flaw, isn't it?  What I experienced is a pretty common failure mode: a power
outage caused the entire cluster to die simultaneously, then when power came
back, some OSDs didn't (the most common time for a server to fail is at
startup).

I wonder if I could close this gap with additional monitoring of my own.  I
could have a cluster bringup protocol that detects OSD processes that aren't
running after a while and mark those OSDs down.  It would be cleaner, though,
if I could just find out from the monitor what OSDs are in the map but not
connected to the monitor cluster.  Is that possible?

A related question: If I mark an OSD down administratively, does it stay down
until I give a command to mark it back up, or will the monitor detect signs of
life and declare it up again on its own?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Bryan Henderson
> I'm not sure why the monitor did not mark it _out_ after 600 seconds
> (default)

Well, that part I understand.  The monitor didn't mark the OSD out because the
monitor still considered the OSD up.  No reason to mark an up OSD out.

I think the monitor should have marked the OSD down upon not hearing from it
for 15 minutes ("mon osd report interval"), then out 10 minutes after that
("mon osd down out interval").

And that's worst case.  Though details of how OSDs watch each other are vague,
I suspect an existing OSD was supposed to detect the dead OSDs and report that
to the monitor, which would believe it within about a minute and mark the OSDs
down.  ("osd heartbeat interval", "mon osd min down reports", "mon osd min down
reporters", "osd reporter subtree level").

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How does monitor know OSD is dead?

2019-06-29 Thread Bryan Henderson
> The reason it is so long is that you don't want to move data
> around unnecessarily if the osd is just being rebooted/restarted. 

I think you're confusing down with out.  When an OSD is out, Ceph
backfills.  While it is merely down, Ceph hopes that it will come back.
But it will direct I/O to other redundant OSDs instead of a down one.

Going down leads to going out, and I believe that is the 600 seconds you
mention - the time between when the OSD is marked down and when Ceph marks it
out (if all other conditions permit).

There is a pretty good explanation of how OSDs get marked down, which is
pretty complicated, at

  http://docs.ceph.com/docs/master/rados/configuration/mon-osd-interaction/

It just doesn't seem to match the implementation.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How does monitor know OSD is dead?

2019-06-27 Thread Bryan Henderson
What does it take for a monitor to consider an OSD down which has been dead as
a doornail since the cluster started?

A couple of times, I have seen 'ceph status' report an OSD was up, when it was
quite dead.  Recently, a couple of OSDs were on machines that failed to boot
up after a power failure.  The rest of the Ceph cluster came up, though, and
reported all OSDs up and in.  I/Os stalled, probably because they were waiting
for the dead OSDs to come back.

I waited 15 minutes, because the manual says if the monitor doesn't hear a
heartbeat from an OSD in that long (default value of mon_osd_report_timeout),
it marks it down.  But it didn't.  I did "osd down" commands for the dead OSDs
and the status changed to down and I/O started working.

And wouldn't even 15 minutes of grace be unacceptable if it means I/Os have to
wait that long before falling back to a redundant OSD?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file block size: must it be so big?

2018-12-14 Thread Bryan Henderson
> I tested fread on Fedora 28. fread does 8k read on even block size is 4M.

So maybe I should be looking at changing my GNU Libc instead of my Ceph.

But I can't confirm that reading 8K regardless of blocksize is normal anywhere.

My test on Debian 9 (about 3 years old) with glibc 2.24 shows fread causes a
blocksize read.  Same on a system with Glibc 2.19.

I'm using the 'stat' program from Coreutils to see the blocksize and using
strace of a program that does a fopen and a 4-byte fread to see the read size.

Here is the code from Glibc 2.23 as well as current development where it
appears to be designed to use the blocksize if it has one:

size = _IO_BUFSIZ;
if (fp->_fileno >= 0 && __builtin_expect (_IO_SYSSTAT (fp, ), 0) >= 0)
  {
if (S_ISCHR (st.st_mode))
{
   ...
}
  #if _IO_HAVE_ST_BLKSIZE
if (st.st_blksize > 0)
  size = st.st_blksize;
  #endif
  }
p = malloc (size);
...
_IO_setb (fp, p, p + size, 1);


_IO_BUFSIZ above is 8K, so I expect an 8K read if the stat fails or reports no
blocksize (st_blksize == 0).

The fread code in glibc reads the full size of the buffer allocated by the
above code, as recorded by that _IO_setb call.

[_IO_file_doallcate() in libio/filedoalloc.c]


> NFS reports 1M block size

Can't reproduce that one either.

In some NFS experiments of mine, the blocksize reported by 'stat' appears to
be controlled by the rsize and wsize mount options.  Without such options, in
the one case I tried, Linux 4.9, blocksize was 32K.  Maybe it's affected by
the server or by the filesystem the NFS server is serving.  This was NFS 3.

> This patch should address this issue [massive reads of e.g. /dev/urandom]:.

Thanks!

> mount option should work.

And thanks again.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs file block size: must it be so big?

2018-12-14 Thread Bryan Henderson
> Going back through the logs though it looks like the main reason we do a
> 4MiB block size is so that we have a chance of reporting actual cluster
> sizes to 32-bit systems,

I believe you're talking about a different block size (there are so many of
them).

The 'statvfs' system call (the essence of a 'df' command) can return its space
sizes in any units it wants, and tells you that unit.  The unit has variously
been called block size and fragment size.  In Cephfs, it is hardcoded as 4 MiB
so that 32 bit fields can represent large storage sizes.  I'm not aware that
anyone attempts to use that value for anything but interpreting statvfs
results.  Not saying they don't, though.

What I'm looking at, in contrast, is the block size returned by a 'stat'
system call on a particular file.  In Cephfs, it's the stripe unit size for
the file, which is an aspect of the file's layout.  In the default layout,
stripe unit size is 4 MiB.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs file block size: must it be so big?

2018-12-13 Thread Bryan Henderson
I've searched the ceph-users archives and found no discussion to speak of of
Cephfs block sizes, and I wonder how much people have thought about it.

The POSIX 'stat' system call reports for each file a block size, which is
usually defined vaguely as the smallest read or write size that is efficient.
It usually takes into account that small writes may require a
read-modify-write and there may be a minimum size on reads from backing
storage.

One thing that uses this information is the stream I/O implementation
(fopen/fclose/fread/fwrite) in GNU libc.  It always reads and usually writes
full blocks, buffering as necessary.

Most filesystems report this number as 4K.

Ceph reports the stripe unit (stripe column size), which is the maximum size
of the RADOS objects that back the file.  This is 4M by default.

One result of this is that a program uses a thousand times more buffer space
when running against a Ceph file as against a traditional filesystem.

And a really pernicious result occurs when you have a special file in Cephfs.
Block size doesn't make any sense at all for special files, and it's probably
a bad idea to use stream I/O to read one, but I've seen it done.  The Chrony
clock synchronizer programs use fread to read random numbers from
/dev/urandom.  Should /dev/urandom be in a Cephfs filesystem, with defaults,
it's going to generate 4M of random bits to satisfy a 4-byte request.  On one
of my computers, that takes 7 seconds - and wipes out the entropy pool.


Has stat block size been discussed much?  Is there a good reason that it's
the RADOS object size?

I'm thinking of modifying the cephfs filesystem driver to add a mount option
to specify a fixed block size to be reported for all files, and using 4K or
64K.  Would that break something?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] searching mailing list archives

2018-11-12 Thread Bryan Henderson
Is it possible to search the mailing list archives?

http://lists.ceph.com/pipermail/ceph-users-ceph.com/

seems to have a search function, but in my experience never finds anything.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to repair rstats mismatch

2018-11-08 Thread Bryan Henderson
How does one repair an rstats mismatch detected by 'scrub_path' (caused by a
previous failure to write the journal)?

And how bad is an rstats mismatch?  What are rstats used for?  I see one thing
the mismatch does, apparently, is make it impossible to delete the directory,
as Cephfs says it isn't empty, while also giving an empty list of its
contents.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Should OSD write error result in damaged filesystem?

2018-11-04 Thread Bryan Henderson
>OSD write errors are not usual events: any issues with the underlying
>storage are expected to be handled by RADOS, and write operations to
>an unhealthy cluster should block, rather than returning an error.  It
>would not be correct for CephFS to throw away metadata updates in the
>case of unexpected write errors -- this is a strongly consistent
>system, so when we can't make progress consistently (i.e. respecting
>all the ops we've seen in order), then we have to stop.

Thank you for that explanation; that all makes sense.  I have to get used to
the idea of responding to broken storage by waiting indefinitely until it is
isn't broken.  I wasn't thinking in those terms.

>I'm guessing that you changed some related settings (like
>mds_log_segment_size) to get into this situation?  Otherwise, an error
>like this would definitely be a bug.

What I changed (from default) was osd_max_write_size.  I set it to its legal
minimum, 1M.  I've discovered that there are clients all around that expect to
be able to write 4M and don't respond nicely when they can't.  Rather than try
to find and change them all, I'm going to capitulate and go ahead and make
osd_max_write_size 4M.

Does manually tuning every client to make it consistent with the OSD's maximum
write size have to be what avoids crashes like this?  It sure would be nice if
an MDS could detect much earlier that the log is on an OSD that's incapable of
hosting that log.  But I found the filesystem driver is the same way - I have
to tell it how big a write it can do; it can't figure it out from the OSDs.
So maybe its a fundamental architecture thing.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Should OSD write error result in damaged filesystem?

2018-11-03 Thread Bryan Henderson
I had a filesystem rank get damaged when the MDS had an error writing the log
to the OSD.  Is damage expected when a log write fails?

According to log messages, an OSD write failed because the MDS attempted
to write a bigger chunk than the OSD's maximum write size.  I can probably
figure out why that happened and fix it, but OSD write failures can happen for
lots of reasons, and I would have expected the MDS just to discard the recent
filesystem updates, issue a log message, and keep going.  The user had
presumably not been told those updates were committed.


And how do I repair this now?  Is this a job for

  cephfs-journal-tool event recover_dentries
  cephfs-journal-tool journal reset

?

This is Jewel.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby

2018-09-07 Thread Bryan Henderson
dub-sitv-ceph-03 mon.0 10.18.53.32:6789/0 89 : 
cluster [INF] Health check cleared: MDS_ALL_DOWN (was: 1 filesystem is offline)



-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS does not always failover to hot standby on reboot

2018-09-01 Thread Bryan Henderson
> If the active MDS is connected to a monitor and they fail at the same time,
> the monitors can't replace the mds until they've been through their own
> election and a full mds timeout window.

So how long are we talking?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Why does Ceph probe for end of MDS log?

2018-08-26 Thread Bryan Henderson
>No, the log end in the header is a hint. This is because we can't
>atomically wrote to two objects (the header and the last log object) at the
>same time, so we do atomic appends to the end of the log and flush out the
>journal header lazily.

Thanks; I get it now.

>I believe zeroes at the end of the log are deliberate, as we "pre-zero" to
>avoid some rare edge cases when MDSes restart and the log might have had
>writes to later objects complete successfully while earlier ones were
>blocked. If your MDS is not restarting it is probably because of the
>non-zero data.

I don't think my zeroes are pre-zeroing, as they actually occur in the middle
of the final short object.  As they start at what the log header says is the
write point, my guess is that the MDS thought it flushed some stuff, so
advanced the flush pointer, but in reality the write never happened.

This failure to restart happened after the MDS crashed, and I lost any
messages that would tell me why it crashed.  I'll fix that and turn up
verbosity and if it happens again, I'll have a better idea how the zeroes got
there.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why does Ceph probe for end of MDS log?

2018-08-23 Thread Bryan Henderson
I've been reading MDS log code, and I have a question: why does it "probe for
the end of the log" after reading the log header when starting up?

As I understand it, the log header says the log had been written up to
Location X ("write_pos") the last time the log was committed, but the
end-probe code determines whether there is stuff physically in the log (based
on Rados object size) beyond X and if so, ignores the header and uses the
physical end of the log instead.

Wouldn't stuff after where the header says writing left off be unreliable?
Maybe incompletely or incorrectly written?

I'm looking at this because I have an MDS that will not start because there
is junk (zeroes) in that space after where the log header says the log ends,
so replay of the log fails there.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: down+peering PGs, can I move PGs from one OSD to another

2018-08-04 Thread Bryan Henderson
>You can export and import PG's using ceph_objectstore_tool, but if the osd
>won't start you may have trouble exporting a PG.

I believe the very purpose of ceph-objectstore-tool is to manipulate OSDs
while they aren't running.

If the crush map says these PGs that are on the broken OSD belong on another
OSD (which I guess it ought to, since the OSD is out), ceph-objecstore-tool is
what you would use to move them over there manually, since ordinary peering
can't do it.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cephfs kernel driver availability

2018-07-22 Thread Bryan Henderson
>Kernel 3.16 is not *the* LTS kernel but *an* LTS kernel. The current LTS
>kernel is 4.14 

Thanks for clarifying that.  I guess I forgot how long I've been trying to
get Ceph to work.  When I started, 3.16 was the current LTS kernel!

Had I known that it's so stable that serious bugs are left in it, I would
not have given so much preference to using stable code.

I think I'll have a look at the Git history and see how practical it would
be for me to proactively backport all the bug fixes to my local kernel.

FUSE really isn't an option for me because Ceph is the root filesystem for
these clients.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephfs kernel driver availability

2018-07-22 Thread Bryan Henderson
Is there some better place to get a filesystem driver for the longterm
stable Linux kernel (3.16) than the regular kernel.org source distribution?

The reason I ask is that I have been trying to get some clients running
Linux kernel 3.16 (the current long term stable Linux kernel) and so far
I have run into two serious bugs that, it turns out, were found and fixed
years ago in more current mainline kernels.

In both cases, I emailed Ben Hutchings, the apparent maintainer of 3.16,
asking if the fixes could be added to 3.16, but was met with silence.  This
leads me to believe that there are many more bugs in the 3.16 cephfs
filesystem driver waiting for me.  Indeed, I've seen panics not yet explained.

So what are other people using?  A less stable kernel?  An out-of-tree driver?
FUSE?  Is there a working process for getting known bugs fixed in 3.16?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data recovery after loosing all monitors

2018-06-02 Thread Bryan Henderson
> Kill all mds first , create new fs with old pools , then run ‘fs reset’
> before start any MDS.

Brilliant! I can't wait to try it.

Thanks.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Data recovery after loosing all monitors

2018-06-01 Thread Bryan Henderson
>Luckily; it's not. I don't remember if the MDS maps contain entirely
>ephemeral data, but on the scale of cephfs recovery scenarios that's just
>about the easiest one. Somebody would have to walk through it; you probably
>need to look up the table states and mds counts from the RADOS store and
>generate a new (epoch 1 or 2) mdsmap which contains those settings ready to
>go. Or maybe you just need to "create" a new cephfs on the prior pools and
>set it up with the correct number of MDSes.
>
>At the moment the mostly-documented recovery procedure probably involves
>recovering the journals, flushing everything out, and resetting the server
>state to a single MDS, and if you lose all your monitors there's a good
>chance you need to be going through recovery anyway, so...*shrug*

The idea of just creating a new filesystem from old metadata and data pools
intrigued me, so I looked into it further, including reading some code.

It appears that there's nothing in the MDS map that can't be regenerated, and
while it's probably easy for a Ceph developer to do that, there aren't tools
available that can.

'fs new' comes close, but according to

  http://docs.ceph.com/docs/master/cephfs/disaster-recovery/

it causes a new empty root directory to be created, so you lose access to all
your files (and leak all the storage space they occupy).

The same document mentions 'fs reset', which also comes close and keeps the
existing root directory, but it requires, perhaps gratuitously, that a
filesystem already exist in the MDS map, albeit maybe corrupted, before it
regenerates it.

I'm tempted to modify Ceph to try to add a 'fs recreate' that does what 'fs
reset' does, but without expecting anything to be there already.  Maybe that's
all it takes along with 'ceph-objecstore-tool --op update-mon-db' to recover
from a lost cluster map.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Data recovery after loosing all monitors

2018-05-26 Thread Bryan Henderson
>> Suppose I lost all monitors in a ceph cluster in my laboratory. I have
>> all OSDs intact. Is it possible to recover something from Ceph?
>
>Yes, there is. Using ceph-objectstore-tool you are able to rebuild the
>MON database.
>
>BUT, this isn't something you would really want to do as you loose your
>cephx keys and such and getting them all back will be a total nightmare.

According to the section of the manual on this, TROUBLESHOOTING MONITORS ->
RECOVERY USING OSDS, another thing that you lose when you use
ceph-objectstore-tool --op update-mon-db to recover a lost monitor database is
the MDS maps.

That seems like a pretty casual way of saying if your monitor database gets
corrupted, you can kiss your entire cephfs filesystem goodbye.  Is that what
it means?  Is there a way to recover the MDS maps or otherwise gain access to
all the files once you've recovered access to the OSDs?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Intepreting reason for blocked request

2018-05-19 Thread Bryan Henderson
>>> 2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
>>>   cluster [WRN] slow request 961.557151 seconds old,
>>>   received at 2018-05-03 01:40:33.689191:
>>> pg_query(4.f epoch 490) currently wait for new map
>>>
>
>The OSD is waiting for a new OSD map, which it will get from one of its
>peers or the monitor (by request). This tends to happen if the client sees
>a newer version than the OSD does.

Hmmm.  So the client gets the current OSD map from the Monitor and then
indicates in its request to the OSD what map epoch it is using?  And if the
OSD has an older map, it requests a new one from another OSD or Monitor before
proceeding?  And I suppose if the current epoch is still older than what the
client said, the OSD keeps trying until it gets the epoch the client stated.

If that's so, this situation could happen if for some reason the client got
the idea that there's a newer map than what there really is.

What I'm looking at is probably just a Ceph bug, because this small test
cluster got into this state immediately upon startup, before any client had
connected (I assume these blocked requests are from inside the cluster), and
the requests aren't just blocked for a long time; they're blocked
indefinitely.  The only time I've seen it is when I brought the cluster up in
a different order than I usually do.  So I'm just trying to understand the
inner workings in case I need to debug it if it keeps happening.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Intepreting reason for blocked request

2018-05-12 Thread Bryan Henderson
I recently had some requests blocked indefinitely; I eventually cleared it
up by recycling the OSDs, but I'd like some help interpreting the log messages
that supposedly give clue as to what caused the blockage:

(I reformatted for easy email reading)

2018-05-03 01:56:35.248623 osd.0 192.168.1.16:6800/348 53 :
  cluster [WRN] 7 slow requests, 2 included below;
  oldest blocked for > 961.596517 secs
  
2018-05-03 01:56:35.249122 osd.0 192.168.1.16:6800/348 54 :
  cluster [WRN] slow request 961.557151 seconds old,
  received at 2018-05-03 01:40:33.689191:
pg_query(4.f epoch 490) currently wait for new map

2018-05-03 01:56:35.249543 osd.0 192.168.1.16:6800/348 55 :
  cluster [WRN] slow request 961.556655 seconds old,
  received at 2018-05-03 01:40:33.689686:
pg_query(1.d epoch 490) currently wait for new map

2018-05-03 01:56:31.918589 osd.1 192.168.1.23:6800/345 80 :
  cluster [WRN] 2 slow requests, 2 included below;
  oldest blocked for > 960.677480 secs

2018-05-03 01:56:31.920076 osd.1 192.168.1.23:6800/345 81 :
  cluster [WRN] slow request 960.677480 seconds old,
  received at 2018-05-03 01:40:31.238642:
osd_op(mds.0.57:1 mds0_inotable [read 0~0] 2.b852b893
  RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

2018-05-03 01:56:31.921526 osd.1 192.168.1.23:6800/345 82 :
  cluster [WRN] slow request 960.663817 seconds old,
  received at 2018-05-03 01:40:31.252305:
osd_op(mds.0.57:3 mds_snaptable [read 0~0] 2.d90270ad
  RETRY=2 ack+retry+read+known_if_redirected e490) currently reached_pg

"wait for new map": what map would that be, and where is the OSD expecting it
to come from?

"reached_pg"?

You see two OSDs: osd.0 and osd.1.  They're basically set up as a mirrored
pair.

Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] stale status from monitor?

2018-05-08 Thread Bryan Henderson
My cluster got stuck somehow, and at one point in trying to recycle things to
unstick it, I ended up shutting down everything, then bringing up just the
monitors.  At that point, the cluster reported the status below.

With nothing but the monitors running, I don't see how the status can say
there are two OSDs and an MDS up and requests are blocked.  This was the
status of the cluster when I previously shut down the monitors (which I
probably shouldn't have done when there were still OSDs and MDSs up, but I
did).

It stayed that way for about 20 minutes, and I finally brought up the OSDs and
everything went back to normal.

So my question is:  Is this normal and what has to happen for the status to be
current?

cluster 23352cdb-18fc-4efc-9d54-e72c000abfdb
 health HEALTH_WARN
60 pgs peering
60 pgs stuck inactive
60 pgs stuck unclean
4 requests are blocked > 32 sec
mds cluster is degraded
mds a is laggy
 monmap e3: 3 mons at 
{a=192.168.1.16:6789/0,b=192.168.1.23:6789/0,c=192.168.1.20:6789/0}
election epoch 202, quorum 0,1,2 a,c,b
 mdsmap e315: 1/1/1 up {0=a=up:replay(laggy or crashed)}
 osdmap e495: 2 osds: 2 up, 2 in
 pgmap v33881: 160 pgs, 4 pools, 568 MB data, 14851 objects
   1430 MB used, 43704 MB / 45134 MB avail
100 active+clean
 60 peering

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Shutting down: why OSDs first?

2018-05-07 Thread Bryan Henderson
There is a lot of advice around on shutting down a Ceph cluster that says
to shut down the OSDs before the monitors and bring up the monitors before
the OSDs, but no one explains why.

I would have thought it would be better to shut down the monitors first and
bring them up last, so they don't have to witness all the interim states with
OSDs down.  And it should make the noout, nodown, etc. settings unnecessary.

So what am I missing?

Also, how much difference does it really make?  Ceph is obviously designed to
tolerate any sequence of failures and recoveries of nodes, so how much risk
would I be taking if I just haphazardly killed everything instead of
orchestrating a shutdown?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Why keep old epochs?

2017-11-14 Thread Bryan Henderson
Some questions about maps and epochs:

I see that I can control the minimum number of osdmap epochs to keep with
"mon min osdmap epoch".  Why do I care?  Why would I want any but the current
osdmap, and why would the system keep more than my minimum?

Similarly, "mon max pgmap epoch" controls the _maximum_ number of pgmap epochs
to keep around.  I believe I need more than the most recent pgmap because I
need to keep previous ones until all PGs that were placed according to that
pgmap have migrated to where the current pgmap says they should be.  But do I
need more epochs than that, and what happens if the maximum I set is too low
to cover those necessesary old pgmaps?

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] What goes in the monitor database?

2017-11-04 Thread Bryan Henderson
Hi.  Can anyone give me a rough idea of what the monitor database is for?  I'm
curious about how it is behaving in an experimental system I set up.

I have a single monitor and a single client that just connects once a second
and does a "status" command.  There is one OSD and one MDS in there too.  This
is a Hammer system with a LevelDB key-value store.  This produces a fair
amount of activity in the database; it looks like about 25K of updates for
every "status" transaction.  The database compacts periodically and over the
longrun, does not grow in size.

Using ceph_kvstore_tool after shutting down the monitor, I see hundreds of
keys.

So what does the monitor have to store to do a "status" command?

I've seen clues that the activity has to do with Paxos elections, but I'm
fuzzy on why elections would be happening or why they would need a persistent
database.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph program memory usage

2017-04-29 Thread Bryan Henderson
A few months ago, I posted here asking why the Ceph program takes so much
memory (virtual, real, and address space) for what seems to be a simple task.
Nobody knew, but I have done extensive research and I have the answer now, and
thought I would publish it here.

All it takes to do a Ceph "status" command is to create a TCP connection to
the monitor, do a small login handshake, send a JSON document that says
"status command", and receive and print the text response.  This could be done
in 64K with maybe a few megabytes of additional address space for the shared
C library.

If you do it with a 'ceph status' command, though, in the Hammer release it
has a 700M peak address space usage (though it varies a lot from one run to
the next) and uses 60M of real memory.

The reason for this is that the Ceph program uses facilities in the librados
library that are meant for much more than just performing a command.  These
facilities are meant to be used by a full-blown server that is a Ceph client.
The facilities deal with locating a monitor within a cluster and failing over
when that monitor dies; they interpret the Ceph configuration file and adjust
dynamically when that file changes; they do logging; and more.  When you type
'ceph status', you are building a sophisticated command-issuing machine,
having it issue one command, and then tearing it down.

'ceph creates about 20 threads.  They are asynchronous enough that in some
runs, multiple threads exist at the same time and in other ones, they exist
serially.  This is why peak memory usage varies from one run to the next.  In
its quiescent state, ready to perform a command, the program has 13 threads
standing by for various purposes.  Each of these has 8M of virtual memory
reserved for its stack and most have 64M for a heap.

Finally, there is a lock auditor facility ("lockdep") that watches for locks
being acquired out of order, as evidence of a bug in the code.  This facility
is not optional; it is always there.  To keep track of all the locking, it
sets up a 2000x2000 array (2000 being an upper limit on the number of locks
the program might contain).  That's 32M of real memory.  I read in a forum
that this has been greatly reduced in later releases.


I was able to reduce the usage to 130M address space and 9M of real memory,
while still using most of the same librados code to do the work, by creating a
stripped-down version of the librados 'MonClient' class, setting the maximum
thread stack size to 1M with rlimits, and making the threads share a heap with
the MALLOC_ARENA_MAX environment variable.  I also disabled lockdep.


I just thought this might be interesting to someone searching the archives for
memory usage information.


-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph program uses lots of memory

2017-01-03 Thread Bryan Henderson
I did some investigation and tracked the high usage down to librados.  I
don't think Python has anything to do with it.

I also noticed that the memory usage was really unpredictable.  Sometimes
I could do a whole 'ceph -s' with only 256M; most of the time I couldn't,
but the program crashed in various points along the way.

I was going to instrument librados and try to track it further, but I found
that Ceph is too complex and resource-consuming for me to build.  I wonder
if there is a way to build just librados without downloading and building
3 GiB of source code.

I hadn't thought before about starting the 'ceph' shell and looking at the
process as it's waiting for a command, but I just did, and see the virtual
memory size does vary a lot from one invocation to the next.  Strange.  Makes
one think there's some kind of race or use of an unset variable.

So I looked at the memory map (/proc/PID/maps) and see in one run (where I got
lucky and it fit in my 256M limit) 165 vmareas occupying 226 MiB (compared to
49 and 25 MiB for a Python shell).  I'll look closer and see if there are some
particulary large ones and what varies from one invocation to the next.

>Is there a reason you're worried about the address space but not the
>actual RAM used?

Yes.  The way I prevent programs from destroying my system with excessive real
memory usage or paging, either by accident or by my ignorance, is by running
with address space rlimits.  It's the best I can do; there is no real memory
or paging rate rlimit.  As it stands, any normal shell on my systems has an
address space limit of 256M, which has never been a problem before, but is
majorly inconvenient now.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph program uses lots of memory

2016-12-29 Thread Bryan Henderson
Does anyone know why the 'ceph' program uses so much memory?  If I run it with
an address space rlimit of less than 300M, it usually dies with messages about
not being able to allocate memory.

I'm curious as to what it could be doing that requires so much address space.

It doesn't matter what specific command I'm doing and it does this even with
there is no ceph cluster running, so it must be something pretty basic.

-- 
Bryan Henderson   San Jose, California
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com