from:"☣Adam"

Re: [ceph-users] Ceph Balancer Limitations

2019-09-13 Thread Adam Tygart

Thanks,

I moved back to crush-compat mapping, the pool that was at "90% full"
is now under 76% full.

Before doing that, I had the automatic balancer off, and ran 'ceph
balancer optimize test'. It ran for 12 hours before I killed it. In
upmap mode, it was "balanced" or at least as balanced as it could get.

crush-compat seems to be much more flexible, and more useful in my situation.

--
Adam

On Wed, Sep 11, 2019 at 10:04 PM Konstantin Shalygin  wrote:
>
> We're using Nautilus 14.2.2 (upgrading soon to 14.2.3) on 29 CentOS osd 
> servers.
>
> We've got a large variation of disk sizes and host densities. Such
> that the default crush mappings lead to an unbalanced data and pg
> distribution.
>
> We enabled the balancer manager module in pg upmap mode. The balancer
> commands frequently hang indefinitely when enabled and then queried.
> Even issuing a balancer off will hang for hours unless issued within
> about a minute of the manager restarting. I digress.
>
> In upmap mode, it looks like ceph only moves osd mappings within a
> host. Is this the case?
>
> I bring this up because we've got one disk that is sitting at 88%
> utilization and I've been unable to bring this down. The next most
> utilized disks are at 80%, and even then, I think that could be
> reduced.
>
> If the limitation is that upmap mode cannot map to osds to different
> hosts, than that might be something to document. As it is a
> significant difference to crush-compat.
>
> Another thing to document would be how to move between the two modes.
>
> I think this is needed to move between crush-compat and upmap: ceph
> osd crush weight-set rm-compat
>
> I don't know about the reverse, though.
>
> ceph osd df tree [1]
> pg upmap items from the osdmap [2]
>
> [1] https://people.cs.ksu.edu/~mozes/ceph_balancer_query/ceph_osd_df_tree.txt
> [2] https://people.cs.ksu.edu/~mozes/ceph_balancer_query/pg_upmap_items.txt
>
> To remove upmaps you can execute `ceph osd rm-pg-upmap-items ${upmap}` from 
> your dump.
>
> Don't forget to "off" balancer before that operation.
>
>
>
> k
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph Balancer Limitations

2019-09-11 Thread Adam Tygart

Hello all,

We're using Nautilus 14.2.2 (upgrading soon to 14.2.3) on 29 CentOS osd servers.

We've got a large variation of disk sizes and host densities. Such
that the default crush mappings lead to an unbalanced data and pg
distribution.

We enabled the balancer manager module in pg upmap mode. The balancer
commands frequently hang indefinitely when enabled and then queried.
Even issuing a balancer off will hang for hours unless issued within
about a minute of the manager restarting. I digress.

In upmap mode, it looks like ceph only moves osd mappings within a
host. Is this the case?

I bring this up because we've got one disk that is sitting at 88%
utilization and I've been unable to bring this down. The next most
utilized disks are at 80%, and even then, I think that could be
reduced.

If the limitation is that upmap mode cannot map to osds to different
hosts, than that might be something to document. As it is a
significant difference to crush-compat.

Another thing to document would be how to move between the two modes.

I think this is needed to move between crush-compat and upmap: ceph
osd crush weight-set rm-compat

I don't know about the reverse, though.

ceph osd df tree [1]
pg upmap items from the osdmap [2]

[1] https://people.cs.ksu.edu/~mozes/ceph_balancer_query/ceph_osd_df_tree.txt
[2] https://people.cs.ksu.edu/~mozes/ceph_balancer_query/pg_upmap_items.txt

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS corruption

2019-08-14 Thread ☣Adam

I was able to get this resolved, thanks again to Pierre Dittes!

The reason the recovery did not work the first time I tried it was
because I still had the filesystem mounted (or at least attempted to
have it mounted).  This was causing sessions to be active.  After
rebooting all the machines which were attempting to mount cephfs, the
recovery steps worked.

For posterity, here are the exact commands:
cephfs-journal-tool --rank cephfs:all event recover_dentries summary
cephfs-journal-tool --rank cephfs:all journal reset
cephfs-table-tool all reset session
cephfs-table-tool all reset snap
cephfs-table-tool all reset inodes

After that the MDS came up properly and I was able to mount the
filesystem again.

Doing a backup of the journal reveals that there is still some
filesystem corruption, so I'm going to take Pierre's advice and copy all
the data off ceph, destroy and re-create the filesystem, and then put it
back.

The backup command and output showing there's still an error:
cephfs-journal-tool --rank cephfs:all journal export
ceph-mds-backup-`date '+%Y.%m.%d'`.bin
2019-08-14 11:10:24.237 7f7b2305b7c0 -1 Bad entry start ptr
(0x11440) at 0x11401956e
journal is 4630511616~103790
wrote 103790 bytes at offset 4630511616 to ceph-mds-backup-2019.08.14.bin



On 8/13/19 8:33 AM, Yan, Zheng wrote:
> nautilus version (14.2.2) of ‘cephfs-data-scan scan_links’  can fix
> snaptable. hopefully it will fix your issue.
> 
> you don't need to upgrade whole cluster. Just install nautilus in a
> temp machine or compile ceph from source.
> 
> 
> 
> On Tue, Aug 13, 2019 at 2:35 PM Adam  wrote:
>>
>> Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the
>> following steps from the disaster recovery page: journal export, dentry
>> recovery, journal truncation, mds table wipes (session, snap and inode),
>> scan_extents, scan_inodes, scan_links, and cleanup.
>>
>> Now all three of my MDS servers are crashing due to a failed assert.
>> Logs with stacktrace are included (the other two servers have the same
>> stacktrace in their logs).
>>
>> Currently I can't mount cephfs (which makes sense since there aren't any
>> MDS services up for more than a few minutes before they crash).  Any
>> suggestions on next steps to troubleshoot/fix this?
>>
>> Hopefully there's some way to recover from this and I don't have to tell
>> my users that I lost all the data and we need to go back to the backups.
>>  It shouldn't be a huge problem if we do, but it'll lose a lot of
>> confidence in ceph and its ability to keep data safe.
>>
>> Thanks,
>> Adam
>>
>> On 8/8/19 3:31 PM, Adam wrote:
>>> I had a machine with insufficient memory and it seems to have corrupted
>>> data on my MDS.  The filesystem seems to be working fine, with the
>>> exception of accessing specific files.
>>>
>>> The ceph-mds logs include things like:
>>> mds.0.1596621 unhandled write error (2) No such file or directory, force
>>> readonly...
>>> dir 0x100fb03 object missing on disk; some files may be lost
>>> (/adam/programming/bash)
>>>
>>> I'm using mimic and trying to follow the instructions here:
>>> https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
>>>
>>> The punchline is this:
>>> cephfs-journal-tool --rank all journal export backup.bin
>>> Error ((22) Invalid argument)
>>> 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.
>>>
>>> I have a backup (outside of ceph) of all data which is inaccessible and
>>> I can back anything which is accessible if need be.  There's some more
>>> information below, but my main question is: what are my next steps?
>>>
>>> On a side note, I'd like to get involved with helping with documentation
>>> (man pages, the ceph website, usage text, etc). Where can I get started?
>>>
>>>
>>>
>>> Here's the context:
>>>
>>> cephfs-journal-tool event recover_dentries summary
>>> Error ((22) Invalid argument)
>>> 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
>>> argument
>>>
>>> Seems like a bug in the documentation since `--rank` is a "mandatory
>>> option" according to the help text.  It looks like the rank of this node
>>> for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
>>> `--rank all` doesn't work either:
>>>
>>> ceph health detail
>>> HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
>>> MDS_DAMAGE 1 MDSs report damaged metadata
>>> mdsge.hax0rbana.or

Re: [ceph-users] MDS corruption

2019-08-13 Thread ☣Adam

Pierre Dittes helped me with adding --rank=yourfsname:all and I ran the
following steps from the disaster recovery page: journal export, dentry
recovery, journal truncation, mds table wipes (session, snap and inode),
scan_extents, scan_inodes, scan_links, and cleanup.

Now all three of my MDS servers are crashing due to a failed assert.
Logs with stacktrace are included (the other two servers have the same
stacktrace in their logs).

Currently I can't mount cephfs (which makes sense since there aren't any
MDS services up for more than a few minutes before they crash).  Any
suggestions on next steps to troubleshoot/fix this?

Hopefully there's some way to recover from this and I don't have to tell
my users that I lost all the data and we need to go back to the backups.
 It shouldn't be a huge problem if we do, but it'll lose a lot of
confidence in ceph and its ability to keep data safe.

Thanks,
Adam

On 8/8/19 3:31 PM, ☣Adam wrote:
> I had a machine with insufficient memory and it seems to have corrupted
> data on my MDS.  The filesystem seems to be working fine, with the
> exception of accessing specific files.
> 
> The ceph-mds logs include things like:
> mds.0.1596621 unhandled write error (2) No such file or directory, force
> readonly...
> dir 0x100fb03 object missing on disk; some files may be lost
> (/adam/programming/bash)
> 
> I'm using mimic and trying to follow the instructions here:
> https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/
> 
> The punchline is this:
> cephfs-journal-tool --rank all journal export backup.bin
> Error ((22) Invalid argument)
> 2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.
> 
> I have a backup (outside of ceph) of all data which is inaccessible and
> I can back anything which is accessible if need be.  There's some more
> information below, but my main question is: what are my next steps?
> 
> On a side note, I'd like to get involved with helping with documentation
> (man pages, the ceph website, usage text, etc). Where can I get started?
> 
> 
> 
> Here's the context:
> 
> cephfs-journal-tool event recover_dentries summary
> Error ((22) Invalid argument)
> 2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
> argument
> 
> Seems like a bug in the documentation since `--rank` is a "mandatory
> option" according to the help text.  It looks like the rank of this node
> for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
> `--rank all` doesn't work either:
> 
> ceph health detail
> HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
> MDS_DAMAGE 1 MDSs report damaged metadata
> mdsge.hax0rbana.org(mds.0): Metadata damage detected
> MDS_READ_ONLY 1 MDSs are read only
> mdsge.hax0rbana.org(mds.0): MDS in read-only mode
> 
> cephfs-journal-tool --rank 0 event recover_dentries summary
> Error ((22) Invalid argument)
> 2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.
> 
> 
> The only place I've found this error message is in an unanswered
> stackoverflow question and in the source code here:
> https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114
> 
> It looks like that is trying to read a filesystem map (fsmap), which
> might be corrupted.  Running `rados export` prints part of the help text
> and then segfaults, which is rather concerning.  This is 100% repeatable
> (outside of gdb, details below).  I tried `rados df` and that worked
> fine, so it's not all rados commands which are having this problem.
> However, I tried `rados bench 60 seq` and that also printed out the
> usage text and then segfaulted.
> 
> 
> 
> 
> 
> Info on the `rados export` crash:
> rados export
> usage: rados [options] [commands]
> POOL COMMANDS
> 
> IMPORT AND EXPORT
>export [filename]
>Serialize pool contents to a file or standard out.
> 
> OMAP OPTIONS:
> --omap-key-file fileread the omap key from a file
> *** Caught signal (Segmentation fault) **
>  in thread 7fcb6bfff700 thread_name:fn_anonymous
> 
> When running it in gdb:
> (gdb) bt
> #0  0x7fffef07331f in std::_Rb_tree std::char_traits, std::allocator >,
> std::pair,
> std::allocator > const, std::map std::__cxx11::basic_string,
> std::allocator >, unsigned long, long, double, bool,
> entity_addr_t, std::chrono::duration >,
> Option::size_t, uuid_d>, std::less, std::allocator const, boost::variant std::char_traits, std::allocator >, unsigned long, long,
> double, bool, entity_addr_t, std::chrono::duration 1l> >, Option::size_t, uuid_d> > > > >,
> std::_Select1st std::char_traits, std::allocator > const, std::map boost::variant std::cha

[ceph-users] MDS corruption

2019-08-08 Thread ☣Adam

I had a machine with insufficient memory and it seems to have corrupted
data on my MDS.  The filesystem seems to be working fine, with the
exception of accessing specific files.

The ceph-mds logs include things like:
mds.0.1596621 unhandled write error (2) No such file or directory, force
readonly...
dir 0x100fb03 object missing on disk; some files may be lost
(/adam/programming/bash)

I'm using mimic and trying to follow the instructions here:
https://docs.ceph.com/docs/mimic/cephfs/disaster-recovery/

The punchline is this:
cephfs-journal-tool --rank all journal export backup.bin
Error ((22) Invalid argument)
2019-08-08 20:02:39.847 7f06827537c0 -1 main: Couldn't determine MDS rank.

I have a backup (outside of ceph) of all data which is inaccessible and
I can back anything which is accessible if need be.  There's some more
information below, but my main question is: what are my next steps?

On a side note, I'd like to get involved with helping with documentation
(man pages, the ceph website, usage text, etc). Where can I get started?



Here's the context:

cephfs-journal-tool event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:50:04.798 7f21f4ffe7c0 -1 main: missing mandatory "--rank"
argument

Seems like a bug in the documentation since `--rank` is a "mandatory
option" according to the help text.  It looks like the rank of this node
for MDS is 0, based on `ceph health detail`, but using `--rank 0` or
`--rank all` doesn't work either:

ceph health detail
HEALTH_ERR 1 MDSs report damaged metadata; 1 MDSs are read only
MDS_DAMAGE 1 MDSs report damaged metadata
mdsge.hax0rbana.org(mds.0): Metadata damage detected
MDS_READ_ONLY 1 MDSs are read only
mdsge.hax0rbana.org(mds.0): MDS in read-only mode

cephfs-journal-tool --rank 0 event recover_dentries summary
Error ((22) Invalid argument)
2019-08-08 19:54:45.583 7f5b37c4c7c0 -1 main: Couldn't determine MDS rank.


The only place I've found this error message is in an unanswered
stackoverflow question and in the source code here:
https://github.com/ceph/ceph/blob/master/src/tools/cephfs/JournalTool.cc#L114

It looks like that is trying to read a filesystem map (fsmap), which
might be corrupted.  Running `rados export` prints part of the help text
and then segfaults, which is rather concerning.  This is 100% repeatable
(outside of gdb, details below).  I tried `rados df` and that worked
fine, so it's not all rados commands which are having this problem.
However, I tried `rados bench 60 seq` and that also printed out the
usage text and then segfaulted.





Info on the `rados export` crash:
rados export
usage: rados [options] [commands]
POOL COMMANDS

IMPORT AND EXPORT
   export [filename]
   Serialize pool contents to a file or standard out.

OMAP OPTIONS:
--omap-key-file fileread the omap key from a file
*** Caught signal (Segmentation fault) **
 in thread 7fcb6bfff700 thread_name:fn_anonymous

When running it in gdb:
(gdb) bt
#0  0x7fffef07331f in std::_Rb_tree, std::allocator >,
std::pair,
std::allocator > const, std::map,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d>, std::less, std::allocator, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d> > > > >,
std::_Select1st, std::allocator > const, std::map, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less,
std::allocator,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d> > > > > >,
std::less,
std::allocator > >,
std::allocator, std::allocator > const, std::map, std::allocator >, unsigned long, long,
double, bool, entity_addr_t, std::chrono::duration >, Option::size_t, uuid_d>, std::less,
std::allocator,
std::allocator >, unsigned long, long, double, bool,
entity_addr_t, std::chrono::duration >,
Option::size_t, uuid_d> > > > > >
>::find(std::__cxx11::basic_string,
std::allocator > const&) const () from
/usr/lib/ceph/libceph-common.so.0
Backtrace stopped: Cannot access memory at address 0x7fffd9ff89f8

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need to replace OSD. How do I find physical disk

2019-07-18 Thread ☣Adam

The block device can be found in /var/lib/ceph/osd/ceph-$ID/block
# ls -l /var/lib/ceph/osd/ceph-9/block

In my case it links to /dev/sdbvg/sdb which makes is pretty obvious
which drive this is, but the Volume Group and Logical volume could be
named anything.  To see what physical disk(s) make up this volume group
use lvblk (as Reed suggested)
# lvblk

If that drive needs to be located in a computer with many drives,
smartctl should be able to be used to pull the make, model, and serial
number
# smartctl -i /dev/sdb


I was not aware of ceph-volume, or `ceph-disk list` (which is apparently
now deprecated in favor of ceph-volume), so thank you to all in this
thread for teaching about alternative (arguably more proper) ways of
doing this. :-)

On 7/18/19 12:58 PM, Pelletier, Robert wrote:
> How do I find the physical disk in a Ceph luminous cluster in order to
> replace it. Osd.9 is down in my cluster which resides on ceph-osd1 host.
> 
>  
> 
> If I run lsblk -io KNAME,TYPE,SIZE,MODEL,SERIAL I can get the serial
> numbers of all the physical disks for example
> 
> sdb    disk  1.8T ST2000DM001-1CH1 Z1E5VLRG
> 
>  
> 
> But how do I find out which osd is mapped to sdb and so on?
> 
> When I run df –h I get this
> 
>  
> 
> [root@ceph-osd1 ~]# df -h
> 
> Filesystem   Size  Used Avail Use% Mounted on
> 
> /dev/mapper/ceph--osd1-root   19G  1.9G   17G  10% /
> 
> devtmpfs  48G 0   48G   0% /dev
> 
> tmpfs 48G 0   48G   0% /dev/shm
> 
> tmpfs 48G  9.3M   48G   1% /run
> 
> tmpfs 48G 0   48G   0% /sys/fs/cgroup
> 
> /dev/sda3    947M  232M  716M  25% /boot
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-2
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-5
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-0
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-8
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-7
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-33
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-10
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-1
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-38
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-4
> 
> tmpfs 48G   24K   48G   1% /var/lib/ceph/osd/ceph-6
> 
> tmpfs    9.5G 0  9.5G   0% /run/user/0
> 
>  
> 
>  
> 
> *Robert Pelletier, **IT and Security Specialist***
> 
> Eastern Maine Community College
> (207) 974-4782 | 354 Hogan Rd., Bangor, ME 04401
> 
>  
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pgs incomplete

2019-06-30 Thread ☣Adam

The guide on migrating from filestore to bluestore was perfect.  I was
able to get that OSD back up and running quickly.  Thanks.

As for my PGs, I tried force-create-pg and it said it was working on it
for a while, and I saw some deep scrubs happening, but when they were
done it didn't help the incomplete problem.  However, the
ceph-objectstore-tool seems to be working.  For the people of the future
(which might well be me if I mess things up again), here's the command I
ran (from the node which hosts the OSD):

ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-11 --pgid 2.0
--op mark-complete --no-mon-config

Thanks for your help Alfredo & Paul. :-)

--Adam

On 6/27/19 11:05 AM, Alfredo Deza wrote:
> 
> 
> On Thu, Jun 27, 2019 at 10:36 AM ☣Adam  <mailto:a...@dc949.org>> wrote:
> 
> Well that caused some excitement (either that or the small power
> disruption did)!  One of my OSDs is now down because it keeps crashing
> due to a failed assert (stacktraces attached, also I'm apparently
> running mimic, not luminous).
> 
> In the past a failed assert on an OSD has meant removing the disk,
> wiping it, re-adding it as a new one, and then have ceph rebuild it from
> other copies of the data.
> 
> I did this all manually in the past, but I'm trying to get more familiar
> with ceph's commands.  Will the following commands do the same?
> 
> ceph-volume lvm zap --destroy --osd-id 11
> # Presumably that has to be run from the node with OSD 11, not just
> # any ceph node?
> # Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap
> 
> 
> That looks correct, and yes, you would need to run on the node with OSD 11.
> 
> 
> 
> Do I need to remove the OSD (ceph osd out 11; wait for stabilization;
> ceph osd purge 11) before I do this and run and "ceph-deploy osd create"
> afterwards?
> 
> 
> I think that what you need es essentially the same as the guide for
> migrating from filestore to bluestore:
> 
> http://docs.ceph.com/docs/mimic/rados/operations/bluestore-migration/
> 
> 
> Thanks,
> Adam
> 
> 
> On 6/26/19 6:35 AM, Paul Emmerich wrote:
> > Have you tried: ceph osd force-create-pg ?
> >
> > If that doesn't work: use objectstore-tool on the OSD (while it's not
> > running) and use it to force mark the PG as complete. (Don't know the
> > exact command off the top of my head)
> >
> > Caution: these are obviously really dangerous commands
> >
> >
> >
> > Paul
> >
> >
> >
> > --
> > Paul Emmerich
> >
> > Looking for help with your Ceph cluster? Contact us at
> https://croit.io
> >
> > croit GmbH
> > Freseniusstr. 31h
> > 81247 München
> > www.croit.io <http://www.croit.io> <http://www.croit.io>
> > Tel: +49 89 1896585 90
> >
> >
> > On Wed, Jun 26, 2019 at 1:56 AM ☣Adam  <mailto:a...@dc949.org>
> > <mailto:a...@dc949.org <mailto:a...@dc949.org>>> wrote:
> >
> >     How can I tell ceph to give up on "incomplete" PGs?
> >
> >     I have 12 pgs which are "inactive, incomplete" that won't
> recover.  I
> >     think this is because in the past I have carelessly pulled
> disks too
> >     quickly without letting the system recover.  I suspect the
> disks that
> >     have the data for these are long gone.
> >
> >     Whatever the reason, I want to fix it so I have a clean cluser
> even if
> >     that means losing data.
> >
> >     I went through the "troubleshooting pgs" guide[1] which is
> excellent,
> >     but didn't get me to a fix.
> >
> >     The output of `ceph pg 2.0 query` includes this:
> >         "recovery_state": [
> >             {
> >                 "name": "Started/Primary/Peering/Incomplete",
> >                 "enter_time": "2019-06-25 18:35:20.306634",
> >                 "comment": "not enough complete instances of this PG"
> >             },
> >
> >     I've already restated all OSDs in various orders, and I
> changed min_size
> >     to 1 to see if that would allow them to get fixed, but no such
> luck.
> >     These pools are not erasure coded and I'm using the Luminous
> release.
> >
> >     How can I tell ceph to give up

Re: [ceph-users] pgs incomplete

2019-06-27 Thread ☣Adam

Well that caused some excitement (either that or the small power
disruption did)!  One of my OSDs is now down because it keeps crashing
due to a failed assert (stacktraces attached, also I'm apparently
running mimic, not luminous).

In the past a failed assert on an OSD has meant removing the disk,
wiping it, re-adding it as a new one, and then have ceph rebuild it from
other copies of the data.

I did this all manually in the past, but I'm trying to get more familiar
with ceph's commands.  Will the following commands do the same?

ceph-volume lvm zap --destroy --osd-id 11
# Presumably that has to be run from the node with OSD 11, not just
# any ceph node?
# Source: http://docs.ceph.com/docs/mimic/ceph-volume/lvm/zap

Do I need to remove the OSD (ceph osd out 11; wait for stabilization;
ceph osd purge 11) before I do this and run and "ceph-deploy osd create"
afterwards?

Thanks,
Adam


On 6/26/19 6:35 AM, Paul Emmerich wrote:
> Have you tried: ceph osd force-create-pg ?
> 
> If that doesn't work: use objectstore-tool on the OSD (while it's not
> running) and use it to force mark the PG as complete. (Don't know the
> exact command off the top of my head)
> 
> Caution: these are obviously really dangerous commands
> 
> 
> 
> Paul
> 
> 
> 
> -- 
> Paul Emmerich
> 
> Looking for help with your Ceph cluster? Contact us at https://croit.io
> 
> croit GmbH
> Freseniusstr. 31h
> 81247 München
> www.croit.io <http://www.croit.io>
> Tel: +49 89 1896585 90
> 
> 
> On Wed, Jun 26, 2019 at 1:56 AM ☣Adam  <mailto:a...@dc949.org>> wrote:
> 
> How can I tell ceph to give up on "incomplete" PGs?
> 
> I have 12 pgs which are "inactive, incomplete" that won't recover.  I
> think this is because in the past I have carelessly pulled disks too
> quickly without letting the system recover.  I suspect the disks that
> have the data for these are long gone.
> 
> Whatever the reason, I want to fix it so I have a clean cluser even if
> that means losing data.
> 
> I went through the "troubleshooting pgs" guide[1] which is excellent,
> but didn't get me to a fix.
> 
> The output of `ceph pg 2.0 query` includes this:
>     "recovery_state": [
>         {
>             "name": "Started/Primary/Peering/Incomplete",
>             "enter_time": "2019-06-25 18:35:20.306634",
>             "comment": "not enough complete instances of this PG"
>         },
> 
> I've already restated all OSDs in various orders, and I changed min_size
> to 1 to see if that would allow them to get fixed, but no such luck.
> These pools are not erasure coded and I'm using the Luminous release.
> 
> How can I tell ceph to give up on these PGs?  There's nothing identified
>     as unfound, so mark_unfound_lost doesn't help.  I feel like `ceph osd
> lost` might be it, but at this point the OSD numbers have been reused
> for new disks, so I'd really like to limit the damage to the 12 PGs
> which are incomplete if possible.
> 
> Thanks,
> Adam
> 
> [1]
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic (stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x14e) [0x7f7372987b5e]
 2: (()+0x2c4cb7) [0x7f7372987cb7]
 3: (PG::check_past_interval_bounds() const+0xae5) [0x564b8db12f05]
 4: (PG::RecoveryState::Reset::react(PG::AdvMap const&)+0x1bb) [0x564b8db43f5b]
 5: (boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::react_impl(boost::statechart::event_base 
const&, void const*)+0x200) [0x564b8db92430]
 6: (boost::statechart::state_machine, 
boost::statechart::null_exception_translator>::send_event(boost::statechart::event_base
 const&)+0x4b) [0x564b8db65a4b]
 7: (PG::handle_advance_map(std::shared_ptr, 
std::shared_ptr, std::vector >&, int, 
std::vector >&, int, PG::RecoveryCtx*)+0x213) 
[0x564b8db27ca3]
 8: (OSD::advance_pg(unsigned int, PG*, ThreadPool::TPHandle&, 
PG::RecoveryCtx*)+0x2b4) [0x564b8da92fa4]
 9: (OSD::dequeue_peering_evt(OSDShard*, PG*, std::shared_ptr, 
ThreadPool::TPHandle&)+0xb4) [0x564b8da93704]
 10: (PGPeeringItem::run(OSD*, OSDShard*, boost::intrusive_ptr&, 
ThreadPool::TPHandle&)+0x52) [0x564b8dcee862]
 11: (OSD::ShardedOpWQ::_process(unsigned int, 
ceph::heartbeat_handle_d*)+0x926) [0x564b8daa0c

[ceph-users] pgs incomplete

2019-06-25 Thread ☣Adam

How can I tell ceph to give up on "incomplete" PGs?

I have 12 pgs which are "inactive, incomplete" that won't recover.  I
think this is because in the past I have carelessly pulled disks too
quickly without letting the system recover.  I suspect the disks that
have the data for these are long gone.

Whatever the reason, I want to fix it so I have a clean cluser even if
that means losing data.

I went through the "troubleshooting pgs" guide[1] which is excellent,
but didn't get me to a fix.

The output of `ceph pg 2.0 query` includes this:
"recovery_state": [
{
"name": "Started/Primary/Peering/Incomplete",
"enter_time": "2019-06-25 18:35:20.306634",
"comment": "not enough complete instances of this PG"
},

I've already restated all OSDs in various orders, and I changed min_size
to 1 to see if that would allow them to get fixed, but no such luck.
These pools are not erasure coded and I'm using the Luminous release.

How can I tell ceph to give up on these PGs?  There's nothing identified
as unfound, so mark_unfound_lost doesn't help.  I feel like `ceph osd
lost` might be it, but at this point the OSD numbers have been reused
for new disks, so I'd really like to limit the damage to the 12 PGs
which are incomplete if possible.

Thanks,
Adam

[1]
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor stuck at "probing"

2019-06-22 Thread ☣Adam

Thanks, this got me back on track.  After a lot of trial and error, I
found that the problem was, in fact, an authentication issue.  All fixed
now. :-)

Now that I solved it, I figured out that "nuke the monitor's store" was
probably referring to /var/lib/ceph/mon/ceph-tc.  I was following the
directions in the deployment and operations manuals[1,2].  Now I know
how monitors work a whole lot better, so that's good.

The moral of the story here is: if you think you have an auth problem,
check the keyring on disk to see if it matches the other nodes.

Thanks,
Adam

[1]
http://docs.ceph.com/docs/master/rados/deployment/ceph-deploy-mon/#remove-a-monitor
[2]
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-a-monitor-manual


On 6/20/19 11:50 AM, Gregory Farnum wrote:
> Just nuke the monitor's store, remove it from the existing quorum, and
> start over again. Injecting maps correctly is non-trivial and obviously
> something went wrong, and re-syncing a monitor is pretty cheap.
> 
> On Thu, Jun 20, 2019 at 6:46 AM ☣Adam  <mailto:a...@dc949.org>> wrote:
> 
> Anyone have any suggestions for how to troubleshoot this issue?
> 
> 
>  Forwarded Message 
> Subject: Monitor stuck at "probing"
> Date: Fri, 14 Jun 2019 21:40:39 -0500
> From: ☣Adam mailto:a...@dc949.org>>
> To: ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> 
> I have a monitor which I just can't seem to get to join the quorum, even
> after injecting a monmap from one of the other servers.[1]  I use NTP on
> all servers and also manually verified the clocks are synchronized.
> 
> 
> My monitors are named: ceph0, ceph2, xe, and tc
> 
> I'm transitioning away from the ceph# naming scheme, so please forgive
> the confusing [lack of a] naming convention.
> 
> 
> The relevant output from: ceph -s
> 1/4 mons down, quorum ceph0,ceph2,xe
> mon: 4 daemons, quorum ceph0,ceph2,xe, out of quorum: tc
> 
> 
> tc is up, bound to the expected IP address, and the ceph-mon service can
> be reached from xe, ceph0 and ceph2 using telnet.  The mon_host and
> mon_initial_members from `ceph daemon mon.tc <http://mon.tc> config
> show` look correct.
> 
> mon_status on tc shows the state as "probing" and the list of
> "extra_probe_peers" looks correct (correct IP addresses, and ports).
> However the monmap section looks wrong.  The "mons" has all 4 servers,
> but the addr and public_addr values are 0.0.0.0:0
> <http://0.0.0.0:0>.  Furthermore it says
> the monmap epoch is 4.  I don't understand why because I just injected a
> monmap which has an epoch of 7.
> 
> Here's the output of: monmaptool --print ./monmap
> monmaptool: monmap file ./monmap
> epoch 7
> fsid a690e404-3152-4804-a960-8b52abf3bd65
> last_changed 2019-06-02 17:38:50.161035
> created 2018-12-28 20:26:41.443339
> 0: 192.168.60.10:6789/0 <http://192.168.60.10:6789/0> mon.ceph0
> 1: 192.168.60.11:6789/0 <http://192.168.60.11:6789/0> mon.tc
> <http://mon.tc>
> 2: 192.168.60.12:6789/0 <http://192.168.60.12:6789/0> mon.ceph2
> 3: 192.168.60.53:6789/0 <http://192.168.60.53:6789/0> mon.xe
> 
> When I injected it, I stopped ceph-mon, ran:
> sudo ceph-mon -i tc --inject-monmap ./monmap
> 
> and started ceph-mon again.  I then rebooted to see if it would fix this
> epoch/addr issue.  It did not.
> 
> I'm attaching what I believe is the relevant section of my log file from
> the tc monitor.  I ran `ceph auth list` on tc and ceph2 and verified
> that the output is identical.  This check was based on what I saw in the
> log and what I read in a blog post.[2]
> 
> What are the next steps in troubleshooting this issue?
> 
> 
> Thanks,
> Adam
> 
> 
> [1]
> http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/
> [2]
> 
> https://medium.com/@george.shuklin/silly-mistakes-with-ceph-mon-9ef6c9eaab54
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitor stuck at "probing"

2019-06-20 Thread ☣Adam

Anyone have any suggestions for how to troubleshoot this issue?


 Forwarded Message 
Subject: Monitor stuck at "probing"
Date: Fri, 14 Jun 2019 21:40:39 -0500
From: ☣Adam 
To: ceph-users@lists.ceph.com

I have a monitor which I just can't seem to get to join the quorum, even
after injecting a monmap from one of the other servers.[1]  I use NTP on
all servers and also manually verified the clocks are synchronized.


My monitors are named: ceph0, ceph2, xe, and tc

I'm transitioning away from the ceph# naming scheme, so please forgive
the confusing [lack of a] naming convention.


The relevant output from: ceph -s
1/4 mons down, quorum ceph0,ceph2,xe
mon: 4 daemons, quorum ceph0,ceph2,xe, out of quorum: tc


tc is up, bound to the expected IP address, and the ceph-mon service can
be reached from xe, ceph0 and ceph2 using telnet.  The mon_host and
mon_initial_members from `ceph daemon mon.tc config show` look correct.

mon_status on tc shows the state as "probing" and the list of
"extra_probe_peers" looks correct (correct IP addresses, and ports).
However the monmap section looks wrong.  The "mons" has all 4 servers,
but the addr and public_addr values are 0.0.0.0:0.  Furthermore it says
the monmap epoch is 4.  I don't understand why because I just injected a
monmap which has an epoch of 7.

Here's the output of: monmaptool --print ./monmap
monmaptool: monmap file ./monmap
epoch 7
fsid a690e404-3152-4804-a960-8b52abf3bd65
last_changed 2019-06-02 17:38:50.161035
created 2018-12-28 20:26:41.443339
0: 192.168.60.10:6789/0 mon.ceph0
1: 192.168.60.11:6789/0 mon.tc
2: 192.168.60.12:6789/0 mon.ceph2
3: 192.168.60.53:6789/0 mon.xe

When I injected it, I stopped ceph-mon, ran:
sudo ceph-mon -i tc --inject-monmap ./monmap

and started ceph-mon again.  I then rebooted to see if it would fix this
epoch/addr issue.  It did not.

I'm attaching what I believe is the relevant section of my log file from
the tc monitor.  I ran `ceph auth list` on tc and ceph2 and verified
that the output is identical.  This check was based on what I saw in the
log and what I read in a blog post.[2]

What are the next steps in troubleshooting this issue?


Thanks,
Adam


[1]
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/
[2]
https://medium.com/@george.shuklin/silly-mistakes-with-ceph-mon-9ef6c9eaab54

2019-06-14 21:16:29.293 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.10:6789/0 conn(0x557135e29500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:31.213 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.10:6789/0 conn(0x557135bfd100 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:31.217 7fa2d7d97700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d4e000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:31.221 7fa2d7596700  0 -- 192.168.60.11:6789/0 >> 192.168.60.12:6789/0 conn(0x557135bfd800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:32.173 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:32.173 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.12:6789/0
2019-06-14 21:16:32.173 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.12:6789/0 conn(0x557135d85c00 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:42.121 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:42.121 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.53:6789/0
2019-06-14 21:16:42.121 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d85500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:42.121 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:42.121 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.53:6789/0
2019-06-14 21:16:42.121 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d85500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:44.293 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 !=

[ceph-users] Monitor stuck at "probing"

2019-06-14 Thread ☣Adam

I have a monitor which I just can't seem to get to join the quorum, even
after injecting a monmap from one of the other servers.[1]  I use NTP on
all servers and also manually verified the clocks are synchronized.


My monitors are named: ceph0, ceph2, xe, and tc

I'm transitioning away from the ceph# naming scheme, so please forgive
the confusing [lack of a] naming convention.


The relevant output from: ceph -s
1/4 mons down, quorum ceph0,ceph2,xe
mon: 4 daemons, quorum ceph0,ceph2,xe, out of quorum: tc


tc is up, bound to the expected IP address, and the ceph-mon service can
be reached from xe, ceph0 and ceph2 using telnet.  The mon_host and
mon_initial_members from `ceph daemon mon.tc config show` look correct.

mon_status on tc shows the state as "probing" and the list of
"extra_probe_peers" looks correct (correct IP addresses, and ports).
However the monmap section looks wrong.  The "mons" has all 4 servers,
but the addr and public_addr values are 0.0.0.0:0.  Furthermore it says
the monmap epoch is 4.  I don't understand why because I just injected a
monmap which has an epoch of 7.

Here's the output of: monmaptool --print ./monmap
monmaptool: monmap file ./monmap
epoch 7
fsid a690e404-3152-4804-a960-8b52abf3bd65
last_changed 2019-06-02 17:38:50.161035
created 2018-12-28 20:26:41.443339
0: 192.168.60.10:6789/0 mon.ceph0
1: 192.168.60.11:6789/0 mon.tc
2: 192.168.60.12:6789/0 mon.ceph2
3: 192.168.60.53:6789/0 mon.xe

When I injected it, I stopped ceph-mon, ran:
sudo ceph-mon -i tc --inject-monmap ./monmap

and started ceph-mon again.  I then rebooted to see if it would fix this
epoch/addr issue.  It did not.

I'm attaching what I believe is the relevant section of my log file from
the tc monitor.  I ran `ceph auth list` on tc and ceph2 and verified
that the output is identical.  This check was based on what I saw in the
log and what I read in a blog post.[2]

What are the next steps in troubleshooting this issue?


Thanks,
Adam


[1]
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-mon/
[2]
https://medium.com/@george.shuklin/silly-mistakes-with-ceph-mon-9ef6c9eaab54
2019-06-14 21:16:29.293 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.10:6789/0 conn(0x557135e29500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:31.213 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.10:6789/0 conn(0x557135bfd100 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:31.217 7fa2d7d97700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d4e000 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:31.221 7fa2d7596700  0 -- 192.168.60.11:6789/0 >> 192.168.60.12:6789/0 conn(0x557135bfd800 :-1 s=STATE_CONNECTING_WAIT_CONNECT_REPLY_AUTH pgs=0 cs=0 l=0).handle_connect_reply connect got BADAUTHORIZER
2019-06-14 21:16:32.173 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:32.173 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.12:6789/0
2019-06-14 21:16:32.173 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.12:6789/0 conn(0x557135d85c00 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:42.121 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:42.121 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.53:6789/0
2019-06-14 21:16:42.121 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d85500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:42.121 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:42.121 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.53:6789/0
2019-06-14 21:16:42.121 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.53:6789/0 conn(0x557135d85500 :6789 s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=0).handle_connect_msg: got bad authorizer
2019-06-14 21:16:44.293 7fa2d6d95700  0 cephx: verify_authorizer could not decrypt ticket info: error: bad magic in decode_decrypt, 16639705050927474509 != 18374858748799134293
2019-06-14 21:16:44.293 7fa2d6d95700  0 mon.tc@-1(probing) e4 ms_verify_authorizer bad authorizer from mon 192.168.60.10:6789/0
2019-06-14 21:16:44.293 7fa2d6d95700  0 -- 192.168.60.11:6789/0 >> 192.168.60.10:

Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)

2019-06-12 Thread Lluis Arasanz i Nonell - Adam

Hi,

If nothing special with defined “initial monitos” on cluster, we’ll try to 
remove mon01 from cluster.

I comment about “initial monitor” because in our ceph implementation there is 
only one monitor as “initial:

[root@mon01 ceph]# cat /etc/ceph/ceph.conf
[global]
fsid = ----
mon_initial_members = mon01
mon_host = 10.10.200.20
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
public_network = 10.10.200.0/24

So, I could change ceph.conf on every storage related computer, but this does 
not work with monitors.
I have mon05 in a “probing” state trying to contact only with mon01 (down) and 
I change “mon_initial_members” in ceph.conf to find mon02 as initial… and this 
does not work ☹

2019-06-12 03:39:47.033242 7f04e630f700  0 mon.mon05@4(probing).data_health(0) 
update_stats avail 98% total 223 GB, used 4255 MB, avail 219 GB

And  asking to socket:

[root@mon05 ~]# ceph daemon mon.mon05 mon_status
{ "name": "mon05",
  "rank": 4,
  "state": "probing",
  "election_epoch": 0,
  "quorum": [],
  "outside_quorum": [
"mon05"],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 21,
  "fsid": "----",
  "modified": "2019-06-07 16:59:26.729467",
  "created": "0.00",
  "mons": [
{ "rank": 0,
  "name": "mon01",
  "addr": "10.10.200.20:6789\/0"},
{ "rank": 1,
  "name": "mon02",
  "addr": "10.10.200.21:6789\/0"},
{ "rank": 2,
  "name": "mon03",
  "addr": "10.10.200.22:6789\/0"},
{ "rank": 3,
  "name": "mon04",
  "addr": "10.10.200.23:6789\/0"},
{ "rank": 4,
  "name": "mon05",
  "addr": "10.10.200.24:6789\/0"}]}}

In any case it contacs mon02, mon03 or mon04 that are healty and with quorum:

[root@mon02 ceph-mon02]# ceph daemon mon.mon02 mon_status
{ "name": "mon02",
  "rank": 1,
  "state": "leader",
  "election_epoch": 476,
  "quorum": [
1,
2,
3],
  "outside_quorum": [],
  "extra_probe_peers": [],
  "sync_provider": [],
  "monmap": { "epoch": 21,
  "fsid": "----",
  "modified": "2019-06-07 16:59:26.729467",
  "created": "0.00",
  "mons": [
{ "rank": 0,
  "name": "mon01",
  "addr": "10.10.200.20:6789\/0"},
{ "rank": 1,
  "name": "mon02",
  "addr": "10.10.200.21:6789\/0"},
{ "rank": 2,
  "name": "mon03",
  "addr": "10.10.200.22:6789\/0"},
{ "rank": 3,
  "name": "mon04",
  "addr": "10.10.200.23:6789\/0"},
{ "rank": 4,
  "name": "mon05",
  "addr": "10.10.200.24:6789\/0"}]}}

Of course, no communication related problems exists.

So, this is my fear touching monitors…

Regards


De: Paul Emmerich 
Enviado el: miércoles, 12 de junio de 2019 15:12
Para: Lluis Arasanz i Nonell - Adam 
CC: ceph-users@lists.ceph.com
Asunto: Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)



On Wed, Jun 12, 2019 at 11:45 AM Lluis Arasanz i Nonell - Adam 
mailto:lluis.aras...@adam.es>> wrote:
- Be careful adding or removing monitors in a not healthy monitor cluster: If 
they lost quorum you will be into problems.

safe procedure: remove the dead monitor before adding a new one


Now, we have some work to do:
- Remove mon01 with "ceph mon destroy mon01": we want to remove it from monmap, 
but is the "initial monitor" so we do not know if it is safe to do.

yes that's safe to do, there's nothing special about the first mon. Command is 
"ceph mon remove ", though

- Clean and "format" monitor data (as we do on mon02 and mon03) for mon01, but 
we have the same situation: is safe to do when is the "initial mon"?

all (fully synched and in quorum) mons have the exact same data

- Modify monmap, deleting mon01, and inyect it om mon05, but...  what happens 
when we delete "initial mon" from monmap? Is safe?

"ceph mon remove" will modify the mon map for you; manually modifying the mon 
map is only required if the cluster is down




--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90



Regards
 ___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [Ceph-community] Monitors not in quorum (1 of 3 live)

2019-06-12 Thread Lluis Arasanz i Nonell - Adam

 we delete "initial mon" from monmap? Is safe?

As you can understand, we have now a working storage but in a critical 
situation, because any problem with monitors could  bring it again unstable... 
And there is still 15 TB of data inside.

If someone has any "safe" idea to share  will be appreciated.

Regards


[logo_adam_firma_email]

Lluís Arasanz Nonell * Departamento de Sistemas
Tel: +34 902 902 685
email: lluis.aras...@adam.es<mailto:lluis.aras...@adam.es>
www.adam.es<http://www.adam.es/>

[16-linkedin]<https://www.linkedin.com/company/adam-tic>  [16-twitter] 
<https://twitter.com/adam_tic>


Advertencia legal: La información contenida en este mensaje y/o archivo(s) 
adjunto(s), enviada desde OGIC INFORMATICA SLU, es confidencial/privilegiada y 
está destinada a ser leída sólo por la(s) persona(s) a la(s) que va dirigida. 
Le recordamos que sus datos han sido incorporados en el sistema de tratamiento 
de OGIC INFORMATICA SLU y que siempre y cuando se cumplan los requisitos 
exigidos por la normativa, usted podrá ejercer sus derechos de acceso, 
rectificación, limitación de tratamiento, supresión, portabilidad y 
oposición/revocación, en los términos que establece la normativa vigente en 
materia de protección de datos, dirigiendo su petición a la dirección postal 
TRAVESSERA DE GRACIA 342-344 08025, BARCELONA o bien a través de correo 
electrónico administrac...@adam.es Si usted lee este mensaje y no es el 
destinatario señalado, el empleado o el agente responsable de entregar el 
mensaje al destinatario, o ha recibido esta comunicación por error, le 
informamos que está totalmente prohibida, y puede ser ilegal, cualquier 
divulgación, distribución o reproducción de esta comunicación, y le rogamos que 
nos lo notifique inmediatamente y nos devuelva el mensaje original a la 
dirección arriba mencionada. Gracias.
[NoImprimir]No imprimas si no es necesario. Protejamos el Medio Ambiente.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

2019-05-17 Thread Adam Tygart

I've not done a scrub yet, but there was no indication of any duplicate inodes 
with the recover dentries. The MDS has been rock solid so far.

I'll probably start a scrub Monday.

--
Adam

On Fri, May 17, 2019, 18:40 Sergey Malinin 
mailto:ad...@data-center.com>> wrote:
I've had similar problem twice (with mimic) and in both cases I ended up 
backing up and restoring to a fresh fs. Did you do MDS scrub after recovery? My 
experience insists that recovering dup inodes is not a trivial process: my MDS 
kept crashing on unlink() in some directories, and in other case newly created 
fs entries would not pass MDS scrub due to linkage errors.


May 17, 2019 3:40 PM, "Adam Tygart" mailto:mo...@ksu.edu>> wrote:

> I followed the docs from here:
> http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts
>
> I exported the journals as a backup for both ranks. I was running 2
> active MDS daemons at the time.
>
> cephfs-journal-tool --rank=combined:0 journal export
> cephfs-journal-0-201905161412.bin
> cephfs-journal-tool --rank=combined:1 journal export
> cephfs-journal-1-201905161412.bin
>
> I recovered the Dentries on both ranks
> cephfs-journal-tool --rank=combined:0 event recover_dentries summary
> cephfs-journal-tool --rank=combined:1 event recover_dentries summary
>
> I reset the journals of both ranks:
> cephfs-journal-tool --rank=combined:1 journal reset
> cephfs-journal-tool --rank=combined:0 journal reset
>
> Then I reset the session table
> cephfs-table-tool all reset session
>
> Once that was done, reboot all machines that were talking to cephfs
> (or at least unmount/remount).
>
> On Fri, May 17, 2019 at 2:30 AM 
> mailto:wangzhig...@uniview.com>> wrote:
>
>> Hi
>> Can you tell me the detail recovery cmd ?
>>
>> I just started learning cephfs ,I would be grateful.
>>
>> 发件人: Adam Tygart mailto:mo...@ksu.edu>>
>> 收件人: Ceph Users mailto:ceph-users@lists.ceph.com>>
>> 日期: 2019/05/17 09:04
>> 主题: [lists.ceph.com<http://lists.ceph.com>代发]Re: [ceph-users] MDS Crashing 
>> 14.2.1
>> 发件人: "ceph-users" 
>> mailto:ceph-users-boun...@lists.ceph.com>>
>> 
>>
>> I ended up backing up the journals of the MDS ranks, recover_dentries for 
>> both of them, resetting
>> the journals and session table. It is back up. The recover dentries stage 
>> didn't show any errors,
>> so I'm not even sure why the MDS was asserting about duplicate inodes.
>>
>> --
>> Adam
>>
>> On Thu, May 16, 2019, 13:52 Adam Tygart 
>> mailto:mo...@ksu.edu>> wrote:
>> Hello all,
>>
>> The rank 0 mds is still asserting. Is this duplicate inode situation
>> one that I should be considering using the cephfs-journal-tool to
>> export, recover dentries and reset?
>>
>> Thanks,
>> Adam
>>
>> On Thu, May 16, 2019 at 12:51 AM Adam Tygart 
>> mailto:mo...@ksu.edu>> wrote:
>>
>> Hello all,
>>
>> I've got a 30 node cluster serving up lots of CephFS data.
>>
>> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
>> this week.
>>
>> We've been running 2 MDS daemons in an active-active setup. Tonight
>> one of the metadata daemons crashed with the following several times:
>>
>> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> In function 'void CIn
>> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
>> 00:20:56.775021
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
>> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
>> ack_allow_loading_invalid_metadata"))
>>
>> I made a quick decision to move to a single MDS because I saw
>> set_primary_parent, and I thought it might be related to auto
>> balancing between the metadata servers.
>>
>> This caused one MDS to fail, the other crashed, and now rank 0 loads,
>> goes active and then crashes with the following:
>> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/cent
>> s7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
>> In function 'void

Re: [ceph-users] [lists.ceph.com代发]Re: MDS Crashing 14.2.1

2019-05-17 Thread Adam Tygart

I followed the docs from here:
http://docs.ceph.com/docs/nautilus/cephfs/disaster-recovery-experts/#disaster-recovery-experts

I exported the journals as a backup for both ranks. I was running 2
active MDS daemons at the time.

cephfs-journal-tool --rank=combined:0 journal export
cephfs-journal-0-201905161412.bin
cephfs-journal-tool --rank=combined:1 journal export
cephfs-journal-1-201905161412.bin

I recovered the Dentries on both ranks
cephfs-journal-tool --rank=combined:0 event recover_dentries summary
cephfs-journal-tool --rank=combined:1 event recover_dentries summary

I reset the journals of both ranks:
cephfs-journal-tool --rank=combined:1 journal reset
cephfs-journal-tool --rank=combined:0 journal reset

Then I reset the session table
cephfs-table-tool all reset session

Once that was done, reboot all machines that were talking to cephfs
(or at least unmount/remount).

On Fri, May 17, 2019 at 2:30 AM  wrote:
>
> Hi
>Can you tell me the detail recovery cmd ?
>
> I just started learning cephfs ,I would be grateful.
>
>
>
> 发件人: Adam Tygart 
> 收件人: Ceph Users 
> 日期: 2019/05/17 09:04
> 主题:[lists.ceph.com代发]Re: [ceph-users] MDS Crashing 14.2.1
> 发件人:"ceph-users" 
> 
>
>
>
> I ended up backing up the journals of the MDS ranks, recover_dentries for 
> both of them, resetting the journals and session table. It is back up. The 
> recover dentries stage didn't show any errors, so I'm not even sure why the 
> MDS was asserting about duplicate inodes.
>
> --
> Adam
>
> On Thu, May 16, 2019, 13:52 Adam Tygart  wrote:
> Hello all,
>
> The rank 0 mds is still asserting. Is this duplicate inode situation
> one that I should be considering using the cephfs-journal-tool to
> export, recover dentries and reset?
>
> Thanks,
> Adam
>
> On Thu, May 16, 2019 at 12:51 AM Adam Tygart  wrote:
> >
> > Hello all,
> >
> > I've got a 30 node cluster serving up lots of CephFS data.
> >
> > We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> > this week.
> >
> > We've been running 2 MDS daemons in an active-active setup. Tonight
> > one of the metadata daemons crashed with the following several times:
> >
> > -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > In function 'void CIn
> > ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> > 00:20:56.775021
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> > 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
> > ack_allow_loading_invalid_metadata"))
> >
> > I made a quick decision to move to a single MDS because I saw
> > set_primary_parent, and I thought it might be related to auto
> > balancing between the metadata servers.
> >
> > This caused one MDS to fail, the other crashed, and now rank 0 loads,
> > goes active and then crashes with the following:
> > -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > In function 'void M
> > DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> > 00:29:21.149531
> > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> > 258: FAILED ceph_assert(!p)
> >
> > It now looks like we somehow have a duplicate inode in the MDS journal?
> >
> > https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> > then became rank one after the crash and attempted drop to one active
> > MDS
> > https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> > and crashed
> >
> > Anyone have any thoughts on this?
> >
> > Thanks,
> > Adam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ---

Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart

I ended up backing up the journals of the MDS ranks, recover_dentries for both 
of them, resetting the journals and session table. It is back up. The recover 
dentries stage didn't show any errors, so I'm not even sure why the MDS was 
asserting about duplicate inodes.

--
Adam

On Thu, May 16, 2019, 13:52 Adam Tygart mailto:mo...@ksu.edu>> 
wrote:
Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart 
mailto:mo...@ksu.edu>> wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS Crashing 14.2.1

2019-05-16 Thread Adam Tygart

Hello all,

The rank 0 mds is still asserting. Is this duplicate inode situation
one that I should be considering using the cephfs-journal-tool to
export, recover dentries and reset?

Thanks,
Adam

On Thu, May 16, 2019 at 12:51 AM Adam Tygart  wrote:
>
> Hello all,
>
> I've got a 30 node cluster serving up lots of CephFS data.
>
> We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
> this week.
>
> We've been running 2 MDS daemons in an active-active setup. Tonight
> one of the metadata daemons crashed with the following several times:
>
> -1> 2019-05-16 00:20:56.775 7f9f22405700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> In function 'void CIn
> ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
> 00:20:56.775021
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
> 1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
> ack_allow_loading_invalid_metadata"))
>
> I made a quick decision to move to a single MDS because I saw
> set_primary_parent, and I thought it might be related to auto
> balancing between the metadata servers.
>
> This caused one MDS to fail, the other crashed, and now rank 0 loads,
> goes active and then crashes with the following:
> -1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> In function 'void M
> DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 
> 00:29:21.149531
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
> 258: FAILED ceph_assert(!p)
>
> It now looks like we somehow have a duplicate inode in the MDS journal?
>
> https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
> then became rank one after the crash and attempted drop to one active
> MDS
> https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
> and crashed
>
> Anyone have any thoughts on this?
>
> Thanks,
> Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] MDS Crashing 14.2.1

2019-05-15 Thread Adam Tygart

Hello all,

I've got a 30 node cluster serving up lots of CephFS data.

We upgraded to Nautilus 14.2.1 from Luminous 12.2.11 on Monday earlier
this week.

We've been running 2 MDS daemons in an active-active setup. Tonight
one of the metadata daemons crashed with the following several times:

-1> 2019-05-16 00:20:56.775 7f9f22405700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
In function 'void CIn
ode::set_primary_parent(CDentry*)' thread 7f9f22405700 time 2019-05-16
00:20:56.775021
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/CInode.h:
1114: FAILED ceph_assert(parent == 0 || g_conf().get_val("mds_h
ack_allow_loading_invalid_metadata"))

I made a quick decision to move to a single MDS because I saw
set_primary_parent, and I thought it might be related to auto
balancing between the metadata servers.

This caused one MDS to fail, the other crashed, and now rank 0 loads,
goes active and then crashes with the following:
-1> 2019-05-16 00:29:21.151 7fe315e8d700 -1
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
In function 'void M
DCache::add_inode(CInode*)' thread 7fe315e8d700 time 2019-05-16 00:29:21.149531
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/14.2.1/rpm/el7/BUILD/ceph-14.2.1/src/mds/MDCache.cc:
258: FAILED ceph_assert(!p)

It now looks like we somehow have a duplicate inode in the MDS journal?

https://people.cs.ksu.edu/~mozes/ceph-mds.melinoe.log <- was rank 0
then became rank one after the crash and attempted drop to one active
MDS
https://people.cs.ksu.edu/~mozes/ceph-mds.mormo.log <- current rank 0
and crashed

Anyone have any thoughts on this?

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Limiting osd process memory use in nautilus.

2019-04-16 Thread Adam Tygart

As of 13.2.3, you should use 'osd_memory_target' instead of
'bluestore_cache_size'
--
Adam


On Tue, Apr 16, 2019 at 10:28 AM Jonathan Proulx  wrote:
>
> Hi All,
>
> I have a a few servers that are a bit undersized on RAM for number of
> osds they run.
>
> When we swithced to bluestore about 1yr ago I'd "fixed" this (well
> kept them from OOMing) by setting bluestore_cache_size_ssd and
> bluestore_cache_size_hdd, this worked.
>
> after upgrading to Nautilus the OSDs again are running away and OOMing
> out.
>
> I noticed osd_memory_target_cgroup_limit_ratio": "0.80" so tried
> setting 'MemoryHigh' and 'MemoryMax' in the unit file. But the osd
> process still happily runs right upto that line and lets the OS deal
> with it (and it deals harshly).
>
> currently I have:
>
> "bluestore_cache_size": "0",
> "bluestore_cache_size_hdd": "1073741824",
> "bluestore_cache_size_ssd": "1073741824",
>
> and
> MemoryHigh=2560M
> MemoryMax=3072M
>
> and processes keep running right upto that 3G line and getting smacked
> down which is causing performance issues as they thrash and I suspect
> some scrub issues I've seen recently.
>
> I guess my next traw to grab at is to set "bluestore_cache_size" but
> is there something I'm missing here?
>
> Thanks,
> -Jon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Radosgw s3 subuser permissions

2019-01-25 Thread Adam C. Emerson

On 24/01/2019, Marc Roos wrote:
>
>
> This should do it sort of.
>
> {
>   "Id": "Policy1548367105316",
>   "Version": "2012-10-17",
>   "Statement": [
> {
>   "Sid": "Stmt1548367099807",
>   "Effect": "Allow",
>   "Action": "s3:ListBucket",
>   "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" },
>   "Resource": "arn:aws:s3:::archive"
> },
> {
>   "Sid": "Stmt1548369229354",
>   "Effect": "Allow",
>   "Action": [
> "s3:GetObject",
> "s3:PutObject",
> "s3:ListBucket"
>   ],
>   "Principal": { "AWS": "arn:aws:iam::Company:user/testuser" },
>   "Resource": "arn:aws:s3:::archive/folder2/*"
> }
>   ]
> }


Does this work well for sub-users? I hadn't worked on them as we were
focusing on the tenant/user case, but if someone's been using policy
with sub-users, I'd like to hear their experience and any problems
they run into.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart

The same user's jobs seem to be the instigator of this issue again.
I've looked through their code and see nothing too onerous.

This time it was 2400+ cores/jobs on 186 nodes all working in the same
directory. Each job reads in a different 110KB file, crunches numbers
for while (1+ hours) and then outputs 7x ~11MB files.

Disabling their jobs, powering off the nodes containing them, waiting
the 5 minutes for an mds session eviction timeout and restarting the
failing mds server let the daemon function again. I then started those
compute nodes again.

--
Adam

On Sat, Jan 19, 2019 at 8:42 PM Adam Tygart  wrote:
>
> Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the
> 27th of December.
>
> --
> Adam
>
> On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart  wrote:
> >
> > Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This 
> > didn't happen before then.
> >
> > --
> > Adam
> >
> > On Sat, Jan 19, 2019, 20:17 Paul Emmerich  >>
> >> Did this only start to happen after upgrading to 12.2.10?
> >>
> >> Paul
> >>
> >> --
> >> Paul Emmerich
> >>
> >> Looking for help with your Ceph cluster? Contact us at https://croit.io
> >>
> >> croit GmbH
> >> Freseniusstr. 31h
> >> 81247 München
> >> www.croit.io
> >> Tel: +49 89 1896585 90
> >>
> >> On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart  wrote:
> >> >
> >> > It worked for about a week, and then seems to have locked up again.
> >> >
> >> > Here is the back trace from the threads on the mds:
> >> > http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
> >> >
> >> > --
> >> > Adam
> >> >
> >> > On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng  wrote:
> >> > >
> >> > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart  wrote:
> >> > > >
> >> > > > Restarting the nodes causes the hanging again. This means that this 
> >> > > > is
> >> > > > workload dependent and not a transient state.
> >> > > >
> >> > > > I believe I've tracked down what is happening. One user was running
> >> > > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
> >> > > > wondering if the cluster was getting ready to fragment the directory
> >> > > > something freaked out, perhaps not able to get all the caps back from
> >> > > > the nodes (if that is even required).
> >> > > >
> >> > > > I've stopped that user's jobs for the time being, and will probably
> >> > > > address it with them Monday. If it is the issue, can I tell the mds 
> >> > > > to
> >> > > > pre-fragment the directory before I re-enable their jobs?
> >> > > >
> >> > >
> >> > > The log shows mds is in busy loop, but doesn't show where is it. If it
> >> > > happens again, please use gdb to attach ceph-mds, then type 'set
> >> > > logging on' and 'thread apply all bt' inside gdb. and send the output
> >> > > to us
> >> > >
> >> > > Yan, Zheng
> >> > > > --
> >> > > > Adam
> >> > > >
> >> > > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
> >> > > > >
> >> > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 
> >> > > > > 10
> >> > > > > minutes after that restarted the mds daemon. It replayed the 
> >> > > > > journal,
> >> > > > > evicted the dead compute nodes and is working again.
> >> > > > >
> >> > > > > This leads me to believe there was a broken transaction of some 
> >> > > > > kind
> >> > > > > coming from the compute nodes (also all running CentOS 7.6 and 
> >> > > > > using
> >> > > > > the kernel cephfs mount). I hope there is enough logging from 
> >> > > > > before
> >> > > > > to try to track this issue down.
> >> > > > >
> >> > > > > We are back up and running for the moment.
> >> > > > > --
> >> > > > > Adam
> >> > > > >
> >> > > > >
> >> > > > >
> >> > > > > On Sat, Jan 12, 20

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart

Just re-checked my notes. We updated from 12.2.8 to 12.2.10 on the
27th of December.

--
Adam

On Sat, Jan 19, 2019 at 8:26 PM Adam Tygart  wrote:
>
> Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't 
> happen before then.
>
> --
> Adam
>
> On Sat, Jan 19, 2019, 20:17 Paul Emmerich >
>> Did this only start to happen after upgrading to 12.2.10?
>>
>> Paul
>>
>> --
>> Paul Emmerich
>>
>> Looking for help with your Ceph cluster? Contact us at https://croit.io
>>
>> croit GmbH
>> Freseniusstr. 31h
>> 81247 München
>> www.croit.io
>> Tel: +49 89 1896585 90
>>
>> On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart  wrote:
>> >
>> > It worked for about a week, and then seems to have locked up again.
>> >
>> > Here is the back trace from the threads on the mds:
>> > http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
>> >
>> > --
>> > Adam
>> >
>> > On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng  wrote:
>> > >
>> > > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart  wrote:
>> > > >
>> > > > Restarting the nodes causes the hanging again. This means that this is
>> > > > workload dependent and not a transient state.
>> > > >
>> > > > I believe I've tracked down what is happening. One user was running
>> > > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
>> > > > wondering if the cluster was getting ready to fragment the directory
>> > > > something freaked out, perhaps not able to get all the caps back from
>> > > > the nodes (if that is even required).
>> > > >
>> > > > I've stopped that user's jobs for the time being, and will probably
>> > > > address it with them Monday. If it is the issue, can I tell the mds to
>> > > > pre-fragment the directory before I re-enable their jobs?
>> > > >
>> > >
>> > > The log shows mds is in busy loop, but doesn't show where is it. If it
>> > > happens again, please use gdb to attach ceph-mds, then type 'set
>> > > logging on' and 'thread apply all bt' inside gdb. and send the output
>> > > to us
>> > >
>> > > Yan, Zheng
>> > > > --
>> > > > Adam
>> > > >
>> > > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
>> > > > >
>> > > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
>> > > > > minutes after that restarted the mds daemon. It replayed the journal,
>> > > > > evicted the dead compute nodes and is working again.
>> > > > >
>> > > > > This leads me to believe there was a broken transaction of some kind
>> > > > > coming from the compute nodes (also all running CentOS 7.6 and using
>> > > > > the kernel cephfs mount). I hope there is enough logging from before
>> > > > > to try to track this issue down.
>> > > > >
>> > > > > We are back up and running for the moment.
>> > > > > --
>> > > > > Adam
>> > > > >
>> > > > >
>> > > > >
>> > > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
>> > > > > >
>> > > > > > Hello all,
>> > > > > >
>> > > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 
>> > > > > > 7.6.
>> > > > > >
>> > > > > > We're using cephfs and rbd.
>> > > > > >
>> > > > > > Last night, one of our two active/active mds servers went laggy and
>> > > > > > upon restart once it goes active it immediately goes laggy again.
>> > > > > >
>> > > > > > I've got a log available here (debug_mds 20, debug_objecter 20):
>> > > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
>> > > > > >
>> > > > > > It looks like I might not have the right log levels. Thoughts on 
>> > > > > > debugging this?
>> > > > > >
>> > > > > > --
>> > > > > > Adam
>> > > > > > ___
>> > > > > > ceph-users mailing list
>> > > > > > ceph-users@lists.ceph.com
>> > > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > > ___
>> > > > > ceph-users mailing list
>> > > > > ceph-users@lists.ceph.com
>> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > > > ___
>> > > > ceph-users mailing list
>> > > > ceph-users@lists.ceph.com
>> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart

Yes, we upgraded to 12.2.10 from 12.2.7 on the 27th of December. This didn't 
happen before then.

--
Adam

On Sat, Jan 19, 2019, 20:17 Paul Emmerich 
mailto:paul.emmer...@croit.io> wrote:
Did this only start to happen after upgrading to 12.2.10?

Paul

--
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io<http://www.croit.io>
Tel: +49 89 1896585 90

On Sat, Jan 19, 2019 at 5:40 PM Adam Tygart 
mailto:mo...@ksu.edu>> wrote:
>
> It worked for about a week, and then seems to have locked up again.
>
> Here is the back trace from the threads on the mds:
> http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt
>
> --
> Adam
>
> On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng 
> mailto:uker...@gmail.com>> wrote:
> >
> > On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart 
> > mailto:mo...@ksu.edu>> wrote:
> > >
> > > Restarting the nodes causes the hanging again. This means that this is
> > > workload dependent and not a transient state.
> > >
> > > I believe I've tracked down what is happening. One user was running
> > > 1500-2000 jobs in a single directory with 92000+ files in it. I am
> > > wondering if the cluster was getting ready to fragment the directory
> > > something freaked out, perhaps not able to get all the caps back from
> > > the nodes (if that is even required).
> > >
> > > I've stopped that user's jobs for the time being, and will probably
> > > address it with them Monday. If it is the issue, can I tell the mds to
> > > pre-fragment the directory before I re-enable their jobs?
> > >
> >
> > The log shows mds is in busy loop, but doesn't show where is it. If it
> > happens again, please use gdb to attach ceph-mds, then type 'set
> > logging on' and 'thread apply all bt' inside gdb. and send the output
> > to us
> >
> > Yan, Zheng
> > > --
> > > Adam
> > >
> > > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart 
> > > mailto:mo...@ksu.edu>> wrote:
> > > >
> > > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> > > > minutes after that restarted the mds daemon. It replayed the journal,
> > > > evicted the dead compute nodes and is working again.
> > > >
> > > > This leads me to believe there was a broken transaction of some kind
> > > > coming from the compute nodes (also all running CentOS 7.6 and using
> > > > the kernel cephfs mount). I hope there is enough logging from before
> > > > to try to track this issue down.
> > > >
> > > > We are back up and running for the moment.
> > > > --
> > > > Adam
> > > >
> > > >
> > > >
> > > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart 
> > > > mailto:mo...@ksu.edu>> wrote:
> > > > >
> > > > > Hello all,
> > > > >
> > > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 
> > > > > 7.6.
> > > > >
> > > > > We're using cephfs and rbd.
> > > > >
> > > > > Last night, one of our two active/active mds servers went laggy and
> > > > > upon restart once it goes active it immediately goes laggy again.
> > > > >
> > > > > I've got a log available here (debug_mds 20, debug_objecter 20):
> > > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
> > > > >
> > > > > It looks like I might not have the right log levels. Thoughts on 
> > > > > debugging this?
> > > > >
> > > > > --
> > > > > Adam
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

2019-01-19 Thread Adam Tygart

It worked for about a week, and then seems to have locked up again.

Here is the back trace from the threads on the mds:
http://people.cs.ksu.edu/~mozes/ceph-12.2.10-laggy-mds.gdb.txt

--
Adam

On Sun, Jan 13, 2019 at 7:41 PM Yan, Zheng  wrote:
>
> On Sun, Jan 13, 2019 at 1:43 PM Adam Tygart  wrote:
> >
> > Restarting the nodes causes the hanging again. This means that this is
> > workload dependent and not a transient state.
> >
> > I believe I've tracked down what is happening. One user was running
> > 1500-2000 jobs in a single directory with 92000+ files in it. I am
> > wondering if the cluster was getting ready to fragment the directory
> > something freaked out, perhaps not able to get all the caps back from
> > the nodes (if that is even required).
> >
> > I've stopped that user's jobs for the time being, and will probably
> > address it with them Monday. If it is the issue, can I tell the mds to
> > pre-fragment the directory before I re-enable their jobs?
> >
>
> The log shows mds is in busy loop, but doesn't show where is it. If it
> happens again, please use gdb to attach ceph-mds, then type 'set
> logging on' and 'thread apply all bt' inside gdb. and send the output
> to us
>
> Yan, Zheng
> > --
> > Adam
> >
> > On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
> > >
> > > On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> > > minutes after that restarted the mds daemon. It replayed the journal,
> > > evicted the dead compute nodes and is working again.
> > >
> > > This leads me to believe there was a broken transaction of some kind
> > > coming from the compute nodes (also all running CentOS 7.6 and using
> > > the kernel cephfs mount). I hope there is enough logging from before
> > > to try to track this issue down.
> > >
> > > We are back up and running for the moment.
> > > --
> > > Adam
> > >
> > >
> > >
> > > On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
> > > >
> > > > Hello all,
> > > >
> > > > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
> > > >
> > > > We're using cephfs and rbd.
> > > >
> > > > Last night, one of our two active/active mds servers went laggy and
> > > > upon restart once it goes active it immediately goes laggy again.
> > > >
> > > > I've got a log available here (debug_mds 20, debug_objecter 20):
> > > > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
> > > >
> > > > It looks like I might not have the right log levels. Thoughts on 
> > > > debugging this?
> > > >
> > > > --
> > > > Adam
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart

Restarting the nodes causes the hanging again. This means that this is
workload dependent and not a transient state.

I believe I've tracked down what is happening. One user was running
1500-2000 jobs in a single directory with 92000+ files in it. I am
wondering if the cluster was getting ready to fragment the directory
something freaked out, perhaps not able to get all the caps back from
the nodes (if that is even required).

I've stopped that user's jobs for the time being, and will probably
address it with them Monday. If it is the issue, can I tell the mds to
pre-fragment the directory before I re-enable their jobs?

--
Adam

On Sat, Jan 12, 2019 at 7:53 PM Adam Tygart  wrote:
>
> On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
> minutes after that restarted the mds daemon. It replayed the journal,
> evicted the dead compute nodes and is working again.
>
> This leads me to believe there was a broken transaction of some kind
> coming from the compute nodes (also all running CentOS 7.6 and using
> the kernel cephfs mount). I hope there is enough logging from before
> to try to track this issue down.
>
> We are back up and running for the moment.
> --
> Adam
>
>
>
> On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
> >
> > Hello all,
> >
> > I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
> >
> > We're using cephfs and rbd.
> >
> > Last night, one of our two active/active mds servers went laggy and
> > upon restart once it goes active it immediately goes laggy again.
> >
> > I've got a log available here (debug_mds 20, debug_objecter 20):
> > https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
> >
> > It looks like I might not have the right log levels. Thoughts on debugging 
> > this?
> >
> > --
> > Adam
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart

On a hunch, I shutdown the compute nodes for our HPC cluster, and 10
minutes after that restarted the mds daemon. It replayed the journal,
evicted the dead compute nodes and is working again.

This leads me to believe there was a broken transaction of some kind
coming from the compute nodes (also all running CentOS 7.6 and using
the kernel cephfs mount). I hope there is enough logging from before
to try to track this issue down.

We are back up and running for the moment.
--
Adam

On Sat, Jan 12, 2019 at 11:23 AM Adam Tygart  wrote:
>
> Hello all,
>
> I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.
>
> We're using cephfs and rbd.
>
> Last night, one of our two active/active mds servers went laggy and
> upon restart once it goes active it immediately goes laggy again.
>
> I've got a log available here (debug_mds 20, debug_objecter 20):
> https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz
>
> It looks like I might not have the right log levels. Thoughts on debugging 
> this?
>
> --
> Adam
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Ceph MDS laggy

2019-01-12 Thread Adam Tygart

Hello all,

I've got a 31 machine Ceph cluster running ceph 12.2.10 and CentOS 7.6.

We're using cephfs and rbd.

Last night, one of our two active/active mds servers went laggy and
upon restart once it goes active it immediately goes laggy again.

I've got a log available here (debug_mds 20, debug_objecter 20):
https://people.cs.ksu.edu/~mozes/ceph-mds-laggy-20190112.log.gz

It looks like I might not have the right log levels. Thoughts on debugging this?

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Lost 1/40 OSDs at EC 4+1, now PGs are incomplete

2018-12-11 Thread Adam Tygart

AFAIR, there is a feature request in the works to allow rebuild with K
chunks, but not allow normal read/write until min_size is met. Not
that I think running with m=1 is a good idea. I'm not seeing the
tracker issue for it at the moment, though.

--
Adam
On Tue, Dec 11, 2018 at 9:50 PM Ashley Merrick  wrote:
>
> Yes if you set back to 5, every time your loose an OSD your have to set to 4 
> and let the rebuild take place before putting back to 5.
>
> I guess is all down to how important 100% up time is over you manually 
> monitoring the back fill / fix the OSD / replace the OSD by dropping to 4 vs 
> letting it do this automatically and risk a further OSD loss.
>
> If you have the space ID suggest going to 4 + 2 and migrating your data, this 
> would remove the ongoing issue and give you some extra data protection from 
> OSD loss.
>
> On Wed, Dec 12, 2018 at 11:43 AM David Young  
> wrote:
>>
>> (accidentally forgot to reply to the list)
>>
>> Thank you, setting min_size to 4 allowed I/O again, and the 39 incomplete 
>> PGs are now:
>>
>> 39  active+undersized+degraded+remapped+backfilling
>>
>> Once backfilling is done, I'll increase min_size to 5 again.
>>
>> Am I likely to encounter this issue whenever I loose an OSD (I/O freezes and 
>> manually reducing size is required), and is there anything I should be doing 
>> differently?
>>
>> Thanks again!
>> D
>>
>>
>>
>> Sent with ProtonMail Secure Email.
>>
>> ‐‐‐ Original Message ‐‐‐
>> On Wednesday, December 12, 2018 3:31 PM, Ashley Merrick 
>>  wrote:
>>
>> With EC the min size is set to K + 1.
>>
>> Generally EC is used with a M of 2 or more, reason min size is set to 1 is 
>> now you are in a state when a further OSD loss will cause some PG’s to not 
>> have at least K size available as you only have 1 extra M.
>>
>> As per the error you can get your pool back online by setting min_size to 4.
>>
>> However this would only be a temp fix while you get the OSD back online / 
>> rebuilt so you can go back to your 4 + 1 state.
>>
>> ,Ash
>>
>> On Wed, 12 Dec 2018 at 10:27 AM, David Young  
>> wrote:
>>>
>>> Hi all,
>>>
>>> I have a small 2-node cluster with 40 OSDs, using erasure coding 4+1
>>>
>>> I lost osd38, and now I have 39 incomplete PGs.
>>>
>>> ---
>>> PG_AVAILABILITY Reduced data availability: 39 pgs inactive, 39 pgs 
>>> incomplete
>>> pg 22.2 is incomplete, acting [19,33,10,8,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.f is incomplete, acting [17,9,23,14,15] (reducing pool media  
>>> from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.12 is incomplete, acting [7,33,10,31,29] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.13 is incomplete, acting [23,0,15,33,13] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> pg 22.23 is incomplete, acting [29,17,18,15,12] (reducing pool media 
>>> min_size from 5 may help; search ceph.com/docs for 'incomplete')
>>> 
>>> ---
>>>
>>> My EC profile is below:
>>>
>>> ---
>>> root@prod1:~# ceph osd erasure-code-profile get ec-41-profile
>>> crush-device-class=
>>> crush-failure-domain=osd
>>> crush-root=default
>>> jerasure-per-chunk-alignment=false
>>> k=4
>>> m=1
>>> plugin=jerasure
>>> technique=reed_sol_van
>>> w=8
>>> ---
>>>
>>> When I query one of the incomplete PGs, I see this:
>>>
>>> ---
>>> "recovery_state": [
>>> {
>>> "name": "Started/Primary/Peering/Incomplete",
>>> "enter_time": "2018-12-11 20:46:11.645796",
>>> "comment": "not enough complete instances of this PG"
>>> },
>>> ---
>>>
>>> And this:
>>>
>>> ---
>>> "probing_osds": [
>>> "0(4)",
>>> "7(2)",
>>> "9(1)",
>>> "11(4)",
>>> "22(3)",
>>> "29(2)",
>>> "36(0)"
>>> ],
>>>

Re: [ceph-users] Apply bucket policy to bucket for LDAP user: what is the correct identifier for principal

2018-10-11 Thread Adam C. Emerson

Ha Son Hai  wrote:
> Hello everyone,
> I try to apply the bucket policy to my bucket for LDAP user but it doesn't 
> work.
> For user created by radosgw-admin, the policy works fine.
>
> {
>
>   "Version": "2012-10-17",
>
>   "Statement": [{
>
> "Effect": "Allow",
>
> "Principal": {"AWS": ["arn:aws:iam:::user/radosgw-user"]},
>
> "Action": "s3:*",
>
> "Resource": [
>
>   "arn:aws:s3:::shared-tenant-test",
>
>   "arn:aws:s3:::shared-tenant-test/*"
>
> ]
>
>   }]
>
> }

LDAP users essentially are RGW users, so it should be this same
format. As I understand RGW's LDAP interface (I have not worked with
LDAP personally), every LDAP users get a corresponding RGW user whose
name is derived from rgw_ldap_dnattr, often 'uid' or 'cn', but this is
dependent on site.

If you, can check that part of configuration, and if that doesn't work
if you'll send some logs I'll take a look. If something fishy is going
on we can try opening a bug.

Thank you.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cephfs mds cache tuning

2018-10-02 Thread Adam Tygart

It may be that having multiple mds is masking the issue, or that we
truly didn't have a large enough inode cache at 55GB. Things are
behaving for me now, even when presenting the same 0 entries in req
and rlat.

If this happens again, I'll attempt to get perf trace logs, along with
ops, ops_in_flight, perf dump and objecter requests. Thanks for your
time.

--
Adam
On Mon, Oct 1, 2018 at 10:36 PM Adam Tygart  wrote:
>
> Okay, here's what I've got: https://www.paste.ie/view/abe8c712
>
> Of note, I've changed things up a little bit for the moment. I've
> activated a second mds to see if it is a particular subtree that is
> more prone to issues. maybe EC vs replica... The one that is currently
> being slow has my EC volume pinned to it.
>
> --
> Adam
> On Mon, Oct 1, 2018 at 10:02 PM Gregory Farnum  wrote:
> >
> > Can you grab the perf dump during this time, perhaps plus dumps of the ops 
> > in progress?
> >
> > This is weird but given it’s somewhat periodic it might be something like 
> > the MDS needing to catch up on log trimming (though I’m unclear why 
> > changing the cache size would impact this).
> >
> > On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart  wrote:
> >>
> >> Hello all,
> >>
> >> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
> >> cephfs mounts (kernel client). We're currently only using 1 active
> >> mds.
> >>
> >> Performance is great about 80% of the time. MDS responses (per ceph
> >> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
> >> with a latency under 100.
> >>
> >> It is the other 20ish percent I'm worried about. I'll check on it and
> >> it with be going 5-15 seconds with "0" requests, "0" latency, then
> >> give me 2 seconds of reasonable response times, and then back to
> >> nothing. Clients are actually seeing blocked requests for this period
> >> of time.
> >>
> >> The strange bit is that when I *reduce* the mds_cache_size, requests
> >> and latencies go back to normal for a while. When it happens again,
> >> I'll increase it back to where it was. It feels like the mds server
> >> decides that some of these inodes can't be dropped from the cache
> >> unless the cache size changes. Maybe something wrong with the LRU?
> >>
> >> I feel like I've got a reasonable cache size for my workload, 30GB on
> >> the small end, 55GB on the large. No real reason for a swing this
> >> large except to potentially delay it recurring after expansion for
> >> longer.
> >>
> >> I also feel like there is probably some magic tunable to change how
> >> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
> >> this tunable actually does? The documentation is a little sparse.
> >>
> >> I can grab logs from the mds if needed, just let me know the settings
> >> you'd like to see.
> >>
> >> --
> >> Adam
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cephfs mds cache tuning

2018-10-01 Thread Adam Tygart

Okay, here's what I've got: https://www.paste.ie/view/abe8c712

Of note, I've changed things up a little bit for the moment. I've
activated a second mds to see if it is a particular subtree that is
more prone to issues. maybe EC vs replica... The one that is currently
being slow has my EC volume pinned to it.

--
Adam
On Mon, Oct 1, 2018 at 10:02 PM Gregory Farnum  wrote:
>
> Can you grab the perf dump during this time, perhaps plus dumps of the ops in 
> progress?
>
> This is weird but given it’s somewhat periodic it might be something like the 
> MDS needing to catch up on log trimming (though I’m unclear why changing the 
> cache size would impact this).
>
> On Sun, Sep 30, 2018 at 9:02 PM Adam Tygart  wrote:
>>
>> Hello all,
>>
>> I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
>> cephfs mounts (kernel client). We're currently only using 1 active
>> mds.
>>
>> Performance is great about 80% of the time. MDS responses (per ceph
>> daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
>> with a latency under 100.
>>
>> It is the other 20ish percent I'm worried about. I'll check on it and
>> it with be going 5-15 seconds with "0" requests, "0" latency, then
>> give me 2 seconds of reasonable response times, and then back to
>> nothing. Clients are actually seeing blocked requests for this period
>> of time.
>>
>> The strange bit is that when I *reduce* the mds_cache_size, requests
>> and latencies go back to normal for a while. When it happens again,
>> I'll increase it back to where it was. It feels like the mds server
>> decides that some of these inodes can't be dropped from the cache
>> unless the cache size changes. Maybe something wrong with the LRU?
>>
>> I feel like I've got a reasonable cache size for my workload, 30GB on
>> the small end, 55GB on the large. No real reason for a swing this
>> large except to potentially delay it recurring after expansion for
>> longer.
>>
>> I also feel like there is probably some magic tunable to change how
>> inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
>> this tunable actually does? The documentation is a little sparse.
>>
>> I can grab logs from the mds if needed, just let me know the settings
>> you'd like to see.
>>
>> --
>> Adam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cephfs mds cache tuning

2018-09-30 Thread Adam Tygart

Hello all,

I've got a ceph (12.2.8) cluster with 27 servers, 500 osds, and 1000
cephfs mounts (kernel client). We're currently only using 1 active
mds.

Performance is great about 80% of the time. MDS responses (per ceph
daemonperf mds.$(hostname -s), indicates 2k-9k requests per second,
with a latency under 100.

It is the other 20ish percent I'm worried about. I'll check on it and
it with be going 5-15 seconds with "0" requests, "0" latency, then
give me 2 seconds of reasonable response times, and then back to
nothing. Clients are actually seeing blocked requests for this period
of time.

The strange bit is that when I *reduce* the mds_cache_size, requests
and latencies go back to normal for a while. When it happens again,
I'll increase it back to where it was. It feels like the mds server
decides that some of these inodes can't be dropped from the cache
unless the cache size changes. Maybe something wrong with the LRU?

I feel like I've got a reasonable cache size for my workload, 30GB on
the small end, 55GB on the large. No real reason for a swing this
large except to potentially delay it recurring after expansion for
longer.

I also feel like there is probably some magic tunable to change how
inodes get stuck in the LRU. perhaps mds_cache_mid. Anyone know what
this tunable actually does? The documentation is a little sparse.

I can grab logs from the mds if needed, just let me know the settings
you'd like to see.

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSD Segfaults after Bluestore conversion

2018-08-27 Thread Adam Tygart

This issue was related to using Jemalloc. Jemalloc is not as well
tested with Bluestore and lead to lots of segfaults. We moved back to
the default of tcmalloc with Bluestore and these stopped.

Check /etc/sysconfig/ceph under RHEL based distros.

--
Adam
On Mon, Aug 27, 2018 at 9:51 PM Tyler Bishop
 wrote:
>
> Did you solve this?  Similar issue.
> _
>
>
> On Wed, Feb 28, 2018 at 3:46 PM Kyle Hutson  wrote:
>>
>> I'm following up from awhile ago. I don't think this is the same bug. The 
>> bug referenced shows "abort: Corruption: block checksum mismatch", and I'm 
>> not seeing that on mine.
>>
>> Now I've had 8 OSDs down on this one server for a couple of weeks, and I 
>> just tried to start it back up. Here's a link to the log of that OSD (which 
>> segfaulted right after starting up): 
>> http://people.beocat.ksu.edu/~kylehutson/ceph-osd.414.log
>>
>> To me, it looks like the logs are providing surprisingly few hints as to 
>> where the problem lies. Is there a way I can turn up logging to see if I can 
>> get any more info as to why this is happening?
>>
>> On Thu, Feb 8, 2018 at 3:02 AM, Mike O'Connor  wrote:
>>>
>>> On 7/02/2018 8:23 AM, Kyle Hutson wrote:
>>> > We had a 26-node production ceph cluster which we upgraded to Luminous
>>> > a little over a month ago. I added a 27th-node with Bluestore and
>>> > didn't have any issues, so I began converting the others, one at a
>>> > time. The first two went off pretty smoothly, but the 3rd is doing
>>> > something strange.
>>> >
>>> > Initially, all the OSDs came up fine, but then some started to
>>> > segfault. Out of curiosity more than anything else, I did reboot the
>>> > server to see if it would get better or worse, and it pretty much
>>> > stayed the same - 12 of the 18 OSDs did not properly come up. Of
>>> > those, 3 again segfaulted
>>> >
>>> > I picked one that didn't properly come up and copied the log to where
>>> > anybody can view it:
>>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.426.log
>>> > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.426.log>
>>> >
>>> > You can contrast that with one that is up:
>>> > http://people.beocat.ksu.edu/~kylehutson/ceph-osd.428.log
>>> > <http://people.beocat.ksu.edu/%7Ekylehutson/ceph-osd.428.log>
>>> >
>>> > (which is still showing segfaults in the logs, but seems to be
>>> > recovering from them OK?)
>>> >
>>> > Any ideas?
>>> Ideas ? yes
>>>
>>> There is a a bug which is hitting a small number of systems and at this
>>> time there is no solution. Issues details at
>>> http://tracker.ceph.com/issues/22102.
>>>
>>> Please submit more details of your problem on the ticket.
>>>
>>> Mike
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] removing auids and auid-based cephx capabilities

2018-08-11 Thread Adam Tygart

I don't care what happens to most of these rados commands, and I've
never used the auid "functionality", but I have found the rados purge
command quite useful when testing different rados level applications.

Run a rados-level application test. Whoops it didn't do what you
wanted, purge and start over. It is significantly faster than
alternative of looping through a 'rados ls' and issuing 'rados rm' for
every object. Sure I could delete the pool and recreate one with the
same name, but that seems wasteful. Enabling pool deletion in the
monitors, allocating new pool ids, causing the mass re-peering of
placement groups, making sure the all of the per-pool settings exactly
match what you had before. It gets tedious.

If the code-path for a purge is different on the server-side, perhaps
there could be an additional permission to let the cephx user perform
a purge. At least then it is protected from the casual (ab)user.

Just my two cents.

--
Adam

On Sat, Aug 11, 2018 at 1:39 PM, Sage Weil  wrote:
> On Fri, 10 Aug 2018, Gregory Farnum wrote:
>> On Wed, Aug 8, 2018 at 1:33 PM, Sage Weil  wrote:
>> > There is an undocumented part of the cephx authentication framework called
>> > the 'auid' (auth uid) that assigns an integer identifier to cephx users
>> > and to rados pools and allows you to craft cephx capabilities that apply
>> > to those pools.  This is leftover infrastructure from an ancient time in
>> > which RGW buckets mapped 1:1 to rados pools (pre-argonaut!) and it was
>> > expected the cephx capabilities would line up with that.
>> >
>> > Although in theory parts of the auid infrastructure might work and be in
>> > use, it is undocumented, untested, and a messy artifact in the code.  I'd
>> > like to remove it.
>> >
>> > ***
>> >
>> >   If you are using auid-based cephx capabilities, now is the time to tell
>> >   us!  Or, if you know of any reason we should keep it around, now is
>> >   the time to speak up.
>> >
>> >   Otherwise we will remove it!
>> >
>> > ***
>>
>> I used to be very proud of this code, but +1. I don't know of any
>> users who *could* be using it (much less are) and it really doesn't
>> make any sense in our current security architecture even if it might
>> function.
>
> Two questions so far:
>
> 1) I marked the librados calls that take aui deprecated, but I can wire
> them up to still work.  For example, if you call pool_create_with_auid it
> can still create a pool.  Alternatively, I can make those calls now return
> EOPNOTSUPP.  That could break some wayward librados user, though.
> Similarly, there are calls to get and set the pool auid.  Currently I have
> converted to no-ops, but they could also return an error instead.
> Thoughts?
>
> 2) The rados cli has a 'mkpool' command that works like 'rados mkpool
>  [auid [crush-rule]]'.  The ordering means I can't just drop
> auid.  So, I could ignore the auid argument, or change the calling
> convention completely.
>
> Or, we could remove the command completely and let people use 'ceph osd
> pool create' for this.  This is my preference!  In fact, there are
> several commands I'd suggest killing at the same time:
>
> "   mkpool  [123[ 4]] create pool '\n"
> "[with auid 123[and using crush rule
> 4]]\n"
> "   cppool copy content of a pool\n"
> "   rmpool  [ --yes-i-really-really-mean-it]\n"
> "remove pool '\n"
> "   purge  --yes-i-really-really-mean-it\n"
> "remove all objects from pool
>  without removing it\n"
>
> cppool is an imcomplete implementation anyway (doesn't preserve snaps,
> for example; prabably doesn't do omap either?).  The others just scare me.
>
> Thoughts?
> sage
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged

2018-07-15 Thread Adam Tygart

Check out the message titled "IMPORTANT: broken luminous 12.2.6
release in repo, do not upgrade"

It sounds like 12.2.7 should come *soon* to fix this transparently.

--
Adam

On Sun, Jul 15, 2018 at 10:28 AM, Nicolas Huillard
 wrote:
> Hi all,
>
> I have the same problem here:
> * during the upgrade from 12.2.5 to 12.2.6
> * I restarted all the OSD server in turn, which did not trigger any bad
> thing
> * a few minutes after upgrading the OSDs/MONs/MDSs/MGRs (all on the
> same set of servers) and unsetting noout, I upgraded the clients, which
> triggers a temporary loss of connectivity between the two datacenters
>
> 2018-07-15 12:49:09.851204 mon.brome mon.0 172.21.0.16:6789/0 98 : cluster 
> [INF] Health check cleared: OSDMAP_FLAGS (was: noout flag(s) set)
> 2018-07-15 12:49:09.851286 mon.brome mon.0 172.21.0.16:6789/0 99 : cluster 
> [INF] Cluster is now healthy
> 2018-07-15 12:56:26.446062 mon.soufre mon.5 172.22.0.20:6789/0 34 : cluster 
> [INF] mon.soufre calling monitor election
> 2018-07-15 12:56:26.446288 mon.oxygene mon.3 172.22.0.16:6789/0 13 : cluster 
> [INF] mon.oxygene calling monitor election
> 2018-07-15 12:56:26.522520 mon.macaret mon.6 172.30.0.3:6789/0 10 : cluster 
> [INF] mon.macaret calling monitor election
> 2018-07-15 12:56:26.539575 mon.phosphore mon.4 172.22.0.18:6789/0 20 : 
> cluster [INF] mon.phosphore calling monitor election
> 2018-07-15 12:56:36.485881 mon.oxygene mon.3 172.22.0.16:6789/0 14 : cluster 
> [INF] mon.oxygene is new leader, mons oxygene,phosphore,soufre,macaret in 
> quorum (ranks 3,4,5,6)
> 2018-07-15 12:56:36.930096 mon.oxygene mon.3 172.22.0.16:6789/0 19 : cluster 
> [WRN] Health check failed: 3/7 mons down, quorum 
> oxygene,phosphore,soufre,macaret (MON_DOWN)
> 2018-07-15 12:56:37.041888 mon.oxygene mon.3 172.22.0.16:6789/0 26 : cluster 
> [WRN] overall HEALTH_WARN 3/7 mons down, quorum 
> oxygene,phosphore,soufre,macaret
> 2018-07-15 12:56:55.456239 mon.oxygene mon.3 172.22.0.16:6789/0 57 : cluster 
> [WRN] daemon mds.fluor is not responding, replacing it as rank 0 with standby 
> daemon mds.brome
> 2018-07-15 12:56:55.456365 mon.oxygene mon.3 172.22.0.16:6789/0 58 : cluster 
> [INF] Standby daemon mds.chlore is not responding, dropping it
> 2018-07-15 12:56:55.456486 mon.oxygene mon.3 172.22.0.16:6789/0 59 : cluster 
> [WRN] daemon mds.brome is not responding, replacing it as rank 0 with standby 
> daemon mds.oxygene
> 2018-07-15 12:56:55.464196 mon.oxygene mon.3 172.22.0.16:6789/0 60 : cluster 
> [WRN] Health check failed: 1 filesystem is degraded (FS_DEGRADED)
> 2018-07-15 12:56:55.691674 mds.oxygene mds.0 172.22.0.16:6800/4212961230 1 : 
> cluster [ERR] Error recovering journal 0x200: (5) Input/output error
> 2018-07-15 12:56:56.645914 mon.oxygene mon.3 172.22.0.16:6789/0 64 : cluster 
> [ERR] Health check failed: 1 mds daemon damaged (MDS_DAMAGE)
>
> I have above the hint about journal 0x200. The error appears much later
> in the logs :
>
> 2018-07-15 16:34:28.567267 osd.11 osd.11 172.22.0.20:6805/2150 21 : cluster 
> [ERR] 6.14 full-object read crc 0x38f8faae != expected 0xed23f8df on 
> 6:292cf221:::200.:head
>
> I tried a repair and deep-scrub on PG 6.14, with the same nil result as
> Alessandro.
> I can't find any other error about the MDS journal 200. on the
> other OSDs so I can't check CRCs.
>
> I'll try the next steps taken by Alessandro, but I'm in unknown
> territory...
>
> Le mercredi 11 juillet 2018 à 18:10 +0300, Alessandro De Salvo a
> écrit :
>> Hi,
>>
>> after the upgrade to luminous 12.2.6 today, all our MDSes have been
>> marked as damaged. Trying to restart the instances only result in
>> standby MDSes. We currently have 2 filesystems active and 2 MDSes
>> each.
>>
>> I found the following error messages in the mon:
>>
>>
>> mds.0 :6800/2412911269 down:damaged
>> mds.1 :6800/830539001 down:damaged
>> mds.0 :6800/4080298733 down:damaged
>>
>>
>> Whenever I try to force the repaired state with ceph mds repaired
>> : I get something like this in the MDS logs:
>>
>>
>> 2018-07-11 13:20:41.597970 7ff7e010e700  0 mds.1.journaler.mdlog(ro)
>> error getting journal off disk
>> 2018-07-11 13:20:41.598173 7ff7df90d700 -1 log_channel(cluster) log
>> [ERR] : Error recovering journal 0x201: (5) Input/output error
>>
>>
>> Any attempt of running the journal export results in errors, like
>> this one:
>>
>>
>> cephfs-journal-tool --rank=cephfs:0 journal export backup.bin
>> Error ((5) Input/output error)2018-07-11 17:01:30.631571 7f94354fff00
>> -1
>> Header 200. is unreadable
>>
>> 2018-07-11 17:01:30.631584 7f94354ff

Re: [ceph-users] MDS damaged

2018-07-13 Thread Adam Tygart

Bluestore.

On Fri, Jul 13, 2018, 05:56 Dan van der Ster  wrote:

> Hi Adam,
>
> Are your osds bluestore or filestore?
>
> -- dan
>
>
> On Fri, Jul 13, 2018 at 7:38 AM Adam Tygart  wrote:
> >
> > I've hit this today with an upgrade to 12.2.6 on my backup cluster.
> > Unfortunately there were issues with the logs (in that the files
> > weren't writable) until after the issue struck.
> >
> > 2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
> > [ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on
> > 5:aa448500:::500.:head
> >
> > It is a backup cluster and I can keep it around or blow away the data
> > (in this instance) as needed for testing purposes.
> >
> > --
> > Adam
> >
> > On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo
> >  wrote:
> > > Some progress, and more pain...
> > >
> > > I was able to recover the 200. using the ceph-objectstore-tool
> for
> > > one of the OSDs (all identical copies) but trying to re-inject it just
> with
> > > rados put was giving no error while the get was still giving the same
> I/O
> > > error. So the solution was to rm the object and the put it again, that
> > > worked.
> > >
> > > However, after restarting one of the MDSes and seeting it to repaired,
> I've
> > > hit another, similar problem:
> > >
> > >
> > > 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log
> [ERR] :
> > > error reading table object 'mds0_inotable' -5 ((5) Input/output error)
> > >
> > >
> > > Can I safely try to do the same as for object 200.? Should I
> check
> > > something before trying it? Again, checking the copies of the object,
> they
> > > have identical md5sums on all the replicas.
> > >
> > > Thanks,
> > >
> > >
> > > Alessandro
> > >
> > >
> > > Il 12/07/18 16:46, Alessandro De Salvo ha scritto:
> > >
> > > Unfortunately yes, all the OSDs were restarted a few times, but no
> change.
> > >
> > > Thanks,
> > >
> > >
> > > Alessandro
> > >
> > >
> > > Il 12/07/18 15:55, Paul Emmerich ha scritto:
> > >
> > > This might seem like a stupid suggestion, but: have you tried to
> restart the
> > > OSDs?
> > >
> > > I've also encountered some random CRC errors that only showed up when
> trying
> > > to read an object,
> > > but not on scrubbing, that magically disappeared after restarting the
> OSD.
> > >
> > > However, in my case it was clearly related to
> > > https://tracker.ceph.com/issues/22464 which doesn't
> > > seem to be the issue here.
> > >
> > > Paul
> > >
> > > 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo
> > > :
> > >>
> > >>
> > >> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
> > >>
> > >>>
> > >>>
> > >>> Il 12/07/18 10:58, Dan van der Ster ha scritto:
> > >>>>
> > >>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum  >
> > >>>> wrote:
> > >>>>>
> > >>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
> > >>>>>  wrote:
> > >>>>>>
> > >>>>>> OK, I found where the object is:
> > >>>>>>
> > >>>>>>
> > >>>>>> ceph osd map cephfs_metadata 200.
> > >>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.'
> -> pg
> > >>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18],
> p23)
> > >>>>>>
> > >>>>>>
> > >>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see:
> > >>>>>>
> > >>>>>>
> > >>>>>> osd.23:
> > >>>>>>
> > >>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster)
> log
> > >>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected
> 0x9ef2b41b
> > >>>>>> on
> > >>>>>> 10:292cf221:::200.:head
> > >>>>>>
> > >>>>>>
> > >>>>>> osd.35:
> > >>>>&g

Re: [ceph-users] MDS damaged

2018-07-12 Thread Adam Tygart

I've hit this today with an upgrade to 12.2.6 on my backup cluster.
Unfortunately there were issues with the logs (in that the files
weren't writable) until after the issue struck.

2018-07-13 00:16:54.437051 7f5a0a672700 -1 log_channel(cluster) log
[ERR] : 5.255 full-object read crc 0x4e97b4e != expected 0x6cfe829d on
5:aa448500:::500.:head

It is a backup cluster and I can keep it around or blow away the data
(in this instance) as needed for testing purposes.

--
Adam

On Thu, Jul 12, 2018 at 10:39 AM, Alessandro De Salvo
 wrote:
> Some progress, and more pain...
>
> I was able to recover the 200. using the ceph-objectstore-tool for
> one of the OSDs (all identical copies) but trying to re-inject it just with
> rados put was giving no error while the get was still giving the same I/O
> error. So the solution was to rm the object and the put it again, that
> worked.
>
> However, after restarting one of the MDSes and seeting it to repaired, I've
> hit another, similar problem:
>
>
> 2018-07-12 17:04:41.999136 7f54c3f4e700 -1 log_channel(cluster) log [ERR] :
> error reading table object 'mds0_inotable' -5 ((5) Input/output error)
>
>
> Can I safely try to do the same as for object 200.? Should I check
> something before trying it? Again, checking the copies of the object, they
> have identical md5sums on all the replicas.
>
> Thanks,
>
>
> Alessandro
>
>
> Il 12/07/18 16:46, Alessandro De Salvo ha scritto:
>
> Unfortunately yes, all the OSDs were restarted a few times, but no change.
>
> Thanks,
>
>
> Alessandro
>
>
> Il 12/07/18 15:55, Paul Emmerich ha scritto:
>
> This might seem like a stupid suggestion, but: have you tried to restart the
> OSDs?
>
> I've also encountered some random CRC errors that only showed up when trying
> to read an object,
> but not on scrubbing, that magically disappeared after restarting the OSD.
>
> However, in my case it was clearly related to
> https://tracker.ceph.com/issues/22464 which doesn't
> seem to be the issue here.
>
> Paul
>
> 2018-07-12 13:53 GMT+02:00 Alessandro De Salvo
> :
>>
>>
>> Il 12/07/18 11:20, Alessandro De Salvo ha scritto:
>>
>>>
>>>
>>> Il 12/07/18 10:58, Dan van der Ster ha scritto:
>>>>
>>>> On Wed, Jul 11, 2018 at 10:25 PM Gregory Farnum 
>>>> wrote:
>>>>>
>>>>> On Wed, Jul 11, 2018 at 9:23 AM Alessandro De Salvo
>>>>>  wrote:
>>>>>>
>>>>>> OK, I found where the object is:
>>>>>>
>>>>>>
>>>>>> ceph osd map cephfs_metadata 200.
>>>>>> osdmap e632418 pool 'cephfs_metadata' (10) object '200.' -> pg
>>>>>> 10.844f3494 (10.14) -> up ([23,35,18], p23) acting ([23,35,18], p23)
>>>>>>
>>>>>>
>>>>>> So, looking at the osds 23, 35 and 18 logs in fact I see:
>>>>>>
>>>>>>
>>>>>> osd.23:
>>>>>>
>>>>>> 2018-07-11 15:49:14.913771 7efbee672700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.:head
>>>>>>
>>>>>>
>>>>>> osd.35:
>>>>>>
>>>>>> 2018-07-11 18:01:19.989345 7f760291a700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.:head
>>>>>>
>>>>>>
>>>>>> osd.18:
>>>>>>
>>>>>> 2018-07-11 18:18:06.214933 7fabaf5c1700 -1 log_channel(cluster) log
>>>>>> [ERR] : 10.14 full-object read crc 0x976aefc5 != expected 0x9ef2b41b
>>>>>> on
>>>>>> 10:292cf221:::200.:head
>>>>>>
>>>>>>
>>>>>> So, basically the same error everywhere.
>>>>>>
>>>>>> I'm trying to issue a repair of the pg 10.14, but I'm not sure if it
>>>>>> may
>>>>>> help.
>>>>>>
>>>>>> No SMART errors (the fileservers are SANs, in RAID6 + LVM volumes),
>>>>>> and
>>>>>> no disk problems anywhere. No relevant errors in syslogs, the hosts
>>>>>> are
>>>>>> just fine. I cannot exclude an error on the RAID controllers, but 2 of
>>>>>

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-06 Thread Adam Tygart

I set this about 15 minutes ago, with the following:
ceph tell osd.* injectargs '--osd-recovery-max-single-start 1
--osd-recovery-max-active 1'
ceph osd unset noout
ceph osd unset norecover

I also set those settings in ceph.conf just in case the "not observed"
response was true.

Things have been stable, no segfaults at all, and recovery is
happening. Thanks for your hard work on this. I'll follow-up if
anything else crops up.

--
Adam

On Fri, Apr 6, 2018 at 11:26 AM, Josh Durgin <jdur...@redhat.com> wrote:
> You should be able to avoid the crash by setting:
>
> osd recovery max single start = 1
> osd recovery max active = 1
>
> With that, you can unset norecover to let recovery start again.
>
> A fix so you don't need those settings is here:
> https://github.com/ceph/ceph/pull/21273
>
> If you see any other backtraces let me know - especially the
> complete_read_op one from http://tracker.ceph.com/issues/21931
>
> Josh
>
>
> On 04/05/2018 08:25 PM, Adam Tygart wrote:
>>
>> Thank you! Setting norecover has seemed to work in terms of keeping
>> the osds up. I am glad my logs were of use to tracking this down. I am
>> looking forward to future updates.
>>
>> Let me know if you need anything else.
>>
>> --
>> Adam
>>
>> On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin <jdur...@redhat.com> wrote:
>>>
>>> On 04/05/2018 08:11 PM, Josh Durgin wrote:
>>>>
>>>>
>>>> On 04/05/2018 06:15 PM, Adam Tygart wrote:
>>>>>
>>>>>
>>>>> Well, the cascading crashes are getting worse. I'm routinely seeing
>>>>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering
>>>>> 14 or so of them to crash repeatedly for more than an hour.
>>>>>
>>>>> I've ran another one of them with more logging, debug osd = 20; debug
>>>>> ms = 1 (definitely more than one crash in there):
>>>>> http://people.cs.ksu.edu/~mozes/ceph-osd.422.log
>>>>>
>>>>> Anyone have any thoughts? My cluster feels like it is getting more and
>>>>> more unstable by the hour...
>>>>
>>>>
>>>>
>>>> Thanks to your logs, I think I've found the root cause. It looks like a
>>>> bug in the EC recovery code that's triggered by EC overwrites. I'm
>>>> working
>>>> on a fix.
>>>>
>>>> For now I'd suggest setting the noout and norecover flags to avoid
>>>> hitting this bug any more by avoiding recovery. Backfilling with no
>>>> client
>>>> I/O would also avoid the bug.
>>>
>>>
>>>
>>> I forgot to mention the tracker ticket for this bug is:
>>> http://tracker.ceph.com/issues/23195
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart

Thank you! Setting norecover has seemed to work in terms of keeping
the osds up. I am glad my logs were of use to tracking this down. I am
looking forward to future updates.

Let me know if you need anything else.

--
Adam

On Thu, Apr 5, 2018 at 10:13 PM, Josh Durgin <jdur...@redhat.com> wrote:
> On 04/05/2018 08:11 PM, Josh Durgin wrote:
>>
>> On 04/05/2018 06:15 PM, Adam Tygart wrote:
>>>
>>> Well, the cascading crashes are getting worse. I'm routinely seeing
>>> 8-10 of my 518 osds crash. I cannot start 2 of them without triggering
>>> 14 or so of them to crash repeatedly for more than an hour.
>>>
>>> I've ran another one of them with more logging, debug osd = 20; debug
>>> ms = 1 (definitely more than one crash in there):
>>> http://people.cs.ksu.edu/~mozes/ceph-osd.422.log
>>>
>>> Anyone have any thoughts? My cluster feels like it is getting more and
>>> more unstable by the hour...
>>
>>
>> Thanks to your logs, I think I've found the root cause. It looks like a
>> bug in the EC recovery code that's triggered by EC overwrites. I'm working
>> on a fix.
>>
>> For now I'd suggest setting the noout and norecover flags to avoid
>> hitting this bug any more by avoiding recovery. Backfilling with no client
>> I/O would also avoid the bug.
>
>
> I forgot to mention the tracker ticket for this bug is:
> http://tracker.ceph.com/issues/23195
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart

Well, the cascading crashes are getting worse. I'm routinely seeing
8-10 of my 518 osds crash. I cannot start 2 of them without triggering
14 or so of them to crash repeatedly for more than an hour.

I've ran another one of them with more logging, debug osd = 20; debug
ms = 1 (definitely more than one crash in there):
http://people.cs.ksu.edu/~mozes/ceph-osd.422.log

Anyone have any thoughts? My cluster feels like it is getting more and
more unstable by the hour...

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] EC related osd crashes (luminous 12.2.4)

2018-04-05 Thread Adam Tygart

Hello all,

I'm having some stability issues with my ceph cluster at the moment.
Using CentOS 7, and Ceph 12.2.4.

I have osds that are segfaulting regularly. roughly every minute or
so, and it seems to be getting worse, now with cascading failures.

Backtraces look like this:
 ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
luminous (stable)
 1: (()+0xa3c611) [0x55cb9249c611]
 2: (()+0xf5e0) [0x7eff83b495e0]
 3: (std::list<boost::tuples::tuple,
std::allocator<boost::tuples::tuple >
>::list(std::list<boost::tuples::tuple,
std::allocator<boost::tuples::tuple > > const&)+0x3e) [0x55cb9225562e]
 4: (ECBackend::send_all_remaining_reads(hobject_t const&,
ECBackend::ReadOp&)+0x33b) [0x55cb92243bab]
 5: (ECBackend::handle_sub_read_reply(pg_shard_t, ECSubReadReply&,
RecoveryMessages*, ZTracer::Trace const&)+0x1074) [0x55cb92245184]
 6: (ECBackend::_handle_message(boost::intrusive_ptr)+0x1af)
[0x55cb9224fa2f]
 7: (PGBackend::handle_message(boost::intrusive_ptr)+0x50)
[0x55cb921545f0]
 8: (PrimaryLogPG::do_request(boost::intrusive_ptr&,
ThreadPool::TPHandle&)+0x59c) [0x55cb920c004c]
 9: (OSD::dequeue_op(boost::intrusive_ptr,
boost::intrusive_ptr, ThreadPool::TPHandle&)+0x3f9)
[0x55cb91f45f69]
 10: (PGQueueable::RunVis::operator()(boost::intrusive_ptr
const&)+0x57) [0x55cb921c2b57]
 11: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0xfce) [0x55cb91f749de]
 12: (ShardedThreadPool::shardedthreadpool_worker(unsigned int)+0x839)
[0x55cb924e1089]
 13: (ShardedThreadPool::WorkThreadSharded::entry()+0x10) [0x55cb924e3020]
 14: (()+0x7e25) [0x7eff83b41e25]
 15: (clone()+0x6d) [0x7eff82c3534d]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

When I start a crashed osd, it seems to cause cascading crashes in
other osds with the same backtrace. This is making it problematic to
keep my placement groups up and active.

A full (start to finish) log file is available here:
http://people.cs.ksu.edu/~mozes/ceph-osd.44.log

Anyone have any thoughts, or workarounds?

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Linux Meltdown (KPTI) fix and how it affects performance?

2018-01-11 Thread Adam Tygart

Some people are doing hyperconverged ceph, colocating qemu
virtualization with ceph-osds. It is relevant for a decent subset of
people here. Therefore knowledge of the degree of performance
degradation is useful.

--
Adam

On Thu, Jan 11, 2018 at 11:38 AM,  <c...@jack.fr.eu.org> wrote:
> I don't understand how all of this is related to Ceph
>
> Ceph runs on a dedicated hardware, there is nothing there except Ceph, and
> the ceph daemons have already all power on ceph's data.
> And there is no random-code execution allowed on this node.
>
> Thus, spectre & meltdown are meaning-less for Ceph's node, and mitigations
> should be disabled
>
> Is this wrong ?
>
>
> On 01/11/2018 06:26 PM, Dan van der Ster wrote:
>>
>> Hi all,
>>
>> Is anyone getting useful results with your benchmarking? I've prepared
>> two test machines/pools and don't see any definitive slowdown with
>> patched kernels from CentOS [1].
>>
>> I wonder if Ceph will be somewhat tolerant of these patches, similarly
>> to what's described here:
>> http://www.scylladb.com/2018/01/07/cost-of-avoiding-a-meltdown/
>>
>> Cheers, Dan
>>
>> [1] Ceph v12.2.2, FileStore OSDs, kernels 3.10.0-693.11.6.el7.x86_64
>> vs the ancient 3.10.0-327.18.2.el7.x86_64
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] C++17 and C++ ABI on master

2018-01-08 Thread Adam C. Emerson

Good day,

I've just merged some changs into master that set us up to compile
with C++17. This will require a reasonably new compiler to build
master.

Due to a change in how 'noexcept' is handled (it is now part of the type
signature of a function), mangled symbol names of noexcept functions are
different, so if you have custom clients using the C++ libraries, you may
need to recompile.

Do not worry, there should be no change to the C ABI. Any C clients
should be unaffected.

Thank you.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] s3 bucket policys

2017-11-07 Thread Adam C. Emerson

On 07/11/2017, Simon Leinen wrote:
> Simon Leinen writes:
> > Adam C Emerson writes:
> >> On 03/11/2017, Simon Leinen wrote:
> >> [snip]
> >>> Is this supported by the Luminous version of RadosGW?
> 
> >> Yes! There's a few bugfixes in master that are making their way into
> >> Luminous, but Luminous has all the features at present.
> 
> > Does that mean it should basically work in 10.2.1?
> 
> Sorry, I meant to say "in 12.2.1"!!!

Yes! I believe so. There are some bug fixes not in there, but the
whole feature is basically there.


-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] s3 bucket policys

2017-11-06 Thread Adam C. Emerson

On 06/11/2017, nigel davies wrote:
> ok i am using Jewel vershion
> 
> when i try setting permissions using s3cmd or an php script using s3client
> 
> i get the error
> 
>  encoding="UTF-8"?>InvalidArgumenttest_bucket
> (truncated...)
>InvalidArgument (client):  -  encoding="UTF-8"?>InvalidArgumenttest_buckettx
> 
> a-005a005b91-109f-default109f-default-default
> 
> 
> 
> in the log on the s3 server i get
> 
> 2017-11-06 12:54:41.987704 7f67a9feb700  0 failed to parse input: {
> "Version": "2012-10-17",
> "Statement": [
> {
> "Sid": "usr_upload_can_write",
> "Effect": "Allow",
> "Principal": {"AWS": ["arn:aws:iam:::user/test"]},
> "Action": ["s3:ListBucket", "s3:PutObject"],
> "Resource": ["arn:aws:s3:::test_bucket"]
> }
> 2017-11-06 12:54:41.988219 7f67a9feb700  1 == req done
> req=0x7f67a9fe57e0 op status=-22 http_status=400 ==
> 
> 
> Any advice on this one

Well! If you upgrade to Luminous the advice I gave you will work
perfectly. Also Luminous has a bunch of awesome, wonderful new
features like Bluestore in it (and really what other enterprise
storage platform promises to color your data such a lovely hue?)

But, if you can't, I think something like:

s3cmd setacl s3://bucket_name --acl_grant=read:someuser
s3cmd setacl s3://bucket_name --acl_grant=write:differentuser

Should work. Other people than I know a lot more about ACLs.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] s3 bucket policys

2017-11-03 Thread Adam C. Emerson

On 03/11/2017, Simon Leinen wrote:
[snip]
> Is this supported by the Luminous version of RadosGW?

Yes! There's a few bugfixes in master that are making their way into
Luminous, but Luminous has all the features at present.

> (Or even Jewel?)

No!

> Does this work with Keystone integration, i.e. can we refer to Keystone
> users as principals?

In principle probably. I haven't tried it and I don't really know much
about Keystone at present. It is hooked into the various
IdentityApplier classes and if RGW thinks a Keystone user is a 'user'
and you supply whatever RGW thinks its username is, then it should
work fine. I haven't tried it, though.

> Let's say there are many read-only users rather than just one.  Would we
> simply add a new clause under "Statement" for each such user, or is
> there a better way? (I understand that RadosGW doesn't support groups,
> which could solve this elegantly and efficiently.)

If you want to give a large number of users the same permissions, just
put them all in the Principal array.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@OFTC, Actinic@Freenode
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bucket policies in Luminous

2017-07-12 Thread Adam C. Emerson

Graham Allan Wrote:
> I thought I'd try out the new bucket policy support in Luminous. My goal
> was simply to permit access on a bucket to another user.
[snip]
> Thanks for any ideas,

It's probably the 'blank' tenant. I'll make up a test case to exercise
this and come up with a patch for it. Sorry about the trouble.

-- 
Senior Software Engineer   Red Hat Storage, Ann Arbor, MI, US
IRC: Aemerson@{RedHat, OFTC}
0x80F7544B90EDBFB9 E707 86BA 0C1B 62CC 152C  7C12 80F7 544B 90ED BFB9
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Checking the current full and nearfull ratio

2017-05-04 Thread Adam Carheden

How do I check the full ratio and nearfull ratio of a running cluster?

I know i can set 'mon osd full ratio' and 'mon osd nearfull ratio' in
the [global] setting of ceph.conf. But things work fine without those
lines (uses defaults, obviously).

They can also be changed with `ceph tell mon.* injectargs
"--mon_osd_full_ratio .##` and `ceph tell mon.* injectargs
"--mon_osd_nearfull_ratio .##`, in which case the running cluster's
notion of full/nearfull wouldn't match ceph.conf.

How do I have monitors report the values they're currently running with?
(i.e. is there something like `ceph tell mon.* dumpargs...`?)

It seems like this should be a pretty basic question, but my Googlefoo
is failing me this morning.

For those who find this post and want to check how full their OSDs are
rather than checking the full/nearfull limits, `ceph osd df tree` seems
to be the hot ticket.


And as long as I'm posting, I may as well get my next question out of
the way. My minimally used 4-node, 16 OSD test cluster looks like this:
# ceph osd df tree

MIN/MAX VAR: 0.75/1.31  STDDEV: 0.84

When should one be concerned about imbalance? What values for
min/max/stddev represent problems where reweighing an OSD (or other
action) is What sort of advisable? Is that the purpose of nearfull or
does one need to monitor individual OSDs too?


-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-05-01 Thread Adam Carheden

Perfect. There's the answer, thanks. DWPD seem like an idiotic and
meaningless measurement, but the endurance figures on those data sheets
give the total TB or PB written, which is what I really want to see.

DC S3510:  0.56 TBW/GB of drive capacity
DC S3610:  6.60 TBW/GB of drive capacity
DC S3710: 20.00 TBW/GB of drive capacity

Strangely enough there seems to be quite a bit more variance by drive
size (larger drives being better) in the better drives. Possibly that's
just due to rounding of the number presented on the data sheet though.

Thanks
-- 
Adam Carheden
Systems Administrator - NCAR/RAL
x2753

On 05/01/2017 02:59 AM, Jens Dueholm Christensen wrote:
> Sorry for topposting, but..
> 
> The Intel 35xx drives are rated for a much lower DWPD (drive-writes-per-day) 
> than the 36xx or 37xx models.
> 
> Keep in mind that a single SSD that acts as journal for 5 OSDs will recieve 
> ALL writes for those 5 OSDs before the data is moved off to the OSDs actual 
> data drives.
> 
> This makes for quite a lot of writes, and along with the consumer/enterprise 
> advice others have written about, your SSD journal devices will recieve quite 
> a lot of writes over time.
> 
> The S3510 is rated for 0.3 DWPD for 5 years 
> (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3510-spec.html)
>  
> The S3610 is rated for 3 DWPD for 5 years  
> (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3610-spec.html)
>  
> The S3710 is rated for 10 DWPD for 5 years 
> (http://www.intel.com/content/www/us/en/solid-state-drives/ssd-dc-s3710-spec.html)
>  
> 
> A 480GB S3510 has no endurance left once you have written 0.275PB to it.
> A 480GB S3610 has no endurance left once you have written 3.7PB to it.
> A 400GB S3710 has no endurance left once you have written 8.3PB to it.
> 
> This makes for quite a lot of difference over time - even if a S3510 wil only 
> act as journal for 1 or 2 OSDs, it will wear out much much much faster than 
> others.
> 
> And I know I've used the xx10 models above, but the xx00 models have all been 
> replaced by those newer models now.
> 
> And yes, the xx10 models are using MLC NAND, but so were the xx00 models, 
> that have a proven trackrecord and delivers what Intel promised in the 
> datasheet.
> 
> You could try and take a look at some of the enterprise SSDs that Samsung has 
> launched.
> Price-wise they are very competitive to Intel, but I want to see (or at least 
> hear from others) if they can deliver what their datasheet promises.
> Samsungs consumer SSDs did not (840/850 Pro), so I'm only using S3710s in my 
> cluster.
> 
> 
> Before I created our own cluster some time ago, I found these threads from 
> the mailinglist regarding the exact same disks we had been expecting to use 
> (Samsung 840/850 Pro), that was quickly changed to Intel S3710s:
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-November/044258.html
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17369.html
> 
> A longish thread about Samsung consumer drives:
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000572.html
> - highlights from that thread:
>   - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000610.html
>   - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000611.html
>   - http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000798.html
> 
> Regards,
> Jens Dueholm Christensen
> Rambøll Survey IT
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Adam 
> Carheden
> Sent: Wednesday, April 26, 2017 5:54 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Sharing SSD journals and SSD drive choice
> 
> Thanks everyone for the replies.
> 
> I will be avoiding TLC drives, it was just something easy to benchmark
> with existing equipment. I hadn't though of unscrupulous data durability
> lies or performance suddenly tanking in unpredictable ways. I guess it
> all comes down to trusting the vendor since it would be expensive in
> time and $$ to test for such things.
> 
> Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All have "DC"
> prefixes and are listed in the Data Center section of their marketing
> pages, so I assume they'll all have the same quality underlying NAND.
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-27 Thread Adam Carheden

On 04/27/2017 12:46 PM, Alexandre DERUMIER wrote:
> 
>>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>>> the single drive leaves more bays free for OSD disks, but is there any
>>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>>> mean:
> 
> where do you see this price difference ?
> 
> for me , S3520 are around 25-30% cheaper than S3610

The price difference probably has to do with size. You can get 150G
S3520 for $104 on newegg. 480G is the smallest S3610 I can find, which
is $526. (And yes, I have heard all the warnings about bigger drives
being better for reliability due to wear leveling and performance due to
the write/erase cycle and all that even though you don't need the space
for the journal)

Considering $/GB is slightly better for the S3520, is 1 S3610 still
better than 4 S3520? (Ignoring that the S3520s occupy drive bays that
could be used for OSDs, of course.)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-26 Thread Adam Carheden

Thanks everyone for the replies.

I will be avoiding TLC drives, it was just something easy to benchmark
with existing equipment. I hadn't though of unscrupulous data durability
lies or performance suddenly tanking in unpredictable ways. I guess it
all comes down to trusting the vendor since it would be expensive in
time and $$ to test for such things.

Any thoughts on multiple Intel 35XX vs a single 36XX/37XX? All have "DC"
prefixes and are listed in the Data Center section of their marketing
pages, so I assume they'll all have the same quality underlying NAND.

-- 
Adam Carheden


On 04/26/2017 09:20 AM, Chris Apsey wrote:
> Adam,
> 
> Before we deployed our cluster, we did extensive testing on all kinds of
> SSDs, from consumer-grade TLC SATA all the way to Enterprise PCI-E NVME
> Drives.  We ended up going with a ratio of 1x Intel P3608 PCI-E 1.6 TB
> to 12x HGST 10TB SAS3 HDDs.  It provided the best
> price/performance/density balance for us overall.  As a frame of
> reference, we have 384 OSDs spread across 16 nodes.
> 
> A few (anecdotal) notes:
> 
> 1. Consumer SSDs have unpredictable performance under load; write
> latency can go from normal to unusable with almost no warning. 
> Enterprise drives generally show much less load sensitivity.
> 2. Write endurance; while it may appear that having several
> consumer-grade SSDs backing a smaller number of OSDs will yield better
> longevity than an enterprise grade SSD backing a larger number of OSDs,
> the reality is that enterprise drives that use SLC or eMLC are generally
> an order of magnitude more reliable when all is said and done.
> 3. Power Loss protection (PLP).  Consumer drives generally don't do well
> when power is suddenly lost.  Yes, we should all have UPS, etc., but
> things happen.  Enterprise drives are much more tolerant of
> environmental failures.  Recovering from misplaced objects while also
> attempting to serve clients is no fun.
> 
> 
> 
> 
> 
> ---
> v/r
> 
> Chris Apsey
> bitskr...@bitskrieg.net
> https://www.bitskrieg.net
> 
> On 2017-04-26 10:53, Adam Carheden wrote:
>> What I'm trying to get from the list is /why/ the "enterprise" drives
>> are important. Performance? Reliability? Something else?
>>
>> The Intel was the only one I was seriously considering. The others were
>> just ones I had for other purposes, so I thought I'd see how they fared
>> in benchmarks.
>>
>> The Intel was the clear winner, but my tests did show that throughput
>> tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
>> it, all with osd op threads = 2, do the benchmarks below not show that
>> the Hynix would be a better choice (at least for performance)?
>>
>> Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
>> the single drive leaves more bays free for OSD disks, but is there any
>> other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
>> mean:
>>
>> a) fewer OSDs go down if the SSD fails
>>
>> b) better throughput (I'm speculating that the S3610 isn't 4 times
>> faster than the S3520)
>>
>> c) load spread across 4 SATA channels (I suppose this doesn't really
>> matter since the drives can't throttle the SATA bus).
>>
>>
>> -- 
>> Adam Carheden
>>
>> On 04/26/2017 01:55 AM, Eneko Lacunza wrote:
>>> Adam,
>>>
>>> What David said before about SSD drives is very important. I will tell
>>> you another way: use enterprise grade SSD drives, not consumer grade.
>>> Also, pay attention to endurance.
>>>
>>> The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7,
>>> and probably it isn't even the most suitable SATA SSD disk from Intel;
>>> better use S3610 o S3710 series.
>>>
>>> Cheers
>>> Eneko
>>>
>>> El 25/04/17 a las 21:02, Adam Carheden escribió:
>>>> On 04/25/2017 11:57 AM, David wrote:
>>>>> On 19 Apr 2017 18:01, "Adam Carheden" <carhe...@ucar.edu
>>>>> <mailto:carhe...@ucar.edu>> wrote:
>>>>>
>>>>>  Does anyone know if XFS uses a single thread to write to it's
>>>>> journal?
>>>>>
>>>>>
>>>>> You probably know this but just to avoid any confusion, the journal in
>>>>> this context isn't the metadata journaling in XFS, it's a separate
>>>>> journal written to by the OSD daemons
>>>> Ha! I didn't know that.
>>>>
>>>>> I think the number of threads per OSD is controlled by the 'osd op
>>>>> threads' se

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-26 Thread Adam Carheden

What I'm trying to get from the list is /why/ the "enterprise" drives
are important. Performance? Reliability? Something else?

The Intel was the only one I was seriously considering. The others were
just ones I had for other purposes, so I thought I'd see how they fared
in benchmarks.

The Intel was the clear winner, but my tests did show that throughput
tanked with more threads. Hypothetically, if I was throwing 16 OSDs at
it, all with osd op threads = 2, do the benchmarks below not show that
the Hynix would be a better choice (at least for performance)?

Also, 4 x Intel DC S3520 costs as much as 1 x Intel DC S3610. Obviously
the single drive leaves more bays free for OSD disks, but is there any
other reason a single S3610 is preferable to 4 S3520s? Wouldn't 4xS3520s
mean:

a) fewer OSDs go down if the SSD fails

b) better throughput (I'm speculating that the S3610 isn't 4 times
faster than the S3520)

c) load spread across 4 SATA channels (I suppose this doesn't really
matter since the drives can't throttle the SATA bus).


-- 
Adam Carheden

On 04/26/2017 01:55 AM, Eneko Lacunza wrote:
> Adam,
> 
> What David said before about SSD drives is very important. I will tell
> you another way: use enterprise grade SSD drives, not consumer grade.
> Also, pay attention to endurance.
> 
> The only suitable drive for Ceph I see in your tests is SSDSC2BB150G7,
> and probably it isn't even the most suitable SATA SSD disk from Intel;
> better use S3610 o S3710 series.
> 
> Cheers
> Eneko
> 
> El 25/04/17 a las 21:02, Adam Carheden escribió:
>> On 04/25/2017 11:57 AM, David wrote:
>>> On 19 Apr 2017 18:01, "Adam Carheden" <carhe...@ucar.edu
>>> <mailto:carhe...@ucar.edu>> wrote:
>>>
>>>  Does anyone know if XFS uses a single thread to write to it's
>>> journal?
>>>
>>>
>>> You probably know this but just to avoid any confusion, the journal in
>>> this context isn't the metadata journaling in XFS, it's a separate
>>> journal written to by the OSD daemons
>> Ha! I didn't know that.
>>
>>> I think the number of threads per OSD is controlled by the 'osd op
>>> threads' setting which defaults to 2
>> So the ideal (for performance) CEPH cluster would be one SSD per HDD
>> with 'osd op threads' set to whatever value fio shows as the optimal
>> number of threads for that drive then?
>>
>>> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
>>> consider going up to a 37xx and putting more OSDs on it. Of course with
>>> the caveat that you'll lose more OSDs if it goes down.
>> Why would you avoid the SanDisk and Hynix? Reliability (I think those
>> two are both TLC)? Brand trust? If it's my benchmarks in my previous
>> email, why not the Hynix? It's slower than the Intel, but sort of
>> decent, at lease compared to the SanDisk.
>>
>> My final numbers are below, including an older Samsung Evo (MCL I think)
>> which did horribly, though not as bad as the SanDisk. The Seagate is a
>> 10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison.
>>
>>   SanDisk SDSSDA240G, fio  1 jobs:   7.0 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio  2 jobs:   7.6 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio  8 jobs:   7.6 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio 16 jobs:   7.6 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s (5 trials)
>>
>>
>>   SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.2 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio  2 jobs:   0.6 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio  4 jobs:   7.5 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio  8 jobs:  17.6 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio 16 jobs:  32.4 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio 32 jobs:  64.4 MB/s (5 trials)
>>
>>
>> HFS250G32TND-N1A2A 3P10, fio 64 jobs:  71.6 MB/s (5 trials)
>>
>>
>>  SAMSUNG SSD, fio  1 jobs:   2.2 MB/s (5 trials)
>>
>>
>>  SAMSUNG SSD, fio  2 jobs:   3.9 MB/s (5 trials)
>>
>>
>>  SAMSUNG SSD, fio  4 jobs:   7.1 MB/s (5 trials)
>>
>>
>>  SAMSUNG SSD, fio  8 jobs:  12.0 MB/s (5 trials)
>>
>>
>>

[ceph-users] Race Condition(?) in CephFS

2017-04-25 Thread Adam Tygart

I'm using CephFS, on CentOS 7. We're currently migrating away from
using a catch-all cephx key to mount the filesystem (with the kernel
module), to a much more restricted key.

In my tests, I've come across an issue, extracting a tar archive with
a mount using the restricted key routinely cannot create files or
directories in recently created directories. I need to keep running a
CentOS based kernel on the clients because of some restrictions from
other software. Below looks like a race condition to me, although I am
not versed well enough in Ceph or the inner workings of the kernel to
know for sure.

# tar xf gmp-6.1.2.tar.lz -C /homes/mozes/tmp/
tar: gmp-6.1.2/mpn/x86_64/mulx/adx/addmul_1.asm: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors

This gets worse with tracing turned on in the kernel. (echo module
ceph +p > /sys/kernel/debug/dynamic_debug/control)

# tar xf gmp-6.1.2.tar.lz -C /homes/mozes/tmp/
tar: gmp-6.1.2/mpn/x86_64/mulx/adx: Cannot mkdir: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/mul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/adx: Cannot mkdir: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/adx/addmul_1.asm: Cannot open: No such
file or directory
tar: gmp-6.1.2/mpn/x86_64/coreinhm/popcount.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/aorrlsh_n.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/sec_tabselect.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/redc_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/hamdist.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/copyd.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/copyi.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/popcount.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/gcd_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/dive_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/redc_1.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mullo_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/fat_entry.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mod_1.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/redc_2.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/sqr_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/fat.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mul_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mullo_basecase.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mul_basecase.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/redc_1.asm: Cannot open: Permission denied

While extracting the same tar file using an unrestricted key works correctly.

I've got some kernel traces to share if anyone is interested.
https://people.beocat.ksu.edu/~mozes/ceph-20170425/

# uname -a
Linux eunomia 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

We're currently running Ceph Jewel (10.2.5). We're looking to update
soon, but we wanted a clean backup of everything in CephFS first.

The new restricted key has these permissions:
caps mds = "allow r, allow rw path=/homes, allow rw
path=/bulk, allow rw path=/beocat"
caps mon = "allow r"
caps osd = "allow rw pool=scratch, allow rw pool=bulk, allow
rw pool=homes"

While the unrestricted key has these permissions:
caps mds = "allow"
caps mon = "allow *"
caps osd = "allow *"

I would appreciate any insights anyone might have.

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-25 Thread Adam Carheden

On 04/25/2017 11:57 AM, David wrote:
> On 19 Apr 2017 18:01, "Adam Carheden" <carhe...@ucar.edu
> <mailto:carhe...@ucar.edu>> wrote:
> 
> Does anyone know if XFS uses a single thread to write to it's journal?
> 
> 
> You probably know this but just to avoid any confusion, the journal in
> this context isn't the metadata journaling in XFS, it's a separate
> journal written to by the OSD daemons

Ha! I didn't know that.

> 
> I think the number of threads per OSD is controlled by the 'osd op
> threads' setting which defaults to 2

So the ideal (for performance) CEPH cluster would be one SSD per HDD
with 'osd op threads' set to whatever value fio shows as the optimal
number of threads for that drive then?

> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
> consider going up to a 37xx and putting more OSDs on it. Of course with
> the caveat that you'll lose more OSDs if it goes down. 

Why would you avoid the SanDisk and Hynix? Reliability (I think those
two are both TLC)? Brand trust? If it's my benchmarks in my previous
email, why not the Hynix? It's slower than the Intel, but sort of
decent, at lease compared to the SanDisk.

My final numbers are below, including an older Samsung Evo (MCL I think)
which did horribly, though not as bad as the SanDisk. The Seagate is a
10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison.

 SanDisk SDSSDA240G, fio  1 jobs:   7.0 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  2 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  8 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 16 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.2 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  2 jobs:   0.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  4 jobs:   7.5 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  8 jobs:  17.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 16 jobs:  32.4 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 32 jobs:  64.4 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 64 jobs:  71.6 MB/s (5 trials)


SAMSUNG SSD, fio  1 jobs:   2.2 MB/s (5 trials)


SAMSUNG SSD, fio  2 jobs:   3.9 MB/s (5 trials)


SAMSUNG SSD, fio  4 jobs:   7.1 MB/s (5 trials)


SAMSUNG SSD, fio  8 jobs:  12.0 MB/s (5 trials)


SAMSUNG SSD, fio 16 jobs:  18.3 MB/s (5 trials)


SAMSUNG SSD, fio 32 jobs:  25.4 MB/s (5 trials)


SAMSUNG SSD, fio 64 jobs:  26.5 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  1 jobs:  91.2 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  2 jobs: 132.4 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  4 jobs: 138.2 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  8 jobs: 116.9 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio 16 jobs:  61.8 MB/s (5 trials)
INTEL SSDSC2BB150G7, fio 32 jobs:  22.7 MB/s (5 trials)
INTEL SSDSC2BB150G7, fio 64 jobs:  16.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio  1 jobs:   0.7 MB/s (5 trials)
SEAGATE ST9300603SS, fio  2 jobs:   0.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio  4 jobs:   1.6 MB/s (5 trials)
SEAGATE ST9300603SS, fio  8 jobs:   2.0 MB/s (5 trials)
SEAGATE ST9300603SS, fio 16 jobs:   4.6 MB/s (5 trials)
SEAGATE ST9300603SS, fio 32 jobs:   6.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio 64 jobs:   0.6 MB/s (5 trials)

For those who come across this and are looking for drives for purposes
other than CEPH, those are all sequential write numbers with caching
disabled, a very CEPH-journal-specific test. The SanDisk held it's own
against the Intel using some benchmarks on Windows that didn't disable
caching. It may very well be a perfectly good drive for other purposes.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Sharing SSD journals and SSD drive choice

2017-04-19 Thread Adam Carheden

Does anyone know if XFS uses a single thread to write to it's journal?

I'm evaluating SSDs to buy as journal devices. I plan to have multiple
OSDs share a single SSD for journal. I'm benchmarking several brands as
described here:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

It appears that sequential write speed using multiple threads varies
widely between brands. Here's what I have so far:
 SanDisk SDSSDA240G, dd:6.8 MB/s
 SanDisk SDSSDA240G, fio  1 jobs:   6.7 MB/s
 SanDisk SDSSDA240G, fio  2 jobs:   7.4 MB/s
 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio  8 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 16 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s
 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s
HFS250G32TND-N1A2A 3P10, dd:1.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  2 jobs:   5.2 MB/s
HFS250G32TND-N1A2A 3P10, fio  4 jobs:   9.5 MB/s
HFS250G32TND-N1A2A 3P10, fio  8 jobs:  23.4 MB/s
HFS250G32TND-N1A2A 3P10, fio 16 jobs:   7.2 MB/s
HFS250G32TND-N1A2A 3P10, fio 32 jobs:  49.8 MB/s
HFS250G32TND-N1A2A 3P10, fio 64 jobs:  70.5 MB/s
INTEL SSDSC2BB150G7, dd:   90.1 MB/s
INTEL SSDSC2BB150G7, fio  1 jobs:  91.0 MB/s
INTEL SSDSC2BB150G7, fio  2 jobs: 108.3 MB/s
INTEL SSDSC2BB150G7, fio  4 jobs: 134.2 MB/s
INTEL SSDSC2BB150G7, fio  8 jobs: 118.2 MB/s
INTEL SSDSC2BB150G7, fio 16 jobs:  39.9 MB/s
INTEL SSDSC2BB150G7, fio 32 jobs:  25.4 MB/s
INTEL SSDSC2BB150G7, fio 64 jobs:  15.8 MB/s

The SanDisk is slow, but speed is the same at any number of threads. The
Intel peaks at 4-6 threads and then declines rapidly into sub-par
performance (at least for a pricey "enterprise" drive). The SK Hynix is
slow at low numbers of threads but gets huge performance gains with more
threads. (This is all with one trial, but I have a script running
multiple trials across all drives today.)

So if XFS has a single thread that does journaling, it looks like my
best option would be 1 intel SSD shared by 4-6 OSDs. If XFS already
throws multiple threads at the journal, then having OSDs share an Intel
drive will likely kill my SSD performance, but having as many OSDs as I
can cram in a chassis share the SK Hynix drive would get me great
performance for a fraction of the cost.

Anyone have any related advice or experience to share regarding journal
SSD selection?
-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding a new rack to crush map without pain?

2017-04-18 Thread Adam Tygart

Ceph has the ability to us a script to figure out where in the
crushmap this disk should go (on osd start):
http://docs.ceph.com/docs/master/rados/operations/crush-map/#ceph-crush-location-hook

--
Adam

On Tue, Apr 18, 2017 at 7:53 AM, Matthew Vernon <m...@sanger.ac.uk> wrote:
> On 17/04/17 21:16, Richard Hesse wrote:
>> I'm just spitballing here, but what if you set osd crush update on start
>> = false ? Ansible would activate the OSD's but not place them in any
>> particular rack, working around the ceph.conf problem you mentioned.
>> Then you could place them in your CRUSH map by hand. I know you wanted
>> to avoid editing the CRUSH map by hand, but it's usually the safest route.
>
> It scales really badly - "edit CRUSH map by hand" isn't really something
> that I can automate; presumably something could be lashed up with ceph
> osd crush add-bucket and ceph osd set ... but that feels more like a
> lash-up and less like a properly-engineered solution to what must be a
> fairly common problem?
>
> Regards,
>
> Matthew
>
>> On Wed, Apr 12, 2017 at 4:46 PM, Matthew Vernon <m...@sanger.ac.uk
>> <mailto:m...@sanger.ac.uk>> wrote:
>>
>> Hi,
>>
>> Our current (jewel) CRUSH map has rack / host / osd (and the default
>> replication rule does step chooseleaf firstn 0 type rack). We're shortly
>> going to be adding some new hosts in new racks, and I'm wondering what
>> the least-painful way of getting the new osds associated with the
>> correct (new) rack will be.
>>
>> We deploy with ceph-ansible, which can add bits of the form
>> [osd.104]
>> osd crush location = root=default rack=1 host=sto-1-1
>>
>> to ceph.conf, but I think this doesn't help for new osds, since
>> ceph-disk will activate them before ceph.conf is fully assembled (and
>> trying to arrange it otherwise would be serious hassle).
>>
>> Would making a custom crush location hook be the way to go? then it'd
>> say rack=4 host=sto-4-x and new osds would end up allocated to rack 4?
>> And would I need to have done ceph osd crush add-bucket rack4 rack
>> first, presumably?
>>
>> I am planning on adding osds to the cluster one box at a time, rather
>> than going with the add-everything-at-crush-weight-0 route; if nothing
>> else it seems easier to automate. And I'd rather avoid having to edit
>> the crush map directly...
>>
>> Any pointers welcomed :)
>>
>> Regards,
>>
>> Matthew
>>
>>
>> --
>>  The Wellcome Trust Sanger Institute is operated by Genome Research
>>  Limited, a charity registered in England with number 1021457 and a
>>  company registered in England with number 2742969, whose registered
>>  office is 215 Euston Road, London, NW1 2BE.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> <http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com>
>>
>>
>
>
>
> --
>  The Wellcome Trust Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is redundancy across failure domains guaranteed or best effort?

2017-04-14 Thread Adam Carheden

Thanks for your replies.

I think the sort version is "guaranteed": CEPH will always either store
'size' copies of your data or set heath to a WARN and/or ERR state to
let you know that it can't. I think that's probably the most desirable
answer.

-- 
Adam Carheden

On 04/14/2017 09:51 AM, David Turner wrote:
> If you have Replica size 3, your failure domain is host, and you have 3
> servers... you will NEVER have 2 copies of the data on 1 server.  If you
> weight your OSDs poorly on one of your servers, then one of the drives
> will fill up to the full ratio in its config and stop receiving writes. 
> You should always monitor your OSDs so that you can fix the weights
> before an OSD becomes nearfull and definitely so that the OSD never
> reaches the FULL setting and stops receiving writes.  Note that when it
> stops receiving writes, it will block the write requests and until it
> has space to fulfill the write and the cluster will be stuck.
> 
> Also to truly answer your question, if you had Replica size 3, your
> failure domain is host, and you only have 2 servers in your cluster...
> You will only be storing 2 copies of data and every single PG in your
> cluster will be degraded.  Ceph will never breach the boundary of your
> failure domain.
> 
> When dealing with 3 node clusters you want to be careful to never fill
> up your cluster past a % where you can lose a drive in one of your
> nodes.  For example, if you have 3 nodes with 3x 4TB drives in each and
> you lose a drive... the other 2 OSDs in that node need to be able to
> take the data from the dead drive without going over 80% (the default
> nearfull setting).  So in this scenario you shouldn't fill the cluster
> to be more than 53% unless you're planning to tell the cluster not to
> backfill until the dead OSD is replaced.
> 
> I will never recommend anyone to go into production with a cluster
> smaller than N+2 your replica size of failure domains.  So if you have
> the default Replica size of 3, then you should go into production with
> at least 5 servers.  This gives you enough failure domains to be able to
> handle drive failures without the situation being critical.
> 
> On Fri, Apr 14, 2017 at 11:25 AM Adam Carheden <carhe...@ucar.edu
> <mailto:carhe...@ucar.edu>> wrote:
> 
> Is redundancy across failure domains guaranteed or best effort?
> 
> Note: The best answer to the questions below is obviously to avoid the
> situation by properly weight drives and not approaching the full ratio.
> I'm just curious how CEPH works.
> 
> Hypothetical situation:
> Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you
> weighted the OSDs poorly such that the OSDs on one server filled up but
> the OSDs on the others still had space. CEPH could still store 3
> replicas of your data, but two of them would be on the same server. What
> happens?
> 
> (select all that apply)
> a.[ ] Clients can still read data
> b.[ ] Clients can still write data
> c.[ ] health = HEALTH_WARN
> d.[ ] health = HEALTH_OK
> e.[ ] PGs are degraded
> f.[ ] ceph stores only two copies of data
> g.[ ] ceph stores 3 copies of data, two of which are on the same server
> h.[ ] something else?
> 
> If the answer is "best effort" (a+b+d+g), how would you detect if that
> scenario is occurring?
> 
> If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in
> that scenario, is there any way to tell CEPH to store temporarily store
> 2 copies on a single server just in case? I suspect the answer is to
>     remove host bucket from the crushmap but that that's a really bad idea
> because it would trigger a rebuild and the extra disk activity increases
> the likelihood of additional drive failures, correct?
> 
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Is redundancy across failure domains guaranteed or best effort?

2017-04-14 Thread Adam Carheden

Is redundancy across failure domains guaranteed or best effort?

Note: The best answer to the questions below is obviously to avoid the
situation by properly weight drives and not approaching the full ratio.
I'm just curious how CEPH works.

Hypothetical situation:
Say you have 1 pool of size=3 and 3 servers, each with 2 OSDs. Say you
weighted the OSDs poorly such that the OSDs on one server filled up but
the OSDs on the others still had space. CEPH could still store 3
replicas of your data, but two of them would be on the same server. What
happens?

(select all that apply)
a.[ ] Clients can still read data
b.[ ] Clients can still write data
c.[ ] health = HEALTH_WARN
d.[ ] health = HEALTH_OK
e.[ ] PGs are degraded
f.[ ] ceph stores only two copies of data
g.[ ] ceph stores 3 copies of data, two of which are on the same server
h.[ ] something else?

If the answer is "best effort" (a+b+d+g), how would you detect if that
scenario is occurring?

If the answer is "guaranteed" (f+e+c+...) and you loose a drive while in
that scenario, is there any way to tell CEPH to store temporarily store
2 copies on a single server just in case? I suspect the answer is to
remove host bucket from the crushmap but that that's a really bad idea
because it would trigger a rebuild and the extra disk activity increases
the likelihood of additional drive failures, correct?

-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Degraded: OSD failure vs crushmap change

2017-04-14 Thread Adam Carheden

Is there a difference between the degraded states triggered by an OSD
failure vs a crushmap change?

When an OSD fails the cluster is obviously degraded in the sense that
you have fewer copies of your data than the pool size mandates.

But when you change the crush map, say by adding an OSD, ceph also
reports HEALTH_WARN and degraded PGs. This makes sense in that data
isn't where it's supposed to be, but you do still have sufficient copies
of your data in the previous locations.

So what happens if you add an OSDs and some multi-OSD failure occurs
such that you have zero copies of data the the target location (i.e.
crush map including new OSD) but you still have a copy of the data in
the old location (i.e. crushmap before adding the new OSD). Is CEPH
smart enough to pull the data from the old location, or do you loose data?

-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] slow perfomance: sanity check

2017-04-06 Thread Adam Carheden

60-80MBs/s for what sort of setup? Is that 1Gbe rather than 10Gbe?

I consistently get 80-90Mb/s bandwidth as measured by `rados bench -p
rbd 10 write` run from a ceph node on a cluster with:
* 3 nodes
* 4 OSD/node, 600GB 15kRPM SAS disks
* 1G disk controller cache write cache shared by all disks in each node
* No SSDs
* 2x1Gbe lacp bond for redundancy, no jumbo frames
* 512 PGs for a cluster of 12 OSDs
* All disks in one pool of size=3, min_size=2

IOzone run on a VM using an rbd as it's HD confirms that setup maxes out
at around just under 100 MB/s for best-case scenarios, so I assumed the
1Gb network was the bottleneck.

I'm in the process of planning a hardware purchase for a larger cluster:
more nodes, more drives, SSD journals and 10Gbe. I'm asuming I'll get
better performance.

What's the upper bound on CEPH performance for large sequential writes
from a single-client with all the recommended bells and whistles (ssd
journal, 10Gbe)? I assume it depends on both the total number of OSDs
and possibly OSDs per node if one had enough to saturate the network,
correct?


-- 
Adam Carheden

On 04/06/2017 12:29 PM, Mark Nelson wrote:
> With filestore on XFS using SSD journals that have good O_DSYNC write
> performance, we typically see between 60-80MB/s per disk before
> replication for large object writes.  This is assuming there are no
> other bottlenecks or things going on though (pg splitting, recovery,
> network issues, etc).  Probably the best case scenario would be large
> writes to an RBD volume with 4MB objects and enough PGs in the pool that
> splits never need to happen.
> 
> Having said that, on setups where some of the drives are slow, the
> network is misconfigured, there are too few PGs, there are too many
> drives on one controller, or other issues, 25-30MB/s per disk is
> certainly possible.
> 
> Mark
> 
> On 04/06/2017 10:05 AM, Stanislav Kopp wrote:
>> I've reduced OSDs to 12 and  moved journal to ssd drives and now have
>> "boost" with writes to ~33-35MB/s. Is it maximum without full ssd
>> pools?
>>
>> Best,
>> Stan
>>
>> 2017-04-06 9:34 GMT+02:00 Stanislav Kopp <stask...@gmail.com>:
>>> Hello,
>>>
>>> I'm evaluate ceph cluster, to see  if you can use it for our
>>> virtualization solution (proxmox). I'm using 3 nodes, running Ubuntu
>>> 16.04 with stock ceph (10.2.6), every OSD uses separate 8 TB spinning
>>> drive (XFS), MONITORs are installed on the same nodes, all nodes are
>>> connected via 10G switch.
>>>
>>> The problem is, on client I have only ~25-30 MB/s with seq. write. (dd
>>> with "oflag=direct"). Proxmox uses Firefly, which is old, I know.  But
>>> I have the same performance on my desktop running the same version as
>>> ceph nodes using rbd mount, iperf shows full speed (1GB or 10GB up to
>>> client).
>>> I know that this setup is not optimal and for production I will use
>>> separate MON nodes and ssd for OSDs, but was wondering is this
>>> performance still normal. This is my cluster status.
>>>
>>>  cluster 3ea55c7e-5829-46d0-b83a-92c6798bde55
>>>  health HEALTH_OK
>>>  monmap e5: 3 mons at
>>> {ceph01=10.1.8.31:6789/0,ceph02=10.1.8.32:6789/0,ceph03=10.1.8.33:6789/0}
>>>
>>> election epoch 60, quorum 0,1,2 ceph01,ceph02,ceph03
>>>  osdmap e570: 42 osds: 42 up, 42 in
>>> flags sortbitwise,require_jewel_osds
>>>   pgmap v14784: 1024 pgs, 1 pools, 23964 MB data, 6047 objects
>>> 74743 MB used, 305 TB / 305 TB avail
>>> 1024 active+clean
>>>
>>> btw, bench on nodes itself looks good as far I see.
>>>
>>> ceph01:~# rados bench -p rbd 10 write
>>> 
>>> Total time run: 10.159667
>>> Total writes made:  1018
>>> Write size: 4194304
>>> Object size:4194304
>>> Bandwidth (MB/sec): 400.801
>>> Stddev Bandwidth:   38.2018
>>> Max bandwidth (MB/sec): 472
>>> Min bandwidth (MB/sec): 344
>>> Average IOPS:   100
>>> Stddev IOPS:9
>>> Max IOPS:   118
>>> Min IOPS:   86
>>> Average Latency(s): 0.159395
>>> Stddev Latency(s):  0.110994
>>> Max latency(s): 1.1069
>>> Min latency(s): 0.0432668
>>>
>>>
>>> Thanks,
>>> Stan
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How do I mix drive sizes in a CEPH cluster?

2017-03-30 Thread Adam Carheden

When mixing hard drives of different sizes, what are the advantages
and disadvantages of one big pool vs multiple pools with matching
drives within each pool?

-= Long Story =-
Using a mix of new and existing hardware, I'm going to end up with
10x8T HDD and 42x600G@15krpm HDD. I can distribute drives evenly among
5 nodes for the 8T drives and among 7 nodes for the 600G drives. All
drives will have journals on SSD. 2x10G LAG for the ceph network.
Usage will be rbd for VMs.

Is the following correct?

-= 1 big pool =-
* Should work fine, but performance is in question
* Smaller I/O could be inconsistent when under load. Normally small
writes will all go to the SSDs, but under load that saturates the SSDs
smaller writes may be slower if the bits happen to be on the slower 8T
drives.
* Larger I/O should get the average performance off all drives
assuming images are created with appropriate striping
* Rebuilds will be bottle-necked by the 8T drives

-= 2 pools with matching disks =-
* Should work fine
* Smaller I/O should be the same for both pools due to SSD journals
* Larger I/O will be faster for pool with 600G@15krpm drives due both
to drive speed and count
* Larger I/O will be slower for pool with 8T drives for the same reasons
* Rebuilds will be significantly faster on the 600G/42-drive pool

Is either configuration a bad idea, or is it just a matter of my
space/speed needs?

It should be possible to have 3 pools:
1) 8T only (slow pool)
2) 600G only (fast pool)
3) all OSDs (medium speed pool)
...but the rebuild would impact performance on the "fast" 600G drive
pool if a 8T drive failed since the medium speed pool would be
rebuilding across all drives, correct?

Thanks
-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] How to check SMR vs PMR before buying disks?

2017-03-27 Thread Adam Carheden

What's the biggest PMR disk I can buy, and how do I tell if a disk is PMR?

I'm well aware that I shouldn't use SMR disks:
http://ceph.com/planet/do-not-use-smr-disks-with-ceph/

But newegg and the like don't seem to advertise SMR vs PMR and I can't
even find it on manufacturer's websites (at least not from Seagate).

Is there any way to tell? Is there a rule of thumb, such "as 4T+ is
probably SMR" or "enterprise usually means PMR"?

Thanks
-- 
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Adam Carheden

On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen  wrote:

>> c. Reads can continue from the single online OSD even in pgs that
>> happened to have two of 3 osds offline.
>>
>
> Hypothetically (This is partially informed guessing on my part):
> If the survivor happens to be the acting primary and it were up-to-date at
> the time,
> it can in theory serve reads. (Only the primary serves reads).

It makes no sense that only the primary could serve reads. That would
mean that even if only a single OSD failed, all PGs for which that OSD
was primary would be unreadable.

There must be an algorithm to appoint a new primary. So in a 2 OSD
failure scenario, a new primary should be appointed after the first
failure, no? Would the final remaining OSD not appoint itself as
primary after the 2nd failure?

This make sense in the context of CEPH's synchronous writes too. A
write isn't complete until all 3 OSDs in the PG have the data,
correct? So shouldn't any one of them be able to act as primary at any
time?

I don't see how that would change even if 2 of 3 ODS fail at exactly
the same time.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Adam Carheden

Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
fail. Are all of the following accurate?

a. An rdb is split into lots of objects, parts of which will probably
exist on all 4 hosts.

b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.

c. Reads can continue from the single online OSD even in pgs that
happened to have two of 3 osds offline.

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
the min_size=2 constraint.

e. Rebalancing does not occur because with only two hosts online there
is no way for CRUSH to meet the size=3 constraint even if it were to
rebalance.

f. I/O can been restored by setting min_size=1.

g. Alternatively, I/O can be restored by setting size=2, which would
kick off rebalancing and restored I/O as the pgs come into compliance
with the size=2 constraint.

h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
two hosts fail, some pgs would have only 1 OSD online, but rebalancing
would start immediately since CRUSH can honor the size=3 constraint by
rebalancing. This means more nodes makes for a more reliable cluster.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
min_size=2 but only 2 hosts online, I could remove the host bucket from
the crushmap. CRUSH would then rebalance, but some PGs would likely end
up with 3 OSDs all on the same host. (This is theory. I promise not to
do any such thing to a production system ;)

Thanks
-- 
Adam Carheden

On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> If you had set min_size to 1 you would not have seen the writes pause. a
> min_size of 1 is dangerous though because it means you are 1 hard disk
> failure away from losing the objects within that placement group
> entirely. a min_size of 2 is generally considered the minimum you want
> but many people ignore that advice, some wish they hadn't. 
> 
> On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carhe...@ucar.edu
> <mailto:carhe...@ucar.edu>> wrote:
> 
> Thanks everyone for the replies. Very informative. However, should I
> have expected writes to pause if I'd had min_size set to 1 instead of 2?
> 
> And yes, I was under the false impression that my rdb devices was a
> single object. That explains what all those other things are on a test
> cluster where I only created a single object!
> 
> 
> --
> Adam Carheden
> 
> On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > This is because of the min_size specification. I would bet you have it
> > set at 2 (which is good).
> >
> > ceph osd pool get rbd min_size
> >
> > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> > from each hosts) results in some of the objects only having 1 replica
> > min_size dictates that IO freezes for those objects until min_size is
> > achieved. 
> http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas
> 
> <http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas>
> >
> > I cant tell if your under the impression that your RBD device is a
> > single object. It is not. It is chunked up into many objects and spread
> > throughout the cluster, as Kjeti mentioned earlier.
> >
> > On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com 
> <mailto:kje...@medallia.com>
> > <mailto:kje...@medallia.com <mailto:kje...@medallia.com>>> wrote:
> >
> > Hi,
> >
> > rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> > will get you a "prefix", which then gets you on to
> > rbd_header., rbd_header.prefix contains block size,
> > striping, etc. The actual data bearing objects will be named
> > something like rbd_data.prefix.%-016x.
> >
> > Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
> >  of that image will be named rbd_data.
> > 86ce2ae8944a., the second  will be
> > 86ce2ae8944a.0001, and so on, chances are that one of these
> > objects are mapped to a pg which has both host3 and host4 among it's
> > replicas.
> >
> > An rbd image will end up scattered across most/all osds of the pool
> > it's in.
> >
> > Cheers,
> > -KJ
> >
> > On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu 
> <mailto:carhe...@ucar.edu>
> > <mailto:carhe...@ucar.edu <mailto:carhe...@ucar.edu>>> wrote:
>

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Adam Carheden

Thanks everyone for the replies. Very informative. However, should I
have expected writes to pause if I'd had min_size set to 1 instead of 2?

And yes, I was under the false impression that my rdb devices was a
single object. That explains what all those other things are on a test
cluster where I only created a single object!


-- 
Adam Carheden

On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> This is because of the min_size specification. I would bet you have it
> set at 2 (which is good). 
> 
> ceph osd pool get rbd min_size
> 
> With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1
> from each hosts) results in some of the objects only having 1 replica
> min_size dictates that IO freezes for those objects until min_size is
> achieved. 
> http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas
> 
> I cant tell if your under the impression that your RBD device is a
> single object. It is not. It is chunked up into many objects and spread
> throughout the cluster, as Kjeti mentioned earlier.
> 
> On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen <kje...@medallia.com
> <mailto:kje...@medallia.com>> wrote:
> 
> Hi,
> 
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents
> will get you a "prefix", which then gets you on to
> rbd_header., rbd_header.prefix contains block size,
> striping, etc. The actual data bearing objects will be named
> something like rbd_data.prefix.%-016x.
> 
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first
>  of that image will be named rbd_data.
> 86ce2ae8944a., the second  will be
> 86ce2ae8944a.0001, and so on, chances are that one of these
> objects are mapped to a pg which has both host3 and host4 among it's
> replicas.
> 
> An rbd image will end up scattered across most/all osds of the pool
> it's in.
> 
> Cheers,
> -KJ
> 
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden <carhe...@ucar.edu
> <mailto:carhe...@ucar.edu>> wrote:
> 
> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
> 
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance
> just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on
> the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2
> still
> have quorum, so that shouldn't be an issue. The placement group
> still
> has 2 of its 3 replicas online.
> 
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
> 
> 
> Size?
> # ceph osd pool get rbd size
> size: 3
> 
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0
> --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
> 
> # ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
>  4 0.40369 osd.4   up  1.0  1.0
>  5 0.40369 osd.5   up  1.0  1.0
>  6 0.54008 osd.6   up  1.0  1.0
>  7 0.54008 osd.7   up  1.0  1.0
> -2 3.61554 host host2
>  0 0.90388 osd.0   up  1.0  1.0
>  1 0.90388 osd.1   up  1.0  1.0
>  2 0.90388 osd.2   up  1.0  1.0
>  3 0.90388 osd.3   up  1.0  1.0
> -6 2.55852 room B
> -4 1.75114 host host3
>  8 0.40369 osd.8   up  1.0  1.0
>  9 0.40369 osd.9   up  1.0  1.0
> 10 0.40369 osd.10  up  1.0  1.0
> 11 0.54008 osd.11  up  1.0  1.0
> -5 0.80737 host host4
> 12 0.40369 osd.12  up  1.0

[ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-17 Thread Adam Carheden

I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
running on hosts 1, 2 and 3. It has a single replicated pool of size
3. I have a VM with its hard drive replicated to OSDs 11(host3),
5(host1) and 3(host2).

I can 'fail' any one host by disabling the SAN network interface and
the VM keeps running with a simple slowdown in I/O performance just as
expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
(i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
have quorum, so that shouldn't be an issue. The placement group still
has 2 of its 3 replicas online.

Why does I/O hang even though host4 isn't running a monitor and
doesn't have anything to do with my VM's hard drive.


Size?
# ceph osd pool get rbd size
size: 3

Where's rbd_id.vm-100-disk-1?
# ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
rbd_id.vm-100-disk-1 /tmp/map
got osdmap epoch 1043
osdmaptool: osdmap file '/tmp/map'
 object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]

# ceph osd tree
ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 8.06160 root default
-7 5.50308 room A
-3 1.88754 host host1
 4 0.40369 osd.4   up  1.0  1.0
 5 0.40369 osd.5   up  1.0  1.0
 6 0.54008 osd.6   up  1.0  1.0
 7 0.54008 osd.7   up  1.0  1.0
-2 3.61554 host host2
 0 0.90388 osd.0   up  1.0  1.0
 1 0.90388 osd.1   up  1.0  1.0
 2 0.90388 osd.2   up  1.0  1.0
 3 0.90388 osd.3   up  1.0  1.0
-6 2.55852 room B
-4 1.75114 host host3
 8 0.40369 osd.8   up  1.0  1.0
 9 0.40369 osd.9   up  1.0  1.0
10 0.40369 osd.10  up  1.0  1.0
11 0.54008 osd.11  up  1.0  1.0
-5 0.80737 host host4
12 0.40369 osd.12  up  1.0  1.0
13 0.40369 osd.13  up  1.0  1.00000


-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Adam Carheden

On Thu, Mar 16, 2017 at 11:55 AM, Jason Dillaman <jdill...@redhat.com> wrote:
> On Thu, Mar 16, 2017 at 1:02 PM, Adam Carheden <carhe...@ucar.edu> wrote:
>> Ceph can mirror data between clusters
>> (http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
>> mirror data between pools in the same cluster?
>
> Unfortunately, that's a negative. The rbd-mirror daemon currently
> assumes that the local and remote pool names are the same. Therefore,
> you cannot mirror images between a pool named "X" and a pool named
> "Y".
I figured as much from the command syntax. Am I going about this all
wrong? There have got to be lots of orgs with two room that back each
other up. How do others solve that problem?

How about a single 10Gb fiber link (which is, unfortunately, used for
everything, not just CEPH)? Any advice on estimating if/when latency
over a single link will become a problem?

> At the current time, I think three separate clusters would be the only
> thing that could satisfy all use-case requirements. While I have never
> attempted this, I would think that you should be able to run two
> clusters on the same node (e.g. the HA cluster gets one OSD per node
> in both rooms and the roomX cluster gets the remainder of OSDs in each
> node in its respective room).

Great idea. I guess that could be done either by munging some port
numbers and non-default config file locations or by running CEPH OSDs
and monitors on VMs. Any compelling reason for one way over the other?

-- 
Adam Carheden
Systems Administrator
UCAR/NCAR/RAL
x2753
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Mirroring data between pools on the same cluster

2017-03-16 Thread Adam Carheden

Ceph can mirror data between clusters
(http://docs.ceph.com/docs/master/rbd/rbd-mirroring/), but can it
mirror data between pools in the same cluster?

My use case is DR in the even of a room failure. I have a single CEPH
cluster that spans multiple rooms. The two rooms have separate power
and cooling, but have a single 10Gbe link between them (actually 2 w/
active-passive failover). I can configure pools and crushmaps to keep
data local to each room so my single link doesn't become a bottleneck.
However, I'd like to be able to recovery quickly if a room UPS fails.

Ideally I'd like something like this:

HA pool - spans rooms but we limit how much we put on it to avoid
latency or saturation issues with our single 10Gbe link.
room1 pool - Writes only to OSDs in room 1
room2 pool - Writes only to OSDs in room 2
room1-backup pool - Asynchronous mirror of room1 pool that writes only
to OSDs in room 2
room2-backup pool - Asynchronous mirror of room2 pool that writes only
to OSDs in room 1

In the event of a room failure, my very important stuff migrates or
reboots immediately in the other room without any manual steps. For
everything else, I manually spin up new VMs (scripted, of course) that
run from the mirrored backups.

Is this possible?

If I made it two separate CEPH clusters, how would I do the automated
HA failover? I could have 3 clusters (HA, room1, room2, mirroring
between room1 and roomt2), but then each cluster would be so small (2
nodes, 3 nodes) that node failure becomes more of a risk than room
failure.


(And yes, I do have a 3rd small room with monitors running so if one
of the primary rooms goes down monitors in the remaining room + 3rd
room have a quorum)

Thanks
-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can Cloudstack really be HA when using CEPH?

2017-02-25 Thread Adam Carheden

You are correct again. I forgot that rrdns returns all address, just in
different orders. So it doesn't matter that cloudstack can't pass libvirt
multiple addresses even though libvirt can pass those to qemu and librados.

On Feb 25, 2017 8:16 AM, "Wido den Hollander" <w...@42on.com> wrote:

>
> > Op 25 februari 2017 om 15:45 schreef Adam Carheden <
> adam.carhe...@gmail.com>:
> >
> >
> > I spoke with the cloud stack guys on IRC yesterday and the only risk is
> > when libvirtd starts. Ceph is supported only with libvirt. Cloudstack can
> > only pass one monitor to libvirt even though libvirt can use more.
> Libvirt
> > uses that info when it boots, but after that it gets all the monitors
> from
> > that initial one, just as you say. If you have to reboot libvirtd when
> that
> > monitor is down, that's a problem. But RR DNS would mean just restarting
> > libvirtd again will probably fix it.
>
> It's not the case. Libvirt will simply pass down the hostname down to
> Qemu, which passes it down to librbd and librados.
>
> Librados is the one doing the DNS lookup and get's back:
>
> ceph-monitor.example.com  2001:db8::100
> ceph-monitor.example.com  2001:db8::101
> ceph-monitor.example.com  2001:db8::102
>
> It will try 2001:db8::100 first. If it works, great! It obtains the monmap.
>
> If it's down, it will try 2001:db8::101 and 2001:db8::102.
>
> After obtaining the monmap everything is good. It's actually easier to use
> RR-DNS so that you can swap Monitors easily without having to update
> CloudStack's configuration.
>
> Wido
>
> >
> > On Feb 25, 2017 6:56 AM, "Wido den Hollander" <w...@42on.com> wrote:
> >
> >
> > > Op 24 februari 2017 om 19:48 schreef Adam Carheden <
> > adam.carhe...@gmail.com>:
> > >
> > >
> > > From the docs for each project:
> > >
> > > "When a primary storage outage occurs the hypervisor immediately stops
> > > all VMs stored on that storage
> > > device"http://docs.cloudstack.apache.org/projects/
> > cloudstack-administration/en/4.8/reliability.html
> > >
> > > "CloudStack will only bind to one monitor (You can however create a
> > > Round Robin DNS record over multiple
> > > monitors)"http://docs.ceph.com/docs/master/rbd/rbd-cloudstack/
> > >
> > > Doesn't this mean that if the CEPH monitor cloudstack chooses to bind
> to
> > > goes down all your VMs stop? If so, that seems pretty risky.
> > >
> >
> > No, it doesn't. librados will failover over to another Monitor.
> >
> > > RRDNS is for poor man's load balancing, not HA. I guess it depends on
> when
> > > Cloudstack does DNS lookup and if there's some minimum unavailable
> delay
> > > before it flags primary storage as offline, but it seems like
> substituting
> > > RRDNS for whatever CEPH's internal "find an available monitor"
> algorithm
> > is
> > > is a bad idea.
> >
> > No, you are understanding it wrongly. CloudStack doesn't perform the DNS
> > lookup, this is done by librados on the hypervisor.
> >
> > It will receive all Monitors from that DNS lookup and connect to one of
> > them. As soon as it does it will obtain the monmap and know the full
> > topology.
> >
> > Fully redundant and failover proof.
> >
> > Wido
> >
> > >
> > > --
> > > Adam Carheden
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Can Cloudstack really be HA when using CEPH?

2017-02-25 Thread Adam Carheden

I spoke with the cloud stack guys on IRC yesterday and the only risk is
when libvirtd starts. Ceph is supported only with libvirt. Cloudstack can
only pass one monitor to libvirt even though libvirt can use more. Libvirt
uses that info when it boots, but after that it gets all the monitors from
that initial one, just as you say. If you have to reboot libvirtd when that
monitor is down, that's a problem. But RR DNS would mean just restarting
libvirtd again will probably fix it.

On Feb 25, 2017 6:56 AM, "Wido den Hollander" <w...@42on.com> wrote:

> Op 24 februari 2017 om 19:48 schreef Adam Carheden <
adam.carhe...@gmail.com>:
>
>
> From the docs for each project:
>
> "When a primary storage outage occurs the hypervisor immediately stops
> all VMs stored on that storage
> device"http://docs.cloudstack.apache.org/projects/
cloudstack-administration/en/4.8/reliability.html
>
> "CloudStack will only bind to one monitor (You can however create a
> Round Robin DNS record over multiple
> monitors)"http://docs.ceph.com/docs/master/rbd/rbd-cloudstack/
>
> Doesn't this mean that if the CEPH monitor cloudstack chooses to bind to
> goes down all your VMs stop? If so, that seems pretty risky.
>

No, it doesn't. librados will failover over to another Monitor.

> RRDNS is for poor man's load balancing, not HA. I guess it depends on when
> Cloudstack does DNS lookup and if there's some minimum unavailable delay
> before it flags primary storage as offline, but it seems like substituting
> RRDNS for whatever CEPH's internal "find an available monitor" algorithm
is
> is a bad idea.

No, you are understanding it wrongly. CloudStack doesn't perform the DNS
lookup, this is done by librados on the hypervisor.

It will receive all Monitors from that DNS lookup and connect to one of
them. As soon as it does it will obtain the monmap and know the full
topology.

Fully redundant and failover proof.

Wido

>
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Can Cloudstack really be HA when using CEPH?

2017-02-24 Thread Adam Carheden

>From the docs for each project:

"When a primary storage outage occurs the hypervisor immediately stops
all VMs stored on that storage
device"http://docs.cloudstack.apache.org/projects/cloudstack-administration/en/4.8/reliability.html

"CloudStack will only bind to one monitor (You can however create a
Round Robin DNS record over multiple
monitors)"http://docs.ceph.com/docs/master/rbd/rbd-cloudstack/

Doesn't this mean that if the CEPH monitor cloudstack chooses to bind to
goes down all your VMs stop? If so, that seems pretty risky.

RRDNS is for poor man's load balancing, not HA. I guess it depends on when
Cloudstack does DNS lookup and if there's some minimum unavailable delay
before it flags primary storage as offline, but it seems like substituting
RRDNS for whatever CEPH's internal "find an available monitor" algorithm is
is a bad idea.

-- 
Adam Carheden
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Down monitors after adding mds node

2016-10-02 Thread Adam Tygart

Sent before I was ready, oops.

How might I get the osdmap from a down cluster?

--
Adam

On Mon, Oct 3, 2016 at 12:29 AM, Adam Tygart <mo...@ksu.edu> wrote:
> I put this in the #ceph-dev on Friday,
>
> (gdb) print info
> $7 = (const MDSMap::mds_info_t &) @0x5fb1da68: {
>   global_id = {<boost::totally_ordered1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base > >> =
> {<boost::less_than_comparable1<mds_gid_t,
> boost::equality_comparable1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base > > >> =
> {<boost::equality_comparable1<mds_gid_t,
> boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base > >> =
> {<boost::totally_ordered2<mds_gid_t, unsigned long,
> boost::detail::empty_base >> =
> {<boost::less_than_comparable2<mds_gid_t, unsigned long,
> boost::equality_comparable2<mds_gid_t, unsigned long,
> boost::detail::empty_base > >> =
> {<boost::equality_comparable2<mds_gid_t, unsigned long,
> boost::detail::empty_base >> =
> {<boost::detail::empty_base> = {},  fields>}, }, }, },  data fields>}, }, t = 1055992652}, name = "mormo",
> rank = -1, inc = 0,
>   state = MDSMap::STATE_STANDBY, state_seq = 2, addr = {type = 0,
> nonce = 8835, {addr = {ss_family = 2, __ss_align = 0, __ss_padding =
> '\000' }, addr4 = {sin_family = 2, sin_port =
> 36890,
> sin_addr = {s_addr = 50398474}, sin_zero =
> "\000\000\000\000\000\000\000"}, addr6 = {sin6_family = 2, sin6_port =
> 36890, sin6_flowinfo = 50398474, sin6_addr = {__in6_u = {
> __u6_addr8 = '\000' , __u6_addr16 = {0,
> 0, 0, 0, 0, 0, 0, 0}, __u6_addr32 = {0, 0, 0, 0}}}, sin6_scope_id =
> 0}}}, laggy_since = {tv = {tv_sec = 0, tv_nsec = 0}},
>   standby_for_rank = 0, standby_for_name = "", standby_for_fscid =
> 328, standby_replay = true, export_targets = std::set with 0 elements,
> mds_features = 1967095022025}
> (gdb) print target_role
> $8 = {rank = 0, fscid = }
>
> It looks like target_role.fscid was somehow optimized out.
>
> --
> Adam
>
> On Sun, Oct 2, 2016 at 4:26 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>> On Sat, Oct 1, 2016 at 7:19 PM, Adam Tygart <mo...@ksu.edu> wrote:
>>> The wip-fixup-mds-standby-init branch doesn't seem to allow the
>>> ceph-mons to start up correctly. I disabled all mds servers before
>>> starting the monitors up, so it would seem the pending mdsmap update
>>> is in durable storage. Now that the mds servers are down, can we clear
>>> the mdsmap of active and standby servers while initializing the mons?
>>> I would hope that, now that all the versions are in sync, a bad
>>> standby_for_fscid would not be possible with new mds servers starting.
>>
>> Looks like my first guess about the run-time initialization being
>> confused was wrong. :(
>> Given that, we're pretty befuddled. But I commented on irc:
>>
>>>if you've still got a core dump, can you go up a frame (to 
>>>MDSMonitor::maybe_promote_standby) and check the values of target_role.rank 
>>>and target_role.fscid, and how that compares to info.standby_for_fscid, 
>>>info.legacy_client_fscid, and info.standby_for_rank?
>>
>> That might pop up something and isn't accessible in the log you
>> posted. We also can't see an osdmap or dump; if you could either
>> extract and print that or get a log which includes it that might show
>> up something.
>>
>> I don't think we changed the mds<-> protocol or anything in the point
>> releases, so the different package version *shouldn't* matter...right,
>> John? ;)
>> -Greg
>>
>>>
>>> --
>>> Adam
>>>
>>> On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>>>> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mo...@ksu.edu> wrote:
>>>>> Hello all,
>>>>>
>>>>> Not sure if this went through before or not, as I can't check the
>>>>> mailing list archives.
>>>>>
>>>>> I've gotten myself into a bit of a bind. I was prepping to add a new
>>>>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>>>>>
>>>>> Unfortunately, it started the mds server before I was ready. My
>>>>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>>>>>
>>>>> This caused 3 of my 5 monitors to crash. Since I immediately realized
>>>>&g

Re: [ceph-users] Down monitors after adding mds node

2016-10-01 Thread Adam Tygart

The wip-fixup-mds-standby-init branch doesn't seem to allow the
ceph-mons to start up correctly. I disabled all mds servers before
starting the monitors up, so it would seem the pending mdsmap update
is in durable storage. Now that the mds servers are down, can we clear
the mdsmap of active and standby servers while initializing the mons?
I would hope that, now that all the versions are in sync, a bad
standby_for_fscid would not be possible with new mds servers starting.

--
Adam

On Fri, Sep 30, 2016 at 3:49 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> On Fri, Sep 30, 2016 at 11:39 AM, Adam Tygart <mo...@ksu.edu> wrote:
>> Hello all,
>>
>> Not sure if this went through before or not, as I can't check the
>> mailing list archives.
>>
>> I've gotten myself into a bit of a bind. I was prepping to add a new
>> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>>
>> Unfortunately, it started the mds server before I was ready. My
>> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>>
>> This caused 3 of my 5 monitors to crash. Since I immediately realized
>> the mds was a newer version, I took that opportunity to upgrade my
>> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
>> looks like they are crashing when trying to apply a pending mdsmap
>> update.
>>
>> The log is available here:
>> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz
>>
>> I have attempted (making backups of course) to extract the monmap from
>> a working monitor and inserting it into a broken one. No luck, and
>> backup was restored.
>>
>> Since I had 2 working monitors, I backed up the monitor stores,
>> updated the monmaps to remove the broken ones and tried to restart
>> them. I then tried to restart the "working" ones. They then failed in
>> the same way. I've now restored my backups of those monitors.
>>
>> I need to get these monitors back up post-haste.
>>
>> If you've got any ideas, I would be grateful.
>
> I'm not sure but it looks like it's now too late to keep the problem
> out of the durable storage, but if you try again make sure you turn
> off the MDS first.
>
> It sort of looks like you've managed to get a failed MDS with an
> invalid fscid (ie, a cephfs filesystem ID).
>
> ...or maybe just a terrible coding mistake. As mentioned on irc,
> wip-fixup-mds-standby-init should fix it. I've created a ticket as
> well: http://tracker.ceph.com/issues/17466
> -Greg
>
>
>>
>> --
>> Adam
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart

Hello all,

Not sure if this went through before or not, as I can't check the
mailing list archives.

I've gotten myself into a bit of a bind. I was prepping to add a new
mds node to my ceph cluster. e.g. ceph-deploy mds create mormo

Unfortunately, it started the mds server before I was ready. My
cluster was running 10.2.1, and the newly deployed mds is 10.2.3.

This caused 3 of my 5 monitors to crash. Since I immediately realized
the mds was a newer version, I took that opportunity to upgrade my
monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
looks like they are crashing when trying to apply a pending mdsmap
update.

The log is available here:
http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz

I have attempted (making backups of course) to extract the monmap from
a working monitor and inserting it into a broken one. No luck, and
backup was restored.

Since I had 2 working monitors, I backed up the monitor stores,
updated the monmaps to remove the broken ones and tried to restart
them. I then tried to restart the "working" ones. They then failed in
the same way. I've now restored my backups of those monitors.

I need to get these monitors back up post-haste.

If you've got any ideas, I would be grateful.

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart

I could, I suppose, update the monmaps in the working monitors to
remove the broken ones and then re-deploy the broken ones. The main
concern I have is that if the mdsmap update isn't pending on the
working ones, what else isn't in sync.

Thoughts?

--
Adam

On Fri, Sep 30, 2016 at 11:05 AM, Adam Tygart <mo...@ksu.edu> wrote:
> Hello all,
>
> I've gotten myself into a bit of a bind. I was prepping to add a new
> mds node to my ceph cluster. e.g. ceph-deploy mds create mormo
>
> Unfortunately, it started the mds server before I was ready. My
> cluster was running 10.2.1, and the newly deployed mds is 10.2.3.
>
> This caused 3 of my 5 monitors to crash. Since I immediately realized
> the mds was a newer version, I took that opportunity to upgrade my
> monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
> looks like they are crashing when trying to apply a pending mdsmap
> update.
>
> The log is available here:
> http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz
>
> I have attempted (making backups of course) to extract the monmap from
> a working monitor and inserting it into a broken one. No luck, and
> backup was restored.
>
> I need to get these monitors back up post-haste.
>
> If you've got any ideas, I would be greatful.
>
> --
> Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Down monitors after adding mds node

2016-09-30 Thread Adam Tygart

Hello all,

I've gotten myself into a bit of a bind. I was prepping to add a new
mds node to my ceph cluster. e.g. ceph-deploy mds create mormo

Unfortunately, it started the mds server before I was ready. My
cluster was running 10.2.1, and the newly deployed mds is 10.2.3.

This caused 3 of my 5 monitors to crash. Since I immediately realized
the mds was a newer version, I took that opportunity to upgrade my
monitors to 10.2.3. Three of the 5 monitors continue to crash. And it
looks like they are crashing when trying to apply a pending mdsmap
update.

The log is available here:
http://people.cis.ksu.edu/~mozes/hobbit01.mon-20160930.log.gz

I have attempted (making backups of course) to extract the monmap from
a working monitor and inserting it into a broken one. No luck, and
backup was restored.

I need to get these monitors back up post-haste.

If you've got any ideas, I would be greatful.

--
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Issues with CephFS

2016-06-18 Thread Adam Tygart

Responses inline.

On Sat, Jun 18, 2016 at 4:53 PM, ServerPoint  wrote:
> Hi,
>
> I am trying to setup a Ceph cluster and mount it as CephFS
>
> These are the steps that I followed :
> -
> ceph-deploy new mon
>  ceph-deploy install admin mon node2 node5 node6
>  ceph-deploy mon create-initial
>   ceph-deploy disk zap  node2:sdb node2:sdc node2:sdd
>   ceph-deploy disk zap  node5:sdb node5:sdc node5:sdd
>   ceph-deploy disk zap  node6:sdb node6:sdc node6:sdd
>   ceph-deploy osd prepare node2:sdb node2:sdc node2:sdd
>   ceph-deploy osd prepare node5:sdb node5:sdc node5:sdd
>   ceph-deploy osd prepare node6:sdb node6:sdc node6:sdd
>   ceph-deploy osd activate node2:/dev/sdb1 node2:/dev/sdc1 node2:/dev/sdd1
> ceph-deploy osd activate node5:/dev/sdb1  node5:/dev/sdc1 node5:/dev/sdd1
> ceph-deploy osd activate node6:/dev/sdb1  node6:/dev/sdc1 node6:/dev/sdd1
> ceph-deploy admin admin mon node2 node5 node6
>
> ceph-deploy mds create mon
> ceph osd pool create cephfs_data 100
>   ceph osd pool create cephfs_metadata 100
>   ceph fs new cephfs cephfs_metadata cephfs_data
> --
>
> Health of Cluster is Ok
> 
> root@admin:~/ceph-cluster# ceph -s
> cluster 5dfaa36a-45b8-47a2-85c4-3f06f53bcd03
>  health HEALTH_OK
>  monmap e1: 1 mons at {mon=10.10.0.122:6789/0}

Monitor at 10.10.0.122...

> election epoch 5, quorum 0 mon
>   fsmap e15: 1/1/1 up {0=mon=up:active}
>  osdmap e60: 9 osds: 9 up, 9 in
> flags sortbitwise
>   pgmap v252: 264 pgs, 3 pools, 2068 bytes data, 20 objects
> 309 MB used, 3976 GB / 3977 GB avail
>  264 active+clean
>  -
>
>
> I then installed ceph on another server to make it as client.
> But I am getting the below error while mounting it.
> 
> root@node9:~#  mount -t ceph 10.10.0.121:6789:/ /mnt/mycephfs -o
> name=admin,secret=AQDhlGVXDhnoGxAAsX7HcOxbrWpSUpSuOTNWBg==
> mount: Connection timed out
> 
>

Mount trying to talk to 10.10.0.121 (mds-server?). The monitors are
the initial point of contact for anything ceph. They will tell the
client where everything else lives.

>
> I tried restarting all the services but with no success. I am stuck here.
> Please help.
>
> Thanks in advance!
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore RAM usage/utilization

2016-06-16 Thread Adam Tygart

According to Sage[1] Bluestore makes use of the pagecache. I don't
believe read-ahead is a filesystem tunable in Linux, it is set on the
block device itself, therefore read-ahead shouldn't be an issue.

I'm not familiar enough with Bluestore to comment on the rest.

[1] http://www.spinics.net/lists/ceph-devel/msg29398.html

--
Adam

On Thu, Jun 16, 2016 at 11:09 PM, Christian Balzer <ch...@gol.com> wrote:
>
> Hello,
>
> I don't have anything running Jewel yet, so this is for devs and people
> who have played with bluestore or read the code.
>
> With filestore, Ceph benefits from ample RAM, both in terms of pagecache
> for reads of hot objects and SLAB to keep all the dir-entries and inodes
> in memory.
>
> With bluestore not being a FS, I'm wondering what can and will be done for
> it to maximize performance by using available RAM.
> I doubt there's a dynamic cache allocation ala pagecache present or on the
> road-map.
> But how about parameters to grow caches (are there any?) and give the DB
> more breathing space?
>
> I suppose this also cuts into the current inability to do read-ahead with
> bluestore by itself (not client driven).
>
> The underlying reason for this of course to future proof OSD storage
> servers, any journal SSDs will be beneficial for RocksDB and WAL as well,
> but if available memory can't be utilized beyond what the OSDs need
> themselves it makes little sense to put extra RAM into them.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS Bug found with CentOS 7.2

2016-06-16 Thread Adam Tygart

]  [] dispatch+0xebf/0x1428
[ 2371.525755]  [] ? ceph_x_check_message_signature+0x42/0xc4
[ 2371.525758]  [] ceph_con_workfn+0xe1a/0x24f3
[ 2371.525762]  [] ? load_TLS+0xb/0xf
[ 2371.525764]  [] ? __switch_to+0x3b0/0x42b
[ 2371.525769]  [] ? finish_task_switch+0xff/0x191
[ 2371.525772]  [] process_one_work+0x175/0x2a0
[ 2371.525774]  [] worker_thread+0x1fc/0x2ae
[ 2371.525776]  [] ? rescuer_thread+0x2c0/0x2c0
[ 2371.525779]  [] kthread+0xaf/0xb7
[ 2371.525782]  [] ? kthread_parkme+0x1f/0x1f
[ 2371.525786]  [] ret_from_fork+0x3f/0x70
[ 2371.525788]  [] ? kthread_parkme+0x1f/0x1f
[ 2371.525790] ---[ end trace b054c5c6854fd2ad ]---

Whenever a readdir is performed on the directory containing the
symlink, and all the stats go ??? and the symlink is unable to be
deleted/moved/operated on.

I believe it involves the overwrites that vim performs on save (save
to temporary file and move it overtop of existing, I believe). I've
seen it on kernels 4.0->4.5 so far. Possibly even earlier.
Hammer->Infernalis, I've not had a chance to test on Jewel.

I'd dump the symlink data out of the metadata pool, but I'm still
recovering from http://tracker.ceph.com/issues/16177

Not trying to hijack your thread here, though.

--
Adam

On Thu, Jun 16, 2016 at 4:03 PM, Jason Gress <jgr...@accertify.com> wrote:
> This is the latest default kernel with CentOS7.  We also tried a newer
> kernel (from elrepo), a 4.4 that has the same problem, so I don't think
> that is it.  Thank you for the suggestion though.
>
> We upgraded our cluster to the 10.2.2 release today, and it didn't resolve
> all of the issues.  It's possible that a related issue is actually
> permissions.  Something may not be right with our config (or a bug) here.
>
> While testing we noticed that there may actually be two issues here.  I am
> unsure, as we noticed that the most consistent way to reproduce our issue
> is to use vim or sed -i which does in place renames:
>
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:50 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  300 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
> [root@ftp01 cron]# sed -i 's/^/#/' file
> sed: cannot rename ./sedfB2CkO: Permission denied
>
>
> Strangely, adding or deleting files works fine, it's only renaming that
> fails.  And strangely I was able to successfully edit the file on ftp02:
>
> [root@ftp02 cron]# sed -i 's/^/#/' file
> [root@ftp02 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:49 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  313 Jun 16 15:49 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Then it worked on ftp01 this time:
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2357 Jun 16 15:49 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  313 Jun 16 15:49 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Then, I vim'd it successfully on ftp01... Then ran the sed again:
>
> [root@ftp01 cron]# sed -i 's/^/#/' file
> sed: cannot rename ./sedfB2CkO: Permission denied
> [root@ftp01 cron]# ls -la
> total 3
> drwx--   1 root root 2044 Jun 16 15:51 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root  300 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> And now we have the zero file problem again:
>
> [root@ftp02 cron]# ls -la
> total 2
> drwx--   1 root root 2044 Jun 16 15:51 .
> drwxr-xr-x. 10 root root  104 May 19 09:34 ..
> -rw-r--r--   1 root root0 Jun 16 15:50 file
> -rw---   1 root root 2044 Jun 16 13:47 root
>
>
> Anyway, I wonder how much of this issue is related to that cannot rename
> issue above.  Here are our security settings:
>
> client.ftp01
> key: 
> caps: [mds] allow r, allow rw path=/ftp
> caps: [mon] allow r
> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
> client.ftp02
> key: 
> caps: [mds] allow r, allow rw path=/ftp
> caps: [mon] allow r
> caps: [osd] allow rw pool=cephfs_metadata, allow rw pool=cephfs_data
>
>
> /ftp is the directory on cephfs under which cron lives; the full path is
> /ftp/cron .
>
> I hope this helps and thank you for your time!
>
> Jason
>
> On 6/15/16, 4:43 PM, "John Spray" <jsp...@redhat.com> wrote:
>
>>On Wed, Jun 15, 2016 at 10:21 PM, Jason Gress <jgr...@accertify.com>
>>wrote:
>>> While trying to use CephFS as a clustered filesystem, we stumbled upon a
>>> reproducible bug that is unfortunately pretty serious, as it leads to
>>>data
>>> loss.  Here is the si

Re: [ceph-users] RDMA/Infiniband status

2016-06-09 Thread Adam Tygart

I believe this is what you want:
https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/7/html/Networking_Guide/sec-Configuring_the_Subnet_Manager.html

--
Adam

On Thu, Jun 9, 2016 at 10:01 AM, Gandalf Corvotempesta
<gandalf.corvotempe...@gmail.com> wrote:
> Il 09 giu 2016 15:41, "Adam Tygart" <mo...@ksu.edu> ha scritto:
>>
>> If you're
>> using pure DDR, you may need to tune the broadcast group in your
>> subnet manager to set the speed to DDR.
>
> Do you know how to set this with opensm?
> I would like to bring up my test cluster again next days
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] RDMA/Infiniband status

2016-06-09 Thread Adam Tygart

IPoIB is done with broadcast packets on the Infiniband fabric. Most
switches and opensm (by default) setup a broadcast group at the lowest
IB speed (SDR), to support all possible IB connections. If you're
using pure DDR, you may need to tune the broadcast group in your
subnet manager to set the speed to DDR.

--
Adam



On Thu, Jun 9, 2016 at 3:25 AM, Gandalf Corvotempesta
<gandalf.corvotempe...@gmail.com> wrote:
> 2016-06-09 10:18 GMT+02:00 Christian Balzer <ch...@gol.com>:
>> IPoIB is about half the speed of your IB layer, yes.
>
> Ok, so it's normal. I've seen benchmarks on net stating that IPoIB on
> DDR should reach about 16-17Gb/s
> I'll plan to move to QDR
>
>> And bandwidth is (usually) not the biggest issue, latency is.
>
> I've never checked the latency that I had.
> Even with IPoIB should be lower than 10GBASE-T
>
>> Google for "Accelio Ceph".
>
> Thanks.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-06 Thread Adam Tygart

Would it be beneficial for anyone to have an archive copy of an osd
that took more than 4 days to export. All but an hour of that time was
spent exporting 1 pg (that ended up being 197MB). I can even send
along the extracted pg for analysis...

--
Adam

On Fri, Jun 3, 2016 at 2:39 PM, Adam Tygart <mo...@ksu.edu> wrote:
> With regards to this export/import process, I've been exporting a pg
> from an osd for more than 24 hours now. The entire OSD only has 8.6GB
> of data. 3GB of that is in omap. The export for this particular PG is
> only 108MB in size right now, after more than 24 hours. How is it
> possible that a fragmented database on an ssd capable of 13,000 iops
> can be this slow?
>
> --
> Adam
>
> On Fri, Jun 3, 2016 at 11:11 AM, Brandon Morris, PMP
> <brandon.morris@gmail.com> wrote:
>> Nice catch.  That was a copy-paste error.  Sorry
>>
>> it should have read:
>>
>>  3. Flush the journal and export the primary version of the PG.  This took 1
>> minute on a well-behaved PG and 4 hours on the misbehaving PG
>>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
>> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
>> --file /root/32.10c.b.export
>>
>>   4. Import the PG into a New / Temporary OSD that is also offline,
>>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
>> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import
>> --file /root/32.10c.b.export
>>
>>
>> On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>>
>>> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
>>> <brandon.morris@gmail.com> wrote:
>>>
>>> > The only way that I was able to get back to Health_OK was to
>>> > export/import.  * Please note, any time you use the
>>> > ceph_objectstore_tool you risk data loss if not done carefully.   Never
>>> > remove a PG until you have a known good export *
>>> >
>>> > Here are the steps I used:
>>> >
>>> > 1. set NOOUT, NO BACKFILL
>>> > 2. Stop the OSD's that have the erroring PG
>>> > 3. Flush the journal and export the primary version of the PG.  This
>>> > took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
>>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
>>> > --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
>>> > --file /root/32.10c.b.export
>>> >
>>> > 4. Import the PG into a New / Temporary OSD that is also offline,
>>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
>>> > --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op 
>>> > export
>>> > --file /root/32.10c.b.export
>>>
>>> This should be an import op and presumably to a different data path
>>> and journal path more like the following?
>>>
>>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
>>> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
>>> import --file /root/32.10c.b.export
>>>
>>> Just trying to clarify for anyone that comes across this thread in the
>>> future.
>>>
>>> Cheers,
>>> Brad
>>>
>>> >
>>> > 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your
>>> > case it looks like)
>>> > 6. Start cluster OSD's
>>> > 7. Start the temporary OSD's and ensure 32.10c backfills correctly to
>>> > the 3 OSD's it is supposed to be on.
>>> >
>>> > This is similar to the recovery process described in this post from
>>> > 04/09/2015:
>>> > http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>> > Hopefully it works in your case too and you can the cluster back to a 
>>> > state
>>> > that you can make the CephFS directories smaller.
>>> >
>>> > - Brandon
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best upgrade strategy

2016-06-05 Thread Adam Tygart

If your monitor nodes are separate from the osd nodes, I'd get ceph
upgraded to the latest point release of your current line (0.94.7).
Upgrade monitors, then osds, then other dependent services (mds, rgw,
qemu).
Once everything is happy again, I'd run OS and ceph upgrades together,
starting with monitors, then osds, and (again) dependent services.
Keep in mind, that you'll want to chown all of the ceph data in there
while you're doing this (per the upgrade notes).

If they're combined, I'd probably upgrade ceph, then the OS. First
from 0.94.5 to 0.94.7, then to Jewel, then I'd upgrade the OS version.
Standard order still applies, monitors->osds->dependent services.
--
Adam

On Sun, Jun 5, 2016 at 6:47 PM, Sebastian Köhler <s...@tyrion.de> wrote:
> Hi,
>
> we are running a cluster with 6 storage nodes(72 osds) and 3 monitors.
> The osds and and monitors are running on Ubuntu 14.04 and with ceph 0.94.5.
> We want to upgrade the cluster to Jewel and at the same time the OS to
> Ubuntu 16.04. What would be the best way to this? First to upgrade the
> OS and then ceph to 0.94.7 followed by 10.2.1. Or should we first
> upgrade Ceph and then Ubuntu? Or maybe doing it all at once?
>
> Regards
> Sebastian
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-03 Thread Adam Tygart

With regards to this export/import process, I've been exporting a pg
from an osd for more than 24 hours now. The entire OSD only has 8.6GB
of data. 3GB of that is in omap. The export for this particular PG is
only 108MB in size right now, after more than 24 hours. How is it
possible that a fragmented database on an ssd capable of 13,000 iops
can be this slow?

--
Adam

On Fri, Jun 3, 2016 at 11:11 AM, Brandon Morris, PMP
<brandon.morris@gmail.com> wrote:
> Nice catch.  That was a copy-paste error.  Sorry
>
> it should have read:
>
>  3. Flush the journal and export the primary version of the PG.  This took 1
> minute on a well-behaved PG and 4 hours on the misbehaving PG
>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
>   4. Import the PG into a New / Temporary OSD that is also offline,
>i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op import
> --file /root/32.10c.b.export
>
>
> On Thu, Jun 2, 2016 at 5:10 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
>>
>> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
>> <brandon.morris@gmail.com> wrote:
>>
>> > The only way that I was able to get back to Health_OK was to
>> > export/import.  * Please note, any time you use the
>> > ceph_objectstore_tool you risk data loss if not done carefully.   Never
>> > remove a PG until you have a known good export *
>> >
>> > Here are the steps I used:
>> >
>> > 1. set NOOUT, NO BACKFILL
>> > 2. Stop the OSD's that have the erroring PG
>> > 3. Flush the journal and export the primary version of the PG.  This
>> > took 1 minute on a well-behaved PG and 4 hours on the misbehaving PG
>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
>> > --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
>> > --file /root/32.10c.b.export
>> >
>> > 4. Import the PG into a New / Temporary OSD that is also offline,
>> >   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
>> > --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
>> > --file /root/32.10c.b.export
>>
>> This should be an import op and presumably to a different data path
>> and journal path more like the following?
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
>> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
>> import --file /root/32.10c.b.export
>>
>> Just trying to clarify for anyone that comes across this thread in the
>> future.
>>
>> Cheers,
>> Brad
>>
>> >
>> > 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your
>> > case it looks like)
>> > 6. Start cluster OSD's
>> > 7. Start the temporary OSD's and ensure 32.10c backfills correctly to
>> > the 3 OSD's it is supposed to be on.
>> >
>> > This is similar to the recovery process described in this post from
>> > 04/09/2015:
>> > http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>> > Hopefully it works in your case too and you can the cluster back to a state
>> > that you can make the CephFS directories smaller.
>> >
>> > - Brandon
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Adam Tygart

I'm still exporting pgs out of some of the downed osds, but things are
definitely looking promising.

Marginally related to this thread, as these seem to be most of the
hanging objects when exporting pgs, what are inodes in the 600 range
used for within the metadata pool? I know the 200 range is used for
journaling. 8 of the 13 osds I've got left down are currently trying
to export objects in the 600 range. Are these just MDS journal objects
from an mds severely behind on trimming?

--
Adam

On Thu, Jun 2, 2016 at 6:10 PM, Brad Hubbard <bhubb...@redhat.com> wrote:
> On Thu, Jun 2, 2016 at 9:07 AM, Brandon Morris, PMP
> <brandon.morris@gmail.com> wrote:
>
>> The only way that I was able to get back to Health_OK was to export/import.  
>> * Please note, any time you use the ceph_objectstore_tool you risk data 
>> loss if not done carefully.   Never remove a PG until you have a known good 
>> export *
>>
>> Here are the steps I used:
>>
>> 1. set NOOUT, NO BACKFILL
>> 2. Stop the OSD's that have the erroring PG
>> 3. Flush the journal and export the primary version of the PG.  This took 1 
>> minute on a well-behaved PG and 4 hours on the misbehaving PG
>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16 
>> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export 
>> --file /root/32.10c.b.export
>>
>> 4. Import the PG into a New / Temporary OSD that is also offline,
>>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100 
>> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export 
>> --file /root/32.10c.b.export
>
> This should be an import op and presumably to a different data path
> and journal path more like the following?
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-101
> --journal-path /var/lib/ceph/osd/ceph-101/journal --pgid 32.10c --op
> import --file /root/32.10c.b.export
>
> Just trying to clarify for anyone that comes across this thread in the future.
>
> Cheers,
> Brad
>
>>
>> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case 
>> it looks like)
>> 6. Start cluster OSD's
>> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3 
>> OSD's it is supposed to be on.
>>
>> This is similar to the recovery process described in this post from 
>> 04/09/2015: 
>> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>Hopefully it works in your case too and you can the cluster back to a 
>> state that you can make the CephFS directories smaller.
>>
>> - Brandon
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-02 Thread Adam Tygart

Okay,

Exporting, removing and importing the pgs seems to be working
(slowly). The question now becomes, why does and export/import work?
That would make me think there is a bug in there somewhere in the pg
loading code. Or does it have to do with re-creating the leveldb
databases? The same number of objects are still in each pg, along with
the same number of omap keys... Something doesn't seem quite right.

If it is too many files in a single directory, what would be the upper
limit to target? I'd like to know when I should be yelling and kicking
and screaming at my users to fix their code.

On Wed, Jun 1, 2016 at 6:07 PM, Brandon Morris, PMP
<brandon.morris@gmail.com> wrote:
> I concur with Greg.
>
> The only way that I was able to get back to Health_OK was to export/import.
> * Please note, any time you use the ceph_objectstore_tool you risk data
> loss if not done carefully.   Never remove a PG until you have a known good
> export *
>
> Here are the steps I used:
>
> 1. set NOOUT, NO BACKFILL
> 2. Stop the OSD's that have the erroring PG
> 3. Flush the journal and export the primary version of the PG.  This took 1
> minute on a well-behaved PG and 4 hours on the misbehaving PG
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-16
> --journal-path /var/lib/ceph/osd/ceph-16/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> 4. Import the PG into a New / Temporary OSD that is also offline,
>   i.e.   ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-100
> --journal-path /var/lib/ceph/osd/ceph-100/journal --pgid 32.10c --op export
> --file /root/32.10c.b.export
>
> 5. remove the PG from all other OSD's  (16, 143, 214, and 448 in your case
> it looks like)
> 6. Start cluster OSD's
> 7. Start the temporary OSD's and ensure 32.10c backfills correctly to the 3
> OSD's it is supposed to be on.
>
> This is similar to the recovery process described in this post from
> 04/09/2015:
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
> Hopefully it works in your case too and you can the cluster back to a state
> that you can make the CephFS directories smaller.
>
> - Brandon
>
> On Wed, Jun 1, 2016 at 4:22 PM, Gregory Farnum <gfar...@redhat.com> wrote:
>>
>> On Wed, Jun 1, 2016 at 2:47 PM, Adam Tygart <mo...@ksu.edu> wrote:
>> > I tried to compact the leveldb on osd 16 and the osd is still hitting
>> > the suicide timeout. I know I've got some users with more than 1
>> > million files in single directories.
>> >
>> > Now that I'm in this situation, can I get some pointers on how can I
>> > use either of your options?
>>
>> In a literal sense, you either make the CephFS directories smaller by
>> moving files out of them. Or you enable directory fragmentation with
>>
>> http://docs.ceph.com/docs/master/cephfs/experimental-features/#directory-fragmentation,
>> but if you have users I *really* wouldn't recommend it just yet.
>> (Notice the text about these experimental features being "not fully
>> stabilized or qualified for users to turn on in real deployments")
>>
>> Since you're doing recovery, you should be able to do the
>> ceph-objectstore-tool export/import thing to get the PG to its new
>> locations, but just deleting it certainly won't help!
>> -Greg
>>
>> >
>> > Thanks,
>> > Adam
>> >
>> > On Wed, Jun 1, 2016 at 4:33 PM, Gregory Farnum <gfar...@redhat.com>
>> > wrote:
>> >> If that pool is your metadata pool, it looks at a quick glance like
>> >> it's timing out somewhere while reading and building up the omap
>> >> contents (ie, the contents of a directory). Which might make sense if,
>> >> say, you have very fragmented leveldb stores combined with very large
>> >> CephFS directories. Trying to make the leveldbs happier (I think there
>> >> are some options to compact on startup, etc?) might help; otherwise
>> >> you might be running into the same "too-large omap collections" thing
>> >> that Brandon referred to. Which in CephFS can be fixed by either
>> >> having smaller folders or (if you're very nervy, and ready to turn on
>> >> something we think works but don't test enough) enabling directory
>> >> fragmentation.
>> >> -Greg
>> >>
>> >> On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart <mo...@ksu.edu> wrote:
>> >>> I've been attempting to work through this, finding the pgs that are
>> >>> causing hangs, determining if they are "safe"

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart

I tried to compact the leveldb on osd 16 and the osd is still hitting
the suicide timeout. I know I've got some users with more than 1
million files in single directories.

Now that I'm in this situation, can I get some pointers on how can I
use either of your options?

Thanks,
Adam

On Wed, Jun 1, 2016 at 4:33 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> If that pool is your metadata pool, it looks at a quick glance like
> it's timing out somewhere while reading and building up the omap
> contents (ie, the contents of a directory). Which might make sense if,
> say, you have very fragmented leveldb stores combined with very large
> CephFS directories. Trying to make the leveldbs happier (I think there
> are some options to compact on startup, etc?) might help; otherwise
> you might be running into the same "too-large omap collections" thing
> that Brandon referred to. Which in CephFS can be fixed by either
> having smaller folders or (if you're very nervy, and ready to turn on
> something we think works but don't test enough) enabling directory
> fragmentation.
> -Greg
>
> On Wed, Jun 1, 2016 at 2:14 PM, Adam Tygart <mo...@ksu.edu> wrote:
>> I've been attempting to work through this, finding the pgs that are
>> causing hangs, determining if they are "safe" to remove, and removing
>> them with ceph-objectstore-tool on osd 16.
>>
>> I'm now getting hangs (followed by suicide timeouts) referencing pgs
>> that I've just removed, so this doesn't seem to be all there is to the
>> issue.
>>
>> --
>> Adam
>>
>> On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP
>> <brandon.morris@gmail.com> wrote:
>>> Adam,
>>>
>>>  We ran into similar issues when we get too many objects in bucket
>>> (around 300 million).  The .rgw.buckets.index pool became unable to complete
>>> backfill operations.The only way we were able to get past it was to
>>> export the offending placement group with the ceph-objectstore-tool and
>>> re-import it into another OSD to complete the backfill.  For us, the export
>>> operation seemed to hang and took 8 hours to export, so if you do choose to
>>> go down this route, be patient.
>>>
>>> From your logs, it appears that pg 32.10c is the offending PG on OSD.16.  If
>>> you are running into the same issue we did, when you go to export it there
>>> will be a file that will hang. For whatever reason the leveldb metadata for
>>> that file hangs and causes the backfill operation to suicide the OSD.
>>>
>>> If anyone from the community has an explanation for why this happens I would
>>> love to know.  We have run into this twice now on the Infernalis codebase.
>>> We are in the process of rebuilding our cluster to Jewel, so can't say
>>> whether or not it happens there as well.
>>>
>>> -
>>> Here is the pertinent lines from your log.
>>>
>>> 2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
>>> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771
>>> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662)
>>> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1
>>> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
>>> undersized+degraded+remapped+backfilling+peered] send_push_op
>>> 32:30966cd6:::100042c76a0.:head v 250315'1040233 size 0
>>> recovery_info:
>>> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.:head@250315'1040233,
>>> size: 0, copy_subset: [], clone_subset: {})
>>> [...]
>>> 2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>>> [...]
>>> 2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
>>> 2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy
>>> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300
>>> common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const
>>> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time
>>> 2016-06-01 09:31:57.201687
>>> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>>>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x85) [0x7f35167bb5b5]
>>>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char

Re: [ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart

I've been attempting to work through this, finding the pgs that are
causing hangs, determining if they are "safe" to remove, and removing
them with ceph-objectstore-tool on osd 16.

I'm now getting hangs (followed by suicide timeouts) referencing pgs
that I've just removed, so this doesn't seem to be all there is to the
issue.

--
Adam

On Wed, Jun 1, 2016 at 12:00 PM, Brandon Morris, PMP
<brandon.morris@gmail.com> wrote:
> Adam,
>
>  We ran into similar issues when we get too many objects in bucket
> (around 300 million).  The .rgw.buckets.index pool became unable to complete
> backfill operations.The only way we were able to get past it was to
> export the offending placement group with the ceph-objectstore-tool and
> re-import it into another OSD to complete the backfill.  For us, the export
> operation seemed to hang and took 8 hours to export, so if you do choose to
> go down this route, be patient.
>
> From your logs, it appears that pg 32.10c is the offending PG on OSD.16.  If
> you are running into the same issue we did, when you go to export it there
> will be a file that will hang. For whatever reason the leveldb metadata for
> that file hangs and causes the backfill operation to suicide the OSD.
>
> If anyone from the community has an explanation for why this happens I would
> love to know.  We have run into this twice now on the Infernalis codebase.
> We are in the process of rebuilding our cluster to Jewel, so can't say
> whether or not it happens there as well.
>
> -
> Here is the pertinent lines from your log.
>
> 2016-06-01 09:26:54.683922 7f34c5e41700  7 osd.16 pg_epoch: 497663
> pg[32.10c( v 477010'1607561 (459778'1604561,477010'1607561] local-les=493771
> n=3917 ec=44014 les/c/f 493771/486667/0 497332/497662/497662)
> [214,143,448]/[16] r=0 lpr=497662 pi=483321-497661/190 rops=1
> bft=143,214,448 crt=0'0 lcod 0'0 mlcod 0'0
> undersized+degraded+remapped+backfilling+peered] send_push_op
> 32:30966cd6:::100042c76a0.:head v 250315'1040233 size 0
> recovery_info:
> ObjectRecoveryInfo(32:30966cd6:::100042c76a0.:head@250315'1040233,
> size: 0, copy_subset: [], clone_subset: {})
> [...]
> 2016-06-01 09:27:25.091411 7f34cf856700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
> [...]
> 2016-06-01 09:31:57.201645 7f3510669700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had timed out after 30
> 2016-06-01 09:31:57.201671 7f3510669700  1 heartbeat_map is_healthy
> 'OSD::recovery_tp thread 0x7f34c5e41700' had suicide timed out after 300
> common/HeartbeatMap.cc: In function 'bool ceph::HeartbeatMap::_check(const
> ceph::heartbeat_handle_d*, const char*, time_t)' thread 7f3510669700 time
> 2016-06-01 09:31:57.201687
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>  ceph version 10.2.1 (3a66dd4f30852819c1bdaa8ec23c795d4ad77269)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x85) [0x7f35167bb5b5]
>  2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
> const*, long)+0x2e1) [0x7f35166f7bf1]
>  3: (ceph::HeartbeatMap::is_healthy()+0xde) [0x7f35166f844e]
>  4: (ceph::HeartbeatMap::check_touch_file()+0x2c) [0x7f35166f8c2c]
>  5: (CephContextServiceThread::entry()+0x15b) [0x7f35167d331b]
>  6: (()+0x7dc5) [0x7f35146ecdc5]
>  7: (clone()+0x6d) [0x7f3512d77ced]
>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
> interpret this.
> 2016-06-01 09:31:57.205990 7f3510669700 -1 common/HeartbeatMap.cc: In
> function 'bool ceph::HeartbeatMap::_check(const ceph::heartbeat_handle_d*,
> const char*, time_t)' thread 7f3510669700 time 2016-06-01 09:31:57.201687
> common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")
>
> Brandon
>
>
> On Wed, Jun 1, 2016 at 9:13 AM, Adam Tygart <mo...@ksu.edu> wrote:
>>
>> Hello all,
>>
>> I'm running into an issue with ceph osds crashing over the last 4
>> days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.
>>
>> A little setup information:
>> 26 hosts
>> 2x 400GB Intel DC P3700 SSDs
>> 12x6TB spinning disks
>> 4x4TB spinning disks.
>>
>> The SSDs are used for both journals and as an OSD (for the cephfs
>> metadata pool).
>>
>> We were running Ceph with some success in this configuration
>> (upgrading ceph from hammer to infernalis to jewel) for the past 8-10
>> months.
>>
>> Up through Friday, we were healthy.
>>
>> Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
>> then finally dying off, hitting their suicide timeout due to missing
>> heartbeats. At the time, we

[ceph-users] Crashing OSDs (suicide timeout, following a single pool)

2016-06-01 Thread Adam Tygart

Hello all,

I'm running into an issue with ceph osds crashing over the last 4
days. I'm running Jewel (10.2.1) on CentOS 7.2.1511.

A little setup information:
26 hosts
2x 400GB Intel DC P3700 SSDs
12x6TB spinning disks
4x4TB spinning disks.

The SSDs are used for both journals and as an OSD (for the cephfs
metadata pool).

We were running Ceph with some success in this configuration
(upgrading ceph from hammer to infernalis to jewel) for the past 8-10
months.

Up through Friday, we were healthy.

Until Saturday. On Saturday, the OSDs on the SSDs started flapping and
then finally dying off, hitting their suicide timeout due to missing
heartbeats. At the time, we were running Infernalis, getting ready to
upgrade to Jewel.

I spent the weekend and Monday, attempting to stabilize those OSDs,
unfortunately failing. As part of the stabilzation attempts, I check
iostat -x, the SSDs were seeing 1000 iops each. I checked wear levels,
and overall SMART health of the SSDs, everything looks normal. I
checked to make sure the time was in sync between all hosts.

I also tried to move the metadata pool to the spinning disks (to
remove some dependence on the SSDs, just in case). The suicide timeout
issues followed the pool migration. The spinning disks started timing
out. This was at a time when *all* of client the IOPs to the ceph
cluster were in the low 100's as reported to by ceph -s. I was
restarting failed OSDs as fast as they were dying and I couldn't keep
up. I checked the switches and NICs for errors and drops. No changes
in the frequency of them. We're talking an error every 20-25 minutes.
I would expect network issues to affect other OSDs (and pools) in the
system, too.

On Tuesday, I got together with my coworker, and we tried together to
stabilize the cluster. We finally went into emergency maintenance
mode, as we could not get the metadata pool healthy. We stopped the
MDS, we tried again to let things stabilize, with no client IO to the
pool. Again more suicide timeouts.

Then, we rebooted the ceph nodes, figuring there *might* be something
stuck in a hardware IO queue or cache somewhere. Again more crashes
when the machines came back up.

We figured at this point, there was nothing to lose by performing the
update to Jewel, and, who knows, maybe we were hitting a bug that had
been fixed. Reboots were involved again (kernel updates, too).

More crashes.

I finally decided, that there *could* be an unlikely chance that jumbo
frames might suddenly be an issue (after years of using them with
these switches). I turned down the MTUs on the ceph nodes to the
standard 1500.

More crashes.

We decided to try and let things settle out overnight, with no IO.
That brings us to today:

We have 51 Intel P3700 SSDs driving this pool, and now 26 of them have
crashed due to the suicide timeout. I've tried starting them one at a
time, they're still dying off with suicide timeouts.

I've gathered the logs I could think to:
A crashing OSD: http://people.cs.ksu.edu/~mozes/osd.16.log
CRUSH Tree: http://people.cs.ksu.edu/~mozes/crushtree.txt
OSD Tree: http://people.cs.ksu.edu/~mozes/osdtree.txt
Pool Definitions: http://people.cs.ksu.edu/~mozes/pools.txt

At the moment, we're dead in the water. I would appreciate any
pointers to getting this fixed.

--
Adam Tygart
Beocat Sysadmin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] hadoop on cephfs

2016-04-30 Thread Adam Tygart

Supposedly cephfs-hadoop worked and/or works on hadoop 2. I am in the
process of getting it working with cdh5.7.0 (based on hadoop 2.6.0).
I'm under the impression that it is/was working with 2.4.0 at some
point in time.

At this very moment, I can use all of the DFS tools built into hadoop
to create, list, delete, rename, and concat files. What I am not able
to do (currently) is run any jobs.

https://github.com/ceph/cephfs-hadoop

It can be built using current (at least infernalis with my testing)
cephfs-java and libcephfs. The only thing you'll for sure need to do
is patch the file referenced here:
https://github.com/ceph/cephfs-hadoop/issues/25 When building, you'll
want to tell maven to skip tests (-Dmaven.test.skip=true).

Like I said, I am digging into this still, and I am not entirely
convinced my issues are ceph related at the moment.

--
Adam

On Sat, Apr 30, 2016 at 1:51 PM, Erik McCormick
<emccorm...@cirrusseven.com> wrote:
> I think what you are thinking of is the driver that was built to actually
> replace hdfs with rbd. As far as I know that thing had a very short lifespan
> on one version of hadoop. Very sad.
>
> As to what you proposed:
>
> 1) Don't use Cephfs in production pre-jewel.
>
> 2) running hdfs on top of ceph is a massive waste of disk and fairly
> pointless as you make replicas of replicas.
>
> -Erik
>
> On Apr 29, 2016 9:20 PM, "Bill Sharer" <bsha...@sharerland.com> wrote:
>>
>> Actually this guy is already a fan of Hadoop.  I was just wondering
>> whether anyone has been playing around with it on top of cephfs lately.  It
>> seems like the last round of papers were from around cuttlefish.
>>
>> On 04/28/2016 06:21 AM, Oliver Dzombic wrote:
>>>
>>> Hi,
>>>
>>> bad idea :-)
>>>
>>> Its of course nice and important to drag developer towards a
>>> new/promising technology/software.
>>>
>>> But if the technology under the individual required specifications does
>>> not match, you will just risk to show this developer how worst this
>>> new/promising technology is.
>>>
>>> So you will just reach the opposite of what you want.
>>>
>>> So before you are doing something, usually big, like hadoop on an
>>> unstable software, maybe you should not use it.
>>>
>>> For the good of the developer, for your good and for the good of the
>>> reputation of the new/promising technology/software you wish.
>>>
>>> To force a pinguin to somehow live in the sahara, might be possible ( at
>>> least for some time ), but usually not a good idea ;-)
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart

Unfortunately, what seems to happen as users (and developers) get more
in tune with software projects, we forget what is and isn't common
knowledge.

Perhaps said "wall of text" should be a glossary of terms. A
definition list, something that can be open in one tab, and define any
ceph-specific or domain-specific terms. Maybe linking back to the
glossary for any specific instance of that term. Maybe there should be
a glossary per topic, as cephfs has its own set of domain-specific
language that isn't necessarily any use to those using rbd.

Comment systems are great, until you need people to moderate them, and
then that takes time away from people that could either be developing
the software or updating documentation.

On Thu, Feb 25, 2016 at 11:24 PM, Nigel Williams
<nigel.d.willi...@gmail.com> wrote:
> On Fri, Feb 26, 2016 at 4:09 PM, Adam Tygart <mo...@ksu.edu> wrote:
>> The docs are already split by version, although it doesn't help that
>> it isn't linked in an obvious manner.
>>
>> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>
> Is there any reason to keep this "master" (version-less variant) given
> how much confusion it causes?
>
> I think I noticed the version split one time back but it didn't lodge
> in my mind, and when I looked for something today I hit the "master"
> and there were no hits for the version (which I should have been
> looking at).
>
> I'd be glad to contribute to the documentation effort. For example I
> would like to be able to ask questions around the terminology that is
> scattered through the documentation that I think needs better
> explanation. I'm not sure if pull-requests that try to annotate what
> is there would mean some parts would become a wall of text whereas the
> explanation would be better suited as a (more informal) comment-thread
> at the bottom of the page that can be browsed (mainly by beginners
> trying to navigate an unfamiliar architecture).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] State of Ceph documention

2016-02-25 Thread Adam Tygart

The docs are already split by version, although it doesn't help that
it isn't linked in an obvious manner.

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/

http://docs.ceph.com/docs/hammer/rados/operations/cache-tiering/

 Updating the documentation takes a lot of effort by all involved, and
in a project this size, it probably needs a team of people. From what
I can tell, all the documentation is in the ceph source tree, and
submitting pull requests/tickets is probably a good option to keep it
up to date. From my perspective it is also our failure (the users),
not updating the docs when we run into issues.
--
Adam

On Thu, Feb 25, 2016 at 10:59 PM, Nigel Williams
<nigel.d.willi...@gmail.com> wrote:
> On Fri, Feb 26, 2016 at 3:10 PM, Christian Balzer <ch...@gol.com> wrote:
>>
>> Then we come to a typical problem for fast evolving SW like Ceph, things
>> that are not present in older versions.
>
>
> I was going to post on this too (I had similar frustrations), and would like
> to propose that a move to splitting the documentation by versions:
>
> OLD
> http://docs.ceph.com/docs/master/rados/operations/cache-tiering/
>
>
> NEW
> http://docs.ceph.com/docs/master/hammer/rados/operations/cache-tiering/
>
> http://docs.ceph.com/docs/master/infernalis/rados/operations/cache-tiering/
>
> http://docs.ceph.com/docs/master/jewel/rados/operations/cache-tiering/
>
> and so on.
>
> When a new version is started, the documentation should be 100% cloned and
> the tree restructured around the version. It could equally be a drop-down on
> the page to select the version.
>
> Postgres for example uses a similar mechanism:
>
> http://www.postgresql.org/docs/
>
> Note the version numbers are embedded in the URL. I like their commenting
> mechanism too as it provides a running narrative of changes that should be
> considered as practice develops around things to do or avoid.
>
> Once the documentation is cloned for the new version, all the inapplicable
> material should be removed and the new features/practice changes should be
> added.
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] downloads.ceph.com no longer valid?

2016-01-27 Thread ☣Adam

It's not hosted on ceph?  Well that's your problem right there. ;-)
On Jan 27, 2016 3:55 PM, "Gregory Farnum"  wrote:

> Nah, it's not hosted on Ceph.
>
> On Wed, Jan 27, 2016 at 1:39 PM, Tyler Bishop
>  wrote:
> > tyte... ceph pool go rogue?
> >
> > - Original Message -
> > From: "Gregory Farnum" 
> > To: "John Hogenmiller" 
> > Cc: ceph-users@lists.ceph.com
> > Sent: Wednesday, January 27, 2016 2:08:36 PM
> > Subject: Re: [ceph-users] downloads.ceph.com no longer valid?
> >
> > Infrastructure guys say it's down and they are working on it.
> >
> > On Wed, Jan 27, 2016 at 11:01 AM, John Hogenmiller 
> wrote:
> >> I dug a bit more.
> >>
> >> download.ceph.com resolves (for me) to 173.236.253.173 and is not
> responding
> >> to icmp, port 80, or 443
> >>
> >> https://git.ceph.com/release.asc works
> >> https://ceph.com/keys/release.asc returns 404
> >> http://eu.ceph.com/keys/release.asc works
> >> http://au.ceph.com/keys/release.asc returns 404
> >>
> >> http://eu.ceph.com/debian-infernalis/ works
> >> http://ceph.com/debian-infernalis/ redirects to
> >> http://download.ceph.com/debian-infernalis/
> >> http://au.ceph.com/debian-infernalis/ returns 503
> >>
> >> So at this point, it looks like eu.ceph.com is working, au.ceph.com is
> out
> >> of sync, and download.ceph.com is not working (and it didn't work for
> me
> >> last week either, requiring me to use gitbuilder.ceph.com.
> >>
> >>
> >>
> >> On Wed, Jan 27, 2016 at 11:58 AM, Moulin Yoann 
> wrote:
> >>>
> >>> Hello,
> >>>
> >>>
> >>> > I installed ceph last week from the docs and noticed all the
> >>> > downloads.ceph.com  and ceph.com/download
> >>> >  links no longer worked.  After various
> >>> > searching around, I substituted and got past this. But then I forgot
> >>> > about it until I went to install again this week.
> >>> >
> >>> > Are the below URLs from my history the new correct?  If so, should I
> go
> >>> > ahead and try and get a PR for docs updated?
> >>> >
> >>> > https://raw.github.com/ceph/ceph/master/keys/autobuild.asc
> >>> >
> >>> > http://gitbuilder.ceph.com/libapache-mod-fastcgi-deb-$(lsb_release
> >>> > -sc)-x86_64-basic/ref/master
> >>>
> >>> it was OK one 1h ago, maybe something goes wrong on some servers
> >>>
> >>> --
> >>> Yoann Moulin
> >>> EPFL IC-IT
> >>
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-18 Thread Adam Tygart

It appears that with --apparent-size, du adds the "size" of the
directories to the total as well. On most filesystems this is the
block size, or the amount of metadata space the directory is using. On
CephFS, this size is fabricated to be the size sum of all sub-files.
i.e. a cheap/free 'du -sh $folder'

$ stat /homes/mozes/tmp/sbatten
  File: '/homes/mozes/tmp/sbatten'
  Size: 138286  Blocks: 0  IO Block: 65536  directory
Device: 0h/0d   Inode: 1099523094368  Links: 1
Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
Access: 2016-01-19 00:12:23.331201000 -0600
Modify: 2015-10-14 13:38:01.098843320 -0500
Change: 2015-10-14 13:38:01.098843320 -0500
 Birth: -
$ stat /tmp/sbatten/
  File: '/tmp/sbatten/'
  Size: 4096Blocks: 8  IO Block: 4096   directory
Device: 803h/2051d  Inode: 9568257 Links: 2
Access: (0755/drwxr-xr-x)  Uid: (163587/   mozes)   Gid: (163587/mozes_users)
Access: 2016-01-19 00:12:23.331201000 -0600
Modify: 2015-10-14 13:38:01.098843320 -0500
Change: 2016-01-19 00:17:29.658902081 -0600
 Birth: -

$ du -s --apparent-size -B1 /homes/mozes/tmp/sbatten
276572  /homes/mozes/tmp/sbatten
$ du -s -B1 /homes/mozes/tmp/sbatten
147456  /homes/mozes/tmp/sbatten

$ du -s -B1 /tmp/sbatten
225280  /tmp/sbatten
$ du -s --apparent-size -B1 /tmp/sbatten
142382  /tmp/sbatten

Notice how the apparent-size version is *exactly* the Size from the
stat + the size from the "proper" du?

--
Adam

On Mon, Jan 18, 2016 at 11:45 PM, Francois Lafont <flafdiv...@free.fr> wrote:
> On 19/01/2016 05:19, Francois Lafont wrote:
>
>> However, I still have a question. Since my previous message, supplementary
>> data have been put in the cephfs and the values have changes as you can see:
>>
>> ~# du -sh /mnt/cephfs/
>> 1.2G  /mnt/cephfs/
>>
>> ~# du --apparent-size -sh /mnt/cephfs/
>> 6.4G  /mnt/cephfs/
>>
>> You can see that the difference between "disk usage" and "apparent size"
>> has really increased and it seems to me curious that only sparse files can
>> explain this difference (in my mind, sparse files are very specific files
>> and here the files are essentially images which doesn't seem to me potential
>> sparse files). I'm not completely sure but I think that same files are put in
>> the cephfs directory.
>>
>> Do you think it's possible that the sames file present in different 
>> directories
>> of the cephfs are stored in only one object in the cephfs pool?
>>
>> This is my feeling when I see the difference between "apparent size" and
>> "disk usage" which has increased. Am I wrong?
>
> In fact, I'm not so sure. Here another information, where /backups is a XFS 
> partition:
>
> ~# du --apparent-size -sh 
> /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
> 2.8G/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
>
> ~# du -sh /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
> 701M/mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/
>
> ~# cp -r /mnt/cephfs/0/5/05286c08-2270-41e7-8055-64eae169bd46/data/ 
> /backups/test
>
> ~# du -sh /backups/test
> 701M/backups/test
>
> ~# du --apparent-size -sh /backups/test
> 701M/backups/test
>
> So I definitively don't understand of du --apparent-size -sh...
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Infernalis, cephfs: difference between df and du

2016-01-17 Thread Adam Tygart

As I understand it:

4.2G is used by ceph (all replication, metadata, et al) it is a sum of
all the space "used" on the osds.
958M is the actual space the data in cephfs is using (without replication).
3.8G means you have some sparse files in cephfs.

'ceph df detail' should return something close to 958MB used for your
cephfs "data" pool. "RAW USED" should be close to 4.2GB

--
Adam

On Sun, Jan 17, 2016 at 9:53 PM, Francois Lafont <flafdiv...@free.fr> wrote:
> On 18/01/2016 04:19, Francois Lafont wrote:
>
>> ~# du -sh /mnt/cephfs
>> 958M  /mnt/cephfs
>>
>> ~# df -h /mnt/cephfs/
>> Filesystem  Size  Used Avail Use% Mounted on
>> ceph-fuse55T  4.2G   55T   1% /mnt/cephfs
>
> Even with the option --apparent-size, the size are different (but closer 
> indeed):
>
> ~# du -sh --apparent-size /mnt/cephfs
> 3.8G/mnt/cephfs
>
>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] systemd support?

2016-01-04 Thread Adam



On 01/01/2016 08:22 PM, Adam wrote:
> I'm running into the same install problem described here:
> https://www.spinics.net/lists/ceph-users/msg23533.html
> 
> I tried compiling from source (ceph-9.2.0) to see if it had been fixed
> in the latest code, but I got the same error as with the pre-compiled
> binaries.  Is there any solution or workaround to this?

I just learned that ceph-deploy isn't included with ceph.  I was using
the latest Ubuntu pakage (1.5.20-0ubuntu1).  I cloned the latest from
git (1.5.31).  Both of them have the exact same error.


Here's the exact error message:
[horde.diseasedmind.com][INFO  ] Running command: sudo initctl emit
ceph-mon cluster=ceph id=horde
[horde.diseasedmind.com][WARNIN] initctl: Unable to connect to Upstart:
Failed to connect to socket /com/ubuntu/upstart: Connection refused
[horde.diseasedmind.com][ERROR ] RuntimeError: command returned non-zero
exit status: 1
[ceph_deploy.mon][ERROR ] Failed to execute command: initctl emit
ceph-mon cluster=ceph id=horde
[ceph_deploy][ERROR ] GenericError: Failed to create 1 monitors

I'm using Ubuntu 15.04 (Vivid Vervet).



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] systemd support?

2016-01-02 Thread ☣Adam

I believe so. I'm using ceph-9.2.0.
On Jan 2, 2016 9:53 AM, "Dan Nica" <dan.n...@staff.bluematrix.com> wrote:

> Are you using latest ceph-deploy ?
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Adam
> Sent: Saturday, January 2, 2016 4:22 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] systemd support?
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I'm running into the same install problem described here:
> https://www.spinics.net/lists/ceph-users/msg23533.html
>
> I tried compiling from source (ceph-9.2.0) to see if it had been fixed in
> the latest code, but I got the same error as with the pre-compiled
> binaries.  Is there any solution or workaround to this?
>
> If not, could someone point me to the code which is responsible and I'll
> see if I can fix it up and submit a patch?
>
> Thanks,
> Adam
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v2
>
> iQIcBAEBCAAGBQJWhzRRAAoJENsDzYGorJFPl9sP/RLz1putNY9UD73YBegNo4AK
> Yhb0QQ2Jkv1m+FWa9blyhUqjYai2Fn8shDcjst2bHG+vk4l7wKxJvhzI/DTdBInr
> fDI6iAxZ+NKku5/CstA3dLxxfJxshRMumHsI+4fs0J0CMMe9/fl/6nIu+BILfjXK
> ci0Z4cNHzg0fO3DB1Q8s45/vkcDhOcC0vX8by5dynNNjlyupDvocGKR1rZ0TTM56
> iidrK/QXhqGcl/2+eHVEAcjoE1tIevWtBQBXC3G9FepWxx/k4UgPc2IKFLVtKs2J
> qY9/jT4zG4fmONI6mk4ApNHss0c5vdt3uTflcGi3R9dMzTSViH+uyoqRl0RjGYVa
> FAdHfV3oLAex0ZOMFTpYaU5D7fPdfkSB20Z0YYH3vlwvuV0v9Xn+jQA0Z046FGs9
> 86Z6zVGbgplWi204q64Um2KY9MhtSPtmlJ8d5X8+hxsFNDeRPQiLAPCw70fOcolh
> WBJlgAD4+dRFHSm+HAzCJVxdwMVBLl6XTC3dmpwv63bDXoZfF3kJ0JTdWtYFqyiD
> hkH/OEY8AQ5FNh0R+Qgn2B+XF1BBvK9+BSjXvaarEo4QItwqzXiJzyk9CeTcdjh9
> xg53JTrClMzGZQ1P7gfR2D18bMPlaFt7zuu4xQJAui55oq/ag6BelTMFjm3Uh62U
> NhdYAqPkGuYVoIpPDxCU
> =HqHA
> -END PGP SIGNATURE-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] systemd support?

2016-01-01 Thread Adam

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

I'm running into the same install problem described here:
https://www.spinics.net/lists/ceph-users/msg23533.html

I tried compiling from source (ceph-9.2.0) to see if it had been fixed
in the latest code, but I got the same error as with the pre-compiled
binaries.  Is there any solution or workaround to this?

If not, could someone point me to the code which is responsible and
I'll see if I can fix it up and submit a patch?

Thanks,
Adam
-BEGIN PGP SIGNATURE-
Version: GnuPG v2

iQIcBAEBCAAGBQJWhzRRAAoJENsDzYGorJFPl9sP/RLz1putNY9UD73YBegNo4AK
Yhb0QQ2Jkv1m+FWa9blyhUqjYai2Fn8shDcjst2bHG+vk4l7wKxJvhzI/DTdBInr
fDI6iAxZ+NKku5/CstA3dLxxfJxshRMumHsI+4fs0J0CMMe9/fl/6nIu+BILfjXK
ci0Z4cNHzg0fO3DB1Q8s45/vkcDhOcC0vX8by5dynNNjlyupDvocGKR1rZ0TTM56
iidrK/QXhqGcl/2+eHVEAcjoE1tIevWtBQBXC3G9FepWxx/k4UgPc2IKFLVtKs2J
qY9/jT4zG4fmONI6mk4ApNHss0c5vdt3uTflcGi3R9dMzTSViH+uyoqRl0RjGYVa
FAdHfV3oLAex0ZOMFTpYaU5D7fPdfkSB20Z0YYH3vlwvuV0v9Xn+jQA0Z046FGs9
86Z6zVGbgplWi204q64Um2KY9MhtSPtmlJ8d5X8+hxsFNDeRPQiLAPCw70fOcolh
WBJlgAD4+dRFHSm+HAzCJVxdwMVBLl6XTC3dmpwv63bDXoZfF3kJ0JTdWtYFqyiD
hkH/OEY8AQ5FNh0R+Qgn2B+XF1BBvK9+BSjXvaarEo4QItwqzXiJzyk9CeTcdjh9
xg53JTrClMzGZQ1P7gfR2D18bMPlaFt7zuu4xQJAui55oq/ag6BelTMFjm3Uh62U
NhdYAqPkGuYVoIpPDxCU
=HqHA
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure coded pools and 'feature set mismatch' issue

2015-11-08 Thread Adam Tygart

The problem is that "hammer" tunables (i.e. "optimal" in v0.94.x) are
incompatible with the kernel interfaces before Linux 4.1 (namely due
to straw2 buckets). To make use of the kernel interfaces in 3.13, I
believe you'll need "firefly" tunables.

--
Adam

On Sun, Nov 8, 2015 at 11:48 PM, Bogdan SOLGA <bogdan.so...@gmail.com> wrote:
> Hello Greg!
>
> Thank you for your advice, first of all!
>
> I have tried to adjust the Ceph tunables detailed in this page, but without
> success. I have tried both 'ceph osd crush tunables optimal' and 'ceph osd
> crush tunables hammer', but both lead to the same 'feature set mismatch'
> issue, whenever I tried to create a new RBD image, afterwards. The only way
> I could restore the proper functioning of the cluster was to set the
> tunables to default ('ceph osd crush tunables default'), which are the
> default values for a new cluster.
>
> So... either I'm doing something incompletely, or I'm doing something wrong.
> Any further advice on how to be able to use EC pools is highly welcomed.
>
> Thank you!
>
> Regards,
> Bogdan
>
>
> On Mon, Nov 9, 2015 at 12:20 AM, Gregory Farnum <gfar...@redhat.com> wrote:
>>
>> With that release it shouldn't be the EC pool causing trouble; it's the
>> CRUSH tunables also mentioned in that thread. Instructions should be
>> available in the docs for using older tunable that are compatible with
>> kernel 3.13.
>> -Greg
>>
>>
>> On Saturday, November 7, 2015, Bogdan SOLGA <bogdan.so...@gmail.com>
>> wrote:
>>>
>>> Hello, everyone!
>>>
>>> I have recently created a Ceph cluster (v 0.94.5) on Ubuntu 14.04.3 and I
>>> have created an erasure coded pool, which has a caching pool in front of it.
>>>
>>> When trying to map RBD images, regardless if they are created in the rbd
>>> or in the erasure coded pool, the operation fails with 'rbd: map failed: (5)
>>> Input/output error'. Searching the internet for a solution... I came across
>>> this page, which seems to detail exactly the same issue - a
>>> 'misunderstanding' between erasure coded pools and the 3.13 kernel (used by
>>> Ubuntu).
>>>
>>> Can you please advise on a fix for that issue? As we would prefer to use
>>> erasure coded pools, the only solutions which came into my mind were:
>>>
>>> upgrade to the Infernalis Ceph release, although I'm not sure the issue
>>> is fixed in that version;
>>>
>>> upgrade the kernel (on all the OSDs and Ceph clients) to the 3.14+
>>> kernel;
>>>
>>> Any better / easier solution is highly appreciated.
>>>
>>> Regards,
>>>
>>> Bogdan
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

1 2 >

1 - 100 of 137 matches

Mail list logo