Re: [ceph-users] CephFS: No space left on device

2016-10-07 Thread Mykola Dvornik
10.2.2

-Mykola

On 7 October 2016 at 15:43, Yan, Zheng <uker...@gmail.com> wrote:

> On Thu, Oct 6, 2016 at 4:11 PM,  <mykola.dvor...@gmail.com> wrote:
> > Is there any way to repair pgs/cephfs gracefully?
> >
>
> So far no.  We need to write a tool to repair this type of corruption.
>
> Which version of ceph did you use before upgrading to 10.2.3 ?
>
> Regards
> Yan, Zheng
>
> >
> >
> > -Mykola
> >
> >
> >
> > From: Yan, Zheng
> > Sent: Thursday, 6 October 2016 04:48
> > To: Mykola Dvornik
> > Cc: John Spray; ceph-users
> > Subject: Re: [ceph-users] CephFS: No space left on device
> >
> >
> >
> > On Wed, Oct 5, 2016 at 2:27 PM, Mykola Dvornik <mykola.dvor...@gmail.com
> >
> > wrote:
> >
> >> Hi Zheng,
> >
> >>
> >
> >> Many thanks for you reply.
> >
> >>
> >
> >> This indicates the MDS metadata is corrupted. Did you do any unusual
> >
> >> operation on the cephfs? (e.g reset journal, create new fs using
> >
> >> existing metadata pool)
> >
> >>
> >
> >> No, nothing has been explicitly done to the MDS. I had a few
> inconsistent
> >
> >> PGs that belonged to the (3 replica) metadata pool. The symptoms were
> >
> >> similar to http://tracker.ceph.com/issues/17177 . The PGs were
> eventually
> >
> >> repaired and no data corruption was expected as explained in the ticket.
> >
> >>
> >
> >
> >
> > I'm afraid that issue does cause corruption.
> >
> >
> >
> >> BTW, when I posted this issue on the ML the amount of ground state stry
> >
> >> objects was around 7.5K. Now it went up to 23K. No inconsistent PGs or
> any
> >
> >> other problems happened to the cluster within this time scale.
> >
> >>
> >
> >> -Mykola
> >
> >>
> >
> >> On 5 October 2016 at 05:49, Yan, Zheng <uker...@gmail.com> wrote:
> >
> >>>
> >
> >>> On Mon, Oct 3, 2016 at 5:48 AM, Mykola Dvornik <
> mykola.dvor...@gmail.com>
> >
> >>> wrote:
> >
> >>> > Hi Johan,
> >
> >>> >
> >
> >>> > Many thanks for your reply. I will try to play with the mds tunables
> >>> > and
> >
> >>> > report back to your ASAP.
> >
> >>> >
> >
> >>> > So far I see that mds log contains a lot of errors of the following
> >
> >>> > kind:
> >
> >>> >
> >
> >>> > 2016-10-02 11:58:03.002769 7f8372d54700  0
> mds.0.cache.dir(100056ddecd)
> >
> >>> > _fetched  badness: got (but i already had) [inode 10005729a77
> [2,head]
> >
> >>> > ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728
> >
> >>> > 1=1+0)
> >
> >>> > (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07
> >
> >>> > 23:06:29.776298
> >
> >>> >
> >
> >>> > 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log
> >
> >>> > [ERR] :
> >
> >>> > loaded dup inode 10005729a77 [2,head] v68621 at
> >
> >>> >
> >
> >>> >
> >>> > /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/
> j002654.out/m_xrange192-320_yrange192-320_016232.dump,
> >
> >>> > but inode 10005729a77.head v67464942 already exists at
> >
> >>> > ~mds0/stray1/10005729a77
> >
> >>>
> >
> >>> This indicates the MDS metadata is corrupted. Did you do any unusual
> >
> >>> operation on the cephfs? (e.g reset journal, create new fs using
> >
> >>> existing metadata pool)
> >
> >>>
> >
> >>> >
> >
> >>> > Those folders within mds.0.cache.dir that got badness report a size
> of
> >
> >>> > 16EB
> >
> >>> > on the clients. rm on them fails with 'Directory not empty'.
> >
> >>> >
> >
> >>> > As for the "Client failing to respond to cache pressure", I have 2
> >
> >>> > kernel
> >
> >>> > clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the
> >
> >>> > most
> >
> >>> > recent release version of ceph-fuse. The funny thing is that every
> >
> >>&g

Re: [ceph-users] CephFS: No space left on device

2016-10-05 Thread Mykola Dvornik
Hi Zheng,

Many thanks for you reply.

This indicates the MDS metadata is corrupted. Did you do any unusual
operation on the cephfs? (e.g reset journal, create new fs using
existing metadata pool)

No, nothing has been explicitly done to the MDS. I had a few inconsistent
PGs that belonged to the (3 replica) metadata pool. The symptoms were
similar to http://tracker.ceph.com/issues/17177 . The PGs were eventually
repaired and no data corruption was expected as explained in the ticket.

BTW, when I posted this issue on the ML the amount of ground state stry
objects was around 7.5K. Now it went up to 23K. No inconsistent PGs or any
other problems happened to the cluster within this time scale.

-Mykola

On 5 October 2016 at 05:49, Yan, Zheng <uker...@gmail.com> wrote:

> On Mon, Oct 3, 2016 at 5:48 AM, Mykola Dvornik <mykola.dvor...@gmail.com>
> wrote:
> > Hi Johan,
> >
> > Many thanks for your reply. I will try to play with the mds tunables and
> > report back to your ASAP.
> >
> > So far I see that mds log contains a lot of errors of the following kind:
> >
> > 2016-10-02 11:58:03.002769 7f8372d54700  0 mds.0.cache.dir(100056ddecd)
> > _fetched  badness: got (but i already had) [inode 10005729a77 [2,head]
> > ~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728 1=1+0)
> > (iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07
> 23:06:29.776298
> >
> > 2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log
> [ERR] :
> > loaded dup inode 10005729a77 [2,head] v68621 at
> > /users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/
> j002654.out/m_xrange192-320_yrange192-320_016232.dump,
> > but inode 10005729a77.head v67464942 already exists at
> > ~mds0/stray1/10005729a77
>
> This indicates the MDS metadata is corrupted. Did you do any unusual
> operation on the cephfs? (e.g reset journal, create new fs using
> existing metadata pool)
>
> >
> > Those folders within mds.0.cache.dir that got badness report a size of
> 16EB
> > on the clients. rm on them fails with 'Directory not empty'.
> >
> > As for the "Client failing to respond to cache pressure", I have 2 kernel
> > clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the most
> > recent release version of ceph-fuse. The funny thing is that every single
> > client misbehaves from time to time. I am aware of quite discussion about
> > this issue on the ML, but cannot really follow how to debug it.
> >
> > Regards,
> >
> > -Mykola
> >
> > On 2 October 2016 at 22:27, John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
> >> <mykola.dvor...@gmail.com> wrote:
> >> > After upgrading to 10.2.3 we frequently see messages like
> >>
> >> From which version did you upgrade?
> >>
> >> > 'rm: cannot remove '...': No space left on device
> >> >
> >> > The folders we are trying to delete contain approx. 50K files 193 KB
> >> > each.
> >>
> >> My guess would be that you are hitting the new
> >> mds_bal_fragment_size_max check.  This limits the number of entries
> >> that the MDS will create in a single directory fragment, to avoid
> >> overwhelming the OSD with oversized objects.  It is 10 by default.
> >> This limit also applies to "stray" directories where unlinked files
> >> are put while they wait to be purged, so you could get into this state
> >> while doing lots of deletions.  There are ten stray directories that
> >> get a roughly even share of files, so if you have more than about one
> >> million files waiting to be purged, you could see this condition.
> >>
> >> The "Client failing to respond to cache pressure" messages may play a
> >> part here -- if you have misbehaving clients then they may cause the
> >> MDS to delay purging stray files, leading to a backlog.  If your
> >> clients are by any chance older kernel clients, you should upgrade
> >> them.  You can also unmount/remount them to clear this state, although
> >> it will reoccur until the clients are updated (or until the bug is
> >> fixed, if you're running latest clients already).
> >>
> >> The high level counters for strays are part of the default output of
> >> "ceph daemonperf mds." when run on the MDS server (the "stry" and
> >> "purg" columns).  You can look at these to watch how fast the MDS is
> >> clearing out strays.  If your backlog is just because it's not doing
> >> it fast eno

Re: [ceph-users] CephFS: No space left on device

2016-10-04 Thread Mykola Dvornik
 = '1000' (unchangeable)
*-mds--
--mds_server-- ---objecter--- -mds_cache- ---mds_log
rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs evts
subm|
  0   99k 1.0k|  000 |1110  260 |  00   68k 110 | 39   29k
111
  0   99k 1.0k|  000 |1980  260 |  00   68k 198 | 39   29k
198
  0   52k 1.0k|  000 |1090  264 |  00   68k 102 | 39   23k
106
  0   52k 1.0k|  000 |1300  265 |  00   68k 125 | 39   23k
125
  0   52k 1.0k|  010 |1270  265 |  00   67k 127 | 39   23k
127
  0   52k 1.0k|  000 | 840  264 |  00   67k  84 | 39   24k
84
  0   52k 1.0k|  000 | 800  263 |  00   67k  80 | 39   24k
80
  0   52k 1.0k|  000 | 890  260 |  00   67k  87 | 32   24k
89
  0   52k 1.0k|  000 |1340  259 |  00   67k 134 | 32   24k
134
  0   52k 1.0k|  000 |1550  259 |  00   67k 152 | 33   24k
154
  0   52k 1.0k|  000 | 990  257 |  00   67k  99 | 33   24k
99
  0   52k 1.0k|  000 | 840  257 |  00   67k  84 | 33   24k
84
  0   52k 1.0k|  000 |1170  257 |  00   67k 115 | 33   24k
115
  0   52k 1.0k|  000 |1220  257 |  00   66k 122 | 33   24k
122
  0   52k 1.0k|  000 | 730  257 |  00   66k  73 | 33   24k
73
  0   52k 1.0k|  000 |1230  257 |  00   66k 123 | 33   25k
123
  0   52k 1.0k|  000 | 870  257 |  00   66k  87 | 33   25k
87
  0   52k 1.0k|  000 | 850  257 |  00   66k  83 | 33   25k
83
  0   52k 1.0k|  000 | 550  257 |  00   66k  55 | 33   25k
55
  0   52k 1.0k|  000 | 340  257 |  00   66k  34 | 33   25k
34
  0   52k 1.0k|  000 | 580  257 |  00   66k  58 | 33   25k
58
  0   52k 1.0k|  000 | 350  257 |  00   66k  35 | 33   25k
35
  0   52k 1.0k|  000 | 650  259 |  00   66k  63 | 31   22k
64
  0   52k 1.0k|  000 | 520  258 |  00   66k  52 | 31   23k
52

Seems like purge rate is virtually not sensitive to mds_max_purge_files.
BTW, the rm completed well before the stry approached the ground state.

-Mykola




On 4 October 2016 at 09:16, John Spray <jsp...@redhat.com> wrote:

> (Re-adding list)
>
> The 7.5k stray dentries while idle is probably indicating that clients
> are holding onto references to them (unless you unmount the clients
> and they don't purge, in which case you may well have found a bug).
> The other way you can end up with lots of dentries sitting in stray
> dirs is if you had lots of hard links and unlinked the original
> location but left the hard link in place.
>
> The rate at which your files are purging seems to roughly correspond
> to mds_max_purge_files, so I'd definitely try changing that to get
> things purging faster.
>
> John
>
> On Mon, Oct 3, 2016 at 3:21 PM, Mykola Dvornik <mykola.dvor...@gmail.com>
> wrote:
> > Hi John,
> >
> > This is how the daemonperf looks like :
> >
> > background
> >
> > -mds-- --mds_server-- ---objecter--- -mds_cache-
> > ---mds_log
> > rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs
> evts
> > subm|
> >   0   99k 177k|  000 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  000 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  000 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  050 |  000 |  00  7.5k   0 | 31
>  22k
> > 1
> >   0   99k 177k|  000 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  000 |  200 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  020 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  020 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  010 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  020 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  000 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  010 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >   0   99k 177k|  060 |  000 |  00  7.5k   0 | 31
>  22k
> > 0
> >
> > with 4 rm instances
> >
> > -mds-- --mds_server-- ---objecter--- -mds_cache-
> > ---mds_log
> > rlat inos caps|hsr  hcs  hcr |writ read actv|recd recy stry purg|segs
> evts
> > subm|
> >   0  172k 174k|  05  3.1k| 850   34 |  00   79

Re: [ceph-users] CephFS: No space left on device

2016-10-02 Thread Mykola Dvornik
Hi Johan,

Many thanks for your reply. I will try to play with the mds tunables and
report back to your ASAP.

So far I see that mds log contains a lot of errors of the following kind:

2016-10-02 11:58:03.002769 7f8372d54700  0 mds.0.cache.dir(100056ddecd)
_fetched  badness: got (but i already had) [inode 10005729a77 [2,head]
~mds0/stray1/10005729a77 auth v67464942 s=196728 nl=0 n(v0 b196728 1=1+0)
(iversion lock) 0x7f84acae82a0] mode 33204 mtime 2016-08-07 23:06:29.776298

2016-10-02 11:58:03.002789 7f8372d54700 -1 log_channel(cluster) log [ERR] :
loaded dup inode 10005729a77 [2,head] v68621 at
/users/mykola/mms/NCSHNO/final/120nm-uniform-h8200/j002654.out/m_xrange192-320_yrange192-320_016232.dump,
but inode 10005729a77.head v67464942 already exists at
~mds0/stray1/10005729a77

Those folders within mds.0.cache.dir that got badness report a size of 16EB
on the clients. rm on them fails with 'Directory not empty'.

As for the "Client failing to respond to cache pressure", I have 2 kernel
clients on 4.4.21, 1 on 4.7.5 and 16 fuse clients always running the most
recent release version of ceph-fuse. The funny thing is that every single
client misbehaves from time to time. I am aware of quite discussion about
this issue on the ML, but cannot really follow how to debug it.

Regards,

-Mykola

On 2 October 2016 at 22:27, John Spray <jsp...@redhat.com> wrote:

> On Sun, Oct 2, 2016 at 11:09 AM, Mykola Dvornik
> <mykola.dvor...@gmail.com> wrote:
> > After upgrading to 10.2.3 we frequently see messages like
>
> From which version did you upgrade?
>
> > 'rm: cannot remove '...': No space left on device
> >
> > The folders we are trying to delete contain approx. 50K files 193 KB
> each.
>
> My guess would be that you are hitting the new
> mds_bal_fragment_size_max check.  This limits the number of entries
> that the MDS will create in a single directory fragment, to avoid
> overwhelming the OSD with oversized objects.  It is 10 by default.
> This limit also applies to "stray" directories where unlinked files
> are put while they wait to be purged, so you could get into this state
> while doing lots of deletions.  There are ten stray directories that
> get a roughly even share of files, so if you have more than about one
> million files waiting to be purged, you could see this condition.
>
> The "Client failing to respond to cache pressure" messages may play a
> part here -- if you have misbehaving clients then they may cause the
> MDS to delay purging stray files, leading to a backlog.  If your
> clients are by any chance older kernel clients, you should upgrade
> them.  You can also unmount/remount them to clear this state, although
> it will reoccur until the clients are updated (or until the bug is
> fixed, if you're running latest clients already).
>
> The high level counters for strays are part of the default output of
> "ceph daemonperf mds." when run on the MDS server (the "stry" and
> "purg" columns).  You can look at these to watch how fast the MDS is
> clearing out strays.  If your backlog is just because it's not doing
> it fast enough, then you can look at tuning mds_max_purge_files and
> mds_max_purge_ops to adjust the throttles on purging.  Those settings
> can be adjusted without restarting the MDS using the "injectargs"
> command (http://docs.ceph.com/docs/master/rados/operations/
> control/#mds-subsystem)
>
> Let us know how you get on.
>
> John
>
>
> > The cluster state and storage available are both OK:
> >
> > cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
> >  health HEALTH_WARN
> > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> > pressure
> > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> > pressure
> > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> > pressure
> > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> > pressure
> > mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
> > pressure
> >  monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
> > election epoch 11, quorum 0 000-s-ragnarok
> >   fsmap e62643: 1/1/1 up {0=000-s-ragnarok=up:active}
> >  osdmap e20203: 16 osds: 16 up, 16 in
> > flags sortbitwise
> >   pgmap v15284654: 1088 pgs, 2 pools, 11263 GB data, 40801 kobjects
> > 23048 GB used, 6745 GB / 29793 GB avail
> > 1085 active+clean
> >2 active+clean+scrubbing
> >1 active+clean+scrubbing+deep
> >
> >
> > Has anybody experienced this issue so far?
> >
> > Regards,
> > --
> >  Mykola
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS: No space left on device

2016-10-02 Thread Mykola Dvornik
After upgrading to 10.2.3 we frequently see messages like

'rm: cannot remove '...': No space left on device

The folders we are trying to delete contain approx. 50K files 193 KB each.

The cluster state and storage available are both OK:

cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
 health HEALTH_WARN
mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
pressure
mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
pressure
mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
pressure
mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
pressure
mds0: Client XXX.XXX.XXX.XXX failing to respond to cache
pressure
 monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
election epoch 11, quorum 0 000-s-ragnarok
  fsmap e62643: 1/1/1 up {0=000-s-ragnarok=up:active}
 osdmap e20203: 16 osds: 16 up, 16 in
flags sortbitwise
  pgmap v15284654: 1088 pgs, 2 pools, 11263 GB data, 40801 kobjects
23048 GB used, 6745 GB / 29793 GB avail
1085 active+clean
   2 active+clean+scrubbing
   1 active+clean+scrubbing+deep


Has anybody experienced this issue so far?

Regards,
-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovering full OSD

2016-08-08 Thread Mykola Dvornik
@Shinobu

According to
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

"If you cannot start an OSD because it is full, you may delete some data by
deleting some placement group directories in the full OSD."


On 8 August 2016 at 13:16, Shinobu Kinjo <shinobu...@gmail.com> wrote:

> On Mon, Aug 8, 2016 at 8:01 PM, Mykola Dvornik <mykola.dvor...@gmail.com>
> wrote:
> > Dear ceph community,
> >
> > One of the OSDs in my cluster cannot start due to the
> >
> > ERROR: osd init failed: (28) No space left on device
> >
> > A while ago it was recommended to manually delete PGs on the OSD to let
> it
> > start.
>
> Who recommended that?
>
> >
> > So I am wondering was is the recommended way to fix this issue for the
> > cluster running Jewel release (10.2.2)?
> >
> > Regards,
> >
> > --
> >  Mykola
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Email:
> shin...@linux.com
> shin...@redhat.com
>



-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recovering full OSD

2016-08-08 Thread Mykola Dvornik
Dear ceph community,

One of the OSDs in my cluster cannot start due to the

*ERROR: osd init failed: (28) No space left on device*

A while ago it was recommended to manually delete PGs on the OSD to let it
start.

So I am wondering was is the recommended way to fix this issue for the
cluster running Jewel release (10.2.2)?

Regards,

-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-15 Thread Mykola Dvornik
I would also advice people to mind the SELinux if it is enabled on the
OSD's nodes.
The re-labeling should be done as the part of the upgrade and this is
rather time consuming process.

-Original Message-
From: Mart van Santen 
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel
Date: Fri, 15 Jul 2016 10:48:40 +0200


  

  
  


Hi Wido,



Thank you, we are currently in the same process so this information
is very usefull. Can you share why you upgraded from hammer
directly
to jewel, is there a reason to skip infernalis? So, I wonder why
you
didn't do a hammer->infernalis->jewel upgrade, as that seems
the logical path for me.



(we did indeed saw the same errors "Failed to encode map eXXX with
expected crc" when upgrading to the latest hammer)





Regards,



Mart















On 07/15/2016 03:08 AM, 席智勇 wrote:



>   good job, thank you for sharing, Wido~
> it's very useful~
> 
>   
> 
>   
> 
> 2016-07-14 14:33 GMT+08:00 Wido den
> >   Hollander :
> 
> > To add,
> > the RGWs upgraded just fine as well.
> > 
> > 
> > 
> > No regions in use here (yet!), so that upgraded as it
> > should.
> > 
> > 
> > 
> > Wido
> > 
> > 
> > 
> > > Op 13 juli 2016 om 16:56 schreef Wido den Hollander
> > > > :
> > 
> > 
> >   >
> > 
> > >
> > 
> > > Hello,
> > 
> > >
> > 
> > > > > The last 3 days I worked at a customer with a
1800
> > > > OSD cluster which had to be upgraded from Hammer
0.94.5
> > to Jewel 10.2.2
> > 
> > >
> > 
> > > > > The cluster in this case is 99% RGW, but also
some
> > RBD.
> > 
> > >
> > 
> > > > > I wanted to share some of the things we
encountered
> > during this upgrade.
> > 
> > >
> > 
> > > > > All 180 nodes are running CentOS 7.1 on a IPv6-
only
> > network.
> > 
> > >
> > 
> > > ** Hammer Upgrade **
> > 
> > > At first we upgraded from 0.94.5 to 0.94.7, this
> > went well except for the fact that the monitors got
> > spammed with these kind of messages:
> > 
> > >
> > 
> > >   "Failed to encode map eXXX with expected crc"
> > 
> > >
> > 
> > > Some searching on the list brought me to:
> > 
> > >
> > 
> > >   ceph tell osd.* injectargs --
> > --clog_to_monitors=false
> > 
> > >
> > 
> > >  This reduced the load on the 5 monitors and made
> > recovery succeed smoothly.
> > 
> > >
> > 
> > >  ** Monitors to Jewel **
> > 
> > >  The next step was to upgrade the monitors from
> > Hammer to Jewel.
> > 
> > >
> > 
> > > > >  Using Salt we upgraded the packages and
afterwards
> > it was simple:
> > 
> > >
> > 
> > >    killall ceph-mon
> > 
> > >    chown -R ceph:ceph /var/lib/ceph
> > 
> > >    chown -R ceph:ceph /var/log/ceph
> > 
> > >
> > 
> > > Now, a systemd quirck. 'systemctl start
> > > > ceph.target' does not work, I had to manually
enabled
> > the monitor and start it:
> > 
> > >
> > 
> > >   systemctl enable ceph-mon@srv-zmb04-05.service
> > 
> > >   systemctl start ceph-mon@srv-zmb04-05.service
> > 
> > >
> > 
> > > Afterwards the monitors were running just fine.
> > 
> > >
> > 
> > > ** OSDs to Jewel **
> > 
> > > > > To upgrade the OSDs to Jewel we initially used
Salt
> > > > to update the packages on all systems to 10.2.2, we
then
> > > > used a Shell script which we ran on one node at a
time.
> > 
> > >
> > 
> > > The failure domain here is 'rack', so we executed
> > this in one rack, then the next one, etc, etc.
> > 
> > >
> > 
> > > > > Script can be found on Github: https://gist.githu
b.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
> > 
> > >
> > 
> > > > > Be aware that the chown can take a long, long,
very
> > long time!
> > 
> > >
> > 
> > > > > We ran into the issue that some OSDs 

[ceph-users] Maximum possible IOPS for the given configuration

2016-06-29 Thread Mykola Dvornik
Dear ceph-users,

Are there any expressions / calculators available to calculate the
maximum expected random write IOPS of the ceph cluster?

To my understanding of the ceph IO, this should be something like

MAXIOPS = (1-OVERHEAD) * OSD_BACKENDSTORAGE_IOPS * NUM_OSD /
REPLICA_COUNT

So the question is what OSD_BACKENDSTORAGE_IOPS should stand for? 4K
random or sequential writes IOPS?  

-Mykola


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS mds cache pressure

2016-06-28 Thread Mykola Dvornik
I have the same issues with the variety of kernel clients running 4.6.3
and 4.4.12 and fuse clients from 10.2.2.

-Mykola

-Original Message-
From: xiaoxi chen 
To: João Castro , ceph-users@lists.ceph.com 
Subject: Re: [ceph-users] CephFS mds cache pressure
Date: Wed, 29 Jun 2016 01:00:40 +

Hmm, I asked in the ML some days before,:) likely you hit the kernel
bug which fixed by commit 5e804ac482 "ceph: don't invalidate page cache
when inode is no longer used”.  This fix is in 4.4 but not in 4.2. I
haven't got a chance to play with 4.4 , it would be great if you can
have a try.

For MDS OOM issue, we did a MDS RSS vs #Inodes scaling test, the result
showing around 4MB per 1000 Inodes, so your MDS likely can hold up to
2~3 Million inodes. But yes, even with the fix if the client
misbehavior (open and hold a lot of inodes, doesn't respond to cache
pressure message), MDS can go over the throttling and then killed by
OOM


> To: ceph-users@lists.ceph.com
> From: castrofj...@gmail.com
> Date: Tue, 28 Jun 2016 21:34:03 +
> Subject: Re: [ceph-users] CephFS mds cache pressure
> 
> Hey John,
> 
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> 4.2.0-36-generic
> 
> Thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados complexity

2016-06-05 Thread Mykola Dvornik
Ok, seems like my problem could be cephfs-related. I have 16 cephfs
clients that do heavy, sub-optimal writes simultaneously. The cluster
have no problems handling the load up until circa 2 kobjects.
 Above this threshold the OSDs start to go down randomly and eventually
get killed by the ceph's watchdog mechanism. The funny thing is that
CPU and HDDs are not really overloaded during this events. So I am
really puzzled at this moment.
-Mykola
-Original Message-
From: Sven Höper <l...@mno.pw>
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rados complexity
Date: Sun, 05 Jun 2016 19:18:27 +0200
We've got a simple cluster having 45 OSDs, have above 5 kobjects
and did not
have any issues so far. Our cluster does mainly serve some rados pools
for an
application which usually writes data once and reads it multiple times.
- Sven
Am Sonntag, den 05.06.2016, 18:47 +0200 schrieb Mykola Dvornik:
> Are there any ceph users with pools containing >2 kobjects?
> 
> If so, have you noticed any instabilities of the clusters once this
> threshold
> is reached?
> 
> -Mykola
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados complexity

2016-06-05 Thread Mykola Dvornik
Are there any ceph users with pools containing >2 kobjects?

If so, have you noticed any instabilities of the clusters once this
threshold is reached?

-Mykola___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS quotas in kernel client

2016-05-23 Thread Mykola Dvornik
Thanks for a quick reply.
On Mon, 2016-05-23 at 20:08 +0800, Yan, Zheng wrote:
> No plan so far.  Current quota design requires client to do
> bottom-to-top path walk, which is unfriendly for kernel client (due
> to
> lock design of kernel).
> 
> On Mon, May 23, 2016 at 4:55 PM, Mykola Dvornik
> <mykola.dvor...@gmail.com> wrote:
> > 
> > Any plans to support quotas in CephFS kernel client?
> > 
> > -Mykola
> > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS quotas in kernel client

2016-05-23 Thread Mykola Dvornik
Any plans to support quotas in CephFS kernel client?

-Mykola___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent help needed for ceph storage "mount error 5 = Input/output error"

2016-02-02 Thread Mykola Dvornik
Try to mount with ceph-fuse. It worked for me when I've faced the same 
sort of issues you are now dealing with.


-Mykola


On Tue, Feb 2, 2016 at 8:42 PM, Zhao Xu <xuzh@gmail.com> wrote:
Thank you Mykola. The issue is that I/we strongly suggested to add 
OSD for many times, but we are not the decision maker.
For now, I just want to mount the ceph drive again, even in read only 
mode, so that they can read the data. Any idea on how to achieve this?


Thanks,
X

On Tue, Feb 2, 2016 at 9:57 AM, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:
I would strongly(!) suggest you to add few more OSDs to cluster 
before things get worse / corrupted.


-Mykola


On Tue, Feb 2, 2016 at 6:45 PM, Zhao Xu <xuzh@gmail.com> wrote:

Hi All,
  Recently our ceph storage is running at low performance. Today, 
we can not write to the folder. We tried to unmount the ceph 
storage then to re-mount it, however, we can not even mount it now:


# mount -v -t  ceph igc-head,is1,i1,i2,i3:6789:/ /mnt/igcfs/ -o 
name=admin,secretfile=/etc/admin.secret

parsing options: rw,name=admin,secretfile=/etc/admin.secret
mount error 5 = Input/output error

  Previously there are some nearly full osd, so we did the "ceph 
osd reweight-by-utilization" to rebalance the usage. The ceph 
health is not ideal but it should still alive. Please help me to 
mount the disk again.


[root@igc-head ~]# ceph -s
cluster debdcfe9-20d3-404b-921c-2210534454e1
 health HEALTH_WARN
39 pgs degraded
39 pgs stuck degraded
3 pgs stuck inactive
332 pgs stuck unclean
39 pgs stuck undersized
39 pgs undersized
48 requests are blocked > 32 sec
recovery 129755/8053623 objects degraded (1.611%)
recovery 965837/8053623 objects misplaced (11.993%)
mds0: Behind on trimming (455/30)
clock skew detected on mon.i1, mon.i2, mon.i3
 monmap e1: 5 mons at 
{i1=10.1.10.11:6789/0,i2=10.1.10.12:6789/0,i3=10.1.10.13:6789/0,igc-head=10.1.10.1:6789/0,is1=10.1.10.100:6789/0}
election epoch 1314, quorum 0,1,2,3,4 
igc-head,i1,i2,i3,is1

 mdsmap e1602: 1/1/1 up {0=igc-head=up:active}
 osdmap e8007: 17 osds: 17 up, 17 in; 298 remapped pgs
  pgmap v5726326: 1088 pgs, 3 pools, 7442 GB data, 2621 kobjects
8 GB used, 18652 GB / 40881 GB avail
129755/8053623 objects degraded (1.611%)
965837/8053623 objects misplaced (11.993%)
 755 active+clean
 293 active+remapped
  31 active+undersized+degraded
   5 active+undersized+degraded+remapped
   3 undersized+degraded+peered
   1 active+clean+scrubbing

[root@igc-head ~]# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 39.86992 root default
-2 18.14995 host is1
 0  3.62999 osd.0   up  1.0  1.0
 1  3.62999 osd.1   up  1.0  1.0
 2  3.62999 osd.2   up  1.0  1.0
 3  3.62999 osd.3   up  1.0  1.0
 4  3.62999 osd.4   up  1.0  1.0
-3  7.23999 host i1
 5  1.81000 osd.5   up  0.44101  1.0
 6  1.81000 osd.6   up  0.40675  1.0
 7  1.81000 osd.7   up  0.60754  1.0
 8  1.81000 osd.8   up  0.50868  1.0
-4  7.23999 host i2
 9  1.81000 osd.9   up  0.54956  1.0
10  1.81000 osd.10  up  0.44815  1.0
11  1.81000 osd.11  up  0.53262  1.0
12  1.81000 osd.12  up  0.47197  1.0
-5  7.23999 host i3
13  1.81000 osd.13  up  0.7  1.0
14  1.81000 osd.14  up  0.65874  1.0
15  1.81000 osd.15  up  0.49663  1.0
16  1.81000 osd.16  up  0.50136  1.0


Thanks,
X


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS is not maintianing conistency

2016-02-02 Thread Mykola Dvornik

One of my clients is using

4.3.5-300.fc23.x86_64 (Fedora release 23)

while all the other clients reply on

3.10.0-327.4.4.el7.x86_64 (CentOS Linux release 7.2.1511)

Should I file report a bug on the RedHat bugzilla?

On Tue, Feb 2, 2016 at 8:57 AM, Yan, Zheng <uker...@gmail.com> wrote:
On Tue, Feb 2, 2016 at 2:27 AM, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:

 What version are you running on your servers and clients?



Are you using 4.1 or 4.2 kernel?
https://bugzilla.kernel.org/show_bug.cgi?id=104911. Upgrade to 4.3+
kernel or 4.1.17 kernel or 4.2.8 kernel can resolve this issue.



 On the clients:

 ceph-fuse --version
 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

 MDS/OSD/MON:

 ceph --version
 ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

  Exactly what changes are you making that aren't visible?


 I am creating some new files in non-root folders.

 What's the output of "ceph -s"?


 ceph -s

 cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
  health HEALTH_OK
  monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
 election epoch 1, quorum 0 000-s-ragnarok
  mdsmap e576: 1/1/1 up {0=000-s-ragnarok=up:active}
  osdmap e233: 16 osds: 16 up, 16 in
 flags sortbitwise
   pgmap v1927636: 1088 pgs, 2 pools, 1907 GB data, 2428 kobjects
 3844 GB used, 25949 GB / 29793 GB avail
 1088 active+clean
   client io 4381 B/s wr, 2 op

 In addition on the clients' side I have

 cat /etc/fuse.conf

 user_allow_other
 auto_cache
 large_read
 max_write = 16777216
 max_read = 16777216


 -Mykola


 On Mon, Feb 1, 2016 at 5:06 PM, Gregory Farnum <gfar...@redhat.com> 
wrote:


 On Monday, February 1, 2016, Mykola Dvornik 
<mykola.dvor...@gmail.com>

 wrote:


 Hi guys,

 This is sort of rebuttal.

 I have a CephFS deployed and mounted on a couple of clients via 
ceph-fuse
 (due to quota support and possibility to kill the ceph-fuse 
process to avoid

 stale mounts).

 So the problems is that some times the changes made on one client 
are not
 visible on the others. It appears to me as rather random process. 
The only
 solution is to touch a new file in any particular folder that 
apparently

 triggers synchronization.

 I've been using a kernel-side client before with no such kind of 
problems.

 So the questions is it expected behavior of ceph-fuse?



 What version are you running on your servers and clients? Exactly 
what
 changes are you making that aren't visible? What's the output of 
"ceph -s"?
 We see bugs like this occasionally but I can't think of any recent 
ones in
 ceph-fuse -- they're actually seen a lot more often in the kernel 
client.

 -Greg





 Regards,

 Mykola












 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS is not maintianing conistency

2016-02-02 Thread Mykola Dvornik

No, I have not had any issues with 4.3.x.

On Tue, Feb 2, 2016 at 3:28 PM, Yan, Zheng <uker...@gmail.com> wrote:
On Tue, Feb 2, 2016 at 8:28 PM, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:

 No, I've never seen this issue on the Fedora stock kernels.

 So either my workflow is not triggering it on the Fedora software 
stack or

 the issues is CentOS / RHEL - specific.


I mean did you encounter this problem when using ceph-fuse on 4.3.5
kernel ? (fuse mount can also be affected by kernel bug)

Regards
Yan, Zheng



 Anyway I will file the ceph-fuse bug then.

 On Tue, Feb 2, 2016 at 12:43 PM, Yan, Zheng <uker...@gmail.com> 
wrote:


 On Tue, Feb 2, 2016 at 5:32 PM, Mykola Dvornik 
<mykola.dvor...@gmail.com>

 wrote:

 One of my clients is using 4.3.5-300.fc23.x86_64 (Fedora release 23)

 did you encounter this problem on client using 4.3.5 kernel? If you 
did,

 this issue should be ceph-fuse bug.

 while all the other clients reply on 3.10.0-327.4.4.el7.x86_64 
(CentOS Linux

 release 7.2.1511) Should I file report a bug on the RedHat bugzilla?

 you can open a bug at 
http://tracker.ceph.com/projects/cephfs/issues Regards

 Yan, Zheng

 On Tue, Feb 2, 2016 at 8:57 AM, Yan, Zheng <uker...@gmail.com> 
wrote: On
 Tue, Feb 2, 2016 at 2:27 AM, Mykola Dvornik 
<mykola.dvor...@gmail.com>
 wrote: What version are you running on your servers and clients? 
Are you
 using 4.1 or 4.2 kernel? 
https://bugzilla.kernel.org/show_bug.cgi?id=104911.
 Upgrade to 4.3+ kernel or 4.1.17 kernel or 4.2.8 kernel can resolve 
this

 issue. On the clients: ceph-fuse --version ceph version 9.2.0
 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) MDS/OSD/MON: ceph 
--version ceph
 version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) Exactly 
what
 changes are you making that aren't visible? I am creating some new 
files in

 non-root folders. What's the output of "ceph -s"? ceph -s cluster
 98d72518-6619-4b5c-b148-9a781ef13bcb health HEALTH_OK monmap e1: 1 
mons at

 {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0} election epoch 1, quorum 0
 000-s-ragnarok mdsmap e576: 1/1/1 up {0=000-s-ragnarok=up:active} 
osdmap
 e233: 16 osds: 16 up, 16 in flags sortbitwise pgmap v1927636: 1088 
pgs, 2
 pools, 1907 GB data, 2428 kobjects 3844 GB used, 25949 GB / 29793 
GB avail
 1088 active+clean client io 4381 B/s wr, 2 op In addition on the 
clients'
 side I have cat /etc/fuse.conf user_allow_other auto_cache 
large_read
 max_write = 16777216 max_read = 16777216 -Mykola On Mon, Feb 1, 
2016 at 5:06
 PM, Gregory Farnum <gfar...@redhat.com> wrote: On Monday, February 
1, 2016,
 Mykola Dvornik <mykola.dvor...@gmail.com> wrote: Hi guys, This is 
sort of
 rebuttal. I have a CephFS deployed and mounted on a couple of 
clients via
 ceph-fuse (due to quota support and possibility to kill the 
ceph-fuse
 process to avoid stale mounts). So the problems is that some times 
the
 changes made on one client are not visible on the others. It 
appears to me
 as rather random process. The only solution is to touch a new file 
in any
 particular folder that apparently triggers synchronization. I've 
been using
 a kernel-side client before with no such kind of problems. So the 
questions
 is it expected behavior of ceph-fuse? What version are you running 
on your

 servers and clients? Exactly what changes are you making that aren't
 visible? What's the output of "ceph -s"? We see bugs like this 
occasionally
 but I can't think of any recent ones in ceph-fuse -- they're 
actually seen a

 lot more often in the kernel client. -Greg Regards, Mykola
 ___ ceph-users mailing 
list

 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent help needed for ceph storage "mount error 5 = Input/output error"

2016-02-02 Thread Mykola Dvornik
I would strongly(!) suggest you to add few more OSDs to cluster before 
things get worse / corrupted.


-Mykola


On Tue, Feb 2, 2016 at 6:45 PM, Zhao Xu  wrote:

Hi All,
  Recently our ceph storage is running at low performance. Today, we 
can not write to the folder. We tried to unmount the ceph storage 
then to re-mount it, however, we can not even mount it now:


# mount -v -t  ceph igc-head,is1,i1,i2,i3:6789:/ /mnt/igcfs/ -o 
name=admin,secretfile=/etc/admin.secret

parsing options: rw,name=admin,secretfile=/etc/admin.secret
mount error 5 = Input/output error

  Previously there are some nearly full osd, so we did the "ceph osd 
reweight-by-utilization" to rebalance the usage. The ceph health is 
not ideal but it should still alive. Please help me to mount the disk 
again.


[root@igc-head ~]# ceph -s
cluster debdcfe9-20d3-404b-921c-2210534454e1
 health HEALTH_WARN
39 pgs degraded
39 pgs stuck degraded
3 pgs stuck inactive
332 pgs stuck unclean
39 pgs stuck undersized
39 pgs undersized
48 requests are blocked > 32 sec
recovery 129755/8053623 objects degraded (1.611%)
recovery 965837/8053623 objects misplaced (11.993%)
mds0: Behind on trimming (455/30)
clock skew detected on mon.i1, mon.i2, mon.i3
 monmap e1: 5 mons at 
{i1=10.1.10.11:6789/0,i2=10.1.10.12:6789/0,i3=10.1.10.13:6789/0,igc-head=10.1.10.1:6789/0,is1=10.1.10.100:6789/0}
election epoch 1314, quorum 0,1,2,3,4 
igc-head,i1,i2,i3,is1

 mdsmap e1602: 1/1/1 up {0=igc-head=up:active}
 osdmap e8007: 17 osds: 17 up, 17 in; 298 remapped pgs
  pgmap v5726326: 1088 pgs, 3 pools, 7442 GB data, 2621 kobjects
8 GB used, 18652 GB / 40881 GB avail
129755/8053623 objects degraded (1.611%)
965837/8053623 objects misplaced (11.993%)
 755 active+clean
 293 active+remapped
  31 active+undersized+degraded
   5 active+undersized+degraded+remapped
   3 undersized+degraded+peered
   1 active+clean+scrubbing

[root@igc-head ~]# ceph osd tree
ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 39.86992 root default
-2 18.14995 host is1
 0  3.62999 osd.0   up  1.0  1.0
 1  3.62999 osd.1   up  1.0  1.0
 2  3.62999 osd.2   up  1.0  1.0
 3  3.62999 osd.3   up  1.0  1.0
 4  3.62999 osd.4   up  1.0  1.0
-3  7.23999 host i1
 5  1.81000 osd.5   up  0.44101  1.0
 6  1.81000 osd.6   up  0.40675  1.0
 7  1.81000 osd.7   up  0.60754  1.0
 8  1.81000 osd.8   up  0.50868  1.0
-4  7.23999 host i2
 9  1.81000 osd.9   up  0.54956  1.0
10  1.81000 osd.10  up  0.44815  1.0
11  1.81000 osd.11  up  0.53262  1.0
12  1.81000 osd.12  up  0.47197  1.0
-5  7.23999 host i3
13  1.81000 osd.13  up  0.7  1.0
14  1.81000 osd.14  up  0.65874  1.0
15  1.81000 osd.15  up  0.49663  1.0
16  1.81000 osd.16  up  0.50136  1.0


Thanks,
X
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS is not maintianing conistency

2016-02-02 Thread Mykola Dvornik

No, I've never seen this issue on the Fedora stock kernels.

So either my workflow is not triggering it on the Fedora software stack 
or the issues is CentOS / RHEL - specific.


Anyway I will file the ceph-fuse bug then.

On Tue, Feb 2, 2016 at 12:43 PM, Yan, Zheng <uker...@gmail.com> wrote:
On Tue, Feb 2, 2016 at 5:32 PM, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:

 One of my clients is using

 4.3.5-300.fc23.x86_64 (Fedora release 23)


did you encounter this problem on client using 4.3.5 kernel? If you
did, this issue should be ceph-fuse bug.



 while all the other clients reply on

 3.10.0-327.4.4.el7.x86_64 (CentOS Linux release 7.2.1511)

 Should I file report a bug on the RedHat bugzilla?


you can open a bug at http://tracker.ceph.com/projects/cephfs/issues

Regards
Yan, Zheng



 On Tue, Feb 2, 2016 at 8:57 AM, Yan, Zheng <uker...@gmail.com> 
wrote:


 On Tue, Feb 2, 2016 at 2:27 AM, Mykola Dvornik 
<mykola.dvor...@gmail.com>

 wrote:

 What version are you running on your servers and clients?

 Are you using 4.1 or 4.2 kernel?
 https://bugzilla.kernel.org/show_bug.cgi?id=104911. Upgrade to 4.3+ 
kernel

 or 4.1.17 kernel or 4.2.8 kernel can resolve this issue.

 On the clients: ceph-fuse --version ceph version 9.2.0
 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) MDS/OSD/MON: ceph 
--version ceph
 version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) Exactly 
what
 changes are you making that aren't visible? I am creating some new 
files in

 non-root folders. What's the output of "ceph -s"? ceph -s cluster
 98d72518-6619-4b5c-b148-9a781ef13bcb health HEALTH_OK monmap e1: 1 
mons at

 {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0} election epoch 1, quorum 0
 000-s-ragnarok mdsmap e576: 1/1/1 up {0=000-s-ragnarok=up:active} 
osdmap
 e233: 16 osds: 16 up, 16 in flags sortbitwise pgmap v1927636: 1088 
pgs, 2
 pools, 1907 GB data, 2428 kobjects 3844 GB used, 25949 GB / 29793 
GB avail
 1088 active+clean client io 4381 B/s wr, 2 op In addition on the 
clients'
 side I have cat /etc/fuse.conf user_allow_other auto_cache 
large_read
 max_write = 16777216 max_read = 16777216 -Mykola On Mon, Feb 1, 
2016 at 5:06
 PM, Gregory Farnum <gfar...@redhat.com> wrote: On Monday, February 
1, 2016,

 Mykola Dvornik <mykola.dvor...@gmail.com> wrote:

 Hi guys, This is sort of rebuttal. I have a CephFS deployed and 
mounted on a
 couple of clients via ceph-fuse (due to quota support and 
possibility to
 kill the ceph-fuse process to avoid stale mounts). So the problems 
is that
 some times the changes made on one client are not visible on the 
others. It
 appears to me as rather random process. The only solution is to 
touch a new
 file in any particular folder that apparently triggers 
synchronization. I've
 been using a kernel-side client before with no such kind of 
problems. So the

 questions is it expected behavior of ceph-fuse?

 What version are you running on your servers and clients? Exactly 
what
 changes are you making that aren't visible? What's the output of 
"ceph -s"?
 We see bugs like this occasionally but I can't think of any recent 
ones in
 ceph-fuse -- they're actually seen a lot more often in the kernel 
client.

 -Greg

 Regards, Mykola

 ___ ceph-users mailing 
list

 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS is not maintianing conistency

2016-02-01 Thread Mykola Dvornik

Hi guys,

This is sort of rebuttal.

I have a CephFS deployed and mounted on a couple of clients via 
ceph-fuse (due to quota support and possibility to kill the ceph-fuse 
process to avoid stale mounts).


So the problems is that some times the changes made on one client are 
not visible on the others. It appears to me as rather random process. 
The only solution is to touch a new file in any particular folder that 
apparently triggers synchronization.


I've been using a kernel-side client before with no such kind of 
problems. So the questions is it expected behavior of ceph-fuse?


Regards,

Mykola











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS is not maintianing conistency

2016-02-01 Thread Mykola Dvornik

What version are you running on your servers and clients?


On the clients:

ceph-fuse --version
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

MDS/OSD/MON:

ceph --version
ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)


 Exactly what changes are you making that aren't visible?


I am creating some new files in non-root folders.


What's the output of "ceph -s"?


ceph -s

   cluster 98d72518-6619-4b5c-b148-9a781ef13bcb
health HEALTH_OK
monmap e1: 1 mons at {000-s-ragnarok=XXX.XXX.XXX.XXX:6789/0}
   election epoch 1, quorum 0 000-s-ragnarok
mdsmap e576: 1/1/1 up {0=000-s-ragnarok=up:active}
osdmap e233: 16 osds: 16 up, 16 in
   flags sortbitwise
 pgmap v1927636: 1088 pgs, 2 pools, 1907 GB data, 2428 kobjects
   3844 GB used, 25949 GB / 29793 GB avail
   1088 active+clean
 client io 4381 B/s wr, 2 op

In addition on the clients' side I have

cat /etc/fuse.conf

user_allow_other
auto_cache
large_read
max_write = 16777216
max_read = 16777216


-Mykola


On Mon, Feb 1, 2016 at 5:06 PM, Gregory Farnum <gfar...@redhat.com> 
wrote:
On Monday, February 1, 2016, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:

Hi guys,

This is sort of rebuttal.

I have a CephFS deployed and mounted on a couple of clients via 
ceph-fuse (due to quota support and possibility to kill the 
ceph-fuse process to avoid stale mounts).


So the problems is that some times the changes made on one client 
are not visible on the others. It appears to me as rather random 
process. The only solution is to touch a new file in any particular 
folder that apparently triggers synchronization.


I've been using a kernel-side client before with no such kind of 
problems. So the questions is it expected behavior of ceph-fuse?


What version are you running on your servers and clients? Exactly 
what changes are you making that aren't visible? What's the output of 
"ceph -s"?
We see bugs like this occasionally but I can't think of any recent 
ones in ceph-fuse -- they're actually seen a lot more often in the 
kernel client.

-Greg




Regards,

Mykola










___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-fuse inconsistent filesystem view from different clients

2015-12-30 Thread Mykola Dvornik

Dear Yan,

Thanks for your prompt reply.


what is the symptom of "out-of-sync"?


The folder does not receive any updates for days, i.e. I don't see any 
new subfolders and files that are present on (actively writing) 
clients. It happened on two clients so far, each running different 
versions of ceph-fuse and OS/userspace. One of the clients is on the 
same switch as the MDS. It appears to me that both clients stopped 
receiving updates virtually simultaneously, i.e. they have the same 
(out-of-sync) view on the folder.



which version of ceph-mds/ceph-fuse are you using?


MDS: ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)  / 
CentOS Linux release 7.2.1511


client1: ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299) 
/ CentOS Linux release 7.2.1511


client2: ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43) 
/ Fedora release 23 (Twenty Three)



you enable client debug by "--debug_client=20' option


Thanks. I've already remounted the clients, but once the issue is back 
I will do some debugging.


And last but not least, writing a file to the folder, i.e. touch test, 
triggers synchronization.


Kind regards,

Mykola


On Wed, Dec 30, 2015 at 7:49 AM, Yan, Zheng <uker...@gmail.com> wrote:

On Wed, Dec 30, 2015 at 5:59 AM, Mykola Dvornik
<mykola.dvor...@gmail.com> wrote:

 Hi guys,

 I have 16 OSD/1MON/1MDS ceph cluster serving CephFS.

 The FS is mounted on 11 clients using ceph-fuse. In some cases 
there are
 multiple ceph-fuse processes per client, each with its own '-r' 
option.


 The problem is that some of the clients get significantly 
out-of-sync.

 Flushing caches, 'touching' things, etc. does not help.


what is the symptom of "out-of-sync"?



 ceph -s reports 'Client XXX failing to respond to cache pressure'.

 Although I've increased "mds_cache_size" to "100", perf dump 
mds still
 reports inodes exceeding the inode_max. I cannot confirm yet, but 
it appears
 to me that the out-of-sync issue started to appear since the 
mds_cache_size

 increase.


which version of ceph-mds/ceph-fuse are you using?



 mds log does not have anything suspicious in it.

 So is there any way to debug ceph-fuse?


you enable client debug by "--debug_client=20' option, you




 Regards,

 --
  Mykola

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-fuse inconsistent filesystem view from different clients

2015-12-29 Thread Mykola Dvornik
Hi guys,

I have 16 OSD/1MON/1MDS ceph cluster serving CephFS.

The FS is mounted on 11 clients using ceph-fuse. In some cases there are
multiple ceph-fuse processes per client, each with its own '-r' option.

The problem is that some of the clients get significantly out-of-sync.
Flushing caches, 'touching' things, etc. does not help.

ceph -s reports 'Client XXX failing to respond to cache pressure'.

Although I've increased "mds_cache_size" to "100", perf dump mds still
reports inodes exceeding the inode_max. I cannot confirm yet, but it
appears to me that the out-of-sync issue started to appear since the
mds_cache_size increase.

mds log does not have anything suspicious in it.

So is there any way to debug ceph-fuse?

Regards,

-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CentOS 7.2, Infernalis, preparing osd's and partprobe issues.

2015-12-15 Thread Mykola Dvornik
I had more or less the same problem. This most likely synchronization
issue. I have been deploying 16 OSD each running exactly the same
hardware/software. The issue appeared randomly with no obvious correlations
with other stuff. The dirty workaround was to put time.sleep(5) before
invoking partprobe.



On 16 December 2015 at 07:17, Matt Taylor  wrote:

> Hi all,
>
> After recently upgrading to CentOS 7.2 and installing a new Ceph cluster
> using Infernalis v9.2.0, I have noticed that disk's are failing to prepare.
>
> I have observed the same behaviour over multiple Ceph servers when
> preparing disk's. All the servers are identical.
>
> Disk's are zapping fine, however when running 'ceph-deploy disk prepare',
> we're encountering the following error:
>
> [ceph_deploy.cli][INFO ] Invoked (1.5.30): /usr/bin/ceph-deploy disk
>> prepare kvsrv02:/dev/sdr
>> [ceph_deploy.cli][INFO ] ceph-deploy options:
>> [ceph_deploy.cli][INFO ] username : None
>> [ceph_deploy.cli][INFO ] disk : [('kvsrv02', '/dev/sdr', None)]
>> [ceph_deploy.cli][INFO ] dmcrypt : False
>> [ceph_deploy.cli][INFO ] verbose : False
>> [ceph_deploy.cli][INFO ] overwrite_conf : False
>> [ceph_deploy.cli][INFO ] subcommand : prepare
>> [ceph_deploy.cli][INFO ] dmcrypt_key_dir : /etc/ceph/dmcrypt-keys
>> [ceph_deploy.cli][INFO ] quiet : False
>> [ceph_deploy.cli][INFO ] cd_conf : > instance at 0x7f1d54a4a7a0>
>> [ceph_deploy.cli][INFO ] cluster : ceph
>> [ceph_deploy.cli][INFO ] fs_type : xfs
>> [ceph_deploy.cli][INFO ] func : 
>> [ceph_deploy.cli][INFO ] ceph_conf : None
>> [ceph_deploy.cli][INFO ] default_release : False
>> [ceph_deploy.cli][INFO ] zap_disk : False
>> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks kvsrv02:/dev/sdr:
>> [kvsrv02][DEBUG ] connection detected need for sudo
>> [kvsrv02][DEBUG ] connected to host: kvsrv02
>> [kvsrv02][DEBUG ] detect platform information from remote host
>> [kvsrv02][DEBUG ] detect machine type
>> [kvsrv02][DEBUG ] find the location of an executable
>> [ceph_deploy.osd][INFO ] Distro info: CentOS Linux 7.2.1511 Core
>> [ceph_deploy.osd][DEBUG ] Deploying osd to kvsrv02
>> [kvsrv02][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>> [ceph_deploy.osd][DEBUG ] Preparing host kvsrv02 disk /dev/sdr journal
>> None activate False
>> [kvsrv02][INFO ] Running command: sudo ceph-disk -v prepare --cluster
>> ceph --fs-type xfs -- /dev/sdr
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-allows-journal -i 0 --cluster ceph
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-wants-journal -i 0 --cluster ceph
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-needs-journal -i 0 --cluster ceph
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --cluster=ceph --show-config-value=fsid
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_mkfs_options_xfs
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_xfs
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_mount_options_xfs
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_xfs
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --cluster=ceph --show-config-value=osd_journal_size
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_dmcrypt_type
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] INFO:ceph-disk:Will colocate journal with data on
>> /dev/sdr
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdr uuid path is
>> /sys/dev/block/65:16/dm/uuid
>> [kvsrv02][WARNIN] DEBUG:ceph-disk:Creating journal partition num 2 size
>> 5120 on /dev/sdr
>> [kvsrv02][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk
>> --new=2:0:5120M --change-name=2:ceph journal
>> --partition-guid=2:7058473f-5c4a-4566-9a11-95cae71e5086
>> --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdr
>> 

[ceph-users] CephFS: number of PGs for metadata pool

2015-12-09 Thread Mykola Dvornik

Hi guys,

I am creating a 4-node/16OSD/32TB CephFS from scratch.

According to the ceph documentation the metadata pool should have small 
amount of PGs since it contains some negligible amount of data compared 
to data pool. This makes me feel it might not be safe.


So I was wondering how to chose the number of PGs per metadata pool to 
maintain its performance and reliability?


Regards,

Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: number of PGs for metadata pool

2015-12-09 Thread Mykola Dvornik

Good point. Thanks!

Triple-failure is essentially what I've faced about a months ago. So 
now I want to make sure that the new cephfs setup I am deploying at the 
moment will handle this kind of things better.


On Wed, Dec 9, 2015 at 2:41 PM, John Spray <jsp...@redhat.com> wrote:
On Wed, Dec 9, 2015 at 1:25 PM, Mykola Dvornik 
<mykola.dvor...@gmail.com> wrote:

 Hi Jan,

 Thanks for the reply. I see your point about replicas. However my 
motivation

 was a bit different.

 Consider some given amount of objects that are stored in the 
metadata pool.
 If I understood correctly ceph data placement approach, the number 
of

 objects per PG should decrease with the amount of PGs per pool.

 So my concern is that in catastrophic event of some PG(s) being 
lost I will
 loose more objects if the amount of PGs per pool is small. At the 
same time
 I don't want to have too few objects per PG to keep things disk IO, 
but not

 CPU bounded.


If you are especially concerned about triple-failures (i.e. permanent
PG loss), I would suggest you look at doing things like a size=4 pool
for your metadata (maybe on SSDs).

You could also look at simply segregating your size=3 metadata on to
separate spinning drives, so that these comparatively less loaded OSDs
will be able to undergo recovery faster in the event of a failure than
an ordinary data drive that's full of terabytes of data, and have a
lower probability of a triple failure.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2015-12-08 Thread Mykola Dvornik
The same thing happens to my setup with CentOS7.x + non-stock kernel 
(kernel-ml from elrepo).


I was not happy with IOPS I got out of the stock CentOS7.x so I did the 
kernel upgrade and crashes started to happen until some of the OSDs 
become non-bootable at all. The funny thing is that I was not able to 
downgrade back to stock since OSDs were crashing with 'cannot decode' 
errors. I am doing backup at the moment and OSDs crash from time to 
time due to the ceph watchdog despite the x20 timeouts.


I believe the version of kernel-ml I have started with was 3.19.


On Tue, Dec 8, 2015 at 10:34 AM, Tom Christensen  
wrote:
We didn't go forward to 4.2 as its a large production cluster, and we 
just needed the problem fixed.  We'll probably test out 4.2 in the 
next couple months, but this one slipped past us as it didn't occur 
in our test cluster until after we had upgraded production.  In our 
experience it takes about 2 weeks to start happening, but once it 
does its all hands on deck cause nodes are going to go down regularly.


All that being said, if/when we try 4.2 its going to need to run for 
1-2 months rock solid in our test cluster before it gets to 
production.


On Tue, Dec 8, 2015 at 2:30 AM, Benedikt Fraunhofer 
 wrote:

Hi Tom,

> We have been seeing this same behavior on a cluster that has been 
perfectly
> happy until we upgraded to the ubuntu vivid 3.19 kernel.  We are 
in the


i can't recall when we gave 3.19 a shot but now that you say it... 
The

cluster was happy for >9 months with 3.16.
Did you try 4.2 or do you think the regression from 3.16 introduced
somewhere trough 3.19 is still in 4.2?

Thx!
   Benedikt


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-19 Thread Mykola Dvornik
Dear Yan,

Thanks for your reply.

The problem is that the back-up I've made was done after the data
corruption (but before any manipulations with the journal). Since FS cannot
be mounted via in-kernel client, I tend to believe that cephfs_metadata
corruption is the cause.

Since I do have a read-only access to the filesystem via ceph-fuse, I would
rather prefer to repair it using cephfs-data-scan tool.

I did 'rsync --dry-run' of the whole FS and MDS complained about a few
missing objects. Not really sure if it is really a trustable method for
identifying corrupted things, but if it is, the damage is marginal.

So the question is cephfs-data-scan designed to resolve problems with
duplicated inodes?


On 19 November 2015 at 04:17, Yan, Zheng <uker...@gmail.com> wrote:

> On Wed, Nov 18, 2015 at 5:21 PM, Mykola Dvornik <mykola.dvor...@gmail.com>
> wrote:
>
>> Hi John,
>>
>> It turned out that mds triggers an assertion
>>
>> *mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)*
>>
>> on any attempt to write data to the filesystem mounted via fuse.
>>
>> Deleting data is still OK.
>>
>> I cannot really follow why duplicated inodes appear.
>>
>> Are there any ways to flush/reset the MDS cache?
>>
>>
>>
> this may caused by session/journal reset. could you try restoring backup
> of your metadata pool.
>
> Yan, Zheng
>
>
>
>
>>
>> On 17 November 2015 at 13:26, John Spray <jsp...@redhat.com> wrote:
>>
>>> On Tue, Nov 17, 2015 at 12:17 PM, Mykola Dvornik
>>> <mykola.dvor...@gmail.com> wrote:
>>> > Dear John,
>>> >
>>> > Thanks for such a prompt reply!
>>> >
>>> > Seems like something happens on the mon side, since there are no
>>> > mount-specific requests logged on the mds side (see below).
>>> > FYI, some hours ago I've disabled auth completely, but it didn't help.
>>> >
>>> > The serialized metadata pool is 9.7G. I can try to compress it with
>>> 7z, then
>>> > setup rssh account for you to scp/rsync it.
>>> >
>>> > debug mds = 20
>>> > debug mon = 20
>>>
>>> Don't worry about the mon logs.  That MDS log snippet appears to be
>>> from several minutes earlier than the client's attempt to mount.
>>>
>>> In these cases it's generally simpler if you truncate all the logs,
>>> then attempt the mount, then send all the logs in full rather than
>>> snippets, so that we can be sure nothing is missing.
>>>
>>> Please also get the client log (use the fuse client with
>>> --debug-client=20).
>>>
>>> John
>>>
>>
>>
>>
>> --
>>  Mykola
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>


-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-19 Thread Mykola Dvornik
I'm guessing in this context that "write data" possibly means creating
a file (as opposed to writing to an existing file).

Indeed. Sorry for the confusion.

You've pretty much hit the limits of what the disaster recovery tools
are currently capable of.  What I'd recommend you do at this stage is
mount your filesystem read-only, back it up, and then create a new
filesystem and restore from backup.

Ok. Is it somehow possible to have multiple FSs on the same ceph cluster?


On 19 November 2015 at 10:43, John Spray <jsp...@redhat.com> wrote:

> On Wed, Nov 18, 2015 at 9:21 AM, Mykola Dvornik
> <mykola.dvor...@gmail.com> wrote:
> > Hi John,
> >
> > It turned out that mds triggers an assertion
> >
> > mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)
> >
> > on any attempt to write data to the filesystem mounted via fuse.
>
> I'm guessing in this context that "write data" possibly means creating
> a file (as opposed to writing to an existing file).
>
> Currently, cephfs-data-scan injects inodes well enough that you can
> read them, but it's not updating the inode table to reflect that the
> recovered inodes are in use.  As a result, when new files are created
> they are probably trying to take inode numbers that are already in use
> (by the recovered files), and as a result you're hitting this
> assertion.  The ticket for updating the inotable after injection of
> recovered inodes is http://tracker.ceph.com/issues/12131
>
> > Deleting data is still OK.
> >
> > I cannot really follow why duplicated inodes appear.
> >
> > Are there any ways to flush/reset the MDS cache?
>
> You've pretty much hit the limits of what the disaster recovery tools
> are currently capable of.  What I'd recommend you do at this stage is
> mount your filesystem read-only, back it up, and then create a new
> filesystem and restore from backup.
>
> I'm writing a patch to handle the particular case where someone needs
> to update their inode table to mark all inodes as used up to some
> maximum, but the chances are that after that you'll still run into
> some other issue, until we've finished the tools to make it all the
> way through this path.
>
> John
>
> >
> >
> >
> > On 17 November 2015 at 13:26, John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Tue, Nov 17, 2015 at 12:17 PM, Mykola Dvornik
> >> <mykola.dvor...@gmail.com> wrote:
> >> > Dear John,
> >> >
> >> > Thanks for such a prompt reply!
> >> >
> >> > Seems like something happens on the mon side, since there are no
> >> > mount-specific requests logged on the mds side (see below).
> >> > FYI, some hours ago I've disabled auth completely, but it didn't help.
> >> >
> >> > The serialized metadata pool is 9.7G. I can try to compress it with
> 7z,
> >> > then
> >> > setup rssh account for you to scp/rsync it.
> >> >
> >> > debug mds = 20
> >> > debug mon = 20
> >>
> >> Don't worry about the mon logs.  That MDS log snippet appears to be
> >> from several minutes earlier than the client's attempt to mount.
> >>
> >> In these cases it's generally simpler if you truncate all the logs,
> >> then attempt the mount, then send all the logs in full rather than
> >> snippets, so that we can be sure nothing is missing.
> >>
> >> Please also get the client log (use the fuse client with
> >> --debug-client=20).
> >>
> >> John
> >
> >
> >
> >
> > --
> >  Mykola
>



-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-19 Thread Mykola Dvornik
Thanks for the tip.

I will stay of the safe side and wait until it will be merged into master)

Many thanks for all your help.

-Mykola

On 19 November 2015 at 11:10, John Spray <jsp...@redhat.com> wrote:

> On Thu, Nov 19, 2015 at 10:07 AM, Mykola Dvornik
> <mykola.dvor...@gmail.com> wrote:
> > I'm guessing in this context that "write data" possibly means creating
> > a file (as opposed to writing to an existing file).
> >
> > Indeed. Sorry for the confusion.
> >
> > You've pretty much hit the limits of what the disaster recovery tools
> > are currently capable of.  What I'd recommend you do at this stage is
> > mount your filesystem read-only, back it up, and then create a new
> > filesystem and restore from backup.
> >
> > Ok. Is it somehow possible to have multiple FSs on the same ceph cluster?
>
> No, we want to do this but it's not there yet.  Your scenario is one
> of the motivations :-)
>
> (for the record multi-fs branch is
> https://github.com/jcsp/ceph/commits/wip-multi-filesystems, which
> works, but we'll probably go back and re-do the mon side of it before
> finishing)
>
> John
>
> >
> >
> > On 19 November 2015 at 10:43, John Spray <jsp...@redhat.com> wrote:
> >>
> >> On Wed, Nov 18, 2015 at 9:21 AM, Mykola Dvornik
> >> <mykola.dvor...@gmail.com> wrote:
> >> > Hi John,
> >> >
> >> > It turned out that mds triggers an assertion
> >> >
> >> > mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)
> >> >
> >> > on any attempt to write data to the filesystem mounted via fuse.
> >>
> >> I'm guessing in this context that "write data" possibly means creating
> >> a file (as opposed to writing to an existing file).
> >>
> >> Currently, cephfs-data-scan injects inodes well enough that you can
> >> read them, but it's not updating the inode table to reflect that the
> >> recovered inodes are in use.  As a result, when new files are created
> >> they are probably trying to take inode numbers that are already in use
> >> (by the recovered files), and as a result you're hitting this
> >> assertion.  The ticket for updating the inotable after injection of
> >> recovered inodes is http://tracker.ceph.com/issues/12131
> >>
> >> > Deleting data is still OK.
> >> >
> >> > I cannot really follow why duplicated inodes appear.
> >> >
> >> > Are there any ways to flush/reset the MDS cache?
> >>
> >> You've pretty much hit the limits of what the disaster recovery tools
> >> are currently capable of.  What I'd recommend you do at this stage is
> >> mount your filesystem read-only, back it up, and then create a new
> >> filesystem and restore from backup.
> >>
> >> I'm writing a patch to handle the particular case where someone needs
> >> to update their inode table to mark all inodes as used up to some
> >> maximum, but the chances are that after that you'll still run into
> >> some other issue, until we've finished the tools to make it all the
> >> way through this path.
> >>
> >> John
> >>
> >> >
> >> >
> >> >
> >> > On 17 November 2015 at 13:26, John Spray <jsp...@redhat.com> wrote:
> >> >>
> >> >> On Tue, Nov 17, 2015 at 12:17 PM, Mykola Dvornik
> >> >> <mykola.dvor...@gmail.com> wrote:
> >> >> > Dear John,
> >> >> >
> >> >> > Thanks for such a prompt reply!
> >> >> >
> >> >> > Seems like something happens on the mon side, since there are no
> >> >> > mount-specific requests logged on the mds side (see below).
> >> >> > FYI, some hours ago I've disabled auth completely, but it didn't
> >> >> > help.
> >> >> >
> >> >> > The serialized metadata pool is 9.7G. I can try to compress it with
> >> >> > 7z,
> >> >> > then
> >> >> > setup rssh account for you to scp/rsync it.
> >> >> >
> >> >> > debug mds = 20
> >> >> > debug mon = 20
> >> >>
> >> >> Don't worry about the mon logs.  That MDS log snippet appears to be
> >> >> from several minutes earlier than the client's attempt to mount.
> >> >>
> >> >> In these cases it's generally simpler if you truncate all the logs,
> >> >> then attempt the mount, then send all the logs in full rather than
> >> >> snippets, so that we can be sure nothing is missing.
> >> >>
> >> >> Please also get the client log (use the fuse client with
> >> >> --debug-client=20).
> >> >>
> >> >> John
> >> >
> >> >
> >> >
> >> >
> >> > --
> >> >  Mykola
> >
> >
> >
> >
> > --
> >  Mykola
>



-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd prepare cmd on infernalis 9.2.0

2015-11-19 Thread Mykola Dvornik
*'Could not create partition 2 from 10485761 to 10485760'.*

Perhaps try to zap the disks first?

On 19 November 2015 at 16:22, German Anders  wrote:

> Hi cephers,
>
> I had some issues while running the prepare osd command:
>
> ceph version: infernalis 9.2.0
>
> disk: /dev/sdf (745.2G)
>   /dev/sdf1 740.2G
>   /dev/sdf2 5G
>
> # parted /dev/sdf
> GNU Parted 2.3
> Using /dev/sdf
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) print
> Model: ATA INTEL SSDSC2BB80 (scsi)
> Disk /dev/sdf: 800GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
>
> Number  Start   End SizeFile system  Name  Flags
>  2  1049kB  5369MB  5368MB   ceph journal
>  1  5370MB  800GB   795GB   btrfsceph data
>
>
> cibn05:
>
>
> $ ceph-deploy osd prepare --fs-type btrfs cibn05:sdf
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ceph/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy osd
> prepare --fs-type btrfs cibn05:sdf
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  disk  : [('cibn05',
> '/dev/sdf', None)]
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: prepare
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  fs_type   : btrfs
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7fbb1e1d9050>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibn05:/dev/sdf:
> [cibn05][DEBUG ] connection detected need for sudo
> [cibn05][DEBUG ] connected to host: cibn05
> [cibn05][DEBUG ] detect platform information from remote host
> [cibn05][DEBUG ] detect machine type
> [cibn05][DEBUG ] find the location of an executable
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
> [ceph_deploy.osd][DEBUG ] Deploying osd to cibn05
> [cibn05][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [cibn05][INFO  ] Running command: sudo udevadm trigger
> --subsystem-match=block --action=add
> [ceph_deploy.osd][DEBUG ] Preparing host cibn05 disk /dev/sdf journal None
> activate False
> [cibn05][INFO  ] Running command: sudo ceph-disk -v prepare --cluster ceph
> --fs-type btrfs -- /dev/sdf
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-allows-journal -i 0 --cluster ceph
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-wants-journal -i 0 --cluster ceph
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-needs-journal -i 0 --cluster ceph
> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
> /sys/dev/block/8:80/dm/uuid
> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
> /sys/dev/block/8:80/dm/uuid
> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
> /sys/dev/block/8:80/dm/uuid
> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
> /sys/dev/block/8:81/dm/uuid
> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf2 uuid path is
> /sys/dev/block/8:82/dm/uuid
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=osd_journal_size
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup 

Re: [ceph-users] Can't activate osd in infernalis

2015-11-19 Thread Mykola Dvornik
cat /etc/udev/rules.d/89-ceph-journal.rules

KERNEL=="sdd?" SUBSYSTEM=="block" OWNER="ceph" GROUP="disk" MODE="0660"

On 19 November 2015 at 13:54, Mykola  wrote:

> I am afraid one would need an udev rule to make it persistent.
>
>
>
> Sent from Outlook Mail 
> for Windows 10 phone
>
>
>
>
> *From: *David Riedl 
> *Sent: *Thursday, November 19, 2015 1:42 PM
> *To: *ceph-us...@ceph.com
> *Subject: *Re: [ceph-users] Can't activate osd in infernalis
>
>
>
> I fixed the issue and opened a ticket on the ceph-deploy bug tracker
>
> http://tracker.ceph.com/issues/13833
>
>
>
> tl;dr:
>
> change permission of the ssd journal partition with
>
> chown ceph:ceph /dev/sdd1
>
>
>
> On 19.11.2015 11:38, David Riedl wrote:
>
> > Hi everyone.
>
> > I updated one of my hammer osd nodes to infernalis today.
>
> > After many problems with the upgrading process of the running OSDs, I
>
> > decided to wipe them and start anew.
>
> > I reinstalled all packages and deleted all partitions on the OSDs and
>
> > the SSD journal drive.
>
> > I zapped the disks with ceph-deploy and also prepared them with
>
> > ceph-deploy.
>
> > Selinux state is enabled (disabling it didn't help though).
>
> >
>
> > After executing "ceph-deploy osd activate ceph01:/dev/sda1:/dev/sdd1"
>
> > I get the following error message from ceph-deploy:
>
> >
>
> >
>
> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph
>
> > --cluster ceph --name client.bootstrap-osd --keyring
>
> > /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
>
> > /var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap
>
> > [ceph01][WARNIN] 2015-11-19 11:22:53.974765 7f1a06852700  0 --
>
> > :/3225863658 >> 10.20.60.10:6789/0 pipe(0x7f19f8062590 sd=4 :0 s=1
>
> > pgs=0 cs=0 l=1 c=0x7f19f805c1b0).fault
>
> > [ceph01][WARNIN] got monmap epoch 16
>
> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>
> > --cluster ceph --mkfs --mkkey -i 0 --monmap
>
> > /var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap --osd-data
>
> > /var/lib/ceph/tmp/mnt.pmHRuu --osd-journal
>
> > /var/lib/ceph/tmp/mnt.pmHRuu/journal --osd-uuid
>
> > de162e24-16b6-4796-b6b9-774fdb8ec234 --keyring
>
> > /var/lib/ceph/tmp/mnt.pmHRuu/keyring --setuser ceph --setgroup ceph
>
> > [ceph01][WARNIN] 2015-11-19 11:22:57.237096 7fb458bb7900 -1
>
> > filestore(/var/lib/ceph/tmp/mnt.pmHRuu) mkjournal error creating
>
> > journal on /var/lib/ceph/tmp/mnt.pmHRuu/journal: (13) Permission denied
>
> > [ceph01][WARNIN] 2015-11-19 11:22:57.237118 7fb458bb7900 -1 OSD::mkfs:
>
> > ObjectStore::mkfs failed with error -13
>
> > [ceph01][WARNIN] 2015-11-19 11:22:57.237157 7fb458bb7900 -1  ** ERROR:
>
> > error creating empty object store in /var/lib/ceph/tmp/mnt.pmHRuu:
>
> > (13) Permission denied
>
> > [ceph01][WARNIN] ERROR:ceph-disk:Failed to activate
>
> > [ceph01][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.pmHRuu
>
> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /bin/umount --
>
> > /var/lib/ceph/tmp/mnt.pmHRuu
>
> > [ceph01][WARNIN] Traceback (most recent call last):
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 3576, in 
>
> > [ceph01][WARNIN] main(sys.argv[1:])
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 3530, in main
>
> > [ceph01][WARNIN] args.func(args)
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 2424, in
>
> > main_activate
>
> > [ceph01][WARNIN] dmcrypt_key_dir=args.dmcrypt_key_dir,
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 2197, in
>
> > mount_activate
>
> > [ceph01][WARNIN] (osd_id, cluster) = activate(path,
>
> > activate_key_template, init)
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 2360, in activate
>
> > [ceph01][WARNIN] keyring=keyring,
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 1950, in mkfs
>
> > [ceph01][WARNIN] '--setgroup', get_ceph_user(),
>
> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 349, in
>
> > command_check_call
>
> > [ceph01][WARNIN] return subprocess.check_call(arguments)
>
> > [ceph01][WARNIN]   File "/usr/lib64/python2.7/subprocess.py", line
>
> > 542, in check_call
>
> > [ceph01][WARNIN] raise CalledProcessError(retcode, cmd)
>
> > [ceph01][WARNIN] subprocess.CalledProcessError: Command
>
> > '['/usr/bin/ceph-osd', '--cluster', 'ceph', '--mkfs', '--mkkey', '-i',
>
> > '0', '--monmap', '/var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap',
>
> > '--osd-data', '/var/lib/ceph/tmp/mnt.pmHRuu', '--osd-journal',
>
> > '/var/lib/ceph/tmp/mnt.pmHRuu/journal', '--osd-uuid',
>
> > 'de162e24-16b6-4796-b6b9-774fdb8ec234', '--keyring',
>
> > '/var/lib/ceph/tmp/mnt.pmHRuu/keyring', '--setuser', 'ceph',
>
> > '--setgroup', 'ceph']' returned non-zero exit status 1
>
> > [ceph01][ERROR ] RuntimeError: command returned non-zero exit status: 1
>
> > [ceph_deploy][ERROR ] RuntimeError: Failed to execute command:
>
> > ceph-disk -v activate --mark-init systemd 

Re: [ceph-users] Can't activate osd in infernalis

2015-11-19 Thread Mykola Dvornik
I am also using centos7.x. /usr/lib/udev/rules.d/ should be fine. If not,
one can always symlink to /etc/udev/rules.d/.

On 19 November 2015 at 14:13, David Riedl <david.ri...@wingcon.com> wrote:

> Thanks for the fix!
> Two questions though:
> Is that the right place for the udev rule? I have CentOS 7. The folder
> exists, but all the other udev rules are in /usr/lib/udev/rules.d/.
> Can I just create a new file named "89-ceph-journal.rules"  in the
> /usr/lib/udev/rules.d/ folder?
>
>
> Regards
>
> David
>
>
> On 19.11.2015 14:02, Mykola Dvornik wrote:
>
> cat /etc/udev/rules.d/89-ceph-journal.rules
>
> KERNEL=="sdd?" SUBSYSTEM=="block" OWNER="ceph" GROUP="disk" MODE="0660"
>
> On 19 November 2015 at 13:54, Mykola <mykola.dvor...@gmail.com> wrote:
>
>> I am afraid one would need an udev rule to make it persistent.
>>
>>
>>
>> Sent from Outlook Mail <http://go.microsoft.com/fwlink/?LinkId=550987>
>> for Windows 10 phone
>>
>>
>>
>>
>> *From: *David Riedl <david.ri...@wingcon.com>
>> *Sent: *Thursday, November 19, 2015 1:42 PM
>> *To: *ceph-us...@ceph.com
>> *Subject: *Re: [ceph-users] Can't activate osd in infernalis
>>
>>
>>
>> I fixed the issue and opened a ticket on the ceph-deploy bug tracker
>>
>> http://tracker.ceph.com/issues/13833
>>
>>
>>
>> tl;dr:
>>
>> change permission of the ssd journal partition with
>>
>> chown ceph:ceph /dev/sdd1
>>
>>
>>
>> On 19.11.2015 11:38, David Riedl wrote:
>>
>> > Hi everyone.
>>
>> > I updated one of my hammer osd nodes to infernalis today.
>>
>> > After many problems with the upgrading process of the running OSDs, I
>>
>> > decided to wipe them and start anew.
>>
>> > I reinstalled all packages and deleted all partitions on the OSDs and
>>
>> > the SSD journal drive.
>>
>> > I zapped the disks with ceph-deploy and also prepared them with
>>
>> > ceph-deploy.
>>
>> > Selinux state is enabled (disabling it didn't help though).
>>
>> >
>>
>> > After executing "ceph-deploy osd activate ceph01:/dev/sda1:/dev/sdd1"
>>
>> > I get the following error message from ceph-deploy:
>>
>> >
>>
>> >
>>
>> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph
>>
>> > --cluster ceph --name client.bootstrap-osd --keyring
>>
>> > /var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap
>>
>> > [ceph01][WARNIN] 2015-11-19 11:22:53.974765 7f1a06852700  0 --
>>
>> > :/3225863658 >> 10.20.60.10:6789/0 pipe(0x7f19f8062590 sd=4 :0 s=1
>>
>> > pgs=0 cs=0 l=1 c=0x7f19f805c1b0).fault
>>
>> > [ceph01][WARNIN] got monmap epoch 16
>>
>> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>>
>> > --cluster ceph --mkfs --mkkey -i 0 --monmap
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu/activate.monmap --osd-data
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu --osd-journal
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu/journal --osd-uuid
>>
>> > de162e24-16b6-4796-b6b9-774fdb8ec234 --keyring
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu/keyring --setuser ceph --setgroup ceph
>>
>> > [ceph01][WARNIN] 2015-11-19 11:22:57.237096 7fb458bb7900 -1
>>
>> > filestore(/var/lib/ceph/tmp/mnt.pmHRuu) mkjournal error creating
>>
>> > journal on /var/lib/ceph/tmp/mnt.pmHRuu/journal: (13) Permission denied
>>
>> > [ceph01][WARNIN] 2015-11-19 11:22:57.237118 7fb458bb7900 -1 OSD::mkfs:
>>
>> > ObjectStore::mkfs failed with error -13
>>
>> > [ceph01][WARNIN] 2015-11-19 11:22:57.237157 7fb458bb7900 -1  ** ERROR:
>>
>> > error creating empty object store in /var/lib/ceph/tmp/mnt.pmHRuu:
>>
>> > (13) Permission denied
>>
>> > [ceph01][WARNIN] ERROR:ceph-disk:Failed to activate
>>
>> > [ceph01][WARNIN] DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.pmHRuu
>>
>> > [ceph01][WARNIN] INFO:ceph-disk:Running command: /bin/umount --
>>
>> > /var/lib/ceph/tmp/mnt.pmHRuu
>>
>> > [ceph01][WARNIN] Traceback (most recent call last):
>>
>> > [ceph01][WARNIN]   File "/usr/sbin/ceph-disk", line 3576, in 
>>
>> > [ceph01][WARNIN] main(sys.argv[1:])
>>
>> > [ceph01

Re: [ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-18 Thread Mykola Dvornik
Hi John,

It turned out that mds triggers an assertion

*mds/MDCache.cc: 269: FAILED assert(inode_map.count(in->vino()) == 0)*

on any attempt to write data to the filesystem mounted via fuse.

Deleting data is still OK.

I cannot really follow why duplicated inodes appear.

Are there any ways to flush/reset the MDS cache?



On 17 November 2015 at 13:26, John Spray <jsp...@redhat.com> wrote:

> On Tue, Nov 17, 2015 at 12:17 PM, Mykola Dvornik
> <mykola.dvor...@gmail.com> wrote:
> > Dear John,
> >
> > Thanks for such a prompt reply!
> >
> > Seems like something happens on the mon side, since there are no
> > mount-specific requests logged on the mds side (see below).
> > FYI, some hours ago I've disabled auth completely, but it didn't help.
> >
> > The serialized metadata pool is 9.7G. I can try to compress it with 7z,
> then
> > setup rssh account for you to scp/rsync it.
> >
> > debug mds = 20
> > debug mon = 20
>
> Don't worry about the mon logs.  That MDS log snippet appears to be
> from several minutes earlier than the client's attempt to mount.
>
> In these cases it's generally simpler if you truncate all the logs,
> then attempt the mount, then send all the logs in full rather than
> snippets, so that we can be sure nothing is missing.
>
> Please also get the client log (use the fuse client with
> --debug-client=20).
>
> John
>



-- 
 Mykola
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-17 Thread Mykola Dvornik
Dear John,

Thanks for such a prompt reply!

Seems like something happens on the mon side, since there are no
mount-specific requests logged on the mds side (see below).
FYI, some hours ago I've disabled auth completely, but it didn't help.

The serialized metadata pool is 9.7G. I can try to compress it with 7z,
then setup rssh account for you to scp/rsync it.

debug mds = 20
debug mon = 20

*grep CLI.ENT.IPA.DDR /var/log/ceph/ceph-mon.000-s-ragnarok.log*

2015-11-17 12:46:20.763049 7ffa90d11700 10 mon.000-s-ragnarok@0(leader) e1
ms_verify_authorizer xxx.xxx.xxx.xxx:0/137313644 client protocol 0
2015-11-17 12:46:20.763687 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch new session 0x5602b5178840 MonSession(unknown.0
xxx.xxx.xxx.xxx:0/137313644 is open)
2015-11-17 12:46:20.763699 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
caps
2015-11-17 12:46:20.763720 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 preprocess_query auth(proto 0 34 bytes epoch 0) from unknown.0
xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:46:20.763726 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 prep_auth() blob_size=34
2015-11-17 12:46:20.763738 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 AuthMonitor::assign_global_id m=auth(proto 0 34 bytes epoch 0)
mon=0/1 last_allocated=1614103 max_global_id=1624096
2015-11-17 12:46:20.763741 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 next_global_id should be 1614104
2015-11-17 12:46:20.763817 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535a480 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:20.764469 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch existing session 0x5602b5178840 for unknown.0
xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:46:20.764475 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
caps
2015-11-17 12:46:20.764492 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 preprocess_query auth(proto 2 32 bytes epoch 0) from unknown.0
xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:46:20.764497 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 prep_auth() blob_size=32
2015-11-17 12:46:20.764705 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535b680 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:20.765279 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch existing session 0x5602b5178840 for unknown.0
xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:46:20.765287 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
caps allow *
2015-11-17 12:46:20.765303 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 preprocess_query auth(proto 2 165 bytes epoch 0) from unknown.0
xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:46:20.765310 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader).auth
v5435 prep_auth() blob_size=165
2015-11-17 12:46:20.765532 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535a000 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:20.766113 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch existing session 0x5602b5178840 for unknown.0
xxx.xxx.xxx.xxx:0/137313644

*and then*

2015-11-17 12:48:20.767152 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader) e1
ms_handle_reset 0x5602b5913b80 xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:48:20.767167 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader) e1
reset/close on session unknown.0 xxx.xxx.xxx.xxx:0/137313644
2015-11-17 12:48:20.767173 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader) e1
remove_session 0x5602b5178840 unknown.0 xxx.xxx.xxx.xxx:0/137313644

*session-specific stuff*

2015-11-17 12:46:20.763817 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535a480 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:20.764705 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535b680 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:20.765532 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b535a000 auth_reply(proto 2 0 (0) Success)
v1
2015-11-17 12:46:21.995713 7ffa8b2e7700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b5278900 mdsbeacon(1614101/000-s-ragnarok
up:active seq 184 v9429) v4
2015-11-17 12:46:23.039318 7ffa8d109700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b5388800 pg_stats_ack(1 pgs tid 389) v1
2015-11-17 12:47:24.056767 7ffa8d109700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b5357400 pg_stats_ack(1 pgs tid 337) v1
2015-11-17 12:47:50.082888 7ffa8d109700  2 mon.000-s-ragnarok@0(leader) e1
send_reply 0x5602b5350920 0x5602b5cd6400 pg_stats_ack(2 pgs tid 263) v1

2015-11-17 12:46:20.763687 7ffa8b2e7700 10 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch new session 0x5602b5178840 MonSession(unknown.0
xxx.xxx.xxx.xxx:0/137313644 is open)
2015-11-17 12:46:20.764469 7ffa8b2e7700 20 mon.000-s-ragnarok@0(leader) e1
_ms_dispatch existing session 0x5602b5178840 for unknown.0
xxx.xxx.xxx.xxx:0/137313644

[ceph-users] Cannot mount CephFS after irreversible OSD lost

2015-11-17 Thread Mykola Dvornik

Dear ceph experts,

I've built and administrating 12 OSD ceph cluster (spanning over 3 
nodes) with replication count of 2. The ceph version is


ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)

The cluster hosts two pools (data and metadata) that are exported over 
CephFS.


At some point the OSDs approached 'full' state one of them got 
corrupted. The easiest solution was to remove/re-add the wiped OSD back.


It went fine, the cluster was recovering without issues. At the point 
of only 39 degraded objects left another OSD  corrupted (its peer 
actually). I was not able to recover it and I have made a hard decision 
to remove it, wipe and re-add back to the cluster. Since no backups 
have been made, the data corruption was expected.


To my surprise when all OSDs got back online and cluster started to 
recover, only one incomplete PG has been reported. I've worked around 
it by ssh'ing to the node that holds its primary OSDs and then 
exporting the corrupted pg with 'ceph-objectstore-tool --op export' 
marking it 'complete' afterwards. Once cluster recovered, I've imported 
the pg's data back to its primary OSD. The recovery then fully 
completed and at the moment 'ceph -s' gives me:


   cluster 7972d1e9-2843-41a3-a4e7-9889d9c75850
health HEALTH_WARN
   1 near full osd(s)
monmap e1: 1 mons at {000-s-ragnarok=xxx.xxx.xxx.xxx:6789/0}
   election epoch 1, quorum 0 000-s-ragnarok
mdsmap e9393: 1/1/0 up {0=000-s-ragnarok=up:active}
osdmap e185363: 12 osds: 12 up, 12 in
 pgmap v5599327: 1024 pgs, 2 pools, 7758 GB data, 22316 kobjects
   15804 GB used, 6540 GB / 22345 GB avail
   1020 active+clean
  4 active+clean+scrubbing+deep

However when I've brought the mds back online the CephFS cannot be 
mounted anymore complaining on the client side 'mount error 5 = 
Input/output error'. Since mds was running just fine without any 
suspicious messages in its log, I've decided that something happened to 
its journal and CephFS disaster recovery is needed. I've stopped the 
mds and tried to make a backup of the journal. UnfortunatelyA, the tool 
crashed with the following output:


cephfs-journal-tool journal export backup.bin
journal is 1841503004303~12076
*** buffer overflow detected ***: cephfs-journal-tool terminated
=== Backtrace: =
/lib64/libc.so.6(__fortify_fail+0x37)[0x7f175ef12a57]
/lib64/libc.so.6(+0x10bc10)[0x7f175ef10c10]
/lib64/libc.so.6(+0x10b119)[0x7f175ef10119]
/lib64/libc.so.6(_IO_vfprintf+0x2f00)[0x7f175ee4f430]
/lib64/libc.so.6(__vsprintf_chk+0x88)[0x7f175ef101a8]
/lib64/libc.so.6(__sprintf_chk+0x7d)[0x7f175ef100fd]
cephfs-journal-tool(_ZN6Dumper4dumpEPKc+0x630)[0x7f1763374720]
cephfs-journal-tool(_ZN11JournalTool14journal_exportERKSsb+0x294)[0x7f1763357874]
cephfs-journal-tool(_ZN11JournalTool12main_journalERSt6vectorIPKcSaIS2_EE+0x105)[0x7f17633580c5]
cephfs-journal-tool(_ZN11JournalTool4mainERSt6vectorIPKcSaIS2_EE+0x56e)[0x7f17633514de]
cephfs-journal-tool(main+0x1de)[0x7f1763350d4e]
/lib64/libc.so.6(__libc_start_main+0xf5)[0x7f175ee26af5]
cephfs-journal-tool(+0x1ccae9)[0x7f1763356ae9]
...
-3> 2015-11-17 10:43:00.874529 7f174db4b700  1 -- 
xxx.xxx.xxx.xxx:6802/3019233561 <== osd.9 xxx.xxx.xxx.xxx:6808/13662 1 
 osd_op_reply(4 200.0006b309 [stat] v0'0 uv0 ack = -2 ((2) No such 
file or directory)) v6  179+0+0 (2303160312 0 0) 0x7f1767c719c0 con 
0x7f1767d194a0

...

So I've used rados tool to export the cephfs_metadata pool, and then 
proceeded with


cephfs-journal-tool event recover_dentries summary
cephfs-journal-tool journal reset
cephfs-table-tool all reset session
ceph fs reset home --yes-i-really-mean-it

After this manipulation, the cephfs-journal-tool journal export 
backup.rec worked, but wrote 48 bytes at around 1.8TB offset!


Then I've brought mds back online, but CephFS is still non-mountable.

I've tried to flush the journal with:

ceph daemon mds.000-s-ragnarok flush journal

No luck. Then I've stopped mds and relaunched with

ceph-mds -i 000-s-ragnarok --journal_check 0 --debug_mds=10 
--debug_ms=100


It persistently outputs this snippet for a couple of hours:

7faf0bd58700  7 mds.0.cache trim max=10  cur=17
7faf0bd58700 10 mds.0.cache trim_client_leases
7faf0bd58700  2 mds.0.cache check_memory_usage total 256288, rss 19116, 
heap 48056, malloc 1791 mmap 0, baseline 48056, buffers 0, 0 / 19 
inodes have caps, 0 caps, 0 caps per inode
7faf0bd58700 10 mds.0.log trim 1 / 30 segments, 8 / -1 events, 0 (0) 
expiring, 0 (0) expired
7faf0bd58700 10 mds.0.log _trim_expired_segments waiting for 
1841488226436/1841503004303 to expire

7faf0bd58700 10 mds.0.server find_idle_sessions.  laggy until 0.00
7faf0bd58700 10 mds.0.locker scatter_tick
7faf0bd58700 10 mds.0.cache find_stale_fragment_freeze
7faf0bd58700 10 mds.0.snap check_osd_map - version unchanged
7faf0b557700 10 mds.beacon.000-s-ragnarok _send up:active seq 12

So it appears to me that even despite