Re: [ceph-users] The OSD can be “down” but still “in”.

2019-01-22 Thread M Ranga Swami Reddy
Thanks for reply.
If the OSD represents the primary one for a PG, then all IO will be
stopped..which may lead to application failure..



On Tue, Jan 22, 2019 at 5:32 PM Matthew Vernon  wrote:
>
> Hi,
>
> On 22/01/2019 10:02, M Ranga Swami Reddy wrote:
> > Hello - If an OSD shown as down and but its still "in" state..what
> > will happen with write/read operations on this down OSD?
>
> It depends ;-)
>
> In a typical 3-way replicated setup with min_size 2, writes to placement
> groups on that OSD will still go ahead - when 2 replicas are written OK,
> then the write will complete. Once the OSD comes back up, these writes
> will then be replicated to that OSD. If it stays down for long enough to
> be marked out, then pgs on that OSD will be replicated elsewhere.
>
> If you had min_size 3 as well, then writes would block until the OSD was
> back up (or marked out and the pgs replicated to another OSD).
>
> Regards,
>
> Matthew
>
>
> --
>  The Wellcome Sanger Institute is operated by Genome Research
>  Limited, a charity registered in England with number 1021457 and a
>  company registered in England with number 2742969, whose registered
>  office is 215 Euston Road, London, NW1 2BE.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-22 Thread Yan, Zheng
On Wed, Jan 23, 2019 at 10:02 AM Albert Yue  wrote:
>
> But with enough memory on MDS, I can just cache all metadata into memory. 
> Right now there are around 500GB metadata in the ssd. So this is not enough?
>

mds needs to tracking lots of extra information for each object. For
500G metadata, mds may need 1T or more memory.

> On Tue, Jan 22, 2019 at 5:48 PM Yan, Zheng  wrote:
>>
>> On Tue, Jan 22, 2019 at 10:49 AM Albert Yue  
>> wrote:
>> >
>> > Hi Yan Zheng,
>> >
>> > In your opinion, can we resolve this issue by move MDS to a 512GB or 1TB 
>> > memory machine?
>> >
>>
>> The problem is from client side, especially clients with large memory.
>> I don't think enlarge mds cache size is good idea. you can
>> periodically check periodically
>> each kernel clients' /sys/kernel/debug/ceph/xxx/caps. run 'echo 2
>> >/proc/sys/vm/drop_caches' if a client used too many caps (for example
>> 10k),
>>
>> > On Mon, Jan 21, 2019 at 10:49 PM Yan, Zheng  wrote:
>> >>
>> >> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue  
>> >> wrote:
>> >> >
>> >> > Dear Ceph Users,
>> >> >
>> >> > We have set up a cephFS cluster with 6 osd machines, each with 16 8TB 
>> >> > harddisk. Ceph version is luminous 12.2.5. We created one data pool 
>> >> > with these hard disks and created another meta data pool with 3 ssd. We 
>> >> > created a MDS with 65GB cache size.
>> >> >
>> >> > But our users are keep complaining that cephFS is too slow. What we 
>> >> > observed is cephFS is fast when we switch to a new MDS instance, once 
>> >> > the cache fills up (which will happen very fast), client became very 
>> >> > slow when performing some basic filesystem operation such as `ls`.
>> >> >
>> >>
>> >> It seems that clients hold lots of unused inodes their icache, which
>> >> prevent mds from trimming corresponding objects from its cache.  mimic
>> >> has command "ceph daemon mds.x cache drop" to ask client to drop its
>> >> cache. I'm also working on a patch that make kclient client release
>> >> unused inodes.
>> >>
>> >> For luminous,  there is not much we can do, except periodically run
>> >> "echo 2 > /proc/sys/vm/drop_caches"  on each client.
>> >>
>> >>
>> >> > What we know is our user are putting lots of small files into the 
>> >> > cephFS, now there are around 560 Million files. We didn't see high CPU 
>> >> > wait on MDS instance and meta data pool just used around 200MB space.
>> >> >
>> >> > My question is, what is the relationship between the metadata pool and 
>> >> > MDS? Is this performance issue caused by the hardware behind meta data 
>> >> > pool? Why the meta data pool only used 200MB space, and we saw 3k iops 
>> >> > on each of these three ssds, why can't MDS cache all these 200MB into 
>> >> > memory?
>> >> >
>> >> > Thanks very much!
>> >> >
>> >> >
>> >> > Best Regards,
>> >> >
>> >> > Albert
>> >> >
>> >> > ___
>> >> > ceph-users mailing list
>> >> > ceph-users@lists.ceph.com
>> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-22 Thread Albert Yue
But with enough memory on MDS, I can just cache all metadata into memory.
Right now there are around 500GB metadata in the ssd. So this is not enough?

On Tue, Jan 22, 2019 at 5:48 PM Yan, Zheng  wrote:

> On Tue, Jan 22, 2019 at 10:49 AM Albert Yue 
> wrote:
> >
> > Hi Yan Zheng,
> >
> > In your opinion, can we resolve this issue by move MDS to a 512GB or 1TB
> memory machine?
> >
>
> The problem is from client side, especially clients with large memory.
> I don't think enlarge mds cache size is good idea. you can
> periodically check periodically
> each kernel clients' /sys/kernel/debug/ceph/xxx/caps. run 'echo 2
> >/proc/sys/vm/drop_caches' if a client used too many caps (for example
> 10k),
>
> > On Mon, Jan 21, 2019 at 10:49 PM Yan, Zheng  wrote:
> >>
> >> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue 
> wrote:
> >> >
> >> > Dear Ceph Users,
> >> >
> >> > We have set up a cephFS cluster with 6 osd machines, each with 16 8TB
> harddisk. Ceph version is luminous 12.2.5. We created one data pool with
> these hard disks and created another meta data pool with 3 ssd. We created
> a MDS with 65GB cache size.
> >> >
> >> > But our users are keep complaining that cephFS is too slow. What we
> observed is cephFS is fast when we switch to a new MDS instance, once the
> cache fills up (which will happen very fast), client became very slow when
> performing some basic filesystem operation such as `ls`.
> >> >
> >>
> >> It seems that clients hold lots of unused inodes their icache, which
> >> prevent mds from trimming corresponding objects from its cache.  mimic
> >> has command "ceph daemon mds.x cache drop" to ask client to drop its
> >> cache. I'm also working on a patch that make kclient client release
> >> unused inodes.
> >>
> >> For luminous,  there is not much we can do, except periodically run
> >> "echo 2 > /proc/sys/vm/drop_caches"  on each client.
> >>
> >>
> >> > What we know is our user are putting lots of small files into the
> cephFS, now there are around 560 Million files. We didn't see high CPU wait
> on MDS instance and meta data pool just used around 200MB space.
> >> >
> >> > My question is, what is the relationship between the metadata pool
> and MDS? Is this performance issue caused by the hardware behind meta data
> pool? Why the meta data pool only used 200MB space, and we saw 3k iops on
> each of these three ssds, why can't MDS cache all these 200MB into memory?
> >> >
> >> > Thanks very much!
> >> >
> >> >
> >> > Best Regards,
> >> >
> >> > Albert
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Yan, Zheng
On Tue, Jan 22, 2019 at 10:42 PM Dan van der Ster  wrote:
>
> On Tue, Jan 22, 2019 at 3:33 PM Yan, Zheng  wrote:
> >
> > On Tue, Jan 22, 2019 at 9:08 PM Dan van der Ster  
> > wrote:
> > >
> > > Hi Zheng,
> > >
> > > We also just saw this today and got a bit worried.
> > > Should we change to:
> > >
> >
> > What is the error message (on stray dir or other dir)? does the
> > cluster ever enable multi-acitive mds?
> >
>
> It was during an upgrade from v12.2.8 to v12.2.10. 5 active MDS's
> during the upgrade.
>
> 2019-01-22 10:08:22.629545 mds.p01001532184554 mds.2
> 128.142.39.144:6800/268398 36 : cluster [WRN]  replayed op
> client.54045065:2282648,2282514 used ino 0x3001c85b193 but session
> next is 0x3001c28f018
> 2019-01-22 10:08:22.629617 mds.p01001532184554 mds.2
> 128.142.39.144:6800/268398 37 : cluster [WRN]  replayed op
> client.54045065:2282649,2282514 used ino 0x3001c85b194 but session
> next is 0x3001c28f018
> 2019-01-22 10:08:22.629652 mds.p01001532184554 mds.2
> 128.142.39.144:6800/268398 38 : cluster [WRN]  replayed op
> client.54045065:2282650,2282514 used ino 0x3001c85b195 but session
> next is 0x3001c28f018
> 2019-01-22 10:08:37.373704 mon.cephflax-mon-9b406e0261 mon.0
> 137.138.121.135:6789/0 2748 : cluster [INF] daemon mds.p01001532184554
> is now active in filesystem cephfs as rank 2
> 2019-01-22 10:08:37.805675 mon.cephflax-mon-9b406e0261 mon.0
> 137.138.121.135:6789/0 2749 : cluster [INF] Health check cleared:
> FS_DEGRADED (was: 1 filesystem is degraded)
> 2019-01-22 10:08:39.784260 mds.p01001532184554 mds.2
> 128.142.39.144:6800/268398 547 : cluster [ERR] bad/negative dir
> size on 0x61b f(v27 m2019-01-22 10:07:38.509466 0=-1+1)
> 2019-01-22 10:08:39.784271 mds.p01001532184554 mds.2
> 128.142.39.144:6800/268398 548 : cluster [ERR] unmatched fragstat
> on 0x61b, inode has f(v28 m2019-01-22 10:07:38.509466 0=-1+1),
> dirfrags have f(v0 m2019-01-22 10:07:38.509466 1=0+1)

Incorrect fragstat on stray dir is not big deal. mds uses it only for
printing debug/warning message. But incorrect fragstat on other dir
may need manual intervention. So I'd like not to change it to
'warning' message.

Regards
Yan, Zheng

> 2019-01-22 10:10:02.605036 mon.cephflax-mon-9b406e0261 mon.0
> 137.138.121.135:6789/0 2803 : cluster [INF] Health check cleared:
> MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons
> available)
> 2019-01-22 10:10:02.605089 mon.cephflax-mon-9b406e0261 mon.0
> 137.138.121.135:6789/0 2804 : cluster [INF] Cluster is now healthy
>
>
>
>
>
> > > diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
> > > index e8c1bc8bc1..e2539390fb 100644
> > > --- a/src/mds/CInode.cc
> > > +++ b/src/mds/CInode.cc
> > > @@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)
> > >
> > > if (pf->fragstat.nfiles < 0 ||
> > > pf->fragstat.nsubdirs < 0) {
> > > - clog->error() << "bad/negative dir size on "
> > > + clog->warn() << "bad/negative dir size on "
> > >   << dir->dirfrag() << " " << pf->fragstat;
> > >   assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);
> > >
> > > @@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
> > >   if (state_test(CInode::STATE_REPAIRSTATS)) {
> > > dout(20) << " dirstat mismatch, fixing" << dendl;
> > >   } else {
> > > -   clog->error() << "unmatched fragstat on " << ino() << ", 
> > > inode has "
> > > +   clog->warn() << "unmatched fragstat on " << ino() << ", inode 
> > > has "
> > >   << pi->dirstat << ", dirfrags have " << dirstat;
> > > assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
> > >   }
> > >
> > >
> > > Cheers, Dan
> > >
> > >
> > > On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:
> > >>
> > >> no action is required. mds fixes this type of error atomically.
> > >> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
> > >>  wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> >
> > >> > upon failover or restart, or MDS complains that something is wrong with
> > >> > one of the stray directories:
> > >> >
> > >> >
> > >> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
> > >> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
> > >> > 12:51:12.016360 -4=-5+1)
> > >> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
> > >> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
> > >> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 
> > >> > 12:51:12.016360
> > >> > 1=0+1)
> > >> >
> > >> >
> > >> > How do we handle this problem?
> > >> >
> > >> >
> > >> > Regards,
> > >> >
> > >> > Burkhard
> > >> >
> > >> >
> > >> > ___
> > >> > ceph-users mailing list
> > >> > ceph-users@lists.ceph.com
> > >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >> ___
> > >> 

Re: [ceph-users] cephfs performance degraded very fast

2019-01-22 Thread Yan, Zheng
On Tue, Jan 22, 2019 at 8:24 PM renjianxinlover  wrote:
>
> hi,
>at some time, as cache pressure or caps release failure, client apps mount 
> got stuck.
>my use case is in kubernetes cluster and automatic kernel client mount in 
> nodes.
>is anyone faced with same issue or has related solution?
> Brs
>
>

If you mean "client.xxx failing to respond to capability release".
you'd better to make sure all clients are uptodate (newest version of
ceph-fuse, recent kernel)

>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Process stuck in D+ on cephfs mount

2019-01-22 Thread Yan, Zheng
On Wed, Jan 23, 2019 at 5:50 AM Marc Roos  wrote:
>
>
> I got one again
>
> [] wait_on_page_bit_killable+0x83/0xa0
> [] __lock_page_or_retry+0xb2/0xc0
> [] filemap_fault+0x3b7/0x410
> [] ceph_filemap_fault+0x13c/0x310 [ceph]
> [] __do_fault+0x4c/0xc0
> [] do_read_fault.isra.42+0x43/0x130
> [] handle_mm_fault+0x6b1/0x1040
> [] __do_page_fault+0x154/0x450
> [] do_page_fault+0x35/0x90
> [] page_fault+0x28/0x30
> [] 0x
>
>

This is likely caused by hang osd request,  was you cluster health?



>  >check /proc//stack to find where it is stuck
>  >
>  >>
>  >>
>  >> I have a process stuck in D+ writing to cephfs kernel mount.
> Anything
>  >> can be done about this? (without rebooting)
>  >>
>  >>
>  >> CentOS Linux release 7.5.1804 (Core)
>  >> Linux 3.10.0-514.21.2.el7.x86_64
>  >>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Christian Wuerdig
If you use librados directly it's up to you to ensure you can identify your
objects. Generally RADOS stores objects and not files so when you provide
your object ids you need to come up with a convention so you can correctly
identify them. If you need to provide meta data (i.e. a list of all
existing backups, when they were taken etc.) then again you need to manage
that yourself (probably in dedicated meta-data objects). Using RADOS
namespaces (like one per database) is probably a good idea.
Also keep in mind that for example Bluestore has a maximum object size of
4GB so mapping files 1:1 to object is probably not a wise approach and you
should breakup your files into smaller chunks when storing them. There is
libradosstriper which handles the striping of large objects transparently
but not sure if that has support for RADOS namespaces.

Using RGW instead might be an easier route to go down

On Wed, 23 Jan 2019 at 10:10, cmonty14 <74cmo...@gmail.com> wrote:

> My backup client is using librados.
> I understand that defining a pool for the same application is recommended.
>
> However this would not answer my other questions:
> How can I identify a backup created by client A that I want to restore
> on another client Z?
> I mean typically client A would write a backup file identified by the
> filename.
> Would it be possible on client Z to identify this backup file by
> filename? If yes, how?
>
> Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb :
> >
> > Hi,
> >
> > Ceph's pool are meant to let you define specific engineering rules
> > and/or application (rbd, cephfs, rgw)
> > They are not designed to be created in a massive fashion (see pgs etc)
> > So, create a pool for each engineering ruleset, and store your data in
> them
> > For what is left of your project, I believe you have to implement that
> > on top of Ceph
> >
> > For instance, let say you simply create a pool, with a rbd volume in it
> > You then create a filesystem on that, and map it on some server
> > Finally, you can push your files on that mountpoint, using various
> > Linux's user, acl or whatever : beyond that point, there is nothing more
> > specific to Ceph, it is "just" a mounted filesystem
> >
> > Regards,
> >
> > On 01/22/2019 02:16 PM, cmonty14 wrote:
> > > Hi,
> > >
> > > my use case for Ceph is providing a central backup storage.
> > > This means I will backup multiple databases in Ceph storage cluster.
> > >
> > > This is my question:
> > > What is the best practice for creating pools & images?
> > > Should I create multiple pools, means one pool per database?
> > > Or should I create a single pool "backup" and use namespace when
> writing
> > > data in the pool?
> > >
> > > This is the security demand that should be considered:
> > > DB-owner A can only modify the files that belong to A; other files
> > > (owned by B, C or D) are accessible for A.
> > >
> > > And there's another issue:
> > > How can I identify a backup created by client A that I want to restore
> > > on another client Z?
> > > I mean typically client A would write a backup file identified by the
> > > filename.
> > > Would it be possible on client Z to identify this backup file by
> > > filename? If yes, how?
> > >
> > >
> > > THX
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Process stuck in D+ on cephfs mount

2019-01-22 Thread Marc Roos
 
I got one again

[] wait_on_page_bit_killable+0x83/0xa0
[] __lock_page_or_retry+0xb2/0xc0
[] filemap_fault+0x3b7/0x410
[] ceph_filemap_fault+0x13c/0x310 [ceph]
[] __do_fault+0x4c/0xc0
[] do_read_fault.isra.42+0x43/0x130
[] handle_mm_fault+0x6b1/0x1040
[] __do_page_fault+0x154/0x450
[] do_page_fault+0x35/0x90
[] page_fault+0x28/0x30
[] 0x


 >check /proc//stack to find where it is stuck
 >
 >>
 >>
 >> I have a process stuck in D+ writing to cephfs kernel mount. 
Anything 
 >> can be done about this? (without rebooting)
 >>
 >>
 >> CentOS Linux release 7.5.1804 (Core)
 >> Linux 3.10.0-514.21.2.el7.x86_64
 >>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] read-only mounts of RBD images on multiple nodes for parallel reads

2019-01-22 Thread Void Star Nill
Thanks all for the great advices and inputs.

Regarding Mykola's suggestion to use Read-Only snapshots, what is the
overhead of creating these snapshots? I assume these are copy-on-write
snapshots, so there's no extra space consumed except for the metadata?

Thanks,
Shridhar


On Fri, 18 Jan 2019 at 04:10, Ilya Dryomov  wrote:

> On Fri, Jan 18, 2019 at 11:25 AM Mykola Golub 
> wrote:
> >
> > On Thu, Jan 17, 2019 at 10:27:20AM -0800, Void Star Nill wrote:
> > > Hi,
> > >
> > > We am trying to use Ceph in our products to address some of the use
> cases.
> > > We think Ceph block device for us. One of the use cases is that we
> have a
> > > number of jobs running in containers that need to have Read-Only
> access to
> > > shared data. The data is written once and is consumed multiple times. I
> > > have read through some of the similar discussions and the
> recommendations
> > > on using CephFS for these situations, but in our case Block device
> makes
> > > more sense as it fits well with other use cases and restrictions we
> have
> > > around this use case.
> > >
> > > The following scenario seems to work as expected when we tried on a
> test
> > > cluster, but we wanted to get an expert opinion to see if there would
> be
> > > any issues in production. The usage scenario is as follows:
> > >
> > > - A block device is created with "--image-shared" options:
> > >
> > > rbd create mypool/foo --size 4G --image-shared
> >
> > "--image-shared" just means that the created image will have
> > "exclusive-lock" feature and all other features that depend on it
> > disabled. It is useful for scenarios when one wants simulteous write
> > access to the image (e.g. when using a shared-disk cluster fs like
> > ocfs2) and does not want a performance penalty due to "exlusive-lock"
> > being pinged-ponged between writers.
> >
> > For your scenario it is not necessary but is ok.
> >
> > > - The image is mapped to a host, formatted in ext4 format (or other
> file
> > > formats), mounted to a directory in read/write mode and data is
> written to
> > > it. Please note that the image will be mapped in exclusive write mode
> -- no
> > > other read/write mounts are allowed a this time.
> >
> > The map "exclusive" option works only for images with "exclusive-lock"
> > feature enabled and prevent in this case automatic exclusive lock
> > transitions (ping-pong mentioned above) from one writer to
> > another. And in this case it will not prevent from mapping and
> > mounting it ro and probably even rw (I am not familiar enough with
> > kernel rbd implementation to be sure here), though in the last case
> > the write will fail.
>
> With -o exclusive, in addition to preventing automatic lock
> transitions, the kernel will attempt to acquire the lock at map time
> (i.e. before allowing any I/O) and return an error from "rbd map" in
> case the lock cannot be acquired.
>
> However, the fact the image is mapped -o exclusive on one host doesn't
> mean that it can't be mapped without -o exclusive on another host.  If
> you then try to write though the non-exclusive mapping, the write will
> block until the exclusive mapping goes away resulting a hung tasks in
> uninterruptible sleep state -- a much less pleasant failure mode.
>
> So make sure that all writers use -o exclusive.
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Jack
AFAIK, the only AAA available with librados works on a pool granularity
So, if you create a ceph user with access to your pool, he will get
access to all the content stored in this pool

If you want to use librados for your use case, you will need to
implement, on your code, the application logic required for your
security needs

So, to answer precisely:
"How can I identify a backup created by client A that I want to restore
on another client Z?"
You cannot, a client will get access to all the content of the pool,
including others' backup (which are keys, at the rados level)

"Would it be possible on client Z to identify this backup file by
filename? If yes, how?"
On the rados level, AFAIK, there is no metadata associated with key
So you have to includes those informations on the key name (the key are
what you are calling "backup", "file" etc)

Regards,


On 01/22/2019 10:09 PM, cmonty14 wrote:
> My backup client is using librados.
> I understand that defining a pool for the same application is recommended.
> 
> However this would not answer my other questions:
> How can I identify a backup created by client A that I want to restore
> on another client Z?
> I mean typically client A would write a backup file identified by the
> filename.
> Would it be possible on client Z to identify this backup file by
> filename? If yes, how?
> 
> Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb :
>>
>> Hi,
>>
>> Ceph's pool are meant to let you define specific engineering rules
>> and/or application (rbd, cephfs, rgw)
>> They are not designed to be created in a massive fashion (see pgs etc)
>> So, create a pool for each engineering ruleset, and store your data in them
>> For what is left of your project, I believe you have to implement that
>> on top of Ceph
>>
>> For instance, let say you simply create a pool, with a rbd volume in it
>> You then create a filesystem on that, and map it on some server
>> Finally, you can push your files on that mountpoint, using various
>> Linux's user, acl or whatever : beyond that point, there is nothing more
>> specific to Ceph, it is "just" a mounted filesystem
>>
>> Regards,
>>
>> On 01/22/2019 02:16 PM, cmonty14 wrote:
>>> Hi,
>>>
>>> my use case for Ceph is providing a central backup storage.
>>> This means I will backup multiple databases in Ceph storage cluster.
>>>
>>> This is my question:
>>> What is the best practice for creating pools & images?
>>> Should I create multiple pools, means one pool per database?
>>> Or should I create a single pool "backup" and use namespace when writing
>>> data in the pool?
>>>
>>> This is the security demand that should be considered:
>>> DB-owner A can only modify the files that belong to A; other files
>>> (owned by B, C or D) are accessible for A.
>>>
>>> And there's another issue:
>>> How can I identify a backup created by client A that I want to restore
>>> on another client Z?
>>> I mean typically client A would write a backup file identified by the
>>> filename.
>>> Would it be possible on client Z to identify this backup file by
>>> filename? If yes, how?
>>>
>>>
>>> THX
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread cmonty14
My backup client is using librados.
I understand that defining a pool for the same application is recommended.

However this would not answer my other questions:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?

Am Di., 22. Jan. 2019 um 15:07 Uhr schrieb :
>
> Hi,
>
> Ceph's pool are meant to let you define specific engineering rules
> and/or application (rbd, cephfs, rgw)
> They are not designed to be created in a massive fashion (see pgs etc)
> So, create a pool for each engineering ruleset, and store your data in them
> For what is left of your project, I believe you have to implement that
> on top of Ceph
>
> For instance, let say you simply create a pool, with a rbd volume in it
> You then create a filesystem on that, and map it on some server
> Finally, you can push your files on that mountpoint, using various
> Linux's user, acl or whatever : beyond that point, there is nothing more
> specific to Ceph, it is "just" a mounted filesystem
>
> Regards,
>
> On 01/22/2019 02:16 PM, cmonty14 wrote:
> > Hi,
> >
> > my use case for Ceph is providing a central backup storage.
> > This means I will backup multiple databases in Ceph storage cluster.
> >
> > This is my question:
> > What is the best practice for creating pools & images?
> > Should I create multiple pools, means one pool per database?
> > Or should I create a single pool "backup" and use namespace when writing
> > data in the pool?
> >
> > This is the security demand that should be considered:
> > DB-owner A can only modify the files that belong to A; other files
> > (owned by B, C or D) are accessible for A.
> >
> > And there's another issue:
> > How can I identify a backup created by client A that I want to restore
> > on another client Z?
> > I mean typically client A would write a backup file identified by the
> > filename.
> > Would it be possible on client Z to identify this backup file by
> > filename? If yes, how?
> >
> >
> > THX
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Spec for Ceph Mon+Mgr?

2019-01-22 Thread jesper
Hi.

We're currently co-locating our mons with the head node of our Hadoop
installation. That may be giving us some problems, we dont know yet, but
thus I'm speculation about moving them to dedicated hardware.

It is hard to get specifications "small" engough .. the specs for the
mon is where we usually virtualize our way out of if .. which seems very
wrong here.

Are other people just co-locating it with something random or what are
others typically using in a small ceph cluster (< 100 OSDs .. 7 OSD hosts)

Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Dan van der Ster
On Tue, Jan 22, 2019 at 3:33 PM Yan, Zheng  wrote:
>
> On Tue, Jan 22, 2019 at 9:08 PM Dan van der Ster  wrote:
> >
> > Hi Zheng,
> >
> > We also just saw this today and got a bit worried.
> > Should we change to:
> >
>
> What is the error message (on stray dir or other dir)? does the
> cluster ever enable multi-acitive mds?
>

It was during an upgrade from v12.2.8 to v12.2.10. 5 active MDS's
during the upgrade.

2019-01-22 10:08:22.629545 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 36 : cluster [WRN]  replayed op
client.54045065:2282648,2282514 used ino 0x3001c85b193 but session
next is 0x3001c28f018
2019-01-22 10:08:22.629617 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 37 : cluster [WRN]  replayed op
client.54045065:2282649,2282514 used ino 0x3001c85b194 but session
next is 0x3001c28f018
2019-01-22 10:08:22.629652 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 38 : cluster [WRN]  replayed op
client.54045065:2282650,2282514 used ino 0x3001c85b195 but session
next is 0x3001c28f018
2019-01-22 10:08:37.373704 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2748 : cluster [INF] daemon mds.p01001532184554
is now active in filesystem cephfs as rank 2
2019-01-22 10:08:37.805675 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2749 : cluster [INF] Health check cleared:
FS_DEGRADED (was: 1 filesystem is degraded)
2019-01-22 10:08:39.784260 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 547 : cluster [ERR] bad/negative dir
size on 0x61b f(v27 m2019-01-22 10:07:38.509466 0=-1+1)
2019-01-22 10:08:39.784271 mds.p01001532184554 mds.2
128.142.39.144:6800/268398 548 : cluster [ERR] unmatched fragstat
on 0x61b, inode has f(v28 m2019-01-22 10:07:38.509466 0=-1+1),
dirfrags have f(v0 m2019-01-22 10:07:38.509466 1=0+1)
2019-01-22 10:10:02.605036 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2803 : cluster [INF] Health check cleared:
MDS_INSUFFICIENT_STANDBY (was: insufficient standby MDS daemons
available)
2019-01-22 10:10:02.605089 mon.cephflax-mon-9b406e0261 mon.0
137.138.121.135:6789/0 2804 : cluster [INF] Cluster is now healthy





> > diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
> > index e8c1bc8bc1..e2539390fb 100644
> > --- a/src/mds/CInode.cc
> > +++ b/src/mds/CInode.cc
> > @@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)
> >
> > if (pf->fragstat.nfiles < 0 ||
> > pf->fragstat.nsubdirs < 0) {
> > - clog->error() << "bad/negative dir size on "
> > + clog->warn() << "bad/negative dir size on "
> >   << dir->dirfrag() << " " << pf->fragstat;
> >   assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);
> >
> > @@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
> >   if (state_test(CInode::STATE_REPAIRSTATS)) {
> > dout(20) << " dirstat mismatch, fixing" << dendl;
> >   } else {
> > -   clog->error() << "unmatched fragstat on " << ino() << ", inode 
> > has "
> > +   clog->warn() << "unmatched fragstat on " << ino() << ", inode 
> > has "
> >   << pi->dirstat << ", dirfrags have " << dirstat;
> > assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
> >   }
> >
> >
> > Cheers, Dan
> >
> >
> > On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:
> >>
> >> no action is required. mds fixes this type of error atomically.
> >> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
> >>  wrote:
> >> >
> >> > Hi,
> >> >
> >> >
> >> > upon failover or restart, or MDS complains that something is wrong with
> >> > one of the stray directories:
> >> >
> >> >
> >> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
> >> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
> >> > 12:51:12.016360 -4=-5+1)
> >> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
> >> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
> >> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 12:51:12.016360
> >> > 1=0+1)
> >> >
> >> >
> >> > How do we handle this problem?
> >> >
> >> >
> >> > Regards,
> >> >
> >> > Burkhard
> >> >
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Yan, Zheng
On Tue, Jan 22, 2019 at 9:08 PM Dan van der Ster  wrote:
>
> Hi Zheng,
>
> We also just saw this today and got a bit worried.
> Should we change to:
>

What is the error message (on stray dir or other dir)? does the
cluster ever enable multi-acitive mds?

> diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
> index e8c1bc8bc1..e2539390fb 100644
> --- a/src/mds/CInode.cc
> +++ b/src/mds/CInode.cc
> @@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)
>
> if (pf->fragstat.nfiles < 0 ||
> pf->fragstat.nsubdirs < 0) {
> - clog->error() << "bad/negative dir size on "
> + clog->warn() << "bad/negative dir size on "
>   << dir->dirfrag() << " " << pf->fragstat;
>   assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);
>
> @@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
>   if (state_test(CInode::STATE_REPAIRSTATS)) {
> dout(20) << " dirstat mismatch, fixing" << dendl;
>   } else {
> -   clog->error() << "unmatched fragstat on " << ino() << ", inode 
> has "
> +   clog->warn() << "unmatched fragstat on " << ino() << ", inode has 
> "
>   << pi->dirstat << ", dirfrags have " << dirstat;
> assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
>   }
>
>
> Cheers, Dan
>
>
> On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:
>>
>> no action is required. mds fixes this type of error atomically.
>> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
>>  wrote:
>> >
>> > Hi,
>> >
>> >
>> > upon failover or restart, or MDS complains that something is wrong with
>> > one of the stray directories:
>> >
>> >
>> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
>> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
>> > 12:51:12.016360 -4=-5+1)
>> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
>> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
>> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 12:51:12.016360
>> > 1=0+1)
>> >
>> >
>> > How do we handle this problem?
>> >
>> >
>> > Regards,
>> >
>> > Burkhard
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] monitor cephfs mount io's

2019-01-22 Thread Mohamad Gebai
Hi Marc,

My point was that there was no way to do that for a kernel mount except
from the client that consumes the mounted RBDs.

Mohamad

On 1/21/19 4:29 AM, Marc Roos wrote:
>
> Hi Mohamad, How do you do that client side, I am having currently two 
> kernel mounts? 
>
>
>
>
>
> -Original Message-
> From: Mohamad Gebai [mailto:mge...@suse.de] 
> Sent: 17 January 2019 15:57
> To: Marc Roos; ceph-users
> Subject: Re: [ceph-users] monitor cephfs mount io's
>
> You can do that either straight from your client, or by querying the 
> perf dump if you're using ceph-fuse.
>
> Mohamad
>
> On 1/17/19 6:19 AM, Marc Roos wrote:
>> How / where can I monitor the ios on cephfs mount / client?
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread ceph
Hi,

Ceph's pool are meant to let you define specific engineering rules
and/or application (rbd, cephfs, rgw)
They are not designed to be created in a massive fashion (see pgs etc)
So, create a pool for each engineering ruleset, and store your data in them
For what is left of your project, I believe you have to implement that
on top of Ceph

For instance, let say you simply create a pool, with a rbd volume in it
You then create a filesystem on that, and map it on some server
Finally, you can push your files on that mountpoint, using various
Linux's user, acl or whatever : beyond that point, there is nothing more
specific to Ceph, it is "just" a mounted filesystem

Regards,

On 01/22/2019 02:16 PM, cmonty14 wrote:
> Hi,
> 
> my use case for Ceph is providing a central backup storage.
> This means I will backup multiple databases in Ceph storage cluster.
> 
> This is my question:
> What is the best practice for creating pools & images?
> Should I create multiple pools, means one pool per database?
> Or should I create a single pool "backup" and use namespace when writing
> data in the pool?
> 
> This is the security demand that should be considered:
> DB-owner A can only modify the files that belong to A; other files
> (owned by B, C or D) are accessible for A.
> 
> And there's another issue:
> How can I identify a backup created by client A that I want to restore
> on another client Z?
> I mean typically client A would write a backup file identified by the
> filename.
> Would it be possible on client Z to identify this backup file by
> filename? If yes, how?
> 
> 
> THX
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] backfill_toofull while OSDs are not full

2019-01-22 Thread Wido den Hollander
Hi,

I've got a couple of PGs which are stuck in backfill_toofull, but none
of them are actually full.

  "up": [
999,
1900,
145
  ],
  "acting": [
701,
1146,
1880
  ],
  "backfill_targets": [
"145",
"999",
"1900"
  ],
  "acting_recovery_backfill": [
"145",
"701",
"999",
"1146",
"1880",
"1900"
  ],

I checked all these OSDs, but they are all <75% utilization.

full_ratio 0.95
backfillfull_ratio 0.9
nearfull_ratio 0.9

So I started checking all the PGs and I've noticed that each of these
PGs has one OSD in the 'acting_recovery_backfill' which is marked as out.

In this case osd.1880 is marked as out and thus it's capacity is shown
as zero.

[ceph@ceph-mgr ~]$ ceph osd df|grep 1880
1880   hdd 4.545990 0 B  0 B  0 B 00  27
[ceph@ceph-mgr ~]$

This is on a Mimic 13.2.4 cluster. Is this expected or is this a unknown
side-effect of one of the OSDs being marked as out?

Thanks,

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD client hangs

2019-01-22 Thread Jason Dillaman
Your "mon" cap should be "profile rbd" instead of "allow r" [1].

[1] 
http://docs.ceph.com/docs/master/rbd/rados-rbd-cmds/#create-a-block-device-user

On Mon, Jan 21, 2019 at 9:05 PM ST Wong (ITSC)  wrote:
>
> Hi,
>
> > Is this an upgraded or a fresh cluster?
> It's a fresh cluster.
>
> > Does client.acapp1 have the permission to blacklist other clients?  You can 
> > check with "ceph auth get client.acapp1".
>
> No,  it's our first Ceph cluster with basic setup for testing, without any 
> blacklist implemented.
>
> --- cut here ---
> # ceph auth get client.acapp1
> exported keyring for client.acapp1
> [client.acapp1]
> key = 
> caps mds = "allow r"
> caps mgr = "allow r"
> caps mon = "allow r"
> caps osd = "allow rwx pool=2copy, allow rwx pool=4copy"
> --- cut here ---
>
> Thanks a lot.
> /st
>
>
>
> -Original Message-
> From: Ilya Dryomov 
> Sent: Monday, January 21, 2019 7:33 PM
> To: ST Wong (ITSC) 
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] RBD client hangs
>
> On Mon, Jan 21, 2019 at 11:43 AM ST Wong (ITSC)  wrote:
> >
> > Hi, we’re trying mimic on an VM farm.  It consists 4 OSD hosts (8 OSDs) and 
> > 3 MON. We tried mounting as RBD and CephFS (fuse and kernel mount) on 
> > different clients without problem.
>
> Is this an upgraded or a fresh cluster?
>
> >
> > Then one day we perform failover test and stopped one of the OSD.  Not sure 
> > if it’s related but after that testing, the RBD client freeze when trying 
> > to mount the rbd device.
> >
> >
> >
> > Steps to reproduce:
> >
> >
> >
> > # modprobe rbd
> >
> >
> >
> > (dmesg)
> >
> > [  309.997587] Key type dns_resolver registered
> >
> > [  310.043647] Key type ceph registered
> >
> > [  310.044325] libceph: loaded (mon/osd proto 15/24)
> >
> > [  310.054548] rbd: loaded
> >
> >
> >
> > # rbd -n client.acapp1 map 4copy/foo
> >
> > /dev/rbd0
> >
> >
> >
> > # rbd showmapped
> >
> > id pool  image snap device
> >
> > 0  4copy foo   -/dev/rbd0
> >
> >
> >
> >
> >
> > Then hangs if I tried to mount or reboot the server after rbd map.   There 
> > are lot of error in dmesg, e.g.
> >
> >
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: blacklist of client74700
> > failed: -13
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: failed to acquire lock: -13
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: no lock owners detected
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: client74700 seems dead,
> > breaking lock
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: blacklist of client74700
> > failed: -13
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: failed to acquire lock: -13
> >
> > Jan 20 03:43:32 acapp1 kernel: rbd: rbd0: no lock owners detected
>
> Does client.acapp1 have the permission to blacklist other clients?  You can 
> check with "ceph auth get client.acapp1".  If not, follow step 6 of 
> http://docs.ceph.com/docs/master/releases/luminous/#upgrade-from-jewel-or-kraken.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread cmonty14
Hi,

my use case for Ceph is providing a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.

This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?

This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.

And there's another issue:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?


THX
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Broken CephFS stray entries?

2019-01-22 Thread Dan van der Ster
Hi Zheng,

We also just saw this today and got a bit worried.
Should we change to:

diff --git a/src/mds/CInode.cc b/src/mds/CInode.cc
index e8c1bc8bc1..e2539390fb 100644
--- a/src/mds/CInode.cc
+++ b/src/mds/CInode.cc
@@ -2040,7 +2040,7 @@ void CInode::finish_scatter_gather_update(int type)

if (pf->fragstat.nfiles < 0 ||
pf->fragstat.nsubdirs < 0) {
- clog->error() << "bad/negative dir size on "
+ clog->warn() << "bad/negative dir size on "
  << dir->dirfrag() << " " << pf->fragstat;
  assert(!"bad/negative fragstat" == g_conf->mds_verify_scatter);

@@ -2077,7 +2077,7 @@ void CInode::finish_scatter_gather_update(int type)
  if (state_test(CInode::STATE_REPAIRSTATS)) {
dout(20) << " dirstat mismatch, fixing" << dendl;
  } else {
-   clog->error() << "unmatched fragstat on " << ino() << ", inode
has "
+   clog->warn() << "unmatched fragstat on " << ino() << ", inode
has "
  << pi->dirstat << ", dirfrags have " << dirstat;
assert(!"unmatched fragstat" == g_conf->mds_verify_scatter);
  }


Cheers, Dan


On Sat, Oct 20, 2018 at 2:33 AM Yan, Zheng  wrote:

> no action is required. mds fixes this type of error atomically.
> On Fri, Oct 19, 2018 at 6:59 PM Burkhard Linke
>  wrote:
> >
> > Hi,
> >
> >
> > upon failover or restart, or MDS complains that something is wrong with
> > one of the stray directories:
> >
> >
> > 2018-10-19 12:56:06.442151 7fc908e2d700 -1 log_channel(cluster) log
> > [ERR] : bad/negative dir size on 0x607 f(v133 m2018-10-19
> > 12:51:12.016360 -4=-5+1)
> > 2018-10-19 12:56:06.442182 7fc908e2d700 -1 log_channel(cluster) log
> > [ERR] : unmatched fragstat on 0x607, inode has f(v134 m2018-10-19
> > 12:51:12.016360 -4=-5+1), dirfrags have f(v0 m2018-10-19 12:51:12.016360
> > 1=0+1)
> >
> >
> > How do we handle this problem?
> >
> >
> > Regards,
> >
> > Burkhard
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt

2019-01-22 Thread Alfredo Deza
On Tue, Jan 22, 2019 at 6:45 AM Manuel Lausch  wrote:
>
> Hi,
>
> we want upgrade our ceph clusters from jewel to luminous. And also want
> to migrate the osds to ceph-volume described in
> http://docs.ceph.com/docs/luminous/ceph-volume/simple/scan/#ceph-volume-simple-scan
>
> The clusters are running since dumpling and are setup with dmcrypt.
> Since dumpling there are until now three different types of dmcrypt
>
> plain dmcrypt with keys local
> luks with keys local
> luks with keys on the ceph monitors
>
> Now it seems only the last type can be migrated to ceph-volume.
>
> ceph-volume simple scan trys to mount a lockbox which does not exists
> on the older OSDs. Are those OSDs not supported with ceph-volume?

This is one case we didn't anticipate :/ We supported the wonky
lockbox setup and thought we wouldn't need to go further back,
although we did add support for both
plain and luks keys.

Looking through the code, it is very tightly couple to
storing/retrieving keys from the monitors, and I don't know what
workarounds might be possible here other than throwing away the OSD
and deploying a new one (I take it this is not an option for you at all)


>
> This are the errors:
>
> # ceph-volume simple scan /var/lib/ceph/osd/ceph-183
>  stderr: lsblk: /var/lib/ceph/osd/ceph-183: not a block device
>  stderr: lsblk: /var/lib/ceph/osd/ceph-183: not a block device
> Running command: /usr/sbin/cryptsetup status 
> /dev/mapper/21ad7722-002f-464c-b460-a8976a7b4872
> Running command: /usr/sbin/cryptsetup status 
> 21ad7722-002f-464c-b460-a8976a7b4872
> Running command: mount -v  /tmp/tmp3t1WRC
>  stderr: mount:  is write-protected, mounting read-only
>  stderr: mount: unknown filesystem type '(null)'
> -->  RuntimeError: command returned non-zero exit status: 32
>
>
> and this is in the ceph-volume.log
>
> [2019-01-22 12:39:31,456][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/blkid -p /dev/mapper/9b68b7e9-854e-498a-8381-4eef128a9d7a
> [2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] detecting 
> if argument is a device or a directory: /var/lib/ceph/osd/ceph-183
> [2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] will scan 
> directly, path is a directory
> [2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] will scan 
> encrypted OSD directory at path: /var/lib/ceph/osd/ceph-183
> [2019-01-22 12:39:31,534][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/blkid -s PARTUUID -o value /dev/sdv1
> [2019-01-22 12:39:31,539][ceph_volume.process][INFO  ] stdout 
> 21ad7722-002f-464c-b460-a8976a7b4872
> [2019-01-22 12:39:31,540][ceph_volume.process][INFO  ] Running command: 
> /usr/sbin/cryptsetup status 21ad7722-002f-464c-b460-a8976a7b4872
> [2019-01-22 12:39:31,546][ceph_volume.process][INFO  ] stdout 
> /dev/mapper/21ad7722-002f-464c-b460-a8976a7b4872 is active and is in use.
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout type:PLAIN
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout cipher:  
> aes-cbc-essiv:sha256
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout keysize: 256 
> bits
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout key location: 
> dm-crypt
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout device:  
> /dev/sdv1
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout sector size:  
> 512
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout offset:  0 
> sectors
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout size:
> 7805646479 sectors
> [2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout mode:
> read/write
> [2019-01-22 12:39:31,548][ceph_volume.process][INFO  ] Running command: mount 
> -v  /tmp/tmp3t1WRC
> [2019-01-22 12:39:31,597][ceph_volume.process][INFO  ] stderr mount:  is 
> write-protected, mounting read-only
> [2019-01-22 12:39:31,622][ceph_volume.process][INFO  ] stderr mount: unknown 
> filesystem type '(null)'
> [2019-01-22 12:39:31,622][ceph_volume][ERROR ] exception caught by decorator
> Traceback (most recent call last):
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
> in newfunc
> return f(*a, **kw)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in 
> main
> terminal.dispatch(self.mapper, subcommand_args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", 
> line 33, in main
> terminal.dispatch(self.mapper, self.argv)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, 
> in dispatch
> instance.main()
>   File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/scan.py", 
> line 353, in main
> self.scan(args)
>   File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
> in is_root
> return func(*a, 

[ceph-users] cephfs performance degraded very fast

2019-01-22 Thread renjianxinlover
hi, 
   at some time, as cache pressure or caps release failure, client apps mount 
got stuck.
   my use case is in kubernetes cluster and automatic kernel client mount in 
nodes.
   is anyone faced with same issue or has related solution?
Brs___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] The OSD can be “down” but still “in”.

2019-01-22 Thread Matthew Vernon
Hi,

On 22/01/2019 10:02, M Ranga Swami Reddy wrote:
> Hello - If an OSD shown as down and but its still "in" state..what
> will happen with write/read operations on this down OSD?

It depends ;-)

In a typical 3-way replicated setup with min_size 2, writes to placement
groups on that OSD will still go ahead - when 2 replicas are written OK,
then the write will complete. Once the OSD comes back up, these writes
will then be replicated to that OSD. If it stays down for long enough to
be marked out, then pgs on that OSD will be replicated elsewhere.

If you had min_size 3 as well, then writes would block until the OSD was
back up (or marked out and the pgs replicated to another OSD).

Regards,

Matthew


-- 
 The Wellcome Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE. 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] migrate ceph-disk to ceph-volume fails with dmcrypt

2019-01-22 Thread Manuel Lausch
Hi,

we want upgrade our ceph clusters from jewel to luminous. And also want
to migrate the osds to ceph-volume described in
http://docs.ceph.com/docs/luminous/ceph-volume/simple/scan/#ceph-volume-simple-scan

The clusters are running since dumpling and are setup with dmcrypt.
Since dumpling there are until now three different types of dmcrypt

plain dmcrypt with keys local 
luks with keys local
luks with keys on the ceph monitors

Now it seems only the last type can be migrated to ceph-volume.

ceph-volume simple scan trys to mount a lockbox which does not exists
on the older OSDs. Are those OSDs not supported with ceph-volume?

This are the errors:

# ceph-volume simple scan /var/lib/ceph/osd/ceph-183 
 stderr: lsblk: /var/lib/ceph/osd/ceph-183: not a block device
 stderr: lsblk: /var/lib/ceph/osd/ceph-183: not a block device
Running command: /usr/sbin/cryptsetup status 
/dev/mapper/21ad7722-002f-464c-b460-a8976a7b4872
Running command: /usr/sbin/cryptsetup status 
21ad7722-002f-464c-b460-a8976a7b4872
Running command: mount -v  /tmp/tmp3t1WRC
 stderr: mount:  is write-protected, mounting read-only
 stderr: mount: unknown filesystem type '(null)'
-->  RuntimeError: command returned non-zero exit status: 32


and this is in the ceph-volume.log

[2019-01-22 12:39:31,456][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/blkid -p /dev/mapper/9b68b7e9-854e-498a-8381-4eef128a9d7a
[2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] detecting if 
argument is a device or a directory: /var/lib/ceph/osd/ceph-183
[2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] will scan 
directly, path is a directory
[2019-01-22 12:39:31,533][ceph_volume.devices.simple.scan][INFO  ] will scan 
encrypted OSD directory at path: /var/lib/ceph/osd/ceph-183
[2019-01-22 12:39:31,534][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/blkid -s PARTUUID -o value /dev/sdv1
[2019-01-22 12:39:31,539][ceph_volume.process][INFO  ] stdout 
21ad7722-002f-464c-b460-a8976a7b4872
[2019-01-22 12:39:31,540][ceph_volume.process][INFO  ] Running command: 
/usr/sbin/cryptsetup status 21ad7722-002f-464c-b460-a8976a7b4872
[2019-01-22 12:39:31,546][ceph_volume.process][INFO  ] stdout 
/dev/mapper/21ad7722-002f-464c-b460-a8976a7b4872 is active and is in use.
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout type:PLAIN
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout cipher:  
aes-cbc-essiv:sha256
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout keysize: 256 bits
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout key location: 
dm-crypt
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout device:  /dev/sdv1
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout sector size:  512
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout offset:  0 sectors
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout size:
7805646479 sectors
[2019-01-22 12:39:31,547][ceph_volume.process][INFO  ] stdout mode:
read/write
[2019-01-22 12:39:31,548][ceph_volume.process][INFO  ] Running command: mount 
-v  /tmp/tmp3t1WRC
[2019-01-22 12:39:31,597][ceph_volume.process][INFO  ] stderr mount:  is 
write-protected, mounting read-only
[2019-01-22 12:39:31,622][ceph_volume.process][INFO  ] stderr mount: unknown 
filesystem type '(null)'
[2019-01-22 12:39:31,622][ceph_volume][ERROR ] exception caught by decorator
Traceback (most recent call last):
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 59, 
in newfunc
return f(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/main.py", line 148, in main
terminal.dispatch(self.mapper, subcommand_args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/main.py", 
line 33, in main
terminal.dispatch(self.mapper, self.argv)
  File "/usr/lib/python2.7/site-packages/ceph_volume/terminal.py", line 182, in 
dispatch
instance.main()
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/scan.py", 
line 353, in main
self.scan(args)
  File "/usr/lib/python2.7/site-packages/ceph_volume/decorators.py", line 16, 
in is_root
return func(*a, **kw)
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/scan.py", 
line 244, in scan
osd_metadata = self.scan_encrypted(osd_path)
  File "/usr/lib/python2.7/site-packages/ceph_volume/devices/simple/scan.py", 
line 169, in scan_encrypted
with system.tmp_mount(lockbox) as lockbox_path:
  File "/usr/lib/python2.7/site-packages/ceph_volume/util/system.py", line 145, 
in __enter__
self.path
  File "/usr/lib/python2.7/site-packages/ceph_volume/process.py", line 153, in 
run
raise RuntimeError(msg)
RuntimeError: command returned non-zero exit status: 32



ceph version 12.2.10 (177915764b752804194937482a39e95e0ca3de94)
luminous (stable)


Regards
Manuel




-- 

Re: [ceph-users] Using Ceph central backup storage - Best practice creating pools

2019-01-22 Thread Eugen Block

Hi Thomas,


What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?


I don't think one pool per DB is reasonable. If the number of DBs  
increases you'll have to create more pools and change the respective  
auth settings. One pool for your DB backups would suffice, and  
restricting user access is possible on rbd image level. You can grant  
read/write access for one client and only read access for other  
clients, you have to create different clients for that, see [1] for  
more details.


Regards,
Eugen

[1]  
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2018-February/024424.html



Zitat von Thomas <74cmo...@gmail.com>:


Hi,
 
my use case for Ceph is serving a central backup storage.
This means I will backup multiple databases in Ceph storage cluster.
 
This is my question:
What is the best practice for creating pools & images?
Should I create multiple pools, means one pool per database?
Or should I create a single pool "backup" and use namespace when writing
data in the pool?
 
This is the security demand that should be considered:
DB-owner A can only modify the files that belong to A; other files
(owned by B, C or D) are accessible for A.

And there's another issue:
How can I identify a backup created by client A that I want to restore
on another client Z?
I mean typically client A would write a backup file identified by the
filename.
Would it be possible on client Z to identify this backup file by
filename? If yes, how?
 
 
THX




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] The OSD can be “down” but still “in”.

2019-01-22 Thread M Ranga Swami Reddy
Hello - If an OSD shown as down and but its still "in" state..what
will happen with write/read operations on this down OSD?

Thanks
Swami
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS performance issue

2019-01-22 Thread Yan, Zheng
On Tue, Jan 22, 2019 at 10:49 AM Albert Yue  wrote:
>
> Hi Yan Zheng,
>
> In your opinion, can we resolve this issue by move MDS to a 512GB or 1TB 
> memory machine?
>

The problem is from client side, especially clients with large memory.
I don't think enlarge mds cache size is good idea. you can
periodically check periodically
each kernel clients' /sys/kernel/debug/ceph/xxx/caps. run 'echo 2
>/proc/sys/vm/drop_caches' if a client used too many caps (for example
10k),

> On Mon, Jan 21, 2019 at 10:49 PM Yan, Zheng  wrote:
>>
>> On Mon, Jan 21, 2019 at 11:16 AM Albert Yue  
>> wrote:
>> >
>> > Dear Ceph Users,
>> >
>> > We have set up a cephFS cluster with 6 osd machines, each with 16 8TB 
>> > harddisk. Ceph version is luminous 12.2.5. We created one data pool with 
>> > these hard disks and created another meta data pool with 3 ssd. We created 
>> > a MDS with 65GB cache size.
>> >
>> > But our users are keep complaining that cephFS is too slow. What we 
>> > observed is cephFS is fast when we switch to a new MDS instance, once the 
>> > cache fills up (which will happen very fast), client became very slow when 
>> > performing some basic filesystem operation such as `ls`.
>> >
>>
>> It seems that clients hold lots of unused inodes their icache, which
>> prevent mds from trimming corresponding objects from its cache.  mimic
>> has command "ceph daemon mds.x cache drop" to ask client to drop its
>> cache. I'm also working on a patch that make kclient client release
>> unused inodes.
>>
>> For luminous,  there is not much we can do, except periodically run
>> "echo 2 > /proc/sys/vm/drop_caches"  on each client.
>>
>>
>> > What we know is our user are putting lots of small files into the cephFS, 
>> > now there are around 560 Million files. We didn't see high CPU wait on MDS 
>> > instance and meta data pool just used around 200MB space.
>> >
>> > My question is, what is the relationship between the metadata pool and 
>> > MDS? Is this performance issue caused by the hardware behind meta data 
>> > pool? Why the meta data pool only used 200MB space, and we saw 3k iops on 
>> > each of these three ssds, why can't MDS cache all these 200MB into memory?
>> >
>> > Thanks very much!
>> >
>> >
>> > Best Regards,
>> >
>> > Albert
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] quick questions about a 5-node homelab setup

2019-01-22 Thread Janne Johansson
Den tis 22 jan. 2019 kl 00:50 skrev Brian Topping :
> > I've scrounged up 5 old Atom Supermicro nodes and would like to run them 
> > 365/7 for limited production as RBD with Bluestore (ideally latest 13.2.4 
> > Mimic), triple copy redundancy. Underlying OS is a Debian 9 64 bit, minimal 
> > install.
>
> The other thing to consider about a lab is “what do you want to learn?” If 
> reliability isn’t an issue (ie you aren’t putting your family pictures on 
> it), regardless of the cluster technology, you can often learn basics more 
> quickly without the overhead of maintaining quorums and all that stuff on day 
> one. So at risk of being a heretic, start small, for instance with single 
> mon/manager and add more later.

Well, if you start small with one OSD, you are going to run into "the
defaults will work against you" since as you make your first pool, it
will want to place 3 copies on the separate hosts, so not only are you
trying to get accustomed to ceph terms and technologies, you are also
working against the whole cluster idea by not building a cluster at
all, so you will encounter problems regular ceph admins don't see
ever, so chances of getting help is smaller. Things like "OSD will
pre-allocate so much data a 10G OSD crashes at start" or "my pool wont
start since my pgs are in a bad state since I have only one OSD or
only one host and I didn't change the crush rules" is just something
people starting small will ever experience. Anyone with 3 or more real
hosts with real drives attached just will never see it.

Telling people to learn clusters by building a non-cluster might be
counter-productive. When you have a working ceph cluster you can
practice in getting it to run on a rpi with a usb stick for a drive,
but starting at that will make you fight two or more unknowns at the
same time, both ceph being new to you, and un-clustering a cluster
software suite. (and possibly running on non-x86_64 for a third
unknown)

-- 
May the most significant bit of your life be positive.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] predict impact of crush tunables change

2019-01-22 Thread Wolfgang Lendl

dear all,

i have a luminious cluster with tunables profile "hammer" - now all my 
hammer clients are gone and i could raise the tunables level to "jewel".
is there any good way to predict the data movement caused by such a 
config change?


br
wolfgang




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does "mark_unfound_lost delete" only delete missing/unfound objects of a PG

2019-01-22 Thread Mathijs van Veluw
Hello.
I have a question about `ceph pg {pg.num} mark_unfound_lost delete`.
Will this only delete objects which are unfound, or the whole PG which
you put in as an argument?
Objects (oid's) which i can see with `ceph pg {pg.num} list_missing`?
So in the case bellow, would it remove the object
"rbd_data.e53c3c27c0089c.01c7"?

If so that would be great, since e53c3c27c0089c is not linked to any
volume any more.


ceph pg 65.2d0 list_missing
{
    "offset": {
        "oid": "",
        "key": "",
        "snapid": 0,
        "hash": 0,
        "max": 0,
        "pool": -9223372036854775808,
        "namespace": ""
    },
    "num_missing": 2,
    "num_unfound": 2,
    "objects": [
        {
            "oid": {
                "oid": "rbd_data.e53c3c27c0089c.01c7",
                "key": "",
                "snapid": -2,
                "hash": 3857244880,
                "max": 0,
                "pool": 65,
                "namespace": ""
            },
            "need": "613065'497508155",
            "have": "0'0",
            "locations": []
        },
        {
            "oid": {
                "oid": "rbd_data.e53c3c27c0089c.0059",
                "key": "",
                "snapid": -2,
                "hash": 2362866384,
                "max": 0,
                "pool": 65,
                "namespace": ""
            },
            "need": "612939'497508050",
            "have": "0'0",
            "locations": []
        }
    ],
    "more": 0
}

Thanks in advance.
Mathijs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] krbd reboot hung

2019-01-22 Thread Gao, Wenjun
I’m using krbd to map a rbd device to a VM, it appears when the device is 
mounted, reboot OS will hung for more than 7min, in baremetal case, it could be 
more than 15min, even using the latest kernel 5.0.0, the problem still occurs.
Here are the console logs with 4.15.18 kernel and mimic rbd client, reboot 
seems to be stuck in umount rbd operation
[  OK  ] Stopped Update UTMP about System Boot/Shutdown.
[  OK  ] Stopped Create Volatile Files and Directories.
[  OK  ] Stopped target Local File Systems.
 Unmounting /run/user/110281572...
 Unmounting /var/tmp...
 Unmounting /root/test...
 Unmounting /run/user/78402...
 Unmounting Configuration File System...
[  OK  ] Stopped Configure read-only root support.
[  OK  ] Unmounted /var/tmp.
[  OK  ] Unmounted /run/user/78402.
[  OK  ] Unmounted /run/user/110281572.
[  OK  ] Stopped target Swap.
[  OK  ] Unmounted Configuration File System.
[  189.919062] libceph: mon4 XX.XX.XX.XX:6789 session lost, hunting for new mon
[  189.950085] libceph: connect XX.XX.XX.XX:6789 error -101
[  189.950764] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  190.687090] libceph: connect XX.XX.XX.XX:6789 error -101
[  190.694197] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  191.711080] libceph: connect XX.XX.XX.XX:6789 error -101
[  191.745254] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  193.695065] libceph: connect XX.XX.XX.XX:6789 error -101
[  193.727694] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  197.087076] libceph: connect XX.XX.XX.XX:6789 error -101
[  197.121077] libceph: mon4 XX.XX.XX.XX:6789 connect error
[  197.663082] libceph: connect XX.XX.XX.XX:6789 error -101
[  197.680671] libceph: mon4 XX.XX.XX.XX:6789 connect error
[  198.687122] libceph: connect XX.XX.XX.XX:6789 error -101
[  198.719253] libceph: mon4 XX.XX.XX.XX:6789 connect error
[  200.671136] libceph: connect XX.XX.XX.XX:6789 error -101
[  200.702717] libceph: mon4 XX.XX.XX.XX:6789 connect error
[  204.703115] libceph: connect XX.XX.XX.XX:6789 error -101
[  204.736586] libceph: mon4 XX.XX.XX.XX:6789 connect error
[  209.887141] libceph: connect XX.XX.XX.XX:6789 error -101
[  209.918721] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  210.719078] libceph: connect XX.XX.XX.XX:6789 error -101
[  210.750378] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  211.679118] libceph: connect XX.XX.XX.XX:6789 error -101
[  211.712246] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  213.663116] libceph: connect XX.XX.XX.XX:6789 error -101
[  213.696943] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  217.695062] libceph: connect XX.XX.XX.XX:6789 error -101
[  217.728511] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  225.759109] libceph: connect XX.XX.XX.XX:6789 error -101
[  225.775869] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  233.951062] libceph: connect XX.XX.XX.XX:6789 error -101
[  233.951997] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  234.719114] libceph: connect XX.XX.XX.XX:6789 error -101
[  234.720083] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  235.679112] libceph: connect XX.XX.XX.XX:6789 error -101
[  235.680060] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  237.663088] libceph: connect XX.XX.XX.XX:6789 error -101
[  237.664121] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  241.695082] libceph: connect XX.XX.XX.XX:6789 error -101
[  241.696500] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  249.823095] libceph: connect XX.XX.XX.XX:6789 error -101
[  249.824101] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  264.671119] libceph: connect XX.XX.XX.XX:6789 error -101
[  264.672102] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  265.695109] libceph: connect XX.XX.XX.XX:6789 error -101
[  265.696106] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  266.719145] libceph: connect XX.XX.XX.XX:6789 error -101
[  266.720204] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  268.703121] libceph: connect XX.XX.XX.XX:6789 error -101
[  268.704110] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  272.671115] libceph: connect XX.XX.XX.XX:6789 error -101
[  272.672159] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  281.055087] libceph: connect XX.XX.XX.XX:6789 error -101
[  281.056577] libceph: mon0 XX.XX.XX.XX:6789 connect error
[  294.879098] libceph: connect XX.XX.XX.XX:6789 error -101
[  294.880230] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  295.711107] libceph: connect XX.XX.XX.XX:6789 error -101
[  295.712102] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  296.671090] libceph: connect XX.XX.XX.XX:6789 error -101
[  296.672082] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  298.719086] libceph: connect XX.XX.XX.XX:6789 error -101
[  298.720027] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  302.687077] libceph: connect XX.XX.XX.XX:6789 error -101
[  302.688103] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  310.751132] libceph: connect XX.XX.XX.XX:6789 error -101
[  310.763103] libceph: mon3 XX.XX.XX.XX:6789 connect error
[  325.087096] 

[ceph-users] RadosGW replication and failover issues

2019-01-22 Thread Rom Freiman
Hi,
We are running the following radosgw( luminous 12.2.8) replications
scenario.
1) We have 2 clusters, each running a radosgw, Cluster1 defined as master,
and Cluster2 as slave.
2) We create a number of bucket with objects via master and slave
3) We shutdown the Cluster1
4) We execute failover on Cluster2: radosgw-admin zone modify --master
--default
  radosgw-admin
period update --commit
5) We create some new bucket and delete some existing bucket that were
created in Step 2
6) We restart Cluster1, and execute :radosgw-admin realm pull
   radosgw-admin
period pull
7) We saw that resync has finished succesfull and Cluster1 is defined as
slave and Cluster2 as master

The issue that now we see in Cluster1 the buckets that were deleted in
Step5 ( while this cluster was down). We have waited awhile to see if maybe
there were some objects left that should be deleted by GC, but even after a
few hours those buckets are still visible in Cluster1 and not visible in
Cluster2

We also tried:
6) We restart Cluster1, and execute :radosgw-admin period  pull
 But then we see that sync is stuck, both of the clusters are defined as
masters, and Cluster1 current period is the one before last period of
Cluster2

How can we fix this issue? Is there some config command that should be
called during failover?


Thanks,

Rom
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com