Re: [ceph-users] block db sizing and calculation

2020-01-14 Thread Xiaoxi Chen
One tricky thing is each layer of RocksDB is 100% on SSD or 100% on HDD,
so either you need to tweak the rocksdb configuration , or there will be a
huge waste,  e.g  20GB DB partition makes no difference compared to a 3GB
one (under default rocksdb configuration)

Janne Johansson  于2020年1月14日周二 下午4:43写道:

> (sorry for empty mail just before)
>
>
>> i'm plannung to split the block db to a seperate flash device which i
>>> also would like to use as an OSD for erasure coding metadata for rbd
>>> devices.
>>>
>>> If i want to use 14x 14TB HDDs per Node
>>>
>>> https://docs.ceph.com/docs/master/rados/configuration/bluestore-config-ref/#sizing
>>>
>>> recommends a minimum size of 140GB per 14TB HDD.
>>>
>>> Is there any recommandation of how many osds a single flash device can
>>> serve? The optane ones can do 2000MB/s write + 500.000 iop/s.
>>>
>>
>>
> I think many ceph admins are more concerned with having many drives
> co-using the same DB drive, since if the DB drive fails, it also means all
> OSDs are lost at the same time.
> Optanes and decent NVMEs are probably capable of handling tons of HDDs, so
> that the bottleneck ends up being somewhere else, but the failure scenarios
> are a bit scary if the whole host is lost just by that one DB device acting
> up.
>
> --
> May the most significant bit of your life be positive.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Anybody using 4x (size=4) replication?

2019-07-25 Thread Xiaoxi Chen
The real impact of changing min_size to 1 , is not about the possibility of
losing data ,but how much data it will lost.. in both case you will lost
some data , just how much.

Let PG X -> (osd A, B, C), min_size = 2, size =3
In your description,

T1,   OSD A goes down due to upgrade, now the PG is in degraded mode with
 (B,C),  note that the PG is still active so that there is data only
written to B and C.

T2 , B goes down to due to disk failure.  C is the only one holding the
portion of data between [T1, T2].
The failure rate of C, in this situation  , is independent to whether we
continue writing to C.

if C failed in T3,
w/o changing min_size, you  lost data from [T1, T2] together with data
unavailable from [T2,T3]
changing min_size = 1 , you lost data from [T1, T3]

But agree, it is a tradeoff,  depending on how you believe you wont have
two drive failure in a row within 15 mins upgrade window...

Wido den Hollander  于2019年7月25日周四 下午3:39写道:

>
>
> On 7/25/19 9:19 AM, Xiaoxi Chen wrote:
> > We had hit this case in production but my solution will be change
> > min_size = 1 immediately so that PG back to active right after.
> >
> > It somewhat tradeoff reliability(durability) with availability during
> > that window of 15 mins but if you are certain one out of two "failure"
> > is due to recoverable issue, it worth to do so.
> >
>
> That's actually dangerous imho.
>
> Because while you set min_size=1 you will be mutating data on that
> single disk/OSD.
>
> If the other two OSDs come back recovery will start. Now IF that single
> disk/OSD now dies while performing the recovery you have lost data.
>
> The PG (or PGs) becomes inactive and you either need to perform data
> recovery on the failed disk or revert back to the last state.
>
> I can't take that risk in this situation.
>
> Wido
>
> > My 0.02
> >
> > Wido den Hollander mailto:w...@42on.com>> 于2019年7月25
> > 日周四 上午3:48写道:
> >
> >
> >
> > On 7/24/19 9:35 PM, Mark Schouten wrote:
> > > I’d say the cure is worse than the issue you’re trying to fix, but
> > that’s my two cents.
> > >
> >
> > I'm not completely happy with it either. Yes, the price goes up and
> > latency increases as well.
> >
> > Right now I'm just trying to find a clever solution to this. It's a
> 2k
> > OSD cluster and the likelihood of an host or OSD crashing is
> reasonable
> > while you are performing maintenance on a different host.
> >
> > All kinds of things have crossed my mind where using size=4 is one
> > of them.
> >
> > Wido
> >
> > > Mark Schouten
> > >
> > >> Op 24 jul. 2019 om 21:22 heeft Wido den Hollander  > <mailto:w...@42on.com>> het volgende geschreven:
> > >>
> > >> Hi,
> > >>
> > >> Is anybody using 4x (size=4, min_size=2) replication with Ceph?
> > >>
> > >> The reason I'm asking is that a customer of mine asked me for a
> > solution
> > >> to prevent a situation which occurred:
> > >>
> > >> A cluster running with size=3 and replication over different
> > racks was
> > >> being upgraded from 13.2.5 to 13.2.6.
> > >>
> > >> During the upgrade, which involved patching the OS as well, they
> > >> rebooted one of the nodes. During that reboot suddenly a node in a
> > >> different rack rebooted. It was unclear why this happened, but
> > the node
> > >> was gone.
> > >>
> > >> While the upgraded node was rebooting and the other node crashed
> > about
> > >> 120 PGs were inactive due to min_size=2
> > >>
> > >> Waiting for the nodes to come back, recovery to finish it took
> > about 15
> > >> minutes before all VMs running inside OpenStack were back again.
> > >>
> > >> As you are upgraded or performing any maintenance with size=3 you
> > can't
> > >> tolerate a failure of a node as that will cause PGs to go
> inactive.
> > >>
> > >> This made me think about using size=4 and min_size=2 to prevent
> this
> > >> situation.
> > >>
> > >> This obviously has implications on write latency and cost, but it
> > would
> > >> prevent such a situation.
> > >>
> > >> Is anybody here running a Ceph cluster with size=4 and min_size=2
> for
> > >> this reason?
> > >>
> > >> Thank you,
> > >>
> > >> Wido
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Changing the release cadence

2019-06-06 Thread Xiaoxi Chen
We go with upstream release and mostly Nautilus now, probably the most
aggressive ones among serious production user (i.e tens of PB+ ),

I will vote for November for several reasons:

 1.   Q4 is holiday season and usually production rollout was blocked
, especially storage related change, which usually give team more time
to prepare/ testing/ LnP the new releases, as well as catch up with
new features.

 2.  Q4/Q1 is usually the planning season,  having the upstream
released and testing to know the readiness of new feature, will
greatly helps when planning the feature/offering of next year.

 3.  Users have whole year to migrate their
provision/monitoring/deployment/remediation system to new version, and
have enough time to fix and stable the surrounding system before next
holiday season.

Release in Feb or March will make the Q4 just in the middle of the
cycle, and lot of changes will land at last minutes(month),   in which
case, few things can be test and forecasted based on the state-of-art
in Q4.

-Xiaoxi

Linh Vu  于2019年6月6日周四 上午8:32写道:
>
> I think 12 months cycle is much better from the cluster operations 
> perspective. I also like March as a release month as well.
> 
> From: ceph-users  on behalf of Sage Weil 
> 
> Sent: Thursday, 6 June 2019 1:57 AM
> To: ceph-us...@ceph.com; ceph-de...@vger.kernel.org; d...@ceph.io
> Subject: [ceph-users] Changing the release cadence
>
> Hi everyone,
>
> Since luminous, we have had the follow release cadence and policy:
>  - release every 9 months
>  - maintain backports for the last two releases
>  - enable upgrades to move either 1 or 2 releases heads
>(e.g., luminous -> mimic or nautilus; mimic -> nautilus or octopus; ...)
>
> This has mostly worked out well, except that the mimic release received
> less attention that we wanted due to the fact that multiple downstream
> Ceph products (from Red Has and SUSE) decided to based their next release
> on nautilus.  Even though upstream every release is an "LTS" release, as a
> practical matter mimic got less attention than luminous or nautilus.
>
> We've had several requests/proposals to shift to a 12 month cadence. This
> has several advantages:
>
>  - Stable/conservative clusters only have to be upgraded every 2 years
>(instead of every 18 months)
>  - Yearly releases are more likely to intersect with downstream
>distribution release (e.g., Debian).  In the past there have been
>problems where the Ceph releases included in consecutive releases of a
>distro weren't easily upgradeable.
>  - Vendors that make downstream Ceph distributions/products tend to
>release yearly.  Aligning with those vendors means they are more likely
>to productize *every* Ceph release.  This will help make every Ceph
>release an "LTS" release (not just in name but also in terms of
>maintenance attention).
>
> So far the balance of opinion seems to favor a shift to a 12 month
> cycle[1], especially among developers, so it seems pretty likely we'll
> make that shift.  (If you do have strong concerns about such a move, now
> is the time to raise them.)
>
> That brings us to an important decision: what time of year should we
> release?  Once we pick the timing, we'll be releasing at that time *every
> year* for each release (barring another schedule shift, which we want to
> avoid), so let's choose carefully!
>
> A few options:
>
>  - November: If we release Octopus 9 months from the Nautilus release
>(planned for Feb, released in Mar) then we'd target this November.  We
>could shift to a 12 months candence after that.
>  - February: That's 12 months from the Nautilus target.
>  - March: That's 12 months from when Nautilus was *actually* released.
>
> November is nice in the sense that we'd wrap things up before the
> holidays.  It's less good in that users may not be inclined to install the
> new release when many developers will be less available in December.
>
> February kind of sucked in that the scramble to get the last few things
> done happened during the holidays.  OTOH, we should be doing what we can
> to avoid such scrambles, so that might not be something we should factor
> in.  March may be a bit more balanced, with a solid 3 months before when
> people are productive, and 3 months after before they disappear on holiday
> to address any post-release issues.
>
> People tend to be somewhat less available over the summer months due to
> holidays etc, so an early or late summer release might also be less than
> ideal.
>
> Thoughts?  If we can narrow it down to a few options maybe we could do a
> poll to gauge user preferences.
>
> Thanks!
> sage
>
>
> [1] 
> https://protect-au.mimecast.com/s/N1l6CROAEns1RN1Zu9Jwts?domain=twitter.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users 

Re: [ceph-users] Stealth Jewel release?

2017-07-12 Thread Xiaoxi Chen
Understood, thanks Abhishek.

 So 10.2.9 will not be another release cycle but just 10.2.8+ mds fix,
and expect to be out soon, right?


2017-07-12 23:51 GMT+08:00 Abhishek L <abhishek.lekshma...@gmail.com>:
> On Wed, Jul 12, 2017 at 9:13 PM, Xiaoxi Chen <superdebu...@gmail.com> wrote:
>> +However, it also introduced a regression that could cause MDS damage.
>> +Therefore, we do *not* recommend that Jewel users upgrade to this version -
>> +instead, we recommend upgrading directly to v10.2.9 in which the regression 
>> is
>> +fixed.
>>
>> It looks like this version is NOT production ready. Curious why we
>> want a not-recwaended version  to be released?
>
> We found a regression in MDS right after packages were built, and the release
> was about to be announced. This is why we didn't announce the release.
> We're  currently running tests after the fix for MDS was merged.
>
> So when we do announce the release we'll announce 10.2.9 so that users
> can upgrade from 10.2.7->10.2.9
>
> Best,
> Abhishek
>
>> 2017-07-12 22:44 GMT+08:00 David Turner <drakonst...@gmail.com>:
>>> The lack of communication on this makes me tentative to upgrade to it.  Are
>>> the packages available to Ubuntu/Debian systems production ready and
>>> intended for upgrades?
>>>
>>> On Tue, Jul 11, 2017 at 8:33 PM Brad Hubbard <bhubb...@redhat.com> wrote:
>>>>
>>>> On Wed, Jul 12, 2017 at 12:58 AM, David Turner <drakonst...@gmail.com>
>>>> wrote:
>>>> > I haven't seen any release notes for 10.2.8 yet.  Is there a document
>>>> > somewhere stating what's in the release?
>>>>
>>>> https://github.com/ceph/ceph/pull/16274 for now although it should
>>>> make it into the master doc tree soon.
>>>>
>>>> >
>>>> > On Mon, Jul 10, 2017 at 1:41 AM Henrik Korkuc <li...@kirneh.eu> wrote:
>>>> >>
>>>> >> On 17-07-10 08:29, Christian Balzer wrote:
>>>> >> > Hello,
>>>> >> >
>>>> >> > so this morning I was greeted with the availability of 10.2.8 for
>>>> >> > both
>>>> >> > Jessie and Stretch (much appreciated), but w/o any announcement here
>>>> >> > or
>>>> >> > updated release notes on the website, etc.
>>>> >> >
>>>> >> > Any reason other "Friday" (US time) for this?
>>>> >> >
>>>> >> > Christian
>>>> >>
>>>> >> My guess is that they didn't have time to announce it yet. Maybe pkgs
>>>> >> were not ready yet on friday?
>>>> >>
>>>> >> ___
>>>> >> ceph-users mailing list
>>>> >> ceph-users@lists.ceph.com
>>>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >
>>>> >
>>>> > ___
>>>> > ceph-users mailing list
>>>> > ceph-users@lists.ceph.com
>>>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>> >
>>>>
>>>>
>>>>
>>>> --
>>>> Cheers,
>>>> Brad
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stealth Jewel release?

2017-07-12 Thread Xiaoxi Chen
+However, it also introduced a regression that could cause MDS damage.
+Therefore, we do *not* recommend that Jewel users upgrade to this version -
+instead, we recommend upgrading directly to v10.2.9 in which the regression is
+fixed.

It looks like this version is NOT production ready. Curious why we
want a not-recommended version  to be released?

2017-07-12 22:44 GMT+08:00 David Turner :
> The lack of communication on this makes me tentative to upgrade to it.  Are
> the packages available to Ubuntu/Debian systems production ready and
> intended for upgrades?
>
> On Tue, Jul 11, 2017 at 8:33 PM Brad Hubbard  wrote:
>>
>> On Wed, Jul 12, 2017 at 12:58 AM, David Turner 
>> wrote:
>> > I haven't seen any release notes for 10.2.8 yet.  Is there a document
>> > somewhere stating what's in the release?
>>
>> https://github.com/ceph/ceph/pull/16274 for now although it should
>> make it into the master doc tree soon.
>>
>> >
>> > On Mon, Jul 10, 2017 at 1:41 AM Henrik Korkuc  wrote:
>> >>
>> >> On 17-07-10 08:29, Christian Balzer wrote:
>> >> > Hello,
>> >> >
>> >> > so this morning I was greeted with the availability of 10.2.8 for
>> >> > both
>> >> > Jessie and Stretch (much appreciated), but w/o any announcement here
>> >> > or
>> >> > updated release notes on the website, etc.
>> >> >
>> >> > Any reason other "Friday" (US time) for this?
>> >> >
>> >> > Christian
>> >>
>> >> My guess is that they didn't have time to announce it yet. Maybe pkgs
>> >> were not ready yet on friday?
>> >>
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>>
>>
>>
>> --
>> Cheers,
>> Brad
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-09 Thread Xiaoxi Chen
Yeah I checked the dump  , it it truely the known issue.

Thanks

2017-03-08 17:58 GMT+08:00 John Spray <jsp...@redhat.com>:
> On Tue, Mar 7, 2017 at 3:05 PM, Xiaoxi Chen <superdebu...@gmail.com> wrote:
>> Thanks John.
>>
>> Very likely, note that mds_mem::ino + mds_cache::strays_created ~=
>> mds::inodes, plus the MDS was the active-standby one, and become
>> active days ago due to failover.
>>
>> mds": {
>> "inodes": 1291393,
>> }
>> "mds_cache": {
>> "num_strays": 3559,
>> "strays_created": 706120,
>> "strays_purged": 702561
>> }
>> "mds_mem": {
>> "ino": 584974,
>> }
>>
>> I do have a cache dump from the mds via admin socket,  is there
>> anything I can get from it  to make 100% percent sure?
>
> You could go through that dump and look for the dentries with no inode
> number set, but honestly if this is a previously-standby-replay daemon
> and you're running pre-Kraken code I'd be pretty sure it's the known
> issue.
>
> John
>
>>
>> Xiaoxi
>>
>> 2017-03-07 22:20 GMT+08:00 John Spray <jsp...@redhat.com>:
>>> On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen <superdebu...@gmail.com> wrote:
>>>> Hi,
>>>>
>>>>   From the admin socket of mds, I got following data on our
>>>> production cephfs env, roughly we have 585K inodes and almost same
>>>> amount of caps, but we have>2x dentries than inodes.
>>>>
>>>>   I am pretty sure we dont use hard link intensively (if any).
>>>> And the #ino match with "rados ls --pool $my_data_pool}.
>>>>
>>>>   Thanks for any explanations, appreciate it.
>>>>
>>>>
>>>> "mds_mem": {
>>>> "ino": 584974,
>>>> "ino+": 1290944,
>>>> "ino-": 705970,
>>>> "dir": 25750,
>>>> "dir+": 25750,
>>>> "dir-": 0,
>>>> "dn": 1291393,
>>>> "dn+": 1997517,
>>>> "dn-": 706124,
>>>> "cap": 584560,
>>>> "cap+": 2657008,
>>>> "cap-": 2072448,
>>>> "rss": 24599976,
>>>> "heap": 166284,
>>>> "malloc": 18446744073708721289,
>>>> "buf": 0
>>>> },
>>>>
>>>
>>> One possibility is that you have many "null" dentries, which are
>>> created when we do a lookup and a file is not found -- we create a
>>> special dentry to remember that that filename does not exist, so that
>>> we can return ENOENT quickly next time.  On pre-Kraken versions, null
>>> dentries can also be left behind after file deletions when the
>>> deletion is replayed on a standbyreplay MDS
>>> (http://tracker.ceph.com/issues/16919)
>>>
>>> John
>>>
>>>
>>>
>>>>
>>>> Xiaoxi
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Much more dentries than inodes, is that normal?

2017-03-07 Thread Xiaoxi Chen
Thanks John.

Very likely, note that mds_mem::ino + mds_cache::strays_created ~=
mds::inodes, plus the MDS was the active-standby one, and become
active days ago due to failover.

mds": {
"inodes": 1291393,
}
"mds_cache": {
"num_strays": 3559,
"strays_created": 706120,
"strays_purged": 702561
}
"mds_mem": {
"ino": 584974,
}

I do have a cache dump from the mds via admin socket,  is there
anything I can get from it  to make 100% percent sure?


Xiaoxi

2017-03-07 22:20 GMT+08:00 John Spray <jsp...@redhat.com>:
> On Tue, Mar 7, 2017 at 9:17 AM, Xiaoxi Chen <superdebu...@gmail.com> wrote:
>> Hi,
>>
>>   From the admin socket of mds, I got following data on our
>> production cephfs env, roughly we have 585K inodes and almost same
>> amount of caps, but we have>2x dentries than inodes.
>>
>>   I am pretty sure we dont use hard link intensively (if any).
>> And the #ino match with "rados ls --pool $my_data_pool}.
>>
>>   Thanks for any explanations, appreciate it.
>>
>>
>> "mds_mem": {
>> "ino": 584974,
>> "ino+": 1290944,
>> "ino-": 705970,
>> "dir": 25750,
>> "dir+": 25750,
>> "dir-": 0,
>> "dn": 1291393,
>> "dn+": 1997517,
>> "dn-": 706124,
>> "cap": 584560,
>> "cap+": 2657008,
>> "cap-": 2072448,
>> "rss": 24599976,
>> "heap": 166284,
>> "malloc": 18446744073708721289,
>> "buf": 0
>> },
>>
>
> One possibility is that you have many "null" dentries, which are
> created when we do a lookup and a file is not found -- we create a
> special dentry to remember that that filename does not exist, so that
> we can return ENOENT quickly next time.  On pre-Kraken versions, null
> dentries can also be left behind after file deletions when the
> deletion is replayed on a standbyreplay MDS
> (http://tracker.ceph.com/issues/16919)
>
> John
>
>
>
>>
>> Xiaoxi
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Much more dentries than inodes, is that normal?

2017-03-07 Thread Xiaoxi Chen
Hi,

  From the admin socket of mds, I got following data on our
production cephfs env, roughly we have 585K inodes and almost same
amount of caps, but we have>2x dentries than inodes.

  I am pretty sure we dont use hard link intensively (if any).
And the #ino match with "rados ls --pool $my_data_pool}.

  Thanks for any explanations, appreciate it.


"mds_mem": {
"ino": 584974,
"ino+": 1290944,
"ino-": 705970,
"dir": 25750,
"dir+": 25750,
"dir-": 0,
"dn": 1291393,
"dn+": 1997517,
"dn-": 706124,
"cap": 584560,
"cap+": 2657008,
"cap-": 2072448,
"rss": 24599976,
"heap": 166284,
"malloc": 18446744073708721289,
"buf": 0
},



Xiaoxi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to hide internal ip on ceph mount

2017-03-02 Thread Xiaoxi Chen
2017-03-02 23:25 GMT+08:00 Ilya Dryomov <idryo...@gmail.com>:
> On Thu, Mar 2, 2017 at 1:06 AM, Sage Weil <s...@newdream.net> wrote:
>> On Thu, 2 Mar 2017, Xiaoxi Chen wrote:
>>> >Still applies. Just create a Round Robin DNS record. The clients will
>>> obtain a new monmap while they are connected to the cluster.
>>> It works to some extent, but causing issue for "mount -a". We have such
>>> deployment nowaday, a GTM(kinds of dns) record created with all MDS ips and
>>> it works fine in terms of failover/ mount.
>>>
>>> But, user usually automation such mount by fstab and even, "mount -a " are
>>> periodically called. With such DNS approach above, they will get mount point
>>> busy message every time. Just due to mount.ceph resolve the DNS name to
>>> another IP, and kernel client was feeling like you are trying to attach
>>> another fs...
>>
>> The kernel client is (should be!) smart enough to tell that it is the same
>> mount point and will share the superblock.  If you see a problem here it's
>> a bug.
>
> I think -EBUSY actually points out that the sharing code is working.
>
> The DNS name in fstab doesn't match the IPs it resolves to, so "mount
> -a" attempts to mount.  The kernel client tells that it's the same fs
> and returns the existing super to the VFS.  The VFS refuses the same
> super on the same mount point...

True,
root@lvspuppetmaster-ng2-1209253:/mnt# mount -a
mount error 16 = Device or resource busy

Do  we have any chane to make dynamic works(i.e suppress the -EBUSY
for this case) for old kernel?
>
> We should look into enabling the in-kernel DNS resolver.

Thanks for explaination,  looking forward :)
>
> Thanks,
>
> Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to hide internal ip on ceph mount

2017-03-01 Thread Xiaoxi Chen
>Still applies. Just create a Round Robin DNS record. The clients will
obtain a new monmap while they are connected to the cluster.

It works to some extent, but causing issue for "mount -a". We have such
deployment nowaday, a GTM(kinds of dns) record created with all MDS ips and
it works fine in terms of failover/ mount.

But, user usually automation such mount by fstab and even, "mount -a " are
periodically called. With such DNS approach above, they will get mount
point busy message every time. Just due to mount.ceph resolve the DNS name
to another IP, and kernel client was feeling like you are trying to attach
another fs...



2017-03-02 0:29 GMT+08:00 Wido den Hollander <w...@42on.com>:

>
> > Op 1 maart 2017 om 16:57 schreef Sage Weil <s...@newdream.net>:
> >
> >
> > On Wed, 1 Mar 2017, Wido den Hollander wrote:
> > > > Op 1 maart 2017 om 15:40 schreef Xiaoxi Chen <superdebu...@gmail.com
> >:
> > > >
> > > >
> > > > Well , I think the argument here is not all about security gain, it
> just
> > > > NOT a user friendly way to let "df" show out 7 IPs of
> monitorsMuch
> > > > better if they seeing something like "mycephfs.mydomain.com".
> > > >
> > >
> > > mount / df simply prints the monmap. It doesn't print what you added
> when you mounted the filesystem.
> > >
> > > Totally normal behavior.
> >
> > Yep.  This *could* be changed, though: modern kernels have DNS resolution
> > capability.  Not sure if all distros compile it in, but if so, mount.ceph
> > could first try to pass in the DNS name and only do the DNS resolution if
> > the kernel can't.  And the kernel client could be updated to remember the
> > DNS name and use that.  It's a bit friendlier, but imprecise, since DNS
> > might change.  What does NFS do in this case? (Show an IP or a name?)
> >
>
> A "df" will show the entry as it's in the fstab file, but mount will show
> the IPs as well.
>
> But Ceph is a different story here due to the monmap.
>
> Wido
>
> > sage
> >
> >
> > > > And using DNS give you the flexibility of changing your monitor
> quorum
> > > > members , without notifying end user to change their fstab entry , or
> > > > whatever mount point record.
> > > >
> > >
> > > Still applies. Just create a Round Robin DNS record. The clients will
> obtain a new monmap while they are connected to the cluster.
> > >
> > > Wido
> > >
> > > > 2017-03-01 18:46 GMT+08:00 gjprabu <gjpr...@zohocorp.com>:
> > > >
> > > > > Hi Robert,
> > > > >
> > > > >   This container host will be provided to end user and we don't
> want to
> > > > > expose this ip to end users.
> > > > >
> > > > > Regards
> > > > > Prabu GJ
> > > > >
> > > > >
> > > > >  On Wed, 01 Mar 2017 16:03:49 +0530 *Robert Sander
> > > > > <r.san...@heinlein-support.de <r.san...@heinlein-support.de>>*
> wrote 
> > > > >
> > > > > On 01.03.2017 10:54, gjprabu wrote:
> > > > > > Hi,
> > > > > >
> > > > > > We try to use host name instead of ip address but mounted partion
> > > > > > showing up address only . How show the host name instead of ip
> address.
> > > > >
> > > > > What is the security gain you try to achieve by hiding the IPs?
> > > > >
> > > > > Regards
> > > > > --
> > > > > Robert Sander
> > > > > Heinlein Support GmbH
> > > > > Schwedter Str. 8/9b, 10119 Berlin
> > > > >
> > > > > http://www.heinlein-support.de
> > > > >
> > > > > Tel: 030 / 405051-43
> > > > > Fax: 030 / 405051-19
> > > > >
> > > > > Zwangsangaben lt. §35a GmbHG:
> > > > > HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> > > > > Geschäftsführer: Peer Heinlein -- Sitz: Berlin
> > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > > >
> > > > > ___
> > > > > ceph-users mailing list
> > > > > ceph-users@lists.ceph.com
> > > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > >
> > > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to hide internal ip on ceph mount

2017-03-01 Thread Xiaoxi Chen
Well , I think the argument here is not all about security gain, it just
NOT a user friendly way to let "df" show out 7 IPs of monitorsMuch
better if they seeing something like "mycephfs.mydomain.com".

And using DNS give you the flexibility of changing your monitor quorum
members , without notifying end user to change their fstab entry , or
whatever mount point record.

2017-03-01 18:46 GMT+08:00 gjprabu :

> Hi Robert,
>
>   This container host will be provided to end user and we don't want to
> expose this ip to end users.
>
> Regards
> Prabu GJ
>
>
>  On Wed, 01 Mar 2017 16:03:49 +0530 *Robert Sander
> >* wrote 
>
> On 01.03.2017 10:54, gjprabu wrote:
> > Hi,
> >
> > We try to use host name instead of ip address but mounted partion
> > showing up address only . How show the host name instead of ip address.
>
> What is the security gain you try to achieve by hiding the IPs?
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to hide internal ip on ceph mount

2017-02-28 Thread Xiaoxi Chen
We do try to use DNS to hide the IP and achieve kinds of HA, but failed.

mount.ceph will resolve whatever you provide, to IP address, and pass it to
kernel.

2017-02-28 16:14 GMT+08:00 Robert Sander :

> On 28.02.2017 07:19, gjprabu wrote:
>
> >  How to hide internal ip address on cephfs mounting. Due to
> > security reason we need to hide ip address. Also we are running docker
> > container in the base machine and which will shown the partition details
> > over there. Kindly let us know is there any solution for this.
> >
> > 192.168.xxx.xxx:6789,192.168.xxx.xxx:6789,192.168.xxx.xxx:6789:/
> > ceph  6.4T  2.0T  4.5T  31% /home/
>
> If this is needed as a "security measure" you should not mount CephFS on
> this host in the first place.
>
> Only mount CephFS on hosts you trust (especially the root user) as the
> Filesystem uses the local accounts for access control.
>
> Regards
> --
> Robert Sander
> Heinlein Support GmbH
> Schwedter Str. 8/9b, 10119 Berlin
>
> http://www.heinlein-support.de
>
> Tel: 030 / 405051-43
> Fax: 030 / 405051-19
>
> Zwangsangaben lt. §35a GmbHG:
> HRB 93818 B / Amtsgericht Berlin-Charlottenburg,
> Geschäftsführer: Peer Heinlein -- Sitz: Berlin
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds0: Behind on trimming (58621/30)

2016-07-05 Thread xiaoxi chen


> From: uker...@gmail.com
> Date: Tue, 5 Jul 2016 21:14:12 +0800
> To: kenneth.waege...@ugent.be
> CC: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] mds0: Behind on trimming (58621/30)
> 
> On Tue, Jul 5, 2016 at 7:56 PM, Kenneth Waegeman
>  wrote:
> >
> >
> > On 04/07/16 11:22, Kenneth Waegeman wrote:
> >>
> >>
> >>
> >> On 01/07/16 16:01, Yan, Zheng wrote:
> >>>
> >>> On Fri, Jul 1, 2016 at 6:59 PM, John Spray  wrote:
> 
>  On Fri, Jul 1, 2016 at 11:35 AM, Kenneth Waegeman
>   wrote:
> >
> > Hi all,
> >
> > While syncing a lot of files to cephfs, our mds cluster got haywire:
> > the
> > mdss have a lot of segments behind on trimming:  (58621/30)
> > Because of this the mds cluster gets degraded. RAM usage is about 50GB.
> > The
> > mdses were respawning and replaying continiously, and I had to stop all
> > syncs , unmount all clients and increase the beacon_grace to keep the
> > cluster up .
> >
> > [root@mds03 ~]# ceph status
> >  cluster 92bfcf0a-1d39-43b3-b60f-44f01b630e47
> >   health HEALTH_WARN
> >  mds0: Behind on trimming (58621/30)
> >   monmap e1: 3 mons at
> >
> > {mds01=10.141.16.1:6789/0,mds02=10.141.16.2:6789/0,mds03=10.141.16.3:6789/0}
> >  election epoch 170, quorum 0,1,2 mds01,mds02,mds03
> >fsmap e78658: 1/1/1 up {0=mds03=up:active}, 2 up:standby
> >   osdmap e19966: 156 osds: 156 up, 156 in
> >  flags sortbitwise
> >pgmap v10213164: 4160 pgs, 4 pools, 253 TB data, 203 Mobjects
> >  357 TB used, 516 TB / 874 TB avail
> >  4151 active+clean
> > 5 active+clean+scrubbing
> > 4 active+clean+scrubbing+deep
> >client io 0 B/s rd, 0 B/s wr, 63 op/s rd, 844 op/s wr
> >cache io 68 op/s promote
> >
> >
> > Now it finally is up again, it is trimming very slowly (+-120 segments
> > /
> > min)
> 
>  Hmm, so it sounds like something was wrong that got cleared by either
>  the MDS restart or the client unmount, and now it's trimming at a
>  healthier rate.
> 
>  What client (kernel or fuse, and version)?
> 
>  Can you confirm that the RADOS cluster itself was handling operations
>  reasonably quickly?  Is your metadata pool using the same drives as
>  your data?  Were the OSDs saturated with IO?
> 
>  While the cluster was accumulating untrimmed segments, did you also
>  have a "client xyz failing to advanced oldest_tid" warning?
> >>>
> >>> This does not prevent MDS from trimming log segment.
> >>>
>  It would be good to clarify whether the MDS was trimming slowly, or
>  not at all.  If you can reproduce this situation, get it to a "behind
>  on trimming" state, and the stop the client IO (but leave it mounted).
>  See if the (x/30) number stays the same.  Then, does it start to
>  decrease when you unmount the client?  That would indicate a
>  misbehaving client.
> >>>
> >>> Behind on trimming on single MDS cluster should be caused by either
> >>> slow rados operations or MDS trim too few log segments on each tick.
> >>>
> >>> Kenneth, could you try setting mds_log_max_expiring to a large value
> >>> (such as 200)
> >>
> >> I've set the mds_log_max_expiring to 200 right now. Should I see something
> >> instantly?
> >
> > The trimming finished rather quick, although I don't have any accurate time
> > measures. Cluster looks running fine right now, but running incremental
> > sync. We will try with same data again to see if it is ok now.
> > Is this mds_log_max_expiring option production ready ? (Don't seem to find
> > it in documentation)
> 
> It should be safe. Setting mds_log_max_expiring to 200 does not change
> the code path
> 
> Yan, Zheng
> 
> >
Zheng, Bump up this conf from 20 -> 200 seems increase the load(concurrent) 
of flushing? would you prefer make this default?
Xiaoxi

> > Thank you!!
> >
> > K
> >
> >>
> >> This weekend , the trimming did not contunue and something happened to the
> >> cluster:
> >>
> >> mds.0.cache.dir(1000da74e85) commit error -2 v 2466977
> >> log_channel(cluster) log [ERR] : failed to commit dir 1000da74e85 object,
> >> errno -2
> >> mds.0.78429 unhandled write error (2) No such file or directory, force
> >> readonly...
> >> mds.0.cache force file system read-only
> >> log_channel(cluster) log [WRN] : force file system read-only
> >>
> >> and ceph health reported:
> >> mds0: MDS in read-only mode
> >>
> >> I restarted it and it is trimming again.
> >>
> >>
> >> Thanks again!
> >> Kenneth
> >>>
> >>> Regards
> >>> Yan, Zheng
> >>>
>  John
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  

Re: [ceph-users] Running ceph in docker

2016-06-30 Thread xiaoxi chen
It make sense to me to run MDS inside docker or k8s as MDS is stateless.But Mon 
and OSD do have data in local , what's the motivation to run it in docker? 

> To: ceph-users@lists.ceph.com
> From: d...@redhat.com
> Date: Thu, 30 Jun 2016 08:36:45 -0400
> Subject: Re: [ceph-users] Running ceph in docker
> 
> On 06/30/2016 02:05 AM, F21 wrote:
> > Hey all,
> >
> > I am interested in running ceph in docker containers. This is extremely
> > attractive given the recent integration of swarm into the docker engine,
> > making it really easy to set up a docker cluster.
> >
> > When running ceph in docker, should monitors, radosgw and OSDs all be on
> > separate physical nodes? I watched Sebastian's video on setting up ceph
> > in docker here: https://www.youtube.com/watch?v=FUSTjTBA8f8. In the
> > video, there were 6 OSDs, with 2 OSDs running on each node.
> >
> > Is running multiple OSDs on the same node a good idea in production? Has
> > anyone operated ceph in docker containers in production? Are there any
> > things I should watch out for?
> >
> > Cheers,
> > Francis
> 
> It's actually quite common to run multiple OSDs on the same physical 
> node, since an OSD currently maps to a single block device.  Depending 
> on your load and traffic, it's usually a good idea to run monitors and 
> RGWs on separate nodes.
> 
> Daniel
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS mds cache pressure

2016-06-28 Thread xiaoxi chen
Hmm, I asked in the ML some days before,:) likely you hit the kernel bug which 
fixed by commit 5e804ac482 "ceph: don't invalidate page cache when inode is no 
longer used”.  This fix is in 4.4 but not in 4.2. I haven't got a chance to 
play with 4.4 , it would be great if you can have a try.
For MDS OOM issue, we did a MDS RSS vs #Inodes scaling test, the result showing 
around 4MB per 1000 Inodes, so your MDS likely can hold up to 2~3 Million 
inodes. But yes, even with the fix if the client misbehavior (open and hold a 
lot of inodes, doesn't respond to cache pressure message), MDS can go over the 
throttling and then killed by OOM

> To: ceph-users@lists.ceph.com
> From: castrofj...@gmail.com
> Date: Tue, 28 Jun 2016 21:34:03 +
> Subject: Re: [ceph-users] CephFS mds cache pressure
> 
> Hey John,
> 
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> 4.2.0-36-generic
> 
> Thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com