[ceph-users] ceph v10.2.9 - rbd cli deadlock ?

2017-07-25 Thread Kjetil Jørgensen
Hi,

I'm not sure yet whether or not this is made worse by config, however - if
I do something along the lines of:

> seq 100 | xargs -P100 -n1 bash -c 'exec rbd.original showmapped'


I'll end up with at least one of the invocations deadlocked like below.
Doing the same on our v10.2.7 clusters seems to work fine.

The stacktraces according to GDB looks something like this for all the ones
I've looked at at least:

> warning: the debug information found in "/usr/bin/rbd" does not match
> "/usr/bin/rbd.original" (CRC mismatch).
> # Yes - we've diverted rbd to rbd.original with a shell-wrapper around it
>

[New LWP 285438]
> [New LWP 285439]
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
> 0x7fbbea58798d in pthread_join (threadid=140444952844032,
> thread_return=thread_return@entry=0x0) at pthread_join.c:90
> 90  pthread_join.c: No such file or directory.
> Thread 3 (Thread 0x7fbbe3865700 (LWP 285439)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x55a852fcf896 in Cond::Wait (mutex=..., this=0x55a85cdeb258) at
> ./common/Cond.h:56
> #2  CephContextServiceThread::entry (this=0x55a85cdeb1c0) at
> common/ceph_context.cc:101
> #3  0x7fbbea5866ba in start_thread (arg=0x7fbbe3865700) at
> pthread_create.c:333
> #4  0x7fbbe80743dd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 2 (Thread 0x7fbbe4804700 (LWP 285438)):
> #0  pthread_cond_wait@@GLIBC_2.3.2 () at
> ../sysdeps/unix/sysv/linux/x86_64/pthread_cond_wait.S:185
> #1  0x55a852fb297b in ceph::log::Log::entry (this=0x55a85cd98830) at
> log/Log.cc:457
> #2  0x7fbbea5866ba in start_thread (arg=0x7fbbe4804700) at
> pthread_create.c:333
> #3  0x7fbbe80743dd in clone () at
> ../sysdeps/unix/sysv/linux/x86_64/clone.S:109
> Thread 1 (Thread 0x7fbbfda1e100 (LWP 285436)):
> #0  0x7fbbea58798d in pthread_join (threadid=140444952844032,
> thread_return=thread_return@entry=0x0) at pthread_join.c:90
> #1  0x55a852fb6270 in Thread::join (this=this@entry=0x55a85cdeb1c0,
> prval=prval@entry=0x0) at common/Thread.cc:171
> #2  0x55a852fca060 in CephContext::join_service_thread 
> (this=this@entry=0x55a85cd95780)
> at common/ceph_context.cc:637
> #3  0x55a852fcc2c7 in CephContext::~CephContext (this=0x55a85cd95780,
> __in_chrg=) at common/ceph_context.cc:507
> #4  0x55a852fcc9bc in CephContext::put (this=0x55a85cd95780) at
> common/ceph_context.cc:578
> #5  0x55a852eac2b1 in
> boost::intrusive_ptr::~intrusive_ptr (this=0x7ffef7ef5060,
> __in_chrg=) at
> /usr/include/boost/smart_ptr/intrusive_ptr.hpp:97
> #6  main (argc=, argv=) at
> tools/rbd/rbd.cc:17


Cheers,
-- 
Kjetil Joergensen 
Staff Curmudgeon, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: How's cephfs going?

2017-07-19 Thread Kjetil Jørgensen
Hi,

While not necessarily CephFS specific - we somehow seem to manage to
frequently end up with objects that have inconsistent omaps. This seems to
be replication (as anecdotally it's a replica that ends up diverging, and
it's at least a few times something that happened after the osd that held
that replica were re-started). (I had hoped
http://tracker.ceph.com/issues/17177 would solve this - but it doesn't
appear to have solved it completely).

We also have one workload which we'd need to re-engineer in order to be a
good fit for CephFS, we do a lot of hardlinks where there's no clear
"origin" file, which is slightly at odds with the hardlink implementation.
If I understand correctly, unlink is move from directory tree into the
stray directories, decrement link count, if link count = 0, purge, if not
keep it around until you encounter another link to it and re-integrate it
back in again. This netted us hilariously large stray directories, which
combined with the above were less than ideal.

Beyond that - there's been other small(-ish) bugs we've encountered, but
it's either been solvable by cherry-picking fixes, upgrading, or using the
available tools for doing surgery guided either by the internet and/or an
approximate understanding of how it's supposed to work/be).

-KJ

On Wed, Jul 19, 2017 at 11:20 AM, Brady Deetz  wrote:

> Thanks Greg. I thought it was impossible when I reported 34MB for 52
> million files.
>
> On Jul 19, 2017 1:17 PM, "Gregory Farnum"  wrote:
>
>>
>>
>> On Wed, Jul 19, 2017 at 10:25 AM David  wrote:
>>
>>> On Tue, Jul 18, 2017 at 6:54 AM, Blair Bethwaite <
>>> blair.bethwa...@gmail.com> wrote:
>>>
 We are a data-intensive university, with an increasingly large fleet
 of scientific instruments capturing various types of data (mostly
 imaging of one kind or another). That data typically needs to be
 stored, protected, managed, shared, connected/moved to specialised
 compute for analysis. Given the large variety of use-cases we are
 being somewhat more circumspect it our CephFS adoption and really only
 dipping toes in the water, ultimately hoping it will become a
 long-term default NAS choice from Luminous onwards.

 On 18 July 2017 at 15:21, Brady Deetz  wrote:
 > All of that said, you could also consider using rbd and zfs or
 whatever filesystem you like. That would allow you to gain the benefits of
 scaleout while still getting a feature rich fs. But, there are some down
 sides to that architecture too.

 We do this today (KVMs with a couple of large RBDs attached via
 librbd+QEMU/KVM), but the throughput able to be achieved this way is
 nothing like native CephFS - adding more RBDs doesn't seem to help
 increase overall throughput. Also, if you have NFS clients you will
 absolutely need SSD ZIL. And of course you then have a single point of
 failure and downtime for regular updates etc.

 In terms of small file performance I'm interested to hear about
 experiences with in-line file storage on the MDS.

 Also, while we're talking about CephFS - what size metadata pools are
 people seeing on their production systems with 10s-100s millions of
 files?

>>>
>>> On a system with 10.1 million files, metadata pool is 60MB
>>>
>>>
>> Unfortunately that's not really an accurate assessment, for good but
>> terrible reasons:
>> 1) CephFS metadata is principally stored via the omap interface (which is
>> designed for handling things like the directory storage CephFS needs)
>> 2) omap is implemented via Level/RocksDB
>> 3) there is not a good way to determine which pool is responsible for
>> which portion of RocksDBs data
>> 4) So the pool stats do not incorporate omap data usage at all in their
>> reports (it's part of the overall space used, and is one of the things that
>> can make that larger than the sum of the per-pool spaces)
>>
>> You could try and estimate it by looking at how much "lost" space there
>> is (and subtracting out journal sizes and things, depending on setup). But
>> I promise there's more than 60MB of CephFS metadata for 10.1 million files!
>> -Greg
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd kernel client fencing

2017-04-25 Thread Kjetil Jørgensen
Hi,

On Wed, Apr 19, 2017 at 9:08 PM, Chaofan Yu <chaofa...@owtware.com> wrote:
> Thank you so much.
>
> The blacklist entries are stored in osd map, which is supposed to be tiny and 
> clean.
> So we are doing similar cleanups after reboot.

In the face of churn - this won't necessarily matter as I believe
there's some osdmap
history stored. It'll eventually fall off. This may also have
improved, my bad experience
were from around hammer.

> I’m quite interested in how the host commit suicide and reboot,

echo b >/proc/sysrq-trigger # This is about as brutal as it gets

The machine is blacklisted, it has no hope of reading/writing anything from/to
a rbd device.

There's a couple of caveats that come with this:
 - Your workload needs to structure it's writes in such a way that it
can recover
   from this kind of failure.
 - You need to engineer your workload in such a way that it can
tolerate a machine
   falling off the face of the earth. (I.e. combination of workload
scheduler like
   mesos/aurora/kubernetes and some HA where necessary)

> can you successfully umount the folder and unmap the rbd block device
>
> after it is blacklisted?
>
> I wonder whether the IO will hang and the umount process will stop at D state
>
> thus the host cannot be shutdown since it is waiting for the umount to finish

No, see previous comment.

> ==
>
> and now that cento 7.3 kernel support exclusive lock feature,
>
> could anyone give out new flow of failover ?

This may not be what you think it is, see i.e.:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004857.html

(And I can't really provide you with much more context, I've primarily
registered
that it isn't made for fencing image access. It's all about
arbitrating modification,
in support of i.e. object-map).

>
> Thanks.
>
>
>> On 20 Apr 2017, at 6:31 AM, Kjetil Jørgensen <kje...@medallia.com> wrote:
>>
>> Hi,
>>
>> As long as you blacklist the old owner by ip, you should be fine. Do
>> note that rbd lock remove implicitly also blacklists unless you also
>> pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
>> (that is I think "ceph osd blacklist add a.b.c.d interval" translates
>> into blacklisting a.b.c.d:0/0 - which should block every client with
>> source ip a.b.c.d).
>>
>> Regardless, I believe the client taking out the lock (rbd cli) and the
>> kernel client mapping the rbd will be different (port, nonce), so
>> specifically if it is possible to blacklist a specific client by (ip,
>> port, nonce) it wouldn't do you much good where you have different
>> clients dealing with the locking and doing the actual IO/mapping (rbd
>> cli and kernel).
>>
>> We do a variation of what you are suggesting, although additionally we
>> check for watches, if watched we give up and complain rather than
>> blacklist. If previous lock were held by my ip we just silently
>> reclaim. The hosts themselves run a process watching for
>> blacklistentries, and if they see themselves blacklisted they commit
>> suicide and re-boot. On boot, machine removes blacklist, reclaims any
>> locks it used to hold before starting the things that might map rbd
>> images. There's some warts in there, but for the most part it works
>> well.
>>
>> If you are going the fencing route - I would strongly advise you also
>> ensure your process don't end up with the possibility of cascading
>> blacklists, in addition to being highly disruptive, it causes osd(?)
>> map churn. (We accidentally did this - and ended up almost running our
>> monitors out of disk).
>>
>> Cheers,
>> KJ
>>
>> On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu <chaofa...@owtware.com> wrote:
>>> Hi list,
>>>
>>>  I wonder someone can help with rbd kernel client fencing (aimed to avoid
>>> simultaneously rbd map on different hosts).
>>>
>>> I know the exclusive rbd image feature is added later to avoid manual rbd
>>> lock CLIs. But want to know previous blacklist solution.
>>>
>>> The official workflow I’ve got is listed below (without exclusive rbd
>>> feature) :
>>>
>>> - identify old rbd lock holder (rbd lock list )
>>> - blacklist old owner (ceph osd blacklist add )
>>> - break old rbd lock (rbd lock remove   )
>>> - lock rbd image on new host (rbd lock add  )
>>> - map rbd image on new host
>>>
>>>
>>> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>>>
>>> However as far as I know, ceph kernel client will do socket reconnection if
>>> conn

Re: [ceph-users] rbd kernel client fencing

2017-04-19 Thread Kjetil Jørgensen
Hi,

As long as you blacklist the old owner by ip, you should be fine. Do
note that rbd lock remove implicitly also blacklists unless you also
pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
(that is I think "ceph osd blacklist add a.b.c.d interval" translates
into blacklisting a.b.c.d:0/0 - which should block every client with
source ip a.b.c.d).

Regardless, I believe the client taking out the lock (rbd cli) and the
kernel client mapping the rbd will be different (port, nonce), so
specifically if it is possible to blacklist a specific client by (ip,
port, nonce) it wouldn't do you much good where you have different
clients dealing with the locking and doing the actual IO/mapping (rbd
cli and kernel).

We do a variation of what you are suggesting, although additionally we
check for watches, if watched we give up and complain rather than
blacklist. If previous lock were held by my ip we just silently
reclaim. The hosts themselves run a process watching for
blacklistentries, and if they see themselves blacklisted they commit
suicide and re-boot. On boot, machine removes blacklist, reclaims any
locks it used to hold before starting the things that might map rbd
images. There's some warts in there, but for the most part it works
well.

If you are going the fencing route - I would strongly advise you also
ensure your process don't end up with the possibility of cascading
blacklists, in addition to being highly disruptive, it causes osd(?)
map churn. (We accidentally did this - and ended up almost running our
monitors out of disk).

Cheers,
KJ

On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu  wrote:
> Hi list,
>
>   I wonder someone can help with rbd kernel client fencing (aimed to avoid
> simultaneously rbd map on different hosts).
>
> I know the exclusive rbd image feature is added later to avoid manual rbd
> lock CLIs. But want to know previous blacklist solution.
>
> The official workflow I’ve got is listed below (without exclusive rbd
> feature) :
>
>  - identify old rbd lock holder (rbd lock list )
>  - blacklist old owner (ceph osd blacklist add )
>  - break old rbd lock (rbd lock remove   )
>  - lock rbd image on new host (rbd lock add  )
>  - map rbd image on new host
>
>
> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>
> However as far as I know, ceph kernel client will do socket reconnection if
> connection failed. So I wonder in this scenario it won’t work:
>
> 1. old client network down for a while
> 2. perform below steps on new host to achieve failover
> - identify old rbd lock holder (rbd lock list )
>
>  - blacklist old owner (ceph osd blacklist add )
>  - break old rbd lock (rbd lock remove   )
>  - lock rbd image on new host (rbd lock add  )
>  - map rbd image on new host
>
> 3. old client network come back and reconnect to osds with new created
> socket client, i.e. new (ip, port,nonce) turple
>
> as a result both new and old client can write to same rbd image, which might
> potentially cause the data corruption.
>
> So does this mean if kernel client does not support exclusive-lock image
> feature, fencing is not possible ?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to cut a large file into small objects

2017-04-11 Thread Kjetil Jørgensen
Hi,

rados - Does not shard your object (as far as I know, there may be a
striping API, although it may not do quite what you want)
cephfs - implemented on top of rados - does it's own object sharding (I'm
fuzzy on the details)
rbd - implemented on top of rados - does shard into 2^order sized objects.
(there's striping support in librbd as well).
radosgw - implemented on top of rados - I'd imagine this does shard into
smaller objects as well

Cheers,
KJ

On Tue, Apr 11, 2017 at 7:42 AM, 冥王星 <945019...@qq.com> wrote:

> In the ceph, a large file will be cut into small objects(2MB ~4MB), then
> the process Pool ---(crush)> PG -> OSD
> Here,I have a question.  How to cut a large file into small objects??
> it'done by the ceph itself or some other way ?
> I try this command:  rados put test-object xxx.iso  --pool=databut the
> large file xxx.iso seem not be cut into small objects.
> I feel confused. I wish some one can help me solve the question.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Modification Time of RBD Images

2017-03-24 Thread Kjetil Jørgensen
Hi,

YMMV, riddled with assumptions (image is image-format=2, has one ext4
filesystem, no partition table, ext4 superblock starts at 0x400 and
probably a whole boatload of other stuff, I don't know when ext4
updates s_wtime
of it's superblock, nor if it's actually the superblock last write or last
write to filesystem, etc.).

rados -p rbd get $(rbd info $SOME_IMAGE_NAME | awk '/block_name_prefix/ {
print $2 }'). - | dd if=/dev/stdin of=/dev/stdout skip=1072
bs=1 count=4 status=none | perl -lane 'print scalar localtime unpack "I*",
 $_;'

Cheers,
KJ

On Fri, Mar 24, 2017 at 12:27 AM, Dongsheng Yang <
dongsheng.y...@easystack.cn> wrote:

> Hi jason,
>
> do you think this is a good feature for rbd?
> maybe we can implement a "rbd stat" command
> to show atime, mtime and ctime of an image.
>
> Yang
>
>
> On 03/23/2017 08:36 PM, Christoph Adomeit wrote:
>
>> Hi,
>>
>> no i did not enable the journalling feature since we do not use mirroring.
>>
>>
>> On Thu, Mar 23, 2017 at 08:10:05PM +0800, Dongsheng Yang wrote:
>>
>>> Did you enable the journaling feature?
>>>
>>> On 03/23/2017 07:44 PM, Christoph Adomeit wrote:
>>>
 Hi Yang,

 I mean "any write" to this image.

 I am sure we have a lot of not-used-anymore rbd images in our pool and
 I am trying to identify them.

 The mtime would be a good hint to show which images might be unused.

 Christoph

 On Thu, Mar 23, 2017 at 07:32:49PM +0800, Dongsheng Yang wrote:

> Hi Christoph,
>
> On 03/23/2017 07:16 PM, Christoph Adomeit wrote:
>
>> Hello List,
>>
>> i am wondering if there is meanwhile an easy method in ceph to find
>> more information about rbd-images.
>>
>> For example I am interested in the modification time of an rbd image.
>>
> Do you mean some metadata changing? such as resize?
>
> Or any write to this image?
>
> Thanx
> Yang
>
>> I found some posts from 2015 that say we have to go over all the
>> objects of an rbd image and find the newest mtime put this is not a
>> preferred solution for me. It takes to much time and too many system
>> resources.
>>
>> Any Ideas ?
>>
>> Thanks
>>Christoph
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Object Map Costs (Was: Snapshot Costs (Was: Re: Pool Sizes))

2017-03-24 Thread Kjetil Jørgensen
Hi,

Depending on how you plan to use the omap - you might also want to avoid a
large number of key/value pairs as well. CephFS got it's directory fragment
size capped due to large omaps being painful to deal with (see:
http://tracker.ceph.com/issues/16164 and
http://tracker.ceph.com/issues/16177).

Cheers,
KJ

On Thu, Mar 9, 2017 at 3:02 PM, Max Yehorov  wrote:

> re: python library
>
> you can do some mon calls using this:
>
> ##--
> from ceph_argparse import json_command as json_command
>
> rados_inst = rados.Rados()
> cluster_handle = rados_inst.connect()
>
> cmd = {'prefix': 'pg dump', 'dumpcontents': ['summary', ], 'format':
> 'json'}
> retcode, jsonret, errstr = json_command(cluster_handle, argdict=cmd)
> ##--
>
>
> MON commands
> https://github.com/ceph/ceph/blob/a68106934c5ed28d0195d6104bce59
> 81aca9aa9d/src/mon/MonCommands.h
>
> On Wed, Mar 8, 2017 at 2:01 PM, Kent Borg  wrote:
> > I'm slowly working my way through Ceph's features...
> >
> > I recently happened upon object maps. (I had heard of LevelDB being in
> there
> > but never saw how to use it: That's because I have been using Python! And
> > the Python library is missing lots of features! Grrr.)
> >
> > How fast are those omap calls?
> >
> > Which is faster: a single LevelDB query yielding a few bytes vs. a single
> > RADOS object read of that many bytes at a specific offset?
> >
> > How about iterating through a whole set of values vs. reading a RADOS
> object
> > holding the same amount of data?
> >
> > Thanks,
> >
> > -kb, the Kent who is guessing LevelDB will be slower in both cases,
> because
> > he really isn't using the key/value aspect of LevelDB but is still paying
> > for it.
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Kjetil Jørgensen
Hi,


I should clarify. When you worry about concurrent osd failures, it's more
likely that the source of that is from i.e. network/rack/power - you'd
organize your osd's spread across those failure domains, and tell crush
that you put each replica in separate failure domains. I.e. you have 3 or
more racks, with their own TOR switches and hopefully power circuits.
You'll tell crush to spread your 3 replicas so that they're in separate
racks.

We do run min_size=2, size=3, although we do run with osd's spread across
multiple racks and require the 3 replicas to be in 3 different racks. Our
reasoning is - two or more machines failing at the same instant that isn't
caused by switch/power is unlikely enough that we'll happily live with it,
it has so far served us well.

-KJ

On Wed, Mar 22, 2017 at 7:06 PM, Kjetil Jørgensen <kje...@medallia.com>
wrote:

>
> For the most part - I'm assuming min_size=2, size=3. In the min_size=3
> and size=3 this changes.
>
> size is how many replicas of an object to maintain, min_size is how many
> writes need to succeed before the primary can ack the operation to the
> client.
>
> larger min_size most likely higher latency for writes.
>
> On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carhe...@ucar.edu> wrote:
>
>> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kje...@medallia.com>
>> wrote:
>>
>> >> c. Reads can continue from the single online OSD even in pgs that
>> >> happened to have two of 3 osds offline.
>> >>
>> >
>> > Hypothetically (This is partially informed guessing on my part):
>> > If the survivor happens to be the acting primary and it were up-to-date
>> at
>> > the time,
>> > it can in theory serve reads. (Only the primary serves reads).
>>
>> It makes no sense that only the primary could serve reads. That would
>> mean that even if only a single OSD failed, all PGs for which that OSD
>> was primary would be unreadable.
>>
>
> Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the
> new
> primary. It'll probably check with 3 to determine whether or not it there
> were
> any writes it itself is unaware of - and peer if there were. Promotion
> should
> be near instantaneous (well, you'd in all likelihood be able to measure
> it).
>
>
>> There must be an algorithm to appoint a new primary. So in a 2 OSD
>> failure scenario, a new primary should be appointed after the first
>> failure, no? Would the final remaining OSD not appoint itself as
>> primary after the 2nd failure?
>>
>>
> Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
> you have no guarantee that the survivor have all writes.
>
> Assuming min_size=3 and size=3 - then yes - you're good, the surviving
> osd can safely be promoted - you're severely degraded, but it can safely
> be promoted.
>
> If you genuinely worry about concurrent failures of 2 machines - run with
> min_size=3, the price you pay is slightly increased mean/median latency
> for writes.
>
> This make sense in the context of CEPH's synchronous writes too. A
>> write isn't complete until all 3 OSDs in the PG have the data,
>> correct? So shouldn't any one of them be able to act as primary at any
>> time?
>
>
> See distinction between size and min_size.
>
>
>> I don't see how that would change even if 2 of 3 ODS fail at exactly
>> the same time.
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen <kje...@medallia.com>
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>



-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-22 Thread Kjetil Jørgensen
For the most part - I'm assuming min_size=2, size=3. In the min_size=3
and size=3 this changes.

size is how many replicas of an object to maintain, min_size is how many
writes need to succeed before the primary can ack the operation to the
client.

larger min_size most likely higher latency for writes.

On Wed, Mar 22, 2017 at 8:05 AM, Adam Carheden <carhe...@ucar.edu> wrote:

> On Tue, Mar 21, 2017 at 1:54 PM, Kjetil Jørgensen <kje...@medallia.com>
> wrote:
>
> >> c. Reads can continue from the single online OSD even in pgs that
> >> happened to have two of 3 osds offline.
> >>
> >
> > Hypothetically (This is partially informed guessing on my part):
> > If the survivor happens to be the acting primary and it were up-to-date
> at
> > the time,
> > it can in theory serve reads. (Only the primary serves reads).
>
> It makes no sense that only the primary could serve reads. That would
> mean that even if only a single OSD failed, all PGs for which that OSD
> was primary would be unreadable.
>

Acting [1, 2, 3] - primary is 1, only 1 serves read. 1 fails, 2 is now the
new
primary. It'll probably check with 3 to determine whether or not it there
were
any writes it itself is unaware of - and peer if there were. Promotion
should
be near instantaneous (well, you'd in all likelihood be able to measure it).


> There must be an algorithm to appoint a new primary. So in a 2 OSD
> failure scenario, a new primary should be appointed after the first
> failure, no? Would the final remaining OSD not appoint itself as
> primary after the 2nd failure?
>
>
Assuming the min_size=2, size=3 - if 2 osd's fail at the same instant,
you have no guarantee that the survivor have all writes.

Assuming min_size=3 and size=3 - then yes - you're good, the surviving
osd can safely be promoted - you're severely degraded, but it can safely
be promoted.

If you genuinely worry about concurrent failures of 2 machines - run with
min_size=3, the price you pay is slightly increased mean/median latency
for writes.

This make sense in the context of CEPH's synchronous writes too. A
> write isn't complete until all 3 OSDs in the PG have the data,
> correct? So shouldn't any one of them be able to act as primary at any
> time?


See distinction between size and min_size.


> I don't see how that would change even if 2 of 3 ODS fail at exactly
> the same time.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-21 Thread Kjetil Jørgensen
Hi,

On Tue, Mar 21, 2017 at 11:59 AM, Adam Carheden <carhe...@ucar.edu> wrote:

> Let's see if I got this. 4 host cluster. size=3, min_size=2. 2 hosts
> fail. Are all of the following accurate?
>
> a. An rdb is split into lots of objects, parts of which will probably
> exist on all 4 hosts.
>

Correct.


>
> b. Some objects will have 2 of their 3 replicas on 2 of the offline OSDs.
>
> Likely correct.


> c. Reads can continue from the single online OSD even in pgs that
> happened to have two of 3 osds offline.
>
>
Hypothetically (This is partially informed guessing on my part):
If the survivor happens to be the acting primary and it were up-to-date at
the time,
it can in theory serve reads. (Only the primary serves reads).

If the survivor weren't the acting primary - you don't have any guarantees
as to
whether or not it had the most up-to-date version of any objects. I don't
know
if enough state is tracked outside of the osds to make this determination,
but
I doubt it (it feels costly to maintain).

Regardless of scenario - I'd guess - the PG is marked as down, and will stay
that way until you revive either of deceased OSDs or you essentially tell
ceph
that they're a lost cause and incur potential data loss over that. (See:
ceph osd lost).

d. Writes hang for pgs that have 2 offline OSDs because CRUSH can't meet
> the min_size=2 constraint.
>

Correct.


> e. Rebalancing does not occur because with only two hosts online there
> is no way for CRUSH to meet the size=3 constraint even if it were to
> rebalance.
>

Partially correct, see c)

f. I/O can been restored by setting min_size=1.
>

See c)


> g. Alternatively, I/O can be restored by setting size=2, which would
> kick off rebalancing and restored I/O as the pgs come into compliance
> with the size=2 constraint.
>

See c)


> h. If I instead have a cluster with 10 hosts, size=3 and min_size=2 and
> two hosts fail, some pgs would have only 1 OSD online, but rebalancing
> would start immediately since CRUSH can honor the size=3 constraint by
> rebalancing. This means more nodes makes for a more reliable cluster.
>

See c)

Side-note: This is where you start using crush to enumerate what you'd
consider
the likely failure domains for concurrent failures. I.e. you have racks
with distinct
power circuits and TOR switches, your more likely large scale failures will
be
a rack, so you tell crush to maintain replicas in distinct racks.

i. If I wanted to force CRUSH to bring I/O back online with size=3 and
> min_size=2 but only 2 hosts online, I could remove the host bucket from
> the crushmap. CRUSH would then rebalance, but some PGs would likely end
> up with 3 OSDs all on the same host. (This is theory. I promise not to
> do any such thing to a production system ;)
>

Partially correct, see c).



> Thanks
> --
> Adam Carheden
>
>
> On 03/21/2017 11:48 AM, Wes Dillingham wrote:
> > If you had set min_size to 1 you would not have seen the writes pause. a
> > min_size of 1 is dangerous though because it means you are 1 hard disk
> > failure away from losing the objects within that placement group
> > entirely. a min_size of 2 is generally considered the minimum you want
> > but many people ignore that advice, some wish they hadn't.
> >
> > On Tue, Mar 21, 2017 at 11:46 AM, Adam Carheden <carhe...@ucar.edu
> > <mailto:carhe...@ucar.edu>> wrote:
> >
> > Thanks everyone for the replies. Very informative. However, should I
> > have expected writes to pause if I'd had min_size set to 1 instead
> of 2?
> >
> > And yes, I was under the false impression that my rdb devices was a
> > single object. That explains what all those other things are on a
> test
> > cluster where I only created a single object!
> >
> >
> > --
> > Adam Carheden
> >
> > On 03/20/2017 08:24 PM, Wes Dillingham wrote:
> > > This is because of the min_size specification. I would bet you
> have it
> > > set at 2 (which is good).
> > >
> > > ceph osd pool get rbd min_size
> > >
> > > With 4 hosts, and a size of 3, removing 2 of the hosts (or 2
> drives 1
> > > from each hosts) results in some of the objects only having 1
> replica
> > > min_size dictates that IO freezes for those objects until min_size
> is
> > > achieved. http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas
> > <http://docs.ceph.com/docs/jewel/rados/operations/pools/#
> set-the-number-of-object-replicas>
> > >
> > > I cant tell if your under the impression that your RBD device is a
> > > single objec

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Kjetil Jørgensen
Hi,

rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
you a "prefix", which then gets you on to rbd_header.,
rbd_header.prefix contains block size, striping, etc. The actual data
bearing objects will be named something like rbd_data.prefix.%-016x.

Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first 
of that image will be named rbd_data. 86ce2ae8944a., the second
 will be 86ce2ae8944a.0001, and so on, chances are that
one of these objects are mapped to a pg which has both host3 and host4
among it's replicas.

An rbd image will end up scattered across most/all osds of the pool it's in.

Cheers,
-KJ

On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  wrote:

> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
>
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
> have quorum, so that shouldn't be an issue. The placement group still
> has 2 of its 3 replicas online.
>
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
>
>
> Size?
> # ceph osd pool get rbd size
> size: 3
>
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>
> # ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
>  4 0.40369 osd.4   up  1.0  1.0
>  5 0.40369 osd.5   up  1.0  1.0
>  6 0.54008 osd.6   up  1.0  1.0
>  7 0.54008 osd.7   up  1.0  1.0
> -2 3.61554 host host2
>  0 0.90388 osd.0   up  1.0  1.0
>  1 0.90388 osd.1   up  1.0  1.0
>  2 0.90388 osd.2   up  1.0  1.0
>  3 0.90388 osd.3   up  1.0  1.0
> -6 2.55852 room B
> -4 1.75114 host host3
>  8 0.40369 osd.8   up  1.0  1.0
>  9 0.40369 osd.9   up  1.0  1.0
> 10 0.40369 osd.10  up  1.0  1.0
> 11 0.54008 osd.11  up  1.0  1.0
> -5 0.80737 host host4
> 12 0.40369 osd.12  up  1.0  1.0
> 13 0.40369 osd.13  up  1.0  1.0
>
>
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs-data-scan scan_links cross version from master on jewel ?

2017-01-12 Thread Kjetil Jørgensen
Hi,

I want/need cephfs-data-scan scan_links, it's in master, although we're
currently on jewel (10.2.5). Am I better off cherry-picking the relevant
commit onto the jewel branch rather than just using master ?

Cheers,
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] jewel/ceph-osd/filestore: Moving omap to separate filesystem/device

2016-12-08 Thread Kjetil Jørgensen
Hi,

so - we're considering moving omap out to faster media than our rather slow
spinning rust. There's been some discussion around this here:
https://github.com/ceph/ceph/pull/6421

Since this hasn't landed in jewel, or the ceph-disk convenience bits -
we're thinking of "other ways" of doing this.

We're considering (ab)using partition UUIDs again, and symlinking it into
place. This should at least avoid the "omap partition" weren't mounted, and
tentatively have the OSD bail for that reason before accidentally doing
something bad, and then amend the upstart unit to depend on the omap
filesystem being mounted.

Side-note, the question in
https://github.com/ceph/ceph/pull/6421#issuecomment-152807595 regarding
syncfs() and omap / objects being on separate devices, is this a legitimate
concern ?

If others have done something similar, I'd be happy to hear any
experiences, great successes, failures or anything in between.

Cheers,
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread Kjetil Jørgensen
Hi

On Fri, Oct 7, 2016 at 6:31 AM, Yan, Zheng <uker...@gmail.com> wrote:

> On Fri, Oct 7, 2016 at 8:20 AM, Kjetil Jørgensen <kje...@medallia.com>
> wrote:
> > And - I just saw another recent thread -
> > http://tracker.ceph.com/issues/17177 - can be an explanation of
> most/all of
> > the above ?
> >
> > Next question(s) would then be:
> >
> > How would one deal with duplicate stray(s)
>
> Here is an untested method
>
> list omap keys in objects 600. ~ 609.. find all duplicated
> keys
>
> for each duplicated keys, use ceph-dencoder to decode their values,
> find the one has the biggest version and delete the rest
> (ceph-dencoder type inode_t skip 9 import /tmp/ decode dump_json)


If I do this - should I turn off any active ceph-mds while/when doing so ?

Cheers,
-- 
Kjetil Joergensen <kje...@medallia.com>
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-07 Thread Kjetil Jørgensen
On Fri, Oct 7, 2016 at 4:46 AM, John Spray <jsp...@redhat.com> wrote:

> On Fri, Oct 7, 2016 at 1:05 AM, Kjetil Jørgensen <kje...@medallia.com>
> wrote:
> > Hi,
> >
> > context (i.e. what we're doing): We're migrating (or trying to) migrate
> off
> > of an nfs server onto cephfs, for a workload that's best described as
> "big
> > piles" of hardlinks. Essentially, we have a set of "sources":
> > foo/01/
> > foo/0b/<0b>
> > .. and so on
> > bar/02/..
> > bar/0c/..
> > .. and so on
> >
> > foo/bar/friends have been "cloned" numerous times to a set of names that
> > over the course of weeks end up being recycled again, the clone is
> > essentially cp -L foo copy-1-of-foo.
> >
> > We're doing "incremental" rsyncs of this onto cephfs, so the sense of
> "the
> > original source of the hardlink" will end up moving around, depending on
> the
> > whims of rsync. (if it matters, I found some allusion to "if the original
> > file hardlinked is deleted, ...".
>
> This might not be much help but... have you thought about making your
> application use hardlinks less aggressively?  They have an intrinsinc
> overhead in any system that stores inodes locally to directories (like
> we do) because you have to take an extra step to resolve them.
>
>
Under "normal" circumstances, this isn't "all that bad", the serious
hammering is
coming from trying migrate to cephfs, where I think we've for the time being
abandoned using hardlinks and take the space-penalty for now. Under "normal"
circumstances it isn't that bad (if my nfs-server stats is to be believed,
it's between
5e5 - and 1.5e6 hardlinks created and unlinked per day, it actually seems a
bit low).


> In CephFS, resolving a hard link involves reading the dentry (where we
> would usually have the inode inline), and then going and finding an
> object from the data pool by the inode number, reading the "backtrace"
> (i.e.path) from that object and then going back to the metadata pool
> to traverse that path.  It's all very fast if your metadata fits in
> your MDS cache, but will slow down a lot otherwise, especially as your
> metadata IOs are now potentially getting held up by anything hammering
> your data pool.
>
> By the way, if your workload is relatively little code and you can
> share it, it sounds like it would be a useful hardlink stress test for
> our test suite


I'll let you know if I manage to reproduce, I'm on-and-off-again trying to
tease this
out on a separate ceph cluster with a "synthetic" load that's close to
equivalent.


> ...
>
> > For RBD the ceph cluster have mostly been rather well behaved, the
> problems
> > we have had have for the most part been self-inflicted. Before
> introducing
> > the hardlink spectacle to cephfs, the same filesystem were used for
> > light-ish read-mostly loads, beint mostly un-eventful. (That being said,
> we
> > did patch it for
> >
> > Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f
> 0ae071bd06),
> > clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
> >
> > The problems we're facing:
> >
> > Maybe a "non-problem" I have ~6M strays sitting around
>
> So as you hint above, when the original file is deleted, the inode
> goes into a stray dentry.  The next time someone reads the file via
> one of its other links, the inode gets "reintegrated" (via
> eval_remote_stray()) into the dentry it was read from.
>
> > Slightly more problematic, I have duplicate stray(s) ? See log excercepts
> > below. Also; rados -p cephfs_metadata listomapkeys 60X. did/does
> > seem to agree with there being duplicate strays (assuming 60X. is
> > the directory indexes for the stray catalogs), caveat "not a perfect
> > snapshot", listomapkeys issued in serial fashion.
> > We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here
> for
> > more context)
>
> When you say you stumbled across it, do you mean that you actually had
> this same deep scrub error on your system, or just that you found the
> ticket?


No - we have done "ceph pg repair", as we did end up with single degraded
objects
in the metadata pool during heavy rsync of "lot of hardlinks".


> > There's been a couple of instances of invalid backtrace(s), mostly
> solved by
> > either mds:scrub_path or just unlinking the files/directories in question
> > and re-rsync-ing.
> >
> > mismatch between head items and fnode.fragstat (See below for more of the
> > log excercep

Re: [ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-06 Thread Kjetil Jørgensen
And - I just saw another recent thread - http://tracker.ceph.com/
issues/17177 - can be an explanation of most/all of the above ?

Next question(s) would then be:

   - How would one deal with duplicate stray(s)
   - How would one deal with mismatch between head items and
   fnode.fragstat, ceph daemon mds.foo scrub_path ?

-KJ

On Thu, Oct 6, 2016 at 5:05 PM, Kjetil Jørgensen <kje...@medallia.com>
wrote:

> Hi,
>
> context (i.e. what we're doing): We're migrating (or trying to) migrate
> off of an nfs server onto cephfs, for a workload that's best described as
> "big piles" of hardlinks. Essentially, we have a set of "sources":
> foo/01/
> foo/0b/<0b>
> .. and so on
> bar/02/..
> bar/0c/..
> .. and so on
>
> foo/bar/friends have been "cloned" numerous times to a set of names that
> over the course of weeks end up being recycled again, the clone is
> essentially cp -L foo copy-1-of-foo.
>
> We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
> original source of the hardlink" will end up moving around, depending on
> the whims of rsync. (if it matters, I found some allusion to "if the
> original file hardlinked is deleted, ...".
>
> For RBD the ceph cluster have mostly been rather well behaved, the
> problems we have had have for the most part been self-inflicted. Before
> introducing the hardlink spectacle to cephfs, the same filesystem were used
> for light-ish read-mostly loads, beint mostly un-eventful. (That being
> said, we did patch it for
>
> Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
> clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.
>
> The problems we're facing:
>
>- Maybe a "non-problem" I have ~6M strays sitting around
>- Slightly more problematic, I have duplicate stray(s) ? See log
>excercepts below. Also; rados -p cephfs_metadata listomapkeys 60X.
>did/does seem to agree with there being duplicate strays (assuming
>60X. is the directory indexes for the stray catalogs), caveat "not
>a perfect snapshot", listomapkeys issued in serial fashion.
>- We stumbled across (http://tracker.ceph.com/issues/17177 - mostly
>here for more context)
>- There's been a couple of instances of invalid backtrace(s), mostly
>solved by either mds:scrub_path or just unlinking the files/directories in
>question and re-rsync-ing.
>- mismatch between head items and fnode.fragstat (See below for more
>of the log excercept), appeared to have been solved by mds:scrub_path
>
>
> Duplicate stray(s), ceph-mds complains (a lot, during rsync):
> 2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
>  badness: got (but i already had) [inode 10003f25eaf [...2,head]
> ~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
> (iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.00
> 2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR]
> : loaded dup inode 10003f25eaf [2,head] v36792929 at
> ~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already
> exists at ~mds0/stray0/10003f25eaf
>
> I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
> immediately useful, beyond slightly-easier-to-follow the control-flow
> of src/mds/CDir.cc without becoming much wiser.
> 2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
> pos 310473 marker 'I' dname '100022e8617 [2,head]
> 2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
> (head, '100022e8617')
> 2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
> (10002a81c10,head)
> 2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
>  badness: got (but i already had) [inode 100022e8617 [...2,head]
> ~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
> (iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.00
> 2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR]
> : loaded dup inode 100022e8617 [2,head] v39284583 at
> ~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already
> exists at ~mds0/stray9/100022e8617
>
>
> 2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> mismatch between head items and fnode.fragstat! printing dentries
> 2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> get_num_head_items() = 36; fnode.fragstat.nfiles=53
> fnode.fragstat.nsubdirs=0
> 2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
> mismatch between child accounted_rstats and my rstats!
> 2016-09-25 06:23:50.94780

[ceph-users] jewel/CephFS - misc problems (duplicate strays, mismatch between head items and fnode.fragst)

2016-10-06 Thread Kjetil Jørgensen
Hi,

context (i.e. what we're doing): We're migrating (or trying to) migrate off
of an nfs server onto cephfs, for a workload that's best described as "big
piles" of hardlinks. Essentially, we have a set of "sources":
foo/01/
foo/0b/<0b>
.. and so on
bar/02/..
bar/0c/..
.. and so on

foo/bar/friends have been "cloned" numerous times to a set of names that
over the course of weeks end up being recycled again, the clone is
essentially cp -L foo copy-1-of-foo.

We're doing "incremental" rsyncs of this onto cephfs, so the sense of "the
original source of the hardlink" will end up moving around, depending on
the whims of rsync. (if it matters, I found some allusion to "if the
original file hardlinked is deleted, ...".

For RBD the ceph cluster have mostly been rather well behaved, the problems
we have had have for the most part been self-inflicted. Before introducing
the hardlink spectacle to cephfs, the same filesystem were used for
light-ish read-mostly loads, beint mostly un-eventful. (That being said, we
did patch it for

Cluster is v10.2.2 (mds v10.2.2+4d15eb12298e007744486e28924a6f0ae071bd06),
clients are ubuntu's 4.4.0-32 kernel(s), and elrepo v4.4.4.

The problems we're facing:

   - Maybe a "non-problem" I have ~6M strays sitting around
   - Slightly more problematic, I have duplicate stray(s) ? See log
   excercepts below. Also; rados -p cephfs_metadata listomapkeys 60X.
   did/does seem to agree with there being duplicate strays (assuming
   60X. is the directory indexes for the stray catalogs), caveat "not
   a perfect snapshot", listomapkeys issued in serial fashion.
   - We stumbled across (http://tracker.ceph.com/issues/17177 - mostly here
   for more context)
   - There's been a couple of instances of invalid backtrace(s), mostly
   solved by either mds:scrub_path or just unlinking the files/directories in
   question and re-rsync-ing.
   - mismatch between head items and fnode.fragstat (See below for more of
   the log excercept), appeared to have been solved by mds:scrub_path


Duplicate stray(s), ceph-mds complains (a lot, during rsync):
2016-09-30 20:00:21.978314 7ffb653b8700  0 mds.0.cache.dir(603) _fetched
 badness: got (but i already had) [inode 10003f25eaf [...2,head]
~mds0/stray0/10003f25eaf auth v38836572 s=8998 nl=5 n(v0 b8998 1=1+0)
(iversion lock) 0x561082e6b520] mode 33188 mtime 2016-07-25 03:02:50.00
2016-09-30 20:00:21.978336 7ffb653b8700 -1 log_channel(cluster) log [ERR] :
loaded dup inode 10003f25eaf [2,head] v36792929 at
~mds0/stray3/10003f25eaf, but inode 10003f25eaf.head v38836572 already
exists at ~mds0/stray0/10003f25eaf

I briefly ran ceph-mds with debug_mds=20/20 which didn't yield anything
immediately useful, beyond slightly-easier-to-follow the control-flow
of src/mds/CDir.cc without becoming much wiser.
2016-09-30 20:43:51.910754 7ffb653b8700 20 mds.0.cache.dir(606) _fetched
pos 310473 marker 'I' dname '100022e8617 [2,head]
2016-09-30 20:43:51.910757 7ffb653b8700 20 mds.0.cache.dir(606) lookup
(head, '100022e8617')
2016-09-30 20:43:51.910759 7ffb653b8700 20 mds.0.cache.dir(606)   miss ->
(10002a81c10,head)
2016-09-30 20:43:51.910762 7ffb653b8700  0 mds.0.cache.dir(606) _fetched
 badness: got (but i already had) [inode 100022e8617 [...2,head]
~mds0/stray9/100022e8617 auth v39303851 s=11470 nl=10 n(v0 b11470 1=1+0)
(iversion lock) 0x560c013904b8] mode 33188 mtime 2016-07-25 03:38:01.00
2016-09-30 20:43:51.910772 7ffb653b8700 -1 log_channel(cluster) log [ERR] :
loaded dup inode 100022e8617 [2,head] v39284583 at
~mds0/stray6/100022e8617, but inode 100022e8617.head v39303851 already
exists at ~mds0/stray9/100022e8617


2016-09-25 06:23:50.947761 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
mismatch between head items and fnode.fragstat! printing dentries
2016-09-25 06:23:50.947779 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
get_num_head_items() = 36; fnode.fragstat.nfiles=53
fnode.fragstat.nsubdirs=0
2016-09-25 06:23:50.947782 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
mismatch between child accounted_rstats and my rstats!
2016-09-25 06:23:50.947803 7ffb653b8700  1 mds.0.cache.dir(10003439a33)
total of child dentrys: n(v0 b19365007 36=36+0)
2016-09-25 06:23:50.947806 7ffb653b8700  1 mds.0.cache.dir(10003439a33) my
rstats:  n(v2 rc2016-08-28 04:48:37.685854 b49447206 53=53+0)

The slightly sad thing is - I suspect all of this is probably from
something that "happened at some time in the past", and running mds with
debugging will make my users very unhappy as writing/formatting all that
log is not exactly cheap. (debug_mds=20/20, quickly ended up with mds
beacon marked as laggy).

Bonus question: In terms of "understanding how cephfs works" is
doc/dev/mds_internals it ? :) Given that making "minimal reproducible
test-cases" so far is turning to be quite elusive from the "top down"
approach, I'm finding myself looking inside the box to try to figure out
how we got where we are.

(And many thanks for ceph-dencoder, it satisfies my 

[ceph-users] HitSet - memory requirement

2016-08-31 Thread Kjetil Jørgensen
Hi,

http://docs.ceph.com/docs/master/rados/operations/cache-tiering/ states

>
> Note A larger hit_set_count results in more RAM consumed by the ceph-osd
> process.


By how much - what order - kb ? mb ? gb ?

After some spelunking - there's osd_hit_set_max_size, is it fair to make
the following assumption that - we're approximately upper bounded by
(osd_hit_set_max_size + change) * number-of-hit-sets ?

-KJ
-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-24 Thread Kjetil Jørgensen
It sounds slightly similar to what I just experienced.

I had one monitor out of three, which seemed to essentially run one core at
full tilt continuously, and had it's virtual address space allocated at the
point where top started calling it Tb. Requests hitting this monitor did
not get very timely responses (although; I don't know if this were
happening consistently or arbitrarily).

I ended up re-building the monitor from the two healthy ones I had, which
made the problem go away for me.

After the fact inspection of the monitor I ripped out, clocked it in at
1.3Gb compared to the 250Mb of the other two, after rebuild they're all
comparable in size.

In my case; this started out for me on firefly, and persisted after
upgrading to hammer. Which prompted the rebuild, suspecting that in my case
it were related to something persistent for this monitor.

I do not have that much more useful to contribute to this discussion, since
I've more-or-less destroyed any evidence by re-building the monitor.

Cheers,
KJ

On Fri, Jul 24, 2015 at 1:55 PM, Luis Periquito periqu...@gmail.com wrote:

 The leveldb is smallish: around 70mb.

 I ran debug mon = 10 for a while,  but couldn't find any interesting
 information. I would run out of space quite quickly though as the log
 partition only has 10g.
 On 24 Jul 2015 21:13, Mark Nelson mnel...@redhat.com wrote:

 On 07/24/2015 02:31 PM, Luis Periquito wrote:

 Now it's official,  I have a weird one!

 Restarted one of the ceph-mons with jemalloc and it didn't make any
 difference. It's still using a lot of cpu and still not freeing up
 memory...

 The issue is that the cluster almost stops responding to requests, and
 if I restart the primary mon (that had almost no memory usage nor cpu)
 the cluster goes back to its merry way responding to requests.

 Does anyone have any idea what may be going on? The worst bit is that I
 have several clusters just like this (well they are smaller), and as we
 do everything with puppet, they should be very similar... and all the
 other clusters are just working fine, without any issues whatsoever...


 We've seen cases where leveldb can't compact fast enough and memory
 balloons, but it's usually associated with extreme CPU usage as well. It
 would be showing up in perf though if that were the case...


 On 24 Jul 2015 10:11, Jan Schermer j...@schermer.cz
 mailto:j...@schermer.cz wrote:

 You don’t (shouldn’t) need to rebuild the binary to use jemalloc. It
 should be possible to do something like

 LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.1 ceph-osd …

 The last time we tried it segfaulted after a few minutes, so YMMV
 and be careful.

 Jan

  On 23 Jul 2015, at 18:18, Luis Periquito periqu...@gmail.com
 mailto:periqu...@gmail.com wrote:

 Hi Greg,

 I've been looking at the tcmalloc issues, but did seem to affect
 osd's, and I do notice it in heavy read workloads (even after the
 patch and
 increasing TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=134217728). This
 is affecting the mon process though.

 looking at perf top I'm getting most of the CPU usage in mutex
 lock/unlock
   5.02% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_unlock
   3.82%  libsoftokn3.so[.] 0x0001e7cb
   3.46% libpthread-2.19.so http://libpthread-2.19.so/[.]
 pthread_mutex_lock

 I could try to use jemalloc, are you aware of any built binaries?
 Can I mix a cluster with different malloc binaries?


 On Thu, Jul 23, 2015 at 10:50 AM, Gregory Farnum g...@gregs42.com
 mailto:g...@gregs42.com wrote:

 On Thu, Jul 23, 2015 at 8:39 AM, Luis Periquito
 periqu...@gmail.com mailto:periqu...@gmail.com wrote:
  The ceph-mon is already taking a lot of memory, and I ran a
 heap stats
  
  MALLOC:   32391696 (   30.9 MiB) Bytes in use by
 application
  MALLOC: +  27597135872 (26318.7 MiB) Bytes in page heap
 freelist
  MALLOC: + 16598552 (   15.8 MiB) Bytes in central cache
 freelist
  MALLOC: + 14693536 (   14.0 MiB) Bytes in transfer cache
 freelist
  MALLOC: + 17441592 (   16.6 MiB) Bytes in thread cache
 freelists
  MALLOC: +116387992 (  111.0 MiB) Bytes in malloc metadata
  MALLOC:   
  MALLOC: =  27794649240 (26507.0 MiB) Actual memory used
 (physical + swap)
  MALLOC: + 26116096 (   24.9 MiB) Bytes released to OS
 (aka unmapped)
  MALLOC:   
  MALLOC: =  27820765336 (26531.9 MiB) Virtual address space
 used
  MALLOC:
  MALLOC:   5683  Spans in use
  MALLOC: 21  Thread heaps in use
  MALLOC:   8192  Tcmalloc page size