Re: [ceph-users] rbd kernel client fencing

2017-04-25 Thread Kjetil Jørgensen
Hi,

On Wed, Apr 19, 2017 at 9:08 PM, Chaofan Yu  wrote:
> Thank you so much.
>
> The blacklist entries are stored in osd map, which is supposed to be tiny and 
> clean.
> So we are doing similar cleanups after reboot.

In the face of churn - this won't necessarily matter as I believe
there's some osdmap
history stored. It'll eventually fall off. This may also have
improved, my bad experience
were from around hammer.

> I’m quite interested in how the host commit suicide and reboot,

echo b >/proc/sysrq-trigger # This is about as brutal as it gets

The machine is blacklisted, it has no hope of reading/writing anything from/to
a rbd device.

There's a couple of caveats that come with this:
 - Your workload needs to structure it's writes in such a way that it
can recover
   from this kind of failure.
 - You need to engineer your workload in such a way that it can
tolerate a machine
   falling off the face of the earth. (I.e. combination of workload
scheduler like
   mesos/aurora/kubernetes and some HA where necessary)

> can you successfully umount the folder and unmap the rbd block device
>
> after it is blacklisted?
>
> I wonder whether the IO will hang and the umount process will stop at D state
>
> thus the host cannot be shutdown since it is waiting for the umount to finish

No, see previous comment.

> ==
>
> and now that cento 7.3 kernel support exclusive lock feature,
>
> could anyone give out new flow of failover ?

This may not be what you think it is, see i.e.:
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-September/004857.html

(And I can't really provide you with much more context, I've primarily
registered
that it isn't made for fencing image access. It's all about
arbitrating modification,
in support of i.e. object-map).

>
> Thanks.
>
>
>> On 20 Apr 2017, at 6:31 AM, Kjetil Jørgensen  wrote:
>>
>> Hi,
>>
>> As long as you blacklist the old owner by ip, you should be fine. Do
>> note that rbd lock remove implicitly also blacklists unless you also
>> pass rbd lock remove the --rbd_blacklist_on_break_lock=false option.
>> (that is I think "ceph osd blacklist add a.b.c.d interval" translates
>> into blacklisting a.b.c.d:0/0 - which should block every client with
>> source ip a.b.c.d).
>>
>> Regardless, I believe the client taking out the lock (rbd cli) and the
>> kernel client mapping the rbd will be different (port, nonce), so
>> specifically if it is possible to blacklist a specific client by (ip,
>> port, nonce) it wouldn't do you much good where you have different
>> clients dealing with the locking and doing the actual IO/mapping (rbd
>> cli and kernel).
>>
>> We do a variation of what you are suggesting, although additionally we
>> check for watches, if watched we give up and complain rather than
>> blacklist. If previous lock were held by my ip we just silently
>> reclaim. The hosts themselves run a process watching for
>> blacklistentries, and if they see themselves blacklisted they commit
>> suicide and re-boot. On boot, machine removes blacklist, reclaims any
>> locks it used to hold before starting the things that might map rbd
>> images. There's some warts in there, but for the most part it works
>> well.
>>
>> If you are going the fencing route - I would strongly advise you also
>> ensure your process don't end up with the possibility of cascading
>> blacklists, in addition to being highly disruptive, it causes osd(?)
>> map churn. (We accidentally did this - and ended up almost running our
>> monitors out of disk).
>>
>> Cheers,
>> KJ
>>
>> On Wed, Apr 19, 2017 at 2:35 AM, Chaofan Yu  wrote:
>>> Hi list,
>>>
>>>  I wonder someone can help with rbd kernel client fencing (aimed to avoid
>>> simultaneously rbd map on different hosts).
>>>
>>> I know the exclusive rbd image feature is added later to avoid manual rbd
>>> lock CLIs. But want to know previous blacklist solution.
>>>
>>> The official workflow I’ve got is listed below (without exclusive rbd
>>> feature) :
>>>
>>> - identify old rbd lock holder (rbd lock list )
>>> - blacklist old owner (ceph osd blacklist add )
>>> - break old rbd lock (rbd lock remove   )
>>> - lock rbd image on new host (rbd lock add  )
>>> - map rbd image on new host
>>>
>>>
>>> The blacklisted entry identified by entity_addr_t (ip, port, nonce).
>>>
>>> However as far as I know, ceph kernel client will do socket reconnection if
>>> connection failed. So I wonder in this scenario it won’t work:
>>>
>>> 1. old client network down for a while
>>> 2. perform below steps on new host to achieve failover
>>> - identify old rbd lock holder (rbd lock list )
>>>
>>> - blacklist old owner (ceph osd blacklist add )
>>> - break old rbd lock (rbd lock remove   )
>>> - lock rbd image on new host (rbd lock add  )
>>> - map rbd image on new host
>>>
>>> 3. old client network come back and reconnect to osds with new created
>>> socket client, i.e. new (ip, port,nonce) turple
>>>
>>> as a result both new and old client can write to sa

Re: [ceph-users] Race Condition(?) in CephFS

2017-04-25 Thread Patrick Donnelly
Hello Adam,

On Tue, Apr 25, 2017 at 5:32 PM, Adam Tygart  wrote:
> I'm using CephFS, on CentOS 7. We're currently migrating away from
> using a catch-all cephx key to mount the filesystem (with the kernel
> module), to a much more restricted key.
>
> In my tests, I've come across an issue, extracting a tar archive with
> a mount using the restricted key routinely cannot create files or
> directories in recently created directories. I need to keep running a
> CentOS based kernel on the clients because of some restrictions from
> other software. Below looks like a race condition to me, although I am
> not versed well enough in Ceph or the inner workings of the kernel to
> know for sure.
> [...]
>
> We're currently running Ceph Jewel (10.2.5). We're looking to update
> soon, but we wanted a clean backup of everything in CephFS first.

To me, this looks like: http://tracker.ceph.com/issues/17858

Fortunately you should only need to upgrade to 10.2.6 or 10.2.7 to fix this.

HTH,

-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Race Condition(?) in CephFS

2017-04-25 Thread Adam Tygart
I'm using CephFS, on CentOS 7. We're currently migrating away from
using a catch-all cephx key to mount the filesystem (with the kernel
module), to a much more restricted key.

In my tests, I've come across an issue, extracting a tar archive with
a mount using the restricted key routinely cannot create files or
directories in recently created directories. I need to keep running a
CentOS based kernel on the clients because of some restrictions from
other software. Below looks like a race condition to me, although I am
not versed well enough in Ceph or the inner workings of the kernel to
know for sure.

# tar xf gmp-6.1.2.tar.lz -C /homes/mozes/tmp/
tar: gmp-6.1.2/mpn/x86_64/mulx/adx/addmul_1.asm: Cannot open: Permission denied
tar: Exiting with failure status due to previous errors

This gets worse with tracing turned on in the kernel. (echo module
ceph +p > /sys/kernel/debug/dynamic_debug/control)

# tar xf gmp-6.1.2.tar.lz -C /homes/mozes/tmp/
tar: gmp-6.1.2/mpn/x86_64/mulx/adx: Cannot mkdir: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/mul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/adx: Cannot mkdir: Permission denied
tar: gmp-6.1.2/mpn/x86_64/mulx/adx/addmul_1.asm: Cannot open: No such
file or directory
tar: gmp-6.1.2/mpn/x86_64/coreinhm/popcount.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/aorrlsh_n.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/sec_tabselect.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/redc_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreinhm/hamdist.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/copyd.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/copyi.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/popcount.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/gcd_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/nano/dive_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/redc_1.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mullo_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/fat_entry.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mod_1.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/gmp-mparam.h: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/redc_2.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/sqr_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/fat.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/fat/mul_basecase.c: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mullo_basecase.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mul_basecase.asm: Cannot open:
Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/aorsmul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/mul_1.asm: Cannot open: Permission denied
tar: gmp-6.1.2/mpn/x86_64/coreihwl/redc_1.asm: Cannot open: Permission denied

While extracting the same tar file using an unrestricted key works correctly.

I've got some kernel traces to share if anyone is interested.
https://people.beocat.ksu.edu/~mozes/ceph-20170425/

# uname -a
Linux eunomia 3.10.0-514.16.1.el7.x86_64 #1 SMP Wed Apr 12 15:04:24
UTC 2017 x86_64 x86_64 x86_64 GNU/Linux

We're currently running Ceph Jewel (10.2.5). We're looking to update
soon, but we wanted a clean backup of everything in CephFS first.

The new restricted key has these permissions:
caps mds = "allow r, allow rw path=/homes, allow rw
path=/bulk, allow rw path=/beocat"
caps mon = "allow r"
caps osd = "allow rw pool=scratch, allow rw pool=bulk, allow
rw pool=homes"

While the unrestricted key has these permissions:
caps mds = "allow"
caps mon = "allow *"
caps osd = "allow *"

I would appreciate any insights anyone might have.

Thanks,
Adam
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding New OSD Problem

2017-04-25 Thread Reed Dier
Others will likely be able to provide some better responses, but I’ll take a 
shot to see if anything makes sense.

With 10.2.6 you should be able to set 'osd scrub during recovery’ to false to 
prevent any new scrubs from occurring during a recovery event. Current scrubs 
will complete, but future scrubs will not being until recovery has completed.

Also, adding just one OSD on the new server, assuming all 6 are ready(?) will 
cause a good deal of unnecessary data reshuffling as you add more OSD’s.
And on top of that, assuming the pool’s crush ruleset is ‘chooseleaf first 0 
type host’ then that should create a bit of an unbalanced weighting. Any reason 
you aren’t bringing in all 6 OSD’s at once?
You should be able to set noscrub, noscrub-deep, norebalance, nobackfill, and 
norecover flags (also probably want noout to prevent rebalance if OSDs flap), 
wait for scrubs to complete (especially deep), add your 6 OSD’s, unset your 
flags for recovery/rebalance/backfill, and it will then move data only once, 
and hopefully not have the scrub load. After recovery, unset the scrub flags, 
and be back to normal.

Caveat, no VM’s running on my cluster, but those seem like low hanging fruit 
for possible load lightening during a rebalance.

Reed

> On Apr 25, 2017, at 3:47 PM, Ramazan Terzi  wrote:
> 
> Hello,
> 
> I have a Ceph Cluster with specifications below:
> 3 x Monitor node
> 6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
> journals)
> Distributed public and private networks. All NICs are 10Gbit/s
> osd pool default size = 3
> osd pool default min size = 2
> 
> Ceph version is Jewel 10.2.6.
> 
> Current health status:
>cluster 
> health HEALTH_OK
> monmap e9: 3 mons at 
> {ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
>election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
> osdmap e1512: 36 osds: 36 up, 36 in
>flags sortbitwise,require_jewel_osds
>  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
>83871 GB used, 114 TB / 196 TB avail
>1408 active+clean
> 
> My cluster is active and a lot of virtual machines running on it (Linux and 
> Windows VM's, database clusters, web servers etc).
> 
> When I want to add a new storage node with 1 disk, I'm getting huge problems. 
> With new osd, crushmap updated and Ceph Cluster turns into recovery mode. 
> Everything is OK. But after a while, some runnings VM's became unmanageable. 
> Servers become unresponsive one by one. Recovery process would take an 
> average of 20 hours. For this reason, I removed the new osd. Recovery process 
> completed and everythink become normal.
> 
> When new osd added, health status:
>cluster 
> health HEALTH_WARN
>91 pgs backfill_wait
>1 pgs bacfilling
>28 pgs degraded
>28 pgs recovery_wait
>28 phs stuck degraded
>recovery 2195/18486602 objects degraded (0.012%)
>recovery 1279784/18486602 objects misplaced (6.923%)
> monmap e9: 3 mons at 
> {ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
>election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
> osdmap e1512: 37 osds: 37 up, 37 in
>flags sortbitwise,require_jewel_osds
>  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
>83871 GB used, 114 TB / 201 TB avail
>2195/18486602 objects degraded (0.012%)
>1279784/18486602 objects misplaced (6.923%)
>1286 active+clean
>91 active+remapped+wait_backfill
>   28 active+recovery_wait+degraded
> 2 active+clean+scrubbing+deep
> 1 active+remapped+backfilling
> recovery io 430 MB/s, 119 objects/s
> client io 36174 B/s rrd, 5567 kB/s wr, 5 op/s rd, 700 op/s wr
> 
> Some Ceph config parameters:
> osd_max_backfills = 1
> osd_backfill_full_ratio = 0.85
> osd_recovery_max_active = 3
> osd_recovery_threads = 1
> 
> How I can add new OSD's safely?
> 
> Best regards,
> Ramazan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph built from source gives Rados import error

2017-04-25 Thread Henry Ngo
Mine is at /usr/local/lib/x86_64-linux-gnu/librados.so.2

I moved the libraries to /usr/lib/x86_64-linux-gnu however I'm still
getting the error when running ceph -v

$ ceph -v

Traceback (most recent call last):

  File "/usr/local/bin/ceph", line 106, in 

import rados

ImportError: No module named rados

On Fri, Apr 21, 2017 at 3:28 PM, Alvaro Soto  wrote:

> Henry,
> Because you compiled the sources, don't know where the librados is, but
> you cant find when is it called and go for there, maybe the error is not
> that is not properly imported, maybe is not where is suppose to be.
>
> $ (strace ceph -s > myout) >& myerror
>
> $ cat myerror | grep librados.so.2
>
> The output in my case of the second command is the next, but I didn't
> build from source.
>
> open("/usr/lib/x86_64-linux-gnu/librados.so.2", O_RDONLY|O_CLOEXEC) = 3
>
>
> Did you try to find the library in your system? to see if is other
> location?, my system output:
>
> $ locate librados.so.2
>
> /usr/lib/x86_64-linux-gnu/librados.so.2
>
> /usr/lib/x86_64-linux-gnu/librados.so.2.0.0
>
>
> Best.
>
> On Fri, Apr 21, 2017 at 5:00 PM, Henry Ngo  wrote:
>
>> Hi all,
>>
>> I built from source and proceeded to do a manual deployment starting on
>> the Mon. I'm getting the error shown below and it appears that Rados has
>> not been properly imported. How do I fix this?
>>
>> Best,
>> Henry N.
>>
>> cephadmin@node1:/var/lib/ceph/mon/ceph-node1$ sudo /etc/init.d/ceph
>> start mon.node1
>>
>> cephadmin@node1:/var/lib/ceph/mon/ceph-node1$ ceph osd lspools
>>
>> Traceback (most recent call last):
>>
>>   File "/usr/local/bin/ceph", line 106, in 
>>
>> import rados
>>
>> ImportError: librados.so.2: cannot open shared object file: No such file
>> or directory
>>
>> cephadmin@node1:/var/lib/ceph/mon/ceph-node1$ which rados
>>
>> /usr/local/bin/rados
>>
>> cephadmin@node1:/var/lib/ceph/mon/ceph-node1$ ceph -v
>>
>> Traceback (most recent call last):
>>
>>   File "/usr/local/bin/ceph", line 106, in 
>>
>> import rados
>>
>> ImportError: librados.so.2: cannot open shared object file: No such file
>> or directory
>>
>> cephadmin@node1:/var/lib/ceph/mon/ceph-node1$
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
>
> ATTE. Alvaro Soto Escobar
>
> --
> Great people talk about ideas,
> average people talk about things,
> small people talk ... about other people.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding New OSD Problem

2017-04-25 Thread Ramazan Terzi
Hello,

I have a Ceph Cluster with specifications below:
3 x Monitor node
6 x Storage Node (6 disk per Storage Node, 6TB SATA Disks, all disks have SSD 
journals)
Distributed public and private networks. All NICs are 10Gbit/s
osd pool default size = 3
osd pool default min size = 2

Ceph version is Jewel 10.2.6.

Current health status:
cluster 
 health HEALTH_OK
 monmap e9: 3 mons at 
{ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
 osdmap e1512: 36 osds: 36 up, 36 in
flags sortbitwise,require_jewel_osds
  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
83871 GB used, 114 TB / 196 TB avail
1408 active+clean

My cluster is active and a lot of virtual machines running on it (Linux and 
Windows VM's, database clusters, web servers etc).

When I want to add a new storage node with 1 disk, I'm getting huge problems. 
With new osd, crushmap updated and Ceph Cluster turns into recovery mode. 
Everything is OK. But after a while, some runnings VM's became unmanageable. 
Servers become unresponsive one by one. Recovery process would take an average 
of 20 hours. For this reason, I removed the new osd. Recovery process completed 
and everythink become normal.

When new osd added, health status:
cluster 
 health HEALTH_WARN
91 pgs backfill_wait
1 pgs bacfilling
28 pgs degraded
28 pgs recovery_wait
28 phs stuck degraded
recovery 2195/18486602 objects degraded (0.012%)
recovery 1279784/18486602 objects misplaced (6.923%)
 monmap e9: 3 mons at 
{ceph-mon01=xxx:6789/0,ceph-mon02=xxx:6789/0,ceph-mon03=xxx:6789/0}
election epoch 84, quorum 0,1,2 ceph-mon01,ceph-mon02,ceph-mon03
 osdmap e1512: 37 osds: 37 up, 37 in
flags sortbitwise,require_jewel_osds
  pgmap v7698673: 1408 pgs, 5 pools, 37365 GB data, 9436 kobjects
83871 GB used, 114 TB / 201 TB avail
2195/18486602 objects degraded (0.012%)
1279784/18486602 objects misplaced (6.923%)
1286 active+clean
91 active+remapped+wait_backfill
   28 active+recovery_wait+degraded
 2 active+clean+scrubbing+deep
 1 active+remapped+backfilling
recovery io 430 MB/s, 119 objects/s
 client io 36174 B/s rrd, 5567 kB/s wr, 5 op/s rd, 700 op/s wr

Some Ceph config parameters:
osd_max_backfills = 1
osd_backfill_full_ratio = 0.85
osd_recovery_max_active = 3
osd_recovery_threads = 1

How I can add new OSD's safely?

Best regards,
Ramazan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

2017-04-25 Thread Martin Millnert
On Tue, Apr 25, 2017 at 03:39:42PM -0400, Gregory Farnum wrote:
> > I'd like to understand if "prio" in Jewel is as explained, i.e.
> > something similar to the following pseudo code:
> >
> >   if len(subqueue) > 0:
> > dequeue(subqueue)
> >   if tokens(global) > some_cost:
> > for queue in queues_high_to_low:
> >   if len(queue) > 0:
> > dequeue(queue)
> > tokens = tokens - some_other_cost
> >   else:
> > for queue in queues_low_to_high:
> >   if len(queue) > 0:
> > dequeue(queue)
> > tokens = tokens - some_other_cost
> >   tokens = min(tokens + some_refill_rate, max_tokens)
> 
> That looks about right.

OK, thanks for validation. That has indeed impact on the entire priority
queue under stress, then. (WPQ motivation seems clear :) )

> > The objective is to increase servicing time of client IO, especially
> > read, while deep scrub is occuring. It doesn't matter for us if a
> > deep-scrub takes x or 3x time, essentially. More consistent latency
> > to clients is more important.
> 
> I don't have any experience with SMR drives so it wouldn't surprise me
> if there are some exciting emergent effects with them.

Basically a very large chunk of disk area needs to be rewritten on each
write. So write amplification factor of an inode update is just silly.
They have a PMR buffer area on approx 500 GB, but that area can run out
pretty fast during consistent IO over time (exact buffer management
logic not known).

> But it sounds
> to me like you want to start by adjusting the osd_scrub_priority
> (default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will
> directly impact how they move through the queue in relation to client
> ops. (There are also the family of scrub scheduling options, which
> might make sense if you are more tolerant of slow IO at certain times
> of the day/week, but I'm not familiar with them).
> -Greg

Thanks for those pointers!  It seems from a distance that it's necessary
to use WPQ if it can be suspected that the IO scheduler is running
without available tokens (not sure how to verify *that*).


#ceph also helped point out that indeed I'm missing noatime,nodiratime
on the mount options. So every read is causing an inode update which is
extremely expensive on SMR, compared with regular HDD (e.g. PMR).
(Not sure how I missed this when I set it up, because I've been aware of
noatime earlier :) )

I think that's the first fix we'll want to do, and the biggest source of
trouble, and then look back in a week or so to see how it's doing then.
Then after that look into the various scrub-vs-client op scheduling
artefacts.

Thanks!

/M


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

2017-04-25 Thread Gregory Farnum
On Tue, Apr 25, 2017 at 3:04 PM, Martin Millnert  wrote:
> Hi,
>
> experiencing significant impact from deep scrubs on Jewel.
> Started investigating OP priorities. We use default values on
> related/relevant OSD priority settings.
>
> "osd op queue" on
> http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations
> states:  "The normal queue is different between implementations."
>
> So... in Jewel, where except code can I learn what is the queue
> behavior? Is there anyone who's familiar with it?
>
>
> I'd like to understand if "prio" in Jewel is as explained, i.e.
> something similar to the following pseudo code:
>
>   if len(subqueue) > 0:
> dequeue(subqueue)
>   if tokens(global) > some_cost:
> for queue in queues_high_to_low:
>   if len(queue) > 0:
> dequeue(queue)
> tokens = tokens - some_other_cost
>   else:
> for queue in queues_low_to_high:
>   if len(queue) > 0:
> dequeue(queue)
> tokens = tokens - some_other_cost
>   tokens = min(tokens + some_refill_rate, max_tokens)

That looks about right.

>
>
>
> The background, for anyone interested, is:
>
> If it is similar to above, this would explain extreme OSD commit
> latencies / client latency. My current theory is that the deep scrub
> quite possibly is consuming all available tokens, such that when a
> client op arrives, and priority(client_io) > priority([deep_]scrub), the
> prio queue essentially inverts and low priority ops get priority over
> high priority ops.
>
> The OSD:s are SMR but the question here is specifically not how they
> perform (we're quite intimately aware of their performance profiles),
> but how to tame Ceph to make cluster behave as good as possible in
> normal case.
>
> I put up some graphs on https://martin.millnert.se/ceph/jewel_prio/ :
>  - OSD Journal/Commit/Apply latencies show very strong correlation with
> ongoing deep scrubs.
>  - When latencies are low and noisy there's essentially no client IO
>happening.
>  - There is some evidence the write latency shoots through the roof --
>but there isn't much client write occuring... Possible Deep Scrub
>causes disk write IO?
>* mount opts used are:
> [...] type xfs (rw,relatime,seclabel,attr2,inode64,noquota)
>
> The objective is to increase servicing time of client IO, especially
> read, while deep scrub is occuring. It doesn't matter for us if a
> deep-scrub takes x or 3x time, essentially. More consistent latency
> to clients is more important.

I don't have any experience with SMR drives so it wouldn't surprise me
if there are some exciting emergent effects with them. But it sounds
to me like you want to start by adjusting the osd_scrub_priority
(default 5) and osd_scrub_cost (default 50 << 20, ie 50MB). That will
directly impact how they move through the queue in relation to client
ops. (There are also the family of scrub scheduling options, which
might make sense if you are more tolerant of slow IO at certain times
of the day/week, but I'm not familiar with them).
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Deepscrub IO impact on Jewel: What is osd_op_queue prio implementation?

2017-04-25 Thread Martin Millnert
Hi,

experiencing significant impact from deep scrubs on Jewel.
Started investigating OP priorities. We use default values on
related/relevant OSD priority settings.

"osd op queue" on
http://docs.ceph.com/docs/master/rados/configuration/osd-config-ref/#operations
states:  "The normal queue is different between implementations."

So... in Jewel, where except code can I learn what is the queue
behavior? Is there anyone who's familiar with it?


I'd like to understand if "prio" in Jewel is as explained, i.e.
something similar to the following pseudo code:

  if len(subqueue) > 0:
dequeue(subqueue)
  if tokens(global) > some_cost:
for queue in queues_high_to_low:
  if len(queue) > 0:
dequeue(queue)
tokens = tokens - some_other_cost
  else:
for queue in queues_low_to_high:
  if len(queue) > 0:
dequeue(queue)
tokens = tokens - some_other_cost
  tokens = min(tokens + some_refill_rate, max_tokens)



The background, for anyone interested, is:

If it is similar to above, this would explain extreme OSD commit
latencies / client latency. My current theory is that the deep scrub
quite possibly is consuming all available tokens, such that when a
client op arrives, and priority(client_io) > priority([deep_]scrub), the
prio queue essentially inverts and low priority ops get priority over
high priority ops.

The OSD:s are SMR but the question here is specifically not how they
perform (we're quite intimately aware of their performance profiles),
but how to tame Ceph to make cluster behave as good as possible in
normal case.

I put up some graphs on https://martin.millnert.se/ceph/jewel_prio/ :
 - OSD Journal/Commit/Apply latencies show very strong correlation with
ongoing deep scrubs.
 - When latencies are low and noisy there's essentially no client IO
   happening.
 - There is some evidence the write latency shoots through the roof --
   but there isn't much client write occuring... Possible Deep Scrub
   causes disk write IO?
   * mount opts used are:
[...] type xfs (rw,relatime,seclabel,attr2,inode64,noquota)

The objective is to increase servicing time of client IO, especially
read, while deep scrub is occuring. It doesn't matter for us if a
deep-scrub takes x or 3x time, essentially. More consistent latency
to clients is more important.

Best,
Martin Millnert


signature.asc
Description: PGP signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-25 Thread Adam Carheden
On 04/25/2017 11:57 AM, David wrote:
> On 19 Apr 2017 18:01, "Adam Carheden"  > wrote:
> 
> Does anyone know if XFS uses a single thread to write to it's journal?
> 
> 
> You probably know this but just to avoid any confusion, the journal in
> this context isn't the metadata journaling in XFS, it's a separate
> journal written to by the OSD daemons

Ha! I didn't know that.

> 
> I think the number of threads per OSD is controlled by the 'osd op
> threads' setting which defaults to 2

So the ideal (for performance) CEPH cluster would be one SSD per HDD
with 'osd op threads' set to whatever value fio shows as the optimal
number of threads for that drive then?

> I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
> consider going up to a 37xx and putting more OSDs on it. Of course with
> the caveat that you'll lose more OSDs if it goes down. 

Why would you avoid the SanDisk and Hynix? Reliability (I think those
two are both TLC)? Brand trust? If it's my benchmarks in my previous
email, why not the Hynix? It's slower than the Intel, but sort of
decent, at lease compared to the SanDisk.

My final numbers are below, including an older Samsung Evo (MCL I think)
which did horribly, though not as bad as the SanDisk. The Seagate is a
10kRPM SAS "spinny" drive I tested as a control/SSD-to-HDD comparison.

 SanDisk SDSSDA240G, fio  1 jobs:   7.0 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  2 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s (5 trials)


 SanDisk SDSSDA240G, fio  8 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 16 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s (5 trials)


 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.2 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  2 jobs:   0.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  4 jobs:   7.5 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio  8 jobs:  17.6 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 16 jobs:  32.4 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 32 jobs:  64.4 MB/s (5 trials)


HFS250G32TND-N1A2A 3P10, fio 64 jobs:  71.6 MB/s (5 trials)


SAMSUNG SSD, fio  1 jobs:   2.2 MB/s (5 trials)


SAMSUNG SSD, fio  2 jobs:   3.9 MB/s (5 trials)


SAMSUNG SSD, fio  4 jobs:   7.1 MB/s (5 trials)


SAMSUNG SSD, fio  8 jobs:  12.0 MB/s (5 trials)


SAMSUNG SSD, fio 16 jobs:  18.3 MB/s (5 trials)


SAMSUNG SSD, fio 32 jobs:  25.4 MB/s (5 trials)


SAMSUNG SSD, fio 64 jobs:  26.5 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  1 jobs:  91.2 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  2 jobs: 132.4 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  4 jobs: 138.2 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio  8 jobs: 116.9 MB/s (5 trials)


INTEL SSDSC2BB150G7, fio 16 jobs:  61.8 MB/s (5 trials)
INTEL SSDSC2BB150G7, fio 32 jobs:  22.7 MB/s (5 trials)
INTEL SSDSC2BB150G7, fio 64 jobs:  16.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio  1 jobs:   0.7 MB/s (5 trials)
SEAGATE ST9300603SS, fio  2 jobs:   0.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio  4 jobs:   1.6 MB/s (5 trials)
SEAGATE ST9300603SS, fio  8 jobs:   2.0 MB/s (5 trials)
SEAGATE ST9300603SS, fio 16 jobs:   4.6 MB/s (5 trials)
SEAGATE ST9300603SS, fio 32 jobs:   6.9 MB/s (5 trials)
SEAGATE ST9300603SS, fio 64 jobs:   0.6 MB/s (5 trials)

For those who come across this and are looking for drives for purposes
other than CEPH, those are all sequential write numbers with caching
disabled, a very CEPH-journal-specific test. The SanDisk held it's own
against the Intel using some benchmarks on Windows that didn't disable
caching. It may very well be a perfectly good drive for other purposes.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph packages on stretch from eu.ceph.com

2017-04-25 Thread Ronny Aasen

Hello

i am trying to install ceph on debian stretch from

http://eu.ceph.com/debian-jewel/dists/

but there is no stretch repo there.

now with stretch being frozen, it is a good time to be testing ceph on 
stretch. is it possible to get packages for stretch on jewel, kraken, 
and lumious ?




kind regards

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sharing SSD journals and SSD drive choice

2017-04-25 Thread David
On 19 Apr 2017 18:01, "Adam Carheden"  wrote:

Does anyone know if XFS uses a single thread to write to it's journal?


You probably know this but just to avoid any confusion, the journal in this
context isn't the metadata journaling in XFS, it's a separate journal
written to by the OSD daemons

I think the number of threads per OSD is controlled by the 'osd op threads'
setting which defaults to 2


I'm evaluating SSDs to buy as journal devices. I plan to have multiple
OSDs share a single SSD for journal.


I'm benchmarking several brands as
described here:

https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
t-if-your-ssd-is-suitable-as-a-journal-device/

It appears that sequential write speed using multiple threads varies
widely between brands. Here's what I have so far:
 SanDisk SDSSDA240G, dd:6.8 MB/s
 SanDisk SDSSDA240G, fio  1 jobs:   6.7 MB/s
 SanDisk SDSSDA240G, fio  2 jobs:   7.4 MB/s
 SanDisk SDSSDA240G, fio  4 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio  8 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 16 jobs:   7.5 MB/s
 SanDisk SDSSDA240G, fio 32 jobs:   7.6 MB/s
 SanDisk SDSSDA240G, fio 64 jobs:   7.6 MB/s
HFS250G32TND-N1A2A 3P10, dd:1.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  1 jobs:   4.8 MB/s
HFS250G32TND-N1A2A 3P10, fio  2 jobs:   5.2 MB/s
HFS250G32TND-N1A2A 3P10, fio  4 jobs:   9.5 MB/s
HFS250G32TND-N1A2A 3P10, fio  8 jobs:  23.4 MB/s
HFS250G32TND-N1A2A 3P10, fio 16 jobs:   7.2 MB/s
HFS250G32TND-N1A2A 3P10, fio 32 jobs:  49.8 MB/s
HFS250G32TND-N1A2A 3P10, fio 64 jobs:  70.5 MB/s
INTEL SSDSC2BB150G7, dd:   90.1 MB/s
INTEL SSDSC2BB150G7, fio  1 jobs:  91.0 MB/s
INTEL SSDSC2BB150G7, fio  2 jobs: 108.3 MB/s
INTEL SSDSC2BB150G7, fio  4 jobs: 134.2 MB/s
INTEL SSDSC2BB150G7, fio  8 jobs: 118.2 MB/s
INTEL SSDSC2BB150G7, fio 16 jobs:  39.9 MB/s
INTEL SSDSC2BB150G7, fio 32 jobs:  25.4 MB/s
INTEL SSDSC2BB150G7, fio 64 jobs:  15.8 MB/s

The SanDisk is slow, but speed is the same at any number of threads. The

Intel peaks at 4-6 threads and then declines rapidly into sub-par
performance (at least for a pricey "enterprise" drive). The SK Hynix is
slow at low numbers of threads but gets huge performance gains with more
threads. (This is all with one trial, but I have a script running
multiple trials across all drives today.)


I would avoid the SanDisk and Hynix. The s3500 isn't too bad. Perhaps
consider going up to a 37xx and putting more OSDs on it. Of course with the
caveat that you'll lose more OSDs if it goes down.


So if XFS has a single thread that does journaling, it looks like my
best option would be 1 intel SSD shared by 4-6 OSDs. If XFS already
throws multiple threads at the journal, then having OSDs share an Intel
drive will likely kill my SSD performance, but having as many OSDs as I
can cram in a chassis share the SK Hynix drive would get me great
performance for a fraction of the cost.


I don't think the hynix is going to give you great performance with
multiple OSDs


Anyone have any related advice or experience to share regarding journal
SSD selection?


Need to know a bit more about the type of cluster you're planning to build,
number of nodes, type of OSD, workload etc.



--
Adam Carheden

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW 10.2.5->10.2.7 authentication fail?

2017-04-25 Thread Radoslaw Zarzynski
Hello Ben,

Could you provide full RadosGW's log for the failed request?
I mean the lines starting from header listing, through the start
marker ("== starting new request...") till the end marker?

At the moment we can't see any details related to the signature
calculation.

Regards,
Radek

On Thu, Apr 20, 2017 at 5:08 PM, Ben Morrice  wrote:
> Hi all,
>
> I have tried upgrading one of our RGW servers from 10.2.5 to 10.2.7 (RHEL7)
> and authentication is in a very bad state. This installation is part of a
> multigw configuration, and I have just updated one host in the secondary
> zone (all other hosts/zones are running 10.2.5).
>
> On the 10.2.7 server I cannot authenticate as a user (normally backed by
> OpenStack Keystone), but even worse I can also not authenticate with an
> admin user.
>
> Please see [1] for the results of performing a list bucket operation with
> python boto (script works against rgw 10.2.5)
>
> Also, if I try to authenticate from the 'master' rgw zone with a
> "radosgw-admin sync status --rgw-zone=bbp-gva-master" I get:
>
> "ERROR: failed to fetch datalog info"
>
> "failed to retrieve sync info: (13) Permission denied"
>
> The above errors correlates to the errors in the log on the server running
> 10.2.7 (debug level 20) at [2]
>
> I'm not sure what I have done wrong or can try next?
>
> By the way, downgrading the packages from 10.2.7 to 10.2.5 returns
> authentication functionality
>
> [1]
> boto.exception.S3ResponseError: S3ResponseError: 403 Forbidden
>  encoding="UTF-8"?>SignatureDoesNotMatchtx4-0058f8c86a-3fa2959-bbp-gva-secondary3fa2959-bbp-gva-secondary-bbp-gva
>
> [2]
> /bbpsrvc15.cscs.ch/admin/log
> 2017-04-20 16:43:04.916253 7ff87c6c0700 15 calculated
> digest=Ofg/f/NI0L4eEG1MsGk4PsVscTM=
> 2017-04-20 16:43:04.916255 7ff87c6c0700 15
> auth_sign=qZ3qsy7AuNCOoPMhr8yNoy5qMKU=
> 2017-04-20 16:43:04.916255 7ff87c6c0700 15 compare=34
> 2017-04-20 16:43:04.916266 7ff87c6c0700 10 failed to authorize request
> 2017-04-20 16:43:04.916268 7ff87c6c0700 20 handler->ERRORHANDLER:
> err_no=-2027 new_err_no=-2027
> 2017-04-20 16:43:04.916329 7ff87c6c0700  2 req 354:0.052585:s3:GET
> /admin/log:get_obj:op status=0
> 2017-04-20 16:43:04.916339 7ff87c6c0700  2 req 354:0.052595:s3:GET
> /admin/log:get_obj:http status=403
> 2017-04-20 16:43:04.916343 7ff87c6c0700  1 == req done
> req=0x7ff87c6ba710 op status=0 http_status=403 ==
> 2017-04-20 16:43:04.916350 7ff87c6c0700 20 process_request() returned -2027
> 2017-04-20 16:43:04.916390 7ff87c6c0700  1 civetweb: 0x7ff990015610:
> 10.80.6.26 - - [20/Apr/2017:16:43:04 +0200] "GET /admin/log HTTP/1.1" 403 0
> - -
> 2017-04-20 16:43:04.917212 7ff9777e6700 20
> cr:s=0x7ff97000d420:op=0x7ff9703a5440:18RGWMetaSyncShardCR: operate()
> 2017-04-20 16:43:04.917223 7ff9777e6700 20 rgw meta sync:
> incremental_sync:1544: shard_id=20
> mdlog_marker=1_1492686039.901886_5551978.1
> sync_marker.marker=1_1492686039.901886_5551978.1 period_marker=
> 2017-04-20 16:43:04.917227 7ff9777e6700 20 rgw meta sync:
> incremental_sync:1551: shard_id=20 syncing mdlog for shard_id=20
> 2017-04-20 16:43:04.917236 7ff9777e6700 20
> cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
> 2017-04-20 16:43:04.917238 7ff9777e6700 20 rgw meta sync: operate:
> shard_id=20: init request
> 2017-04-20 16:43:04.917240 7ff9777e6700 20
> cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
> 2017-04-20 16:43:04.917241 7ff9777e6700 20 rgw meta sync: operate:
> shard_id=20: reading shard status
> 2017-04-20 16:43:04.917303 7ff9777e6700 20 run: stack=0x7ff97000d420 is io
> blocked
> 2017-04-20 16:43:04.918285 7ff9777e6700 20
> cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
> 2017-04-20 16:43:04.918295 7ff9777e6700 20 rgw meta sync: operate:
> shard_id=20: reading shard status complete
> 2017-04-20 16:43:04.918307 7ff9777e6700 20 rgw meta sync: shard_id=20
> marker=1_1492686039.901886_5551978.1 last_update=2017-04-20
> 13:00:39.0.901886s
> 2017-04-20 16:43:04.918316 7ff9777e6700 20
> cr:s=0x7ff97000d420:op=0x7ff970066b80:24RGWCloneMetaLogCoroutine: operate()
> 2017-04-20 16:43:04.918317 7ff9777e6700 20 rgw meta sync: operate:
> shard_id=20: sending rest request
> 2017-04-20 16:43:04.918381 7ff9777e6700 20 RGWEnv::set(): HTTP_DATE: Thu Apr
> 20 14:43:04 2017
> 2017-04-20 16:43:04.918390 7ff9777e6700 20 > HTTP_DATE -> Thu Apr 20
> 14:43:04 2017
> 2017-04-20 16:43:04.918404 7ff9777e6700 10 get_canon_resource():
> dest=/admin/log
> 2017-04-20 16:43:04.918406 7ff9777e6700 10 generated canonical header: GET
>
> --
> Kind regards,
>
> Ben Morrice
>
> __
> Ben Morrice | e: ben.morr...@epfl.ch | t: +41-21-693-9670
> EPFL / BBP
> Biotech Campus
> Chemin des Mines 9
> 1202 Geneva
> Switzerland
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cg

Re: [ceph-users] inconsistent of pgs due to attr_value_mismatch

2017-04-25 Thread Lomayani S. Laizer
Hello,
I managed to resolve the issue. OSD 21 had corrupted data. I removed from
cluster and formatted hard drive then re-added to the cluster.

After backfill finished I ran repair again and fixed the problem

--
Lomayani

On Tue, Apr 25, 2017 at 11:42 AM, Lomayani S. Laizer 
wrote:

> Hello,
> Am having this error in my cluster of inconsistent of pgs due to
> attr_value_mismatch. Looks all pgs having these error is hosting one vm
> with ID 3fb4c238e1f29. Am using using replication of 3 with min of 2.
>
> Pg repair is not working. Please any suggestions to resolve this issue.
> More logs are available http://www.heypasteit.com/clip/0BOJ36
>
>  ceph health detail
> HEALTH_ERR 12 pgs inconsistent; 16 scrub errors
> pg 7.765 is active+clean+inconsistent, acting [16,21,3]
> pg 7.6e7 is active+clean+inconsistent, acting [12,21,4]
> pg 7.335 is active+clean+inconsistent, acting [8,17,21]
> pg 7.304 is active+clean+inconsistent, acting [14,6,21]
> pg 7.2e0 is active+clean+inconsistent, acting [21,17,6]
> pg 7.138 is active+clean+inconsistent, acting [11,17,21]
> pg 7.6c is active+clean+inconsistent, acting [21,11,14]
> pg 7.102 is active+clean+inconsistent, acting [21,5,12]
> pg 7.198 is active+clean+inconsistent, acting [14,11,21]
> pg 7.5fc is active+clean+inconsistent, acting [6,16,21]
> pg 7.65b is active+clean+inconsistent, acting [21,17,2]
> pg 7.67a is active+clean+inconsistent, acting [16,21,6]
>
> rados list-inconsistent-obj   7.67a  --format=json-pretty
> {
> "epoch": 5699,
> "inconsistents": [
> {
> "object": {
> "name": "rbd_data.3fb4c238e1f29.00017bef",
> "nspace": "",
> "locator": "",
> "snap": "head",
> "version": 346953
> },
> "errors": [
> "object_info_inconsistency",
> "attr_value_mismatch"
> ],
> "union_shard_errors": [],
> "selected_object_info": "7:5e76a45a:::rbd_data.3fb4c238e1f29.
> 00017bef:head(5640'346953 client.2930592.0:2368795
> dirty|omap_digest s 3792896 uv 346953 od )",
> "shards": [
> {
> "osd": 6,
> "errors": [],
> "size": 3792896,
> "object_info": "7:5e76a45a:::rbd_data.3fb4c238e1f29.
> 00017bef:head(5640'346953 client.2930592.0:2368795
> dirty|omap_digest s 3792896 uv 346953 od )",
> "attrs": [
>
>
> 2017-04-25 08:56:23.333835 7f8a0835e700 -1 log_channel(cluster) log [ERR]
> : 7.102 shard 21: soid 
> 7:4081eee7:::rbd_data.3fb4c238e1f29.00017b03:head
> size 3076096 != size 2633728 from auth oi 7:4081eee7:::rbd_data.
> 3fb4c238e1f29.00017b03:head(5640'990157 client.2930592.0:2367433
> dirty|omap_digest s 2633728 uv 990157 od ), size 3076096 != size
> 2633728 from shard 5, attr value mismatch '_'
>
> --
> Lomayani
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] best practices in connecting clients to cephfs public network

2017-04-25 Thread David Turner
In the past, I've made the "public" network another vlan that only has
servers that need to talk to the storage back end included in it.  That way
you don't open it up to anything that doesn't need to have it and if a
server needs to talk on it that should only be on restricted vlans, then
you satisfy that as well.

On Tue, Apr 25, 2017 at 10:58 AM Ronny Aasen 
wrote:

> hello
>
> i want to connect 3 servers to cephfs. The servers are normally not in
> the public network.
> is it best practice to connect 2 interfaces on the servers to have the
> servers directly connected to the public network ?
> or to route between the networks, via their common default gateway.
>
> the machines are vm's so it's easy to add interfaces, and the servers
> lan and the clusters public networks is on the same router so it's also
> easy to route between them. there is a separate firewall in front of the
> routed networks so the security aspect is quite similar one way or the
> other.
>
>
> what is the recommended way to connect clients to the public network ?
>
>
> kind regards
>
> Ronny Aasen
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] best practices in connecting clients to cephfs public network

2017-04-25 Thread Ronny Aasen

hello

i want to connect 3 servers to cephfs. The servers are normally not in 
the public network.
is it best practice to connect 2 interfaces on the servers to have the 
servers directly connected to the public network ?

or to route between the networks, via their common default gateway.

the machines are vm's so it's easy to add interfaces, and the servers 
lan and the clusters public networks is on the same router so it's also 
easy to route between them. there is a separate firewall in front of the 
routed networks so the security aspect is quite similar one way or the 
other.



what is the recommended way to connect clients to the public network ?


kind regards

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.0.2 Luminous (dev) released

2017-04-25 Thread Sage Weil
I think this commit just missed 12.0.2:

commit 32b1b0476ad0d6a50d84732ce96cda6ee09f6bec 
Author: Sage Weil 
Date:   Mon Apr 10 17:36:37 2017 -0400

mon/OSDMonitor: tolerate upgrade from post-kraken dev cluster

If the 'creating' pgs key is missing, move on without crashing.

Signed-off-by: Sage Weil 

You can cherry-pick that or run a mon built from the master branch.

sage



On Tue, 25 Apr 2017, Dan van der Ster wrote:

> Created ticket to follow up: http://tracker.ceph.com/issues/19769
> 
> 
> 
> On Tue, Apr 25, 2017 at 11:34 AM, Dan van der Ster  
> wrote:
> > Could this change be the culprit?
> >
> > commit 973829132bf7206eff6c2cf30dd0aa32fb0ce706
> > Author: Sage Weil 
> > Date:   Fri Mar 31 09:33:19 2017 -0400
> >
> > mon/OSDMonitor: spinlock -> std::mutex
> >
> > I think spinlock is dangerous here: we're doing semi-unbounded
> > work (decode).  Also seemingly innocuous code like dout macros
> > take mutexes.
> >
> > Signed-off-by: Sage Weil 
> >
> >
> > diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
> > index 543338bdf3..6fa5e8de4b 100644
> > --- a/src/mon/OSDMonitor.cc
> > +++ b/src/mon/OSDMonitor.cc
> > @@ -245,7 +245,7 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
> >  bufferlist bl;
> >  mon->store->get(OSD_PG_CREATING_PREFIX, "creating", bl);
> >  auto p = bl.begin();
> > -std::lock_guard l(creating_pgs_lock);
> > +std::lock_guard l(creating_pgs_lock);
> >  creating_pgs.decode(p);
> >  dout(7) << __func__ << " loading creating_pgs e" <<
> > creating_pgs.last_scan_epoch << dendl;
> >}
> > ...
> >
> >
> > Cheers, Dan
> >
> >
> > On Tue, Apr 25, 2017 at 11:15 AM, Dan van der Ster  
> > wrote:
> >> Hi,
> >>
> >> The mon's on my test luminous cluster do not start after upgrading
> >> from 12.0.1 to 12.0.2. Here is the backtrace:
> >>
> >>  0> 2017-04-25 11:06:02.897941 7f467ddd7880 -1 *** Caught signal
> >> (Aborted) **
> >>  in thread 7f467ddd7880 thread_name:ceph-mon
> >>
> >>  ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
> >>  1: (()+0x797e7f) [0x7f467e58ce7f]
> >>  2: (()+0xf370) [0x7f467d18d370]
> >>  3: (gsignal()+0x37) [0x7f467a44f1d7]
> >>  4: (abort()+0x148) [0x7f467a4508c8]
> >>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f467ad539d5]
> >>  6: (()+0x5e946) [0x7f467ad51946]
> >>  7: (()+0x5e973) [0x7f467ad51973]
> >>  8: (()+0x5eb93) [0x7f467ad51b93]
> >>  9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
> >> char*)+0xa5) [0x7f467e2fc715]
> >>  10: (creating_pgs_t::decode(ceph::buffer::list::iterator&)+0x3c)
> >> [0x7f467e211e8c]
> >>  11: (OSDMonitor::update_from_paxos(bool*)+0x225a) [0x7f467e1cd16a]
> >>  12: (PaxosService::refresh(bool*)+0x1a5) [0x7f467e196335]
> >>  13: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f467e12953b]
> >>  14: (Monitor::init_paxos()+0x115) [0x7f467e129975]
> >>  15: (Monitor::preinit()+0x93d) [0x7f467e13b07d]
> >>  16: (main()+0x2518) [0x7f467e07f848]
> >>  17: (__libc_start_main()+0xf5) [0x7f467a43bb35]
> >>  18: (()+0x32671e) [0x7f467e11b71e]
> >>  NOTE: a copy of the executable, or `objdump -rdS ` is
> >> needed to interpret this.
> >>
> >> Cheers, Dan
> >>
> >>
> >> On Mon, Apr 24, 2017 at 5:49 PM, Abhishek Lekshmanan  
> >> wrote:
> >>> This is the third development checkpoint release of Luminous, the next
> >>> long term
> >>> stable release.
> >>>
> >>> Major changes from v12.0.1
> >>> --
> >>> * The original librados rados_objects_list_open (C) and objects_begin
> >>>   (C++) object listing API, deprecated in Hammer, has finally been
> >>>   removed.  Users of this interface must update their software to use
> >>>   either the rados_nobjects_list_open (C) and nobjects_begin (C++) API or
> >>>   the new rados_object_list_begin (C) and object_list_begin (C++) API
> >>>   before updating the client-side librados library to Luminous.
> >>>
> >>>   Object enumeration (via any API) with the latest librados version
> >>>   and pre-Hammer OSDs is no longer supported.  Note that no in-tree
> >>>   Ceph services rely on object enumeration via the deprecated APIs, so
> >>>   only external librados users might be affected.
> >>>
> >>>   The newest (and recommended) rados_object_list_begin (C) and
> >>>   object_list_begin (C++) API is only usable on clusters with the
> >>>   SORTBITWISE flag enabled (Jewel and later).  (Note that this flag is
> >>>   required to be set before upgrading beyond Jewel.)
> >>>
> >>> * CephFS clients without the 'p' flag in their authentication capability
> >>>   string will no longer be able to set quotas or any layout fields.  This
> >>>   flag previously only restricted modification of the pool and namespace
> >>>   fields in layouts.
> >>>
> >>> * CephFS directory fragmentation (large directory support) is enabled
> >>>   by default on new filesystems.  To enable it on existing filesystems
> >>>   use "ceph fs set  allow_dirfrags".
> >>>
> >>> * CephFS will

Re: [ceph-users] Large META directory within each OSD's directory

2017-04-25 Thread David Turner
Which version of Ceph are you running? My guess is Hammer pre-0.94.9. There
is an osdmap cache bug that was introduced with Hammer that was fixed in
0.94.9. The work around is to restart all of the OSDs in your cluster.
After restarting the OSDs, the cluster will start to clean up osdmaps 20 at
a time each time you generate a new map. If you don't generate maps often,
then you can write a loop that does something like setting the min size for
a pool to the same thing every 10-20 seconds until you catch up. (Note,
that doesn't change any settings, but it does update the map).

On Tue, Apr 25, 2017, 4:45 AM 许雪寒  wrote:

> Hi, everyone.
>
> Recently, in one of our clusters, we found that the “META” directory in
> each OSD’s working directory is getting extremely large, about 17GB each.
> Why hasn’t the OSD cleared those old osdmaps? How should I deal with this
> problem?
>
> Thank you☺
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is single MDS data recoverable

2017-04-25 Thread Henrik Korkuc

On 17-04-25 13:43, gjprabu wrote:

Hi Team,

   I am running cephfs setup with single MDS . Suppose in 
single MDS setup if the MDS goes down what will happen for data. Is it 
advisable to run multiple MDS.


MDS data is in Ceph cluster itself. After MDS failure you can start 
another MDS on different server. It should pick up there previous MDS ended.



Regards
Prabu GJ



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph built from source, can't start ceph-mon

2017-04-25 Thread Joao Eduardo Luis

On 04/25/2017 03:52 AM, Henry Ngo wrote:

Anyone?

On Sat, Apr 22, 2017 at 12:33 PM, Henry Ngo mailto:henry@phazr.io>> wrote:

I followed the install doc however after deploying the monitor, the
doc states to start the mon using Upstart. I learned through digging
around that the Upstart package is not installed using Make Install
so it won't work. I tried running "ceph-mon -i [host]" and it gives
an error. Any ideas?

http://paste.openstack.org/show/607588/




From `ps`, you have

  cephadm+  1501 1  0 12:12 pts/000:00:00 ceph-mon -i node2

The monitor is already running.

The error you get,

  IO error: lock /var/lib/ceph/mon/ceph-node2/store.db/LOCK: Resource 
temporarily unavailable


is because the monitor is running and already holds the store lock, 
hence why you can't start a monitor with the same id.


  -Joao
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is single MDS data recoverable

2017-04-25 Thread gjprabu
Hi Team,



   I am running cephfs setup with single MDS . Suppose in single MDS 
setup if the MDS goes down what will happen for data. Is it advisable to run 
multiple MDS.



Regards

Prabu GJ



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 答复: cephfs not writeable on a few clients

2017-04-25 Thread Xusangdi
The working client is running in user space (probably ceph-fuse), while the 
non-working client is using kernel mount


发件人: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] 代表 Steininger, 
Herbert
发送时间: 2017年4月25日 16:44
收件人: ceph-users@lists.ceph.com
主题: [ceph-users] cephfs not writeable on a few clients

Hi,

I’m fairly new to cephfs, on my new job there is a cephfs-cluster that I have 
to administer.

The problem is, I can’t write from some clients to the cephfs-mount.
When I try from the specific clients I get in the Logfile:
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 hung
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 caps stale
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 came back
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 caps still stale

Restarting the mds doesn’t make any difference.

>ceph –s
Says:
[root@cuda001:/var/log/ceph]# ceph -s
cluster cde1487e-f930-417a-9403-28e9ebf406b8
 health HEALTH_OK
 monmap e6: 1 mons at {cephcontrol=172.22.12.241:6789/0}
election epoch 1, quorum 0 cephcontrol
 mdsmap e1574: 1/1/1 up {0=A1214-2950-01=up:active}
 osdmap e9571: 6 osds: 6 up, 6 in
  pgmap v11438317: 320 pgs, 3 pools, 20427 GB data, 7102 kobjects
62100 GB used, 52968 GB / 112 TB avail
 319 active+clean
   1 active+clean+scrubbing+deep


So everything should be right, but it is not working.

The only thing I found that is different to the other hosts is when I do:
> ceph daemon mds.A1214-2950-01 session ls

On the working clients I get:
{
"id": 670317,
"num_leases": 0,
"num_caps": 35386,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.670317 172.22.7.52:0\/4290071627",
"client_metadata": {
"ceph_sha1": "mySHA1-ID",
"ceph_version": "ceph version 0.94.9 (mySHA1-ID)",
"entity_id": "admin",
"hostname": "PE8",
"mount_point": "\/cephfs01"
}


On the non-working clients it looks like:
{
"id": 670648,
"num_leases": 0,
"num_caps": 60,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.670648 172.22.20.5:0\/2770536198",
"client_metadata": {
"entity_id": "cephfs",
"hostname": "slurmgate",
"kernel_version": "3.10.0-514.16.1.el7.x86_64"
}

The biggest difference is,
There are no ‘ceph_sha1’-Entrys, no ‘ceph_version’-Entrys also no ‘mount_point’ 
and the entity-id is also different.

Could someone please shed some light upon me what I did wrong?
The Guy who installed it, is no longer here and there is also no documentation.
I just try to mount it per automount/autofs.

If you need more info, just let me know.

Thanks in Advance,
Steininger Herbert

-
本邮件及其附件含有杭州华三通信技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v12.0.2 Luminous (dev) released

2017-04-25 Thread Dan van der Ster
Created ticket to follow up: http://tracker.ceph.com/issues/19769



On Tue, Apr 25, 2017 at 11:34 AM, Dan van der Ster  wrote:
> Could this change be the culprit?
>
> commit 973829132bf7206eff6c2cf30dd0aa32fb0ce706
> Author: Sage Weil 
> Date:   Fri Mar 31 09:33:19 2017 -0400
>
> mon/OSDMonitor: spinlock -> std::mutex
>
> I think spinlock is dangerous here: we're doing semi-unbounded
> work (decode).  Also seemingly innocuous code like dout macros
> take mutexes.
>
> Signed-off-by: Sage Weil 
>
>
> diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
> index 543338bdf3..6fa5e8de4b 100644
> --- a/src/mon/OSDMonitor.cc
> +++ b/src/mon/OSDMonitor.cc
> @@ -245,7 +245,7 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
>  bufferlist bl;
>  mon->store->get(OSD_PG_CREATING_PREFIX, "creating", bl);
>  auto p = bl.begin();
> -std::lock_guard l(creating_pgs_lock);
> +std::lock_guard l(creating_pgs_lock);
>  creating_pgs.decode(p);
>  dout(7) << __func__ << " loading creating_pgs e" <<
> creating_pgs.last_scan_epoch << dendl;
>}
> ...
>
>
> Cheers, Dan
>
>
> On Tue, Apr 25, 2017 at 11:15 AM, Dan van der Ster  
> wrote:
>> Hi,
>>
>> The mon's on my test luminous cluster do not start after upgrading
>> from 12.0.1 to 12.0.2. Here is the backtrace:
>>
>>  0> 2017-04-25 11:06:02.897941 7f467ddd7880 -1 *** Caught signal
>> (Aborted) **
>>  in thread 7f467ddd7880 thread_name:ceph-mon
>>
>>  ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
>>  1: (()+0x797e7f) [0x7f467e58ce7f]
>>  2: (()+0xf370) [0x7f467d18d370]
>>  3: (gsignal()+0x37) [0x7f467a44f1d7]
>>  4: (abort()+0x148) [0x7f467a4508c8]
>>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f467ad539d5]
>>  6: (()+0x5e946) [0x7f467ad51946]
>>  7: (()+0x5e973) [0x7f467ad51973]
>>  8: (()+0x5eb93) [0x7f467ad51b93]
>>  9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
>> char*)+0xa5) [0x7f467e2fc715]
>>  10: (creating_pgs_t::decode(ceph::buffer::list::iterator&)+0x3c)
>> [0x7f467e211e8c]
>>  11: (OSDMonitor::update_from_paxos(bool*)+0x225a) [0x7f467e1cd16a]
>>  12: (PaxosService::refresh(bool*)+0x1a5) [0x7f467e196335]
>>  13: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f467e12953b]
>>  14: (Monitor::init_paxos()+0x115) [0x7f467e129975]
>>  15: (Monitor::preinit()+0x93d) [0x7f467e13b07d]
>>  16: (main()+0x2518) [0x7f467e07f848]
>>  17: (__libc_start_main()+0xf5) [0x7f467a43bb35]
>>  18: (()+0x32671e) [0x7f467e11b71e]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>>
>> Cheers, Dan
>>
>>
>> On Mon, Apr 24, 2017 at 5:49 PM, Abhishek Lekshmanan  
>> wrote:
>>> This is the third development checkpoint release of Luminous, the next
>>> long term
>>> stable release.
>>>
>>> Major changes from v12.0.1
>>> --
>>> * The original librados rados_objects_list_open (C) and objects_begin
>>>   (C++) object listing API, deprecated in Hammer, has finally been
>>>   removed.  Users of this interface must update their software to use
>>>   either the rados_nobjects_list_open (C) and nobjects_begin (C++) API or
>>>   the new rados_object_list_begin (C) and object_list_begin (C++) API
>>>   before updating the client-side librados library to Luminous.
>>>
>>>   Object enumeration (via any API) with the latest librados version
>>>   and pre-Hammer OSDs is no longer supported.  Note that no in-tree
>>>   Ceph services rely on object enumeration via the deprecated APIs, so
>>>   only external librados users might be affected.
>>>
>>>   The newest (and recommended) rados_object_list_begin (C) and
>>>   object_list_begin (C++) API is only usable on clusters with the
>>>   SORTBITWISE flag enabled (Jewel and later).  (Note that this flag is
>>>   required to be set before upgrading beyond Jewel.)
>>>
>>> * CephFS clients without the 'p' flag in their authentication capability
>>>   string will no longer be able to set quotas or any layout fields.  This
>>>   flag previously only restricted modification of the pool and namespace
>>>   fields in layouts.
>>>
>>> * CephFS directory fragmentation (large directory support) is enabled
>>>   by default on new filesystems.  To enable it on existing filesystems
>>>   use "ceph fs set  allow_dirfrags".
>>>
>>> * CephFS will generate a health warning if you have fewer standby daemons
>>>   than it thinks you wanted.  By default this will be 1 if you ever had
>>>   a standby, and 0 if you did not.  You can customize this using
>>>   ``ceph fs set  standby_count_wanted ``.  Setting it
>>>   to zero will effectively disable the health check.
>>>
>>> * The "ceph mds tell ..." command has been removed.  It is superseded
>>>   by "ceph tell mds. ..."
>>>
>>> * RGW introduces server side encryption of uploaded objects with 3
>>> options for
>>>   the management of encryption keys, automatic encryption (only
>>> recommended for
>>>   test setups), customer provided keys similar to Amazon SSE KMS
>>> spe

Re: [ceph-users] v12.0.2 Luminous (dev) released

2017-04-25 Thread Dan van der Ster
Could this change be the culprit?

commit 973829132bf7206eff6c2cf30dd0aa32fb0ce706
Author: Sage Weil 
Date:   Fri Mar 31 09:33:19 2017 -0400

mon/OSDMonitor: spinlock -> std::mutex

I think spinlock is dangerous here: we're doing semi-unbounded
work (decode).  Also seemingly innocuous code like dout macros
take mutexes.

Signed-off-by: Sage Weil 


diff --git a/src/mon/OSDMonitor.cc b/src/mon/OSDMonitor.cc
index 543338bdf3..6fa5e8de4b 100644
--- a/src/mon/OSDMonitor.cc
+++ b/src/mon/OSDMonitor.cc
@@ -245,7 +245,7 @@ void OSDMonitor::update_from_paxos(bool *need_bootstrap)
 bufferlist bl;
 mon->store->get(OSD_PG_CREATING_PREFIX, "creating", bl);
 auto p = bl.begin();
-std::lock_guard l(creating_pgs_lock);
+std::lock_guard l(creating_pgs_lock);
 creating_pgs.decode(p);
 dout(7) << __func__ << " loading creating_pgs e" <<
creating_pgs.last_scan_epoch << dendl;
   }
...


Cheers, Dan


On Tue, Apr 25, 2017 at 11:15 AM, Dan van der Ster  wrote:
> Hi,
>
> The mon's on my test luminous cluster do not start after upgrading
> from 12.0.1 to 12.0.2. Here is the backtrace:
>
>  0> 2017-04-25 11:06:02.897941 7f467ddd7880 -1 *** Caught signal
> (Aborted) **
>  in thread 7f467ddd7880 thread_name:ceph-mon
>
>  ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
>  1: (()+0x797e7f) [0x7f467e58ce7f]
>  2: (()+0xf370) [0x7f467d18d370]
>  3: (gsignal()+0x37) [0x7f467a44f1d7]
>  4: (abort()+0x148) [0x7f467a4508c8]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f467ad539d5]
>  6: (()+0x5e946) [0x7f467ad51946]
>  7: (()+0x5e973) [0x7f467ad51973]
>  8: (()+0x5eb93) [0x7f467ad51b93]
>  9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
> char*)+0xa5) [0x7f467e2fc715]
>  10: (creating_pgs_t::decode(ceph::buffer::list::iterator&)+0x3c)
> [0x7f467e211e8c]
>  11: (OSDMonitor::update_from_paxos(bool*)+0x225a) [0x7f467e1cd16a]
>  12: (PaxosService::refresh(bool*)+0x1a5) [0x7f467e196335]
>  13: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f467e12953b]
>  14: (Monitor::init_paxos()+0x115) [0x7f467e129975]
>  15: (Monitor::preinit()+0x93d) [0x7f467e13b07d]
>  16: (main()+0x2518) [0x7f467e07f848]
>  17: (__libc_start_main()+0xf5) [0x7f467a43bb35]
>  18: (()+0x32671e) [0x7f467e11b71e]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
>
> Cheers, Dan
>
>
> On Mon, Apr 24, 2017 at 5:49 PM, Abhishek Lekshmanan  
> wrote:
>> This is the third development checkpoint release of Luminous, the next
>> long term
>> stable release.
>>
>> Major changes from v12.0.1
>> --
>> * The original librados rados_objects_list_open (C) and objects_begin
>>   (C++) object listing API, deprecated in Hammer, has finally been
>>   removed.  Users of this interface must update their software to use
>>   either the rados_nobjects_list_open (C) and nobjects_begin (C++) API or
>>   the new rados_object_list_begin (C) and object_list_begin (C++) API
>>   before updating the client-side librados library to Luminous.
>>
>>   Object enumeration (via any API) with the latest librados version
>>   and pre-Hammer OSDs is no longer supported.  Note that no in-tree
>>   Ceph services rely on object enumeration via the deprecated APIs, so
>>   only external librados users might be affected.
>>
>>   The newest (and recommended) rados_object_list_begin (C) and
>>   object_list_begin (C++) API is only usable on clusters with the
>>   SORTBITWISE flag enabled (Jewel and later).  (Note that this flag is
>>   required to be set before upgrading beyond Jewel.)
>>
>> * CephFS clients without the 'p' flag in their authentication capability
>>   string will no longer be able to set quotas or any layout fields.  This
>>   flag previously only restricted modification of the pool and namespace
>>   fields in layouts.
>>
>> * CephFS directory fragmentation (large directory support) is enabled
>>   by default on new filesystems.  To enable it on existing filesystems
>>   use "ceph fs set  allow_dirfrags".
>>
>> * CephFS will generate a health warning if you have fewer standby daemons
>>   than it thinks you wanted.  By default this will be 1 if you ever had
>>   a standby, and 0 if you did not.  You can customize this using
>>   ``ceph fs set  standby_count_wanted ``.  Setting it
>>   to zero will effectively disable the health check.
>>
>> * The "ceph mds tell ..." command has been removed.  It is superseded
>>   by "ceph tell mds. ..."
>>
>> * RGW introduces server side encryption of uploaded objects with 3
>> options for
>>   the management of encryption keys, automatic encryption (only
>> recommended for
>>   test setups), customer provided keys similar to Amazon SSE KMS
>> specification &
>>   using a key management service (openstack barbician)
>>
>> For a more detailed changelog, refer to
>> http://ceph.com/releases/ceph-v12-0-2-luminous-dev-released/
>>
>> Getting Ceph
>> 
>>
>> * Git at git://github.com/ceph/ceph.git
>> * Tarball at 

Re: [ceph-users] v12.0.2 Luminous (dev) released

2017-04-25 Thread Dan van der Ster
Hi,

The mon's on my test luminous cluster do not start after upgrading
from 12.0.1 to 12.0.2. Here is the backtrace:

 0> 2017-04-25 11:06:02.897941 7f467ddd7880 -1 *** Caught signal
(Aborted) **
 in thread 7f467ddd7880 thread_name:ceph-mon

 ceph version 12.0.2 (5a1b6b3269da99a18984c138c23935e5eb96f73e)
 1: (()+0x797e7f) [0x7f467e58ce7f]
 2: (()+0xf370) [0x7f467d18d370]
 3: (gsignal()+0x37) [0x7f467a44f1d7]
 4: (abort()+0x148) [0x7f467a4508c8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f467ad539d5]
 6: (()+0x5e946) [0x7f467ad51946]
 7: (()+0x5e973) [0x7f467ad51973]
 8: (()+0x5eb93) [0x7f467ad51b93]
 9: (ceph::buffer::list::iterator_impl::copy(unsigned int,
char*)+0xa5) [0x7f467e2fc715]
 10: (creating_pgs_t::decode(ceph::buffer::list::iterator&)+0x3c)
[0x7f467e211e8c]
 11: (OSDMonitor::update_from_paxos(bool*)+0x225a) [0x7f467e1cd16a]
 12: (PaxosService::refresh(bool*)+0x1a5) [0x7f467e196335]
 13: (Monitor::refresh_from_paxos(bool*)+0x19b) [0x7f467e12953b]
 14: (Monitor::init_paxos()+0x115) [0x7f467e129975]
 15: (Monitor::preinit()+0x93d) [0x7f467e13b07d]
 16: (main()+0x2518) [0x7f467e07f848]
 17: (__libc_start_main()+0xf5) [0x7f467a43bb35]
 18: (()+0x32671e) [0x7f467e11b71e]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Cheers, Dan


On Mon, Apr 24, 2017 at 5:49 PM, Abhishek Lekshmanan  wrote:
> This is the third development checkpoint release of Luminous, the next
> long term
> stable release.
>
> Major changes from v12.0.1
> --
> * The original librados rados_objects_list_open (C) and objects_begin
>   (C++) object listing API, deprecated in Hammer, has finally been
>   removed.  Users of this interface must update their software to use
>   either the rados_nobjects_list_open (C) and nobjects_begin (C++) API or
>   the new rados_object_list_begin (C) and object_list_begin (C++) API
>   before updating the client-side librados library to Luminous.
>
>   Object enumeration (via any API) with the latest librados version
>   and pre-Hammer OSDs is no longer supported.  Note that no in-tree
>   Ceph services rely on object enumeration via the deprecated APIs, so
>   only external librados users might be affected.
>
>   The newest (and recommended) rados_object_list_begin (C) and
>   object_list_begin (C++) API is only usable on clusters with the
>   SORTBITWISE flag enabled (Jewel and later).  (Note that this flag is
>   required to be set before upgrading beyond Jewel.)
>
> * CephFS clients without the 'p' flag in their authentication capability
>   string will no longer be able to set quotas or any layout fields.  This
>   flag previously only restricted modification of the pool and namespace
>   fields in layouts.
>
> * CephFS directory fragmentation (large directory support) is enabled
>   by default on new filesystems.  To enable it on existing filesystems
>   use "ceph fs set  allow_dirfrags".
>
> * CephFS will generate a health warning if you have fewer standby daemons
>   than it thinks you wanted.  By default this will be 1 if you ever had
>   a standby, and 0 if you did not.  You can customize this using
>   ``ceph fs set  standby_count_wanted ``.  Setting it
>   to zero will effectively disable the health check.
>
> * The "ceph mds tell ..." command has been removed.  It is superseded
>   by "ceph tell mds. ..."
>
> * RGW introduces server side encryption of uploaded objects with 3
> options for
>   the management of encryption keys, automatic encryption (only
> recommended for
>   test setups), customer provided keys similar to Amazon SSE KMS
> specification &
>   using a key management service (openstack barbician)
>
> For a more detailed changelog, refer to
> http://ceph.com/releases/ceph-v12-0-2-luminous-dev-released/
>
> Getting Ceph
> 
>
> * Git at git://github.com/ceph/ceph.git
> * Tarball at http://download.ceph.com/tarballs/ceph-12.0.2.tar.gz
> * For packages, see http://docs.ceph.com/docs/master/install/get-packages/
> * For ceph-deploy, see
> http://docs.ceph.com/docs/master/install/install-ceph-deploy
> * Release sha1: 5a1b6b3269da99a18984c138c23935e5eb96f73e
>
> --
> Abhishek Lekshmanan
> SUSE Linux GmbH, GF: Felix Imendörffer, Jane Smithard, Graham Norton,
> HRB 21284 (AG Nürnberg)
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large META directory within each OSD's directory

2017-04-25 Thread 许雪寒
Hi, everyone.

Recently, in one of our clusters, we found that the “META” directory in each 
OSD’s working directory is getting extremely large, about 17GB each. Why hasn’t 
the OSD cleared those old osdmaps? How should I deal with this problem?

Thank you☺
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs not writeable on a few clients

2017-04-25 Thread Steininger, Herbert
Hi,

I'm fairly new to cephfs, on my new job there is a cephfs-cluster that I have 
to administer.

The problem is, I can't write from some clients to the cephfs-mount.
When I try from the specific clients I get in the Logfile:
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 hung
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 caps stale
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 came back
>Apr 24 13:14:00 cuda002 kernel: ceph: mds0 caps still stale

Restarting the mds doesn't make any difference.

>ceph -s
Says:
[root@cuda001:/var/log/ceph]# ceph -s
cluster cde1487e-f930-417a-9403-28e9ebf406b8
 health HEALTH_OK
 monmap e6: 1 mons at {cephcontrol=172.22.12.241:6789/0}
election epoch 1, quorum 0 cephcontrol
 mdsmap e1574: 1/1/1 up {0=A1214-2950-01=up:active}
 osdmap e9571: 6 osds: 6 up, 6 in
  pgmap v11438317: 320 pgs, 3 pools, 20427 GB data, 7102 kobjects
62100 GB used, 52968 GB / 112 TB avail
 319 active+clean
   1 active+clean+scrubbing+deep


So everything should be right, but it is not working.

The only thing I found that is different to the other hosts is when I do:
> ceph daemon mds.A1214-2950-01 session ls

On the working clients I get:
{
"id": 670317,
"num_leases": 0,
"num_caps": 35386,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.670317 172.22.7.52:0\/4290071627",
"client_metadata": {
"ceph_sha1": "mySHA1-ID",
"ceph_version": "ceph version 0.94.9 (mySHA1-ID)",
"entity_id": "admin",
"hostname": "PE8",
"mount_point": "\/cephfs01"
}


On the non-working clients it looks like:
{
"id": 670648,
"num_leases": 0,
"num_caps": 60,
"state": "open",
"replay_requests": 0,
"reconnecting": false,
"inst": "client.670648 172.22.20.5:0\/2770536198",
"client_metadata": {
"entity_id": "cephfs",
"hostname": "slurmgate",
"kernel_version": "3.10.0-514.16.1.el7.x86_64"
}

The biggest difference is,
There are no 'ceph_sha1'-Entrys, no 'ceph_version'-Entrys also no 'mount_point' 
and the entity-id is also different.

Could someone please shed some light upon me what I did wrong?
The Guy who installed it, is no longer here and there is also no documentation.
I just try to mount it per automount/autofs.

If you need more info, just let me know.

Thanks in Advance,
Steininger Herbert

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] inconsistent of pgs due to attr_value_mismatch

2017-04-25 Thread Lomayani S. Laizer
Hello,
Am having this error in my cluster of inconsistent of pgs due to
attr_value_mismatch. Looks all pgs having these error is hosting one vm
with ID 3fb4c238e1f29. Am using using replication of 3 with min of 2.

Pg repair is not working. Please any suggestions to resolve this issue.
More logs are available http://www.heypasteit.com/clip/0BOJ36

 ceph health detail
HEALTH_ERR 12 pgs inconsistent; 16 scrub errors
pg 7.765 is active+clean+inconsistent, acting [16,21,3]
pg 7.6e7 is active+clean+inconsistent, acting [12,21,4]
pg 7.335 is active+clean+inconsistent, acting [8,17,21]
pg 7.304 is active+clean+inconsistent, acting [14,6,21]
pg 7.2e0 is active+clean+inconsistent, acting [21,17,6]
pg 7.138 is active+clean+inconsistent, acting [11,17,21]
pg 7.6c is active+clean+inconsistent, acting [21,11,14]
pg 7.102 is active+clean+inconsistent, acting [21,5,12]
pg 7.198 is active+clean+inconsistent, acting [14,11,21]
pg 7.5fc is active+clean+inconsistent, acting [6,16,21]
pg 7.65b is active+clean+inconsistent, acting [21,17,2]
pg 7.67a is active+clean+inconsistent, acting [16,21,6]

rados list-inconsistent-obj   7.67a  --format=json-pretty
{
"epoch": 5699,
"inconsistents": [
{
"object": {
"name": "rbd_data.3fb4c238e1f29.00017bef",
"nspace": "",
"locator": "",
"snap": "head",
"version": 346953
},
"errors": [
"object_info_inconsistency",
"attr_value_mismatch"
],
"union_shard_errors": [],
"selected_object_info":
"7:5e76a45a:::rbd_data.3fb4c238e1f29.00017bef:head(5640'346953
client.2930592.0:2368795 dirty|omap_digest s 3792896 uv 346953 od
)",
"shards": [
{
"osd": 6,
"errors": [],
"size": 3792896,
"object_info":
"7:5e76a45a:::rbd_data.3fb4c238e1f29.00017bef:head(5640'346953
client.2930592.0:2368795 dirty|omap_digest s 3792896 uv 346953 od
)",
"attrs": [


2017-04-25 08:56:23.333835 7f8a0835e700 -1 log_channel(cluster) log [ERR] :
7.102 shard 21: soid
7:4081eee7:::rbd_data.3fb4c238e1f29.00017b03:head size 3076096 !=
size 2633728 from auth oi
7:4081eee7:::rbd_data.3fb4c238e1f29.00017b03:head(5640'990157
client.2930592.0:2367433 dirty|omap_digest s 2633728 uv 990157 od
), size 3076096 != size 2633728 from shard 5, attr value mismatch
'_'

--
Lomayani
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH MON Updates Live

2017-04-25 Thread Henrik Korkuc

On 17-04-24 19:38, Ashley Merrick wrote:

Hey,

Quick question hopefully have tried a few Google searches but noting concrete.

I am running KVM VM's using KRBD, if I add and remove CEPH mon's are the 
running VM's updated with this information. Or do I need to reboot the VM's for 
them to be provided with the change of MON's?
clients are updated with information. Just make sure that you have at 
least one active mon in config in case VM get's restarted for some other 
reason.



Thanks!
Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com