Re: [ceph-users] Two CEPHFS Issues

2017-10-19 Thread Sage Weil
On Thu, 19 Oct 2017, Daniel Pryor wrote:
> Hello Everyone, 
> 
> We are currently running into two issues.
> 
> 1) We are noticing huge pauses during directory creation, but our file write
> times are super fast. The metadata and data pools are on the same
> infrastructure. 
>  *  https://gist.github.com/pryorda/a0d5c37f119c4a320fa4ca9d48c8752b
>  *  https://gist.github.com/pryorda/ba6e5c2f94f67ca72a744b90cc58024e

Separate metadata onto different (ideally faster) devices is usually a 
good idea if you want to protect metadata performance.  The stalls you're 
seeing could either be MDS requests getting slowed down by the OSDs or it 
might be the MDS missing something in it's cache and having to go 
fetch or flush something to RADOS.  You might see if increasing the MDS 
cache size helps.

> 2) Since we were having the issue above, we wanted to possibly move to a
> larger top level directory. Stuff everything in there and later move
> everything out via a batch job. To do this we need to increase the the
> directory limit from 100,000 to 300,000. How do we increase this limit.

I would recommend upgrading to luminous and enabling directory 
fragmentation instead of increasing the per-fragment limit on Jewel.  Big 
fragments have a negative impact on MDS performance (leading to spikes 
like you see above) and can also make life harder for the OSDs.

sage



 > 
> 
> dpryor@beta-ceph-node1:~$ dpkg -l |grep ceph
> ii  ceph-base                            10.2.10-1xenial                   
> amd64        common ceph daemon libraries and management tools
> ii  ceph-common                          10.2.10-1xenial                   
> amd64        common utilities to mount and interact with a ceph storage
> cluster
> ii  ceph-deploy                          1.5.38                           
>  all          Ceph-deploy is an easy to use configuration tool
> ii  ceph-mds                             10.2.10-1xenial                   
> amd64        metadata server for the ceph distributed file system
> ii  ceph-mon                             10.2.10-1xenial                   
> amd64        monitor server for the ceph storage system
> ii  ceph-osd                             10.2.10-1xenial                   
> amd64        OSD server for the ceph storage system
> ii  libcephfs1                           10.2.10-1xenial                   
> amd64        Ceph distributed file system client library
> ii  python-cephfs                        10.2.10-1xenial                   
> amd64        Python libraries for the Ceph libcephfs library
> dpryor@beta-ceph-node1:~$ 
> 
> Any direction would be appreciated!?
> 
> Thanks,
> Daniel
> 
> 
> ​​___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Two CEPHFS Issues

2017-10-19 Thread Daniel Pryor
Hello Everyone,

We are currently running into two issues.

1) We are noticing huge pauses during directory creation, but our file
write times are super fast. The metadata and data pools are on the same
infrastructure.

   - https://gist.github.com/pryorda/a0d5c37f119c4a320fa4ca9d48c8752b
   - https://gist.github.com/pryorda/ba6e5c2f94f67ca72a744b90cc58024e

2) Since we were having the issue above, we wanted to possibly move to a
larger top level directory. Stuff everything in there and later move
everything out via a batch job. To do this we need to increase the the
directory limit from 100,000 to 300,000. How do we increase this limit.


dpryor@beta-ceph-node1:~$ dpkg -l |grep ceph
ii  ceph-base10.2.10-1xenial
amd64common ceph daemon libraries and management tools
ii  ceph-common  10.2.10-1xenial
amd64common utilities to mount and interact with a ceph storage
cluster
ii  ceph-deploy  1.5.38
 all  Ceph-deploy is an easy to use configuration tool
ii  ceph-mds 10.2.10-1xenial
amd64metadata server for the ceph distributed file system
ii  ceph-mon 10.2.10-1xenial
amd64monitor server for the ceph storage system
ii  ceph-osd 10.2.10-1xenial
amd64OSD server for the ceph storage system
ii  libcephfs1   10.2.10-1xenial
amd64Ceph distributed file system client library
ii  python-cephfs10.2.10-1xenial
amd64Python libraries for the Ceph libcephfs library
dpryor@beta-ceph-node1:~$

Any direction would be appreciated!?

Thanks,
Daniel
​​
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Upstream @The Pub in Prague

2017-10-19 Thread Leonardo Vaz
Hi Cephers,

Brett Niver and Orit Wasserman are organizing a Ceph Upstream meeting on
next Thursday October 25 in Prague.

The meeting will happen at The Pub from 5pm to 9pm (CEST):

  http://www.thepub.cz/praha-1/?lng=en

At the moment we are working on the participant list, if you're
interested on attending the meeting, please send me a message so I can
add your name on the list of participants.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests

2017-10-19 Thread Brad Hubbard
I guess you have both read and followed
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/?highlight=backfill#debugging-slow-requests

What was the result?

On Fri, Oct 20, 2017 at 2:50 AM, J David  wrote:
> On Wed, Oct 18, 2017 at 8:12 AM, Ольга Ухина  wrote:
>> I have a problem with ceph luminous 12.2.1.
>> […]
>> I have slow requests on different OSDs on random time (for example at night,
>> but I don’t see any problems at the time of problem
>> […]
>> 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 : cluster
>> [WRN] Health check update: 49 slow requests are blocked > 32 sec
>> (REQUEST_SLOW)
>
> This looks almost exactly like what we have been experiencing, and
> your use-case (Proxmox client using rbd) is the same as ours as well.
>
> Unfortunately we were not able to find the source of the issue so far,
> and haven’t gotten much feedback from the list.  Extensive testing of
> every component has ruled out any hardware issue we can think of.
>
> Originally we thought our issue was related to deep-scrub, but that
> now appears not to be the case, as it happens even when nothing is
> being deep-scrubbed.  Nonetheless, although they aren’t the cause,
> they definitely make the problem much worse.  So you may want to check
> to see if deep-scrub operations are happening at the times where you
> see issues and (if so) whether the OSDs participating in the
> deep-scrub are the same ones reporting slow requests.
>
> Hopefully you have better luck finding/fixing this than we have!  It’s
> definitely been a very frustrating issue for us.
>
> Thanks!
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not able to start OSD

2017-10-19 Thread Brad Hubbard
On Fri, Oct 20, 2017 at 6:32 AM, Josy  wrote:
> Hi,
>
>>> have you checked the output of "ceph-disk list” on the nodes where the
>>> OSDs are not coming back on?
>
> Yes, it shows all the disk correctly mounted.
>
>>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>>> produced by the OSD itself when it starts.
>
> This is the error messages seen in one of the OSD log file. Even though the
> service is starting the status shows as down itself.
>
>
> =
>
>-7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
> -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
> -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10(unlocked)] enter Initial
> -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] exit Initial 0.000130 0 0.00
> -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 4312
> pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270
> les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 crt=0'0 unknown
> NOTIFY] enter Reset
> -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 write_log_and_missing
> with: dirty_to: 0'0, dirty_from: 4294967295'18446744073709551615,
> writeout_from: 4294967295'18446744073709551615, trimmed: , trimmed_dups: ,
> clear_divergent_priors: 0
> -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 4306
> pg[32.ds0(unlocked)] enter Initial
>  0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)'
> thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h:
> 38: FAILED assert(stripe_width % stripe_size == 0)

What does your erasure code profile look like for pool 32?

>
>
>
> On 20-10-2017 01:05, Jean-Charles Lopez wrote:
>>
>> Hi,
>>
>> have you checked the output of "ceph-disk list” on the nodes where the
>> OSDs are not coming back on?
>>
>> This should give you a hint on what’s going one.
>>
>> Also use dmesg to search for any error message
>>
>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages
>> produced by the OSD itself when it starts.
>>
>> Regards
>> JC
>>
>>> On Oct 19, 2017, at 12:11, Josy  wrote:
>>>
>>> Hi,
>>>
>>> I am not able to start some of the OSDs in the cluster.
>>>
>>> This is a test cluster and had 8 OSDs. One node was taken out for
>>> maintenance. I set the noout flag and after the server came back up I unset
>>> the noout flag.
>>>
>>> Suddenly couple of OSDs went down.
>>>
>>> And now I can start the OSDs manually from each node, but the status is
>>> still "down"
>>>
>>> $  ceph osd stat
>>> 8 osds: 2 up, 5 in
>>>
>>>
>>> $ ceph osd tree
>>> ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
>>>   -1   7.97388 root default
>>>   -3   1.86469 host a1-osd
>>>1   ssd 1.86469 osd.1   down0 1.0
>>>   -5   0.87320 host a2-osd
>>>2   ssd 0.87320 osd.2   down0 1.0
>>>   -7   0.87320 host a3-osd
>>>4   ssd 0.87320 osd.4   down  1.0 1.0
>>>   -9   0.87320 host a4-osd
>>>8   ssd 0.87320 osd.8 up  1.0 1.0
>>> -11   0.87320 host a5-osd
>>>   12   ssd 0.87320 osd.12  down  1.0 1.0
>>> -13   0.87320 host a6-osd
>>>   17   ssd 0.87320 osd.17up  1.0 1.0
>>> -15   0.87320 host a7-osd
>>>   21   ssd 0.87320 osd.21  down  1.0 1.0
>>> -17   0.87000 host a8-osd
>>>   28   ssd 0.87000 osd.28  down0 1.0
>>>
>>> Also can see this error in each OSD node.
>>>
>>> # systemctl status ceph-osd@1
>>> ● ceph-osd@1.service - Ceph object storage daemon osd.1
>>> Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled;
>>> vendor preset: disabled)
>>> Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18
>>> PDT; 

Re: [ceph-users] ceph inconsistent pg missing ec object

2017-10-19 Thread Gregory Farnum
Okay, you're going to need to explain in very clear terms exactly what
happened to your cluster, and *exactly* what operations you performed
manually.

The PG shards seem to have different views of the PG in question. The
primary has a different log_tail, last_user_version, and last_epoch_clean
from the others. Plus different log sizes? It's not making a ton of sense
at first glance.
-Greg

On Thu, Oct 19, 2017 at 1:08 AM Stijn De Weirdt 
wrote:

> hi greg,
>
> i attached the gzip output of the query and some more info below. if you
> need more, let me know.
>
> stijn
>
> > [root@mds01 ~]# ceph -s
> > cluster 92beef0a-1239-4000-bacf-4453ab630e47
> >  health HEALTH_ERR
> > 1 pgs inconsistent
> > 40 requests are blocked > 512 sec
> > 1 scrub errors
> > mds0: Behind on trimming (2793/30)
> >  monmap e1: 3 mons at {mds01=
> 1.2.3.4:6789/0,mds02=1.2.3.5:6789/0,mds03=1.2.3.6:6789/0}
> > election epoch 326, quorum 0,1,2 mds01,mds02,mds03
> >   fsmap e238677: 1/1/1 up {0=mds02=up:active}, 2 up:standby
> >  osdmap e79554: 156 osds: 156 up, 156 in
> > flags sortbitwise,require_jewel_osds
> >   pgmap v51003893: 4096 pgs, 3 pools, 387 TB data, 243 Mobjects
> > 545 TB used, 329 TB / 874 TB avail
> > 4091 active+clean
> >4 active+clean+scrubbing+deep
> >1 active+clean+inconsistent
> >   client io 284 kB/s rd, 146 MB/s wr, 145 op/s rd, 177 op/s wr
> >   cache io 115 MB/s flush, 153 MB/s evict, 14 op/s promote, 3 PG(s)
> flushing
>
> > [root@mds01 ~]# ceph health detail
> > HEALTH_ERR 1 pgs inconsistent; 52 requests are blocked > 512 sec; 5 osds
> have slow requests; 1 scrub errors; mds0: Behind on trimming (2782/30)
> > pg 5.5e3 is active+clean+inconsistent, acting
> [35,50,91,18,139,59,124,40,104,12,71]
> > 34 ops are blocked > 524.288 sec on osd.8
> > 6 ops are blocked > 524.288 sec on osd.67
> > 6 ops are blocked > 524.288 sec on osd.27
> > 1 ops are blocked > 524.288 sec on osd.107
> > 5 ops are blocked > 524.288 sec on osd.116
> > 5 osds have slow requests
> > 1 scrub errors
> > mds0: Behind on trimming (2782/30)(max_segments: 30, num_segments: 2782)
>
> > # zgrep -C 1 ERR ceph-osd.35.log.*.gz
> > ceph-osd.35.log.5.gz:2017-10-14 11:25:52.260668 7f34d6748700  0 --
> 10.141.16.13:6801/1001792 >> 1.2.3.11:6803/1951 pipe(0x56412da80800
> sd=273 :6801 s=2 pgs=3176 cs=31 l=0 c=0x564156e83b00).fault with nothing to
> send, going to standby
> > ceph-osd.35.log.5.gz:2017-10-14 11:26:06.071011 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3s0 shard 59(5) missing
> 5:c7ae919b:::10014d3184b.:head
> > ceph-osd.35.log.5.gz:2017-10-14 11:28:36.465684 7f34ffdf5700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.21:6829/1834 pipe(0x56414e2a2000 sd=37
> :6801 s=0 pgs=0 cs=0 l=0 c=0x5641470d2a00).accept connect_seq 33 vs
> existing 33 state standby
> > ceph-osd.35.log.5.gz:--
> > ceph-osd.35.log.5.gz:2017-10-14 11:43:35.570711 7f3508efd700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.20:6825/1806 pipe(0x56413be34000 sd=138
> :6801 s=2 pgs=2763 cs=45 l=0 c=0x564132999480).fault with nothing to send,
> going to standby
> > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235548 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3s0 deep-scrub 1 missing, 0
> inconsistent objects
> > ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235554 7f3511be4700 -1
> log_channel(cluster) log [ERR] : 5.5e3 deep-scrub 1 errors
> > ceph-osd.35.log.5.gz:2017-10-14 11:59:02.331454 7f34d6d4e700  0 --
> 1.2.3.13:6801/1001792 >> 1.2.3.11:6817/1941 pipe(0x56414d370800 sd=227
> :42104 s=2 pgs=3238 cs=89 l=0 c=0x56413122d200).fault with nothing to send,
> going to standby
>
>
>
> On 10/18/2017 10:19 PM, Gregory Farnum wrote:
> > It would help if you can provide the exact output of "ceph -s", "pg
> query",
> > and any other relevant data. You shouldn't need to do manual repair of
> > erasure-coded pools, since it has checksums and can tell which bits are
> > bad. Following that article may not have done you any good (though I
> > wouldn't expect it to hurt, either...)...
> > -Greg
> >
> > On Wed, Oct 18, 2017 at 5:56 AM Stijn De Weirdt  >
> > wrote:
> >
> >> hi all,
> >>
> >> we have a ceph 10.2.7 cluster with a 8+3 EC pool.
> >> in that pool, there is a pg in inconsistent state.
> >>
> >> we followed http://ceph.com/geen-categorie/ceph-manually-repair-object/
> ,
> >> however, we are unable to solve our issue.
> >>
> >> from the primary osd logs, the reported pg had a missing object.
> >>
> >> we found a related object on the primary osd, and then looked for
> >> similar ones on the other osds in same path (i guess it is just has the
> >> index of the osd in the pg list of osds suffixed)
> >>
> >> one osd did not have such a file (the 10 others did).
> >>
> >> so we did the "stop osd/flush/start os/pg repair" on both the primary
> >> osd and on the osd with 

Re: [ceph-users] Ceph iSCSI login failed due to authorization failure

2017-10-19 Thread Jason Dillaman
Development versions of the RPMs can be found here [1]. We don't have
production signed builds in place for our ceph-iscsi-XYZ packages yet and
the other packages would eventually come from a distro (or third party
add-on) repo.

[1] https://shaman.ceph.com/repos/

On Thu, Oct 19, 2017 at 8:27 PM, Tyler Bishop <
tyler.bis...@beyondhosting.net> wrote:

> Where did you find the iscsi rpms ect?  I looked all through the repo and
> can't find anything but the documentation.
>
> _
>
> *Tyler Bishop*
> Founder EST 2007
>
>
> O: 513-299-7108 x10 <(513)%20299-7108>
> M: 513-646-5809 <(513)%20646-5809>
> http://BeyondHosting.net 
>
>
> This email is intended only for the recipient(s) above and/or
> otherwise authorized personnel. The information contained herein and
> attached is confidential and the property of Beyond Hosting. Any
> unauthorized copying, forwarding, printing, and/or disclosing
> any information related to this email is prohibited. If you received this
> message in error, please contact the sender and destroy all copies of this
> email and any attachment(s).
>
> --
> *From: *"Maged Mokhtar" 
> *To: *"Kashif Mumtaz" 
> *Cc: *"Ceph Users" 
> *Sent: *Saturday, October 14, 2017 1:40:05 PM
> *Subject: *Re: [ceph-users] Ceph iSCSI login failed due to authorization
> failure
>
> On 2017-10-14 17:50, Kashif Mumtaz wrote:
>
> Hello Dear,
>
> I am trying to configure the Ceph iscsi gateway on Ceph Luminious . As per
> below
>
> Ceph iSCSI Gateway — Ceph Documentation
> 
>
> 
>   Ceph iSCSI Gateway — Ceph Documentation
>
>
>
>
>
> Ceph is iscsi gateway are configured and chap auth is set.
>
>
>
>
> /> ls
> o- / 
> . [...]
>   o- clusters 
>  [Clusters: 1]
>   | o- ceph 
> .. [HEALTH_WARN]
>   |   o- pools ..
> 
> [Pools: 2]
>   |   | o- kashif ..
> ... [Commit: 0b, Avail: 116G, Used: 1K,
> Commit%: 0%]
>   |   | o- rbd ..
> . [Commit: 10G, Avail: 116G, Used:
> 3K, Commit%: 8%]
>   |   o- topology ..
> . [OSDs:
> 13,MONs: 3]
>   o- disks 
> . [10G, Disks: 1]
>   | o- rbd.disk_1 ..
> .
> [disk_1 (10G)]
>   o- iscsi-target ..
> ...
> [Targets: 1]
> o- iqn.2003-01.com.redhat.iscsi-gw:tahir
> .
> [Gateways: 2]
>   o- gateways ..
> .. [Up: 2/2,
> Portals: 2]
>   | o- gateway ..
> ..
> [192.168.10.37 (UP)]
>   | o- gateway2 ..
> .
> [192.168.10.38 (UP)]
>   o- hosts ..
> 
> [Hosts: 1]
> o- iqn.1994-05.com.redhat:rh7-client
> ... [Auth: CHAP,
> Disks: 1(10G)]
>   o- lun 0 ..
>  [rbd.disk_1(10G), Owner:
> gateway2]
> />
>
>
>
> But initiators are unable to mount it. Try both ion Linux and ESXi 6.
>
>
>
> Below is the  error message on iscsi gateway server log file.
>
> Oct 14 19:34:49 gateway kernel: iSCSI Initiator Node:
> iqn.1998-01.com.vmware:esx0-36c45c69 is not authorized to access iSCSI
> target portal group: 1.
> Oct 14 19:34:49 gateway kernel: iSCSI Login negotiation failed.
>
> Oct 14 19:35:27 gateway kernel: iSCSI Initiator Node:
> iqn.1994-05.com.redhat:5ef55740c576 is not authorized to access iSCSI
> target portal group: 1.
> Oct 14 19:35:27 gateway kernel: iSCSI Login negotiation failed.
>
>
> I am giving the ceph authentication on initiator side.
>
> Discovery on initiator is happening
>
> root@server1 ~]# 

Re: [ceph-users] Ceph iSCSI login failed due to authorization failure

2017-10-19 Thread Tyler Bishop
Where did you find the iscsi rpms ect? I looked all through the repo and can't 
find anything but the documentation. 

_ 

Tyler Bishop 
Founder EST 2007 


O: 513-299-7108 x10 
M: 513-646-5809 
[ http://beyondhosting.net/ | http://BeyondHosting.net ] 


This email is intended only for the recipient(s) above and/or otherwise 
authorized personnel. The information contained herein and attached is 
confidential and the property of Beyond Hosting. Any unauthorized copying, 
forwarding, printing, and/or disclosing any information related to this email 
is prohibited. If you received this message in error, please contact the sender 
and destroy all copies of this email and any attachment(s). 


From: "Maged Mokhtar"  
To: "Kashif Mumtaz"  
Cc: "Ceph Users"  
Sent: Saturday, October 14, 2017 1:40:05 PM 
Subject: Re: [ceph-users] Ceph iSCSI login failed due to authorization failure 



On 2017-10-14 17:50, Kashif Mumtaz wrote: 


Hello Dear, 
I am trying to configure the Ceph iscsi gateway on Ceph Luminious . As per 
below 
[ http://docs.ceph.com/docs/master/rbd/iscsi-overview/ | Ceph iSCSI Gateway — 
Ceph Documentation ] 
[ http://docs.ceph.com/docs/master/rbd/iscsi-overview/ |   ] 

Ceph iSCSI Gateway — Ceph Documentation 

Ceph is iscsi gateway are configured and chap auth is set. 
/> ls 
o- / 
.
 [...] 
o- clusters 

 [Clusters: 1] 
| o- ceph 
..
 [HEALTH_WARN] 
| o- pools 
..
 [Pools: 2] 
| | o- kashif . 
[Commit: 0b, Avail: 116G, Used: 1K, Commit%: 0%] 
| | o- rbd ... 
[Commit: 10G, Avail: 116G, Used: 3K, Commit%: 8%] 
| o- topology 
...
 [OSDs: 13,MONs: 3] 
o- disks 
.
 [10G, Disks: 1] 
| o- rbd.disk_1 
...
 [disk_1 (10G)] 
o- iscsi-target 
.
 [Targets: 1] 
o- iqn.2003-01.com.redhat.iscsi-gw:tahir 
. 
[Gateways: 2] 
o- gateways 

 [Up: 2/2, Portals: 2] 
| o- gateway 

 [192.168.10.37 (UP)] 
| o- gateway2 
...
 [192.168.10.38 (UP)] 
o- hosts 
..
 [Hosts: 1] 
o- iqn.1994-05.com.redhat:rh7-client 
... [Auth: CHAP, Disks: 
1(10G)] 
o- lun 0 
.. 
[rbd.disk_1(10G), Owner: gateway2] 
/> 
But initiators are unable to mount it. Try both ion Linux and ESXi 6. 
Below is the error message on iscsi gateway server log file. 
Oct 14 19:34:49 gateway kernel: iSCSI Initiator Node: 
iqn.1998-01.com.vmware:esx0-36c45c69 is not authorized to access iSCSI target 
portal group: 1. 
Oct 14 19:34:49 gateway kernel: iSCSI Login negotiation failed. 
Oct 14 19:35:27 gateway kernel: iSCSI Initiator Node: 
iqn.1994-05.com.redhat:5ef55740c576 is not authorized to access iSCSI target 
portal group: 1. 
Oct 14 19:35:27 gateway kernel: iSCSI Login negotiation failed. 
I am giving the ceph authentication on initiator side. 
Discovery on initiator is happening 
root@server1 ~]# iscsiadm -m discovery -t st -p 192.168.10.37 
192.168.10.37:3260,1 iqn.2003-01.com.redhat.iscsi-gw:tahir 
192.168.10.38:3260,2 iqn.2003-01.com.redhat.iscsi-gw:tahir 
But when trying to login , it is giving "iSCSI login failed due to 
authorization failure" 
[root@server1 ~]# iscsiadm -m node -T iqn.2003-01.com.redhat.iscsi-gw:tahir -l 
Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:tahir, 
portal: 192.168.10.37,3260] (multiple) 
Logging in to [iface: default, target: iqn.2003-01.com.redhat.iscsi-gw:tahir, 
portal: 192.168.10.38,3260] (multiple) 
iscsiadm: Could not login to [iface: default, 

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread Christian Balzer

Hello,

On Thu, 19 Oct 2017 17:14:17 -0500 Russell Glaue wrote:

> That is a good idea.
> However, a previous rebalancing processes has brought performance of our
> Guest VMs to a slow drag.
>

Never mind that I'm not sure that these SSDs are particular well suited
for Ceph, your problem is clearly located on that one node.

Not that I think it's the case, but make sure your PG distribution is not
skewed with many more PGs per OSD on that node.

Once you rule that out my first guess is the RAID controller, you're
running the SSDs are single RAID0s I presume?
If so a either configuration difference or a failed BBU on the controller
could result in the writeback cache being disabled, which would explain
things beautifully. 

As for a temporary test/fix (with reduced redundancy of course), set noout
(or mon_osd_down_out_subtree_limit accordingly) and turn the slow host off.

This should result in much better performance than you have now and of
course be the final confirmation of that host being the culprit.

Christian

> 
> On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez 
> wrote:
> 
> > Hi Russell,
> >
> > as you have 4 servers, assuming you are not doing EC pools, just stop all
> > the OSDs on the second questionable server, mark the OSDs on that server as
> > out, let the cluster rebalance and when all PGs are active+clean just
> > replay the test.
> >
> > All IOs should then go only to the other 3 servers.
> >
> > JC
> >
> > On Oct 19, 2017, at 13:49, Russell Glaue  wrote:
> >
> > No, I have not ruled out the disk controller and backplane making the
> > disks slower.
> > Is there a way I could test that theory, other than swapping out hardware?
> > -RG
> >
> > On Thu, Oct 19, 2017 at 3:44 PM, David Turner 
> > wrote:
> >  
> >> Have you ruled out the disk controller and backplane in the server
> >> running slower?
> >>
> >> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue  wrote:
> >>  
> >>> I ran the test on the Ceph pool, and ran atop on all 4 storage servers,
> >>> as suggested.
> >>>
> >>> Out of the 4 servers:
> >>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
> >>> Momentarily spiking up to 50% on one server, and 80% on another
> >>> The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
> >>> wait. And more than momentarily spiking to 101% disk busy and 250% CPU 
> >>> wait.
> >>> For this 2nd newest server, this was the statistics for about 8 of 9
> >>> disks, with the 9th disk not far behind the others.
> >>>
> >>> I cannot believe all 9 disks are bad
> >>> They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
> >>> and same exact server hardware too.
> >>> They were purchased at the same time in the same purchase order and
> >>> arrived at the same time.
> >>> So I cannot believe I just happened to put 9 bad disks in one server,
> >>> and 9 good ones in the other.
> >>>
> >>> I know I have Ceph configured exactly the same on all servers
> >>> And I am sure I have the hardware settings configured exactly the same
> >>> on the 1st and 2nd servers.
> >>> So if I were someone else, I would say it maybe is bad hardware on the
> >>> 2nd server.
> >>> But the 2nd server is running very well without any hint of a problem.
> >>>
> >>> Any other ideas or suggestions?
> >>>
> >>> -RG
> >>>
> >>>
> >>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar 
> >>> wrote:
> >>>  
>  just run the same 32 threaded rados test as you did before and this
>  time run atop while the test is running looking for %busy of cpu/disks. 
>  It
>  should give an idea if there is a bottleneck in them.
> 
>  On 2017-10-18 21:35, Russell Glaue wrote:
> 
>  I cannot run the write test reviewed at the ceph-how-to-test-if-your-s
>  sd-is-suitable-as-a-journal-device blog. The tests write directly to
>  the raw disk device.
>  Reading an infile (created with urandom) on one SSD, writing the
>  outfile to another osd, yields about 17MB/s.
>  But Isn't this write speed limited by the speed in which in the dd
>  infile can be read?
>  And I assume the best test should be run with no other load.
> 
>  How does one run the rados bench "as stress"?
> 
>  -RG
> 
> 
>  On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
>  wrote:
>   
> > measuring resource load as outlined earlier will show if the drives
> > are performing well or not. Also how many osds do you have  ?
> >
> > On 2017-10-18 19:26, Russell Glaue wrote:
> >
> > The SSD drives are Crucial M500
> > A Ceph user did some benchmarks and found it had good performance
> > https://forum.proxmox.com/threads/ceph-bad-performance-in-
> > qemu-guests.21551/
> >
> > However, a user comment from 3 years ago on the blog post you linked
> > to says to avoid the Crucial M500
> 

Re: [ceph-users] [filestore][journal][prepare_entry] rebuild data_align is 4086, maybe a bug

2017-10-19 Thread Gregory Farnum
On Thu, Oct 19, 2017 at 12:59 AM, zhaomingyue  wrote:
> Hi:
>
> when I analyzed the performance of ceph, I found that rebuild_aligned was
> time-consuming, and the analysis found that rebuild operations were
> performed every time.
>
>
>
> Source code:
>
> FileStore::queue_transactions
>
> –> journal->prepare_entry(o->tls, );
>
> -> data_align = ((*p).get_data_alignment() - bl.length()) & ~CEPH_PAGE_MASK;
>
> -> ret = ebl.rebuild_aligned(CEPH_DIRECTIO_ALIGNMENT);
>
>
>
> Log:
>
> 2017-10-17 19:49:29.706246 7fb472bfe700 10 journal  len 4196131 -> 4202496
> (head 40 pre_pad 4046 bl 4196131 post_pad 2239 tail 40) (bl alignment 4086)
>
>
>
> question:
>
> I find “alignment =4086”, I think It maybe a bug
>
> I think it should be 4096
>
> because CEPH_DIRECTIO_ALIGNMENT is 4096

What led you to this, and what version of the code are you
running/examining? I don't see any instance of "4086" in the codebase,
so there's not a typo, CEPH_DIRECTIO_ALIGNMENT is set to 4096, and our
invocations of rebuild_aligned in master do not look like the code
snippets you've got there that I can see.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread Russell Glaue
That is a good idea.
However, a previous rebalancing processes has brought performance of our
Guest VMs to a slow drag.


On Thu, Oct 19, 2017 at 3:55 PM, Jean-Charles Lopez 
wrote:

> Hi Russell,
>
> as you have 4 servers, assuming you are not doing EC pools, just stop all
> the OSDs on the second questionable server, mark the OSDs on that server as
> out, let the cluster rebalance and when all PGs are active+clean just
> replay the test.
>
> All IOs should then go only to the other 3 servers.
>
> JC
>
> On Oct 19, 2017, at 13:49, Russell Glaue  wrote:
>
> No, I have not ruled out the disk controller and backplane making the
> disks slower.
> Is there a way I could test that theory, other than swapping out hardware?
> -RG
>
> On Thu, Oct 19, 2017 at 3:44 PM, David Turner 
> wrote:
>
>> Have you ruled out the disk controller and backplane in the server
>> running slower?
>>
>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue  wrote:
>>
>>> I ran the test on the Ceph pool, and ran atop on all 4 storage servers,
>>> as suggested.
>>>
>>> Out of the 4 servers:
>>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>>> Momentarily spiking up to 50% on one server, and 80% on another
>>> The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
>>> wait. And more than momentarily spiking to 101% disk busy and 250% CPU wait.
>>> For this 2nd newest server, this was the statistics for about 8 of 9
>>> disks, with the 9th disk not far behind the others.
>>>
>>> I cannot believe all 9 disks are bad
>>> They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
>>> and same exact server hardware too.
>>> They were purchased at the same time in the same purchase order and
>>> arrived at the same time.
>>> So I cannot believe I just happened to put 9 bad disks in one server,
>>> and 9 good ones in the other.
>>>
>>> I know I have Ceph configured exactly the same on all servers
>>> And I am sure I have the hardware settings configured exactly the same
>>> on the 1st and 2nd servers.
>>> So if I were someone else, I would say it maybe is bad hardware on the
>>> 2nd server.
>>> But the 2nd server is running very well without any hint of a problem.
>>>
>>> Any other ideas or suggestions?
>>>
>>> -RG
>>>
>>>
>>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar 
>>> wrote:
>>>
 just run the same 32 threaded rados test as you did before and this
 time run atop while the test is running looking for %busy of cpu/disks. It
 should give an idea if there is a bottleneck in them.

 On 2017-10-18 21:35, Russell Glaue wrote:

 I cannot run the write test reviewed at the ceph-how-to-test-if-your-s
 sd-is-suitable-as-a-journal-device blog. The tests write directly to
 the raw disk device.
 Reading an infile (created with urandom) on one SSD, writing the
 outfile to another osd, yields about 17MB/s.
 But Isn't this write speed limited by the speed in which in the dd
 infile can be read?
 And I assume the best test should be run with no other load.

 How does one run the rados bench "as stress"?

 -RG


 On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
 wrote:

> measuring resource load as outlined earlier will show if the drives
> are performing well or not. Also how many osds do you have  ?
>
> On 2017-10-18 19:26, Russell Glaue wrote:
>
> The SSD drives are Crucial M500
> A Ceph user did some benchmarks and found it had good performance
> https://forum.proxmox.com/threads/ceph-bad-performance-in-
> qemu-guests.21551/
>
> However, a user comment from 3 years ago on the blog post you linked
> to says to avoid the Crucial M500
>
> Yet, this performance posting tells that the Crucial M500 is good.
> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>
> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar 
> wrote:
>
>> Check out the following link: some SSDs perform bad in Ceph due to
>> sync writes to journal
>>
>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>
>> Anther thing that can help is to re-run the rados 32 threads as
>> stress and view resource usage using atop (or collectl/sar) to check for
>> %busy cpu and %busy disks to give you an idea of what is holding down 
>> your
>> cluster..for example: if cpu/disk % are all low then check your
>> network/switches.  If disk %busy is high (90%) for all disks then your
>> disks are the bottleneck: which either means you have SSDs that are not
>> suitable for Ceph or you have too few disks (which i doubt is the case). 
>> If
>> only 1 disk %busy is high, there may be something wrong with this disk
>> should 

Re: [ceph-users] Backup VM (Base image + snapshot)

2017-10-19 Thread Oscar Segarra
Hi Richard,

Thanks a lot for sharing your experience... I have made deeper
investigation and it looks export-diff is the most common tool used for
backup as you have suggested.

I will make some tests with export-diff  and I will share my experience.

Again, thanks a lot!

2017-10-16 12:00 GMT+02:00 Richard Hesketh :

> On 16/10/17 03:40, Alex Gorbachev wrote:
> > On Sat, Oct 14, 2017 at 12:25 PM, Oscar Segarra 
> wrote:
> >> Hi,
> >>
> >> In my VDI environment I have configured the suggested ceph
> >> design/arquitecture:
> >>
> >> http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/
> >>
> >> Where I have a Base Image + Protected Snapshot + 100 clones (one for
> each
> >> persistent VDI).
> >>
> >> Now, I'd like to configure a backup script/mechanism to perform backups
> of
> >> each persistent VDI VM to an external (non ceph) device, like NFS or
> >> something similar...
> >>
> >> Then, some questions:
> >>
> >> 1.- Does anybody have been able to do this kind of backups?
> >
> > Yes, we have been using export-diff successfully (note this is off a
> > snapshot and not a clone) to back up and restore ceph images to
> > non-ceph storage.  You can use merge-diff to create "synthetic fulls"
> > and even do some basic replication to another cluster.
> >
> > http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/
> >
> > http://docs.ceph.com/docs/master/dev/rbd-export/
> >
> > http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication
> >
> > --
> > Alex Gorbachev
> > Storcium
> >
> >> 2.- Is it possible to export BaseImage in qcow2 format and snapshots in
> >> qcow2 format as well as "linked clones" ?
> >> 3.- Is it possible to export the Base Image in raw format, snapshots in
> raw
> >> format as well and, when recover is required, import both images and
> >> "relink" them?
> >> 4.- What is the suggested solution for this scenario?
> >>
> >> Thanks a lot everybody!
>
> In my setup I backup individually complete raw disk images to file,
> because then they're easier to manually inspect and grab data off in the
> event of catastrophic cluster failure. I haven't personally bothered trying
> to preserve the layering between master/clone images in backup form; that
> sounds like a bunch of effort and by inspection the amount of space it'd
> actually save in my use case is really minimal.
>
> However I do use export-diff in order to make backups efficient - a
> rolling snapshot on each RBD is used to export the day's diff out of the
> cluster and then the ceph_apply_diff utility from https://gp2x.org/ceph/
> is used to apply that diff to the raw image file (though I did patch it to
> work with streaming input and eliminate the necessity for a temporary file
> containing the diff). There are a handful of very large RBDs in my cluster
> for which exporting the full disk image takes a prohibitively long time,
> which made leveraging diffs necessary.
>
> For a while, I was instead just exporting diffs and using merge-diff to
> munge them together into big super-diffs, and the restoration procedure
> would be to apply the merged diff to a freshly made image in the cluster.
> This worked, but it is a more fiddly recovery process; importing complete
> disk images is easier. I don't think it's possible to create two images in
> the cluster and then link them into a layering relationship; you'd have to
> import the base image, clone it, and them import a diff onto that clone if
> you wanted to recreate the original layering.
>
> Rich
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code failure

2017-10-19 Thread David Turner
Unless your min_size is set to 3, then you are not hitting the bug in the
tracker you linked.  Most likely you are running with a min_size of 2 which
means that bug is not relevant to your cluster.  Upload this if you
wouldn't mind.  `ceph osd pool get {pool_name} all`

On Thu, Oct 19, 2017 at 5:03 PM Jorge Pinilla López 
wrote:

> Yes, I am trying it over luminous.
>
> Well the bug has been going for 8 month and it hasn't been merged yet. Idk
> if that is whats preventing me to make it work. Tomorrow I will try to
> prove it again.
>
> El 19/10/2017 a las 23:00, David Turner escribió:
>
> Running a cluster on various versions of Hammer and Jewel I haven't had
> any problems.  I haven't upgraded to Luminous quite yet, but I'd be
> surprised if there is that severe of a regression especially since they did
> so many improvements to Erasure Coding.
>
> On Thu, Oct 19, 2017 at 4:59 PM Jorge Pinilla López 
> wrote:
>
>> Well I was trying it some days ago and it didn't work for me.
>>
>> maybe because of this:
>>
>> http://tracker.ceph.com/issues/18749
>>
>> https://github.com/ceph/ceph/pull/17619
>>
>> I don't know if now it's actually working
>>
>> El 19/10/2017 a las 22:55, David Turner escribió:
>>
>> In a 3 node cluster with EC k=2 m=1, you can turn off one of the nodes
>> and the cluster will still operate normally.  If you lose a disk during
>> this state or another server goes offline, then you lose access to your
>> data.  But assuming that you bring up the third node and let it finish
>> backfilling/recovering before restarting any other nodes, then you're fine.
>>
>> On Thu, Oct 19, 2017 at 4:49 PM Jorge Pinilla López 
>> wrote:
>>
>>> Imagine we have a 3 OSDs cluster and I make an erasure pool with k=2 m=1.
>>>
>>> If I have an OSD fail, we can rebuild the data but (I think) the hole
>>> cluster won't be able to perform IOS.
>>>
>>> Wouldn't be possible to make the cluster work in a degraded mode?
>>> I think it would be a good idea to make the cluster work on degraded
>>> mode and promise to re balance/re build whenever a third OSD comes alive.
>>> On reads, it could serve the data using the live data chunks and
>>> rebuilding (if necessary) the missing ones(using cpu to calculate the data
>>> before serving// with 0 RTA) or trying to rebuild the missing parts so it
>>> actually has the 2 data chunks on the 2 live OSDs (with some RTA and space
>>> usage) or even doing both things at the same time (with high network and
>>> cpu and storage cost).
>>> On writes, it could write the 2 data parts into the live OSDs and
>>> whenever the third OSD comes up, the cluster could re balance rebuilding
>>> the parity chunk and re positioning the parts so all OSDs have the same
>>> amount of data/work.
>>>
>>> would this be possible?
>>>
>>> --
>>> *Jorge Pinilla López*
>>> jorp...@unizar.es
>>> Estudiante de ingenieria informática
>>> Becario del area de sistemas (SICUZ)
>>> Universidad de Zaragoza
>>> PGP-KeyID: A34331932EBC715A
>>> 
>>> --
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>> --
>> --
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> Estudiante de ingenieria informática
>> Becario del area de sistemas (SICUZ)
>> Universidad de Zaragoza
>> PGP-KeyID: A34331932EBC715A
>> 
>> --
>>
>
> --
> --
> *Jorge Pinilla López*
> jorp...@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
> --
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph delete files and status

2017-10-19 Thread nigel davies
I am using RGW, with an S3 bucket setup.

The live vershion also uses rbd as well

On 19 Oct 2017 10:04 pm, "David Turner"  wrote:

How are you uploading a file?  RGW, librados, CephFS, or RBD?  There are
multiple reasons that the space might not be updating or cleaning itself
up.  The more information you can give us about how you're testing, the
more we can help you.

On Thu, Oct 19, 2017 at 5:00 PM nigel davies  wrote:

> Hay
>
> I some how got the space back, by tweeking the reweights.
>
> but i am a tad confused i uploaded a file (200MB) then removed the file
> and the space is not changed. i am not sure why that happens and what i can
> do
>
> On Thu, Oct 19, 2017 at 6:42 PM, nigel davies  wrote:
>
>> PS was not aware of fstrim
>>
>> On 19 Oct 2017 6:41 pm, "nigel davies"  wrote:
>>
> Hay
>>>
>>> My ceph cluster is connted to ceph gateway for a S3 server I was
>>> uploading the file and removing it from the bucket using s3cmd.
>>>
>>> I upload and removed the file a few times now the cluster so close to
>>> full.
>>>
>>> I was told their is a way to clear the any deleted files out or
>>> something.
>>>
>>
>>> On 19 Oct 2017 5:09 pm, "Jamie Fargen"  wrote:
>>>
 Nigel-

 What method did you use to upload and delete the file? How did you
 check the space utilization? I believe the reason that you are still seeing
 the space being utilized when you issue your ceph -df is because even after
 the file is deleted, the file system doesn't actually delete the file, it
 just removes the file inode entry pointing to the file. The file will still
 be on the disk until the blocks are re-allocated to another file.

 -Jamie

 On Thu, Oct 19, 2017 at 11:54 AM, nigel davies 
 wrote:

> Hay all
>
> I am looking at my small test Ceph cluster, i have uploaded a 200MB
> iso and checked the space on "ceph status" and see it incress.
>
> But when i delete the file the space used does not go down.
>
> Have i missed a configuration somewhere or something?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


 --
 Jamie Fargen
 Consultant
 jfar...@redhat.com
 813-817-4430 <(813)%20817-4430>

>>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph delete files and status

2017-10-19 Thread David Turner
How are you uploading a file?  RGW, librados, CephFS, or RBD?  There are
multiple reasons that the space might not be updating or cleaning itself
up.  The more information you can give us about how you're testing, the
more we can help you.

On Thu, Oct 19, 2017 at 5:00 PM nigel davies  wrote:

> Hay
>
> I some how got the space back, by tweeking the reweights.
>
> but i am a tad confused i uploaded a file (200MB) then removed the file
> and the space is not changed. i am not sure why that happens and what i can
> do
>
> On Thu, Oct 19, 2017 at 6:42 PM, nigel davies  wrote:
>
>> PS was not aware of fstrim
>>
>> On 19 Oct 2017 6:41 pm, "nigel davies"  wrote:
>>
> Hay
>>>
>>> My ceph cluster is connted to ceph gateway for a S3 server I was
>>> uploading the file and removing it from the bucket using s3cmd.
>>>
>>> I upload and removed the file a few times now the cluster so close to
>>> full.
>>>
>>> I was told their is a way to clear the any deleted files out or
>>> something.
>>>
>>
>>> On 19 Oct 2017 5:09 pm, "Jamie Fargen"  wrote:
>>>
 Nigel-

 What method did you use to upload and delete the file? How did you
 check the space utilization? I believe the reason that you are still seeing
 the space being utilized when you issue your ceph -df is because even after
 the file is deleted, the file system doesn't actually delete the file, it
 just removes the file inode entry pointing to the file. The file will still
 be on the disk until the blocks are re-allocated to another file.

 -Jamie

 On Thu, Oct 19, 2017 at 11:54 AM, nigel davies 
 wrote:

> Hay all
>
> I am looking at my small test Ceph cluster, i have uploaded a 200MB
> iso and checked the space on "ceph status" and see it incress.
>
> But when i delete the file the space used does not go down.
>
> Have i missed a configuration somewhere or something?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


 --
 Jamie Fargen
 Consultant
 jfar...@redhat.com
 813-817-4430 <(813)%20817-4430>

>>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code failure

2017-10-19 Thread Jorge Pinilla López
Yes, I am trying it over luminous.

Well the bug has been going for 8 month and it hasn't been merged yet.
Idk if that is whats preventing me to make it work. Tomorrow I will try
to prove it again.


El 19/10/2017 a las 23:00, David Turner escribió:
> Running a cluster on various versions of Hammer and Jewel I haven't
> had any problems.  I haven't upgraded to Luminous quite yet, but I'd
> be surprised if there is that severe of a regression especially since
> they did so many improvements to Erasure Coding.
>
> On Thu, Oct 19, 2017 at 4:59 PM Jorge Pinilla López  > wrote:
>
> Well I was trying it some days ago and it didn't work for me.
>
> maybe because of this:
>
> http://tracker.ceph.com/issues/18749
>
> https://github.com/ceph/ceph/pull/17619
>
> I don't know if now it's actually working
>
>
> El 19/10/2017 a las 22:55, David Turner escribió:
>> In a 3 node cluster with EC k=2 m=1, you can turn off one of the
>> nodes and the cluster will still operate normally.  If you lose a
>> disk during this state or another server goes offline, then you
>> lose access to your data.  But assuming that you bring up the
>> third node and let it finish backfilling/recovering before
>> restarting any other nodes, then you're fine.
>>
>> On Thu, Oct 19, 2017 at 4:49 PM Jorge Pinilla López
>> > wrote:
>>
>> Imagine we have a 3 OSDs cluster and I make an erasure pool
>> with k=2 m=1.
>>
>> If I have an OSD fail, we can rebuild the data but (I think)
>> the hole cluster won't be able to perform IOS.
>>
>> Wouldn't be possible to make the cluster work in a degraded
>> mode?
>> I think it would be a good idea to make the cluster work on
>> degraded mode and promise to re balance/re build whenever a
>> third OSD comes alive.
>> On reads, it could serve the data using the live data chunks
>> and rebuilding (if necessary) the missing ones(using cpu to
>> calculate the data before serving// with 0 RTA) or trying to
>> rebuild the missing parts so it actually has the 2 data
>> chunks on the 2 live OSDs (with some RTA and space usage) or
>> even doing both things at the same time (with high network
>> and cpu and storage cost).
>> On writes, it could write the 2 data parts into the live OSDs
>> and whenever the third OSD comes up, the cluster could re
>> balance rebuilding the parity chunk and re positioning the
>> parts so all OSDs have the same amount of data/work.
>>
>> would this be possible?
>>
>> 
>> 
>> *Jorge Pinilla López*
>> jorp...@unizar.es 
>> Estudiante de ingenieria informática
>> Becario del area de sistemas (SICUZ)
>> Universidad de Zaragoza
>> PGP-KeyID: A34331932EBC715A
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> -- 
> 
> *Jorge Pinilla López*
> jorp...@unizar.es 
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
> 
>

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code failure

2017-10-19 Thread David Turner
Running a cluster on various versions of Hammer and Jewel I haven't had any
problems.  I haven't upgraded to Luminous quite yet, but I'd be surprised
if there is that severe of a regression especially since they did so many
improvements to Erasure Coding.

On Thu, Oct 19, 2017 at 4:59 PM Jorge Pinilla López 
wrote:

> Well I was trying it some days ago and it didn't work for me.
>
> maybe because of this:
>
> http://tracker.ceph.com/issues/18749
>
> https://github.com/ceph/ceph/pull/17619
>
> I don't know if now it's actually working
>
> El 19/10/2017 a las 22:55, David Turner escribió:
>
> In a 3 node cluster with EC k=2 m=1, you can turn off one of the nodes and
> the cluster will still operate normally.  If you lose a disk during this
> state or another server goes offline, then you lose access to your data.
> But assuming that you bring up the third node and let it finish
> backfilling/recovering before restarting any other nodes, then you're fine.
>
> On Thu, Oct 19, 2017 at 4:49 PM Jorge Pinilla López 
> wrote:
>
>> Imagine we have a 3 OSDs cluster and I make an erasure pool with k=2 m=1.
>>
>> If I have an OSD fail, we can rebuild the data but (I think) the hole
>> cluster won't be able to perform IOS.
>>
>> Wouldn't be possible to make the cluster work in a degraded mode?
>> I think it would be a good idea to make the cluster work on degraded mode
>> and promise to re balance/re build whenever a third OSD comes alive.
>> On reads, it could serve the data using the live data chunks and
>> rebuilding (if necessary) the missing ones(using cpu to calculate the data
>> before serving// with 0 RTA) or trying to rebuild the missing parts so it
>> actually has the 2 data chunks on the 2 live OSDs (with some RTA and space
>> usage) or even doing both things at the same time (with high network and
>> cpu and storage cost).
>> On writes, it could write the 2 data parts into the live OSDs and
>> whenever the third OSD comes up, the cluster could re balance rebuilding
>> the parity chunk and re positioning the parts so all OSDs have the same
>> amount of data/work.
>>
>> would this be possible?
>>
>> --
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> Estudiante de ingenieria informática
>> Becario del area de sistemas (SICUZ)
>> Universidad de Zaragoza
>> PGP-KeyID: A34331932EBC715A
>> 
>> --
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> --
> --
> *Jorge Pinilla López*
> jorp...@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
> --
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph delete files and status

2017-10-19 Thread nigel davies
Hay

I some how got the space back, by tweeking the reweights.

but i am a tad confused i uploaded a file (200MB) then removed the file and
the space is not changed. i am not sure why that happens and what i can do

On Thu, Oct 19, 2017 at 6:42 PM, nigel davies  wrote:

> PS was not aware of fstrim
>
> On 19 Oct 2017 6:41 pm, "nigel davies"  wrote:
>
>> Hay
>>
>> My ceph cluster is connted to ceph gateway for a S3 server I was
>> uploading the file and removing it from the bucket using s3cmd.
>>
>> I upload and removed the file a few times now the cluster so close to
>> full.
>>
>> I was told their is a way to clear the any deleted files out or something.
>>
>> On 19 Oct 2017 5:09 pm, "Jamie Fargen"  wrote:
>>
>>> Nigel-
>>>
>>> What method did you use to upload and delete the file? How did you check
>>> the space utilization? I believe the reason that you are still seeing the
>>> space being utilized when you issue your ceph -df is because even after the
>>> file is deleted, the file system doesn't actually delete the file, it just
>>> removes the file inode entry pointing to the file. The file will still be
>>> on the disk until the blocks are re-allocated to another file.
>>>
>>> -Jamie
>>>
>>> On Thu, Oct 19, 2017 at 11:54 AM, nigel davies 
>>> wrote:
>>>
 Hay all

 I am looking at my small test Ceph cluster, i have uploaded a 200MB iso
 and checked the space on "ceph status" and see it incress.

 But when i delete the file the space used does not go down.

 Have i missed a configuration somewhere or something?

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>>
>>> --
>>> Jamie Fargen
>>> Consultant
>>> jfar...@redhat.com
>>> 813-817-4430 <(813)%20817-4430>
>>>
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code failure

2017-10-19 Thread Jorge Pinilla López
Well I was trying it some days ago and it didn't work for me.

maybe because of this:

http://tracker.ceph.com/issues/18749

https://github.com/ceph/ceph/pull/17619

I don't know if now it's actually working


El 19/10/2017 a las 22:55, David Turner escribió:
> In a 3 node cluster with EC k=2 m=1, you can turn off one of the nodes
> and the cluster will still operate normally.  If you lose a disk
> during this state or another server goes offline, then you lose access
> to your data.  But assuming that you bring up the third node and let
> it finish backfilling/recovering before restarting any other nodes,
> then you're fine.
>
> On Thu, Oct 19, 2017 at 4:49 PM Jorge Pinilla López  > wrote:
>
> Imagine we have a 3 OSDs cluster and I make an erasure pool with
> k=2 m=1.
>
> If I have an OSD fail, we can rebuild the data but (I think) the
> hole cluster won't be able to perform IOS.
>
> Wouldn't be possible to make the cluster work in a degraded mode?
> I think it would be a good idea to make the cluster work on
> degraded mode and promise to re balance/re build whenever a third
> OSD comes alive.
> On reads, it could serve the data using the live data chunks and
> rebuilding (if necessary) the missing ones(using cpu to calculate
> the data before serving// with 0 RTA) or trying to rebuild the
> missing parts so it actually has the 2 data chunks on the 2 live
> OSDs (with some RTA and space usage) or even doing both things at
> the same time (with high network and cpu and storage cost).
> On writes, it could write the 2 data parts into the live OSDs and
> whenever the third OSD comes up, the cluster could re balance
> rebuilding the parity chunk and re positioning the parts so all
> OSDs have the same amount of data/work.
>
> would this be possible?
>
> 
> *Jorge Pinilla López*
> jorp...@unizar.es 
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread Russell Glaue
I'm better off trying to solve the first hurdle.
This ceph cluster is in production serving 186 guest VMs.
-RG

On Thu, Oct 19, 2017 at 3:52 PM, David Turner  wrote:

> Assuming the problem with swapping out hardware is having spare
> hardware... you could always switch hardware between nodes and see if the
> problem follows the component.
>
> On Thu, Oct 19, 2017 at 4:49 PM Russell Glaue  wrote:
>
>> No, I have not ruled out the disk controller and backplane making the
>> disks slower.
>> Is there a way I could test that theory, other than swapping out hardware?
>> -RG
>>
>> On Thu, Oct 19, 2017 at 3:44 PM, David Turner 
>> wrote:
>>
>>> Have you ruled out the disk controller and backplane in the server
>>> running slower?
>>>
>>> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue  wrote:
>>>
 I ran the test on the Ceph pool, and ran atop on all 4 storage servers,
 as suggested.

 Out of the 4 servers:
 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
 Momentarily spiking up to 50% on one server, and 80% on another
 The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
 wait. And more than momentarily spiking to 101% disk busy and 250% CPU 
 wait.
 For this 2nd newest server, this was the statistics for about 8 of 9
 disks, with the 9th disk not far behind the others.

 I cannot believe all 9 disks are bad
 They are the same disks as the newest 1st
 server, Crucial_CT960M500SSD1, and same exact server hardware too.
 They were purchased at the same time in the same purchase order and
 arrived at the same time.
 So I cannot believe I just happened to put 9 bad disks in one server,
 and 9 good ones in the other.

 I know I have Ceph configured exactly the same on all servers
 And I am sure I have the hardware settings configured exactly the same
 on the 1st and 2nd servers.
 So if I were someone else, I would say it maybe is bad hardware on the
 2nd server.
 But the 2nd server is running very well without any hint of a problem.

 Any other ideas or suggestions?

 -RG


 On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar 
 wrote:

> just run the same 32 threaded rados test as you did before and this
> time run atop while the test is running looking for %busy of cpu/disks. It
> should give an idea if there is a bottleneck in them.
>
> On 2017-10-18 21:35, Russell Glaue wrote:
>
> I cannot run the write test reviewed at the ceph-how-to-test-if-your-
> ssd-is-suitable-as-a-journal-device blog. The tests write directly to
> the raw disk device.
> Reading an infile (created with urandom) on one SSD, writing the
> outfile to another osd, yields about 17MB/s.
> But Isn't this write speed limited by the speed in which in the dd
> infile can be read?
> And I assume the best test should be run with no other load.
>
> How does one run the rados bench "as stress"?
>
> -RG
>
>
> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
> wrote:
>
>> measuring resource load as outlined earlier will show if the drives
>> are performing well or not. Also how many osds do you have  ?
>>
>> On 2017-10-18 19:26, Russell Glaue wrote:
>>
>> The SSD drives are Crucial M500
>> A Ceph user did some benchmarks and found it had good performance
>> https://forum.proxmox.com/threads/ceph-bad-performance-
>> in-qemu-guests.21551/
>>
>> However, a user comment from 3 years ago on the blog post you linked
>> to says to avoid the Crucial M500
>>
>> Yet, this performance posting tells that the Crucial M500 is good.
>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>>
>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar > > wrote:
>>
>>> Check out the following link: some SSDs perform bad in Ceph due to
>>> sync writes to journal
>>>
>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
>>> test-if-your-ssd-is-suitable-as-a-journal-device/
>>>
>>> Anther thing that can help is to re-run the rados 32 threads as
>>> stress and view resource usage using atop (or collectl/sar) to check for
>>> %busy cpu and %busy disks to give you an idea of what is holding down 
>>> your
>>> cluster..for example: if cpu/disk % are all low then check your
>>> network/switches.  If disk %busy is high (90%) for all disks then your
>>> disks are the bottleneck: which either means you have SSDs that are not
>>> suitable for Ceph or you have too few disks (which i doubt is the 
>>> case). If
>>> only 1 disk %busy is high, there may be something wrong with this disk
>>> should be removed.
>>>
>>> Maged
>>>

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread Russell Glaue
No, I have not ruled out the disk controller and backplane making the disks
slower.
Is there a way I could test that theory, other than swapping out hardware?
-RG

On Thu, Oct 19, 2017 at 3:44 PM, David Turner  wrote:

> Have you ruled out the disk controller and backplane in the server running
> slower?
>
> On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue  wrote:
>
>> I ran the test on the Ceph pool, and ran atop on all 4 storage servers,
>> as suggested.
>>
>> Out of the 4 servers:
>> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
>> Momentarily spiking up to 50% on one server, and 80% on another
>> The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
>> wait. And more than momentarily spiking to 101% disk busy and 250% CPU wait.
>> For this 2nd newest server, this was the statistics for about 8 of 9
>> disks, with the 9th disk not far behind the others.
>>
>> I cannot believe all 9 disks are bad
>> They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
>> and same exact server hardware too.
>> They were purchased at the same time in the same purchase order and
>> arrived at the same time.
>> So I cannot believe I just happened to put 9 bad disks in one server, and
>> 9 good ones in the other.
>>
>> I know I have Ceph configured exactly the same on all servers
>> And I am sure I have the hardware settings configured exactly the same on
>> the 1st and 2nd servers.
>> So if I were someone else, I would say it maybe is bad hardware on the
>> 2nd server.
>> But the 2nd server is running very well without any hint of a problem.
>>
>> Any other ideas or suggestions?
>>
>> -RG
>>
>>
>> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar 
>> wrote:
>>
>>> just run the same 32 threaded rados test as you did before and this time
>>> run atop while the test is running looking for %busy of cpu/disks. It
>>> should give an idea if there is a bottleneck in them.
>>>
>>> On 2017-10-18 21:35, Russell Glaue wrote:
>>>
>>> I cannot run the write test reviewed at the ceph-how-to-test-if-your-
>>> ssd-is-suitable-as-a-journal-device blog. The tests write directly to
>>> the raw disk device.
>>> Reading an infile (created with urandom) on one SSD, writing the outfile
>>> to another osd, yields about 17MB/s.
>>> But Isn't this write speed limited by the speed in which in the dd
>>> infile can be read?
>>> And I assume the best test should be run with no other load.
>>>
>>> How does one run the rados bench "as stress"?
>>>
>>> -RG
>>>
>>>
>>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
>>> wrote:
>>>
 measuring resource load as outlined earlier will show if the drives are
 performing well or not. Also how many osds do you have  ?

 On 2017-10-18 19:26, Russell Glaue wrote:

 The SSD drives are Crucial M500
 A Ceph user did some benchmarks and found it had good performance
 https://forum.proxmox.com/threads/ceph-bad-performance-
 in-qemu-guests.21551/

 However, a user comment from 3 years ago on the blog post you linked to
 says to avoid the Crucial M500

 Yet, this performance posting tells that the Crucial M500 is good.
 https://inside.servers.com/ssd-performance-2017-c4307a92dea

 On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar 
 wrote:

> Check out the following link: some SSDs perform bad in Ceph due to
> sync writes to journal
>
> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-
> test-if-your-ssd-is-suitable-as-a-journal-device/
>
> Anther thing that can help is to re-run the rados 32 threads as stress
> and view resource usage using atop (or collectl/sar) to check for %busy 
> cpu
> and %busy disks to give you an idea of what is holding down your
> cluster..for example: if cpu/disk % are all low then check your
> network/switches.  If disk %busy is high (90%) for all disks then your
> disks are the bottleneck: which either means you have SSDs that are not
> suitable for Ceph or you have too few disks (which i doubt is the case). 
> If
> only 1 disk %busy is high, there may be something wrong with this disk
> should be removed.
>
> Maged
>
> On 2017-10-18 18:13, Russell Glaue wrote:
>
> In my previous post, in one of my points I was wondering if the
> request size would increase if I enabled jumbo packets. currently it is
> disabled.
>
> @jdillama: The qemu settings for both these two guest machines, with
> RAID/LVM and Ceph/rbd images, are the same. I am not thinking that 
> changing
> the qemu settings of "min_io_size=,opt_io_size= image object size>" will directly address the issue.
>
> @mmokhtar: Ok. So you suggest the request size is the result of the
> problem and not the cause of the problem. meaning I should go after a
> different 

[ceph-users] Erasure code failure

2017-10-19 Thread Jorge Pinilla López
Imagine we have a 3 OSDs cluster and I make an erasure pool with k=2 m=1.

If I have an OSD fail, we can rebuild the data but (I think) the hole
cluster won't be able to perform IOS.

Wouldn't be possible to make the cluster work in a degraded mode?
I think it would be a good idea to make the cluster work on degraded
mode and promise to re balance/re build whenever a third OSD comes alive.
On reads, it could serve the data using the live data chunks and
rebuilding (if necessary) the missing ones(using cpu to calculate the
data before serving// with 0 RTA) or trying to rebuild the missing parts
so it actually has the 2 data chunks on the 2 live OSDs (with some RTA
and space usage) or even doing both things at the same time (with high
network and cpu and storage cost).
On writes, it could write the 2 data parts into the live OSDs and
whenever the third OSD comes up, the cluster could re balance rebuilding
the parity chunk and re positioning the parts so all OSDs have the same
amount of data/work.

would this be possible?


*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread David Turner
Have you ruled out the disk controller and backplane in the server running
slower?

On Thu, Oct 19, 2017 at 4:42 PM Russell Glaue  wrote:

> I ran the test on the Ceph pool, and ran atop on all 4 storage servers, as
> suggested.
>
> Out of the 4 servers:
> 3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
> Momentarily spiking up to 50% on one server, and 80% on another
> The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
> wait. And more than momentarily spiking to 101% disk busy and 250% CPU wait.
> For this 2nd newest server, this was the statistics for about 8 of 9
> disks, with the 9th disk not far behind the others.
>
> I cannot believe all 9 disks are bad
> They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
> and same exact server hardware too.
> They were purchased at the same time in the same purchase order and
> arrived at the same time.
> So I cannot believe I just happened to put 9 bad disks in one server, and
> 9 good ones in the other.
>
> I know I have Ceph configured exactly the same on all servers
> And I am sure I have the hardware settings configured exactly the same on
> the 1st and 2nd servers.
> So if I were someone else, I would say it maybe is bad hardware on the 2nd
> server.
> But the 2nd server is running very well without any hint of a problem.
>
> Any other ideas or suggestions?
>
> -RG
>
>
> On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar 
> wrote:
>
>> just run the same 32 threaded rados test as you did before and this time
>> run atop while the test is running looking for %busy of cpu/disks. It
>> should give an idea if there is a bottleneck in them.
>>
>> On 2017-10-18 21:35, Russell Glaue wrote:
>>
>> I cannot run the write test reviewed at
>> the ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device blog. The
>> tests write directly to the raw disk device.
>> Reading an infile (created with urandom) on one SSD, writing the outfile
>> to another osd, yields about 17MB/s.
>> But Isn't this write speed limited by the speed in which in the dd infile
>> can be read?
>> And I assume the best test should be run with no other load.
>>
>> How does one run the rados bench "as stress"?
>>
>> -RG
>>
>>
>> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
>> wrote:
>>
>>> measuring resource load as outlined earlier will show if the drives are
>>> performing well or not. Also how many osds do you have  ?
>>>
>>> On 2017-10-18 19:26, Russell Glaue wrote:
>>>
>>> The SSD drives are Crucial M500
>>> A Ceph user did some benchmarks and found it had good performance
>>>
>>> https://forum.proxmox.com/threads/ceph-bad-performance-in-qemu-guests.21551/
>>>
>>> However, a user comment from 3 years ago on the blog post you linked to
>>> says to avoid the Crucial M500
>>>
>>> Yet, this performance posting tells that the Crucial M500 is good.
>>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>>>
>>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar 
>>> wrote:
>>>
 Check out the following link: some SSDs perform bad in Ceph due to sync
 writes to journal


 https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 Anther thing that can help is to re-run the rados 32 threads as stress
 and view resource usage using atop (or collectl/sar) to check for %busy cpu
 and %busy disks to give you an idea of what is holding down your
 cluster..for example: if cpu/disk % are all low then check your
 network/switches.  If disk %busy is high (90%) for all disks then your
 disks are the bottleneck: which either means you have SSDs that are not
 suitable for Ceph or you have too few disks (which i doubt is the case). If
 only 1 disk %busy is high, there may be something wrong with this disk
 should be removed.

 Maged

 On 2017-10-18 18:13, Russell Glaue wrote:

 In my previous post, in one of my points I was wondering if the request
 size would increase if I enabled jumbo packets. currently it is disabled.

 @jdillama: The qemu settings for both these two guest machines, with
 RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing
 the qemu settings of "min_io_size=,opt_io_size=>>> image object size>" will directly address the issue.

 @mmokhtar: Ok. So you suggest the request size is the result of the
 problem and not the cause of the problem. meaning I should go after a
 different issue.

 I have been trying to get write speeds up to what people on this mail
 list are discussing.
 It seems that for our configuration, as it matches others, we should be
 getting about 70MB/s write speed.
 But we are not getting that.
 Single writes to disk are lucky to get 5MB/s to 6MB/s, but are
 typically 1MB/s to 2MB/s.
 Monitoring the entire 

Re: [ceph-users] How to increase the size of requests written to a ceph image

2017-10-19 Thread Russell Glaue
I ran the test on the Ceph pool, and ran atop on all 4 storage servers, as
suggested.

Out of the 4 servers:
3 of them performed with 17% to 30% disk %busy, and 11% CPU wait.
Momentarily spiking up to 50% on one server, and 80% on another
The 2nd newest server was almost averaging 90% disk %busy and 150% CPU
wait. And more than momentarily spiking to 101% disk busy and 250% CPU wait.
For this 2nd newest server, this was the statistics for about 8 of 9 disks,
with the 9th disk not far behind the others.

I cannot believe all 9 disks are bad
They are the same disks as the newest 1st server, Crucial_CT960M500SSD1,
and same exact server hardware too.
They were purchased at the same time in the same purchase order and arrived
at the same time.
So I cannot believe I just happened to put 9 bad disks in one server, and 9
good ones in the other.

I know I have Ceph configured exactly the same on all servers
And I am sure I have the hardware settings configured exactly the same on
the 1st and 2nd servers.
So if I were someone else, I would say it maybe is bad hardware on the 2nd
server.
But the 2nd server is running very well without any hint of a problem.

Any other ideas or suggestions?

-RG


On Wed, Oct 18, 2017 at 3:40 PM, Maged Mokhtar  wrote:

> just run the same 32 threaded rados test as you did before and this time
> run atop while the test is running looking for %busy of cpu/disks. It
> should give an idea if there is a bottleneck in them.
>
> On 2017-10-18 21:35, Russell Glaue wrote:
>
> I cannot run the write test reviewed at the ceph-how-to-test-if-your-
> ssd-is-suitable-as-a-journal-device blog. The tests write directly to the
> raw disk device.
> Reading an infile (created with urandom) on one SSD, writing the outfile
> to another osd, yields about 17MB/s.
> But Isn't this write speed limited by the speed in which in the dd infile
> can be read?
> And I assume the best test should be run with no other load.
>
> How does one run the rados bench "as stress"?
>
> -RG
>
>
> On Wed, Oct 18, 2017 at 1:33 PM, Maged Mokhtar 
> wrote:
>
>> measuring resource load as outlined earlier will show if the drives are
>> performing well or not. Also how many osds do you have  ?
>>
>> On 2017-10-18 19:26, Russell Glaue wrote:
>>
>> The SSD drives are Crucial M500
>> A Ceph user did some benchmarks and found it had good performance
>> https://forum.proxmox.com/threads/ceph-bad-performance-in-qe
>> mu-guests.21551/
>>
>> However, a user comment from 3 years ago on the blog post you linked to
>> says to avoid the Crucial M500
>>
>> Yet, this performance posting tells that the Crucial M500 is good.
>> https://inside.servers.com/ssd-performance-2017-c4307a92dea
>>
>> On Wed, Oct 18, 2017 at 11:53 AM, Maged Mokhtar 
>> wrote:
>>
>>> Check out the following link: some SSDs perform bad in Ceph due to sync
>>> writes to journal
>>>
>>> https://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-tes
>>> t-if-your-ssd-is-suitable-as-a-journal-device/
>>>
>>> Anther thing that can help is to re-run the rados 32 threads as stress
>>> and view resource usage using atop (or collectl/sar) to check for %busy cpu
>>> and %busy disks to give you an idea of what is holding down your
>>> cluster..for example: if cpu/disk % are all low then check your
>>> network/switches.  If disk %busy is high (90%) for all disks then your
>>> disks are the bottleneck: which either means you have SSDs that are not
>>> suitable for Ceph or you have too few disks (which i doubt is the case). If
>>> only 1 disk %busy is high, there may be something wrong with this disk
>>> should be removed.
>>>
>>> Maged
>>>
>>> On 2017-10-18 18:13, Russell Glaue wrote:
>>>
>>> In my previous post, in one of my points I was wondering if the request
>>> size would increase if I enabled jumbo packets. currently it is disabled.
>>>
>>> @jdillama: The qemu settings for both these two guest machines, with
>>> RAID/LVM and Ceph/rbd images, are the same. I am not thinking that changing
>>> the qemu settings of "min_io_size=,opt_io_size=>> image object size>" will directly address the issue.
>>>
>>> @mmokhtar: Ok. So you suggest the request size is the result of the
>>> problem and not the cause of the problem. meaning I should go after a
>>> different issue.
>>>
>>> I have been trying to get write speeds up to what people on this mail
>>> list are discussing.
>>> It seems that for our configuration, as it matches others, we should be
>>> getting about 70MB/s write speed.
>>> But we are not getting that.
>>> Single writes to disk are lucky to get 5MB/s to 6MB/s, but are typically
>>> 1MB/s to 2MB/s.
>>> Monitoring the entire Ceph cluster (using http://cephdash.crapworks.de/),
>>> I have seen very rare momentary spikes up to 30MB/s.
>>>
>>> My storage network is connected via a 10Gb switch
>>> I have 4 storage servers with a LSI Logic MegaRAID SAS 2208 controller
>>> Each storage server has 9 1TB SSD drives, each drive 

Re: [ceph-users] RBD-image permissions

2017-10-19 Thread Jason Dillaman
The most realistic backlog feature would be for adding support for
namespaces within RBD [1], but it's not being actively developed at
the moment. Of course, the usual caveat that "everyone with access to
the cluster network would be trusted" would still apply. It's because
of that assumption that adding support for RBD namespaces hasn't
really bubbled up the priority list in all honesty.

[1] https://trello.com/c/UafuAuEV

On Thu, Oct 19, 2017 at 4:16 PM, Jorge Pinilla López  wrote:
> I want to give permissions to my clients but only for reading/writting an
> specific RBD image not the hole pool.
>
> If I give permissions to the hole pool, a client could delete all the images
> in the pool or mount any other image and I don't really want that.
>
> I've read about using prefix
> (https://blog-fromsomedude.rhcloud.com/2016/04/26/Allowing-a-RBD-client-to-map-only-one-RBD/)
>
> But I wounder if is there any future plan to support an easier client
> rbd-image permissions?
>
> 
> Jorge Pinilla López
> jorp...@unizar.es
> Estudiante de ingenieria informática
> Becario del area de sistemas (SICUZ)
> Universidad de Zaragoza
> PGP-KeyID: A34331932EBC715A
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not able to start OSD

2017-10-19 Thread Josy

Hi,

>> have you checked the output of "ceph-disk list” on the nodes where 
the OSDs are not coming back on?


Yes, it shows all the disk correctly mounted.

>> And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages 
produced by the OSD itself when it starts.


This is the error messages seen in one of the OSD log file. Even though 
the service is starting the status shows as down itself.



=

   -7> 2017-10-19 13:16:15.589465 7efefcda4d00  5 osd.28 pg_epoch: 4312 
pg[33.11( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 4270/4270 
les/c/f 4271/4271/0 4270/4270/4270) [1,28,12] r=1 lpr=0 crt=0'0 unknown 
NOTIFY] enter Reset
    -6> 2017-10-19 13:16:15.589476 7efefcda4d00  5 
write_log_and_missing with: dirty_to: 0'0, dirty_from: 
4294967295'18446744073709551615, writeout_from: 
4294967295'18446744073709551615, trimmed: , trimmed_dups: , 
clear_divergent_priors: 0
    -5> 2017-10-19 13:16:15.591629 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10(unlocked)] enter Initial
    -4> 2017-10-19 13:16:15.591759 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 
crt=0'0 unknown NOTIFY] exit Initial 0.000130 0 0.00
    -3> 2017-10-19 13:16:15.591786 7efefcda4d00  5 osd.28 pg_epoch: 
4312 pg[33.10( empty local-lis/les=4270/4271 n=0 ec=4270/4270 lis/c 
4270/4270 les/c/f 4271/4271/0 4270/4270/4270) [8,17,28] r=2 lpr=0 
crt=0'0 unknown NOTIFY] enter Reset
    -2> 2017-10-19 13:16:15.591799 7efefcda4d00  5 
write_log_and_missing with: dirty_to: 0'0, dirty_from: 
4294967295'18446744073709551615, writeout_from: 
4294967295'18446744073709551615, trimmed: , trimmed_dups: , 
clear_divergent_priors: 0
    -1> 2017-10-19 13:16:15.594757 7efefcda4d00  5 osd.28 pg_epoch: 
4306 pg[32.ds0(unlocked)] enter Initial
 0> 2017-10-19 13:16:15.598295 7efefcda4d00 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: 
In function 'ECUtil::stripe_info_t::stripe_info_t(uint64_t, uint64_t)' 
thread 7efefcda4d00 time 2017-10-19 13:16:15.594821
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/12.2.1/rpm/el7/BUILD/ceph-12.2.1/src/osd/ECUtil.h: 
38: FAILED assert(stripe_width % stripe_size == 0)



On 20-10-2017 01:05, Jean-Charles Lopez wrote:

Hi,

have you checked the output of "ceph-disk list” on the nodes where the OSDs are 
not coming back on?

This should give you a hint on what’s going one.

Also use dmesg to search for any error message

And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced 
by the OSD itself when it starts.

Regards
JC


On Oct 19, 2017, at 12:11, Josy  wrote:

Hi,

I am not able to start some of the OSDs in the cluster.

This is a test cluster and had 8 OSDs. One node was taken out for maintenance. 
I set the noout flag and after the server came back up I unset the noout flag.

Suddenly couple of OSDs went down.

And now I can start the OSDs manually from each node, but the status is still 
"down"

$  ceph osd stat
8 osds: 2 up, 5 in


$ ceph osd tree
ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
  -1   7.97388 root default
  -3   1.86469 host a1-osd
   1   ssd 1.86469 osd.1   down0 1.0
  -5   0.87320 host a2-osd
   2   ssd 0.87320 osd.2   down0 1.0
  -7   0.87320 host a3-osd
   4   ssd 0.87320 osd.4   down  1.0 1.0
  -9   0.87320 host a4-osd
   8   ssd 0.87320 osd.8 up  1.0 1.0
-11   0.87320 host a5-osd
  12   ssd 0.87320 osd.12  down  1.0 1.0
-13   0.87320 host a6-osd
  17   ssd 0.87320 osd.17up  1.0 1.0
-15   0.87320 host a7-osd
  21   ssd 0.87320 osd.21  down  1.0 1.0
-17   0.87000 host a8-osd
  28   ssd 0.87000 osd.28  down0 1.0

Also can see this error in each OSD node.

# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor 
preset: disabled)
Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT; 
19min ago
   Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
--setuser ceph --setgroup ceph (code=killed, signal=ABRT)
   Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
  Main PID: 4163 (code=killed, signal=ABRT)

Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered 
failed 

[ceph-users] RBD-image permissions

2017-10-19 Thread Jorge Pinilla López
I want to give permissions to my clients but only for reading/writting
an specific RBD image not the hole pool.

If I give permissions to the hole pool, a client could delete all the
images in the pool or mount any other image and I don't really want that.

I've read about using prefix
(https://blog-fromsomedude.rhcloud.com/2016/04/26/Allowing-a-RBD-client-to-map-only-one-RBD/)


But I wounder if is there any future plan to support an easier client
rbd-image permissions?


*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore compression and existing CephFS filesystem

2017-10-19 Thread Michael Sudnick
Hello, I recently migrated to Bluestore on Luminous and have enabled
aggressive snappy compression on my CephFS data pool. I was wondering if
there was a way to see how much space was being saved. Also, are existing
files compressed at all, or do I have a bunch of resyncing ahead of me?
Sorry if this is in the documentation somewhere - I searched and haven't
been able to find anything.

Thank you,

-Michael
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Not able to start OSD

2017-10-19 Thread Jean-Charles Lopez
Hi,

have you checked the output of "ceph-disk list” on the nodes where the OSDs are 
not coming back on? 

This should give you a hint on what’s going one.

Also use dmesg to search for any error message

And finally inspect /var/log/ceph/ceph-osd.${id}.log to see messages produced 
by the OSD itself when it starts.

Regards
JC

> On Oct 19, 2017, at 12:11, Josy  wrote:
> 
> Hi,
> 
> I am not able to start some of the OSDs in the cluster.
> 
> This is a test cluster and had 8 OSDs. One node was taken out for 
> maintenance. I set the noout flag and after the server came back up I unset 
> the noout flag.
> 
> Suddenly couple of OSDs went down.
> 
> And now I can start the OSDs manually from each node, but the status is still 
> "down"
> 
> $  ceph osd stat
> 8 osds: 2 up, 5 in
> 
> 
> $ ceph osd tree
> ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
>  -1   7.97388 root default
>  -3   1.86469 host a1-osd
>   1   ssd 1.86469 osd.1   down0 1.0
>  -5   0.87320 host a2-osd
>   2   ssd 0.87320 osd.2   down0 1.0
>  -7   0.87320 host a3-osd
>   4   ssd 0.87320 osd.4   down  1.0 1.0
>  -9   0.87320 host a4-osd
>   8   ssd 0.87320 osd.8 up  1.0 1.0
> -11   0.87320 host a5-osd
>  12   ssd 0.87320 osd.12  down  1.0 1.0
> -13   0.87320 host a6-osd
>  17   ssd 0.87320 osd.17up  1.0 1.0
> -15   0.87320 host a7-osd
>  21   ssd 0.87320 osd.21  down  1.0 1.0
> -17   0.87000 host a8-osd
>  28   ssd 0.87000 osd.28  down0 1.0
> 
> Also can see this error in each OSD node.
> 
> # systemctl status ceph-osd@1
> ● ceph-osd@1.service - Ceph object storage daemon osd.1
>Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; vendor 
> preset: disabled)
>Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 PDT; 
> 19min ago
>   Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} --id %i 
> --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
>   Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh --cluster 
> ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)
>  Main PID: 4163 (code=killed, signal=ABRT)
> 
> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered 
> failed state.
> Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff time 
> over, scheduling restart.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too 
> quickly for ceph-osd@1.service
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object 
> storage daemon osd.1.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service entered 
> failed state.
> Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Not able to start OSD

2017-10-19 Thread Josy

Hi,

I am not able to start some of the OSDs in the cluster.

This is a test cluster and had 8 OSDs. One node was taken out for 
maintenance. I set the noout flag and after the server came back up I 
unset the noout flag.


Suddenly couple of OSDs went down.

And now I can start the OSDs manually from each node, but the status is 
still "down"


$  ceph osd stat
8 osds: 2 up, 5 in


$ ceph osd tree
ID  CLASS WEIGHT  TYPE NAME STATUS REWEIGHT PRI-AFF
 -1   7.97388 root default
 -3   1.86469 host a1-osd
  1   ssd 1.86469 osd.1   down    0 1.0
 -5   0.87320 host a2-osd
  2   ssd 0.87320 osd.2   down    0 1.0
 -7   0.87320 host a3-osd
  4   ssd 0.87320 osd.4   down  1.0 1.0
 -9   0.87320 host a4-osd
  8   ssd 0.87320 osd.8 up  1.0 1.0
-11   0.87320 host a5-osd
 12   ssd 0.87320 osd.12  down  1.0 1.0
-13   0.87320 host a6-osd
 17   ssd 0.87320 osd.17    up  1.0 1.0
-15   0.87320 host a7-osd
 21   ssd 0.87320 osd.21  down  1.0 1.0
-17   0.87000 host a8-osd
 28   ssd 0.87000 osd.28  down    0 1.0

Also can see this error in each OSD node.

# systemctl status ceph-osd@1
● ceph-osd@1.service - Ceph object storage daemon osd.1
   Loaded: loaded (/usr/lib/systemd/system/ceph-osd@.service; enabled; 
vendor preset: disabled)
   Active: failed (Result: start-limit) since Thu 2017-10-19 11:35:18 
PDT; 19min ago
  Process: 4163 ExecStart=/usr/bin/ceph-osd -f --cluster ${CLUSTER} 
--id %i --setuser ceph --setgroup ceph (code=killed, signal=ABRT)
  Process: 4158 ExecStartPre=/usr/lib/ceph/ceph-osd-prestart.sh 
--cluster ${CLUSTER} --id %i (code=exited, status=0/SUCCESS)

 Main PID: 4163 (code=killed, signal=ABRT)

Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service 
entered failed state.

Oct 19 11:34:58 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service holdoff 
time over, scheduling restart.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: start request repeated too 
quickly for ceph-osd@1.service
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Failed to start Ceph object 
storage daemon osd.1.
Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: Unit ceph-osd@1.service 
entered failed state.

Oct 19 11:35:18 ceph-las1-a1-osd systemd[1]: ceph-osd@1.service failed.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-19 Thread Sean Sullivan
I have tried using ceph-disk directly and i'm running into all sorts of
trouble but I'm trying my best. Currently I am using the following cobbled
script which seems to be working:
https://github.com/seapasulli/CephScripts/blob/master/provision_storage.sh
I'm at 11 right now. I hope this works.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph delete files and status

2017-10-19 Thread Jamie Fargen
Nigel-

What method did you use to upload and delete the file? How did you check
the space utilization? I believe the reason that you are still seeing the
space being utilized when you issue your ceph -df is because even after the
file is deleted, the file system doesn't actually delete the file, it just
removes the file inode entry pointing to the file. The file will still be
on the disk until the blocks are re-allocated to another file.

-Jamie

On Thu, Oct 19, 2017 at 11:54 AM, nigel davies  wrote:

> Hay all
>
> I am looking at my small test Ceph cluster, i have uploaded a 200MB iso
> and checked the space on "ceph status" and see it incress.
>
> But when i delete the file the space used does not go down.
>
> Have i missed a configuration somewhere or something?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Jamie Fargen
Consultant
jfar...@redhat.com
813-817-4430
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code settings

2017-10-19 Thread Josy

Please ignore. I found the mistake.


On 19-10-2017 21:08, Josy wrote:

Hi,

I created a testprofile, but not able to create a pool using it


==
$ ceph osd erasure-code-profile get testprofile1
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=10
m=4
plugin=jerasure
technique=reed_sol_van
w=8

$ ceph osd pool create ecpool 100 100 testprofile1
Error ENOENT: specified rule testprofile1 doesn't exist


On 19-10-2017 19:54, Josy wrote:

Hi,

I would like to set up an erasure code profile with k=10 amd m=4 
settings.


Is there any minimum requirement of OSD nodes and OSDs to achieve 
this setting ?


Can I create a pool with 8 OSD servers, with one disk each in it ?







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph delete files and status

2017-10-19 Thread nigel davies
Hay all

I am looking at my small test Ceph cluster, i have uploaded a 200MB iso and
checked the space on "ceph status" and see it incress.

But when i delete the file the space used does not go down.

Have i missed a configuration somewhere or something?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how does recovery work

2017-10-19 Thread Richard Hesketh
On 19/10/17 11:00, Dennis Benndorf wrote:
> Hello @all,
> 
> givin the following config:
> 
>   * ceph.conf:
> 
> ...
> mon osd down out subtree limit = host
> osd_pool_default_size = 3
> osd_pool_default_min_size = 2
> ...
> 
>   * each OSD has its journal on a 30GB partition on a PCIe-Flash-Card
>   * 3 hosts
> 
> What would happen if one host goes down? I mean is there a limit of downtime 
> of this host/osds? How is Ceph detecting the differences between OSDs within 
> a placement group? Is there a binary log(which could run out of space) in the 
> journal/monitor or will it just copy all object within that pgs which had 
> unavailable osds?
> 
> Thanks in advance,
> Dennis

When the OSDs that were offline come back up, the PGs on those OSDs will 
resynchronise with the other replicas. Where there are new objects (or newer 
objects in the case of modifications), the new data will be copied from the 
other OSDs that remained active. There is no binary logging replication 
mechanism as you might be used to from mysql or similar.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code settings

2017-10-19 Thread Josy

Hi,

I created a testprofile, but not able to create a pool using it


==
$ ceph osd erasure-code-profile get testprofile1
crush-device-class=
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=10
m=4
plugin=jerasure
technique=reed_sol_van
w=8

$ ceph osd pool create ecpool 100 100 testprofile1
Error ENOENT: specified rule testprofile1 doesn't exist


On 19-10-2017 19:54, Josy wrote:

Hi,

I would like to set up an erasure code profile with k=10 amd m=4 
settings.


Is there any minimum requirement of OSD nodes and OSDs to achieve this 
setting ?


Can I create a pool with 8 OSD servers, with one disk each in it ?





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to recover from block.db failure?

2017-10-19 Thread Wido den Hollander

> Op 19 oktober 2017 om 16:47 schreef Caspar Smit :
> 
> 
> Hi David,
> 
> Thank you for your answer, but wouldn't scrub (deep-scrub) handle
> that? It will flag the unflushed journal pg's as inconsistent and you
> would have to repair the pg's. Or am i overlooking something here? The
> official blog doesn't state anything about this method being a bad
> idea.
> 

No, it doesn't. You would have to wipe the whole OSD when you loose a journal 
from a unclean shutdown.

Same goes for BlueStore and it's WAL+DB. It's not a cache, it contains vital 
information of the OSD.

If you loose either the WAL or DB you can't be sure the OSD is still consistent 
and loose it.

Look at it the other way around. Why would either the WAL or DB require 
persistent storage if it can be disposed?

Wido

> Caspar
> 
> 2017-10-19 16:14 GMT+02:00 David Turner :
> > I'm speaking to the method in general and don't know the specifics of
> > bluestore.  Recovering from a failed journal in this way is only a good idea
> > if you were able to flush the journal before making a new one.  If the
> > journal failed during operation and you couldn't cleanly flush the journal,
> > then the data on the OSD could not be guaranteed and would need to be wiped
> > and started over.  The same would go for the block.wal and block.db
> > partitions if you can find the corresponding commands for them.
> >
> > On Thu, Oct 19, 2017 at 7:44 AM Caspar Smit  wrote:
> >>
> >> Hi all,
> >>
> >> I'm testing some scenario's with the new Ceph luminous/bluestore
> >> combination.
> >>
> >> I've created a demo setup with 3 nodes (each has 10 HDD's and 2 SSD's)
> >> So i created 10 BlueStore OSD's with a seperate 20GB block.db on the
> >> SSD's (5 HDD's per block.db SSD).
> >>
> >> I'm testing a failure of one of those SSD's (block.db failure).
> >>
> >> With filestore i have used the following blog/script to recover from a
> >> journal SSD failure:
> >>
> >> http://ceph.com/no-category/ceph-recover-osds-after-ssd-journal-failure/
> >>
> >> I tried to adapt the script to bluestore but i couldn't find any
> >> BlueStore equivalent to the following command (where the journal is
> >> re-created):
> >>
> >> sudo ceph-osd –mkjournal -i $osd_id
> >>
> >> Tracing the 'ceph-disk prepare' command didn't result in a seperate
> >> command that the BlueStore block.db is initialized. It looks like the
> >> --mkfs switch does all the work (including the data part). Am i
> >> correct?
> >>
> >> Is there any way a seperate block.db can be initialized after the OSD
> >> was created? In other words: is it possible to recover from a block.db
> >> failure or do i need to start over?
> >>
> >> block.db is probably no equivalent to a FileStore's journal, but what
> >> about block.wal? If i use a seperate block.wal device only will the
> >> --mkjournal command re-initialize that or is the --mkjournal command
> >> only used for FileStore ?
> >>
> >> Kind regards and thanks in advance for any reply,
> >> Caspar Smit
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure code settings

2017-10-19 Thread Denes Dolhay

Hi,

If you want to split your data to 10 peaces (stripes), and hold 4 parity 
peaces in extra (so your cluster can handle the loss of any 4 osds), 
then you need a minimum of 14 osds to hold your data.



Denes.


On 10/19/2017 04:24 PM, Josy wrote:

Hi,

I would like to set up an erasure code profile with k=10 amd m=4 
settings.


Is there any minimum requirement of OSD nodes and OSDs to achieve this 
setting ?


Can I create a pool with 8 OSD servers, with one disk each in it ?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to recover from block.db failure?

2017-10-19 Thread Caspar Smit
Hi David,

Thank you for your answer, but wouldn't scrub (deep-scrub) handle
that? It will flag the unflushed journal pg's as inconsistent and you
would have to repair the pg's. Or am i overlooking something here? The
official blog doesn't state anything about this method being a bad
idea.

Caspar

2017-10-19 16:14 GMT+02:00 David Turner :
> I'm speaking to the method in general and don't know the specifics of
> bluestore.  Recovering from a failed journal in this way is only a good idea
> if you were able to flush the journal before making a new one.  If the
> journal failed during operation and you couldn't cleanly flush the journal,
> then the data on the OSD could not be guaranteed and would need to be wiped
> and started over.  The same would go for the block.wal and block.db
> partitions if you can find the corresponding commands for them.
>
> On Thu, Oct 19, 2017 at 7:44 AM Caspar Smit  wrote:
>>
>> Hi all,
>>
>> I'm testing some scenario's with the new Ceph luminous/bluestore
>> combination.
>>
>> I've created a demo setup with 3 nodes (each has 10 HDD's and 2 SSD's)
>> So i created 10 BlueStore OSD's with a seperate 20GB block.db on the
>> SSD's (5 HDD's per block.db SSD).
>>
>> I'm testing a failure of one of those SSD's (block.db failure).
>>
>> With filestore i have used the following blog/script to recover from a
>> journal SSD failure:
>>
>> http://ceph.com/no-category/ceph-recover-osds-after-ssd-journal-failure/
>>
>> I tried to adapt the script to bluestore but i couldn't find any
>> BlueStore equivalent to the following command (where the journal is
>> re-created):
>>
>> sudo ceph-osd –mkjournal -i $osd_id
>>
>> Tracing the 'ceph-disk prepare' command didn't result in a seperate
>> command that the BlueStore block.db is initialized. It looks like the
>> --mkfs switch does all the work (including the data part). Am i
>> correct?
>>
>> Is there any way a seperate block.db can be initialized after the OSD
>> was created? In other words: is it possible to recover from a block.db
>> failure or do i need to start over?
>>
>> block.db is probably no equivalent to a FileStore's journal, but what
>> about block.wal? If i use a seperate block.wal device only will the
>> --mkjournal command re-initialize that or is the --mkjournal command
>> only used for FileStore ?
>>
>> Kind regards and thanks in advance for any reply,
>> Caspar Smit
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure code settings

2017-10-19 Thread Josy

Hi,

I would like to set up an erasure code profile with k=10 amd m=4 settings.

Is there any minimum requirement of OSD nodes and OSDs to achieve this 
setting ?


Can I create a pool with 8 OSD servers, with one disk each in it ?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Is it possible to recover from block.db failure?

2017-10-19 Thread David Turner
I'm speaking to the method in general and don't know the specifics of
bluestore.  Recovering from a failed journal in this way is only a good
idea if you were able to flush the journal before making a new one.  If the
journal failed during operation and you couldn't cleanly flush the journal,
then the data on the OSD could not be guaranteed and would need to be wiped
and started over.  The same would go for the block.wal and block.db
partitions if you can find the corresponding commands for them.

On Thu, Oct 19, 2017 at 7:44 AM Caspar Smit  wrote:

> Hi all,
>
> I'm testing some scenario's with the new Ceph luminous/bluestore
> combination.
>
> I've created a demo setup with 3 nodes (each has 10 HDD's and 2 SSD's)
> So i created 10 BlueStore OSD's with a seperate 20GB block.db on the
> SSD's (5 HDD's per block.db SSD).
>
> I'm testing a failure of one of those SSD's (block.db failure).
>
> With filestore i have used the following blog/script to recover from a
> journal SSD failure:
>
> http://ceph.com/no-category/ceph-recover-osds-after-ssd-journal-failure/
>
> I tried to adapt the script to bluestore but i couldn't find any
> BlueStore equivalent to the following command (where the journal is
> re-created):
>
> sudo ceph-osd –mkjournal -i $osd_id
>
> Tracing the 'ceph-disk prepare' command didn't result in a seperate
> command that the BlueStore block.db is initialized. It looks like the
> --mkfs switch does all the work (including the data part). Am i
> correct?
>
> Is there any way a seperate block.db can be initialized after the OSD
> was created? In other words: is it possible to recover from a block.db
> failure or do i need to start over?
>
> block.db is probably no equivalent to a FileStore's journal, but what
> about block.wal? If i use a seperate block.wal device only will the
> --mkjournal command re-initialize that or is the --mkjournal command
> only used for FileStore ?
>
> Kind regards and thanks in advance for any reply,
> Caspar Smit
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Is it possible to recover from block.db failure?

2017-10-19 Thread Caspar Smit
Hi all,

I'm testing some scenario's with the new Ceph luminous/bluestore combination.

I've created a demo setup with 3 nodes (each has 10 HDD's and 2 SSD's)
So i created 10 BlueStore OSD's with a seperate 20GB block.db on the
SSD's (5 HDD's per block.db SSD).

I'm testing a failure of one of those SSD's (block.db failure).

With filestore i have used the following blog/script to recover from a
journal SSD failure:

http://ceph.com/no-category/ceph-recover-osds-after-ssd-journal-failure/

I tried to adapt the script to bluestore but i couldn't find any
BlueStore equivalent to the following command (where the journal is
re-created):

sudo ceph-osd –mkjournal -i $osd_id

Tracing the 'ceph-disk prepare' command didn't result in a seperate
command that the BlueStore block.db is initialized. It looks like the
--mkfs switch does all the work (including the data part). Am i
correct?

Is there any way a seperate block.db can be initialized after the OSD
was created? In other words: is it possible to recover from a block.db
failure or do i need to start over?

block.db is probably no equivalent to a FileStore's journal, but what
about block.wal? If i use a seperate block.wal device only will the
--mkjournal command re-initialize that or is the --mkjournal command
only used for FileStore ?

Kind regards and thanks in advance for any reply,
Caspar Smit
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PG's stuck unclean active+remapped

2017-10-19 Thread Roel de Rooy
Hi all,

I'm hoping some of you have some experience in dealing with this, as 
unfortunately this is the first time we encountered this issue.
We currently have placement groups that are stuck unclean with 
'active+remapped' as last state.

The rundown of what happened:

Yesterday morning, one of our network engineers, was working on some LACP bonds 
on the same switch stack which also houses this cluster's internal and public 
ceph networks.
Unfortunately the engineer also accidentally touched the LACP bonds of all 3 
monitor servers and issues started to appear.

In a rapid amount, we started losing osd's, one by one, and rebalance/recover 
started kicking in.
As connectivity between de monitor servers appeared ok (ping connectivity was 
somehow still there, there was still a quorum visible and ceph commands worked 
on all three), we didn't suspect the monitor servers at first.

When investigating the osd's that were marked down, the logging of those osd's 
were full with below error messages:

- monclient: _check_auth_rotating possible clock skew, 
rotating keys expired way too early
- auth: could not find secret_id
- cephx: verify_authorizer could not get service secret for 
service osd secret_id
- x.x.x.x:6801/1258067 >> x.x.x.x:0/1115346558 
pipe(0x560136fd8800 sd=706 :6801 s=0 pgs=0 cs=0 l=1 c=0x560122778e80).accept: 
got bad authorizer

We suspected time sync, but everything turned out ok.
As more and more osd's started failing, we changed the crushmap to add 2 
additional osd nodes for the affected pools, that were not housing any data at 
the moment, but the same message kept appearing on these osd's as well.
In the meantime enough osd's were down, so everything stopped in it's process.

After finding out about the LACP bonds, the changes were reverted and all osd's 
came up again.
Unfortunately after some time rebalance/recover stopped and status gives the 
following information:

health HEALTH_WARN
1088 pgs stuck unclean
recovery 92/1073206 objects degraded (0.009%)
recovery 53092/1073206 objects misplaced (4.947%)
nodeep-scrub,sortbitwise,require_jewel_osds flag(s) set
 monmap e1: 3 mons at 
{srv-ams3-cmon-01=192.168.152.3:6789/0,srv-ams3-cmon-02=192.168.152.4:6789/0,srv-ams3-cmon-03=192.168.152.5:6789/0}
election epoch 5152, quorum 0,1,2 
srv-ams3-cmon-01,srv-ams3-cmon-02,srv-ams3-cmon-03
 osdmap e30517: 39 osds: 39 up, 39 in; 1088 remapped pgs
flags nodeep-scrub,sortbitwise,require_jewel_osds
  pgmap v20285289: 2340 pgs, 22 pools, 2056 GB data, 524 kobjects
4057 GB used, 12409 GB / 16466 GB avail
92/1073206 objects degraded (0.009%)
53092/1073206 objects misplaced (4.947%)
1252 active+clean
1088 active+remapped

There does not seem to be any issue that prevents continuous service of the 
connected clients, but when querying such a placement group, it's show that:

- 2 osd's are acting (the pools have a replication size of 2 at the 
moment)

- 1 osd is primary

- Both osd's are visible as value for 'actingbackfill'

- up_primary has the value '-1'

- None is up

We already tried reweighting the affected primary osd, but the affected 
placement groups are not touched by the rebalance.
Restarting the osd's also did not have any affect.
We even tried 'ceph osd crush tunables optimal', but as we already though it 
would not have any affect.

Sorry for the long read, but if someone might have an idea what we could try?
I did read about setting 'osd_find_best_info_ignore_history_les' to true, but 
I'm not sure what the implications would be when using this setting.
Additionally we did set deep-scrub of during the recovery, could this be 
something deep-scrub would fix?

Thanks in advance!

Roel


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crashed while reparing inconsistent PG luminous

2017-10-19 Thread Ana Aviles
Hi Greg,

Thanks for your findings! We've updated the issue with the log files of
osd.93 and osd.69 which corresponds to the period of the log we posted.
Also, we've recreated a new set of logs for that pair of OSDs. As we
explain in the issue, right now the OSDs fail on that other assert you
mentioned, and, sadly enough we can't reproduce the initial crash.

Greetings,

Ana

On 19/10/17 01:14, Gregory Farnum wrote:
> I updated the ticket with some findings. It appears that osd.93 has
> that snapshot object in its missing set that gets sent to osd.78, and
> then osd.69 claims to have the object. Can you upload debug logs of
> those OSDs that go along with this log? (Or just generate a new set of
> them together.)
> -Greg
>
> On Wed, Oct 18, 2017 at 9:16 AM Mart van Santen  > wrote:
>
>
> Dear all,
>
> We are still struggling this this issue. By now, one OSD crashes
> all the time (a different then yesterday), but now on a different
> assert.
>
>
> Namely with this one:
>
> #0  0x75464428 in __GI_raise (sig=sig@entry=6) at
> ../sysdeps/unix/sysv/linux/raise.c:54
> #1  0x7546602a in __GI_abort () at abort.c:89
> #2  0x55ff157e in ceph::__ceph_assert_fail
> (assertion=assertion@entry=0x56564e97 "head_obc",
> file=file@entry=0x56566bd8
> "/build/ceph-12.2.1/src/osd/PrimaryLogPG.cc", line=line@entry=10369,
>     func=func@entry=0x5656d940
>  PGBackend::RecoveryHandle*)::__PRETTY_FUNCTION__> "int
> PrimaryLogPG::recover_missing(const hobject_t&, eversion_t, int,
> PGBackend::RecoveryHandle*)") at
> /build/ceph-12.2.1/src/common/assert.cc:66
> #3  0x55b833e9 in PrimaryLogPG::recover_missing
> (this=this@entry=0x62aca000, soid=..., v=...,
> priority=, h=h@entry=0x67a3a080) at
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc:10369
> #4  0x55bc3fd0 in PrimaryLogPG::recover_primary
> (this=this@entry=0x62aca000, max=max@entry=1, handle=...) at
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc:11588
> #5  0x55bcc81e in PrimaryLogPG::start_recovery_ops
> (this=0x62aca000, max=1, handle=...,
> ops_started=0x7fffd8b1ac68) at
> /build/ceph-12.2.1/src/osd/PrimaryLogPG.cc:11339
> #6  0x55a20a59 in OSD::do_recovery (this=0x5f95a000,
> pg=0x62aca000, queued=384619, reserved_pushes=1, handle=...)
> at /build/ceph-12.2.1/src/osd/OSD.cc:9381
> #7  0x55c94be9 in PGQueueable::RunVis::operator()
> (this=this@entry=0x7fffd8b1af00, op=...) at
> /build/ceph-12.2.1/src/osd/PGQueueable.cc:34
> #8  0x55a226c4 in
> 
> boost::detail::variant::invoke_visitor::internal_visit
> (operand=..., this=)
>     at
> 
> /build/ceph-12.2.1/obj-x86_64-linux-gnu/boost/include/boost/variant/variant.hpp:1046
> #9 
> 
> boost::detail::variant::visitation_impl_invoke_impl void*, PGRecovery> (storage=0x7fffd8b1af50, visitor= pointer>)
>     at
> 
> /build/ceph-12.2.1/obj-x86_64-linux-gnu/boost/include/boost/variant/detail/visitation_impl.hpp:114
> #10
> 
> boost::detail::variant::visitation_impl_invoke void*, PGRecovery, boost::variant PGSnapTrim, PGScrub, PGRecovery>::has_fallback_type_> (
>     t=0x0, storage=0x7fffd8b1af50, visitor=,
> internal_which=) at
> 
> /build/ceph-12.2.1/obj-x86_64-linux-gnu/boost/include/boost/variant/detail/visitation_impl.hpp:157
> #11 boost::detail::variant::visitation_impl,
> 
> boost::detail::variant::visitation_impl_step,
> boost::intrusive_ptr,
> boost::mpl::l_item, PGSnapTrim,
> boost::mpl::l_item, PGScrub,
> boost::mpl::l_item, PGRecovery, boost::mpl::l_end>
> > > > >, boost::mpl::l_iter >,
> boost::detail::variant::invoke_visitor,
> void*, boost::variant PGScrub, PGRecovery>::has_fallback_type_> (no_backup_flag=...,
> storage=0x7fffd8b1af50, visitor=,
> logical_which=, internal_which=)
>     at
> 
> /build/ceph-12.2.1/obj-x86_64-linux-gnu/boost/include/boost/variant/detail/visitation_impl.hpp:238
> #12 boost::variant PGScrub,
> 
> PGRecovery>::internal_apply_visitor_impl void*> (storage=0x7fffd8b1af50, visitor=,
>     logical_which=, internal_which=)
> at
> 
> /build/ceph-12.2.1/obj-x86_64-linux-gnu/boost/include/boost/variant/variant.hpp:2389
> #13 boost::variant PGScrub,
> 
> 

[ceph-users] how does recovery work

2017-10-19 Thread Dennis Benndorf

Hello @all,

givin the following config:

 * ceph.conf:

   ...
   mon osd down out subtree limit = host
   osd_pool_default_size = 3
   osd_pool_default_min_size = 2
   ...

 * each OSD has its journal on a 30GB partition on a PCIe-Flash-Card
 * 3 hosts

What would happen if one host goes down? I mean is there a limit of 
downtime of this host/osds? How is Ceph detecting the differences 
between OSDs within a placement group? Is there a binary log(which could 
run out of space) in the journal/monitor or will it just copy all object 
within that pgs which had unavailable osds?


Thanks in advance,
Dennis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests

2017-10-19 Thread Ольга Ухина
Mostly I'm using ceph as a storage to my vms in Proxmox. I have radosgw but
only for tests. It doesn't seem cause of problem.
I've tuned these parameters. It should improve speed of requests in
recovery stage, but I receive warnings anyway:
osd_client_op_priority = 63
osd_recovery_op_priority = 1
osd_max_backfills = 1
osd_recovery_max_active = 1
osd_recovery_threads = 1



С уважением,
Ухина Ольга

Моб. тел.: 8(905)-566-46-62

2017-10-19 11:06 GMT+03:00 Sean Purdy :

> Are you using radosgw?  I found this page useful when I had a similar
> issue:
>
> http://www.osris.org/performance/rgw.html
>
>
> Sean
>
> On Wed, 18 Oct 2017, Ольга Ухина said:
> > Hi!
> >
> > I have a problem with ceph luminous 12.2.1. It was upgraded from kraken,
> > but I'm not sure if it was a problem in kraken.
> > I have slow requests on different OSDs on random time (for example at
> > night, but I don't see any problems at the time of problem with disks,
> CPU,
> > there is possibility of network problem at night). During daytime I have
> > not this problem.
> > Almost all requests are nearly 30 seconds, so I receive warnings like
> this:
> >
> > 2017-10-18 01:20:26.147758 mon.st3 mon.0 10.192.1.78:6789/0 22686 :
> cluster
> > [WRN] Health check failed: 1 slow requests are blocked > 32 sec
> > (REQUEST_SLOW)
> > 2017-10-18 01:20:28.025315 mon.st3 mon.0 10.192.1.78:6789/0 22687 :
> cluster
> > [WRN] overall HEALTH_WARN 1 slow requests are blocked > 32 sec
> > 2017-10-18 01:20:32.166758 mon.st3 mon.0 10.192.1.78:6789/0 22688 :
> cluster
> > [WRN] Health check update: 38 slow requests are blocked > 32 sec
> > (REQUEST_SLOW)
> > 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 :
> cluster
> > [WRN] Health check update: 49 slow requests are blocked > 32 sec
> > (REQUEST_SLOW)
> > 2017-10-18 01:20:38.727421 osd.23 osd.23 10.192.1.158:6840/3659 1758 :
> > cluster [WRN] 27 slow requests, 5 included below; oldest blocked for >
> > 30.839843 secs
> > 2017-10-18 01:20:38.727425 osd.23 osd.23 10.192.1.158:6840/3659 1759 :
> > cluster [WRN] slow request 30.814060 seconds old, received at 2017-10-18
> > 01:20:07.913300: osd_op(client.12464272.1:56610561 31.410dd55
> > 5 31:aaabb082:::rbd_data.7b3e22ae8944a.00012e2c:head
> > [set-alloc-hint object_size 4194304 write_size 4194304,write
> 2977792~4096]
> > snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 39
> > 2017-10-18 01:20:38.727431 osd.23 osd.23 10.192.1.158:6840/3659 1760 :
> > cluster [WRN] slow request 30.086589 seconds old, received at 2017-10-18
> > 01:20:08.640771: osd_repop(client.12464806.1:17326170 34.242
> > e10926/10860 34:426def95:::rbd_data.acdc9238e1f29.1231:head
> v
> > 10926'4976910) currently write_thread_in_journal_buffer
> > 2017-10-18 01:20:38.727433 osd.23 osd.23 10.192.1.158:6840/3659 1761 :
> > cluster [WRN] slow request 30.812569 seconds old, received at 2017-10-18
> > 01:20:07.914791: osd_repop(client.12464272.1:56610570 31.1eb
> > e10926/10848 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head
> v
> > 10926'135331) currently write_thread_in_journal_buffer
> > 2017-10-18 01:20:38.727436 osd.23 osd.23 10.192.1.158:6840/3659 1762 :
> > cluster [WRN] slow request 30.807328 seconds old, received at 2017-10-18
> > 01:20:07.920032: osd_op(client.12464272.1:56610586 31.3f2f2e2
> > 6 31:6474f4fc:::rbd_data.7b3e22ae8944a.00013673:head
> > [set-alloc-hint object_size 4194304 write_size 4194304,write 12288~4096]
> > snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 30
> > 2017-10-18 01:20:38.727438 osd.23 osd.23 10.192.1.158:6840/3659 1763 :
> > cluster [WRN] slow request 30.807253 seconds old, received at 2017-10-18
> > 01:20:07.920107: osd_op(client.12464272.1:56610588 31.2d23291
> > 8 31:1894c4b4:::rbd_data.7b3e22ae8944a.00013a5b:head
> > [set-alloc-hint object_size 4194304 write_size 4194304,write 700416~4096]
> > snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 28
> > 2017-10-18 01:20:38.006142 osd.39 osd.39 10.192.1.159:6808/3323 1501 :
> > cluster [WRN] 2 slow requests, 2 included below; oldest blocked for >
> > 30.092091 secs
> > 2017-10-18 01:20:38.006153 osd.39 osd.39 10.192.1.159:6808/3323 1502 :
> > cluster [WRN] slow request 30.092091 seconds old, received at 2017-10-18
> > 01:20:07.913962: osd_op(client.12464272.1:56610570 31.e683e9e
> > b 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head
> > [set-alloc-hint object_size 4194304 write_size 4194304,write 143360~4096]
> > snapc 0=[] ondisk+write e10926) currently op_applied
> > 2017-10-18 01:20:38.006159 osd.39 osd.39 10.192.1.159:6808/3323 1503 :
> > cluster [WRN] slow request 30.086123 seconds old, received at 2017-10-18
> > 01:20:07.919930: osd_op(client.12464272.1:56610587 31.e683e9eb
> > 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head
> [set-alloc-hint
> > object_size 4194304 write_size 4194304,write 3256320~4096] snapc 0=[]
> > ondisk+write e10926) currently 

Re: [ceph-users] ceph inconsistent pg missing ec object

2017-10-19 Thread Stijn De Weirdt
hi greg,

i attached the gzip output of the query and some more info below. if you
need more, let me know.

stijn

> [root@mds01 ~]# ceph -s
> cluster 92beef0a-1239-4000-bacf-4453ab630e47
>  health HEALTH_ERR
> 1 pgs inconsistent
> 40 requests are blocked > 512 sec
> 1 scrub errors
> mds0: Behind on trimming (2793/30)
>  monmap e1: 3 mons at 
> {mds01=1.2.3.4:6789/0,mds02=1.2.3.5:6789/0,mds03=1.2.3.6:6789/0}
> election epoch 326, quorum 0,1,2 mds01,mds02,mds03
>   fsmap e238677: 1/1/1 up {0=mds02=up:active}, 2 up:standby
>  osdmap e79554: 156 osds: 156 up, 156 in
> flags sortbitwise,require_jewel_osds
>   pgmap v51003893: 4096 pgs, 3 pools, 387 TB data, 243 Mobjects
> 545 TB used, 329 TB / 874 TB avail
> 4091 active+clean
>4 active+clean+scrubbing+deep
>1 active+clean+inconsistent
>   client io 284 kB/s rd, 146 MB/s wr, 145 op/s rd, 177 op/s wr
>   cache io 115 MB/s flush, 153 MB/s evict, 14 op/s promote, 3 PG(s) flushing

> [root@mds01 ~]# ceph health detail
> HEALTH_ERR 1 pgs inconsistent; 52 requests are blocked > 512 sec; 5 osds have 
> slow requests; 1 scrub errors; mds0: Behind on trimming (2782/30)
> pg 5.5e3 is active+clean+inconsistent, acting 
> [35,50,91,18,139,59,124,40,104,12,71]
> 34 ops are blocked > 524.288 sec on osd.8
> 6 ops are blocked > 524.288 sec on osd.67
> 6 ops are blocked > 524.288 sec on osd.27
> 1 ops are blocked > 524.288 sec on osd.107
> 5 ops are blocked > 524.288 sec on osd.116
> 5 osds have slow requests
> 1 scrub errors
> mds0: Behind on trimming (2782/30)(max_segments: 30, num_segments: 2782)

> # zgrep -C 1 ERR ceph-osd.35.log.*.gz 
> ceph-osd.35.log.5.gz:2017-10-14 11:25:52.260668 7f34d6748700  0 -- 
> 10.141.16.13:6801/1001792 >> 1.2.3.11:6803/1951 pipe(0x56412da80800 sd=273 
> :6801 s=2 pgs=3176 cs=31 l=0 c=0x564156e83b00).fault with nothing to send, 
> going to standby
> ceph-osd.35.log.5.gz:2017-10-14 11:26:06.071011 7f3511be4700 -1 
> log_channel(cluster) log [ERR] : 5.5e3s0 shard 59(5) missing 
> 5:c7ae919b:::10014d3184b.:head
> ceph-osd.35.log.5.gz:2017-10-14 11:28:36.465684 7f34ffdf5700  0 -- 
> 1.2.3.13:6801/1001792 >> 1.2.3.21:6829/1834 pipe(0x56414e2a2000 sd=37 :6801 
> s=0 pgs=0 cs=0 l=0 c=0x5641470d2a00).accept connect_seq 33 vs existing 33 
> state standby
> ceph-osd.35.log.5.gz:--
> ceph-osd.35.log.5.gz:2017-10-14 11:43:35.570711 7f3508efd700  0 -- 
> 1.2.3.13:6801/1001792 >> 1.2.3.20:6825/1806 pipe(0x56413be34000 sd=138 :6801 
> s=2 pgs=2763 cs=45 l=0 c=0x564132999480).fault with nothing to send, going to 
> standby
> ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235548 7f3511be4700 -1 
> log_channel(cluster) log [ERR] : 5.5e3s0 deep-scrub 1 missing, 0 inconsistent 
> objects
> ceph-osd.35.log.5.gz:2017-10-14 11:44:02.235554 7f3511be4700 -1 
> log_channel(cluster) log [ERR] : 5.5e3 deep-scrub 1 errors
> ceph-osd.35.log.5.gz:2017-10-14 11:59:02.331454 7f34d6d4e700  0 -- 
> 1.2.3.13:6801/1001792 >> 1.2.3.11:6817/1941 pipe(0x56414d370800 sd=227 :42104 
> s=2 pgs=3238 cs=89 l=0 c=0x56413122d200).fault with nothing to send, going to 
> standby



On 10/18/2017 10:19 PM, Gregory Farnum wrote:
> It would help if you can provide the exact output of "ceph -s", "pg query",
> and any other relevant data. You shouldn't need to do manual repair of
> erasure-coded pools, since it has checksums and can tell which bits are
> bad. Following that article may not have done you any good (though I
> wouldn't expect it to hurt, either...)...
> -Greg
> 
> On Wed, Oct 18, 2017 at 5:56 AM Stijn De Weirdt 
> wrote:
> 
>> hi all,
>>
>> we have a ceph 10.2.7 cluster with a 8+3 EC pool.
>> in that pool, there is a pg in inconsistent state.
>>
>> we followed http://ceph.com/geen-categorie/ceph-manually-repair-object/,
>> however, we are unable to solve our issue.
>>
>> from the primary osd logs, the reported pg had a missing object.
>>
>> we found a related object on the primary osd, and then looked for
>> similar ones on the other osds in same path (i guess it is just has the
>> index of the osd in the pg list of osds suffixed)
>>
>> one osd did not have such a file (the 10 others did).
>>
>> so we did the "stop osd/flush/start os/pg repair" on both the primary
>> osd and on the osd with the missing EC part.
>>
>> however, the scrub error still exists.
>>
>> does anyone have any hints what to do in this case?
>>
>> stijn
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 


query_5.5e3.gz
Description: application/gzip
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow requests

2017-10-19 Thread Sean Purdy
Are you using radosgw?  I found this page useful when I had a similar issue:

http://www.osris.org/performance/rgw.html


Sean

On Wed, 18 Oct 2017, Ольга Ухина said:
> Hi!
> 
> I have a problem with ceph luminous 12.2.1. It was upgraded from kraken,
> but I'm not sure if it was a problem in kraken.
> I have slow requests on different OSDs on random time (for example at
> night, but I don't see any problems at the time of problem with disks, CPU,
> there is possibility of network problem at night). During daytime I have
> not this problem.
> Almost all requests are nearly 30 seconds, so I receive warnings like this:
> 
> 2017-10-18 01:20:26.147758 mon.st3 mon.0 10.192.1.78:6789/0 22686 : cluster
> [WRN] Health check failed: 1 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2017-10-18 01:20:28.025315 mon.st3 mon.0 10.192.1.78:6789/0 22687 : cluster
> [WRN] overall HEALTH_WARN 1 slow requests are blocked > 32 sec
> 2017-10-18 01:20:32.166758 mon.st3 mon.0 10.192.1.78:6789/0 22688 : cluster
> [WRN] Health check update: 38 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2017-10-18 01:20:38.187326 mon.st3 mon.0 10.192.1.78:6789/0 22689 : cluster
> [WRN] Health check update: 49 slow requests are blocked > 32 sec
> (REQUEST_SLOW)
> 2017-10-18 01:20:38.727421 osd.23 osd.23 10.192.1.158:6840/3659 1758 :
> cluster [WRN] 27 slow requests, 5 included below; oldest blocked for >
> 30.839843 secs
> 2017-10-18 01:20:38.727425 osd.23 osd.23 10.192.1.158:6840/3659 1759 :
> cluster [WRN] slow request 30.814060 seconds old, received at 2017-10-18
> 01:20:07.913300: osd_op(client.12464272.1:56610561 31.410dd55
> 5 31:aaabb082:::rbd_data.7b3e22ae8944a.00012e2c:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write 2977792~4096]
> snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 39
> 2017-10-18 01:20:38.727431 osd.23 osd.23 10.192.1.158:6840/3659 1760 :
> cluster [WRN] slow request 30.086589 seconds old, received at 2017-10-18
> 01:20:08.640771: osd_repop(client.12464806.1:17326170 34.242
> e10926/10860 34:426def95:::rbd_data.acdc9238e1f29.1231:head v
> 10926'4976910) currently write_thread_in_journal_buffer
> 2017-10-18 01:20:38.727433 osd.23 osd.23 10.192.1.158:6840/3659 1761 :
> cluster [WRN] slow request 30.812569 seconds old, received at 2017-10-18
> 01:20:07.914791: osd_repop(client.12464272.1:56610570 31.1eb
> e10926/10848 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head v
> 10926'135331) currently write_thread_in_journal_buffer
> 2017-10-18 01:20:38.727436 osd.23 osd.23 10.192.1.158:6840/3659 1762 :
> cluster [WRN] slow request 30.807328 seconds old, received at 2017-10-18
> 01:20:07.920032: osd_op(client.12464272.1:56610586 31.3f2f2e2
> 6 31:6474f4fc:::rbd_data.7b3e22ae8944a.00013673:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write 12288~4096]
> snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 30
> 2017-10-18 01:20:38.727438 osd.23 osd.23 10.192.1.158:6840/3659 1763 :
> cluster [WRN] slow request 30.807253 seconds old, received at 2017-10-18
> 01:20:07.920107: osd_op(client.12464272.1:56610588 31.2d23291
> 8 31:1894c4b4:::rbd_data.7b3e22ae8944a.00013a5b:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write 700416~4096]
> snapc 0=[] ondisk+write e10926) currently sub_op_commit_rec from 28
> 2017-10-18 01:20:38.006142 osd.39 osd.39 10.192.1.159:6808/3323 1501 :
> cluster [WRN] 2 slow requests, 2 included below; oldest blocked for >
> 30.092091 secs
> 2017-10-18 01:20:38.006153 osd.39 osd.39 10.192.1.159:6808/3323 1502 :
> cluster [WRN] slow request 30.092091 seconds old, received at 2017-10-18
> 01:20:07.913962: osd_op(client.12464272.1:56610570 31.e683e9e
> b 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head
> [set-alloc-hint object_size 4194304 write_size 4194304,write 143360~4096]
> snapc 0=[] ondisk+write e10926) currently op_applied
> 2017-10-18 01:20:38.006159 osd.39 osd.39 10.192.1.159:6808/3323 1503 :
> cluster [WRN] slow request 30.086123 seconds old, received at 2017-10-18
> 01:20:07.919930: osd_op(client.12464272.1:56610587 31.e683e9eb
> 31:d797c167:::rbd_data.7b3e22ae8944a.00013828:head [set-alloc-hint
> object_size 4194304 write_size 4194304,write 3256320~4096] snapc 0=[]
> ondisk+write e10926) currently op_applied
> 2017-10-18 01:20:38.374091 osd.38 osd.38 10.192.1.159:6857/236992 1387 :
> cluster [WRN] 2 slow requests, 2 included below; oldest blocked for >
> 30.449318 secs
> 2017-10-18 01:20:38.374107 osd.38 osd.38 10.192.1.159:6857/236992 1388 :
> cluster [WRN] slow request 30.449318 seconds old, received at 2017-10-18
> 01:20:07.924670: osd_op(client.12464272.1:56610603 31.fe179bed
> 31:b7d9e87f:::rbd_data.7b3e22ae8944a.00013a60:head [set-alloc-hint
> object_size 4194304 write_size 4194304,write 143360~4096] snapc 0=[]
> ondisk+write e10926) currently op_applied
> 
> 
> How can I determine the reason of problem? Should I only adjust
> 

Re: [ceph-users] High mem with Luminous/Bluestore

2017-10-19 Thread Hans van den Bogert
> Memory usage is still quite high here even with a large onode cache! 
> Are you using erasure coding?  I recently was able to reproduce a bug in 
> bluestore causing excessive memory usage during large writes with EC, 
> but have not tracked down exactly what's going on yet.
> 
> Mark
No, this is 3-way replicated cluster for all pools.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [filestore][journal][prepare_entry] rebuild data_align is 4086, maybe a bug

2017-10-19 Thread zhaomingyue
Hi:
when I analyzed the performance of ceph, I found that rebuild_aligned was 
time-consuming, and the analysis found that rebuild operations were performed 
every time.

Source code:
FileStore::queue_transactions
–> journal->prepare_entry(o->tls, );
-> data_align = ((*p).get_data_alignment() - bl.length()) & ~CEPH_PAGE_MASK;
-> ret = ebl.rebuild_aligned(CEPH_DIRECTIO_ALIGNMENT);

Log:
2017-10-17 19:49:29.706246 7fb472bfe700 10 journal  len 4196131 -> 4202496 
(head 40 pre_pad 4046 bl 4196131 post_pad 2239 tail 40) (bl alignment 4086)

question:
I find “alignment =4086”, I think It maybe a bug
I think it should be 4096
because CEPH_DIRECTIO_ALIGNMENT is 4096

thank you

-
本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上面地址中列出
的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
邮件!
This e-mail and its attachments contain confidential information from New H3C, 
which is
intended only for the person or entity whose address is listed above. Any use 
of the
information contained herein in any way (including, but not limited to, total 
or partial
disclosure, reproduction, or dissemination) by persons other than the intended
recipient(s) is prohibited. If you receive this e-mail in error, please notify 
the sender
by phone or email immediately and delete it!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous can't seem to provision more than 32 OSDs per server

2017-10-19 Thread Marc Roos
 
What about not using deploy?




-Original Message-
From: Sean Sullivan [mailto:lookcr...@gmail.com] 
Sent: donderdag 19 oktober 2017 2:28
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Luminous can't seem to provision more than 32 OSDs 
per server

I am trying to install Ceph luminous (ceph version 12.2.1) on 4 ubuntu 
16.04 servers each with 74 disks, 60 of which are HGST 7200rpm sas 
drives::


HGST HUS724040AL sdbv  sas
root@kg15-2:~# lsblk --output MODEL,KNAME,TRAN | grep HGST | wc -l 60

I am trying to deploy them all with ::
a line like the following::
ceph-deploy osd zap kg15-2:(sas_disk)
ceph-deploy osd create --dmcrypt --bluestore --block-db (ssd_partition) 
kg15-2:(sas_disk)

This didn't seem to work at all so I am now trying to troubleshoot by 
just provisioning the sas disks::
ceph-deploy osd create --dmcrypt --bluestore kg15-2:(sas_disk)

Across all 4 hosts I can only seem to get 32 OSDs up and after that the 
rest fail::
root@kg15-1:~# ps faux | grep [c]eph-osd' | wc -l
32
root@kg15-2:~# ps faux | grep [c]eph-osd' | wc -l
32

root@kg15-3:~# ps faux | grep [c]eph-osd' | wc -l
32

The ceph-deploy tool doesn't seem to log or notice any failure but the 
host itself shows the following in the osd log:


2017-10-17 23:05:43.121016 7f8ca75c9e00  0 set uid:gid to 64045:64045 
(ceph:ceph)
2017-10-17 23:05:43.121040 7f8ca75c9e00  0 ceph version 12.2.1 
(3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous (stable), process 
(unknown), pid 69926
2017-10-17 23:05:43.123939 7f8ca75c9e00  1 
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) mkfs path 
/var/lib/ceph/tmp/mnt.8oIc5b
2017-10-17 23:05:43.124037 7f8ca75c9e00  1 bdev create path 
/var/lib/ceph/tmp/mnt.8oIc5b/block type kernel
2017-10-17 23:05:43.124045 7f8ca75c9e00  1 bdev(0x564b7a05e900 
/var/lib/ceph/tmp/mnt.8oIc5b/block) open path 
/var/lib/ceph/tmp/mnt.8oIc5b/block
2017-10-17 23:05:43.124231 7f8ca75c9e00  1 bdev(0x564b7a05e900 
/var/lib/ceph/tmp/mnt.8oIc5b/block) open size 4000668520448 
(0x3a37a6d1000, 3725 GB) block_size 4096 (4096 B) rotational
2017-10-17 23:05:43.124296 7f8ca75c9e00  1 
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _set_cache_sizes max 0.5 < ratio 
0.99
2017-10-17 23:05:43.124313 7f8ca75c9e00  1 
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _set_cache_sizes cache_size 
1073741824 meta 0.5 kv 0.5 data 0
2017-10-17 23:05:43.124349 7f8ca75c9e00 -1 
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) _open_db 
/var/lib/ceph/tmp/mnt.8oIc5b/block.db link target doesn't exist
2017-10-17 23:05:43.124368 7f8ca75c9e00  1 bdev(0x564b7a05e900 
/var/lib/ceph/tmp/mnt.8oIc5b/block) close
2017-10-17 23:05:43.402165 7f8ca75c9e00 -1 
bluestore(/var/lib/ceph/tmp/mnt.8oIc5b) mkfs failed, (2) No such file or 
directory
2017-10-17 23:05:43.402185 7f8ca75c9e00 -1 OSD::mkfs: ObjectStore::mkfs 
failed with error (2) No such file or directory
2017-10-17 23:05:43.402258 7f8ca75c9e00 -1  ** ERROR: error creating 
empty object store in /var/lib/ceph/tmp/mnt.8oIc5b: (2) No such file or 
directory


I have a few questions. I am not sure where to start troubleshooting so 
I have a few questions.

1.) Anyone have any idea on why 32?
2.) Is there a good guide / outline on how to get the benefit of storing 
the keys in the monitor while still having ceph more or less manage the 
drives but provisioning the drives without ceph-deploy? I looked at the 
manual deployment long and short form and it doesn't mention dmcrypt or 
bluestore at all. I know I can use crypttab and cryptsetup to do this 
and then give ceph-disk the path to the mapped device but I would prefer 
to keep as much management in ceph as possible if I could.  (mailing 
list thread :: 
https://www.mail-archive.com/ceph-users@lists.ceph.com/msg38575.html 
  )

3.) Ideally I would like to provision the drives with the DB on the SSD. (or 
would it be better to make a cache tier? I read on a reddit 
thread that the tiering in ceph isn't being developed anymore is it 
still worth it?)

Sorry for the bother and thanks for all the help!!!


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com