Re: [ceph-users] High osd cpu usage

2017-11-08 Thread Vy Nguyen Tan
Hello,

I think it not normal behavior in Luminous. I'm testing 3 nodes, each node
have 3 x 1TB HDD, 1 SSD for wal + db, E5-2620 v3, 32GB of RAM, 10Gbps NIC.

I use fio for  I/O performance measurements. When I ran "fio --randrepeat=1
--ioengine=libaio --direct=1 --gtod_reduce=1 --name=test --filename=test
--bs=4k --iodepth=64 --size=4G --readwrite=randrw --rwmixread=75" I get %
CPU each ceph-osd as shown bellow:

   2452 ceph  20   0 2667088 1.813g  15724 S  22.8  5.8  34:41.02
/usr/bin/ceph-osd -f --cluster ceph --id 1 --setuser ceph --setgroup ceph
   2178 ceph  20   0 2872152 2.005g  15916 S  22.2  6.4  43:22.80
/usr/bin/ceph-osd -f --cluster ceph --id 0 --setuser ceph --setgroup ceph
   1820 ceph  20   0 2713428 1.865g  15064 S  13.2  5.9  34:19.56
/usr/bin/ceph-osd -f --cluster ceph --id 2 --setuser ceph --setgroup ceph

Are you using bluestore? How many IOPS / disk throughput did you get with
your cluster ?


Regards,

On Wed, Nov 8, 2017 at 8:13 PM, Alon Avrahami 
wrote:

> Hello Guys
>
> We  have a fresh 'luminous'  (  12.2.0 ) 
> (32ce2a3ae5239ee33d6150705cdb24d43bab910c)
> luminous (rc)   ( installed using ceph-ansible )
>
> the cluster contains 6 *  Intel  server board  S2600WTTR  (  96 osds and
> 3 mons )
>
> We have 6 nodes  ( Intel server board  S2600WTTR ) , Mem - 64G , CPU
> -> Intel(R) Xeon(R) CPU E5-2620 v4 @ 2.10GHz , 32 cores .
> Each server   has 16 * 1.6TB  Dell SSD drives ( SSDSC2BB016T7R )  , total
> of 96 osds , 3 mons
>
> The main usage  is rbd's for our  OpenStack environment ( Okata )
>
> We're at the beginning of our production tests and it looks like the
> osd's are too busy although  we don't generate  too much iops at this stage
> ( almost nothing )
> All ceph-osds using 50% of CPU usage and I can't figure out why are they
> so busy :
>
> top - 07:41:55 up 49 days,  2:54,  2 users,  load average: 6.85, 6.40, 6.37
>
> Tasks: 518 total,   1 running, 517 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 14.8 us,  4.3 sy,  0.0 ni, 80.3 id,  0.0 wa,  0.0 hi,  0.6 si,
> 0.0 st
> KiB Mem : 65853584 total, 23953788 free, 40342680 used,  1557116 buff/cache
> KiB Swap:  3997692 total,  3997692 free,0 used. 18020584 avail Mem
>
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>   36713 ceph  20   0 3869588 2.826g  28896 S  47.2  4.5   6079:20
> ceph-osd
>   53981 ceph  20   0 3998732 2.666g  28628 S  45.8  4.2   5939:28
> ceph-osd
>   55879 ceph  20   0 3707004 2.286g  28844 S  44.2  3.6   5854:29
> ceph-osd
>   46026 ceph  20   0 3631136 1.930g  29100 S  43.2  3.1   6008:50
> ceph-osd
>   39021 ceph  20   0 4091452 2.698g  28936 S  42.9  4.3   5687:39
> ceph-osd
>   47210 ceph  20   0 3598572 1.871g  29092 S  42.9  3.0   5759:19
> ceph-osd
>   52763 ceph  20   0 3843216 2.410g  28896 S  42.2  3.8   5540:11
> ceph-osd
>   49317 ceph  20   0 3794760 2.142g  28932 S  41.5  3.4   5872:24
> ceph-osd
>   42653 ceph  20   0 3915476 2.489g  28840 S  41.2  4.0   5605:13
> ceph-osd
>   41560 ceph  20   0 3460900 1.801g  28660 S  38.5  2.9   5128:01
> ceph-osd
>   50675 ceph  20   0 3590288 1.827g  28840 S  37.9  2.9   5196:58
> ceph-osd
>   37897 ceph  20   0 4034180 2.814g  29000 S  34.9  4.5   4789:10
> ceph-osd
>   50237 ceph  20   0 3379780 1.930g  28892 S  34.6  3.1   4846:36
> ceph-osd
>   48608 ceph  20   0 3893684 2.721g  28880 S  33.9  4.3   4752:43
> ceph-osd
>   40323 ceph  20   0 4227864 2.959g  28800 S  33.6  4.7   4712:36
> ceph-osd
>   44638 ceph  20   0 3656780 2.437g  28896 S  33.2  3.9   4793:58
> ceph-osd
>   61639 ceph  20   0  527512 114300  20988 S   2.7  0.2   2722:03
> ceph-mgr
>   31586 ceph  20   0  765672 304140  21816 S   0.7  0.5 409:06.09
> ceph-mon
>  68 root  20   0   0  0  0 S   0.3  0.0   3:09.69
> ksoftirqd/12
>
> strace  doesn't show anything suspicious
>
> root@ecprdbcph10-opens:~# strace -p 36713
> strace: Process 36713 attached
> futex(0x563343c56764, FUTEX_WAIT_PRIVATE, 1, NUL
>
> Ceph logs don't reveal anything?
> Is this "normal" behavior in Luminous?
> Looking out in older threads I can only find a thread about time gaps
> which is not our case
>
> Thanks,
> Alon
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 2x replica with NVMe

2017-06-08 Thread Vy Nguyen Tan
Hi,

I think that the replica 2x on HDD/SSD are the same. You should read quote
from Wido bellow:

""Hi,

As a Ceph consultant I get numerous calls throughout the year to help people
 with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these
settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks
and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented
easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon
fails. With size = 3 you always have two additional copies left thus
keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool
with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size =
2. The downtime and problems caused by missing objects/replicas are usually
big and it takes days to recover from those. But very often data is lost
and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a
SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido""

On Thu, Jun 8, 2017 at 5:32 PM,  wrote:

> Hi all,
>
> i'm going to build an all-flash ceph cluster, looking around the existing
> documentation i see lots of guides and and use case scenarios from various
> vendor testing Ceph with replica 2x.
>
> Now, i'm an old school Ceph user, I always considered 2x replica really
> dangerous for production data, especially when both OSDs can't decide which
> replica is the good one.
> Why all NVMe storage vendor and partners use only 2x replica?
> They claim it's safe because NVMe is better in handling errors, but i
> usually don't trust marketing claims :)
> Is it true? Can someone confirm that NVMe is different compared to HDD and
> therefore replica 2 can be considered safe to be put in production?
>
> Many Thanks
> Giordano
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mix HDDs and SSDs togheter

2017-03-06 Thread Vy Nguyen Tan
Hi Jiajia zhong,

I'm using mixed SSD and HDD on the same node and I did it from url
https://www.sebastien-han.fr/blog/2014/08/25/ceph-mix-sata-and-ssd-within-the-same-box/,
I don't get any problems when run SSD and HDD on the same node. Now I want
to increase Ceph thoughput by increase network into 20Gbps (I want single
network stream get max 20Gbps - test by iperf ). Could you please share
your experience about HA network for Ceph ? What type of bonding do you
have? are you using stackable switches?

I very appreciate your help.

On Mon, Mar 6, 2017 at 11:45 AM, jiajia zhong  wrote:

> we are using mixed too, intel PCIE 400G SSD * 8 for metadata pool and tier
> caching pool for our cephfs.
>
> *plus:*
> *'osd crush update on start = false*'  as Vladimir replied.
>
> 2017-03-03 20:33 GMT+08:00 Дробышевский, Владимир :
>
>> Hi, Matteo!
>>
>>   Yes, I'm using mixed cluster in production but it's pretty small at the
>> moment. I've made a smal step by step manual for myself when I did this for
>> the first time and now put it as a gist: https://gist.github.com/vheath
>> en/cf2203aeb53e33e3f80c8c64a02263bc#file-manual-txt. Probably it could
>> be a little bit outdated since it was some time ago.
>>
>>   Crush map modifications are going to be persistent in case of reboots
>> and maintenance if you put *'osd crush update on start = false*' to the
>> [osd] section of ceph conf.
>>
>>   But I would also recommend to start from this article:
>> https://www.sebastien-han.fr/blog/2014/08/25/ceph-
>> mix-sata-and-ssd-within-the-same-box/
>>
>>   P.S. While I was writing this letter I've seen a letter from Maxime
>> Guyot. Seems that his method is much easier if it leads to the same results.
>>
>> Best regards,
>> Vladimir
>>
>> С уважением,
>> Дробышевский Владимир
>> Компания "АйТи Город"
>> +7 343 192 <+7%20343%20222-21-92>
>>
>> Аппаратное и программное обеспечение
>> IBM, Microsoft, Eset
>> Поставка проектов "под ключ"
>> Аутсорсинг ИТ-услуг
>>
>> 2017-03-03 16:30 GMT+05:00 Matteo Dacrema :
>>
>>> Hi all,
>>>
>>> Does anyone run a production cluster with a modified crush map for
>>> create two pools belonging one to HDDs and one to SSDs.
>>> What’s the best method? Modify the crush map via ceph CLI or via text
>>> editor?
>>> Will the modification to the crush map be persistent across reboots and
>>> maintenance operations?
>>> There’s something to consider when doing upgrades or other operations or
>>> different by having “original” crush map?
>>>
>>> Thank you
>>> Matteo
>>> 
>>> This email and any files transmitted with it are confidential and
>>> intended solely for the use of the individual or entity to whom they are
>>> addressed. If you have received this email in error please notify the
>>> system manager. This message contains confidential information and is
>>> intended only for the individual named. If you are not the named addressee
>>> you should not disseminate, distribute or copy this e-mail. Please notify
>>> the sender immediately by e-mail if you have received this e-mail by
>>> mistake and delete this e-mail from your system. If you are not the
>>> intended recipient you are notified that disclosing, copying, distributing
>>> or taking any action in reliance on the contents of this information is
>>> strictly prohibited.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] replica questions

2017-03-03 Thread Vy Nguyen Tan
Hi,

You should read email from Wido den Hollander:
"Hi,

As a Ceph consultant I get numerous calls throughout the year to help people
 with getting their broken Ceph clusters back online.

The causes of downtime vary vastly, but one of the biggest causes is that
people use replication 2x. size = 2, min_size = 1.

In 2016 the amount of cases I have where data was lost due to these
settings grew exponentially.

Usually a disk failed, recovery kicks in and while recovery is happening a
second disk fails. Causing PGs to become incomplete.

There have been to many times where I had to use xfs_repair on broken disks
and use ceph-objectstore-tool to export/import PGs.

I really don't like these cases, mainly because they can be prevented
easily by using size = 3 and min_size = 2 for all pools.

With size = 2 you go into the danger zone as soon as a single disk/daemon
fails. With size = 3 you always have two additional copies left thus
keeping your data safe(r).

If you are running CephFS, at least consider running the 'metadata' pool
with size = 3 to keep the MDS happy.

Please, let this be a big warning to everybody who is running with size =
2. The downtime and problems caused by missing objects/replicas are usually
big and it takes days to recover from those. But very often data is lost
and/or corrupted which causes even more problems.

I can't stress this enough. Running with size = 2 in production is a
SERIOUS hazard and should not be done imho.

To anyone out there running with size = 2, please reconsider this!

Thanks,

Wido"

Btw, could you please share your experience about HA network for Ceph ?
What type of bonding do you have? are you using stackable switches?



On Fri, Mar 3, 2017 at 6:24 PM, Maxime Guyot  wrote:

> Hi Henrik and Matteo,
>
>
>
> While I agree with Henrik: increasing your replication factor won’t
> improve recovery or read performance on its own. If you are changing from
> replica 2 to replica 3, you might need to scale-out your cluster to have
> enough space for the additional replica, and that would improve the
> recovery and read performance.
>
>
>
> Cheers,
>
> Maxime
>
>
>
> *From: *ceph-users  on behalf of
> Henrik Korkuc 
> *Date: *Friday 3 March 2017 11:35
> *To: *"ceph-users@lists.ceph.com" 
> *Subject: *Re: [ceph-users] replica questions
>
>
>
> On 17-03-03 12:30, Matteo Dacrema wrote:
>
> Hi All,
>
>
>
> I’ve a production cluster made of 8 nodes, 166 OSDs and 4 Journal SSD
> every 5 OSDs with replica 2 for a total RAW space of 150 TB.
>
> I’ve few question about it:
>
>
>
>- It’s critical to have replica 2? Why?
>
> Replica size 3 is highly recommended. I do not know exact numbers but it
> decreases chance of data loss as 2 disk failures appear to be quite
> frequent thing, especially in larger clusters.
>
>
>- Does replica 3 makes recovery faster?
>
> no
>
>
>- Does replica 3 makes rebalancing and recovery less heavy for
>customers? If I lose 1 node does replica 3 reduce the IO impact respect a
>replica 2?
>
> no
>
>
>- Does read performance increase with replica 3?
>
> no
>
>
>
> Thank you
>
> Regards
>
> Matteo
>
>
>
> 
>
> This email and any files transmitted with it are confidential and intended
> solely for the use of the individual or entity to whom they are addressed.
> If you have received this email in error please notify the system manager.
> This message contains confidential information and is intended only for the
> individual named. If you are not the named addressee you should not
> disseminate, distribute or copy this e-mail. Please notify the sender
> immediately by e-mail if you have received this e-mail by mistake and
> delete this e-mail from your system. If you are not the intended recipient
> you are notified that disclosing, copying, distributing or taking any
> action in reliance on the contents of this information is strictly
> prohibited.
>
>
>
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] property upgrade Ceph from 10.2.3 to 10.2.5 without downtime

2017-01-19 Thread Vy Nguyen Tan
Hello everyone,

I am planning for upgrade Ceph cluster from 10.2.3 to 10.2.5. I am
wondering can I upgrade Ceph cluster without downtime? And how to upgrade
Ceph from 10.2.3 to 10.2.5 without downtime ?

Thanks for help.

Regards,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CEPH mirror down again

2016-11-25 Thread Vy Nguyen Tan
Hi Matt and Joao,

Thank you for your information. I am installing Ceph with alternative
mirror (ceph-deploy install --repo-url http://hk.ceph.com/rpm-jewel/el7/
--gpg-url http://hk.ceph.com/keys/release.asc {host}) and everything work
again.

On Sat, Nov 26, 2016 at 10:12 AM, Joao Eduardo Luis <j...@suse.de> wrote:

> On 11/26/2016 03:05 AM, Vy Nguyen Tan wrote:
>
>> Hello,
>>
>> I want to install CEPH on new nodes but I can't reach CEPH repo, It
>> seems the repo are broken. I am using CentOS 7.2 and ceph-deploy 1.5.36.
>>
>
> Patrick sent an email to the list informing this would happen back on Nov
> 18th; quote:
>
> Due to Dreamhost shutting down the old DreamCompute cluster in their
>> US-East 1 region, we are in the process of beginning the migration of
>> Ceph infrastructure.  We will need to move download.ceph.com,
>> tracker.ceph.com, and docs.ceph.com to their US-East 2 region.
>>
>> The current plan is to move the VMs on 25 NOV 2016 throughout the day.
>> Expect them to be down intermittently.
>>
>
>   -Joao
>
> P.S.: also, it's Ceph; not CEPH.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CEPH mirror down again

2016-11-25 Thread Vy Nguyen Tan
Hello,

I want to install CEPH on new nodes but I can't reach CEPH repo, It seems
the repo are broken. I am using CentOS 7.2 and ceph-deploy 1.5.36.

*[root@cp ~]# ping -c 3 download.ceph.com *

*PING download.ceph.com  (173.236.253.173) 56(84)
bytes of data.*


*--- download.ceph.com  ping statistics ---*

*3 packets transmitted, 0 received, 100% packet loss, time 11999ms*


*[root@cp ~]# curl https://download.ceph.com/debian-jewel/
*

*curl: (7) Failed to connect to 2607:f298:6050:51f3:f816:3eff:fe71:9135:
Network is unreachable*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ubuntu repo's broken

2016-10-17 Thread Vy Nguyen Tan
Hello,


I have the same problem. I am using Debian 8.6 and ceph-deploy 1.5.36


Logs from ceph-deploy:

[*hv01*][*INFO*  ] Running command: env DEBIAN_FRONTEND=noninteractive
DEBIAN_PRIORITY=critical apt-get --assume-yes -q
--no-install-recommends install -o Dpkg::Options::=--force-confnew
ceph-osd ceph-mds ceph-mon radosgw

[*hv01*][*DEBUG* ] Reading package lists...

[*hv01*][*DEBUG* ] Building dependency tree...

[*hv01*][*DEBUG* ] Reading state information...

[*hv01*][*DEBUG* ] Package ceph-osd is not available, but is referred
to by another package.

[*hv01*][*DEBUG* ] This may mean that the package is missing, has been
obsoleted, or

[*hv01*][*DEBUG* ] is only available from another source

[*hv01*][*DEBUG* ]

[*hv01*][*WARNIN*] E: Package 'ceph-osd' has no installation candidate

[*hv01*][*WARNIN*] E: Unable to locate package ceph-mon

[*hv01*][*ERROR* ] RuntimeError: command returned non-zero exit status: 100

[*ceph_deploy*][*ERROR* ] RuntimeError: Failed to execute command: env
DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get
--assume-yes -q --no-install-recommends install -o
Dpkg::Options::=--force-confnew ceph-osd ceph-mds ceph-mon radosgw


> Op 16 oktober 2016 om 11:57 schreef "Jon Morby (FidoNet)" :
>
>
> Morning
>
> It’s been a few days now since the outage however we’re still unable to
> install new nodes, it seems the repo’s are broken … and have been for at
> least 2 days now (so not just a brief momentary issue caused by an update)
>
> [osd04][WARNIN] E: Package 'ceph-osd' has no installation candidate
> [osd04][WARNIN] E: Package 'ceph-mon' has no installation candidate
> [osd04][ERROR ] RuntimeError: command returned non-zero exit status: 100
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: env
> DEBIAN_FRONTEND=noninteractive DEBIAN_PRIORITY=critical apt-get --assume-yes
> -q --no-install-recommends install -o Dpkg::Options::=--force-confnew
> ceph-osd ceph-mds ceph-mon radosgw
>
> Is there any eta for when this might be fixed?
>

>> What is the line in your sources.list on your system?

>> Afaik the mirrors are working fine.

>> Wido

> —
> Jon Morby
> FidoNet - the internet made simple!
> tel: 0345 004 3050 / fax: 0345 004 3051
> twitter: @fido | skype://jmorby  | web: https://www.fido.net
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com