Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Анатолий Фуников
It's strange but parted output for this disk (/dev/sdf) show me that it's
GPT:

(parted) print
Model: ATA HGST HUS726020AL (scsi)
Disk /dev/sdf: 2000GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End SizeType  File system Flags
 2 1049kB  1075MB  1074MBceph journal
 1 1075MB  2000GB  1999GB  xfs   ceph data

ср, 20 февр. 2019 г. в 17:11, Alfredo Deza :

> On Wed, Feb 20, 2019 at 8:40 AM Анатолий Фуников
>  wrote:
> >
> > Thanks for the reply.
> > blkid -s PARTUUID -o value /dev/sdf1 shows me nothing, but blkid
> /dev/sdf1 shows me this: /dev/sdf1:
> UUID="b03810e4-dcc1-46c2-bc31-a1e558904750" TYPE="xfs"
>
> I think this is what happens with a non-gpt partition. GPT labels will
> use a PARTUUID to identify the partition, and I just confirmed that
> ceph-volume will enforce looking for PARTUUID if the JSON
> identified a partition (vs. an LV).
>
> From what I briefly researched it is not possible to add a GPT label
> on a non-gpt partition without losing data.
>
> My suggestion (if you confirm it is not possible to add the GPT label)
> is to start the migration towards the new way of creating OSDs
>
> >
> > ср, 20 февр. 2019 г. в 16:27, Alfredo Deza :
> >>
> >> On Wed, Feb 20, 2019 at 8:16 AM Анатолий Фуников
> >>  wrote:
> >> >
> >> > Hello. I need to raise the OSD on the node after reinstalling the OS,
> some OSD were made a long time ago, not even a ceph-disk, but a set of
> scripts.
> >> > There was an idea to get their configuration in json via ceph-volume
> simple scan, and then on a fresh system I can make a ceph-volume simple
> activate --file /etc/ceph/osd/31-46eacafe-22b6-4433-8e5c-e595612d8579.json
> >> > I do ceph-volume simple scan /var/lib/ceph/osd/ceph-31/, and got this
> json: https://pastebin.com/uJ8WVZyV
> >> > It seems everything is not bad, but in the data section I see a
> direct link to the device /dev/sdf1, and the uuid field is empty. At the
> same time, in the /dev/disk/by-partuuid directory I can find and substitute
> this UUID in this json, and delete the direct link to the device in this
> json.
> >> > The question is: how correct is it and can I raise this OSD on a
> freshly installed OS with this fixed json?
> >>
> >> It worries me that it is unable to find a uuid for the device. This is
> >> important because paths like /dev/sdf1 are ephemeral and can change
> >> after a reboot. The uuid is found by running the following:
> >>
> >> blkid -s PARTUUID -o value /dev/sdf1
> >>
> >> If that is not returning anything, then ceph-volume will probably not
> >> be able to ensure this device is brought up correctly. You can correct
> >> or add to anything in the JSON after a scan and rely on that, but then
> >> again
> >> without a partuuid I don't think this will work nicely
> >>
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2019-02-20 Thread Frédéric Nass
Hi Sage,

Would be nice to have this one backported to Luminous if easy. 

Cheers,
Frédéric.

> Le 7 juin 2018 à 13:33, Sage Weil  a écrit :
> 
> On Wed, 6 Jun 2018, Caspar Smit wrote:
>> Hi all,
>> 
>> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node
>> to it.
>> 
>> osd-max-backfills is at the default 1 so backfilling didn't go very fast
>> but that doesn't matter.
>> 
>> Once it started backfilling everything looked ok:
>> 
>> ~300 pgs in backfill_wait
>> ~10 pgs backfilling (~number of new osd's)
>> 
>> But i noticed the degraded objects increasing a lot. I presume a pg that is
>> in backfill_wait state doesn't accept any new writes anymore? Hence
>> increasing the degraded objects?
>> 
>> So far so good, but once a while i noticed a random OSD flapping (they come
>> back up automatically). This isn't because the disk is saturated but a
>> driver/controller/kernel incompatibility which 'hangs' the disk for a short
>> time (scsi abort_task error in syslog). Investigating further i noticed
>> this was already the case before the node expansion.
>> 
>> These OSD's flapping results in lots of pg states which are a bit worrying:
>> 
>> 109 active+remapped+backfill_wait
>> 80  active+undersized+degraded+remapped+backfill_wait
>> 51  active+recovery_wait+degraded+remapped
>> 41  active+recovery_wait+degraded
>> 27  active+recovery_wait+undersized+degraded+remapped
>> 14  active+undersized+remapped+backfill_wait
>> 4   active+undersized+degraded+remapped+backfilling
>> 
>> I think the recovery_wait is more important then the backfill_wait, so i
>> like to prioritize these because the recovery_wait was triggered by the
>> flapping OSD's
> 
> Just a note: this is fixed in mimic.  Previously, we would choose the 
> highest-priority PG to start recovery on at the time, but once recovery 
> had started, the appearance of a new PG with a higher priority (e.g., 
> because it finished peering after the others) wouldn't preempt/cancel the 
> other PG's recovery, so you would get behavior like the above.
> 
> Mimic implements that preemption, so you should not see behavior like 
> this.  (If you do, then the function that assigns a priority score to a 
> PG needs to be tweaked.)
> 
> sage
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Prioritize recovery over backfilling

2019-02-20 Thread Frédéric Nass
Hi,

Please keep in mind that setting the ‘nodown' flag will prevent PGs from 
becoming degraded but will also prevent client's requests from being served by 
other OSDs that would have take over the non responsive one without the 
‘nodown’ flag in a healthy manner. And this the whole time the OSD is non 
responsive.

Regards,
Frédéric.


> Le 7 juin 2018 à 08:47, Piotr Dałek  a écrit :
> 
> On 18-06-06 09:29 PM, Caspar Smit wrote:
>> Hi all,
>> We have a Luminous 12.2.2 cluster with 3 nodes and i recently added a node 
>> to it.
>> osd-max-backfills is at the default 1 so backfilling didn't go very fast but 
>> that doesn't matter.
>> Once it started backfilling everything looked ok:
>> ~300 pgs in backfill_wait
>> ~10 pgs backfilling (~number of new osd's)
>> But i noticed the degraded objects increasing a lot. I presume a pg that is 
>> in backfill_wait state doesn't accept any new writes anymore? Hence 
>> increasing the degraded objects?
>> So far so good, but once a while i noticed a random OSD flapping (they come 
>> back up automatically). This isn't because the disk is saturated but a 
>> driver/controller/kernel incompatibility which 'hangs' the disk for a short 
>> time (scsi abort_task error in syslog). Investigating further i noticed this 
>> was already the case before the node expansion.
>> These OSD's flapping results in lots of pg states which are a bit worrying:
>>  109 active+remapped+backfill_wait
>>  80  active+undersized+degraded+remapped+backfill_wait
>>  51  active+recovery_wait+degraded+remapped
>>  41  active+recovery_wait+degraded
>>  27  active+recovery_wait+undersized+degraded+remapped
>>  14  active+undersized+remapped+backfill_wait
>>  4   active+undersized+degraded+remapped+backfilling
>> I think the recovery_wait is more important then the backfill_wait, so i 
>> like to prioritize these because the recovery_wait was triggered by the 
>> flapping OSD's
> >
>> furthermore the undersized ones should get absolute priority or is that 
>> already the case?
>> I was thinking about setting "nobackfill" to prioritize recovery instead of 
>> backfilling.
>> Would that help in this situation? Or am i making it even worse then?
>> ps. i tried increasing the heartbeat values for the OSD's to no avail, they 
>> still get flagged as down once in a while after a hiccup of the driver.
> 
> First of all, use "nodown" flag so osds won't be marked down automatically 
> and unset it once everything backfills/recovers and settles for good -- note 
> that there might be lingering osd down reports, so unsetting nodown might 
> cause some of problematic osds to be instantly marked as down.
> 
> Second, since Luminous you can use "ceph pg force-recovery" to ask particular 
> pgs to recover first, even if there are other pgs to backfill and/or recovery.
> 
> -- 
> Piotr Dałek
> piotr.da...@corp.ovh.com 
> https://www.ovhcloud.com 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 


smime.p7s
Description: S/MIME cryptographic signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Urgent: Reduced data availability / All pgs inactive

2019-02-20 Thread Irek Fasikhov
Hi,

You have problems with MRG.
http://docs.ceph.com/docs/master/rados/operations/pg-states/
*The ceph-mgr hasn’t yet received any information about the PG’s state from
an OSD since mgr started up.*

чт, 21 февр. 2019 г. в 09:04, Irek Fasikhov :

> Hi,
>
> You have problems with MRG.
> http://docs.ceph.com/docs/master/rados/operations/pg-states/
> *The ceph-mgr hasn’t yet received any information about the PG’s state
> from an OSD since mgr started up.*
>
>
> ср, 20 февр. 2019 г. в 23:10, Ranjan Ghosh :
>
>> Hi all,
>>
>> hope someone can help me. After restarting a node of my 2-node-cluster
>> suddenly I get this:
>>
>> root@yak2 /var/www/projects # ceph -s
>>   cluster:
>> id: 749b2473-9300-4535-97a6-ee6d55008a1b
>> health: HEALTH_WARN
>> Reduced data availability: 200 pgs inactive
>>
>>   services:
>> mon: 3 daemons, quorum yak1,yak2,yak0
>> mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de,
>> yak2.planwerk6.de
>> mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de=up:active}, 1 up:standby
>> osd: 2 osds: 2 up, 2 in
>>
>>   data:
>> pools:   2 pools, 200 pgs
>> objects: 0  objects, 0 B
>> usage:   0 B used, 0 B / 0 B avail
>> pgs: 100.000% pgs unknown
>>  200 unknown
>>
>> And this:
>>
>>
>> root@yak2 /var/www/projects # ceph health detail
>> HEALTH_WARN Reduced data availability: 200 pgs inactive
>> PG_AVAILABILITY Reduced data availability: 200 pgs inactive
>> pg 1.34 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.35 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.36 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.37 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.38 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.39 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3e is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.3f is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.40 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.41 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.42 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.43 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.44 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.45 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.46 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.47 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.48 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.49 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.4a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.4b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.4c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 1.4d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.34 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.35 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.36 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.38 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.39 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3a is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3b is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3c is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3d is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3e is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.3f is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.40 is stuck inactive for 3506.815664, current state unknown,
>> last acting []
>> pg 2.41 is stuck inactive for 3506.815664, current state unknown,
>> last 

[ceph-users] ccache did not support in ceph?

2019-02-20 Thread ddu

Hi

When enable ccache for ceph, error occurs:
-
 ccache: invalid option -- 'E'
 ...
 Unable to determine C++ standard library, got .
-
This is because variable "CXX_STDLIB" was null in CMakeLists.txt line: 637,
The "CXX_STDLIB" come from:
-
 execute_process(
   COMMAND ./librarytest.sh ${CMAKE_CXX_COMPILER} ${CMAKE_CXX_FLAGS}
   WORKING_DIRECTORY ${CMAKE_SOURCE_DIR}
   OUTPUT_VARIABLE CXX_STDLIB
   )
-
The script librarytest.sh in ceph accept argument: compile and flags, 
but when

enable ccache, the compile replace by ccache, so that script failed.

How can I solve it?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-20 Thread David Turner
If I'm not mistaken, if you stop them at the same time during a reboot on a
node with both mds and mon, the mons might receive it, but wait to finish
their own election vote before doing anything about it.  If you're trying
to keep optimal uptime for your mds, then stopping it first and on its own
makes sense.

On Wed, Feb 20, 2019 at 3:46 PM Patrick Donnelly 
wrote:

> On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
> >
> > Hi!
> >
> > From documentation:
> >
> > mds beacon grace
> > Description:The interval without beacons before Ceph declares an MDS
> laggy (and possibly replace it).
> > Type:   Float
> > Default:15
> >
> > I do not understand, 15 - are is seconds or beacons?
>
> seconds
>
> > And an additional misunderstanding - if we gently turn off the MDS (or
> MON), why it does not inform everyone interested before death - "I am
> turned off, no need to wait, appoint a new active server"
>
> The MDS does inform the monitors if it has been shutdown. If you pull
> the plug or SIGKILL, it does not. :)
>
>
> --
> Patrick Donnelly
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] faster switch to another mds

2019-02-20 Thread Patrick Donnelly
On Tue, Feb 19, 2019 at 11:39 AM Fyodor Ustinov  wrote:
>
> Hi!
>
> From documentation:
>
> mds beacon grace
> Description:The interval without beacons before Ceph declares an MDS 
> laggy (and possibly replace it).
> Type:   Float
> Default:15
>
> I do not understand, 15 - are is seconds or beacons?

seconds

> And an additional misunderstanding - if we gently turn off the MDS (or MON), 
> why it does not inform everyone interested before death - "I am turned off, 
> no need to wait, appoint a new active server"

The MDS does inform the monitors if it has been shutdown. If you pull
the plug or SIGKILL, it does not. :)


-- 
Patrick Donnelly
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Urgent: Reduced data availability / All pgs inactive

2019-02-20 Thread Ranjan Ghosh

Hi all,

hope someone can help me. After restarting a node of my 2-node-cluster 
suddenly I get this:


root@yak2 /var/www/projects # ceph -s
  cluster:
    id: 749b2473-9300-4535-97a6-ee6d55008a1b
    health: HEALTH_WARN
    Reduced data availability: 200 pgs inactive

  services:
    mon: 3 daemons, quorum yak1,yak2,yak0
    mgr: yak0.planwerk6.de(active), standbys: yak1.planwerk6.de, 
yak2.planwerk6.de

    mds: cephfs-1/1/1 up  {0=yak1.planwerk6.de=up:active}, 1 up:standby
    osd: 2 osds: 2 up, 2 in

  data:
    pools:   2 pools, 200 pgs
    objects: 0  objects, 0 B
    usage:   0 B used, 0 B / 0 B avail
    pgs: 100.000% pgs unknown
 200 unknown

And this:


root@yak2 /var/www/projects # ceph health detail
HEALTH_WARN Reduced data availability: 200 pgs inactive
PG_AVAILABILITY Reduced data availability: 200 pgs inactive
    pg 1.34 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.35 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.36 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.37 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.38 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.39 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3a is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3b is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3c is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3d is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3e is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.3f is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.40 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.41 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.42 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.43 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.44 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.45 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.46 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.47 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.48 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.49 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.4a is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.4b is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.4c is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 1.4d is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.34 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.35 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.36 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.38 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.39 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3a is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3b is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3c is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3d is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3e is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.3f is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.40 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.41 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.42 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.43 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.44 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.45 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.46 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.47 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.48 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.49 is stuck inactive for 3506.815664, current state unknown, 
last acting []
    pg 2.4a is stuck inactive for 3506.815664, current state unknown, 
last acting []
    

Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Marco Gaiarin
Mandi! Alfredo Deza
  In chel di` si favelave...

> > Ahem, how can i add a GPT label to a non-GPT partition (even loosing
> > data)?
> If you are coming from ceph-disk (or something else custom-made) and
> don't care about losing data, why not fully migrate to the
> new OSDs? 
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd

I'm using proxmox, so 'pveceph' helper, but i've trouble with journal
lables, indeed, not main filesystem labels...

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replicating CephFS between clusters

2019-02-20 Thread Balazs Soltesz
Hi everyone,

Thank you all for the quick replies!

So making snapshots and rsyncing them between clusters should work, I'll be 
sure to check that out. Snapshot mirroring is what we'd need, but I couldn't 
find any release date on nautilus, and we don’t really have time to wait for 
its release.

Some number would be definitely would come useful to approximate the upper 
limit on snapshots, as we do intend to keep some to serve as 'checkpoints'. 
Currently we're snapshotting every 15 minutes, but only keep one snapshot a day 
long-term. We delete snapshots older than a month, this means we have about 200 
snapshots at any given time, so I think this'll work for us.

How does having multiple ranks in CephFS influence snapshots? Is it a kind of 
'no-no', like having multiple filesystems?

Thanks again for your answers.

Best,
Balazs
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access to cephfs from two different networks

2019-02-20 Thread Andrés Rojas Guerrero
Ohh, sorry for the question, now I understand why we need to define 
differents public networks in Ceph, I understand that clients contact 
with the mon in order to obtain only the cluster map, from the 
documentation:


"When a Ceph Client binds to a Ceph Monitor, it retrieves the latest 
copy of the Cluster Map. With the cluster map, the client knows about 
all of the monitors, OSDs, and metadata servers in the cluster"


I thought that mon's were some kind of gateway for all "ceph service" 
and for them pass all Ceph traffic.


I need to read more ... sorry again.






On 20/2/19 13:03, Andrés Rojas Guerrero wrote:
Hi all, sorry, we are newbies in Ceph and we have a newbie question 
about it. We have a Ceph cluster with three mon's and two public networks:


public network = 10.100.100.0/23,10.100.101.0/21

We have seen that ceph-mon are listen in only one of this network:


tcp  0  0 10.100.100.9:6789  0.0.0.0:*  LISTEN  135385/ceph-mon


but not in the other public network necessary to access to cephfs from 
this second network. We have seen that in principle that it's not

possible?:

https://access.redhat.com/solutions/1463363

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019659.html


and it's seems necessary to route the traffic from the second public 
network to the first public network, but seeing this our question is why 
you can define differents public networks in ceph if only you can access 
to only one from the ceph-mon daemons? perhaps you can configure the 
other mon's with the other public network?   Or is there anything we're 
not getting right?


Thank's in advance.


--
***
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
Pinar 19
28006 - Madrid
Tel: +34 915680059 -- Ext. 990059
email: a.ro...@csic.es
ID comunicate.csic.es: @50852720l:matrix.csic.es
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread Alexandru Cucu
Hi,

I would decrese max active recovery processes per osd and increase
recovery sleep.
osd recovery max active = 1 (default is 3)
osd recovery sleep = 1 (default is 0 or 0.1)

osd max backfills defaults to 1 so that should be OK if he's using the
default :D

Disabling scrubbing during recovery should also help:
osd scrub during recovery = false

On Wed, Feb 20, 2019 at 5:47 PM Darius Kasparavičius  wrote:
>
> Hello,
>
>
> Check your CPU usage when you are doing those kind of operations. We
> had a similar issue where our CPU monitoring was reporting fine < 40%
> usage, but our load on the nodes was high mid 60-80. If it's possible
> try disabling ht and see the actual cpu usage.
> If you are hitting CPU limits you can try disabling crc on messages.
> ms_nocrc
> ms_crc_data
> ms_crc_header
>
> And setting all your debug messages to 0.
> If you haven't done you can also lower your recovery settings a little.
> osd recovery max active
> osd max backfills
>
> You can also lower your file store threads.
> filestore op threads
>
>
> If you can also switch to bluestore from filestore. This will also
> lower your CPU usage. I'm not sure that this is bluestore that does
> it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
> compared to filestore + leveldb .
>
>
> On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
>  wrote:
> >
> > Thats expected from Ceph by design. But in our case, we are using all
> > recommendation like rack failure domain, replication n/w,etc, still
> > face client IO performance issues during one OSD down..
> >
> > On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> > >
> > > With a RACK failure domain, you should be able to have an entire rack 
> > > powered down without noticing any major impact on the clients.  I 
> > > regularly take down OSDs and nodes for maintenance and upgrades without 
> > > seeing any problems with client IO.
> > >
> > > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy 
> > >  wrote:
> > >>
> > >> Hello - I have a couple of questions on ceph cluster stability, even
> > >> we follow all recommendations as below:
> > >> - Having separate replication n/w and data n/w
> > >> - RACK is the failure domain
> > >> - Using SSDs for journals (1:4ratio)
> > >>
> > >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> > >> impacted.
> > >> Q2 - what is stability ratio, like with above, is ceph cluster
> > >> workable condition, if one osd down or one node down,etc.
> > >>
> > >> Thanks
> > >> Swami
> > >> ___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread Darius Kasparavičius
Hello,


Check your CPU usage when you are doing those kind of operations. We
had a similar issue where our CPU monitoring was reporting fine < 40%
usage, but our load on the nodes was high mid 60-80. If it's possible
try disabling ht and see the actual cpu usage.
If you are hitting CPU limits you can try disabling crc on messages.
ms_nocrc
ms_crc_data
ms_crc_header

And setting all your debug messages to 0.
If you haven't done you can also lower your recovery settings a little.
osd recovery max active
osd max backfills

You can also lower your file store threads.
filestore op threads


If you can also switch to bluestore from filestore. This will also
lower your CPU usage. I'm not sure that this is bluestore that does
it, but I'm seeing lower cpu usage when moving to bluestore + rocksdb
compared to filestore + leveldb .


On Wed, Feb 20, 2019 at 4:27 PM M Ranga Swami Reddy
 wrote:
>
> Thats expected from Ceph by design. But in our case, we are using all
> recommendation like rack failure domain, replication n/w,etc, still
> face client IO performance issues during one OSD down..
>
> On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
> >
> > With a RACK failure domain, you should be able to have an entire rack 
> > powered down without noticing any major impact on the clients.  I regularly 
> > take down OSDs and nodes for maintenance and upgrades without seeing any 
> > problems with client IO.
> >
> > On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy  
> > wrote:
> >>
> >> Hello - I have a couple of questions on ceph cluster stability, even
> >> we follow all recommendations as below:
> >> - Having separate replication n/w and data n/w
> >> - RACK is the failure domain
> >> - Using SSDs for journals (1:4ratio)
> >>
> >> Q1 - If one OSD down, cluster IO down drastically and customer Apps 
> >> impacted.
> >> Q2 - what is stability ratio, like with above, is ceph cluster
> >> workable condition, if one osd down or one node down,etc.
> >>
> >> Thanks
> >> Swami
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image format v1 EOL ...

2019-02-20 Thread Mykola Golub
On Wed, Feb 20, 2019 at 10:22:47AM +0100, Jan Kasprzak wrote:

>   If I read the parallel thread about pool migration in ceph-users@
> correctly, the ability to migrate to v2 would still require to stop the client
> before the "rbd migration prepare" can be executed.

Note, if even rbd supported live (without any downtime) migration you
would still need to restart the client after the upgrate to a new
librbd with migration support.

So actually you can combine the upgrade with migration:

  upgrade client library
  stop client
  rbd migration prepare
  start client

and eventually:

  rbd migration execute
  rbd migration commit

And it would be interesting to investigate a possibility to replace
"stop/start client" steps with "migrating the VM to another (upgraded)
host" to avoid stopping the VM at all. The trick would be to execute
somehow "rbd migration prepare" after the the sourse VM closes the
image, but before the destination VM opens it.

-- 
Mykola Golub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Alfredo Deza
On Wed, Feb 20, 2019 at 10:21 AM Marco Gaiarin  wrote:
>
> Mandi! Alfredo Deza
>   In chel di` si favelave...
>
> > I think this is what happens with a non-gpt partition. GPT labels will
> > use a PARTUUID to identify the partition, and I just confirmed that
> > ceph-volume will enforce looking for PARTUUID if the JSON
> > identified a partition (vs. an LV).
> > From what I briefly researched it is not possible to add a GPT label
> > on a non-gpt partition without losing data.
>
> Ahem, how can i add a GPT label to a non-GPT partition (even loosing
> data)?

If you are coming from ceph-disk (or something else custom-made) and
don't care about losing data, why not fully migrate to the
new OSDs? 
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd
>
> Seems the culprit around my 'Proxmox 4.4, Ceph hammer, OSD cache
> link...' thread...
>
>
> Thanks.
>
> --
> dott. Marco Gaiarin GNUPG Key ID: 240A3D66
>   Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
>   Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
>   marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797
>
> Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
>   http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
> (cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Marco Gaiarin
Mandi! Alfredo Deza
  In chel di` si favelave...

> I think this is what happens with a non-gpt partition. GPT labels will
> use a PARTUUID to identify the partition, and I just confirmed that
> ceph-volume will enforce looking for PARTUUID if the JSON
> identified a partition (vs. an LV).
> From what I briefly researched it is not possible to add a GPT label
> on a non-gpt partition without losing data.

Ahem, how can i add a GPT label to a non-GPT partition (even loosing
data)?

Seems the culprit around my 'Proxmox 4.4, Ceph hammer, OSD cache
link...' thread...


Thanks.

-- 
dott. Marco Gaiarin GNUPG Key ID: 240A3D66
  Associazione ``La Nostra Famiglia''  http://www.lanostrafamiglia.it/
  Polo FVG   -   Via della Bontà, 7 - 33078   -   San Vito al Tagliamento (PN)
  marco.gaiarin(at)lanostrafamiglia.it   t +39-0434-842711   f +39-0434-842797

Dona il 5 PER MILLE a LA NOSTRA FAMIGLIA!
  http://www.lanostrafamiglia.it/index.php/it/sostienici/5x1000
(cf 00307430132, categoria ONLUS oppure RICERCA SANITARIA)
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph cluster stability

2019-02-20 Thread M Ranga Swami Reddy
Thats expected from Ceph by design. But in our case, we are using all
recommendation like rack failure domain, replication n/w,etc, still
face client IO performance issues during one OSD down..

On Tue, Feb 19, 2019 at 10:56 PM David Turner  wrote:
>
> With a RACK failure domain, you should be able to have an entire rack powered 
> down without noticing any major impact on the clients.  I regularly take down 
> OSDs and nodes for maintenance and upgrades without seeing any problems with 
> client IO.
>
> On Tue, Feb 12, 2019 at 5:01 AM M Ranga Swami Reddy  
> wrote:
>>
>> Hello - I have a couple of questions on ceph cluster stability, even
>> we follow all recommendations as below:
>> - Having separate replication n/w and data n/w
>> - RACK is the failure domain
>> - Using SSDs for journals (1:4ratio)
>>
>> Q1 - If one OSD down, cluster IO down drastically and customer Apps impacted.
>> Q2 - what is stability ratio, like with above, is ceph cluster
>> workable condition, if one osd down or one node down,etc.
>>
>> Thanks
>> Swami
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Alfredo Deza
On Wed, Feb 20, 2019 at 8:40 AM Анатолий Фуников
 wrote:
>
> Thanks for the reply.
> blkid -s PARTUUID -o value /dev/sdf1 shows me nothing, but blkid /dev/sdf1 
> shows me this: /dev/sdf1: UUID="b03810e4-dcc1-46c2-bc31-a1e558904750" 
> TYPE="xfs"

I think this is what happens with a non-gpt partition. GPT labels will
use a PARTUUID to identify the partition, and I just confirmed that
ceph-volume will enforce looking for PARTUUID if the JSON
identified a partition (vs. an LV).

From what I briefly researched it is not possible to add a GPT label
on a non-gpt partition without losing data.

My suggestion (if you confirm it is not possible to add the GPT label)
is to start the migration towards the new way of creating OSDs

>
> ср, 20 февр. 2019 г. в 16:27, Alfredo Deza :
>>
>> On Wed, Feb 20, 2019 at 8:16 AM Анатолий Фуников
>>  wrote:
>> >
>> > Hello. I need to raise the OSD on the node after reinstalling the OS, some 
>> > OSD were made a long time ago, not even a ceph-disk, but a set of scripts.
>> > There was an idea to get their configuration in json via ceph-volume 
>> > simple scan, and then on a fresh system I can make a ceph-volume simple 
>> > activate --file /etc/ceph/osd/31-46eacafe-22b6-4433-8e5c-e595612d8579.json
>> > I do ceph-volume simple scan /var/lib/ceph/osd/ceph-31/, and got this 
>> > json: https://pastebin.com/uJ8WVZyV
>> > It seems everything is not bad, but in the data section I see a direct 
>> > link to the device /dev/sdf1, and the uuid field is empty. At the same 
>> > time, in the /dev/disk/by-partuuid directory I can find and substitute 
>> > this UUID in this json, and delete the direct link to the device in this 
>> > json.
>> > The question is: how correct is it and can I raise this OSD on a freshly 
>> > installed OS with this fixed json?
>>
>> It worries me that it is unable to find a uuid for the device. This is
>> important because paths like /dev/sdf1 are ephemeral and can change
>> after a reboot. The uuid is found by running the following:
>>
>> blkid -s PARTUUID -o value /dev/sdf1
>>
>> If that is not returning anything, then ceph-volume will probably not
>> be able to ensure this device is brought up correctly. You can correct
>> or add to anything in the JSON after a scan and rely on that, but then
>> again
>> without a partuuid I don't think this will work nicely
>>
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER
on osd.8, at 01:20 when latency begin to increase, I have a scrub running

2019-02-20 01:16:08.851 7f84d24d9700  0 log_channel(cluster) log [DBG] : 5.52 
scrub starts
2019-02-20 01:17:18.019 7f84ce4d1700  0 log_channel(cluster) log [DBG] : 5.52 
scrub ok
2019-02-20 01:20:31.944 7f84f036e700  0 -- 10.5.0.106:6820/2900 >> 
10.5.0.79:0/2442367265 conn(0x7e120300 :6820 
s=STATE_ACCEPTING_WAIT_CONNECT_MSG_AUTH pgs=0 cs=0 l=1).handle_connect_msg 
accept replacing existing (lossy) channel (new one lossy=1)
2019-02-20 01:28:35.421 7f84d34db700  0 log_channel(cluster) log [DBG] : 5.c8 
scrub starts
2019-02-20 01:29:45.553 7f84cf4d3700  0 log_channel(cluster) log [DBG] : 5.c8 
scrub ok
2019-02-20 01:32:45.737 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 
scrub starts
2019-02-20 01:33:56.137 7f84d14d7700  0 log_channel(cluster) log [DBG] : 5.c4 
scrub ok


I'll try to do test disabling scrubing (currently it's running the night 
between 01:00-05:00)

- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 20 Février 2019 12:09:08
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Something interesting, 

when I have restarted osd.8 at 11:20, 

I'm seeing another osd.1 where latency is decreasing exactly at the same time. 
(without restart of this osd). 

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mercredi 20 Février 2019 11:39:34 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
 Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
 OSDs as well. Over time their latency increased until we started to 
 notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
 A restart fixed it. We also increased memory target from 4G to 6G on 
 these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 

Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Анатолий Фуников
Thanks for the reply.
blkid -s PARTUUID -o value /dev/sdf1 shows me nothing, but blkid /dev/sdf1
shows me this: /dev/sdf1: UUID="b03810e4-dcc1-46c2-bc31-a1e558904750"
TYPE="xfs"

ср, 20 февр. 2019 г. в 16:27, Alfredo Deza :

> On Wed, Feb 20, 2019 at 8:16 AM Анатолий Фуников
>  wrote:
> >
> > Hello. I need to raise the OSD on the node after reinstalling the OS,
> some OSD were made a long time ago, not even a ceph-disk, but a set of
> scripts.
> > There was an idea to get their configuration in json via ceph-volume
> simple scan, and then on a fresh system I can make a ceph-volume simple
> activate --file /etc/ceph/osd/31-46eacafe-22b6-4433-8e5c-e595612d8579.json
> > I do ceph-volume simple scan /var/lib/ceph/osd/ceph-31/, and got this
> json: https://pastebin.com/uJ8WVZyV
> > It seems everything is not bad, but in the data section I see a direct
> link to the device /dev/sdf1, and the uuid field is empty. At the same
> time, in the /dev/disk/by-partuuid directory I can find and substitute this
> UUID in this json, and delete the direct link to the device in this json.
> > The question is: how correct is it and can I raise this OSD on a freshly
> installed OS with this fixed json?
>
> It worries me that it is unable to find a uuid for the device. This is
> important because paths like /dev/sdf1 are ephemeral and can change
> after a reboot. The uuid is found by running the following:
>
> blkid -s PARTUUID -o value /dev/sdf1
>
> If that is not returning anything, then ceph-volume will probably not
> be able to ensure this device is brought up correctly. You can correct
> or add to anything in the JSON after a scan and rely on that, but then
> again
> without a partuuid I don't think this will work nicely
>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Alfredo Deza
On Wed, Feb 20, 2019 at 8:16 AM Анатолий Фуников
 wrote:
>
> Hello. I need to raise the OSD on the node after reinstalling the OS, some 
> OSD were made a long time ago, not even a ceph-disk, but a set of scripts.
> There was an idea to get their configuration in json via ceph-volume simple 
> scan, and then on a fresh system I can make a ceph-volume simple activate 
> --file /etc/ceph/osd/31-46eacafe-22b6-4433-8e5c-e595612d8579.json
> I do ceph-volume simple scan /var/lib/ceph/osd/ceph-31/, and got this json: 
> https://pastebin.com/uJ8WVZyV
> It seems everything is not bad, but in the data section I see a direct link 
> to the device /dev/sdf1, and the uuid field is empty. At the same time, in 
> the /dev/disk/by-partuuid directory I can find and substitute this UUID in 
> this json, and delete the direct link to the device in this json.
> The question is: how correct is it and can I raise this OSD on a freshly 
> installed OS with this fixed json?

It worries me that it is unable to find a uuid for the device. This is
important because paths like /dev/sdf1 are ephemeral and can change
after a reboot. The uuid is found by running the following:

blkid -s PARTUUID -o value /dev/sdf1

If that is not returning anything, then ceph-volume will probably not
be able to ensure this device is brought up correctly. You can correct
or add to anything in the JSON after a scan and rely on that, but then
again
without a partuuid I don't think this will work nicely

> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD after OS reinstallation.

2019-02-20 Thread Анатолий Фуников
Hello. I need to raise the OSD on the node after reinstalling the OS, some
OSD were made a long time ago, not even a ceph-disk, but a set of scripts.
There was an idea to get their configuration in json via ceph-volume simple
scan, and then on a fresh system I can make a ceph-volume simple activate
--file /etc/ceph/osd/31-46eacafe-22b6-4433-8e5c-e595612d8579.json
I do ceph-volume simple scan /var/lib/ceph/osd/ceph-31/, and got this json:
https://pastebin.com/uJ8WVZyV
It seems everything is not bad, but in the data section I see a direct link
to the device /dev/sdf1, and the uuid field is empty. At the same time, in
the /dev/disk/by-partuuid directory I can find and substitute this UUID in
this json, and delete the direct link to the device in this json.
The question is: how correct is it and can I raise this OSD on a freshly
installed OS with this fixed json?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access to cephfs from two different networks

2019-02-20 Thread Wido den Hollander


On 2/20/19 1:03 PM, Andrés Rojas Guerrero wrote:
> Hi all, sorry, we are newbies in Ceph and we have a newbie question
> about it. We have a Ceph cluster with three mon's and two public networks:
> 
> public network = 10.100.100.0/23,10.100.101.0/21
> 
> We have seen that ceph-mon are listen in only one of this network:
> 
> 
> tcp  0  0 10.100.100.9:6789  0.0.0.0:*  LISTEN  135385/ceph-mon
> 
> 
> but not in the other public network necessary to access to cephfs from
> this second network. We have seen that in principle that it's not
> possible?:
> 
> https://access.redhat.com/solutions/1463363
> 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019659.html
> 
> 
> and it's seems necessary to route the traffic from the second public
> network to the first public network, but seeing this our question is why
> you can define differents public networks in ceph if only you can access
> to only one from the ceph-mon daemons? perhaps you can configure the
> other mon's with the other public network?   Or is there anything we're
> not getting right?
> 
No, you can't.

Routed is really the way to go. It makes your life so much easier if you
have routed networks then just having all these separate networks.

Wido

> Thank's in advance.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Access to cephfs from two different networks

2019-02-20 Thread Andrés Rojas Guerrero
Hi all, sorry, we are newbies in Ceph and we have a newbie question 
about it. We have a Ceph cluster with three mon's and two public networks:


public network = 10.100.100.0/23,10.100.101.0/21

We have seen that ceph-mon are listen in only one of this network:


tcp  0  0 10.100.100.9:6789  0.0.0.0:*  LISTEN  135385/ceph-mon


but not in the other public network necessary to access to cephfs from 
this second network. We have seen that in principle that it's not

possible?:

https://access.redhat.com/solutions/1463363

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-July/019659.html


and it's seems necessary to route the traffic from the second public 
network to the first public network, but seeing this our question is why 
you can define differents public networks in ceph if only you can access 
to only one from the ceph-mon daemons? perhaps you can configure the 
other mon's with the other public network?   Or is there anything we're 
not getting right?


Thank's in advance.
--
***
Andrés Rojas Guerrero
Unidad Sistemas Linux
Area Arquitectura Tecnológica
Secretaría General Adjunta de Informática
Consejo Superior de Investigaciones Científicas (CSIC)
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER
Something interesting,

when I have restarted osd.8 at 11:20,

I'm seeing another osd.1 where latency is decreasing exactly at the same time. 
(without restart of this osd).

http://odisoweb1.odiso.net/osd1.png 

onodes and cache_other are also going down for osd.1 at this time. 




- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mercredi 20 Février 2019 11:39:34
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

Hi, 

I have hit the bug again, but this time only on 1 osd 

here some graphs: 
http://odisoweb1.odiso.net/osd8.png 

latency was good until 01:00 

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal), 
after that latency is slowing increasing from 1ms to 3-5ms 

after osd restart, I'm between 0.7-1ms 


- Mail original - 
De: "aderumier"  
À: "Igor Fedotov"  
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 17:03:58 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
 Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
 OSDs as well. Over time their latency increased until we started to 
 notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
 A restart fixed it. We also increased memory target from 4G to 6G on 
 these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>>> different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>>> see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with 

Re: [ceph-users] [Bluestore] Some of my osd's uses BlueFS slow storage for db - why?

2019-02-20 Thread Igor Fedotov

You're right - WAL/DB expansion capability is present in Luminous+ releases.

But David meant volume migration stuff which appeared in Nautilus, see:

https://github.com/ceph/ceph/pull/23103


Thanks,

Igor

On 2/20/2019 9:22 AM, Konstantin Shalygin wrote:

On 2/19/19 11:46 PM, David Turner wrote:
I don't know that there's anything that can be done to resolve this 
yet without rebuilding the OSD. Based on a Nautilus tool being able 
to resize the DB device, I'm assuming that Nautilus is also capable 
of migrating the DB/WAL between devices.  That functionality would 
allow anyone to migrate their DB back off of their spinner which is 
what's happening to you.  I don't believe that sort of tooling exists 
yet, though, without compiling the Nautilus Beta tooling for yourself.


I think there you are wrong, initially bluestore tool can expand only 
wal/db devices [1]. With last releases of mimic and luminous this 
should work fine.


And only master received  feature for main device expanding [2].



[1] 
https://github.com/ceph/ceph/commit/2184e3077caa9de5f21cc901d26f6ecfb76de9e1


[2] 
https://github.com/ceph/ceph/commit/d07c10dfc02e4cdeda288bf39b8060b10da5bbf9


k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd commit latency increase over time, until restart

2019-02-20 Thread Alexandre DERUMIER
Hi,

I have hit the bug again, but this time only on 1 osd

here some graphs:
http://odisoweb1.odiso.net/osd8.png

latency was good until 01:00

Then I'm seeing nodes miss, bluestore onodes number is increasing (seem to be 
normal),
after that latency is slowing increasing from 1ms to 3-5ms

after osd restart, I'm between 0.7-1ms


- Mail original -
De: "aderumier" 
À: "Igor Fedotov" 
Cc: "ceph-users" , "ceph-devel" 

Envoyé: Mardi 19 Février 2019 17:03:58
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart

>>I think op_w_process_latency includes replication times, not 100% sure 
>>though. 
>> 
>>So restarting other nodes might affect latencies at this specific OSD. 

Seem to be the case, I have compared with sub_op_latency. 

I have changed my graph, to clearly identify the osd where the latency is high. 


I have done some changes in my setup: 
- 2 osd by nvme (2x3TB by osd), with 6GB memory. (instead 1osd of 6TB with 12G 
memory). 
- disabling transparent hugepage 

Since 24h, latencies are still low (between 0.7-1.2ms). 

I'm also seeing that total memory used (#free), is lower than before (48GB 
(8osd x 6GB) vs 56GB (4osd x 12GB). 

I'll send more stats tomorrow. 

Alexandre 


- Mail original - 
De: "Igor Fedotov"  
À: "Alexandre Derumier" , "Wido den Hollander" 
 
Cc: "ceph-users" , "ceph-devel" 
 
Envoyé: Mardi 19 Février 2019 11:12:43 
Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
restart 

Hi Alexander, 

I think op_w_process_latency includes replication times, not 100% sure 
though. 

So restarting other nodes might affect latencies at this specific OSD. 


Thanks, 

Igot 

On 2/16/2019 11:29 AM, Alexandre DERUMIER wrote: 
>>> There are 10 OSDs in these systems with 96GB of memory in total. We are 
>>> runnigh with memory target on 6G right now to make sure there is no 
>>> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
>>> so it will max out on 80GB leaving 16GB as spare. 
> Thanks Wido. I send results monday with my increased memory 
> 
> 
> 
> @Igor: 
> 
> I have also notice, that sometime when I have bad latency on an osd on node1 
> (restarted 12h ago for example). 
> (op_w_process_latency). 
> 
> If I restart osds on other nodes (last restart some days ago, so with bigger 
> latency), it's reducing latency on osd of node1 too. 
> 
> does "op_w_process_latency" counter include replication time ? 
> 
> - Mail original - 
> De: "Wido den Hollander"  
> À: "aderumier"  
> Cc: "Igor Fedotov" , "ceph-users" 
> , "ceph-devel"  
> Envoyé: Vendredi 15 Février 2019 14:59:30 
> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
> restart 
> 
> On 2/15/19 2:54 PM, Alexandre DERUMIER wrote: 
 Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
 OSDs as well. Over time their latency increased until we started to 
 notice I/O-wait inside VMs. 
>> I'm also notice it in the vms. BTW, what it your nvme disk size ? 
> Samsung PM983 3.84TB SSDs in both clusters. 
> 
>> 
 A restart fixed it. We also increased memory target from 4G to 6G on 
 these OSDs as the memory would allow it. 
>> I have set memory to 6GB this morning, with 2 osds of 3TB for 6TB nvme. 
>> (my last test was 8gb with 1osd of 6TB, but that didn't help) 
> There are 10 OSDs in these systems with 96GB of memory in total. We are 
> runnigh with memory target on 6G right now to make sure there is no 
> leakage. If this runs fine for a longer period we will go to 8GB per OSD 
> so it will max out on 80GB leaving 16GB as spare. 
> 
> As these OSDs were all restarted earlier this week I can't tell how it 
> will hold up over a longer period. Monitoring (Zabbix) shows the latency 
> is fine at the moment. 
> 
> Wido 
> 
>> 
>> - Mail original - 
>> De: "Wido den Hollander"  
>> À: "Alexandre Derumier" , "Igor Fedotov" 
>>  
>> Cc: "ceph-users" , "ceph-devel" 
>>  
>> Envoyé: Vendredi 15 Février 2019 14:50:34 
>> Objet: Re: [ceph-users] ceph osd commit latency increase over time, until 
>> restart 
>> 
>> On 2/15/19 2:31 PM, Alexandre DERUMIER wrote: 
>>> Thanks Igor. 
>>> 
>>> I'll try to create multiple osds by nvme disk (6TB) to see if behaviour is 
>>> different. 
>>> 
>>> I have other clusters (same ceph.conf), but with 1,6TB drives, and I don't 
>>> see this latency problem. 
>>> 
>>> 
>> Just wanted to chime in, I've seen this with Luminous+BlueStore+NVMe 
>> OSDs as well. Over time their latency increased until we started to 
>> notice I/O-wait inside VMs. 
>> 
>> A restart fixed it. We also increased memory target from 4G to 6G on 
>> these OSDs as the memory would allow it. 
>> 
>> But we noticed this on two different 12.2.10/11 clusters. 
>> 
>> A restart made the latency drop. Not only the numbers, but the 
>> real-world latency as experienced by a VM as well. 
>> 
>> Wido 
>> 
>>> 
>>> 
>>> 
>>> 
>>> - Mail original - 
>>> De: "Igor Fedotov"  
>>> 

Re: [ceph-users] min_size vs. K in erasure coded pools

2019-02-20 Thread Eugen Block

Hi,

I see that as a security feature ;-)
You can prevent data loss if k chunks are intact, but you don't want  
to work with the least required amount of chunks. In a disaster  
scenario you can reduce min_size to k temporarily, but the main goal  
should always be to get the OSDs back up.
For example, in a replicated pool with size 3 we set min_size to 2 not  
to 1, although that would also work if everything is healthy. But it's  
risky since there's also a chance that two corrupt PGs overwrite a  
healthy PG.


Regards,
Eugen


Zitat von "Clausen, Jörn" :


Hi!

While trying to understand erasure coded pools, I would have  
expected that "min_size" of a pool is equal to the "K" parameter.  
But it turns out, that it is always K+1.


Isn't the description of erasure coding misleading then? In a K+M  
setup, I would expect to be good (in the sense of "no service  
impact"), even if M OSDs are lost. But in reality, my clients would  
already experience an impact when M-1 OSDs are lost. This means, you  
should always consider one more spare than you would do in e.g. a  
classic RAID setup, right?


Joern

--
Jörn Clausen
Daten- und Rechenzentrum
GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
Düsternbrookerweg 20
24105 Kiel




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] min_size vs. K in erasure coded pools

2019-02-20 Thread Clausen , Jörn

Hi!

While trying to understand erasure coded pools, I would have expected 
that "min_size" of a pool is equal to the "K" parameter. But it turns 
out, that it is always K+1.


Isn't the description of erasure coding misleading then? In a K+M setup, 
I would expect to be good (in the sense of "no service impact"), even if 
M OSDs are lost. But in reality, my clients would already experience an 
impact when M-1 OSDs are lost. This means, you should always consider 
one more spare than you would do in e.g. a classic RAID setup, right?


Joern

--
Jörn Clausen
Daten- und Rechenzentrum
GEOMAR Helmholtz-Zentrum für Ozeanforschung Kiel
Düsternbrookerweg 20
24105 Kiel



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image format v1 EOL ...

2019-02-20 Thread Jan Kasprzak
Hello,

Jason Dillaman wrote:
: For the future Ceph Octopus release, I would like to remove all
: remaining support for RBD image format v1 images baring any
: substantial pushback.
: 
: The image format for new images has been defaulted to the v2 image
: format since Infernalis, the v1 format was officially deprecated in
: Jewel, and creation of new v1 images was prohibited starting with
: Mimic.
: 
: The forthcoming Nautilus release will add a new image migration
: feature to help provide a low-impact conversion path forward for any
: legacy images in a cluster. The ability to migrate existing images off
: the v1 image format was the last known pain point that was highlighted
: the previous time I suggested removing support.
: 
: Please let me know if anyone has any major objections or concerns.

If I read the parallel thread about pool migration in ceph-users@
correctly, the ability to migrate to v2 would still require to stop the client
before the "rbd migration prepare" can be executed.

On my OpenNebula/Ceph cluster, I still have bigger tens of images
in v1 format, so it would induce a moderate pain to figure out which VMs
are using them, how availability-critical they are, and finally to migrate
the images.

But whatever, I guess I can cope with it :-)

-Yenya

-- 
| Jan "Yenya" Kasprzak  |
| http://www.fi.muni.cz/~kas/ GPG: 4096R/A45477D5 |
 This is the world we live in: the way to deal with computers is to google
 the symptoms, and hope that you don't have to watch a video. --P. Zaitcev
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to change/anable/activate a different osd_memory_target value

2019-02-20 Thread Konstantin Shalygin

we run into some OSD node freezes with out of memory and eating all swap too. 
Till we get more physical RAM I’d like to reduce the osd_memory_target, but 
can’t find where and how to enable it.

We have 24 bluestore Disks in 64 GB centos nodes with Luminous v12.2.11
Just set value for `osd_memory_target` in your ceph.conf and restart 
your OSD's (`systemctl restart ceph-osd.target` to restart all your osd 
daemons on this host).




k

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com