Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Dejan Lesjak

> On 17. okt. 2017, at 00:59, Gregory Farnum  wrote:
> 
> On Mon, Oct 16, 2017 at 3:49 PM Dejan Lesjak  wrote:
> 
> > On 17. okt. 2017, at 00:23, Gregory Farnum  wrote:
> >
> > On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak  wrote:
> > On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> > > Hi,
> > >
> > > During rather high load and rebalancing, a couple of our OSDs crashed
> > > and they fail to start. This is from the log:
> > >
> > > -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
> > > opened 370 pgs
> > > -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> > > build_past_intervals_parallel over 439159-439159
> > >  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> > > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > In function 'void OSD::build_past_intervals_parallel()' thread
> > > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> > > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > 4177: FAILED assert(p.same_interval_since)
> > >
> > >  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> > > (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x102) [0x55e4caa18592]
> > >  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
> > >  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
> > >  4: (OSD::init()+0x2227) [0x55e4ca467327]
> > >  5: (main()+0x2d5a) [0x55e4ca379b1a]
> > >  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
> > >  7: (_start()+0x2a) [0x55e4ca4039aa]
> > >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > > needed to interpret this.
> > >
> > > Does anybody know how to fix or further debug this?
> >
> > Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
> > From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
> > 10.1fce. Yet pg map doesn't show osd.1 for this pg:
> >
> > # ceph pg map 10.1fce
> > osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
> > [110,213,132,182]
> >
> > Hmm, this is odd. What caused your rebalancing exactly? Can you turn on the 
> > OSD with debugging set to 20, and then upload the log file using 
> > ceph-post-file?
> >
> > The specific assert you're hitting here is supposed to cope with PGs that 
> > have been imported (via the ceph-objectstore-tool). But obviously something 
> > has gone wrong here.
> 
> It started when we bumped the number of PGs for a pool (from 2048 to 8192).
> I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba
> 
> It actually seems similar than http://tracker.ceph.com/issues/21142 in that 
> the pg found in log seems empty if checked with ceph-objectstore-tool and 
> removing it allows the osd to start. At least on one osd, I’ve not tried that 
> yet on all of the failed ones.
> 
> Ah. I bet we are default-constructing the "child" PGs from split with that 
> value set to zero, so it's incorrectly being flagged for later use. David, 
> does that make sense to you? Do you think it's reasonable to fix it by just 
> checking for other default-initialized values as part of that branch check?
> (I note that this code got removed once Luminous branched, so hopefully 
> there's a simple fix we can apply!)
> 
> Dejan, did you make sure the OSD you tried that on has re-created the removed 
> PG and populated it with data? If so I think you ought to be fine removing 
> any empty PGs which are causing this assert.

Well, after a while apparently the pg does get recreated on osd, but 
unfortunately the assert happens again.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread David Zafman


I don't see that same_interval_since being cleared by split. 
PG::split_into() copies the history from the parent PG to child. The 
only code in Luminous that I see that clears it is in 
ceph_objectstore_tool.cc


David


On 10/16/17 3:59 PM, Gregory Farnum wrote:

On Mon, Oct 16, 2017 at 3:49 PM Dejan Lesjak  wrote:


On 17. okt. 2017, at 00:23, Gregory Farnum  wrote:

On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak 

wrote:

On 10/16/2017 02:02 PM, Dejan Lesjak wrote:

Hi,

During rather high load and rebalancing, a couple of our OSDs crashed
and they fail to start. This is from the log:

 -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123

load_pgs

opened 370 pgs
 -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
build_past_intervals_parallel over 439159-439159
  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1


/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:

In function 'void OSD::build_past_intervals_parallel()' thread
7f5e4c3bae80 time 2017-10-16 13:27:50.260062


/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:

4177: FAILED assert(p.same_interval_since)

  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)

luminous

(stable)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55e4caa18592]
  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
  4: (OSD::init()+0x2227) [0x55e4ca467327]
  5: (main()+0x2d5a) [0x55e4ca379b1a]
  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
  7: (_start()+0x2a) [0x55e4ca4039aa]
  NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Does anybody know how to fix or further debug this?

Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
 From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
10.1fce. Yet pg map doesn't show osd.1 for this pg:

# ceph pg map 10.1fce
osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
[110,213,132,182]

Hmm, this is odd. What caused your rebalancing exactly? Can you turn on

the OSD with debugging set to 20, and then upload the log file using
ceph-post-file?

The specific assert you're hitting here is supposed to cope with PGs

that have been imported (via the ceph-objectstore-tool). But obviously
something has gone wrong here.

It started when we bumped the number of PGs for a pool (from 2048 to 8192).
I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba

It actually seems similar than http://tracker.ceph.com/issues/21142 in
that the pg found in log seems empty if checked with ceph-objectstore-tool
and removing it allows the osd to start. At least on one osd, I’ve not
tried that yet on all of the failed ones.


Ah. I bet we are default-constructing the "child" PGs from split with that
value set to zero, so it's incorrectly being flagged for later use. David,
does that make sense to you? Do you think it's reasonable to fix it by just
checking for other default-initialized values as part of that branch check?
(I note that this code got removed once Luminous branched, so hopefully
there's a simple fix we can apply!)

Dejan, did you make sure the OSD you tried that on has re-created the
removed PG and populated it with data? If so I think you ought to be fine
removing any empty PGs which are causing this assert.
-Greg



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-16 Thread Christian Wuerdig
Maybe an additional example where the numbers don't line up all so
nicely would be good as well. For example it's not immediately obvious
to me what would happen with the stripe settings given by your example
but you write 97M of data
Would it be 4 objects of 24M and 4 objects of 250KB? Or will the last
4 objects be artificially padded (with 0's) to meet the stripe_unit?



On Tue, Oct 17, 2017 at 12:35 PM, Alexander Kushnirenko
 wrote:
> Hi, Gregory, Ian!
>
> There is very little information on striper mode in Ceph documentation.
> Could this explanation help?
>
> The logic of striper mode is very much the same as in RAID-0.  There are 3
> parameters that drives it:
>
> stripe_unit - the stripe size  (default=4M)
> stripe_count - how many objects to write in parallel (default=1)
> object_size  - when to stop increasing object size and create new objects.
> (default =4M)
>
> For example if you write 132M of data (132 consecutive pieces of data 1M
> each) in striped mode with the following parameters:
> stripe_unit = 8M
> stripe_count = 4
> object_size = 24M
> Then 8 objects will be created - 4 objects with 24M size and 4 objects with
> 8M size.
>
> Obj1=24MObj2=24MObj3=24MObj4=24M
> 00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <-- consecutive
> 1M pieces of data
> 20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
> 40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
>
> Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
> 60 .. 6768 .. 6f70 .. 7778 .. 7f
>
> Alexander.
>
>
>
>
> On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko
>  wrote:
>>
>> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
>> stripe count is here:
>>
>>
>> https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf
>>
>> I tried to attach it in the mail, but it was blocked.
>>
>>
>> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko
>>  wrote:
>>>
>>> Hi, Ian!
>>>
>>> Thank you for your reference!
>>>
>>> Could you comment on the following rule:
>>> object_size = stripe_unit * stripe_count
>>> Or it is not necessarily so?
>>>
>>> I refer to page 8 in this report:
>>>
>>>
>>> https://indico.cern.ch/event/531810/contributions/2298934/attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
>>>
>>>
>>> Alexander.
>>>
>>> On Wed, Oct 11, 2017 at 1:11 PM,  wrote:

 Hi Gregory

 You’re right, when setting the object layout in libradosstriper, one
 should set all three parameters (the number of stripes, the size of the
 stripe unit, and the size of the striped object). The Ceph plugin for
 GridFTP has an example of this at
 https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371



 At RAL, we use the following values:



 $STRIPER_NUM_STRIPES 1

 $STRIPER_STRIPE_UNIT 8388608

 $STRIPER_OBJECT_SIZE 67108864



 Regards,



 Ian Johnson MBCS

 Data Services Group

 Scientific Computing Department

 Rutherford Appleton Laboratory




 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-16 Thread Gregory Farnum
That looks right to me... PRs for the Ceph docs are welcome! :)

On Mon, Oct 16, 2017 at 4:35 PM Alexander Kushnirenko 
wrote:

> Hi, Gregory, Ian!
>
> There is very little information on striper mode in Ceph documentation.
> Could this explanation help?
>
> The logic of striper mode is very much the same as in RAID-0.  There are 3
> parameters that drives it:
>
> stripe_unit - the stripe size  (default=4M)
> stripe_count - how many objects to write in parallel (default=1)
> object_size  - when to stop increasing object size and create new objects.
>  (default =4M)
>
> For example if you write 132M of data (132 consecutive pieces of data 1M
> each) in striped mode with the following parameters:
> stripe_unit = 8M
> stripe_count = 4
> object_size = 24M
> Then 8 objects will be created - 4 objects with 24M size and 4 objects
> with 8M size.
>
> Obj1=24MObj2=24MObj3=24MObj4=24M
> 00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <--
> consecutive 1M pieces of data
> 20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
> 40 .. 47 48 .. 4f 50 .. 57 58 .. 5f
>
> Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
> 60 .. 6768 .. 6f70 .. 7778 .. 7f
>
> Alexander.
>
>
>
>
> On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko <
> kushnire...@gmail.com> wrote:
>
>> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit
>> and stripe count is here:
>>
>>
>> https://indico.cern.ch/event/330212/contributions/1718786/attachments/642384/883834/CephPluginForXroot.pdf
>>
>> I tried to attach it in the mail, but it was blocked.
>>
>>
>> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko <
>> kushnire...@gmail.com> wrote:
>>
>>> Hi, Ian!
>>>
>>> Thank you for your reference!
>>>
>>> Could you comment on the following rule:
>>> object_size = stripe_unit * stripe_count
>>> Or it is not necessarily so?
>>>
>>> I refer to page 8 in this report:
>>>
>>>
>>> https://indico.cern.ch/event/531810/contributions/2298934/attachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
>>>
>>>
>>> Alexander.
>>>
>>> On Wed, Oct 11, 2017 at 1:11 PM,  wrote:
>>>
 Hi Gregory

 You’re right, when setting the object layout in libradosstriper, one
 should set all three parameters (the number of stripes, the size of the
 stripe unit, and the size of the striped object). The Ceph plugin for
 GridFTP has an example of this at
 https://github.com/stfc/gridFTPCephPlugin/blob/master/ceph_posix.cpp#L371



 At RAL, we use the following values:



 $STRIPER_NUM_STRIPES 1

 $STRIPER_STRIPE_UNIT 8388608

 $STRIPER_OBJECT_SIZE 67108864



 Regards,



 Ian Johnson MBCS

 Data Services Group

 Scientific Computing Department

 Rutherford Appleton Laboratory



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bareos and libradosstriper works only for 4M sripe_unit size

2017-10-16 Thread Alexander Kushnirenko
Hi, Gregory, Ian!

There is very little information on striper mode in Ceph documentation.
Could this explanation help?

The logic of striper mode is very much the same as in RAID-0.  There are 3
parameters that drives it:

stripe_unit - the stripe size  (default=4M)
stripe_count - how many objects to write in parallel (default=1)
object_size  - when to stop increasing object size and create new objects.
 (default =4M)

For example if you write 132M of data (132 consecutive pieces of data 1M
each) in striped mode with the following parameters:
stripe_unit = 8M
stripe_count = 4
object_size = 24M
Then 8 objects will be created - 4 objects with 24M size and 4 objects with
8M size.

Obj1=24MObj2=24MObj3=24MObj4=24M
00 .. 07 08 .. 0f 10 .. 17 18 .. 1f  <--
consecutive 1M pieces of data
20 .. 27 21 .. 2f 30 .. 37 38 .. 3f
40 .. 47 48 .. 4f 50 .. 57 58 .. 5f

Obj5= 8MObj6= 8MObj7= 8MObj8= 8M
60 .. 6768 .. 6f70 .. 7778 .. 7f

Alexander.




On Wed, Oct 11, 2017 at 3:19 PM, Alexander Kushnirenko <
kushnire...@gmail.com> wrote:

> Oh!  I put a wrong link, sorry  The picture which explains stripe_unit and
> stripe count is here:
>
> https://indico.cern.ch/event/330212/contributions/1718786/at
> tachments/642384/883834/CephPluginForXroot.pdf
>
> I tried to attach it in the mail, but it was blocked.
>
>
> On Wed, Oct 11, 2017 at 3:16 PM, Alexander Kushnirenko <
> kushnire...@gmail.com> wrote:
>
>> Hi, Ian!
>>
>> Thank you for your reference!
>>
>> Could you comment on the following rule:
>> object_size = stripe_unit * stripe_count
>> Or it is not necessarily so?
>>
>> I refer to page 8 in this report:
>>
>> https://indico.cern.ch/event/531810/contributions/2298934/at
>> tachments/1358128/2053937/Ceph-Experience-at-RAL-final.pdf
>>
>>
>> Alexander.
>>
>> On Wed, Oct 11, 2017 at 1:11 PM,  wrote:
>>
>>> Hi Gregory
>>>
>>> You’re right, when setting the object layout in libradosstriper, one
>>> should set all three parameters (the number of stripes, the size of the
>>> stripe unit, and the size of the striped object). The Ceph plugin for
>>> GridFTP has an example of this at https://github.com/stfc/gridFT
>>> PCephPlugin/blob/master/ceph_posix.cpp#L371
>>>
>>>
>>>
>>> At RAL, we use the following values:
>>>
>>>
>>>
>>> $STRIPER_NUM_STRIPES 1
>>>
>>> $STRIPER_STRIPE_UNIT 8388608
>>>
>>> $STRIPER_OBJECT_SIZE 67108864
>>>
>>>
>>>
>>> Regards,
>>>
>>>
>>>
>>> Ian Johnson MBCS
>>>
>>> Data Services Group
>>>
>>> Scientific Computing Department
>>>
>>> Rutherford Appleton Laboratory
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-16 Thread Christian Balzer
On Mon, 16 Oct 2017 18:32:06 -0400 (EDT) Anthony Verevkin wrote:

> > From: "Sage Weil" 
> > To: "Alfredo Deza" 
> > Cc: "ceph-devel" , ceph-users@lists.ceph.com
> > Sent: Monday, October 9, 2017 11:09:29 AM
> > Subject: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and 
> > disk partition support]
> > 
> > To put this in context, the goal here is to kill ceph-disk in mimic.
> >   
> 
>  
> > Perhaps the "out" here is to support a "dir" option where the user
> > can
> > manually provision and mount an OSD on /var/lib/ceph/osd/*, with
> > 'journal'
> > or 'block' symlinks, and ceph-volume will do the last bits that
> > initialize
> > the filestore or bluestore OSD from there.  Then if someone has a
> > scenario
> > that isn't captured by LVM (or whatever else we support) they can
> > always
> > do it manually?
> >   
> 
> 
> In fact, now that bluestore only requires a few small files and symlinks to 
> remain in /var/lib/ceph/osd/* without the extra requirements for xattrs 
> support and xfs, why not simply leave those folders on OS root filesystem and 
> only point symlinks to bluestore block and db devices? That would simplify 
> the osd deployment so much - and the symlinks can then point to 
> /dev/disk/by-uuid or by-path or lvm path or whatever. The only downside for 
> this approach that I see is that disks themselves would no longer be 
> transferable between the hosts as those few files that describe the OSD are 
> no longer on the disk itself.
> 

If the OS is on a RAID1 the chances of things being lost entirely is
reduced very much, so moving OSDs to another host becomes a trivial
exercise one would assume.

But yeah, this sounds fine to me, as it's extremely flexible.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Rakuten Communications
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Gregory Farnum
On Mon, Oct 16, 2017 at 3:49 PM Dejan Lesjak  wrote:

>
> > On 17. okt. 2017, at 00:23, Gregory Farnum  wrote:
> >
> > On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak 
> wrote:
> > On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> > > Hi,
> > >
> > > During rather high load and rebalancing, a couple of our OSDs crashed
> > > and they fail to start. This is from the log:
> > >
> > > -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123
> load_pgs
> > > opened 370 pgs
> > > -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> > > build_past_intervals_parallel over 439159-439159
> > >  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> > >
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > In function 'void OSD::build_past_intervals_parallel()' thread
> > > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> > >
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > > 4177: FAILED assert(p.same_interval_since)
> > >
> > >  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e)
> luminous
> > > (stable)
> > >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > const*)+0x102) [0x55e4caa18592]
> > >  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
> > >  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
> > >  4: (OSD::init()+0x2227) [0x55e4ca467327]
> > >  5: (main()+0x2d5a) [0x55e4ca379b1a]
> > >  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
> > >  7: (_start()+0x2a) [0x55e4ca4039aa]
> > >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > > needed to interpret this.
> > >
> > > Does anybody know how to fix or further debug this?
> >
> > Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
> > From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
> > 10.1fce. Yet pg map doesn't show osd.1 for this pg:
> >
> > # ceph pg map 10.1fce
> > osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
> > [110,213,132,182]
> >
> > Hmm, this is odd. What caused your rebalancing exactly? Can you turn on
> the OSD with debugging set to 20, and then upload the log file using
> ceph-post-file?
> >
> > The specific assert you're hitting here is supposed to cope with PGs
> that have been imported (via the ceph-objectstore-tool). But obviously
> something has gone wrong here.
>
> It started when we bumped the number of PGs for a pool (from 2048 to 8192).
> I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba
>
> It actually seems similar than http://tracker.ceph.com/issues/21142 in
> that the pg found in log seems empty if checked with ceph-objectstore-tool
> and removing it allows the osd to start. At least on one osd, I’ve not
> tried that yet on all of the failed ones.


Ah. I bet we are default-constructing the "child" PGs from split with that
value set to zero, so it's incorrectly being flagged for later use. David,
does that make sense to you? Do you think it's reasonable to fix it by just
checking for other default-initialized values as part of that branch check?
(I note that this code got removed once Luminous branched, so hopefully
there's a simple fix we can apply!)

Dejan, did you make sure the OSD you tried that on has re-created the
removed PG and populated it with data? If so I think you ought to be fine
removing any empty PGs which are causing this assert.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Dejan Lesjak

> On 17. okt. 2017, at 00:23, Gregory Farnum  wrote:
> 
> On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak  wrote:
> On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> > Hi,
> >
> > During rather high load and rebalancing, a couple of our OSDs crashed
> > and they fail to start. This is from the log:
> >
> > -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
> > opened 370 pgs
> > -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> > build_past_intervals_parallel over 439159-439159
> >  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > In function 'void OSD::build_past_intervals_parallel()' thread
> > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > 4177: FAILED assert(p.same_interval_since)
> >
> >  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x102) [0x55e4caa18592]
> >  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
> >  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
> >  4: (OSD::init()+0x2227) [0x55e4ca467327]
> >  5: (main()+0x2d5a) [0x55e4ca379b1a]
> >  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
> >  7: (_start()+0x2a) [0x55e4ca4039aa]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> > Does anybody know how to fix or further debug this?
> 
> Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
> From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
> 10.1fce. Yet pg map doesn't show osd.1 for this pg:
> 
> # ceph pg map 10.1fce
> osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
> [110,213,132,182]
> 
> Hmm, this is odd. What caused your rebalancing exactly? Can you turn on the 
> OSD with debugging set to 20, and then upload the log file using 
> ceph-post-file?
> 
> The specific assert you're hitting here is supposed to cope with PGs that 
> have been imported (via the ceph-objectstore-tool). But obviously something 
> has gone wrong here.

It started when we bumped the number of PGs for a pool (from 2048 to 8192).
I’ve sent the log with ID 3a6dea4f-05d7-4c15-9f7e-2d95d99195ba

It actually seems similar than http://tracker.ceph.com/issues/21142 in that the 
pg found in log seems empty if checked with ceph-objectstore-tool and removing 
it allows the osd to start. At least on one osd, I’ve not tried that yet on all 
of the failed ones.

Dejan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 答复: assert(objiter->second->version > last_divergent_update) when testing pull out disk and insert

2017-10-16 Thread Gregory Farnum
On Sat, Oct 14, 2017 at 7:24 AM, zhaomingyue  wrote:
> 1、this assert happened accidently, not easy to reproduce; In fact, I also 
> suppose this assert is caused by device data lost;
> but if has lost,how it can accur that (last_update +1 = log.rbegin.version) , 
> in case of losting data, it's more likely to be confused. At present, this 
> situation can't think clearly.
>
> 2、According read_log code,assume a situation:
> When osd start ,if pg log has been lost some content because of power off or 
> xfs error,then log.head would be bigger than log.rbegin.version in memory;
> during peering , using last_update as one determinal arg to find_best, so the 
> consistent one(osd who has shorter pg log,but last_update is normal) may 
> become the auth log,
> other osd use 'this auth log' would lead to pg inconsistent if scrub this 
> pg,isn’t it?

I'm not sure I understand what you're saying here.

I don't think OSDs will write down data from their peers until they
get far enough along to actually commit to somebody being primary,
though. So if we have an inconsistent guy with lost log, he'll hit one
of these asserts and then one of the remaining (consistent) OSDs will
start over again and get selected as the winner.

Keep in mind that once we know the PG metadata is inconsistent, we
don't want to keep on using that disk for data as we know it's not
trusted!
-Greg

>
>
>
> -邮件原件-
> 发件人: Gregory Farnum [mailto:gfar...@redhat.com]
> 发送时间: 2017年10月14日 0:34
> 收件人: zhaomingyue 09440 (RD)
> 抄送: ceph-de...@vger.kernel.org; ceph-us...@ceph.com
> 主题: Re: [ceph-users] assert(objiter->second->version > last_divergent_update) 
> when testing pull out disk and insert
>
> On Fri, Oct 13, 2017 at 12:48 AM, zhaomingyue  wrote:
>> Hi:
>> I had met an assert problem like
>> bug16279(http://tracker.ceph.com/issues/16279) when testing pull out
>> disk and insert, ceph version 10.2.5,assert(objiter->second->version >
>> last_divergent_update)
>>
>> according to osd log,I think this maybe due to (log.head !=
>> *log.log.rbegin.version.version) when some abnormal condition
>> happened,such as power off ,pull out disk and insert.
>
> I don't think is supposed to be possible. We apply all changes like this 
> atomically; FileStore does all its journaling to prevent partial updates like 
> this.
>
> A few other people have reported the same issue on disk pull, so maybe 
> there's some *other* issue going on, but the correct fix is by preventing 
> those two from differing (unless I misunderstand the context).
>
> Given one of the reporters on that ticket confirms they also had xfs issues, 
> I find it vastly more likely that something in your kernel configuration and 
> hardware stack is not writing out data the way it claims to. Be very, very 
> sure all that is working correctly!
>
>
>> In below situation, merge_log would push 234’1034 into divergent
>> list;and divergent has only one node;then lead to
>> assert(objiter->second->version > last_divergent_update).
>>
>> olog     (0’0, 234’1034)  olog.head = 234’1034
>>
>> log      (0’0, 234’1034)  log.head = 234’1033
>>
>>
>>
>> I see osd load_pgs code,in function PGLog::read_log() , code like this:
>>  .
>>  for (p->seek_to_first(); p->valid() ; p->next()) {
>>
>> .
>>
>> log.log.push_back(e);
>>
>> log.head = e.version;  // every pg log node
>>
>>   }
>>
>> .
>>
>>  log.head = info.last_update;
>>
>>
>>
>> two doubt:
>>
>> first : why set (log.head = info.last_update) after all pg log node
>> processed(every node has updated log.head = e.version)?
>>
>> second: Whether it can occur that info.last_update is less than
>> *log.log.rbegin.version or not and what scene happens?
>
> I'm looking at the luminous code base right now and things have changed a bit 
> so I don't have the specifics of your question on hand.
>
> But the general reason we change these versions around is because we need to 
> reconcile the logs across all OSDs. If one OSD has an entry for an operation 
> that was never returned to the client, we may need to declare it divergent 
> and undo it. (In replicated pools, entries are only divergent if the OSD 
> hosting it was either netsplit from the primary, or else managed to commit 
> something during a failure event that its peers didn't and then was 
> resubmitted under a different ID by the client on recovery. In erasure-coded 
> pools things are more complicated because we can only roll operations forward 
> if a quorum of the shards are present.) -Greg
> -
> 本邮件及其附件含有新华三技术有限公司的保密信息,仅限于发送给上面地址中列出
> 的个人或群组。禁止任何其他人以任何形式使用(包括但不限于全部或部分地泄露、复制、
> 或散发)本邮件中的信息。如果您错收了本邮件,请您立即电话或邮件通知发件人并删除本
> 邮件!
> This e-mail and its attachments contain confidential information from New 
> H3C, which is
> intended only for the person or entity whose 

Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-16 Thread Sage Weil
On Mon, 16 Oct 2017, Anthony Verevkin wrote:
> 
> > From: "Sage Weil" 
> > To: "Alfredo Deza" 
> > Cc: "ceph-devel" , ceph-users@lists.ceph.com
> > Sent: Monday, October 9, 2017 11:09:29 AM
> > Subject: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and 
> > disk partition support]
> > 
> > To put this in context, the goal here is to kill ceph-disk in mimic.
> > 
> 
>  
> > Perhaps the "out" here is to support a "dir" option where the user
> > can
> > manually provision and mount an OSD on /var/lib/ceph/osd/*, with
> > 'journal'
> > or 'block' symlinks, and ceph-volume will do the last bits that
> > initialize
> > the filestore or bluestore OSD from there.  Then if someone has a
> > scenario
> > that isn't captured by LVM (or whatever else we support) they can
> > always
> > do it manually?
> > 
> 
> In fact, now that bluestore only requires a few small files and symlinks 
> to remain in /var/lib/ceph/osd/* without the extra requirements for 
> xattrs support and xfs, why not simply leave those folders on OS root 
> filesystem and only point symlinks to bluestore block and db devices? 
> That would simplify the osd deployment so much - and the symlinks can 
> then point to /dev/disk/by-uuid or by-path or lvm path or whatever. The 
> only downside for this approach that I see is that disks themselves 
> would no longer be transferable between the hosts as those few files 
> that describe the OSD are no longer on the disk itself.

:) this is exactly what we're doing, actually:

https://github.com/ceph/ceph/pull/18256

We plan to backport this to luminous, hopefully in time for the next 
point release.

dm-crypt is still slightly annoying to set up, but it will still be much 
easier.

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-16 Thread Anthony Verevkin

> From: "Sage Weil" 
> To: "Alfredo Deza" 
> Cc: "ceph-devel" , ceph-users@lists.ceph.com
> Sent: Monday, October 9, 2017 11:09:29 AM
> Subject: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and 
> disk partition support]
> 
> To put this in context, the goal here is to kill ceph-disk in mimic.
> 

 
> Perhaps the "out" here is to support a "dir" option where the user
> can
> manually provision and mount an OSD on /var/lib/ceph/osd/*, with
> 'journal'
> or 'block' symlinks, and ceph-volume will do the last bits that
> initialize
> the filestore or bluestore OSD from there.  Then if someone has a
> scenario
> that isn't captured by LVM (or whatever else we support) they can
> always
> do it manually?
> 


In fact, now that bluestore only requires a few small files and symlinks to 
remain in /var/lib/ceph/osd/* without the extra requirements for xattrs support 
and xfs, why not simply leave those folders on OS root filesystem and only 
point symlinks to bluestore block and db devices? That would simplify the osd 
deployment so much - and the symlinks can then point to /dev/disk/by-uuid or 
by-path or lvm path or whatever. The only downside for this approach that I see 
is that disks themselves would no longer be transferable between the hosts as 
those few files that describe the OSD are no longer on the disk itself.

Regards,
Anthony

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Gregory Farnum
On Mon, Oct 16, 2017 at 8:24 AM Dejan Lesjak  wrote:

> On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> > Hi,
> >
> > During rather high load and rebalancing, a couple of our OSDs crashed
> > and they fail to start. This is from the log:
> >
> > -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
> > opened 370 pgs
> > -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> > build_past_intervals_parallel over 439159-439159
> >  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > In function 'void OSD::build_past_intervals_parallel()' thread
> > 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> > /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> > 4177: FAILED assert(p.same_interval_since)
> >
> >  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> > (stable)
> >  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > const*)+0x102) [0x55e4caa18592]
> >  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
> >  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
> >  4: (OSD::init()+0x2227) [0x55e4ca467327]
> >  5: (main()+0x2d5a) [0x55e4ca379b1a]
> >  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
> >  7: (_start()+0x2a) [0x55e4ca4039aa]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> > needed to interpret this.
> >
> > Does anybody know how to fix or further debug this?
>
> Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
> From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
> 10.1fce. Yet pg map doesn't show osd.1 for this pg:
>
> # ceph pg map 10.1fce
> osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
> [110,213,132,182]
>

Hmm, this is odd. What caused your rebalancing exactly? Can you turn on the
OSD with debugging set to 20, and then upload the log file using
ceph-post-file?

The specific assert you're hitting here is supposed to cope with PGs that
have been imported (via the ceph-objectstore-tool). But obviously something
has gone wrong here.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs: some metadata operations take seconds to complete

2017-10-16 Thread Linh Vu
We're using cephfs here as well for HPC scratch, but we're on Luminous 12.2.1. 
This issue seems to have been fixed between Jewel and Luminous, we don't have 
such problems. :) Any reason you guys aren't evaluating the latest LTS?


From: ceph-users  on behalf of Tyanko 
Aleksiev 
Sent: Tuesday, 17 October 2017 4:07:26 AM
To: ceph-users
Subject: [ceph-users] cephfs: some metadata operations take seconds to complete

Hi,

At UZH we are currently evaluating cephfs as a distributed file system
for the scratch space of an HPC installation. Some slow down of the
metadata operations seems to occur under certain circumstances. In
particular, commands issued after some big file deletion could take
several seconds.

Example:

dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test
274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s

dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2
274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s

ls; time rm dd-test2 ; time ls
dd-test  dd-test2

real0m0.004s
user0m0.000s
sys 0m0.000s
dd-test

real0m8.795s
user0m0.000s
sys 0m0.000s

Additionally, the time it takes to complete the "ls" command appears to
be proportional to the size of the deleted file. The issue described
above is not limited to "ls" but extends to other commands:

ls ; time rm dd-test2 ; time du -hs ./*
dd-test  dd-test2

real0m0.003s
user0m0.000s
sys 0m0.000s
128G./dd-test

real0m9.974s
user0m0.000s
sys 0m0.000s

What might be causing this behavior and eventually how could we improve it?

Setup:

- ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,
- 3 monitors,
- 1 mds,
- 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs
in total). 4TB disks are used for the cephfs_data pool. Journaling is on
SSDs,
- we installed an 400GB NVMe disk on each storage node and aggregated
the tree disks in crush rule. cephfs_metadata pool was then created
using that rule and therefore is hosted on the NVMes. Journaling and
data are on the same partition here.

So far we are using the default ceph configuration settings.

Clients are mounting the file system with the kernel driver using the
following options (again default):
"rw,noatime,name=admin,secret=,acl,_netdev".

Thank you in advance for the help.

Cheers,
Tyanko



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cephfs: some metadata operations take seconds to complete

2017-10-16 Thread Tyanko Aleksiev

Hi,

At UZH we are currently evaluating cephfs as a distributed file system 
for the scratch space of an HPC installation. Some slow down of the 
metadata operations seems to occur under certain circumstances. In 
particular, commands issued after some big file deletion could take 
several seconds.


Example:

dd bs=$((1024*1024*128)) count=2048 if=/dev/zero of=./dd-test
274877906944 bytes (275 GB, 256 GiB) copied, 224.798 s, 1.2 GB/s

dd bs=$((1024*1024*128)) count=2048 if=./dd-test of=./dd-test2
274877906944 bytes (275 GB, 256 GiB) copied, 1228.87 s, 224 MB/s

ls; time rm dd-test2 ; time ls
dd-test  dd-test2

real0m0.004s
user0m0.000s
sys 0m0.000s
dd-test

real0m8.795s
user0m0.000s
sys 0m0.000s

Additionally, the time it takes to complete the "ls" command appears to 
be proportional to the size of the deleted file. The issue described 
above is not limited to "ls" but extends to other commands:


ls ; time rm dd-test2 ; time du -hs ./*
dd-test  dd-test2

real0m0.003s
user0m0.000s
sys 0m0.000s
128G./dd-test

real0m9.974s
user0m0.000s
sys 0m0.000s

What might be causing this behavior and eventually how could we improve it?

Setup:

- ceph version: 10.2.9, OS: Ubuntu 16.04, kernel: 4.8.0-58-generic,
- 3 monitors,
- 1 mds,
- 3 storage nodes with 24 X 4TB disks on each node: 1 OSD/disk (72 OSDs 
in total). 4TB disks are used for the cephfs_data pool. Journaling is on 
SSDs,
- we installed an 400GB NVMe disk on each storage node and aggregated 
the tree disks in crush rule. cephfs_metadata pool was then created 
using that rule and therefore is hosted on the NVMes. Journaling and 
data are on the same partition here.


So far we are using the default ceph configuration settings.

Clients are mounting the file system with the kernel driver using the 
following options (again default): 
"rw,noatime,name=admin,secret=,acl,_netdev".


Thank you in advance for the help.

Cheers,
Tyanko



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] How to stop using (unmount) a failed OSD with BlueStore ?

2017-10-16 Thread Alejandro Comisario
Hi all, i have to hot-swap a failed osd on a Luminous Cluster with Blue
store (the disk is SATA, WAL and DB are on NVME).

I've issued a:
* ceph osd crush reweight osd_id 0
* systemctl stop (osd I'd daemon)
* umount /var/lib/ceph/osd/osd_id
* ceph osd destroy osd_id

everything seems of, but if I left everything as is ( until I wait for the
replaced disk ) I can see that dmesg errors on writing on the device are
still appearing.

The osd is of course down and out the crushmap.
am I missing something ? like a step to execute or something else ?

hoping to get help.
best.

​alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-16 Thread Richard Hesketh
On 16/10/17 13:45, Wido den Hollander wrote:
>> Op 26 september 2017 om 16:39 schreef Mark Nelson :
>> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
>>> thanks David,
>>>
>>> that's confirming what I was assuming. To bad that there is no
>>> estimate/method to calculate the db partition size.
>>
>> It's possible that we might be able to get ranges for certain kinds of 
>> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
>> expect a typical metadata size of X per object.  Or maybe if you do lots 
>> of large sequential object writes in RGW, it's more like Y.  I think 
>> it's probably going to be tough to make it accurate for everyone though.
> 
> So I did a quick test. I wrote 75.000 objects to a BlueStore device:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
> 75085
> root@alpha:~# 
> 
> I then saw the RocksDB database was 450MB in size:
> 
> root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
> 459276288
> root@alpha:~#
> 
> 459276288 / 75085 = 6116
> 
> So about 6kb of RocksDB data per object.
> 
> Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
> space.
> 
> Is this a safe assumption? Do you think that 6kb is normal? Low? High?
> 
> There aren't many of these numbers out there for BlueStore right now so I'm 
> trying to gather some numbers.
> 
> Wido

If I check for the same stats on OSDs in my production cluster I see similar 
but variable values:

root@vm-ds-01:~/ceph-conf# for i in {0..9} ; do echo -n "osd.$i db per object: 
" ; expr `ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph 
daemon osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.0 db per object: 7490
osd.1 db per object: 7523
osd.2 db per object: 7378
osd.3 db per object: 7447
osd.4 db per object: 7233
osd.5 db per object: 7393
osd.6 db per object: 7074
osd.7 db per object: 7967
osd.8 db per object: 7253
osd.9 db per object: 7680

root@vm-ds-02:~# for i in {10..19} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.10 db per object: 5168
osd.11 db per object: 5291
osd.12 db per object: 5476
osd.13 db per object: 4978
osd.14 db per object: 5252
osd.15 db per object: 5461
osd.16 db per object: 5135
osd.17 db per object: 5126
osd.18 db per object: 9336
osd.19 db per object: 4986

root@vm-ds-03:~# for i in {20..29} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.20 db per object: 5115
osd.21 db per object: 4844
osd.22 db per object: 5063
osd.23 db per object: 5486
osd.24 db per object: 5228
osd.25 db per object: 4966
osd.26 db per object: 5047
osd.27 db per object: 5021
osd.28 db per object: 5321
osd.29 db per object: 5150

root@vm-ds-04:~# for i in {30..39} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.30 db per object: 6658
osd.31 db per object: 6445
osd.32 db per object: 6259
osd.33 db per object: 6691
osd.34 db per object: 6513
osd.35 db per object: 6628
osd.36 db per object: 6779
osd.37 db per object: 6819
osd.38 db per object: 6677
osd.39 db per object: 6689

root@vm-ds-05:~# for i in {40..49} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.40 db per object: 5335
osd.41 db per object: 5203
osd.42 db per object: 5552
osd.43 db per object: 5188
osd.44 db per object: 5218
osd.45 db per object: 5157
osd.46 db per object: 4956
osd.47 db per object: 5370
osd.48 db per object: 5117
osd.49 db per object: 5313

I'm not sure why so much variance (these nodes are basically identical) and I 
think that the db_used_bytes includes the WAL at least in my case, as I don't 
have a separate WAL device. I'm not sure how big the WAL is relative to 
metadata and hence how much this might be thrown off, but ~6kb/object seems 
like a reasonable value to take for back-of-envelope calculating.

[bonus hilarity]
On my all-in-one-SSD OSDs, because bluestore reports them entirely as db space, 
I get results like:

root@vm-hv-01:~# for i in {60..65} ; do echo -n "osd.$i db per object: " ; expr 
`ceph daemon osd.$i perf dump | jq '.bluefs.db_used_bytes'` / `ceph daemon 
osd.$i perf dump | jq '.bluestore.bluestore_onodes'` ; done
osd.60 db per object: 80273
osd.61 db per object: 68859
osd.62 db per object: 45560
osd.63 db per object: 38209
osd.64 db per object: 48258
osd.65 db per object: 50525

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Dejan Lesjak
On 10/16/2017 02:02 PM, Dejan Lesjak wrote:
> Hi,
> 
> During rather high load and rebalancing, a couple of our OSDs crashed
> and they fail to start. This is from the log:
> 
> -2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
> opened 370 pgs
> -1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
> build_past_intervals_parallel over 439159-439159
>  0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> In function 'void OSD::build_past_intervals_parallel()' thread
> 7f5e4c3bae80 time 2017-10-16 13:27:50.260062
> /var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
> 4177: FAILED assert(p.same_interval_since)
> 
>  ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
> (stable)
>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x102) [0x55e4caa18592]
>  2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
>  3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
>  4: (OSD::init()+0x2227) [0x55e4ca467327]
>  5: (main()+0x2d5a) [0x55e4ca379b1a]
>  6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
>  7: (_start()+0x2a) [0x55e4ca4039aa]
>  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> 
> Does anybody know how to fix or further debug this?

Bumped logging to 10 and posted log to https://pastebin.com/raw/StTeYWRt
>From "10.1fces2 needs 439159-0" it seems osd (osd.1) gets stuck at pg
10.1fce. Yet pg map doesn't show osd.1 for this pg:

# ceph pg map 10.1fce
osdmap e443665 pg 10.1fce (10.1fce) -> up [110,213,132,182] acting
[110,213,132,182]


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Anthony Verevkin
Hi Matteo,

This looks like the 'noout' flag might be set for your cluster.

Please check it with:
ceph osd dump | grep flags

If you see 'noout' flag is set, you can unset it with:
ceph osd unset noout

Regards,
Anthony

- Original Message -
> From: "Matteo Dacrema" 
> To: ceph-users@lists.ceph.com
> Sent: Monday, October 16, 2017 4:21:29 AM
> Subject: [ceph-users] Ceph not recovering after osd/host failure
> 
> Hi all,
> 
> I’m testing Ceph Luminous 12.2.1 installed with ceph ansible.
> 
> Doing some failover tests I noticed that when I kill an osd or and
> hosts Ceph doesn’t recover automatically remaining in this state
> until I bring OSDs or host back online.
> I’ve 3 pools volumes, cephfs_data and cephfs_metadata with size 3 and
> min_size 1.
> 
> Is there something I’m missing ?
> 
> Below some cluster info.
> 
> Thank you all
> Regards
> 
> Matteo
> 
> 
>   cluster:
> id: ab7cb890-ee21-484e-9290-14b9e5e85125
> health: HEALTH_WARN
> 3 osds down
> Degraded data redundancy: 2842/73686 objects degraded
> (3.857%), 318 pgs unclean, 318 pgs degraded, 318 pgs
> undersized
> 
>   services:
> mon: 3 daemons, quorum controller001,controller002,controller003
> mgr: controller001(active), standbys: controller002,
> controller003
> mds: cephfs-1/1/1 up  {0=controller002=up:active}, 2 up:standby
> osd: 77 osds: 74 up, 77 in
> 
>   data:
> pools:   3 pools, 4112 pgs
> objects: 36843 objects, 142 GB
> usage:   470 GB used, 139 TB / 140 TB avail
> pgs: 2842/73686 objects degraded (3.857%)
>  3794 active+clean
>  318  active+undersized+degraded
> 
> 
> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
>  -1   140.02425 root default
>  -920.00346 host storage001
>   0   hdd   1.81850 osd.0   up  1.0 1.0
>   6   hdd   1.81850 osd.6   up  1.0 1.0
>   8   hdd   1.81850 osd.8   up  1.0 1.0
>  11   hdd   1.81850 osd.11  up  1.0 1.0
>  14   hdd   1.81850 osd.14  up  1.0 1.0
>  18   hdd   1.81850 osd.18  up  1.0 1.0
>  24   hdd   1.81850 osd.24  up  1.0 1.0
>  28   hdd   1.81850 osd.28  up  1.0 1.0
>  33   hdd   1.81850 osd.33  up  1.0 1.0
>  40   hdd   1.81850 osd.40  up  1.0 1.0
>  45   hdd   1.81850 osd.45  up  1.0 1.0
>  -720.00346 host storage002
>   1   hdd   1.81850 osd.1   up  1.0 1.0
>   5   hdd   1.81850 osd.5   up  1.0 1.0
>   9   hdd   1.81850 osd.9   up  1.0 1.0
>  21   hdd   1.81850 osd.21  up  1.0 1.0
>  22   hdd   1.81850 osd.22  up  1.0 1.0
>  23   hdd   1.81850 osd.23  up  1.0 1.0
>  35   hdd   1.81850 osd.35  up  1.0 1.0
>  36   hdd   1.81850 osd.36  up  1.0 1.0
>  38   hdd   1.81850 osd.38  up  1.0 1.0
>  42   hdd   1.81850 osd.42  up  1.0 1.0
>  49   hdd   1.81850 osd.49  up  1.0 1.0
> -1120.00346 host storage003
>  27   hdd   1.81850 osd.27  up  1.0 1.0
>  31   hdd   1.81850 osd.31  up  1.0 1.0
>  32   hdd   1.81850 osd.32  up  1.0 1.0
>  37   hdd   1.81850 osd.37  up  1.0 1.0
>  44   hdd   1.81850 osd.44  up  1.0 1.0
>  46   hdd   1.81850 osd.46  up  1.0 1.0
>  48   hdd   1.81850 osd.48  up  1.0 1.0
>  53   hdd   1.81850 osd.53  up  1.0 1.0
>  54   hdd   1.81850 osd.54  up  1.0 1.0
>  56   hdd   1.81850 osd.56  up  1.0 1.0
>  59   hdd   1.81850 osd.59  up  1.0 1.0
>  -320.00346 host storage004
>   2   hdd   1.81850 osd.2   up  1.0 1.0
>   4   hdd   1.81850 osd.4   up  1.0 1.0
>  10   hdd   1.81850 osd.10  up  1.0 1.0
>  16   hdd   1.81850 osd.16  up  1.0 1.0
>  17   hdd   1.81850 osd.17  up  1.0 1.0
>  19   hdd   1.81850 osd.19  up  1.0 1.0
>  26   hdd   1.81850 osd.26  up  1.0 1.0
>  29   hdd   1.81850 osd.29  up  1.0 1.0
>  39   hdd   1.81850 osd.39  up  1.0 1.0
>  43   hdd   1.81850 osd.43  up  1.0 1.0
>  50   hdd   1.81850 osd.50  up  1.0 1.0
>  -520.00346 host storage005
>   3   hdd   1.81850 osd.3   up  1.0 1.0

Re: [ceph-users] Bluestore OSD_DATA, WAL & DB

2017-10-16 Thread Wido den Hollander

> Op 26 september 2017 om 16:39 schreef Mark Nelson :
> 
> 
> 
> 
> On 09/26/2017 01:10 AM, Dietmar Rieder wrote:
> > thanks David,
> >
> > that's confirming what I was assuming. To bad that there is no
> > estimate/method to calculate the db partition size.
> 
> It's possible that we might be able to get ranges for certain kinds of 
> scenarios.  Maybe if you do lots of small random writes on RBD, you can 
> expect a typical metadata size of X per object.  Or maybe if you do lots 
> of large sequential object writes in RGW, it's more like Y.  I think 
> it's probably going to be tough to make it accurate for everyone though.
> 

So I did a quick test. I wrote 75.000 objects to a BlueStore device:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluestore.bluestore_onodes'
75085
root@alpha:~# 

I then saw the RocksDB database was 450MB in size:

root@alpha:~# ceph daemon osd.0 perf dump|jq '.bluefs.db_used_bytes'
459276288
root@alpha:~#

459276288 / 75085 = 6116

So about 6kb of RocksDB data per object.

Let's say I want to store 1M objects in a single OSD I would need ~6GB of DB 
space.

Is this a safe assumption? Do you think that 6kb is normal? Low? High?

There aren't many of these numbers out there for BlueStore right now so I'm 
trying to gather some numbers.

Wido

> Mark
> 
> >
> > Dietmar
> >
> > On 09/25/2017 05:10 PM, David Turner wrote:
> >> db/wal partitions are per OSD.  DB partitions need to be made as big as
> >> you need them.  If they run out of space, they will fall back to the
> >> block device.  If the DB and block are on the same device, then there's
> >> no reason to partition them and figure out the best size.  If they are
> >> on separate devices, then you need to make it as big as you need to to
> >> ensure that it won't spill over (or if it does that you're ok with the
> >> degraded performance while the db partition is full).  I haven't come
> >> across an equation to judge what size should be used for either
> >> partition yet.
> >>
> >> On Mon, Sep 25, 2017 at 10:53 AM Dietmar Rieder
> >> > wrote:
> >>
> >> On 09/25/2017 02:59 PM, Mark Nelson wrote:
> >> > On 09/25/2017 03:31 AM, TYLin wrote:
> >> >> Hi,
> >> >>
> >> >> To my understand, the bluestore write workflow is
> >> >>
> >> >> For normal big write
> >> >> 1. Write data to block
> >> >> 2. Update metadata to rocksdb
> >> >> 3. Rocksdb write to memory and block.wal
> >> >> 4. Once reach threshold, flush entries in block.wal to block.db
> >> >>
> >> >> For overwrite and small write
> >> >> 1. Write data and metadata to rocksdb
> >> >> 2. Apply the data to block
> >> >>
> >> >> Seems we don’t have a formula or suggestion to the size of block.db.
> >> >> It depends on the object size and number of objects in your pool. 
> >> You
> >> >> can just give big partition to block.db to ensure all the database
> >> >> files are on that fast partition. If block.db full, it will use 
> >> block
> >> >> to put db files, however, this will slow down the db performance. So
> >> >> give db size as much as you can.
> >> >
> >> > This is basically correct.  What's more, it's not just the object
> >> size,
> >> > but the number of extents, checksums, RGW bucket indices, and
> >> > potentially other random stuff.  I'm skeptical how well we can
> >> estimate
> >> > all of this in the long run.  I wonder if we would be better served 
> >> by
> >> > just focusing on making it easy to understand how the DB device is
> >> being
> >> > used, how much is spilling over to the block device, and make it
> >> easy to
> >> > upgrade to a new device once it gets full.
> >> >
> >> >>
> >> >> If you want to put wal and db on same ssd, you don’t need to create
> >> >> block.wal. It will implicitly use block.db to put wal. The only case
> >> >> you need block.wal is that you want to separate wal to another disk.
> >> >
> >> > I always make explicit partitions, but only because I (potentially
> >> > illogically) like it that way.  There may actually be some benefits 
> >> to
> >> > using a single partition for both if sharing a single device.
> >>
> >> is this "Single db/wal partition" then to be used for all OSDs on a 
> >> node
> >> or do you need to create a seperate "Single  db/wal partition" for each
> >> OSD  on the node?
> >>
> >> >
> >> >>
> >> >> I’m also studying bluestore, this is what I know so far. Any
> >> >> correction is welcomed.
> >> >>
> >> >> Thanks
> >> >>
> >> >>
> >> >>> On Sep 22, 2017, at 5:27 PM, Richard Hesketh
> >> >>>  >> > wrote:
> >> >>>
> >> >>> I asked the same question a couple of weeks ago. No response I got
> >> >>> contradicted the 

Re: [ceph-users] rados export/import fail

2017-10-16 Thread Wido den Hollander

> Op 16 oktober 2017 om 13:00 schreef Nagy Ákos :
> 
> 
> Thanks,
> 
> but I erase all of the data, I have only this backup.

I hate to bring the bad news, but it will not work. The pools have different 
IDs and that will make it very difficult to get this working again.

Wido

> If the restore work for 3 pools, I can do it for the remainig 2?
> 
> What can I try to set, to import it or how I can find this IDs?
> 
> 2017. 10. 16. 13:39 keltezéssel, John Spray írta:
> > On Mon, Oct 16, 2017 at 11:35 AM, Nagy Ákos  
> > wrote:
> >> Hi,
> >>
> >> I want to upgrade my ceph from jewel to luminous, and switch to bluestore.
> >>
> >> For that I export the pools from old cluster:
> > This is not the way to do it.  You should convert your OSDs from
> > filestore to bluestore one by one, and let the data re-replicate to
> > the new OSDs.
> >
> > Dumping data out of one Ceph cluster and into another will not work,
> > because things like RBD images record things like the ID of the pool
> > where their parent image is, and pool IDs are usually different
> > between clusters.
> >
> > John
> >
> >> rados export -p pool1 pool1.ceph
> >>
> >> and after upgrade and osd recreation:
> >>
> >> rados --create -p pool1 import pool1.ceph
> >>
> >> I can import the backup without error, but when I want  to map an image, I
> >> got error:
> >>
> >> rbd --image container1 --pool pool1 map
> >>
> >> rbd: sysfs write failed
> >> In some cases useful info is found in syslog - try "dmesg | tail".
> >> rbd: map failed: (2) No such file or directory
> >>
> >> dmesg | tail
> >>
> >> [160606.729840] rbd: image container1 : WARNING: kernel layering is
> >> EXPERIMENTAL!
> >> [160606.730675] libceph: tid 86731 pool does not exist
> >>
> >>
> >> When I try to get info about the image:
> >>
> >> rbd info pool1/container1
> >>
> >> 2017-10-16 13:18:17.404858 7f35a37fe700 -1
> >> librbd::image::RefreshParentRequest: failed to open parent image: (2) No
> >> such file or directory
> >> 2017-10-16 13:18:17.404903 7f35a37fe700 -1 librbd::image::RefreshRequest:
> >> failed to refresh parent image: (2) No such file or directory
> >> 2017-10-16 13:18:17.404930 7f35a37fe700 -1 librbd::image::OpenRequest:
> >> failed to refresh image: (2) No such file or directory
> >> rbd: error opening image container1: (2) No such file or directory
> >>
> >>
> >> I check to exported image checksum after export and before import, and it's
> >> match, and I can restore three pools with one with 60 MB one with 1.2 GB 
> >> and
> >> one with 25 GB of data.
> >>
> >> The problematic has 60 GB data.
> >>
> >> The pool store LXD container images.
> >>
> >> Any help is highly appreciated.
> >>
> >> --
> >> Ákos
> >>
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> 
> -- 
> Ákos
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Osd FAILED assert(p.same_interval_since)

2017-10-16 Thread Dejan Lesjak
Hi,

During rather high load and rebalancing, a couple of our OSDs crashed
and they fail to start. This is from the log:

-2> 2017-10-16 13:27:50.235204 7f5e4c3bae80  0 osd.1 442123 load_pgs
opened 370 pgs
-1> 2017-10-16 13:27:50.239175 7f5e4c3bae80  1 osd.1 442123
build_past_intervals_parallel over 439159-439159
 0> 2017-10-16 13:27:50.261883 7f5e4c3bae80 -1
/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
In function 'void OSD::build_past_intervals_parallel()' thread
7f5e4c3bae80 time 2017-10-16 13:27:50.260062
/var/tmp/portage/sys-cluster/ceph-12.2.1/work/ceph-12.2.1/src/osd/OSD.cc:
4177: FAILED assert(p.same_interval_since)

 ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) luminous
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x102) [0x55e4caa18592]
 2: (OSD::build_past_intervals_parallel()+0x1d7b) [0x55e4ca453e8b]
 3: (OSD::load_pgs()+0x14cb) [0x55e4ca45564b]
 4: (OSD::init()+0x2227) [0x55e4ca467327]
 5: (main()+0x2d5a) [0x55e4ca379b1a]
 6: (__libc_start_main()+0xf1) [0x7f5e48ee35d1]
 7: (_start()+0x2a) [0x55e4ca4039aa]
 NOTE: a copy of the executable, or `objdump -rdS ` is
needed to interpret this.

Does anybody know how to fix or further debug this?


Cheers,
Dejan Lesjak
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] [ocata] [cinder] cinder-volume causes high cpu load

2017-10-16 Thread Eugen Block

Hi list,

some of you also use ceph as storage backend for OpenStack, so maybe  
you can help me out.


Last week we upgraded our Mitaka cloud to Ocata (via Newton, of  
course), and also upgraded the cloud nodes from openSUSE Leap 42.1 to  
Leap 42.3. There were some issues as expected, but no showstoppers  
(luckily).


So the cloud is up and working again, but our monitoring shows a high  
CPU load for cinder-volume service on the control node, also visible  
in a tcpdump showing lots of connections to the ceph nodes. But since  
all the clients are on the compute nodes we are wondering what cinder  
actually does on the control node except initializing the connections  
of course, but this should only be relevant for new volumes. The data  
sent to the ceph nodes contains all these rbd_header files, e.g.  
rb.0.24d5b04[...]. I expect this kind of traffic on the compute nodes,  
of course, but why does the control node also establish so many  
connections?


I'd appreciate any insight!

Regards,
Eugen

--
Eugen Block voice   : +49-40-559 51 75
NDE Netzdesign und -entwicklung AG  fax : +49-40-559 51 77
Postfach 61 03 15
D-22423 Hamburg e-mail  : ebl...@nde.ag

Vorsitzende des Aufsichtsrates: Angelika Mozdzen
  Sitz und Registergericht: Hamburg, HRB 90934
  Vorstand: Jens-U. Mozdzen
   USt-IdNr. DE 814 013 983

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados export/import fail

2017-10-16 Thread Nagy Ákos
Thanks,

but I erase all of the data, I have only this backup.
If the restore work for 3 pools, I can do it for the remainig 2?

What can I try to set, to import it or how I can find this IDs?

2017. 10. 16. 13:39 keltezéssel, John Spray írta:
> On Mon, Oct 16, 2017 at 11:35 AM, Nagy Ákos  wrote:
>> Hi,
>>
>> I want to upgrade my ceph from jewel to luminous, and switch to bluestore.
>>
>> For that I export the pools from old cluster:
> This is not the way to do it.  You should convert your OSDs from
> filestore to bluestore one by one, and let the data re-replicate to
> the new OSDs.
>
> Dumping data out of one Ceph cluster and into another will not work,
> because things like RBD images record things like the ID of the pool
> where their parent image is, and pool IDs are usually different
> between clusters.
>
> John
>
>> rados export -p pool1 pool1.ceph
>>
>> and after upgrade and osd recreation:
>>
>> rados --create -p pool1 import pool1.ceph
>>
>> I can import the backup without error, but when I want  to map an image, I
>> got error:
>>
>> rbd --image container1 --pool pool1 map
>>
>> rbd: sysfs write failed
>> In some cases useful info is found in syslog - try "dmesg | tail".
>> rbd: map failed: (2) No such file or directory
>>
>> dmesg | tail
>>
>> [160606.729840] rbd: image container1 : WARNING: kernel layering is
>> EXPERIMENTAL!
>> [160606.730675] libceph: tid 86731 pool does not exist
>>
>>
>> When I try to get info about the image:
>>
>> rbd info pool1/container1
>>
>> 2017-10-16 13:18:17.404858 7f35a37fe700 -1
>> librbd::image::RefreshParentRequest: failed to open parent image: (2) No
>> such file or directory
>> 2017-10-16 13:18:17.404903 7f35a37fe700 -1 librbd::image::RefreshRequest:
>> failed to refresh parent image: (2) No such file or directory
>> 2017-10-16 13:18:17.404930 7f35a37fe700 -1 librbd::image::OpenRequest:
>> failed to refresh image: (2) No such file or directory
>> rbd: error opening image container1: (2) No such file or directory
>>
>>
>> I check to exported image checksum after export and before import, and it's
>> match, and I can restore three pools with one with 60 MB one with 1.2 GB and
>> one with 25 GB of data.
>>
>> The problematic has 60 GB data.
>>
>> The pool store LXD container images.
>>
>> Any help is highly appreciated.
>>
>> --
>> Ákos
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>

-- 
Ákos


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados export/import fail

2017-10-16 Thread John Spray
On Mon, Oct 16, 2017 at 11:35 AM, Nagy Ákos  wrote:
> Hi,
>
> I want to upgrade my ceph from jewel to luminous, and switch to bluestore.
>
> For that I export the pools from old cluster:

This is not the way to do it.  You should convert your OSDs from
filestore to bluestore one by one, and let the data re-replicate to
the new OSDs.

Dumping data out of one Ceph cluster and into another will not work,
because things like RBD images record things like the ID of the pool
where their parent image is, and pool IDs are usually different
between clusters.

John

> rados export -p pool1 pool1.ceph
>
> and after upgrade and osd recreation:
>
> rados --create -p pool1 import pool1.ceph
>
> I can import the backup without error, but when I want  to map an image, I
> got error:
>
> rbd --image container1 --pool pool1 map
>
> rbd: sysfs write failed
> In some cases useful info is found in syslog - try "dmesg | tail".
> rbd: map failed: (2) No such file or directory
>
> dmesg | tail
>
> [160606.729840] rbd: image container1 : WARNING: kernel layering is
> EXPERIMENTAL!
> [160606.730675] libceph: tid 86731 pool does not exist
>
>
> When I try to get info about the image:
>
> rbd info pool1/container1
>
> 2017-10-16 13:18:17.404858 7f35a37fe700 -1
> librbd::image::RefreshParentRequest: failed to open parent image: (2) No
> such file or directory
> 2017-10-16 13:18:17.404903 7f35a37fe700 -1 librbd::image::RefreshRequest:
> failed to refresh parent image: (2) No such file or directory
> 2017-10-16 13:18:17.404930 7f35a37fe700 -1 librbd::image::OpenRequest:
> failed to refresh image: (2) No such file or directory
> rbd: error opening image container1: (2) No such file or directory
>
>
> I check to exported image checksum after export and before import, and it's
> match, and I can restore three pools with one with 60 MB one with 1.2 GB and
> one with 25 GB of data.
>
> The problematic has 60 GB data.
>
> The pool store LXD container images.
>
> Any help is highly appreciated.
>
> --
> Ákos
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rados export/import fail

2017-10-16 Thread Nagy Ákos
Hi,

I want to upgrade my ceph from jewel to luminous, and switch to bluestore.

For that I export the pools from old cluster:

/rados export -p pool1 pool1.ceph/

and after upgrade and osd recreation:

/rados --create -p pool1 import pool1.ceph/

I can import the backup without error, but when I want  to map an image,
I got error:

/rbd --image container1 --pool pool1 map/

rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail".
rbd: map failed: (2) No such file or directory

dmesg | tail

[160606.729840] rbd: image container1 : WARNING: kernel layering is
EXPERIMENTAL!
[160606.730675] libceph: tid 86731 pool does not exist


When I try to get info about the image:

rbd info pool1/container1

2017-10-16 13:18:17.404858 7f35a37fe700 -1
librbd::image::RefreshParentRequest: failed to open parent image: (2) No
such file or directory
2017-10-16 13:18:17.404903 7f35a37fe700 -1
librbd::image::RefreshRequest: failed to refresh parent image: (2) No
such file or directory
2017-10-16 13:18:17.404930 7f35a37fe700 -1 librbd::image::OpenRequest:
failed to refresh image: (2) No such file or directory
rbd: error opening image container1: (2) No such file or directory


I check to exported image checksum after export and before import, and
it's match, and I can restore three pools with one with 60 MB one with
1.2 GB and one with 25 GB of data.

The problematic has 60 GB data.

The pool store LXD container images.

Any help is highly appreciated.

-- 
Ákos

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to get current min-compat-client setting

2017-10-16 Thread Wido den Hollander

> Op 13 oktober 2017 om 10:22 schreef Hans van den Bogert 
> :
> 
> 
> Hi, 
> 
> I’m in the middle of debugging some incompatibilities with an upgrade of 
> Proxmox which uses Ceph. At this point I’d like to know what my current value 
> is for the min-compat-client setting, which would’ve been set by:
> 
> ceph osd set-require-min-compat-client …
> 
> AFAIK, there is no direct get-* variant of the above command. Does anybody 
> now how I can retrieve the current setting with perhaps lower level 
> commands/tools ?
> 

It's in the OSDMap, see these commands:

root@alpha:~# ceph osd set-require-min-compat-client luminous
set require_min_compat_client to luminous
root@alpha:~# ceph osd dump|grep require_min_compat_client
require_min_compat_client luminous
root@alpha:~#

'ceph osd dump' will show it and you can grep for it.

Wido

> Thanks, 
> 
> Hans
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore "separate" WAL and DB

2017-10-16 Thread Wido den Hollander
I thought I'd pick up on this older thread instead of starting a new one.

For the WAL something between 512MB and 2GB should be sufficient as Mark Nelson 
explained in a different thread.

The DB however I'm not certain about at this moment. The general consensus 
seems to be: "use as much as available", but that could be a lot of space.

The DB will roll over to the DATA partition in case it grows too large.

There is a relation between the amount of objects and the size of the DB. For 
each object (regardless of the size) you will have a RocksDB entry and that 
will occupy space in the DB.

Hopefully Mark (or somebody else) can shine some light on this. Eg, is 10GB 
sufficient for DB? 20GB? 100GB? What is a reasonable amount?

Wido

> Op 20 september 2017 om 20:50 schreef Alejandro Comisario 
> :
> 
> 
> Bump! i would love the thoughts about this !
> 
> On Fri, Sep 8, 2017 at 7:44 AM, Richard Hesketh <
> richard.hesk...@rd.bbc.co.uk> wrote:
> 
> > Hi,
> >
> > Reading the ceph-users list I'm obviously seeing a lot of people talking
> > about using bluestore now that Luminous has been released. I note that many
> > users seem to be under the impression that they need separate block devices
> > for the bluestore data block, the DB, and the WAL... even when they are
> > going to put the DB and the WAL on the same device!
> >
> > As per the docs at http://docs.ceph.com/docs/master/rados/configuration/
> > bluestore-config-ref/ this is nonsense:
> >
> > > If there is only a small amount of fast storage available (e.g., less
> > than a gigabyte), we recommend using it as a WAL device. If there is more,
> > provisioning a DB
> > > device makes more sense. The BlueStore journal will always be placed on
> > the fastest device available, so using a DB device will provide the same
> > benefit that the WAL
> > > device would while also allowing additional metadata to be stored there
> > (if it will fix). [sic, I assume that should be "fit"]
> >
> > I understand that if you've got three speeds of storage available, there
> > may be some sense to dividing these. For instance, if you've got lots of
> > HDD, a bit of SSD, and a tiny NVMe available in the same host, data on HDD,
> > DB on SSD and WAL on NVMe may be a sensible division of data. That's not
> > the case for most of the examples I'm reading; they're talking about
> > putting DB and WAL on the same block device, but in different partitions.
> > There's even one example of someone suggesting to try partitioning a single
> > SSD to put data/DB/WAL all in separate partitions!
> >
> > Are the docs wrong and/or I am missing something about optimal bluestore
> > setup, or do people simply have the wrong end of the stick? I ask because
> > I'm just going through switching all my OSDs over to Bluestore now and I've
> > just been reusing the partitions I set up for journals on my SSDs as DB
> > devices for Bluestore HDDs without specifying anything to do with the WAL,
> > and I'd like to know sooner rather than later if I'm making some sort of
> > horrible mistake.
> >
> > Rich
> > --
> > Richard Hesketh
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> 
> 
> -- 
> *Alejandro Comisario*
> *CTO | NUBELIU*
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Backup VM (Base image + snapshot)

2017-10-16 Thread Richard Hesketh
On 16/10/17 03:40, Alex Gorbachev wrote:
> On Sat, Oct 14, 2017 at 12:25 PM, Oscar Segarra  
> wrote:
>> Hi,
>>
>> In my VDI environment I have configured the suggested ceph
>> design/arquitecture:
>>
>> http://docs.ceph.com/docs/giant/rbd/rbd-snapshot/
>>
>> Where I have a Base Image + Protected Snapshot + 100 clones (one for each
>> persistent VDI).
>>
>> Now, I'd like to configure a backup script/mechanism to perform backups of
>> each persistent VDI VM to an external (non ceph) device, like NFS or
>> something similar...
>>
>> Then, some questions:
>>
>> 1.- Does anybody have been able to do this kind of backups?
> 
> Yes, we have been using export-diff successfully (note this is off a
> snapshot and not a clone) to back up and restore ceph images to
> non-ceph storage.  You can use merge-diff to create "synthetic fulls"
> and even do some basic replication to another cluster.
> 
> http://ceph.com/geen-categorie/incremental-snapshots-with-rbd/
> 
> http://docs.ceph.com/docs/master/dev/rbd-export/
> 
> http://cephnotes.ksperis.com/blog/2014/08/12/rbd-replication
> 
> --
> Alex Gorbachev
> Storcium
> 
>> 2.- Is it possible to export BaseImage in qcow2 format and snapshots in
>> qcow2 format as well as "linked clones" ?
>> 3.- Is it possible to export the Base Image in raw format, snapshots in raw
>> format as well and, when recover is required, import both images and
>> "relink" them?
>> 4.- What is the suggested solution for this scenario?
>>
>> Thanks a lot everybody!

In my setup I backup individually complete raw disk images to file, because 
then they're easier to manually inspect and grab data off in the event of 
catastrophic cluster failure. I haven't personally bothered trying to preserve 
the layering between master/clone images in backup form; that sounds like a 
bunch of effort and by inspection the amount of space it'd actually save in my 
use case is really minimal.

However I do use export-diff in order to make backups efficient - a rolling 
snapshot on each RBD is used to export the day's diff out of the cluster and 
then the ceph_apply_diff utility from https://gp2x.org/ceph/ is used to apply 
that diff to the raw image file (though I did patch it to work with streaming 
input and eliminate the necessity for a temporary file containing the diff). 
There are a handful of very large RBDs in my cluster for which exporting the 
full disk image takes a prohibitively long time, which made leveraging diffs 
necessary.

For a while, I was instead just exporting diffs and using merge-diff to munge 
them together into big super-diffs, and the restoration procedure would be to 
apply the merged diff to a freshly made image in the cluster. This worked, but 
it is a more fiddly recovery process; importing complete disk images is easier. 
I don't think it's possible to create two images in the cluster and then link 
them into a layering relationship; you'd have to import the base image, clone 
it, and them import a diff onto that clone if you wanted to recreate the 
original layering.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Matteo Dacrema

In the meanwhile I find out why this happened.
For some reason the 3 osds was not marked out of the cluster as the other and 
this caused the cluster to not reassing PGs to other OSDs.

This is strange because I leaved the 3 osds down for two days.


> Il giorno 16 ott 2017, alle ore 10:21, Matteo Dacrema  ha 
> scritto:
> 
> Hi all,
> 
> I’m testing Ceph Luminous 12.2.1 installed with ceph ansible.
> 
> Doing some failover tests I noticed that when I kill an osd or and hosts Ceph 
> doesn’t recover automatically remaining in this state until I bring OSDs or 
> host back online.
> I’ve 3 pools volumes, cephfs_data and cephfs_metadata with size 3 and 
> min_size 1.
> 
> Is there something I’m missing ?
> 
> Below some cluster info.
> 
> Thank you all
> Regards
> 
> Matteo
> 
> 
>  cluster:
>id: ab7cb890-ee21-484e-9290-14b9e5e85125
>health: HEALTH_WARN
>3 osds down
>Degraded data redundancy: 2842/73686 objects degraded (3.857%), 
> 318 pgs unclean, 318 pgs degraded, 318 pgs undersized
> 
>  services:
>mon: 3 daemons, quorum controller001,controller002,controller003
>mgr: controller001(active), standbys: controller002, controller003
>mds: cephfs-1/1/1 up  {0=controller002=up:active}, 2 up:standby
>osd: 77 osds: 74 up, 77 in
> 
>  data:
>pools:   3 pools, 4112 pgs
>objects: 36843 objects, 142 GB
>usage:   470 GB used, 139 TB / 140 TB avail
>pgs: 2842/73686 objects degraded (3.857%)
> 3794 active+clean
> 318  active+undersized+degraded
> 
> 
> ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
> -1   140.02425 root default
> -920.00346 host storage001
>  0   hdd   1.81850 osd.0   up  1.0 1.0
>  6   hdd   1.81850 osd.6   up  1.0 1.0
>  8   hdd   1.81850 osd.8   up  1.0 1.0
> 11   hdd   1.81850 osd.11  up  1.0 1.0
> 14   hdd   1.81850 osd.14  up  1.0 1.0
> 18   hdd   1.81850 osd.18  up  1.0 1.0
> 24   hdd   1.81850 osd.24  up  1.0 1.0
> 28   hdd   1.81850 osd.28  up  1.0 1.0
> 33   hdd   1.81850 osd.33  up  1.0 1.0
> 40   hdd   1.81850 osd.40  up  1.0 1.0
> 45   hdd   1.81850 osd.45  up  1.0 1.0
> -720.00346 host storage002
>  1   hdd   1.81850 osd.1   up  1.0 1.0
>  5   hdd   1.81850 osd.5   up  1.0 1.0
>  9   hdd   1.81850 osd.9   up  1.0 1.0
> 21   hdd   1.81850 osd.21  up  1.0 1.0
> 22   hdd   1.81850 osd.22  up  1.0 1.0
> 23   hdd   1.81850 osd.23  up  1.0 1.0
> 35   hdd   1.81850 osd.35  up  1.0 1.0
> 36   hdd   1.81850 osd.36  up  1.0 1.0
> 38   hdd   1.81850 osd.38  up  1.0 1.0
> 42   hdd   1.81850 osd.42  up  1.0 1.0
> 49   hdd   1.81850 osd.49  up  1.0 1.0
> -1120.00346 host storage003
> 27   hdd   1.81850 osd.27  up  1.0 1.0
> 31   hdd   1.81850 osd.31  up  1.0 1.0
> 32   hdd   1.81850 osd.32  up  1.0 1.0
> 37   hdd   1.81850 osd.37  up  1.0 1.0
> 44   hdd   1.81850 osd.44  up  1.0 1.0
> 46   hdd   1.81850 osd.46  up  1.0 1.0
> 48   hdd   1.81850 osd.48  up  1.0 1.0
> 53   hdd   1.81850 osd.53  up  1.0 1.0
> 54   hdd   1.81850 osd.54  up  1.0 1.0
> 56   hdd   1.81850 osd.56  up  1.0 1.0
> 59   hdd   1.81850 osd.59  up  1.0 1.0
> -320.00346 host storage004
>  2   hdd   1.81850 osd.2   up  1.0 1.0
>  4   hdd   1.81850 osd.4   up  1.0 1.0
> 10   hdd   1.81850 osd.10  up  1.0 1.0
> 16   hdd   1.81850 osd.16  up  1.0 1.0
> 17   hdd   1.81850 osd.17  up  1.0 1.0
> 19   hdd   1.81850 osd.19  up  1.0 1.0
> 26   hdd   1.81850 osd.26  up  1.0 1.0
> 29   hdd   1.81850 osd.29  up  1.0 1.0
> 39   hdd   1.81850 osd.39  up  1.0 1.0
> 43   hdd   1.81850 osd.43  up  1.0 1.0
> 50   hdd   1.81850 osd.50  up  1.0 1.0
> -520.00346 host storage005
>  3   hdd   1.81850 osd.3   up  1.0 1.0
>  7   hdd   1.81850 osd.7   up  1.0 1.0
> 12   hdd   1.81850 osd.12  up  1.0 1.0
> 13   hdd   1.81850 osd.13  up  1.0 

[ceph-users] Ceph not recovering after osd/host failure

2017-10-16 Thread Matteo Dacrema
Hi all,

I’m testing Ceph Luminous 12.2.1 installed with ceph ansible.

Doing some failover tests I noticed that when I kill an osd or and hosts Ceph 
doesn’t recover automatically remaining in this state until I bring OSDs or 
host back online.
I’ve 3 pools volumes, cephfs_data and cephfs_metadata with size 3 and min_size 
1.

Is there something I’m missing ?

Below some cluster info.

Thank you all
Regards

Matteo


  cluster:
id: ab7cb890-ee21-484e-9290-14b9e5e85125
health: HEALTH_WARN
3 osds down
Degraded data redundancy: 2842/73686 objects degraded (3.857%), 318 
pgs unclean, 318 pgs degraded, 318 pgs undersized

  services:
mon: 3 daemons, quorum controller001,controller002,controller003
mgr: controller001(active), standbys: controller002, controller003
mds: cephfs-1/1/1 up  {0=controller002=up:active}, 2 up:standby
osd: 77 osds: 74 up, 77 in

  data:
pools:   3 pools, 4112 pgs
objects: 36843 objects, 142 GB
usage:   470 GB used, 139 TB / 140 TB avail
pgs: 2842/73686 objects degraded (3.857%)
 3794 active+clean
 318  active+undersized+degraded


ID  CLASS WEIGHTTYPE NAME   STATUS REWEIGHT PRI-AFF
 -1   140.02425 root default
 -920.00346 host storage001
  0   hdd   1.81850 osd.0   up  1.0 1.0
  6   hdd   1.81850 osd.6   up  1.0 1.0
  8   hdd   1.81850 osd.8   up  1.0 1.0
 11   hdd   1.81850 osd.11  up  1.0 1.0
 14   hdd   1.81850 osd.14  up  1.0 1.0
 18   hdd   1.81850 osd.18  up  1.0 1.0
 24   hdd   1.81850 osd.24  up  1.0 1.0
 28   hdd   1.81850 osd.28  up  1.0 1.0
 33   hdd   1.81850 osd.33  up  1.0 1.0
 40   hdd   1.81850 osd.40  up  1.0 1.0
 45   hdd   1.81850 osd.45  up  1.0 1.0
 -720.00346 host storage002
  1   hdd   1.81850 osd.1   up  1.0 1.0
  5   hdd   1.81850 osd.5   up  1.0 1.0
  9   hdd   1.81850 osd.9   up  1.0 1.0
 21   hdd   1.81850 osd.21  up  1.0 1.0
 22   hdd   1.81850 osd.22  up  1.0 1.0
 23   hdd   1.81850 osd.23  up  1.0 1.0
 35   hdd   1.81850 osd.35  up  1.0 1.0
 36   hdd   1.81850 osd.36  up  1.0 1.0
 38   hdd   1.81850 osd.38  up  1.0 1.0
 42   hdd   1.81850 osd.42  up  1.0 1.0
 49   hdd   1.81850 osd.49  up  1.0 1.0
-1120.00346 host storage003
 27   hdd   1.81850 osd.27  up  1.0 1.0
 31   hdd   1.81850 osd.31  up  1.0 1.0
 32   hdd   1.81850 osd.32  up  1.0 1.0
 37   hdd   1.81850 osd.37  up  1.0 1.0
 44   hdd   1.81850 osd.44  up  1.0 1.0
 46   hdd   1.81850 osd.46  up  1.0 1.0
 48   hdd   1.81850 osd.48  up  1.0 1.0
 53   hdd   1.81850 osd.53  up  1.0 1.0
 54   hdd   1.81850 osd.54  up  1.0 1.0
 56   hdd   1.81850 osd.56  up  1.0 1.0
 59   hdd   1.81850 osd.59  up  1.0 1.0
 -320.00346 host storage004
  2   hdd   1.81850 osd.2   up  1.0 1.0
  4   hdd   1.81850 osd.4   up  1.0 1.0
 10   hdd   1.81850 osd.10  up  1.0 1.0
 16   hdd   1.81850 osd.16  up  1.0 1.0
 17   hdd   1.81850 osd.17  up  1.0 1.0
 19   hdd   1.81850 osd.19  up  1.0 1.0
 26   hdd   1.81850 osd.26  up  1.0 1.0
 29   hdd   1.81850 osd.29  up  1.0 1.0
 39   hdd   1.81850 osd.39  up  1.0 1.0
 43   hdd   1.81850 osd.43  up  1.0 1.0
 50   hdd   1.81850 osd.50  up  1.0 1.0
 -520.00346 host storage005
  3   hdd   1.81850 osd.3   up  1.0 1.0
  7   hdd   1.81850 osd.7   up  1.0 1.0
 12   hdd   1.81850 osd.12  up  1.0 1.0
 13   hdd   1.81850 osd.13  up  1.0 1.0
 15   hdd   1.81850 osd.15  up  1.0 1.0
 20   hdd   1.81850 osd.20  up  1.0 1.0
 25   hdd   1.81850 osd.25  up  1.0 1.0
 30   hdd   1.81850 osd.30  up  1.0 1.0
 34   hdd   1.81850 osd.34  up  1.0 1.0
 41   hdd   1.81850 osd.41  up  1.0 1.0
 47   hdd   1.81850 osd.47  up  1.0 1.0
-13