[ceph-users] CephFS performance.

2018-10-03 Thread jesper
Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.


Thanks.

Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
On Thu, 4 Oct 2018, Goktug Yildirim wrote:
> This is our cluster state right now. I can reach rbd list and thats good! 
> Thanks a lot Sage!!!
> ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/

Progress!  Not out of the woods yet, though...

> As you can see we have 2 unfound pg since some of our OSDs can not start. 58 
> OSD gives different errors.
> How can I fix these OSD's? If I remember correctly it should not be so much 
> trouble.
> 
> These are OSDs' failed logs.
> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> https://paste.ubuntu.com/p/pkRdVjCH4D/

These are both failing in rocksdb code, with something like

Can't access /032949.sst: NotFound:

Can you check whether that .sst file actually exists?  Might be a 
weird path issue.

> https://paste.ubuntu.com/p/zJTf2fzSj9/
> https://paste.ubuntu.com/p/xpJRK6YhRX/

These are failing in the rocksdb CheckConstency code.  Not sure what to 
make of that.

> https://paste.ubuntu.com/p/SY3576dNbJ/
> https://paste.ubuntu.com/p/smyT6Y976b/

These are failing in BlueStore code.  The ceph-blustore-tool fsck may help 
here, can you give it a shot?

sage


> 
> > On 3 Oct 2018, at 21:37, Sage Weil  wrote:
> > 
> > On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >> I'm so sorry about that I missed "out" parameter. My bad..
> >> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> > 
> > Excellent, thanks.  That looks like it confirms the problem is that teh 
> > recovery tool didn't repopulate the creating pgs properly.
> > 
> > If you take that 30 byte file I sent earlier (as hex) and update the 
> > osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> > properly, and then inject it on the 3 mons, that should get you past this 
> > hump (and hopefully back up!).
> > 
> > sage
> > 
> > 
> >> 
> >> Sage Weil  şunları yazdı (3 Eki 2018 21:13):
> >> 
> >>> I bet the kvstore output it in a hexdump format?  There is another option 
> >>> to get the raw data iirc
> >>> 
> >>> 
> >>> 
>  On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
>   wrote:
>  I changed the file name to make it clear.
>  When I use your command with "+decode"  I'm getting an error like this:
>  
>  ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>  error: buffer::malformed_input: void 
>  creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer 
>  understand old encoding version 2 < 111
>  
>  My ceph version: 13.2.2
>  
>  3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
>  yazdı:
> > On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >> If I didn't do it wrong, I got the output as below.
> >> 
> >> ceph-kvstore-tool rocksdb 
> >> /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating 
> >> creating > dump
> >> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
> >> families: [default]
> >> 
> >> ceph-dencoder type creating_pgs_t import dump dump_json
> > 
> > Sorry, should be
> > 
> > ceph-dencoder type creating_pgs_t import dump decode dump_json
> > 
> > s
> > 
> >> {
> >>"last_scan_epoch": 0,
> >>"creating_pgs": [],
> >>"queue": [],
> >>"created_pools": []
> >> }
> >> 
> >> You can find the "dump" link below.
> >> 
> >> dump: 
> >> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >> 
> >> 
> >> Sage Weil  şunları yazdı (3 Eki 2018 18:45):
> >> 
>  On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>  We are starting to work on it. First step is getting the structure 
>  out and dumping the current value as you say.
>  
>  And you were correct we did not run force_create_pg.
> >>> 
> >>> Great.
> >>> 
> >>> So, eager to see what the current structure is... please attach once 
> >>> you 
> >>> have it.
> >>> 
> >>> The new replacement one should look like this (when hexdump -C'd):
> >>> 
> >>>   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
> >>> ||
> >>> 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
> >>> |..B...|
> >>> 001e
> >>> 
> >>> ...except that from byte 6 you want to put in a recent OSDMap epoch, 
> >>> in 
> >>> hex, little endian (least significant byte first), in place of the 
> >>> 0x10 
> >>> that is there now.  It should dump like this:
> >>> 
> >>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>> {
> >>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
> >>>   "creating_pgs": [],
> >>>   "queue": [],
> >>>   "created_pools": [
> >>>   66
> >>>   ]
> >>> }
> >>> 
> >>> sage
> >>> 
> >>> 
>  
> > On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> > 
> > On Wed, 3 Oct 2018, Goktug Yildirim 

[ceph-users] provide cephfs to mutiple project

2018-10-03 Thread Joshua Chen
Hello all,
  I am almost ready to provide storage (cephfs in the beginning) to my
colleagues, they belong to different main project, and according to their
budget that are previously claimed, to have different capacity. For example
ProjectA will have 50TB, ProjectB will have 150TB.

I choosed cephfs because that it has good enough throughput compared to rbd.

but I would like to let clients in ProjectA only see 50TB mount space (by
linux df -h maybe) and ProjectB clients see 150TB. so my question is:
1, is that possible? that cephfs make clients see different available space
respectively?

2, what is the good setup that ProjectA has a reasonable mount source and
ProjectB has his?

for example
in projecta client root, he will do
mount -t ceph cephmon1,cephmon2:/ProjectA /mnt/ProjectA

but can not

mount -t ceph cephmon1,cephmon2:/ProjectB /mnt/ProjectB

(can not mount the root /, either /ProjectB which is not their area)

or what is the official production style for this need?

Thank in advance
Cheers
Joshua
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug Yildirim
This is our cluster state right now. I can reach rbd list and thats good! 
Thanks a lot Sage!!!
ceph -s: https://paste.ubuntu.com/p/xBNPr6rJg2/

As you can see we have 2 unfound pg since some of our OSDs can not start. 58 
OSD gives different errors.
How can I fix these OSD's? If I remember correctly it should not be so much 
trouble.

These are OSDs' failed logs.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/
https://paste.ubuntu.com/p/pkRdVjCH4D/
https://paste.ubuntu.com/p/zJTf2fzSj9/
https://paste.ubuntu.com/p/xpJRK6YhRX/
https://paste.ubuntu.com/p/SY3576dNbJ/
https://paste.ubuntu.com/p/smyT6Y976b/

> On 3 Oct 2018, at 21:37, Sage Weil  wrote:
> 
> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> I'm so sorry about that I missed "out" parameter. My bad..
>> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> 
> Excellent, thanks.  That looks like it confirms the problem is that teh 
> recovery tool didn't repopulate the creating pgs properly.
> 
> If you take that 30 byte file I sent earlier (as hex) and update the 
> osdmap epoch to the latest on the mon, confirm it decodes and dumps 
> properly, and then inject it on the 3 mons, that should get you past this 
> hump (and hopefully back up!).
> 
> sage
> 
> 
>> 
>> Sage Weil  şunları yazdı (3 Eki 2018 21:13):
>> 
>>> I bet the kvstore output it in a hexdump format?  There is another option 
>>> to get the raw data iirc
>>> 
>>> 
>>> 
 On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
  wrote:
 I changed the file name to make it clear.
 When I use your command with "+decode"  I'm getting an error like this:
 
 ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
 error: buffer::malformed_input: void 
 creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand 
 old encoding version 2 < 111
 
 My ceph version: 13.2.2
 
 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
 yazdı:
> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> If I didn't do it wrong, I got the output as below.
>> 
>> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ 
>> get osd_pg_creating creating > dump
>> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
>> families: [default]
>> 
>> ceph-dencoder type creating_pgs_t import dump dump_json
> 
> Sorry, should be
> 
> ceph-dencoder type creating_pgs_t import dump decode dump_json
> 
> s
> 
>> {
>>"last_scan_epoch": 0,
>>"creating_pgs": [],
>>"queue": [],
>>"created_pools": []
>> }
>> 
>> You can find the "dump" link below.
>> 
>> dump: 
>> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>> 
>> 
>> Sage Weil  şunları yazdı (3 Eki 2018 18:45):
>> 
 On Wed, 3 Oct 2018, Goktug Yildirim wrote:
 We are starting to work on it. First step is getting the structure out 
 and dumping the current value as you say.
 
 And you were correct we did not run force_create_pg.
>>> 
>>> Great.
>>> 
>>> So, eager to see what the current structure is... please attach once 
>>> you 
>>> have it.
>>> 
>>> The new replacement one should look like this (when hexdump -C'd):
>>> 
>>>   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
>>> ||
>>> 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
>>> |..B...|
>>> 001e
>>> 
>>> ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>> hex, little endian (least significant byte first), in place of the 0x10 
>>> that is there now.  It should dump like this:
>>> 
>>> $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>> {
>>>   "last_scan_epoch": 16,   <--- but with a recent epoch here
>>>   "creating_pgs": [],
>>>   "queue": [],
>>>   "created_pools": [
>>>   66
>>>   ]
>>> }
>>> 
>>> sage
>>> 
>>> 
 
> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> 
> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> Sage,
>> 
>> Pool 66 is the only pool it shows right now. This a pool created 
>> months ago.
>> ceph osd lspools
>> 66 mypool
>> 
>> As we recreated mon db from OSDs, the pools for MDS was unusable. So 
>> we deleted them.
>> After we create another cephfs fs and pools we started MDS and it 
>> stucked on creation. So we stopped MDS and removed fs and fs pools. 
>> Right now we do not have MDS running nor we have cephfs related 
>> things.
>> 
>> ceph fs dump
>> dumped fsmap epoch 1 e1
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
>> 

Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread solarflow99
thats strange, I recall only deleting the OSD from the crushmap, authm then
osd rm..


On Wed, Oct 3, 2018 at 2:54 PM Alfredo Deza  wrote:

> On Wed, Oct 3, 2018 at 3:52 PM Andras Pataki
>  wrote:
> >
> > Ok, understood (for next time).
> >
> > But just as an update/closure to my investigation - it seems this is a
> > feature of ceph-volume (that it can't just create an OSD from scratch
> > with a given ID), not of base ceph.  The underlying ceph command (ceph
> > osd new) very happily accepts an osd-id as an extra optional argument
> > (after the fsid), and creates and osd with the given ID.  In fact, a
> > quick change to ceph_volume (create_id function in prepare.py) will make
> > ceph-volume recreate the OSD with a given ID.  I'm not a ceph-volume
> > expert, but a feature to create an OSD with a given ID from scratch
> > would be nice (given that the underlying raw ceph commands already
> > support it).
>
> That is something that I wasn't aware of, thanks for bringing it up.
> I've created an issue on the tracker to accommodate for that behavior:
>
> http://tracker.ceph.com/issues/36307
>
> >
> > Andras
> >
> > On 10/3/18 11:41 AM, Alfredo Deza wrote:
> > > On Wed, Oct 3, 2018 at 11:23 AM Andras Pataki
> > >  wrote:
> > >> Thanks - I didn't realize that was such a recent fix.
> > >>
> > >> I've now tried 12.2.8, and perhaps I'm not clear on what I should have
> > >> done to the OSD that I'm replacing, since I'm getting the error "The
> osd
> > >> ID 747 is already in use or does not exist.".  The case is clearly the
> > >> latter, since I've completely removed the old OSD (osd crush remove,
> > >> auth del, osd rm, wipe disk).  Should I have done something different
> > >> (i.e. not remove the OSD completely)?
> > > Yeah, you completely removed it so now it can't be re-used. This is
> > > the proper way if wanting to re-use the ID:
> > >
> > >
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd
> > >
> > > Basically:
> > >
> > >  ceph osd destroy {id} --yes-i-really-mean-it
> > >
> > >> Searching the docs I see a command 'ceph osd destroy'.  What does that
> > >> do (compared to my removal procedure, osd crush remove, auth del, osd
> rm)?
> > >>
> > >> Thanks,
> > >>
> > >> Andras
> > >>
> > >>
> > >> On 10/3/18 10:36 AM, Alfredo Deza wrote:
> > >>> On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
> > >>>  wrote:
> >  After replacing failing drive I'd like to recreate the OSD with the
> same
> >  osd-id using ceph-volume (now that we've moved to ceph-volume from
> >  ceph-disk).  However, I seem to not be successful.  The command I'm
> using:
> > 
> >  ceph-volume lvm prepare --bluestore --osd-id 747 --data
> H901D44/H901D44
> >  --block.db /dev/disk/by-partlabel/H901J44
> > 
> >  But it created an OSD the ID 601, which was the lowest it could
> allocate
> >  and ignored the 747 apparently.  This is with ceph 12.2.7. Any
> ideas?
> > >>> Yeah, this was a problem that was fixed and released as part of
> 12.2.8
> > >>>
> > >>> The tracker issue is: http://tracker.ceph.com/issues/24044
> > >>>
> > >>> The Luminous PR is https://github.com/ceph/ceph/pull/23102
> > >>>
> > >>> Sorry for the trouble!
> >  Andras
> > 
> >  ___
> >  ceph-users mailing list
> >  ceph-users@lists.ceph.com
> >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-03 Thread solarflow99
I use the same configuration you have, and I plan on using bluestore.  My
SSDs are only 240GB and it worked with filestore all this time, I suspect
bluestore should be fine too.


On Wed, Oct 3, 2018 at 4:25 AM Massimo Sgaravatto <
massimo.sgarava...@gmail.com> wrote:

> Hi
>
> I have a ceph cluster, running luminous, composed of 5 OSD nodes, which is
> using filestore.
> Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM, 10x6TB SATA disk
> + 2x200GB SSD disk (then I have 2 other disks in RAID for the OS), 10 Gbps.
> So each SSD disk is used for the journal for 5 OSDs. With this
> configuration everything is running smoothly ...
>
>
> We are now buying some new storage nodes, and I am trying to buy something
> which is bluestore compliant. So the idea is to consider a configuration
> something like:
>
> - 10 SATA disks (8TB / 10TB / 12TB each. TBD)
> - 2 processor (~ 10 core each)
> - 64 GB of RAM
> - 2 SSD to be used for WAL+DB
> - 10 Gbps
>
> For what concerns the size of the SSD disks I read in this mailing list
> that it is suggested to have at least 10GB of SSD disk/10TB of SATA disk.
>
>
> So, the questions:
>
> 1) Does this hardware configuration seem reasonable ?
>
> 2) Are there problems to live (forever, or until filestore deprecation)
> with some OSDs using filestore (the old ones) and some OSDs using bluestore
> (the old ones) ?
>
> 3) Would you suggest to update to bluestore also the old OSDs, even if the
> available SSDs are too small (they don't satisfy the "10GB of SSD disk/10TB
> of SATA disk" rule) ?
>
> Thanks, Massimo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] hardware heterogeneous in same pool

2018-10-03 Thread Jonathan D. Proulx
On Wed, Oct 03, 2018 at 07:09:30PM -0300, Bruno Carvalho wrote:
:Hi Cephers, I would like to know how you are growing the cluster.
:
:Using dissimilar hardware in the same pool or creating a pool for each
:different hardware group.
:
:What problem would I have many problems using different hardware (CPU,
:memory, disk) in the same pool?

I've been growing with new hardware in old pools.

Due to the way RDB gets smeared across the disks your performance is
almost always bottle necked by slowest storage location.

If you're just adding slightly newer slightly faster hardware this is
OK as most of the performance gain in that case is from spreading
wider not so much the individual drive performance.

But if you are adding a faster technology like going from
spinning disk to ssd you do want to think about how to transition.

I recently added SSD to a previously all HDD cluster (well HDD data
with SSD WAL/DB).  For this I did fiddle with crush rules. First I made
the existing rules require HDD class devices which shoudl have been a
noop in my mind but actually moved 90% of my data.  The folks at CERN
made a similar discovery before me and even (I think worked out a way
to avoid it) see
http://lists.ceph.com/pipermail/ceph-large-ceph.com/2018-June/000113.html

After that I made new rules that took on SSD andtwo HDD for each
replica set (in addtion to spreading across racks or servers or what
ever) and after applying the new rule to the pools I use for Nova
ephemeral storage and Cinder Volumes I set the SSD OSDs to have high
"primary affinity" and the HDDs to have low "primary affinity".

In the end this means the SSDs server reads and writes while writes to
the HDD replicas are buffered by the SSD WAL so both reads and write
are relatively fast (we'd previouslyy been suffering on reads due to
IO load).

I left  Glance images on HDD only as those don't require much
performance in my world, same with RGW object storage though for soem
that may be performance sensitive.

The plan forward is more SSD to replace HDD, probbably by first
getting enough to transition ephemeral dirves, then a set to move
block storage, then the rest over next year or two.

The mixed SSD/HDD was a big win for us though so we're happy with that
for now.

scale matters with this so we have:
245 OSDs in 12 servers
627 TiB RAW storage (267 TiB used)
19.44 M objects


hope that helps,
-Jon
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hardware heterogeneous in same pool

2018-10-03 Thread Bruno Carvalho
Hi Cephers, I would like to know how you are growing the cluster.

Using dissimilar hardware in the same pool or creating a pool for each
different hardware group.

What problem would I have many problems using different hardware (CPU,
memory, disk) in the same pool?

Someone could share the experience with openstack.

Att,

Bruno Carvalho
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Alfredo Deza
On Wed, Oct 3, 2018 at 3:52 PM Andras Pataki
 wrote:
>
> Ok, understood (for next time).
>
> But just as an update/closure to my investigation - it seems this is a
> feature of ceph-volume (that it can't just create an OSD from scratch
> with a given ID), not of base ceph.  The underlying ceph command (ceph
> osd new) very happily accepts an osd-id as an extra optional argument
> (after the fsid), and creates and osd with the given ID.  In fact, a
> quick change to ceph_volume (create_id function in prepare.py) will make
> ceph-volume recreate the OSD with a given ID.  I'm not a ceph-volume
> expert, but a feature to create an OSD with a given ID from scratch
> would be nice (given that the underlying raw ceph commands already
> support it).

That is something that I wasn't aware of, thanks for bringing it up.
I've created an issue on the tracker to accommodate for that behavior:

http://tracker.ceph.com/issues/36307

>
> Andras
>
> On 10/3/18 11:41 AM, Alfredo Deza wrote:
> > On Wed, Oct 3, 2018 at 11:23 AM Andras Pataki
> >  wrote:
> >> Thanks - I didn't realize that was such a recent fix.
> >>
> >> I've now tried 12.2.8, and perhaps I'm not clear on what I should have
> >> done to the OSD that I'm replacing, since I'm getting the error "The osd
> >> ID 747 is already in use or does not exist.".  The case is clearly the
> >> latter, since I've completely removed the old OSD (osd crush remove,
> >> auth del, osd rm, wipe disk).  Should I have done something different
> >> (i.e. not remove the OSD completely)?
> > Yeah, you completely removed it so now it can't be re-used. This is
> > the proper way if wanting to re-use the ID:
> >
> > http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd
> >
> > Basically:
> >
> >  ceph osd destroy {id} --yes-i-really-mean-it
> >
> >> Searching the docs I see a command 'ceph osd destroy'.  What does that
> >> do (compared to my removal procedure, osd crush remove, auth del, osd rm)?
> >>
> >> Thanks,
> >>
> >> Andras
> >>
> >>
> >> On 10/3/18 10:36 AM, Alfredo Deza wrote:
> >>> On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
> >>>  wrote:
>  After replacing failing drive I'd like to recreate the OSD with the same
>  osd-id using ceph-volume (now that we've moved to ceph-volume from
>  ceph-disk).  However, I seem to not be successful.  The command I'm 
>  using:
> 
>  ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
>  --block.db /dev/disk/by-partlabel/H901J44
> 
>  But it created an OSD the ID 601, which was the lowest it could allocate
>  and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?
> >>> Yeah, this was a problem that was fixed and released as part of 12.2.8
> >>>
> >>> The tracker issue is: http://tracker.ceph.com/issues/24044
> >>>
> >>> The Luminous PR is https://github.com/ceph/ceph/pull/23102
> >>>
> >>> Sorry for the trouble!
>  Andras
> 
>  ___
>  ceph-users mailing list
>  ceph-users@lists.ceph.com
>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Sage Weil
On Tue, 2 Oct 2018, jes...@krogh.cc wrote:
> Hi.
> 
> Based on some recommendations we have setup our CephFS installation using
> bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
> server - 100TB-ish size.
> 
> Current setup is - a sizeable Linux host with 512GB of memory - one large
> Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.
> 
> Since our "hot" dataset is < 400GB we can actually serve the hot data
> directly out of the host page-cache and never really touch the "slow"
> underlying drives. Except when new bulk data are written where a Perc with
> BBWC is consuming the data.
> 
> In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
> OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
> it is really hard to create a synthetic test where they hot data does not
> end up being read out of the underlying disks. Yes, the
> client side page cache works very well, but in our scenario we have 30+
> hosts pulling the same data over NFS.
> 
> Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
> the recommendation to make an SSD "overlay" on the slow drives?
> 
> Thoughts?

1. This sounds like it is primarily a matter of configuring the bluestore 
cache size.  This is the main downside of bluestore: it doesn't magically 
use any available RAM as a cache (like the OS page cache).

2. There are two other important options that control bluestore cache 
behavior:

 bluestore_default_buffered_read (default true)
 bluestore_default_buffered_write (default false)

Given your description it sounds like the default is fine: newly written 
data won't land it cache, but once it is read it will be there.  If you 
want recent writes to land it cache you can change the second option to 
true.

3. Because we don't ues the page cache, an OSD restart also drops the 
cache, so be sure to allow things to warm up after a restart before 
drawing conclusions about steady-state performance.

Hope that helps!
sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Ronny Aasen

On 03.10.2018 20:10, jes...@krogh.cc wrote:

Your use case sounds it might profit from the rados cache tier
feature. It's a rarely used feature because it only works in very
specific circumstances. But your scenario sounds like it might work.
Definitely worth giving it a try. Also, dm-cache with LVM *might*
help.
But if your active working set is really just 400GB: Bluestore cache
should handle this just fine. Don't worry about "unequal"
distribution, every 4mb chunk of every file will go to a random OSD.

I tried it out - and will do it more but Initial tests didnt really
convince me - but I'll try more.


One very powerful and simple optimization is moving the metadata pool
to SSD only. Even if it's just 3 small but fast SSDs; that can make a
huge difference to how fast your filesystem "feels".

They are ordered and will hopefully arrive very soon.

Can I:
1) Add disks
2) Create pool
3) stop all MDS's
4) rados cppool
5) Start MDS

.. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is
there a better guide?


this post
https://ceph.com/community/new-luminous-crush-device-classes/
and this document
http://docs.ceph.com/docs/master/rados/operations/pools/

explains how the osd class is used to define a crush placement rule.
and then you can set the |crush_rule| on the pool and ceph will move the 
data. No downtime needed.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki

Ok, understood (for next time).

But just as an update/closure to my investigation - it seems this is a 
feature of ceph-volume (that it can't just create an OSD from scratch 
with a given ID), not of base ceph.  The underlying ceph command (ceph 
osd new) very happily accepts an osd-id as an extra optional argument 
(after the fsid), and creates and osd with the given ID.  In fact, a 
quick change to ceph_volume (create_id function in prepare.py) will make 
ceph-volume recreate the OSD with a given ID.  I'm not a ceph-volume 
expert, but a feature to create an OSD with a given ID from scratch 
would be nice (given that the underlying raw ceph commands already 
support it).


Andras

On 10/3/18 11:41 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 11:23 AM Andras Pataki
 wrote:

Thanks - I didn't realize that was such a recent fix.

I've now tried 12.2.8, and perhaps I'm not clear on what I should have
done to the OSD that I'm replacing, since I'm getting the error "The osd
ID 747 is already in use or does not exist.".  The case is clearly the
latter, since I've completely removed the old OSD (osd crush remove,
auth del, osd rm, wipe disk).  Should I have done something different
(i.e. not remove the OSD completely)?

Yeah, you completely removed it so now it can't be re-used. This is
the proper way if wanting to re-use the ID:

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd

Basically:

 ceph osd destroy {id} --yes-i-really-mean-it


Searching the docs I see a command 'ceph osd destroy'.  What does that
do (compared to my removal procedure, osd crush remove, auth del, osd rm)?

Thanks,

Andras


On 10/3/18 10:36 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
 wrote:

After replacing failing drive I'd like to recreate the OSD with the same
osd-id using ceph-volume (now that we've moved to ceph-volume from
ceph-disk).  However, I seem to not be successful.  The command I'm using:

ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
--block.db /dev/disk/by-partlabel/H901J44

But it created an OSD the ID 601, which was the lowest it could allocate
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?

Yeah, this was a problem that was fixed and released as part of 12.2.8

The tracker issue is: http://tracker.ceph.com/issues/24044

The Luminous PR is https://github.com/ceph/ceph/pull/23102

Sorry for the trouble!

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Göktuğ Yıldırım
Also you was asking the RAW output.
I've been trying to fix it for days and I didn't sleep. Forgive the dumb 
mistakes.

RAW dump output: 
https://drive.google.com/file/d/1SzFNNjSK9Q_j4iyYJTRqOYuLWJcsFX9C/view?usp=sharing

Göktuğ Yıldırım  şunları yazdı (3 Eki 2018 21:34):

> I'm so sorry about that I missed "out" parameter. My bad..
> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/
> 
> 
> Sage Weil  şunları yazdı (3 Eki 2018 21:13):
> 
>> I bet the kvstore output it in a hexdump format?  There is another option to 
>> get the raw data iirc
>> 
>> 
>> 
>>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
>>>  wrote:
>>> I changed the file name to make it clear.
>>> When I use your command with "+decode"  I'm getting an error like this:
>>> 
>>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>>> error: buffer::malformed_input: void 
>>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand 
>>> old encoding version 2 < 111
>>> 
>>> My ceph version: 13.2.2
>>> 
>>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
>>> yazdı:
 On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
 > If I didn't do it wrong, I got the output as below.
 > 
 > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ 
 > get osd_pg_creating creating > dump
 > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
 > families: [default]
 > 
 > ceph-dencoder type creating_pgs_t import dump dump_json
 
 Sorry, should be
 
 ceph-dencoder type creating_pgs_t import dump decode dump_json
 
 s
 
 > {
 > "last_scan_epoch": 0,
 > "creating_pgs": [],
 > "queue": [],
 > "created_pools": []
 > }
 > 
 > You can find the "dump" link below.
 > 
 > dump: 
 > https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
 > 
 > 
 > Sage Weil  şunları yazdı (3 Eki 2018 18:45):
 > 
 > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
 > >> We are starting to work on it. First step is getting the structure 
 > >> out and dumping the current value as you say.
 > >> 
 > >> And you were correct we did not run force_create_pg.
 > > 
 > > Great.
 > > 
 > > So, eager to see what the current structure is... please attach once 
 > > you 
 > > have it.
 > > 
 > > The new replacement one should look like this (when hexdump -C'd):
 > > 
 > >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
 > > ||
 > > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
 > > |..B...|
 > > 001e
 > > 
 > > ...except that from byte 6 you want to put in a recent OSDMap epoch, 
 > > in 
 > > hex, little endian (least significant byte first), in place of the 
 > > 0x10 
 > > that is there now.  It should dump like this:
 > > 
 > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
 > > {
 > >"last_scan_epoch": 16,   <--- but with a recent epoch here
 > >"creating_pgs": [],
 > >"queue": [],
 > >"created_pools": [
 > >66
 > >]
 > > }
 > > 
 > > sage
 > > 
 > > 
 > >> 
 > >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
 > >>> 
 > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
 >  Sage,
 >  
 >  Pool 66 is the only pool it shows right now. This a pool created 
 >  months ago.
 >  ceph osd lspools
 >  66 mypool
 >  
 >  As we recreated mon db from OSDs, the pools for MDS was unusable. 
 >  So we deleted them.
 >  After we create another cephfs fs and pools we started MDS and it 
 >  stucked on creation. So we stopped MDS and removed fs and fs pools. 
 >  Right now we do not have MDS running nor we have cephfs related 
 >  things.
 >  
 >  ceph fs dump
 >  dumped fsmap epoch 1 e1
 >  enable_multiple, ever_enabled_multiple: 0,0
 >  compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
 >  writeable ranges,3=default file layouts on dirs,4=dir inode in 
 >  separate object,5=mds uses versioned encoding,6=dirfrag is stored 
 >  in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
 >  legacy client fscid: -1
 >  
 >  No filesystems configured
 >  
 >  ceph fs ls
 >  No filesystems enabled
 >  
 >  Now pool 66 seems to only pool we have and it has been created 
 >  months ago. Then I guess there is something hidden out there.
 >  
 >  Is there any way to find and delete it?
 > >>> 
 > >>> Ok, I'm concerned that the creating pg is in there if this is an old 
 > >>> pool... did you perhaps run force_create_pg at some point?  

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> I'm so sorry about that I missed "out" parameter. My bad..
> This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/

Excellent, thanks.  That looks like it confirms the problem is that teh 
recovery tool didn't repopulate the creating pgs properly.

If you take that 30 byte file I sent earlier (as hex) and update the 
osdmap epoch to the latest on the mon, confirm it decodes and dumps 
properly, and then inject it on the 3 mons, that should get you past this 
hump (and hopefully back up!).

sage


> 
> Sage Weil  şunları yazdı (3 Eki 2018 21:13):
> 
> > I bet the kvstore output it in a hexdump format?  There is another option 
> > to get the raw data iirc
> > 
> > 
> > 
> >> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
> >>  wrote:
> >> I changed the file name to make it clear.
> >> When I use your command with "+decode"  I'm getting an error like this:
> >> 
> >> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
> >> error: buffer::malformed_input: void 
> >> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand 
> >> old encoding version 2 < 111
> >> 
> >> My ceph version: 13.2.2
> >> 
> >> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
> >> yazdı:
> >>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> >>> > If I didn't do it wrong, I got the output as below.
> >>> > 
> >>> > ceph-kvstore-tool rocksdb 
> >>> > /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get osd_pg_creating 
> >>> > creating > dump
> >>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column 
> >>> > families: [default]
> >>> > 
> >>> > ceph-dencoder type creating_pgs_t import dump dump_json
> >>> 
> >>> Sorry, should be
> >>> 
> >>> ceph-dencoder type creating_pgs_t import dump decode dump_json
> >>> 
> >>> s
> >>> 
> >>> > {
> >>> > "last_scan_epoch": 0,
> >>> > "creating_pgs": [],
> >>> > "queue": [],
> >>> > "created_pools": []
> >>> > }
> >>> > 
> >>> > You can find the "dump" link below.
> >>> > 
> >>> > dump: 
> >>> > https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >>> > 
> >>> > 
> >>> > Sage Weil  şunları yazdı (3 Eki 2018 18:45):
> >>> > 
> >>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> > >> We are starting to work on it. First step is getting the structure 
> >>> > >> out and dumping the current value as you say.
> >>> > >> 
> >>> > >> And you were correct we did not run force_create_pg.
> >>> > > 
> >>> > > Great.
> >>> > > 
> >>> > > So, eager to see what the current structure is... please attach once 
> >>> > > you 
> >>> > > have it.
> >>> > > 
> >>> > > The new replacement one should look like this (when hexdump -C'd):
> >>> > > 
> >>> > >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
> >>> > > ||
> >>> > > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
> >>> > > |..B...|
> >>> > > 001e
> >>> > > 
> >>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, 
> >>> > > in 
> >>> > > hex, little endian (least significant byte first), in place of the 
> >>> > > 0x10 
> >>> > > that is there now.  It should dump like this:
> >>> > > 
> >>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> >>> > > {
> >>> > >"last_scan_epoch": 16,   <--- but with a recent epoch here
> >>> > >"creating_pgs": [],
> >>> > >"queue": [],
> >>> > >"created_pools": [
> >>> > >66
> >>> > >]
> >>> > > }
> >>> > > 
> >>> > > sage
> >>> > > 
> >>> > > 
> >>> > >> 
> >>> > >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> >>> > >>> 
> >>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> >  Sage,
> >>> >  
> >>> >  Pool 66 is the only pool it shows right now. This a pool created 
> >>> >  months ago.
> >>> >  ceph osd lspools
> >>> >  66 mypool
> >>> >  
> >>> >  As we recreated mon db from OSDs, the pools for MDS was unusable. 
> >>> >  So we deleted them.
> >>> >  After we create another cephfs fs and pools we started MDS and it 
> >>> >  stucked on creation. So we stopped MDS and removed fs and fs 
> >>> >  pools. Right now we do not have MDS running nor we have cephfs 
> >>> >  related things.
> >>> >  
> >>> >  ceph fs dump
> >>> >  dumped fsmap epoch 1 e1
> >>> >  enable_multiple, ever_enabled_multiple: 0,0
> >>> >  compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
> >>> >  writeable ranges,3=default file layouts on dirs,4=dir inode in 
> >>> >  separate object,5=mds uses versioned encoding,6=dirfrag is stored 
> >>> >  in omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
> >>> >  legacy client fscid: -1
> >>> >  
> >>> >  No filesystems configured
> >>> >  
> >>> >  ceph fs ls
> >>> >  No filesystems enabled
> >>> >  
> >>> >  Now pool 66 seems to only pool we have and it has been created 
> >>> >  months ago. 

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Göktuğ Yıldırım
I'm so sorry about that I missed "out" parameter. My bad..
This is the output: https://paste.ubuntu.com/p/KwT9c8F6TF/


Sage Weil  şunları yazdı (3 Eki 2018 21:13):

> I bet the kvstore output it in a hexdump format?  There is another option to 
> get the raw data iirc
> 
> 
> 
>> On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM 
>>  wrote:
>> I changed the file name to make it clear.
>> When I use your command with "+decode"  I'm getting an error like this:
>> 
>> ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>> error: buffer::malformed_input: void 
>> creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand 
>> old encoding version 2 < 111
>> 
>> My ceph version: 13.2.2
>> 
>> 3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu 
>> yazdı:
>>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>>> > If I didn't do it wrong, I got the output as below.
>>> > 
>>> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ 
>>> > get osd_pg_creating creating > dump
>>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: 
>>> > [default]
>>> > 
>>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>> 
>>> Sorry, should be
>>> 
>>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>> 
>>> s
>>> 
>>> > {
>>> > "last_scan_epoch": 0,
>>> > "creating_pgs": [],
>>> > "queue": [],
>>> > "created_pools": []
>>> > }
>>> > 
>>> > You can find the "dump" link below.
>>> > 
>>> > dump: 
>>> > https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>>> > 
>>> > 
>>> > Sage Weil  şunları yazdı (3 Eki 2018 18:45):
>>> > 
>>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> > >> We are starting to work on it. First step is getting the structure out 
>>> > >> and dumping the current value as you say.
>>> > >> 
>>> > >> And you were correct we did not run force_create_pg.
>>> > > 
>>> > > Great.
>>> > > 
>>> > > So, eager to see what the current structure is... please attach once 
>>> > > you 
>>> > > have it.
>>> > > 
>>> > > The new replacement one should look like this (when hexdump -C'd):
>>> > > 
>>> > >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
>>> > > ||
>>> > > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
>>> > > |..B...|
>>> > > 001e
>>> > > 
>>> > > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
>>> > > hex, little endian (least significant byte first), in place of the 0x10 
>>> > > that is there now.  It should dump like this:
>>> > > 
>>> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
>>> > > {
>>> > >"last_scan_epoch": 16,   <--- but with a recent epoch here
>>> > >"creating_pgs": [],
>>> > >"queue": [],
>>> > >"created_pools": [
>>> > >66
>>> > >]
>>> > > }
>>> > > 
>>> > > sage
>>> > > 
>>> > > 
>>> > >> 
>>> > >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
>>> > >>> 
>>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> >  Sage,
>>> >  
>>> >  Pool 66 is the only pool it shows right now. This a pool created 
>>> >  months ago.
>>> >  ceph osd lspools
>>> >  66 mypool
>>> >  
>>> >  As we recreated mon db from OSDs, the pools for MDS was unusable. So 
>>> >  we deleted them.
>>> >  After we create another cephfs fs and pools we started MDS and it 
>>> >  stucked on creation. So we stopped MDS and removed fs and fs pools. 
>>> >  Right now we do not have MDS running nor we have cephfs related 
>>> >  things.
>>> >  
>>> >  ceph fs dump
>>> >  dumped fsmap epoch 1 e1
>>> >  enable_multiple, ever_enabled_multiple: 0,0
>>> >  compat: compat={},rocompat={},incompat={1=base v0.20,2=client 
>>> >  writeable ranges,3=default file layouts on dirs,4=dir inode in 
>>> >  separate object,5=mds uses versioned encoding,6=dirfrag is stored in 
>>> >  omap,8=no anchor table,9=file layout v2,10=snaprealm v2}
>>> >  legacy client fscid: -1
>>> >  
>>> >  No filesystems configured
>>> >  
>>> >  ceph fs ls
>>> >  No filesystems enabled
>>> >  
>>> >  Now pool 66 seems to only pool we have and it has been created 
>>> >  months ago. Then I guess there is something hidden out there.
>>> >  
>>> >  Is there any way to find and delete it?
>>> > >>> 
>>> > >>> Ok, I'm concerned that the creating pg is in there if this is an old 
>>> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming 
>>> > >>> you 
>>> > >>> didn't, I think this is a bug in the process for rebuilding the mon 
>>> > >>> store.. one that doesn't normally come up because the impact is this 
>>> > >>> osdmap scan that is cheap in our test scenarios but clearly not cheap 
>>> > >>> for 
>>> > >>> your aged cluster.
>>> > >>> 
>>> > >>> In any case, there is a way to clear those out of the mon, but it's a 
>>> > >>> bit 
>>> > >>> dicey. 
>>> > 

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
I bet the kvstore output it in a hexdump format?  There is another option to 
get the raw data iirc

On October 3, 2018 3:01:41 PM EDT, Goktug YILDIRIM  
wrote:
>I changed the file name to make it clear.
>When I use your command with "+decode"  I'm getting an error like this:
>
>ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
>error: buffer::malformed_input: void
>creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer
>understand
>old encoding version 2 < 111
>
>My ceph version: 13.2.2
>
>3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu
>yazdı:
>
>> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
>> > If I didn't do it wrong, I got the output as below.
>> >
>> > ceph-kvstore-tool rocksdb
>/var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/
>> get osd_pg_creating creating > dump
>> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column
>> families: [default]
>> >
>> > ceph-dencoder type creating_pgs_t import dump dump_json
>>
>> Sorry, should be
>>
>> ceph-dencoder type creating_pgs_t import dump decode dump_json
>>
>> s
>>
>> > {
>> > "last_scan_epoch": 0,
>> > "creating_pgs": [],
>> > "queue": [],
>> > "created_pools": []
>> > }
>> >
>> > You can find the "dump" link below.
>> >
>> > dump:
>>
>https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
>> >
>> >
>> > Sage Weil  şunları yazdı (3 Eki 2018 18:45):
>> >
>> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> > >> We are starting to work on it. First step is getting the
>structure
>> out and dumping the current value as you say.
>> > >>
>> > >> And you were correct we did not run force_create_pg.
>> > >
>> > > Great.
>> > >
>> > > So, eager to see what the current structure is... please attach
>once
>> you
>> > > have it.
>> > >
>> > > The new replacement one should look like this (when hexdump
>-C'd):
>> > >
>> > >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00
>> ||
>> > > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
>> |..B...|
>> > > 001e
>> > >
>> > > ...except that from byte 6 you want to put in a recent OSDMap
>epoch,
>> in
>> > > hex, little endian (least significant byte first), in place of
>the
>> 0x10
>> > > that is there now.  It should dump like this:
>> > >
>> > > $ ceph-dencoder type creating_pgs_t import myfile decode
>dump_json
>> > > {
>> > >"last_scan_epoch": 16,   <--- but with a recent epoch here
>> > >"creating_pgs": [],
>> > >"queue": [],
>> > >"created_pools": [
>> > >66
>> > >]
>> > > }
>> > >
>> > > sage
>> > >
>> > >
>> > >>
>> > >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
>> > >>>
>> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> >  Sage,
>> > 
>> >  Pool 66 is the only pool it shows right now. This a pool
>created
>> months ago.
>> >  ceph osd lspools
>> >  66 mypool
>> > 
>> >  As we recreated mon db from OSDs, the pools for MDS was
>unusable.
>> So we deleted them.
>> >  After we create another cephfs fs and pools we started MDS and
>it
>> stucked on creation. So we stopped MDS and removed fs and fs pools.
>Right
>> now we do not have MDS running nor we have cephfs related things.
>> > 
>> >  ceph fs dump
>> >  dumped fsmap epoch 1 e1
>> >  enable_multiple, ever_enabled_multiple: 0,0
>> >  compat: compat={},rocompat={},incompat={1=base v0.20,2=client
>> writeable ranges,3=default file layouts on dirs,4=dir inode in
>separate
>> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
>> anchor table,9=file layout v2,10=snaprealm v2}
>> >  legacy client fscid: -1
>> > 
>> >  No filesystems configured
>> > 
>> >  ceph fs ls
>> >  No filesystems enabled
>> > 
>> >  Now pool 66 seems to only pool we have and it has been created
>> months ago. Then I guess there is something hidden out there.
>> > 
>> >  Is there any way to find and delete it?
>> > >>>
>> > >>> Ok, I'm concerned that the creating pg is in there if this is
>an old
>> > >>> pool... did you perhaps run force_create_pg at some point? 
>Assuming
>> you
>> > >>> didn't, I think this is a bug in the process for rebuilding the
>mon
>> > >>> store.. one that doesn't normally come up because the impact is
>this
>> > >>> osdmap scan that is cheap in our test scenarios but clearly not
>> cheap for
>> > >>> your aged cluster.
>> > >>>
>> > >>> In any case, there is a way to clear those out of the mon, but
>it's
>> a bit
>> > >>> dicey.
>> > >>>
>> > >>> 1. stop all mons
>> > >>> 2. make a backup of all mons
>> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
>> > >>> key=creating key on one of the mons
>> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t
>import
>> FILE dump_json
>> > >>> 5. hex edit the structure to remove all of the creating pgs,
>and
>> adds pool
>> > >>> 66 to the created_pgs member.
>> > >>> 6. verify with ceph-dencoder dump that the edit was correct...

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug YILDIRIM
I changed the file name to make it clear.
When I use your command with "+decode"  I'm getting an error like this:

ceph-dencoder type creating_pgs_t import DUMPFILE decode dump_json
error: buffer::malformed_input: void
creating_pgs_t::decode(ceph::buffer::list::iterator&) no longer understand
old encoding version 2 < 111

My ceph version: 13.2.2

3 Eki 2018 Çar, saat 20:46 tarihinde Sage Weil  şunu
yazdı:

> On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> > If I didn't do it wrong, I got the output as below.
> >
> > ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/
> get osd_pg_creating creating > dump
> > 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column
> families: [default]
> >
> > ceph-dencoder type creating_pgs_t import dump dump_json
>
> Sorry, should be
>
> ceph-dencoder type creating_pgs_t import dump decode dump_json
>
> s
>
> > {
> > "last_scan_epoch": 0,
> > "creating_pgs": [],
> > "queue": [],
> > "created_pools": []
> > }
> >
> > You can find the "dump" link below.
> >
> > dump:
> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> >
> >
> > Sage Weil  şunları yazdı (3 Eki 2018 18:45):
> >
> > >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > >> We are starting to work on it. First step is getting the structure
> out and dumping the current value as you say.
> > >>
> > >> And you were correct we did not run force_create_pg.
> > >
> > > Great.
> > >
> > > So, eager to see what the current structure is... please attach once
> you
> > > have it.
> > >
> > > The new replacement one should look like this (when hexdump -C'd):
> > >
> > >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00
> ||
> > > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00
> |..B...|
> > > 001e
> > >
> > > ...except that from byte 6 you want to put in a recent OSDMap epoch,
> in
> > > hex, little endian (least significant byte first), in place of the
> 0x10
> > > that is there now.  It should dump like this:
> > >
> > > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> > > {
> > >"last_scan_epoch": 16,   <--- but with a recent epoch here
> > >"creating_pgs": [],
> > >"queue": [],
> > >"created_pools": [
> > >66
> > >]
> > > }
> > >
> > > sage
> > >
> > >
> > >>
> > >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> > >>>
> > >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >  Sage,
> > 
> >  Pool 66 is the only pool it shows right now. This a pool created
> months ago.
> >  ceph osd lspools
> >  66 mypool
> > 
> >  As we recreated mon db from OSDs, the pools for MDS was unusable.
> So we deleted them.
> >  After we create another cephfs fs and pools we started MDS and it
> stucked on creation. So we stopped MDS and removed fs and fs pools. Right
> now we do not have MDS running nor we have cephfs related things.
> > 
> >  ceph fs dump
> >  dumped fsmap epoch 1 e1
> >  enable_multiple, ever_enabled_multiple: 0,0
> >  compat: compat={},rocompat={},incompat={1=base v0.20,2=client
> writeable ranges,3=default file layouts on dirs,4=dir inode in separate
> object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no
> anchor table,9=file layout v2,10=snaprealm v2}
> >  legacy client fscid: -1
> > 
> >  No filesystems configured
> > 
> >  ceph fs ls
> >  No filesystems enabled
> > 
> >  Now pool 66 seems to only pool we have and it has been created
> months ago. Then I guess there is something hidden out there.
> > 
> >  Is there any way to find and delete it?
> > >>>
> > >>> Ok, I'm concerned that the creating pg is in there if this is an old
> > >>> pool... did you perhaps run force_create_pg at some point?  Assuming
> you
> > >>> didn't, I think this is a bug in the process for rebuilding the mon
> > >>> store.. one that doesn't normally come up because the impact is this
> > >>> osdmap scan that is cheap in our test scenarios but clearly not
> cheap for
> > >>> your aged cluster.
> > >>>
> > >>> In any case, there is a way to clear those out of the mon, but it's
> a bit
> > >>> dicey.
> > >>>
> > >>> 1. stop all mons
> > >>> 2. make a backup of all mons
> > >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating
> > >>> key=creating key on one of the mons
> > >>> 4. dump the object with ceph-dencoder type creating_pgs_t import
> FILE dump_json
> > >>> 5. hex edit the structure to remove all of the creating pgs, and
> adds pool
> > >>> 66 to the created_pgs member.
> > >>> 6. verify with ceph-dencoder dump that the edit was correct...
> > >>> 7. inject the updated structure into all of the mons
> > >>> 8. start all mons
> > >>>
> > >>> 4-6 will probably be an iterative process... let's start by getting
> the
> > >>> structure out and dumping the current value?
> > >>>
> > >>> The code to refer to to understand the structure is
> src/mon/CreatingPGs.h
> 

Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Paul Emmerich
Am Mi., 3. Okt. 2018 um 20:10 Uhr schrieb :
> They are ordered and will hopefully arrive very soon.
>
> Can I:
> 1) Add disks
> 2) Create pool
> 3) stop all MDS's
> 4) rados cppool
> 5) Start MDS
>
> .. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is
> there a better guide?

you can just change the crush rule of the existing metadata pool.

Paul

>
> --
> Jesper
>


-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
On Wed, 3 Oct 2018, Göktuğ Yıldırım wrote:
> If I didn't do it wrong, I got the output as below.
> 
> ceph-kvstore-tool rocksdb /var/lib/ceph/mon/ceph-SRV-SBKUARK14/store.db/ get 
> osd_pg_creating creating > dump
> 2018-10-03 20:08:52.070 7f07f5659b80  1 rocksdb: do_open column families: 
> [default]
> 
> ceph-dencoder type creating_pgs_t import dump dump_json

Sorry, should be

ceph-dencoder type creating_pgs_t import dump decode dump_json

s

> {
> "last_scan_epoch": 0,
> "creating_pgs": [],
> "queue": [],
> "created_pools": []
> }
> 
> You can find the "dump" link below.
> 
> dump: 
> https://drive.google.com/file/d/1ZLUiQyotQ4-778wM9UNWK_TLDAROg0yN/view?usp=sharing
> 
> 
> Sage Weil  şunları yazdı (3 Eki 2018 18:45):
> 
> >> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >> We are starting to work on it. First step is getting the structure out and 
> >> dumping the current value as you say.
> >> 
> >> And you were correct we did not run force_create_pg.
> > 
> > Great.
> > 
> > So, eager to see what the current structure is... please attach once you 
> > have it.
> > 
> > The new replacement one should look like this (when hexdump -C'd):
> > 
> >   02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  
> > ||
> > 0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00|..B...|
> > 001e
> > 
> > ...except that from byte 6 you want to put in a recent OSDMap epoch, in 
> > hex, little endian (least significant byte first), in place of the 0x10 
> > that is there now.  It should dump like this:
> > 
> > $ ceph-dencoder type creating_pgs_t import myfile decode dump_json
> > {
> >"last_scan_epoch": 16,   <--- but with a recent epoch here
> >"creating_pgs": [],
> >"queue": [],
> >"created_pools": [
> >66
> >]
> > }
> > 
> > sage
> > 
> > 
> >> 
> >>> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> >>> 
> >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>  Sage,
>  
>  Pool 66 is the only pool it shows right now. This a pool created months 
>  ago.
>  ceph osd lspools
>  66 mypool
>  
>  As we recreated mon db from OSDs, the pools for MDS was unusable. So we 
>  deleted them.
>  After we create another cephfs fs and pools we started MDS and it 
>  stucked on creation. So we stopped MDS and removed fs and fs pools. 
>  Right now we do not have MDS running nor we have cephfs related things.
>  
>  ceph fs dump
>  dumped fsmap epoch 1 e1
>  enable_multiple, ever_enabled_multiple: 0,0
>  compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
>  ranges,3=default file layouts on dirs,4=dir inode in separate 
>  object,5=mds uses versioned encoding,6=dirfrag is stored in omap,8=no 
>  anchor table,9=file layout v2,10=snaprealm v2}
>  legacy client fscid: -1
>  
>  No filesystems configured
>  
>  ceph fs ls
>  No filesystems enabled
>  
>  Now pool 66 seems to only pool we have and it has been created months 
>  ago. Then I guess there is something hidden out there.
>  
>  Is there any way to find and delete it?
> >>> 
> >>> Ok, I'm concerned that the creating pg is in there if this is an old 
> >>> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> >>> didn't, I think this is a bug in the process for rebuilding the mon 
> >>> store.. one that doesn't normally come up because the impact is this 
> >>> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> >>> your aged cluster.
> >>> 
> >>> In any case, there is a way to clear those out of the mon, but it's a bit 
> >>> dicey. 
> >>> 
> >>> 1. stop all mons
> >>> 2. make a backup of all mons
> >>> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> >>> key=creating key on one of the mons
> >>> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE 
> >>> dump_json
> >>> 5. hex edit the structure to remove all of the creating pgs, and adds 
> >>> pool 
> >>> 66 to the created_pgs member.
> >>> 6. verify with ceph-dencoder dump that the edit was correct...
> >>> 7. inject the updated structure into all of the mons
> >>> 8. start all mons
> >>> 
> >>> 4-6 will probably be an iterative process... let's start by getting the 
> >>> structure out and dumping the current value?  
> >>> 
> >>> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> >>> encode/decode methods.
> >>> 
> >>> sage
> >>> 
> >>> 
>  
>  
> > On 3 Oct 2018, at 16:46, Sage Weil  wrote:
> > 
> > Oh... I think this is the problem:
> > 
> > 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> > 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 
> > 66.124:60196 
> > 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> > 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> > 
> > You are in 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Finally goodness happened!
I applied PR and ran repair on OSD unmodified after initial failure. It went 
through without any errors and now I'm able to fuse mount the OSD and export 
PGs off it using ceph-objectstore-tool. Just in order to not mess it up I 
haven't started ceph-osd until I have PGs backed up.
Cheers Igor, you're the best!


> On 3.10.2018, at 14:39, Igor Fedotov  wrote:
> 
> To fix this specific issue please apply the following PR: 
> https://github.com/ceph/ceph/pull/24339
> 
> This wouldn't fix original issue but just in case please try to run repair 
> again. Will need log if an error is different from ENOSPC from your latest 
> email.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/3/2018 1:58 PM, Sergey Malinin wrote:
>> Repair has gone farther but failed on something different - this time it 
>> appears to be related to store inconsistency rather than lack of free space. 
>> Emailed log to you, beware: over 2GB uncompressed.
>> 
>> 
>>> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
>>> 
>>> You may want to try new updates from the PR along with disabling flush on 
>>> recovery for rocksdb (avoid_flush_during_recovery parameter).
>>> 
>>> Full cmd line might looks like:
>>> 
>>> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
>>> bin/ceph-bluestore-tool --path  repair
>>> 
>>> 
>>> To be applied for "non-expanded" OSDs where repair didn't pass.
>>> 
>>> Please collect a log during repair...
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
 Repair goes through only when LVM volume has been expanded, otherwise it 
 fails with enospc as well as any other operation. However, expanding the 
 volume immediately renders bluefs unmountable with IO error.
 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
 end of bluefs-log-dump), I'm not sure whether corruption occurred before 
 or after volume expansion.
 
 
> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
> 
> You mentioned repair had worked before, is that correct? What's the 
> difference now except the applied patch? Different OSD? Anything else?
> 
> 
> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
> 
>> It didn't work, emailed logs to you.
>> 
>> 
>>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>>> 
>>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>>> bluefs_rebalance_txn assignment..
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
 PR doesn't seem to have changed since yesterday. Am I missing 
 something?
 
 
> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
> 
> Please update the patch from the PR - it didn't update bluefs extents 
> list before.
> 
> Also please set debug bluestore 20 when re-running repair and collect 
> the log.
> 
> If repair doesn't help - would you send repair and startup logs 
> directly to me as I have some issues accessing ceph-post-file uploads.
> 
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
>> Yes, I did repair all OSDs and it finished with 'repair success'. I 
>> backed up OSDs so now I have more room to play.
>> I posted log files using ceph-post-file with the following IDs:
>> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
>> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
>> 
>> 
>>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>>> 
>>> You did repair for any of this OSDs, didn't you? For all of them?
>>> 
>>> 
>>> Would you please provide the log for both types (failed on mount 
>>> and failed with enospc) of failing OSDs. Prior to collecting please 
>>> remove existing ones prior and set debug bluestore to 20.
>>> 
>>> 
>>> 
>>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
 I was able to apply patches to mimic, but nothing changed. One osd 
 that I had space expanded on fails with bluefs mount IO error, 
 others keep failing with enospc.
 
 
> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
> 
> So you should call repair which rebalances (i.e. allocates 
> additional space) BlueFS space. Hence allowing OSD to start.
> 
> Thanks,
> 
> Igor
> 
> 
> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
>> Not exactly. The rebalancing from this kv_sync_thread still 
>> might be deferred due to the nature of this thread (haven't 100% 
>> sure though).
>> 
>> Here is my PR showing the idea (still untested and perhaps 

Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread jesper
> Your use case sounds it might profit from the rados cache tier
> feature. It's a rarely used feature because it only works in very
> specific circumstances. But your scenario sounds like it might work.
> Definitely worth giving it a try. Also, dm-cache with LVM *might*
> help.
> But if your active working set is really just 400GB: Bluestore cache
> should handle this just fine. Don't worry about "unequal"
> distribution, every 4mb chunk of every file will go to a random OSD.

I tried it out - and will do it more but Initial tests didnt really
convince me - but I'll try more.

> One very powerful and simple optimization is moving the metadata pool
> to SSD only. Even if it's just 3 small but fast SSDs; that can make a
> huge difference to how fast your filesystem "feels".

They are ordered and will hopefully arrive very soon.

Can I:
1) Add disks
2) Create pool
3) stop all MDS's
4) rados cppool
5) Start MDS

.. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is
there a better guide?

--
Jesper

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] fixing another remapped+incomplete EC 4+2 pg

2018-10-03 Thread Graham Allan
Following on from my previous adventure with recovering pgs in the face 
of failed OSDs, I now have my EC 4+2 pool oeprating with min_size=5 
which is as things should be.


However I have one pg which is stuck in state remapped+incomplete 
because it has only 4 out of 6 osds running, and I have been unable to 
bring the missing two back into service.



PG_AVAILABILITY Reduced data availability: 1 pg inactive, 1 pg incomplete
pg 70.82d is remapped+incomplete, acting 
[2147483647,2147483647,190,448,61,315] (reducing pool .rgw.buckets.ec42 
min_size from 5 may help; search ceph.com/docs for 'incomplete')


I don't think I want to do anything with min_size as that would make all 
other pgs vulnerable to running dangerously undersized (unless there is 
any way to force that state for only a single pg). It seems to me that 
with 4/6 osds available, it should maybe be possible to force ceph to 
select one or two new osds to rebalance this pg to?


ceph pg query gives me (snippet):


"down_osds_we_would_probe": [
98,
233,
238,
239
],
"peering_blocked_by": [],
"peering_blocked_by_detail": [
{
"detail": "peering_blocked_by_history_les_bound"
}
]


Of these, osd 98 appears to have a corrupt xfs filesystem

osd 239 was the original osd to hold a shard of this pg but would not 
keep running, exiting with:



/build/ceph-12.2.7/src/osd/ECBackend.cc: 619: FAILED assert(pop.data.length() 
== sinfo.aligned_logical_offset_to_chunk_offset( 
after_progress.data_recovered_to - op.recovery_progress.data_recovered_to))


osds 233 and 238 were otherwise evacuated (weight 0) osds to which I 
imported the pg shard from osd 239 (using ceph-objectstore-tool). After 
which they crash with the same assert. More specifically they seem to 
crash in the same way each time the pg becomes active and starts to 
backfill, on the same object:



-9> 2018-10-03 11:30:28.174586 7f94ce9c4700  5 osd.233 pg_epoch: 704441 
pg[70.82ds1( v 704329'703106 (586066'698574,704329'703106] 
local-lis/les=704439/704440 n=102585 ec=21494/21494 lis/c 704439/588565 les/c/f 
704440/588566/0 68066
6/704439/704439) 
[820,761,105,789,562,485]/[2147483647,233,190,448,61,315]p233(1) r=1 lpr=704439 
pi=[21494,704439)/4 rops=1 bft=105(2),485(5),562(4),761(1),789(3),820(0) 
crt=704329'703106 lcod 0'0 mlcod 0'0 active+undersized+remapped+ba
ckfilling] backfill_pos is 
70:b415ca14:::default.630943.7__shadow_Barley_GC_Project%2fBarley_GC_Project%2fRawdata%2fReads%2fCZOA.6150.7.38741.TGCTGG.fastq.gz.2~Vn8g0rMwpVY8eaW83TDzJ2mczLXAl3z.3_24:head
-8> 2018-10-03 11:30:28.174887 7f94ce9c4700  1 -- 10.31.0.1:6854/2210291 
--> 10.31.0.1:6854/2210291 -- MOSDECSubOpReadReply(70.82ds1 704441/704439 
ECSubReadReply(tid=1, attrs_read=0)) v2 -- 0x7f9500472280 con 0
-7> 2018-10-03 11:30:28.174902 7f94db9de700  1 -- 10.31.0.1:6854/2210291 
<== osd.233 10.31.0.1:6854/2210291 0  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  0+0+0 (0 0 0) 
0x7f9500472280
 con 0x7f94fb72b000
-6> 2018-10-03 11:30:28.176267 7f94ead66700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.4:6880/2181727 conn(0x7f94ff2a6000 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=946 cs=1 l=0). rx osd.61 seq 9 
0x7f9500472500 MOSDECSubOpRe
adReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-5> 2018-10-03 11:30:28.176281 7f94ead66700  1 -- 10.31.0.1:6854/2210291 
<== osd.61 10.31.0.4:6880/2181727 9  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (875698380 0 0) 
0x
7f9500472500 con 0x7f94ff2a6000
-4> 2018-10-03 11:30:28.177723 7f94ead66700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.9:6920/13427 conn(0x7f94ff2bc800 :-1 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=46152 cs=1 l=0). rx osd.448 seq 8 
0x7f94fe9d5980 MOSDECSubOpR
eadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-3> 2018-10-03 11:30:28.177738 7f94ead66700  1 -- 10.31.0.1:6854/2210291 
<== osd.448 10.31.0.9:6920/13427 8  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (2772477454 0 
0) 0x
7f94fe9d5980 con 0x7f94ff2bc800
-2> 2018-10-03 11:30:28.185788 7f94ea565700  5 -- 10.31.0.1:6854/2210291 >> 
10.31.0.7:6868/2012671 conn(0x7f94ff5c3800 :6854 
s=STATE_OPEN_MESSAGE_READ_FOOTER_AND_DISPATCH pgs=4193 cs=1 l=0). rx osd.190 seq 10 
0x7f9500472780 MOSDECSu
bOpReadReply(70.82ds1 704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2
-1> 2018-10-03 11:30:28.185815 7f94ea565700  1 -- 10.31.0.1:6854/2210291 
<== osd.190 10.31.0.7:6868/2012671 10  MOSDECSubOpReadReply(70.82ds1 
704441/704439 ECSubReadReply(tid=1, attrs_read=0)) v2  786745+0+0 (2670842780 0 0)
 0x7f9500472780 con 0x7f94ff5c3800
 0> 2018-10-03 11:30:28.194795 7f94ce9c4700 -1 

[ceph-users] interpreting ceph mds stat

2018-10-03 Thread Jeff Smith
I need some help deciphering the results of ceph mds stat.  I have
been digging in the docs for hours.  If someone can point me in the
right direction and/or help me understand.

In the documentation it shows a result like this.

cephfs-1/1/1 up {0=a=up:active}

What do each of the 1s represent?   What is the 0=a=up:active?  Is
that saying rank 0 of file system a is up:active?

Jeff Smith
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> We are starting to work on it. First step is getting the structure out and 
> dumping the current value as you say.
> 
> And you were correct we did not run force_create_pg.

Great.

So, eager to see what the current structure is... please attach once you 
have it.

The new replacement one should look like this (when hexdump -C'd):

  02 01 18 00 00 00 10 00  00 00 00 00 00 00 01 00  ||
0010  00 00 42 00 00 00 00 00  00 00 00 00 00 00|..B...|
001e

...except that from byte 6 you want to put in a recent OSDMap epoch, in 
hex, little endian (least significant byte first), in place of the 0x10 
that is there now.  It should dump like this:

$ ceph-dencoder type creating_pgs_t import myfile decode dump_json
{
"last_scan_epoch": 16,   <--- but with a recent epoch here
"creating_pgs": [],
"queue": [],
"created_pools": [
66
]
}

sage


 > 
> > On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> > 
> > On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >> Sage,
> >> 
> >> Pool 66 is the only pool it shows right now. This a pool created months 
> >> ago.
> >> ceph osd lspools
> >> 66 mypool
> >> 
> >> As we recreated mon db from OSDs, the pools for MDS was unusable. So we 
> >> deleted them.
> >> After we create another cephfs fs and pools we started MDS and it stucked 
> >> on creation. So we stopped MDS and removed fs and fs pools. Right now we 
> >> do not have MDS running nor we have cephfs related things.
> >> 
> >> ceph fs dump
> >> dumped fsmap epoch 1 e1
> >> enable_multiple, ever_enabled_multiple: 0,0
> >> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> >> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> >> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor 
> >> table,9=file layout v2,10=snaprealm v2}
> >> legacy client fscid: -1
> >> 
> >> No filesystems configured
> >> 
> >> ceph fs ls
> >> No filesystems enabled
> >> 
> >> Now pool 66 seems to only pool we have and it has been created months ago. 
> >> Then I guess there is something hidden out there.
> >> 
> >> Is there any way to find and delete it?
> > 
> > Ok, I'm concerned that the creating pg is in there if this is an old 
> > pool... did you perhaps run force_create_pg at some point?  Assuming you 
> > didn't, I think this is a bug in the process for rebuilding the mon 
> > store.. one that doesn't normally come up because the impact is this 
> > osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> > your aged cluster.
> > 
> > In any case, there is a way to clear those out of the mon, but it's a bit 
> > dicey. 
> > 
> > 1. stop all mons
> > 2. make a backup of all mons
> > 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> > key=creating key on one of the mons
> > 4. dump the object with ceph-dencoder type creating_pgs_t import FILE 
> > dump_json
> > 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> > 66 to the created_pgs member.
> > 6. verify with ceph-dencoder dump that the edit was correct...
> > 7. inject the updated structure into all of the mons
> > 8. start all mons
> > 
> > 4-6 will probably be an iterative process... let's start by getting the 
> > structure out and dumping the current value?  
> > 
> > The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> > encode/decode methods.
> > 
> > sage
> > 
> > 
> >> 
> >> 
> >>> On 3 Oct 2018, at 16:46, Sage Weil  wrote:
> >>> 
> >>> Oh... I think this is the problem:
> >>> 
> >>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> >>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> >>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> >>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> >>> 
> >>> You are in the midst of creating new pgs, and unfortunately pg create is 
> >>> one of the last remaining places where the OSDs need to look at a full 
> >>> history of map changes between then and the current map epoch.  In this 
> >>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> >>> later.
> >>> 
> >>> What is this new pool for?  Is it still empty, and if so, can we delete 
> >>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> >>> and restarting the OSDs will make these pg_creates go away.
> >>> 
> >>> s
> >>> 
> >>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> >>> 
>  Hello,
>  
>  It seems nothing has changed.
>  
>  OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
>  
>  OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
>  
>  
>  
> > On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
> > 
> > Hello,
> > 
> > 
> > You can also reduce the 

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Brett Chancellor
That turned out to be exactly the issue (And boy was it fun clearing pgs
out on 71 OSDs). I think it's caused by a combination of two factors.
1. This cluster has way to many placement groups per OSD (just north of
800). It was fine when we first created all the pools, but upgrades (most
recently to luminous 12.2.4) have cemented the fact that high PG:OSD ratio
is a bad thing.
2. We had a host in a failed state for an extended period of time. That
host finally coming online is what triggered the event. The system dug
itself into a hole it couldn't get out of.

-Brett

On Wed, Oct 3, 2018 at 11:49 AM Gregory Farnum  wrote:

> Yeah, don't run these commands blind. They are changing the local metadata
> of the PG in ways that may make it inconsistent with the overall cluster
> and result in lost data.
>
> Brett, it seems this issue has come up several times in the field but we
> haven't been able to reproduce it locally or get enough info to debug
> what's going on: https://tracker.ceph.com/issues/21142
> Maybe run through that ticket and see if you can contribute new logs or
> add detail about possible sources?
> -Greg
>
> On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim 
> wrote:
>
>> Hi,
>>
>> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>>
>> I’ve corrected mine OSDs with the following commands. My OSD logs
>> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
>> number besides and before crash dump.
>>
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> trim-pg-log --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> fix-lost --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
>> --pgid $2
>> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
>> mark-complete --pgid $2
>> systemctl restart ceph-osd@$1
>>
>> I dont know if it works for you but it may be no harm to try for an OSD.
>>
>> There is such less information about this tools. So it might be risky. I
>> hope someone much experienced could help more.
>>
>>
>> > On 2 Oct 2018, at 23:23, Brett Chancellor 
>> wrote:
>> >
>> > Help. I have a 60 node cluster and most of the OSDs decided to crash
>> themselves at the same time. They wont restart, the messages look like...
>> >
>> > --- begin dump of recent events ---
>> >  0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
>> (Aborted) **
>> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
>> >
>> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b)
>> luminous (stable)
>> >  1: (()+0xa3c611) [0x556d618bb611]
>> >  2: (()+0xf6d0) [0x7f57a885e6d0]
>> >  3: (gsignal()+0x37) [0x7f57a787f277]
>> >  4: (abort()+0x148) [0x7f57a7880968]
>> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0x284) [0x556d618fa6e4]
>> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
>> const&)+0x3b2) [0x556d615c74a2]
>> >  7: (PastIntervals::check_new_interval(int, int, std::vector> std::allocator > const&, std::vector >
>> const&, int, int, std::vector > const&,
>> std::vector > const&, unsigned int, unsigned int,
>> std::shared_ptr, std::shared_ptr, pg_t,
>> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
>> [0x556d615ae6c0]
>> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
>> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
>> >  10: (OSD::init()+0x2169) [0x556d613919d9]
>> >  11: (main()+0x2d07) [0x556d61295dd7]
>> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
>> >  13: (()+0x4b53e3) [0x556d613343e3]
>> >  NOTE: a copy of the executable, or `objdump -rdS ` is
>> needed to interpret this.
>> >
>> >
>> > Some hosts have no working OSDs, others seem to have 1 working, and 2
>> dead.  It's spread all across the cluster, across several different racks.
>> Any idea on where to look next? The cluster is dead in the water right now.
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug Yildirim
We are starting to work on it. First step is getting the structure out and 
dumping the current value as you say.

And you were correct we did not run force_create_pg.

> On 3 Oct 2018, at 17:52, Sage Weil  wrote:
> 
> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>> Sage,
>> 
>> Pool 66 is the only pool it shows right now. This a pool created months ago.
>> ceph osd lspools
>> 66 mypool
>> 
>> As we recreated mon db from OSDs, the pools for MDS was unusable. So we 
>> deleted them.
>> After we create another cephfs fs and pools we started MDS and it stucked on 
>> creation. So we stopped MDS and removed fs and fs pools. Right now we do not 
>> have MDS running nor we have cephfs related things.
>> 
>> ceph fs dump
>> dumped fsmap epoch 1 e1
>> enable_multiple, ever_enabled_multiple: 0,0
>> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
>> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
>> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
>> layout v2,10=snaprealm v2}
>> legacy client fscid: -1
>> 
>> No filesystems configured
>> 
>> ceph fs ls
>> No filesystems enabled
>> 
>> Now pool 66 seems to only pool we have and it has been created months ago. 
>> Then I guess there is something hidden out there.
>> 
>> Is there any way to find and delete it?
> 
> Ok, I'm concerned that the creating pg is in there if this is an old 
> pool... did you perhaps run force_create_pg at some point?  Assuming you 
> didn't, I think this is a bug in the process for rebuilding the mon 
> store.. one that doesn't normally come up because the impact is this 
> osdmap scan that is cheap in our test scenarios but clearly not cheap for 
> your aged cluster.
> 
> In any case, there is a way to clear those out of the mon, but it's a bit 
> dicey. 
> 
> 1. stop all mons
> 2. make a backup of all mons
> 3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
> key=creating key on one of the mons
> 4. dump the object with ceph-dencoder type creating_pgs_t import FILE 
> dump_json
> 5. hex edit the structure to remove all of the creating pgs, and adds pool 
> 66 to the created_pgs member.
> 6. verify with ceph-dencoder dump that the edit was correct...
> 7. inject the updated structure into all of the mons
> 8. start all mons
> 
> 4-6 will probably be an iterative process... let's start by getting the 
> structure out and dumping the current value?  
> 
> The code to refer to to understand the structure is src/mon/CreatingPGs.h 
> encode/decode methods.
> 
> sage
> 
> 
>> 
>> 
>>> On 3 Oct 2018, at 16:46, Sage Weil  wrote:
>>> 
>>> Oh... I think this is the problem:
>>> 
>>> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
>>> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
>>> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
>>> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
>>> 
>>> You are in the midst of creating new pgs, and unfortunately pg create is 
>>> one of the last remaining places where the OSDs need to look at a full 
>>> history of map changes between then and the current map epoch.  In this 
>>> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
>>> later.
>>> 
>>> What is this new pool for?  Is it still empty, and if so, can we delete 
>>> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
>>> and restarting the OSDs will make these pg_creates go away.
>>> 
>>> s
>>> 
>>> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
>>> 
 Hello,
 
 It seems nothing has changed.
 
 OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
 
 OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
 
 
 
> On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
> 
> Hello,
> 
> 
> You can also reduce the osd map updates by adding this to your ceph
> config file. "osd crush update on start = false". This should remove
> and update that is generated when osd starts.
> 
> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> "root=default"]} v 0) v1
> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> 

Re: [ceph-users] Ceph 12.2.5 - FAILED assert(0 == "put on missing extent (nothing before)")

2018-10-03 Thread Ricardo J. Barberis
I created https://tracker.ceph.com/issues/36303

I can wait maybe a couple of days before recreating this OSD if you need me to 
extract som more info.

Thanks.

El Miércoles 03/10/2018 a las 01:43, Gregory Farnum escribió:
> I'd create a new ticket and reference the older one; they may not have the
> same cause.
>
> On Tue, Oct 2, 2018 at 12:33 PM Ricardo J. Barberis 
>
> wrote:
> > Hello,
> >
> > I'm having this same issue on 12.2.8. Should I repoen the bug report?
> >
> > This cluster started on 12.2.4 and was upgraded to 12.2.5 and then
> > directly to
> > 12.2.8 (we skipped 2.6 and 2.7) but the malfunctioning OSD is on a new
> > node
> > installed with 12.2.8.
> >
> > We're using CentOS 7.5, and bluestore for ceph. This particular node has
> > SSD
> > disks.
> >
> > I have an extract of the log and objdump if needed.
> >
> > Thanks,
> >
> > El Miércoles 11/07/2018 a las 18:31, Gregory Farnum escribió:
> > > A bit delayed, but Radoslaw looked at this some and has a diagnosis on
> >
> > the
> >
> > > tracker ticket: http://tracker.ceph.com/issues/24715
> > > So it looks like a symptom of a bug that was already fixed for
> > > unrelated reasons. :)
> > > -Greg
> > >
> > > On Wed, Jun 27, 2018 at 4:51 AM Dyweni - Ceph-Users
> > > <6exbab4fy...@dyweni.com>
> > >
> > > wrote:
> > > > Good Morning,
> > > >
> > > > I have rebuilt the OSD and the cluster is healthy now.
> > > >
> > > > I have one pool with 3 replica setup.  I am a bit concerned that
> > > > removing a snapshot can cause an OSD to crash.  I've asked myself
> > > > what would have happened if 2 OSD's had crashed?  God forbid, what if
> > > > 3 or more OSD's had crashed with this same error?  How would I have
> >
> > recovered
> >
> > > > from that?
> > > >
> > > > So planning for the future:
> > > >
> > > >   1. Is there any way to proactively scan for (and even repair) this?
> > > >
> > > >   2. What could have caused this?
> > > >
> > > > We experienced a cluster wide power outage lasting several hours
> >
> > several
> >
> > > > days ago.  The outage occurred at a time when no snapshots were being
> > > > created.  The cluster was brought back up in a controlled manner and
> > > > no errors were discovered immediately afterward (Ceph reported
> > > > healthly). Could this have caused corruption?
> > > >
> > > > Thanks,
> > > > Dyweni
> > > >
> > > > On 2018-06-25 09:34, Dyweni - Ceph-Users wrote:
> > > > > Hi,
> > > > >
> > > > > Is there any information you'd like to grab off this OSD?  Anything
> > > > > I can provide to help you troubleshoot this?
> > > > >
> > > > > I ask, because if not, I'm going to reformat / rebuild this OSD
> > > > > (unless there is a faster way to repair this issue).
> > > > >
> > > > > Thanks,
> > > > > Dyweni
> > > > >
> > > > > On 2018-06-25 07:30, Dyweni - Ceph-Users wrote:
> > > > >> Good Morning,
> > > > >>
> > > > >> After removing roughly 20-some rbd shapshots, one of my OSD's has
> > > > >> begun flapping.
> > > > >>
> > > > >>
> > > > >>  ERROR 1 
> > > > >>
> > > > >> 2018-06-25 06:46:39.132257 a0ce2700 -1 osd.8 pg_epoch: 44738
> >
> > pg[4.e8(
> >
> > > > >> v 44721'485588 (44697'484015,44721'485588]
> > > > >> local-lis/les=44593/44595 n=2972 ec=9422/9422 lis/c 44593/44593
> > > > >> les/c/f 44595/44595/40729 44593/44593/44593) [8,7,10] r=0
> > > > >> lpr=44593 crt=44721'485588 lcod 44721'485586 mlcod 44721'485586
> > > > >> active+clean+snapt
> > > > >> rim snaptrimq=[276~1,280~1,2af~1,2e8~4]] removing snap head
> > > > >> 2018-06-25 06:46:41.314172 a1ce2700 -1
> >
> 
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > > > >> In function 'void bluestore_extent_ref_map_t::put(uint64_t,
> >
> > uint32_t,
> >
> > > > >> PExtentVector*, bool*)' thread a1ce2700 time 2018-06-25
> > > > >> 06:46:41.220388
> >
> 
> /var/tmp/portage/sys-cluster/ceph-12.2.5/work/ceph-12.2.5/src/os/bluestore/bluestore_types.cc:
> > > > >> 217: FAILED assert(0 == "put on missing extent (nothing before)")
> > > > >>
> > > > >>  ceph version 12.2.5 (cad919881333ac92274171586c827e01f554a70a)
> > > > >> luminous (stable)
> > > > >>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> > > > >> const*)+0x1bc) [0x2a2c314]
> > > > >>  2: (bluestore_extent_ref_map_t::put(unsigned long long, unsigned
> >
> > int,
> >
> > > > >> std::vector > > > >> mempool::pool_allocator<(mempool::pool_index_t)4,
> >
> > bluestore_pextent_t>
> >
> > > > >> >*, bool*)+0x128) [0x2893650]
> > > > >>
> > > > >>  3: (BlueStore::SharedBlob::put_ref(unsigned long long, unsigned
> >
> > int,
> >
> > > > >> std::vector > > > >> mempool::pool_allocator<(mempool::pool_index_t)4,
> >
> > bluestore_pextent_t>
> >
> > > > >> >*, std::set > > > >> > std::less,
> > > >
> > > > std::allocator >*)+0xb8) [0x2791bdc]
> > > >
> > > > >>  4: (BlueStore::_wctx_finish(BlueStore::TransContext*,
> > > > >> boost::intrusive_ptr&,
> > > > >> boost::intrusive_ptr, BlueStore::WriteContext*,
> > > > >> std::set > 

Re: [ceph-users] slow export of cephfs through samba

2018-10-03 Thread Gregory Farnum
On Thu, Sep 27, 2018 at 7:37 AM Chad W Seys  wrote:

> Hi all,
>I am exporting cephfs using samba.  It is much slower over samba than
> direct. Anyone know how to speed it up?
>Benchmarked using bonnie++ 5 times either directly to cephfs mounted
> by kernel (v4.18.6) module:
> bonnie++ -> kcephfs
> or through a cifs kernel-module-mounted (protocol version 3.02) Samba
> (v4.8.5) share on the same machine.
> bonnie++ -> Samba -> kcephfs
>
> Abbreviated results for 5 runs:
> kcephfs:  min  max
>   file created 555  619   files/sec
>   sequential block input:  106.44   108.13MB/sec
>   sequential block output: 102.82   110.61MB/sec
>
> (There is a gigabit network between the client and the ceph cluster, so
> the block in/out is pleasing.)
>
> samba -> kcephfs: min  max
>   file created 45   files/sec
>   sequential block input:  22.8529.5MB/sec
>   sequential block output: 27.9530.01   MB/sec
>
> The block input/output is okay fast, but the files created per second is
> low.  Anyone know how to tweak Samba to speed it up?
>

That seems to be a samba tuning issue, which this isn't the right list for.
:/


>Would Samba vfs_ceph speed up access?  At the moment vfs_ceph in
> Debian depends on libceph1 10.2.5, so not too modern.
>

Well, good news and bad news here. Good news is that since vfs_ceph uses
the native cephfs client library, it ought to be faster (though I don't
have any data on how much going through samba itself costs against our
client).
Bad news is that samba's connection model doesn't play very nicely with
Ceph's — if you use vfs_ceph, every samba connection will turn into a brand
new Ceph connection — smb literally runs a fork on every incoming
connection — so you will see RAM/network connections/etc scale on a
per-smb-client basis. This works fine for a small number of clients but not
so great if you're planning to attach a bunch of them. :(
-Greg


>
> Current Samba settings:
> [global]
>  dns proxy = No
>  hostname lookups = Yes
>  kerberos method = secrets and keytab
>  logging = syslog@1 /var/log/samba/log.%m
>  max log size = 10
>  panic action = /usr/share/samba/panic-action %d
>  realm = PHYSICS.WISC.EDU
>  security = USER
>  server signing = required
>  server string = %h server
>  workgroup = PHYSICS
>  fruit:nfs_aces = no
>  idmap config * : backend = tdb
> [smb]
>  ea support = Yes
>  inherit acls = Yes
>  inherit permissions = Yes
>  msdfs root = Yes
>  path = /srv/smb
>  read only = No
>  smb encrypt = desired
>  vfs objects = catia fruit streams_xattr
>  fruit:encoding = native
>
> Thanks!
> Chad.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> Sage,
> 
> Pool 66 is the only pool it shows right now. This a pool created months ago.
> ceph osd lspools
> 66 mypool
> 
> As we recreated mon db from OSDs, the pools for MDS was unusable. So we 
> deleted them.
> After we create another cephfs fs and pools we started MDS and it stucked on 
> creation. So we stopped MDS and removed fs and fs pools. Right now we do not 
> have MDS running nor we have cephfs related things.
> 
> ceph fs dump
> dumped fsmap epoch 1 e1
> enable_multiple, ever_enabled_multiple: 0,0
> compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
> ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds 
> uses versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file 
> layout v2,10=snaprealm v2}
> legacy client fscid: -1
> 
> No filesystems configured
> 
> ceph fs ls
> No filesystems enabled
> 
> Now pool 66 seems to only pool we have and it has been created months ago. 
> Then I guess there is something hidden out there.
> 
> Is there any way to find and delete it?

Ok, I'm concerned that the creating pg is in there if this is an old 
pool... did you perhaps run force_create_pg at some point?  Assuming you 
didn't, I think this is a bug in the process for rebuilding the mon 
store.. one that doesn't normally come up because the impact is this 
osdmap scan that is cheap in our test scenarios but clearly not cheap for 
your aged cluster.

In any case, there is a way to clear those out of the mon, but it's a bit 
dicey. 

1. stop all mons
2. make a backup of all mons
3. use ceph-kvstore-tool to extract the prefix=osd_pg_creating 
key=creating key on one of the mons
4. dump the object with ceph-dencoder type creating_pgs_t import FILE dump_json
5. hex edit the structure to remove all of the creating pgs, and adds pool 
66 to the created_pgs member.
6. verify with ceph-dencoder dump that the edit was correct...
7. inject the updated structure into all of the mons
8. start all mons

4-6 will probably be an iterative process... let's start by getting the 
structure out and dumping the current value?  

The code to refer to to understand the structure is src/mon/CreatingPGs.h 
encode/decode methods.

sage


> 
> 
> > On 3 Oct 2018, at 16:46, Sage Weil  wrote:
> > 
> > Oh... I think this is the problem:
> > 
> > 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> > 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> > 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> > 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> > 
> > You are in the midst of creating new pgs, and unfortunately pg create is 
> > one of the last remaining places where the OSDs need to look at a full 
> > history of map changes between then and the current map epoch.  In this 
> > case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> > later.
> > 
> > What is this new pool for?  Is it still empty, and if so, can we delete 
> > it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> > and restarting the OSDs will make these pg_creates go away.
> > 
> > s
> > 
> > On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> > 
> >> Hello,
> >> 
> >> It seems nothing has changed.
> >> 
> >> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
> >> 
> >> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
> >> 
> >> 
> >> 
> >>> On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
> >>> 
> >>> Hello,
> >>> 
> >>> 
> >>> You can also reduce the osd map updates by adding this to your ceph
> >>> config file. "osd crush update on start = false". This should remove
> >>> and update that is generated when osd starts.
> >>> 
> >>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> e14 handle_command mon_command({"prefix": "osd crush
> >>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> >>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> >>> dispatch
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> >>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> >>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> >>> "root=default"]} v 0) v1
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> >>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> >>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> >>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> >>> 2018-10-03 14:03:21.538 7fe15eddb700  0
> >>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> >>> 'osd.47' initial_weight 3.6396 at location
> >>> 

Re: [ceph-users] Help! OSDs across the cluster just crashed

2018-10-03 Thread Gregory Farnum
Yeah, don't run these commands blind. They are changing the local metadata
of the PG in ways that may make it inconsistent with the overall cluster
and result in lost data.

Brett, it seems this issue has come up several times in the field but we
haven't been able to reproduce it locally or get enough info to debug
what's going on: https://tracker.ceph.com/issues/21142
Maybe run through that ticket and see if you can contribute new logs or add
detail about possible sources?
-Greg

On Tue, Oct 2, 2018 at 3:18 PM Goktug Yildirim 
wrote:

> Hi,
>
> Sorry to hear that. I’ve been battling with mine for 2 weeks :/
>
> I’ve corrected mine OSDs with the following commands. My OSD logs
> (/var/log/ceph/ceph-OSDx.log) has a line including log(EER) with the PG
> number besides and before crash dump.
>
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
> trim-pg-log --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op fix-lost
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op repair
> --pgid $2
> ceph-objectstore-tool --data-path /var/lib/ceph/osd/ceph-$1/ --op
> mark-complete --pgid $2
> systemctl restart ceph-osd@$1
>
> I dont know if it works for you but it may be no harm to try for an OSD.
>
> There is such less information about this tools. So it might be risky. I
> hope someone much experienced could help more.
>
>
> > On 2 Oct 2018, at 23:23, Brett Chancellor 
> wrote:
> >
> > Help. I have a 60 node cluster and most of the OSDs decided to crash
> themselves at the same time. They wont restart, the messages look like...
> >
> > --- begin dump of recent events ---
> >  0> 2018-10-02 21:19:16.990369 7f57ab5b7d80 -1 *** Caught signal
> (Aborted) **
> >  in thread 7f57ab5b7d80 thread_name:ceph-osd
> >
> >  ceph version 12.2.4 (52085d5249a80c5f5121a76d6288429f35e4e77b) luminous
> (stable)
> >  1: (()+0xa3c611) [0x556d618bb611]
> >  2: (()+0xf6d0) [0x7f57a885e6d0]
> >  3: (gsignal()+0x37) [0x7f57a787f277]
> >  4: (abort()+0x148) [0x7f57a7880968]
> >  5: (ceph::__ceph_assert_fail(char const*, char const*, int, char
> const*)+0x284) [0x556d618fa6e4]
> >  6: (pi_compact_rep::add_interval(bool, PastIntervals::pg_interval_t
> const&)+0x3b2) [0x556d615c74a2]
> >  7: (PastIntervals::check_new_interval(int, int, std::vector std::allocator > const&, std::vector >
> const&, int, int, std::vector > const&,
> std::vector > const&, unsigned int, unsigned int,
> std::shared_ptr, std::shared_ptr, pg_t,
> IsPGRecoverablePredicate*, PastIntervals*, std::ostream*)+0x380)
> [0x556d615ae6c0]
> >  8: (OSD::build_past_intervals_parallel()+0x9ff) [0x556d613707af]
> >  9: (OSD::load_pgs()+0x545) [0x556d61373095]
> >  10: (OSD::init()+0x2169) [0x556d613919d9]
> >  11: (main()+0x2d07) [0x556d61295dd7]
> >  12: (__libc_start_main()+0xf5) [0x7f57a786b445]
> >  13: (()+0x4b53e3) [0x556d613343e3]
> >  NOTE: a copy of the executable, or `objdump -rdS ` is
> needed to interpret this.
> >
> >
> > Some hosts have no working OSDs, others seem to have 1 working, and 2
> dead.  It's spread all across the cluster, across several different racks.
> Any idea on where to look next? The cluster is dead in the water right now.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] getattr - failed to rdlock waiting

2018-10-03 Thread Gregory Farnum
On Tue, Oct 2, 2018 at 12:18 PM Thomas Sumpter 
wrote:

> Hi Folks,
>
>
>
> I am looking for advice on how to troubleshoot some long operations found
> in MDS. Most of the time performance is fantastic, but occasionally and to
> no real pattern or trend, a gettattr op will take up to ~30 seconds to
> complete in MDS which is stuck on "event": "failed to rdlock, waiting"
>
>
>
> E.g.
>
> "description": "client_request(client.84183:54794012 getattr pAsLsXsFs
> #0x1038585 2018-10-02 07:56:27.554282 caller_uid=48, caller_gid=48{})",
>
> "duration": 28.987992,
>
> {
>
> "time": "2018-09-25 07:56:27.552511",
>
> "event": "failed to rdlock, waiting"
>
> },
>
> {
>
> "time": "2018-09-25 07:56:56.529748",
>
> "event": "failed to rdlock, waiting"
>
> },
>
> {
>
> "time": "2018-09-25 07:56:56.540386",
>
> "event": "acquired locks"
>
> }
>
>
>
> I can find no corresponding long op on any of the OSDs and no other op in
> MDS which this one could be waiting for.
>
> Nearly all configuration will be the default. Currently have a small
> amount of data which is constantly being updated. 1 data pool and 1
> metadata pool.
>
> How can I track down what is holding up this op and try to stop it
> happening?
>

This is a weakness in the MDS introspection right now, unfortunately.

What the error message literally means is what it says — the op needs to
get a read lock, but it can't, so it's waiting. This might mean that
there's an MDS op in progress, but it usually means there's a client which
is holding "write capabilities" on the inode in question, and it's asking
for/waiting for that client to drop those capabilities.

This might take a while because of a buggy client, or because the client
had a very large amount of buffered writes it is now frantically trying to
flush out to RADOS as fast as it can.
-Greg


>
>
> # rados df
>
> …
>
> total_objects191
>
> total_used   5.7 GiB
>
> total_avail  367 GiB
>
> total_space  373 GiB
>
>
>
>
>
> Cephfs version 13.2.1 on CentOs 7.5
>
> Kernel: 3.10.0-862.11.6.el7.x86_64
>
> 1x Active MDS, 1x Replay Standby MDS
>
> 3x MON
>
> 4x OSD
>
> Bluestore FS
>
>
>
> Ceph kernel client on CentOs 7.4
>
> Kernel: 4.18.7-1.el7.elrepo.x86_64  (almost the latest, should be good?)
>
>
>
> Many Thanks!
>
> Tom
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Alfredo Deza
On Wed, Oct 3, 2018 at 11:23 AM Andras Pataki
 wrote:
>
> Thanks - I didn't realize that was such a recent fix.
>
> I've now tried 12.2.8, and perhaps I'm not clear on what I should have
> done to the OSD that I'm replacing, since I'm getting the error "The osd
> ID 747 is already in use or does not exist.".  The case is clearly the
> latter, since I've completely removed the old OSD (osd crush remove,
> auth del, osd rm, wipe disk).  Should I have done something different
> (i.e. not remove the OSD completely)?

Yeah, you completely removed it so now it can't be re-used. This is
the proper way if wanting to re-use the ID:

http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/#rados-replacing-an-osd

Basically:

ceph osd destroy {id} --yes-i-really-mean-it

> Searching the docs I see a command 'ceph osd destroy'.  What does that
> do (compared to my removal procedure, osd crush remove, auth del, osd rm)?
>
> Thanks,
>
> Andras
>
>
> On 10/3/18 10:36 AM, Alfredo Deza wrote:
> > On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
> >  wrote:
> >> After replacing failing drive I'd like to recreate the OSD with the same
> >> osd-id using ceph-volume (now that we've moved to ceph-volume from
> >> ceph-disk).  However, I seem to not be successful.  The command I'm using:
> >>
> >> ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
> >> --block.db /dev/disk/by-partlabel/H901J44
> >>
> >> But it created an OSD the ID 601, which was the lowest it could allocate
> >> and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?
> > Yeah, this was a problem that was fixed and released as part of 12.2.8
> >
> > The tracker issue is: http://tracker.ceph.com/issues/24044
> >
> > The Luminous PR is https://github.com/ceph/ceph/pull/23102
> >
> > Sorry for the trouble!
> >> Andras
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover data from cluster / get rid of down, incomplete, unknown pgs

2018-10-03 Thread Gregory Farnum
If you've really extracted all the PGs from the down OSDs, you should have
been able to inject them into new OSDs and continue on from there with just
rebalancing activity. The use of mark_unfound_lost_revert complicates
matters a bit but I'm not sure what the behavior would be if you just put
them in place now.
Really though you'll need to find out why the incomplete PGs are marked
that way — it might be that some of the live OSDs are aware of writes
they've missed but think ought to still be in the cluster somewhere. Also
which version are you running? It might be that some of the OSDs are
failing to map correctly onto enough OSDs since a quarter of them are dead.
-Greg

On Tue, Oct 2, 2018 at 11:14 AM Dylan Jones 
wrote:

> Our ceph cluster stopped responding to requests two weeks ago, and I have
> been trying to fix it since then.  After a semi-hard reboot, we had 11-ish
> OSDs "fail" spread across two hosts, with the pool size set to two.  I was
> able to extract a copy of every PG that resided solely on the nonfunctional
> OSDs, but the cluster is refusing to let me read from it.  I marked all the
> "failed" OSDs as lost and used ceph pg $pg mark_unfound_lost revert for
> all the PGs reporting unfound objects, but that didn't help either.
> ddrescue also breaks, because ceph will never admit that it has lost data
> and just blocks forever instead of returning a read error.
> Is there any way to tell ceph to cut its losses and just let me access my
> data again?
>
>
>   cluster:
> id: 313be153-5e8a-4275-b3aa-caea1ce7bce2
> health: HEALTH_ERR
> noout,nobackfill,norebalance flag(s) set
> 2720183/6369036 objects misplaced (42.709%)
> 9/3184518 objects unfound (0.000%)
> 39 scrub errors
> Reduced data availability: 131 pgs inactive, 16 pgs down, 114
> pgs incomplete
> Possible data damage: 7 pgs recovery_unfound, 1 pg
> inconsistent, 7 pgs snaptrim_error
> Degraded data redundancy: 1710175/6369036 objects degraded
> (26.851%), 1069 pgs degraded, 1069 pgs undersized
> Degraded data redundancy (low space): 82 pgs backfill_toofull
>
>   services:
> mon: 1 daemons, quorum waitaha
> mgr: waitaha(active)
> osd: 43 osds: 34 up, 34 in; 1786 remapped pgs
>  flags noout,nobackfill,norebalance
>
>   data:
> pools:   2 pools, 2048 pgs
> objects: 3.18 M objects, 8.4 TiB
> usage:   21 TiB used, 60 TiB / 82 TiB avail
> pgs: 0.049% pgs unknown
>  6.348% pgs not active
>  1710175/6369036 objects degraded (26.851%)
>  2720183/6369036 objects misplaced (42.709%)
>  9/3184518 objects unfound (0.000%)
>  987 active+undersized+degraded+remapped+backfill_wait
>  695 active+remapped+backfill_wait
>  124 active+clean
>  114 incomplete
>  62
> active+undersized+degraded+remapped+backfill_wait+backfill_toofull
>  20  active+remapped+backfill_wait+backfill_toofull
>  16  down
>  12  active+undersized+degraded+remapped+backfilling
>  7   active+recovery_unfound+undersized+degraded+remapped
>  7   active+clean+snaptrim_error
>  2   active+remapped+backfilling
>  1   unknown
>  1
> active+undersized+degraded+remapped+inconsistent+backfill_wait
>
> Thanks,
> Dylan
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki

Thanks - I didn't realize that was such a recent fix.

I've now tried 12.2.8, and perhaps I'm not clear on what I should have 
done to the OSD that I'm replacing, since I'm getting the error "The osd 
ID 747 is already in use or does not exist.".  The case is clearly the 
latter, since I've completely removed the old OSD (osd crush remove, 
auth del, osd rm, wipe disk).  Should I have done something different 
(i.e. not remove the OSD completely)?
Searching the docs I see a command 'ceph osd destroy'.  What does that 
do (compared to my removal procedure, osd crush remove, auth del, osd rm)?


Thanks,

Andras


On 10/3/18 10:36 AM, Alfredo Deza wrote:

On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
 wrote:

After replacing failing drive I'd like to recreate the OSD with the same
osd-id using ceph-volume (now that we've moved to ceph-volume from
ceph-disk).  However, I seem to not be successful.  The command I'm using:

ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
--block.db /dev/disk/by-partlabel/H901J44

But it created an OSD the ID 601, which was the lowest it could allocate
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?

Yeah, this was a problem that was fixed and released as part of 12.2.8

The tracker issue is: http://tracker.ceph.com/issues/24044

The Luminous PR is https://github.com/ceph/ceph/pull/23102

Sorry for the trouble!

Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug Yildirim
Sage,

Pool 66 is the only pool it shows right now. This a pool created months ago.
ceph osd lspools
66 mypool

As we recreated mon db from OSDs, the pools for MDS was unusable. So we deleted 
them.
After we create another cephfs fs and pools we started MDS and it stucked on 
creation. So we stopped MDS and removed fs and fs pools. Right now we do not 
have MDS running nor we have cephfs related things.

ceph fs dump
dumped fsmap epoch 1 e1
enable_multiple, ever_enabled_multiple: 0,0
compat: compat={},rocompat={},incompat={1=base v0.20,2=client writeable 
ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses 
versioned encoding,6=dirfrag is stored in omap,8=no anchor table,9=file layout 
v2,10=snaprealm v2}
legacy client fscid: -1

No filesystems configured

ceph fs ls
No filesystems enabled

Now pool 66 seems to only pool we have and it has been created months ago. Then 
I guess there is something hidden out there.

Is there any way to find and delete it?


> On 3 Oct 2018, at 16:46, Sage Weil  wrote:
> 
> Oh... I think this is the problem:
> 
> 2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
> 66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
> 66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
> 66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916
> 
> You are in the midst of creating new pgs, and unfortunately pg create is 
> one of the last remaining places where the OSDs need to look at a full 
> history of map changes between then and the current map epoch.  In this 
> case, the pool was created in 60196 and it is now 72883, ~12k epochs 
> later.
> 
> What is this new pool for?  Is it still empty, and if so, can we delete 
> it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
> and restarting the OSDs will make these pg_creates go away.
> 
> s
> 
> On Wed, 3 Oct 2018, Goktug Yildirim wrote:
> 
>> Hello,
>> 
>> It seems nothing has changed.
>> 
>> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
>> 
>> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
>> 
>> 
>> 
>>> On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
>>> 
>>> Hello,
>>> 
>>> 
>>> You can also reduce the osd map updates by adding this to your ceph
>>> config file. "osd crush update on start = false". This should remove
>>> and update that is generated when osd starts.
>>> 
>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> e14 handle_command mon_command({"prefix": "osd crush
>>> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
>>> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
>>> dispatch
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
>>> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
>>> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
>>> "root=default"]} v 0) v1
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
>>> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
>>> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
>>> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
>>> 2018-10-03 14:03:21.538 7fe15eddb700  0
>>> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
>>> 'osd.47' initial_weight 3.6396 at location
>>> {host=SRV-SEKUARK8,root=default}
>>> 2018-10-03 14:03:22.250 7fe1615e0700  1
>>> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
>>> enabled
>>> 
>>> 
>>> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>>>  wrote:
 
 Hi Sage,
 
 Thank you for your response. Now I am sure this incident is going to be 
 resolved.
 
 The problem started when 7 server crashed same time and they came back 
 after ~5 minutes.
 
 Two of our 3 mon services were restarted in this crash. Since mon services 
 are enabled they should be started nearly at the same time. I dont know if 
 this makes any difference but some of the guys on IRC told it is required 
 that they start in order not at the same time. Otherwise it could break 
 things badly.
 
 After 9 days we still see 3400-3500 active+clear PG. But in the end we 
 have so many STUCK request and our cluster can not heal itself.
 
 When we set noup flag, OSDs can catch up epoch easily. But when we unset 
 the flag we see so many STUCKS and SLOW OPS in 1 hour.
 I/O load on all of my OSD disks are at around %95 utilization and never 
 ends. CPU and RAM usage are OK.
 OSDs get stuck that we even can't run “ceph pg osd.0 query”.
 
 Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
 eliminate older PG's and leaving 

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Sage Weil
Oh... I think this is the problem:

2018-10-03 16:37:04.284 7efef2ae0700 20 slow op osd_pg_create(e72883 
66.af:60196 66.ba:60196 66.be:60196 66.d8:60196 66.f8:60196 66.124:60196 
66.14c:60196 66.1ac:60196 66.223:60196 66.248:60196 66.271:60196 
66.2d1:60196 66.47a:68641) initiated 2018-10-03 16:20:01.915916

You are in the midst of creating new pgs, and unfortunately pg create is 
one of the last remaining places where the OSDs need to look at a full 
history of map changes between then and the current map epoch.  In this 
case, the pool was created in 60196 and it is now 72883, ~12k epochs 
later.

What is this new pool for?  Is it still empty, and if so, can we delete 
it? If yes, I'm ~70% sure that will then get cleaned out at the mon end 
and restarting the OSDs will make these pg_creates go away.

s

On Wed, 3 Oct 2018, Goktug Yildirim wrote:

> Hello,
> 
> It seems nothing has changed.
> 
> OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 
> 
> OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 
> 
> 
> 
> > On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
> > 
> > Hello,
> > 
> > 
> > You can also reduce the osd map updates by adding this to your ceph
> > config file. "osd crush update on start = false". This should remove
> > and update that is generated when osd starts.
> > 
> > 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> > e14 handle_command mon_command({"prefix": "osd crush
> > set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> > 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> > from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> > "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> > dispatch
> > 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> > e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> > "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> > "root=default"]} v 0) v1
> > 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> > from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> > "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> > ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> > 2018-10-03 14:03:21.538 7fe15eddb700  0
> > mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> > 'osd.47' initial_weight 3.6396 at location
> > {host=SRV-SEKUARK8,root=default}
> > 2018-10-03 14:03:22.250 7fe1615e0700  1
> > mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> > enabled
> > 
> > 
> > On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
> >  wrote:
> >> 
> >> Hi Sage,
> >> 
> >> Thank you for your response. Now I am sure this incident is going to be 
> >> resolved.
> >> 
> >> The problem started when 7 server crashed same time and they came back 
> >> after ~5 minutes.
> >> 
> >> Two of our 3 mon services were restarted in this crash. Since mon services 
> >> are enabled they should be started nearly at the same time. I dont know if 
> >> this makes any difference but some of the guys on IRC told it is required 
> >> that they start in order not at the same time. Otherwise it could break 
> >> things badly.
> >> 
> >> After 9 days we still see 3400-3500 active+clear PG. But in the end we 
> >> have so many STUCK request and our cluster can not heal itself.
> >> 
> >> When we set noup flag, OSDs can catch up epoch easily. But when we unset 
> >> the flag we see so many STUCKS and SLOW OPS in 1 hour.
> >> I/O load on all of my OSD disks are at around %95 utilization and never 
> >> ends. CPU and RAM usage are OK.
> >> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
> >> 
> >> Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
> >> eliminate older PG's and leaving cluster with good ones.
> >> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve 
> >> our problem.
> >> 
> >> Of course we have to save %100 of data. But we feel like even saving %50 
> >> of our data will be make us very happy right now.
> >> 
> >> This is what happens when the cluster starts. I believe it explains the 
> >> whole story very nicely.
> >> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
> >> 
> >> This is our ceph.conf:
> >> https://paste.ubuntu.com/p/8sQhfPDXnW/
> >> 
> >> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> >> https://paste.ubuntu.com/p/g5t8xnrjjZ/
> >> 
> >> This is pg dump:
> >> https://paste.ubuntu.com/p/zYqsN5T95h/
> >> 
> >> This is iostat & perf top:
> >> https://paste.ubuntu.com/p/Pgf3mcXXX8/
> >> 
> >> This strace output of ceph-osd:
> >> https://paste.ubuntu.com/p/YCdtfh5qX8/
> >> 
> >> This is OSD log (default debug):
> >> https://paste.ubuntu.com/p/Z2JrrBzzkM/
> >> 
> >> This is leader MON log (default debug):
> >> https://paste.ubuntu.com/p/RcGmsVKmzG/
> >> 
> 

Re: [ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Alfredo Deza
On Wed, Oct 3, 2018 at 9:57 AM Andras Pataki
 wrote:
>
> After replacing failing drive I'd like to recreate the OSD with the same
> osd-id using ceph-volume (now that we've moved to ceph-volume from
> ceph-disk).  However, I seem to not be successful.  The command I'm using:
>
> ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44
> --block.db /dev/disk/by-partlabel/H901J44
>
> But it created an OSD the ID 601, which was the lowest it could allocate
> and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?

Yeah, this was a problem that was fixed and released as part of 12.2.8

The tracker issue is: http://tracker.ceph.com/issues/24044

The Luminous PR is https://github.com/ceph/ceph/pull/23102

Sorry for the trouble!
>
> Andras
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-volume: recreate OSD with same ID after drive replacement

2018-10-03 Thread Andras Pataki
After replacing failing drive I'd like to recreate the OSD with the same 
osd-id using ceph-volume (now that we've moved to ceph-volume from 
ceph-disk).  However, I seem to not be successful.  The command I'm using:


ceph-volume lvm prepare --bluestore --osd-id 747 --data H901D44/H901D44 
--block.db /dev/disk/by-partlabel/H901J44


But it created an OSD the ID 601, which was the lowest it could allocate 
and ignored the 747 apparently.  This is with ceph 12.2.7. Any ideas?


Andras

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug Yildirim
Hello,

It seems nothing has changed.

OSD config: https://paste.ubuntu.com/p/MtvTr5HYW4/ 

OSD debug log: https://paste.ubuntu.com/p/7Sx64xGzkR/ 



> On 3 Oct 2018, at 14:27, Darius Kasparavičius  wrote:
> 
> Hello,
> 
> 
> You can also reduce the osd map updates by adding this to your ceph
> config file. "osd crush update on start = false". This should remove
> and update that is generated when osd starts.
> 
> 2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush
> set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
> 2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
> dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
> e14 handle_command mon_command({"prefix": "osd crush create-or-move",
> "id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
> "root=default"]} v 0) v1
> 2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
> from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
> "osd crush create-or-move", "id": 47, "weight":3.6396, "args":
> ["host=SRV-SEKUARK8", "root=default"]}]: dispatch
> 2018-10-03 14:03:21.538 7fe15eddb700  0
> mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
> 'osd.47' initial_weight 3.6396 at location
> {host=SRV-SEKUARK8,root=default}
> 2018-10-03 14:03:22.250 7fe1615e0700  1
> mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
> enabled
> 
> 
> On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
>  wrote:
>> 
>> Hi Sage,
>> 
>> Thank you for your response. Now I am sure this incident is going to be 
>> resolved.
>> 
>> The problem started when 7 server crashed same time and they came back after 
>> ~5 minutes.
>> 
>> Two of our 3 mon services were restarted in this crash. Since mon services 
>> are enabled they should be started nearly at the same time. I dont know if 
>> this makes any difference but some of the guys on IRC told it is required 
>> that they start in order not at the same time. Otherwise it could break 
>> things badly.
>> 
>> After 9 days we still see 3400-3500 active+clear PG. But in the end we have 
>> so many STUCK request and our cluster can not heal itself.
>> 
>> When we set noup flag, OSDs can catch up epoch easily. But when we unset the 
>> flag we see so many STUCKS and SLOW OPS in 1 hour.
>> I/O load on all of my OSD disks are at around %95 utilization and never 
>> ends. CPU and RAM usage are OK.
>> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>> 
>> Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
>> eliminate older PG's and leaving cluster with good ones.
>> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our 
>> problem.
>> 
>> Of course we have to save %100 of data. But we feel like even saving %50 of 
>> our data will be make us very happy right now.
>> 
>> This is what happens when the cluster starts. I believe it explains the 
>> whole story very nicely.
>> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>> 
>> This is our ceph.conf:
>> https://paste.ubuntu.com/p/8sQhfPDXnW/
>> 
>> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
>> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>> 
>> This is pg dump:
>> https://paste.ubuntu.com/p/zYqsN5T95h/
>> 
>> This is iostat & perf top:
>> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>> 
>> This strace output of ceph-osd:
>> https://paste.ubuntu.com/p/YCdtfh5qX8/
>> 
>> This is OSD log (default debug):
>> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>> 
>> This is leader MON log (default debug):
>> https://paste.ubuntu.com/p/RcGmsVKmzG/
>> 
>> These are OSDs failed to start. Total number is 58.
>> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
>> https://paste.ubuntu.com/p/pkRdVjCH4D/
>> https://paste.ubuntu.com/p/zJTf2fzSj9/
>> https://paste.ubuntu.com/p/xpJRK6YhRX/
>> https://paste.ubuntu.com/p/SY3576dNbJ/
>> https://paste.ubuntu.com/p/smyT6Y976b/
>> 
>> 
>> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 
>> 20.
>> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>> 
>> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore 
>> = 20.
>> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>> 
>> As far as I understand OSD catchs up with the mon epoch and exceeds mon 
>> epoch somehow??
>> 
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 
>> e60196@2018-09-28 23:57:08.251119
>> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 
>> build_initial_pg_history 66.f8 created 60196
>> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - 
>> loading 

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Darius Kasparavičius
Hello,


You can also reduce the osd map updates by adding this to your ceph
config file. "osd crush update on start = false". This should remove
and update that is generated when osd starts.

2018-10-03 14:03:21.534 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
e14 handle_command mon_command({"prefix": "osd crush
set-device-class", "class": "hdd", "ids": ["47"]} v 0) v1
2018-10-03 14:03:21.534 7fe15eddb700  0 log_channel(audit) log [INF] :
from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
"osd crush set-device-class", "class": "hdd", "ids": ["47"]}]:
dispatch
2018-10-03 14:03:21.538 7fe15eddb700  0 mon.SRV-SBKUARK14@0(leader)
e14 handle_command mon_command({"prefix": "osd crush create-or-move",
"id": 47, "weight":3.6396, "args": ["host=SRV-SEKUARK8",
"root=default"]} v 0) v1
2018-10-03 14:03:21.538 7fe15eddb700  0 log_channel(audit) log [INF] :
from='osd.47 10.10.112.17:6803/64652' entity='osd.47' cmd=[{"prefix":
"osd crush create-or-move", "id": 47, "weight":3.6396, "args":
["host=SRV-SEKUARK8", "root=default"]}]: dispatch
2018-10-03 14:03:21.538 7fe15eddb700  0
mon.SRV-SBKUARK14@0(leader).osd e72601 create-or-move crush item name
'osd.47' initial_weight 3.6396 at location
{host=SRV-SEKUARK8,root=default}
2018-10-03 14:03:22.250 7fe1615e0700  1
mon.SRV-SBKUARK14@0(leader).osd e72601 do_prune osdmap full prune
enabled


On Wed, Oct 3, 2018 at 3:16 PM Goktug Yildirim
 wrote:
>
> Hi Sage,
>
> Thank you for your response. Now I am sure this incident is going to be 
> resolved.
>
> The problem started when 7 server crashed same time and they came back after 
> ~5 minutes.
>
> Two of our 3 mon services were restarted in this crash. Since mon services 
> are enabled they should be started nearly at the same time. I dont know if 
> this makes any difference but some of the guys on IRC told it is required 
> that they start in order not at the same time. Otherwise it could break 
> things badly.
>
> After 9 days we still see 3400-3500 active+clear PG. But in the end we have 
> so many STUCK request and our cluster can not heal itself.
>
> When we set noup flag, OSDs can catch up epoch easily. But when we unset the 
> flag we see so many STUCKS and SLOW OPS in 1 hour.
> I/O load on all of my OSD disks are at around %95 utilization and never ends. 
> CPU and RAM usage are OK.
> OSDs get stuck that we even can't run “ceph pg osd.0 query”.
>
> Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
> eliminate older PG's and leaving cluster with good ones.
> With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our 
> problem.
>
> Of course we have to save %100 of data. But we feel like even saving %50 of 
> our data will be make us very happy right now.
>
> This is what happens when the cluster starts. I believe it explains the whole 
> story very nicely.
> https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
>
> This is our ceph.conf:
> https://paste.ubuntu.com/p/8sQhfPDXnW/
>
> This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
> https://paste.ubuntu.com/p/g5t8xnrjjZ/
>
> This is pg dump:
> https://paste.ubuntu.com/p/zYqsN5T95h/
>
> This is iostat & perf top:
> https://paste.ubuntu.com/p/Pgf3mcXXX8/
>
> This strace output of ceph-osd:
> https://paste.ubuntu.com/p/YCdtfh5qX8/
>
> This is OSD log (default debug):
> https://paste.ubuntu.com/p/Z2JrrBzzkM/
>
> This is leader MON log (default debug):
> https://paste.ubuntu.com/p/RcGmsVKmzG/
>
> These are OSDs failed to start. Total number is 58.
> https://paste.ubuntu.com/p/ZfRD5ZtvpS/
> https://paste.ubuntu.com/p/pkRdVjCH4D/
> https://paste.ubuntu.com/p/zJTf2fzSj9/
> https://paste.ubuntu.com/p/xpJRK6YhRX/
> https://paste.ubuntu.com/p/SY3576dNbJ/
> https://paste.ubuntu.com/p/smyT6Y976b/
>
>
> This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 
> 20.
> https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
>
> This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore 
> = 20.
> https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
>
> As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch 
> somehow??
>
> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 
> e60196@2018-09-28 23:57:08.251119
> 2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 
> build_initial_pg_history 66.f8 created 60196
> 2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading 
> and decoding 0x19da8400
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
> to_process <> waiting <> waiting_peering {}
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 
> OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
> NullEvt +create_info) prio 255 cost 10 e72642) queued
> 2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
> to_process  epoch_requested: 72642 

Re: [ceph-users] Mimic offline problem

2018-10-03 Thread Goktug Yildirim
Hi Sage,

Thank you for your response. Now I am sure this incident is going to be 
resolved.

The problem started when 7 server crashed same time and they came back after ~5 
minutes. 

Two of our 3 mon services were restarted in this crash. Since mon services are 
enabled they should be started nearly at the same time. I dont know if this 
makes any difference but some of the guys on IRC told it is required that they 
start in order not at the same time. Otherwise it could break things badly.

After 9 days we still see 3400-3500 active+clear PG. But in the end we have so 
many STUCK request and our cluster can not heal itself.

When we set noup flag, OSDs can catch up epoch easily. But when we unset the 
flag we see so many STUCKS and SLOW OPS in 1 hour.
I/O load on all of my OSD disks are at around %95 utilization and never ends. 
CPU and RAM usage are OK.
OSDs get stuck that we even can't run “ceph pg osd.0 query”.

Also we tried to change RBD pool replication size 2 to 1. Our goal was the 
eliminate older PG's and leaving cluster with good ones. 
With replication size=1 we saw "%13 PGS not active”. But it didn’t solve our 
problem. 

Of course we have to save %100 of data. But we feel like even saving %50 of our 
data will be make us very happy right now. 

This is what happens when the cluster starts. I believe it explains the whole 
story very nicely.
https://drive.google.com/file/d/1-HHuACyXkYt7e0soafQwAbWJP1qs8-u1/view?usp=sharing
 


This is our ceph.conf:
https://paste.ubuntu.com/p/8sQhfPDXnW/ 

This is the output of "osd stat && osd epochs && ceph -s && ceph health”:
https://paste.ubuntu.com/p/g5t8xnrjjZ/ 

This is pg dump:
https://paste.ubuntu.com/p/zYqsN5T95h/ 

This is iostat & perf top:
https://paste.ubuntu.com/p/Pgf3mcXXX8/ 

This strace output of ceph-osd:
https://paste.ubuntu.com/p/YCdtfh5qX8/ 

This is OSD log (default debug):
https://paste.ubuntu.com/p/Z2JrrBzzkM/ 

This is leader MON log (default debug):
https://paste.ubuntu.com/p/RcGmsVKmzG/ 

These are OSDs failed to start. Total number is 58.
https://paste.ubuntu.com/p/ZfRD5ZtvpS/ 
https://paste.ubuntu.com/p/pkRdVjCH4D/ 
https://paste.ubuntu.com/p/zJTf2fzSj9/ 
https://paste.ubuntu.com/p/xpJRK6YhRX/ 
https://paste.ubuntu.com/p/SY3576dNbJ/ 
https://paste.ubuntu.com/p/smyT6Y976b/ 


This is OSD video with debug osd = 20 and debug ms = 1 and debug_filestore = 20.
https://drive.google.com/file/d/1UHHocK3Wy8pVpgZ4jV8Rl1z7rqK3bcJi/view?usp=sharing
 


This is OSD logfile with debug osd = 20 and debug ms = 1 and debug_filestore = 
20.
https://drive.google.com/file/d/1gH5Z0dUe36jM8FaulahEL36sxXrhORWI/view?usp=sharing
 


As far as I understand OSD catchs up with the mon epoch and exceeds mon epoch 
somehow??

2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 mkpg 66.f8 
e60196@2018-09-28 23:57:08.251119
2018-10-03 14:55:08.653 7f66c0bf9700 10 osd.150 72642 build_initial_pg_history 
66.f8 created 60196
2018-10-03 14:55:08.653 7f66c0bf9700 20 osd.150 72642 get_map 60196 - loading 
and decoding 0x19da8400
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
to_process <> waiting <> waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 
OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
NullEvt +create_info) prio 255 cost 10 e72642) queued
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 66.d8 
to_process  waiting <> 
waiting_peering {}
2018-10-03 14:55:08.653 7f66a6bc5700 20 osd.150 op_wq(1) _process 
OpQueueItem(66.d8 PGPeeringEvent(epoch_sent: 72642 epoch_requested: 72642 
NullEvt +create_info) prio 255 cost 10 e72642) pg 0xb579400
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 pg_epoch: 72642 pg[66.d8( v 
39934'8971934 (38146'8968839,39934'8971934] local-lis/les=72206/72212 n=2206 
ec=50786/50786 lis/c 72206/72206 les/c/f 72212/72212/0 72642/72642/72642) [150] 
r=0 lpr=72642 pi=[72206,72642)/1 crt=39934'8971934 lcod 0'0 mlcod 0'0 peering 
mbc={} ps=[1~11]] do_peering_event: epoch_sent: 72642 epoch_requested: 72642 
NullEvt +create_info
2018-10-03 14:55:08.653 7f66a6bc5700 10 log is not dirty
2018-10-03 14:55:08.653 7f66a6bc5700 10 osd.150 72642 queue_want_up_thru want 
72642 <= queued 

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Paul Emmerich
There's "ceph-bluestore-tool repair/fsck"

In your scenario, a few more log files would be interesting: try
setting debug bluefs to 20/20. And if that's not enough log try also
setting debug osd, debug bluestore, and debug bdev to 20/20.



Paul
Am Mi., 3. Okt. 2018 um 13:48 Uhr schrieb Kevin Olbrich :
>
> The disks were deployed with ceph-deploy / ceph-volume using the default 
> style (lvm) and not simple-mode.
>
> The disks were provisioned as a whole, no resizing. I never touched the disks 
> after deployment.
>
> It is very strange that this first happened after the update, never met such 
> an error before.
>
> I found a BUG in the tracker, that also shows such an error with count 0. 
> That was closed with „can’t reproduce“ (don’t have the link ready). For me 
> this seems like the data itself is fine and I just hit a bad transaction in 
> the replay (which maybe caused the crash in the first place).
>
> I need one of three disks back. Object corruption would not be a problem 
> (regarding drop of a journal), as this cluster hosts backups which will fail 
> validation and regenerate. Just marking the OSD lost does not seem to be an 
> option.
>
> Is there some sort of fsck for BlueFS?
>
> Kevin
>
>
> Igor Fedotov  schrieb am Mi. 3. Okt. 2018 um 13:01:
>>
>> I've seen somewhat similar behavior in a log from Sergey Malinin in another 
>> thread ("mimic: 3/4 OSDs crashed...")
>>
>> He claimed it happened after LVM volume expansion. Isn't this the case for 
>> you?
>>
>> Am I right that you use LVM volumes?
>>
>>
>> On 10/3/2018 11:22 AM, Kevin Olbrich wrote:
>>
>> Small addition: the failing disks are in the same host.
>> This is a two-host, failure-domain OSD cluster.
>>
>>
>> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :
>>>
>>> Hi!
>>>
>>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down 
>>> (EC 8+2) together.
>>> This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two hours 
>>> before.
>>> They failed exactly at the same moment, rendering the cluster unusable 
>>> (CephFS).
>>> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no 
>>> external journal / wal / db.
>>>
>>> OSD 29 (no disk failure in dmesg):
>>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
>>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2 
>>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process 
>>> ceph-osd, pid 20899
>>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty 
>>> --pid-file
>>> 2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load: isa
>>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path 
>>> /var/lib/ceph/osd/ceph-29/block type kernel
>>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2 
>>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2 
>>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 
>>> GiB) block_size 4096 (4 KiB) rotational
>>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 
>>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > 
>>> kv_ratio 0.5
>>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 
>>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 
>>> meta 0 kv 1 data 0
>>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2 
>>> /var/lib/ceph/osd/ceph-29/block) close
>>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 
>>> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
>>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path 
>>> /var/lib/ceph/osd/ceph-29/block type kernel
>>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2 
>>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>>> 2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2 
>>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 
>>> GiB) block_size 4096 (4 KiB) rotational
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 
>>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 > 
>>> kv_ratio 0.5
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 
>>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912 
>>> meta 0 kv 1 data 0
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path 
>>> /var/lib/ceph/osd/ceph-29/block type kernel
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80 
>>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80 
>>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932 
>>> GiB) block_size 4096 (4 KiB) rotational
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1 path 
>>> /var/lib/ceph/osd/ceph-29/block size 932 GiB
>>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
>>> 2018-10-03 09:47:15.538 7fb8835ce1c0 

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
The disks were deployed with ceph-deploy / ceph-volume using the default
style (lvm) and not simple-mode.

The disks were provisioned as a whole, no resizing. I never touched the
disks after deployment.

It is very strange that this first happened after the update, never met
such an error before.

I found a BUG in the tracker, that also shows such an error with count 0.
That was closed with „can’t reproduce“ (don’t have the link ready). For me
this seems like the data itself is fine and I just hit a bad transaction in
the replay (which maybe caused the crash in the first place).

I need one of three disks back. Object corruption would not be a problem
(regarding drop of a journal), as this cluster hosts backups which will
fail validation and regenerate. Just marking the OSD lost does not seem to
be an option.

Is there some sort of fsck for BlueFS?

Kevin


Igor Fedotov  schrieb am Mi. 3. Okt. 2018 um 13:01:

> I've seen somewhat similar behavior in a log from Sergey Malinin in
> another thread ("mimic: 3/4 OSDs crashed...")
>
> He claimed it happened after LVM volume expansion. Isn't this the case for
> you?
>
> Am I right that you use LVM volumes?
>
> On 10/3/2018 11:22 AM, Kevin Olbrich wrote:
>
> Small addition: the failing disks are in the same host.
> This is a two-host, failure-domain OSD cluster.
>
>
> Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :
>
>> Hi!
>>
>> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down
>> (EC 8+2) together.
>> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
>> hours before.*
>> They failed exactly at the same moment, rendering the cluster unusable
>> (CephFS).
>> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs,
>> no external journal / wal / db.
>>
>> *OSD 29 (no disk failure in dmesg):*
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
>> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
>> ceph-osd, pid 20899
>> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty
>> --pid-file
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load:
>> isa
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
>> kv_ratio 0.5
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
>> meta 0 kv 1 data 0
>> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) close
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
>> kv_ratio 0.5
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
>> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
>> meta 0 kv 1 data 0
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
>> /var/lib/ceph/osd/ceph-29/block type kernel
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
>> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
>> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
>> GiB) block_size 4096 (4 KiB) rotational
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1
>> path /var/lib/ceph/osd/ceph-29/block size 932 GiB
>> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
>> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link
>> count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev
>> 1 allocated 320 extents
>> 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov
To fix this specific issue please apply the following PR: 
https://github.com/ceph/ceph/pull/24339


This wouldn't fix original issue but just in case please try to run 
repair again. Will need log if an error is different from ENOSPC from 
your latest email.



Thanks,

Igor


On 10/3/2018 1:58 PM, Sergey Malinin wrote:

Repair has gone farther but failed on something different - this time it 
appears to be related to store inconsistency rather than lack of free space. 
Emailed log to you, beware: over 2GB uncompressed.



On 3.10.2018, at 13:15, Igor Fedotov  wrote:

You may want to try new updates from the PR along with disabling flush on 
recovery for rocksdb (avoid_flush_during_recovery parameter).

Full cmd line might looks like:

CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
bin/ceph-bluestore-tool --path  repair


To be applied for "non-expanded" OSDs where repair didn't pass.

Please collect a log during repair...


Thanks,

Igor

On 10/2/2018 4:32 PM, Sergey Malinin wrote:

Repair goes through only when LVM volume has been expanded, otherwise it fails 
with enospc as well as any other operation. However, expanding the volume 
immediately renders bluefs unmountable with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end 
of bluefs-log-dump), I'm not sure whether corruption occurred before or after 
volume expansion.



On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's the difference 
now except the applied patch? Different OSD? Anything else?


On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..



On 10/2/2018 2:40 PM, Sergey Malinin wrote:

PR doesn't seem to have changed since yesterday. Am I missing something?



On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs extents list 
before.

Also please set debug bluestore 20 when re-running repair and collect the log.

If repair doesn't help - would you send repair and startup logs directly to me 
as I have some issues accessing ceph-post-file uploads.


Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
  

[ceph-users] Some questions concerning filestore --> bluestore migration

2018-10-03 Thread Massimo Sgaravatto
Hi

I have a ceph cluster, running luminous, composed of 5 OSD nodes, which is
using filestore.
Each OSD node has 2 E5-2620 v4 processors, 64 GB of RAM, 10x6TB SATA disk +
2x200GB SSD disk (then I have 2 other disks in RAID for the OS), 10 Gbps.
So each SSD disk is used for the journal for 5 OSDs. With this
configuration everything is running smoothly ...


We are now buying some new storage nodes, and I am trying to buy something
which is bluestore compliant. So the idea is to consider a configuration
something like:

- 10 SATA disks (8TB / 10TB / 12TB each. TBD)
- 2 processor (~ 10 core each)
- 64 GB of RAM
- 2 SSD to be used for WAL+DB
- 10 Gbps

For what concerns the size of the SSD disks I read in this mailing list
that it is suggested to have at least 10GB of SSD disk/10TB of SATA disk.


So, the questions:

1) Does this hardware configuration seem reasonable ?

2) Are there problems to live (forever, or until filestore deprecation)
with some OSDs using filestore (the old ones) and some OSDs using bluestore
(the old ones) ?

3) Would you suggest to update to bluestore also the old OSDs, even if the
available SSDs are too small (they don't satisfy the "10GB of SSD disk/10TB
of SATA disk" rule) ?

Thanks, Massimo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Paul Emmerich
I would never ever start a new cluster with Filestore nowadays. Sure,
there are a few minor issues with Bluestore like that it currently
requires some manual configuration for the cache. But overall,
Bluestore is so much better.

Your use case sounds it might profit from the rados cache tier
feature. It's a rarely used feature because it only works in very
specific circumstances. But your scenario sounds like it might work.
Definitely worth giving it a try. Also, dm-cache with LVM *might*
help.
But if your active working set is really just 400GB: Bluestore cache
should handle this just fine. Don't worry about "unequal"
distribution, every 4mb chunk of every file will go to a random OSD.

One very powerful and simple optimization is moving the metadata pool
to SSD only. Even if it's just 3 small but fast SSDs; that can make a
huge difference to how fast your filesystem "feels".


Paul



Am Mi., 3. Okt. 2018 um 11:49 Uhr schrieb John Spray :
>
> On Tue, Oct 2, 2018 at 6:28 PM  wrote:
> >
> > Hi.
> >
> > Based on some recommendations we have setup our CephFS installation using
> > bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
> > server - 100TB-ish size.
> >
> > Current setup is - a sizeable Linux host with 512GB of memory - one large
> > Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.
> >
> > Since our "hot" dataset is < 400GB we can actually serve the hot data
> > directly out of the host page-cache and never really touch the "slow"
> > underlying drives. Except when new bulk data are written where a Perc with
> > BBWC is consuming the data.
> >
> > In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
> > OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
> > it is really hard to create a synthetic test where they hot data does not
> > end up being read out of the underlying disks. Yes, the
> > client side page cache works very well, but in our scenario we have 30+
> > hosts pulling the same data over NFS.
>
> Are you finding that the OSDs use lots of memory but you're still
> hitting disk, or just that the OSDs aren't using up all the available
> memory?  Unlike the page cache, the OSDs will not use all the memory
> in your system by default, you have to tell them how much to use
> (http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size)
>
> John
>
> > Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
> > the recommendation to make an SSD "overlay" on the slow drives?
> >
> > Thoughts?
> >
> > Jesper
> >
> > * Bluestore should be the new and shiny future - right?
> > ** Total mem 1TB+
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Update:
I rebuilt ceph-osd with latest PR and it started, worked for a few minutes and 
eventually failed on enospc.
After that ceph-bluestore-tool repair started to fail on enospc again. I was 
unable to collect ceph-osd log, so emailed you the most recent repair log.



> On 3.10.2018, at 13:58, Sergey Malinin  wrote:
> 
> Repair has gone farther but failed on something different - this time it 
> appears to be related to store inconsistency rather than lack of free space. 
> Emailed log to you, beware: over 2GB uncompressed.
> 
> 
>> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
>> 
>> You may want to try new updates from the PR along with disabling flush on 
>> recovery for rocksdb (avoid_flush_during_recovery parameter).
>> 
>> Full cmd line might looks like:
>> 
>> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
>> bin/ceph-bluestore-tool --path  repair
>> 
>> 
>> To be applied for "non-expanded" OSDs where repair didn't pass.
>> 
>> Please collect a log during repair...
>> 
>> 
>> Thanks,
>> 
>> Igor
>> 
>> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>>> Repair goes through only when LVM volume has been expanded, otherwise it 
>>> fails with enospc as well as any other operation. However, expanding the 
>>> volume immediately renders bluefs unmountable with IO error.
>>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>>> after volume expansion.
>>> 
>>> 
 On 2.10.2018, at 16:07, Igor Fedotov  wrote:
 
 You mentioned repair had worked before, is that correct? What's the 
 difference now except the applied patch? Different OSD? Anything else?
 
 
 On 10/2/2018 3:52 PM, Sergey Malinin wrote:
 
> It didn't work, emailed logs to you.
> 
> 
>> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
>> 
>> The major change is in get_bluefs_rebalance_txn function, it lacked 
>> bluefs_rebalance_txn assignment..
>> 
>> 
>> 
>> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>>> PR doesn't seem to have changed since yesterday. Am I missing something?
>>> 
>>> 
 On 2.10.2018, at 14:15, Igor Fedotov  wrote:
 
 Please update the patch from the PR - it didn't update bluefs extents 
 list before.
 
 Also please set debug bluestore 20 when re-running repair and collect 
 the log.
 
 If repair doesn't help - would you send repair and startup logs 
 directly to me as I have some issues accessing ceph-post-file uploads.
 
 
 Thanks,
 
 Igor
 
 
 On 10/2/2018 11:39 AM, Sergey Malinin wrote:
> Yes, I did repair all OSDs and it finished with 'repair success'. I 
> backed up OSDs so now I have more room to play.
> I posted log files using ceph-post-file with the following IDs:
> 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
> 20df7df5-f0c9-4186-aa21-4e5c0172cd93
> 
> 
>> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
>> 
>> You did repair for any of this OSDs, didn't you? For all of them?
>> 
>> 
>> Would you please provide the log for both types (failed on mount and 
>> failed with enospc) of failing OSDs. Prior to collecting please 
>> remove existing ones prior and set debug bluestore to 20.
>> 
>> 
>> 
>> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>>> I was able to apply patches to mimic, but nothing changed. One osd 
>>> that I had space expanded on fails with bluefs mount IO error, 
>>> others keep failing with enospc.
>>> 
>>> 
 On 1.10.2018, at 19:26, Igor Fedotov  wrote:
 
 So you should call repair which rebalances (i.e. allocates 
 additional space) BlueFS space. Hence allowing OSD to start.
 
 Thanks,
 
 Igor
 
 
 On 10/1/2018 7:22 PM, Igor Fedotov wrote:
> Not exactly. The rebalancing from this kv_sync_thread still might 
> be deferred due to the nature of this thread (haven't 100% sure 
> though).
> 
> Here is my PR showing the idea (still untested and perhaps 
> unfinished!!!)
> 
> https://github.com/ceph/ceph/pull/24353
> 
> 
> Igor
> 
> 
> On 10/1/2018 7:07 PM, Sergey Malinin wrote:
>> Can you please confirm whether I got this right:
>> 
>> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
>> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
>> @@ -9049,22 +9049,17 @@
>>

Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Igor Fedotov
I've seen somewhat similar behavior in a log from Sergey Malinin in 
another thread ("mimic: 3/4 OSDs crashed...")


He claimed it happened after LVM volume expansion. Isn't this the case 
for you?


Am I right that you use LVM volumes?


On 10/3/2018 11:22 AM, Kevin Olbrich wrote:

Small addition: the failing disks are in the same host.
This is a two-host, failure-domain OSD cluster.


Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich >:


Hi!

Yesterday one of our (non-priority) clusters failed when 3 OSDs
went down (EC 8+2) together.
*This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or
two hours before.*
They failed exactly at the same moment, rendering the cluster
unusable (CephFS).
We are using CentOS 7 with latest updates and ceph repo. No cache
SSDs, no external journal / wal / db.

*OSD 29 (no disk failure in dmesg):*
2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167
(ceph:ceph)
2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
ceph-osd, pid 20899
2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore
empty --pid-file
2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc
load: isa
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open path
/var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664
(0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.101 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio
1 > kv_ratio 0.5
2018-10-03 09:47:15.101 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size
536870912 meta 0 kv 1 data 0
2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.358 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _mount path
/var/lib/ceph/osd/ceph-29
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open path
/var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664
(0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio
1 > kv_ratio 0.5
2018-10-03 09:47:15.360 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size
536870912 meta 0 kv 1 data 0
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
/var/lib/ceph/osd/ceph-29/block) open path
/var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664
(0xe8e080, 932 GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device
bdev 1 path /var/lib/ceph/osd/ceph-29/block size 932 GiB
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with
link count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02
12:24:22.632397 bdev 1 allocated 320 extents

[1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10])
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Sergey Malinin
Repair has gone farther but failed on something different - this time it 
appears to be related to store inconsistency rather than lack of free space. 
Emailed log to you, beware: over 2GB uncompressed.


> On 3.10.2018, at 13:15, Igor Fedotov  wrote:
> 
> You may want to try new updates from the PR along with disabling flush on 
> recovery for rocksdb (avoid_flush_during_recovery parameter).
> 
> Full cmd line might looks like:
> 
> CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
> bin/ceph-bluestore-tool --path  repair
> 
> 
> To be applied for "non-expanded" OSDs where repair didn't pass.
> 
> Please collect a log during repair...
> 
> 
> Thanks,
> 
> Igor
> 
> On 10/2/2018 4:32 PM, Sergey Malinin wrote:
>> Repair goes through only when LVM volume has been expanded, otherwise it 
>> fails with enospc as well as any other operation. However, expanding the 
>> volume immediately renders bluefs unmountable with IO error.
>> 2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very 
>> end of bluefs-log-dump), I'm not sure whether corruption occurred before or 
>> after volume expansion.
>> 
>> 
>>> On 2.10.2018, at 16:07, Igor Fedotov  wrote:
>>> 
>>> You mentioned repair had worked before, is that correct? What's the 
>>> difference now except the applied patch? Different OSD? Anything else?
>>> 
>>> 
>>> On 10/2/2018 3:52 PM, Sergey Malinin wrote:
>>> 
 It didn't work, emailed logs to you.
 
 
> On 2.10.2018, at 14:43, Igor Fedotov  wrote:
> 
> The major change is in get_bluefs_rebalance_txn function, it lacked 
> bluefs_rebalance_txn assignment..
> 
> 
> 
> On 10/2/2018 2:40 PM, Sergey Malinin wrote:
>> PR doesn't seem to have changed since yesterday. Am I missing something?
>> 
>> 
>>> On 2.10.2018, at 14:15, Igor Fedotov  wrote:
>>> 
>>> Please update the patch from the PR - it didn't update bluefs extents 
>>> list before.
>>> 
>>> Also please set debug bluestore 20 when re-running repair and collect 
>>> the log.
>>> 
>>> If repair doesn't help - would you send repair and startup logs 
>>> directly to me as I have some issues accessing ceph-post-file uploads.
>>> 
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/2/2018 11:39 AM, Sergey Malinin wrote:
 Yes, I did repair all OSDs and it finished with 'repair success'. I 
 backed up OSDs so now I have more room to play.
 I posted log files using ceph-post-file with the following IDs:
 4af9cc4d-9c73-41c9-9c38-eb6c551047a0
 20df7df5-f0c9-4186-aa21-4e5c0172cd93
 
 
> On 2.10.2018, at 11:26, Igor Fedotov  wrote:
> 
> You did repair for any of this OSDs, didn't you? For all of them?
> 
> 
> Would you please provide the log for both types (failed on mount and 
> failed with enospc) of failing OSDs. Prior to collecting please 
> remove existing ones prior and set debug bluestore to 20.
> 
> 
> 
> On 10/2/2018 2:16 AM, Sergey Malinin wrote:
>> I was able to apply patches to mimic, but nothing changed. One osd 
>> that I had space expanded on fails with bluefs mount IO error, 
>> others keep failing with enospc.
>> 
>> 
>>> On 1.10.2018, at 19:26, Igor Fedotov  wrote:
>>> 
>>> So you should call repair which rebalances (i.e. allocates 
>>> additional space) BlueFS space. Hence allowing OSD to start.
>>> 
>>> Thanks,
>>> 
>>> Igor
>>> 
>>> 
>>> On 10/1/2018 7:22 PM, Igor Fedotov wrote:
 Not exactly. The rebalancing from this kv_sync_thread still might 
 be deferred due to the nature of this thread (haven't 100% sure 
 though).
 
 Here is my PR showing the idea (still untested and perhaps 
 unfinished!!!)
 
 https://github.com/ceph/ceph/pull/24353
 
 
 Igor
 
 
 On 10/1/2018 7:07 PM, Sergey Malinin wrote:
> Can you please confirm whether I got this right:
> 
> --- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
> +++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
> @@ -9049,22 +9049,17 @@
> throttle_bytes.put(costs);
>   PExtentVector bluefs_gift_extents;
> -  if (bluefs &&
> -  after_flush - bluefs_last_balance >
> -  cct->_conf->bluestore_bluefs_balance_interval) {
> -bluefs_last_balance = after_flush;
> -int r = _balance_bluefs_freespace(_gift_extents);
> -assert(r >= 0);
> -if (r > 0) {
> -  for (auto& p : 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov

Alex,

upstream recommendations for DB sizing are probably good enough but as 
most of fixed allocations they aren't super optimal for all the use 
cases. Usually one either wastes space or lacks it pme day in such configs.


So I think we should have means to have more freedom in volumes 
management (change sizes, migrate, coalesce and split).


LVM usage is a big step toward that but it is still insufficient and 
lacks additional helpers sometimes.



To avoid the issue Sergey is experiencing IMO it's better to have 
standalone DB volume with some extra spare space.  Even if the physical 
media is the same it helps to avoid this lazy rebalancing procedure 
which is the issue's root cause. But this wouldn't eliminate it totally 
- if spillover to main device takes place one might face it again.
The same improvement can be probably done with single device 
configuration by proper rebalance tuning though (bluestore_bluefs_min 
and other params) but that's more complicated to debug and setup 
properly IMO.

Anyway I think the issue is met very rarely.

Sorry given all that I wouldn't comment if 30 GB fits your scenario or 
not. I don't know :)


Thanks,
Igor

On 10/2/2018 5:23 PM, Alex Litvak wrote:

Igor,

Thank you for your reply.  So what you are saying there are really no 
sensible space requirements for a collocated device? Even if I setup 
30 GB for DB (which I really wouldn't like to do due to a space waste 
considerations ) there is a chance that if this space feels up I will 
be in the same trouble under some heavy load scenario?


On 10/2/2018 9:15 AM, Igor Fedotov wrote:
Even with a single device bluestore has a sort of implicit "BlueFS 
partition" where DB is stored.  And it dynamically adjusts 
(rebalances) the space for that partition in background. 
Unfortunately it might perform that "too lazy" and hence under some 
heavy load it might end-up with the lack of space for that partition. 
While main device still has plenty of free space.


I'm planning to refactor this re-balancing procedure in the future to 
eliminate the root cause.



Thanks,

Igor


On 10/2/2018 5:04 PM, Alex Litvak wrote:
I am sorry for interrupting the thread, but my understanding always 
was that blue store on the single device should not care of the DB 
size, i.e. it would use the data part for all operations if DB is 
full.  And if it is not true, what would be sensible defaults on 800 
GB SSD?  I used ceph-ansible to build my cluster with system 
defaults and from I reading in this thread doesn't give me a good 
feeling at all. Document ion on the topic is very sketchy and online 
posts contradict each other some times.


Thank you in advance,

On 10/2/2018 8:52 AM, Igor Fedotov wrote:

May I have a repair log for that "already expanded" OSD?


On 10/2/2018 4:32 PM, Sergey Malinin wrote:
Repair goes through only when LVM volume has been expanded, 
otherwise it fails with enospc as well as any other operation. 
However, expanding the volume immediately renders bluefs 
unmountable with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at 
the very end of bluefs-log-dump), I'm not sure whether corruption 
occurred before or after volume expansion.




On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's 
the difference now except the applied patch? Different OSD? 
Anything else?



On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it 
lacked bluefs_rebalance_txn assignment..




On 10/2/2018 2:40 PM, Sergey Malinin wrote:
PR doesn't seem to have changed since yesterday. Am I missing 
something?




On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs 
extents list before.


Also please set debug bluestore 20 when re-running repair and 
collect the log.


If repair doesn't help - would you send repair and startup 
logs directly to me as I have some issues accessing 
ceph-post-file uploads.



Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:
Yes, I did repair all OSDs and it finished with 'repair 
success'. I backed up OSDs so now I have more room to play.

I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of 
them?



Would you please provide the log for both types (failed on 
mount and failed with enospc) of failing OSDs. Prior to 
collecting please remove existing ones prior and set debug 
bluestore to 20.




On 10/2/2018 2:16 AM, Sergey Malinin wrote:
I was able to apply patches to mimic, but nothing changed. 
One osd that I had space expanded on fails with bluefs 
mount IO error, others keep failing with enospc.



On 1.10.2018, at 19:26, Igor 

Re: [ceph-users] mimic: 3/4 OSDs crashed on "bluefs enospc"

2018-10-03 Thread Igor Fedotov
You may want to try new updates from the PR along with disabling flush 
on recovery for rocksdb (avoid_flush_during_recovery parameter).


Full cmd line might looks like:

CEPH_ARGS="--bluestore_rocksdb_options avoid_flush_during_recovery=1" 
bin/ceph-bluestore-tool --path  repair



To be applied for "non-expanded" OSDs where repair didn't pass.

Please collect a log during repair...


Thanks,

Igor

On 10/2/2018 4:32 PM, Sergey Malinin wrote:

Repair goes through only when LVM volume has been expanded, otherwise it fails 
with enospc as well as any other operation. However, expanding the volume 
immediately renders bluefs unmountable with IO error.
2 of 3 OSDs got bluefs log currupted (bluestore tool segfaults at the very end 
of bluefs-log-dump), I'm not sure whether corruption occurred before or after 
volume expansion.



On 2.10.2018, at 16:07, Igor Fedotov  wrote:

You mentioned repair had worked before, is that correct? What's the difference 
now except the applied patch? Different OSD? Anything else?


On 10/2/2018 3:52 PM, Sergey Malinin wrote:


It didn't work, emailed logs to you.



On 2.10.2018, at 14:43, Igor Fedotov  wrote:

The major change is in get_bluefs_rebalance_txn function, it lacked 
bluefs_rebalance_txn assignment..



On 10/2/2018 2:40 PM, Sergey Malinin wrote:

PR doesn't seem to have changed since yesterday. Am I missing something?



On 2.10.2018, at 14:15, Igor Fedotov  wrote:

Please update the patch from the PR - it didn't update bluefs extents list 
before.

Also please set debug bluestore 20 when re-running repair and collect the log.

If repair doesn't help - would you send repair and startup logs directly to me 
as I have some issues accessing ceph-post-file uploads.


Thanks,

Igor


On 10/2/2018 11:39 AM, Sergey Malinin wrote:

Yes, I did repair all OSDs and it finished with 'repair success'. I backed up 
OSDs so now I have more room to play.
I posted log files using ceph-post-file with the following IDs:
4af9cc4d-9c73-41c9-9c38-eb6c551047a0
20df7df5-f0c9-4186-aa21-4e5c0172cd93



On 2.10.2018, at 11:26, Igor Fedotov  wrote:

You did repair for any of this OSDs, didn't you? For all of them?


Would you please provide the log for both types (failed on mount and failed 
with enospc) of failing OSDs. Prior to collecting please remove existing ones 
prior and set debug bluestore to 20.



On 10/2/2018 2:16 AM, Sergey Malinin wrote:

I was able to apply patches to mimic, but nothing changed. One osd that I had 
space expanded on fails with bluefs mount IO error, others keep failing with 
enospc.



On 1.10.2018, at 19:26, Igor Fedotov  wrote:

So you should call repair which rebalances (i.e. allocates additional space) 
BlueFS space. Hence allowing OSD to start.

Thanks,

Igor


On 10/1/2018 7:22 PM, Igor Fedotov wrote:

Not exactly. The rebalancing from this kv_sync_thread still might be deferred 
due to the nature of this thread (haven't 100% sure though).

Here is my PR showing the idea (still untested and perhaps unfinished!!!)

https://github.com/ceph/ceph/pull/24353


Igor


On 10/1/2018 7:07 PM, Sergey Malinin wrote:

Can you please confirm whether I got this right:

--- BlueStore.cc.bak2018-10-01 18:54:45.096836419 +0300
+++ BlueStore.cc2018-10-01 19:01:35.937623861 +0300
@@ -9049,22 +9049,17 @@
 throttle_bytes.put(costs);
   PExtentVector bluefs_gift_extents;
-  if (bluefs &&
-  after_flush - bluefs_last_balance >
-  cct->_conf->bluestore_bluefs_balance_interval) {
-bluefs_last_balance = after_flush;
-int r = _balance_bluefs_freespace(_gift_extents);
-assert(r >= 0);
-if (r > 0) {
-  for (auto& p : bluefs_gift_extents) {
-bluefs_extents.insert(p.offset, p.length);
-  }
-  bufferlist bl;
-  encode(bluefs_extents, bl);
-  dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
-   << bluefs_extents << std::dec << dendl;
-  synct->set(PREFIX_SUPER, "bluefs_extents", bl);
+  int r = _balance_bluefs_freespace(_gift_extents);
+  ceph_assert(r >= 0);
+  if (r > 0) {
+for (auto& p : bluefs_gift_extents) {
+  bluefs_extents.insert(p.offset, p.length);
   }
+bufferlist bl;
+encode(bluefs_extents, bl);
+dout(10) << __func__ << " bluefs_extents now 0x" << std::hex
+ << bluefs_extents << std::dec << dendl;
+synct->set(PREFIX_SUPER, "bluefs_extents", bl);
 }
   // cleanup sync deferred keys


On 1.10.2018, at 18:39, Igor Fedotov  wrote:

So you have just a single main device per OSD

Then bluestore-tool wouldn't help, it's unable to expand BlueFS partition at 
main device, standalone devices are supported only.

Given that you're able to rebuild the code I can suggest to make a patch that 
triggers BlueFS rebalance (see code snippet below) on repairing.
  PExtentVector bluefs_gift_extents;
  int r = _balance_bluefs_freespace(_gift_extents);
  ceph_assert(r >= 0);
  if (r > 0) {
for (auto& p 

Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread John Spray
On Tue, Oct 2, 2018 at 6:28 PM  wrote:
>
> Hi.
>
> Based on some recommendations we have setup our CephFS installation using
> bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
> server - 100TB-ish size.
>
> Current setup is - a sizeable Linux host with 512GB of memory - one large
> Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.
>
> Since our "hot" dataset is < 400GB we can actually serve the hot data
> directly out of the host page-cache and never really touch the "slow"
> underlying drives. Except when new bulk data are written where a Perc with
> BBWC is consuming the data.
>
> In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
> OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
> it is really hard to create a synthetic test where they hot data does not
> end up being read out of the underlying disks. Yes, the
> client side page cache works very well, but in our scenario we have 30+
> hosts pulling the same data over NFS.

Are you finding that the OSDs use lots of memory but you're still
hitting disk, or just that the OSDs aren't using up all the available
memory?  Unlike the page cache, the OSDs will not use all the memory
in your system by default, you have to tell them how much to use
(http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size)

John

> Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
> the recommendation to make an SSD "overlay" on the slow drives?
>
> Thoughts?
>
> Jesper
>
> * Bluestore should be the new and shiny future - right?
> ** Total mem 1TB+
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] network latency setup for osd nodes combined with vm

2018-10-03 Thread Marc Roos



It was not my first intention to host vm's on osd nodes of the ceph 
cluster. But since this test cluster is not doing anything, I might 
aswell use some of the cores.

Currently I have configured a macvtap on the ceph client network 
configured as a vlan. Disadvantage is that the local osd's can not be 
reached. Advantage is (I think) that the ceph client network has the 
least latency in this setup, compared to for instance using the bridge.

Can anyone advice on a better implementation? (Maybe putting the ceph 
client network ip also on a macvtap and not direct on the adapter?)











___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
Small addition: the failing disks are in the same host.
This is a two-host, failure-domain OSD cluster.


Am Mi., 3. Okt. 2018 um 10:13 Uhr schrieb Kevin Olbrich :

> Hi!
>
> Yesterday one of our (non-priority) clusters failed when 3 OSDs went down
> (EC 8+2) together.
> *This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
> hours before.*
> They failed exactly at the same moment, rendering the cluster unusable
> (CephFS).
> We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no
> external journal / wal / db.
>
> *OSD 29 (no disk failure in dmesg):*
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
> (02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
> ceph-osd, pid 20899
> 2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty
> --pid-file
> 2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load: isa
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
> kv_ratio 0.5
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
> meta 0 kv 1 data 0
> 2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) close
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
> kv_ratio 0.5
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1
> bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
> meta 0 kv 1 data 0
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
> /var/lib/ceph/osd/ceph-29/block type kernel
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
> /var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
> /var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
> GiB) block_size 4096 (4 KiB) rotational
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1
> path /var/lib/ceph/osd/ceph-29/block size 932 GiB
> 2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link
> count 0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev
> 1 allocated 320 extents
> [1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10])
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log:
> (5) Input/output error
> 2018-10-03 09:47:15.538 7fb8835ce1c0  1 stupidalloc 0x0x561250b8d030
> shutdown
> 2018-10-03 09:47:15.538 7fb8835ce1c0 -1
> bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (5)
> Input/output error
> 2018-10-03 09:47:15.538 7fb8835ce1c0  1 bdev(0x561250a20a80
> /var/lib/ceph/osd/ceph-29/block) close
> 2018-10-03 09:47:15.616 7fb8835ce1c0  1 bdev(0x561250a2
> 

[ceph-users] After 13.2.2 upgrade: bluefs mount failed to replay log: (5) Input/output error

2018-10-03 Thread Kevin Olbrich
Hi!

Yesterday one of our (non-priority) clusters failed when 3 OSDs went down
(EC 8+2) together.
*This is strange as we did an upgrade from 13.2.1 to 13.2.2 one or two
hours before.*
They failed exactly at the same moment, rendering the cluster unusable
(CephFS).
We are using CentOS 7 with latest updates and ceph repo. No cache SSDs, no
external journal / wal / db.

*OSD 29 (no disk failure in dmesg):*
2018-10-03 09:47:15.074 7fb8835ce1c0  0 set uid:gid to 167:167 (ceph:ceph)
2018-10-03 09:47:15.074 7fb8835ce1c0  0 ceph version 13.2.2
(02899bfda814146b021136e9d8e80eba494e1126) mimic (stable), process
ceph-osd, pid 20899
2018-10-03 09:47:15.074 7fb8835ce1c0  0 pidfile_write: ignore empty
--pid-file
2018-10-03 09:47:15.100 7fb8835ce1c0  0 load: jerasure load: lrc load: isa
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.100 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.101 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
kv_ratio 0.5
2018-10-03 09:47:15.101 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
meta 0 kv 1 data 0
2018-10-03 09:47:15.101 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.358 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _mount path /var/lib/ceph/osd/ceph-29
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.358 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.359 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes kv_min_ratio 1 >
kv_ratio 0.5
2018-10-03 09:47:15.360 7fb8835ce1c0  1
bluestore(/var/lib/ceph/osd/ceph-29) _set_cache_sizes cache_size 536870912
meta 0 kv 1 data 0
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev create path
/var/lib/ceph/osd/ceph-29/block type kernel
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
/var/lib/ceph/osd/ceph-29/block) open path /var/lib/ceph/osd/ceph-29/block
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bdev(0x561250a20a80
/var/lib/ceph/osd/ceph-29/block) open size 1000198897664 (0xe8e080, 932
GiB) block_size 4096 (4 KiB) rotational
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs add_block_device bdev 1 path
/var/lib/ceph/osd/ceph-29/block size 932 GiB
2018-10-03 09:47:15.360 7fb8835ce1c0  1 bluefs mount
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs _replay file with link count
0: file(ino 519 size 0x31e2f42 mtime 2018-10-02 12:24:22.632397 bdev 1
allocated 320 extents
[1:0x700820+10,1:0x700900+10,1:0x700910+10,1:0x700920+10,1:0x700930+10,1:0x700940+10,1:0x700950+10,1:0x700960+10,1:0x700970+10,1:0x700980+10,1:0x700990+10,1:0x7009a0+10,1:0x7009b0+10,1:0x7009c0+10,1:0x7009d0+10,1:0x7009e0+10,1:0x7009f0+10,1:0x700a00+10,1:0x700a10+10,1:0x700a20+10,1:0x700a30+10,1:0x700a40+10,1:0x700a50+10,1:0x700a60+10,1:0x700a70+10,1:0x700a80+10,1:0x700a90+10,1:0x700aa0+10,1:0x700ab0+10,1:0x700ac0+10,1:0x700ad0+10,1:0x700ae0+10,1:0x700af0+10,1:0x700b00+10,1:0x700b10+10,1:0x700b20+10,1:0x700b30+10,1:0x700b40+10,1:0x700b50+10,1:0x700b60+10,1:0x700b70+10,1:0x700b80+10,1:0x700b90+10,1:0x700ba0+10,1:0x700bb0+10,1:0x700bc0+10,1:0x700bd0+10,1:0x700be0+10,1:0x700bf0+10,1:0x700c00+10])
2018-10-03 09:47:15.538 7fb8835ce1c0 -1 bluefs mount failed to replay log:
(5) Input/output error
2018-10-03 09:47:15.538 7fb8835ce1c0  1 stupidalloc 0x0x561250b8d030
shutdown
2018-10-03 09:47:15.538 7fb8835ce1c0 -1
bluestore(/var/lib/ceph/osd/ceph-29) _open_db failed bluefs mount: (5)
Input/output error
2018-10-03 09:47:15.538 7fb8835ce1c0  1 bdev(0x561250a20a80
/var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.616 7fb8835ce1c0  1 bdev(0x561250a2
/var/lib/ceph/osd/ceph-29/block) close
2018-10-03 09:47:15.870 7fb8835ce1c0 -1 osd.29 0 OSD:init: unable to mount
object store
2018-10-03 09:47:15.870 7fb8835ce1c0 -1  ** ERROR: osd init failed: (5)
Input/output error

*OSD 42:*
disk is found by lvm, tmpfs is created but service immediately dies on
start without log...
This might be 

[ceph-users] Unfound object on erasure when recovering

2018-10-03 Thread Jan Pekař - Imatic

Hi all,

I'm playing with my testing cluster with ceph 12.2.8 installed.

It happened to me for the second time, that I have 1 unfound objects on erasure 
coded pool.

I have erasure with 3+1 configuration.

First time I was adding additional disk. During cluster rebalance I noticed one unfound object. I hoped, that it will be fixed after cluster 
rebalance, but it was not.


I coped with marking object as lost, because disk IO on that object stuck.

Yesterday I was trying to remove one disk so I marked it out.

After few hours I noticed also one object unfound. This is dump of pg 
list_missing.

There is strange pool number, snapid (I'm not using snapshots on that pool, it is just pool for cephfs data), also locations array looks 
strange.


I decided to put disk I wanted to remove back "in" and unfound object 
disappeared.

Can you give me additional informations to this problem? Should I debug it more?

Thank you

{
    "offset": {
    "oid": "",
    "key": "",
    "snapid": 0,
    "hash": 0,
    "max": 0,
    "pool": -9223372036854775808,
    "namespace": ""
    },
    "num_missing": 0,
    "num_unfound": 1,
    "objects": [
    {
    "oid": {
    "oid": "11eec49.",
    "key": "",
    "snapid": -2,
    "hash": 586898362,
    "max": 0,
    "pool": 10,
    "namespace": ""
    },
    "need": "13528'6795",
    "have": "0'0",
    "flags": "none",
    "locations": [
    "7(3)"
    ]
    }
    ],
    "more": false
}


--

Ing. Jan Pekař
jan.pe...@imatic.cz | +420603811737

Imatic | Jagellonská 14 | Praha 3 | 130 00
http://www.imatic.cz

--

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com