Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan



On 07/18/2018 06:37 PM, Brad Hubbard wrote:
> On Thu, Jul 19, 2018 at 2:48 AM, Troy Ablan  wrote:
>>
>>
>> On 07/17/2018 11:14 PM, Brad Hubbard wrote:
>>>
>>> On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:
>>>>
>>>> I was on 12.2.5 for a couple weeks and started randomly seeing
>>>> corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
>>>> loose.  I panicked and moved to Mimic, and when that didn't solve the
>>>> problem, only then did I start to root around in mailing lists archives.
>>>>
>>>> It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
>>>> out, but I'm unsure how to proceed now that the damaged cluster is
>>>> running under Mimic.  Is there anything I can do to get the cluster back
>>>> online and objects readable?
>>>
>>> That depends on what the specific problem is. Can you provide some
>>> data that fills in the blanks around "randomly seeing corruption"?
>>>
>> Thanks for the reply, Brad.  I have a feeling that almost all of this stems
>> from the time the cluster spent running 12.2.6.  When booting VMs that use
>> rbd as a backing store, they typically get I/O errors during boot and cannot
>> read critical parts of the image.  I also get similar errors if I try to rbd
>> export most of the images. Also, CephFS is not started as ceph -s indicates
>> damage.
>>
>> Many of the OSDs have been crashing and restarting as I've tried to rbd
>> export good versions of images (from older snapshots).  Here's one
>> particular crash:
>>
>> 2018-07-18 15:52:15.809 7fcbaab77700 -1
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
>> uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h:
>> In function 'void
>> BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread
>> 7fcbaab7
>> 7700 time 2018-07-18 15:52:15.750916
>> /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
>> .2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)
>>
>>  ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic
>> (stable)
>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>> const*)+0xff) [0x7fcbc197a53f]
>>  2: (()+0x286727) [0x7fcbc197a727]
>>  3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
>>  4: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
>>  5: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  6: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  7: (std::_Rb_tree,
>> boost::intrusive_ptr,
>> std::_Identity >,
>> std::less >,
>> std::allocator >
>>> ::_M_erase(std::_Rb_tree_node> lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
>>  8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
>>  9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610)
>> [0x5641f391c9b0]
>>  10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a)
>> [0x5641f392a38a]
>>  11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
>>  12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
>>  13: (()+0x7e25) [0x7fcbbe4d2e25]
>>  14: (clone()+0x6d) [0x7fcbbd5c3bad]
>>  NOTE: a copy of the executable, or `objdump -rdS ` is needed to
>> interpret this.
>>
>>
>> Here's the output of ceph -s that might fill in some configuration
>> questions.  Since osds are continually restarting if I try to put load on
>> it, the cluster seems to be churning a bit.  That's why I set nodown for
>> now.
>>
>>   cluster:
>> id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
>> health: HEALTH_ERR
>> 1 filesystem is degraded
>> 1 filesystem is offline
>> 1 mds daemon damaged
>> nodown,noscrub,nodeep-scrub flag(s) se

[ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-17 Thread Troy Ablan
I was on 12.2.5 for a couple weeks and started randomly seeing
corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
loose.  I panicked and moved to Mimic, and when that didn't solve the
problem, only then did I start to root around in mailing lists archives.

It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
out, but I'm unsure how to proceed now that the damaged cluster is
running under Mimic.  Is there anything I can do to get the cluster back
online and objects readable?

Everything is BlueStore and most of it is EC.

Thanks.

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-18 Thread Troy Ablan



On 07/17/2018 11:14 PM, Brad Hubbard wrote:

On Wed, Jul 18, 2018 at 2:57 AM, Troy Ablan  wrote:

I was on 12.2.5 for a couple weeks and started randomly seeing
corruption, moved to 12.2.6 via yum update on Sunday, and all hell broke
loose.  I panicked and moved to Mimic, and when that didn't solve the
problem, only then did I start to root around in mailing lists archives.

It appears I can't downgrade OSDs back to Luminous now that 12.2.7 is
out, but I'm unsure how to proceed now that the damaged cluster is
running under Mimic.  Is there anything I can do to get the cluster back
online and objects readable?

That depends on what the specific problem is. Can you provide some
data that fills in the blanks around "randomly seeing corruption"?

Thanks for the reply, Brad.  I have a feeling that almost all of this 
stems from the time the cluster spent running 12.2.6.  When booting VMs 
that use rbd as a backing store, they typically get I/O errors during 
boot and cannot read critical parts of the image.  I also get similar 
errors if I try to rbd export most of the images. Also, CephFS is not 
started as ceph -s indicates damage.


Many of the OSDs have been crashing and restarting as I've tried to rbd 
export good versions of images (from older snapshots).  Here's one 
particular crash:


2018-07-18 15:52:15.809 7fcbaab77700 -1 
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/h
uge/release/13.2.0/rpm/el7/BUILD/ceph-13.2.0/src/os/bluestore/BlueStore.h: 
In function 'void 
BlueStore::SharedBlobSet::remove_last(BlueStore::SharedBlob*)' thread 
7fcbaab7

7700 time 2018-07-18 15:52:15.750916
/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.0/rpm/el7/BUILD/ceph-13
.2.0/src/os/bluestore/BlueStore.h: 455: FAILED assert(sb->nref == 0)

 ceph version 13.2.0 (79a10589f1f80dfe21e8f9794365ed98143071c4) mimic 
(stable)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0xff) [0x7fcbc197a53f]

 2: (()+0x286727) [0x7fcbc197a727]
 3: (BlueStore::SharedBlob::put()+0x1da) [0x5641f39181ca]
 4: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x2d) [0x5641f3977cfd]
 5: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 6: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 7: (std::_Rb_tree, 
boost::intrusive_ptr, 
std::_Identity >,
std::less >, 
std::allocator > 
>::_M_erase(std::_Rb_tree_node
lueStore::SharedBlob> >*)+0x1b) [0x5641f3977ceb]
 8: (BlueStore::TransContext::~TransContext()+0xf7) [0x5641f3979297]
 9: (BlueStore::_txc_finish(BlueStore::TransContext*)+0x610) 
[0x5641f391c9b0]
 10: (BlueStore::_txc_state_proc(BlueStore::TransContext*)+0x9a) 
[0x5641f392a38a]

 11: (BlueStore::_kv_finalize_thread()+0x41e) [0x5641f392b3be]
 12: (BlueStore::KVFinalizeThread::entry()+0xd) [0x5641f397d85d]
 13: (()+0x7e25) [0x7fcbbe4d2e25]
 14: (clone()+0x6d) [0x7fcbbd5c3bad]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.



Here's the output of ceph -s that might fill in some configuration 
questions.  Since osds are continually restarting if I try to put load 
on it, the cluster seems to be churning a bit.  That's why I set nodown 
for now.


  cluster:
    id: b2873c9a-5539-4c76-ac4a-a6c9829bfed2
    health: HEALTH_ERR
    1 filesystem is degraded
    1 filesystem is offline
    1 mds daemon damaged
    nodown,noscrub,nodeep-scrub flag(s) set
    9 scrub errors
    Reduced data availability: 61 pgs inactive, 56 pgs peering, 
4 pgs stale

    Possible data damage: 3 pgs inconsistent
    16 slow requests are blocked > 32 sec
    26 stuck requests are blocked > 4096 sec

  services:
    mon: 5 daemons, quorum a,b,c,d,e
    mgr: a(active), standbys: b, d, e, c
    mds: lcs-0/1/1 up , 2 up:standby, 1 damaged
    osd: 34 osds: 34 up, 34 in
 flags nodown,noscrub,nodeep-scrub

  data:
    pools:   15 pools, 640 pgs
    objects: 9.73 M objects, 13 TiB
    usage:   24 TiB used, 55 TiB / 79 TiB avail
    pgs: 23.438% pgs not active
 487 active+clean
 73  peering
 70  activating
 5   stale+peering
 3   active+clean+inconsistent
 2   stale+activating

  io:
    client:   1.3 KiB/s wr, 0 op/s rd, 0 op/s wr


If there's any other information I can provide that can help point to 
the problem, I'd be glad to share.


Thank

Re: [ceph-users] RAID question for Ceph

2018-07-18 Thread Troy Ablan



On 07/18/2018 07:44 PM, Satish Patel wrote:
> If i have 8 OSD drives in server on P410i RAID controller (HP), If i
> want to make this server has OSD node in that case show should i
> configure RAID?
> 
> 1. Put all drives in RAID-0?
> 2. Put individual HDD in RAID-0 and create 8 individual RAID-0 so OS
> can see 8 separate HDD drives
> 
> What most people doing in production for Ceph (BleuStore)?

In my experience, using a RAID card is not ideal for storage systems
like Ceph.  Redundancy comes from replicating data across multiple
hosts, so there's no need for this functionality in a disk controller.
Even worse, the P410i doesn't appear to support a pass-thru (JBOD/HBA)
mode, so your only sane option for using this card is to create RAID-0s.
Whenever you need to replace a bad drive, you will need to go through
the extra step of creating a RAID-0 on the new drive.

In a production environment, I would recommend an HBA that exposes all
of the drives directly to the OS. It makes management and monitoring a
lot easier.

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recovery from 12.2.5 (corruption) -> 12.2.6 (hair on fire) -> 13.2.0 (some objects inaccessible and CephFS damaged)

2018-07-19 Thread Troy Ablan



>>
>> I'm on IRC (as MooingLemur) if more real-time communication would help :)
> 
> Sure, I'll try to contact you there. In the meantime could you open up
> a tracker showing the crash stack trace above and a brief description
> of the current situation and the events leading up to it? Could you
> also get a debug log of one of these crashes with "debug bluestore =
> 20" and, ideally, a coredump?
> 

https://tracker.ceph.com/issues/25001

As mentioned in the bug, I was mistaken when I mentioned here that these
were SSDs.  They're SATA, so the crashing ones aren't hosting a cache.

Thanks!

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-13 Thread Troy Ablan

I've opened a tracker issue at https://tracker.ceph.com/issues/41240

Background: Cluster of 13 hosts, 5 of which contain 14 SSD OSDs between 
them.  409 HDDs in as well.


The SSDs contain the RGW index and log pools, and some smaller pools
The HDDs ccontain all other pools, including the RGW data pool

The RGW instance contains just over 1 billion objects across about 65k 
buckets.  I don't know of any action on the cluster that would have 
caused this.  There have been no changes to the crush map in months, but 
HDDs were added a couple weeks ago and backfilling is still in progress 
but in the home stretch.


I don't know what I can do at this point, though something points to the 
osdmap on these being wrong and/or corrupted?  Log excerpt from crash 
included below.  All of the OSD logs I checked look very similar.





2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3362] Recovered 
from manifest file:db/MANIFEST-245361 succeeded,manifest_file_number is 
245361, next_file_number is 245364, last_sequence is 606668564
6, log_number is 0,prev_log_number is 0,max_column_family is 
0,deleted_log_number is 245359


2019-08-13 18:09:52.913 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:3370] Column family 
[default] (ID 0), log number is 245360


2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1565719792920682, "job": 1, "event": "recovery_started", 
"log_files": [245362]}
2019-08-13 18:09:52.918 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:551] Recovering 
log #245362 mode 0
2019-08-13 18:09:52.919 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/version_set.cc:2863] Creating 
manifest 245364


2019-08-13 18:09:52.933 7f76484e9d80  4 rocksdb: EVENT_LOG_v1 
{"time_micros": 1565719792935329, "job": 1, "event": "recovery_finished"}
2019-08-13 18:09:52.951 7f76484e9d80  4 rocksdb: 
[/home/jenkins-build/build/workspace/ceph-build/ARCH/x86_64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/centos7/MACHINE_SIZE/huge/release/13.2.6/rpm
/el7/BUILD/ceph-13.2.6/src/rocksdb/db/db_impl_open.cc:1218] DB pointer 
0x56445a6c8000
2019-08-13 18:09:52.951 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_db opened rocksdb path db 
options 
compression=kNoCompression,max_write_buffer_number=4,min_write_buffer_number_to_merge=

1,recycle_log_file_num=4,write_buffer_size=268435456,writable_file_max_buffer_size=0,compaction_readahead_size=2097152
2019-08-13 18:09:52.964 7f76484e9d80  1 freelist init
2019-08-13 18:09:52.976 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc opening allocation metadata
2019-08-13 18:09:53.119 7f76484e9d80  1 
bluestore(/var/lib/ceph/osd/ceph-46) _open_alloc loaded 926 GiB in 13292 
extents

2019-08-13 18:09:53.133 7f76484e9d80 -1 *** Caught signal (Aborted) **
 in thread 7f76484e9d80 thread_name:ceph-osd

 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f763c4455d0]
 2: (gsignal()+0x37) [0x7f763b466207]
 3: (abort()+0x148) [0x7f763b4678f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f763bd757d5]
 5: (()+0x5e746) [0x7f763bd73746]
 6: (()+0x5e773) [0x7f763bd73773]
 7: (__cxa_rethrow()+0x49) [0x7f763bd739e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f763fcb48d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f763fa924ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f763fa94db1]
 11: (OSDService::try_get_map(unsigned int)+0x4f8) [0x5644576e1e08]
 12: (OSDService::get_map(unsigned int)+0x1e) [0x564457743dae]
 13: (OSD::init()+0x1d32) [0x5644576ef982]
 14: (main()+0x23a3) [0x5644575cc7a3]
 15: (__libc_start_main()+0xf5) [0x7f763b4523d5]
 16: (()+0x385900) [0x5644576a4900]
 NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-18 Thread Troy Ablan



On 8/18/19 6:43 PM, Brad Hubbard wrote:

That's this code.

3114   switch (alg) {
3115   case CRUSH_BUCKET_UNIFORM:
3116 size = sizeof(crush_bucket_uniform);
3117 break;
3118   case CRUSH_BUCKET_LIST:
3119 size = sizeof(crush_bucket_list);
3120 break;
3121   case CRUSH_BUCKET_TREE:
3122 size = sizeof(crush_bucket_tree);
3123 break;
3124   case CRUSH_BUCKET_STRAW:
3125 size = sizeof(crush_bucket_straw);
3126 break;
3127   case CRUSH_BUCKET_STRAW2:
3128 size = sizeof(crush_bucket_straw2);
3129 break;
3130   default:
3131 {
3132   char str[128];
3133   snprintf(str, sizeof(str), "unsupported bucket algorithm:
%d", alg);
3134   throw buffer::malformed_input(str);
3135 }
3136   }

CRUSH_BUCKET_UNIFORM = 1
CRUSH_BUCKET_LIST = 2
CRUSH_BUCKET_TREE = 3
CRUSH_BUCKET_STRAW = 4
CRUSH_BUCKET_STRAW2 = 5

So valid values for bucket algorithms are 1 through 5 but, for
whatever reason, at least one of yours is being interpreted as "-1"

this doesn't seem like something that would just happen spontaneously
with no changes to the cluster.

What recent changes have you made to the osdmap? What recent changes
have you made to the crushmap? Have you recently upgraded?



Brad,

There were no recent changes to the cluster/osd config to my knowledge. 
The only person who would make any such changes should have been me.  A 
few weeks ago, we added 90 new HDD OSDs all at once and the cluster was 
still backfilling onto those, but none of the pools on the now-affected 
OSDs were involved in that.


It seems that all of the SSDs are likely to be in this same state, but I 
haven't checked every single one.


I sent a complete image of one of the 1TB OSDs (compressed to about 
41GB) via ceph-post-file.  I put it the id in the tracker issue I opened 
for this, https://tracker.ceph.com/issues/41240


I don't know if you or any other devs could use that for further 
insight, but I'm hopeful.


Thanks,

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RESOLVED: Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-19 Thread Troy Ablan
While I'm still unsure how this happened, this is what was done to solve 
this.


Started OSD in foreground with debug 10, watched for the most recent 
osdmap epoch mentioned before abort().  For example, if it mentioned 
that it just tried to load 80896 and then crashed


# ceph osd getmap -o osdmap.80896 80896
# ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-77/ --file osdmap.80896


Then I restarted the osd in foreground/debug, and repeated for the next 
osdmap epoch until it got past the first few seconds.  This process 
worked for all but two OSDs.  For the ones that succeeded I'd ^C and 
then start the osd via systemd


For the remaining two, it would try loading the incremental map and then 
crash.  I had presence of mind to make dd images of every OSD before 
starting this process, so I reverted these two to the state before 
injecting the osdmaps.


I then injected the last 15 or so epochs of the osdmap in sequential 
order before starting the osd, with success.


This leads me to believe that the step-wise injection didn't work 
because the osd had more subtle corruption that it got past, but it was 
confused when it requested the next incremental delta.


Thanks again to Brad/badone for the guidance!

Tracker issue updated.

Here's the closing IRC dialogue re this issue (UTC-0700)

2019-08-19 16:27:42 < MooingLemur> badone: I appreciate you reaching out 
yesterday, you've helped a ton, twice now :)  I'm still concerned 
because I don't know how this happened.  I'll feel better once 
everything's active+clean, but it's all at least active.


2019-08-19 16:30:28 < badone> MooingLemur: I had a quick discussion with 
Josh earlier and he shares my opinion this is likely somehow related to 
these drives or perhaps controllers, or at least specific to these machines


2019-08-19 16:31:04 < badone> however, there is a possibility you are 
seeing some extremely rare race that no one up to this point has seen before


2019-08-19 16:31:20 < badone> that is less likely though

2019-08-19 16:32:50 < badone> the osd read the osdmap over the wire 
successfully but wrote it out to disk in a format that it could not then 
read back in (unlikely) or...


2019-08-19 16:33:21 < badone> the map "changed" after it had been 
written to disk


2019-08-19 16:33:46 < badone> the second is considered most likely by us 
but I recognise you may not share that opinion

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-19 Thread Troy Ablan
Yes, it's possible that they do, but since all of the affected OSDs are 
still down and the monitors have been restarted since, all of those 
pools have pgs that are in unknown state and don't return anything in 
ceph pg ls.


There weren't that many placement groups for the SSDs, but also I don't 
know that there were that many objects.  There were of course a ton of 
omap key/values.


-Troy

On 8/18/19 10:57 PM, Brett Chancellor wrote:
This sounds familiar. Do any of these pools on the SSD have fairly dense 
placement group to object ratios? Like more than 500k objects per pg? 
(ceph pg ls)



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Sudden loss of all SSD OSDs in a cluster, immedaite abort on restart [Mimic 13.2.6]

2019-08-14 Thread Troy Ablan

Paul,

Thanks for the reply.  All of these seemed to fail except for pulling 
the osdmap from the live cluster.


-Troy

-[~:#]- ceph-objectstore-tool --op get-osdmap --data-path 
/var/lib/ceph/osd/ceph-45/ --file osdmap45
terminate called after throwing an instance of 
'ceph::buffer::malformed_input'

  what():  buffer::malformed_input: unsupported bucket algorithm: -1
*** Caught signal (Aborted) **
 in thread 7f945ee04f00 thread_name:ceph-objectstor
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f94531935d0]
 2: (gsignal()+0x37) [0x7f9451d80207]
 3: (abort()+0x148) [0x7f9451d818f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f945268f7d5]
 5: (()+0x5e746) [0x7f945268d746]
 6: (()+0x5e773) [0x7f945268d773]
 7: (__cxa_rethrow()+0x49) [0x7f945268d9e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f94553218d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f94550ff4ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9455101db1]
 11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, 
ceph::buffer::list&)+0x1d0) [0x55de1f9a6e60]

 12: (main()+0x5340) [0x55de1f8c8870]
 13: (__libc_start_main()+0xf5) [0x7f9451d6c3d5]
 14: (()+0x3adc10) [0x55de1f9a1c10]
Aborted

-[~:#]- ceph-objectstore-tool --op get-osdmap --data-path 
/var/lib/ceph/osd/ceph-46/ --file osdmap46
terminate called after throwing an instance of 
'ceph::buffer::malformed_input'

  what():  buffer::malformed_input: unsupported bucket algorithm: -1
*** Caught signal (Aborted) **
 in thread 7f9ce4135f00 thread_name:ceph-objectstor
 ceph version 13.2.6 (7b695f835b03642f85998b2ae7b6dd093d9fbce4) mimic 
(stable)

 1: (()+0xf5d0) [0x7f9cd84c45d0]
 2: (gsignal()+0x37) [0x7f9cd70b1207]
 3: (abort()+0x148) [0x7f9cd70b28f8]
 4: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7f9cd79c07d5]
 5: (()+0x5e746) [0x7f9cd79be746]
 6: (()+0x5e773) [0x7f9cd79be773]
 7: (__cxa_rethrow()+0x49) [0x7f9cd79be9e9]
 8: (CrushWrapper::decode(ceph::buffer::list::iterator&)+0x18b8) 
[0x7f9cda6528d8]

 9: (OSDMap::decode(ceph::buffer::list::iterator&)+0x4ad) [0x7f9cda4304ad]
 10: (OSDMap::decode(ceph::buffer::list&)+0x31) [0x7f9cda432db1]
 11: (get_osdmap(ObjectStore*, unsigned int, OSDMap&, 
ceph::buffer::list&)+0x1d0) [0x55cea26c8e60]

 12: (main()+0x5340) [0x55cea25ea870]
 13: (__libc_start_main()+0xf5) [0x7f9cd709d3d5]
 14: (()+0x3adc10) [0x55cea26c3c10]
Aborted

-[~:#]- ceph osd getmap -o osdmap
got osdmap epoch 81298

-[~:#]- ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-46/ --file osdmap

osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.

-[~:#]- ceph-objectstore-tool --op set-osdmap --data-path 
/var/lib/ceph/osd/ceph-45/ --file osdmap

osdmap (#-1:92f679f2:::osdmap.81298:0#) does not exist.



On 8/14/19 2:54 AM, Paul Emmerich wrote:

Starting point to debug/fix this would be to extract the osdmap from
one of the dead OSDs:

ceph-objectstore-tool --op get-osdmap --data-path /var/lib/ceph/osd/...

Then try to run osdmaptool on that osdmap to see if it also crashes,
set some --debug options (don't know which one off the top of my
head).
Does it also crash? How does it differ from the map retrieved with
"ceph osd getmap"?

You can also set the osdmap with "--op set-osdmap", does it help to
set the osdmap retrieved by "ceph osd getmap"?

Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan

Hi folks,

Mimic cluster here, RGW pool with only default zone.  I have a 
persistent error here


LARGE_OMAP_OBJECTS 1 large omap objects 

1 large objects found in pool 'default.rgw.log' 

Search the cluster log for 'Large omap object found' for more 
details.


I think I've narrowed it down to this namespace:

-[~:#]- rados -p default.rgw.log --namespace usage ls 






usage.17
usage.19
usage.24

-[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.17 
| wc -l

21284

-[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.19 
| wc -l 





1355968

-[~:#]- rados -p default.rgw.log --namespace usage listomapkeys usage.24 
| wc -l 





0


Now, this last one take a long time to return -- minutes, even with a 0 
response, and listomapvals indeed returns a very large amount of data. 
I'm doing `wc -c` on listomapvals but this hasn't returned at the time 
of writing this message.  Is there anything I can do about this?


Whenever PGs on the default.rgw.log are recovering or backfilling, my 
RGW cluster appears to block writes for almost two hours, and I think it 
points to this object, or at least this pool.


I've been having trouble finding any documentation about how this log 
pool is used by RGW.  I have a feeling updating this object happens on 
every write to the cluster.  How would I remove this bottleneck?  Can I?


Thanks

-Troy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan
Yep, that's on me.  I did enable it in the config originally, and I 
think that I thought at the time that it might be useful, but I wasn't 
aware of a sharding caveat owing to most of our traffic is happening on 
one rgw user.


I think I know what I need to do to fix it now though.

Thanks again!

-Troy

On 10/14/19 3:23 PM, Paul Emmerich wrote:

Yeah, the number of shards is configurable ("rgw usage num shards"? or
something).

Are you sure you aren't using it? This feature is not enabled by
default, someone had to explicitly set "rgw enable usage log" for you
to run into this problem.


Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] default.rgw.log contains large omap object

2019-10-14 Thread Troy Ablan

Paul,

Apparently never.  Appears to (potentially) have every request from the 
beginning of time (late last year, in my case).  In our use case, we 
don't really need this data (not multi-tenant), so I might simply clear it.


But in the case where this were an extremely high transaction cluster 
where we did care about historical data for longer, can this be spread 
or sharded across more than just this small handful of objects?


Thanks for the insight.

-Troy


On 10/14/19 3:01 PM, Paul Emmerich wrote:

Looks like the usage log (radosgw-admin usage show), how often do you trim it?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com