Re: [ceph-users] ceph cluster inconsistency keyvaluestore

Kenneth Waegeman Mon, 08 Sep 2014 03:36:23 -0700


Thank you very much !


Is this problem then related to the weird sizes I see:
      pgmap v55220: 1216 pgs, 3 pools, 3406 GB data, 852 kobjects
            418 GB used, 88130 GB / 88549 GB avail

a calculation with df shows indeed that there is about 400GB used ondisks, but the tests I ran should indeed have generated 3,5 TB, asalso seen in rados df:

pool name category KB objects clonesdegraded unfound rd rd KB wrwr KBcache - 59150443 15466 00 0 1388365 5686734850 36659844709621763ecdata - 3512807425 857620 00 0 1109938 312332288 8576213512807426


I thought it was related to the inconsistency?

Or can this be a sparse objects thing? (But I don't seem to foundanything in the docs about that)


Thanks again!

Kenneth



----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
   Date: Sun, 7 Sep 2014 20:34:39 +0800
   From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: ceph cluster inconsistency keyvaluestore
     To: Kenneth Waegeman <kenneth.waege...@ugent.be>
     Cc: ceph-users@lists.ceph.com

I have found the root cause. It's a bug.

When chunky scrub happen, it will iterate the who pg's objects and
each iterator only a few objects will be scan.

osd/PG.cc:3758
            ret = get_pgbackend()-> objects_list_partial(
      start,
      cct->_conf->osd_scrub_chunk_min,
      cct->_conf->osd_scrub_chunk_max,
      0,
      &objects,
      &candidate_end);

candidate_end is the end of object set and it's used to indicate the
next scrub process's start position. But it will be truncated:

osd/PG.cc:3777
            while (!boundary_found && objects.size() > 1) {
              hobject_t end = objects.back().get_boundary();
              objects.pop_back();

              if (objects.back().get_filestore_key() !=
end.get_filestore_key()) {
                candidate_end = end;
                boundary_found = true;
              }
            }
end which only contain "hash" field as hobject_t will be assign to
candidate_end.  So the next scrub process a hobject_t only contains
"hash" field will be passed in to get_pgbackend()->
objects_list_partial.

It will cause incorrect results for KeyValueStore backend. Because it
will use strict key ordering for "collection_list_paritial" method. A
hobject_t only contains "hash" field will be:

1%e79s0_head!972F1B5D!!none!!!00000000000000000000!0!0

and the actual object is
1%e79s0_head!972F1B5D!!1!!!object-name!head

In other word, a object only contain "hash" field can't used by to
search a absolute object has the same "hash" field.

@sage The simply way is modify obj->key function which will change
storage format. Because it's a experiment backend I would like to
provide with a external format change program help users do it. Is it
OK?


On Wed, Sep 3, 2014 at 9:16 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:

I also can reproduce it on a new slightly different set up (also EC on KV
and Cache) by running ceph pg scrub on a KV pg: this pg will then get the
'inconsistent' status



----- Message from Kenneth Waegeman <kenneth.waege...@ugent.be> ---------
   Date: Mon, 01 Sep 2014 16:28:31 +0200
   From: Kenneth Waegeman <kenneth.waege...@ugent.be>
Subject: Re: ceph cluster inconsistency keyvaluestore
     To: Haomai Wang <haomaiw...@gmail.com>
     Cc: ceph-users@lists.ceph.com

Hi,


The cluster got installed with quattor, which uses ceph-deploy for
installation of daemons, writes the config file and installs the crushmap.
I have 3 hosts, each 12 disks, having a large KV partition (3.6T) for the
ECdata pool and a small cache partition (50G) for the cache

I manually did this:

ceph osd pool create cache 1024 1024
ceph osd pool set cache size 2
ceph osd pool set cache min_size 1
ceph osd erasure-code-profile set profile11 k=8 m=3
ruleset-failure-domain=osd
ceph osd pool create ecdata 128 128 erasure profile11
ceph osd tier add ecdata cache
ceph osd tier cache-mode cache writeback
ceph osd tier set-overlay ecdata cache
ceph osd pool set cache hit_set_type bloom
ceph osd pool set cache hit_set_count 1
ceph osd pool set cache hit_set_period 3600
ceph osd pool set cache target_max_bytes $((280*1024*1024*1024))

(But the previous time I had the problem already without the cache part)



Cluster live since 2014-08-29 15:34:16

Config file on host ceph001:

[global]
auth_client_required = cephx
auth_cluster_required = cephx
auth_service_required = cephx
cluster_network = 10.143.8.0/24
filestore_xattr_use_omap = 1
fsid = 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
mon_cluster_log_to_syslog = 1
mon_host = ceph001.cubone.os, ceph002.cubone.os, ceph003.cubone.os
mon_initial_members = ceph001, ceph002, ceph003
osd_crush_update_on_start = 0
osd_journal_size = 10240
osd_pool_default_min_size = 2
osd_pool_default_pg_num = 512
osd_pool_default_pgp_num = 512
osd_pool_default_size = 3
public_network = 10.141.8.0/24

[osd.11]
osd_objectstore = keyvaluestore-dev

[osd.13]
osd_objectstore = keyvaluestore-dev

[osd.15]
osd_objectstore = keyvaluestore-dev

[osd.17]
osd_objectstore = keyvaluestore-dev

[osd.19]
osd_objectstore = keyvaluestore-dev

[osd.21]
osd_objectstore = keyvaluestore-dev

[osd.23]
osd_objectstore = keyvaluestore-dev

[osd.25]
osd_objectstore = keyvaluestore-dev

[osd.3]
osd_objectstore = keyvaluestore-dev

[osd.5]
osd_objectstore = keyvaluestore-dev

[osd.7]
osd_objectstore = keyvaluestore-dev

[osd.9]
osd_objectstore = keyvaluestore-dev


OSDs:
# id    weight  type name       up/down reweight
-12     140.6   root default-cache
-9      46.87           host ceph001-cache
2       3.906                   osd.2   up      1
4       3.906                   osd.4   up      1
6       3.906                   osd.6   up      1
8       3.906                   osd.8   up      1
10      3.906                   osd.10  up      1
12      3.906                   osd.12  up      1
14      3.906                   osd.14  up      1
16      3.906                   osd.16  up      1
18      3.906                   osd.18  up      1
20      3.906                   osd.20  up      1
22      3.906                   osd.22  up      1
24      3.906                   osd.24  up      1
-10     46.87           host ceph002-cache
28      3.906                   osd.28  up      1
30      3.906                   osd.30  up      1
32      3.906                   osd.32  up      1
34      3.906                   osd.34  up      1
36      3.906                   osd.36  up      1
38      3.906                   osd.38  up      1
40      3.906                   osd.40  up      1
42      3.906                   osd.42  up      1
44      3.906                   osd.44  up      1
46      3.906                   osd.46  up      1
48      3.906                   osd.48  up      1
50      3.906                   osd.50  up      1
-11     46.87           host ceph003-cache
54      3.906                   osd.54  up      1
56      3.906                   osd.56  up      1
58      3.906                   osd.58  up      1
60      3.906                   osd.60  up      1
62      3.906                   osd.62  up      1
64      3.906                   osd.64  up      1
66      3.906                   osd.66  up      1
68      3.906                   osd.68  up      1
70      3.906                   osd.70  up      1
72      3.906                   osd.72  up      1
74      3.906                   osd.74  up      1
76      3.906                   osd.76  up      1
-8      140.6   root default-ec
-5      46.87           host ceph001-ec
3       3.906                   osd.3   up      1
5       3.906                   osd.5   up      1
7       3.906                   osd.7   up      1
9       3.906                   osd.9   up      1
11      3.906                   osd.11  up      1
13      3.906                   osd.13  up      1
15      3.906                   osd.15  up      1
17      3.906                   osd.17  up      1
19      3.906                   osd.19  up      1
21      3.906                   osd.21  up      1
23      3.906                   osd.23  up      1
25      3.906                   osd.25  up      1
-6      46.87           host ceph002-ec
29      3.906                   osd.29  up      1
31      3.906                   osd.31  up      1
33      3.906                   osd.33  up      1
35      3.906                   osd.35  up      1
37      3.906                   osd.37  up      1
39      3.906                   osd.39  up      1
41      3.906                   osd.41  up      1
43      3.906                   osd.43  up      1
45      3.906                   osd.45  up      1
47      3.906                   osd.47  up      1
49      3.906                   osd.49  up      1
51      3.906                   osd.51  up      1
-7      46.87           host ceph003-ec
55      3.906                   osd.55  up      1
57      3.906                   osd.57  up      1
59      3.906                   osd.59  up      1
61      3.906                   osd.61  up      1
63      3.906                   osd.63  up      1
65      3.906                   osd.65  up      1
67      3.906                   osd.67  up      1
69      3.906                   osd.69  up      1
71      3.906                   osd.71  up      1
73      3.906                   osd.73  up      1
75      3.906                   osd.75  up      1
77      3.906                   osd.77  up      1
-4      23.44   root default-ssd
-1      7.812           host ceph001-ssd
0       3.906                   osd.0   up      1
1       3.906                   osd.1   up      1
-2      7.812           host ceph002-ssd
26      3.906                   osd.26  up      1
27      3.906                   osd.27  up      1
-3      7.812           host ceph003-ssd
52      3.906                   osd.52  up      1
53      3.906                   osd.53  up      1

Cache OSDs are each 50G, the EC KV OSDS 3.6T, (ssds not used right now)

Pools:
pool 0 'rbd' replicated size 3 min_size 2 crush_ruleset 0 object_hash
rjenkins pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'cache' replicated size 2 min_size 1 crush_ruleset 0 object_hash
rjenkins pg_num 1024 pgp_num 1024 last_change 174 flags
hashpspool,incomplete_clones tier_of 2 cache_mode writeback target_bytes

300647710720 hit_set bloom{false_positive_probability: 0.05,target_size: 0,

seed: 0} 3600s x1 stripe_width 0
pool 2 'ecdata' erasure size 11 min_size 8 crush_ruleset 2 object_hash
rjenkins pg_num 128 pgp_num 128 last_change 170 lfor 170 flags hashpspool
tiers 1 read_tier 1 write_tier 1 stripe_width 4096


Crushmap:
# begin crush map
tunable choose_local_fallback_tries 0
tunable choose_local_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47
device 48 osd.48
device 49 osd.49
device 50 osd.50
device 51 osd.51
device 52 osd.52
device 53 osd.53
device 54 osd.54
device 55 osd.55
device 56 osd.56
device 57 osd.57
device 58 osd.58
device 59 osd.59
device 60 osd.60
device 61 osd.61
device 62 osd.62
device 63 osd.63
device 64 osd.64
device 65 osd.65
device 66 osd.66
device 67 osd.67
device 68 osd.68
device 69 osd.69
device 70 osd.70
device 71 osd.71
device 72 osd.72
device 73 osd.73
device 74 osd.74
device 75 osd.75
device 76 osd.76
device 77 osd.77

# types
type 0 osd
type 1 host
type 2 root

# buckets
host ceph001-ssd {
        id -1           # do not change unnecessarily
        # weight 7.812
        alg straw
        hash 0  # rjenkins1
        item osd.0 weight 3.906
        item osd.1 weight 3.906
}
host ceph002-ssd {
        id -2           # do not change unnecessarily
        # weight 7.812
        alg straw
        hash 0  # rjenkins1
        item osd.26 weight 3.906
        item osd.27 weight 3.906
}
host ceph003-ssd {
        id -3           # do not change unnecessarily
        # weight 7.812
        alg straw
        hash 0  # rjenkins1
        item osd.52 weight 3.906
        item osd.53 weight 3.906
}
root default-ssd {
        id -4           # do not change unnecessarily
        # weight 23.436
        alg straw
        hash 0  # rjenkins1
        item ceph001-ssd weight 7.812
        item ceph002-ssd weight 7.812
        item ceph003-ssd weight 7.812
}
host ceph001-ec {
        id -5           # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.3 weight 3.906
        item osd.5 weight 3.906
        item osd.7 weight 3.906
        item osd.9 weight 3.906
        item osd.11 weight 3.906
        item osd.13 weight 3.906
        item osd.15 weight 3.906
        item osd.17 weight 3.906
        item osd.19 weight 3.906
        item osd.21 weight 3.906
        item osd.23 weight 3.906
        item osd.25 weight 3.906
}
host ceph002-ec {
        id -6           # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.29 weight 3.906
        item osd.31 weight 3.906
        item osd.33 weight 3.906
        item osd.35 weight 3.906
        item osd.37 weight 3.906
        item osd.39 weight 3.906
        item osd.41 weight 3.906
        item osd.43 weight 3.906
        item osd.45 weight 3.906
        item osd.47 weight 3.906
        item osd.49 weight 3.906
        item osd.51 weight 3.906
}
host ceph003-ec {
        id -7           # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.55 weight 3.906
        item osd.57 weight 3.906
        item osd.59 weight 3.906
        item osd.61 weight 3.906
        item osd.63 weight 3.906
        item osd.65 weight 3.906
        item osd.67 weight 3.906
        item osd.69 weight 3.906
        item osd.71 weight 3.906
        item osd.73 weight 3.906
        item osd.75 weight 3.906
        item osd.77 weight 3.906
}
root default-ec {
        id -8           # do not change unnecessarily
        # weight 140.616
        alg straw
        hash 0  # rjenkins1
        item ceph001-ec weight 46.872
        item ceph002-ec weight 46.872
        item ceph003-ec weight 46.872
}
host ceph001-cache {
        id -9           # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.2 weight 3.906
        item osd.4 weight 3.906
        item osd.6 weight 3.906
        item osd.8 weight 3.906
        item osd.10 weight 3.906
        item osd.12 weight 3.906
        item osd.14 weight 3.906
        item osd.16 weight 3.906
        item osd.18 weight 3.906
        item osd.20 weight 3.906
        item osd.22 weight 3.906
        item osd.24 weight 3.906
}
host ceph002-cache {
        id -10          # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.28 weight 3.906
        item osd.30 weight 3.906
        item osd.32 weight 3.906
        item osd.34 weight 3.906
        item osd.36 weight 3.906
        item osd.38 weight 3.906
        item osd.40 weight 3.906
        item osd.42 weight 3.906
        item osd.44 weight 3.906
        item osd.46 weight 3.906
        item osd.48 weight 3.906
        item osd.50 weight 3.906
}
host ceph003-cache {
        id -11          # do not change unnecessarily
        # weight 46.872
        alg straw
        hash 0  # rjenkins1
        item osd.54 weight 3.906
        item osd.56 weight 3.906
        item osd.58 weight 3.906
        item osd.60 weight 3.906
        item osd.62 weight 3.906
        item osd.64 weight 3.906
        item osd.66 weight 3.906
        item osd.68 weight 3.906
        item osd.70 weight 3.906
        item osd.72 weight 3.906
        item osd.74 weight 3.906
        item osd.76 weight 3.906
}
root default-cache {
        id -12          # do not change unnecessarily
        # weight 140.616
        alg straw
        hash 0  # rjenkins1
        item ceph001-cache weight 46.872
        item ceph002-cache weight 46.872
        item ceph003-cache weight 46.872
}

# rules
rule cache {
        ruleset 0
        type replicated
        min_size 1
        max_size 10
        step take default-cache
        step chooseleaf firstn 0 type host
        step emit
}
rule metadata {
        ruleset 1
        type replicated
        min_size 1
        max_size 10
        step take default-ssd
        step chooseleaf firstn 0 type host
        step emit
}
rule ecdata {
        ruleset 2
        type erasure
        min_size 3
        max_size 20
        step set_chooseleaf_tries 5
        step take default-ec
        step choose indep 0 type osd
        step emit
}

# end crush map

The benchmarks I then did:

./benchrw 50000

benchrw:
/usr/bin/rados -p ecdata bench $1 write --no-cleanup
/usr/bin/rados -p ecdata bench $1 seq
/usr/bin/rados -p ecdata bench $1 seq &
/usr/bin/rados -p ecdata bench $1 write --no-cleanup


Srubbing errors started soon after that: 2014-08-31 10:59:14


Please let me know if you need more information, and thanks !

Kenneth

----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
   Date: Mon, 1 Sep 2014 21:30:16 +0800
   From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: ceph cluster inconsistency keyvaluestore
     To: Kenneth Waegeman <kenneth.waege...@ugent.be>
     Cc: ceph-users@lists.ceph.com

Hmm, could you please list your instructions including cluster existing
time and all relevant ops? I want to reproduce it.


On Mon, Sep 1, 2014 at 4:45 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be>
wrote:

Hi,

I reinstalled the cluster with 0.84, and tried again running rados bench
on a EC coded pool on keyvaluestore.
Nothing crashed this time, but when I check the status:

    health HEALTH_ERR 128 pgs inconsistent; 128 scrub errors; too few
pgs
per osd (15 < min 20)
    monmap e1: 3 mons at {ceph001=10.141.8.180:6789/0,
ceph002=10.141.8.181:6789/0,ceph003=10.141.8.182:6789/0}, election epoch
8, quorum 0,1,2 ceph001,ceph002,ceph003
    osdmap e174: 78 osds: 78 up, 78 in
     pgmap v147680: 1216 pgs, 3 pools, 14758 GB data, 3690 kobjects
           1753 GB used, 129 TB / 131 TB avail
               1088 active+clean
                128 active+clean+inconsistent

the 128 inconsistent pgs are ALL the pgs of the EC KV store ( the others
are on Filestore)

The only thing I can see in the logs is that after the rados tests, it
start scrubbing, and for each KV pg I get something like this:

2014-08-31 11:14:09.050747 osd.11 10.141.8.180:6833/61098 4 : [ERR]
2.3s0
scrub stat mismatch, got 28164/29291 objects, 0/0 clones, 28164/29291
dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts,
118128377856/122855358464 bytes.

What could here be the problem?
Thanks again!!

Kenneth


----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
  Date: Tue, 26 Aug 2014 17:11:43 +0800
  From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
    To: Kenneth Waegeman <kenneth.waege...@ugent.be>
    Cc: ceph-users@lists.ceph.com


Hmm, it looks like you hit this
bug(http://tracker.ceph.com/issues/9223).



Sorry for the late message, I forget that this fix is merged into 0.84.

Thanks for your patient :-)

On Tue, Aug 26, 2014 at 4:39 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:


Hi,

In the meantime I already tried with upgrading the cluster to 0.84, to
see
if that made a difference, and it seems it does.
I can't reproduce the crashing osds by doing a 'rados -p ecdata ls'
anymore.

But now the cluster detect it is inconsistent:

     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
      health HEALTH_ERR 40 pgs inconsistent; 40 scrub errors; too few
pgs
per osd (4 < min 20); mon.ceph002 low disk space
      monmap e3: 3 mons at
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
ceph003=10.141.8.182:6789/0},
election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
      mdsmap e78951: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
up:standby
      osdmap e145384: 78 osds: 78 up, 78 in
       pgmap v247095: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
             1502 GB used, 129 TB / 131 TB avail
                  279 active+clean
                   40 active+clean+inconsistent
                    1 active+clean+scrubbing+deep


I tried to do ceph pg repair for all the inconsistent pgs:

     cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
      health HEALTH_ERR 40 pgs inconsistent; 1 pgs repair; 40 scrub
errors;
too few pgs per osd (4 < min 20); mon.ceph002 low disk space
      monmap e3: 3 mons at
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
ceph003=10.141.8.182:6789/0},
election epoch 30, quorum 0,1,2 ceph001,ceph002,ceph003
      mdsmap e79486: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
up:standby
      osdmap e146452: 78 osds: 78 up, 78 in
       pgmap v248520: 320 pgs, 4 pools, 15366 GB data, 3841 kobjects
             1503 GB used, 129 TB / 131 TB avail
                  279 active+clean
                   39 active+clean+inconsistent
                    1 active+clean+scrubbing+deep
                    1 active+clean+scrubbing+deep+inconsistent+repair

I let it recovering through the night, but this morning the mons were
all
gone, nothing to see in the log files.. The osds were all still up!

   cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
    health HEALTH_ERR 36 pgs inconsistent; 1 pgs repair; 36 scrub
errors;
too few pgs per osd (4 < min 20)
    monmap e7: 3 mons at
{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
ceph003=10.141.8.182:6789/0},
election epoch 44, quorum 0,1,2 ceph001,ceph002,ceph003
    mdsmap e109481: 1/1/1 up {0=ceph003.cubone.os=up:active}, 3
up:standby
    osdmap e203410: 78 osds: 78 up, 78 in
     pgmap v331747: 320 pgs, 4 pools, 15251 GB data, 3812 kobjects
           1547 GB used, 129 TB / 131 TB avail
                  1 active+clean+scrubbing+deep+inconsistent+repair
                284 active+clean
                 35 active+clean+inconsistent

I restarted the monitors now, I will let you know when I see something
more..




----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
    Date: Sun, 24 Aug 2014 12:51:41 +0800

    From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
      To: Kenneth Waegeman <kenneth.waege...@ugent.be>,
ceph-users@lists.ceph.com


It's really strange! I write a test program according the key ordering


you provided and parse the corresponding value. It's true!

I have no idea now. If free, could you add this debug code to
"src/os/GenericObjectMap.cc" and insert *before* "assert(start <=
header.oid);":

 dout(0) << "start: " << start << "header.oid: " << header.oid <<
dendl;

Then you need to recompile ceph-osd and run it again. The output log
can help it!

On Tue, Aug 19, 2014 at 10:19 PM, Haomai Wang <haomaiw...@gmail.com>
wrote:


I feel a little embarrassed, 1024 rows still true for me.

I was wondering if you could give your all keys via
""ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_ > keys.log“.

thanks!

On Tue, Aug 19, 2014 at 4:58 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:

----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
Date: Tue, 19 Aug 2014 12:28:27 +0800

From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
  To: Kenneth Waegeman <kenneth.waege...@ugent.be>
  Cc: Sage Weil <sw...@redhat.com>, ceph-users@lists.ceph.com

On Mon, Aug 18, 2014 at 7:32 PM, Kenneth Waegeman


<kenneth.waege...@ugent.be> wrote:

----- Message from Haomai Wang <haomaiw...@gmail.com> ---------
Date: Mon, 18 Aug 2014 18:34:11 +0800

From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
  To: Kenneth Waegeman <kenneth.waege...@ugent.be>
  Cc: Sage Weil <sw...@redhat.com>, ceph-users@lists.ceph.com

On Mon, Aug 18, 2014 at 5:38 PM, Kenneth Waegeman


<kenneth.waege...@ugent.be> wrote:




Hi,

I tried this after restarting the osd, but I guess that was not
the
aim
(
# ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_|
grep 6adb1100 -A 100
IO error: lock /var/lib/ceph/osd/ceph-67/current//LOCK:
Resource
temporarily
unavailable
tools/ceph_kvstore_tool.cc: In function
'StoreTool::StoreTool(const
string&)' thread 7f8fecf7d780 time 2014-08-18 11:12:29.551780
tools/ceph_kvstore_tool.cc: 38: FAILED
assert(!db_ptr->open(std::cerr))
..
)

When I run it after bringing the osd down, it takes a while,
but
it
has
no
output.. (When running it without the grep, I'm getting a huge
list
)





Oh, sorry for it! I made a mistake, the hash value(6adb1100)
will
be
reversed into leveldb.
So grep "benchmark_data_ceph001.cubone.os_5560_object789734"
should
be
help it.

this gives:



[root@ceph003 ~]# ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/
current/
list
_GHOBJTOSEQ_ | grep 5560_object789734 -A 100


_GHOBJTOSEQ_:3%e0s0_head!0011BDA6!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object789734!head


_GHOBJTOSEQ_:3%e0s0_head!0011C027!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1330170!head


_GHOBJTOSEQ_:3%e0s0_head!0011C6FD!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object227366!head


_GHOBJTOSEQ_:3%e0s0_head!0011CB03!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1363631!head


_GHOBJTOSEQ_:3%e0s0_head!0011CDF0!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1573957!head


_GHOBJTOSEQ_:3%e0s0_head!0011D02C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1019282!head


_GHOBJTOSEQ_:3%e0s0_head!0011E2B5!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1283563!head


_GHOBJTOSEQ_:3%e0s0_head!0011E511!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object273736!head


_GHOBJTOSEQ_:3%e0s0_head!0011E547!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1170628!head


_GHOBJTOSEQ_:3%e0s0_head!0011EAAB!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object256335!head


_GHOBJTOSEQ_:3%e0s0_head!0011F446!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1484196!head


_GHOBJTOSEQ_:3%e0s0_head!0011FC59!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object884178!head


_GHOBJTOSEQ_:3%e0s0_head!001203F3!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object853746!head


_GHOBJTOSEQ_:3%e0s0_head!001208E3!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object36633!head


_GHOBJTOSEQ_:3%e0s0_head!00120B37!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1235337!head


_GHOBJTOSEQ_:3%e0s0_head!001210B6!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1661351!head


_GHOBJTOSEQ_:3%e0s0_head!001210CB!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object238126!head


_GHOBJTOSEQ_:3%e0s0_head!0012184C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object339943!head


_GHOBJTOSEQ_:3%e0s0_head!00121916!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1047094!head


_GHOBJTOSEQ_:3%e0s0_head!001219C1!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object520642!head


_GHOBJTOSEQ_:3%e0s0_head!001222BB!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object639565!head


_GHOBJTOSEQ_:3%e0s0_head!001223AA!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object231080!head


_GHOBJTOSEQ_:3%e0s0_head!0012243C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object858050!head


_GHOBJTOSEQ_:3%e0s0_head!0012289C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object241796!head


_GHOBJTOSEQ_:3%e0s0_head!00122D28!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object7462!head


_GHOBJTOSEQ_:3%e0s0_head!00122DFE!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object243798!head


_GHOBJTOSEQ_:3%e0s0_head!00122EFC!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object109512!head


_GHOBJTOSEQ_:3%e0s0_head!001232D7!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object653973!head


_GHOBJTOSEQ_:3%e0s0_head!001234A3!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1378169!head


_GHOBJTOSEQ_:3%e0s0_head!00123714!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object512925!head


_GHOBJTOSEQ_:3%e0s0_head!001237D9!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object23289!head


_GHOBJTOSEQ_:3%e0s0_head!00123854!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1108852!head


_GHOBJTOSEQ_:3%e0s0_head!00123971!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object704026!head


_GHOBJTOSEQ_:3%e0s0_head!00123F75!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object250441!head


_GHOBJTOSEQ_:3%e0s0_head!00124083!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object706178!head


_GHOBJTOSEQ_:3%e0s0_head!001240FA!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object316952!head


_GHOBJTOSEQ_:3%e0s0_head!0012447D!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object538734!head


_GHOBJTOSEQ_:3%e0s0_head!001244D9!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object789215!head


_GHOBJTOSEQ_:3%e0s0_head!001247CD!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object265993!head


_GHOBJTOSEQ_:3%e0s0_head!00124897!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object610597!head


_GHOBJTOSEQ_:3%e0s0_head!00124BE4!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object691723!head


_GHOBJTOSEQ_:3%e0s0_head!00124C9B!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1306135!head


_GHOBJTOSEQ_:3%e0s0_head!00124E1D!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object520580!head


_GHOBJTOSEQ_:3%e0s0_head!0012534C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object659767!head


_GHOBJTOSEQ_:3%e0s0_head!00125A81!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object184060!head


_GHOBJTOSEQ_:3%e0s0_head!00125E77!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1292867!head


_GHOBJTOSEQ_:3%e0s0_head!00126562!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1201410!head


_GHOBJTOSEQ_:3%e0s0_head!00126B34!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1657326!head


_GHOBJTOSEQ_:3%e0s0_head!00127383!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1269787!head


_GHOBJTOSEQ_:3%e0s0_head!00127396!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object500115!head


_GHOBJTOSEQ_:3%e0s0_head!001277F8!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object394932!head


_GHOBJTOSEQ_:3%e0s0_head!001279DD!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object252963!head


_GHOBJTOSEQ_:3%e0s0_head!00127B40!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object936811!head


_GHOBJTOSEQ_:3%e0s0_head!00127BAC!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1481773!head


_GHOBJTOSEQ_:3%e0s0_head!0012894E!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object999885!head


_GHOBJTOSEQ_:3%e0s0_head!00128D05!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object943667!head


_GHOBJTOSEQ_:3%e0s0_head!0012908A!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object212990!head


_GHOBJTOSEQ_:3%e0s0_head!00129519!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object437596!head


_GHOBJTOSEQ_:3%e0s0_head!00129716!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1585330!head


_GHOBJTOSEQ_:3%e0s0_head!00129798!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object603505!head


_GHOBJTOSEQ_:3%e0s0_head!001299C9!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object808800!head


_GHOBJTOSEQ_:3%e0s0_head!00129B7A!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object23193!head


_GHOBJTOSEQ_:3%e0s0_head!00129B9A!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1158397!head


_GHOBJTOSEQ_:3%e0s0_head!0012A932!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object542450!head


_GHOBJTOSEQ_:3%e0s0_head!0012B77A!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object195480!head


_GHOBJTOSEQ_:3%e0s0_head!0012BE8C!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object312911!head


_GHOBJTOSEQ_:3%e0s0_head!0012BF74!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1563783!head


_GHOBJTOSEQ_:3%e0s0_head!0012C65C!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1123980!head


_GHOBJTOSEQ_:3%e0s0_head!0012C6FE!!3!!benchmark_data_
ceph001%ecubone%eos_3411_object913!head


_GHOBJTOSEQ_:3%e0s0_head!0012CCAD!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object400863!head


_GHOBJTOSEQ_:3%e0s0_head!0012CDBB!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object789667!head


_GHOBJTOSEQ_:3%e0s0_head!0012D14B!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1020723!head


_GHOBJTOSEQ_:3%e0s0_head!0012D95B!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object106293!head


_GHOBJTOSEQ_:3%e0s0_head!0012E3C8!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1355526!head


_GHOBJTOSEQ_:3%e0s0_head!0012E5B3!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1491348!head


_GHOBJTOSEQ_:3%e0s0_head!0012F2BB!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object338872!head


_GHOBJTOSEQ_:3%e0s0_head!0012F374!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1337264!head


_GHOBJTOSEQ_:3%e0s0_head!0012FBE5!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1512395!head


_GHOBJTOSEQ_:3%e0s0_head!0012FCE3!!3!!benchmark_data_
ceph001%ecubone%eos_8961_object298610!head


_GHOBJTOSEQ_:3%e0s0_head!0012FEB6!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object120824!head


_GHOBJTOSEQ_:3%e0s0_head!001301CA!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object816326!head


_GHOBJTOSEQ_:3%e0s0_head!00130263!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object777163!head


_GHOBJTOSEQ_:3%e0s0_head!00130529!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1413173!head


_GHOBJTOSEQ_:3%e0s0_head!001317D9!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object809510!head


_GHOBJTOSEQ_:3%e0s0_head!0013204F!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object471416!head


_GHOBJTOSEQ_:3%e0s0_head!00132400!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object695087!head


_GHOBJTOSEQ_:3%e0s0_head!00132A19!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object591945!head


_GHOBJTOSEQ_:3%e0s0_head!00132BF8!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object302000!head


_GHOBJTOSEQ_:3%e0s0_head!00132F5B!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1645443!head


_GHOBJTOSEQ_:3%e0s0_head!00133B8B!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object761911!head


_GHOBJTOSEQ_:3%e0s0_head!0013433E!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object1467727!head


_GHOBJTOSEQ_:3%e0s0_head!00134446!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object791960!head


_GHOBJTOSEQ_:3%e0s0_head!00134678!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object677078!head


_GHOBJTOSEQ_:3%e0s0_head!00134A96!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object254923!head


_GHOBJTOSEQ_:3%e0s0_head!001355D0!!3!!benchmark_data_
ceph001%ecubone%eos_31461_object321528!head


_GHOBJTOSEQ_:3%e0s0_head!00135690!!3!!benchmark_data_
ceph001%ecubone%eos_4919_object36935!head


_GHOBJTOSEQ_:3%e0s0_head!00135B62!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object1228272!head


_GHOBJTOSEQ_:3%e0s0_head!00135C72!!3!!benchmark_data_
ceph001%ecubone%eos_4812_object2180!head


_GHOBJTOSEQ_:3%e0s0_head!00135DEE!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object425705!head


_GHOBJTOSEQ_:3%e0s0_head!00136366!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object141569!head


_GHOBJTOSEQ_:3%e0s0_head!00136371!!3!!benchmark_data_
ceph001%ecubone%eos_5560_object564213!head

100 rows seemed true for me. I found the min list objects is 1024.
Please could you run
"ceph-kvstore-tool /var/lib/ceph/osd/ceph-67/current/ list
_GHOBJTOSEQ_| grep 6adb1100 -A 1024"




I got the output in attachment

Or should I run this immediately after the osd is crashed,
(because
it
maybe
rebalanced?  I did already restarted the cluster)


I don't know if it is related, but before I could all do that,
I
had
to
fix
something else: A monitor did run out if disk space, using 8GB
for
his
store.db folder (lot of sst files). Other monitors are also
near
that
level.
Never had that problem on previous setups before. I recreated a
monitor
and
now it uses 3.8GB.





It exists some duplicate data which needed to be compacted.

Another idea, maybe you can make KeyValueStore's stripe size
align
with EC stripe size.




How can I do that? Is there some documentation about that?




ceph --show-config | grep keyvaluestore



debug_keyvaluestore = 0/0
keyvaluestore_queue_max_ops = 50
keyvaluestore_queue_max_bytes = 104857600
keyvaluestore_debug_check_backend = false
keyvaluestore_op_threads = 2
keyvaluestore_op_thread_timeout = 60
keyvaluestore_op_thread_suicide_timeout = 180
keyvaluestore_default_strip_size = 4096
keyvaluestore_max_expected_write_size = 16777216
keyvaluestore_header_cache_size = 4096
keyvaluestore_backend = leveldb

keyvaluestore_default_strip_size is the wanted


I haven't think deeply and maybe I will try it later.



Thanks!

Kenneth

----- Message from Sage Weil <sw...@redhat.com> ---------
Date: Fri, 15 Aug 2014 06:10:34 -0700 (PDT)
From: Sage Weil <sw...@redhat.com>

Subject: Re: [ceph-users] ceph cluster inconsistency?
  To: Haomai Wang <haomaiw...@gmail.com>
  Cc: Kenneth Waegeman <kenneth.waege...@ugent.be>,
ceph-users@lists.ceph.com

On Fri, 15 Aug 2014, Haomai Wang wrote:





Hi Kenneth,

I don't find valuable info in your logs, it lack of the
necessary
debug output when accessing crash code.

But I scan the encode/decode implementation in
GenericObjectMap
and
find something bad.

For example, two oid has same hash and their name is:
A: "rb.data.123"
B: "rb-123"

In ghobject_t compare level, A < B. But GenericObjectMap
encode
"."
to
"%e", so the key in DB is:
A: _GHOBJTOSEQ_:blah!51615000!!none!!rb%edata%e123!head
B: _GHOBJTOSEQ_:blah!51615000!!none!!rb-123!head

A > B

And it seemed that the escape function is useless and should
be
disabled.

I'm not sure whether Kenneth's problem is touching this bug.
Because
this scene only occur when the object set is very large and
make
the
two object has same hash value.

Kenneth, could you have time to run "ceph-kv-store
[path-to-osd]
list
_GHOBJTOSEQ_| grep 6adb1100 -A 100". ceph-kv-store is a debug
tool
which can be compiled from source. You can clone ceph repo
and
run
"./authongen.sh; ./configure; cd src; make
ceph-kvstore-tool".
"path-to-osd" should be "/var/lib/ceph/osd-[id]/current/".
"6adb1100"
is from your verbose log and the next 100 rows should know
necessary
infos.






You can also get ceph-kvstore-tool from the 'ceph-tests'
package.

Hi sage, do you think we need to provided with upgrade
function


to
fix
it?






Hmm, we might.  This only affects the key/value encoding
right?
The
FileStore is using its own function to map these to file
names?

Can you open a ticket in the tracker for this?

Thanks!
sage


On Thu, Aug 14, 2014 at 7:36 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:

----- Message from Haomai Wang <haomaiw...@gmail.com>
---------
 Date: Thu, 14 Aug 2014 19:11:55 +0800

 From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman <kenneth.waege...@ugent.be>

Could you add config "debug_keyvaluestore = 20/20" to the


crashed
osd
and replay the command causing crash?

I would like to get more debug infos! Thanks.




I included the log in attachment!
Thanks!

On Thu, Aug 14, 2014 at 4:41 PM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:




I have:
osd_objectstore = keyvaluestore-dev

in the global section of my ceph.conf


[root@ceph002 ~]# ceph osd erasure-code-profile get
profile11
directory=/usr/lib64/ceph/erasure-code
k=8
m=3
plugin=jerasure
ruleset-failure-domain=osd
technique=reed_sol_van

the ecdata pool has this as profile

pool 3 'ecdata' erasure size 11 min_size 8 crush_ruleset 2
object_hash
rjenkins pg_num 128 pgp_num 128 last_change 161 flags
hashpspool
stripe_width 4096

ECrule in crushmap

rule ecdata {
      ruleset 2
      type erasure
      min_size 3
      max_size 20
      step set_chooseleaf_tries 5
      step take default-ec
      step choose indep 0 type osd
      step emit
}
root default-ec {
      id -8           # do not change unnecessarily
      # weight 140.616
      alg straw
      hash 0  # rjenkins1
      item ceph001-ec weight 46.872
      item ceph002-ec weight 46.872
      item ceph003-ec weight 46.872
...

Cheers!
Kenneth

----- Message from Haomai Wang <haomaiw...@gmail.com>
---------
 Date: Thu, 14 Aug 2014 10:07:50 +0800
 From: Haomai Wang <haomaiw...@gmail.com>
Subject: Re: [ceph-users] ceph cluster inconsistency?
   To: Kenneth Waegeman <kenneth.waege...@ugent.be>
   Cc: ceph-users <ceph-users@lists.ceph.com>



Hi Kenneth,



Could you give your configuration related to EC and
KeyValueStore?
Not sure whether it's bug on KeyValueStore

On Thu, Aug 14, 2014 at 12:06 AM, Kenneth Waegeman
<kenneth.waege...@ugent.be> wrote:




Hi,

I was doing some tests with rados bench on a Erasure
Coded
pool
(using
keyvaluestore-dev objectstore) on 0.83, and I see some
strangs
things:


[root@ceph001 ~]# ceph status
  cluster 82766e04-585b-49a6-a0ac-c13d9ffd0a7d
   health HEALTH_WARN too few pgs per osd (4 < min 20)
   monmap e1: 3 mons at







{ceph001=10.141.8.180:6789/0,ceph002=10.141.8.181:6789/0,
ceph003=10.141.8.182:6789/0},
election epoch 6, quorum 0,1,2 ceph001,ceph002,ceph003
   mdsmap e116: 1/1/1 up
{0=ceph001.cubone.os=up:active},
2
up:standby
   osdmap e292: 78 osds: 78 up, 78 in
    pgmap v48873: 320 pgs, 4 pools, 15366 GB data, 3841
kobjects
          1381 GB used, 129 TB / 131 TB avail
               320 active+clean

There is around 15T of data, but only 1.3 T usage.

This is also visible in rados:

[root@ceph001 ~]# rados df
pool name       category                 KB      objects
clones
degraded      unfound           rd        rd KB
wr
wr
KB
data            -                          0


0           0            0            0            0

ecdata          -                16113451009
3933959

0           0            1            1      3935632

16116850711
metadata        -                          2
20

0           0           33           36           21

8
rbd             -                          0


0           0            0            0            0

total used      1448266016      3933979
total avail   139400181016
total space   140848447032


Another (related?) thing: if I do rados -p ecdata ls, I
trigger
osd
shutdowns (each time):
I get a list followed by an error:

...
benchmark_data_ceph001.cubone.os_8961_object243839
benchmark_data_ceph001.cubone.os_5560_object801983
benchmark_data_ceph001.cubone.os_31461_object856489
benchmark_data_ceph001.cubone.os_8961_object202232
benchmark_data_ceph001.cubone.os_4919_object33199
benchmark_data_ceph001.cubone.os_5560_object807797
benchmark_data_ceph001.cubone.os_4919_object74729
benchmark_data_ceph001.cubone.os_31461_object1264121
benchmark_data_ceph001.cubone.os_5560_object1318513
benchmark_data_ceph001.cubone.os_5560_object1202111
benchmark_data_ceph001.cubone.os_31461_object939107
benchmark_data_ceph001.cubone.os_31461_object729682
benchmark_data_ceph001.cubone.os_5560_object122915
benchmark_data_ceph001.cubone.os_5560_object76521
benchmark_data_ceph001.cubone.os_5560_object113261
benchmark_data_ceph001.cubone.os_31461_object575079
benchmark_data_ceph001.cubone.os_5560_object671042
benchmark_data_ceph001.cubone.os_5560_object381146
2014-08-13 17:57:48.736150 7f65047b5700  0 --
10.141.8.180:0/1023295 >>
10.141.8.182:6839/4471 pipe(0x7f64fc019b20 sd=5 :0 s=1
pgs=0
cs=0
l=1
c=0x7f64fc019db0).fault

And I can see this in the log files:

 -25> 2014-08-13 17:52:56.323908 7f8a97fa4700  1 --
10.143.8.182:6827/64670 <== osd.57 10.141.8.182:0/15796
51
====
osd_ping(ping e220 stamp 2014-08-13 17:52:56.323092) v2
====
47+0+0
(3227325175 0 0) 0xf475940 con 0xee89fa0
 -24> 2014-08-13 17:52:56.323938 7f8a97fa4700  1 --
10.143.8.182:6827/64670 --> 10.141.8.182:0/15796 --
osd_ping(ping_reply
e220
stamp 2014-08-13 17:52:56.323092) v2 -- ?+0 0xf815b00
con
0xee89fa0
 -23> 2014-08-13 17:52:56.324078 7f8a997a7700  1 --
10.141.8.182:6840/64670 <== osd.57 10.141.8.182:0/15796
51
====
osd_ping(ping e220 stamp 2014-08-13 17:52:56.323092) v2
====
47+0+0
(3227325175 0 0) 0xf132bc0 con 0xee8a680
 -22> 2014-08-13 17:52:56.324111 7f8a997a7700  1 --
10.141.8.182:6840/64670 --> 10.141.8.182:0/15796 --
osd_ping(ping_reply
e220
stamp 2014-08-13 17:52:56.323092) v2 -- ?+0 0xf811a40
con
0xee8a680
 -21> 2014-08-13 17:52:56.584461 7f8a997a7700  1 --
10.141.8.182:6840/64670 <== osd.29 10.143.8.181:0/12142
47
====
osd_ping(ping e220 stamp 2014-08-13 17:52:56.583010) v2
====
47+0+0
(3355887204 0 0) 0xf655940 con 0xee88b00
 -20> 2014-08-13 17:52:56.584486 7f8a997a7700  1 --
10.141.8.182:6840/64670 --> 10.143.8.181:0/12142 --
osd_ping(ping_reply
e220
stamp 2014-08-13 17:52:56.583010) v2 -- ?+0 0xf132bc0
con
0xee88b00
 -19> 2014-08-13 17:52:56.584498 7f8a97fa4700  1 --
10.143.8.182:6827/64670 <== osd.29 10.143.8.181:0/12142
47
====
osd_ping(ping e220 stamp 2014-08-13 17:52:56.583010) v2
====
47+0+0
(3355887204 0 0) 0xf20e040 con 0xee886e0
 -18> 2014-08-13 17:52:56.584526 7f8a97fa4700  1 --
10.143.8.182:6827/64670 --> 10.143.8.181:0/12142 --
osd_ping(ping_reply
e220
stamp 2014-08-13 17:52:56.583010) v2 -- ?+0 0xf475940
con
0xee886e0
 -17> 2014-08-13 17:52:56.594448 7f8a798c7700  1 --
10.141.8.182:6839/64670 >> :/0 pipe(0xec15f00 sd=74
:6839
s=0
pgs=0
cs=0
l=0
c=0xee856a0).accept sd=74 10.141.8.180:47641/0
 -16> 2014-08-13 17:52:56.594921 7f8a798c7700  1 --
10.141.8.182:6839/64670 <== client.7512
10.141.8.180:0/1018433
1
====
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220) v4 ==== 151+0+39
(1972163119

4174233976) 0xf3bca40 con 0xee856a0
 -15> 2014-08-13 17:52:56.594957 7f8a798c7700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.594874, event:
header_read,
op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
 -14> 2014-08-13 17:52:56.594970 7f8a798c7700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.594880, event: throttled,
op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
 -13> 2014-08-13 17:52:56.594978 7f8a798c7700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.594917, event: all_read,
op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
 -12> 2014-08-13 17:52:56.594986 7f8a798c7700  5 -- op
tracker
--
,
seq:
299, time: 0.000000, event: dispatched, op:
osd_op(client.7512.0:1
[pgls
start_epoch 0] 3.0 ack+read+known_if_redirected e220)
 -11> 2014-08-13 17:52:56.595127 7f8a90795700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.595104, event:
reached_pg,
op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
 -10> 2014-08-13 17:52:56.595159 7f8a90795700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.595153, event: started,
op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
  -9> 2014-08-13 17:52:56.602179 7f8a90795700  1 --
10.141.8.182:6839/64670 --> 10.141.8.180:0/1018433 --
osd_op_reply(1
[pgls
start_epoch 0] v164'30654 uv30654 ondisk = 0) v6 -- ?+0
0xec16180
con
0xee856a0
  -8> 2014-08-13 17:52:56.602211 7f8a90795700  5 -- op
tracker
--
,
seq:
299, time: 2014-08-13 17:52:56.602205, event: done, op:
osd_op(client.7512.0:1  [pgls start_epoch 0] 3.0
ack+read+known_if_redirected e220)
  -7> 2014-08-13 17:52:56.614839 7f8a798c7700  1 --
10.141.8.182:6839/64670 <== client.7512
10.141.8.180:0/1018433
2
====
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220) v4 ==== 151+0+89
(3460833343

2600845095) 0xf3bcec0 con 0xee856a0
  -6> 2014-08-13 17:52:56.614864 7f8a798c7700  5 -- op
tracker
--
,
seq:
300, time: 2014-08-13 17:52:56.614789, event:
header_read,
op:
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220)
  -5> 2014-08-13 17:52:56.614874 7f8a798c7700  5 -- op
tracker
--
,
seq:
300, time: 2014-08-13 17:52:56.614792, event: throttled,
op:
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220)
  -4> 2014-08-13 17:52:56.614884 7f8a798c7700  5 -- op
tracker
--
,
seq:
300, time: 2014-08-13 17:52:56.614835, event: all_read,
op:
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220)
  -3> 2014-08-13 17:52:56.614891 7f8a798c7700  5 -- op
tracker
--
,
seq:
300, time: 0.000000, event: dispatched, op:
osd_op(client.7512.0:2
[pgls
start_epoch 220] 3.0 ack+read+known_if_redirected e220)
  -2> 2014-08-13 17:52:56.614972 7f8a92f9a700  5 -- op
tracker
--
,
seq:
300, time: 2014-08-13 17:52:56.614958, event:
reached_pg,
op:
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220)
  -1> 2014-08-13 17:52:56.614993 7f8a92f9a700  5 -- op
tracker
--
,
seq:
300, time: 2014-08-13 17:52:56.614986, event: started,
op:
osd_op(client.7512.0:2  [pgls start_epoch 220] 3.0
ack+read+known_if_redirected e220)
   0> 2014-08-13 17:52:56.617087 7f8a92f9a700 -1
os/GenericObjectMap.cc:
In function 'int GenericObjectMap::list_objects(const
coll_t&,
ghobject_t,
int, std::vector<ghobject_t>*, ghobject_t*)' thread
7f8a92f9a700
time
2014-08-13 17:52:56.615073
os/GenericObjectMap.cc: 1118: FAILED assert(start <=
header.oid)


ceph version 0.83 (78ff1f0a5dfd3c5850805b40217385
64c36c92b8)
1: (GenericObjectMap::list_objects(coll_t const&,
ghobject_t,
int,
std::vector<ghobject_t, std::allocator<ghobject_t> >*,
ghobject_t*)+0x474)
[0x98f774]
2: (KeyValueStore::collection_list_partial(coll_t,
ghobject_t,
int,
int,
snapid_t, std::vector<ghobject_t,
std::allocator<ghobject_t>

*,


ghobject_t*)+0x274) [0x8c5b54]
3: (PGBackend::objects_list_partial(hobject_t const&,
int,
int,
snapid_t,
std::vector<hobject_t, std::allocator<hobject_t> >*,
hobject_t*)+0x1c9)
[0x862de9]
4:


(ReplicatedPG::do_pg_op(std::tr1::shared_ptr<OpRequest>)+
0xea5)
[0x7f67f5]
5:
(ReplicatedPG::do_op(std::tr1:
:shared_ptr<OpRequest>)+0x1f3)
[0x8177b3]
6: (ReplicatedPG::do_request(std:
:tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x5d5) [0x7b8045]
7: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x47d)
[0x62bf8d]
8: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x35c) [0x62c56c]
9: (ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x8cd)
[0xa776fd]
10: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xa79980]
11: (()+0x7df3) [0x7f8aac71fdf3]
12: (clone()+0x6d) [0x7f8aab1963dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>`
is
needed
to
interpret this.


ceph version 0.83 (78ff1f0a5dfd3c5850805b40217385
64c36c92b8)
1: /usr/bin/ceph-osd() [0x99b466]
2: (()+0xf130) [0x7f8aac727130]
3: (gsignal()+0x39) [0x7f8aab0d5989]
4: (abort()+0x148) [0x7f8aab0d7098]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165)
[0x7f8aab9e89d5]
6: (()+0x5e946) [0x7f8aab9e6946]
7: (()+0x5e973) [0x7f8aab9e6973]
8: (()+0x5eb9f) [0x7f8aab9e6b9f]
9: (ceph::__ceph_assert_fail(char const*, char const*,
int,
char
const*)+0x1ef) [0xa8805f]
10: (GenericObjectMap::list_objects(coll_t const&,
ghobject_t,
int,
std::vector<ghobject_t, std::allocator<ghobject_t> >*,
ghobject_t*)+0x474)
[0x98f774]
11: (KeyValueStore::collection_list_partial(coll_t,
ghobject_t,
int,
int,
snapid_t, std::vector<ghobject_t,
std::allocator<ghobject_t>

*,


ghobject_t*)+0x274) [0x8c5b54]
12: (PGBackend::objects_list_partial(hobject_t const&,
int,
int,
snapid_t,
std::vector<hobject_t, std::allocator<hobject_t> >*,
hobject_t*)+0x1c9)
[0x862de9]
13:


(ReplicatedPG::do_pg_op(std::tr1::shared_ptr<OpRequest>)+
0xea5)
[0x7f67f5]
14:
(ReplicatedPG::do_op(std::tr1:
:shared_ptr<OpRequest>)+0x1f3)
[0x8177b3]
15:

(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x5d5) [0x7b8045]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x47d)
[0x62bf8d]
17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x35c) [0x62c56c]
18:
(ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x8cd)
[0xa776fd]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xa79980]
20: (()+0x7df3) [0x7f8aac71fdf3]
21: (clone()+0x6d) [0x7f8aab1963dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>`
is
needed
to
interpret this.

--- begin dump of recent events ---
   0> 2014-08-13 17:52:56.714214 7f8a92f9a700 -1 ***
Caught
signal
(Aborted) **
in thread 7f8a92f9a700

ceph version 0.83 (78ff1f0a5dfd3c5850805b40217385
64c36c92b8)
1: /usr/bin/ceph-osd() [0x99b466]
2: (()+0xf130) [0x7f8aac727130]
3: (gsignal()+0x39) [0x7f8aab0d5989]
4: (abort()+0x148) [0x7f8aab0d7098]
5: (__gnu_cxx::__verbose_terminate_handler()+0x165)
[0x7f8aab9e89d5]
6: (()+0x5e946) [0x7f8aab9e6946]
7: (()+0x5e973) [0x7f8aab9e6973]
8: (()+0x5eb9f) [0x7f8aab9e6b9f]
9: (ceph::__ceph_assert_fail(char const*, char const*,
int,
char
const*)+0x1ef) [0xa8805f]
10: (GenericObjectMap::list_objects(coll_t const&,
ghobject_t,
int,
std::vector<ghobject_t, std::allocator<ghobject_t> >*,
ghobject_t*)+0x474)
[0x98f774]
11: (KeyValueStore::collection_list_partial(coll_t,
ghobject_t,
int,
int,
snapid_t, std::vector<ghobject_t,
std::allocator<ghobject_t>

*,


ghobject_t*)+0x274) [0x8c5b54]
12: (PGBackend::objects_list_partial(hobject_t const&,
int,
int,
snapid_t,
std::vector<hobject_t, std::allocator<hobject_t> >*,
hobject_t*)+0x1c9)
[0x862de9]
13:


(ReplicatedPG::do_pg_op(std::tr1::shared_ptr<OpRequest>)+
0xea5)
[0x7f67f5]
14:
(ReplicatedPG::do_op(std::tr1:
:shared_ptr<OpRequest>)+0x1f3)
[0x8177b3]
15:

(ReplicatedPG::do_request(std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x5d5) [0x7b8045]
16: (OSD::dequeue_op(boost::intrusive_ptr<PG>,
std::tr1::shared_ptr<OpRequest>,
ThreadPool::TPHandle&)+0x47d)
[0x62bf8d]
17: (OSD::ShardedOpWQ::_process(unsigned int,
ceph::heartbeat_handle_d*)+0x35c) [0x62c56c]
18:
(ShardedThreadPool::shardedthreadpool_worker(unsigned
int)+0x8cd)
[0xa776fd]
19: (ShardedThreadPool::WorkThreadSharded::entry()+0x10)
[0xa79980]
20: (()+0x7df3) [0x7f8aac71fdf3]
21: (clone()+0x6d) [0x7f8aab1963dd]
NOTE: a copy of the executable, or `objdump -rdS
<executable>`
is
needed
to
interpret this.

I guess this has something to do with using the dev
Keyvaluestore?


Thanks!

Kenneth

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com







--
Best Regards,

Wheat






----- End message from Haomai Wang <haomaiw...@gmail.com>
-----

--

Met vriendelijke groeten,
Kenneth Waegeman



--
Best Regards,

Wheat





----- End message from Haomai Wang <haomaiw...@gmail.com>
-----

--

Met vriendelijke groeten,
Kenneth Waegeman



--
Best Regards,

Wheat
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


----- End message from Sage Weil <sw...@redhat.com> -----


--

Met vriendelijke groeten,

Kenneth Waegeman



--
Best Regards,

Wheat






----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman



--
Best Regards,

Wheat





----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman



--
Best Regards,

Wheat





--
Best Regards,

Wheat




----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman



--
Best Regards,

Wheat



----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman



--

Best Regards,

Wheat




----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman




----- End message from Kenneth Waegeman <kenneth.waege...@ugent.be> -----


--

Met vriendelijke groeten,
Kenneth Waegeman




--
Best Regards,

Wheat



----- End message from Haomai Wang <haomaiw...@gmail.com> -----

--

Met vriendelijke groeten,
Kenneth Waegeman


_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster inconsistency keyvaluestore

Reply via email to