Hi all, 

Apologies for necromancing an old thread, but I was wondering if anyone had any 
more thoughts on this. We're running v10.2.9 now and still have 3 PGs 
exhibiting this behavior in our cache pool after scrubs, deep-scrubs, and 
repair attempts. Some more information below.

Thanks much,
Lincoln


[1]

# rados list-inconsistent-obj 36.14f0 | jq
{
  "epoch": 820795,
  "inconsistents": [
    {
      "object": {
        "name": "1002378e2a6.00000001",
        "nspace": "",
        "locator": "",
        "snap": "head",
        "version": 2251698
      },
      "errors": [],
      "union_shard_errors": [
        "size_mismatch_oi"
      ],
      "selected_object_info": 
"36:0f29a1d4:::1002378e2a6.00000001:head(737930'2208087 
client.36346283.1:5757188 dirty
 s 4136960 uv 2251698 alloc_hint [0 0])",
      "shards": [
        {
          "osd": 173,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 242,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        },
        {
          "osd": 295,
          "errors": [
            "size_mismatch_oi"
          ],
          "size": 0
        }
      ]
    }
  ]
}

2017-08-08 13:26:23.243245 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
36.2c85 shard 212 missing 36:a13626c6:::1002378e9a9.00000001:head
2017-08-08 13:26:23.243250 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
36.2c85 shard 295: soid 36:a13626c6:::1002378e9a9.00000001:head size 0 != size 
4173824 from auth oi 36:a13626c6:::1002378e9a9.00000001:head(737930'2123468 
client.36346283.1:5782375 dirty s 4173824 uv 2164627 alloc_hint [0 0])
2017-08-08 13:26:23.243253 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
36.2c85 shard 353 missing 36:a13626c6:::1002378e9a9.00000001:head
2017-08-08 13:26:23.243255 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
36.2c85 soid 36:a13626c6:::1002378e9a9.00000001:head: failed to pick suitable 
auth object
2017-08-08 13:26:23.243362 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
scrub 36.2c85 36:a13626c6:::1002378e9a9.00000001:head on disk size (0) does not 
match object info size (4173824) adjusted for ondisk to (4173824)
2017-08-08 13:26:34.310237 7fafac78a700 -1 log_channel(cluster) log [ERR] : 
36.2c85 scrub 4 errors

> On May 15, 2017, at 5:28 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> 
> 
> 
> On Mon, May 15, 2017 at 3:19 PM Lincoln Bryant <linco...@uchicago.edu> wrote:
> Hi Greg,
> 
> Curiously, some of these scrub errors went away on their own. The example pg 
> in the original post is now active+clean, and nothing interesting in the logs:
> 
> # zgrep "36.277b" ceph-osd.244*gz
> ceph-osd.244.log-20170510.gz:2017-05-09 06:56:40.739855 7f0184623700  0 
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170510.gz:2017-05-09 06:58:01.872484 7f0186e28700  0 
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170511.gz:2017-05-10 20:40:47.536974 7f0186e28700  0 
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170511.gz:2017-05-10 20:41:38.399614 7f0184623700  0 
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170514.gz:2017-05-13 20:49:47.063789 7f0186e28700  0 
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170514.gz:2017-05-13 20:50:42.085718 7f0186e28700  0 
> log_channel(cluster) log [INF] : 36.277b scrub ok
> ceph-osd.244.log-20170515.gz:2017-05-15 00:10:39.417578 7f0184623700  0 
> log_channel(cluster) log [INF] : 36.277b scrub starts
> ceph-osd.244.log-20170515.gz:2017-05-15 00:11:26.189777 7f0186e28700  0 
> log_channel(cluster) log [INF] : 36.277b scrub ok
> 
> (No matches in the logs for osd 175 and osd 297  — perhaps already rotated 
> away?)
> 
> Other PGs still exhibit this behavior though:
> 
> # rados list-inconsistent-obj 36.2953 | jq .
> {
>   "epoch": 737940,
>   "inconsistents": [
>     {
>       "object": {
>         "name": "1002378da6c.00000001",
>         "nspace": "",
>         "locator": "",
>         "snap": "head",
>         "version": 2213621
>       },
>       "errors": [],
>       "union_shard_errors": [
>         "size_mismatch_oi"
>       ],
>       "selected_object_info": 
> "36:ca95a23b:::1002378da6c.00000001:head(737930'2177823 
> client.36346283.1:5635626 dirty s 4067328 uv 2213621)",
>       "shards": [
>         {
>           "osd": 113,
>           "errors": [
>             "size_mismatch_oi"
>           ],
>           "size": 0
>         },
>         {
>           "osd": 123,
>           "errors": [
>             "size_mismatch_oi"
>           ],
>           "size": 0
>         },
>         {
>           "osd": 173,
>           "errors": [
>             "size_mismatch_oi"
>           ],
>           "size": 0
>         }
>       ]
>     }
>   ]
> }
> 
> Perhaps new data being written to this pg cleared things up?
> 
> Hmm, somebody else did report the same thing (and the symptoms disappearing) 
> recently as well. I wonder if we broke the synchronization around eviction 
> and scrubbing within cache pools. Sage, you've done work on cache pools 
> recently; any thoughts?
> -Greg
>  
> 
> The only other data point that I can add is that, due to some tweaking of the 
> cache tier size before this happened, the cache tier was reporting near full 
> / full in `ceph -s` for a brief amount of time (maybe <1hr ?).
> 
> Thanks for looking into this.
> 
> --Lincoln
> 
> > On May 15, 2017, at 4:50 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> >
> > On Mon, May 1, 2017 at 9:28 AM, Lincoln Bryant <linco...@uchicago.edu> 
> > wrote:
> >> Hi all,
> >>
> >> I’ve run across a peculiar issue on 10.2.7. On my 3x replicated cache 
> >> tiering cache pool, routine scrubbing suddenly found a bunch of PGs with 
> >> size_mismatch_oi errors. From the “rados list-inconsistent-pg tool”[1], I 
> >> see that all OSDs are reporting size 0 for a particular pg. I’ve checked 
> >> this pg on disk, and it is indeed 0 bytes:
> >>        -rw-r--r--  1 root root    0 Apr 29 06:12 
> >> 100235614fe.00000005__head_6E9A677B__24
> >>
> >> I’ve tried re-issuing a scrub, which informs me that the object info size 
> >> (2994176) doesn’t match the on-disk size (0) (see [2]). I’ve tried a 
> >> repair operation as well to no avail.
> >>
> >> For what it’s worth, this particular cluster is currently migrating 
> >> several disks from one CRUSH root to another, and there is a nightly cache 
> >> flush/eviction script that is lowering the cache_target_*_ratios before 
> >> raising them again in the morning.
> >>
> >> This issue is currently affecting ~10 PGs in my cache pool. Any ideas how 
> >> to proceed here?
> >
> > Did anything come from this? It's tickling my brain (especially with
> > the cache pool) but I'm not seeing anything relevant when I search my
> > email.
> >
> >>
> >> Thanks,
> >> Lincoln
> >>
> >> [1]:
> >> {
> >>  "epoch": 721312,
> >>  "inconsistents": [
> >>    {
> >>      "object": {
> >>        "name": "100235614fe.00000005",
> >>        "nspace": "",
> >>        "locator": "",
> >>        "snap": "head",
> >>        "version": 2233551
> >>      },
> >>      "errors": [],
> >>      "union_shard_errors": [
> >>        "size_mismatch_oi"
> >>      ],
> >>      "selected_object_info": 
> >> "36:dee65976:::100235614fe.00000005:head(737928'2182216 
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)",
> >>      "shards": [
> >>        {
> >>          "osd": 175,
> >>          "errors": [
> >>            "size_mismatch_oi"
> >>          ],
> >>          "size": 0
> >>        },
> >>        {
> >>          "osd": 244,
> >>          "errors": [
> >>            "size_mismatch_oi"
> >>          ],
> >>          "size": 0
> >>        },
> >>        {
> >>          "osd": 297,
> >>          "errors": [
> >>            "size_mismatch_oi"
> >>          ],
> >>          "size": 0
> >>        }
> >>      ]
> >>    }
> >>  ]
> >> }
> >>
> >> [2]:
> >> 2017-05-01 10:50:13.812992 7f0184623700  0 log_channel(cluster) log [INF] 
> >> : 36.277b scrub starts
> >> 2017-05-01 10:51:02.495229 7f0186e28700 -1 log_channel(cluster) log [ERR] 
> >> : 36.277b shard 175: soid 36:dee65976:::100235614fe.00000005:head size 0 
> >> != size 2994176 from auth oi 
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216 
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495234 7f0186e28700 -1 log_channel(cluster) log [ERR] 
> >> : 36.277b shard 244: soid 36:dee65976:::100235614fe.00000005:head size 0 
> >> != size 2994176 from auth oi 
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216 
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495326 7f0186e28700 -1 log_channel(cluster) log [ERR] 
> >> : 36.277b shard 297: soid 36:dee65976:::100235614fe.00000005:head size 0 
> >> != size 2994176 from auth oi 
> >> 36:dee65976:::100235614fe.00000005:head(737928'2182216 
> >> client.36346283.1:5754260 dirty s 2994176 uv 2233551)
> >> 2017-05-01 10:51:02.495328 7f0186e28700 -1 log_channel(cluster) log [ERR] 
> >> : 36.277b soid 36:dee65976:::100235614fe.00000005:head: failed to pick 
> >> suitable auth object
> >> 2017-05-01 10:51:02.495450 7f0186e28700 -1 log_channel(cluster) log [ERR] 
> >> : scrub 36.277b 36:dee65976:::100235614fe.00000005:head on disk size (0) 
> >> does not match object info size (2994176) adjusted for ondisk to (2994176)
> >> 2017-05-01 10:51:20.223733 7f0184623700 -1 log_channel(cluster) log [ERR] 
> >> : 36.277b scrub 4 errors
> >>
> >> _______________________________________________
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to