cluster state:
osdmap e3240: 24 osds: 12 up, 12 in
pgmap v46050: 1088 pgs, 2 pools, 20322 GB data, 5080 kobjects
22224 GB used, 61841 GB / 84065 GB avail
4745644/10405374 objects degraded (45.608%);
3688079/10405374 objects misplaced (35.444%)
5 stale+active+clean
59 active+clean
74 active+undersized+degraded+remapped+backfilling
53 active+remapped
577 active+undersized+degraded
37 down+peering
283 active+undersized+degraded+remapped+wait_backfill
recovery io 844 MB/s, 211 objects/s
On Wed, Jul 15, 2015 at 2:29 PM, Mallikarjun Biradar
<[email protected]> wrote:
> Sorry for delay in replying to this, as I was doing some retries on
> this issue and summarise.
>
>
> Tony,
> Setup details:
> Two storage box (each with 12 drives) , each connected with 4 hosts.
> Each host own 3 disk from storage box. Total of 24 OSD's.
> Failure domain is at Chassis level.
>
> OSD tree:
> -1 164.2 root default
> -7 82.08 chassis chassis1
> -2 20.52 host host-1
> 0 6.84 osd.0 up 1
> 1 6.84 osd.1 up 1
> 2 6.84 osd.2 up 1
> -3 20.52 host host-2
> 3 6.84 osd.3 up 1
> 4 6.84 osd.4 up 1
> 5 6.84 osd.5 up 1
> -4 20.52 host host-3
> 6 6.84 osd.6 up 1
> 7 6.84 osd.7 up 1
> 8 6.84 osd.8 up 1
> -5 20.52 host host-4
> 9 6.84 osd.9 up 1
> 10 6.84 osd.10 up 1
> 11 6.84 osd.11 up 1
> -8 82.08 chassis chassis2
> -6 20.52 host host-5
> 12 6.84 osd.12 up 1
> 13 6.84 osd.13 up 1
> 14 6.84 osd.14 up 1
> -9 20.52 host host-6
> 15 6.84 osd.15 up 1
> 16 6.84 osd.16 up 1
> 17 6.84 osd.17 up 1
> -10 20.52 host host-7
> 18 6.84 osd.18 up 1
> 19 6.84 osd.19 up 1
> 20 6.84 osd.20 up 1
> -11 20.52 host host-8
> 21 6.84 osd.21 up 1
> 22 6.84 osd.22 up 1
> 23 6.84 osd.23 up 1
>
> Cluster had ~30TB of data. Client IO is in progress on cluster.
> After chassis1 underwent powercycle,
> 1> all OSD's under chassis2 were intact. Up & running
> 2> all OSD's under chassis1 were down as expected.
>
> But, client IO was paused untill all the hosts/OSD's under chassis1
> comes up. This issue is observed twice out of 5 attempts.
>
> Size is 2 & min_size is 1.
>
> -Thanks,
> Mallikarjun
>
>
> On Thu, Jul 9, 2015 at 8:01 PM, Tony Harris <[email protected]> wrote:
>> Sounds to me like you've put yourself at too much risk - *if* I'm reading
>> your message right about your configuration, you have multiple hosts
>> accessing OSDs that are stored on a single shared box - so if that single
>> shared box (single point of failure for multiple nodes) goes down it's
>> possible for multiple replicas to disappear at the same time which could
>> halt the operation of your cluster if the masters and the replicas are both
>> on OSDs within that single shared storage system...
>>
>> On Thu, Jul 9, 2015 at 5:42 AM, Mallikarjun Biradar
>> <[email protected]> wrote:
>>>
>>> Hi all,
>>>
>>> Setup details:
>>> Two storage enclosures each connected to 4 OSD nodes (Shared storage).
>>> Failure domain is Chassis (enclosure) level. Replication count is 2.
>>> Each host has allotted with 4 drives.
>>>
>>> I have active client IO running on cluster. (Random write profile with
>>> 4M block size & 64 Queue depth).
>>>
>>> One of enclosure had power loss. So all OSD's from hosts that are
>>> connected to this enclosure went down as expected.
>>>
>>> But client IO got paused. After some time enclosure & hosts connected
>>> to it came up.
>>> And all OSD's on that hosts came up.
>>>
>>> Till this time, cluster was not serving IO. Once all hosts & OSD's
>>> pertaining to that enclosure came up, client IO resumed.
>>>
>>>
>>> Can anybody help me why cluster not serving IO during enclosure
>>> failure. OR its a bug?
>>>
>>> -Thanks & regards,
>>> Mallikarjun Biradar
>>> _______________________________________________
>>> ceph-users mailing list
>>> [email protected]
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
_______________________________________________
ceph-users mailing list
[email protected]
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com