Re: [ceph-users] OSDs go down with infernalis

Adrien Gillard Tue, 08 Mar 2016 09:40:43 -0800

If you manually create your journal partition, you need to specify the
correct Ceph partition GUID in order for the system and Ceph to identify
the partition as Ceph journal and affect correct ownership and permissions
at boot via udev.


I used something like this to create the partition :
sudo sgdisk --new=1:0G:15G
--typecode=1:45B0969E-9B03-4F30-B4C6-B4B80CEFF106
 --partition-guid=$(uuidgen -r) --mbrtogpt -- /dev/sda

45B0969E-9B03-4F30-B4C6-B4B80CEFF106 being the GUID. More info on GTP GUID
is available on wikipedia [1].

I think the issue with the label you had was linked to some bugs in the
disk initialization process. This was discussed a few weeks back on this
mailing list.

[1] https://en.wikipedia.org/wiki/GUID_Partition_Table

On Tue, Mar 8, 2016 at 5:21 PM, Yoann Moulin <yoann.mou...@epfl.ch> wrote:

> Hello Adrien,
>
> > I think I faced the same issue setting up my own cluster. If it is the
> same,
> > it's one of the many people encounter(ed) during disk initialization.
> > Could you please give the output of :
> >  - ll /dev/disk/by-partuuid/
> >  - ll /var/lib/ceph/osd/ceph-*
>
> unfortunately, I already reinstall my test cluster, but I got some
> information
> that might explain this issue.
>
> I was creating the journal partition before running the ansible playbook.
> firstly, owner and right was not persistent at boot (had to add udev's
> rules).
> And I strongly suspect a side effect of not let ceph-disk create journal
> partition.
>
> Yoann
>
> > On Thu, Mar 3, 2016 at 3:42 PM, Yoann Moulin <yoann.mou...@epfl.ch
> > <mailto:yoann.mou...@epfl.ch>> wrote:
> >
> >     Hello,
> >
> >     I'm (almost) a new user of ceph (couple of month). In my university,
> we start to
> >     do some test with ceph a couple of months ago.
> >
> >     We have 2 clusters. Each cluster have 100 OSDs on 10 servers :
> >
> >     Each server as this setup :
> >
> >     CPU : 2 x Intel(R) Xeon(R) CPU E5-2680 v3 @ 2.50GHz
> >     Memory : 128GB of Memory
> >     OS Storage : 2 x SSD 240GB Intel S3500 DC (raid 1)
> >     Journal Storage : 2 x SSD 400GB Intel S3300 DC (no Raid)
> >     OSD Disk : 10 x HGST ultrastar-7k6000 6TB
> >     Network : 1 x 10Gb/s
> >     OS : Ubuntu 14.04
> >     Ceph version : infernalis 9.2.0
> >
> >     One cluster give access to some user through a S3 gateway (service
> is still in
> >     beta). We call this cluster "ceph-beta".
> >
> >     One cluster is for our internal need to learn more about ceph. We
> call this
> >     cluster "ceph-test". (those servers will be integrated into the
> ceph-beta
> >     cluster when we will need more space)
> >
> >     We have deploy both clusters with the ceph-ansible playbook[1]
> >
> >     Journal are raw partitions on SSDs (400GB Intel S3300 DC) with no
> raid. 5
> >     journals partitions on each SSDs.
> >
> >     OSDs disk are format in XFS.
> >
> >     1. https://github.com/ceph/ceph-ansible
> >
> >     We have an issue. Some OSDs go down and don't start. It seem to be
> related to
> >     the fsid of the journal partition :
> >
> >     > -1> 2016-03-03 14:09:05.422515 7f31118d0940 -1 journal
> FileJournal::open:
> >     ondisk fsid 00000000-0000-0000-0000-000000000000 doesn't match
> expected
> >     eeadbce2-f096-4156-ba56-dfc634e59106, invalid (someone else's?)
> journal
> >
> >     in attachment, the full logs of one of the dead OSDs
> >
> >     We had this issue with 2 OSDs on ceph-beta cluster fixed by
> removing, zapping
> >     and readding it.
> >
> >     Now, we have the same issue on ceph-test cluster but on 18 OSDs.
> >
> >     Now the stats of this cluster
> >
> >     > root@icadmin004:~# ceph -s
> >     >     cluster 4fb4773c-0873-44ad-a65f-269f01bfcff8
> >     >      health HEALTH_WARN
> >     >             1024 pgs incomplete
> >     >             1024 pgs stuck inactive
> >     >             1024 pgs stuck unclean
> >     >      monmap e1: 3 mons at
> >     {iccluster003=
> 10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0
> >     <
> http://10.90.37.4:6789/0,iccluster014=10.90.37.15:6789/0,iccluster022=10.90.37.23:6789/0
> >}
> >     >             election epoch 62, quorum 0,1,2
> >     iccluster003,iccluster014,iccluster022
> >     >      osdmap e242: 100 osds: 82 up, 82 in
> >     >             flags sortbitwise
> >     >       pgmap v469212: 2304 pgs, 10 pools, 2206 bytes data, 181
> objects
> >     >             4812 MB used, 447 TB / 447 TB avail
> >     >                 1280 active+clean
> >     >                 1024 creating+incomplete
> >
> >     We have install this cluster at the begin of February. We did not
> use that
> >     cluster at all even at the begin to troubleshoot an issue with
> ceph-ansible. We
> >     did not push any data neither create pool. What could explain this
> behaviour ?
> >
> >     Thanks for your help
> >
> >     Best regards,
> >
> >     --
> >     Yoann Moulin
> >     EPFL IC-IT
> >
> >     _______________________________________________
> >     ceph-users mailing list
> >     ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> >     http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
>
>
> --
> Yoann Moulin
> EPFL IC-IT
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OSDs go down with infernalis

Reply via email to