Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Дробышевский , Владимир
2016-08-10 9:30 GMT+05:00 Александр Пивушков :

> I want to use Ceph only as  user data storage.
> user program writes data to a folder that is mounted on a Ceph.
> Virtual machine images are not stored on the Сeph.
> Fiber channel and 40GbE  are used only for the rapid transmission of
> information between the cluster Ceph and the virtual machine on oVirt.
> In this scheme I can use oVirt?
>
What kind of OSes do you use on the guests? If it's linux then it's better
to directly use RBD (in case of per VM dedicated storage) or CephFS (if the
storage has to be shared) right inside of the guest, if it's windows -
export CephFS with Samba. And I believe you definetely want to use 40GbE or
Infiniband instead of FC.

--
> Александр Пивушков.
> +7(961)5097964
> среда, 10 августа 2016г., 01:26 +03:00 от Christian Balzer  >:
>
>
>
> Hello,
>
> On Tue, 9 Aug 2016 14:15:59 -0400 Jeff Bailey wrote:
>
> >
> >
> > On 8/9/2016 10:43 AM, Wido den Hollander wrote:
> > >
> > >> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков  >:
> > >>
> > >>
> > >> > >> Hello dear community!
> > >> I'm new to the Ceph and not long ago took up the theme of
> building clusters.
> > >> Therefore it is very important to your opinion.
> > >> It is necessary to create a cluster from 1.2 PB storage and very
> rapid access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB
> NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all
> satisfies, but with increase of volume of storage, the price of such
> cluster very strongly grows and therefore there was an idea to use Ceph.
> > >
> > > You may want to tell us more about your environment, use case and
> in
> > > particular what your clients are.
> > > Large amounts of data usually means graphical or scientific data,
> > > extremely high speed (IOPS) requirements usually mean database
> > > like applications, which one is it, or is it a mix?
> > 
> >  This is a mixed project, with combined graphics and science.
> Project linking the vast array of image data. Like google MAP :)
> >  Previously, customers were Windows that are connected to powerful
> servers directly.
> >  Ceph cluster connected on FC to servers of the virtual machines is
> now planned. Virtualization - oVirt.
> > >>>
> > >>> Stop right there. oVirt, despite being from RedHat, doesn't really
> support
> > >>> Ceph directly all that well, last I checked.
> > >>> That is probably where you get the idea/need for FC from.
> > >>>
> > >>> If anyhow possible, you do NOT want another layer and protocol
> conversion
> > >>> between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
> > >>>
> > >>> So if you're free to choose your Virtualization platform, use
> KVM/qemu at
> > >>> the bottom and something like Openstack, OpenNebula, ganeti,
> Pacemake with
> > >>> KVM resource agents on top.
> > >> oh, that's too bad ...
> > >> I do not understand something...
> > >>
> > >> oVirt built on kvm
> > >> https://www.ovirt.org/documentation/introduction/about-ovirt/
> > >>
> > >> Ceph, such as support kvm
> > >> http://docs.ceph.com/docs/master/architecture/
> > >>
> > >
> > > KVM is just the hypervisor. oVirt is a tool which controls KVM and it
> doesn't have support for Ceph. That means that it can't pass down the
> proper arguments to KVM to talk to RBD.
> > >
> > >> What could be the overhead costs and how big they are?
> > >>
> > >>
> > >> I do not understand why oVirt bad, and the qemu in the Openstack,
> it's good.
> > >> What can be read?
> > >>
> > >
> > > Like I said above. oVirt and OpenStack both control KVM. OpenStack
> also knows how to 'configure' KVM to use RBD, oVirt doesn't.
> > >
> > > Maybe Proxmox is a better solution in your case.
> > >
> >
> > oVirt can use ceph through cinder. It doesn't currently provide all the
> > functionality of
> > other oVirt storage domains but it does work.
> >
> Well, I saw this before I gave my answer:
> http://www.ovirt.org/develop/release-management/features/sto
> rage/cinder-integration/
>
> And based on that I would say oVirt is not a good fit for Ceph at this
> time.
>
> Even less so than OpenNebula, which currently needs an additional shared
> network FS or hacks to allow live migration with RBD.
>
> Christian
>
> > > Wido
> > >
> > >>
> > >> --
> > >> Александр Пивушков___
> > >> ceph-users mailing list
> > >> ceph-users@lists.ceph.com
> 
> > >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> 
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > >
> > ___
> > ceph-users mailing list
> > 

Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Christian Balzer

Hello Vladimir,

On Wed, 10 Aug 2016 09:12:39 +0500 Дробышевский, Владимир wrote:

> Christian,
> 
>   I have to say that OpenNebula 5 doesn't need any additional hacks (ok,
> just two lines of code to support rescheduling in case of the original node
> failure and even these patch scheduled to 5.2 to be added after my question
> a couple of weeks ago; but it isn't about 'live') or an additional shared
> fs to support live migration with ceph. It works like a charm. I have an
> installation I just finished with OpenNebula 5.0.1 + ceph with dual root
> (HDD + ssd journal and pure SSD), so it's a first-hand information.
>
Thanks for bringing that to my attention.
I was of course referring to 4.14 and wasn't aware that 5 had been
released, thanks to the way their repository (apt sources lines) works.
 
>   In ONE 5 it's possible to use ceph as a system datastore, so it
> eliminates any problems with live migration. For file-based datastore
> (which is recommended to use for custom kernels and configs) it's possible
> to use CephFS (but it doesn't belong to ONE, of course).
> 
Right.

>   P.S. If somebody needs to reschedule (restore) VM from a host in the
> ERROR state then here is the patch for the ceph driver:
> https://github.com/OpenNebula/one/pull/106
> This patch doesn't need to rebuild the ONE from source, it could be applied
> to a working system (since ONE drivers are mostly a set of shell scripts).
> 
Thanks, I'll give that a spin next week.

Christian

> Best regards,
> Vladimir
> 
> 
> С уважением,
> Дробышевский Владимир
> Компания "АйТи Город"
> +7 343 192
> 
> Аппаратное и программное обеспечение
> IBM, Microsoft, Eset
> Поставка проектов "под ключ"
> Аутсорсинг ИТ-услуг
> 
> 2016-08-10 3:26 GMT+05:00 Christian Balzer :
> 
> >
> > Hello,
> >
> > On Tue, 9 Aug 2016 14:15:59 -0400 Jeff Bailey wrote:
> >
> > >
> > >
> > > On 8/9/2016 10:43 AM, Wido den Hollander wrote:
> > > >
> > > >> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков  > >:
> > > >>
> > > >>
> > > >>  > >> Hello dear community!
> > > >> I'm new to the Ceph and not long ago took up the theme of
> > building clusters.
> > > >> Therefore it is very important to your opinion.
> > > >> It is necessary to create a cluster from 1.2 PB storage and very
> > rapid access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB
> > NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all
> > satisfies, but with increase of volume of storage, the price of such
> > cluster very strongly grows and therefore there was an idea to use Ceph.
> > > >
> > > > You may want to tell us more about your environment, use case and
> > in
> > > > particular what your clients are.
> > > > Large amounts of data usually means graphical or scientific data,
> > > > extremely high speed (IOPS) requirements usually mean database
> > > > like applications, which one is it, or is it a mix?
> > > 
> > >  This is a mixed project, with combined graphics and science.
> > Project linking the vast array of image data. Like google MAP :)
> > >  Previously, customers were Windows that are connected to powerful
> > servers directly.
> > >  Ceph cluster connected on FC to servers of the virtual machines is
> > now planned. Virtualization - oVirt.
> > > >>>
> > > >>> Stop right there. oVirt, despite being from RedHat, doesn't really
> > support
> > > >>> Ceph directly all that well, last I checked.
> > > >>> That is probably where you get the idea/need for FC from.
> > > >>>
> > > >>> If anyhow possible, you do NOT want another layer and protocol
> > conversion
> > > >>> between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
> > > >>>
> > > >>> So if you're free to choose your Virtualization platform, use
> > KVM/qemu at
> > > >>> the bottom and something like Openstack, OpenNebula, ganeti,
> > Pacemake with
> > > >>> KVM resource agents on top.
> > > >> oh, that's too bad ...
> > > >> I do not understand something...
> > > >>
> > > >> oVirt built on kvm
> > > >> https://www.ovirt.org/documentation/introduction/about-ovirt/
> > > >>
> > > >> Ceph, such as support kvm
> > > >> http://docs.ceph.com/docs/master/architecture/
> > > >>
> > > >
> > > > KVM is just the hypervisor. oVirt is a tool which controls KVM and it
> > doesn't have support for Ceph. That means that it can't pass down the
> > proper arguments to KVM to talk to RBD.
> > > >
> > > >> What could be the overhead costs and how big they are?
> > > >>
> > > >>
> > > >> I do not understand why oVirt bad, and the qemu in the Openstack,
> > it's good.
> > > >> What can be read?
> > > >>
> > > >
> > > > Like I said above. oVirt and OpenStack both control KVM. OpenStack
> > also knows how to  'configure' KVM to use RBD, oVirt doesn't.
> > > >
> > > > Maybe Proxmox is a better solution in your case.
> > > >
> > >
> > > oVirt can use ceph through cinder.  It doesn't currently provide all the
> > > functionality of
> 

Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-09 Thread Georgios Dimitrakakis


Hello!

Brad,

is that possible from the default logging or verbose one is needed??

I 've managed to get the UUID of the deleted volume from OpenStack but 
don't really know how to get the offsets and OSD maps since "rbd info" 
doesn't provide any information for that volume.


Is it possible to somehow get them from leveldb?

Best,

G.


On Tue, Aug 9, 2016 at 7:39 AM, George Mihaiescu
 wrote:
Look in the cinder db, the volumes table to find the Uuid of the 
deleted volume.


You could also look through the logs at the time of the delete and I
suspect you should
be able to see how the rbd image was prefixed/named at the time of
the delete.

HTH,
Brad



If you go through yours OSDs and look for the directories for PG 
index 20, you might find some fragments from the deleted volume, but 
it's a long shot...


On Aug 8, 2016, at 4:39 PM, Georgios Dimitrakakis 
 wrote:


Dear David (and all),

the data are considered very critical therefore all this attempt to 
recover them.


Although the cluster hasn't been fully stopped all users actions 
have. I mean services are running but users are not able to 
read/write/delete.


The deleted image was the exact same size of the example (500GB) 
but it wasn't the only one deleted today. Our user was trying to do a 
"massive" cleanup by deleting 11 volumes and unfortunately one of 
them was very important.


Let's assume that I "dd" all the drives what further actions should 
I do to recover the files? Could you please elaborate a bit more on 
the phrase "If you've never deleted any other rbd images and assuming 
you can recover data with names, you may be able to find the rbd 
objects"??


Do you mean that if I know the file names I can go through and 
check for them? How?
Do I have to know *all* file names or by searching for a few of 
them I can find all data that exist?


Thanks a lot for taking the time to answer my questions!

All the best,

G.

I dont think theres a way of getting the prefix from the cluster 
at

this point.

If the deleted image was a similar size to the example youve 
given,

you will likely have had objects on every OSD. If this data is
absolutely critical you need to stop your cluster immediately or 
make

copies of all the drives with something like dd. If youve never
deleted any other rbd images and assuming you can recover data 
with

names, you may be able to find the rbd objects.

On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:


Hi,

On 08.08.2016 10:50, Georgios Dimitrakakis wrote:


Hi,


On 08.08.2016 09:58, Georgios Dimitrakakis wrote:

Dear all,

I would like your help with an emergency issue but first
let me describe our environment.

Our environment consists of 2OSD nodes with 10x 2TB HDDs
each and 3MON nodes (2 of them are the OSD nodes as well)
all with ceph version 0.80.9
(b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

This environment provides RBD volumes to an OpenStack
Icehouse installation.

Although not a state of the art environment is working
well and within our expectations.

The issue now is that one of our users accidentally
deleted one of the volumes without keeping its data first!

Is there any way (since the data are considered critical
and very important) to recover them from CEPH?


Short answer: no

Long answer: no, but

Consider the way Ceph stores data... each RBD is striped
into chunks
(RADOS objects with 4MB size by default); the chunks are
distributed
among the OSDs with the configured number of replicates
(probably two
in your case since you use 2 OSD hosts). RBD uses thin
provisioning,
so chunks are allocated upon first write access.
If an RBD is deleted all of its chunks are deleted on the
corresponding OSDs. If you want to recover a deleted RBD,
you need to
recover all individual chunks. Whether this is possible
depends on
your filesystem and whether the space of a former chunk is
already
assigned to other RADOS objects. The RADOS object names are
composed
of the RBD name and the offset position of the chunk, so if
an
undelete mechanism exists for the OSDs filesystem, you have
to be
able to recover file by their filename, otherwise you might
end up
mixing the content of various deleted RBDs. Due to the thin
provisioning there might be some chunks missing (e.g. never
allocated
before).

Given the fact that
- you probably use XFS on the OSDs since it is the
preferred
filesystem for OSDs (there is RDR-XFS, but Ive never had to
use it)
- you would need to stop the complete ceph cluster
(recovery tools do
not work on mounted filesystems)
- your cluster has been in use after the RBD was deleted
and thus
parts of its former space might already have been
overwritten
(replication might help you here, since there are two OSDs
to try)
- XFS undelete does not work well on fragmented files (and
OSDs tend
to introduce fragmentation...)

the answer is no, since it might not be feasible and the
chance of
success are way too low.

If you want to spend time 

Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Александр Пивушков

I want to use Ceph only as  user data storage.
user program writes data to a folder that is mounted on a Ceph.
Virtual machine images are not stored on the Сeph.
Fiber channel and 40GbE  are used only for the rapid transmission of 
information between the cluster Ceph and the virtual machine on oVirt.
In this scheme I can use oVirt?
--
Александр Пивушков.
+7(961)5097964 среда, 10 августа 2016г., 01:26 +03:00 от Christian Balzer < 
ch...@gol.com> :

>
>Hello,
>
>On Tue, 9 Aug 2016 14:15:59 -0400 Jeff Bailey wrote:
>
>> 
>> 
>> On 8/9/2016 10:43 AM, Wido den Hollander wrote:
>> >
>> >> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков < p...@mail.ru >:
>> >>
>> >>
>> >>  > >> Hello dear community!
>> >> I'm new to the Ceph and not long ago took up the theme of building 
>> >> clusters.
>> >> Therefore it is very important to your opinion.
>> >> It is necessary to create a cluster from 1.2 PB storage and very 
>> >> rapid access to data. Earlier disks of "Intel® SSD DC P3608 Series 
>> >> 1.6TB NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of 
>> >> all satisfies, but with increase of volume of storage, the price of 
>> >> such cluster very strongly grows and therefore there was an idea to 
>> >> use Ceph.
>> >
>> > You may want to tell us more about your environment, use case and in
>> > particular what your clients are.
>> > Large amounts of data usually means graphical or scientific data,
>> > extremely high speed (IOPS) requirements usually mean database
>> > like applications, which one is it, or is it a mix?
>> 
>>  This is a mixed project, with combined graphics and science. Project 
>>  linking the vast array of image data. Like google MAP :)
>>  Previously, customers were Windows that are connected to powerful 
>>  servers directly.
>>  Ceph cluster connected on FC to servers of the virtual machines is now 
>>  planned. Virtualization - oVirt.
>> >>>
>> >>> Stop right there. oVirt, despite being from RedHat, doesn't really 
>> >>> support
>> >>> Ceph directly all that well, last I checked.
>> >>> That is probably where you get the idea/need for FC from.
>> >>>
>> >>> If anyhow possible, you do NOT want another layer and protocol conversion
>> >>> between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
>> >>>
>> >>> So if you're free to choose your Virtualization platform, use KVM/qemu at
>> >>> the bottom and something like Openstack, OpenNebula, ganeti, Pacemake 
>> >>> with
>> >>> KVM resource agents on top.
>> >> oh, that's too bad ...
>> >> I do not understand something...
>> >>
>> >> oVirt built on kvm
>> >>  https://www.ovirt.org/documentation/introduction/about-ovirt/
>> >>
>> >> Ceph, such as support kvm
>> >>  http://docs.ceph.com/docs/master/architecture/
>> >>
>> >
>> > KVM is just the hypervisor. oVirt is a tool which controls KVM and it 
>> > doesn't have support for Ceph. That means that it can't pass down the 
>> > proper arguments to KVM to talk to RBD.
>> >
>> >> What could be the overhead costs and how big they are?
>> >>
>> >>
>> >> I do not understand why oVirt bad, and the qemu in the Openstack, it's 
>> >> good.
>> >> What can be read?
>> >>
>> >
>> > Like I said above. oVirt and OpenStack both control KVM. OpenStack also 
>> > knows how to  'configure' KVM to use RBD, oVirt doesn't.
>> >
>> > Maybe Proxmox is a better solution in your case.
>> >
>> 
>> oVirt can use ceph through cinder.  It doesn't currently provide all the 
>> functionality of
>> other oVirt storage domains but it does work.
>>
>Well, I saw this before I gave my answer: 
>http://www.ovirt.org/develop/release-management/features/storage/cinder-integration/
>
>And based on that I would say oVirt is not a good fit for Ceph at this
>time.
>
>Even less so than OpenNebula, which currently needs an additional shared
>network FS or hacks to allow live migration with RBD.
>
>Christian
>
>> > Wido
>> >
>> >>
>> >> --
>> >> Александр Пивушков___
>> >> ceph-users mailing list
>> >>  ceph-users@lists.ceph.com
>> >>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> > ___
>> > ceph-users mailing list
>> >  ceph-users@lists.ceph.com
>> >  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>>  ceph-users@lists.ceph.com
>>  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>-- 
>Christian BalzerNetwork/Systems Engineer 
>ch...@gol.com Global OnLine Japan/Rakuten Communications
>http://www.gol.com/
>___
>ceph-users mailing list
>ceph-users@lists.ceph.com
>http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Дробышевский , Владимир
Christian,

  I have to say that OpenNebula 5 doesn't need any additional hacks (ok,
just two lines of code to support rescheduling in case of the original node
failure and even these patch scheduled to 5.2 to be added after my question
a couple of weeks ago; but it isn't about 'live') or an additional shared
fs to support live migration with ceph. It works like a charm. I have an
installation I just finished with OpenNebula 5.0.1 + ceph with dual root
(HDD + ssd journal and pure SSD), so it's a first-hand information.

  In ONE 5 it's possible to use ceph as a system datastore, so it
eliminates any problems with live migration. For file-based datastore
(which is recommended to use for custom kernels and configs) it's possible
to use CephFS (but it doesn't belong to ONE, of course).

  P.S. If somebody needs to reschedule (restore) VM from a host in the
ERROR state then here is the patch for the ceph driver:
https://github.com/OpenNebula/one/pull/106
This patch doesn't need to rebuild the ONE from source, it could be applied
to a working system (since ONE drivers are mostly a set of shell scripts).

Best regards,
Vladimir


С уважением,
Дробышевский Владимир
Компания "АйТи Город"
+7 343 192

Аппаратное и программное обеспечение
IBM, Microsoft, Eset
Поставка проектов "под ключ"
Аутсорсинг ИТ-услуг

2016-08-10 3:26 GMT+05:00 Christian Balzer :

>
> Hello,
>
> On Tue, 9 Aug 2016 14:15:59 -0400 Jeff Bailey wrote:
>
> >
> >
> > On 8/9/2016 10:43 AM, Wido den Hollander wrote:
> > >
> > >> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков  >:
> > >>
> > >>
> > >>  > >> Hello dear community!
> > >> I'm new to the Ceph and not long ago took up the theme of
> building clusters.
> > >> Therefore it is very important to your opinion.
> > >> It is necessary to create a cluster from 1.2 PB storage and very
> rapid access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB
> NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all
> satisfies, but with increase of volume of storage, the price of such
> cluster very strongly grows and therefore there was an idea to use Ceph.
> > >
> > > You may want to tell us more about your environment, use case and
> in
> > > particular what your clients are.
> > > Large amounts of data usually means graphical or scientific data,
> > > extremely high speed (IOPS) requirements usually mean database
> > > like applications, which one is it, or is it a mix?
> > 
> >  This is a mixed project, with combined graphics and science.
> Project linking the vast array of image data. Like google MAP :)
> >  Previously, customers were Windows that are connected to powerful
> servers directly.
> >  Ceph cluster connected on FC to servers of the virtual machines is
> now planned. Virtualization - oVirt.
> > >>>
> > >>> Stop right there. oVirt, despite being from RedHat, doesn't really
> support
> > >>> Ceph directly all that well, last I checked.
> > >>> That is probably where you get the idea/need for FC from.
> > >>>
> > >>> If anyhow possible, you do NOT want another layer and protocol
> conversion
> > >>> between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
> > >>>
> > >>> So if you're free to choose your Virtualization platform, use
> KVM/qemu at
> > >>> the bottom and something like Openstack, OpenNebula, ganeti,
> Pacemake with
> > >>> KVM resource agents on top.
> > >> oh, that's too bad ...
> > >> I do not understand something...
> > >>
> > >> oVirt built on kvm
> > >> https://www.ovirt.org/documentation/introduction/about-ovirt/
> > >>
> > >> Ceph, such as support kvm
> > >> http://docs.ceph.com/docs/master/architecture/
> > >>
> > >
> > > KVM is just the hypervisor. oVirt is a tool which controls KVM and it
> doesn't have support for Ceph. That means that it can't pass down the
> proper arguments to KVM to talk to RBD.
> > >
> > >> What could be the overhead costs and how big they are?
> > >>
> > >>
> > >> I do not understand why oVirt bad, and the qemu in the Openstack,
> it's good.
> > >> What can be read?
> > >>
> > >
> > > Like I said above. oVirt and OpenStack both control KVM. OpenStack
> also knows how to  'configure' KVM to use RBD, oVirt doesn't.
> > >
> > > Maybe Proxmox is a better solution in your case.
> > >
> >
> > oVirt can use ceph through cinder.  It doesn't currently provide all the
> > functionality of
> > other oVirt storage domains but it does work.
> >
> Well, I saw this before I gave my answer:
> http://www.ovirt.org/develop/release-management/features/
> storage/cinder-integration/
>
> And based on that I would say oVirt is not a good fit for Ceph at this
> time.
>
> Even less so than OpenNebula, which currently needs an additional shared
> network FS or hacks to allow live migration with RBD.
>
> Christian
>
> > > Wido
> > >
> > >>
> > >> --
> > >> Александр Пивушков___
> > >> ceph-users mailing list
> > >> 

Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-09 Thread Brad Hubbard
On Tue, Aug 9, 2016 at 7:39 AM, George Mihaiescu  wrote:
> Look in the cinder db, the volumes table to find the Uuid of the deleted 
> volume.

You could also look through the logs at the time of the delete and I
suspect you should
be able to see how the rbd image was prefixed/named at the time of the delete.

HTH,
Brad

>
> If you go through yours OSDs and look for the directories for PG index 20, 
> you might find some fragments from the deleted volume, but it's a long shot...
>
>> On Aug 8, 2016, at 4:39 PM, Georgios Dimitrakakis  
>> wrote:
>>
>> Dear David (and all),
>>
>> the data are considered very critical therefore all this attempt to recover 
>> them.
>>
>> Although the cluster hasn't been fully stopped all users actions have. I 
>> mean services are running but users are not able to read/write/delete.
>>
>> The deleted image was the exact same size of the example (500GB) but it 
>> wasn't the only one deleted today. Our user was trying to do a "massive" 
>> cleanup by deleting 11 volumes and unfortunately one of them was very 
>> important.
>>
>> Let's assume that I "dd" all the drives what further actions should I do to 
>> recover the files? Could you please elaborate a bit more on the phrase "If 
>> you've never deleted any other rbd images and assuming you can recover data 
>> with names, you may be able to find the rbd objects"??
>>
>> Do you mean that if I know the file names I can go through and check for 
>> them? How?
>> Do I have to know *all* file names or by searching for a few of them I can 
>> find all data that exist?
>>
>> Thanks a lot for taking the time to answer my questions!
>>
>> All the best,
>>
>> G.
>>
>>> I dont think theres a way of getting the prefix from the cluster at
>>> this point.
>>>
>>> If the deleted image was a similar size to the example youve given,
>>> you will likely have had objects on every OSD. If this data is
>>> absolutely critical you need to stop your cluster immediately or make
>>> copies of all the drives with something like dd. If youve never
>>> deleted any other rbd images and assuming you can recover data with
>>> names, you may be able to find the rbd objects.
>>>
>>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>>>
>> Hi,
>>
>> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>>
 Hi,

> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>
> Dear all,
>
> I would like your help with an emergency issue but first
> let me describe our environment.
>
> Our environment consists of 2OSD nodes with 10x 2TB HDDs
> each and 3MON nodes (2 of them are the OSD nodes as well)
> all with ceph version 0.80.9
> (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)
>
> This environment provides RBD volumes to an OpenStack
> Icehouse installation.
>
> Although not a state of the art environment is working
> well and within our expectations.
>
> The issue now is that one of our users accidentally
> deleted one of the volumes without keeping its data first!
>
> Is there any way (since the data are considered critical
> and very important) to recover them from CEPH?

 Short answer: no

 Long answer: no, but

 Consider the way Ceph stores data... each RBD is striped
 into chunks
 (RADOS objects with 4MB size by default); the chunks are
 distributed
 among the OSDs with the configured number of replicates
 (probably two
 in your case since you use 2 OSD hosts). RBD uses thin
 provisioning,
 so chunks are allocated upon first write access.
 If an RBD is deleted all of its chunks are deleted on the
 corresponding OSDs. If you want to recover a deleted RBD,
 you need to
 recover all individual chunks. Whether this is possible
 depends on
 your filesystem and whether the space of a former chunk is
 already
 assigned to other RADOS objects. The RADOS object names are
 composed
 of the RBD name and the offset position of the chunk, so if
 an
 undelete mechanism exists for the OSDs filesystem, you have
 to be
 able to recover file by their filename, otherwise you might
 end up
 mixing the content of various deleted RBDs. Due to the thin
 provisioning there might be some chunks missing (e.g. never
 allocated
 before).

 Given the fact that
 - you probably use XFS on the OSDs since it is the
 preferred
 filesystem for OSDs (there is RDR-XFS, but Ive never had to
 use it)
 - you would need to stop the complete ceph cluster
 (recovery tools do
 not work on mounted filesystems)

Re: [ceph-users] installing multi osd and monitor of ceph in single VM

2016-08-09 Thread Brad Hubbard
On Wed, Aug 10, 2016 at 12:26 AM, agung Laksono  wrote:
>
> Hi Ceph users,
>
> I am new in ceph. I've been succeed installing ceph in 4 VM using Quick
> installation guide in ceph documentation.
>
> And I've also done to compile
> ceph from source code, build and install in single vm.
>
> What I want to do next is that run ceph multiple nodes in a cluster
> but only inside a single machine. I need this because I will
> learn the ceph code and will modify some codes, recompile and
> redeploy on the node/VM. On my study, I've also to be able to run/kill
> particular node.
>
> does somebody know how to configure single vm to run multiple osd and
> monitor of ceph?
>
> Advises and comments are very appreciate. thanks

Hi,

Did you see this?

http://docs.ceph.com/docs/hammer/dev/quick_guide/#running-a-development-deployment

Also take a look at the AIO (all in one) options in ceph-ansible.

HTH,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Christian Balzer

Hello,

On Tue, 9 Aug 2016 14:15:59 -0400 Jeff Bailey wrote:

> 
> 
> On 8/9/2016 10:43 AM, Wido den Hollander wrote:
> >
> >> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков :
> >>
> >>
> >>  > >> Hello dear community!
> >> I'm new to the Ceph and not long ago took up the theme of building 
> >> clusters.
> >> Therefore it is very important to your opinion.
> >> It is necessary to create a cluster from 1.2 PB storage and very rapid 
> >> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB 
> >> NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all 
> >> satisfies, but with increase of volume of storage, the price of such 
> >> cluster very strongly grows and therefore there was an idea to use 
> >> Ceph.
> >
> > You may want to tell us more about your environment, use case and in
> > particular what your clients are.
> > Large amounts of data usually means graphical or scientific data,
> > extremely high speed (IOPS) requirements usually mean database
> > like applications, which one is it, or is it a mix?
> 
>  This is a mixed project, with combined graphics and science. Project 
>  linking the vast array of image data. Like google MAP :)
>  Previously, customers were Windows that are connected to powerful 
>  servers directly.
>  Ceph cluster connected on FC to servers of the virtual machines is now 
>  planned. Virtualization - oVirt.
> >>>
> >>> Stop right there. oVirt, despite being from RedHat, doesn't really support
> >>> Ceph directly all that well, last I checked.
> >>> That is probably where you get the idea/need for FC from.
> >>>
> >>> If anyhow possible, you do NOT want another layer and protocol conversion
> >>> between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
> >>>
> >>> So if you're free to choose your Virtualization platform, use KVM/qemu at
> >>> the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
> >>> KVM resource agents on top.
> >> oh, that's too bad ...
> >> I do not understand something...
> >>
> >> oVirt built on kvm
> >> https://www.ovirt.org/documentation/introduction/about-ovirt/
> >>
> >> Ceph, such as support kvm
> >> http://docs.ceph.com/docs/master/architecture/
> >>
> >
> > KVM is just the hypervisor. oVirt is a tool which controls KVM and it 
> > doesn't have support for Ceph. That means that it can't pass down the 
> > proper arguments to KVM to talk to RBD.
> >
> >> What could be the overhead costs and how big they are?
> >>
> >>
> >> I do not understand why oVirt bad, and the qemu in the Openstack, it's 
> >> good.
> >> What can be read?
> >>
> >
> > Like I said above. oVirt and OpenStack both control KVM. OpenStack also 
> > knows how to  'configure' KVM to use RBD, oVirt doesn't.
> >
> > Maybe Proxmox is a better solution in your case.
> >
> 
> oVirt can use ceph through cinder.  It doesn't currently provide all the 
> functionality of
> other oVirt storage domains but it does work.
>
Well, I saw this before I gave my answer: 
http://www.ovirt.org/develop/release-management/features/storage/cinder-integration/

And based on that I would say oVirt is not a good fit for Ceph at this
time.

Even less so than OpenNebula, which currently needs an additional shared
network FS or hacks to allow live migration with RBD.

Christian

> > Wido
> >
> >>
> >> --
> >> Александр Пивушков___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Jeff Bailey



On 8/9/2016 10:43 AM, Wido den Hollander wrote:



Op 9 augustus 2016 om 16:36 schreef Александр Пивушков :


 > >> Hello dear community!

I'm new to the Ceph and not long ago took up the theme of building clusters.
Therefore it is very important to your opinion.
It is necessary to create a cluster from 1.2 PB storage and very rapid access to data. 
Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe PCIe 3.0 x4 Solid State 
Drive" were used, their speed of all satisfies, but with increase of volume of 
storage, the price of such cluster very strongly grows and therefore there was an idea to 
use Ceph.


You may want to tell us more about your environment, use case and in
particular what your clients are.
Large amounts of data usually means graphical or scientific data,
extremely high speed (IOPS) requirements usually mean database
like applications, which one is it, or is it a mix?


This is a mixed project, with combined graphics and science. Project linking 
the vast array of image data. Like google MAP :)
Previously, customers were Windows that are connected to powerful servers 
directly.
Ceph cluster connected on FC to servers of the virtual machines is now planned. 
Virtualization - oVirt.


Stop right there. oVirt, despite being from RedHat, doesn't really support
Ceph directly all that well, last I checked.
That is probably where you get the idea/need for FC from.

If anyhow possible, you do NOT want another layer and protocol conversion
between Ceph and the VMs, like a FC gateway or iSCSI or NFS.

So if you're free to choose your Virtualization platform, use KVM/qemu at
the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
KVM resource agents on top.

oh, that's too bad ...
I do not understand something...

oVirt built on kvm
https://www.ovirt.org/documentation/introduction/about-ovirt/

Ceph, such as support kvm
http://docs.ceph.com/docs/master/architecture/



KVM is just the hypervisor. oVirt is a tool which controls KVM and it doesn't 
have support for Ceph. That means that it can't pass down the proper arguments 
to KVM to talk to RBD.


What could be the overhead costs and how big they are?


I do not understand why oVirt bad, and the qemu in the Openstack, it's good.
What can be read?



Like I said above. oVirt and OpenStack both control KVM. OpenStack also knows 
how to  'configure' KVM to use RBD, oVirt doesn't.

Maybe Proxmox is a better solution in your case.



oVirt can use ceph through cinder.  It doesn't currently provide all the 
functionality of

other oVirt storage domains but it does work.


Wido



--
Александр Пивушков___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practices for extending a ceph cluster with minimal client impact data movement

2016-08-09 Thread Martin Palma
Hi Wido,

thanks for your advice.

Best,
Martin

On Tue, Aug 9, 2016 at 10:05 AM, Wido den Hollander  wrote:
>
>> Op 8 augustus 2016 om 16:45 schreef Martin Palma :
>>
>>
>> Hi all,
>>
>> we are in the process of expanding our cluster and I would like to
>> know if there are some best practices in doing so.
>>
>> Our current cluster is composted as follows:
>> - 195 OSDs (14 Storage Nodes)
>> - 3 Monitors
>> - Total capacity 620 TB
>> - Used 360 TB
>>
>> We will expand the cluster by other 14 Storage Nodes and 2 Monitor
>> nodes. So we are doubling the current deployment:
>>
>> - OSDs: 195 --> 390
>> - Total capacity: 620 TB --> 1250 TB
>>
>> During the expansion we would like to minimize the client impact and
>> data movement. Any suggestions?
>>
>
> There are a few routes you can take, I would suggest that you:
>
> - set max backfills to 1
> - set max recovery to 1
>
> Now, add the OSDs to the cluster, but NOT to the CRUSHMap.
>
> When all the OSDs are online, inject a new CRUSHMap where you add the new 
> OSDs to the data placement.
>
> $ ceph osd setcrushmap -i 
>
> The OSDs will now start to migrate data, but this is throttled by the max 
> recovery and backfill settings.
>
> Wido
>
>> Best,
>> Martin
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Александр Пивушков
 Вторник,  9 августа 2016, 17:43 +03:00 от Wido den Hollander :
>
>
>> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков < p...@mail.ru >:
>> 
>> 
>>  > >> Hello dear community!
>> >> >> I'm new to the Ceph and not long ago took up the theme of building 
>> >> >> clusters.
>> >> >> Therefore it is very important to your opinion.
>> >> >> It is necessary to create a cluster from 1.2 PB storage and very rapid 
>> >> >> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB 
>> >> >> NVMe PCIe 3.0 x4 Solid State Drive" were used, their speed of all 
>> >> >> satisfies, but with increase of volume of storage, the price of such 
>> >> >> cluster very strongly grows and therefore there was an idea to use 
>> >> >> Ceph.
>> >> >
>> >> >You may want to tell us more about your environment, use case and in
>> >> >particular what your clients are.
>> >> >Large amounts of data usually means graphical or scientific data,
>> >> >extremely high speed (IOPS) requirements usually mean database
>> >> >like applications, which one is it, or is it a mix? 
>> >>
>> >>This is a mixed project, with combined graphics and science. Project 
>> >>linking the vast array of image data. Like google MAP :)
>> >> Previously, customers were Windows that are connected to powerful servers 
>> >> directly. 
>> >> Ceph cluster connected on FC to servers of the virtual machines is now 
>> >> planned. Virtualization - oVirt. 
>> >
>> >Stop right there. oVirt, despite being from RedHat, doesn't really support
>> >Ceph directly all that well, last I checked.
>> >That is probably where you get the idea/need for FC from.
>> >
>> >If anyhow possible, you do NOT want another layer and protocol conversion
>> >between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
>> >
>> >So if you're free to choose your Virtualization platform, use KVM/qemu at
>> >the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
>> >KVM resource agents on top.
>> oh, that's too bad ...
>> I do not understand something...
>> 
>> oVirt built on kvm
>>  https://www.ovirt.org/documentation/introduction/about-ovirt/
>> 
>> Ceph, such as support kvm
>>  http://docs.ceph.com/docs/master/architecture/
>> 
>
>KVM is just the hypervisor. oVirt is a tool which controls KVM and it doesn't 
>have support for Ceph. That means that it can't pass down the proper arguments 
>to KVM to talk to RBD.
>
>> What could be the overhead costs and how big they are?
>> 
>> 
>> I do not understand why oVirt bad, and the qemu in the Openstack, it's good.
>> What can be read?
>> 
>
>Like I said above. oVirt and OpenStack both control KVM. OpenStack also knows 
>how to  'configure' KVM to use RBD, oVirt doesn't.
>
>Maybe Proxmox is a better solution in your case.
No, I Openstack implement, would be to understand why! :)
Why Openstack support for Ceph
For example, in an oVirt  running Centos 7. I have it mounted directory Ceph. 
Cluster Ceph installed on 13 physical servers.
Do I understand correctly, the overhead is only because of the virtualized 
network card?


-- 
Александр Пивушков
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Wido den Hollander

> Op 9 augustus 2016 om 16:36 schreef Александр Пивушков :
> 
> 
>  > >> Hello dear community!
> >> >> I'm new to the Ceph and not long ago took up the theme of building 
> >> >> clusters.
> >> >> Therefore it is very important to your opinion.
> >> >> It is necessary to create a cluster from 1.2 PB storage and very rapid 
> >> >> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe 
> >> >> PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, 
> >> >> but with increase of volume of storage, the price of such cluster very 
> >> >> strongly grows and therefore there was an idea to use Ceph.
> >> >
> >> >You may want to tell us more about your environment, use case and in
> >> >particular what your clients are.
> >> >Large amounts of data usually means graphical or scientific data,
> >> >extremely high speed (IOPS) requirements usually mean database
> >> >like applications, which one is it, or is it a mix? 
> >>
> >>This is a mixed project, with combined graphics and science. Project 
> >>linking the vast array of image data. Like google MAP :)
> >> Previously, customers were Windows that are connected to powerful servers 
> >> directly. 
> >> Ceph cluster connected on FC to servers of the virtual machines is now 
> >> planned. Virtualization - oVirt. 
> >
> >Stop right there. oVirt, despite being from RedHat, doesn't really support
> >Ceph directly all that well, last I checked.
> >That is probably where you get the idea/need for FC from.
> >
> >If anyhow possible, you do NOT want another layer and protocol conversion
> >between Ceph and the VMs, like a FC gateway or iSCSI or NFS.
> >
> >So if you're free to choose your Virtualization platform, use KVM/qemu at
> >the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
> >KVM resource agents on top.
> oh, that's too bad ...
> I do not understand something...
> 
> oVirt built on kvm
> https://www.ovirt.org/documentation/introduction/about-ovirt/  
> 
> Ceph, such as support kvm
> http://docs.ceph.com/docs/master/architecture/  
> 

KVM is just the hypervisor. oVirt is a tool which controls KVM and it doesn't 
have support for Ceph. That means that it can't pass down the proper arguments 
to KVM to talk to RBD.

> What could be the overhead costs and how big they are?
> 
> 
> I do not understand why oVirt bad, and the qemu in the Openstack, it's good.
> What can be read?
> 

Like I said above. oVirt and OpenStack both control KVM. OpenStack also knows 
how to  'configure' KVM to use RBD, oVirt doesn't.

Maybe Proxmox is a better solution in your case.

Wido

> 
> -- 
> Александр Пивушков___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] installing multi osd and monitor of ceph in single VM

2016-08-09 Thread agung Laksono
Hi Ceph users,

I am new in ceph. I've been succeed installing ceph in 4 VM using Quick
installation guide in ceph documentation.

And I've also done to compile
ceph from source code, build and install in single vm.

What I want to do next is that run ceph multiple nodes in a cluster
but only inside a single machine. I need this because I will
learn the ceph code and will modify some codes, recompile and
redeploy on the node/VM. On my study, I've also to be able to run/kill
particular node.

does somebody know how to configure single vm to run multiple osd and
monitor of ceph?

Advises and comments are very appreciate. thanks

-- 
Cheers,

Agung Laksono
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug pg inconsistent state - no ioerrors seen

2016-08-09 Thread Gregory Farnum
On Tue, Aug 9, 2016 at 2:00 AM, Kenneth Waegeman
 wrote:
> Hi,
>
> I did a diff on the directories of all three the osds, no difference .. So I
> don't know what's wrong.

omap (as implied by the omap_digest complaint) is stored in the OSD
leveldb, not in the data directories, so you wouldn't expect to see
any differences from a raw diff. I think you can extract the omaps as
well by using the ceph-objectstore-tool or whatever it's called (I
haven't done it myself) and compare those. Should see if you get more
useful information out of the pg query first, though!
-Greg

>
> Only thing I see different is a scrub file in the TEMP folder (it is already
> another pg than last mail):
>
> -rw-r--r--1 ceph ceph 0 Aug  9 09:51
> scrub\u6.107__head_0107__fff8
>
> But it is empty..
>
> Thanks!
>
>
>
> On 09/08/16 04:33, Goncalo Borges wrote:
>>
>> Hi Kenneth...
>>
>> The previous default behavior of 'ceph pg repair' was to copy the pg
>> objects from the primary osd to others. Not sure if it is till the case in
>> Jewel. For this reason, once we get these kind of errors in a data pool, the
>> best practice is to compare the md5 checksums of the damaged object in all
>> osds involved in the inconsistent pg. Since we have a 3 replica cluster, we
>> should find a 2 good object quorum. If by chance the primary osd has the
>> wrong object, it should delete it before running  the repair.
>>
>> On a metadata pool, I am not sure exactly how to cross check since all
>> objects are size 0 and therefore, md5sum is meaningless. Maybe, one way
>> forward could be to check the contents of the pg directories (ex:
>> /var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for the
>> pg and see if we spot something wrong?
>>
>> Cheers
>>
>> G.
>>
>>
>> On 08/08/2016 09:40 PM, Kenneth Waegeman wrote:
>>>
>>> Hi all,
>>>
>>> Since last week, some pg's are going in the inconsistent state after a
>>> scrub error. Last week we had 4 pgs in that state, They were on different
>>> OSDS, but all of the metadata pool.
>>> I did a pg repair on them, and all were healthy again. But now again one
>>> pg is inconsistent.
>>>
>>> with health detail I see:
>>>
>>> pg 6.2f4 is active+clean+inconsistent, acting [3,5,1]
>>> 1 scrub errors
>>>
>>> And in the log of the primary:
>>>
>>> 2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log [ERR]
>>> : 6.2f4 shard 5: soid 6:2f55791f:::606.:head omap_digest 0x3a105358
>>> != best guess omap_digest 0xc85c4361 from auth shard 1
>>> 2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log [ERR]
>>> : 6.2f4 deep-scrub 0 missing, 1 inconsistent objects
>>> 2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log [ERR]
>>> : 6.2f4 deep-scrub 1 errors
>>>
>>> I looked in dmesg but I couldn't see any IO errors on any of the OSDs in
>>> the acting set.  Last week it was another set. It is of course possible more
>>> than 1 OSD is failing, but how can we check this, since there is nothing
>>> more in the logs?
>>>
>>> Thanks !!
>>>
>>> K
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fast Ceph a Cluster with PB storage

2016-08-09 Thread Christian Balzer

Hello,

[re-added the list]

Also try to leave a line-break, paragraph between quoted and new text,
your mail looked like it was all written by me...

On Tue, 09 Aug 2016 11:00:27 +0300 Александр Пивушков wrote:

>  Thank you for your response!
> 
> 
> >Вторник,  9 августа 2016, 5:11 +03:00 от Christian Balzer :
> >
> >
> >Hello,
> >
> >On Mon, 08 Aug 2016 17:39:07 +0300 Александр Пивушков wrote:
> >
> >> 
> >> Hello dear community!
> >> I'm new to the Ceph and not long ago took up the theme of building 
> >> clusters.
> >> Therefore it is very important to your opinion.
> >> It is necessary to create a cluster from 1.2 PB storage and very rapid 
> >> access to data. Earlier disks of "Intel® SSD DC P3608 Series 1.6TB NVMe 
> >> PCIe 3.0 x4 Solid State Drive" were used, their speed of all satisfies, 
> >> but with increase of volume of storage, the price of such cluster very 
> >> strongly grows and therefore there was an idea to use Ceph.
> >
> >You may want to tell us more about your environment, use case and in
> >particular what your clients are.
> >Large amounts of data usually means graphical or scientific data,
> >extremely high speed (IOPS) requirements usually mean database
> >like applications, which one is it, or is it a mix? 
>
>This is a mixed project, with combined graphics and science. Project linking 
>the vast array of image data. Like google MAP :)
> Previously, customers were Windows that are connected to powerful servers 
> directly. 
> Ceph cluster connected on FC to servers of the virtual machines is now 
> planned. Virtualization - oVirt. 

Stop right there. oVirt, despite being from RedHat, doesn't really support
Ceph directly all that well, last I checked.
That is probably where you get the idea/need for FC from.

If anyhow possible, you do NOT want another layer and protocol conversion
between Ceph and the VMs, like a FC gateway or iSCSI or NFS.

So if you're free to choose your Virtualization platform, use KVM/qemu at
the bottom and something like Openstack, OpenNebula, ganeti, Pacemake with
KVM resource agents on top.

>Clients on 40 GB ethernet are connected to servers of virtualization.

Your VM clients (if using RBD instead of FC) and the end-users could use
the same network infrastructure.

>Clients on Windows.
> Customers use their software. It is written by them. About the base I do not 
> know, probably not. The processing results are stored in conventional files. 
> In total about 160 GB.

1 image file being 160GB?

> We need very quickly to process these images, so as not to cause 
> dissatisfaction among customers. :) Per minute.

Explain. 
Writing 160GB/minute is going to be a challenge on many levels.
Even with 40Gb/s networks this assumes no contention on the network OR the
storage backend...


> >
> >
> >For example, how were the above NVMes deployed and how did they serve data
> >to the clients?
> >The fiber channel bit in your HW list below makes me think you're using
> >VMware, FC and/or iSCSI right now. 
>
>Data is stored on the SSD disk 1.6TB NVMe, and processed and stored directly 
>on it. In one powerful server. Gave for this task. Used 40 GB ethernet. Server 
>- CentOS 7 

So you're going from a single server with all NVMe storage to a
distributed storage. 

You will be disappointed by the cost/performance in direct comparison.


> 
> >
> >
> >> There are following requirements:
> >> - The amount of data 160 GB should be read and written at speeds of SSD 
> >> P3608
> >Again, how are they serving data now?
> >The speeds (latency!) a local NVMe can reach is of course impossible with
> >a network attached SDS like Ceph. 
>
>It is sad. Not helping matters is paralleling to 13 servers? and the FC?
>
Ceph does not FC internally.
I only uses IP (so you can use IPoIB if you want).
Never mind that the problem is that the replication (x3) is causing the
largest part of the latency.

> >
> >160GB is tiny, are you sure about this number? 
>
>Yes, it's small, and it is exactly. But, it is the most sensitive data 
>processing time. Even in the background and can be a slower process more data. 
>Their treatment is not so nervous clients.

Still no getting it, but it seems more and more like 160GB/s.

> >
> >
> >> - There must be created a high-speed storage of the SSD drives 36 TB 
> >> volume with read / write speed tends to SSD P3608
> >How is that different to the point above? The data of this volume can be 
> >processed in the background, running in parallel with the processing of 160 
> >GB. The speed of processing is not so important. Previously, the entire 
> >amount was placed in a server on Ssd disk lesser performance. Therefore, I 
> >declare Ceph cluster ssd drives of the same of volume that can quickly read 
> >and write data.
> >
> >
> >> - Must be created store 1.2 PB with the access speed than the bigger, the 
> >> better ...
> >Ceph scales well.
> >> - Must have triple redundancy
> >Also not an issue, depending on how you define this. 
>

Re: [ceph-users] Recover Data from Deleted RBD Volume

2016-08-09 Thread David
On Mon, Aug 8, 2016 at 9:39 PM, Georgios Dimitrakakis 
wrote:

> Dear David (and all),
>
> the data are considered very critical therefore all this attempt to
> recover them.
>
> Although the cluster hasn't been fully stopped all users actions have. I
> mean services are running but users are not able to read/write/delete.
>
> The deleted image was the exact same size of the example (500GB) but it
> wasn't the only one deleted today. Our user was trying to do a "massive"
> cleanup by deleting 11 volumes and unfortunately one of them was very
> important.
>
> Let's assume that I "dd" all the drives what further actions should I do
> to recover the files? Could you please elaborate a bit more on the phrase
> "If you've never deleted any other rbd images and assuming you can recover
> data with names, you may be able to find the rbd objects"??
>

Sorry that last comment was a bit confusing, I was suggesting at this stage
you just need to concentrate on recovering everything you can and then try
and find the data you need.

The dd is to make a backup of the partition so you can work on it safely.
Ideally you would make a 2nd copy of the dd'd partition and work on that.
Then you need to find tools to attempt the recovery which is going to be
slow and painful and not guaranteed to be successful.



>
> Do you mean that if I know the file names I can go through and check for
> them? How?
> Do I have to know *all* file names or by searching for a few of them I can
> find all data that exist?
>
> Thanks a lot for taking the time to answer my questions!
>
> All the best,
>
> G.
>
> I dont think theres a way of getting the prefix from the cluster at
>> this point.
>>
>> If the deleted image was a similar size to the example youve given,
>> you will likely have had objects on every OSD. If this data is
>> absolutely critical you need to stop your cluster immediately or make
>> copies of all the drives with something like dd. If youve never
>> deleted any other rbd images and assuming you can recover data with
>> names, you may be able to find the rbd objects.
>>
>> On Mon, Aug 8, 2016 at 7:28 PM, Georgios Dimitrakakis  wrote:
>>
>> Hi,
>
> On 08.08.2016 10:50, Georgios Dimitrakakis wrote:
>
> Hi,
>>>
>>> On 08.08.2016 09:58, Georgios Dimitrakakis wrote:
>>>
>>> Dear all,

 I would like your help with an emergency issue but first
 let me describe our environment.

 Our environment consists of 2OSD nodes with 10x 2TB HDDs
 each and 3MON nodes (2 of them are the OSD nodes as well)
 all with ceph version 0.80.9
 (b5a67f0e1d15385bc0d60a6da6e7fc810bde6047)

 This environment provides RBD volumes to an OpenStack
 Icehouse installation.

 Although not a state of the art environment is working
 well and within our expectations.

 The issue now is that one of our users accidentally
 deleted one of the volumes without keeping its data first!

 Is there any way (since the data are considered critical
 and very important) to recover them from CEPH?

>>>
>>> Short answer: no
>>>
>>> Long answer: no, but
>>>
>>> Consider the way Ceph stores data... each RBD is striped
>>> into chunks
>>> (RADOS objects with 4MB size by default); the chunks are
>>> distributed
>>> among the OSDs with the configured number of replicates
>>> (probably two
>>> in your case since you use 2 OSD hosts). RBD uses thin
>>> provisioning,
>>> so chunks are allocated upon first write access.
>>> If an RBD is deleted all of its chunks are deleted on the
>>> corresponding OSDs. If you want to recover a deleted RBD,
>>> you need to
>>> recover all individual chunks. Whether this is possible
>>> depends on
>>> your filesystem and whether the space of a former chunk is
>>> already
>>> assigned to other RADOS objects. The RADOS object names are
>>> composed
>>> of the RBD name and the offset position of the chunk, so if
>>> an
>>> undelete mechanism exists for the OSDs filesystem, you have
>>> to be
>>> able to recover file by their filename, otherwise you might
>>> end up
>>> mixing the content of various deleted RBDs. Due to the thin
>>> provisioning there might be some chunks missing (e.g. never
>>> allocated
>>> before).
>>>
>>> Given the fact that
>>> - you probably use XFS on the OSDs since it is the
>>> preferred
>>> filesystem for OSDs (there is RDR-XFS, but Ive never had to
>>>
>>> use it)
>>> - you would need to stop the complete ceph cluster
>>> (recovery tools do
>>> not work on mounted filesystems)
>>> - your cluster has been in use after the RBD was deleted
>>> and thus
>>> parts of its former space might already have been
>>> overwritten
>>> 

Re: [ceph-users] Advice on migrating from legacy tunables to Jewel tunables.

2016-08-09 Thread Andrei Mikhailovsky
Gregory,

I've been given a tip by one of the ceph user list members on tuning values and 
data migration and cluster IO. I had an issues twice already where my vms would 
simply loose IO and crash while the cluster is being optimised for the new 
tunables.

The recommendations were to upgrade the cluster and migrate all of your vms at 
least once so that the migrated vms are launched using the new version of ceph 
libraries. Once this is done you should be okay with data movement and your vms 
should have IO.

Andrei

- Original Message -
> From: "Gregory Farnum" 
> To: "Goncalo Borges" 
> Cc: "ceph-users" 
> Sent: Tuesday, 9 August, 2016 02:34:11
> Subject: Re: [ceph-users] Advice on migrating from legacy tunables to Jewel   
> tunables.

> On Mon, Aug 8, 2016 at 5:14 PM, Goncalo Borges
>  wrote:
>> Thanks for replying Greg.
>>
>> I am trying to figure oout what parameters should I tune to mitigate the
>> impact of the data movement. For now, I've set
>>
>>osd max backfills = 1
>>
>> Are there others you think we should set?
>>
>> What do you reckon?
> 
> That is generally the big one, but I think you'll need advice from
> people who actually run clusters to see if there's anything more
> that's useful. :)
> -Greg
> 
>> 
>> Cheers
>>
>> Goncalo
>>
>>
>> On 08/09/2016 09:26 AM, Gregory Farnum wrote:
>>>
>>> On Thu, Aug 4, 2016 at 8:57 PM, Goncalo Borges
>>>  wrote:

 Dear cephers...

 I am looking for some advice on migrating from legacy tunables to Jewel
 tunables.

 What would be the best strategy?

 1) A step by step approach?
  - starting with the transition from bobtail to firefly (and, in this
 particular step, by starting to set setting chooseleaf_vary_r=5 and then
 decrease it slowly to 1?)
  - then from firefly to hammer
  - then from hammer to jewel

 2) or going directly to jewel tunables?

 Any advise on how to minimize the data movement?
>>>
>>> If you're switching tunables, there's going to be a ton of movement
>>> and you need to prepare for it. But unlike with reweighting, there
>>> isn't really a way to do it incrementally. I don't know if we have
>>> experimental evidence but stepping through them is very unlikely to
>>> help; I think you just want to pick the least bad time for a bunch of
>>> migration and enable jewel tunables directly.
>>> -Greg
>>
>>
>> --
>> Goncalo Borges
>> Research Computing
>> ARC Centre of Excellence for Particle Physics at the Terascale
>> School of Physics A28 | University of Sydney, NSW  2006
>> T: +61 2 93511937
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant to Jewel poor read performance with Rados bench

2016-08-09 Thread David
Hi Mark, thanks for following up. I'm now pretty convinced I have issues
with my network, it's not Ceph related. My cursory iperf tests between
pairs of hosts were looking fine but with multiple clients I'm seeing
really high tcp retransmissions.

On Mon, Aug 8, 2016 at 1:07 PM, Mark Nelson  wrote:

> Hi David,
>
> We haven't done any direct giant to jewel comparisons, but I wouldn't
> expect a drop that big, even for cached tests.  How long are you running
> the test for, and how large are the IOs?  Did you upgrade anything else at
> the same time Ceph was updated?
>
> Mark
>
>
> On 08/06/2016 03:38 PM, David wrote:
>
>> Hi All
>>
>> I've just installed Jewel 10.2.2 on hardware that has previously been
>> running Giant. Rados Bench with the default rand and seq tests is giving
>> me approx 40% of the throughput I used to achieve. On Giant I would get
>> ~1000MB/s (so probably limited by the 10GbE interface), now I'm getting
>> 300 - 400MB/s.
>>
>> I can see there is no activity on the disks during the bench so the data
>> is all coming out of cache. The cluster isn't doing anything else during
>> the test. I'm fairly sure my network is sound, I've done the usual
>> testing with iperf etc. The write test seems about the same as I used to
>> get (~400MB/s).
>>
>> This was a fresh install rather than an upgrade.
>>
>> Are there any gotchas I should be aware of?
>>
>> Some more details:
>>
>> OS: CentOS 7
>> Kernel: 3.10.0-327.28.2.el7.x86_64
>> 5 nodes (each 10 * 4TB SATA, 2 * Intel dc3700 SSD partitioned up for
>> journals).
>> 10GbE public network
>> 10GbE cluster network
>> MTU 9000 on all interfaces and switch
>> Ceph installed from ceph repo
>>
>> Ceph.conf is pretty basic (IPs, hosts etc omitted):
>>
>> filestore_xattr_use_omap = true
>> osd_journal_size = 1
>> osd_pool_default_size = 3
>> osd_pool_default_min_size = 2
>> osd_pool_default_pg_num = 4096
>> osd_pool_default_pgp_num = 4096
>> osd_crush_chooseleaf_type = 1
>> max_open_files = 131072
>> mon_clock_drift_allowed = .15
>> mon_clock_drift_warn_backoff = 30
>> mon_osd_down_out_interval = 300
>> mon_osd_report_timeout = 300
>> mon_osd_full_ratio = .95
>> mon_osd_nearfull_ratio = .80
>> osd_backfill_full_ratio = .80
>>
>> Thanks
>> David
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] how to debug pg inconsistent state - no ioerrors seen

2016-08-09 Thread Kenneth Waegeman

Hi,

I did a diff on the directories of all three the osds, no difference .. 
So I don't know what's wrong.


Only thing I see different is a scrub file in the TEMP folder (it is 
already another pg than last mail):


-rw-r--r--1 ceph ceph 0 Aug  9 09:51 
scrub\u6.107__head_0107__fff8


But it is empty..

Thanks!


On 09/08/16 04:33, Goncalo Borges wrote:

Hi Kenneth...

The previous default behavior of 'ceph pg repair' was to copy the pg 
objects from the primary osd to others. Not sure if it is till the 
case in Jewel. For this reason, once we get these kind of errors in a 
data pool, the best practice is to compare the md5 checksums of the 
damaged object in all osds involved in the inconsistent pg. Since we 
have a 3 replica cluster, we should find a 2 good object quorum. If by 
chance the primary osd has the wrong object, it should delete it 
before running  the repair.


On a metadata pool, I am not sure exactly how to cross check since all 
objects are size 0 and therefore, md5sum is meaningless. Maybe, one 
way forward could be to check the contents of the pg directories (ex: 
/var/lib/ceph/osd/ceph-0/current/5.161_head/) in all osds involved for 
the pg and see if we spot something wrong?


Cheers

G.


On 08/08/2016 09:40 PM, Kenneth Waegeman wrote:

Hi all,

Since last week, some pg's are going in the inconsistent state after 
a scrub error. Last week we had 4 pgs in that state, They were on 
different OSDS, but all of the metadata pool.
I did a pg repair on them, and all were healthy again. But now again 
one pg is inconsistent.


with health detail I see:

pg 6.2f4 is active+clean+inconsistent, acting [3,5,1]
1 scrub errors

And in the log of the primary:

2016-08-06 06:24:44.723224 7fc5493f3700 -1 log_channel(cluster) log 
[ERR] : 6.2f4 shard 5: soid 6:2f55791f:::606.:head 
omap_digest 0x3a105358 != best guess omap_digest 0xc85c4361 from auth 
shard 1
2016-08-06 06:24:53.931029 7fc54bbf8700 -1 log_channel(cluster) log 
[ERR] : 6.2f4 deep-scrub 0 missing, 1 inconsistent objects
2016-08-06 06:24:53.931055 7fc54bbf8700 -1 log_channel(cluster) log 
[ERR] : 6.2f4 deep-scrub 1 errors


I looked in dmesg but I couldn't see any IO errors on any of the OSDs 
in the acting set.  Last week it was another set. It is of course 
possible more than 1 OSD is failing, but how can we check this, since 
there is nothing more in the logs?


Thanks !!

K
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best practices for extending a ceph cluster with minimal client impact data movement

2016-08-09 Thread Wido den Hollander

> Op 8 augustus 2016 om 16:45 schreef Martin Palma :
> 
> 
> Hi all,
> 
> we are in the process of expanding our cluster and I would like to
> know if there are some best practices in doing so.
> 
> Our current cluster is composted as follows:
> - 195 OSDs (14 Storage Nodes)
> - 3 Monitors
> - Total capacity 620 TB
> - Used 360 TB
> 
> We will expand the cluster by other 14 Storage Nodes and 2 Monitor
> nodes. So we are doubling the current deployment:
> 
> - OSDs: 195 --> 390
> - Total capacity: 620 TB --> 1250 TB
> 
> During the expansion we would like to minimize the client impact and
> data movement. Any suggestions?
> 

There are a few routes you can take, I would suggest that you:

- set max backfills to 1
- set max recovery to 1

Now, add the OSDs to the cluster, but NOT to the CRUSHMap.

When all the OSDs are online, inject a new CRUSHMap where you add the new OSDs 
to the data placement.

$ ceph osd setcrushmap -i 

The OSDs will now start to migrate data, but this is throttled by the max 
recovery and backfill settings.

Wido

> Best,
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Large file storage having problem with deleting

2016-08-09 Thread zhu tong
Hi,


I saved a file sizing 5GB in the cluster.  OSD disk "Used space" increases 15GB 
in total because replication is 3. And radosgw-admin bucket stats --uid=someuid 
shows that num-objects is increased by 1.


However, after I removed the object, I observe this:

the OSD disk usage does NOT change,

ceph df shows default.rgw.buckets.data "USED" is still 5119M.

radosgw-admin bucket stats --uid=someuid shows that num-objects is decreased by 
1


Now I have deleted every object for this user, the only user using this 
cluster, the results do not change.


Is there a way to print object information for pool, like name, size etc?

Is this "ceph should behave like this" or is this random bug?


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com