from:"Ronny Aasen"

Re: [ceph-users] pgs inconsistent

2019-08-16 Thread Ronny Aasen


On 15.08.2019 16:38, huxia...@horebdata.cn wrote:

Dear folks,

I had a Ceph cluster with replication 2, 3 nodes, each node with 3 OSDs, 
on Luminous 12.2.12. Some days ago i had one OSD down (the disk is still 
fine) due to some errors on rocksdb crash. I tried to restart that OSD 
but failed. So I tried to rebalance but encountered PGs inconsistent.


what can i do to make the cluster working again?

thanks a lot for helping me out

Samuel

**
# ceph -s
   cluster:
     id:     289e3afa-f188-49b0-9bea-1ab57cc2beb8
     health: HEALTH_ERR
             pauserd,pausewr,noout flag(s) set
             191444 scrub errors
             Possible data damage: 376 pgs inconsistent
   services:
     mon: 3 daemons, quorum horeb71,horeb72,horeb73
     mgr: horeb73(active), standbys: horeb71, horeb72
     osd: 9 osds: 8 up, 8 in
          flags pauserd,pausewr,noout
   data:
     pools:   1 pools, 1024 pgs
     objects: 524.29k objects, 1.99TiB
     usage:   3.67TiB used, 2.58TiB / 6.25TiB avail
     pgs:     645 active+clean
              376 active+clean+inconsistent
              3   active+clean+scrubbing+deep



that was a lot of inconsistent pg's. When you say replication = 2 do you 
mean you have 2 copies as in size=3 min-size=2 , or that you have size=2 
min-size=1 ?


the reason i ask is that min-size=1 is a well known way to get into lots 
of problems. (one disk can accept a write alone, and before it is 
recoverd/backfilled the drive can die)


if you have min-size=1 i would recommend you set min-size=2 as the first 
step, to avoid creating more inconsistency while troubleshooting. if you 
have the space for it in the cluster you should also set size=3


if you run "#ceph health detail" you will get a list of the pg's that 
are inconsistent. check if there is a repeat offender osd in that list 
of pg's, and check that disk for issues. check dmesg and logs of the 
osd, and if there are smart errors.


You can try to repair the inconsistent pg's automagically by running the 
command  "#ceph pg repair [pg id]" but make sure the hardware is good 
first.



good luck
Ronny


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] VM management setup

2019-04-05 Thread Ronny Aasen




Proxmox VE is a simple solution.
https://www.proxmox.com/en/proxmox-ve

based on debian. can administer an internal ceph cluster or connect to 
an external connected . easy and almost self explanatory web interface.


good luck in your search !

Ronny



On 05.04.2019 21:34, jes...@krogh.cc wrote:

Hi. Knowing this is a bit off-topic but seeking recommendations
and advise anyway.

We're seeking a "management" solution for VM's - currently in the 40-50
VM - but would like to have better access in managing them and potintially
migrate them across multiple hosts, setup block devices, etc, etc.

This is only to be used internally in a department where a bunch of
engineering people will manage it, no costumers and that kind of thing.

Up until now we have been using virt-manager with kvm - and have been
quite satisfied when we were in the "few vms", but it seems like the
time to move on.

Thus we're looking for something "simple" that can help manage a ceph+kvm
based setup -  the simpler and more to the point the better.

Any recommendations?

.. found a lot of names allready ..
OpenStack
CloudStack
Proxmox
..

But recommendations are truely welcome.

Thanks.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] v14.2.0 Nautilus released

2019-03-20 Thread Ronny Aasen




with Debian buster frozen, If there are issues with ceph on debian that 
would best be fixed in debian, now is the last chance to get anything 
into buster before the next release.


it is also important to get mimic and luminous packages built for 
Buster. Since you want to avoid a situation where you have to upgrade 
both the OS and ceph at the same time.


kind regards
Ronny Aasen



On 20.03.2019 07:09, Alfredo Deza wrote:

There aren't any Debian packages built for this release because we
haven't updated the infrastructure to build (and test) Debian packages
yet.

On Tue, Mar 19, 2019 at 10:24 AM Sean Purdy  wrote:

Hi,


Will debian packages be released?  I don't see them in the nautilus repo.  I 
thought that Nautilus was going to be debian-friendly, unlike Mimic.


Sean

On Tue, 19 Mar 2019 14:58:41 +0100
Abhishek Lekshmanan  wrote:


We're glad to announce the first release of Nautilus v14.2.0 stable
series. There have been a lot of changes across components from the
previous Ceph releases, and we advise everyone to go through the release
and upgrade notes carefully.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] debian packages on download.ceph.com

2019-02-22 Thread Ronny Aasen

the 2019-02-12 Debian Buster went into soft freeze 
https://release.debian.org/buster/freeze_policy.html
So all the debian developers are hard at work getting buster ready for 
release.


It would be really awesome if we could get debian buster packages built 
on http://download.ceph.com/ both for luminous and mimic, so one can 
test upgrades. since there is no mimic on stretch we are forced to 
upgrade to buster before upgrading ceph to mimic.


This is basically the last chance to be able to try ceph on Buster, and 
find all the bugs potentially affecting ceph on debian. Before the 
release, while it is still possible to get them fixed.


on a related note. are the build infrastructure for ceph on git somewhere ?

kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Mimic 13.2.3?

2019-01-10 Thread Ronny Aasen


On 09.01.2019 17:27, Matthew Vernon wrote:

Hi,

On 08/01/2019 18:58, David Galloway wrote:


The current distro matrix is:

Luminous: xenial centos7 trusty jessie stretch
Mimic: bionic xenial centos7

Thanks for clarifying :)


This may have been different in previous point releases because, as Greg
mentioned in an earlier post in this thread, the release process has
changed hands and I'm still working on getting a solid/bulletproof
process documented, in place, and (more) automated.

I wouldn't be the final decision maker but if you think we should be
building Mimic packages for Debian (for example), we could consider it.
  The build process should support it I believe.

Could I suggest building Luminous for Bionic, and Mimic for Buster, please?



Getting mimic, and luminous  built for buster would be awesome, would 
let us start a bit testing on mimic, But would also allow to detect and 
fix potential bugs before buster  hard freezes


it is important to get luminous built as well, since we do not want to 
upgrade both OS and ceph in the same process. To many moving parts.


since it is impossible to get (official) mimic on current debian stable, 
one would assume one first upgraded Debian to buster, still running 
luminous and, and afterwards upgraded luminous to mimic.



kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] disk controller failure

2018-12-13 Thread Ronny Aasen


On 13.12.2018 18:19, Alex Gorbachev wrote:

On Thu, Dec 13, 2018 at 10:48 AM Dietmar Rieder
 wrote:

Hi Cephers,

one of our OSD nodes is experiencing a Disk controller problem/failure
(frequent resetting), so the OSDs on this controller are flapping
(up/down in/out).

I will hopefully get the replacement part soon.

I have some simple questions, what are the best steps to take now before
an after replacement of the controller?

- marking down and shutting down all osds on that node?
- waiting for rebalance is finished
- replace the controller
- just restart the osds? Or redeploy them, since they still hold data?

We are running:

ceph version 12.2.7 (3ec878d1e53e1aeb47a9f619c49d9e7c0aa384d5) luminous
(stable)
CentOS 7.5

Sorry for my naive questions.

I usually do ceph osd set noout first to prevent any recoveries

Then replace the hardware and make sure all OSDs come back online

Then ceph osd unset noout

Best regards,
Alex



Setting noout prevents the osd's from re-balancing.  ie when you do a 
short fix and do not want it to start re-balancing, since you know the 
data will be available shortly.. eg a reboot or similar.


if osd's are flapping you normally want them out of the cluster, so they 
do not impact performance any more.



kind regards

Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Ronny Aasen


On 11.12.2018 12:59, Kevin Olbrich wrote:

Hi!

Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?


the disk is on the source location untill the migration is finalized. if 
the local compute node crashed and the vm dies with it before the 
migration is done. the disk is on the source location as expected.  if 
nodes on the ceph cluster dies but the cluster is operational, ceph just 
selfheal and the migration is finished. if the cluster dies hard enough 
to actually break, the migration will timeout , and abort. and disk 
remains on source location. if network is unavailable the transfer will 
also timeout.


good luck

Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] KVM+Ceph: Live migration of I/O-heavy VM

2018-12-11 Thread Ronny Aasen


On 11.12.2018 17:39, Lionel Bouton wrote:

Le 11/12/2018 à 15:51, Konstantin Shalygin a écrit :



Currently I plan a migration of a large VM (MS Exchange, 300 Mailboxes
and 900GB DB) from qcow2 on ext4 (RAID1) to an all-flash Ceph luminous
cluster (which already holds lot's of images).
The server has access to both local and cluster-storage, I only need
to live migrate the storage, not machine.

I have never used live migration as it can cause more issues and the
VMs that are already migrated, had planned downtime.
Taking the VM offline and convert/import using qemu-img would take
some hours but I would like to still serve clients, even if it is
slower.

The VM is I/O-heavy in terms of the old storage (LSI/Adaptec with
BBU). There are two HDDs bound as RAID1 which are constantly under 30%
- 60% load (this goes up to 100% during reboot, updates or login
prime-time).

What happens when either the local compute node or the ceph cluster
fails (degraded)? Or network is unavailable?
Are all writes performed to both locations? Is this fail-safe? Or does
the VM crash in worst case, which can lead to dirty shutdown for MS-EX
DBs?

The node currently has 4GB free RAM and 29GB listed as cache /
available. These numbers need caution because we have "tuned" enabled
which causes de-deplication on RAM and this host runs about 10 Windows
VMs.
During reboots or updates, RAM can get full again.

Maybe I am to cautious about live-storage-migration, maybe I am not.

What are your experiences or advices?

Thank you very much!


I was read your message two times and still can't figure out what is 
your question?


You need move your block image from some storage to Ceph? No, you 
can't do this without downtime because fs consistency.


You can easy migrate your filesystem via rsync for example, with 
small downtime for reboot VM.




I believe OP is trying to use the storage migration feature of QEMU. 
I've never tried it and I wouldn't recommend it (probably not very 
tested and there is a large window for failure).



use the qemu storage migration feature via proxmox webui several times a 
day. never any issues.


I regularly migrate between  ceph rbd,  local directory, shared lvm over 
fiberchannel, nfs server.  super easy and convenient.



Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Bluestore : Deep Scrubbing vs Checksums

2018-11-25 Thread Ronny Aasen


On 22.11.2018 17:06, Eddy Castillon wrote:

Hello dear ceph users:

We are running a ceph cluster with Luminous (BlueStore). As you may 
know this new  ceph version has a new feature called "Checksums".  I 
would like to ask if this feature replace to deep-scrub. In our 
cluster, we run deep-scrub ever month however the impact in the 
performance is high.


Source:  ceph's documentation:

Checksums

BlueStore calculates, stores, and verifies checksums for all data and 
metadata it stores. Any time data is read off of disk, a checksum is 
used to verify the data is correct before it is exposed to any other 
part of the system (or the user).


checksum's and deep-scrub do different things. and you want to keep 
doing both.


checksum helps in determining if the data is OK or not, when reading it 
off the disk.


if you have data sitting idle on a drive, it can become unreadable due 
to bad blocks over time. if the data is never read you can end up in a 
situation where all the objects replicas have become unreadable.


deep-scrub periodically reads the data on the drive to verify it is 
still readable and correct, (checking with other replicas, and checksums)


you can schedule the deep-scrub to run on non-peek hours.

kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] https://ceph-storage.slack.com

2018-10-10 Thread Ronny Aasen


On 18.09.2018 21:15, Alfredo Daniel Rezinovsky wrote:

Can anyone add me to this slack?

with my email alfrenov...@gmail.com

Thanks.


why would a ceph slack be invite only?
Also is the slack bridged to matrix?  room id ?


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS performance.

2018-10-04 Thread Ronny Aasen


On 10/4/18 7:04 AM, jes...@krogh.cc wrote:

Hi All.

First thanks for the good discussion and strong answer's I've gotten so far.

Current cluster setup is 4 x 10 x 12TB 7.2K RPM drives with all and
10GbitE and metadata on rotating drives - 3x replication - 256GB memory in
OSD hosts and 32+ cores. Behind Perc with eachdiskraid0 and BBWC.

Planned changes:
- is to get 1-2 more OSD-hosts
- experiment with EC-pools for CephFS
- MDS onto seperate host and metadata onto SSD's.

I'm still struggling to get "non-cached" performance up to "hardware"
speed - whatever that means. I do "fio" benchmark using 10GB files, 16
threads, 4M block size -- at which I can "almost" sustained fill the
10GbitE NIC. In this configuraiton I would have expected it to be "way
above" 10Gbit speed thus have the NIC not "almost" filled - but fully
filled - could that be the metadata activities .. but on "big files" and
read - that should not be much - right?

Above is actually ok for production, thus .. not a big issue, just
information.

Single threaded performance is still struggling

Cold HHD (read from disk in NFS-server end) / NFS performance:

jk@zebra01:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   15.86 GB in 00h00m27.53s:  589.88 MB/second


Local page cache (just to say it isn't the profiling tool delivering
limitations):
jk@zebra03:~$ pipebench < /nfs/16GB.file > /dev/null
Summary:
Piped   29.24 GB in 00h00m09.15s:3.19 GB/second
jk@zebra03:~$

Now from the Ceph system:
jk@zebra01:~$ pipebench < /ceph/bigfile.file> /dev/null
Summary:
Piped   36.79 GB in 00h03m47.66s:  165.49 MB/second

Can block/stripe-size be tuned? Does it make sense?
Does read-ahead on the CephFS kernel-client need tuning?
What performance are other people seeing?
Other thoughts - recommendations?

On some of the shares we're storing pretty large files (GB size) and
need the backup to move them to tape - so it is preferred to be capable
of filling an LTO6 drive's write speed to capacity with a single thread.

40'ish 7.2K RPM drives - should - add up to more than above.. right?
This is the only current load being put on the cluster - + 100MB/s
recovery traffic.




the problem with single threaded performance in ceph. Is that it reads 
the spindles in serial. so you are practically reading one and one 
drive, and see a single disk's performance, subtracted all the overheads 
from ceph, network, mds, etc.
So you do not get the combined performance of the drives, only one drive 
at the time. So the trick for ceph performance is to get more spindles 
working for you at the same time.



There are ways to get more performance out of a single thread:
- faster components in the path, ie faster disk/network/cpu/memory
- larger pre-fetching/read-ahead, with a large enough read-ahead more 
osd's will participate in reading simultaneously. [1] shows a table of 
benchmarks with different read-ahead sizes.
- erasure coding. while erasure coding does add latency vs replicated 
pools. You will get more spindles involved in reading in parallel. so 
for large sequential loads erasure coding can have a benefit.
- some sort of extra caching scheme, I have not looked at cachefiles, 
but it may provide some benefit.



you can also play with different cephfs implementations, there is a fuse 
client, where you can play with different cache solutions. But generally 
the kernel client is faster.


in rbd there is a fancy striping solution, by using --stripe-unit and 
--stripe-count. This would get more spindles running ; perhaps consider 
using rbd instead of cephfs if it fits the workload.



[1] 
https://tracker.ceph.com/projects/ceph/wiki/Kernel_client_read_ahead_optimization


good luck
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore vs. Filestore

2018-10-03 Thread Ronny Aasen


On 03.10.2018 20:10, jes...@krogh.cc wrote:

Your use case sounds it might profit from the rados cache tier
feature. It's a rarely used feature because it only works in very
specific circumstances. But your scenario sounds like it might work.
Definitely worth giving it a try. Also, dm-cache with LVM *might*
help.
But if your active working set is really just 400GB: Bluestore cache
should handle this just fine. Don't worry about "unequal"
distribution, every 4mb chunk of every file will go to a random OSD.

I tried it out - and will do it more but Initial tests didnt really
convince me - but I'll try more.


One very powerful and simple optimization is moving the metadata pool
to SSD only. Even if it's just 3 small but fast SSDs; that can make a
huge difference to how fast your filesystem "feels".

They are ordered and will hopefully arrive very soon.

Can I:
1) Add disks
2) Create pool
3) stop all MDS's
4) rados cppool
5) Start MDS

.. Yes, thats a cluster-down on CephFS but shouldn't take long. Or is
there a better guide?


this post
https://ceph.com/community/new-luminous-crush-device-classes/
and this document
http://docs.ceph.com/docs/master/rados/operations/pools/

explains how the osd class is used to define a crush placement rule.
and then you can set the |crush_rule| on the pool and ceph will move the 
data. No downtime needed.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread Ronny Aasen


On 02.10.2018 21:21, jes...@krogh.cc wrote:

On 02.10.2018 19:28, jes...@krogh.cc wrote:
In the cephfs world there is no central server that hold the cache. each
cephfs client reads data directly from the osd's.

I can accept this argument, but nevertheless .. if I used Filestore - it
would work.


bluestore is fairly new tho, so if your use case fits filestore better, 
there is no huge reason not to just use that





This also means no
single point of failure, and you can scale out performance by spreading
metadata tree information over multiple MDS servers. and scale out
storage and throughput with added osd nodes.

so if the cephfs client cache is not sufficient, you can look at at the
bluestore cache.

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

I have been there, but it seems to "not work"- I think the need to
slice per OSD and statically allocate mem per OSD breaks the efficiency.
(but I cannot prove it)


or you can look at adding a ssd layer over the spinning disks. with egÂ
bcache.Â  I assume you are using a ssd/nvram for bluestore db already

My currently bluestore(s) is backed by 10TB 7.2K RPM drives, allthough behind
BBWC. Can you elaborate on the "assumption" as we're not doing that, I'd like
to explore that.


https://ceph.com/community/new-luminous-bluestore/
read about "multiple devices"
you can split out the DB part of the bluestore to a faster drive (ssd) 
many tend to put db's for 4 spinners on a single ssd.
the db is the osd metadata, it say where on the block the objects are. 
and it increases the performance of bluestore significantly.






you should also look at tuning the cephfs metadata servers.
make sure the metadata pool is on fast ssd osd's .Â  and tune the mds
cache to the mds server's ram, so you cache as much metadata as possible.

Yes, we're in the process of doing that - I belive we're seeing the MDS
suffering
when we saturate a few disks in the setup - and they are sharing. Thus
we'll move
the metadata as per recommendations to SSD.



good luck

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Bluestore vs. Filestore

2018-10-02 Thread Ronny Aasen


On 02.10.2018 19:28, jes...@krogh.cc wrote:

Hi.

Based on some recommendations we have setup our CephFS installation using
bluestore*. We're trying to get a strong replacement for "huge" xfs+NFS
server - 100TB-ish size.

Current setup is - a sizeable Linux host with 512GB of memory - one large
Dell MD1200 or MD1220 - 100TB + a Linux kernel NFS server.

Since our "hot" dataset is < 400GB we can actually serve the hot data
directly out of the host page-cache and never really touch the "slow"
underlying drives. Except when new bulk data are written where a Perc with
BBWC is consuming the data.

In the CephFS + Bluestore world, Ceph is "deliberatly" bypassing the host
OS page-cache, so even when we have 4-5 x 256GB memory** in the OSD hosts
it is really hard to create a synthetic test where they hot data does not
end up being read out of the underlying disks. Yes, the
client side page cache works very well, but in our scenario we have 30+
hosts pulling the same data over NFS.

Is bluestore just a "bad fit" .. Filestore "should" do the right thing? Is
the recommendation to make an SSD "overlay" on the slow drives?

Thoughts?

Jesper

* Bluestore should be the new and shiny future - right?
** Total mem 1TB+





In the cephfs world there is no central server that hold the cache. each 
cephfs client reads data directly from the osd's.  this also means no 
single point of failure, and you can scale out performance by spreading 
metadata tree information over multiple MDS servers. and scale out 
storage and throughput with added osd nodes.


so if the cephfs client cache is not sufficient, you can look at at the 
bluestore cache.

http://docs.ceph.com/docs/mimic/rados/configuration/bluestore-config-ref/#cache-size

or you can look at adding a ssd layer over the spinning disks. with eg  
bcache.  I assume you are using a ssd/nvram for bluestore db already


you should also look at tuning the cephfs metadata servers.
make sure the metadata pool is on fast ssd osd's .  and tune the mds 
cache to the mds server's ram, so you cache as much metadata as possible.


good luck
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Slow Ceph: Any plans on torrent-like transfers from OSDs ?

2018-09-09 Thread Ronny Aasen


ceph is a distributed system, it scales by concurrent access to nodes.

generally a single client will access a single OSD at the time, iow max 
possible single thread read is the read speak of the drive. and max 
possible write is single drive write / (replication size-1)
but when you have many vm's accessing the same cluster the load is 
spread all over (just like when you see the recovery running)


A single spinning disk should be able to do 100-150MB/s depending on 
make and model. even with the overhead of ceph and networking so i still 
think 20MB/s is a bit on the low side, depending on how you benchmark.


I would start by going thru this benchmarking guide, and see if you find 
some issues:

https://tracker.ceph.com/projects/ceph/wiki/Benchmark_Ceph_Cluster_Performance


in order to get more singlethread performance out of ceph you must get 
faster individual parts ( nvram disks/fast ram and processors/fast 
network/etc/etc) or you can cheat by either spreading the load over more 
disks. eg you can do rbd fancy striping, or attach multiple disk's with 
individual controllers in the vm. or use caching and /or readahead.



when it comes to cache tiering i would remove that, it does not get the 
love it needs. and redhat have even stopped supporting it in deployments.

but you can use dm-cache or bcache on osd's
or/and  rbd-cache on kvm clients.


good luck
Ronny Aasen


On 09.09.2018 11:20, Alex Lupsa wrote:

Hi,
Any ideas about the below ?

Thanks,
Alex

--
Hi,
I have a really small homelab 3-node ceph cluster on consumer hw - thanks
to Proxmox for making it easy to deploy it.
The problem I am having is very very bad transfer rates, ie 20mb/sec for
both read and write on 17 OSDs with cache layer.
However during recovery the speed hover between 250 to 700mb/sec which
proves that the cluster IS capable of reaching way above those 20mb/sec in
KVM.

Reading the documentation, I see that during recovery "nearly all OSDs
participate in resilvering a new drive" - kind of a torrent of data
incoming from multiple sources at once, causing a huge deluge.

However I believe this does not happen during the normal transfers, so my
question is simply - is there any hidden tunables I can enable for this
with the implied cost of network and heavy usage of disks ? Will there be
in the future if not ?

I have tried disabling authx, upgrading the network to 10gbit, have bigger
journals, more bluestore cache and disabled the debugging logs as it has
been advised on the list. The only thing that did help a bit was cache
tiering, but this only helps somewhat as the ops do not get promoted unless
I am very adamant about keeping programs in KVM open for very long times so
that the writes/reads are promoted.
To add some to the injury, once the cache gets full - the whole 3-node
cluster grinds to a full halt until I start forcefully evict data from the
cache... manually!
So I am therefore guessing a really bad misconfiguration from my side.

Next step would be removing the cache layer and using those SSDs as bcache
instead as it seems to yeld 5x the results, even though it does add yet
another layer of complexity and RAM requirements.

Full config details:
https://pastebin.com/xUM7VF9k

rados bench -p ceph_pool 30 write
Total time run: 30.983343
Total writes made:  762
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 98.3754
Stddev Bandwidth:   20.9586
Max bandwidth (MB/sec): 132
Min bandwidth (MB/sec): 16
Average IOPS:   24
Stddev IOPS:5
Max IOPS:   33
Min IOPS:   4
Average Latency(s): 0.645017
Stddev Latency(s):  0.326411
Max latency(s): 2.08067
Min latency(s): 0.0355789
Cleaning up (deleting benchmark objects)
Removed 762 objects
Clean up completed and total clean up time :3.925631

Thanks,
Alex


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] how to swap osds between servers

2018-09-03 Thread Ronny Aasen


On 03.09.2018 17:42, Andrei Mikhailovsky wrote:

Hello everyone,

I am in the process of adding an additional osd server to my small 
ceph cluster as well as migrating from filestore to bluestore. Here is 
my setup at the moment:


Ceph - 12.2.5 , running on Ubuntu 16.04 with latest updates
3 x osd servers with 10x3TB SAS drives, 2 x Intel S3710 200GB ssd and 
64GB ram in each server. The same servers are also mon servers.


I am adding the following to the cluster:
1 x osd+mon server with 64GB of ram, 2xIntel S3710 200GB ssds.
Adding 4 x 6TB disks and 2x 3TB disks.

Thus, the new setup will have the following configuration:
4 x osd servers with 8x3TB SAS drives and 1x6TB SAS drive, 2 x Intel 
S3710 200GB ssd and 64GB ram in each server. This will make sure that 
all servers have the same amount/capacity drives. There will be 3 mon 
servers in total.


As a result, I will have to remove 2 x 3TB drives from the existing 
three osd servers and place them into the new osd server and add a 6TB 
drive into each osd server. As those 6 x 3TB drives which will be 
taken from the existing osd servers and placed to the new server will 
have the data stored on them, what is the best way to do this? I would 
like to minimise the data migration all over the place as it creates a 
havoc on the cluster performance. What is the best workflow to achieve 
the hardware upgrade? If I add the new osd host server into the 
cluster and physically take the osd disk from one server and place it 
in the other server, will it be recognised and accepted by the cluster?


Data will migrate no matter how you change the crushmap.  since you want 
to migrate to bluestore this is also unavoidable.


if it is critical data, and you want to minimize impact, I prefer to do 
it the slow and steady way of adding a new bluestore drive to the new 
host, with weight 0 and gradually upping it's weight, while gradually 
lowering the weight of the filestore drive beeing removed.


a worse option if you do not have a drive to spare for that, is to 
gradually drain a drive, remove it from the cluster, move it over, zap 
and recreate as bluestore, and gradually fill it again. but this takes 
longer, and if you have space issues can be complicated.


an even worse option is to move the osd drive over, (with it's journal 
and  data), and have the cluster shuffle all the data around, this is a 
big impact.
And then you are still running filestore. so you still need to migrate 
to bluestore


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Help Basically..

2018-09-02 Thread Ronny Aasen


On 02.09.2018 17:12, Lee wrote:
Should I just out the OSD's first or completely zap them and recreate? 
Or delete and let the cluster repair itself?


On the second node when it started back up I had problems with the 
Journals for ID 5 and 7 they were also recreated all the rest are 
still the originals.


I know that some PG's are on both 24 and 5 and 7 ie.



Personally I would never wipe a disk until the cluster is health_OK.
out them from the cluster. And if you need the slot for healthy disks 
you can remove them physically, but label and store together with the 
journal until you are health_OK


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ls operation is too slow in cephfs

2018-07-25 Thread Ronny Aasen

What are you talking about when you say you have mds in a region, afaik 
only radosgw supports multisite and regions.
it sounds like you have a cluster spread out over a geographical area. 
and this will have a massive impact on latency

what is the latency between all servers in the cluster ?

kind regards
Ronny Aasen

On 25.07.2018 12:03, Surya Bala wrote:

time got reduced when MDS from the same region became active

Each region we have a MDS. OSD nodes are in one region and active MDS 
is in another region . So that this delay.

On Tue, Jul 17, 2018 at 6:23 PM, John Spray <mailto:jsp...@redhat.com>> wrote:

On Tue, Jul 17, 2018 at 8:26 AM Surya Bala
mailto:sooriya.ba...@gmail.com>> wrote:
>
> Hi folks,
>
> We have production cluster with 8 nodes and each node has 60
disks of size 6TB each. We are using cephfs and FUSE client with
global mount point. We are doing rsync from our old server to this
cluster rsync is slow compared to normal server
>
> when we do 'ls' inside some folder, which has many more number
of files like 1lakhs and 2lakhs, the response is too slow.

The first thing to check is what kind of "ls" you're doing.  Some
systems colorize ls by default, and that involves statting every file
in addition to listing the directory.  Try with "ls --color=never".

It also helps to be more specific about what "too slow" means.  How
many seconds, and how many files?

John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Reclaim free space on RBD images that use Bluestore?????

2018-07-23 Thread Ronny Aasen


On 23.07.2018 22:18, Sean Bolding wrote:


I have XenServers that connect via iSCSI to Ceph gateway servers that 
use lrbd and targetcli. On my ceph cluster the RBD images I create are 
used as storage repositories in Xenserver for the virtual machine vdisks.


Whenever I delete a virtual machine, XenServer shows that the 
repository size has decreased. This also happens when I mount a 
virtual drive in Xenserver as a virtual drive in a Windows guest. If I 
delete a large file, such as an exported VM, it shows as deleted and 
space available. However; when check in Ceph  using ceph –s or ceph df 
it still shows the space being used.


I checked everywhere and it seems there was a reference to it here 
https://github.com/ceph/ceph/pull/14727 but not sure if a way to trim 
or discard freed blocks was ever implemented.


The only way I have found is to play musical chairs and move the VMs 
to different repositories and then completely remove the old RBD 
images in ceph. This is not exactly easy to do.


Is there a way to reclaim free space on RBD images that use 
Bluestore? What commands do I use and where do I use this from? If 
such command exist do I run them on the ceph cluster or do I run them 
from XenServer? Please help.


Sean



I am not familiar with Xen, but it does sounds like you have a rbd 
mounted with a filesystem on the xen server.
in that case it is the same as for other filesystems. Deleted files are 
just deleted in the file allocation table, and the RBD space is 
"reclaimed" when the filesystem zeroes out the now unused blocks.


in many filesystems you would run the fstrim command to overwrite free'd 
blocks with zeroes, optionally mount the fs with the the discard option.
in xenserver >6.5 this should be a button in xencenter to reclaim freed 
space.



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] active+clean+inconsistent PGs after upgrade to 12.2.7

2018-07-19 Thread Ronny Aasen


On 19. juli 2018 10:37, Robert Sander wrote:

Hi,

just a quick warning: We currently see active+clean+inconsistent PGs on
two cluster after upgrading to 12.2.7.

I created http://tracker.ceph.com/issues/24994

Regards




Did you upgrade from 12.2.5 or 12.2.6 ?
sounds like you hit the reason for the 12.2.7 release

read : https://ceph.com/releases/12-2-7-luminous-released/

there should come features in 12.2.8 that can deal with the "objects are 
in sync but checksums are wrong" scenario.



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph cluster

2018-06-12 Thread Ronny Aasen


On 12. juni 2018 12:17, Muneendra Kumar M wrote:

conf file as shown below.

If I reconfigure my ipaddress from 10.xx.xx.xx to 192.xx.xx.xx and by 
changing the public network and mon_host filed in the ceph.conf


Will my cluster will work as it is ?

Below is my ceph.conf file details.

Any inputs will really help me to understand more on the same.



no. changing the subnet of the cluster is a complex operation.

sincxe you are using private ip addresses anyway i would reconsider 
changing, and only change it there was no other way.


this is the documentation for mimic on how to change.

http://docs.ceph.com/docs/mimic/rados/operations/add-or-rm-mons/#changing-a-monitor-s-ip-address-the-messy-way

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Mimic on Debian 9 Stretch

2018-06-05 Thread Ronny Aasen


On 04.06.2018 21:08, Joao Eduardo Luis wrote:

On 06/04/2018 07:39 PM, Sage Weil wrote:

[1] 
http://lists.ceph.com/private.cgi/ceph-maintainers-ceph.com/2018-April/000603.html
[2] 
http://lists.ceph.com/private.cgi/ceph-maintainers-ceph.com/2018-April/000611.html

Just a heads up, seems the ceph-maintainers archives are not public.

   -Joao


The debian-gcc list is public:
https://lists.debian.org/debian-gcc/2018/04/msg00137.html

Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph Mimic on Debian 9 Stretch

2018-06-04 Thread Ronny Aasen


On 04. juni 2018 06:41, Charles Alva wrote:

Hi Guys,

When will the Ceph Mimic packages for Debian Stretch released? I could 
not find the packages even after changing the sources.list.





I am also eager to test mimic on my ceph

debian-mimic only contains ceph-deploy atm.

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to normally expand OSD’s capacity?

2018-05-10 Thread Ronny Aasen


On 10.05.2018 12:24, Yi-Cian Pu wrote:


Hi All,

We are wondering if there is any way to expand OSD’s capacity. We are 
studying about this and conducted an experiment. However, in the 
result, the size of expanded capacity is counted on the USED part 
rather than the AVAIL one. The following shows the process of our 
experiment:


1.We prepare a small cluster of luminous v12.2.4 and write some data 
into pool. The osd.1 is manually deployed and it uses a disk partition 
of size 100GB (the whole disk size is 320GB).


=

[root@workstation /]# ceph osd df

ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS

0hdd 0.289991.0 297G 27062M271G8.89 0.6732

1hdd 0.01.0 100G 27062M 76361M 26.17 1.9732

TOTAL 398G 54125M345G 13.27

MIN/MAX VAR: 0.67/1.97STDDEV: 9.63

=

2.Then, we expand the disk partition used by osd.1 by the following steps:

(1)Stop osd.1 daemon

(2)Use “parted” command to expand 50GB of the disk partition.

(3)Restart osd.1 daemon

3.After we do the above steps, we have the result that the expanded 
size is counted on USED part.


=

[root@workstation /]# ceph osd df

ID CLASS WEIGHTREWEIGHT SIZE USEAVAIL%USEVARPGS

0hdd 0.289991.0 297G 27063M271G8.89 0.3932

1hdd 0.01.0 150G 78263M 76360M 50.62 2.2132

TOTAL 448G102G345G 22.94

MIN/MAX VAR: 0.39/2.21STDDEV: 21.95

=

This is what we have tried, and the result looks very confusing. We’d 
really want to


know if there is any way to normally expand OSD’s capacity. Any 
feedback or suggestions would be much appreciated.




you do not do this in ceph

you would most likely not partition the osd drive, you use the whole 
drive.  so you would never get into the position to need to increase.
you add space by adding osd's and adding nodes, so increasing osd size 
is not logical


if you must for some oddball reason...   you can remove (drain or 
destroy)  - repartition - add the osd and let ceph backfill the drive.

or you can just make a new osd with the remaining disk space.
since the space increase will change the crushmap there is no way to 
avoid some data movement, anyway.


mvh
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] 3 monitor servers to monitor 2 different OSD set of servers

2018-04-26 Thread Ronny Aasen


On 26.04.2018 17:05, DHD.KOHA wrote:

Hello,

I am wondering if this is possible,

I am currently running a ceph cluster consisting of 3 servers as 
monitors  and 6 OSD  servers that host the disk drives, that is 10x8T 
for osds on  each server.


Since we did some cleaning around of the old servers

I am able to create another OSD cluster with 3 servers having 16Drives 
of 10T each.


Since adding the above servers to expand the current servers of OSDs 
doesn't seem to be a good idea according to the documentation, because 
the disk drives are of different size 8T and 10T


I wonder if it is possible to create another cluster

using the same 3 monitors and have them to monitor a second cluster also,

so that

--cluster ceph

--cluster ceph2

processes are running on the same monitor set of servers. 


I do not think you can make a new cluster that share mon servers easily.

But you can make a new class of hdd, eg call it hdd10 or something . and 
create/move some pools to use that class of device.
It is the same cluster but the pools will not share disks. so you need 
to "think" about them almost like separate clusters.

and that should be quite straight forward

https://ceph.com/community/new-luminous-crush-device-classes/

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-25 Thread Ronny Aasen

the difference in cost between 2 and 3 servers are not HUGE. but the 
reliability  difference between a size 2/1 pool and a 3/2 pool is 
massive. a 2/1 pool is just a single fault during maintenance away from 
dataloss.  but you need multiple simultaneous faults, and have very bad 
luck to break a 3/2 pool


I would recommend rather using 2/2 pools if you are willing to accept a 
little downtime when a disk dies.  the cluster io would stop until the 
disks backfill to cover for the lost disk.
but it is better then having inconsistent pg's or dataloss because a 
disk crashed during a routine reboot, or 2 disks


also worth to read this link 
https://www.spinics.net/lists/ceph-users/msg32895.html   a good explanation.


you have good backups and are willing to restore the whole pool. And it 
is of course your privilege to run 2/1 pools but be mind full of the 
risks of doing so.



kind regards
Ronny Aasen

BTW: i did not know ubuntu automagically rebooted after a upgrade. you 
can probably avoid that reboot somehow in ubuntu. and do the restarts of 
services manually. if you wish to maintain service during upgrade





On 25.04.2018 11:52, Ranjan Ghosh wrote:
Thanks a lot for your detailed answer. The problem for us, however, 
was that we use the Ceph packages that come with the Ubuntu 
distribution. If you do a Ubuntu upgrade, all packages are upgraded in 
one go and the server is rebooted. You cannot influence anything or 
start/stop services one-by-one etc. This was concering me, because the 
upgrade instructions didn't mention anything about an alternative or 
what to do in this case. But someone here enlightened me that - in 
general - it all doesnt matter that much *if you are just accepting a 
downtime*. And, indeed, it all worked nicely. We stopped all services 
on all servers, upgraded the Ubuntu version, rebooted all servers and 
were ready to go again. Didn't encounter any problems there. The only 
problem turned out to be our own fault and simply a firewall 
misconfiguration.


And, yes, we're running a "size:2 min_size:1" because we're on a very 
tight budget. If I understand correctly, this means: Make changes of 
files to one server. *Eventually* copy them to the other server. I 
hope this *eventually* means after a few minutes. Up until now I've 
never experienced *any* problems with file integrity with this 
configuration. In fact, Ceph is incredibly stable. Amazing. I have 
never ever had any issues whatsoever with broken files/partially 
written files, files that contain garbage etc. Even after 
starting/stopping services, rebooting etc. With GlusterFS and other 
Cluster file system I've experienced many such problems over the 
years, so this is what makes Ceph so great. I have now a lot of trust 
in Ceph, that it will eventually repair everything :-) And: If a file 
that has been written a few seconds ago is really lost it wouldnt be 
that bad for our use-case. It's a web-server. Most important stuff is 
in the DB. We have hourly backups of everything. In a huge emergency, 
we could even restore the backup from an hour ago if we really had to. 
Not nice, but if it happens every 6 years or sth due to some freak 
hardware failure, I think it is manageable. I accept it's not the 
recommended/perfect solution if you have infinite amounts of money at 
your hands, but in our case, I think it's not extremely audacious 
either to do it like this, right?



Am 11.04.2018 um 19:25 schrieb Ronny Aasen:

ceph upgrades are usualy not a problem:
ceph have to be upgraded in the right order. normally when each 
service is on its own machine this is not difficult.
but when you have mon, mgr, osd, mds, and klients on the same host 
you have to do it a bit carefully..


i tend to have a terminal open with "watch ceph -s" running, and i 
never do another service until the health is ok again.


first apt upgrade the packages on all the hosts. This only update the 
software on disk and not the running services.
then do the restart of services in the right order.  and only on one 
host at the time


mons: first you restart the mon service on all mon running hosts.
all the 3 mons are active at the same time, so there is no "shifting 
around" but make sure the quorum is ok again before you do the next mon.


mgr: then restart mgr on all hosts that run mgr. there is only one 
active mgr at the time now, so here there will be a bit of shifting 
around. but it is only for statistics/management so it may affect 
your ceph -s command, but not the cluster operation.


osd: restart osd processes one osd at the time, make sure health_ok 
before doing the next osd process. do this for all hosts that have osd's


mds: restart mds's one at the time. you will notice the standby mds 
taking over for the mds that was restarted. do both.


klients: restart clients, that means remount filesystems, migrate or 
restart vm's. or restart whatever process uses the old ceph libraries.



about pools:
s

Re: [ceph-users] Cephalocon APAC 2018 report, videos and slides

2018-04-24 Thread Ronny Aasen


On 24.04.2018 17:30, Leonardo Vaz wrote:

Hi,

Last night I posted the Cephalocon 2018 conference report on the Ceph
blog[1], published the video recordings from the sessions on
YouTube[2] and the slide decks on Slideshare[3].

[1] https://ceph.com/community/cephalocon-apac-2018-report/
[2] https://www.youtube.com/playlist?list=PLrBUGiINAakNgeLvjald7NcWps_yDCblr
[3] https://www.slideshare.net/Inktank_Ceph/tag/cephalocon-apac-2018

I'd like to take the opportunity to apologize for the lots of posts on
Twitter and Google+ about the video uploads last night. Seems that
even I disabled the checkbox to make posts announcing the new uploads
on social media YouTube decided to post it anyway. Sorry for the
inconvenience.

Kindest regards,

Leo



thanks to the presenters and yourself for your awesome work.
this is a goldmine for us that could not attend. :)


kind regards

Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] configuration section for each host

2018-04-24 Thread Ronny Aasen


On 24.04.2018 18:24, Robert Stanford wrote:


 In examples I see that each host has a section in ceph.conf, on every 
host (host-a has a section in its conf on host-a, but there's also a 
host-a section in the ceph.conf on host-b, etc.) Is this really 
necessary?  I've been using just generic osd and monitor sections, and 
that has worked out fine so far.  Am I setting myself up for 
unexpected problems?


only if you want to override default values for that individual host. i 
have never had anything but generic sections.


ceph is moving more and more away from must_have information in the 
configuration file.
in next version you will probably not need initial monitors either since 
they can be discovered via SRV dns records.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] osds with different disk sizes may killing performance (?? ?)

2018-04-12 Thread Ronny Aasen


On 13. april 2018 05:32, Chad William Seys wrote:

Hello,
   I think your observations suggest that, to a first approximation, 
filling drives with bytes to the same absolute level is better for 
performance than filling drives to the same percentage full. Assuming 
random distribution of PGs, this would cause the smallest drives to be 
as active as the largest drives.
   E.g. if every drive had 1TB of data, each would be equally likely to 
contain the PG of interest.
   Of course, as more data was added the smallest drives could not hold 
more and the larger drives become more active, but at least the smaller 
drives would as active as possible.


but in this case you would have a steep drop off of performance. when 
you reach the fill level where small drives do not accept more data, 
suddenly you would have a performance cliff where only your larger disks 
are doing new writes. and only larger disks doing reads on new data.



it is also easier to make the logical connection while you are 
installing new nodes/disks. then a year later when your cluster just 
happen to reach that fill level.


it would also be an easier job balancing disks between nodes when you 
are adding osd's anyway and the new ones are mostly empty. rather then 
when your small osd's are full and your large disks have significant 
data on them.




kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 => 12.2.2

2018-04-11 Thread Ronny Aasen


ceph upgrades are usualy not a problem:
ceph have to be upgraded in the right order. normally when each service 
is on its own machine this is not difficult.
but when you have mon, mgr, osd, mds, and klients on the same host you 
have to do it a bit carefully..


i tend to have a terminal open with "watch ceph -s" running, and i never 
do another service until the health is ok again.


first apt upgrade the packages on all the hosts. This only update the 
software on disk and not the running services.
then do the restart of services in the right order.  and only on one 
host at the time


mons: first you restart the mon service on all mon running hosts.
all the 3 mons are active at the same time, so there is no "shifting 
around" but make sure the quorum is ok again before you do the next mon.


mgr: then restart mgr on all hosts that run mgr. there is only one 
active mgr at the time now, so here there will be a bit of shifting 
around. but it is only for statistics/management so it may affect your 
ceph -s command, but not the cluster operation.


osd: restart osd processes one osd at the time, make sure health_ok 
before doing the next osd process. do this for all hosts that have osd's


mds: restart mds's one at the time. you will notice the standby mds 
taking over for the mds that was restarted. do both.


klients: restart clients, that means remount filesystems, migrate or 
restart vm's. or restart whatever process uses the old ceph libraries.



about pools:
since you only have 2 osd's you can obviously not be running the 
recommended 3 replication pools. ? this makes me worry that you may be 
running size:2 min_size:1 pools. and are daily running risk of dataloss 
due to corruption and inconsistencies. especially when you restart osd's


if your pools are size:2 min_size:2 then your cluster will fail when any 
osd is restarted, until the osd is up and healthy again. but you have 
less chance for dataloss then 2/1 pools.


if you added a osd on a third host you can run size:3 min_size:2 . the 
recommended config when you can have both redundancy and high 
availabillity.



kind regards
Ronny Aasen







On 11.04.2018 17:42, Ranjan Ghosh wrote:
Ah, nevermind, we've solved it. It was a firewall issue. The only 
thing that's weird is that it became an issue immediately after an 
update. Perhaps it has sth. to do with monitor nodes shifting around 
or anything. Well, thanks again for your quick support, though. It's 
much appreciated.


BR

Ranjan


Am 11.04.2018 um 17:07 schrieb Ranjan Ghosh:
Thank you for your answer. Do you have any specifics on which thread 
you're talking about? Would be very interested to read about a 
success story, because I fear that if I update the other node that 
the whole cluster comes down.



Am 11.04.2018 um 10:47 schrieb Marc Roos:

I think you have to update all osd's, mon's etc. I can remember running
into similar issue. You should be able to find more about this in
mailing list archive.



-Original Message-
From: Ranjan Ghosh [mailto:gh...@pw6.de]
Sent: woensdag 11 april 2018 16:02
To: ceph-users
Subject: [ceph-users] Cluster degraded after Ceph Upgrade 12.2.1 =>
12.2.2

Hi all,

We have a two-cluster-node (with a third "monitoring-only" node). Over
the last months, everything ran *perfectly* smooth. Today, I did an
Ubuntu "apt-get upgrade" on one of the two servers. Among others, the
ceph packages were upgraded from 12.2.1 to 12.2.2. A minor release
update, one might think. But, to my surprise, after restarting the
services, Ceph is now in degraded state :-( (see below). Only the first
node - which ist still on 12.2.1 - seems to be running. I did a bit of
research and found this:

https://ceph.com/community/new-luminous-pg-overdose-protection/

I did set "mon_max_pg_per_osd = 300" to no avail. Don't know if this is
the problem at all.

Looking at the status it seems we have 264 pgs, right? When I enter
"ceph osd df" (which I found on another website claiming it should 
print

the number of PGs per OSD), it just hangs (need to abort with Ctrl+C).

Hope anybody can help me. The cluster know works with the single node,
but it is definively quite worrying because we don't have redundancy.

Thanks in advance,

Ranjan


root@tukan2 /var/www/projects # ceph -s
    cluster:
      id: 19895e72-4a0c-4d5d-ae23-7f631ec8c8e4
      health: HEALTH_WARN
      insufficient standby MDS daemons available
      Reduced data availability: 264 pgs inactive
      Degraded data redundancy: 264 pgs unclean

    services:
      mon: 3 daemons, quorum tukan1,tukan2,tukan0
      mgr: tukan0(active), standbys: tukan2
      mds: cephfs-1/1/1 up  {0=tukan2=up:active}
      osd: 2 osds: 2 up, 2 in

    data:
      pools:   3 pools, 264 pgs
      objects: 0 objects, 0 bytes
      usage:   0 kB used, 0 kB / 0 kB avail
      pgs: 100.000% pgs unknown

___

Re: [ceph-users] Bluestore and scrubbing/deep scrubbing

2018-03-29 Thread Ronny Aasen


On 29.03.2018 20:02, Alex Gorbachev wrote:

w Luminous 12.2.4 cluster with Bluestore, I see a good deal
of scrub and deep scrub operations.  Tried to find a reference, but
nothing obvious out there - was it not supposed to not need scrubbing
any more due to CRC checks?


crc gives you checks as you read data. iow you do not give corrupt data 
to clients, while thinking it is ok data.
scrubs periodically checks your data for corruption, by reading it and 
comparing it to crc and the other replicas.  this protects against 
bitrot [1]
crc also helps the system to know what object is good and what object is 
bad during scrub.


crc is not a replacement for scrub, but a compliment. it improves the 
quality of the data you provide to clients, and it makes it easier for 
scrub to detect errors.



kind regards
Ronny Aasen

[ 1] https://en.wikipedia.org/wiki/Data_degradation

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] split brain case

2018-03-29 Thread Ronny Aasen

On 29.03.2018 11:13, ST Wong (ITSC) wrote:

Hi,

Thanks.

> ofcourse the 4 osd's left working now want to selfheal by recreating 
all objects stored on the 4 split off osd's and have a huge recovery 
job. and you may risk that the osd's goes into too_full error, unless 
you have free space in your osd's to recreate all the data in the 
defective part of the cluster. or they will be stuck in recovery mode 
until you get the second room running, this depends on your crush map.

Means we’ve to made 4 OSD machines sufficient space to hold all data 
and thus the usable space will be halved?

yes if you want to be able to be able to operatate one room as if it was 
the whole cluster (HA) then you need this.
also if you want to have 4+2 instead of 3+2 pool size to avoid the 
blocking during recovery, that would take a whole lot of ekstra space.
you can optionally let the cluster run degraded with 4+2 while one room 
is down. or temporary set pools to 2+2 while the other room is down, to 
reduce the space requirements.

> point in that slitting the cluster hurts. and if HA is the most 
important then you may  want to check out rbd mirror.

Will consider when there is budget to setup another ceph cluster for 
rdb mirror.

i do not know your needs or applications, but while you only have 2 
rooms you may just think of it as a single cluster that just happen to 
occupy 2 rooms.  but with that few osd's you should perhaps just put the 
cluster in a single  room
the pain of splitting a cluster down the middle is quite significant. 
and i would perhaps use resources to improve the redundancy of the 
networks between the buildings instead. have multiple paths between the 
buildings to prevent service disruption in the building that does not 
house the cluster.

having 5 mons is quite a lot. i think most clusters have 3 mons up into 
several hundred osd hosts

how many servers are your osd's split over ? keep in mind that ceph's 
default picks one osd from each host. so you would need minimum 4 osd 
hosts in total to be able to use 4+2 pools and with only 4 hosts you 
have no failuredomain.  but 4 hosts in the minimum sane starting point 
for a regular small cluster with 3+2 pools  (you can loose a node and 
ceph selfheals as long as there are enough freespace.

kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] split brain case

2018-03-29 Thread Ronny Aasen


On 29.03.2018 10:25, ST Wong (ITSC) wrote:


Hi all,

We put 8 (4+4) OSD and 5 (2+3) MON servers in server rooms in 2 
buildings for redundancy.  The buildings are connected through direct 
connection.


While servers in each building have alternate uplinks.   What will 
happen in case the link between the buildings is broken (application 
servers in each server room will continue to write to OSDs in the same 
room) ?


Thanks a lot.

Rgds

/st wong





my guesstimate is that the serverroom with 3 mons will retain quorum, 
and continue operation. the room with 2 mon's will notice they are split 
out and block.
assuming you have 3+2 pools and one of the objects is allways on the 
other server room. some pg's will be active becouse you have 2 objects 
on the working room.  but some pg's will be inactive until they can 
selfheal and backfill a second copy of the objects.

i assume you could have 4+2 replication to avoid this issue.

ofcourse the 4 osd's left working now want to selfheal by recreating all 
objects stored on the 4 split off osd's and have a huge recovery job. 
and you may risk that the osd's goes into too_full error, unless you 
have free space in your osd's to recreate all the data in the defective 
part of the cluster. or they will be stuck in recovery mode until you 
get the second room running, this depends on your crush map.


if you really need to split a cluster into separate rooms, i would have 
used 3 rooms, with redundant data paths between them. primary path 
between room A and C is direct. redundant path is via A-B-C. this should 
reduce the disaster if a single path is broken.
with 1 mon in each room. you can loose a whole room to powerloss, and 
still have a working cluster.  and you would only need 33% instead of 
50%  cluster capacity as free space in your cluster to be able to selfheal


point in that slitting the cluster hurts. and if HA is the most 
important then you may  want to check out rbd mirror.




kind
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Separate BlueStore WAL/DB : best scenario ?

2018-03-22 Thread Ronny Aasen

keep in mind that with 4+2 =6  erasure coding, ceph can not self heal if 
a  node dies if you have only 6 nodes.
that means that you have a degraded cluster with low performance, and 
higher risk until you replace or fix or buy a new node. it is kind of 
like loosing a disk in raid5 you have to scramble to fix it asap.


if you have an additional node. ceph can self heal if a node dies and 
you can look at it on monday after the meeting... no stress.


in my opinion ceph's self healing is one of THE killer apps for ceph. 
that makes a ceph cluster robust and reliable. pity not to take 
advantage of that in your design/pool configuration.


kind regards
Ronny Aasen



On 22.03.2018 10:53, Hervé Ballans wrote:

Le 21/03/2018 à 11:48, Ronny Aasen a écrit :

On 21. mars 2018 11:27, Hervé Ballans wrote:

Hi all,

I have a question regarding a possible scenario to put both wal and 
db in a separate SSD device for an OSD node composed by 22 OSDs (HDD 
SAS 10k 1,8 To).


I'm thinking of 2 options (at about the same price) :

- add 2 SSD SAS Write Intensive (10DWPD)

- or add a unique SSD NVMe 800 Go (it's the minimum capacity 
currently on the market !..)


In both case, that's a lot of partitions on each SSD disk, 
especially on the second solution where we would have 44 partitions 
(22 WAL and 22 DB) !


Is this solution workable (I mean in term of i/o speeds), or is it 
unsafe despite the high PCIe bus transfer rate ?


I just want to talk here about throughput performances, not data 
integrity on the node in case of SSD crashes...


Thanks in advance for your advices,



if you put the wal and db on the same device anyway, there is no real 
benefit to having a partition for each. the reason you can split them 
up is if you have them on different devices. Eg db on ssd, but wal on 
nvram. it is easier to just colocat wal and db into the same 
partition since they live on the same device in your case anyway.


if you have too many osd's db's on the same ssd, you may end up with 
the ssd beeing the bottleneck. 4 osd's db's on a ssd have been a 
"golden rule" on the mailinglist for a while. for nvram you can 
possibly have some more.


but the bottleneck is only one part of the problem. when the 22 
partitions db nvram dies, it brings down 22 osd's at once and will be 
a huge pain on your cluster. (depending on how large it is...)
i would spread the db's on more devices to reduce the bottleneck and 
failure domains in this situation.


Hi Ronny,

Thank you for your clear answer.
OK for putting both wal and db on the same partition, I didn't have 
this information, but indeed it seems more interesting in my case (in 
particular if I choose the fastest device, i.e. NVMe*)


I plan to have 6 OSDs nodes (same configuration for each) but I don't 
know yet if I will use replication (x3) or Erasure Coding (4+2 ?) 
pools. Also in both cases, I could eventually accept the loss of a 
node on a reduced time (replacement of the journals disk + OSDs 
reconfiguration).


But you're right, I will start on a configuration where I will spread 
the db's on at least 2 fast disks.


Regards,
Hervé
* Just for information, I look closely at the SAMSUNG PM1725 NVMe PCIe 
SSD. The (theorical) technical specifications seem interesting, 
especilly on the IOPS : up to 750K IOPS for Random Read and 120K IOPS 
for Random Write...




kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Separate BlueStore WAL/DB : best scenario ?

2018-03-21 Thread Ronny Aasen


On 21. mars 2018 11:27, Hervé Ballans wrote:

Hi all,

I have a question regarding a possible scenario to put both wal and db 
in a separate SSD device for an OSD node composed by 22 OSDs (HDD SAS 
10k 1,8 To).


I'm thinking of 2 options (at about the same price) :

- add 2 SSD SAS Write Intensive (10DWPD)

- or add a unique SSD NVMe 800 Go (it's the minimum capacity currently 
on the market !..)


In both case, that's a lot of partitions on each SSD disk, especially on 
the second solution where we would have 44 partitions (22 WAL and 22 DB) !


Is this solution workable (I mean in term of i/o speeds), or is it 
unsafe despite the high PCIe bus transfer rate ?


I just want to talk here about throughput performances, not data 
integrity on the node in case of SSD crashes...


Thanks in advance for your advices,




if you put the wal and db on the same device anyway, there is no real 
benefit to having a partition for each. the reason you can split them up 
is if you have them on different devices. Eg db on ssd, but wal on 
nvram. it is easier to just colocat wal and db into the same partition 
since they live on the same device in your case anyway.


if you have too many osd's db's on the same ssd, you may end up with the 
ssd beeing the bottleneck. 4 osd's db's on a ssd have been a "golden 
rule" on the mailinglist for a while. for nvram you can possibly have 
some more.


but the bottleneck is only one part of the problem. when the 22 
partitions db nvram dies, it brings down 22 osd's at once and will be a 
huge pain on your cluster. (depending on how large it is...)
i would spread the db's on more devices to reduce the bottleneck and 
failure domains in this situation.



kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Delete a Pool - how hard should be?

2018-03-06 Thread Ronny Aasen


On 06. mars 2018 10:26, Max Cuttins wrote:

Il 05/03/2018 20:17, Gregory Farnum ha scritto:


You're not wrong, and indeed that's why I pushed back on the latest 
attempt to make deleting pools even more cumbersome.


But having a "trash" concept is also pretty weird. If admins can 
override it to just immediately delete the data (if they need the 
space), how is that different from just being another hoop to jump 
through? If we want to give the data owners a chance to undo, how do 
we identify and notify *them* rather than the admin running the 
command? But if admins can't override the trash and delete 
immediately, what do we do for things like testing and proofs of 
concept where large-scale data creates and deletes are to be expected?

-Greg


I'm talking about my experience:

  * Data Owner are a little bit in their LA LA LAND, and think that they
can safely delete some of their data without losses.
  * Data Owner should think that their pool have been really deleted
  * Data Owner should not been akwnoledge about the existance of the
"/trash/"
  * So Data Owner ask to restore from backup (but instead we'll use
easily the trash).

Said so, we also have to think that:

  * Administrator is always GOD, so he need to be in the possibility to
override if needed whenever he needs.
  * However Administrator should just put in status delete without
override this behaviour if there is not need to do so.
  * Override should be allowed only with many cumbersome telling you
that YOU SHOULD NOT OVERRIDE - PLEASE AVOID OVERRIDE

I don't like that the software can limit administrators to do his job... 
in the end Administrator'll always find its way to do what he want (it's 
the root).
Of course I like the feature to push the Admin to follow the right 
behaviour.



some sort of active/inactive toggle both on RBD images, pools, buckets 
and filesystems trees is nice to allow admins to perform scream tests.


"data owner requests deletion - admin disables pool(kicks all clients) - 
data owner screams - admin reactivates"


sounds much better then the last step beeing admin checking if the 
backups are good.,..


i try to do something similar by renaming pools to be deleted but that 
is not allways the same as inactive.



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Ronny Aasen


On 05. mars 2018 14:45, Jan Marquardt wrote:

Am 05.03.18 um 13:13 schrieb Ronny Aasen:

i had some similar issues when i started my proof of concept. especialy
the snapshot deletion i remember well.

the rule of thumb for filestore that i assume you are running is 1GB ram
per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for
osd's + some  GB's for the mon service, + some GB's  for the os itself.

i suspect if you inspect your dmesg log and memory graphs you will find
that the out of memory killer ends your osd's when the snap deletion (or
any other high load task) runs.

I ended up reducing the number of osd's per node, since the old
mainboard i used was maxed for memory.


Well, thanks for the broad hint. Somehow I assumed we fulfill the
recommendations, but of course you are right. We'll check if our boards
support 48 GB RAM. Unfortunately, there are currently no corresponding
messages. But I can't rule out that there haven't been any.


corruptions occured for me as well. and they was normaly associated with
disks dying or giving read errors. ceph often managed to fix them but
sometimes i had to just remove the hurting OSD disk.

hage some graph's  to look at. personaly i used munin/munin-node since
it was just an apt-get away from functioning graphs

also i used smartmontools to send me emails about hurting disks.
and smartctl to check all disks for errors.


I'll check S.M.A.R.T stuff. I am wondering if scrubbing errors are
always caused by disk problems or if they also could be triggered
by flapping OSDs or other circumstances.


good luck with ceph !


Thank you!



in my not that extensive experience, schrub errors come mainly from 2 
issues.  Either disk's giving read errors (should be visible both in the 
log and dmesg.) or having pools with size=2/min_size=1 instead of the 
default and recomended size=3/min_size=2
but i can not say that they do not come from crashing OSD's but my case 
the osd kept crashing due to bad disk and/or low memory.





If you have scrub errors you can not get rid of on filestore (not 
bluestore!) you can read the two following urls.



http://ceph.com/geen-categorie/ceph-manually-repair-object/  and on 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/


basicaly the steps are:

- find the pg  ::  rados list-inconsistent-pg [pool]
- find the problem ::  rados list-inconsistent-obj 0.6 
--format=json-pretty ; give you the object name  look for hints to what 
is the bad object
- find the object on disks  :: manually check the objects on each osd 
for the given pg, check the object metadata (size/date/etc), run md5sum 
on them all and compare. check objects on the nonrunning osd's and 
compare there as well. anything to try to determine what object is ok 
and what is bad.
- fix the problem  :: assuming you find the bad object, stop the 
affected osd with the bad object, remove the object manually, restart 
osd. issue repair command.



Once i fixed my min_size=1 misconfiguration, and pulled the dying (but 
functional) disks from my cluster, and reduced osd count to prevent 
dying osd's  all of those scrub errors went away. have not seen one in 6 
months now.



kinds regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph newbie(?) issues

2018-03-05 Thread Ronny Aasen


On 05. mars 2018 11:21, Jan Marquardt wrote:

Hi,

we are relatively new to Ceph and are observing some issues, where
I'd like to know how likely they are to happen when operating a
Ceph cluster.

Currently our setup consists of three servers which are acting as
OSDs and MONs. Each server has two Intel Xeon L5420 (yes, I know,
it's not state of the art, but we thought it would be sufficient
for a Proof of Concept. Maybe we were wrong?) and 24 GB RAM and is
running 8 OSDs with 4 TB harddisks. 4 OSDs are sharing one SSD for
journaling. We started on Kraken and upgraded lately to Luminous.
The next two OSD servers and three separate MONs are ready for
deployment. Please find attached our ceph.conf. Current usage looks
like this:

data:
   pools:   1 pools, 768 pgs
   objects: 5240k objects, 18357 GB
   usage:   59825 GB used, 29538 GB / 89364 GB avail

We have only one pool which is exclusively used for rbd. We started
filling it with data and creating snapshots in January until Mid of
February. Everything was working like a charm until we started
removing old snapshots then.

While we were removing snapshots for the first time, OSDs started
flapping. Besides this there was no other load on the cluster.
For idle times we solved it by adding

osd snap trim priority = 1
osd snap trim sleep = 0.1

to ceph.conf. When there is load from other operations and we
remove big snapshots OSD flapping still occurs.

Last week our first scrub errors appeared. Repairing the first
one was no big deal. The second one however was, because the
instructed OSD started crashing. First on Friday osd.17 and
today osd.11.

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.17 to repair

ceph1:~# ceph pg repair 0.1b2
instructing pg 0.1b2 on osd.11 to repair

I am still researching on the crashes, but already would be
thankful for any input.

Any opinions, hints and advices would really be appreciated.



i had some similar issues when i started my proof of concept. especialy 
the snapshot deletion i remember well.


the rule of thumb for filestore that i assume you are running is 1GB ram 
per TB of osd. so with 8 x 4TB osd's you are looking at 32GB of ram for 
osd's + some  GB's for the mon service, + some GB's  for the os itself.


i suspect if you inspect your dmesg log and memory graphs you will find 
that the out of memory killer ends your osd's when the snap deletion (or 
any other high load task) runs.


I ended up reducing the number of osd's per node, since the old 
mainboard i used was maxed for memory.



corruptions occured for me as well. and they was normaly associated with 
disks dying or giving read errors. ceph often managed to fix them but 
sometimes i had to just remove the hurting OSD disk.


hage some graph's  to look at. personaly i used munin/munin-node since 
it was just an apt-get away from functioning graphs


also i used smartmontools to send me emails about hurting disks.
and smartctl to check all disks for errors.


good luck with ceph !

kinds regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cannot delete a pool

2018-03-01 Thread Ronny Aasen


On 01. mars 2018 13:04, Max Cuttins wrote:

I was testing IO and I created a bench pool.

But if I tried to delete I get:

Error EPERM: pool deletion is disabled; you must first set the
mon_allow_pool_delete config option to true before you can destroy a
pool

So I run:

ceph tell mon.\* injectargs '--mon-allow-pool-delete=true'
mon.ceph-node1: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node2: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)
mon.ceph-node3: injectargs:mon_allow_pool_delete = 'true' (not
observed, change may require restart)

I restarted all the nodes.
But the flag has not been observed.

Is this the right way to remove a pool?


i think you need to set the option in the ceph.conf of the monitors.
and then restart the mon's one by one.

afaik that is by design.
https://blog.widodh.nl/2015/04/protecting-your-ceph-pools-against-removal-or-property-changes/

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Install previous version of Ceph

2018-02-26 Thread Ronny Aasen


On 23. feb. 2018 23:37, Scottix wrote:

Hey,
We had one of our monitor servers die on us and I have a replacement 
computer now. In between that time you have released 12.2.3 but we are 
still on 12.2.2.


We are on Ubuntu servers

I see all the binaries are in the repo but your package cache only shows 
12.2.3, is there a reason for not keeping the previous builds like in my 
case.


I could do an install like
apt install ceph-mon=12.2.2

Also how would I go installing 12.2.2 in my scenario since I don't want 
to update till have this monitor running again.


Thanks,
Scott


did you figure out a solution to this ? I have the same problem now.
I assume you have to download the old version manually and install with 
dpkg -i


optionally mirror the ceph repo and build your own repo index containing 
all versions.


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Luminous and calamari

2018-02-16 Thread Ronny Aasen


On 16.02.2018 06:20, Laszlo Budai wrote:

Hi,

I've just started up the dasboard component of the ceph mgr. It looks 
OK, but from what can be seen, and what I was able to find in the 
docs, the dashboard is just for monitoring. Is there any plugin that 
allows management of the ceph resources (pool create/delete). 



openattic allows for web administation. but i think it is only possible 
to run it comfortably on opensuse leap atm. I could not find updated 
debian packages last time i checked.


proxmox also allow for ceph administration.  but proxmox is probably a 
bit overkill for only ceph admin. since it is a web admin tool for kvm 
vm's and lxd containers as well as ceph.




kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Query regarding min_size.

2018-01-03 Thread Ronny Aasen


On 03. jan. 2018 14:51, James Poole wrote:

Hi all,

Whilst on a training course recently I was told that 'min_size' had an
affect on client write performance, in that it's the required number of
copies before ceph reports back to the client that an object has been
written therefore setting a 'min_size' of 0 would only require a write
to be accepted by the journal before confirming it's been accepted.

This is contrary to further reading elsewhere that the 'min_size' is the
minimum number of copies required of an object to allow I/O and that
'size' is the parameter that would affect write speed i.e. desired
number of replicas.

Setting 'min_size' to 0 with a 'size' of 3 you would still have an
effective 'min_size' of 2 from:

https://raw.githubusercontent.com/ceph/ceph/master/doc/release-notes.rst

"* Degraded mode (when there fewer than the desired number of replicas)
is now more configurable on a per-pool basis, with the min_size
parameter. By default, with min_size 0, this allows I/O to objects
with N - floor(N/2) replicas, where N is the total number of
expected copies. Argonaut behavior was equivalent to having min_size
= 1, so I/O would always be possible if any completely up to date
copy remained. min_size = 1 could result in lower overall
availability in certain cases, such as flapping network partition"

Which leads to the conclusion that changing 'min_size' has nothing to do
with performance but is solely related to data integrity/resilience.

Could someone confirm my assertion is correct?

Many thanks

James



you are correct that it is related to data integrity.


the writes to a osd filestore is allways acked internally when it have 
hit the journal. unrelated to size/min_size.


in normal operation, all osd's must ack the write before the write is 
acked to the client: iow all 3 (size 3) must ack. and min_size is not 
relevant in any case.


min_size is only relevant when a pg is degraded while being remapped or 
backfilled (or degraded because of no space to remap/backfill into) 
because of a osd or node failure. in that case min_size specify how many 
osd's must ack the write before the write is acked to the client.


since failure is most likely when disks are stressing (eg with rebuild), 
reducing min_size is just asking for corruption and data loss.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Running Jewel and Luminous mixed for a longer period

2018-01-01 Thread Ronny Aasen


On 30.12.2017 15:41, Milanov, Radoslav Nikiforov wrote:

Performance as well - in my testing FileStore was much quicker than BlueStore.



with filestore you often have a ssd journal in front, this will often 
mask/hide slow spinning disk write performance, until the journal size 
becomes the bottleneck.


with bluestore only metadata db and wal is on ssd. so there is no 
doublewrite, and there is no journal bottleneck. but write latency will 
be the speed of the disk, and not the speed of the ssd journal. this 
will feel like a write performance regression.


you can use bcache in front of bluestore to regain the "journal+ 
doublewrite" write characteristic of filestore+journal.


kind regards

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph status doesnt show available and used disk space after upgrade

2017-12-20 Thread Ronny Aasen


On 20.12.2017 19:02, kevin parrikar wrote:

hi All,
I have upgraded the cluster from Hammer to Jewel and to Luminous .

i am able to upload/download glance images but ceph -s shows 0kb used 
and Available and probably because of that cinder create is failing.



ceph -s
  cluster:
    id: 06c5c906-fc43-499f-8a6f-6c8e21807acf
    health: HEALTH_WARN
    Reduced data availability: 6176 pgs inactive
    Degraded data redundancy: 6176 pgs unclean

  services:
    mon: 3 daemons, quorum controller3,controller2,controller1
    mgr: controller3(active)
    osd: 71 osds: 71 up, 71 in
    rgw: 1 daemon active

  data:
    pools:   4 pools, 6176 pgs
    objects: 0 objects, 0 bytes
    usage:   0 kB used, 0 kB / 0 kB avail
    pgs: 100.000% pgs unknown
 6176 unknown


i deployed ceph-mgr using ceph-deploy gather-keys && ceph-deploy mgr 
create ,it was successfull but for some reason ceph -s is not showing 
correct values.

Can some one help me here please

Regards,
Kevin



is ceph-mgr actually running ? all statistics now require a ceph-mgr to 
be running.
also check the mgr's logfile to see if it is able to authenticate/start 
properly.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-15 Thread Ronny Aasen

if you have a global setting in ceph.conf it will only affect the 
creation of new pools. i reccomend using the default

size:3 + min_size:2

also check your pools that you have min_size=2

kind regards
Ronny Aasen

On 15.12.2017 23:00, James Okken wrote:

This whole effort went extremely well, thanks to Cary, and Im not used to that 
with CEPH so far. (And openstack ever)
Thank you Cary.

Ive upped the replication factor and now I see "replicated size 3" in each of 
my pools. Is this the only place to check replication level? Is there a Global setting or 
only a setting per Pool?

ceph osd pool ls detail
pool 0 'rbd' replicated size 3..
pool 1 'images' replicated size 3...
...

One last question!
At this replication level how can I tell how much total space I actually have 
now?
Do I just 1/3 the Global size?

ceph df
GLOBAL:
 SIZE   AVAIL  RAW USED %RAW USED
 13680G 12998G 682G  4.99
POOLS:
 NAMEID USED %USED MAX AVAIL OBJECTS
 rbd 0 0 0 6448G   0
 images  1  216G  3.24 6448G   27745
 backups 2 0 0 6448G   0
 volumes 3  117G  1.79 6448G   30441
 compute 4 0 0 6448G   0

ceph osd df
ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE VAR  PGS
  0 0.81689  1.0   836G 36549M   800G 4.27 0.86  67
  4 3.7  1.0  3723G   170G  3553G 4.58 0.92 270
  1 0.81689  1.0   836G 49612M   788G 5.79 1.16  56
  5 3.7  1.0  3723G   192G  3531G 5.17 1.04 282
  2 0.81689  1.0   836G 33639M   803G 3.93 0.79  58
  3 3.7  1.0  3723G   202G  3521G 5.43 1.09 291
   TOTAL 13680G   682G 12998G 4.99
MIN/MAX VAR: 0.79/1.16  STDDEV: 0.67

Thanks!

-Original Message-
From: Cary [mailto:dynamic.c...@gmail.com]
Sent: Friday, December 15, 2017 4:05 PM
To: James Okken
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

James,

  Those errors are normal. Ceph creates the missing files. You can check 
"/var/lib/ceph/osd/ceph-6", before and after you run those commands to see what 
files are added there.

  Make sure you get the replication factor set.


Cary
-Dynamic

On Fri, Dec 15, 2017 at 6:11 PM, James Okken <james.ok...@dialogic.com> wrote:

Thanks again Cary,

Yes, once all the backfilling was done I was back to a Healthy cluster.
I moved on to the same steps for the next server in the cluster, it is 
backfilling now.
Once that is done I will do the last server in the cluster, and then I think I 
am done!

Just checking on one thing. I get these messages when running this command. I 
assume this is OK, right?
root@node-54:~# ceph-osd -i 4 --mkfs --mkkey --osd-uuid
25c21708-f756-4593-bc9e-c5506622cf07
2017-12-15 17:28:22.849534 7fd2f9e928c0 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force
use of aio anyway
2017-12-15 17:28:22.855838 7fd2f9e928c0 -1 journal FileJournal::_open:
disabling aio for non-block journal.  Use journal_force_aio to force
use of aio anyway
2017-12-15 17:28:22.856444 7fd2f9e928c0 -1
filestore(/var/lib/ceph/osd/ceph-4) could not find
#-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or
directory
2017-12-15 17:28:22.893443 7fd2f9e928c0 -1 created object store
/var/lib/ceph/osd/ceph-4 for osd.4 fsid
2b9f7957-d0db-481e-923e-89972f6c594f
2017-12-15 17:28:22.893484 7fd2f9e928c0 -1 auth: error reading file:
/var/lib/ceph/osd/ceph-4/keyring: can't open
/var/lib/ceph/osd/ceph-4/keyring: (2) No such file or directory
2017-12-15 17:28:22.893662 7fd2f9e928c0 -1 created new key in keyring
/var/lib/ceph/osd/ceph-4/keyring

thanks

-Original Message-
From: Cary [mailto:dynamic.c...@gmail.com]
Sent: Thursday, December 14, 2017 7:13 PM
To: James Okken
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] add hard drives to 3 CEPH servers (3 server
cluster)

James,

  Usually once the misplaced data has balanced out the cluster should reach a healthy state. If you 
run a "ceph health detail" Ceph will show you some more detail about what is happening.  
Is Ceph still recovering, or has it stalled? has the "objects misplaced (62.511%"
changed to a lower %?

Cary
-Dynamic

On Thu, Dec 14, 2017 at 10:52 PM, James Okken <james.ok...@dialogic.com> wrote:

Thanks Cary!

Your directions worked on my first sever. (once I found the missing carriage 
return in your list of commands, the email musta messed it up.

For anyone else:
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4 ceph auth add osd.4 osd
'allow *' mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring really is 2 
commands:
chown -R ceph:ceph /var/lib/ceph/osd/ceph-4  and ceph auth add osd.4
osd 'allow *' mon 'allow profile osd' -i /etc/ceph/ceph.osd.4.keyring

Cary, what am I looking for in ceph -w and c

Re: [ceph-users] add hard drives to 3 CEPH servers (3 server cluster)

2017-12-14 Thread Ronny Aasen


On 14.12.2017 18:34, James Okken wrote:

Hi all,

Please let me know if I am missing steps or using the wrong steps

I'm hoping to expand my small CEPH cluster by adding 4TB hard drives to each of 
the 3 servers in the cluster.

I also need to change my replication factor from 1 to 3.
This is part of an Openstack environment deployed by Fuel and I had foolishly 
set my replication factor to 1 in the Fuel settings before deploy. I know this 
would have been done better at the beginning. I do want to keep the current 
cluster and not start over. I know this is going thrash my cluster for a while 
replicating, but there isn't too much data on it yet.


To start I need to safely turn off each CEPH server and add in the 4TB drive:
To do that I am going to run:
ceph osd set noout
systemctl stop ceph-osd@1 (or 2 or 3 on the other servers)
ceph osd tree (to verify it is down)
poweroff, install the 4TB drive, bootup again
ceph osd unset noout



Next step wouyld be to get CEPH to use the 4TB drives. Each CEPH server already 
has a 836GB OSD.

ceph> osd df
ID WEIGHT  REWEIGHT SIZE  USE  AVAIL %USE  VAR  PGS
  0 0.81689  1.0  836G 101G  734G 12.16 0.90 167
  1 0.81689  1.0  836G 115G  721G 13.76 1.02 166
  2 0.81689  1.0  836G 121G  715G 14.49 1.08 179
   TOTAL 2509G 338G 2171G 13.47
MIN/MAX VAR: 0.90/1.08  STDDEV: 0.97

ceph> df
GLOBAL:
 SIZE  AVAIL RAW USED %RAW USED
 2509G 2171G 338G 13.47
POOLS:
 NAMEID USED %USED MAX AVAIL OBJECTS
 rbd 0 0 0 2145G   0
 images  1  216G  9.15 2145G   27745
 backups 2 0 0 2145G   0
 volumes 3  114G  5.07 2145G   29717
 compute 4 0 0 2145G   0


Once I get the 4TB drive into each CEPH server should I look to increasing the 
current OSD (ie: to 4836GB)?
Or create a second 4000GB OSD on each CEPH server?
If I am going to create a second OSD on each CEPH server I hope to use this doc:
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/



As far as changing the replication factor from 1 to 3:
Here are my pools now:

ceph osd pool ls detail
pool 0 'rbd' replicated size 1 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 64 pgp_num 64 last_change 1 flags hashpspool stripe_width 0
pool 1 'images' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 116 flags hashpspool stripe_width 0
 removed_snaps [1~3,b~6,12~8,20~2,24~6,2b~8,34~2,37~20]
pool 2 'backups' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 7 flags hashpspool stripe_width 0
pool 3 'volumes' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 256 pgp_num 256 last_change 73 flags hashpspool stripe_width 0
 removed_snaps [1~3]
pool 4 'compute' replicated size 1 min_size 1 crush_ruleset 0 object_hash 
rjenkins pg_num 64 pgp_num 64 last_change 34 flags hashpspool stripe_width 0

I plan on using these steps I saw online:
ceph osd pool set rbd size 3
ceph -s  (Verify that replication completes successfully)
ceph osd pool set images size 3
ceph -s
ceph osd pool set backups size 3
ceph -s
ceph osd pool set volumes size 3
ceph -s


please let me know any advice or better methods...


you normaly want each drive to be it's own osd. it is the number of 
osd's that give ceph it's scaleabillity. so more osd's = more aggeregate 
performance.  only exception is if you are limited by something like cpu 
or ram and must limit osd count becouse of that.


also remember to up your min_size from 1 to the default 2.  with 1 your 
cluster will accept writes with only a single operational osd. and if 
that one fail you will have dataloss corruption and inconsistencies.


you might also consider upping your size and min_size before taking down 
a osd, since you obviously will have the pg's on that osd unavailable. 
and you may want to have the extra redundancy before shaking the tree.  
with max usage 15% on the most used OSD you should have the space for it.



good luck
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Corrupted files on CephFS since Luminous upgrade

2017-12-08 Thread Ronny Aasen


On 08. des. 2017 14:49, Florent B wrote:

On 08/12/2017 14:29, Yan, Zheng wrote:

On Fri, Dec 8, 2017 at 6:51 PM, Florent B <flor...@coppint.com> wrote:

I don't know I didn't touched that setting. Which one is recommended ?



If multiple dovecot instances are running at the same time and they
all modify the same files. you need to set fuse_disable_pagecache to
true.


Ok, but in my configuration, each mail user is mapped to a single server.
So files are accessed only by a single server at a time.



how about mail delivery ? if you use dovecot deliver a delivery can 
occur (and rewrite dovecot index/cache) at the same time as a user 
accesses imap and writes to dovecot index/cache.




kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] I cannot make the OSD to work, Journal always breaks 100% time

2017-12-06 Thread Ronny Aasen

entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


2017-12-05 13:19:04.442866 7f243d9a1700 -1 os/filestore/FileStore.cc: In 
function 'void FileStore::_do_transaction(ObjectStore::Transaction&, 
uint64_t, int, ThreadPool::TPHandle*)' thread 7f243d9a1700 time 
2017-12-05 13:19:04.435362

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


  0> 2017-12-05 13:19:04.442866 7f243d9a1700 -1 
os/filestore/FileStore.cc: In function 'void 
FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, int, 
ThreadPool::TPHandle*)' thread 7f243d9a1700 time 2017-12-05 13:19:04.435362

os/filestore/FileStore.cc: 2930: FAILED assert(0 == "unexpected error")

  ceph version 10.2.10 (5dc1e4c05cb68dbf62ae6fce3f0700e4654fdbbe)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x80) [0x55569c1ff790]
  2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned 
long, int, ThreadPool::TPHandle*)+0xb8e) [0x55569be9d58e]
  3: (FileStore::_do_transactions(std::vector<ObjectStore::Transaction, 
std::allocator >&, unsigned long, 
ThreadPool::TPHandle*)+0x3b) [0x55569bea3a1b]
  4: (FileStore::_do_op(FileStore::OpSequencer*, 
ThreadPool::TPHandle&)+0x39d) [0x55569bea3ded]

  5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xdb1) [0x55569c1f1961]
  6: (ThreadPool::WorkThread::entry()+0x10) [0x55569c1f2a60]
  7: (()+0x76ba) [0x7f24503e36ba]
  8: (clone()+0x6d) [0x7f244e45b3dd]
  NOTE: a copy of the executable, or `objdump -rdS ` is 
needed to interpret this.


*** Caught signal (Aborted) **
  in thread 7f243d1a0700 thread_name:tp_fstore_op


I tried to boot it several times.

I zero the journal

dd if=/dev/zero of=/dev/sde2


This probably kills the OSD, at the very least it destroys objects that 
was written to journal (and cluster assumed was safe), unless you 
flushed it successfully previously.






create a new journal

ceph-osd --mkjournal -i 6

Flush it. But's empty so ok.

/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup 
ceph --flush-journal



and boot manually the osd.


/usr/bin/ceph-osd -f --cluster ceph --id 6 --setuser ceph --setgroup ceph


Then it breaks. I pasted bin my whole configuration in 
https://pastebin.com/QfrE71Dg.


But I changed also the journal partition from sde4 to sde2 to see if 
this has something to do. sde is SSD disk so wanted to see no block is 
corrupting everything.


Nothing it breaks 100% of time after a while. I'm desperate to see how 
it breaks. I must say that this is other OSD that failed and I 
recovered. Smartscan long is correct xfs_repair is ok on disk everything 
seems correct. But it keep crashing.


Any advice?

Can I run the disk without journal for a while until all pg are backup 
to the other disks? I just increased the size of the pools and min size 
as well and I need this disk in order to recover all information.



you need this disk to recover all information ? do you not have 
replication and objects are safe? i can not see from your pastebin that 
you have missing objects (that are only on this one disk)


if you need the actualy objects from this disk, then you need to do a 
recovery. that is a whole other job.


if you only need the space of the disk, then you should zap and wipe it. 
and insert it as a new fresh OSD.




but these 2 lines from your pastebin is a bit over the top. how you can 
have this many degraded objects  based on only 289090 objects is hard to 
get.


recovery 20266198323167232/289090 objects degraded (7010342219781.809%)
37154696925806625 scrub errors

i have not seen that before so hopefully someone else can chime in.
also what exact os kernel and ceph versions are you running?


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen

just as long as you are aware that size=3, min_size=2 is the right 
config for everyone except those that really know what they are doing.
and if you ever run min_size=1 you better be expecting to corrupt your 
cluster sooner or later.


Ronny

On 05.12.2017 21:22, Denes Dolhay wrote:

Hi,

So for this to happen you have to lose another osd before backfilling 
is done.



Thank You! This clarifies it!

Denes



On 12/05/2017 03:32 PM, Ronny Aasen wrote:

On 05. des. 2017 10:26, Denes Dolhay wrote:

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in 
your cluster, the cluster can not know witch of the 2 objects are 
the correct one."


Doesn't it use an epoch, or an omap epoch when storing new data? If 
so why can it not use the recent one?






this have been discussed a few times on the list. generally  you have 
2 disks.


first disk fail. and writes happen to the other disk..

first disk recovers, and second disk fail before recovery is done. 
writes happen to second disk..


all objects have correct checksum. and both osd's think they are the 
correct one. so your cluster is inconsistent.  so bluestore checksums

does not solve this problem, both objects are objectivly "correct" :)


with min_size =2 the cluster would not accept a write unless 2 disks 
accepted the write.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen


On 05. des. 2017 10:26, Denes Dolhay wrote:

Hi,

This question popped up a few times already under filestore and 
bluestore too, but please help me understand, why this is?


"when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the correct 
one."


Doesn't it use an epoch, or an omap epoch when storing new data? If so 
why can it not use the recent one?






this have been discussed a few times on the list. generally  you have 2 
disks.


first disk fail. and writes happen to the other disk..

first disk recovers, and second disk fail before recovery is done. 
writes happen to second disk..


all objects have correct checksum. and both osd's think they are the 
correct one. so your cluster is inconsistent.  so bluestore checksums

does not solve this problem, both objects are objectivly "correct" :)


with min_size =2 the cluster would not accept a write unless 2 disks 
accepted the write.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Adding multiple OSD

2017-12-05 Thread Ronny Aasen


On 05. des. 2017 00:14, Karun Josy wrote:

Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL : 23490G
Use  : 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB. What am I missing ?

==


$ ceph df
GLOBAL:
     SIZE       AVAIL      RAW USED     %RAW USED
     23490G     13338G       10151G         43.22
POOLS:
     NAME            ID     USED      %USED     MAX AVAIL     OBJECTS
     ostemplates     1       162G      2.79         1134G       42084
     imagepool       34      122G      2.11         1891G       34196
     cvm1            54      8058         0         1891G         950
     ecpool1         55     4246G     42.77         3546G     1232590


$ ceph osd df
ID CLASS WEIGHT  REWEIGHT SIZE   USE    AVAIL  %USE  VAR  PGS
  0   ssd 1.86469  1.0  1909G   625G  1284G 32.76 0.76 201
  1   ssd 1.86469  1.0  1909G   691G  1217G 36.23 0.84 208
  2   ssd 0.87320  1.0   894G   587G   306G 65.67 1.52 156
11   ssd 0.87320  1.0   894G   631G   262G 70.68 1.63 186
  3   ssd 0.87320  1.0   894G   605G   288G 67.73 1.56 165
14   ssd 0.87320  1.0   894G   635G   258G 71.07 1.64 177
  4   ssd 0.87320  1.0   894G   419G   474G 46.93 1.08 127
15   ssd 0.87320  1.0   894G   373G   521G 41.73 0.96 114
16   ssd 0.87320  1.0   894G   492G   401G 55.10 1.27 149
  5   ssd 0.87320  1.0   894G   288G   605G 32.25 0.74  87
  6   ssd 0.87320  1.0   894G   342G   551G 38.28 0.88 102
  7   ssd 0.87320  1.0   894G   300G   593G 33.61 0.78  93
22   ssd 0.87320  1.0   894G   343G   550G 38.43 0.89 104
  8   ssd 0.87320  1.0   894G   267G   626G 29.90 0.69  77
  9   ssd 0.87320  1.0   894G   376G   518G 42.06 0.97 118
10   ssd 0.87320  1.0   894G   322G   571G 36.12 0.83 102
19   ssd 0.87320  1.0   894G   339G   554G 37.95 0.88 109
12   ssd 0.87320  1.0   894G   360G   534G 40.26 0.93 112
13   ssd 0.87320  1.0   894G   404G   489G 45.21 1.04 120
20   ssd 0.87320  1.0   894G   342G   551G 38.29 0.88 103
23   ssd 0.87320  1.0   894G   148G   745G 16.65 0.38  61
17   ssd 0.87320  1.0   894G   423G   470G 47.34 1.09 117
18   ssd 0.87320  1.0   894G   403G   490G 45.18 1.04 120
21   ssd 0.87320  1.0   894G   444G   450G 49.67 1.15 130
                     TOTAL 23490G 10170G 13320G 43.30



Karun Josy

On Tue, Dec 5, 2017 at 4:42 AM, Karun Josy <karunjo...@gmail.com 
<mailto:karunjo...@gmail.com>> wrote:


Thank you for detailed explanation!

Got one another doubt,

This is the total space available in the cluster :

TOTAL 23490G
Use 10170G
Avail : 13320G


But ecpool shows max avail as just 3 TB.




without knowing details of your cluster, this is just assumption 
guessing, but...


perhaps one of your hosts have less free space then the others, 
replicated can pick 3 of the hosts that have plenty of space, but 
erasure perhaps require more hosts, so the host with least space is the 
limiting factor.


check
ceph osd df tree

to see how it looks.


kinds regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-05 Thread Ronny Aasen


On 05. des. 2017 09:18, Gonzalo Aguilar Delgado wrote:

Hi,

I created this. http://paste.debian.net/999172/ But the expiration date 
is too short. So I did this too https://pastebin.com/QfrE71Dg.


What I want to mention is that there's no known cause for what's 
happening. It's true that time desynch happens on reboot because few 
millis skew. But ntp corrects it fast. There are no network issues and 
the log of the osd is in the output.


I only see in other osd the errors that are becoming more and more usual:

2017-12-05 08:58:56.637773 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 2: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 
0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.637775 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a shard 6: soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head data_digest 
0xfae07534 != data_digest 0xe2de2a76 from auth oi 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head(3873'5250781 
client.5697316.0:51282235 dirty|data_digest|omap_digest s 4194304 uv 
5250781 dd e2de2a76 od  alloc_hint [0 0])
2017-12-05 08:58:56.63 7f0feff7f700 -1 log_channel(cluster) log 
[ERR] : 10.7a soid 
10:5ff4f7a3:::rbd_data.56bf3a4775a618.2efa:head: failed to 
pick suitable auth object


Digests not matching basically. Someone told me that this can be caused 
by a faulty disk. So I replaced the offending drive, and now I found the 
new disk is happening the same. Ok. But this thread is not for checking 
the source of the problem. This will be done later.


This thread is to try recover an OSD that seems ok to the object store 
tool. This is:



Why it breaks here?



if i get errors on a disk that i suspect are from reasons other then the 
disk beeing faulty. i remove the disk from the cluster. run it thru 
smart disk tests + long test. then run it thru the vendors diagnostic 
tools (i have a separate 1u machine for this)

if the disk clears as OK i wipe it and reinsert it as a new OSD

the reason you are getting corrupt digests are probably the very common 
way most people get corruptions.. you have size=2 , min_size=1



when you have 2 different objects, both with correct digests, in your 
cluster, the cluster can not know witch of the 2 objects are the correct 
one.  just search this list for all the users that end up in your 
situation for the same reason, also read this : 
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-March/016663.html



simple rule of thumb
size=2, min_size=1 :: i do not care about my data, the data is volatile 
but i want the cluster to accept writes _all the time_


size=2, min_size=2 :: i can not afford real redundancy, but i do care a 
little about my data, i accept that the cluster will block writes in 
error situations until the problem is fixed.


size=3, min_size=2 :: i want safe and available data, and i understand 
that the ceph defaults are there for a reason.




basically: size=3, min_size=2 if you want to avoid corruptions.

remove-wipe-reinstall disks that have developed 
corruptions/inconsistencies with the cluster


kind regards
Ronny Aasen




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] HELP with some basics please

2017-12-04 Thread Ronny Aasen


On 04.12.2017 19:18, tim taler wrote:

In size=2 losing any 2 discs on different hosts would probably cause data to
be unavailable / lost, as the pg copys are randomly distribbuted across the
osds. Chances are, that you can find a pg which's acting group is the two
failed osd (you lost all your replicas)

okay I see, getting clearer at least ;-)



you can also consider running  size=2, min_size=2 while restructuring.
it will block your problematic pg's if there is a failure, until the 
rebuild/rebalance is done. But it should be a bit more resistant to full 
cluster loss and/or corruption.


basically it means if there is less then 2 copies do not accept writes.  
if you want to do this depends on your requirements,
is it a bigger disaster to be unavailable a while, then there is to 
restore from backup.


kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Another OSD broken today. How can I recover it?

2017-12-04 Thread Ronny Aasen


On 04. des. 2017 10:22, Gonzalo Aguilar Delgado wrote:

Hello,

Things are going worse every day.


ceph -w
     cluster 9028f4da-0d77-462b-be9b-dbdf7fa57771
  health HEALTH_ERR
     1 pgs are stuck inactive for more than 300 seconds
     8 pgs inconsistent
     1 pgs repair
     1 pgs stale
     1 pgs stuck stale
     recovery 20266198323167232/288980 objects degraded 
(7013010700798.405%)

     37154696925806624 scrub errors
     no legacy OSD present but 'sortbitwise' flag is not set


But I'm finally finding time to recover. The disk seems to be correct, 
no smart errors and everything looks fine just ceph not starting. Today 
I started to look for the ceph-objectstore-tool. That I don't really 
know much.


It just works nice. No crash as expected like on the OSD.

So I'm lost. Since both OSD and ceph objectstore tool use same backend 
how is this posible?


Can someone help me on fixing this, please?




this line seems quite insane:
recovery 20266198323167232/288980 objects degraded (7013010700798.405%)

there is obviously something wrong in your cluster. once the defect osd 
id down/out does the cluster eventually heal to HEALTH_OK ?


you should start by reading and understanding this page.
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-osd/

also in order to get assistance you need to provide a lot more detail.
how many nodes, how many osd's per node. what kinf of nodes cpu/ram. 
what kind of networking setup.


show the output from
ceph -s
ceph osd tree
ceph osd pool ls detail
ceph health detail




since you are systematically loosing osd's i would start by checking the 
timestamp in the defect osd for when it died.
doublecheck your clock sync settingts that all servers are time 
syncronized and then check all logs for the time in question.


especialy dmesg, did OOM killer do something ? was networking flaky ?
mon logs ?  did they complain about the osd in some fashion ?


also since you fail to start the osd again there is probably some 
corruption going on. bump the log for that osd in the nodes ceph.conf, 
something like


[osd.XX]
debug osd = 20

rename the log for the osd so you have a fresh file. and try to start 
the osd once. put the log on some pastebin and send the url.
read 
http://ceph.com/planet/how-to-increase-debug-levels-and-harvest-a-detailed-osd-log/ 
for details.




generally: try to make it easy for people to help you without having to 
drag details out of you. If you can collect all of the above on a 
pastebin like http://paste.debian.net/ instead of piecing it together 
from 3-4 different email threads, you will find a lot more eyeballs 
willing to give it a look.




good luck and kind regards
Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph - SSD cluster

2017-11-21 Thread Ronny Aasen


On 20. nov. 2017 23:06, Christian Balzer wrote:

On Mon, 20 Nov 2017 15:53:31 +0100 Ansgar Jazdzewski wrote:


Hi *,

just on note because we hit it, take a look on your discard options
make sure it not run on all OSD at the same time.


Any SSD that actually _requires_ the use of TRIM/DISCARD to maintain
either speed or endurance I'd consider unfit for Ceph to boot.




hello

is there some sort of hardware compatibillity list for this part ?
perhaps community maintained on the wiki or similar.

there are some older blog posts covering some devices, but hard to find 
ceph related for current devices.


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Moving bluestore WAL and DB after bluestore creation

2017-11-17 Thread Ronny Aasen


On 16.11.2017 09:45, Loris Cuoghi wrote:

Le Wed, 15 Nov 2017 19:46:48 +,
Shawn Edwards <lesser.e...@gmail.com> a écrit :


On Wed, Nov 15, 2017, 11:07 David Turner <drakonst...@gmail.com>
wrote:


I'm not going to lie.  This makes me dislike Bluestore quite a
bit.  Using multiple OSDs to an SSD journal allowed for you to
monitor the write durability of the SSD and replace it without
having to out and re-add all of the OSDs on the device.  Having to
now out and backfill back onto the HDDs is awful and would have
made a time when I realized that 20 journal SSDs all ran low on
writes at the same time nearly impossible to recover from.

Flushing journals, replacing SSDs, and bringing it all back online
was a slick process.  Formatting the HDDs and backfilling back onto
the same disks sounds like a big regression.  A process to migrate
the WAL and DB onto the HDD and then back off to a new device would
be very helpful.

On Wed, Nov 15, 2017 at 10:51 AM Mario Giammarco
<mgiamma...@gmail.com> wrote:
  

It seems it is not possible. I recreated the OSD

2017-11-12 17:44 GMT+01:00 Shawn Edwards <lesser.e...@gmail.com>:
  

I've created some Bluestore OSD with all data (wal, db, and data)
all on the same rotating disk.  I would like to now move the wal
and db onto an nvme disk.  Is that possible without re-creating
the OSD?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
  

This.  Exactly this.  Not being able to move the .db and .wal data on
and off the main storage disk on Bluestore is a regression.


Hello,

What stops you from dd'ing the DB/WAL's partitions on another disk and
updating the symlinks in the OSD's mount point under /var/lib/ceph/osd?



this probably works when you deployed bluestore with partitions, but if 
you did not create partitions for block.db on orginal bluestore creation 
there is no block.db symlink, db and wal are mixed into the block 
partition and not easy to extract.  also just dd the block device may 
not help if you want to change the size of the db partition. this needs 
more testing.  probably tools can be created in the future for resizing  
db and wal partitions, and for extracting db data from block into a 
separate block.db partition.


dd block.db would probably work when you need to replace a worn out ssd 
drive. but not so much if you want to deploy separate block.db from a 
bluestore made without block.db



kind regards
Ronny Aasen





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cluster network slower than public network

2017-11-15 Thread Ronny Aasen


On 15.11.2017 13:50, Gandalf Corvotempesta wrote:
As 10gb switches are expansive, what would happen by using a gigabit 
cluster network and a 10gb public network?


Replication and rebalance should be slow, but what about public I/O ?
When a client wants to write to a file, it does over the public 
network and the ceph automatically replicate it over the cluster 
network or the whole IO is made over the public?





public io would be slow.
each write goes from client to primary osd on public network, then is 
replicated 2 times to the secondary osd's over the cluster network, then 
the client is informed the block is written.
since cluster network would see 2x write traffic compared to public 
network when things a OK. and many times the traffic of the public 
network when things are recovering or backfilling. i would prioritize 
the clusternetwork for the highest speed if one could not have 10Gbps on 
everything.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Undersized fix for small cluster, other than adding a 4th node?

2017-11-10 Thread Ronny Aasen


On 09. nov. 2017 22:52, Marc Roos wrote:
  
I added an erasure k=3,m=2 coded pool on a 3 node test cluster and am

getting these errors.

pg 48.0 is stuck undersized for 23867.00, current state
active+undersized+degraded, last acting [9,13,2147483647,7,2147483647]
 pg 48.1 is stuck undersized for 27479.944212, current state
active+undersized+degraded, last acting [12,1,2147483647,8,2147483647]
 pg 48.2 is stuck undersized for 27479.944514, current state
active+undersized+degraded, last acting [12,1,2147483647,3,2147483647]
 pg 48.3 is stuck undersized for 27479.943845, current state
active+undersized+degraded, last acting [11,0,2147483647,2147483647,5]
 pg 48.4 is stuck undersized for 27479.947473, current state
active+undersized+degraded, last acting [8,4,2147483647,2147483647,5]
 pg 48.5 is stuck undersized for 27479.940289, current state
active+undersized+degraded, last acting [6,5,11,2147483647,2147483647]
 pg 48.6 is stuck undersized for 27479.947125, current state
active+undersized+degraded, last acting [5,8,2147483647,1,2147483647]
 pg 48.7 is stuck undersized for 23866.977708, current state
active+undersized+degraded, last acting [13,11,2147483647,0,2147483647]

Mentioned here
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-May/009572.html
is that the problem was resolved by adding an extra node, I already
changed the min_size to 3. Or should I change to k=2,m=2 but do I still
then have good saving on storage then? How can you calculate saving
storage of erasure pool?



minimum nodes for a cluster is k+m and with that you have no nodes for 
additional failure domain. IOW, if a node fail your cluster is degraded 
and can not heal itself.


having ceph heal on failures is kind of one one of the best things about 
ceph. so when choosing how many nodes to have in your cluster, you need 
to think:  k + m + how many node failures do i want to tolerate without 
stressing = minimum number of nodes



basically with a 3 node cluster, you can either run 3x replication or 
k=2 + m=1




to look for space saving you can read
http://ceph.com/geen-categorie/ceph-erasure-coding-overhead-in-a-nutshell/


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] How to enable jumbo frames on IPv6 only cluster?

2017-10-27 Thread Ronny Aasen


On 27. okt. 2017 14:22, Félix Barbeira wrote:

Hi,

I'm trying to configure a ceph cluster using IPv6 only but I can't 
enable jumbo frames. I made the definition on the
'interfaces' file and it seems like the value is applied but when I test 
it looks like only works on IPv4, not IPv6.


It works on IPv4:

root@ceph-node01:~# ping -c 3 -M do -s 8972 ceph-node02

PING ceph-node02 (x.x.x.x) 8972(9000) bytes of data.
8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=1 ttl=64 time=0.474 ms
8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=2 ttl=64 time=0.254 ms
8980 bytes from ceph-node02 (x.x.x.x): icmp_seq=3 ttl=64 time=0.288 ms

--- ceph-node02 ping statistics ---
3 packets transmitted, 3 received, 0% packet loss, time 2000ms
rtt min/avg/max/mdev = 0.254/0.338/0.474/0.099 ms

root@ceph-node01:~#

But *not* in IPv6:

root@ceph-node01:~# ping6 -c 3 -M do -s 8972 ceph-node02
PING ceph-node02(x:x:x:x:x:x:x:x) 8972 data bytes
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500
ping: local error: Message too long, mtu=1500

--- ceph-node02 ping statistics ---
4 packets transmitted, 0 received, +4 errors, 100% packet loss, time 3024ms

root@ceph-node01:~#



root@ceph-node01:~# ifconfig
eno1      Link encap:Ethernet  HWaddr 24:6e:96:05:55:f8
           inet6 addr: 2a02:x:x:x:x:x:x:x/64 Scope:Global
           inet6 addr: fe80::266e:96ff:fe05:55f8/64 Scope:Link
           UP BROADCAST RUNNING MULTICAST *MTU:9000*  Metric:1
           RX packets:633318 errors:0 dropped:0 overruns:0 frame:0
           TX packets:649607 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1000
           RX bytes:463355602 (463.3 MB)  TX bytes:498891771 (498.8 MB)

lo        Link encap:Local Loopback
           inet addr:127.0.0.1  Mask:255.0.0.0
           inet6 addr: ::1/128 Scope:Host
           UP LOOPBACK RUNNING  MTU:65536  Metric:1
           RX packets:127420 errors:0 dropped:0 overruns:0 frame:0
           TX packets:127420 errors:0 dropped:0 overruns:0 carrier:0
           collisions:0 txqueuelen:1
           RX bytes:179470326 (179.4 MB)  TX bytes:179470326 (179.4 MB)

root@ceph-node01:~#

root@ceph-node01:~# cat /etc/network/interfaces
# This file describes network interfaces avaiulable on your system
# and how to activate them. For more information, see interfaces(5).

source /etc/network/interfaces.d/*

# The loopback network interface
auto lo
iface lo inet loopback

# The primary network interface
auto eno1
iface eno1 inet6 auto
    post-up ifconfig eno1 mtu 9000
root@ceph-node01:#


Please help!





hello

have you changed on all nodes ?

also the ipv6 icmpv6 protocol can advertise a link MTU value.
the client will pick up this mtu value and store it 
in/proc/sys/net/ipv6/conf/eth0/mtu

if /proc/sys/net/ipv6/conf/ens32/accept_ra_mtu is enabled.

you can perhaps change what mtu is advertised on the link by altering 
your Router or device that advertise RA's



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged

2017-10-26 Thread Ronny Aasen


if you were following this page:
http://docs.ceph.com/docs/jewel/rados/troubleshooting/troubleshooting-pg/


then there is normally hours of troubleshooting in the following 
paragraph, before finally admitting defeat and marking the object as lost:


"It is possible that there are other locations where the object can 
exist that are not listed. For example, if a ceph-osd is stopped and 
taken out of the cluster, the cluster fully recovers, and due to some 
future set of failures ends up with an unfound object, it won’t consider 
the long-departed ceph-osd as a potential location to consider. (This 
scenario, however, is unlikely.)"



Also this warning is important regarding the loosing of objects:
"Use this with caution, as it may confuse applications that expected the 
object to exist."


mds is definitiftly such an application. i think rgw would be the only 
application that loosing a object could be acceptable, depending on what 
used the object storage.  rbd and cephfs will have issues of varying 
degree. One could argue that the mark-unfound-lost command should have a 
--yes-i-mean-it type of warning, especialy of the pool application is 
cephfs or rbd



This is ofcourse a bit late now that the object is marked as lost. but 
for your future reference: since you had a inconsistent pg, most likely 
you had one corrupt object and 1 or more OK object on some osd. and 
using the methods written about in 
http://ceph.com/geen-categorie/ceph-manually-repair-object/ might have 
recovered that object for you.


kind regards
Ronny Aasen



On 26. okt. 2017 04:38, dani...@igb.illinois.edu wrote:

Hi Ronny,

 From the documentation, I thought this was the proper way to resolve the
issue.

Dan


On 24. okt. 2017 19:14, Daniel Davidson wrote:

Our ceph system is having a problem.

A few days a go we had a pg that was marked as inconsistent, and today I
fixed it with a:

#ceph pg repair 1.37c

then a file was stuck as missing so I did a:

#ceph pg 1.37c mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking


sorry i can not assist on the corrupt mds part. i have no experience in
that part.

But I felt this escaleted a bit quick. since this is a "i accept lost
object" type of command, the consequences are quite ugly, depending on
what the missing object was for.  Did you do much troubleshooting before
jumping to this command so you were certain there was no other non
dataloss options ?

kind regards
Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] MDS damaged

2017-10-25 Thread Ronny Aasen


On 24. okt. 2017 19:14, Daniel Davidson wrote:

Our ceph system is having a problem.

A few days a go we had a pg that was marked as inconsistent, and today I 
fixed it with a:


#ceph pg repair 1.37c

then a file was stuck as missing so I did a:

#ceph pg 1.37c mark_unfound_lost delete
pg has 1 objects unfound and apparently lost marking


sorry i can not assist on the corrupt mds part. i have no experience in 
that part.


But I felt this escaleted a bit quick. since this is a "i accept lost 
object" type of command, the consequences are quite ugly, depending on 
what the missing object was for.  Did you do much troubleshooting before 
jumping to this command so you were certain there was no other non 
dataloss options ?


kind regards
Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure code profile

2017-10-24 Thread Ronny Aasen

yes you can. but just like a raid5 array with a lost disk, it is not a 
comfortable way to run your cluster for any significant time. you also 
get performance degradations.


having a warning active all the time makes it harder to detect new 
issues, and such. One becomes numb to the warning allways beeing on.


strive to have your cluster in health ok all the time. and design so 
that you have the fault tolerance you want as overhead. having more 
nodes then strictly needed allow ceph to self heal quickly. and also 
gives better performance, by spreading load over more machines.

10+4 on 14 nodes means each and every  nodes are hit on each write.


kind regards
Ronny Aasen


On 23. okt. 2017 21:12, Jorge Pinilla López wrote:
I have one question, what can or can't do a cluster working on degraded 
mode?


With K=10 + M = 4 if one of my OSDs node fails it will start working on 
degraded mode, but can I still do writes and reads from that pool?



El 23/10/2017 a las 21:01, Ronny Aasen escribió:

On 23.10.2017 20:29, Karun Josy wrote:

Hi,

While creating a pool with erasure code profile k=10, m=4, I get PG 
status as

"200 creating+incomplete"

While creating pool with profile k=5, m=3 it works fine.

Cluster has 8 OSDs with total 23 disks.

Is there any requirements for setting the first profile ?



you need K+M+X  osd nodes. K and M comes from the profile, X is how 
many nodes you want to be able to tolerate failure of, without 
becoming degraded. (how many failed nodes ceph should be able to 
automatically heal)


so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault 
tolerance (a single failure = a degreded cluster)  so you have to 
scramble to replace the node to get HEALTH OK again.  if you have 15 
nodes you can loose 1 node and cehp will automatically rebalance to 
the 14 needed nodes, and you can replace the lost node at your leisure.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A 
<http://pgp.rediris.es:11371/pks/lookup?op=get=0xA34331932EBC715A>




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Erasure code profile

2017-10-23 Thread Ronny Aasen


On 23.10.2017 20:29, Karun Josy wrote:

Hi,

While creating a pool with erasure code profile k=10, m=4, I get PG 
status as

"200 creating+incomplete"

While creating pool with profile k=5, m=3 it works fine.

Cluster has 8 OSDs with total 23 disks.

Is there any requirements for setting the first profile ?



you need K+M+X  osd nodes. K and M comes from the profile, X is how many 
nodes you want to be able to tolerate failure of, without becoming 
degraded. (how many failed nodes ceph should be able to automatically heal)


so with K=10 + M = 4 you need minimum 14 nodes and you have 0 fault 
tolerance (a single failure = a degreded cluster)  so you have to 
scramble to replace the node to get HEALTH OK again.  if you have 15 
nodes you can loose 1 node and cehp will automatically rebalance to the 
14 needed nodes, and you can replace the lost node at your leisure.


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Brand new cluster -- pg is stuck inactive

2017-10-13 Thread Ronny Aasen


strange that no osd is acting for your pg's
can you show the output from
ceph osd tree


mvh
Ronny Aasen



On 13.10.2017 18:53, dE wrote:

Hi,

    I'm running ceph 10.2.5 on Debian (official package).

It cant seem to create any functional pools --

ceph health detail
HEALTH_ERR 64 pgs are stuck inactive for more than 300 seconds; 64 pgs 
stuck inactive; too few PGs per OSD (21 < min 30)
pg 0.39 is stuck inactive for 652.741684, current state creating, last 
acting []
pg 0.38 is stuck inactive for 652.741688, current state creating, last 
acting []
pg 0.37 is stuck inactive for 652.741690, current state creating, last 
acting []
pg 0.36 is stuck inactive for 652.741692, current state creating, last 
acting []
pg 0.35 is stuck inactive for 652.741694, current state creating, last 
acting []
pg 0.34 is stuck inactive for 652.741696, current state creating, last 
acting []
pg 0.33 is stuck inactive for 652.741698, current state creating, last 
acting []
pg 0.32 is stuck inactive for 652.741701, current state creating, last 
acting []
pg 0.3 is stuck inactive for 652.741762, current state creating, last 
acting []
pg 0.2e is stuck inactive for 652.741715, current state creating, last 
acting []
pg 0.2d is stuck inactive for 652.741719, current state creating, last 
acting []
pg 0.2c is stuck inactive for 652.741721, current state creating, last 
acting []
pg 0.2b is stuck inactive for 652.741723, current state creating, last 
acting []
pg 0.2a is stuck inactive for 652.741725, current state creating, last 
acting []
pg 0.29 is stuck inactive for 652.741727, current state creating, last 
acting []
pg 0.28 is stuck inactive for 652.741730, current state creating, last 
acting []
pg 0.27 is stuck inactive for 652.741732, current state creating, last 
acting []
pg 0.26 is stuck inactive for 652.741734, current state creating, last 
acting []
pg 0.3e is stuck inactive for 652.741707, current state creating, last 
acting []
pg 0.f is stuck inactive for 652.741761, current state creating, last 
acting []
pg 0.3f is stuck inactive for 652.741708, current state creating, last 
acting []
pg 0.10 is stuck inactive for 652.741763, current state creating, last 
acting []
pg 0.4 is stuck inactive for 652.741773, current state creating, last 
acting []
pg 0.5 is stuck inactive for 652.741774, current state creating, last 
acting []
pg 0.3a is stuck inactive for 652.741717, current state creating, last 
acting []
pg 0.b is stuck inactive for 652.741771, current state creating, last 
acting []
pg 0.c is stuck inactive for 652.741772, current state creating, last 
acting []
pg 0.3b is stuck inactive for 652.741721, current state creating, last 
acting []
pg 0.d is stuck inactive for 652.741774, current state creating, last 
acting []
pg 0.3c is stuck inactive for 652.741722, current state creating, last 
acting []
pg 0.e is stuck inactive for 652.741776, current state creating, last 
acting []
pg 0.3d is stuck inactive for 652.741724, current state creating, last 
acting []
pg 0.22 is stuck inactive for 652.741756, current state creating, last 
acting []
pg 0.21 is stuck inactive for 652.741758, current state creating, last 
acting []
pg 0.a is stuck inactive for 652.741783, current state creating, last 
acting []
pg 0.20 is stuck inactive for 652.741761, current state creating, last 
acting []
pg 0.9 is stuck inactive for 652.741787, current state creating, last 
acting []
pg 0.1f is stuck inactive for 652.741764, current state creating, last 
acting []
pg 0.8 is stuck inactive for 652.741790, current state creating, last 
acting []
pg 0.7 is stuck inactive for 652.741792, current state creating, last 
acting []
pg 0.6 is stuck inactive for 652.741794, current state creating, last 
acting []
pg 0.1e is stuck inactive for 652.741770, current state creating, last 
acting []
pg 0.1d is stuck inactive for 652.741772, current state creating, last 
acting []
pg 0.1c is stuck inactive for 652.741774, current state creating, last 
acting []
pg 0.1b is stuck inactive for 652.741777, current state creating, last 
acting []
pg 0.1a is stuck inactive for 652.741784, current state creating, last 
acting []
pg 0.2 is stuck inactive for 652.741812, current state creating, last 
acting []
pg 0.31 is stuck inactive for 652.741762, current state creating, last 
acting []
pg 0.19 is stuck inactive for 652.741789, current state creating, last 
acting []
pg 0.11 is stuck inactive for 652.741797, current state creating, last 
acting []
pg 0.18 is stuck inactive for 652.741793, current state creating, last 
acting []
pg 0.1 is stuck inactive for 652.741820, current state creating, last 
acting []
pg 0.30 is stuck inactive for 652.741769, current state creating, last 
acting []
pg 0.17 is stuck inactive for 652.741797, current state creating, last 
acting []
pg 0.0 is stuck inactive for 652.741829, current state creating, last 
acting []
pg 0.2f is stuck inactive for 652.741774, current state creating, last 
acting []
pg 0.16 is stuck inact

[ceph-users] windows server 2016 refs3.1 veeam syntetic backup with fast block clone

2017-10-13 Thread Ronny Aasen


greetings

when using windows storagespaces and refs 3.1 one can in veeam backups 
use something called block clone to build syntetic backups. and to 
reduce the time taken to backup vm's.


i have used windows servers 2016 with refs3.1 on ceph. my question is if 
it is possible to get fast block clone and fast syntetic full backups 
when using refs on rbd on ceph.


i ofcourse have other backup solutions, but this is spesific for vmware 
backups.



possible?

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph luminous repo not working on Ubuntu xenial

2017-09-29 Thread Ronny Aasen

"apt-cache policy" shows you the different versions that are possible to 
install, and the prioritized order they have.
the highest version will normally be installed unless priorities are 
changed.

example:
apt-cache policy ceph
ceph:
  Installed: 12.2.1-1~bpo90+1
  Candidate: 12.2.1-1~bpo90+1
  Version table:
 *** 12.2.1-1~bpo90+1 500
    500 http://download.ceph.com/debian-luminous stretch/main amd64 
Packages

    100 /var/lib/dpkg/status
 10.2.5-7.2 500
    500 http://deb.debian.org/debian stretch/main amd64 Packages

apt-get install ceph=$version will install that spesific version.

example in my  case:
apt install ceph=10.2.5-7.2

will downgrade to the previous version.

kind regards
Ronny Aasen

On 29.09.2017 15:40, Kashif Mumtaz wrote:

Dear Stefan,

Thanks for your help. You are right. I was missing apt update" after 
adding repo.

 After doing apt update I am able to install luminous

cadmin@admin:~/my-cluster$ ceph --version
ceph version 12.2.1 (3e7492b9ada8bdc9a5cd0feafd42fbca27f9c38e) 
luminous (stable)

I am not much in practice with Ubuntu. I use Centos/RHEL only . This 
time a specific requirement to install it on Ubuntu.

I want to ask one thing.

Now ceph two version availbe in repository.

1- Jewel in Ubuntu update repository
2 -  Manually added ceph Repository

If one package available in multiple repository with different version 
How can I install specific version ?

.

On Friday, September 29, 2017 9:57 AM, Stefan Kooman <ste...@bit.nl> 
wrote:

Quoting Kashif Mumtaz (kashif.mum...@yahoo.com 
<mailto:kashif.mum...@yahoo.com>):

>
> Dear User,
> I am striving had to install Ceph luminous version on Ubuntu 
16.04.3  ( xenial ).
> Its repo is available at https://download.ceph.com/debian-luminous/ 
<https://download.ceph.com/debian-luminous/%C2%A0>
> I added it like sudo apt-add-repository 'deb 
https://download.ceph.com/debian-luminous/ xenial main'

> # more  sources.list
> deb https://download.ceph.com/debian-luminous/ xenial main

^^ That looks good.

> It say no package available. Did anybody able to install Luminous on 
Xenial by using repo?

Just checkin': you did a "apt update" after adding the repo?

The repo works fine for me. Is the Ceph gpg key installed?

apt-key list |grep Ceph
uid                  Ceph.com (release key) <secur...@ceph.com 
<mailto:secur...@ceph.com>>

Make sure you have "apt-transport-https" installed (as the repos uses
TLS).

Gr. Stefan

--
| BIT BV http://www.bit.nl/       Kamer van Koophandel 09090351
| GPG: 0xD14839C6                  +31 318 648 688 / i...@bit.nl 
<mailto:i...@bit.nl>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-09-28 Thread Ronny Aasen


On 28. sep. 2017 18:53, hjcho616 wrote:
Yay! Finally after about exactly one month I finally am able to mount 
the drive!  Now is time to see how my data is doing. =P  Doesn't look 
too bad though.


Got to love the open source. =)  I downloaded ceph source code.  Built them.  Then tried to run ceph-objectstore-export on that osd.4.   Then started debugging it.  Obviously don't have any idea of what everything do... > but was able to trace to the error message.  The corruption appears to be at the mount region.  When it tries to decode a buffer, most buffers had very periodic (looking at the printfs I put in) access to data but then > few of them had huge number.  Oh that "1" that didn't make sense was from the corruption happened, and that struct_v portion of the data changed to ASCII value of 1, which happily printed 1. =P  Since it was a mount 
portion... and hoping it doesn't impact the data much... went ahead and allowed those corrupted values.  I was able to export osd.4 with journal!


congratulations and well done :)

just imagine tring to do this on $vendors's propitary blackbox...

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] PG in active+clean+inconsistent, but list-inconsistent-obj doesn't show it

2017-09-28 Thread Ronny Aasen


On 28. sep. 2017 09:27, Olivier Migeot wrote:

Greetings,

we're in the process of recovering a cluster after an electrical 
disaster. Didn't work bad so far, we managed to clear most of errors. 
All that prevents return to HEALTH_OK now is a bunch (6) of scrub 
errors, apparently from a PG that's marked as active+clean+inconsistent.


Thing is, rados list-inconsistent-obj doesn't return anything but an 
empty list (plus, in the most recent attempts : error 2: (2) No such 
file or directory)


We're on Jewel (waiting for this to be fixed before planning upgrade), 
and the pool our PG belongs to has a replica of 2.


No success with ceph pg repair, and I already tried to remove and import 
the most recent version of said PG in both its acting OSDs : it doesn't 
change a thing.


Is there anything else I could try?

Thanks,



size=2 is ofcourse horrible, and I assume you know that...  But even 
more important:  I hope you have min_size=2 so you avoid generating more 
problems in the future, or while troubleshooting.

!


first of all, read this link a few times:
http://ceph.com/geen-categorie/ceph-manually-repair-object/

you need to locate the bad objects to fix them. since
rados list-inconsistent-obj does not work you need to manualy check the 
logs of the osd's that are participating in the pg in question. grep for 
ERR,


once you find the name of the object with problem, you need to locate 
the object using find /path/of/pg -name 'objectname'


once you have the objectpath you need to compare the 2 objects and find 
out what object is the bad one, this is where 3 replication would have 
helped, since when one is bad, how do you know the bad from the good...


the error message in the log may give hints to the error. read and 
understand what the error message is, since it is critical to 
understanding what is wrong with the object.


the object type also helps when determining the wrong one. is it a rados 
object, a rbd block or a cephfs metadata og data object. knowing what it 
should be helps determining the wrong one.


things to try:
ls -lh $path ; compare metadata are there obvious problems?  refer to 
the error in the log.

- one have size 0 and there should have been a size?
- one have size greater then 0 and it should have been size 0?
- one is significantly larger then the other, perhaps one is truncated? 
perhaps one have garbage added.


md5sum $path
- perhaps a block have read error, it would show on this command. and be 
a dead giveaway to the problem object.

- compare checksum.  do you know what the object  should have as sum?

actualy look at the object. use strings or hexdump to try to determine 
the contents, vs what the object should contain.


if you can  locate the bad object. then stop the osd. flush it's 
journal. move away the bad object, (i just mv it to somewhere else).

restart the osd.

run repair on the pg, tail  the logs and wait for the repair and scrub 
to finish.



--

if you are unable to determine the good object from the bad. You can try 
to determine what file it refers to in cephfs, or what block it refers 
to in rbd.  and by overwriting that file or block in cephfs or rbd you 
can indirectly overwrite both objects with new data.


if this is a rbd you should run a filesystem check on the fs on that rbd 
after all the ceph problems are repaired.


good luck
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Re install ceph

2017-09-27 Thread Ronny Aasen


On 27. sep. 2017 10:09, Pierre Palussiere wrote:

Hi,

Is anyone know if it’s possible to re install ceph on a host and keep osd 
without wipe data on them ?

Hope you can help me,



it depends... if you have journal on same drive as osd, you should be 
able to eject the drive from a server, connect it to another and udev 
should mount and active osd (data will ofcourse move)


i can not see why reinstall of a host would be much different from 
moving the disk.


if you have journal on a separate device then you need to move osd and 
journal device together. you can also have configuration that makes this 
process less automatic.


!BUT! i would not in any way risk reinstalling a host with live osd's.!

I would set all osd's out and let the data remap to other osd's so you 
have a temporary replica on other osd's. while reinstalling. the 
backfill should be fast since the data is still on disk.


or

I would set crush weight to 0 and drain all osd's off the node before 
reinstalling. here the backfill will take longer, since you actualy have 
to refill disks.




kind regards
Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen

i would only tar the pg you have missing objects from, trying to inject 
older objects when the pg is correct can not be good.



scrub errors is kind of the issue with only 2 replicas. when you have 2 
different objects. how to know witch one is correct and witch one is bad..
and as you have read on 
http://ceph.com/geen-categorie/ceph-manually-repair-object/  and on 
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/ 
you need to


- find the pg  ::  rados list-inconsistent-pg [pool]
- find the problem :: rados list-inconsistent-obj 0.6 
--format=json-pretty ; give you the object name  look for hints to what 
is the bad object
- find the object  :: manually check the objects, check the object 
metadata, run md5sum on them all and compare. check objects on the 
nonrunning osd's and compare there as well. anything to try to determine 
what object is ok and what is bad.
- fix the problem  :: assuming you find the bad object, stop the 
affected osd with the bad object, remove the object manually, restart 
osd. issue repair command.



if the rados commands does not give you the info you need to do it all 
manually as on http://ceph.com/geen-categorie/ceph-manually-repair-object/


good luck
Ronny Aasen

On 20.09.2017 22:17, hjcho616 wrote:

Thanks Ronny.

I decided to try to tar everything under current directory.  Is this 
correct command for it?  Is there any directory we do not want in the 
new drive?  commit_op_seq, meta, nosnap, omap?


tar --xattrs --preserve-permissions -zcvf osd.4.tar.gz .

As far as inconsistent PGs... I am running in to these errors.  I 
tried moving one copy of pg to other location, but it just says moved 
shard is missing.  Tried setting 'noout ' and turn one of them down, 
seems to work on something but then back to same error.  Currently 
trying to move to different osd... making sure the drive is not 
faulty, got few of them.. but still persisting..  I've been kicking 
off ceph pg repair PG#, hoping it would fix them. =P  Any other 
suggestion?


2017-09-20 09:39:48.481400 7f163c5fa700  0 log_channel(cluster) log 
[INF] : 0.29 repair starts
2017-09-20 09:47:37.384921 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 6: soid 0:97126ead:::200014ce4c3.028f:head 
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 
0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 
539375 dd 979f2ed4 od  alloc_hint [0 0])
2017-09-20 09:47:37.384931 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 7: soid 0:97126ead:::200014ce4c3.028f:head 
data_digest 0x8f679a50 != data_digest 0x979f2ed4 from auth oi 
0:97126ead:::200014ce4c3.028f:head(19366'539375 
client.535319.1:2361163 dirty|data_digest|omap_digest s 4194304 uv 
539375 dd 979f2ed4 od  alloc_hint [0 0])
2017-09-20 09:47:37.384936 7f163c5fa700 -1 log_channel(cluster) log 
[ERR] : 0.29 soid 0:97126ead:::200014ce4c3.028f:head: failed to 
pick suitable auth object
2017-09-20 09:48:11.138566 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 6: soid 0:97d5c15a:::10101b4.6892:head 
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od  
alloc_hint [0 0])
2017-09-20 09:48:11.138575 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 shard 7: soid 0:97d5c15a:::10101b4.6892:head 
data_digest 0xd65b4014 != data_digest 0xf41cfab8 from auth oi 
0:97d5c15a:::10101b4.6892:head(12962'65557 osd.4.0:42234 
dirty|data_digest|omap_digest s 4194304 uv 776 dd f41cfab8 od  
alloc_hint [0 0])
2017-09-20 09:48:11.138581 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 soid 0:97d5c15a:::10101b4.6892:head: failed to 
pick suitable auth object
2017-09-20 09:48:55.584022 7f1639df5700 -1 log_channel(cluster) log 
[ERR] : 0.29 repair 4 errors, 0 fixed


Latest health...
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
down; 1 pgs incomplete; 9 pgs inconsistent; 1 pgs repair; 1 pgs stuck 
inactive; 1 pgs stuck unclean; 68 scrub errors; mds rank 0 has failed; 
mds cluster is degraded; no legacy OSD present but 'sortbitwise' flag 
is not set


Regards,
Hong




On Wednesday, September 20, 2017 11:53 AM, Ronny Aasen 
<ronny+ceph-us...@aasen.cx> wrote:



On 20.09.2017 16:49, hjcho616 wrote:

Anyone?  Can this page be saved?  If not what are my options?

Regards,
Hong


On Saturday, September 16, 2017 1:55 AM, hjcho616 
<hjcho...@yahoo.com> <mailto:hjcho...@yahoo.com> wrote:



Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 
1 pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 
30); mds rank 0 has failed; mds cluster is degrade

Re: [ceph-users] Power outages!!! help!

2017-09-20 Thread Ronny Aasen


On 20.09.2017 16:49, hjcho616 wrote:

Anyone?  Can this page be saved?  If not what are my options?

Regards,
Hong


On Saturday, September 16, 2017 1:55 AM, hjcho616 <hjcho...@yahoo.com> 
wrote:



Looking better... working on scrubbing..
HEALTH_ERR 1 pgs are stuck inactive for more than 300 seconds; 1 pgs 
incomplete; 12 pgs inconsistent; 2 pgs repair; 1 pgs stuck inactive; 1 
pgs stuck unclean; 109 scrub errors; too few PGs per OSD (29 < min 
30); mds rank 0 has failed; mds cluster is degraded; noout flag(s) 
set; no legacy OSD present but 'sortbitwise' flag is not set


Now PG1.28.. looking at all old osds dead or alive.  Only one with 
DIR_* directory is in osd.4. This appears to be metadata pool!  21M of 
metadata can be quite a bit of stuff.. so I would like to rescue this! 
 But I am not able to start this OSD.  exporting through 
ceph-objectstore-tool appears to crash.  Even with 
--skip-journal-replay and --skip-mount-omap (different failure).  As I 
mentioned in earlier email, that exception thrown message is bogus...
# ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path 
/var/lib/ceph/osd/ceph-4/journal --file ~/1.28.export

terminate called after throwing an instance of 'std::domain_error'



[SNIP]
What can I do to save that PG1.28?  Please let me know if you need 
more information.  So close!... =)


Regards,
Hong

12 inconsistent and 109 scrub errors is something you should fix first 
of all.


also you can consider using the paid-services of many ceph support 
companies. that specialize in these kind of situations.


--

that beeing said, here are some suggestions...

when it comes to lost object recovery you have come about as far as i 
have ever experienced. so everything after here is just assumptions and 
wild guesswork to what you can try.  I hope others shouts out if i tell 
you wildly wrong things.


if you have found date pg1.28 from the broken osd and have checked all 
other working and nonworking drives, for that pg. then you need to try 
and extract the pg from the broken drive. As always in recovery cases, 
take a dd clone of the drive and work from the cloned image. to avoid 
more damage to the drive, and to allow you to try multiple times.


you should add a temporary injection drive large enough for that pg, and 
set its crush weight to 0 so it always drains. make sure it is up and 
registered properly in ceph.


the idea is to copy the pg manually from broken-osd to the injection 
drive, since the export/import fails.. making sure you get all xattrs 
included.  one can either copy the whole pg, or just the "missing" 
objects.  if there are few objects i would go for that, if there are 
many i would take the whole pg. you wont get data from leveldb. so i am 
not at all sure this would work. but worth a shot.


- stop your injection osd, verify it is down and the proccess not running.
- from the mountpoint of your broken-osd go into the current directory. 
and tar up the pg1.28 make sure you use -p and --xattrs when you create 
the archive.
- if tar errors out on unreadable files, just rm those (since you are 
working on a copy of your rescue image, you can allways try again)
- copy the tar file to the injection drive and extract while sitting in 
the current directory (remember --xattrs)

- set debug options on the injection drive in ceph.conf
- start the injection drive, and follow along in the log file. hopefully 
it should scan, locate the pg, and replicate the pg1.28 objects off to 
the current primary drive for pg1.28. and since it have crush weight 0 
it should drain out.
- if that works, verify the injection drive is drained, stop it and 
remove it from ceph.  zap the drive.



this is all as i said guesstimates so your mileage may vary
good luck

Ronny Aasen







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-09-15 Thread Ronny Aasen



you write you had all pg's exported except one. so i assume you have 
injected those pg's into the cluster again using the method linked a few 
times in this thread. How did that go, were you successfull in 
recovering those pg's ?


kind regards.
Ronny Aasen



On 15. sep. 2017 07:52, hjcho616 wrote:

I just did this and backfilling started.  Let's see where this takes me.
ceph osd lost 0 --yes-i-really-mean-it

Regards,
Hong


On Friday, September 15, 2017 12:44 AM, hjcho616 <hjcho...@yahoo.com> wrote:


Ronny,

Working with all of the pgs shown in the "ceph health detail", I ran 
below for each PG to export.
ceph-objectstore-tool --op export --pgid 0.1c   --data-path 
/var/lib/ceph/osd/ceph-0 --journal-path /var/lib/ceph/osd/ceph-0/journal 
--skip-journal-replay --file 0.1c.export


I have all PGs exported, except 1... PG 1.28.  It is on ceph-4.  This 
error doesn't make much sense to me.  Looking at the source code from 
https://github.com/ceph/ceph/blob/master/src/osd/osd_types.cc, that 
message is telling me struct_v is 1... but not sure how it ended up in 
the default in the case statement when 1 case is defined...  I tried 
with --skip-journal-replay, fails with same error message.
ceph-objectstore-tool --op export --pgid 1.28  --data-path 
/var/lib/ceph/osd/ceph-4 --journal-path /var/lib/ceph/osd/ceph-4/journal 
--file 1.28.export

terminate called after throwing an instance of 'std::domain_error'
   what():  coll_t::decode(): don't know how to decode version 1
*** Caught signal (Aborted) **
  in thread 7fabc5ecc940 thread_name:ceph-objectstor
  ceph version 10.2.9 (2ee413f77150c0f375ff6f10edd6c8f9c7d060d0)
  1: (()+0x996a57) [0x55b2d3323a57]
  2: (()+0x110c0) [0x7fabc46d50c0]
  3: (gsignal()+0xcf) [0x7fabc2b08fcf]
  4: (abort()+0x16a) [0x7fabc2b0a3fa]
  5: (__gnu_cxx::__verbose_terminate_handler()+0x15d) [0x7fabc33efb3d]
  6: (()+0x5ebb6) [0x7fabc33edbb6]
  7: (()+0x5ec01) [0x7fabc33edc01]
  8: (()+0x5ee19) [0x7fabc33ede19]
  9: (coll_t::decode(ceph::buffer::list::iterator&)+0x21e) [0x55b2d2ff401e]
  10: 
(DBObjectMap::_Header::decode(ceph::buffer::list::iterator&)+0x125) 
[0x55b2d31315f5]

  11: (DBObjectMap::check(std::ostream&, bool)+0x279) [0x55b2d3126bb9]
  12: (DBObjectMap::init(bool)+0x288) [0x55b2d3125eb8]
  13: (FileStore::mount()+0x2525) [0x55b2d305ceb5]
  14: (main()+0x28c0) [0x55b2d2c8d400]
  15: (__libc_start_main()+0xf1) [0x7fabc2af62b1]
  16: (()+0x34f747) [0x55b2d2cdc747]
Aborted

Then wrote a simple script to run import process... just created an OSD 
per PG.  Basically ran below for each PG.

mkdir /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk prepare /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph-disk activate /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
ceph osd crush reweight osd.$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) 0

systemctl stop ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)
ceph-objectstore-tool --op import --pgid 0.1c   --data-path 
/var/lib/ceph/osd/ceph-$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami) --journal-path 
/var/lib/ceph/osd/ceph-$(cat 
/var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)/journal --file 
./export/0.1c.export

chown -R ceph.ceph /var/lib/ceph/osd/ceph-5/tmposd_0.1c/
systemctl start ceph-osd@$(cat /var/lib/ceph/osd/ceph-5/tmposd_0.1c/whoami)

Sometimes import didn't work.. but stopping OSD and rerunning 
ceph-objectstore-tool again seems to help or when some PG didn't really 
want to import .


Unfound messages are gone!   But I still have down+peering, or 
down+remapped+peering.

# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs 
down; 1 pgs inconsistent; 22 pgs peering; 22 pgs stuck inactive; 22 pgs 
stuck unclean; 1 requests are blocked > 32 sec; 1 osds have slow 
requests; 2 scrub errors; mds cluster is degraded; noout flag(s) set; no 
legacy OSD present but 'sortbitwise' flag is not set
pg 1.d is stuck inactive since forever, current state down+peering, last 
acting [11,2]
pg 0.a is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 2.8 is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 2.b is stuck inactive since forever, current state 
down+remapped+peering, last acting [7,11]
pg 1.9 is stuck inactive since forever, current state 
down+remapped+peering, last acting [11,7]
pg 0.e is stuck inactive since forever, current state down+peering, last 
acting [11,2]
pg 1.3d is stuck inactive since forever, current state 
down+remapped+peering, last acting [10,6]
pg 0.2c is stuck inactive since forever, current state down+peering, 
last acting [1,11]
pg 0.0 is stuck inactive since forever, current state 
down+remapped+peering, last acting [10,7]
pg 1.2b is stuck inactive since forever, current state down+peering, 
last acting [1,11]
pg 0.29 is stuck inactive since forever, current state down+peering, 
last acting [11,6]

Re: [ceph-users] OSD_OUT_OF_ORDER_FULL even when the ratios are in order.

2017-09-14 Thread Ronny Aasen


On 14. sep. 2017 11:58, dE . wrote:

Hi,
 I got a ceph cluster where I'm getting a OSD_OUT_OF_ORDER_FULL 
health error, even though it appears that it is in order --


full_ratio 0.99
backfillfull_ratio 0.97
nearfull_ratio 0.98

These don't seem like a mistake to me but ceph is complaining --
OSD_OUT_OF_ORDER_FULL full ratio(s) out of order
 backfillfull_ratio (0.97) < nearfull_ratio (0.98), increased
 osd_failsafe_full_ratio (0.97) < full_ratio (0.99), increased


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





post output from

ceph osd df
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] access ceph filesystem at storage level and not via ethernet

2017-09-14 Thread Ronny Aasen


On 14. sep. 2017 00:34, James Okken wrote:

Thanks Ronny! Exactly the info I need. And kinda of what I thought the answer 
would be as I was typing and thinking clearer about what I was asking. I just 
was hoping CEPH would work like this since the openstack fuel tools deploy CEPH 
storage nodes easily.
I agree I would not be using CEPH for its strengths.

I am interested further in what you've said in this paragraph though:

"if you want to have FC SAN attached storage on servers, shareable
between servers in a usable fashion I would rather mount the same SAN
lun on multiple servers and use a cluster filesystem like ocfs or gfs
that is made for this kind of solution."

Please allow me to ask you a few questions regarding that even though it isn't 
CEPH specific.

Do you mean gfs/gfs2 global file system?

Does ocfs and/or gfs require some sort of management/clustering server to 
maintain and manage? (akin to a CEPH OSD)
I'd love to find a distributed/cluster filesystem where I can just partition 
and format. And then be able to mount and use that same SAN datastore from 
multiple servers without a management server.
If ocfs or gfs do need a server of this sort does it needed to be involved in 
the I/O? or will I be able to mount the datastore, similar to any other disk 
and the IO goes across the fiberchannel?


i only have experience with ocfs. but i think gfs works similarish. 
There are quite a few cluster filesystems to choose from. 
https://en.wikipedia.org/wiki/Clustered_file_system


servers that are mounting ocfs shared filesystems must have ocfs2-tools 
installed. have access to the common shared FC lun via FC.  they need to 
be aware of the other ocfs servers of the same lun, that you define in a 
/etc/ocfs/cluster.conf configfile and the ocfs daemon must be running.


then it is just a matter of making the ocfs (on one server) and adding 
it to fstab (of all servers) and mount.




One final question, if you don't mind, do you think I could use ext4or xfs and 
"mount the same SAN lun on multiple servers" if I can guarantee each server 
will only right to its own specific directory and never anywhere the other servers will 
be writing? (I even have the SAN mapped to each server using different lun's)


mounting the same (non cluster) filesystem on multiple servers is 
guaranteed to destroy the filesystem, you will have multiple servers 
writing in the same metadata area, the same journal area and generaly 
shitting over each other. luckily i think most modern filesystems would 
detect that the FS is mounted somewhere else and prevent you from 
mounting it again without big fat warnings.


kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] access ceph filesystem at storage level and not via ethernet

2017-09-13 Thread Ronny Aasen


On 13.09.2017 19:03, James Okken wrote:


Hi,

Novice question here:

The way I understand CEPH is that it distributes data in OSDs in a 
cluster. The reads and writes come across the ethernet as RBD requests 
and the actual data IO then also goes across the ethernet.


I have a CEPH environment being setup on a fiber channel disk array 
(via an openstack fuel deploy). The servers using the CEPH storage 
also have access to the same fiber channel disk array.


From what I understand those servers would need to make the RDB 
requests and do the IO across ethernet, is that correct? Even though 
with this infrastructure setup there is a “shorter” and faster path to 
those disks, via the fiber channel.


Is there a way to access storage on a CEPH cluster when one has this 
“better” access to the disks in the cluster? (how about if it were to 
be only a single OSD with replication set to 1)


Sorry if this question is crazy…

thanks



a bit cracy :)

if the disks are directly attached on a OSD node, or attachable on 
Fiberchannel does not make a difference.  you can not shortcut the ceph 
cluster and talk to the osd disks directly without eventually destroying 
the ceph cluster.


Even if you did, ceph is an object storage on disk, so you would not 
find filesystem or RBD diskimages there, only objects on your FC 
attached osd node disks with filestore, and with bluestore not even 
readable objects.


that beeing said I think a FC SAN attached ceph osd node sounds a bit 
strange. ceph's strength is the distributed scaleable solution. and 
having the osd nodes collected on a SAN array would nuter ceph's 
strengths, and amplify ceph's weakness of high latency. i would only 
consider such a solution for testing, learning or playing around without 
having actual hardware for a distributed system.  and in that case use 1 
lun for each osd disk, give 8-10 vm's some luns/osd's each, just to 
learn how to work with ceph.


if you want to have FC SAN attached storage on servers, shareable 
between servers in a usable fashion I would rather mount the same SAN 
lun on multiple servers and use a cluster filesystem like ocfs or gfs 
that is made for this kind of solution.



kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-09-13 Thread Ronny Aasen


On 13. sep. 2017 07:04, hjcho616 wrote:

Ronny,

Did bunch of ceph pg repair pg# and got the scrub errors down to 10... 
well was 9, trying to fix one became 10.. waiting for it to fix (I did 
that noout trick as I only have two copies).  8 of those scrub errors 
looks like it would need data from osd.0.


HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs 
degraded; 6 pgs down; 3 pgs inconsistent; 6 pgs peering; 6 pgs 
recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive; 
16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16 
pgs undersized; 1 requests are blocked > 32 sec; recovery 221990/4503980 
objects degraded (4.929%); recovery 147/2251990 unfound (0.007%); 10 
scrub errors; mds cluster is degraded; no legacy OSD present but 
'sortbitwise' flag is not set


 From what I saw from ceph health detail, running osd.0 would solve 
majority of the problems.  But that was the disk with the smart error 
earlier.  I did move to new drive using ddrescue.  When trying to start 
osd.0, I get this.  Is there anyway I can get around this?




running a rescued disk is not something you should try. this is when you 
should try to export using the objectstoretool


this was the drive that failed to export pg's becouse of missing 
superblock ? you could also try the export directly on the failed drive. 
just to try if that works. you many have to run the tool as ceph user if 
that is the user owning all the files


you could try running the export of one of the pg's on osd.0 again and 
post all commands and output.


good luck

Ronny





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-09-12 Thread Ronny Aasen

you can start by posting more details. atleast
"ceph osd tree" "cat ceph.conf" and "ceph osd df" so we can see what
settings you are running, and how your cluster is balanced at the moment.

generally:

inconsistent pg's are pg's that have scrub errors. use rados
list-inconsistent-pg [pool] and rados-list-inconsistent-obj [pg] to
locate the objects with problems. compare and fix the objects using info
from
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#pgs-inconsistent
also read http://ceph.com/geen-categorie/ceph-manually-repair-object/

since you have so many scrub errors i would assume there are more bad
disks, check all disk's smart values and look for read errors in logs.
if you find any you should drain those disks by setting crush weight to
0. and when they are empty remove them from the cluster. personally i
use smartmontools it sends me emails about bad disks, and check disks
manually withsmartctl -a /dev/sda || echo bad-disk: $?

pg's that are down+peering need to have one of the acting osd's started
again. or to have the objects recovered using the methods we have
discussed previously.
ref:
http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#placement-group-down-peering-failure

nb: do not mark any osd's as lost since that = dataloss.

I would
- check smart stats of all disks. drain disks that are going bad. make
sure you have enough space on good disks to drain them properly.
- check scrub errors and objects. fix those that are fixable. some may
require an object from a down osd.
- try to get down osd's running again if possible. if you manage to get
one running, let it recover and stabilize.
- recover and inject objects from osd's that do not run. stasrt by doing
one and one pg. and once you get the hang of the method you can do
multiple pg's at the same time.

good luck
Ronny Aasen

On 11. sep. 2017 06:51, hjcho616 wrote:
It took a while. It appears to have cleaned up quite a bit... but still
has issues. I've been seeing below message for more than a day and cpu
utilization and io utilization is low... looks like something is
stuck... I rebooted OSDs several times when it looked like it was stuck
earlier and it would work on something else, but now it is not changing
much. What can I try now?

Regards,
Hong

# ceph health detail
HEALTH_ERR 22 pgs are stuck inactive for more than 300 seconds; 22 pgs
degraded; 6 pgs down; 11 pgs inconsistent; 6 pgs peering; 6 pgs
recovering; 16 pgs stale; 22 pgs stuck degraded; 6 pgs stuck inactive;
16 pgs stuck stale; 28 pgs stuck unclean; 16 pgs stuck undersized; 16
pgs undersized; 1 requests are blocked > 32 sec; 1 osds have slow
requests; recovery 221990/4503980 objects degraded (4.929%); recovery
147/2251990 unfound (0.007%); 95 scrub errors; mds cluster is degraded;
no legacy OSD present but 'sortbitwise' flag is not set
pg 0.e is stuck inactive since forever, current state down+peering, last
acting [11,2]
pg 1.d is stuck inactive since forever, current state down+peering, last
acting [11,2]
pg 1.28 is stuck inactive since forever, current state down+peering,
last acting [11,6]
pg 0.29 is stuck inactive since forever, current state down+peering,
last acting [11,6]
pg 1.2b is stuck inactive since forever, current state down+peering,
last acting [1,11]
pg 0.2c is stuck inactive since forever, current state down+peering,
last acting [1,11]
pg 0.e is stuck unclean since forever, current state down+peering, last
acting [11,2]
pg 0.a is stuck unclean for 1233182.248198, current state
stale+active+undersized+degraded+inconsistent, last acting [0]
pg 2.8 is stuck unclean for 1238044.714421, current state
stale+active+undersized+degraded, last acting [0]
pg 2.1a is stuck unclean for 1238933.203920, current state
active+recovering+degraded, last acting [2,11]
pg 2.3 is stuck unclean for 1238882.443876, current state
stale+active+undersized+degraded, last acting [0]
pg 2.27 is stuck unclean for 1295260.765981, current state
active+recovering+degraded, last acting [11,6]
pg 0.d is stuck unclean for 1230831.504001, current state
stale+active+undersized+degraded, last acting [0]
pg 1.c is stuck unclean for 1238044.715698, current state
stale+active+undersized+degraded, last acting [0]
pg 1.3d is stuck unclean for 1232066.572856, current state
stale+active+undersized+degraded, last acting [0]
pg 1.28 is stuck unclean since forever, current state down+peering, last
acting [11,6]
pg 0.29 is stuck unclean since forever, current state down+peering, last
acting [11,6]
pg 1.2b is stuck unclean since forever, current state down+peering, last
acting [1,11]
pg 2.2f is stuck unclean for 1238127.474088, current state
active+recovering+degraded+remapped, last acting [9,10]
pg 0.0 is stuck unclean for 1233182.247776, current state
stale+active+undersized+degraded, last acting [0]
pg 0.2c is stuck unclean since forever, current

Re: [ceph-users] Power outages!!! help!

2017-09-03 Thread Ronny Aasen

I would not even attempt to connect a recovered drive to ceph, 
especially not one that have had xfs errors and corruption.


your pg's that are undersized lead me to belive you still need to either 
expand, with more disks, or nodes. or that you need to set


|osd crush chooseleaf type = 0 |

to let ceph pick 2 disks on the same node as a valid object placement.  
(temporary until you get 2 balanced nodes) generally let ceph selfheal 
as much as possible (no misplaced or degraded objects)  this require 
that ceph have space for the recovery.

i would run with size=2 min_size=2

you should also look at the 7 shrub errors. they indicate that there can 
be other drives with issues, you want to locate where those inconsistent 
objects are, and fix them. read this page about fixing scrub errors. 
http://ceph.com/geen-categorie/ceph-manually-repair-object/


then you would sit with the 103 unfound objects, and those you should 
try to recover from the recovered drive.
by using the /ceph/-/objectstore/-/tool /export/import  to try and 
export pg's missing objects  to a dedicated temporary added import drive.
the import drive does not need to be very large. since you can do one 
and one pg at the time. and you should only recover pg's that contain 
unfound objects. there is realy only 103 unfound objects that you need 
to recover.
once the recovery is compleate you can wipe the functioning recovery 
drive, and install it as a new osd to the cluster.




kind regards
Ronny Aasen


On 03.09.2017 06:20, hjcho616 wrote:
I checked with ceph-2, 3, 4, 5 so I figured it was safe to assume that 
superblock file is the same.  I copied it over and started OSD.  It 
still fails with the same error message.  Looks like when I updated to 
10.2.9, some osd needs to be updated and that process is not finding 
the data it needs?  What can I do about this situation?


2017-09-01 22:27:35.590041 7f68837e5800  1 
filestore(/var/lib/ceph/osd/ceph-0) upgrade
2017-09-01 22:27:35.590149 7f68837e5800 -1 
filestore(/var/lib/ceph/osd/ceph-0) could not find 
#-1:7b3f43c4:::osd_superblock:0# in index: (2) No such file or directory


Regards,
Hong


On Friday, September 1, 2017 11:10 PM, hjcho616 <hjcho...@yahoo.com> 
wrote:



Just realized there is a file called superblock in the ceph directory. 
 ceph-1 and ceph-2's superblock file is identical, ceph-6 and ceph-7 
are identical, but not between the two groups.  When I originally 
created the OSDs, I created ceph-0 through 5.  Can superblock file be 
copied over from ceph-1 to ceph-0?


Hmm.. it appears to be doing something in the background even though 
osd.0 is down.  ceph health output is changing!

# ceph health
HEALTH_ERR 40 pgs are stuck inactive for more than 300 seconds; 14 pgs 
backfill_wait; 21 pgs degraded; 10 pgs down; 2 pgs inconsistent; 10 
pgs peering; 3 pgs recovering; 2 pgs recovery_wait; 30 pgs stale; 21 
pgs stuck degraded; 10 pgs stuck inactive; 30 pgs stuck stale; 45 pgs 
stuck unclean; 16 pgs stuck undersized; 16 pgs undersized; 2 requests 
are blocked > 32 sec; recovery 221826/2473662 objects degraded 
(8.968%); recovery 254711/2473662 objects misplaced (10.297%); 
recovery 103/2251966 unfound (0.005%); 7 scrub errors; mds cluster is 
degraded; no legacy OSD present but 'sortbitwise' flag is not set


Regards,
Hong


On Friday, September 1, 2017 10:37 PM, hjcho616 <hjcho...@yahoo.com> 
wrote:



Tried connecting recovered osd.  Looks like some of the files in the 
lost+found are super blocks.  Below is the log.  What can I do about this?


2017-09-01 22:27:27.634228 7f68837e5800  0 set uid:gid to 1001:1001 
(ceph:ceph)
2017-09-01 22:27:27.634245 7f68837e5800  0 ceph version 10.2.9 
(2ee413f77150c0f375ff6f10edd6c8f9c7d060d0), process ceph-osd, pid 5432
2017-09-01 22:27:27.635456 7f68837e5800  0 pidfile_write: ignore empty 
--pid-file
2017-09-01 22:27:27.646849 7f68837e5800  0 
filestore(/var/lib/ceph/osd/ceph-0) backend xfs (magic 0x58465342)
2017-09-01 22:27:27.647077 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
FIEMAP ioctl is disabled via 'filestore fiemap' config option
2017-09-01 22:27:27.647080 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
SEEK_DATA/SEEK_HOLE is disabled via 'filestore seek data hole' config 
option
2017-09-01 22:27:27.647091 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
splice is supported
2017-09-01 22:27:27.678937 7f68837e5800  0 
genericfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_features: 
syncfs(2) syscall fully supported (by glibc and kernel)
2017-09-01 22:27:27.679044 7f68837e5800  0 
xfsfilestorebackend(/var/lib/ceph/osd/ceph-0) detect_feature: extsize 
is disabled by conf

2017-09-01 22:27:27.680718 7f68837e5800  1 leveldb: Recovering log #28054
2017-09-01 22:27:27.804501 7f68837e5800  1 leveldb: Delete type=0 #28054

2017-09-01 22:27:27.804579 7f68837e5800  1 leveldb: Delete type=3 #28053

2

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Ronny Aasen


On 30.08.2017 15:32, Steve Taylor wrote:
I'm not familiar with dd_rescue, but I've just been reading about it. 
I'm not seeing any features that would be beneficial in this scenario 
that aren't also available in dd. What specific features give it 
"really a far better chance of restoring a copy of your disk" than dd? 
I'm always interested in learning about new recovery tools.


i see i wrote dd_rescue from old habit, but the package one should use 
on debian is gddrescue or also called gnu ddrecue.


this page have some details on the differences on dd vs the ddrescue 
variants.

http://www.toad.com/gnu/sysadmin/index.html#ddrescue

kind regards
Ronny Aasen







*Steve Taylor* | Senior Software Engineer |***StorageCraft Technology 
Corporation* <https://storagecraft.com>

380 Data Drive Suite 300 | Draper | Utah | 84020
*Office:* 801.871.2799 |


If you are not the intended recipient of this message or received it 
erroneously, please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of this 
message is prohibited.




On Tue, 2017-08-29 at 21:49 +0200, Willem Jan Withagen wrote:

On 29-8-2017 19:12, Steve Taylor wrote:
Hong, Probably your best chance at recovering any data without 
special, expensive, forensic procedures is to perform a dd from 
/dev/sdb to somewhere else large enough to hold a full disk image 
and attempt to repair that. You'll want to use 'conv=noerror' with 
your dd command since your disk is failing. Then you could either 
re-attach the OSD from the new source or attempt to retrieve objects 
from the filestore on it. 



Like somebody else already pointed out
In problem "cases like disk, use dd_rescue.
It has really a far better chance of restoring a copy of your disk

--WjW

I have actually done this before by creating an RBD that matches the 
disk size, performing the dd, running xfs_repair, and eventually 
adding it back to the cluster as an OSD. RBDs as OSDs is certainly a 
temporary arrangement for repair only, but I'm happy to report that 
it worked flawlessly in my case. I was able to weight the OSD to 0, 
offload all of its data, then remove it for a full recovery, at 
which point I just deleted the RBD. The possibilities afforded by 
Ceph inception are endless. ☺ Steve Taylor | Senior Software 
Engineer | StorageCraft Technology Corporation 380 Data Drive Suite 
300 | Draper | Utah | 84020 Office: 801.871.2799 | If you are not 
the intended recipient of this message or received it erroneously, 
please notify the sender and delete it, together with any 
attachments, and be advised that any dissemination or copying of 
this message is prohibited. On Mon, 2017-08-28 at 23:17 +0100, 
Tomasz Kusmierz wrote:
Rule of thumb with batteries is: - more “proper temperature” you 
run them at the more life you get out of them - more battery is 
overpowered for your application the longer it will survive. Get 
your self a LSI 94** controller and use it as HBA and you will be 
fine. but get MORE DRIVES ! …
On 28 Aug 2017, at 23:10, hjcho616 <hjcho...@yahoo.com 
<mailto:hjcho...@yahoo.com>> wrote: Thank you Tomasz and Ronny. 
 I'll have to order some hdd soon and try these out.  Car battery 
idea is nice!  I may try that.. =)  Do they last longer?  Ones 
that fit the UPS original battery spec didn't last very long... 
part of the reason why I gave up on them.. =P  My wife probably 
won't like the idea of car battery hanging out though ha! The OSD1 
(one with mostly ok OSDs, except that smart failure) motherboard 
doesn't have any additional SATA connectors available.  Would it 
be safe to add another OSD host? Regards, Hong On Monday, August 
28, 2017 4:43 PM, Tomasz Kusmierz <tom.kusmierz@g mail.com> wrote: 
Sorry for being brutal … anyway 1. get the battery for UPS ( a car 
battery will do as well, I’ve moded on ups in the past with truck 
battery and it was working like a charm :D ) 2. get spare drives 
and put those in because your cluster CAN NOT get out of error due 
to lack of space 3. Follow advice of Ronny Aasen on hot to recover 
data from hard drives 4 get cooling to drives or you will loose 
more !
On 28 Aug 2017, at 22:39, hjcho616 <hjcho...@yahoo.com 
<mailto:hjcho...@yahoo.com>> wrote: Tomasz, Those machines are 
behind a surge protector.  Doesn't appear to be a good one!  I do 
have a UPS... but it is my fault... no battery.  Power was pretty 
reliable for a while... and UPS was just beeping every chance it 
had, disrupting some sleep.. =P  So running on surge protector 
only.  I am running this in home environment.   So far, HDD 
failures have been very rare for this environment. =)  It just 
doesn't get loaded as much!  I am not s

Re: [ceph-users] Power outages!!! help!

2017-08-30 Thread Ronny Aasen


[snip]

I'm not sure if I am liking what I see on fdisk... it doesn't show sdb1. 
  I hope it shows up when I run dd_rescue to other drive... =P


# fdisk /dev/sdb

Welcome to fdisk (util-linux 2.25.2).
Changes will remain in memory only, until you decide to write them.
Be careful before using the write command.

/dev/sdb: device contains a valid 'xfs' signature, it's strongly 
recommended to wipe the device by command wipefs(8) if this setup is 
unexpected to avoid possible collisions.


Device does not contain a recognized partition table.
Created a new DOS disklabel with disk identifier 0xe684adb6.

Command (m for help): p
Disk /dev/sdb: 1.8 TiB, 2000398934016 bytes, 3907029168 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: dos
Disk identifier: 0xe684adb6



Command (m for help):




Do not use fdisk for osd drives. they are using the GPT partition 
structure. and depend on the GPT uuid to be correct.  So use either 
parted or gdisk/cgdisk/sgdisk  if you want to look at it.


writing a mbr partition table to the osd will break it naturally.

kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen


> [SNIP - bad drives]

Generally when a disk is displaying bad blocks to the OS, the drive have 
been remapping blocks for ages in the background. and the disk is really 
on it's last legs.  a bit unlikely that you get so many disks dying at 
the same time tho. but the problem can have been silently worsening and 
was not realy noticed until the osd had to restart due to the powerloss.



if this is _very_ important data i would recomend you start by taking 
the bad drives out of operation, and cloning the bad drive block by 
block onto a good one. by using dd_rescue. also a good idea to store a 
image of the disk so you can try the different rescue methods several 
times.  in the very worst case send the disk to a professional data 
recovery company.


once that is done, you have 2 options:
try to make the osd run again, by. xfs_fsck, + manually finding corrupt 
objects. (find + md5sum (look for read errors)) and deleting them have 
helped me in the past. if you manage to get the osd to run, drain it, by 
setting crush weight to 0. and eventualy remove the disk from the cluster.

alternativly if you can not get the osd running again:
use ceph objectstoretool to extract objects and inject them using a 
clean node and osd like described in 
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/   read the man page 
and help for the tool i think the arguments have changed slightly since 
that blogpost.


you may also run into read errors on corrupt objects, stopping your 
export.  in that case rm the offending object and rerun the export.

repeat for all bad drives.

when doing the inject it is important that your cluster is operational 
and able to accept objects from the draining drive, so either set 
minimal replication type to OSD, or even better. add more osd nodes to 
make a operational cluster (with missing objects)



also i see in your log you have os-prober testing all partitions. i tend 
to remove os-prober on machines that does not dualboot with another os.


rules of thumb for future ceph clusters:
min_size =2 for a reason it should never be 1 unless dataloss is wanted.
size=3 f you need the cluster to be operating with a drive or node in a 
error state. size=2 gives you more space but the cluster will block on 
errors until the recovery is done. better to be blocking then loosing data.
if you have size=3 and 3 nodes and you loose a node, then your cluster 
can not self heal. you should have more nodes then you have set size to.
have free space on drives, this is where data is replicated to in case 
of a down node. if you have 4 nodes and you want to be able to loose 
one, and still operate. you need leftover room on your 3 remaining nodes 
to cover for the lost one. the more nodes you have the less the impact 
of a node failure is.  and the less spare room is needed  for a 4 node 
cluster you should not fill more then 66% if you want to be able to 
self-heal + operate.




good luck
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen


comments inline

On 28.08.2017 18:31, hjcho616 wrote:



I'll see what I can do on that... Looks like I may have to add another 
OSD host as I utilized all of the SATA ports on those boards. =P


Ronny,

I am running with size=2 min_size=1.  I created everything with 
ceph-deploy and didn't touch much of that pool settings...  I hope 
not, but sounds like I may have lost some files!  I do want some of 
those OSDs to come back online somehow... to get that confidence level 
up. =P




This is a bad idea as you have found out. once your cluster is healthy 
you should look at improving this.


The dead osd.3 message is probably me trying to stop and start the 
osd.  There were some cases where stop didn't kill the ceph-osd 
process.  I just started or restarted osd to try and see if that 
worked..  After that, there were some reboots and I am not seeing 
those messages after it...




when providing logs. try to move away the old one. do a single startup. 
and post that. it makes it easier to read when you have a single run in 
the file.




This is something I am running at home.  I am the only user.  In a way 
it is production environment but just driven by me. =)


Do you have any suggestions to get any of those osd.3, osd.4, osd.5, 
and osd.8 come back up without removing them?  I have a feeling I can 
get some data back with some of them intact.


just incase you are not able to make them run again, does not 
automatically mean the data is lost. i have successfully recovered lost 
object using these instructions 
http://ceph.com/geen-categorie/incomplete-pgs-oh-my/


I would start by  renaming the osd's log file, do a single try at 
starting the osd. and posting that log. have you done anything to the 
osd's that could make them not run ?


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Power outages!!! help!

2017-08-28 Thread Ronny Aasen

h-3' is currently in use. (Is ceph-osd already 
running?)


7faf16e23800 -1  ** ERROR: osd pre_init failed: (16) Device or resource busy



This can indicate that you have a dead osd3 process keeping the 
resources open, and preventing a new osd from starting.


check with   ps aux if you can see any ceph processes. If you do find 
somthging relating to your down osds's you should try stopping it 
normally, and if that fails. killing it manually. before trying to 
restart the osd.


also check dmesg if you have messages relating to faulty hardware or OOM 
killer there. i have had experiences with the OOM killer where the osd 
node became unreliable until i rebooted the machine.



kind regards, and good luck
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Monitoring a rbd map rbd connection

2017-08-25 Thread Ronny Aasen

write to a subdirectory on the RBD. so if it is not mounted, the 
directory will be missing, and you get a no such file error.


Ronny Aasen



On 25.08.2017 18:04, David Turner wrote:
Additionally, solely testing if you can write to the path could give a 
false sense of security if the path is writable when the RBD is not 
mounted.  It would write a file to the system drive and you would see 
it as successful.


On Fri, Aug 25, 2017 at 2:27 AM Adrian Saul 
<adrian.s...@tpgtelecom.com.au <mailto:adrian.s...@tpgtelecom.com.au>> 
wrote:


If you are monitoring to ensure that it is mounted and active, a
simple check_disk on the mountpoint should work.  If the mount is
not present, or the filesystem is non-responsive then this should
pick it up. A second check to perhaps test you can actually write
files to the file system would not go astray either.

Other than that I don't think there is much point checking
anything else like rbd mapped output.


> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com
<mailto:ceph-users-boun...@lists.ceph.com>] On Behalf Of
> Hauke Homburg
> Sent: Friday, 25 August 2017 1:35 PM
> To: ceph-users <ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>>
> Subject: [ceph-users] Monitoring a rbd map rbd connection
>
> Hallo,
>
> Ich want to monitor the mapped Connection between a rbd map rbdimage
> an a /dev/rbd device.
>
> This i want to do with icinga.
>
> Has anyone a Idea how i can do this?
>
> My first Idea is to touch and remove a File in the mount point.
I am not sure
> that this is the the only thing i have to do
>
>
> Thanks for Help
>
> Hauke
>
> --
> www.w3-creative.de <http://www.w3-creative.de>
>
> www.westchat.de <http://www.westchat.de>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Confidentiality: This email and any attachments are confidential
and may be subject to copyright, legal or some other professional
privilege. They are intended solely for the attention and use of
the named addressee(s). They may only be copied, distributed or
disclosed with the consent of the copyright owner. If you have
received this email by mistake or by breach of the confidentiality
clause, please notify the sender immediately by return email and
delete or destroy all copies of the email. Any confidentiality,
privilege or copyright is not waived or lost because this email
has been sent to you by mistake.
___
ceph-users mailing list
ceph-users@lists.ceph.com <mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] luminous/bluetsore osd memory requirements

2017-08-14 Thread Ronny Aasen


On 10.08.2017 17:30, Gregory Farnum wrote:
This has been discussed a lot in the performance meetings so I've 
added Mark to discuss. My naive recollection is that the per-terabyte 
recommendation will be more realistic  than it was in the past (an 
effective increase in memory needs), but also that it will be under 
much better control than previously.



Is there any way to tune or reduce the memory footprint? perhaps by 
sacrificing performace ? our jewel cluster osd servers is maxed out on 
memory. And with the added memory requirements I  fear we may not be 
able to upgrade to luminous/bluestore..


kind regards
Ronny Aasen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] ceph packages on stretch from eu.ceph.com

2017-06-19 Thread Ronny Aasen

Thanks for the suggestions. i did do a trial with the proxmox ones, on a 
single node machine tho.


But i hope now that debian 9 is released and stable, that the ceph repos 
will incluclude stretch soon..  Hint Hint :)


I am itching to try to upgrade my testing cluster. :)

kind regards
Ronny Aasen



On 26. april 2017 19:46, Alexandre DERUMIER wrote:

you can try the proxmox stretch repository if you want

http://download.proxmox.com/debian/ceph-luminous/dists/stretch/



- Mail original -
De: "Wido den Hollander" <w...@42on.com>
À: "ceph-users" <ceph-users@lists.ceph.com>, "Ronny Aasen" 
<ronny+ceph-us...@aasen.cx>
Envoyé: Mercredi 26 Avril 2017 16:58:04
Objet: Re: [ceph-users] ceph packages on stretch from eu.ceph.com


Op 25 april 2017 om 20:07 schreef Ronny Aasen <ronny+ceph-us...@aasen.cx>:


Hello

i am trying to install ceph on debian stretch from

http://eu.ceph.com/debian-jewel/dists/

but there is no stretch repo there.

now with stretch being frozen, it is a good time to be testing ceph on
stretch. is it possible to get packages for stretch on jewel, kraken,
and lumious ?


Afaik packages are only build for stable releases. As Stretch isn't out there 
are no packages.

You can try if the Ubuntu 16.04 (Xenial) packages work.

Wido





kind regards

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph packages on stretch from eu.ceph.com

2017-04-25 Thread Ronny Aasen


Hello

i am trying to install ceph on debian stretch from

http://eu.ceph.com/debian-jewel/dists/

but there is no stretch repo there.

now with stretch being frozen, it is a good time to be testing ceph on 
stretch. is it possible to get packages for stretch on jewel, kraken, 
and lumious ?




kind regards

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] best practices in connecting clients to cephfs public network

2017-04-25 Thread Ronny Aasen


hello

i want to connect 3 servers to cephfs. The servers are normally not in 
the public network.
is it best practice to connect 2 interfaces on the servers to have the 
servers directly connected to the public network ?

or to route between the networks, via their common default gateway.

the machines are vm's so it's easy to add interfaces, and the servers 
lan and the clusters public networks is on the same router so it's also 
easy to route between them. there is a separate firewall in front of the 
routed networks so the security aspect is quite similar one way or the 
other.



what is the recommended way to connect clients to the public network ?


kind regards

Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Directly addressing files on individual OSD

2017-03-16 Thread Ronny Aasen


On 16.03.2017 08:26, Youssef Eldakar wrote:

Thanks for the reply, Anthony, and I am sorry my question did not give 
sufficient background.

This is the cluster behind archive.bibalex.org. Storage nodes keep archived 
webpages as multi-member GZIP files on the disks, which are formatted using XFS 
as standalone file systems. The access system consults an index that says where 
a URL is stored, which is then fetched over HTTP from the individual storage 
node that has the URL somewhere on one of the disks. So far, we have pretty 
much been managing the storage using homegrown scripts to have each GZIP file 
stored on 2 separate nodes. This obviously has been requiring a good deal of 
manual work and as such has not been very effective.

Given that description, do you feel Ceph could be an appropriate choice?


if you adapt your scripts to something like...

"Storage nodes archives webpages as gzip files, hashes the url to use as 
an object name and saves the gzipfiles as an object in ceph via the S3 
interface.  The access system gets a request for an url, it hashes an 
url into a object name and fetch the gzip (object) using regular S3 get 
syntax"


ceph would deal with replication, you would only put objects in, and 
fetch them out.


you could if you need it store the list of urls and hashes. except as a 
list of what you have stored.
this is just an example tho. you could also use cephfs, mounted on nodes 
and serve files as today.


ceph is just a storage tool it could work very nicely for your needs. 
but accessing the files on osd's directly will only bring pain.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] ceph osd crash on startup / crashed first during snap removal

2016-11-10 Thread Ronny Aasen


greetings

when i removed a single large rbd snap today, from a 20 TB rbd my osd's 
had very high load for a while. during this periode of high load where 
multiple osd's was marked down, and marked itself up again, 2 of my 
osd's crashed, and these do not want to start again.



the log does not show anything obvious to me to why the osd should crash 
so quickly like that on startup.


logs does not show anything wrong with hardware either.


i have shared a complete log file using debug osd/filestore/journal= 20  
where i try to start the osd.


https://owncloud.fjordane-it.no/index.php/s/gYEmYOcuil8ANG2

i still have the osd available so i can try starting it again with other 
debug values if that is valuable.


i hope someone can shed some light on why this osd crashes.


kind regards

Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread Ronny Aasen


thanks for the suggestion.

is a rolling reboot sufficient? or must all osd's be down at the same 
time ?

one is no problem.  the other takes some scheduling..

Ronny Aasen


On 01.11.2016 21:52, c...@elchaka.de wrote:

Hello Ronny,

if it is possible for you, try to Reboot all OSD Nodes.

I had this issue on my test Cluster and it become healthy after rebooting.

Hth
- Mehmet

Am 1. November 2016 19:55:07 MEZ, schrieb Ronny Aasen 
<ronny+ceph-us...@aasen.cx>:


Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25
unfound objects.

# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs 
stuck unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 
294599/149522370 objects degraded (0.197%); recovery 640073/149522370 objects 
misplaced (0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]*pg 6.d4 is 
stuck degraded for 438122.427402, current state
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state
active+recovering+undersized+degraded+remapped, last acting
[18,12] pg 6.d4 is active+recovering+undersized+degraded+remapped,
acting [62], 25 unfound pg 6.ab is
active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%) recovery
640073/149522370 objects misplaced (0.428%) recovery 25/46579241
unfound (0.000%) noout flag(s) set have been following the
troubleshooting guide at
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/
but gets stuck without a resolution. luckily it is not critical
data. so i wanted to mark the pg lost so it could become
health-ok< br /> # ceph pg 6.d4 mark_unfound_lost delete Error
EINVAL: pg has 25 unfound objects but we haven't probed all
sources, not marking lost querying the pg i see that it would want
osd.80 and osd 36 { "osd": "80", "status": "osd is down" }, trying
to mark the osd's lost does not work either. since the osd's was
removed from the cluster a long time ago. # ceph osd lost 80
--yes-i-really-mean-it osd.80 is not down or doesn't exist # ceph
osd lost 36 --yes-i-really-mean-it osd.36 is not down or doesn't
exist and this is where i am stuck. have tried stopping and
starting the 3 osd's but that did not have any effect. Anyone have
any advice how to proceed ? full output at:
http://paste.debian.net/hidden/be03a185/ this is hammer 0.94.9 on
debian 8. kind regards Ronny Aasen

ceph-users mailing list ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com *

**


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Need help! Ceph backfill_toofull and recovery_wait+degraded

2016-11-01 Thread Ronny Aasen

if you have the default crushmap and osd pool default size = 3, then 
ceph creates 3 copies of each object. and store

it on 3 separate nodes.

so the best way to solve your space problems is to try to even out the 
space between your hosts. either by adding disks to ceph1 ceph2 ceph3, 
or by adding more nodes.



kind regards
Ronny Aasen




On 01.11.2016 20:14, Marcus Müller wrote:
> Hi all,
>
> i have a big problem and i really hope someone can help me!
>
> We are running a ceph cluster since a year now. Version is: 0.94.7 
(Hammer)

> Here is some info:
>
> Our osd map is:
>
> ID WEIGHT   TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 26.67998 root default
> -2  3.64000 host ceph1
>  0  3.64000 osd.0   up  1.0  1.0
> -3  3.5 host ceph2
>  1  3.5 osd.1   up  1.0  1.0
> -4  3.64000 host ceph3
>  2  3.64000 osd.2   up  1.0  1.0
> -5 15.89998 host ceph4
>  3  4.0 osd.3   up  1.0  1.0
>  4  3.5 osd.4   up  1.0  1.0
>  5  3.2 osd.5   up  1.0  1.0
>  6  5.0 osd.6   up  1.0  1.0
>
> ceph df:
>
> GLOBAL:
> SIZE   AVAIL  RAW USED %RAW USED
> 40972G 26821G   14151G 34.54
> POOLS:
> NAMEID USED  %USED MAX AVAIL OBJECTS
> blocks  7  4490G 10.96 1237G 7037004
> commits 8   473M 0 1237G  802353
> fs  9  9666M  0.02 1237G 7863422
>
> ceph osd df:
>
> ID WEIGHT  REWEIGHT SIZE   USEAVAIL  %USE  VAR
>  0 3.64000  1.0  3724G  3128G   595G 84.01 2.43
>  1 3.5  1.0  3724G  3237G   487G 86.92 2.52
>  2 3.64000  1.0  3724G  3180G   543G 85.41 2.47
>  3 4.0  1.0  7450G  1616G  5833G 21.70 0.63
>  4 3.5  1.0  7450G  1246G  6203G 16.74 0.48
>  5 3.2  1.0  7450G  1181G  6268G 15.86 0.46
>  6 5.0  1.0  7450G   560G  6889G  7.52 0.22
>   TOTAL 40972G 14151G 26820G 34.54
> MIN/MAX VAR: 0.22/2.52  STDDEV: 36.53
>
>
> Our current cluster state is:
>
>  health HEALTH_WARN
> 63 pgs backfill
> 8 pgs backfill_toofull
> 9 pgs backfilling
> 11 pgs degraded
> 1 pgs recovering
> 10 pgs recovery_wait
> 11 pgs stuck degraded
> 89 pgs stuck unclean
> recovery 8237/52179437 objects degraded (0.016%)
> recovery 9620295/52179437 objects misplaced (18.437%)
> 2 near full osd(s)
> noout,noscrub,nodeep-scrub flag(s) set
>  monmap e8: 4 mons at 
{ceph1=192.168.10.3:6789/0,ceph2=192.168.10.4:6789/0,ceph3=192.168.10.5:6789/0,ceph4=192.168.60.6:6789/0}

> election epoch 400, quorum 0,1,2,3 ceph1,ceph2,ceph3,ceph4
>  osdmap e1774: 7 osds: 7 up, 7 in; 84 remapped pgs
> flags noout,noscrub,nodeep-scrub
>   pgmap v7316159: 320 pgs, 3 pools, 4501 GB data, 15336 kobjects
> 14152 GB used, 26820 GB / 40972 GB avail
> 8237/52179437 objects degraded (0.016%)
> 9620295/52179437 objects misplaced (18.437%)
>  231 active+clean
>   61 active+remapped+wait_backfill
>9 active+remapped+backfilling
>6 active+recovery_wait+degraded+remapped
>6 active+remapped+backfill_toofull
>4 active+recovery_wait+degraded
>2 active+remapped+wait_backfill+backfill_toofull
>1 active+recovering+degraded
> recovery io 11754 kB/s, 35 objects/s
>   client io 1748 kB/s rd, 249 kB/s wr, 44 op/s
>
>
> My main problems are:
>
> - As you can see from the osd tree, we have three separate hosts with 
only one osd each. Another one has four osds. Ceph allows me not to get 
data back from these three nodes with only one HDD, which are all near 
full. I tried to set the weight of the osds in the bigger node higher 
but this just does not work. So i added a new osd yesterday which made 
things not better, as you can see now. What do i have to do to just 
become these three nodes empty again and put more data on the other node 
with the four HDDs.

>
> - I added the „ceph4“ node later, this resulted in a strange ip 
change as you can see in the mon list. The public network and the 
cluster network were swapped or not assigned right. See ceph.conf

>
> [global]
> fsid = xxx
> mon_initial_members = ceph1
> mon_host = 192.168.10.3, 192.168.10.4, 192.168.10.5, 192.168.10.11
> auth_cluster_required = ce

[ceph-users] pg stuck with unfound objects on non exsisting osd's

2016-11-01 Thread Ronny Aasen


Hello.

I have a cluster stuck with 2 pg's stuck undersized degraded, with 25 
unfound objects.


# ceph health detail
HEALTH_WARN 2 pgs degraded; 2 pgs recovering; 2 pgs stuck degraded; 2 pgs stuck 
unclean; 2 pgs stuck undersized; 2 pgs undersized; recovery 294599/149522370 
objects degraded (0.197%); recovery 640073/149522370 objects misplaced 
(0.428%); recovery 25/46579241 unfound (0.000%); noout flag(s) set
pg 6.d4 is stuck unclean for 8893374.380079, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck unclean for 8896787.249470, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck undersized for 438122.427341, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck undersized for 416947.461950, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is stuck degraded for 438122.427402, current state 
active+recovering+undersized+degraded+remapped, last acting [62]
pg 6.ab is stuck degraded for 416947.462010, current state 
active+recovering+undersized+degraded+remapped, last acting [18,12]
pg 6.d4 is active+recovering+undersized+degraded+remapped, acting [62], 25 
unfound
pg 6.ab is active+recovering+undersized+degraded+remapped, acting [18,12]
recovery 294599/149522370 objects degraded (0.197%)
recovery 640073/149522370 objects misplaced (0.428%)
recovery 25/46579241 unfound (0.000%)
noout flag(s) set


have been following the troubleshooting guide at 
http://docs.ceph.com/docs/hammer/rados/troubleshooting/troubleshooting-pg/ 
but gets stuck without a resolution.


luckily it is not critical data. so i wanted to mark the pg lost so it 
could become health-ok



# ceph pg 6.d4 mark_unfound_lost delete
Error EINVAL: pg has 25 unfound objects but we haven't probed all 
sources, not marking lost


querying the pg i see that it would want osd.80 and osd 36

 {
"osd": "80",
"status": "osd is down"
},

trying to mark the osd's lost does not work either. since the osd's was 
removed from the cluster a long time ago.


# ceph osd lost 80 --yes-i-really-mean-it
osd.80 is not down or doesn't exist

# ceph osd lost 36 --yes-i-really-mean-it
osd.36 is not down or doesn't exist


and this is where i am stuck.

have tried stopping and starting the 3 osd's but that did not have any 
effect.


Anyone have any advice how to proceed ?

full output at:  http://paste.debian.net/hidden/be03a185/

this is hammer 0.94.9  on debian 8.


kind regards

Ronny Aasen



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] offending shards are crashing osd's

2016-10-21 Thread Ronny Aasen


On 19. okt. 2016 13:00, Ronny Aasen wrote:

On 06. okt. 2016 13:41, Ronny Aasen wrote:

hello

I have a few osd's in my cluster that are regularly crashing.


[snip]



ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the
broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be
nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen


i again have this problem with crashing osd's. a more detailed log is on
the tail of this mail.

Does anyone have any suggestions on how i can identify what shard that
needs to be removed to allow the EC to recover. ?

and more importantly how i can stop the osd's from crashing?


kind regards
Ronny Aasen



Answering my own question for googleabillity.

using this one liner.

for dir in $(find /var/lib/ceph/osd/ceph-* -maxdepth 2  -type d -name 
'5.26*' | sort | uniq) ; do find $dir -name 
'*3a3938238e1f29.002d80ca*' -type f -ls ;done


i got a list of all shards of the problematic object.
One of the object had size 0 but was otherways readable without any io 
errors. I guess this explains the inconsistent size, but it does not 
explain why ceph decides it's better to crash 3 osd's, rather then move 
a 0 byte file into a "LOST+FOUND" style directory structure.

Or just delete it, since it will not have any useful data anyway.

Deleting this file (mv to /tmp). allowed the 3 broken osd's to start, 
and have been running for >24h now. while usualy they crash within 10 
minutes. Yay!


Generally you need to check _all_ shards on the given pg. Not just the 3 
crashing. This was what confused me since i only focused on the crashing 
osd's


I used the oneliner that checked osd's for the pg since due to 
backfilling the pg was spread all over the place. And i could run it 
from ansible to reduce tedious work.


Also it would be convinient to be able to mark a broken/inconsistent pg 
manually "inactive". Instead of crashing 3 osd's and taking lots of 
other pg's with them down. One could set the pg inactive while 
troubleshooting, and unset pg-inactive when done. without having osd's 
crash and all the following high load rebalancing.


Also i ran a find for 0 size files on that pg and there are multiple 
other files.  are a 0 byte rbd_data file on a pg a normal occurence, or 
can i have more similar problems in the future due to the other 0 size 
files ?



kind regards
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] offending shards are crashing osd's

2016-10-19 Thread Ronny Aasen


On 06. okt. 2016 13:41, Ronny Aasen wrote:

hello

I have a few osd's in my cluster that are regularly crashing.


[snip]



ofcourse having 3 osd's dying regularly is not good for my health. so i
have set noout, to avoid heavy recoveries.

googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the
broken shard. (or perhaps all 3 of them?)

a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be
nice.

hope someone have an idea how to progress.

kind regards
Ronny Aasen


i again have this problem with crashing osd's. a more detailed log is on 
the tail of this mail.


Does anyone have any suggestions on how i can identify what shard that 
needs to be removed to allow the EC to recover. ?


and more importantly how i can stop the osd's from crashing?


kind regards
Ronny Aasen





-- query of pg in question --
# ceph pg 5.26 query
{
 "state": "active+undersized+degraded+remapped+wait_backfill",
 "snap_trimq": "[]",
 "epoch": 138744,
 "up": [
 27,
 109,
 2147483647,
 2147483647,
 62,
 75
 ],
 "acting": [
 2147483647,
 2147483647,
 32,
 107,
 62,
 38
 ],
 "backfill_targets": [
 "27(0)",
 "75(5)",
 "109(1)"
 ],
 "actingbackfill": [
 "27(0)",
 "32(2)",
 "38(5)",
 "62(4)",
 "75(5)",
 "107(3)",
 "109(1)"
 ],
 "info": {
 "pgid": "5.26s2",
 "last_update": "84093'35622",
 "last_complete": "84093'35622",
 "log_tail": "82361'32622",
 "last_user_version": 0,
 "last_backfill": "MAX",
 "purged_snaps": "[1~7]",
 "history": {
 "epoch_created": 61149,
 "last_epoch_started": 138692,
 "last_epoch_clean": 136567,
 "last_epoch_split": 0,
 "same_up_since": 138691,
 "same_interval_since": 138691,
 "same_primary_since": 138691,
 "last_scrub": "84093'35622",
 "last_scrub_stamp": "2016-10-18 06:18:28.253508",
 "last_deep_scrub": "84093'35622",
 "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167",
 "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167"
 },
 "stats": {
 "version": "84093'35622",
 "reported_seq": "210475",
 "reported_epoch": "138730",
 "state": "active+undersized+degraded+remapped+wait_backfill",
 "last_fresh": "2016-10-19 12:40:32.982617",
 "last_change": "2016-10-19 12:03:29.377914",
 "last_active": "2016-10-19 12:40:32.982617",
 "last_peered": "2016-10-19 12:40:32.982617",
 "last_clean": "2016-07-19 12:03:54.814292",
 "last_became_active": "0.00",
 "last_became_peered": "0.00",
 "last_unstale": "2016-10-19 12:40:32.982617",
 "last_undegraded": "2016-10-19 12:02:03.030755",
 "last_fullsized": "2016-10-19 12:02:03.030755",
 "mapping_epoch": 138627,
 "log_start": "82361'32622",
 "ondisk_log_start": "82361'32622",
 "created": 61149,
 "last_epoch_clean": 136567,
 "parent": "0.0",
 "parent_split_bits": 0,
 "last_scrub": "84093'35622",
 "last_scrub_stamp": "2016-10-18 06:18:28.253508",
 "last_deep_scrub": "84093'35622",
 "last_deep_scrub_stamp": "2016-10-14 05:33:56.701167",
 "last_clean_scrub_stamp": "2016-10-14 05:33:56.701167",
 "log_size": 3000,
 "ondisk_log_size

Re: [ceph-users] Recovery/Backfill Speedup

2016-10-06 Thread Ronny Aasen


how did you set the parameter ?
editing ceph.conf only works when you restart the osd nodes.

but running something like
ceph tell osd.*  injectargs '--osd-max-backfills 6'

would set all osd's max backfill dynamically without restarting the osd. 
and you should fairly quickly afterwards see more backfills in ceph -s



I have also noticed that if i run
ceph -n osd.0 --show-config

on one of my mon nodes, it shows the deafult settings. it does not 
actualy talk to osd.0 and get the current settings. but if i run it from 
any osd node it works. But i am on hammer and not on jewel so this might 
have changed and actualy work for you.



Kind regards
Ronny Aasen


On 05. okt. 2016 21:52, Dan Jakubiec wrote:

Thank Ronny, I am working with Reed on this problem.

Yes something is very strange.  Docs say osd_max_backfills default to
10, but when we examined the run-time configuration using "ceph
--show-config" it was showing osd_max_backfills set to 1 (we are running
latest Jewel release).

We have explicitly set this parameter to 10 now.   Sadly, about 2 hours
in backfills continue to be anemic.   Any other ideas?

$ ceph -s
cluster edeb727e-c6d3-4347-bfbb-b9ce7f60514b
 health HEALTH_WARN
246 pgs backfill_wait
3 pgs backfilling
329 pgs degraded
83 pgs recovery_wait
332 pgs stuck unclean
257 pgs undersized
recovery 154681996/676556815 objects degraded (22.863%)
recovery 278768286/676556815 objects misplaced (41.204%)
noscrub,nodeep-scrub,sortbitwise flag(s) set
 monmap e1: 3 mons at
{core=10.0.1.249:6789/0,db=10.0.1.251:6789/0,dev=10.0.1.250:6789/0}
election epoch 210, quorum 0,1,2 core,dev,db
 osdmap e4274: 16 osds: 16 up, 16 in; 279 remapped pgs
flags noscrub,nodeep-scrub,sortbitwise
  pgmap v1657039: 576 pgs, 2 pools, 6427 GB data, 292 Mobjects
15308 GB used, 101 TB / 116 TB avail
154681996/676556815 objects degraded (22.863%)
278768286/676556815 objects misplaced (41.204%)
 244 active+clean
 242 active+undersized+degraded+remapped+wait_backfill
  53 active+recovery_wait+degraded
  17 active+recovery_wait+degraded+remapped
  13 active+recovery_wait+undersized+degraded+remapped
   3 active+remapped+wait_backfill
   2 active+undersized+degraded+remapped+backfilling
   1 active+degraded+remapped+wait_backfill
   1 active+degraded+remapped+backfilling
recovery io 1568 kB/s, 109 objects/s
  client io 5629 kB/s rd, 411 op/s rd, 0 op/s wr


Here is what our current configuration looks like:

$ ceph -n osd.0 --show-config | grep osd | egrep "recovery|backfill" | sort
osd_allow_recovery_below_min_size = true
osd_backfill_full_ratio = 0.85
osd_backfill_retry_interval = 10
osd_backfill_scan_max = 512
osd_backfill_scan_min = 64
osd_debug_reject_backfill_probability = 0
osd_debug_skip_full_check_in_backfill_reservation = false
osd_kill_backfill_at = 0
osd_max_backfills = 10
osd_min_recovery_priority = 0
osd_recovery_delay_start = 0
osd_recovery_forget_lost_objects = false
osd_recovery_max_active = 15
osd_recovery_max_chunk = 8388608
osd_recovery_max_single_start = 1
osd_recovery_op_priority = 63
osd_recovery_op_warn_multiple = 16
osd_recovery_sleep = 0
osd_recovery_thread_suicide_timeout = 300
osd_recovery_thread_timeout = 30
osd_recovery_threads = 5


-- Dan


Ronny Aasen wrote:

On 04.10.2016 16:31, Reed Dier wrote:

Attempting to expand our small ceph cluster currently.

Have 8 nodes, 3 mons, and went from a single 8TB disk per node to 2x
8TB disks per node, and the rebalancing process is excruciatingly slow.

Originally at 576 PGs before expansion, and wanted to allow rebalance
to finish before expanding the PG count for the single pool, and the
replication size.

I have stopped scrubs for the time being, as well as set client and
recovery io to equal parts so that client io is not burying the
recovery io. Also have increased the number of recovery threads per osd.


[osd]
osd_recovery_threads = 5
filestore_max_sync_interval = 30
osd_client_op_priority = 32
osd_recovery_op_priority = 32

Also, this is 10G networking we are working with and recovery io
typically hovers between 0-35 MB’s but typically very bursty.
Disks are 8TB 7.2k SAS disks behind an LSI 3108 controller,
configured as individual RAID0 VD’s, with pdcache disabled, but BBU
backed write back caching enabled at the controller level.

Have thought about increasing the ‘osd_max_backfills’ as well as
‘osd_recovery_max_active’, and possibly ‘osd_recovery_max_chunk’ to
attempt to speed it up, but will hopefully get some insight from the
community here.

ceph -s about 4 days in:


  health HEALTH_WARN
 255 pgs backfill_wait
 4 pgs backfilling
 385 pgs degraded
 1

[ceph-users] offending shards are crashing osd's

2016-10-06 Thread Ronny Aasen


hello

I have a few osd's in my cluster that are regularly crashing.

in the log of them i can see

osd.7
-1> 2016-10-06 08:09:18.869687 7ffaa037f700 -1 osd.7 pg_epoch: 
128840 pg[5.3as0( v 84797'30080 (67219'27080,84797'30080] 
local-les=128834 n=13146 ec=61149 les/c 128834/127358 
128829/128829/128829) [7,109,4,0,62,32]/[7,109,32,0,62,39] r=0 
lpr=128829 pi=127357-128828/12 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 
mlcod 0'0 active+remapped+backfilling] handle_recovery_read_complete: 
inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])



osd.32
  -411> 2016-10-06 13:21:15.166968 7fe45b6cb700 -1 osd.32 pg_epoch: 
129181 pg[5.3as2( v 84797'30080 (67219'27080,84797'30080] 
local-les=129171 n=13146 ec=61149 les/c 129171/127358 
129170/129170/129170) 
[2147483647,2147483647,4,0,62,32]/[2147483647,2147483647,32,0,62,39] r=2 
lpr=129170 pi=121260-129169/43 rops=5 bft=4(2),32(5) crt=0'0 lcod 0'0 
mlcod 0'0 active+undersized+degraded+remapped+backfilling] 
handle_recovery_read_complete: inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])




osd.109
 -1> 2016-10-06 13:17:36.748340 7fa53d36c700 -1 osd.109 pg_epoch: 
129167 pg[5.3as1( v 84797'30080 (66310'24592,84797'30080] 
local-les=129163 n=13146 ec=61149 les/c 129163/127358 
129162/129162/129162) 
[2147483647,109,4,0,62,32]/[2147483647,109,32,0,62,39] r=1 lpr=129162 
pi=112552-129161/59 rops=5 bft=4(2),32(5) crt=84797'30076 lcod 0'0 mlcod 
0'0 active+undersized+degraded+remapped+backfilling] 
handle_recovery_read_complete: inconsistent shard sizes 
5/abc6d43a/rbd_data.33640a238e1f29.0003b165/head  the offending 
shard must be manually removed  after verifying there are enough shards 
to recover (0, 8388608, [32(2),0, 39(5),0])



ofcourse having 3 osd's dying regularly is not good for my health. so i 
have set noout, to avoid heavy recoveries.


googeling this error messages gives exactly 1 hit:
https://github.com/ceph/ceph/pull/6946

where it saies:  "the shard must be removed so it can be reconstructed"
but with my 3 osd's failing, i am not certain witch of them contain the 
broken shard. (or perhaps all 3 of them?)


a bit reluctant to delete on all 3. I have 4+2 erasure coding.
( erasure size 6 min_size 4 ) so finding out witch one is bad would be 
nice.


hope someone have an idea how to progress.

kind regards
Ronny Aasen

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Give up on backfill, remove slow OSD

2016-10-02 Thread Ronny Aasen


On 22. sep. 2016 09:16, Iain Buclaw wrote:

Hi,

I currently have an OSD that has been backfilling data off it for a
little over two days now, and it's gone from approximately 68 PGs to
63.

As data is still being read from, and written to it by clients whilst
I'm trying to get it out of the cluster, this is not helping it at
all.  I figured that it's probably best just to cut my losses and just
force it out entirely so that all new writes and reads to those PGs
get redirected elsewhere to a functional disk, and the rest of the
recovery can proceed without being blocked heavily by this one disk.

Granted that objects and files have a 1:1 relationship, I can just
rsync the data to a new server and write it back into ceph afterwards.

Now, I know that as soon as I bring down this OSD, the entire cluster
will stop operating.  So what's the most swift method of telling the
cluster to forget about this disk and everything that may be stored on
it.

Thanks




It should normally not get new writes to it if you want to remove it 
from the cluster. I assume you did something wrong here. How did you 
define the osd out of the cluster ?



generally my procedure for a working osd is something like
1. ceph osd crush reweight osd.X 0

2. ceph osd tree
   check that the osd in question actualy have 0 weight (first number
after ID) and that the host weight have been reduced accordingly.


3. ls /var/lib/ceph/osd/cph-X/current ; periodically
   wait for the osd to drain, there should be no PG directories 
n.xxx_head or n.xxx_TEMP this will take a while depending on the size of 
the osd. in reality i just wait  until the disk usage graph settle, then 
doublecheck with ls.


4: once empty I mark the osd out, stop the process, and removes the osd 
from the cluster as written in the documentation

 - ceph auth del osd.x
 - ceph osd crush remove osd.x
 - ceph osd rm osd.x



PS: if your cluster stops to operate when a osd goes down, you have 
something else fundamentally wrong. you should look into this as well as 
a separate case.


kind regards
Ronny Aasen





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Is it possible to recover the data of which all replicas are lost?

2016-09-29 Thread Ronny Aasen


On 27. sep. 2016 13:29, xxhdx1985126 wrote:

Hi, everyone.

I've got a problem, here. Due to some miss operations, I deleted all
three replicas of my data, is there any way to recover it?
This is a very urgent problem.

Please help me, Thanks.




you do not give any details on how you deleted the data, so i am 
assuming a lot.



But if you pulled 3 disks at the same time, and the disks are working. 
you can connect and mount the disks, and use the ceph-objectstore-tool 
to export a pg to a datafile, and then run thetool again to import it to 
a fresh emppty osd.


this older writeup gives an overview of the process. keep in mind the 
tool have changed name as is part of the default install


http://ceph.com/community/incomplete-pgs-oh-my/




if you actually deleted the pg's off the disks, or the disks are dead. 
Then you need to stop writing to those osd's and use some kind of file 
recovery tool or service.
and then as step 2 use the tool above to get the objects back onto the 
cluster.


i would start by marking the 3 osds's  out, so no more writes take 
place. and then and stop them as soon as possible, (you do not want to 
make the problem worse)


then stop the osd's and try some file recovery tools, or send the disks 
to someone like ibas https://www.krollontrack.com/ if they are dead.
keep in mind you need the xattr information as well in order to get the 
functioning objects back.


once you have the file structure in place you use the 
ceph-objectstore-tool to export /import to a working osd.




good luck
Ronny Aasen


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] problem starting osd ; PGLog.cc: 984: FAILED assert hammer 0.94.9

2016-09-18 Thread Ronny Aasen

added debug journal = 20 and got some new lines in the log. that i added 
to the end of this email.


any of you can make something out of them ?

kind regards
Ronny Aasen



On 18.09.2016 18:59, Kostis Fardelas wrote:

If you are aware of the problematic PGs and they are exportable, then
ceph-objectstore-tool is a viable solution. If not, then running gdb
and/or higher debug osd level logs may prove useful (to understand
more about the problem or collect info to ask for more in ceph-devel).

On 13 September 2016 at 17:26, Henrik Korkuc <li...@kirneh.eu> wrote:

On 16-09-13 11:13, Ronny Aasen wrote:

I suspect this must be a difficult question since there have been no
replies on irc or mailinglist.

assuming it's impossible to get these osd's running again.

Is there a way to recover objects from the disks. ? they are mounted and
data is readable. I have pg's down since they want to probe these osd's that
do not want to start.

pg query claim it can continue if i mark the osd as lost. but i would
prefer to not loose data. especially since the data is ok and readable on
the nonfunctioning osd.

also let me know if there is other debug i can extract in order to
troubleshoot the non starting osd's

kind regards
Ronny Aasen



I cannot help you with this, but you can try using
http://ceph.com/community/incomplete-pgs-oh-my/ and
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2015-April/000238.html
(found this mail thread googling for the objectool post). ymmv





On 12. sep. 2016 13:16, Ronny Aasen wrote:

after adding more osd's and having a big backfill running 2 of my osd's
keep on stopping.

We also recently upgraded from 0.94.7 to 0.94.9 but i do not know if
that is related.

the log say.



[snip old error log. ]

   -17> 2016-09-18 22:52:06.405881 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/578c53b6/rb.0.392c.238e1f29.000513d5/head '_' = 266
   -16> 2016-09-18 22:52:06.405915 7f878791b880 15 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/578c53b6/rb.0.392c.238e1f29.000513d5/21 '_'
   -15> 2016-09-18 22:52:06.406049 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/578c53b6/rb.0.392c.238e1f29.000513d5/21 '_' = 251
   -14> 2016-09-18 22:52:06.406079 7f878791b880 15 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/4ecf13b6/rb.0.392c.238e1f29.0037c4cb/21 '_'
   -13> 2016-09-18 22:52:06.406166 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) error opening file 
/var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.0037c4c 
b__21_4ECF13B6__1 with flags=2: (2) No such file or directory
   -12> 2016-09-18 22:52:06.406187 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/4ecf13b6/rb.0.392c.238e1f29.0037c4cb/21 '_' = -2
   -11> 2016-09-18 22:52:06.406190 7f878791b880 15 read_log missing 
104661'46956,1/4ecf13b6/rb.0.392c.238e 1f29.0037c4cb/21
   -10> 2016-09-18 22:52:06.406195 7f878791b880 15 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/head '_'
-9> 2016-09-18 22:52:06.406279 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) error opening file 
/var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.00b5bb3 
b__head_E85F13B6__1 with flags=2: (2) No such file or directory
-8> 2016-09-18 22:52:06.406293 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/head '_' = -2
-7> 2016-09-18 22:52:06.406297 7f878791b880 15 read_log missing 
104661'46955,1/e85f13b6/rb.0.392c.238e 1f29.00b5bb3b/head
-6> 2016-09-18 22:52:06.406311 7f878791b880 15 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/21 '_'
-5> 2016-09-18 22:52:06.406363 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) error opening file 
/var/lib/ceph/osd/ceph-106/current/1.3b6_head/DIR_6/DIR_B/DIR_3/DIR_1/DIR_F/rb.0.392c.238e1f29.00b5bb3 
b__21_E85F13B6__1 with flags=2: (2) No such file or directory
-4> 2016-09-18 22:52:06.406369 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/e85f13b6/rb.0.392c.238e1f29.00b5bb3b/21 '_' = -2
-3> 2016-09-18 22:52:06.406372 7f878791b880 15 read_log missing 
91332'39092,1/e85f13b6/rb.0.392c.238e1 f29.00b5bb3b/21
-2> 2016-09-18 22:52:06.406375 7f878791b880 15 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/d9c303b6/rb.0.392c.238e1f29.4943/head '_'
-1> 2016-09-18 22:52:06.426875 7f878791b880 10 
filestore(/var/lib/ceph/osd/ceph-106) getattr 1.3b6_head 
/1/d9c303b6/rb.0.392c.238e1f29.4943/head '_' = 266
 0> 2016-09-18 22:52:06.455911 7f878791b880 -1 osd/PGLog.cc: In 
function 'static void 
PGLog::read_log(O

1 2 >

1 - 100 of 112 matches

Mail list logo