Re: [ceph-users] Cannot create bucket via the S3 (s3cmd)

2016-02-17 Thread Arvydas Opulskis
Hi,

Are you using rgw_dns_name parameter in config? Sometimes it’s needed (when s3 
client sends bucket name as subdomain).

Arvydas

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Alexandr Porunov
Sent: Wednesday, February 17, 2016 10:37 PM
To: Василий Ангапов ; ceph-commun...@lists.ceph.com; 
ceph-maintain...@lists.ceph.com; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Cannot create bucket via the S3 (s3cmd)

Because I have created them manually and then I have installed Rados Gateway. 
After that I realised that Rados Gateway didn't work. I thought that it was 
because I have created pools manually so I removed those buckets which I had 
created and reinstall Rados Gateway. But without success of course

On Wed, Feb 17, 2016 at 10:35 PM, Alexandr Porunov 
> wrote:
Because I have created them manually and then I have installed Rados Gateway. 
After that I realised that Rados Gateway didn't work. I thought that it was 
because I have created pools manually so I removed those buckets which I had 
created and reinstall Rados Gateway. But without success of course

On Wed, Feb 17, 2016 at 10:13 PM, Василий Ангапов 
> wrote:
First, seems to me you should not delete pools .rgw.buckets and
.rgw.buckets.index because that's the pools where RGW stores buckets
actually.
But why did you do that?


2016-02-18 3:08 GMT+08:00 Alexandr Porunov 
>:
> When I try to create bucket:
> s3cmd mb s3://first-bucket
>
> I always get this error:
> ERROR: S3 error: 405 (MethodNotAllowed)
>
> /var/log/ceph/ceph-client.rgw.gateway.log :
> 2016-02-17 20:22:49.282715 7f86c50f3700  1 handle_sigterm
> 2016-02-17 20:22:49.282750 7f86c50f3700  1 handle_sigterm set alarm for 120
> 2016-02-17 20:22:49.282646 7f9478ff9700  1 handle_sigterm
> 2016-02-17 20:22:49.282689 7f9478ff9700  1 handle_sigterm set alarm for 120
> 2016-02-17 20:22:49.285830 7f949b842880 -1 shutting down
> 2016-02-17 20:22:49.285289 7f86f36c3880 -1 shutting down
> 2016-02-17 20:22:49.370173 7f86f36c3880  1 final shutdown
> 2016-02-17 20:22:49.467154 7f949b842880  1 final shutdown
> 2016-02-17 22:23:33.388956 7f4a94adf880  0 ceph version 9.2.0
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 889
> 2016-02-17 20:23:44.344574 7f4a94adf880  0 framework: civetweb
> 2016-02-17 20:23:44.344583 7f4a94adf880  0 framework conf key: port, val: 80
> 2016-02-17 20:23:44.344590 7f4a94adf880  0 starting handler: civetweb
> 2016-02-17 20:23:44.344630 7f4a94adf880  0 civetweb: 0x7f4a951c8b00:
> set_ports_option: cannot bind to 80: 13 (Permission denied)
> 2016-02-17 20:23:44.495510 7f4a65ffb700  0 ERROR: can't read user header:
> ret=-2
> 2016-02-17 20:23:44.495516 7f4a65ffb700  0 ERROR: sync_user() failed,
> user=alex ret=-2
> 2016-02-17 20:26:47.425354 7fb50132b880  0 ceph version 9.2.0
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 3149
> 2016-02-17 20:26:47.471472 7fb50132b880 -1 asok(0x7fb503e51340)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to '/var/run/ceph/ceph-client.rgw.gateway.asok':
> (17) File exists
> 2016-02-17 20:26:47.554305 7fb50132b880  0 framework: civetweb
> 2016-02-17 20:26:47.554319 7fb50132b880  0 framework conf key: port, val: 80
> 2016-02-17 20:26:47.554328 7fb50132b880  0 starting handler: civetweb
> 2016-02-17 20:26:47.576110 7fb4d2ffd700  0 ERROR: can't read user header:
> ret=-2
> 2016-02-17 20:26:47.576119 7fb4d2ffd700  0 ERROR: sync_user() failed,
> user=alex ret=-2
> 2016-02-17 20:27:03.504131 7fb49d7a2700  1 == starting new request
> req=0x7fb4e40008c0 =
> 2016-02-17 20:27:03.522989 7fb49d7a2700  1 == req done
> req=0x7fb4e40008c0 http_status=200 ==
> 2016-02-17 20:27:03.523023 7fb49d7a2700  1 civetweb: 0x7fb4e40022a0:
> 192.168.56.100 - - [17/Feb/2016:20:27:03 +0200] "GET / HTTP/1.1" 200 0 - -
> 2016-02-17 20:27:08.796459 7fb49bf9f700  1 == starting new request
> req=0x7fb4ec0343a0 =
> 2016-02-17 20:27:08.796755 7fb49bf9f700  1 == req done
> req=0x7fb4ec0343a0 http_status=405 ==
> 2016-02-17 20:27:08.796807 7fb49bf9f700  1 civetweb: 0x7fb4ec0008c0:
> 192.168.56.100 - - [17/Feb/2016:20:27:08 +0200] "PUT / HTTP/1.1" 405 0 - -
> 2016-02-17 20:28:22.088508 7fb49e7a4700  1 == starting new request
> req=0x7fb503f1bfd0 =
> 2016-02-17 20:28:22.090993 7fb49e7a4700  1 == req done
> req=0x7fb503f1bfd0 http_status=200 ==
> 2016-02-17 20:28:22.091035 7fb49e7a4700  1 civetweb: 0x7fb503f2e9f0:
> 192.168.56.100 - - [17/Feb/2016:20:28:22 +0200] "GET / HTTP/1.1" 200 0 - -
> 2016-02-17 20:28:35.943110 7fb4a77b6700  1 == starting new request
> req=0x7fb4cc0047b0 =
> 2016-02-17 20:28:35.945233 7fb4a77b6700  1 == req done
> req=0x7fb4cc0047b0 http_status=200 ==
> 2016-02-17 20:28:35.945282 

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 09:19:39 - Nick Fisk wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Christian Balzer
> > Sent: 17 February 2016 02:41
> > To: ceph-users 
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> > 
> > 
> > Hello,
> > 
> > On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:
> > 
> > > Nick, Tyler, many thanks for very helpful feedback!
> > > I spent many hours meditating on the following two links:
> > > http://www.supermicro.com/solutions/storage_ceph.cfm
> > > http://s3s.eu/cephshop
> > >
> > > 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> > > CPUs (even the fastest ones) be enough to handle Erasure Coding?
> > >
> > Depends.
> > Since you're doing sequential writes (and reads I assume as you're
> > dealing with videos), CPU usage is going to be a lot lower than with
> > random, small 4KB block I/Os.
> > So most likely, yes.
> 
> That was my initial thought, but reading that paper I linked, the 4MB
> tests are the ones that bring the CPU's to their knees. I think the
> erasure calculation is a large part of the overall CPU usage and more
> data with the larger IO's causes a significant increase in CPU
> requirements.
> 
This is clearly where my total lack of EC exposure and experience is
showing, but it certainly makes sense as well.

> Correct me if I'm wrong, but I recall Christian, that your cluster is a
> full SSD cluster? 
No, but we talked back when I was building our 2nd production cluster and
while waiting for parts did make a temporary all SSD one by using all the
prospective journals SSDs.

And definitely maxed out on CPU long before the SSDs got busy when doing
4KB rados benches or similar.

OTOH that same machine only uses about 4 cores out of 16 when doing the
same thing in its current configuration with 8 HDDs and 4 journal SSDs.

> I think we touched on this before, that the ghz per
> OSD is probably more like 100mhz per IOP. In a spinning disk cluster,
> you effectively have a cap on the number of IOs you can serve before the
> disks max out. So the difference between large and small IO's is not
> that great. But on a SSD cluster there is no cap and so you just end up
> with more IO's, hence the higher CPU.
> 
Yes and that number is a good baseline (still).

My own rule of thumb is 1GHz or slightly less per OSD for pure HDD based
clusters and about 1.5GHz for ones with SSD journals. 
Round up for OS and (in my case frequently) MON usage.

Of course for purely SSD based OSDs, throw the kitchen sink at it, if
your wallet allows for it.
 
Christian
> > 
> > > Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> > > combinations. I've did some calculations and found that the most
> > > efficient and safe configuration is to use 10 nodes with 29*6TB SATA
> > > and 7*200GB S3700 for journals. Assuming 6+3 EC profile that will
> > > give me
> > > 1.16 PB of effective space. Also I prefer not to use precious NVMe
> > > drives. Don't see any reason to use them.
> > >
> > This is probably your best way forward, dense is nice and cost saving,
> > but comes with a lot of potential gotchas.
> > Dense and large clusters can work, dense and small not so much.
> > 
> > > But what about RAM? Can I go with 64GB per node with above config?
> > > I've seen OSDs are consuming not more than 1GB RAM for replicated
> > > pools (even 6TB ones). But what is the typical memory usage of EC
> > > pools? Does anybody know that?
> > >
> > Above config (29 OSDs) that would be just about right.
> > I always go with at least 2GB RAM per OSD, since during a full node
> > restart and the consecutive peering OSDs will grow large, a LOT larger
> > than their usual steady state size.
> > RAM isn't that expensive these days and additional RAM comes in very
> > handy when used for pagecache and SLAB (dentry) stuff.
> > 
> > Something else to think about in your specific use case is to have
> > RAID'ed OSDs.
> > It's a bit of zero sum game probably, but compare the above config
> > with this. 11 nodes, each with:
> > 34 6TB SATAs (2x 17HDDs RAID6)
> > 2 200GB S3700 SSDs (journal/OS)
> > Just 2 OSDs per node.
> > Ceph with replication of 2.
> > Just shy of 1PB of effective space.
> > 
> > Minus: More physical space, less efficient HDD usage (replication vs.
> > EC).
> > 
> > Plus: A lot less expensive SSDs, less CPU and RAM requirements, smaller
> > impact in case of node failure/maintenance.
> > 
> > No ideas about the stuff below.
> > 
> > Christian
> > > Also, am I right that for 6+3 EC profile i need at least 10 nodes to
> > > feel comfortable (one extra node for redundancy)?
> > >
> > > And finally can someone recommend what EC plugin to use in my case? I
> > > know it's a difficult question but anyway?
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > 2016-02-16 16:12 GMT+08:00 Nick Fisk :

Re: [ceph-users] Performance Testing of CEPH on ARM MicroServer

2016-02-17 Thread Christian Balzer


Hello,

On Wed, 17 Feb 2016 21:47:31 +0530 Swapnil Jain wrote:

> Thanks Christian,
> 
> 
> 
> > On 17-Feb-2016, at 7:25 AM, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > On Mon, 15 Feb 2016 21:10:33 +0530 Swapnil Jain wrote:
> > 
> >> For most of you CEPH on ARMv7 might not sound good. This is our setup
> >> and our FIO testing report.  I am not able to understand ….
> >> 
> > Just one OSD per Microserver as in your case should be fine.
> > As always, use atop (or similar) on your storage servers when running
> > these tests to see where your bottlenecks are (HDD/network/CPU).
> > 
> >> 1) Are these results good or bad
> >> 2) Write is much better than read, where as read should be better.
> >> 
> > Your testing is flawed, more below.
> > 
> >> Hardware:
> >> 
> >> 8 x ARMv7 MicroServer with 4 x 10G Uplink
> >> 
> >> Each MicroServer with:
> >> 2GB RAM
> > Barely OK for one OSD, not enough if you run MONs as well on it (as you
> > do).
> > 
> >> Dual Core 1.6 GHz processor
> >> 2 x 2.5 Gbps Ethernet (1 for Public / 1 for Cluster Network)
> >> 1 x 3TB SATA HDD
> >> 1 x 128GB MSata Flash
> > Exact model/maker please.
> 
> Its Seagate ST3000NC000 & Phison Msata
>
There's quite a large number of Phison MSata drive models available it
seems.
And their specifications don't mention endurance, DWPD or TBW...

Anyway, you will want to look into this to see if they are a good match
for Ceph journals:
http://www.sebastien-han.fr/blog/2014/10/10/ceph-how-to-test-if-your-ssd-is-suitable-as-a-journal-device/

 
> 
> > 
> >> 
> >> Software:
> >> Debian 8.3 32bit
> >> ceph version 9.2.0-25-gf480cea
> >> 
> >> Setup:
> >> 
> >> 3 MON (Shared with 3 OSD)
> >> 8 OSD
> >> Data on 3TB SATA with XFS
> >> Journal on 128GB MSata Flash
> >> 
> >> pool with replica 1
> > Not a very realistic test of course.
> > For a production, fault resilient cluster you would have to divide your
> > results by 3 (at least).
> > 
> >> 500GB image with 4M object size
> >> 
> >> FIO command: fio --name=unit1 --filename=/dev/rbd1 --bs=4k
> >> --runtime=300 --readwrite=write
> >> 
> > 
> > If that is your base FIO command line, I'm assuming you mounted that
> > image on the client via the kernel RBD module?
> 
> Yes, its via kernel RBD module
> 
> 
> > 
> > Either way, the main reason you're seeing writes being faster than
> > reads is that with this command line (no direct=1 flag) fio will use
> > the page cache on your client host for writes, speeding things up
> > dramatically. To get a realistic idea of your clusters ability, use
> > direct=1 and also look into rados bench.
> > 
> > Another reason for the slow reads is that Ceph (RBD) does badly with
> > regards to read-ahead, setting  /sys/block/rdb1/queue/read_ahead_kb to
> > something like 2048 should improve things.
> > 
> > That all being said, your read values look awfully low.
> 
> Thanks again for the suggestion. Below are some results using rados
> bench, here read looks much better than write. Still is it good or can
> be better? 

rados bench with default setting operates on 4MB blocks, which matches
Ceph objects. 
Meaning it is optimized for giving the best performance figures in terms
of throughput. 
In real live situations you're likely to be more interested in IOPS than
MB/s.
If you run it with "-b 4096" (aka 4KB blocks) you're likely to see with
atop that your CPUs are getting much MUCH more of a workout.

> I also checked atop, couldn't see any bottleneck except that
> that sda disk was busy 80-90% of time during the test.
> 
Well, if that is true (on average) for all your nodes, then you found the
bottleneck. 
Also, which one is "sda", the HDD or the SSD?

> 
> WRITE Throughput (MB/sec): 297.544
> WRITE Average Latency:0.21499
> 
> READ Throughput (MB/sec):478.026
> READ Average Latency:   0.133818
> 
These are pretty good numbers for this setup indeed. 
But again, with a replication size of 1 they're not representative of
reality at all.

Regards,

Christian
> —
> Swapnil
> 
> > 
> > Christian
> >> Client:
> >> 
> >> Ubuntu on Intel 24core/16GB RAM 10G Ethernet
> >> 
> >> Result for different tests
> >> 
> >> 128k-randread.txt:  read : io=2587.4MB, bw=8830.2KB/s, iops=68,
> >> runt=300020msec 128k-randwrite.txt:  write: io=48549MB, bw=165709KB/s,
> >> iops=1294, runt=35msec 128k-read.txt:  read : io=26484MB,
> >> bw=90397KB/s, iops=706, runt=32msec 128k-write.txt:  write:
> >> io=89538MB, bw=305618KB/s, iops=2387, runt=34msec
> >> 16k-randread.txt: read : io=383760KB, bw=1279.2KB/s, iops=79,
> >> runt=31msec 16k-randwrite.txt:  write: io=8720.7MB, bw=29764KB/s,
> >> iops=1860, runt=32msec 16k-read.txt:  read : io=27444MB,
> >> bw=93676KB/s, iops=5854, runt=31msec 16k-write.txt:  write:
> >> io=87811MB, bw=299726KB/s, iops=18732, runt=31msec
> >> 1M-randread.txt:  read : io=10439MB, bw=35631KB/s, iops=34,
> >> runt=38msec 1M-randwrite.txt: write: io=98943MB, bw=337721KB/s,
> >> iops=329, 

Re: [ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Gregory Farnum
You probably don't want to try and replace the dead OSD with a new one
until stuff is otherwise recovered. Just import the PG into any osd in the
cluster and it should serve the data up for proper recovery (and then
delete it when done).

I've never done this or worked on the tooling though so that's bailout the
extent of my knowledge.
-Greg

On Wednesday, February 17, 2016, Kostis Fardelas 
wrote:

> Right now the PG is served by two other OSDs and fresh data is written
> to them. Is it safe to export the stale pg contents from the crashed
> OSD and try to just import them again back to the cluster (the PG is
> not entirely lost, only some objects didn't make it).
>
> What could be the right sequence of commands in that case?
> a. ceph-objectstore-tool --op export --pgid 3.5a9 --data-path
> /var/lib/ceph/osd/ceph-xx/ --journal-path
> /var/lib/ceph/osd/ceph-xx/journal --file 3.5a9.export
> b. rm the crashed OSD, remove from crushmap and create a fresh new
> with the same ID
> c. ceph-objectstore-tool --op import --data-path
> /var/lib/ceph/osd/ceph-xx/ --journal-path
> /var/lib/ceph/osd/ceph-xx/journal --file 3.5a9..export
> d. start the osd
>
> Regards,
> Kostis
>
>
> On 18 February 2016 at 02:54, Gregory Farnum  > wrote:
> > On Wed, Feb 17, 2016 at 4:44 PM, Kostis Fardelas  > wrote:
> >> Thanks Greg,
> >> I gather from reading about ceph_objectstore_tool that it acts at the
> >> level of the PG. The fact is that I do not want to wipe the whole PG,
> >> only export certain objects (the unfound ones) and import them again
> >> into the cluster. To be precise the pg with the unfound objects is
> >> mapped like this:
> >> osdmap e257960 pg 3.5a9 (3.5a9) -> up [86,30] acting [86]
> >>
> >> but by searching in the underlying filesystem of the crahed OSD, I can
> >> verify that it contains the 4MB unfound objects which I get with pg
> >> list_missing and cannot be found on every other probed OSD.
> >>
> >> Do you know if and how could I achieve this with ceph_objectstore_tool?
> >
> > You can't just pull out single objects. What you can do is export the
> > entire PG containing the objects, import it into a random OSD, and
> > then let the cluster recover from that OSD.
> > (Assuming all the data you need is there — just because you can see
> > the files on disk doesn't mean all the separate metadata is
> > available.)
> > -Greg
> >
> >>
> >> Regards,
> >> Kostis
> >>
> >>
> >> On 18 February 2016 at 01:22, Gregory Farnum  > wrote:
> >>> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas  > wrote:
>  Hello cephers,
>  due to an unfortunate sequence of events (disk crashes, network
>  problems), we are currently in a situation with one PG that reports
>  unfound objects. There is also an OSD which cannot start-up and
>  crashes with the following:
> 
>  2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
>  function 'virtual int FileStore::read(coll_t, const ghobject_t&,
>  uint64_t, size_t, ceph::bufferlist&, bool)
>  ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
>  os/FileStore.cc: 2650: FAILED assert(allow_eio ||
>  !m_filestore_fail_eio || got != -5)
> 
>  (There is probably a problem with the OSD's underlying disk storage)
> 
>  By querying the PG that is stuck in
>  active+recovering+degraded+remapped state due to the unfound objects,
>  I understand that all possible OSDs are probed except for the crashed
>  one:
> 
>  "might_have_unfound": [
>    { "osd": "30",
> "status": "already probed"},
>    { "osd": "102",
> "status": "already probed"},
>    { "osd": "104",
> "status": "osd is down"},
>    { "osd": "105",
> "status": "already probed"},
>    { "osd": "145",
>  "status": "already probed"}],
> 
>  so I understand that the crashed OSD may have the latest version of
>  the objects. I can also verify that I I can find the 4MB objects in
>  the underlying filesystem of the crashed OSD.
> 
>  By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
>  information like this:
> 
>  { "oid": { "oid":
>  "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
>    "key": "",
>    "snapid": -2,
>    "hash": 3880052137,
>    "max": 0,
>    "pool": 3,
>    "namespace": ""},
>    "need": "255658'37078125",
>    "have": "255651'37077081",
>    "locations": []}
> 
> 
>  My question is what is the best solution that I should follow?
>  a. Is there any way to export the objects from the crashed OSD's
>  filesystem and reimport it to the cluster? How could that be done?
> >>>
> 

Re: [ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Kostis Fardelas
Right now the PG is served by two other OSDs and fresh data is written
to them. Is it safe to export the stale pg contents from the crashed
OSD and try to just import them again back to the cluster (the PG is
not entirely lost, only some objects didn't make it).

What could be the right sequence of commands in that case?
a. ceph-objectstore-tool --op export --pgid 3.5a9 --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 3.5a9.export
b. rm the crashed OSD, remove from crushmap and create a fresh new
with the same ID
c. ceph-objectstore-tool --op import --data-path
/var/lib/ceph/osd/ceph-xx/ --journal-path
/var/lib/ceph/osd/ceph-xx/journal --file 3.5a9..export
d. start the osd

Regards,
Kostis


On 18 February 2016 at 02:54, Gregory Farnum  wrote:
> On Wed, Feb 17, 2016 at 4:44 PM, Kostis Fardelas  wrote:
>> Thanks Greg,
>> I gather from reading about ceph_objectstore_tool that it acts at the
>> level of the PG. The fact is that I do not want to wipe the whole PG,
>> only export certain objects (the unfound ones) and import them again
>> into the cluster. To be precise the pg with the unfound objects is
>> mapped like this:
>> osdmap e257960 pg 3.5a9 (3.5a9) -> up [86,30] acting [86]
>>
>> but by searching in the underlying filesystem of the crahed OSD, I can
>> verify that it contains the 4MB unfound objects which I get with pg
>> list_missing and cannot be found on every other probed OSD.
>>
>> Do you know if and how could I achieve this with ceph_objectstore_tool?
>
> You can't just pull out single objects. What you can do is export the
> entire PG containing the objects, import it into a random OSD, and
> then let the cluster recover from that OSD.
> (Assuming all the data you need is there — just because you can see
> the files on disk doesn't mean all the separate metadata is
> available.)
> -Greg
>
>>
>> Regards,
>> Kostis
>>
>>
>> On 18 February 2016 at 01:22, Gregory Farnum  wrote:
>>> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas  
>>> wrote:
 Hello cephers,
 due to an unfortunate sequence of events (disk crashes, network
 problems), we are currently in a situation with one PG that reports
 unfound objects. There is also an OSD which cannot start-up and
 crashes with the following:

 2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
 function 'virtual int FileStore::read(coll_t, const ghobject_t&,
 uint64_t, size_t, ceph::bufferlist&, bool)
 ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
 os/FileStore.cc: 2650: FAILED assert(allow_eio ||
 !m_filestore_fail_eio || got != -5)

 (There is probably a problem with the OSD's underlying disk storage)

 By querying the PG that is stuck in
 active+recovering+degraded+remapped state due to the unfound objects,
 I understand that all possible OSDs are probed except for the crashed
 one:

 "might_have_unfound": [
   { "osd": "30",
"status": "already probed"},
   { "osd": "102",
"status": "already probed"},
   { "osd": "104",
"status": "osd is down"},
   { "osd": "105",
"status": "already probed"},
   { "osd": "145",
 "status": "already probed"}],

 so I understand that the crashed OSD may have the latest version of
 the objects. I can also verify that I I can find the 4MB objects in
 the underlying filesystem of the crashed OSD.

 By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
 information like this:

 { "oid": { "oid":
 "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
   "key": "",
   "snapid": -2,
   "hash": 3880052137,
   "max": 0,
   "pool": 3,
   "namespace": ""},
   "need": "255658'37078125",
   "have": "255651'37077081",
   "locations": []}


 My question is what is the best solution that I should follow?
 a. Is there any way to export the objects from the crashed OSD's
 filesystem and reimport it to the cluster? How could that be done?
>>>
>>> Look at ceph_objecstore_tool. eg,
>>> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>>
 b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
 expect that the "have" version of this object (thus an older version
 of the object) will become enabled?
>>>
>>> It should, although I gather this sometimes takes some contortions for
>>> reasons I've never worked out.
>>> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Gregory Farnum
On Wed, Feb 17, 2016 at 4:44 PM, Kostis Fardelas  wrote:
> Thanks Greg,
> I gather from reading about ceph_objectstore_tool that it acts at the
> level of the PG. The fact is that I do not want to wipe the whole PG,
> only export certain objects (the unfound ones) and import them again
> into the cluster. To be precise the pg with the unfound objects is
> mapped like this:
> osdmap e257960 pg 3.5a9 (3.5a9) -> up [86,30] acting [86]
>
> but by searching in the underlying filesystem of the crahed OSD, I can
> verify that it contains the 4MB unfound objects which I get with pg
> list_missing and cannot be found on every other probed OSD.
>
> Do you know if and how could I achieve this with ceph_objectstore_tool?

You can't just pull out single objects. What you can do is export the
entire PG containing the objects, import it into a random OSD, and
then let the cluster recover from that OSD.
(Assuming all the data you need is there — just because you can see
the files on disk doesn't mean all the separate metadata is
available.)
-Greg

>
> Regards,
> Kostis
>
>
> On 18 February 2016 at 01:22, Gregory Farnum  wrote:
>> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas  wrote:
>>> Hello cephers,
>>> due to an unfortunate sequence of events (disk crashes, network
>>> problems), we are currently in a situation with one PG that reports
>>> unfound objects. There is also an OSD which cannot start-up and
>>> crashes with the following:
>>>
>>> 2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
>>> function 'virtual int FileStore::read(coll_t, const ghobject_t&,
>>> uint64_t, size_t, ceph::bufferlist&, bool)
>>> ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
>>> os/FileStore.cc: 2650: FAILED assert(allow_eio ||
>>> !m_filestore_fail_eio || got != -5)
>>>
>>> (There is probably a problem with the OSD's underlying disk storage)
>>>
>>> By querying the PG that is stuck in
>>> active+recovering+degraded+remapped state due to the unfound objects,
>>> I understand that all possible OSDs are probed except for the crashed
>>> one:
>>>
>>> "might_have_unfound": [
>>>   { "osd": "30",
>>>"status": "already probed"},
>>>   { "osd": "102",
>>>"status": "already probed"},
>>>   { "osd": "104",
>>>"status": "osd is down"},
>>>   { "osd": "105",
>>>"status": "already probed"},
>>>   { "osd": "145",
>>> "status": "already probed"}],
>>>
>>> so I understand that the crashed OSD may have the latest version of
>>> the objects. I can also verify that I I can find the 4MB objects in
>>> the underlying filesystem of the crashed OSD.
>>>
>>> By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
>>> information like this:
>>>
>>> { "oid": { "oid":
>>> "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
>>>   "key": "",
>>>   "snapid": -2,
>>>   "hash": 3880052137,
>>>   "max": 0,
>>>   "pool": 3,
>>>   "namespace": ""},
>>>   "need": "255658'37078125",
>>>   "have": "255651'37077081",
>>>   "locations": []}
>>>
>>>
>>> My question is what is the best solution that I should follow?
>>> a. Is there any way to export the objects from the crashed OSD's
>>> filesystem and reimport it to the cluster? How could that be done?
>>
>> Look at ceph_objecstore_tool. eg,
>> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>>
>>> b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
>>> expect that the "have" version of this object (thus an older version
>>> of the object) will become enabled?
>>
>> It should, although I gather this sometimes takes some contortions for
>> reasons I've never worked out.
>> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Kostis Fardelas
Thanks Greg,
I gather from reading about ceph_objectstore_tool that it acts at the
level of the PG. The fact is that I do not want to wipe the whole PG,
only export certain objects (the unfound ones) and import them again
into the cluster. To be precise the pg with the unfound objects is
mapped like this:
osdmap e257960 pg 3.5a9 (3.5a9) -> up [86,30] acting [86]

but by searching in the underlying filesystem of the crahed OSD, I can
verify that it contains the 4MB unfound objects which I get with pg
list_missing and cannot be found on every other probed OSD.

Do you know if and how could I achieve this with ceph_objectstore_tool?

Regards,
Kostis


On 18 February 2016 at 01:22, Gregory Farnum  wrote:
> On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas  wrote:
>> Hello cephers,
>> due to an unfortunate sequence of events (disk crashes, network
>> problems), we are currently in a situation with one PG that reports
>> unfound objects. There is also an OSD which cannot start-up and
>> crashes with the following:
>>
>> 2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
>> function 'virtual int FileStore::read(coll_t, const ghobject_t&,
>> uint64_t, size_t, ceph::bufferlist&, bool)
>> ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
>> os/FileStore.cc: 2650: FAILED assert(allow_eio ||
>> !m_filestore_fail_eio || got != -5)
>>
>> (There is probably a problem with the OSD's underlying disk storage)
>>
>> By querying the PG that is stuck in
>> active+recovering+degraded+remapped state due to the unfound objects,
>> I understand that all possible OSDs are probed except for the crashed
>> one:
>>
>> "might_have_unfound": [
>>   { "osd": "30",
>>"status": "already probed"},
>>   { "osd": "102",
>>"status": "already probed"},
>>   { "osd": "104",
>>"status": "osd is down"},
>>   { "osd": "105",
>>"status": "already probed"},
>>   { "osd": "145",
>> "status": "already probed"}],
>>
>> so I understand that the crashed OSD may have the latest version of
>> the objects. I can also verify that I I can find the 4MB objects in
>> the underlying filesystem of the crashed OSD.
>>
>> By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
>> information like this:
>>
>> { "oid": { "oid":
>> "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
>>   "key": "",
>>   "snapid": -2,
>>   "hash": 3880052137,
>>   "max": 0,
>>   "pool": 3,
>>   "namespace": ""},
>>   "need": "255658'37078125",
>>   "have": "255651'37077081",
>>   "locations": []}
>>
>>
>> My question is what is the best solution that I should follow?
>> a. Is there any way to export the objects from the crashed OSD's
>> filesystem and reimport it to the cluster? How could that be done?
>
> Look at ceph_objecstore_tool. eg,
> http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool
>
>> b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
>> expect that the "have" version of this object (thus an older version
>> of the object) will become enabled?
>
> It should, although I gather this sometimes takes some contortions for
> reasons I've never worked out.
> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Stillwell, Bryan
Vlad,

First off your cluster is rather full (80.31%).  Hopefully you have
hardware ordered for an expansion in the near future.

Based on your 'ceph osd tree' output, it doesn't look like the
reweight-by-utilization did anything for you.  That last number for each
OSD is set to 1, which means it didn't reweight any of the OSDs.  This is
a different weight than the CRUSH weight, and something you can manually
modify as well.

For example you could manually tweak the weights of the fullest OSDs with:

ceph osd reweight osd.23 0.95
ceph osd reweight osd.7 0.95
ceph osd reweight osd.8 0.95

Then just keep tweaking those numbers until the cluster gets an even
distribution of PGs across the OSDs.  The reweight-by-utilization option
can help make this quicker.

Your volumes pool also doesn't have a power of two for pg_num, so your PGs
will have uneven sizes.  Since you can't go back down to 256 PGs, you
should look at gradually increasing them up to 512 PGs.

There are also inconsistent PGs that you should look at repairing.  It
won't help you with the data distribution, but it's good cluster
maintenance.

Bryan

From:  ceph-users  on behalf of Vlad
Blando 
Date:  Wednesday, February 17, 2016 at 5:11 PM
To:  ceph-users 
Subject:  [ceph-users] How to properly deal with NEAR FULL OSD


>Hi This been bugging me for some time now, the distribution of data on
>the OSD is not balanced so some OSD are near full, i did ceph
> osd reweight-by-utilization but it not helping much.
>
>
>[root@controller-node ~]# ceph osd tree
># idweight  type name   up/down reweight
>-1  98.28   root default
>-2  32.76   host ceph-node-1
>0   3.64osd.0   up  1
>1   3.64osd.1   up  1
>2   3.64osd.2   up  1
>3   3.64osd.3   up  1
>4   3.64osd.4   up  1
>5   3.64osd.5   up  1
>6   3.64osd.6   up  1
>7   3.64osd.7   up  1
>8   3.64osd.8   up  1
>-3  32.76   host ceph-node-2
>9   3.64osd.9   up  1
>10  3.64osd.10  up  1
>11  3.64osd.11  up  1
>12  3.64osd.12  up  1
>13  3.64osd.13  up  1
>14  3.64osd.14  up  1
>15  3.64osd.15  up  1
>16  3.64osd.16  up  1
>17  3.64osd.17  up  1
>-4  32.76   host ceph-node-3
>18  3.64osd.18  up  1
>19  3.64osd.19  up  1
>20  3.64osd.20  up  1
>21  3.64osd.21  up  1
>22  3.64osd.22  up  1
>23  3.64osd.23  up  1
>24  3.64osd.24  up  1
>25  3.64osd.25  up  1
>26  3.64osd.26  up  1
>[root@controller-node ~]#
>
>
>[root@controller-node ~]# /opt/df-osd.sh
>ceph-node-1
>===
>/dev/sdb1  3.7T  2.0T  1.7T  54% /var/lib/ceph/osd/ceph-0
>/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-1
>/dev/sdd1  3.7T  3.3T  431G  89% /var/lib/ceph/osd/ceph-2
>/dev/sde1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-3
>/dev/sdf1  3.7T  3.3T  379G  90% /var/lib/ceph/osd/ceph-4
>/dev/sdg1  3.7T  2.9T  762G  80% /var/lib/ceph/osd/ceph-5
>/dev/sdh1  3.7T  3.0T  733G  81% /var/lib/ceph/osd/ceph-6
>/dev/sdi1  3.7T  3.4T  284G  93% /var/lib/ceph/osd/ceph-7
>/dev/sdj1  3.7T  3.4T  342G  91% /var/lib/ceph/osd/ceph-8
>===
>ceph-node-2
>===
>/dev/sdb1  3.7T  3.1T  622G  84% /var/lib/ceph/osd/ceph-9
>/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-10
>/dev/sdd1  3.7T  3.1T  557G  86% /var/lib/ceph/osd/ceph-11
>/dev/sde1  3.7T  3.3T  392G  90% /var/lib/ceph/osd/ceph-12
>/dev/sdf1  3.7T  2.6T  1.1T  72% /var/lib/ceph/osd/ceph-13
>/dev/sdg1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-14
>/dev/sdh1  3.7T  2.7T  984G  74% /var/lib/ceph/osd/ceph-15
>/dev/sdi1  3.7T  3.2T  463G  88% /var/lib/ceph/osd/ceph-16
>/dev/sdj1  3.7T  3.1T  594G  85% /var/lib/ceph/osd/ceph-17
>===
>ceph-node-3
>===
>/dev/sdb1  3.7T  

Re: [ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Jan Schermer
It would be helpful to see your crush map (there are some tunables that help 
with this issue as well available if you're not running ancient versions).
However, distribution uniformity isn't that great really.
It helps to increase the number of PGs, but beware that there's no turning back.

Other than that, play with reweights (and possibly crush weights) regularly - 
that's what we do...

Jan


> On 18 Feb 2016, at 01:11, Vlad Blando  wrote:
> 
> Hi This been bugging me for some time now, the distribution of data on the 
> OSD is not balanced so some OSD are near full, i did ceph osd 
> reweight-by-utilization but it not helping much.
> 
> 
> [root@controller-node ~]# ceph osd tree
> # idweight  type name   up/down reweight
> -1  98.28   root default
> -2  32.76   host ceph-node-1
> 0   3.64osd.0   up  1
> 1   3.64osd.1   up  1
> 2   3.64osd.2   up  1
> 3   3.64osd.3   up  1
> 4   3.64osd.4   up  1
> 5   3.64osd.5   up  1
> 6   3.64osd.6   up  1
> 7   3.64osd.7   up  1
> 8   3.64osd.8   up  1
> -3  32.76   host ceph-node-2
> 9   3.64osd.9   up  1
> 10  3.64osd.10  up  1
> 11  3.64osd.11  up  1
> 12  3.64osd.12  up  1
> 13  3.64osd.13  up  1
> 14  3.64osd.14  up  1
> 15  3.64osd.15  up  1
> 16  3.64osd.16  up  1
> 17  3.64osd.17  up  1
> -4  32.76   host ceph-node-3
> 18  3.64osd.18  up  1
> 19  3.64osd.19  up  1
> 20  3.64osd.20  up  1
> 21  3.64osd.21  up  1
> 22  3.64osd.22  up  1
> 23  3.64osd.23  up  1
> 24  3.64osd.24  up  1
> 25  3.64osd.25  up  1
> 26  3.64osd.26  up  1
> [root@controller-node ~]#
> 
> 
> [root@controller-node ~]# /opt/df-osd.sh
> ceph-node-1
> ===
> /dev/sdb1  3.7T  2.0T  1.7T  54% /var/lib/ceph/osd/ceph-0
> /dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-1
> /dev/sdd1  3.7T  3.3T  431G  89% /var/lib/ceph/osd/ceph-2
> /dev/sde1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-3
> /dev/sdf1  3.7T  3.3T  379G  90% /var/lib/ceph/osd/ceph-4
> /dev/sdg1  3.7T  2.9T  762G  80% /var/lib/ceph/osd/ceph-5
> /dev/sdh1  3.7T  3.0T  733G  81% /var/lib/ceph/osd/ceph-6
> /dev/sdi1  3.7T  3.4T  284G  93% /var/lib/ceph/osd/ceph-7
> /dev/sdj1  3.7T  3.4T  342G  91% /var/lib/ceph/osd/ceph-8
> ===
> ceph-node-2
> ===
> /dev/sdb1  3.7T  3.1T  622G  84% /var/lib/ceph/osd/ceph-9
> /dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-10
> /dev/sdd1  3.7T  3.1T  557G  86% /var/lib/ceph/osd/ceph-11
> /dev/sde1  3.7T  3.3T  392G  90% /var/lib/ceph/osd/ceph-12
> /dev/sdf1  3.7T  2.6T  1.1T  72% /var/lib/ceph/osd/ceph-13
> /dev/sdg1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-14
> /dev/sdh1  3.7T  2.7T  984G  74% /var/lib/ceph/osd/ceph-15
> /dev/sdi1  3.7T  3.2T  463G  88% /var/lib/ceph/osd/ceph-16
> /dev/sdj1  3.7T  3.1T  594G  85% /var/lib/ceph/osd/ceph-17
> ===
> ceph-node-3
> ===
> /dev/sdb1  3.7T  2.8T  910G  76% /var/lib/ceph/osd/ceph-18
> /dev/sdc1  3.7T  2.7T 1012G  73% /var/lib/ceph/osd/ceph-19
> /dev/sdd1  3.7T  3.2T  537G  86% /var/lib/ceph/osd/ceph-20
> /dev/sde1  3.7T  3.2T  465G  88% /var/lib/ceph/osd/ceph-21
> /dev/sdf1  3.7T  3.0T  663G  83% /var/lib/ceph/osd/ceph-22
> /dev/sdg1  3.7T  3.4T  248G  94% /var/lib/ceph/osd/ceph-23
> /dev/sdh1  3.7T  2.8T  928G  76% /var/lib/ceph/osd/ceph-24
> /dev/sdi1  3.7T  2.9T  802G  79% /var/lib/ceph/osd/ceph-25
> /dev/sdj1  3.7T  2.7T  1.1T  73% /var/lib/ceph/osd/ceph-26
> ===
> [root@controller-node ~]#
> 
> 
> [root@controller-node ~]# ceph health detail
> HEALTH_ERR 2 pgs inconsistent; 10 near full osd(s); 2 scrub 

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Hmm, it's possible there aren't any safeguards against filling the whole drive 
when increasing PGs, actually I think ceph only cares about free space when 
backilling which is not what happened (at least directly) in your case.
However, having a completely full OSD filesystem is not going to end well - 
better trash the OSD if it crashes because of it.
Be aware that whenever ceph starts backfilling it temporarily needs more space, 
and sometimes it shuffles more data than you'd expect. What can happen is that 
while OSD1 is trying to get rid of data, it simultaneously gets filled with 
data from another OSD (because crush-magic happens) and if that eats the last 
bits of space it's going to go FUBAR. 

You can set "nobackfill" on the cluster, that will prevent ceph from shuffling 
anything temporarily (set that before you restart the OSDs), but I wonder if 
it's too late - that 20MB free in the df output scares me. 

The safest way would probably be to trash osd.5 and osd.4 in your case, create 
two new OSDs in their place and backfill them again (with lower reweight). It's 
up to you whether you can afford the IO it will cause.

Which OSDs actually crashed? 4 and 5? Too late to save them methinks...

Jan



> On 17 Feb 2016, at 23:06, Lukáš Kubín  wrote:
> 
> You're right, the "full" osd was still up and in until I increased the pg 
> values of one of the pools. The redistribution has not completed yet and 
> perhaps that's what is still filling the drive. With this info - do you think 
> I'm still safe to follow the steps suggested in previous post?
> 
> Thanks!
> 
> Lukas
> 
> On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer  > wrote:
> Something must be on those 2 OSDs that ate all that space - ceph by default 
> doesn't allow OSD to get completely full (filesystem-wise) and from what 
> you've shown those filesystems are really really full.
> OSDs don't usually go down when "full" (95%) .. or do they? I don't think 
> so... so the reason they stopped is likely a completely full filfeystem. You 
> have to move something out of the way, restart those OSDs with lower reweight 
> and hopefully everything will be good.
> 
> Jan
> 
> 
>> On 17 Feb 2016, at 22:22, Lukáš Kubín > > wrote:
>> 
>> Ahoj Jan, thanks for the quick hint!
>> 
>> Those 2 OSDs are currently full and down. How should I handle that? Is it ok 
>> that I delete some pg directories again and start the OSD daemons, on both 
>> drives in parallel. Then set the weights as recommended ?
>> 
>> What effect should I expect then - will the cluster attempt to move some pgs 
>> out of these drives to different local OSDs? I'm asking because when I've 
>> attempted to delete pg dirs and restart OSD for the first time, the OSD get 
>> full again very fast.
>> 
>> Thank you.
>> 
>> Lukas
>> 
>> 
>> 
>> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer > > wrote:
>> Ahoj ;-)
>> 
>> You can reweight them temporarily, that shifts the data from the full drives.
>> 
>> ceph osd reweight osd.XX YY
>> (XX = the number of full OSD, YY is "weight" which default to 1)
>> 
>> This is different from "crush reweight" which defaults to drive size in TB.
>> 
>> Beware that reweighting will (afaik) only shuffle the data to other local 
>> drives, so you should reweight both the full drives at the same time and 
>> only by little bit at a time (0.95 is a good starting point).
>> 
>> Jan
>> 
>>  
>> 
>>> On 17 Feb 2016, at 21:43, Lukáš Kubín >> > wrote:
>>> 
>> 
>>> Hi,
>>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
>>> pools, each of size=2. Today, one of our OSDs got full, another 2 near 
>>> full. Cluster turned into ERR state. I have noticed uneven space 
>>> distribution among OSD drives between 70 and 100 perce. I have realized 
>>> there's a low amount of pgs in those 2 pools (128 each) and increased one 
>>> of them to 512, expecting a magic to happen and redistribute the space 
>>> evenly. 
>>> 
>>> Well, something happened - another OSD became full during the 
>>> redistribution and cluster stopped both OSDs and marked them down. After 
>>> some hours the remaining drives partially rebalanced and cluster get to 
>>> WARN state. 
>>> 
>>> I've deleted 3 placement group directories from one of the full OSD's 
>>> filesystem which allowed me to start it up again. Soon, however this drive 
>>> became full again.
>>> 
>>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no 
>>> drives to add. 
>>> 
>>> Is there a way how to get out of this situation without adding OSDs? I will 
>>> attempt to release some space, just waiting for colleague to identify RBD 
>>> volumes (openstack images and volumes) which can be deleted.
>>> 
>>> Thank you.
>>> 
>>> Lukas
>>> 
>>> 
>>> This is my cluster state now:
>>> 
>>> 

[ceph-users] How to properly deal with NEAR FULL OSD

2016-02-17 Thread Vlad Blando
Hi This been bugging me for some time now, the distribution of data on the
OSD is not balanced so some OSD are near full, i did ceph osd
reweight-by-utilization but it not helping much.


[root@controller-node ~]# ceph osd tree
# idweight  type name   up/down reweight
-1  98.28   root default
-2  32.76   host ceph-node-1
0   3.64osd.0   up  1
1   3.64osd.1   up  1
2   3.64osd.2   up  1
3   3.64osd.3   up  1
4   3.64osd.4   up  1
5   3.64osd.5   up  1
6   3.64osd.6   up  1
7   3.64osd.7   up  1
8   3.64osd.8   up  1
-3  32.76   host ceph-node-2
9   3.64osd.9   up  1
10  3.64osd.10  up  1
11  3.64osd.11  up  1
12  3.64osd.12  up  1
13  3.64osd.13  up  1
14  3.64osd.14  up  1
15  3.64osd.15  up  1
16  3.64osd.16  up  1
17  3.64osd.17  up  1
-4  32.76   host ceph-node-3
18  3.64osd.18  up  1
19  3.64osd.19  up  1
20  3.64osd.20  up  1
21  3.64osd.21  up  1
22  3.64osd.22  up  1
23  3.64osd.23  up  1
24  3.64osd.24  up  1
25  3.64osd.25  up  1
26  3.64osd.26  up  1
[root@controller-node ~]#


[root@controller-node ~]# /opt/df-osd.sh
ceph-node-1
===
/dev/sdb1  3.7T  2.0T  1.7T  54% /var/lib/ceph/osd/ceph-0
/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-1
/dev/sdd1  3.7T  3.3T  431G  89% /var/lib/ceph/osd/ceph-2
/dev/sde1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-3
/dev/sdf1  3.7T  3.3T  379G  90% /var/lib/ceph/osd/ceph-4
/dev/sdg1  3.7T  2.9T  762G  80% /var/lib/ceph/osd/ceph-5
/dev/sdh1  3.7T  3.0T  733G  81% /var/lib/ceph/osd/ceph-6
/dev/sdi1  3.7T  3.4T  284G  93% /var/lib/ceph/osd/ceph-7
/dev/sdj1  3.7T  3.4T  342G  91% /var/lib/ceph/osd/ceph-8
===
ceph-node-2
===
/dev/sdb1  3.7T  3.1T  622G  84% /var/lib/ceph/osd/ceph-9
/dev/sdc1  3.7T  2.7T  1.1T  72% /var/lib/ceph/osd/ceph-10
/dev/sdd1  3.7T  3.1T  557G  86% /var/lib/ceph/osd/ceph-11
/dev/sde1  3.7T  3.3T  392G  90% /var/lib/ceph/osd/ceph-12
/dev/sdf1  3.7T  2.6T  1.1T  72% /var/lib/ceph/osd/ceph-13
/dev/sdg1  3.7T  2.8T  879G  77% /var/lib/ceph/osd/ceph-14
/dev/sdh1  3.7T  2.7T  984G  74% /var/lib/ceph/osd/ceph-15
/dev/sdi1  3.7T  3.2T  463G  88% /var/lib/ceph/osd/ceph-16
/dev/sdj1  3.7T  3.1T  594G  85% /var/lib/ceph/osd/ceph-17
===
ceph-node-3
===
/dev/sdb1  3.7T  2.8T  910G  76% /var/lib/ceph/osd/ceph-18
/dev/sdc1  3.7T  2.7T 1012G  73% /var/lib/ceph/osd/ceph-19
/dev/sdd1  3.7T  3.2T  537G  86% /var/lib/ceph/osd/ceph-20
/dev/sde1  3.7T  3.2T  465G  88% /var/lib/ceph/osd/ceph-21
/dev/sdf1  3.7T  3.0T  663G  83% /var/lib/ceph/osd/ceph-22
/dev/sdg1  3.7T  3.4T  248G  94% /var/lib/ceph/osd/ceph-23
/dev/sdh1  3.7T  2.8T  928G  76% /var/lib/ceph/osd/ceph-24
/dev/sdi1  3.7T  2.9T  802G  79% /var/lib/ceph/osd/ceph-25
/dev/sdj1  3.7T  2.7T  1.1T  73% /var/lib/ceph/osd/ceph-26
===
[root@controller-node ~]#


[root@controller-node ~]# ceph health detail
HEALTH_ERR 2 pgs inconsistent; 10 near full osd(s); 2 scrub errors
pg 5.7f is active+clean+inconsistent, acting [2,12,18]
pg 5.38 is active+clean+inconsistent, acting [7,9,24]
osd.2 is near full at 88%
osd.4 is near full at 89%
osd.7 is near full at 92%
osd.8 is near full at 90%
osd.11 is near full at 85%
osd.12 is near full at 89%
osd.16 is near full at 87%
osd.20 is near full at 85%
osd.21 is near full at 87%
osd.23 is near full at 93%
2 scrub errors
[root@controller-node ~]#

[root@controller-node ~]# ceph df
GLOBAL:
SIZEAVAIL  RAW USED %RAW USED
100553G 19796G 80757G   80.31
POOLS:
NAMEID USED   %USED OBJECTS
images  4 

Re: [ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Gregory Farnum
On Wed, Feb 17, 2016 at 3:05 PM, Kostis Fardelas  wrote:
> Hello cephers,
> due to an unfortunate sequence of events (disk crashes, network
> problems), we are currently in a situation with one PG that reports
> unfound objects. There is also an OSD which cannot start-up and
> crashes with the following:
>
> 2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
> function 'virtual int FileStore::read(coll_t, const ghobject_t&,
> uint64_t, size_t, ceph::bufferlist&, bool)
> ' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
> os/FileStore.cc: 2650: FAILED assert(allow_eio ||
> !m_filestore_fail_eio || got != -5)
>
> (There is probably a problem with the OSD's underlying disk storage)
>
> By querying the PG that is stuck in
> active+recovering+degraded+remapped state due to the unfound objects,
> I understand that all possible OSDs are probed except for the crashed
> one:
>
> "might_have_unfound": [
>   { "osd": "30",
>"status": "already probed"},
>   { "osd": "102",
>"status": "already probed"},
>   { "osd": "104",
>"status": "osd is down"},
>   { "osd": "105",
>"status": "already probed"},
>   { "osd": "145",
> "status": "already probed"}],
>
> so I understand that the crashed OSD may have the latest version of
> the objects. I can also verify that I I can find the 4MB objects in
> the underlying filesystem of the crashed OSD.
>
> By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
> information like this:
>
> { "oid": { "oid":
> "829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
>   "key": "",
>   "snapid": -2,
>   "hash": 3880052137,
>   "max": 0,
>   "pool": 3,
>   "namespace": ""},
>   "need": "255658'37078125",
>   "have": "255651'37077081",
>   "locations": []}
>
>
> My question is what is the best solution that I should follow?
> a. Is there any way to export the objects from the crashed OSD's
> filesystem and reimport it to the cluster? How could that be done?

Look at ceph_objecstore_tool. eg,
http://ceph-users.ceph.narkive.com/lwDkR2fZ/recovering-incomplete-pgs-with-ceph-objectstore-tool

> b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
> expect that the "have" version of this object (thus an older version
> of the object) will become enabled?

It should, although I gather this sometimes takes some contortions for
reasons I've never worked out.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Recover unfound objects from crashed OSD's underlying filesystem

2016-02-17 Thread Kostis Fardelas
Hello cephers,
due to an unfortunate sequence of events (disk crashes, network
problems), we are currently in a situation with one PG that reports
unfound objects. There is also an OSD which cannot start-up and
crashes with the following:

2016-02-17 18:40:01.919546 7fecb0692700 -1 os/FileStore.cc: In
function 'virtual int FileStore::read(coll_t, const ghobject_t&,
uint64_t, size_t, ceph::bufferlist&, bool)
' thread 7fecb0692700 time 2016-02-17 18:40:01.889980
os/FileStore.cc: 2650: FAILED assert(allow_eio ||
!m_filestore_fail_eio || got != -5)

(There is probably a problem with the OSD's underlying disk storage)

By querying the PG that is stuck in
active+recovering+degraded+remapped state due to the unfound objects,
I understand that all possible OSDs are probed except for the crashed
one:

"might_have_unfound": [
  { "osd": "30",
   "status": "already probed"},
  { "osd": "102",
   "status": "already probed"},
  { "osd": "104",
   "status": "osd is down"},
  { "osd": "105",
   "status": "already probed"},
  { "osd": "145",
"status": "already probed"}],

so I understand that the crashed OSD may have the latest version of
the objects. I can also verify that I I can find the 4MB objects in
the underlying filesystem of the crashed OSD.

By issuing ceph pg 3.5a9 list_missing, I get for all unfound objects,
information like this:

{ "oid": { "oid":
"829d5be29cd6e231e7e951ba58ad3d0baf7fba65aad40083cef39bb03d5ec0fd",
  "key": "",
  "snapid": -2,
  "hash": 3880052137,
  "max": 0,
  "pool": 3,
  "namespace": ""},
  "need": "255658'37078125",
  "have": "255651'37077081",
  "locations": []}


My question is what is the best solution that I should follow?
a. Is there any way to export the objects from the crashed OSD's
filesystem and reimport it to the cluster? How could that be done?
b. If I issue ceph pg {pg-id} mark_unfound_lost revert, should I
expect that the "have" version of this object (thus an older version
of the object) will become enabled?

Best regards,
Kostis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
You're right, the "full" osd was still up and in until I increased the pg
values of one of the pools. The redistribution has not completed yet and
perhaps that's what is still filling the drive. With this info - do you
think I'm still safe to follow the steps suggested in previous post?

Thanks!

Lukas

On Wed, Feb 17, 2016 at 10:29 PM Jan Schermer  wrote:

> Something must be on those 2 OSDs that ate all that space - ceph by
> default doesn't allow OSD to get completely full (filesystem-wise) and from
> what you've shown those filesystems are really really full.
> OSDs don't usually go down when "full" (95%) .. or do they? I don't think
> so... so the reason they stopped is likely a completely full filfeystem.
> You have to move something out of the way, restart those OSDs with lower
> reweight and hopefully everything will be good.
>
> Jan
>
>
> On 17 Feb 2016, at 22:22, Lukáš Kubín  wrote:
>
> Ahoj Jan, thanks for the quick hint!
>
> Those 2 OSDs are currently full and down. How should I handle that? Is it
> ok that I delete some pg directories again and start the OSD daemons, on
> both drives in parallel. Then set the weights as recommended ?
>
> What effect should I expect then - will the cluster attempt to move some
> pgs out of these drives to different local OSDs? I'm asking because when
> I've attempted to delete pg dirs and restart OSD for the first time, the
> OSD get full again very fast.
>
> Thank you.
>
> Lukas
>
>
>
> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer  wrote:
>
>> Ahoj ;-)
>>
>> You can reweight them temporarily, that shifts the data from the full
>> drives.
>>
>> ceph osd reweight osd.XX YY
>> (XX = the number of full OSD, YY is "weight" which default to 1)
>>
>> This is different from "crush reweight" which defaults to drive size in
>> TB.
>>
>> Beware that reweighting will (afaik) only shuffle the data to other local
>> drives, so you should reweight both the full drives at the same time and
>> only by little bit at a time (0.95 is a good starting point).
>>
>> Jan
>>
>>
>>
>> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
>>
>> Hi,
>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
>> pools, each of size=2. Today, one of our OSDs got full, another 2 near
>> full. Cluster turned into ERR state. I have noticed uneven space
>> distribution among OSD drives between 70 and 100 perce. I have realized
>> there's a low amount of pgs in those 2 pools (128 each) and increased one
>> of them to 512, expecting a magic to happen and redistribute the space
>> evenly.
>>
>> Well, something happened - another OSD became full during the
>> redistribution and cluster stopped both OSDs and marked them down. After
>> some hours the remaining drives partially rebalanced and cluster get to
>> WARN state.
>>
>> I've deleted 3 placement group directories from one of the full OSD's
>> filesystem which allowed me to start it up again. Soon, however this drive
>> became full again.
>>
>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
>> drives to add.
>>
>> Is there a way how to get out of this situation without adding OSDs? I
>> will attempt to release some space, just waiting for colleague to identify
>> RBD volumes (openstack images and volumes) which can be deleted.
>>
>> Thank you.
>>
>> Lukas
>>
>>
>> This is my cluster state now:
>>
>> [root@compute1 ~]# ceph -w
>> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>>  health HEALTH_WARN
>> 10 pgs backfill_toofull
>> 114 pgs degraded
>> 114 pgs stuck degraded
>> 147 pgs stuck unclean
>> 114 pgs stuck undersized
>> 114 pgs undersized
>> 1 requests are blocked > 32 sec
>> recovery 56923/640724 objects degraded (8.884%)
>> recovery 29122/640724 objects misplaced (4.545%)
>> 3 near full osd(s)
>>  monmap e3: 3 mons at {compute1=
>> 10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>> }
>> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
>> 4365 GB used, 890 GB / 5256 GB avail
>> 56923/640724 objects degraded (8.884%)
>> 29122/640724 objects misplaced (4.545%)
>>  493 active+clean
>>  108 active+undersized+degraded
>>   29 active+remapped
>>6 active+undersized+degraded+remapped+backfill_toofull
>>4 active+remapped+backfill_toofull
>>
>> [root@ceph1 ~]# df|grep osd
>> /dev/sdg1   580496384 500066812  80429572  87%
>> /var/lib/ceph/osd/ceph-3
>> /dev/sdf1   580496384 502131428  78364956  87%
>> /var/lib/ceph/osd/ceph-2
>> /dev/sde1   580496384 506927100  73569284  

Re: [ceph-users] pg repair behavior? (Was: Re: getting rid of misplaced objects)

2016-02-17 Thread George Mihaiescu
We have three replicas, so we just performed md5sum on all of them in order
to find the correct ones, then we deleted the bad file and ran pg repair.
On 15 Feb 2016 10:42 a.m., "Zoltan Arnold Nagy" 
wrote:

> Hi Bryan,
>
> You were right: we’ve modified our PG weights a little (from 1 to around
> 0.85 on some OSDs) and once I’ve changed them back to 1, the remapped PGs
> and misplaced objects were gone.
> So thank you for the tip.
>
> For the inconsistent ones and scrub errors, I’m a little wary to use pg
> repair as that - if I understand correctly - only copies the primary PG’s
> data to the other PGs thus can easily corrupt the whole object if the
> primary is corrupted.
>
> I haven’t seen an update on this since last May where this was brought up
> as a concern from several people and there were mentions of adding
> checksumming to the metadata and doing a checksum-comparison on repair.
>
> Can anybody update on the current status on how exactly pg repair works in
> Hammer or will work in Jewel?
>
> > On 11 Feb 2016, at 22:17, Stillwell, Bryan 
> wrote:
> >
> > What does 'ceph osd tree' look like for this cluster?  Also have you done
> > anything special to your CRUSH rules?
> >
> > I've usually found this to be caused by modifying OSD weights a little
> too
> > much.
> >
> > As for the inconsistent PG, you should be able to run 'ceph pg repair' on
> > it:
> >
> >
> http://docs.ceph.com/docs/master/rados/troubleshooting/troubleshooting-pg/#
> > pgs-inconsistent
> >
> >
> > Bryan
> >
> > On 2/11/16, 11:21 AM, "ceph-users on behalf of Zoltan Arnold Nagy"
> >  zol...@linux.vnet.ibm.com>
> > wrote:
> >
> >> Hi,
> >>
> >> Are there any tips and tricks around getting rid of misplaced objects? I
> >> did check the archive but didn¹t find anything.
> >>
> >> Right now my cluster looks like this:
> >>
> >> pgmap v43288593: 16384 pgs, 4 pools, 45439 GB data, 10383 kobjects
> >>   109 TB used, 349 TB / 458 TB avail
> >>   330/25160461 objects degraded (0.001%)
> >>   31280/25160461 objects misplaced (0.124%)
> >>  16343 active+clean
> >> 40 active+remapped
> >>  1 active+clean+inconsistent
> >>
> >> This is how it has been for a while and I thought for sure that the
> >> misplaced would converge down to 0, but nevertheless, it didn¹t.
> >>
> >> Any pointers on how I could get it back to all active+clean?
> >>
> >> Cheers,
> >> Zoltan
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> > 
> >
> > This E-mail and any of its attachments may contain Time Warner Cable
> proprietary information, which is privileged, confidential, or subject to
> copyright belonging to Time Warner Cable. This E-mail is intended solely
> for the use of the individual or entity to which it is addressed. If you
> are not the intended recipient of this E-mail, you are hereby notified that
> any dissemination, distribution, copying, or action taken in relation to
> the contents of and attachments to this E-mail is strictly prohibited and
> may be unlawful. If you have received this E-mail in error, please notify
> the sender immediately and permanently delete the original and any copy of
> this E-mail and any printout.
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Something must be on those 2 OSDs that ate all that space - ceph by default 
doesn't allow OSD to get completely full (filesystem-wise) and from what you've 
shown those filesystems are really really full.
OSDs don't usually go down when "full" (95%) .. or do they? I don't think so... 
so the reason they stopped is likely a completely full filfeystem. You have to 
move something out of the way, restart those OSDs with lower reweight and 
hopefully everything will be good.

Jan


> On 17 Feb 2016, at 22:22, Lukáš Kubín  wrote:
> 
> Ahoj Jan, thanks for the quick hint!
> 
> Those 2 OSDs are currently full and down. How should I handle that? Is it ok 
> that I delete some pg directories again and start the OSD daemons, on both 
> drives in parallel. Then set the weights as recommended ?
> 
> What effect should I expect then - will the cluster attempt to move some pgs 
> out of these drives to different local OSDs? I'm asking because when I've 
> attempted to delete pg dirs and restart OSD for the first time, the OSD get 
> full again very fast.
> 
> Thank you.
> 
> Lukas
> 
> 
> 
> On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer  > wrote:
> Ahoj ;-)
> 
> You can reweight them temporarily, that shifts the data from the full drives.
> 
> ceph osd reweight osd.XX YY
> (XX = the number of full OSD, YY is "weight" which default to 1)
> 
> This is different from "crush reweight" which defaults to drive size in TB.
> 
> Beware that reweighting will (afaik) only shuffle the data to other local 
> drives, so you should reweight both the full drives at the same time and only 
> by little bit at a time (0.95 is a good starting point).
> 
> Jan
> 
>  
> 
>> On 17 Feb 2016, at 21:43, Lukáš Kubín > > wrote:
>> 
> 
>> Hi,
>> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
>> pools, each of size=2. Today, one of our OSDs got full, another 2 near full. 
>> Cluster turned into ERR state. I have noticed uneven space distribution 
>> among OSD drives between 70 and 100 perce. I have realized there's a low 
>> amount of pgs in those 2 pools (128 each) and increased one of them to 512, 
>> expecting a magic to happen and redistribute the space evenly. 
>> 
>> Well, something happened - another OSD became full during the redistribution 
>> and cluster stopped both OSDs and marked them down. After some hours the 
>> remaining drives partially rebalanced and cluster get to WARN state. 
>> 
>> I've deleted 3 placement group directories from one of the full OSD's 
>> filesystem which allowed me to start it up again. Soon, however this drive 
>> became full again.
>> 
>> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives 
>> to add. 
>> 
>> Is there a way how to get out of this situation without adding OSDs? I will 
>> attempt to release some space, just waiting for colleague to identify RBD 
>> volumes (openstack images and volumes) which can be deleted.
>> 
>> Thank you.
>> 
>> Lukas
>> 
>> 
>> This is my cluster state now:
>> 
>> [root@compute1 ~]# ceph -w
>> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>>  health HEALTH_WARN
>> 10 pgs backfill_toofull
>> 114 pgs degraded
>> 114 pgs stuck degraded
>> 147 pgs stuck unclean
>> 114 pgs stuck undersized
>> 114 pgs undersized
>> 1 requests are blocked > 32 sec
>> recovery 56923/640724 objects degraded (8.884%)
>> recovery 29122/640724 objects misplaced (4.545%)
>> 3 near full osd(s)
>>  monmap e3: 3 mons at 
>> {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>>  
>> }
>> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
>> 4365 GB used, 890 GB / 5256 GB avail
>> 56923/640724 objects degraded (8.884%)
>> 29122/640724 objects misplaced (4.545%)
>>  493 active+clean
>>  108 active+undersized+degraded
>>   29 active+remapped
>>6 active+undersized+degraded+remapped+backfill_toofull
>>4 active+remapped+backfill_toofull
>> 
>> [root@ceph1 ~]# df|grep osd
>> /dev/sdg1   580496384 500066812  80429572  87% 
>> /var/lib/ceph/osd/ceph-3
>> /dev/sdf1   580496384 502131428  78364956  87% 
>> /var/lib/ceph/osd/ceph-2
>> /dev/sde1   580496384 506927100  73569284  88% 
>> /var/lib/ceph/osd/ceph-0
>> /dev/sdb1   287550208 28755018820 100% 
>> /var/lib/ceph/osd/ceph-5
>> /dev/sdd1   580496384 58049636420 100% 
>> 

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
Ahoj Jan, thanks for the quick hint!

Those 2 OSDs are currently full and down. How should I handle that? Is it
ok that I delete some pg directories again and start the OSD daemons, on
both drives in parallel. Then set the weights as recommended ?

What effect should I expect then - will the cluster attempt to move some
pgs out of these drives to different local OSDs? I'm asking because when
I've attempted to delete pg dirs and restart OSD for the first time, the
OSD get full again very fast.

Thank you.

Lukas



On Wed, Feb 17, 2016 at 9:48 PM Jan Schermer  wrote:

> Ahoj ;-)
>
> You can reweight them temporarily, that shifts the data from the full
> drives.
>
> ceph osd reweight osd.XX YY
> (XX = the number of full OSD, YY is "weight" which default to 1)
>
> This is different from "crush reweight" which defaults to drive size in TB.
>
> Beware that reweighting will (afaik) only shuffle the data to other local
> drives, so you should reweight both the full drives at the same time and
> only by little bit at a time (0.95 is a good starting point).
>
> Jan
>
>
>
> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
>
> Hi,
> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
> pools, each of size=2. Today, one of our OSDs got full, another 2 near
> full. Cluster turned into ERR state. I have noticed uneven space
> distribution among OSD drives between 70 and 100 perce. I have realized
> there's a low amount of pgs in those 2 pools (128 each) and increased one
> of them to 512, expecting a magic to happen and redistribute the space
> evenly.
>
> Well, something happened - another OSD became full during the
> redistribution and cluster stopped both OSDs and marked them down. After
> some hours the remaining drives partially rebalanced and cluster get to
> WARN state.
>
> I've deleted 3 placement group directories from one of the full OSD's
> filesystem which allowed me to start it up again. Soon, however this drive
> became full again.
>
> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
> drives to add.
>
> Is there a way how to get out of this situation without adding OSDs? I
> will attempt to release some space, just waiting for colleague to identify
> RBD volumes (openstack images and volumes) which can be deleted.
>
> Thank you.
>
> Lukas
>
>
> This is my cluster state now:
>
> [root@compute1 ~]# ceph -w
> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>  health HEALTH_WARN
> 10 pgs backfill_toofull
> 114 pgs degraded
> 114 pgs stuck degraded
> 147 pgs stuck unclean
> 114 pgs stuck undersized
> 114 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 56923/640724 objects degraded (8.884%)
> recovery 29122/640724 objects misplaced (4.545%)
> 3 near full osd(s)
>  monmap e3: 3 mons at {compute1=
> 10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
> }
> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
> 4365 GB used, 890 GB / 5256 GB avail
> 56923/640724 objects degraded (8.884%)
> 29122/640724 objects misplaced (4.545%)
>  493 active+clean
>  108 active+undersized+degraded
>   29 active+remapped
>6 active+undersized+degraded+remapped+backfill_toofull
>4 active+remapped+backfill_toofull
>
> [root@ceph1 ~]# df|grep osd
> /dev/sdg1   580496384 500066812  80429572  87%
> /var/lib/ceph/osd/ceph-3
> /dev/sdf1   580496384 502131428  78364956  87%
> /var/lib/ceph/osd/ceph-2
> /dev/sde1   580496384 506927100  73569284  88%
> /var/lib/ceph/osd/ceph-0
> /dev/sdb1   287550208 28755018820 100%
> /var/lib/ceph/osd/ceph-5
> /dev/sdd1   580496384 58049636420 100%
> /var/lib/ceph/osd/ceph-4
> /dev/sdc1   580496384 478675672 101820712  83%
> /var/lib/ceph/osd/ceph-1
>
> [root@ceph2 ~]# df|grep osd
> /dev/sdf1   580496384 448689872 131806512  78%
> /var/lib/ceph/osd/ceph-7
> /dev/sdb1   287550208 227054336  60495872  79%
> /var/lib/ceph/osd/ceph-11
> /dev/sdd1   580496384 464175196 116321188  80%
> /var/lib/ceph/osd/ceph-10
> /dev/sdc1   580496384 489451300  91045084  85%
> /var/lib/ceph/osd/ceph-6
> /dev/sdg1   580496384 470559020 109937364  82%
> /var/lib/ceph/osd/ceph-9
> /dev/sde1   580496384 490289388  90206996  85%
> /var/lib/ceph/osd/ceph-8
>
> [root@ceph2 ~]# ceph df
> GLOBAL:
> SIZE  AVAIL RAW USED %RAW USED
> 5256G  890G4365G 83.06
> POOLS:
> NAME   ID USED  %USED MAX AVAIL OBJECTS

Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Somnath Roy
If you are not sure about what weight to put , ‘ceph osd 
reweight-by-utilization’ should also do the job for you automatically..

Thanks & Regards
Somnath


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Wednesday, February 17, 2016 12:48 PM
To: Lukáš Kubín
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to recover from OSDs full in small cluster

Ahoj ;-)

You can reweight them temporarily, that shifts the data from the full drives.

ceph osd reweight osd.XX YY
(XX = the number of full OSD, YY is "weight" which default to 1)

This is different from "crush reweight" which defaults to drive size in TB.

Beware that reweighting will (afaik) only shuffle the data to other local 
drives, so you should reweight both the full drives at the same time and only 
by little bit at a time (0.95 is a good starting point).

Jan


On 17 Feb 2016, at 21:43, Lukáš Kubín 
> wrote:

Hi,
I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 pools, 
each of size=2. Today, one of our OSDs got full, another 2 near full. Cluster 
turned into ERR state. I have noticed uneven space distribution among OSD 
drives between 70 and 100 perce. I have realized there's a low amount of pgs in 
those 2 pools (128 each) and increased one of them to 512, expecting a magic to 
happen and redistribute the space evenly.

Well, something happened - another OSD became full during the redistribution 
and cluster stopped both OSDs and marked them down. After some hours the 
remaining drives partially rebalanced and cluster get to WARN state.

I've deleted 3 placement group directories from one of the full OSD's 
filesystem which allowed me to start it up again. Soon, however this drive 
became full again.

So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives to 
add.

Is there a way how to get out of this situation without adding OSDs? I will 
attempt to release some space, just waiting for colleague to identify RBD 
volumes (openstack images and volumes) which can be deleted.

Thank you.

Lukas


This is my cluster state now:

[root@compute1 ~]# ceph -w
cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
 health HEALTH_WARN
10 pgs backfill_toofull
114 pgs degraded
114 pgs stuck degraded
147 pgs stuck unclean
114 pgs stuck undersized
114 pgs undersized
1 requests are blocked > 32 sec
recovery 56923/640724 objects degraded (8.884%)
recovery 29122/640724 objects misplaced (4.545%)
3 near full osd(s)
 monmap e3: 3 mons at 
{compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0}
election epoch 128, quorum 0,1,2 compute1,compute2,compute3
 osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
  pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
4365 GB used, 890 GB / 5256 GB avail
56923/640724 objects degraded (8.884%)
29122/640724 objects misplaced (4.545%)
 493 active+clean
 108 active+undersized+degraded
  29 active+remapped
   6 active+undersized+degraded+remapped+backfill_toofull
   4 active+remapped+backfill_toofull

[root@ceph1 ~]# df|grep osd
/dev/sdg1   580496384 500066812  80429572  87% 
/var/lib/ceph/osd/ceph-3
/dev/sdf1   580496384 502131428  78364956  87% 
/var/lib/ceph/osd/ceph-2
/dev/sde1   580496384 506927100  73569284  88% 
/var/lib/ceph/osd/ceph-0
/dev/sdb1   287550208 28755018820 100% 
/var/lib/ceph/osd/ceph-5
/dev/sdd1   580496384 58049636420 100% 
/var/lib/ceph/osd/ceph-4
/dev/sdc1   580496384 478675672 101820712  83% 
/var/lib/ceph/osd/ceph-1

[root@ceph2 ~]# df|grep osd
/dev/sdf1   580496384 448689872 131806512  78% 
/var/lib/ceph/osd/ceph-7
/dev/sdb1   287550208 227054336  60495872  79% 
/var/lib/ceph/osd/ceph-11
/dev/sdd1   580496384 464175196 116321188  80% 
/var/lib/ceph/osd/ceph-10
/dev/sdc1   580496384 489451300  91045084  85% 
/var/lib/ceph/osd/ceph-6
/dev/sdg1   580496384 470559020 109937364  82% 
/var/lib/ceph/osd/ceph-9
/dev/sde1   580496384 490289388  90206996  85% 
/var/lib/ceph/osd/ceph-8

[root@ceph2 ~]# ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
5256G  890G4365G 83.06
POOLS:
NAME   ID USED  %USED MAX AVAIL OBJECTS
glance 6  1714G 32.61  385G  219579
cinder 7   676G 12.86  385G   97488

[root@ceph2 ~]# ceph osd pool get glance pg_num
pg_num: 512
[root@ceph2 ~]# ceph osd pool get cinder pg_num
pg_num: 

[ceph-users] Idea for speedup RadosGW for buckets with many objects.

2016-02-17 Thread Krzysztof Księżyk
Hi,

I'm experiencing problem with poor performance of RadosGW while
operating on bucket with many object. That's known issue with LevelDB
and can be partially resolved using shrading but I have one more idea.
As I see in ceph osd logs all slow requests are while making call to
rgw.bucket_list:

2016-02-17 03:17:56.846694 7f5396f63700  0 log_channel(cluster) log
[WRN] : slow request 30.272904 seconds old, received at 2016-02-17
03:17:26.573742: osd_op(client.12611484.0:15137332 .dir.default.4162.3
[call rgw.bucket_list] 9.2955279 ack+read+known_if_redirected e3252)
currently started

I don't know exactly how Ceph internally works but maybe data required
to return results for rgw.bucket_list could be cached for some time.
Cache TTL would be parametrized and could be disabled to keep the same
behaviour as current one. There can be 3 cases when there's a call to
rgw.bucket_list:
1. no cached data
2. up-to-date cache
3. outdated cache

Ad 1. First call starts generating full list. All new requests are put
on hold. When list is ready it's saved to cache
Ad 2. All calls are served from cache
Ad 3. First request starts generating full list. All new requests are
served from outdated cache until new cached data is ready

This can be even optimized by periodically generating fresh cache, even
if it's not expired yet to reduce cases when cache is outdated.

Maybe this idea is stupid, maybe not, but if it's doable it would be
nice to have choice.

Kind regards -
Krzysztof Księżyk


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Jan Schermer
Ahoj ;-)

You can reweight them temporarily, that shifts the data from the full drives.

ceph osd reweight osd.XX YY
(XX = the number of full OSD, YY is "weight" which default to 1)

This is different from "crush reweight" which defaults to drive size in TB.

Beware that reweighting will (afaik) only shuffle the data to other local 
drives, so you should reweight both the full drives at the same time and only 
by little bit at a time (0.95 is a good starting point).

Jan

 
> On 17 Feb 2016, at 21:43, Lukáš Kubín  wrote:
> 
> Hi,
> I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 
> pools, each of size=2. Today, one of our OSDs got full, another 2 near full. 
> Cluster turned into ERR state. I have noticed uneven space distribution among 
> OSD drives between 70 and 100 perce. I have realized there's a low amount of 
> pgs in those 2 pools (128 each) and increased one of them to 512, expecting a 
> magic to happen and redistribute the space evenly. 
> 
> Well, something happened - another OSD became full during the redistribution 
> and cluster stopped both OSDs and marked them down. After some hours the 
> remaining drives partially rebalanced and cluster get to WARN state. 
> 
> I've deleted 3 placement group directories from one of the full OSD's 
> filesystem which allowed me to start it up again. Soon, however this drive 
> became full again.
> 
> So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives 
> to add. 
> 
> Is there a way how to get out of this situation without adding OSDs? I will 
> attempt to release some space, just waiting for colleague to identify RBD 
> volumes (openstack images and volumes) which can be deleted.
> 
> Thank you.
> 
> Lukas
> 
> 
> This is my cluster state now:
> 
> [root@compute1 ~]# ceph -w
> cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
>  health HEALTH_WARN
> 10 pgs backfill_toofull
> 114 pgs degraded
> 114 pgs stuck degraded
> 147 pgs stuck unclean
> 114 pgs stuck undersized
> 114 pgs undersized
> 1 requests are blocked > 32 sec
> recovery 56923/640724 objects degraded (8.884%)
> recovery 29122/640724 objects misplaced (4.545%)
> 3 near full osd(s)
>  monmap e3: 3 mons at 
> {compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
>  
> }
> election epoch 128, quorum 0,1,2 compute1,compute2,compute3
>  osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
>   pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
> 4365 GB used, 890 GB / 5256 GB avail
> 56923/640724 objects degraded (8.884%)
> 29122/640724 objects misplaced (4.545%)
>  493 active+clean
>  108 active+undersized+degraded
>   29 active+remapped
>6 active+undersized+degraded+remapped+backfill_toofull
>4 active+remapped+backfill_toofull
> 
> [root@ceph1 ~]# df|grep osd
> /dev/sdg1   580496384 500066812  80429572  87% 
> /var/lib/ceph/osd/ceph-3
> /dev/sdf1   580496384 502131428  78364956  87% 
> /var/lib/ceph/osd/ceph-2
> /dev/sde1   580496384 506927100  73569284  88% 
> /var/lib/ceph/osd/ceph-0
> /dev/sdb1   287550208 28755018820 100% 
> /var/lib/ceph/osd/ceph-5
> /dev/sdd1   580496384 58049636420 100% 
> /var/lib/ceph/osd/ceph-4
> /dev/sdc1   580496384 478675672 101820712  83% 
> /var/lib/ceph/osd/ceph-1
> 
> [root@ceph2 ~]# df|grep osd
> /dev/sdf1   580496384 448689872 131806512  78% 
> /var/lib/ceph/osd/ceph-7
> /dev/sdb1   287550208 227054336  60495872  79% 
> /var/lib/ceph/osd/ceph-11
> /dev/sdd1   580496384 464175196 116321188  80% 
> /var/lib/ceph/osd/ceph-10
> /dev/sdc1   580496384 489451300  91045084  85% 
> /var/lib/ceph/osd/ceph-6
> /dev/sdg1   580496384 470559020 109937364  82% 
> /var/lib/ceph/osd/ceph-9
> /dev/sde1   580496384 490289388  90206996  85% 
> /var/lib/ceph/osd/ceph-8
> 
> [root@ceph2 ~]# ceph df
> GLOBAL:
> SIZE  AVAIL RAW USED %RAW USED
> 5256G  890G4365G 83.06
> POOLS:
> NAME   ID USED  %USED MAX AVAIL OBJECTS
> glance 6  1714G 32.61  385G  219579
> cinder 7   676G 12.86  385G   97488
> 
> [root@ceph2 ~]# ceph osd pool get glance pg_num
> pg_num: 512
> [root@ceph2 ~]# ceph osd pool get cinder pg_num
> pg_num: 128
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users 

[ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Lukáš Kubín
Hi,
I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2
pools, each of size=2. Today, one of our OSDs got full, another 2 near
full. Cluster turned into ERR state. I have noticed uneven space
distribution among OSD drives between 70 and 100 perce. I have realized
there's a low amount of pgs in those 2 pools (128 each) and increased one
of them to 512, expecting a magic to happen and redistribute the space
evenly.

Well, something happened - another OSD became full during the
redistribution and cluster stopped both OSDs and marked them down. After
some hours the remaining drives partially rebalanced and cluster get to
WARN state.

I've deleted 3 placement group directories from one of the full OSD's
filesystem which allowed me to start it up again. Soon, however this drive
became full again.

So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no
drives to add.

Is there a way how to get out of this situation without adding OSDs? I will
attempt to release some space, just waiting for colleague to identify RBD
volumes (openstack images and volumes) which can be deleted.

Thank you.

Lukas


This is my cluster state now:

[root@compute1 ~]# ceph -w
cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
 health HEALTH_WARN
10 pgs backfill_toofull
114 pgs degraded
114 pgs stuck degraded
147 pgs stuck unclean
114 pgs stuck undersized
114 pgs undersized
1 requests are blocked > 32 sec
recovery 56923/640724 objects degraded (8.884%)
recovery 29122/640724 objects misplaced (4.545%)
3 near full osd(s)
 monmap e3: 3 mons at {compute1=
10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0
}
election epoch 128, quorum 0,1,2 compute1,compute2,compute3
 osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
  pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
4365 GB used, 890 GB / 5256 GB avail
56923/640724 objects degraded (8.884%)
29122/640724 objects misplaced (4.545%)
 493 active+clean
 108 active+undersized+degraded
  29 active+remapped
   6 active+undersized+degraded+remapped+backfill_toofull
   4 active+remapped+backfill_toofull

[root@ceph1 ~]# df|grep osd
/dev/sdg1   580496384 500066812  80429572  87%
/var/lib/ceph/osd/ceph-3
/dev/sdf1   580496384 502131428  78364956  87%
/var/lib/ceph/osd/ceph-2
/dev/sde1   580496384 506927100  73569284  88%
/var/lib/ceph/osd/ceph-0
/dev/sdb1   287550208 28755018820 100%
/var/lib/ceph/osd/ceph-5
/dev/sdd1   580496384 58049636420 100%
/var/lib/ceph/osd/ceph-4
/dev/sdc1   580496384 478675672 101820712  83%
/var/lib/ceph/osd/ceph-1

[root@ceph2 ~]# df|grep osd
/dev/sdf1   580496384 448689872 131806512  78%
/var/lib/ceph/osd/ceph-7
/dev/sdb1   287550208 227054336  60495872  79%
/var/lib/ceph/osd/ceph-11
/dev/sdd1   580496384 464175196 116321188  80%
/var/lib/ceph/osd/ceph-10
/dev/sdc1   580496384 489451300  91045084  85%
/var/lib/ceph/osd/ceph-6
/dev/sdg1   580496384 470559020 109937364  82%
/var/lib/ceph/osd/ceph-9
/dev/sde1   580496384 490289388  90206996  85%
/var/lib/ceph/osd/ceph-8

[root@ceph2 ~]# ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
5256G  890G4365G 83.06
POOLS:
NAME   ID USED  %USED MAX AVAIL OBJECTS
glance 6  1714G 32.61  385G  219579
cinder 7   676G 12.86  385G   97488

[root@ceph2 ~]# ceph osd pool get glance pg_num
pg_num: 512
[root@ceph2 ~]# ceph osd pool get cinder pg_num
pg_num: 128
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot create bucket via the S3 (s3cmd)

2016-02-17 Thread Alexandr Porunov
Because I have created them manually and then I have installed Rados
Gateway. After that I realised that Rados Gateway didn't work. I thought
that it was because I have created pools manually so I removed those
buckets which I had created and reinstall Rados Gateway. But without
success of course

On Wed, Feb 17, 2016 at 10:35 PM, Alexandr Porunov <
alexandr.poru...@gmail.com> wrote:

> Because I have created them manually and then I have installed Rados
> Gateway. After that I realised that Rados Gateway didn't work. I thought
> that it was because I have created pools manually so I removed those
> buckets which I had created and reinstall Rados Gateway. But without
> success of course
>
> On Wed, Feb 17, 2016 at 10:13 PM, Василий Ангапов 
> wrote:
>
>> First, seems to me you should not delete pools .rgw.buckets and
>> .rgw.buckets.index because that's the pools where RGW stores buckets
>> actually.
>> But why did you do that?
>>
>>
>> 2016-02-18 3:08 GMT+08:00 Alexandr Porunov :
>> > When I try to create bucket:
>> > s3cmd mb s3://first-bucket
>> >
>> > I always get this error:
>> > ERROR: S3 error: 405 (MethodNotAllowed)
>> >
>> > /var/log/ceph/ceph-client.rgw.gateway.log :
>> > 2016-02-17 20:22:49.282715 7f86c50f3700  1 handle_sigterm
>> > 2016-02-17 20:22:49.282750 7f86c50f3700  1 handle_sigterm set alarm for
>> 120
>> > 2016-02-17 20:22:49.282646 7f9478ff9700  1 handle_sigterm
>> > 2016-02-17 20:22:49.282689 7f9478ff9700  1 handle_sigterm set alarm for
>> 120
>> > 2016-02-17 20:22:49.285830 7f949b842880 -1 shutting down
>> > 2016-02-17 20:22:49.285289 7f86f36c3880 -1 shutting down
>> > 2016-02-17 20:22:49.370173 7f86f36c3880  1 final shutdown
>> > 2016-02-17 20:22:49.467154 7f949b842880  1 final shutdown
>> > 2016-02-17 22:23:33.388956 7f4a94adf880  0 ceph version 9.2.0
>> > (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 889
>> > 2016-02-17 20:23:44.344574 7f4a94adf880  0 framework: civetweb
>> > 2016-02-17 20:23:44.344583 7f4a94adf880  0 framework conf key: port,
>> val: 80
>> > 2016-02-17 20:23:44.344590 7f4a94adf880  0 starting handler: civetweb
>> > 2016-02-17 20:23:44.344630 7f4a94adf880  0 civetweb: 0x7f4a951c8b00:
>> > set_ports_option: cannot bind to 80: 13 (Permission denied)
>> > 2016-02-17 20:23:44.495510 7f4a65ffb700  0 ERROR: can't read user
>> header:
>> > ret=-2
>> > 2016-02-17 20:23:44.495516 7f4a65ffb700  0 ERROR: sync_user() failed,
>> > user=alex ret=-2
>> > 2016-02-17 20:26:47.425354 7fb50132b880  0 ceph version 9.2.0
>> > (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 3149
>> > 2016-02-17 20:26:47.471472 7fb50132b880 -1 asok(0x7fb503e51340)
>> > AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen:
>> failed to
>> > bind the UNIX domain socket to
>> '/var/run/ceph/ceph-client.rgw.gateway.asok':
>> > (17) File exists
>> > 2016-02-17 20:26:47.554305 7fb50132b880  0 framework: civetweb
>> > 2016-02-17 20:26:47.554319 7fb50132b880  0 framework conf key: port,
>> val: 80
>> > 2016-02-17 20:26:47.554328 7fb50132b880  0 starting handler: civetweb
>> > 2016-02-17 20:26:47.576110 7fb4d2ffd700  0 ERROR: can't read user
>> header:
>> > ret=-2
>> > 2016-02-17 20:26:47.576119 7fb4d2ffd700  0 ERROR: sync_user() failed,
>> > user=alex ret=-2
>> > 2016-02-17 20:27:03.504131 7fb49d7a2700  1 == starting new request
>> > req=0x7fb4e40008c0 =
>> > 2016-02-17 20:27:03.522989 7fb49d7a2700  1 == req done
>> > req=0x7fb4e40008c0 http_status=200 ==
>> > 2016-02-17 20:27:03.523023 7fb49d7a2700  1 civetweb: 0x7fb4e40022a0:
>> > 192.168.56.100 - - [17/Feb/2016:20:27:03 +0200] "GET / HTTP/1.1" 200 0
>> - -
>> > 2016-02-17 20:27:08.796459 7fb49bf9f700  1 == starting new request
>> > req=0x7fb4ec0343a0 =
>> > 2016-02-17 20:27:08.796755 7fb49bf9f700  1 == req done
>> > req=0x7fb4ec0343a0 http_status=405 ==
>> > 2016-02-17 20:27:08.796807 7fb49bf9f700  1 civetweb: 0x7fb4ec0008c0:
>> > 192.168.56.100 - - [17/Feb/2016:20:27:08 +0200] "PUT / HTTP/1.1" 405 0
>> - -
>> > 2016-02-17 20:28:22.088508 7fb49e7a4700  1 == starting new request
>> > req=0x7fb503f1bfd0 =
>> > 2016-02-17 20:28:22.090993 7fb49e7a4700  1 == req done
>> > req=0x7fb503f1bfd0 http_status=200 ==
>> > 2016-02-17 20:28:22.091035 7fb49e7a4700  1 civetweb: 0x7fb503f2e9f0:
>> > 192.168.56.100 - - [17/Feb/2016:20:28:22 +0200] "GET / HTTP/1.1" 200 0
>> - -
>> > 2016-02-17 20:28:35.943110 7fb4a77b6700  1 == starting new request
>> > req=0x7fb4cc0047b0 =
>> > 2016-02-17 20:28:35.945233 7fb4a77b6700  1 == req done
>> > req=0x7fb4cc0047b0 http_status=200 ==
>> > 2016-02-17 20:28:35.945282 7fb4a77b6700  1 civetweb: 0x7fb4cc004c90:
>> > 192.168.56.100 - - [17/Feb/2016:20:28:35 +0200] "GET / HTTP/1.1" 200 0
>> - -
>> > 2016-02-17 20:29:07.447283 7fb49dfa3700  1 == starting new request
>> > req=0x7fb4d8000bf0 =
>> > 2016-02-17 20:29:07.447743 7fb49dfa3700  1 == req done
>> > req=0x7fb4d8000bf0 

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
Ah typo, I meant to say 10Mhz per IO. So a 7.2k disk does around 80IOPs = ~ 
800mhz which is close to the 1Ghz figure.

 

From: John Hogenmiller [mailto:j...@hogenmiller.net] 
Sent: 17 February 2016 13:15
To: Nick Fisk 
Cc: Василий Ангапов ; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code

 

I hadn't come across this ratio prior, but now that I've read that PDF you 
linked and I've narrowed my search in the mailing list, I think that the 0.5 - 
1ghz per OSD ratio is pretty spot on. The 100Mhz per IOP is also pretty 
interesting, and we do indeed use 7200 RPM drives. 

 

I'll look up a few more things, but based on what I've seen so far, the 
hardware we're using will most likely not be suitable, which is unfortunate as 
that adds some more complexity at OSI Level 8. :D

 

 

On Wed, Feb 17, 2016 at 4:14 AM, Nick Fisk  > wrote:

Thanks for posting your experiences John, very interesting read. I think the 
golden rule of around 1Ghz is still a realistic goal to aim for. It looks like 
you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD. Do you have any 
idea on how much CPU you think you would need to just be able to get away with 
it?

I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage over 
about 20%, which indicates to me the bare minimum for a replicated pool is 
probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have will certainly 
have less CPU.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
>  ] On Behalf Of
> John Hogenmiller
> Sent: 17 February 2016 03:04
> To: Василий Ангапов  >; 
> ceph-users@lists.ceph.com  
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
>

> Turns out i didn't do reply-all.
>
> On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller   >
> wrote:
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
>
>
> This is something I've been attempting to determine as well. I'm not yet
> getting
> I'm testing with some white-label hardware, but essentially supermicro
> 2twinu's with a pair of E5-2609 Xeons and 64GB of
> memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
>
> Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> supported "reference architecture" device. The processors in those nodes
> are E5-269 12-core, vs what I have which is quad-
> core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> 6048R-OSD432). I would highly recommend reflecting on the supermicro
> hardware and using that as your reference as well. If you could get an eval
> unit, use that to compare with the hardware you're working with.
>
> I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> one each node, and 5 ceph monitors. I plan to move the monitors to their
> own dedicated hardware, and in reading, I may only need 3 to manage the
> 420 OSDs.   I am currently just setup for replication instead of EC, though I
> want to redo this cluster to use EC. Also, I am still trying to work out how
> much of an impact placement groups have on performance, and I may have a
> performance-hampering amount..
>
> We test the system using locust speaking S3 to the radosgw. Transactions are
> distributed equally across all 7 nodes and we track the statistics. We started
> first emulating 1000 users and got over 4Gbps, but load average on all nodes
> was in the mid-100s, and after 15 minutes we started getting socket
> timeouts. We stopped the test, let load settle, and started back at 100
> users.  We've been running this test about 5 days now.  Load average on all
> nodes floats between 40 and 70. The nodes with ceph-mon running on them
> do not appear to be taxed any more than the ones without. The radosgw
> itself seems to take up a decent amount of cpu (running civetweb, no
> ssl).  iowait is non existent, everything appears to be cpu bound.
>
> At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not capture
> the TPS on that short test.
> At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs. Averaging
> 115 TPS.
>
> All in all, the speeds are not bad for a single rack, but the CPU utilization 
> is a
> big concern. We're currently using other (proprietary) object storage
> platforms on this hardware configuration. They have their own set of issues,
> but CPU utilization is typically not the problem, even at higher utilization.
>
>
>
> root@ljb01:/home/ceph/rain-cluster# ceph status
> cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45
>  health HEALTH_OK
>  monmap e5: 5 mons at 

Re: [ceph-users] Cannot create bucket via the S3 (s3cmd)

2016-02-17 Thread Василий Ангапов
First, seems to me you should not delete pools .rgw.buckets and
.rgw.buckets.index because that's the pools where RGW stores buckets
actually.
But why did you do that?


2016-02-18 3:08 GMT+08:00 Alexandr Porunov :
> When I try to create bucket:
> s3cmd mb s3://first-bucket
>
> I always get this error:
> ERROR: S3 error: 405 (MethodNotAllowed)
>
> /var/log/ceph/ceph-client.rgw.gateway.log :
> 2016-02-17 20:22:49.282715 7f86c50f3700  1 handle_sigterm
> 2016-02-17 20:22:49.282750 7f86c50f3700  1 handle_sigterm set alarm for 120
> 2016-02-17 20:22:49.282646 7f9478ff9700  1 handle_sigterm
> 2016-02-17 20:22:49.282689 7f9478ff9700  1 handle_sigterm set alarm for 120
> 2016-02-17 20:22:49.285830 7f949b842880 -1 shutting down
> 2016-02-17 20:22:49.285289 7f86f36c3880 -1 shutting down
> 2016-02-17 20:22:49.370173 7f86f36c3880  1 final shutdown
> 2016-02-17 20:22:49.467154 7f949b842880  1 final shutdown
> 2016-02-17 22:23:33.388956 7f4a94adf880  0 ceph version 9.2.0
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 889
> 2016-02-17 20:23:44.344574 7f4a94adf880  0 framework: civetweb
> 2016-02-17 20:23:44.344583 7f4a94adf880  0 framework conf key: port, val: 80
> 2016-02-17 20:23:44.344590 7f4a94adf880  0 starting handler: civetweb
> 2016-02-17 20:23:44.344630 7f4a94adf880  0 civetweb: 0x7f4a951c8b00:
> set_ports_option: cannot bind to 80: 13 (Permission denied)
> 2016-02-17 20:23:44.495510 7f4a65ffb700  0 ERROR: can't read user header:
> ret=-2
> 2016-02-17 20:23:44.495516 7f4a65ffb700  0 ERROR: sync_user() failed,
> user=alex ret=-2
> 2016-02-17 20:26:47.425354 7fb50132b880  0 ceph version 9.2.0
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 3149
> 2016-02-17 20:26:47.471472 7fb50132b880 -1 asok(0x7fb503e51340)
> AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
> bind the UNIX domain socket to '/var/run/ceph/ceph-client.rgw.gateway.asok':
> (17) File exists
> 2016-02-17 20:26:47.554305 7fb50132b880  0 framework: civetweb
> 2016-02-17 20:26:47.554319 7fb50132b880  0 framework conf key: port, val: 80
> 2016-02-17 20:26:47.554328 7fb50132b880  0 starting handler: civetweb
> 2016-02-17 20:26:47.576110 7fb4d2ffd700  0 ERROR: can't read user header:
> ret=-2
> 2016-02-17 20:26:47.576119 7fb4d2ffd700  0 ERROR: sync_user() failed,
> user=alex ret=-2
> 2016-02-17 20:27:03.504131 7fb49d7a2700  1 == starting new request
> req=0x7fb4e40008c0 =
> 2016-02-17 20:27:03.522989 7fb49d7a2700  1 == req done
> req=0x7fb4e40008c0 http_status=200 ==
> 2016-02-17 20:27:03.523023 7fb49d7a2700  1 civetweb: 0x7fb4e40022a0:
> 192.168.56.100 - - [17/Feb/2016:20:27:03 +0200] "GET / HTTP/1.1" 200 0 - -
> 2016-02-17 20:27:08.796459 7fb49bf9f700  1 == starting new request
> req=0x7fb4ec0343a0 =
> 2016-02-17 20:27:08.796755 7fb49bf9f700  1 == req done
> req=0x7fb4ec0343a0 http_status=405 ==
> 2016-02-17 20:27:08.796807 7fb49bf9f700  1 civetweb: 0x7fb4ec0008c0:
> 192.168.56.100 - - [17/Feb/2016:20:27:08 +0200] "PUT / HTTP/1.1" 405 0 - -
> 2016-02-17 20:28:22.088508 7fb49e7a4700  1 == starting new request
> req=0x7fb503f1bfd0 =
> 2016-02-17 20:28:22.090993 7fb49e7a4700  1 == req done
> req=0x7fb503f1bfd0 http_status=200 ==
> 2016-02-17 20:28:22.091035 7fb49e7a4700  1 civetweb: 0x7fb503f2e9f0:
> 192.168.56.100 - - [17/Feb/2016:20:28:22 +0200] "GET / HTTP/1.1" 200 0 - -
> 2016-02-17 20:28:35.943110 7fb4a77b6700  1 == starting new request
> req=0x7fb4cc0047b0 =
> 2016-02-17 20:28:35.945233 7fb4a77b6700  1 == req done
> req=0x7fb4cc0047b0 http_status=200 ==
> 2016-02-17 20:28:35.945282 7fb4a77b6700  1 civetweb: 0x7fb4cc004c90:
> 192.168.56.100 - - [17/Feb/2016:20:28:35 +0200] "GET / HTTP/1.1" 200 0 - -
> 2016-02-17 20:29:07.447283 7fb49dfa3700  1 == starting new request
> req=0x7fb4d8000bf0 =
> 2016-02-17 20:29:07.447743 7fb49dfa3700  1 == req done
> req=0x7fb4d8000bf0 http_status=405 ==
> 2016-02-17 20:29:07.447913 7fb49dfa3700  1 civetweb: 0x7fb4d8002bb0:
> 192.168.56.100 - - [17/Feb/2016:20:29:07 +0200] "PUT / HTTP/1.1" 405 0 - -
>
> My ceph.conf:
> [global]
> fsid = 54060180-f49f-4cfb-a04e-72ecbda8692b
> mon_initial_members = node1
> mon_host = 192.168.56.101
> auth_cluster_required = cephx
> auth_service_required = cephx
> auth_client_required = cephx
> filestore_xattr_use_omap = true
> osd_pool_default_size = 2
> public_network = 192.168.56.0/24
> cluster_network = 192.168.57.0/24
> [client.rgw.gateway]
> rgw_frontends = "civetweb port=80"
>
> I have created several pools (like .rgw.buckets.index and so on) and then I
> have deleted several of them (like .rgw.buckets.index and so on). It is my
> current list of pools:
> 0 rbd,1 .rgw.root,2 .rgw.control,3 .rgw,5 .log,6 .users.uid,7 data,12
> .intent-log,13 .usage,14 .users,15 .users.email,16 .users.swift,17 .rgw.gc,
>
> After reboot my ceph-radosgw@rgw.gateway.service is running but I cannot
> send any request on Ceph Gateway 

[ceph-users] Cannot create bucket via the S3 (s3cmd)

2016-02-17 Thread Alexandr Porunov
When I try to create bucket:
s3cmd mb s3://first-bucket

I always get this error:
ERROR: S3 error: 405 (MethodNotAllowed)

/var/log/ceph/ceph-client.rgw.gateway.log :
2016-02-17 20:22:49.282715 7f86c50f3700  1 handle_sigterm
2016-02-17 20:22:49.282750 7f86c50f3700  1 handle_sigterm set alarm for 120
2016-02-17 20:22:49.282646 7f9478ff9700  1 handle_sigterm
2016-02-17 20:22:49.282689 7f9478ff9700  1 handle_sigterm set alarm for 120
2016-02-17 20:22:49.285830 7f949b842880 -1 shutting down
2016-02-17 20:22:49.285289 7f86f36c3880 -1 shutting down
2016-02-17 20:22:49.370173 7f86f36c3880  1 final shutdown
2016-02-17 20:22:49.467154 7f949b842880  1 final shutdown
2016-02-17 22:23:33.388956 7f4a94adf880  0 ceph version 9.2.0
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 889
2016-02-17 20:23:44.344574 7f4a94adf880  0 framework: civetweb
2016-02-17 20:23:44.344583 7f4a94adf880  0 framework conf key: port, val: 80
2016-02-17 20:23:44.344590 7f4a94adf880  0 starting handler: civetweb
2016-02-17 20:23:44.344630 7f4a94adf880  0 civetweb: 0x7f4a951c8b00:
set_ports_option: cannot bind to 80: 13 (Permission denied)
2016-02-17 20:23:44.495510 7f4a65ffb700  0 ERROR: can't read user header:
ret=-2
2016-02-17 20:23:44.495516 7f4a65ffb700  0 ERROR: sync_user() failed,
user=alex ret=-2
2016-02-17 20:26:47.425354 7fb50132b880  0 ceph version 9.2.0
(bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 3149
2016-02-17 20:26:47.471472 7fb50132b880 -1 asok(0x7fb503e51340)
AdminSocketConfigObs::init: failed: AdminSocket::bind_and_listen: failed to
bind the UNIX domain socket to
'/var/run/ceph/ceph-client.rgw.gateway.asok': (17) File exists
2016-02-17 20:26:47.554305 7fb50132b880  0 framework: civetweb
2016-02-17 20:26:47.554319 7fb50132b880  0 framework conf key: port, val: 80
2016-02-17 20:26:47.554328 7fb50132b880  0 starting handler: civetweb
2016-02-17 20:26:47.576110 7fb4d2ffd700  0 ERROR: can't read user header:
ret=-2
2016-02-17 20:26:47.576119 7fb4d2ffd700  0 ERROR: sync_user() failed,
user=alex ret=-2
2016-02-17 20:27:03.504131 7fb49d7a2700  1 == starting new request
req=0x7fb4e40008c0 =
2016-02-17 20:27:03.522989 7fb49d7a2700  1 == req done
req=0x7fb4e40008c0 http_status=200 ==
2016-02-17 20:27:03.523023 7fb49d7a2700  1 civetweb: 0x7fb4e40022a0:
192.168.56.100 - - [17/Feb/2016:20:27:03 +0200] "GET / HTTP/1.1" 200 0 - -
2016-02-17 20:27:08.796459 7fb49bf9f700  1 == starting new request
req=0x7fb4ec0343a0 =
2016-02-17 20:27:08.796755 7fb49bf9f700  1 == req done
req=0x7fb4ec0343a0 http_status=405 ==
2016-02-17 20:27:08.796807 7fb49bf9f700  1 civetweb: 0x7fb4ec0008c0:
192.168.56.100 - - [17/Feb/2016:20:27:08 +0200] "PUT / HTTP/1.1" 405 0 - -
2016-02-17 20:28:22.088508 7fb49e7a4700  1 == starting new request
req=0x7fb503f1bfd0 =
2016-02-17 20:28:22.090993 7fb49e7a4700  1 == req done
req=0x7fb503f1bfd0 http_status=200 ==
2016-02-17 20:28:22.091035 7fb49e7a4700  1 civetweb: 0x7fb503f2e9f0:
192.168.56.100 - - [17/Feb/2016:20:28:22 +0200] "GET / HTTP/1.1" 200 0 - -
2016-02-17 20:28:35.943110 7fb4a77b6700  1 == starting new request
req=0x7fb4cc0047b0 =
2016-02-17 20:28:35.945233 7fb4a77b6700  1 == req done
req=0x7fb4cc0047b0 http_status=200 ==
2016-02-17 20:28:35.945282 7fb4a77b6700  1 civetweb: 0x7fb4cc004c90:
192.168.56.100 - - [17/Feb/2016:20:28:35 +0200] "GET / HTTP/1.1" 200 0 - -
2016-02-17 20:29:07.447283 7fb49dfa3700  1 == starting new request
req=0x7fb4d8000bf0 =
2016-02-17 20:29:07.447743 7fb49dfa3700  1 == req done
req=0x7fb4d8000bf0 http_status=405 ==
2016-02-17 20:29:07.447913 7fb49dfa3700  1 civetweb: 0x7fb4d8002bb0:
192.168.56.100 - - [17/Feb/2016:20:29:07 +0200] "PUT / HTTP/1.1" 405 0 - -

My ceph.conf:
[global]
fsid = 54060180-f49f-4cfb-a04e-72ecbda8692b
mon_initial_members = node1
mon_host = 192.168.56.101
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_pool_default_size = 2
public_network = 192.168.56.0/24
cluster_network = 192.168.57.0/24
[client.rgw.gateway]
rgw_frontends = "civetweb port=80"

I have created several pools (like .rgw.buckets.index and so on) and then I
have deleted several of them (like .rgw.buckets.index and so on). It is my
current list of pools:
0 rbd,1 .rgw.root,2 .rgw.control,3 .rgw,5 .log,6 .users.uid,7 data,12
.intent-log,13 .usage,14 .users,15 .users.email,16 .users.swift,17 .rgw.gc,

After reboot my ceph-radosgw@rgw.gateway.service is running but I cannot
send any request on Ceph Gateway Node (it shows errors).

I manually start it with this command:
/usr/bin/radosgw --id=rgw.gateway

After this command ceph gateway becomes responsable but s3cmd mb
s3://first-bucket still doesn't work.

Help me please to figure out how to create buckets

Regards
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd become unusable, blocked by xfsaild (?) and load > 5000

2016-02-17 Thread Scottix
Looks like the bug with the kernel using ceph and XFS was fixed, I haven't
tested it yet but just wanted to give an update.

https://bugs.launchpad.net/ubuntu/+source/linux/+bug/1527062

On Tue, Dec 8, 2015 at 8:05 AM Scottix  wrote:

> I can confirm it seems to be kernels greater than 3.16, we had this
> problem where servers would lock up and had to perform restarts on a weekly
> basis.
> We downgraded to 3.16, since then we have not had to do any restarts.
>
> I did find this thread in the XFS forums and I am not sure if has been
> fixed or not
> http://oss.sgi.com/archives/xfs/2015-07/msg00034.html
>
>
> On Tue, Dec 8, 2015 at 2:06 AM Tom Christensen  wrote:
>
>> We run deep scrubs via cron with a script so we know when deep scrubs are
>> happening, and we've seen nodes fail both during deep scrubbing and while
>> no deep scrubs are occurring so I'm pretty sure its not related.
>>
>>
>> On Tue, Dec 8, 2015 at 2:42 AM, Benedikt Fraunhofer <
>> fraunho...@traced.net> wrote:
>>
>>> Hi Tom,
>>>
>>> 2015-12-08 10:34 GMT+01:00 Tom Christensen :
>>>
>>> > We didn't go forward to 4.2 as its a large production cluster, and we
>>> just
>>> > needed the problem fixed.  We'll probably test out 4.2 in the next
>>> couple
>>>
>>> unfortunately we don't have the luxury of a test cluster.
>>> and to add to that, we couldnt simulate the load, altough it does not
>>> seem to be load related.
>>> Did you try running with nodeep-scrub as a short-term workaround?
>>>
>>> I'll give ~30% of the nodes 4.2 and see how it goes.
>>>
>>> > In our experience it takes about 2 weeks to start happening
>>>
>>> we're well below that. Somewhat between 1 and 4 days.
>>> And yes, once one goes south, it affects the rest of the cluster.
>>>
>>> Thx!
>>>
>>>  Benedikt
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues related to scrubbing

2016-02-17 Thread Cullen King
On Wed, Feb 17, 2016 at 12:13 AM, Christian Balzer  wrote:

>
> Hello,
>
> On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote:
>
> > Thanks for the helpful commentary Christian. Cluster is performing much
> > better with 50% more spindles (12 to 18 drives), along with setting scrub
> > sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
> > Pro journal drives to Intel 3710's, even though dd and other direct tests
> > of the drives yielded much better results. rados bench with 4k requests
> > are still awfully low. I'll figure that problem out next.
> >
> Got examples, numbers, watched things with atop?
> 4KB rados benches are what can make my CPUs melt on the cluster here
> that's most similar to yours. ^o^
>
> > I ended up bumping up the number of placement groups from 512 to 1024
> > which should help a little bit. Basically it'll change the worst case
> > scrub performance such that it is distributed a little more across
> > drives rather than clustered on a single drive for longer.
> >
> Of course with osd_max_scrubs at its default of 1 there should never be
> more than one scrub per OSD.
> However I seem to vaguely remember that this is per "primary" scrub, so in
> case of deep-scrubs there could still be plenty of contention going on.
> Again, I've always had a good success with that manually kicked off scrub
> of all OSDs.
> It seems to sequence things nicely and finishes within 4 hours on my
> "good" production cluster.
>
> > I think the real solution here is to create a secondary SSD pool, pin
> > some radosgw buckets to it and put my thumbnail data on the smaller,
> > faster pool. I'll reserve the spindle based pool for original high res
> > photos, which are only read to create thumbnails when necessary. This
> > should put the majority of my random read IO on SSDs, and thumbnails
> > average 50kb each so it shouldn't be too spendy. I am considering trying
> > the newer samsung sm863 drives as we are read heavy, any potential data
> > loss on this thumbnail pool will not be catastrophic.
> >
> I seriously detest it when makers don't have they endurance data on the
> web page with all the other specifications and make you look up things in
> a slightly hidden PDF.
> Then giving the total endurance and making you calculate drive writes per
> day. ^o^
> Only to find that these have 3 DWPD, which is nothing to be ashamed off
> and should be fine for this particular use case.
>
> However take a look at this old posting of mine:
>
> http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
>
> With that in mind, I'd recommend you do some testing with real world data
> before you invest too much into something that will wear out long before
> it has payed for itself.
>

We are not write heavy at all, if my current drives are any indication I'd
only do one drive write per year on the things.


>
> Christian
>
> > Third, it seems that I am also running into the known "Lots Of Small
> > Files" performance issue. Looks like performance in my use case will be
> > drastically improved with the upcoming bluestore, though migrating to it
> > sounds painful!
> >
> > On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
> > >
> > > > Replies in-line:
> > > >
> > > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> > > >  wrote:
> > > >
> > > > >
> > > > > Hello,
> > > > >
> > > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I've been trying to nail down a nasty performance issue related
> > > > > > to scrubbing. I am mostly using radosgw with a handful of buckets
> > > > > > containing millions of various sized objects. When ceph scrubs,
> > > > > > both regular and deep, radosgw blocks on external requests, and
> > > > > > my cluster has a bunch of requests that have blocked for > 32
> > > > > > seconds. Frequently OSDs are marked down.
> > > > > >
> > > > > From my own (painful) experiences let me state this:
> > > > >
> > > > > 1. When your cluster runs out of steam during deep-scrubs, drop
> > > > > what you're doing and order more HW (OSDs).
> > > > > Because this is a sign that it would also be in trouble when doing
> > > > > recoveries.
> > > > >
> > > >
> > > > When I've initiated recoveries from working on the hardware the
> > > > cluster hasn't had a problem keeping up. It seems that it only has a
> > > > problem with scrubbing, meaning it feels like the IO pattern is
> > > > drastically different. I would think that with scrubbing I'd see
> > > > something closer to bursty sequential reads, rather than just
> > > > thrashing the drives with a more random IO pattern, especially given
> > > > our low cluster utilization.
> > > >
> > > It's probably more pronounced when phasing in/out entire OSDs, where it
> > > also has to read the entire 

Re: [ceph-users] Performance Testing of CEPH on ARM MicroServer

2016-02-17 Thread Swapnil Jain
Thanks Christian,



> On 17-Feb-2016, at 7:25 AM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Mon, 15 Feb 2016 21:10:33 +0530 Swapnil Jain wrote:
> 
>> For most of you CEPH on ARMv7 might not sound good. This is our setup
>> and our FIO testing report.  I am not able to understand ….
>> 
> Just one OSD per Microserver as in your case should be fine.
> As always, use atop (or similar) on your storage servers when running
> these tests to see where your bottlenecks are (HDD/network/CPU).
> 
>> 1) Are these results good or bad
>> 2) Write is much better than read, where as read should be better.
>> 
> Your testing is flawed, more below.
> 
>> Hardware:
>> 
>> 8 x ARMv7 MicroServer with 4 x 10G Uplink
>> 
>> Each MicroServer with:
>> 2GB RAM
> Barely OK for one OSD, not enough if you run MONs as well on it (as you
> do).
> 
>> Dual Core 1.6 GHz processor
>> 2 x 2.5 Gbps Ethernet (1 for Public / 1 for Cluster Network)
>> 1 x 3TB SATA HDD
>> 1 x 128GB MSata Flash
> Exact model/maker please.

Its Seagate ST3000NC000 & Phison Msata


> 
>> 
>> Software:
>> Debian 8.3 32bit
>> ceph version 9.2.0-25-gf480cea
>> 
>> Setup:
>> 
>> 3 MON (Shared with 3 OSD)
>> 8 OSD
>> Data on 3TB SATA with XFS
>> Journal on 128GB MSata Flash
>> 
>> pool with replica 1
> Not a very realistic test of course.
> For a production, fault resilient cluster you would have to divide your
> results by 3 (at least).
> 
>> 500GB image with 4M object size
>> 
>> FIO command: fio --name=unit1 --filename=/dev/rbd1 --bs=4k --runtime=300
>> --readwrite=write
>> 
> 
> If that is your base FIO command line, I'm assuming you mounted that image
> on the client via the kernel RBD module?

Yes, its via kernel RBD module


> 
> Either way, the main reason you're seeing writes being faster than reads
> is that with this command line (no direct=1 flag) fio will use the page
> cache on your client host for writes, speeding things up dramatically.
> To get a realistic idea of your clusters ability, use direct=1 and also
> look into rados bench.
> 
> Another reason for the slow reads is that Ceph (RBD) does badly with
> regards to read-ahead, setting  /sys/block/rdb1/queue/read_ahead_kb to
> something like 2048 should improve things.
> 
> That all being said, your read values look awfully low.

Thanks again for the suggestion. Below are some results using rados bench, here 
read looks much better than write. Still is it good or can be better? I also 
checked atop, couldn't see any bottleneck except that that sda disk was busy 
80-90% of time during the test.


WRITE Throughput (MB/sec): 297.544
WRITE Average Latency:0.21499

READ Throughput (MB/sec):478.026
READ Average Latency:   0.133818

—
Swapnil

> 
> Christian
>> Client:
>> 
>> Ubuntu on Intel 24core/16GB RAM 10G Ethernet
>> 
>> Result for different tests
>> 
>> 128k-randread.txt:  read : io=2587.4MB, bw=8830.2KB/s, iops=68,
>> runt=300020msec 128k-randwrite.txt:  write: io=48549MB, bw=165709KB/s,
>> iops=1294, runt=35msec 128k-read.txt:  read : io=26484MB,
>> bw=90397KB/s, iops=706, runt=32msec 128k-write.txt:  write:
>> io=89538MB, bw=305618KB/s, iops=2387, runt=34msec 16k-randread.txt:
>> read : io=383760KB, bw=1279.2KB/s, iops=79, runt=31msec
>> 16k-randwrite.txt:  write: io=8720.7MB, bw=29764KB/s, iops=1860,
>> runt=32msec 16k-read.txt:  read : io=27444MB, bw=93676KB/s,
>> iops=5854, runt=31msec 16k-write.txt:  write: io=87811MB,
>> bw=299726KB/s, iops=18732, runt=31msec 1M-randread.txt:  read :
>> io=10439MB, bw=35631KB/s, iops=34, runt=38msec 1M-randwrite.txt:
>> write: io=98943MB, bw=337721KB/s, iops=329, runt=34msec
>> 1M-read.txt:  read : io=25717MB, bw=87779KB/s, iops=85, runt=37msec
>> 1M-write.txt:  write: io=74264MB, bw=253487KB/s, iops=247,
>> runt=31msec 4k-randread.txt:  read : io=116920KB, bw=399084B/s,
>> iops=97, runt=32msec 4k-randwrite.txt:  write: io=5579.2MB,
>> bw=19043KB/s, iops=4760, runt=34msec 4k-read.txt:  read :
>> io=27032MB, bw=92271KB/s, iops=23067, runt=31msec 4k-write.txt:
>> write: io=92955MB, bw=317284KB/s, iops=79320, runt=31msec
>> 64k-randread.txt:  read : io=1400.2MB, bw=4778.2KB/s, iops=74,
>> runt=300020msec 64k-randwrite.txt:  write: io=27676MB, bw=94467KB/s,
>> iops=1476, runt=35msec 64k-read.txt:  read : io=27805MB,
>> bw=94909KB/s, iops=1482, runt=32msec 64k-write.txt:  write:
>> io=95484MB, bw=325917KB/s, iops=5092, runt=33msec
>> 
>> 
>> —
>> Swapnil Jain | swap...@linux.com  
>> >
>> Solution Architect & Red Hat Certified Instructor
>> RHC{A,DS,E,I,SA,SA-RHOS,VA}, CE{H,I}, CC{DA,NA}, MCSE, CNE
>> 
>> 
> 
> 
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten 
> Communications
> http://www.gol.com/ 


signature.asc
Description: Message signed with OpenPGP using GPGMail

Re: [ceph-users] Adding multiple OSDs to existing cluster

2016-02-17 Thread Ed Rowley
On 17 February 2016 at 14:59, Christian Balzer  wrote:
>
> Hello,
>
> On Wed, 17 Feb 2016 13:44:17 + Ed Rowley wrote:
>
>> On 17 February 2016 at 12:04, Christian Balzer  wrote:
>> >
>> > Hello,
>> >
>> > On Wed, 17 Feb 2016 11:18:40 + Ed Rowley wrote:
>> >
>> >> Hi,
>> >>
>> >> We have been running Ceph in production for a few months and looking
>> >> at our first big expansion. We are going to be adding 8 new OSDs
>> >> across 3 hosts to our current cluster of 13 OSD across 5 hosts. We
>> >> obviously want to minimize the amount of disruption this is going to
>> >> cause but we are unsure about the impact on the crush map as we add
>> >> each OSD.
>> >>
>> > So you are adding new hosts as well?
>> >
>>
>> Yes, we are adding 2 new hosts with 3 OSDs  and adding two drives/OSDs
>> to an existing host
>>
> Nods.
>
>> >> From the docs I can see that an OSD is added as 'in' and 'down' and
>> >> wont get objects until the OSD service has started. But what happens
>> >> to the crushmap while the OSD is 'down', is it recalculated? are
>> >> objects misplaced and moved on the existing cluster?
>> >>
>> > Yes, even more so when adding hosts (well, the first OSD on a new
>> > host).
>> >
>> > Find my "Storage node refurbishing, a  "freeze" OSD feature would be
>> > nice" thread in the ML archives.
>> >
>> > Christian
>> >
>>
>> Thanks for the reference, the thread is useful,
>>
>> I am right with the assumption that adding an OSD with:
>>
>> [osd]
>> osd_crush_initial_weight = 0
>>
> Or by adding it with a weight of zero like this:
>
> ceph osd crush add  0 host=
>

Thanks, we will give it a try.

>> will not change the existing crush map
>>
> Well it will change it (ceph osd tree will show it), but no data movement
> will result from it, yes.
>
> Christian
>>
>> >> We think we would like to limit the rebuild of the crush map, is this
>> >> possible or beneficial.
>> >>
>> >> Thanks,
>> >>
>> >> Ed Rowley
>> >> ___
>> >> ceph-users mailing list
>> >> ceph-users@lists.ceph.com
>> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >>
>> >
>> >
>> > --
>> > Christian BalzerNetwork/Systems Engineer
>> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
>> > http://www.gol.com/
>>
>> Regards,
>>
>> Ed Rowley
>>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com   Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
Regards,

Ed Rowley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSDs to existing cluster

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 13:44:17 + Ed Rowley wrote:

> On 17 February 2016 at 12:04, Christian Balzer  wrote:
> >
> > Hello,
> >
> > On Wed, 17 Feb 2016 11:18:40 + Ed Rowley wrote:
> >
> >> Hi,
> >>
> >> We have been running Ceph in production for a few months and looking
> >> at our first big expansion. We are going to be adding 8 new OSDs
> >> across 3 hosts to our current cluster of 13 OSD across 5 hosts. We
> >> obviously want to minimize the amount of disruption this is going to
> >> cause but we are unsure about the impact on the crush map as we add
> >> each OSD.
> >>
> > So you are adding new hosts as well?
> >
> 
> Yes, we are adding 2 new hosts with 3 OSDs  and adding two drives/OSDs
> to an existing host
>
Nods.
 
> >> From the docs I can see that an OSD is added as 'in' and 'down' and
> >> wont get objects until the OSD service has started. But what happens
> >> to the crushmap while the OSD is 'down', is it recalculated? are
> >> objects misplaced and moved on the existing cluster?
> >>
> > Yes, even more so when adding hosts (well, the first OSD on a new
> > host).
> >
> > Find my "Storage node refurbishing, a  "freeze" OSD feature would be
> > nice" thread in the ML archives.
> >
> > Christian
> >
> 
> Thanks for the reference, the thread is useful,
> 
> I am right with the assumption that adding an OSD with:
> 
> [osd]
> osd_crush_initial_weight = 0
> 
Or by adding it with a weight of zero like this:

ceph osd crush add  0 host=

> will not change the existing crush map
> 
Well it will change it (ceph osd tree will show it), but no data movement
will result from it, yes.

Christian
> 
> >> We think we would like to limit the rebuild of the crush map, is this
> >> possible or beneficial.
> >>
> >> Thanks,
> >>
> >> Ed Rowley
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >
> >
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> 
> Regards,
> 
> Ed Rowley
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot change the gateway port (civetweb)

2016-02-17 Thread Jaroslaw Owsiewski
Probably this is the reason:

https://www.w3.org/Daemon/User/Installation/PrivilegedPorts.html

Regards,
-- 
Jarosław Owsiewski

2016-02-17 15:28 GMT+01:00 Alexandr Porunov :

> Hello,
>
> I have problem with port changes of rados gateway node.
> I don't know why but I cannot change listening port of civetweb.
>
> My steps to install radosgw:
> *ceph-deploy install --rgw gateway*
> *ceph-deploy admin gateway*
> *ceph-deploy create rgw gateway*
>
> (gateway starts on port 7480 as expected)
>
> To change the port I add the following lines to ceph.conf:
> *[client.rgw.gateway]*
> *rgw_frontends = "civetweb port=80"*
>
> then I update it on all nodes:
> *ceph-deploy --overwrite-conf config push admin-node node1 node2 node3
> gateway*
>
> After this I try to restart rados gateway:
> *systemctl restart ceph-radosgw@rgw.gateway*
>
> But after restart it doesn't work and
> /var/log/ceph/ceph-client.rgw.gateway.log shows this:
> 2016-02-17 16:06:03.766890 7f3a9215d880  0 set uid:gid to 167:167
> 2016-02-17 16:06:03.766976 7f3a9215d880  0 ceph version 9.2.0
> (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299), process radosgw, pid 2810
> 2016-02-17 16:06:03.859469 7f3a9215d880  0 framework: civetweb
> 2016-02-17 16:06:03.859480 7f3a9215d880  0 framework conf key: port, val:
> 80
> 2016-02-17 16:06:03.859488 7f3a9215d880  0 starting handler: civetweb
> 2016-02-17 16:06:03.859534 7f3a9215d880  0 civetweb: 0x7f3a92846b00:
> set_ports_option: cannot bind to 80: 13 (Permission denied)
> 2016-02-17 16:06:03.876508 7f3a5f7fe700  0 ERROR: can't read user header:
> ret=-2
> 2016-02-17 16:06:03.876516 7f3a5f7fe700  0 ERROR: sync_user() failed,
> user=alex ret=-2
>
> I have added 80 port to iptables and I haven't any firewalls on nodes.
>
> Please help me change the port.
>
> Regards, Alexandr
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot change the gateway port (civetweb)

2016-02-17 Thread Karol Mroz
On Wed, Feb 17, 2016 at 04:28:38PM +0200, Alexandr Porunov wrote:
[...]
> set_ports_option: cannot bind to 80: 13 (Permission denied)

Hi,

The problem is that civetweb can't bind to privileged port 80 because it
currently drops permissions _before_ the bind.

https://github.com/ceph/ceph/pull/7313 is trying to address this problem.

If you can use a non-privileged port for the time being, that would be
best. You could also set --setuser/--setgroup to root, but this has
security implications.

-- 
Regards,
Karol


signature.asc
Description: Digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Tyler Bishop
I'm using 2x replica on that pool for storing rbd volumes. Our workload is 
pretty heavy, id imagine objects an ec would be light in comparison. 







Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 




From: "John Hogenmiller"  
To: "Tyler Bishop"  
Cc: "Nick Fisk" , ceph-users@lists.ceph.com 
Sent: Wednesday, February 17, 2016 7:50:11 AM 
Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code 

Tyler, 
E5-2660 V2 is a 10-core, 2.2Ghz, giving you roughly 44Ghz or 0.78Ghz per OSD. 
That seems to fall in line with Nick's "golden rule" or 0.5Ghz - 1Ghz per OSD. 

Are you doing EC or Replication? If EC, what profile? Could you also provide an 
average of CPU utilization? 

I'm still researching, but so far, the ratio seems to be pretty realistic. 

-John 

On Tue, Feb 16, 2016 at 9:22 AM, Tyler Bishop < tyler.bis...@beyondhosting.net 
> wrote: 


We use dual E5-2660 V2 with 56 6T and performance has not been an issue. It 
will easily saturate the 40G interfaces and saturate the spindle io. 

And yes, you can run dual servers attached to 30 disk each. This gives you lots 
of density. Your failure domain will remain as individual servers. The only 
thing shared is the quad power supplies. 

Tyler Bishop 
Chief Technical Officer 
513-299-7108 x10 



tyler.bis...@beyondhosting.net 


If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited. 

- Original Message - 
From: "Nick Fisk" < n...@fisk.me.uk > 
To: "Василий Ангапов" < anga...@gmail.com >, "Tyler Bishop" < 
tyler.bis...@beyondhosting.net > 
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, February 16, 2016 8:24:33 AM 
Subject: RE: [ceph-users] Recomendations for building 1PB RadosGW with Erasure 
Code 

> -Original Message- 
> From: Василий Ангапов [mailto: anga...@gmail.com ] 
> Sent: 16 February 2016 13:15 
> To: Tyler Bishop < tyler.bis...@beyondhosting.net > 
> Cc: Nick Fisk < n...@fisk.me.uk >; < ceph-users@lists.ceph.com >  us...@lists.ceph.com > 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with 
> Erasure Code 
> 
> 2016-02-16 17:09 GMT+08:00 Tyler Bishop 
> < tyler.bis...@beyondhosting.net >: 
> > With ucs you can run dual server and split the disk. 30 drives per node. 
> > Better density and easier to manage. 
> I don't think I got your point. Can you please explain it in more details? 

I think he means that the 60 bays can be zoned, so you end up with physically 1 
JBOD split into two 30 logical JBOD's each connected to a different server. 
What this does to your failures domains is another question. 

> 
> And again - is dual Xeon's power enough for 60-disk node and Erasure Code? 

I would imagine yes, but you would mostly likely need to go for the 12-18core 
versions with a high clock. These are serious . I don't know at what point 
this becomes more expensive than 12 disk nodes with "cheap" Xeon-D's or Xeon 
E3's. 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread John Hogenmiller
I hadn't come across this ratio prior, but now that I've read that PDF you
linked and I've narrowed my search in the mailing list, I think that the
0.5 - 1ghz per OSD ratio is pretty spot on. The 100Mhz per IOP is also
pretty interesting, and we do indeed use 7200 RPM drives.

I'll look up a few more things, but based on what I've seen so far, the
hardware we're using will most likely not be suitable, which is unfortunate
as that adds some more complexity at OSI Level 8. :D


On Wed, Feb 17, 2016 at 4:14 AM, Nick Fisk  wrote:

> Thanks for posting your experiences John, very interesting read. I think
> the golden rule of around 1Ghz is still a realistic goal to aim for. It
> looks like you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD.
> Do you have any idea on how much CPU you think you would need to just be
> able to get away with it?
>
> I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage
> over about 20%, which indicates to me the bare minimum for a replicated
> pool is probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have
> will certainly have less CPU.
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > John Hogenmiller
> > Sent: 17 February 2016 03:04
> > To: Василий Ангапов ; ceph-users@lists.ceph.com
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > Turns out i didn't do reply-all.
> >
> > On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller 
> > wrote:
> > > And again - is dual Xeon's power enough for 60-disk node and Erasure
> > Code?
> >
> >
> > This is something I've been attempting to determine as well. I'm not yet
> > getting
> > I'm testing with some white-label hardware, but essentially supermicro
> > 2twinu's with a pair of E5-2609 Xeons and 64GB of
> > memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> > 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
> >
> > Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> > supported "reference architecture" device. The processors in those nodes
> > are E5-269 12-core, vs what I have which is quad-
> > core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> > 6048R-OSD432). I would highly recommend reflecting on the supermicro
> > hardware and using that as your reference as well. If you could get an
> eval
> > unit, use that to compare with the hardware you're working with.
> >
> > I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> > one each node, and 5 ceph monitors. I plan to move the monitors to their
> > own dedicated hardware, and in reading, I may only need 3 to manage the
> > 420 OSDs.   I am currently just setup for replication instead of EC,
> though I
> > want to redo this cluster to use EC. Also, I am still trying to work out
> how
> > much of an impact placement groups have on performance, and I may have a
> > performance-hampering amount..
> >
> > We test the system using locust speaking S3 to the radosgw. Transactions
> are
> > distributed equally across all 7 nodes and we track the statistics. We
> started
> > first emulating 1000 users and got over 4Gbps, but load average on all
> nodes
> > was in the mid-100s, and after 15 minutes we started getting socket
> > timeouts. We stopped the test, let load settle, and started back at 100
> > users.  We've been running this test about 5 days now.  Load average on
> all
> > nodes floats between 40 and 70. The nodes with ceph-mon running on them
> > do not appear to be taxed any more than the ones without. The radosgw
> > itself seems to take up a decent amount of cpu (running civetweb, no
> > ssl).  iowait is non existent, everything appears to be cpu bound.
> >
> > At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not
> capture
> > the TPS on that short test.
> > At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs.
> Averaging
> > 115 TPS.
> >
> > All in all, the speeds are not bad for a single rack, but the CPU
> utilization is a
> > big concern. We're currently using other (proprietary) object storage
> > platforms on this hardware configuration. They have their own set of
> issues,
> > but CPU utilization is typically not the problem, even at higher
> utilization.
> >
> >
> >
> > root@ljb01:/home/ceph/rain-cluster# ceph status
> > cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45
> >  health HEALTH_OK
> >  monmap e5: 5 mons at {hail02-r01-06=172.29.4.153:6789/0,hail02-r01-
> > 08=172.29.4.155:6789/0,rain02-r01-01=172.29.4.148:6789/0,rain02-r01-
> > 03=172.29.4.150:6789/0,rain02-r01-04=172.29.4.151:6789/0}
> > election epoch 86, quorum 0,1,2,3,4
> rain02-r01-01,rain02-r01-03,rain02-
> > r01-04,hail02-r01-06,hail02-r01-08
> >  osdmap e2543: 423 osds: 419 up, 419 in
> > flags sortbitwise
> >   pgmap v676131: 33848 pgs, 

Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-02-17 Thread Mark Nelson

On 02/17/2016 06:36 AM, Christian Balzer wrote:


Hello,

On Wed, 17 Feb 2016 09:23:11 - Nick Fisk wrote:


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
Of Christian Balzer
Sent: 17 February 2016 04:22
To: ceph-users@lists.ceph.com
Cc: Piotr Wachowicz 
Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,

which is

better?


[snip]

I'm sure both approaches have their own merits, and might be better
for some specific tasks, but with all other things being equal, I
would expect that using SSDs as the "Writeback" cache tier should, on
average, provide better performance than suing the same SSDs for

Journals.

Specifically in the area of read throughput/latency.


Cache tiers (currently) work only well if all your hot data fits into

them.

In which case you'd even better off with with a dedicated SSD pool for

that

data.

Because (currently) Ceph has to promote a full object (4MB by default)
to the cache for each operation, be it read or or write.
That means the first time you want to read a 2KB file in your RBD
backed

VM,

Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
This has of course a significant impact on read performance, in my
crappy

test

cluster reading cold data is half as fast as using the actual
non-cached

HDD

pool.



Just a FYI, there will most likely be several fixes/improvements going
into Jewel which will address most of these problems with caching.
Objects will now only be promoted if they are hit several
times(configurable) and, if it makes it in time, a promotion throttle to
stop too many promotions hindering cluster performance.


Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).

The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased.


The branch is wip-promote-throttle and we need testing from more people 
besides me to make sure it's the right path forward . :)


I'm including the a link to the results we've gotten so far here. 
There's still a degenerate case in small random mixed workloads, but 
initial testing seems to indicate that the promotion throttling is 
helping in many other cases, especially at *very* low promotion rates. 
Small random read and write performance for example improves 
dramatically.  Highly skewed zipf distribution writes are also much 
improved except for large writes).


https://drive.google.com/open?id=0B2gTBZrkrnpZUFV4OC1UaGVlTm8

Note: You will likely need to download the document and open it in open 
office to see the graphs.


In the graphs I have different series labeled as VH, H, M, L, VL, 0, 
etc.  The throttle rates that correspond to those are:


#VH (ie, let everything through)
#osd tier promote max objects sec = 2
#osd tier promote max bytes sec = 1610612736

#H (Almost allow the cache tier to be saturated with writes)
#osd tier promote max objects sec = 2000
#osd tier promote max bytes sec = 268435456

# M (Allow about 20% writes into the cache tier)
#osd tier promote max objects sec = 500
#osd tier promote max bytes sec = 67108864

# L (Allow about 5% writes into the cache tier)
#osd tier promote max objects sec = 125
#osd tier promote max bytes sec = 16777216

# VL (Only allow 4MB/sec to be promoted into the cache tier)
#osd tier promote max objects sec = 25
#osd tier promote max bytes sec = 4194304

# 0 (Technically not zero, something like 1/1000 still allowed through)
#osd tier promote max objects sec = 0
#osd tier promote max bytes  sec = 0

Mark




However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.


Yeah, thus my overuse of "currently". ^o^

Christian



And once your cache pool has to evict objects because it is getting
full,

it has

to write out 4MB for each such object to the HDD pool.
Then read it back in later, etc.


The main difference, I suspect, between the two approaches is that in
the case of multiple HDDs (multiple ceph-osd processes), all of those
processes share access to the same shared SSD storing their journals.
Whereas it's likely not the case with Cache tiering, right? Though I
must say I failed to find any detailed info on this. Any
clarification will be appreciated.


In your specific case writes to the OSDs (HDDs) will be (at least) 50%

slower if

your journals are on disk instead of the SSD.
(Which SSDs do you plan to use anyway?)
I don't think you'll be happy with the resulting performance.

Christian.


So, is the above correct, or am I missing some pieces here? Any other
major differences between the two approaches?

Thanks.
P.



--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread John Hogenmiller
Tyler,

E5-2660 V2 is a 10-core, 2.2Ghz, giving you roughly 44Ghz or 0.78Ghz per
OSD.  That seems to fall in line with Nick's "golden rule" or 0.5Ghz - 1Ghz
per OSD.

Are you doing EC or Replication? If EC, what profile?  Could you also
provide an average of CPU utilization?

I'm still researching, but so far, the ratio seems to be pretty realistic.

-John

On Tue, Feb 16, 2016 at 9:22 AM, Tyler Bishop <
tyler.bis...@beyondhosting.net> wrote:

> We use dual E5-2660 V2 with 56 6T and performance has not been an issue.
> It will easily saturate the 40G interfaces and saturate the spindle io.
>
> And yes, you can run dual servers attached to 30 disk each.  This gives
> you lots of density.  Your failure domain will remain as individual
> servers.  The only thing shared is the quad power supplies.
>
> Tyler Bishop
> Chief Technical Officer
> 513-299-7108 x10
>
>
>
> tyler.bis...@beyondhosting.net
>
>
> If you are not the intended recipient of this transmission you are
> notified that disclosing, copying, distributing or taking any action in
> reliance on the contents of this information is strictly prohibited.
>
> - Original Message -
> From: "Nick Fisk" 
> To: "Василий Ангапов" , "Tyler Bishop" <
> tyler.bis...@beyondhosting.net>
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, February 16, 2016 8:24:33 AM
> Subject: RE: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
>
> > -Original Message-
> > From: Василий Ангапов [mailto:anga...@gmail.com]
> > Sent: 16 February 2016 13:15
> > To: Tyler Bishop 
> > Cc: Nick Fisk ;   > us...@lists.ceph.com>
> > Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> > Erasure Code
> >
> > 2016-02-16 17:09 GMT+08:00 Tyler Bishop
> > :
> > > With ucs you can run dual server and split the disk.  30 drives per
> node.
> > > Better density and easier to manage.
> > I don't think I got your point. Can you please explain it in more
> details?
>
> I think he means that the 60 bays can be zoned, so you end up with
> physically 1 JBOD split into two 30 logical JBOD's each connected to a
> different server. What this does to your failures domains is another
> question.
>
> >
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
>
> I would imagine yes, but you would mostly likely need to go for the
> 12-18core versions with a high clock. These are serious . I don't know
> at what point this becomes more expensive than 12 disk nodes with "cheap"
> Xeon-D's or Xeon E3's.
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 09:23:11 - Nick Fisk wrote:

> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > Of Christian Balzer
> > Sent: 17 February 2016 04:22
> > To: ceph-users@lists.ceph.com
> > Cc: Piotr Wachowicz 
> > Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
> which is
> > better?
> > 
[snip]
> > > I'm sure both approaches have their own merits, and might be better
> > > for some specific tasks, but with all other things being equal, I
> > > would expect that using SSDs as the "Writeback" cache tier should, on
> > > average, provide better performance than suing the same SSDs for
> > Journals.
> > > Specifically in the area of read throughput/latency.
> > >
> > Cache tiers (currently) work only well if all your hot data fits into
> them.
> > In which case you'd even better off with with a dedicated SSD pool for
> that
> > data.
> > 
> > Because (currently) Ceph has to promote a full object (4MB by default)
> > to the cache for each operation, be it read or or write.
> > That means the first time you want to read a 2KB file in your RBD
> > backed
> VM,
> > Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
> > This has of course a significant impact on read performance, in my
> > crappy
> test
> > cluster reading cold data is half as fast as using the actual
> > non-cached
> HDD
> > pool.
> > 
> 
> Just a FYI, there will most likely be several fixes/improvements going
> into Jewel which will address most of these problems with caching.
> Objects will now only be promoted if they are hit several
> times(configurable) and, if it makes it in time, a promotion throttle to
> stop too many promotions hindering cluster performance.
> 
Ah, both of these would be very nice indeed, especially since the first
one is something that's supposedly already present (but broken).

The 2nd one, if done right, will be probably a game changer.
Robert LeBlanc and me will be most pleased. 

> However in the context of this thread, Christian is correct, SSD journals
> first and then caching if needed.
>
Yeah, thus my overuse of "currently". ^o^

Christian 
> 
> > And once your cache pool has to evict objects because it is getting
> > full,
> it has
> > to write out 4MB for each such object to the HDD pool.
> > Then read it back in later, etc.
> > 
> > > The main difference, I suspect, between the two approaches is that in
> > > the case of multiple HDDs (multiple ceph-osd processes), all of those
> > > processes share access to the same shared SSD storing their journals.
> > > Whereas it's likely not the case with Cache tiering, right? Though I
> > > must say I failed to find any detailed info on this. Any
> > > clarification will be appreciated.
> > >
> > In your specific case writes to the OSDs (HDDs) will be (at least) 50%
> slower if
> > your journals are on disk instead of the SSD.
> > (Which SSDs do you plan to use anyway?)
> > I don't think you'll be happy with the resulting performance.
> > 
> > Christian.
> > 
> > > So, is the above correct, or am I missing some pieces here? Any other
> > > major differences between the two approaches?
> > >
> > > Thanks.
> > > P.
> > 
> > 
> > --
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Adding multiple OSDs to existing cluster

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 11:18:40 + Ed Rowley wrote:

> Hi,
> 
> We have been running Ceph in production for a few months and looking
> at our first big expansion. We are going to be adding 8 new OSDs
> across 3 hosts to our current cluster of 13 OSD across 5 hosts. We
> obviously want to minimize the amount of disruption this is going to
> cause but we are unsure about the impact on the crush map as we add
> each OSD.
> 
So you are adding new hosts as well?

> From the docs I can see that an OSD is added as 'in' and 'down' and
> wont get objects until the OSD service has started. But what happens
> to the crushmap while the OSD is 'down', is it recalculated? are
> objects misplaced and moved on the existing cluster?
> 
Yes, even more so when adding hosts (well, the first OSD on a new host).

Find my "Storage node refurbishing, a  "freeze" OSD feature would be nice" 
thread in the ML archives.

Christian

> We think we would like to limit the rebuild of the crush map, is this
> possible or beneficial.
> 
> Thanks,
> 
> Ed Rowley
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Adding multiple OSDs to existing cluster

2016-02-17 Thread Ed Rowley
Hi,

We have been running Ceph in production for a few months and looking
at our first big expansion. We are going to be adding 8 new OSDs
across 3 hosts to our current cluster of 13 OSD across 5 hosts. We
obviously want to minimize the amount of disruption this is going to
cause but we are unsure about the impact on the crush map as we add
each OSD.

>From the docs I can see that an OSD is added as 'in' and 'down' and
wont get objects until the OSD service has started. But what happens
to the crushmap while the OSD is 'down', is it recalculated? are
objects misplaced and moved on the existing cluster?

We think we would like to limit the rebuild of the crush map, is this
possible or beneficial.

Thanks,

Ed Rowley
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-02-17 Thread Christian Balzer

Hello,

On Wed, 17 Feb 2016 10:04:11 +0100 Piotr Wachowicz wrote:

> Thanks for your reply.
> 
> 
> > > Let's consider both cases:
> > > Journals on SSDs - for writes, the write operation returns right
> > > after data lands on the Journal's SSDs, but before it's written to
> > > the backing HDD. So, for writes, SSD journal approach should be
> > > comparable to having a SSD cache tier.
> > Not quite, see below.
> >
> >
> Could you elaborate a bit more?
> 
> Are you saying that with a Journal on a SSD writes from clients, before
> they can return from the operation to the client, must end up on both the
> SSD (Journal) *and* HDD (actual data store behind that journal)? 

No, your initial statement is correct. 

However that burst of speed doesn't last indefinitely. 

Aside from the size of the journal (which is incidentally NOT the most
limiting factor) there are various "filestore" parameters in Ceph, in
particular the sync interval ones. 
There was a more in-depth explanation by a developer about this in this ML,
try your google-foo. 

For short bursts of activity, the journal helps a LOT.
If you send a huge number of for example 4KB writes to your cluster, the
speed will eventually (after a few seconds) go down to what your backing
storage (HDDs) are capable of sustaining.

> I was
> under the impression that one of the benefits of having a journal on a
> SSD is deferring the write to the slow HDD to a later time, until after
> the write call returns to the client. Is that not the case? If so, that
> would mean SSD cache tier should be much faster in terms of write
> latency than SSD journal.
> 
> 
> > In your specific case writes to the OSDs (HDDs) will be (at least) 50%
> > slower if your journals are on disk instead of the SSD.
> >
> 
> Is that because of the above -- with Journal on the same disk (HDD) as
> the data, writes have to be written twice (assuming no btrfs/zfs cow) to
> the HDD (journal, and data). Whereas with a Journal on the SSD write to
> the Journal and disk can be done in parallel with write to the HDD? 
Yes, as far as the doubling of the I/O and thus the halving of speed is
concerned. Even with disk based journals the ACK of course happens when
ALL journal OSDs have done their writing. 

>(But
> still both of those have to be completed before the write operation
> returns to the client).
>
See above, eventually, kind-a-sorta.  
> 
> 
> > (Which SSDs do you plan to use anyway?)
> >
> 
> Intel DC S3700
> 
Good choice, with the 200GB model prefer the 3700 over the 3710 (higher
sequential write speed).

Christian
> 
> Thanks,
> Piotr


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 9.2.0 mds cluster went down and now constantly crashes with Floating point exception

2016-02-17 Thread Kenneth Waegeman



On 05/02/16 11:43, John Spray wrote:

On Fri, Feb 5, 2016 at 9:36 AM, Kenneth Waegeman
 wrote:


On 04/02/16 16:17, Gregory Farnum wrote:

On Thu, Feb 4, 2016 at 1:42 AM, Kenneth Waegeman
 wrote:

Hi,

Hi, we are running ceph 9.2.0.
Overnight, our ceph state went to 'mds mds03 is laggy' . When I checked
the
logs, I saw this mds crashed with a stacktrace. I checked the other mdss,
and I saw the same there.
When I try to start the mds again, I get again a stacktrace and it won't
come up:

   -12> 2016-02-04 10:23:46.837131 7ff9ea570700  1 --
10.141.16.2:6800/193767 <== osd.146 10.141.16.25:6800/7036 1 
osd_op_reply(207 15ef982. [stat] v0'0 uv22184 ondisk = 0) v6
 187+0+16 (113
2261152 0 506978568) 0x7ffa171ae940 con 0x7ffa189cc3c0
  -11> 2016-02-04 10:23:46.837317 7ff9ed6a1700  1 --
10.141.16.2:6800/193767 <== osd.136 10.141.16.24:6800/6764 6 
osd_op_reply(209 148aaac. [delete] v0'0 uv23797 ondisk = -2
((2)
No such file o
r directory)) v6  187+0+0 (64699207 0 0) 0x7ffa171acb00 con
0x7ffa014fd9c0
  -10> 2016-02-04 10:23:46.837406 7ff9ec994700  1 --
10.141.16.2:6800/193767 <== osd.36 10.141.16.14:6800/5395 5 
osd_op_reply(175 15f631f. [stat] v0'0 uv22466 ondisk = 0) v6
 187+0+16 (1037
61047 0 2527067705) 0x7ffa08363700 con 0x7ffa189ca580
   -9> 2016-02-04 10:23:46.837463 7ff9eba85700  1 --
10.141.16.2:6800/193767 <== osd.47 10.141.16.15:6802/7128 2 
osd_op_reply(211 148aac8. [delete] v0'0 uv22990 ondisk = -2
((2)
No such file or
directory)) v6  187+0+0 (1138385695 0 0) 0x7ffa01cd0dc0 con
0x7ffa189cadc0
   -8> 2016-02-04 10:23:46.837468 7ff9eb27d700  1 --
10.141.16.2:6800/193767 <== osd.16 10.141.16.12:6800/5739 2 
osd_op_reply(212 148aacd. [delete] v0'0 uv23991 ondisk = -2
((2)
No such file or
directory)) v6  187+0+0 (1675093742 0 0) 0x7ffa171ac840 con
0x7ffa189cb760
   -7> 2016-02-04 10:23:46.837477 7ff9eab76700  1 --
10.141.16.2:6800/193767 <== osd.66 10.141.16.17:6800/6353 2 
osd_op_reply(210 148aab9. [delete] v0'0 uv24583 ondisk = -2
((2)
No such file or
directory)) v6  187+0+0 (603192739 0 0) 0x7ffa19054680 con
0x7ffa189cbce0
   -6> 2016-02-04 10:23:46.838140 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 43 
osd_op_reply(121 200.9d96 [write 1459360~980] v943'4092 uv4092 ondisk
=
0) v6  179+0+0 (3939130488 0 0) 0x7ffa01590100 con 0x7ffa014fab00
   -5> 2016-02-04 10:23:46.838342 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 44 
osd_op_reply(124 200.9d96 [write 1460340~956] v943'4093 uv4093 ondisk
=
0) v6  179+0+0 (1434265886 0 0) 0x7ffa01590100 con 0x7ffa014fab00
   -4> 2016-02-04 10:23:46.838531 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 45 
osd_op_reply(126 200.9d96 [write 1461296~954] v943'4094 uv4094 ondisk
=
0) v6  179+0+0 (25292940 0 0) 0x7ffa01590100 con 0x7ffa014fab00
   -3> 2016-02-04 10:23:46.838700 7ff9ecd98700  1 --
10.141.16.2:6800/193767 <== osd.57 10.141.16.16:6802/7067 3 
osd_op_reply(199 15ef976. [stat] v0'0 uv22557 ondisk = 0) v6
 187+0+16 (354652996 0 2244692791) 0x7ffa171ade40 con 0x7ffa189ca160
   -2> 2016-02-04 10:23:46.839301 7ff9ed8a3700  1 --
10.141.16.2:6800/193767 <== osd.107 10.141.16.21:6802/7468 3 
osd_op_reply(115 1625476. [stat] v0'0 uv22587 ondisk = 0) v6
 187+0+16 (664308076 0 998461731) 0x7ffa08363c80 con 0x7ffa014fdb20
   -1> 2016-02-04 10:23:46.839322 7ff9f0bcf700  1 --
10.141.16.2:6800/193767 <== osd.2 10.141.16.2:6802/126856 46 
osd_op_reply(128 200.9d96 [write 1462250~954] v943'4095 uv4095 ondisk
=
0) v6  179+0+0 (1379768629 0 0) 0x7ffa01590100 con 0x7ffa014fab00
0> 2016-02-04 10:23:46.839379 7ff9f30d8700 -1 *** Caught signal
(Floating point exception) **
in thread 7ff9f30d8700

ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
1: (()+0x4b6fa2) [0x7ff9fd091fa2]
2: (()+0xf100) [0x7ff9fbfd3100]
3: (StrayManager::_calculate_ops_required(CInode*, bool)+0xa2)
[0x7ff9fcf0adc2]
4: (StrayManager::enqueue(CDentry*, bool)+0x169) [0x7ff9fcf10459]
5: (StrayManager::__eval_stray(CDentry*, bool)+0xa49) [0x7ff9fcf111c9]
6: (StrayManager::eval_stray(CDentry*, bool)+0x1e) [0x7ff9fcf113ce]
7: (MDCache::scan_stray_dir(dirfrag_t)+0x13d) [0x7ff9fce6741d]
8: (MDSInternalContextBase::complete(int)+0x1e3) [0x7ff9fcff4993]
9: (MDSRank::_advance_queues()+0x382) [0x7ff9fcdd4652]
10: (MDSRank::ProgressThread::entry()+0x4a) [0x7ff9fcdd4aca]
11: (()+0x7dc5) [0x7ff9fbfcbdc5]
12: (clone()+0x6d) [0x7ff9faeb621d]

Does someone has an idea? We can't use our fs right now..

Hey, fun! Just looking for FPE opportunities in that function, it
looks like someone managed to set either the object size or stripe
count to 0 

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 17 February 2016 02:41
> To: ceph-users 
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> 
> Hello,
> 
> On Tue, 16 Feb 2016 16:39:06 +0800 Василий Ангапов wrote:
> 
> > Nick, Tyler, many thanks for very helpful feedback!
> > I spent many hours meditating on the following two links:
> > http://www.supermicro.com/solutions/storage_ceph.cfm
> > http://s3s.eu/cephshop
> >
> > 60- or even 72-disk nodes are very capacity-efficient, but will the 2
> > CPUs (even the fastest ones) be enough to handle Erasure Coding?
> >
> Depends.
> Since you're doing sequential writes (and reads I assume as you're dealing
> with videos), CPU usage is going to be a lot lower than with random, small
> 4KB block I/Os.
> So most likely, yes.

That was my initial thought, but reading that paper I linked, the 4MB tests are 
the ones that bring the CPU's to their knees. I think the erasure calculation 
is a large part of the overall CPU usage and more data with the larger IO's 
causes a significant increase in CPU requirements.

Correct me if I'm wrong, but I recall Christian, that your cluster is a full 
SSD cluster? I think we touched on this before, that the ghz per OSD is 
probably more like 100mhz per IOP. In a spinning disk cluster, you effectively 
have a cap on the number of IOs you can serve before the disks max out. So the 
difference between large and small IO's is not that great. But on a SSD cluster 
there is no cap and so you just end up with more IO's, hence the higher CPU.

> 
> > Also as Nick stated with 4-5 nodes I cannot use high-M "K+M"
> > combinations. I've did some calculations and found that the most
> > efficient and safe configuration is to use 10 nodes with 29*6TB SATA
> > and 7*200GB S3700 for journals. Assuming 6+3 EC profile that will give
> > me
> > 1.16 PB of effective space. Also I prefer not to use precious NVMe
> > drives. Don't see any reason to use them.
> >
> This is probably your best way forward, dense is nice and cost saving, but
> comes with a lot of potential gotchas.
> Dense and large clusters can work, dense and small not so much.
> 
> > But what about RAM? Can I go with 64GB per node with above config?
> > I've seen OSDs are consuming not more than 1GB RAM for replicated
> > pools (even 6TB ones). But what is the typical memory usage of EC
> > pools? Does anybody know that?
> >
> Above config (29 OSDs) that would be just about right.
> I always go with at least 2GB RAM per OSD, since during a full node restart
> and the consecutive peering OSDs will grow large, a LOT larger than their
> usual steady state size.
> RAM isn't that expensive these days and additional RAM comes in very
> handy when used for pagecache and SLAB (dentry) stuff.
> 
> Something else to think about in your specific use case is to have RAID'ed
> OSDs.
> It's a bit of zero sum game probably, but compare the above config with this.
> 11 nodes, each with:
> 34 6TB SATAs (2x 17HDDs RAID6)
> 2 200GB S3700 SSDs (journal/OS)
> Just 2 OSDs per node.
> Ceph with replication of 2.
> Just shy of 1PB of effective space.
> 
> Minus: More physical space, less efficient HDD usage (replication vs. EC).
> 
> Plus: A lot less expensive SSDs, less CPU and RAM requirements, smaller
> impact in case of node failure/maintenance.
> 
> No ideas about the stuff below.
> 
> Christian
> > Also, am I right that for 6+3 EC profile i need at least 10 nodes to
> > feel comfortable (one extra node for redundancy)?
> >
> > And finally can someone recommend what EC plugin to use in my case? I
> > know it's a difficult question but anyway?
> >
> >
> >
> >
> >
> >
> >
> >
> >
> > 2016-02-16 16:12 GMT+08:00 Nick Fisk :
> > >
> > >
> > >> -Original Message-
> > >> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > >> Behalf Of Tyler Bishop
> > >> Sent: 16 February 2016 04:20
> > >> To: Василий Ангапов 
> > >> Cc: ceph-users 
> > >> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW
> > >> with Erasure Code
> > >>
> > >> You should look at a 60 bay 4U chassis like a Cisco UCS C3260.
> > >>
> > >> We run 4 systems at 56x6tB with dual E5-2660 v2 and 256gb ram.
> > >> Performance is excellent.
> > >
> > > Only thing I will say to the OP, is that if you only need 1PB, then
> > > likely 4-5 of these will give you enough capacity. Personally I
> > > would prefer to spread the capacity around more nodes. If you are
> > > doing anything serious with Ceph its normally a good idea to try and
> > > make each node no more than 10% of total capacity. Also with Ec
> > > pools you will be limited to the K+M combo's you can achieve with
> > > smaller number of nodes.
> > >
> > >>
> > >> I would recommend a cache tier for sure if your data is busy for
> > >> reads.
> > >>
> 

Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-02-17 Thread Nick Fisk
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 17 February 2016 04:22
> To: ceph-users@lists.ceph.com
> Cc: Piotr Wachowicz 
> Subject: Re: [ceph-users] SSDs for journals vs SSDs for a cache tier,
which is
> better?
> 
> 
> Hello,
> 
> On Tue, 16 Feb 2016 18:56:43 +0100 Piotr Wachowicz wrote:
> 
> > Hey,
> >
> > Which one's "better": to use SSDs for storing journals, vs to use them
> > as a writeback cache tier? All other things being equal.
> >
> Pears are better than either oranges or apples. ^_-
> 
> > The usecase is a 15 osd-node cluster, with 6 HDDs and 1 SSDs per node.
> > Used for block storage for a typical 20-hypervisor OpenStack cloud
> > (with bunch of VMs running Linux). 10GigE public net + 10 GigE
> > replication network.
> >
> > Let's consider both cases:
> > Journals on SSDs - for writes, the write operation returns right after
> > data lands on the Journal's SSDs, but before it's written to the
> > backing HDD. So, for writes, SSD journal approach should be comparable
> > to having a SSD cache tier.
> Not quite, see below.
> 
> > In both cases we're writing to an SSD (and to replica's SSDs), and
> > returning to the client immediately after that.
> > Data is only flushed to HDD later on.
> >
> Correct, note that the flushing is happening by the OSD process submitting
> this write to the underlying device/FS.
> It doesn't go from the journal to the OSD storage device, which has the
> implication that with default settings and plain HDDs you quickly wind up
> being being limited to what your actual HDDs can handle in a sustained
> manner.
> 
> >
> > However for reads (of hot data) I would expect a SSD Cache Tier to be
> > faster/better. That's because, in the case of having journals on SSDs,
> > even if data is in the journal, it's always read from the (slow)
> > backing disk anyway, right? But with a SSD cache tier, if the data is
> > hot, it would be read from the (fast) SSD.
> >
> It will be read from the even faster pagecache if it is a sufficiently hot
object
> and you have sufficient RAM.
> 
> > I'm sure both approaches have their own merits, and might be better
> > for some specific tasks, but with all other things being equal, I
> > would expect that using SSDs as the "Writeback" cache tier should, on
> > average, provide better performance than suing the same SSDs for
> Journals.
> > Specifically in the area of read throughput/latency.
> >
> Cache tiers (currently) work only well if all your hot data fits into
them.
> In which case you'd even better off with with a dedicated SSD pool for
that
> data.
> 
> Because (currently) Ceph has to promote a full object (4MB by default) to
> the cache for each operation, be it read or or write.
> That means the first time you want to read a 2KB file in your RBD backed
VM,
> Ceph has to copy 4MB from the HDD pool to the SSD cache tier.
> This has of course a significant impact on read performance, in my crappy
test
> cluster reading cold data is half as fast as using the actual non-cached
HDD
> pool.
> 

Just a FYI, there will most likely be several fixes/improvements going into
Jewel which will address most of these problems with caching. Objects will
now only be promoted if they are hit several times(configurable) and, if it
makes it in time, a promotion throttle to stop too many promotions hindering
cluster performance.

However in the context of this thread, Christian is correct, SSD journals
first and then caching if needed.


> And once your cache pool has to evict objects because it is getting full,
it has
> to write out 4MB for each such object to the HDD pool.
> Then read it back in later, etc.
> 
> > The main difference, I suspect, between the two approaches is that in
> > the case of multiple HDDs (multiple ceph-osd processes), all of those
> > processes share access to the same shared SSD storing their journals.
> > Whereas it's likely not the case with Cache tiering, right? Though I
> > must say I failed to find any detailed info on this. Any clarification
> > will be appreciated.
> >
> In your specific case writes to the OSDs (HDDs) will be (at least) 50%
slower if
> your journals are on disk instead of the SSD.
> (Which SSDs do you plan to use anyway?)
> I don't think you'll be happy with the resulting performance.
> 
> Christian.
> 
> > So, is the above correct, or am I missing some pieces here? Any other
> > major differences between the two approaches?
> >
> > Thanks.
> > P.
> 
> 
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com

Re: [ceph-users] Recomendations for building 1PB RadosGW with Erasure Code

2016-02-17 Thread Nick Fisk
Thanks for posting your experiences John, very interesting read. I think the 
golden rule of around 1Ghz is still a realistic goal to aim for. It looks like 
you probably have around 16ghz for 60OSD's, or 0.26Ghz per OSD. Do you have any 
idea on how much CPU you think you would need to just be able to get away with 
it?

I have 24Ghz for 12 OSD's (2x2620v2) and I typically don't see CPU usage over 
about 20%, which indicates to me the bare minimum for a replicated pool is 
probably around 0.5Ghz per 7.2k rpm OSD. The next nodes we have will certainly 
have less CPU.

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> John Hogenmiller
> Sent: 17 February 2016 03:04
> To: Василий Ангапов ; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Recomendations for building 1PB RadosGW with
> Erasure Code
> 
> Turns out i didn't do reply-all.
> 
> On Tue, Feb 16, 2016 at 9:18 AM, John Hogenmiller 
> wrote:
> > And again - is dual Xeon's power enough for 60-disk node and Erasure
> Code?
> 
> 
> This is something I've been attempting to determine as well. I'm not yet
> getting
> I'm testing with some white-label hardware, but essentially supermicro
> 2twinu's with a pair of E5-2609 Xeons and 64GB of
> memory.  (http://www.supermicro.com/products/system/2U/6028/SYS-
> 6028TR-HTFR.cfm). This is attached to DAEs with 60 x 6TB drives, in JBOD.
> 
> Conversely, Supermicro sells a 72-disk OSD node, which Redhat considers a
> supported "reference architecture" device. The processors in those nodes
> are E5-269 12-core, vs what I have which is quad-
> core. http://www.supermicro.com/solutions/storage_ceph.cfm  (SSG-
> 6048R-OSD432). I would highly recommend reflecting on the supermicro
> hardware and using that as your reference as well. If you could get an eval
> unit, use that to compare with the hardware you're working with.
> 
> I currently have mine setup with 7 nodes, 60 OSDs each, radosgw running
> one each node, and 5 ceph monitors. I plan to move the monitors to their
> own dedicated hardware, and in reading, I may only need 3 to manage the
> 420 OSDs.   I am currently just setup for replication instead of EC, though I
> want to redo this cluster to use EC. Also, I am still trying to work out how
> much of an impact placement groups have on performance, and I may have a
> performance-hampering amount..
> 
> We test the system using locust speaking S3 to the radosgw. Transactions are
> distributed equally across all 7 nodes and we track the statistics. We started
> first emulating 1000 users and got over 4Gbps, but load average on all nodes
> was in the mid-100s, and after 15 minutes we started getting socket
> timeouts. We stopped the test, let load settle, and started back at 100
> users.  We've been running this test about 5 days now.  Load average on all
> nodes floats between 40 and 70. The nodes with ceph-mon running on them
> do not appear to be taxed any more than the ones without. The radosgw
> itself seems to take up a decent amount of cpu (running civetweb, no
> ssl).  iowait is non existent, everything appears to be cpu bound.
> 
> At 1000 users, we had 4.3Gbps of PUTs and 2.2Gbps of GETs. Did not capture
> the TPS on that short test.
> At 100 users, we're pushing 2Gbps  in PUTs and 1.24Gpbs in GETs. Averaging
> 115 TPS.
> 
> All in all, the speeds are not bad for a single rack, but the CPU utilization 
> is a
> big concern. We're currently using other (proprietary) object storage
> platforms on this hardware configuration. They have their own set of issues,
> but CPU utilization is typically not the problem, even at higher utilization.
> 
> 
> 
> root@ljb01:/home/ceph/rain-cluster# ceph status
> cluster 4ebe7995-6a33-42be-bd4d-20f51d02ae45
>  health HEALTH_OK
>  monmap e5: 5 mons at {hail02-r01-06=172.29.4.153:6789/0,hail02-r01-
> 08=172.29.4.155:6789/0,rain02-r01-01=172.29.4.148:6789/0,rain02-r01-
> 03=172.29.4.150:6789/0,rain02-r01-04=172.29.4.151:6789/0}
> election epoch 86, quorum 0,1,2,3,4 
> rain02-r01-01,rain02-r01-03,rain02-
> r01-04,hail02-r01-06,hail02-r01-08
>  osdmap e2543: 423 osds: 419 up, 419 in
> flags sortbitwise
>   pgmap v676131: 33848 pgs, 14 pools, 50834 GB data, 29660 kobjects
> 149 TB used, 2134 TB / 2284 TB avail
>33848 active+clean
>   client io 129 MB/s rd, 182 MB/s wr, 1562 op/s
> 
> 
> 
>  # ceph-osd + ceph-mon + radosgw
> top - 13:29:22 up 40 days, 22:05,  1 user,  load average: 47.76, 47.33, 47.08
> Tasks: 1001 total,   7 running, 994 sleeping,   0 stopped,   0 zombie
> %Cpu(s): 39.2 us, 44.7 sy,  0.0 ni,  9.9 id,  2.4 wa,  0.0 hi,  3.7 si,  0.0 
> st
> KiB Mem:  65873180 total, 64818176 used,  1055004 free, 9324 buffers
> KiB Swap:  8388604 total,  7801828 used,   586776 free. 17610868 cached Mem
> 
> PID USER  PR  NIVIRTRESSHR S  %CPU %MEM TIME+
> COMMAND
>  178129 ceph  

Re: [ceph-users] Hammer OSD crash during deep scrub

2016-02-17 Thread Maksym Krasilnikov
Hello!

On Wed, Feb 17, 2016 at 07:38:15AM +, ceph.user wrote:

>  ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
>  1: /usr/bin/ceph-osd() [0xbf03dc]
>  2: (()+0xf0a0) [0x7f29e4c4d0a0]
>  3: (gsignal()+0x35) [0x7f29e35b7165]
>  4: (abort()+0x180) [0x7f29e35ba3e0]
>  5: (__gnu_cxx::__verbose_terminate_handler()+0x11d) [0x7f29e3e0d89d]
>  6: (()+0x63996) [0x7f29e3e0b996]
>  7: (()+0x639c3) [0x7f29e3e0b9c3]
>  8: (()+0x63bee) [0x7f29e3e0bbee]
>  9: (ceph::__ceph_assert_fail(char const*,
>  char const*, int, char const*)+0x220) [0xcddda0]
>  10: (FileStore::read(coll_t, ghobject_t const&, unsigned long,
>  unsigned long, ceph::buffer::list&, unsigned int, bool)+0x8cb) [0xa296cb]
>  11: (ReplicatedBackend::be_deep_scrub(hobject_t const&,
>  unsigned int, ScrubMap::object&, ThreadPool::TPHandle&)+0x287) [0xb1a527]
>  12: (PGBackend::be_scan_list(ScrubMap&, std::vector  std::allocator > const&, bool, unsigned int,
>  ThreadPool::TPHandle&)+0x52c) [0x9f8ddc]
>  13: (PG::build_scrub_map_chunk(ScrubMap&, hobject_t, hobject_t,
>  bool, unsigned int, ThreadPool::TPHandle&)+0x124) [0x910ee4]
>  14: (PG::replica_scrub(MOSDRepScrub*, ThreadPool::TPHandle&)+0x481)
>  [0x9116d1]
>  15: (OSD::RepScrubWQ::_process(MOSDRepScrub*, ThreadPool::TPHandle&)+0xf4)
>  [0x8119f4]
>  16: (ThreadPool::worker(ThreadPool::WorkThread*)+0x629) [0xccfd69]
>  17: (ThreadPool::WorkThread::entry()+0x10) [0xcd0f70]
>  18: (()+0x6b50) [0x7f29e4c44b50]
>  19: (clone()+0x6d) [0x7f29e366095d]

> Looks like an IO error during read maybe,
> only nothing logged in syslog messages at the time.
> But current this drive shows predictive error status
> in the raid crontoller, so maybe...

I have issue like yours:

 ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)
 1: (()+0x6149ea) [0x55944d5669ea]
 2: (()+0x10340) [0x7f6ff4271340]
 3: (gsignal()+0x39) [0x7f6ff2710cc9]
 4: (abort()+0x148) [0x7f6ff27140d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7f6ff301b535]
 6: (()+0x5e6d6) [0x7f6ff30196d6]
 7: (()+0x5e703) [0x7f6ff3019703]
 8: (()+0x5e922) [0x7f6ff3019922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x278) [0x55944d65f368]
 10: (SnapSet::get_clone_bytes(snapid_t) const+0xb6) [0x55944d2d0306]
 11: (ReplicatedPG::_scrub(ScrubMap&, std::map > > const&)+0xa1c) [0x55944d3af3cc]
 12: (PG::scrub_compare_maps()+0xec9) [0x55944d31ed19]
 13: (PG::chunky_scrub(ThreadPool::TPHandle&)+0x1ee) [0x55944d321dce]
 14: (PG::scrub(ThreadPool::TPHandle&)+0x2ee) [0x55944d32374e]
 15: (OSD::ScrubWQ::_process(PG*, ThreadPool::TPHandle&)+0x19) [0x55944d207fa9]
 16: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0x55944d64fd66]
 17: (ThreadPool::WorkThread::entry()+0x10) [0x55944d650e10]
 18: (()+0x8182) [0x7f6ff4269182]
 19: (clone()+0x6d) [0x7f6ff27d447d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.

This appears when scrubbing, deep scrubbing or repairing PG 5.ca. I can 
reproduce it anytime.

I tried to remove and re-create OSD, but it did not help.

Now I'm going to check OSD filesystem. But I have neither strange logs in 
syslog, nor SMART reports about this drive.

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs for journals vs SSDs for a cache tier, which is better?

2016-02-17 Thread Piotr Wachowicz
Thanks for your reply.


> > Let's consider both cases:
> > Journals on SSDs - for writes, the write operation returns right after
> > data lands on the Journal's SSDs, but before it's written to the backing
> > HDD. So, for writes, SSD journal approach should be comparable to having
> > a SSD cache tier.
> Not quite, see below.
>
>
Could you elaborate a bit more?

Are you saying that with a Journal on a SSD writes from clients, before
they can return from the operation to the client, must end up on both the
SSD (Journal) *and* HDD (actual data store behind that journal)? I was
under the impression that one of the benefits of having a journal on a SSD
is deferring the write to the slow HDD to a later time, until after the
write call returns to the client. Is that not the case? If so, that would
mean SSD cache tier should be much faster in terms of write latency than
SSD journal.


> In your specific case writes to the OSDs (HDDs) will be (at least) 50%
> slower if your journals are on disk instead of the SSD.
>

Is that because of the above -- with Journal on the same disk (HDD) as the
data, writes have to be written twice (assuming no btrfs/zfs cow) to the
HDD (journal, and data). Whereas with a Journal on the SSD write to the
Journal and disk can be done in parallel with write to the HDD? (But still
both of those have to be completed before the write operation returns to
the client).



> (Which SSDs do you plan to use anyway?)
>

Intel DC S3700


Thanks,
Piotr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Infernalis sortbitwise flag

2016-02-17 Thread Markus Blank-Burian
Hi,

 

I recently saw, that a new osdmap is created with the sortbitwise flag. Can
this safely be enabled on an existing cluster and would there be any
advantages in doing so?

 

Markus

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance issues related to scrubbing

2016-02-17 Thread Christian Balzer

Hello,

On Tue, 16 Feb 2016 10:46:32 -0800 Cullen King wrote:

> Thanks for the helpful commentary Christian. Cluster is performing much
> better with 50% more spindles (12 to 18 drives), along with setting scrub
> sleep to 0.1. Didn't see really any gain from moving from the Samsung 850
> Pro journal drives to Intel 3710's, even though dd and other direct tests
> of the drives yielded much better results. rados bench with 4k requests
> are still awfully low. I'll figure that problem out next.
> 
Got examples, numbers, watched things with atop?
4KB rados benches are what can make my CPUs melt on the cluster here
that's most similar to yours. ^o^

> I ended up bumping up the number of placement groups from 512 to 1024
> which should help a little bit. Basically it'll change the worst case
> scrub performance such that it is distributed a little more across
> drives rather than clustered on a single drive for longer.
> 
Of course with osd_max_scrubs at its default of 1 there should never be
more than one scrub per OSD. 
However I seem to vaguely remember that this is per "primary" scrub, so in
case of deep-scrubs there could still be plenty of contention going on.
Again, I've always had a good success with that manually kicked off scrub
of all OSDs. 
It seems to sequence things nicely and finishes within 4 hours on my
"good" production cluster.

> I think the real solution here is to create a secondary SSD pool, pin
> some radosgw buckets to it and put my thumbnail data on the smaller,
> faster pool. I'll reserve the spindle based pool for original high res
> photos, which are only read to create thumbnails when necessary. This
> should put the majority of my random read IO on SSDs, and thumbnails
> average 50kb each so it shouldn't be too spendy. I am considering trying
> the newer samsung sm863 drives as we are read heavy, any potential data
> loss on this thumbnail pool will not be catastrophic.
> 
I seriously detest it when makers don't have they endurance data on the
web page with all the other specifications and make you look up things in
a slightly hidden PDF. 
Then giving the total endurance and making you calculate drive writes per
day. ^o^
Only to find that these have 3 DWPD, which is nothing to be ashamed off
and should be fine for this particular use case.

However take a look at this old posting of mine:
http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html

With that in mind, I'd recommend you do some testing with real world data
before you invest too much into something that will wear out long before
it has payed for itself.

Christian

> Third, it seems that I am also running into the known "Lots Of Small
> Files" performance issue. Looks like performance in my use case will be
> drastically improved with the upcoming bluestore, though migrating to it
> sounds painful!
> 
> On Thu, Feb 4, 2016 at 7:56 PM, Christian Balzer  wrote:
> 
> >
> > Hello,
> >
> > On Thu, 4 Feb 2016 08:44:25 -0800 Cullen King wrote:
> >
> > > Replies in-line:
> > >
> > > On Wed, Feb 3, 2016 at 9:54 PM, Christian Balzer
> > >  wrote:
> > >
> > > >
> > > > Hello,
> > > >
> > > > On Wed, 3 Feb 2016 17:48:02 -0800 Cullen King wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > I've been trying to nail down a nasty performance issue related
> > > > > to scrubbing. I am mostly using radosgw with a handful of buckets
> > > > > containing millions of various sized objects. When ceph scrubs,
> > > > > both regular and deep, radosgw blocks on external requests, and
> > > > > my cluster has a bunch of requests that have blocked for > 32
> > > > > seconds. Frequently OSDs are marked down.
> > > > >
> > > > From my own (painful) experiences let me state this:
> > > >
> > > > 1. When your cluster runs out of steam during deep-scrubs, drop
> > > > what you're doing and order more HW (OSDs).
> > > > Because this is a sign that it would also be in trouble when doing
> > > > recoveries.
> > > >
> > >
> > > When I've initiated recoveries from working on the hardware the
> > > cluster hasn't had a problem keeping up. It seems that it only has a
> > > problem with scrubbing, meaning it feels like the IO pattern is
> > > drastically different. I would think that with scrubbing I'd see
> > > something closer to bursty sequential reads, rather than just
> > > thrashing the drives with a more random IO pattern, especially given
> > > our low cluster utilization.
> > >
> > It's probably more pronounced when phasing in/out entire OSDs, where it
> > also has to read the entire (primary) data off it.
> >
> > >
> > > >
> > > > 2. If you cluster is inconvenienced by even mere scrubs, you're
> > > > really in trouble.
> > > > Threaten the penny pincher with bodily violence and have that new
> > > > HW phased in yesterday.
> > > >
> > >
> > > I am the penny pincher, biz owner, dev and ops guy for
> > > http://ridewithgps.com :) More hardware isn't an issue, it just feels
> > >