Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Scott Laird
I think I just solved at least part of the problem.

Because of the somewhat peculiar way that I have Docker configured, docker
instances on another system were being assigned my OSD's IP address,
running for a couple seconds, and then failing (for unrelated reasons).
Effectively, there was something sitting on the network throwing random
RSTs at my TCP connections and then vanishing.

Amazingly, Ceph seems to have been able to handle it *just* well enough to
make it non-obvious that the problem was external and network related.

That doesn't quite explain the issues with local OSDs acting up, though.

For now, I've moved all of my OSDs back to Ubuntu; it's more work to
manage, but on the other hand it's actually working.


Scott

On Tue Nov 18 2014 at 3:14:54 PM Gregory Farnum  wrote:

> It's a little strange, but with just the one-sided log it looks as
> though the OSD is setting up a bunch of connections and then
> deliberately tearing them down again within  second or two (i.e., this
> is not a direct messenger bug, but it might be an OSD one, or it might
> be something else).
> Is it possible that you have some firewalls set up that are allowing
> through some traffic but not others? The OSDs use a bunch of ports and
> it looks like maybe there are at least intermittent issues with them
> heartbeating.
> -Greg
>
> On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird  wrote:
> > Here are the first 33k lines or so:
> > https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
> >
> > This is a different (but more or less identical) machine from the past
> set
> > of logs.  This system doesn't have quite as many drives in it, so I
> couldn't
> > spot a same-host error burst, but it's logging tons of the same errors
> while
> > trying to talk to 10.2.0.34.
> >
> > On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum 
> wrote:
> >>
> >> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
> >> > I'm having a problem with my cluster.  It's running 0.87 right now,
> but
> >> > I
> >> > saw the same behavior with 0.80.5 and 0.80.7.
> >> >
> >> > The problem is that my logs are filling up with "replacing existing
> >> > (lossy)
> >> > channel" log lines (see below), to the point where I'm filling drives
> to
> >> > 100% almost daily just with logs.
> >> >
> >> > It doesn't appear to be network related, because it happens even when
> >> > talking to other OSDs on the same host.
> >>
> >> Well, that means it's probably not physical network related, but there
> >> can still be plenty wrong with the networking stack... ;)
> >>
> >> > The logs pretty much all point to
> >> > port 0 on the remote end.  Is this an indicator that it's failing to
> >> > resolve
> >> > port numbers somehow, or is this normal at this point in connection
> >> > setup?
> >>
> >> That's definitely unusual, but I'd need to see a little more to be
> >> sure if it's bad. My guess is that these pipes are connections from
> >> the other OSD's Objecter, which is treated as a regular client and
> >> doesn't bind to a socket for incoming connections.
> >>
> >> The repetitive channel replacements are concerning, though — they can
> >> be harmless in some circumstances but this looks more like the
> >> connection is simply failing to establish and so it's retrying over
> >> and over again. Can you restart the OSDs with "debug ms = 10" in their
> >> config file and post the logs somewhere? (There is not really any
> >> documentation available on what they mean, but the deeper detail ones
> >> might also be more understandable to you.)
> >> -Greg
> >>
> >> >
> >> > The systems that are causing this problem are somewhat unusual;
> they're
> >> > running OSDs in Docker containers, but they *should* be configured to
> >> > run as
> >> > root and have full access to the host's network stack.  They manage to
> >> > work,
> >> > mostly, but things are still really flaky.
> >> >
> >> > Also, is there documentation on what the various fields mean, short of
> >> > digging through the source?  And how does Ceph resolve OSD numbers
> into
> >> > host/port addresses?
> >> >
> >> >
> >> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1e070580).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
> >> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
> >> > c=0x1e070420).accept replacing existing (lossy) channel (new one
> >> > lossy=1)
> >> >
> >> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
> >> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
> >> > c=0x1f3d8420).accept replacing existing (lossy

Re: [ceph-users] osd crashed while there was no space

2014-11-18 Thread han vincent
Hmm, the problem is I had not modified any config, all the config
is default.
as you said, all the IO should be stopped by the configs
"mon_osd_full_ration" or "osd_failsafe_full_ration". In my test, when
the osd near full, the IO from "rest bench" stopped, but the backfill
IO did not stop.  Each osd had 20G space, I think the space is big
enough.

2014-11-19 3:18 GMT+08:00 Craig Lewis :
> You shouldn't let the cluster get so full that losing a few OSDs will make
> you go toofull.  Letting the cluster get to 100% full is such a bad idea
> that you should make sure it doesn't happen.
>
>
> Ceph is supposed to stop moving data to an OSD once that OSD hits
> osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will
> stop backfilling.
>
> I have verified this works when the disks fill up while the cluster is
> healthy, but I haven't failed a disk once I'm in the toofull state.  Even
> so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default
> 0.97) should stop all IO until a human gets involved.
>
> The only gotcha I can find is that the values are percentages, and the test
> is a "greater than" done with two significant digits.  ie, if the
> osd_backfill_full_ratio is 0.85, it will continue backfilling until the disk
> is 86% full.  So values are 0.99 and 1.00 will cause problems.
>
>
> On Mon, Nov 17, 2014 at 6:50 PM, han vincent  wrote:
>>
>> hi, craig:
>>
>> Your solution did work very well. But if the data is very
>> important, when remove directory of PG from OSDs, a small mistake will
>> result in loss of data. And if cluster is very large, do not you think
>> delete the data on the disk from 100% to 95% is a tedious and
>> error-prone thing, for so many OSDs, large disks, and so on.
>>
>>  so my key question is: if there is no space in the cluster while
>> some OSDs crashed,  why the cluster should choose to migrate? And in
>> the migrating, other
>> OSDs will crashed one by one until the cluster could not work.
>>
>> 2014-11-18 5:28 GMT+08:00 Craig Lewis :
>> > At this point, it's probably best to delete the pool.  I'm assuming the
>> > pool
>> > only contains benchmark data, and nothing important.
>> >
>> > Assuming you can delete the pool:
>> > First, figure out the ID of the data pool.  You can get that from ceph
>> > osd
>> > dump | grep '^pool'
>> >
>> > Once you have the number, delete the data pool: rados rmpool data data
>> > --yes-i-really-really-mean-it
>> >
>> > That will only free up space on OSDs that are up.  You'll need to
>> > manually
>> > some PGs on the OSDs that are 100% full.  Go to
>> > /var/lib/ceph/osd/ceph-/current, and delete a few directories
>> > that
>> > start with your data pool ID.  You don't need to delete all of them.
>> > Once
>> > the disk is below 95% full, you should be able to start that OSD.  Once
>> > it's
>> > up, it will finish deleting the pool.
>> >
>> > If you can't delete the pool, it is possible, but it's more work, and
>> > you
>> > still run the risk of losing data if you make a mistake.  You need to
>> > disable backfilling, then delete some PGs on each OSD that's full. Try
>> > to
>> > only delete one copy of each PG.  If you delete every copy of a PG on
>> > all
>> > OSDs, then you lost the data that was in that PG.  As before, once you
>> > delete enough that the disk is less than 95% full, you can start the
>> > OSD.
>> > Once you start it, start deleting your benchmark data out of the data
>> > pool.
>> > Once that's done, you can re-enable backfilling.  You may need to scrub
>> > or
>> > deep-scrub the OSDs you deleted data from to get everything back to
>> > normal.
>> >
>> >
>> > So how did you get the disks 100% full anyway?  Ceph normally won't let
>> > you
>> > do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio,
>> > or
>> > osd_failsafe_full_ratio?
>> >
>> >
>> > On Mon, Nov 17, 2014 at 7:00 AM, han vincent  wrote:
>> >>
>> >> hello, every one:
>> >>
>> >> These days a problem of "ceph" has troubled me for a long time.
>> >>
>> >> I build a cluster with 3 hosts and each host has three osds in it.
>> >> And after that
>> >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write
>> >> --no-cleanup"
>> >> to test the write performance of the cluster.
>> >>
>> >> When the cluster is near full, there couldn't write any data to
>> >> it. Unfortunately,
>> >> there was a host hung up, then a lots of PG was going to migrate to
>> >> other
>> >> OSDs.
>> >> After a while, a lots of OSD was marked down and out, my cluster
>> >> couldn't
>> >> work
>> >> any more.
>> >>
>> >> The following is the output of "ceph -s":
>> >>
>> >> cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
>> >> health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
>> >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
>> >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
>> >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s)

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread Ramakrishna Nishtala (rnishtal)
Hi Dave

Did you say iscsi only? The tracker issue does not say though.

I am on giant, with both client and ceph on RHEL 7 and seems to work ok, unless 
I am missing something here. RBD on baremetal with kmod-rbd and caching 
disabled.



[root@compute4 ~]# time fio --name=writefile --size=100G --filesize=100G 
--filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 
--rw=write --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio

writefile: (g=0): rw=write, bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=200

fio-2.1.11

Starting 1 process

Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/853.0MB/0KB /s] [0/853/0 iops] [eta 
00m:00s]

...

Disk stats (read/write):

  rbd0: ios=184/204800, merge=0/0, ticks=70/16164931, in_queue=16164942, 
util=99.98%



real1m56.175s

user0m18.115s

sys 0m10.430s



Regards,

Rama





-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David 
Moreau Simard
Sent: Tuesday, November 18, 2014 3:49 PM
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target



Testing without the cache tiering is the next test I want to do when I have 
time..



When it's hanging, there is no activity at all on the cluster.

Nothing in "ceph -w", nothing in "ceph osd pool stats".



I'll provide an update when I have a chance to test without tiering.

--

David Moreau Simard





> On Nov 18, 2014, at 3:28 PM, Nick Fisk 
> mailto:n...@fisk.me.uk>> wrote:

>

> Hi David,

>

> Have you tried on a normal replicated pool with no cache? I've seen a

> number of threads recently where caching is causing various things to 
> block/hang.

> It would be interesting to see if this still happens without the

> caching layer, at least it would rule it out.

>

> Also is there any sign that as the test passes ~50GB that the cache

> might start flushing to the backing pool causing slow performance?

>

> I am planning a deployment very similar to yours so I am following

> this with great interest. I'm hoping to build a single node test

> "cluster" shortly, so I might be in a position to work with you on

> this issue and hopefully get it resolved.

>

> Nick

>

> -Original Message-

> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf

> Of David Moreau Simard

> Sent: 18 November 2014 19:58

> To: Mike Christie

> Cc: ceph-users@lists.ceph.com; Christopher 
> Spearman

> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target

>

> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and

> chatted with "dis" on #ceph-devel.

>

> I ran a LOT of tests on a LOT of comabination of kernels (sometimes

> with tunables legacy). I haven't found a magical combination in which

> the following test does not hang:

> fio --name=writefile --size=100G --filesize=100G --filename=/dev/rbd0

> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write

> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio

>

> Either directly on a mapped rbd device, on a mounted filesystem (over

> rbd), exported through iSCSI.. nothing.

> I guess that rules out a potential issue with iSCSI overhead.

>

> Now, something I noticed out of pure luck is that I am unable to

> reproduce the issue if I drop the size of the test to 50GB. Tests will

> complete in under 2 minutes.

> 75GB will hang right at the end and take more than 10 minutes.

>

> TL;DR of tests:

> - 3x fio --name=writefile --size=50G --filesize=50G

> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0

> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 --iodepth=200

> --ioengine=libaio

> -- 1m44s, 1m49s, 1m40s

>

> - 3x fio --name=writefile --size=75G --filesize=75G

> --filename=/dev/rbd0 --bs=1M --nrfiles=1 --direct=1 --sync=0

> --randrepeat=0 --rw=write --refill_buffers --end_fsync=1 --iodepth=200

> --ioengine=libaio

> -- 10m12s, 10m11s, 10m13s

>

> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP

>

> Does that ring you guys a bell ?

>

> --

> David Moreau Simard

>

>

>> On Nov 13, 2014, at 3:31 PM, Mike Christie 
>> mailto:mchri...@redhat.com>> wrote:

>>

>> On 11/13/2014 10:17 AM, David Moreau Simard wrote:

>>> Running into weird issues here as well in a test environment. I

>>> don't

> have a solution either but perhaps we can find some things in common..

>>>

>>> Setup in a nutshell:

>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with

>>> separate public/cluster network in 10 Gbps)

>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7,

>>> Ceph

>>> 0.87-1 (10 Gbps)

>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)

>>>

>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E

>>> cards (2

> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.

>>>

>>> I'm following the instructions here:

>>> http://www

Re: [ceph-users] Bug or by design?

2014-11-18 Thread Robert LeBlanc
On Nov 18, 2014 4:48 PM, "Gregory Farnum"  wrote:
>
> On Tue, Nov 18, 2014 at 3:38 PM, Robert LeBlanc 
wrote:
> > I was going to submit this as a bug, but thought I would put it here for
> > discussion first. I have a feeling that it could be behavior by design.
> >
> > ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
> >
> > I'm using a cache pool and was playing around with the size and
min_size on
> > the pool to see the effects of replication. I set size/min_size to 1,
then I
> > ran "ceph osd pool set ssd size 3; ceph osd pool set ssd min_size 2".
Client
> > I/O immediately blocked as there was not 2 copies yet (as expected).
> > However, after the degraded objects are cleared up, there are several
PGs in
> > the remapped+incomplete state and client I/O continues to be blocked
even
> > though all OSDs are up and healthy (even left overnight). If I set
min_size
> > back down to 1, the cluster recovers and client I/O continues.
> >
> > I expected that as long as there is one copy of the data, the cluster
can
> > copy that data to min_size and cluster operations resume.
> >
> > Where I think it could be by design is when min_size was already set to
2
> > and you lose enough OSDs fast enough to dip below that level. There
could be
> > the chance that the serving OSD could have bad data (but we wouldn't
know
> > that anyway at the moment). The bad data could then be replicated and
the
> > ability to recover any good data would be lost.
> >
> > However, if Ceph immediately replicated the sole OSD to get back to
min_size
> > then when the other(s) came back online, it could back fill and just
destroy
> > the extras.
> >
> > It seems that immediately replication to keep the cluster operational
seems
> > like a good thing overall. Am I missing something?
>
> This is sort of by design, but mostly an accident of many other
> architecture choices. Sam is actually working now to enable PG
> recovery when you have fewer than min_size copies available; I very
> much doubt it will be backported to any existing LTS releases but it
> ought to be in Hammer.
> -Greg

Greg, thanks for the update. I'll refrain from submitting a bug request
since it is already being worked on. For now we will make sure that we
don't increase min_size until size has been increased and the objects have
been completely replicated.

Robert LeBlanc

Sent from a mobile device please excuse any typos.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread Gregory Farnum
On Wed, Nov 12, 2014 at 1:41 PM, houmles  wrote:
> Hi,
>
> I have 2 hosts with 8 2TB drive in each.
> I want to have 2 replicas between both hosts and then 2 replicas between osds 
> on each host. That way even when I lost one host I still have 2 replicas.
>
> Currently I have this ruleset:
>
> rule repl {
> ruleset 5
> type replicated
> min_size 1
> max_size 10
> step take asterix
> step choose firstn -2 type osd
> step emit
> step take obelix
> step choose firstn 2 type osd
> step emit
> }
>
> Which works ok. I have 4 replicas as I want and PGs are distributed perfectly 
> but when I run ceph df I have only 1/2 of my capacity which I should have.
> In total it's 32TB, 16TB in each host. If there is a 2 replicas on each host 
> it should report around 8TB, right? It's reporting only 4TB in pool which is 
> 1/8 of total capacity.
> Can anyone tell me what is wrong?

What version are you running? Can you copy-paste the command and
output, pointing out which bit you think is wrong? There are
occasionally oddities in the source data that confuse things and I
think there's new functionality to try and predict the "effective"
size that might have an issue.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Bug or by design?

2014-11-18 Thread Gregory Farnum
On Tue, Nov 18, 2014 at 3:38 PM, Robert LeBlanc  wrote:
> I was going to submit this as a bug, but thought I would put it here for
> discussion first. I have a feeling that it could be behavior by design.
>
> ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)
>
> I'm using a cache pool and was playing around with the size and min_size on
> the pool to see the effects of replication. I set size/min_size to 1, then I
> ran "ceph osd pool set ssd size 3; ceph osd pool set ssd min_size 2". Client
> I/O immediately blocked as there was not 2 copies yet (as expected).
> However, after the degraded objects are cleared up, there are several PGs in
> the remapped+incomplete state and client I/O continues to be blocked even
> though all OSDs are up and healthy (even left overnight). If I set min_size
> back down to 1, the cluster recovers and client I/O continues.
>
> I expected that as long as there is one copy of the data, the cluster can
> copy that data to min_size and cluster operations resume.
>
> Where I think it could be by design is when min_size was already set to 2
> and you lose enough OSDs fast enough to dip below that level. There could be
> the chance that the serving OSD could have bad data (but we wouldn't know
> that anyway at the moment). The bad data could then be replicated and the
> ability to recover any good data would be lost.
>
> However, if Ceph immediately replicated the sole OSD to get back to min_size
> then when the other(s) came back online, it could back fill and just destroy
> the extras.
>
> It seems that immediately replication to keep the cluster operational seems
> like a good thing overall. Am I missing something?

This is sort of by design, but mostly an accident of many other
architecture choices. Sam is actually working now to enable PG
recovery when you have fewer than min_size copies available; I very
much doubt it will be backported to any existing LTS releases but it
ought to be in Hammer.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread David Moreau Simard
Testing without the cache tiering is the next test I want to do when I have 
time..

When it's hanging, there is no activity at all on the cluster.
Nothing in "ceph -w", nothing in "ceph osd pool stats".

I'll provide an update when I have a chance to test without tiering. 
--
David Moreau Simard


> On Nov 18, 2014, at 3:28 PM, Nick Fisk  wrote:
> 
> Hi David,
> 
> Have you tried on a normal replicated pool with no cache? I've seen a number
> of threads recently where caching is causing various things to block/hang.
> It would be interesting to see if this still happens without the caching
> layer, at least it would rule it out.
> 
> Also is there any sign that as the test passes ~50GB that the cache might
> start flushing to the backing pool causing slow performance?
> 
> I am planning a deployment very similar to yours so I am following this with
> great interest. I'm hoping to build a single node test "cluster" shortly, so
> I might be in a position to work with you on this issue and hopefully get it
> resolved.
> 
> Nick
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Moreau Simard
> Sent: 18 November 2014 19:58
> To: Mike Christie
> Cc: ceph-users@lists.ceph.com; Christopher Spearman
> Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target
> 
> Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and chatted
> with "dis" on #ceph-devel.
> 
> I ran a LOT of tests on a LOT of comabination of kernels (sometimes with
> tunables legacy). I haven't found a magical combination in which the
> following test does not hang:
> fio --name=writefile --size=100G --filesize=100G --filename=/dev/rbd0
> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
> 
> Either directly on a mapped rbd device, on a mounted filesystem (over rbd),
> exported through iSCSI.. nothing.
> I guess that rules out a potential issue with iSCSI overhead.
> 
> Now, something I noticed out of pure luck is that I am unable to reproduce
> the issue if I drop the size of the test to 50GB. Tests will complete in
> under 2 minutes.
> 75GB will hang right at the end and take more than 10 minutes.
> 
> TL;DR of tests:
> - 3x fio --name=writefile --size=50G --filesize=50G --filename=/dev/rbd0
> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
> -- 1m44s, 1m49s, 1m40s
> 
> - 3x fio --name=writefile --size=75G --filesize=75G --filename=/dev/rbd0
> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
> -- 10m12s, 10m11s, 10m13s
> 
> Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP
> 
> Does that ring you guys a bell ?
> 
> --
> David Moreau Simard
> 
> 
>> On Nov 13, 2014, at 3:31 PM, Mike Christie  wrote:
>> 
>> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>>> Running into weird issues here as well in a test environment. I don't
> have a solution either but perhaps we can find some things in common..
>>> 
>>> Setup in a nutshell:
>>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with 
>>> separate public/cluster network in 10 Gbps)
>>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, Ceph 
>>> 0.87-1 (10 Gbps)
>>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>>> 
>>> Relevant cluster config: Writeback cache tiering with NVME PCI-E cards (2
> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>>> 
>>> I'm following the instructions here: 
>>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-ima
>>> ges-san-storage-devices No issues with creating and mapping a 100GB 
>>> RBD image and then creating the target.
>>> 
>>> I'm interested in finding out the overhead/performance impact of
> re-exporting through iSCSI so the idea is to run benchmarks.
>>> Here's a fio test I'm trying to run on the client node on the mounted
> iscsi device:
>>> fio --name=writefile --size=100G --filesize=100G --filename=/dev/sdu 
>>> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write 
>>> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
>>> 
>>> The benchmark will eventually hang towards the end of the test for some
> long seconds before completing.
>>> On the proxy node, the kernel complains with iscsi portal login 
>>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance 
>>> errors in syslog: http://pastebin.com/AiRTWDwR
>>> 
>> 
>> You are hitting a different issue. German Anders is most likely 
>> correct and you hit the rbd hang. That then caused the iscsi/scsi 
>> command to timeout which caused the scsi error handler to run. In your 
>> logs we see the LIO error handler has received a task abort from the 
>> initiator and that timed out which caused the escalation (iscsi portal 
>> login related messages).
> 
> _

Re: [ceph-users] Cache tiering and cephfs

2014-11-18 Thread Gregory Farnum
I believe the reason we don't allow you to do this right now is that
there was not a good way of coordinating the transition (so that
everybody starts routing traffic through the cache pool at the same
time), which could lead to data inconsistencies. Looks like the OSDs
handle this appropriately now, though, so I'll create a bug.for
backport to giant. Until that happens I think you'll need to associate
the cache and base pool prior to giving them to the MDS; sorry.
-Greg

On Mon, Nov 17, 2014 at 1:07 PM, Scott Laird  wrote:
> Hmm.  I'd rather not recreate by cephfs filesystem from scratch if I don't
> have do.  Has anyone managed to add a cache tier to a running cephfs
> filesystem?
>
>
> On Sun Nov 16 2014 at 1:39:47 PM Erik Logtenberg  wrote:
>>
>> I know that it is possible to run CephFS with a cache tier on the data
>> pool in Giant, because that's what I do. However when I configured it, I
>> was on the previous release. When I upgraded to Giant, everything just
>> kept working.
>>
>> By the way when I set it up, I used the following commmands:
>>
>> ceph osd pool create cephfs-data 192 192 erasure
>> ceph osd pool create cephfs-metadata 192 192 replicated ssd
>> ceph osd pool create cephfs-data-cache 192 192 replicated ssd
>> ceph osd pool set cephfs-data-cache crush_ruleset 1
>> ceph osd pool set cephfs-metadata crush_ruleset 1
>> ceph osd tier add cephfs-data cephfs-data-cache
>> ceph osd tier cache-mode cephfs-data-cache writeback
>> ceph osd tier set-overlay cephfs-data cephfs-data-cache
>> ceph osd dump
>> ceph mds newfs 5 6 --yes-i-really-mean-it
>>
>> So actually I didn't add a cache tier to an existing CephFS, but first
>> made the pools and added CephFS directly after. In my case, the "ssd"
>> pool is ssd-backed (obviously), while the default pool is on rotating
>> media; the crush_ruleset 1 is meant to place both the cache pool and the
>> metadata pool on the ssd's.
>>
>> Erik.
>>
>>
>> On 11/16/2014 08:01 PM, Scott Laird wrote:
>> > Is it possible to add a cache tier to cephfs's data pool in giant?
>> >
>> > I'm getting a error:
>> >
>> > $ ceph osd tier set-overlay data data-cache
>> >
>> > Error EBUSY: pool 'data' is in use by CephFS via its tier
>> >
>> >
>> > From what I can see in the code, that comes from
>> > OSDMonitor::_check_remove_tier; I don't understand why set-overlay needs
>> > to call _check_remove_tier.  A quick look makes it look like set-overlay
>> > will always fail once MDS has been set up.  Is this a bug, or am I doing
>> > something wrong?
>> >
>> >
>> > Scott
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> >
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bug or by design?

2014-11-18 Thread Robert LeBlanc
I was going to submit this as a bug, but thought I would put it here for
discussion first. I have a feeling that it could be behavior by design.

ceph version 0.87 (c51c8f9d80fa4e0168aa52685b8de40e42758578)

I'm using a cache pool and was playing around with the size and min_size on
the pool to see the effects of replication. I set size/min_size to 1, then
I ran "ceph osd pool set ssd size 3; ceph osd pool set ssd min_size 2".
Client I/O immediately blocked as there was not 2 copies yet (as expected).
However, after the degraded objects are cleared up, there are several PGs
in the remapped+incomplete state and client I/O continues to be blocked
even though all OSDs are up and healthy (even left overnight). If I set
min_size back down to 1, the cluster recovers and client I/O continues.

I expected that as long as there is one copy of the data, the cluster can
copy that data to min_size and cluster operations resume.

Where I think it could be by design is when min_size was already set to 2
and you lose enough OSDs fast enough to dip below that level. There could
be the chance that the serving OSD could have bad data (but we wouldn't
know that anyway at the moment). The bad data could then be replicated and
the ability to recover any good data would be lost.

However, if Ceph immediately replicated the sole OSD to get back to
min_size then when the other(s) came back online, it could back fill and
just destroy the extras.

It seems that immediately replication to keep the cluster operational seems
like a good thing overall. Am I missing something?

Thanks,
Robert
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Sam, 

Pastebin or similar will not take tens of megabytes worth of logs. If we are 
talking about debug_ms 10 setting, I've got about 7gb worth of logs generated 
every half an hour or so. Not really sure what to do with that much data. 
Anything more constructive? 

Thanks 
- Original Message -

From: "Samuel Just"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, 18 November, 2014 8:53:47 PM 
Subject: Re: [ceph-users] Giant upgrade - stability issues 

pastebin or something, probably. 
-Sam 

On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky  
wrote: 
> Sam, the logs are rather large in size. Where should I post it to? 
> 
> Thanks 
>  
> From: "Samuel Just"  
> To: "Andrei Mikhailovsky"  
> Cc: ceph-users@lists.ceph.com 
> Sent: Tuesday, 18 November, 2014 7:54:56 PM 
> Subject: Re: [ceph-users] Giant upgrade - stability issues 
> 
> 
> Ok, why is ceph marking osds down? Post your ceph.log from one of the 
> problematic periods. 
> -Sam 
> 
> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky  
> wrote: 
>> Hello cephers, 
>> 
>> I need your help and suggestion on what is going on with my cluster. A few 
>> weeks ago i've upgraded from Firefly to Giant. I've previously written 
>> about 
>> having issues with Giant where in two weeks period the cluster's IO froze 
>> three times after ceph down-ed two osds. I have in total just 17 osds 
>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 
>> with 
>> latest updates. 
>> 
>> I've got zabbix agents monitoring the osd servers and the cluster. I get 
>> alerts of any issues, such as problems with PGs, etc. Since upgrading to 
>> Giant, I am now frequently seeing emails alerting of the cluster having 
>> degraded PGs. I am getting around 10-15 such emails per day stating that 
>> the 
>> cluster has degraded PGs. The number of degraded PGs very between a couple 
>> of PGs to over a thousand. After several minutes the cluster repairs 
>> itself. 
>> The total number of PGs in the cluster is 4412 between all the pools. 
>> 
>> I am also seeing more alerts from vms stating that there is a high IO wait 
>> and also seeing hang tasks. Some vms reporting over 50% io wait. 
>> 
>> This has not happened on Firefly or the previous releases of ceph. Not 
>> much 
>> has changed in the cluster since the upgrade to Giant. Networking and 
>> hardware is still the same and it is still running the same version of 
>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the 
>> issues 
>> above are related to the upgrade of ceph to Giant. 
>> 
>> Here is the ceph.conf that I use: 
>> 
>> [global] 
>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe 
>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib 
>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 
>> auth_supported = cephx 
>> osd_journal_size = 10240 
>> filestore_xattr_use_omap = true 
>> public_network = 192.168.168.0/24 
>> rbd_default_format = 2 
>> osd_recovery_max_chunk = 8388608 
>> osd_recovery_op_priority = 1 
>> osd_max_backfills = 1 
>> osd_recovery_max_active = 1 
>> osd_recovery_threads = 1 
>> filestore_max_sync_interval = 15 
>> filestore_op_threads = 8 
>> filestore_merge_threshold = 40 
>> filestore_split_multiple = 8 
>> osd_disk_threads = 8 
>> osd_op_threads = 8 
>> osd_pool_default_pg_num = 1024 
>> osd_pool_default_pgp_num = 1024 
>> osd_crush_update_on_start = false 
>> 
>> [client] 
>> rbd_cache = true 
>> admin_socket = /var/run/ceph/$name.$pid.asok 
>> 
>> 
>> I would like to get to the bottom of these issues. Not sure if the issues 
>> could be fixed with changing some settings in ceph.conf or a full 
>> downgrade 
>> back to the Firefly. Is the downgrade even possible on a production 
>> cluster? 
>> 
>> Thanks for your help 
>> 
>> Andrei 
>> 
>> ___ 
>> ceph-users mailing list 
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds cluster degraded

2014-11-18 Thread Gregory Farnum
Hmm, last time we saw this it meant that the MDS log had gotten
corrupted somehow and was a little short (in that case due to the OSDs
filling up). What do you mean by "rebuilt the OSDs"?
-Greg

On Mon, Nov 17, 2014 at 12:52 PM, JIten Shah  wrote:
> After i rebuilt the OSD’s, the MDS went into the degraded mode and will not
> recover.
>
>
> [jshah@Lab-cephmon001 ~]$ sudo tail -100f
> /var/log/ceph/ceph-mds.Lab-cephmon001.log
> 2014-11-17 17:55:27.855861 7fffef5d3700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=0 pgs=0 cs=0 l=0
> c=0x1e02c00).accept peer addr is really X.X.16.114:0/838757053 (socket is
> X.X.16.114:34672/0)
> 2014-11-17 17:57:27.855519 7fffef5d3700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/838757053 pipe(0x1e18000 sd=22 :6800 s=2 pgs=2 cs=1 l=0
> c=0x1e02c00).fault with nothing to send, going to standby
> 2014-11-17 17:58:47.883799 7fffef3d1700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=0 pgs=0 cs=0 l=0
> c=0x1e04ba0).accept peer addr is really X.X.16.114:0/26738200 (socket is
> X.X.16.114:34699/0)
> 2014-11-17 18:00:47.882484 7fffef3d1700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/26738200 pipe(0x1e1be80 sd=23 :6800 s=2 pgs=2 cs=1 l=0
> c=0x1e04ba0).fault with nothing to send, going to standby
> 2014-11-17 18:01:47.886662 7fffef1cf700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=0 pgs=0 cs=0 l=0
> c=0x1e05540).accept peer addr is really X.X.16.114:0/3673954317 (socket is
> X.X.16.114:34718/0)
> 2014-11-17 18:03:47.885488 7fffef1cf700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/3673954317 pipe(0x1e1c380 sd=24 :6800 s=2 pgs=2 cs=1 l=0
> c=0x1e05540).fault with nothing to send, going to standby
> 2014-11-17 18:04:47.888983 7fffeefcd700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=0 pgs=0 cs=0 l=0
> c=0x1e05280).accept peer addr is really X.X.16.114:0/3403131574 (socket is
> X.X.16.114:34744/0)
> 2014-11-17 18:06:47.888427 7fffeefcd700  0 -- X.X.16.111:6800/3046050 >>
> X.X.16.114:0/3403131574 pipe(0x1e18a00 sd=25 :6800 s=2 pgs=2 cs=1 l=0
> c=0x1e05280).fault with nothing to send, going to standby
> 2014-11-17 20:02:03.558250 707de700 -1 mds.0.1 *** got signal Terminated
> ***
> 2014-11-17 20:02:03.558297 707de700  1 mds.0.1 suicide.  wanted
> down:dne, now up:active
> 2014-11-17 20:02:56.053339 77fe77a0  0 ceph version 0.80.5
> (38b73c67d375a2552d8ed67843c8a65c2c0feba6), process ceph-mds, pid 3424727
> 2014-11-17 20:02:56.121367 730e4700  1 mds.-1.0 handle_mds_map standby
> 2014-11-17 20:02:56.124343 730e4700  1 mds.0.2 handle_mds_map i am now
> mds.0.2
> 2014-11-17 20:02:56.124345 730e4700  1 mds.0.2 handle_mds_map state
> change up:standby --> up:replay
> 2014-11-17 20:02:56.124348 730e4700  1 mds.0.2 replay_start
> 2014-11-17 20:02:56.124359 730e4700  1 mds.0.2  recovery set is
> 2014-11-17 20:02:56.124362 730e4700  1 mds.0.2  need osdmap epoch 93,
> have 92
> 2014-11-17 20:02:56.124363 730e4700  1 mds.0.2  waiting for osdmap 93
> (which blacklists prior instance)
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Unclear about CRUSH map and more than one "step emit" in rule

2014-11-18 Thread Gregory Farnum
On Sun, Nov 16, 2014 at 4:17 PM, Anthony Alba  wrote:
> The step emit documentation states
>
> "Outputs the current value and empties the stack. Typically used at
> the end of a rule, but may also be used to pick from different trees
> in the same rule."
>
> What use case is there for more than one "step emit"? Where would you
> put it since
> a rule looks like
>
> rule  {
>
> ruleset 
> type [ replicated | raid4 ]
> min_size 
> max_size 
> step take 
> step [choose|chooseleaf] [firstn|indep]  
> step emit
> }
>
> Hazard a guess: after "step emit" you start with step take... all over again?

Yep, that's it exactly. You could use this to do something like

step take ssd-root
step chooseleaf firstn 1 host
step emit
step take hdd-root
step chooseleaf firstn -1 host
step emit

-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Log reading/how do I tell what an OSD is trying to connect to

2014-11-18 Thread Gregory Farnum
It's a little strange, but with just the one-sided log it looks as
though the OSD is setting up a bunch of connections and then
deliberately tearing them down again within  second or two (i.e., this
is not a direct messenger bug, but it might be an OSD one, or it might
be something else).
Is it possible that you have some firewalls set up that are allowing
through some traffic but not others? The OSDs use a bunch of ports and
it looks like maybe there are at least intermittent issues with them
heartbeating.
-Greg

On Wed, Nov 12, 2014 at 11:32 AM, Scott Laird  wrote:
> Here are the first 33k lines or so:
> https://dl.dropboxusercontent.com/u/104949139/ceph-osd-log.txt
>
> This is a different (but more or less identical) machine from the past set
> of logs.  This system doesn't have quite as many drives in it, so I couldn't
> spot a same-host error burst, but it's logging tons of the same errors while
> trying to talk to 10.2.0.34.
>
> On Wed Nov 12 2014 at 10:47:30 AM Gregory Farnum  wrote:
>>
>> On Tue, Nov 11, 2014 at 6:28 PM, Scott Laird  wrote:
>> > I'm having a problem with my cluster.  It's running 0.87 right now, but
>> > I
>> > saw the same behavior with 0.80.5 and 0.80.7.
>> >
>> > The problem is that my logs are filling up with "replacing existing
>> > (lossy)
>> > channel" log lines (see below), to the point where I'm filling drives to
>> > 100% almost daily just with logs.
>> >
>> > It doesn't appear to be network related, because it happens even when
>> > talking to other OSDs on the same host.
>>
>> Well, that means it's probably not physical network related, but there
>> can still be plenty wrong with the networking stack... ;)
>>
>> > The logs pretty much all point to
>> > port 0 on the remote end.  Is this an indicator that it's failing to
>> > resolve
>> > port numbers somehow, or is this normal at this point in connection
>> > setup?
>>
>> That's definitely unusual, but I'd need to see a little more to be
>> sure if it's bad. My guess is that these pipes are connections from
>> the other OSD's Objecter, which is treated as a regular client and
>> doesn't bind to a socket for incoming connections.
>>
>> The repetitive channel replacements are concerning, though — they can
>> be harmless in some circumstances but this looks more like the
>> connection is simply failing to establish and so it's retrying over
>> and over again. Can you restart the OSDs with "debug ms = 10" in their
>> config file and post the logs somewhere? (There is not really any
>> documentation available on what they mean, but the deeper detail ones
>> might also be more understandable to you.)
>> -Greg
>>
>> >
>> > The systems that are causing this problem are somewhat unusual; they're
>> > running OSDs in Docker containers, but they *should* be configured to
>> > run as
>> > root and have full access to the host's network stack.  They manage to
>> > work,
>> > mostly, but things are still really flaky.
>> >
>> > Also, is there documentation on what the various fields mean, short of
>> > digging through the source?  And how does Ceph resolve OSD numbers into
>> > host/port addresses?
>> >
>> >
>> > 2014-11-12 01:50:40.802604 7f7828db8700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ce31c80 sd=135 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070580).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.802708 7f7816538700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ff61080 sd=120 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3db2e0).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.803346 7f781ba8d700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ce31180 sd=125 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070420).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.803944 7f781996c700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ff618c0 sd=107 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3d8420).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.804185 7f7816538700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd1e40 sd=20 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1e070840).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.805235 7f7813407700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd1340 sd=60 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x1b2d6260).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> > 2014-11-12 01:50:40.806364 7f781bc8f700  0 -- 10.2.0.36:6819/1 >>
>> > 10.2.0.36:0/1 pipe(0x1ffd0b00 sd=162 :6819 s=0 pgs=0 cs=0 l=1
>> > c=0x675c580).accept replacing existing (lossy) channel (new one lossy=1)
>> >
>> > 2014-11-12 01:50:40.806425 7f781aa7d700  0 -- 10.2.0.36:6830/1 >>
>> > 10.2.0.36:0/1 pipe(0x1db29600 sd=143 :6830 s=0 pgs=0 cs=0 l=1
>> > c=0x1f3d9600).accept replacing existing (lossy) channel (new one
>> > lossy=1)
>> >
>> >
>> >
>> > __

[ceph-users] Replacing Ceph mons & understanding initial members

2014-11-18 Thread Scottix
We currently have a 3 node system with 3 monitor nodes. I created them in
the initial setup and the ceph.conf

mon initial members = Ceph200, Ceph201, Ceph202
mon host = 10.10.5.31,10.10.5.32,10.10.5.33

We are in the process of expanding and installing dedicated mon servers.

I know I can run:
ceph-deploy mon create Ceph300, etc.
to install the new mons but then I will eventually need to destroy the old
mons Ceph200, etc...
Will this create issues or anything I need to update with the initial
members? mon host?

Thanks in advance
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini

Then.
...very good! :)

Ok, the next bad thing is that I have installed GIANT on Admin node.
However ceph-deploy ignore ADMIN node installation and install FIREFLY.
Now i have ceph-deploy of Giant on my ADMIN node and my first OSD node 
with FIREFLY.

It seems to me odd. Is it fine or i should prepare myself to format again?



Il 18/11/2014 23:03, Travis Rhoden ha scritto:

I've captured this at http://tracker.ceph.com/issues/10133

On Tue, Nov 18, 2014 at 4:48 PM, Travis Rhoden > wrote:


Hi Massimiliano,

I just recreated this bug myself.  Ceph-deploy is supposed to
install EPEL automatically on the platforms that need it.  I just
confirmed that it is not doing so, and will be opening up a bug in
the Ceph tracker.  I'll paste it here when I do so you can follow
it.  Thanks for the report!

 - Travis

On Tue, Nov 18, 2014 at 4:41 PM, Massimiliano Cuttini
mailto:m...@phoenixweb.it>> wrote:

I solved by installing EPEL repo on yum.
I think that somebody should write down in the documentation
that EPEL is mandatory



Il 18/11/2014 14:29, Massimiliano Cuttini ha scritto:

Dear all,

i try to install ceph but i get errors:

#ceph-deploy install node1
[]
[ceph_deploy.install][DEBUG ] Installing stable version
*firefly *on cluster ceph hosts node1
[ceph_deploy.install][DEBUG ] Detecting platform for host
node1 ...
[]
[node1][DEBUG ] ---> Pacchetto libXxf86vm.x86_64
0:1.1.3-2.1.el7 settato per essere installato
[node1][DEBUG ] ---> Pacchetto mesa-libgbm.x86_64
0:9.2.5-6.20131218.el7_0 settato per essere installato
[node1][DEBUG ] ---> Pacchetto mesa-libglapi.x86_64
0:9.2.5-6.20131218.el7_0 settato per essere installato
[node1][DEBUG ] --> Risoluzione delle dipendenze completata
[node1][WARNIN] Errore: Pacchetto:
ceph-common-0.80.7-0.el7.centos.x86_64 (Ceph)
[node1][WARNIN] Richiede:
libtcmalloc.so.4()(64bit)
[node1][WARNIN] Errore: Pacchetto:
ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
[node1][DEBUG ]  Si può provare ad usare --skip-broken
per aggirare il problema
[node1][WARNIN] Richiede:
libleveldb.so.1()(64bit)
[node1][WARNIN] Errore: Pacchetto:
ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
[node1][WARNIN] Richiede:
libtcmalloc.so.4()(64bit)
[node1][DEBUG ]  Provare ad eseguire: rpm -Va --nofiles
--nodigest
[node1][ERROR ] RuntimeError: command returned non-zero
exit status: 1
*[ceph_deploy][ERROR ] RuntimeError: Failed to execute
command: yum -y install ceph*

I installed GIANT version not FIREFLY on admin-node.
Is it a typo error in the config file or is it truly trying
to install FIREFLY instead of GIANT.

About the error, i see that it's related to wrong python
default libraries.
It seems that CEPH require libraries not available in the
current distro:

[node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
[node1][WARNIN] Richiede:
libleveldb.so.1()(64bit)
[node1][WARNIN] Richiede:
libtcmalloc.so.4()(64bit)

This seems strange.
Can you fix this?


Thanks,
Massimiliano Cuttini





___
ceph-users mailing list
ceph-users@lists.ceph.com  
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Travis Rhoden
I've captured this at http://tracker.ceph.com/issues/10133

On Tue, Nov 18, 2014 at 4:48 PM, Travis Rhoden  wrote:

> Hi Massimiliano,
>
> I just recreated this bug myself.  Ceph-deploy is supposed to install EPEL
> automatically on the platforms that need it.  I just confirmed that it is
> not doing so, and will be opening up a bug in the Ceph tracker.  I'll paste
> it here when I do so you can follow it.  Thanks for the report!
>
>  - Travis
>
> On Tue, Nov 18, 2014 at 4:41 PM, Massimiliano Cuttini 
> wrote:
>
>>  I solved by installing EPEL repo on yum.
>> I think that somebody should write down in the documentation that EPEL is
>> mandatory
>>
>>
>>
>> Il 18/11/2014 14:29, Massimiliano Cuttini ha scritto:
>>
>> Dear all,
>>
>> i try to install ceph but i get errors:
>>
>> #ceph-deploy install node1
>> []
>> [ceph_deploy.install][DEBUG ] Installing stable version *firefly *on
>> cluster ceph hosts node1
>> [ceph_deploy.install][DEBUG ] Detecting platform for host node1 ...
>> []
>> [node1][DEBUG ] ---> Pacchetto libXxf86vm.x86_64 0:1.1.3-2.1.el7 settato
>> per essere installato
>> [node1][DEBUG ] ---> Pacchetto mesa-libgbm.x86_64
>> 0:9.2.5-6.20131218.el7_0 settato per essere installato
>> [node1][DEBUG ] ---> Pacchetto mesa-libglapi.x86_64
>> 0:9.2.5-6.20131218.el7_0 settato per essere installato
>> [node1][DEBUG ] --> Risoluzione delle dipendenze completata
>> [node1][WARNIN] Errore: Pacchetto: ceph-common-0.80.7-0.el7.centos.x86_64
>> (Ceph)
>> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
>> [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
>> [node1][DEBUG ]  Si può provare ad usare --skip-broken per aggirare il
>> problema
>> [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
>> [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
>> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
>> [node1][DEBUG ]  Provare ad eseguire: rpm -Va --nofiles --nodigest
>> [node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
>> *[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
>> install ceph*
>>
>> I installed GIANT version not FIREFLY on admin-node.
>> Is it a typo error in the config file or is it truly trying to install
>> FIREFLY instead of GIANT.
>>
>> About the error, i see that it's related to wrong python default
>> libraries.
>> It seems that CEPH require libraries not available in the current distro:
>>
>> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
>> [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
>> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
>>
>> This seems strange.
>> Can you fix this?
>>
>>
>> Thanks,
>> Massimiliano Cuttini
>>
>>
>>
>>
>>
>> ___
>> ceph-users mailing 
>> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mds continuously crashing on Firefly

2014-11-18 Thread Gregory Farnum
On Thu, Nov 13, 2014 at 9:34 AM, Lincoln Bryant  wrote:
> Hi all,
>
> Just providing an update to this -- I started the mds daemon on a new server 
> and rebooted a box with a hung CephFS mount (from the first crash) and the 
> problem seems to have gone away.
>
> I'm still not sure why the mds was shutting down with a "Caught signal", 
> though.

I imagine the logs are gone now, but if you still have any from the
MDS crashes they might be useful to examine; a client shouldn't be
able to crash the MDS. :/
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Yes Openstack also uses libvirt/qemu/kvm, thanks.
On 18 Nov 2014 23:50, "Campbell, Bill" 
wrote:

> I can't speak for OpenStack, but OpenNebula uses Libvirt/QEMU/KVM to
> access an RBD directly for each virtual instance deployed, live-migration
> included (as each RBD is in and of itself a separate block device, not file
> system).  I would imagine OpenStack works in a similar fashion.
>
> --
> *From: *"hp cre" 
> *To: *"Gregory Farnum" 
> *Cc: *ceph-users@lists.ceph.com
> *Sent: *Tuesday, November 18, 2014 4:43:07 PM
> *Subject: *Re: [ceph-users] Concurrency in ceph
>
> Ok thanks Greg.
> But what openstack does,  AFAIU, is use rbd devices directly,  one for
> each Vm instance,  right?  And that's how it supports live migrations on
> KVM, etc.. Right? Openstack and similar cloud frameworks don't need to
> create vm instances on filesystems,  am I correct?
> On 18 Nov 2014 23:33, "Gregory Farnum"  wrote:
>
>> On Tue, Nov 18, 2014 at 1:26 PM, hp cre  wrote:
>> > Hello everyone,
>> >
>> > I'm new to ceph but been working with proprietary clustered filesystem
>> for
>> > quite some time.
>> >
>> > I almost understand how ceph works,  but have a couple of questions
>> which
>> > have been asked before here,  but i didn't understand the answer.
>> >
>> > In the closed source world,  we use clustered filesystems like Veritas
>> > clustered filesystem to mount a shared block device (using San) to more
>> than
>> > one compute node concurrently for shared read/write.
>> >
>> > What I can't seem to get a solid and clear answer for its this..
>> > How can I use ceph to do the same thing?  Can RADOS guarantee coherency
>> and
>> > integrity of my data if I use an rbd device with any filesystem on top
>> of
>> > it?  Or must I still use a cluster aware filesystem such as vxfs or
>> ocfs?
>>
>> RBD behaves just like a regular disk if you mount it to multiple nodes
>> at once (although you need to disable the client caching). This means
>> that the disk accesses will be coherent, but using ext4 on top of it
>> won't work because ext4 assumes it is the only accessor — you have to
>> use a cluster-aware FS like ocfs2. A SAN would have the same problem
>> here, so I'm not sure why you think it works with them...
>>
>>
>> > And is CephFS going to some this problem? Or does it not have support
>> for
>> > concurrent read/write access among all now mounting it?
>>
>> CephFS definitely does support concurrent access to the same data.
>>
>> > And,  does iscsi targets over rbd devices behave the same?
>>
>> Uh, yes, iSCSI over rbd will be the same as regular RBD in this
>> regard, modulo anything the iSCSI gateway might be set up to do.
>> -Greg
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Gregory Farnum
On Tue, Nov 18, 2014 at 1:43 PM, hp cre  wrote:
> Ok thanks Greg.
> But what openstack does,  AFAIU, is use rbd devices directly,  one for each
> Vm instance,  right?  And that's how it supports live migrations on KVM,
> etc.. Right? Openstack and similar cloud frameworks don't need to create vm
> instances on filesystems,  am I correct?

Right; these systems are doing the cache coherency (by duplicating all
the memory, including that of ext4/whatever) so that they work.
-Greg

>
> On 18 Nov 2014 23:33, "Gregory Farnum"  wrote:
>>
>> On Tue, Nov 18, 2014 at 1:26 PM, hp cre  wrote:
>> > Hello everyone,
>> >
>> > I'm new to ceph but been working with proprietary clustered filesystem
>> > for
>> > quite some time.
>> >
>> > I almost understand how ceph works,  but have a couple of questions
>> > which
>> > have been asked before here,  but i didn't understand the answer.
>> >
>> > In the closed source world,  we use clustered filesystems like Veritas
>> > clustered filesystem to mount a shared block device (using San) to more
>> > than
>> > one compute node concurrently for shared read/write.
>> >
>> > What I can't seem to get a solid and clear answer for its this..
>> > How can I use ceph to do the same thing?  Can RADOS guarantee coherency
>> > and
>> > integrity of my data if I use an rbd device with any filesystem on top
>> > of
>> > it?  Or must I still use a cluster aware filesystem such as vxfs or
>> > ocfs?
>>
>> RBD behaves just like a regular disk if you mount it to multiple nodes
>> at once (although you need to disable the client caching). This means
>> that the disk accesses will be coherent, but using ext4 on top of it
>> won't work because ext4 assumes it is the only accessor — you have to
>> use a cluster-aware FS like ocfs2. A SAN would have the same problem
>> here, so I'm not sure why you think it works with them...
>>
>>
>> > And is CephFS going to some this problem? Or does it not have support
>> > for
>> > concurrent read/write access among all now mounting it?
>>
>> CephFS definitely does support concurrent access to the same data.
>>
>> > And,  does iscsi targets over rbd devices behave the same?
>>
>> Uh, yes, iSCSI over rbd will be the same as regular RBD in this
>> regard, modulo anything the iSCSI gateway might be set up to do.
>> -Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Campbell, Bill
I can't speak for OpenStack, but OpenNebula uses Libvirt/QEMU/KVM to access an 
RBD directly for each virtual instance deployed, live-migration included (as 
each RBD is in and of itself a separate block device, not file system). I would 
imagine OpenStack works in a similar fashion. 

- Original Message -

From: "hp cre"  
To: "Gregory Farnum"  
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, November 18, 2014 4:43:07 PM 
Subject: Re: [ceph-users] Concurrency in ceph 



Ok thanks Greg. 
But what openstack does, AFAIU, is use rbd devices directly, one for each Vm 
instance, right? And that's how it supports live migrations on KVM, etc.. 
Right? Openstack and similar cloud frameworks don't need to create vm instances 
on filesystems, am I correct? 
On 18 Nov 2014 23:33, "Gregory Farnum" < g...@gregs42.com > wrote: 


On Tue, Nov 18, 2014 at 1:26 PM, hp cre < hpc...@gmail.com > wrote: 
> Hello everyone, 
> 
> I'm new to ceph but been working with proprietary clustered filesystem for 
> quite some time. 
> 
> I almost understand how ceph works, but have a couple of questions which 
> have been asked before here, but i didn't understand the answer. 
> 
> In the closed source world, we use clustered filesystems like Veritas 
> clustered filesystem to mount a shared block device (using San) to more than 
> one compute node concurrently for shared read/write. 
> 
> What I can't seem to get a solid and clear answer for its this.. 
> How can I use ceph to do the same thing? Can RADOS guarantee coherency and 
> integrity of my data if I use an rbd device with any filesystem on top of 
> it? Or must I still use a cluster aware filesystem such as vxfs or ocfs? 

RBD behaves just like a regular disk if you mount it to multiple nodes 
at once (although you need to disable the client caching). This means 
that the disk accesses will be coherent, but using ext4 on top of it 
won't work because ext4 assumes it is the only accessor — you have to 
use a cluster-aware FS like ocfs2. A SAN would have the same problem 
here, so I'm not sure why you think it works with them... 


> And is CephFS going to some this problem? Or does it not have support for 
> concurrent read/write access among all now mounting it? 

CephFS definitely does support concurrent access to the same data. 

> And, does iscsi targets over rbd devices behave the same? 

Uh, yes, iSCSI over rbd will be the same as regular RBD in this 
regard, modulo anything the iSCSI gateway might be set up to do. 
-Greg 




___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Travis Rhoden
Hi Massimiliano,

I just recreated this bug myself.  Ceph-deploy is supposed to install EPEL
automatically on the platforms that need it.  I just confirmed that it is
not doing so, and will be opening up a bug in the Ceph tracker.  I'll paste
it here when I do so you can follow it.  Thanks for the report!

 - Travis

On Tue, Nov 18, 2014 at 4:41 PM, Massimiliano Cuttini 
wrote:

>  I solved by installing EPEL repo on yum.
> I think that somebody should write down in the documentation that EPEL is
> mandatory
>
>
>
> Il 18/11/2014 14:29, Massimiliano Cuttini ha scritto:
>
> Dear all,
>
> i try to install ceph but i get errors:
>
> #ceph-deploy install node1
> []
> [ceph_deploy.install][DEBUG ] Installing stable version *firefly *on
> cluster ceph hosts node1
> [ceph_deploy.install][DEBUG ] Detecting platform for host node1 ...
> []
> [node1][DEBUG ] ---> Pacchetto libXxf86vm.x86_64 0:1.1.3-2.1.el7 settato
> per essere installato
> [node1][DEBUG ] ---> Pacchetto mesa-libgbm.x86_64 0:9.2.5-6.20131218.el7_0
> settato per essere installato
> [node1][DEBUG ] ---> Pacchetto mesa-libglapi.x86_64
> 0:9.2.5-6.20131218.el7_0 settato per essere installato
> [node1][DEBUG ] --> Risoluzione delle dipendenze completata
> [node1][WARNIN] Errore: Pacchetto: ceph-common-0.80.7-0.el7.centos.x86_64
> (Ceph)
> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
> [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
> [node1][DEBUG ]  Si può provare ad usare --skip-broken per aggirare il
> problema
> [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
> [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64 (Ceph)
> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
> [node1][DEBUG ]  Provare ad eseguire: rpm -Va --nofiles --nodigest
> [node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
> *[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
> install ceph*
>
> I installed GIANT version not FIREFLY on admin-node.
> Is it a typo error in the config file or is it truly trying to install
> FIREFLY instead of GIANT.
>
> About the error, i see that it's related to wrong python default libraries.
> It seems that CEPH require libraries not available in the current distro:
>
> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
> [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
> [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
>
> This seems strange.
> Can you fix this?
>
>
> Thanks,
> Massimiliano Cuttini
>
>
>
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Ok thanks Greg.
But what openstack does,  AFAIU, is use rbd devices directly,  one for each
Vm instance,  right?  And that's how it supports live migrations on KVM,
etc.. Right? Openstack and similar cloud frameworks don't need to create vm
instances on filesystems,  am I correct?
On 18 Nov 2014 23:33, "Gregory Farnum"  wrote:

> On Tue, Nov 18, 2014 at 1:26 PM, hp cre  wrote:
> > Hello everyone,
> >
> > I'm new to ceph but been working with proprietary clustered filesystem
> for
> > quite some time.
> >
> > I almost understand how ceph works,  but have a couple of questions which
> > have been asked before here,  but i didn't understand the answer.
> >
> > In the closed source world,  we use clustered filesystems like Veritas
> > clustered filesystem to mount a shared block device (using San) to more
> than
> > one compute node concurrently for shared read/write.
> >
> > What I can't seem to get a solid and clear answer for its this..
> > How can I use ceph to do the same thing?  Can RADOS guarantee coherency
> and
> > integrity of my data if I use an rbd device with any filesystem on top of
> > it?  Or must I still use a cluster aware filesystem such as vxfs or ocfs?
>
> RBD behaves just like a regular disk if you mount it to multiple nodes
> at once (although you need to disable the client caching). This means
> that the disk accesses will be coherent, but using ext4 on top of it
> won't work because ext4 assumes it is the only accessor — you have to
> use a cluster-aware FS like ocfs2. A SAN would have the same problem
> here, so I'm not sure why you think it works with them...
>
>
> > And is CephFS going to some this problem? Or does it not have support for
> > concurrent read/write access among all now mounting it?
>
> CephFS definitely does support concurrent access to the same data.
>
> > And,  does iscsi targets over rbd devices behave the same?
>
> Uh, yes, iSCSI over rbd will be the same as regular RBD in this
> regard, modulo anything the iSCSI gateway might be set up to do.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini

I solved by installing EPEL repo on yum.
I think that somebody should write down in the documentation that EPEL 
is mandatory




Il 18/11/2014 14:29, Massimiliano Cuttini ha scritto:

Dear all,

i try to install ceph but i get errors:

#ceph-deploy install node1
[]
[ceph_deploy.install][DEBUG ] Installing stable version *firefly
*on cluster ceph hosts node1
[ceph_deploy.install][DEBUG ] Detecting platform for host node1 ...
[]
[node1][DEBUG ] ---> Pacchetto libXxf86vm.x86_64 0:1.1.3-2.1.el7
settato per essere installato
[node1][DEBUG ] ---> Pacchetto mesa-libgbm.x86_64
0:9.2.5-6.20131218.el7_0 settato per essere installato
[node1][DEBUG ] ---> Pacchetto mesa-libglapi.x86_64
0:9.2.5-6.20131218.el7_0 settato per essere installato
[node1][DEBUG ] --> Risoluzione delle dipendenze completata
[node1][WARNIN] Errore: Pacchetto:
ceph-common-0.80.7-0.el7.centos.x86_64 (Ceph)
[node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
[node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64
(Ceph)
[node1][DEBUG ]  Si può provare ad usare --skip-broken per
aggirare il problema
[node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
[node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64
(Ceph)
[node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
[node1][DEBUG ]  Provare ad eseguire: rpm -Va --nofiles --nodigest
[node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
*[ceph_deploy][ERROR ] RuntimeError: Failed to execute command:
yum -y install ceph*

I installed GIANT version not FIREFLY on admin-node.
Is it a typo error in the config file or is it truly trying to install 
FIREFLY instead of GIANT.


About the error, i see that it's related to wrong python default 
libraries.

It seems that CEPH require libraries not available in the current distro:

[node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
[node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
[node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)

This seems strange.
Can you fix this?


Thanks,
Massimiliano Cuttini





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rados mkpool fails, but not ceph osd pool create

2014-11-18 Thread Gregory Farnum
On Tue, Nov 11, 2014 at 11:43 PM, Gauvain Pocentek
 wrote:
> Hi all,
>
> I'm facing a problem on a ceph deployment. rados mkpool always fails:
>
> # rados -n client.admin mkpool test
> error creating pool test: (2) No such file or directory
>
> rados lspool and rmpool commands work just fine, and the following also
> works:
>
> # ceph osd pool create test 128 128
> pool 'test' created
>
> I've enabled rados debug but it really didn't help much. Should I look at
> mons or osds debug logs?
>
> Any idea about what could be happening?

This worked for me when I tested it on a dev branch. What version are
you running? Can you run "rados -n client.admin mkpool test --debug_ms
1" and show the results?
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Concurrency in ceph

2014-11-18 Thread Gregory Farnum
On Tue, Nov 18, 2014 at 1:26 PM, hp cre  wrote:
> Hello everyone,
>
> I'm new to ceph but been working with proprietary clustered filesystem for
> quite some time.
>
> I almost understand how ceph works,  but have a couple of questions which
> have been asked before here,  but i didn't understand the answer.
>
> In the closed source world,  we use clustered filesystems like Veritas
> clustered filesystem to mount a shared block device (using San) to more than
> one compute node concurrently for shared read/write.
>
> What I can't seem to get a solid and clear answer for its this..
> How can I use ceph to do the same thing?  Can RADOS guarantee coherency and
> integrity of my data if I use an rbd device with any filesystem on top of
> it?  Or must I still use a cluster aware filesystem such as vxfs or ocfs?

RBD behaves just like a regular disk if you mount it to multiple nodes
at once (although you need to disable the client caching). This means
that the disk accesses will be coherent, but using ext4 on top of it
won't work because ext4 assumes it is the only accessor — you have to
use a cluster-aware FS like ocfs2. A SAN would have the same problem
here, so I'm not sure why you think it works with them...


> And is CephFS going to some this problem? Or does it not have support for
> concurrent read/write access among all now mounting it?

CephFS definitely does support concurrent access to the same data.

> And,  does iscsi targets over rbd devices behave the same?

Uh, yes, iSCSI over rbd will be the same as regular RBD in this
regard, modulo anything the iSCSI gateway might be set up to do.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Concurrency in ceph

2014-11-18 Thread hp cre
Hello everyone,

I'm new to ceph but been working with proprietary clustered filesystem for
quite some time.

I almost understand how ceph works,  but have a couple of questions which
have been asked before here,  but i didn't understand the answer.

In the closed source world,  we use clustered filesystems like Veritas
clustered filesystem to mount a shared block device (using San) to more
than one compute node concurrently for shared read/write.

What I can't seem to get a solid and clear answer for its this..
How can I use ceph to do the same thing?  Can RADOS guarantee coherency and
integrity of my data if I use an rbd device with any filesystem on top of
it?  Or must I still use a cluster aware filesystem such as vxfs or ocfs?
And is CephFS going to some this problem? Or does it not have support for
concurrent read/write access among all now mounting it?
And,  does iscsi targets over rbd devices behave the same?

Thanks guys did the help..
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-18 Thread Kevin Sumner
Hi Thomas,

I looked over the mds config reference a bit yesterday, but mds cache size 
seems to be the most relevant tunable.

As suggested, I upped mds-cache-size to 1 million yesterday and started the 
load generator.  During load generation, we’re seeing similar behavior on the 
filesystem and the mds.  The mds process is running a little hotter now with 
higher CPU average and 11GB resident size (was just under 10GB iirc).  
Enumerating files on the filesystem, e.g., with ls, is still hanging though.

With load generation disabled, the behavior is the same as before, i.e., things 
work ask expected.

I’ve got a lot of memory and CPU headroom on the box hosting the mds, so unless 
there’s good reason not to, I’m to continue increasing the mds cache 
iteratively in the hopes of finding a size that produces good behavior.  Right 
now, I’d expect us to hit around 2 million inodes each minute, so cache at 1 
million is still undersized.  If that doesn’t work, we’re running Firefly on 
the cluster currently and I’ll be upgrading it to Giant.
--
Kevin Sumner
ke...@sumner.io



> On Nov 18, 2014, at 1:36 AM, Thomas Lemarchand 
>  wrote:
> 
> Hi Kevin,
> 
> There are every (I think) MDS tunables listed on this page with a short
> description : http://ceph.com/docs/master/cephfs/mds-config-ref/ 
> 
> 
> Can you tell us how your cluster behave after the mds-cache-size
> change ? What is your MDS ram consumption, before and after ?
> 
> Thanks !
> -- 
> Thomas Lemarchand
> Cloud Solutions SAS - Responsable des systèmes d'information
> 
> 
> 
> On lun., 2014-11-17 at 16:06 -0800, Kevin Sumner wrote:
>>> On Nov 17, 2014, at 15:52, Sage Weil  wrote:
>>> 
>>> On Mon, 17 Nov 2014, Kevin Sumner wrote:
 I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and
 1 MDS.  All
 the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing
 at a space
 under /ceph.  Over the weekend, I drove almost 2 million metrics,
 each of
 which creates a ~3MB file in a hierarchical path, each sending a
 datapoint
 into the metric file once a minute.  CephFS seemed to handle the
 writes ok
 while I was driving load.  All files containing each metric are at
 paths
 like this:
 /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp
 
 Today, however, with the load generator still running, reading
 metadata of
 files (e.g. directory entries and stat(2) info) in the filesystem
 (presumably MDS-managed data) seems nearly impossible, especially
 deeper
 into the tree.  For example, in a shell cd seems to work but
 ls hangs,
 seemingly indefinitely.  After turning off the load generator and
 allowing a
 while for things to settle down, everything seems to behave
 better.
 
 ceph status and ceph health both return good statuses the entire
 time.
 During load generation, the ceph-mds process seems pegged at
 between 100%
 and 150%, but with load generation turned off, the process has
 some high
 variability from near-idle up to similar 100-150% CPU.
 
 Hopefully, I?ve missed something in the CephFS tuning.  However,
 I?m looking for
 direction on figuring out if it is, indeed, a tuning problem or if
 this
 behavior is a symptom of the ?not ready for production? banner in
 the
 documentation.
>>> 
>>> My first guess is that the MDS cache is just too small and it is 
>>> thrashing.  Try
>>> 
>>> ceph mds tell 0 injectargs '--mds-cache-size 100'
>>> 
>>> That's 10x bigger than the default, tho be aware that it will eat up
>>> 10x 
>>> as much RAM too.
>>> 
>>> We've also seen teh cache behave in a non-optimal way when evicting 
>>> things, making it thrash more often than it should.  I'm hoping we
>>> can 
>>> implement something like MQ instead of our two-level LRU, but it
>>> isn't 
>>> high on the priority list right now.
>>> 
>>> sage
>> 
>> 
>> Thanks!  I’ll pursue mds cache size tuning.  Is there any guidance on
>> setting the cache and other mds tunables correctly, or is it an
>> adjust-and-test sort of thing?  Cursory searching doesn’t return any
>> relevant documentation for ceph.com.  I’m plowing through some other
>> list posts now.
>> --
>> Kevin Sumner
>> ke...@sumner.io
>> 
>> 
>> 
>> 
>> -- 
>> This message has been scanned for viruses and 
>> dangerous content by MailScanner, and is 
>> believed to be clean. 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
>> 
> 
> 
> -- 
> This message has been scanned for viruses and
> dangerous content by MailScanner, and is
> believed to be clean.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.co

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Samuel Just
pastebin or something, probably.
-Sam

On Tue, Nov 18, 2014 at 12:34 PM, Andrei Mikhailovsky  wrote:
> Sam, the logs are rather large in size. Where should I post it to?
>
> Thanks
> 
> From: "Samuel Just" 
> To: "Andrei Mikhailovsky" 
> Cc: ceph-users@lists.ceph.com
> Sent: Tuesday, 18 November, 2014 7:54:56 PM
> Subject: Re: [ceph-users] Giant upgrade - stability issues
>
>
> Ok, why is ceph marking osds down?  Post your ceph.log from one of the
> problematic periods.
> -Sam
>
> On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky 
> wrote:
>> Hello cephers,
>>
>> I need your help and suggestion on what is going on with my cluster. A few
>> weeks ago i've upgraded from Firefly to Giant. I've previously written
>> about
>> having issues with Giant where in two weeks period the cluster's IO froze
>> three times after ceph down-ed two osds. I have in total just 17 osds
>> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04
>> with
>> latest updates.
>>
>> I've got zabbix agents monitoring the osd servers and the cluster. I get
>> alerts of any issues, such as problems with PGs, etc. Since upgrading to
>> Giant, I am now frequently seeing emails alerting of the cluster having
>> degraded PGs. I am getting around 10-15 such emails per day stating that
>> the
>> cluster has degraded PGs. The number of degraded PGs very between a couple
>> of PGs to over a thousand. After several minutes the cluster repairs
>> itself.
>> The total number of PGs in the cluster is 4412 between all the pools.
>>
>> I am also seeing more alerts from vms stating that there is a high IO wait
>> and also seeing hang tasks. Some vms reporting over 50% io wait.
>>
>> This has not happened on Firefly or the previous releases of ceph. Not
>> much
>> has changed in the cluster since the upgrade to Giant. Networking and
>> hardware is still the same and it is still running the same version of
>> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the
>> issues
>> above are related to the upgrade of ceph to Giant.
>>
>> Here is the ceph.conf that I use:
>>
>> [global]
>> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
>> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
>> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
>> auth_supported = cephx
>> osd_journal_size = 10240
>> filestore_xattr_use_omap = true
>> public_network = 192.168.168.0/24
>> rbd_default_format = 2
>> osd_recovery_max_chunk = 8388608
>> osd_recovery_op_priority = 1
>> osd_max_backfills = 1
>> osd_recovery_max_active = 1
>> osd_recovery_threads = 1
>> filestore_max_sync_interval = 15
>> filestore_op_threads = 8
>> filestore_merge_threshold = 40
>> filestore_split_multiple = 8
>> osd_disk_threads = 8
>> osd_op_threads = 8
>> osd_pool_default_pg_num = 1024
>> osd_pool_default_pgp_num = 1024
>> osd_crush_update_on_start = false
>>
>> [client]
>> rbd_cache = true
>> admin_socket = /var/run/ceph/$name.$pid.asok
>>
>>
>> I would like to get to the bottom of these issues. Not sure if the issues
>> could be fixed with changing some settings in ceph.conf or a full
>> downgrade
>> back to the Firefly. Is the downgrade even possible on a production
>> cluster?
>>
>> Thanks for your help
>>
>> Andrei
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stackforge Puppet Module

2014-11-18 Thread David Moreau Simard
Great find Nick.

I've discussed it on IRC and it does look like a real issue: 
https://github.com/enovance/edeploy-roles/blob/master/puppet-master.install#L48-L52
I've pushed the fix for review: https://review.openstack.org/#/c/135421/

--
David Moreau Simard


> On Nov 18, 2014, at 3:32 PM, Nick Fisk  wrote:
> 
> Hi David,
> 
> Just to let you know I finally managed to get to the bottom of this.
> 
> In the repo.pp one of the authors has a non ASCII character in his name, for
> whatever reason this was tripping up my puppet environment. After removing
> the following line:-
> 
> # Author: François Charlier 
> 
> The module proceeds further, I'm now getting an error about a missing arg
> parameter, but I hope this should be pretty easy to solve.
> 
> Nick
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> David Moreau Simard
> Sent: 12 November 2014 14:25
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Stackforge Puppet Module
> 
> What comes to mind is that you need to make sure that you've cloned the git
> repository to /etc/puppet/modules/ceph and not
> /etc/puppet/modules/puppet-ceph.
> 
> Feel free to hop on IRC to discuss about puppet-ceph on freenode in
> #puppet-openstack.
> You can find me there as dmsimard.
> 
> --
> David Moreau Simard
> 
>> On Nov 12, 2014, at 8:58 AM, Nick Fisk  wrote:
>> 
>> Hi David,
>> 
>> Many thanks for your reply.
>> 
>> I must admit I have only just started looking at puppet, but a lot of 
>> what you said makes sense to me and understand the reason for not 
>> having the module auto discover disks.
>> 
>> I'm currently having a problem with the ceph::repo class when trying 
>> to push this out to a test server:-
>> 
>> Error: Could not retrieve catalog from remote server: Error 400 on SERVER:
>> Could not find class ceph::repo for ceph-puppet-test on node 
>> ceph-puppet-test
>> Warning: Not using cache on failed catalog
>> Error: Could not retrieve catalog; skipping run
>> 
>> I'm a bit stuck but will hopefully work out why it's not working soon 
>> and then I can attempt your idea of using a script to dynamically pass 
>> disks to the puppet module.
>> 
>> Thanks,
>> Nick
>> 
>> 
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
>> Of David Moreau Simard
>> Sent: 11 November 2014 12:05
>> To: Nick Fisk
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] Stackforge Puppet Module
>> 
>> Hi Nick,
>> 
>> The great thing about puppet-ceph's implementation on Stackforge is 
>> that it is both unit and integration tested.
>> You can see the integration tests here:
>> https://github.com/ceph/puppet-ceph/tree/master/spec/system
>> 
>> Where I'm getting at is that the tests allow you to see how you can 
>> use the module to a certain extent.
>> For example, in the OSD integration tests:
>> -
>> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_s
>> pec.rb
>> #L24 and then:
>> -
>> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_s
>> pec.rb
>> #L82-L110
>> 
>> There's no auto discovery mechanism built-in the module right now. 
>> It's kind of dangerous, you don't want to format the wrong disks.
>> 
>> Now, this doesn't mean you can't "discover" the disks yourself and 
>> pass them to the module from your site.pp or from a composition layer.
>> Here's something I have for my CI environment that uses the 
>> $::blockdevices fact to discover all devices, split that fact into a 
>> list of the devices and then reject the drives I don't want (such as the
> OS disk):
>> 
>>   # Assume OS is installed on xvda/sda/vda.
>>   # On an Openstack VM, vdb is ephemeral, we don't want to use vdc.
>>   # WARNING: ALL OTHER DISKS WILL BE FORMATTED/PARTITIONED BY CEPH!
>>   $block_devices = reject(split($::blockdevices, ','),
>> '(xvda|sda|vda|vdc|sr0)')
>>   $devices = prefix($block_devices, '/dev/')
>> 
>> And then you can pass $devices to the module.
>> 
>> Let me know if you have any questions !
>> --
>> David Moreau Simard
>> 
>>> On Nov 11, 2014, at 6:23 AM, Nick Fisk  wrote:
>>> 
>>> Hi,
>>> 
>>> I'm just looking through the different methods of deploying Ceph and 
>>> I particularly liked the idea that the stackforge puppet module 
>>> advertises of using discover to automatically add new disks. I 
>>> understand the principle of how it should work; using ceph-disk list 
>>> to find unknown disks, but I would like to see in a little more 
>>> detail on
>> how it's been implemented.
>>> 
>>> I've been looking through the puppet module on Github, but I can't 
>>> see anyway where this discovery is carried out.
>>> 
>>> Could anyone confirm if this puppet modules does currently support 
>>> the auto discovery and where  in the code its carried out?
>>> 
>>> Many Thanks,
>>> Nick
>>> 
>>> 
>>> 
>>> 
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.c

Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Sam, the logs are rather large in size. Where should I post it to? 

Thanks 
- Original Message -

From: "Samuel Just"  
To: "Andrei Mikhailovsky"  
Cc: ceph-users@lists.ceph.com 
Sent: Tuesday, 18 November, 2014 7:54:56 PM 
Subject: Re: [ceph-users] Giant upgrade - stability issues 

Ok, why is ceph marking osds down? Post your ceph.log from one of the 
problematic periods. 
-Sam 

On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky  wrote: 
> Hello cephers, 
> 
> I need your help and suggestion on what is going on with my cluster. A few 
> weeks ago i've upgraded from Firefly to Giant. I've previously written about 
> having issues with Giant where in two weeks period the cluster's IO froze 
> three times after ceph down-ed two osds. I have in total just 17 osds 
> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 with 
> latest updates. 
> 
> I've got zabbix agents monitoring the osd servers and the cluster. I get 
> alerts of any issues, such as problems with PGs, etc. Since upgrading to 
> Giant, I am now frequently seeing emails alerting of the cluster having 
> degraded PGs. I am getting around 10-15 such emails per day stating that the 
> cluster has degraded PGs. The number of degraded PGs very between a couple 
> of PGs to over a thousand. After several minutes the cluster repairs itself. 
> The total number of PGs in the cluster is 4412 between all the pools. 
> 
> I am also seeing more alerts from vms stating that there is a high IO wait 
> and also seeing hang tasks. Some vms reporting over 50% io wait. 
> 
> This has not happened on Firefly or the previous releases of ceph. Not much 
> has changed in the cluster since the upgrade to Giant. Networking and 
> hardware is still the same and it is still running the same version of 
> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the issues 
> above are related to the upgrade of ceph to Giant. 
> 
> Here is the ceph.conf that I use: 
> 
> [global] 
> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe 
> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib 
> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 
> auth_supported = cephx 
> osd_journal_size = 10240 
> filestore_xattr_use_omap = true 
> public_network = 192.168.168.0/24 
> rbd_default_format = 2 
> osd_recovery_max_chunk = 8388608 
> osd_recovery_op_priority = 1 
> osd_max_backfills = 1 
> osd_recovery_max_active = 1 
> osd_recovery_threads = 1 
> filestore_max_sync_interval = 15 
> filestore_op_threads = 8 
> filestore_merge_threshold = 40 
> filestore_split_multiple = 8 
> osd_disk_threads = 8 
> osd_op_threads = 8 
> osd_pool_default_pg_num = 1024 
> osd_pool_default_pgp_num = 1024 
> osd_crush_update_on_start = false 
> 
> [client] 
> rbd_cache = true 
> admin_socket = /var/run/ceph/$name.$pid.asok 
> 
> 
> I would like to get to the bottom of these issues. Not sure if the issues 
> could be fixed with changing some settings in ceph.conf or a full downgrade 
> back to the Firefly. Is the downgrade even possible on a production cluster? 
> 
> Thanks for your help 
> 
> Andrei 
> 
> ___ 
> ceph-users mailing list 
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Stackforge Puppet Module

2014-11-18 Thread Nick Fisk
Hi David,

Just to let you know I finally managed to get to the bottom of this.

In the repo.pp one of the authors has a non ASCII character in his name, for
whatever reason this was tripping up my puppet environment. After removing
the following line:-

# Author: François Charlier 

The module proceeds further, I'm now getting an error about a missing arg
parameter, but I hope this should be pretty easy to solve.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
David Moreau Simard
Sent: 12 November 2014 14:25
To: Nick Fisk
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Stackforge Puppet Module

What comes to mind is that you need to make sure that you've cloned the git
repository to /etc/puppet/modules/ceph and not
/etc/puppet/modules/puppet-ceph.

Feel free to hop on IRC to discuss about puppet-ceph on freenode in
#puppet-openstack.
You can find me there as dmsimard.

--
David Moreau Simard

> On Nov 12, 2014, at 8:58 AM, Nick Fisk  wrote:
> 
> Hi David,
> 
> Many thanks for your reply.
> 
> I must admit I have only just started looking at puppet, but a lot of 
> what you said makes sense to me and understand the reason for not 
> having the module auto discover disks.
> 
> I'm currently having a problem with the ceph::repo class when trying 
> to push this out to a test server:-
> 
> Error: Could not retrieve catalog from remote server: Error 400 on SERVER:
> Could not find class ceph::repo for ceph-puppet-test on node 
> ceph-puppet-test
> Warning: Not using cache on failed catalog
> Error: Could not retrieve catalog; skipping run
> 
> I'm a bit stuck but will hopefully work out why it's not working soon 
> and then I can attempt your idea of using a script to dynamically pass 
> disks to the puppet module.
> 
> Thanks,
> Nick
> 
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of David Moreau Simard
> Sent: 11 November 2014 12:05
> To: Nick Fisk
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Stackforge Puppet Module
> 
> Hi Nick,
> 
> The great thing about puppet-ceph's implementation on Stackforge is 
> that it is both unit and integration tested.
> You can see the integration tests here:
> https://github.com/ceph/puppet-ceph/tree/master/spec/system
> 
> Where I'm getting at is that the tests allow you to see how you can 
> use the module to a certain extent.
> For example, in the OSD integration tests:
> -
> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_s
> pec.rb
> #L24 and then:
> -
> https://github.com/ceph/puppet-ceph/blob/master/spec/system/ceph_osd_s
> pec.rb
> #L82-L110
> 
> There's no auto discovery mechanism built-in the module right now. 
> It's kind of dangerous, you don't want to format the wrong disks.
> 
> Now, this doesn't mean you can't "discover" the disks yourself and 
> pass them to the module from your site.pp or from a composition layer.
> Here's something I have for my CI environment that uses the 
> $::blockdevices fact to discover all devices, split that fact into a 
> list of the devices and then reject the drives I don't want (such as the
OS disk):
> 
># Assume OS is installed on xvda/sda/vda.
># On an Openstack VM, vdb is ephemeral, we don't want to use vdc.
># WARNING: ALL OTHER DISKS WILL BE FORMATTED/PARTITIONED BY CEPH!
>$block_devices = reject(split($::blockdevices, ','),
> '(xvda|sda|vda|vdc|sr0)')
>$devices = prefix($block_devices, '/dev/')
> 
> And then you can pass $devices to the module.
> 
> Let me know if you have any questions !
> --
> David Moreau Simard
> 
>> On Nov 11, 2014, at 6:23 AM, Nick Fisk  wrote:
>> 
>> Hi,
>> 
>> I'm just looking through the different methods of deploying Ceph and 
>> I particularly liked the idea that the stackforge puppet module 
>> advertises of using discover to automatically add new disks. I 
>> understand the principle of how it should work; using ceph-disk list 
>> to find unknown disks, but I would like to see in a little more 
>> detail on
> how it's been implemented.
>> 
>> I've been looking through the puppet module on Github, but I can't 
>> see anyway where this discovery is carried out.
>> 
>> Could anyone confirm if this puppet modules does currently support 
>> the auto discovery and where  in the code its carried out?
>> 
>> Many Thanks,
>> Nick
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists

Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread Nick Fisk
Hi David,

Have you tried on a normal replicated pool with no cache? I've seen a number
of threads recently where caching is causing various things to block/hang.
It would be interesting to see if this still happens without the caching
layer, at least it would rule it out.

Also is there any sign that as the test passes ~50GB that the cache might
start flushing to the backing pool causing slow performance?

I am planning a deployment very similar to yours so I am following this with
great interest. I'm hoping to build a single node test "cluster" shortly, so
I might be in a position to work with you on this issue and hopefully get it
resolved.

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
David Moreau Simard
Sent: 18 November 2014 19:58
To: Mike Christie
Cc: ceph-users@lists.ceph.com; Christopher Spearman
Subject: Re: [ceph-users] Poor RBD performance as LIO iSCSI target

Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and chatted
with "dis" on #ceph-devel.

I ran a LOT of tests on a LOT of comabination of kernels (sometimes with
tunables legacy). I haven't found a magical combination in which the
following test does not hang:
fio --name=writefile --size=100G --filesize=100G --filename=/dev/rbd0
--bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
--refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio

Either directly on a mapped rbd device, on a mounted filesystem (over rbd),
exported through iSCSI.. nothing.
I guess that rules out a potential issue with iSCSI overhead.

Now, something I noticed out of pure luck is that I am unable to reproduce
the issue if I drop the size of the test to 50GB. Tests will complete in
under 2 minutes.
75GB will hang right at the end and take more than 10 minutes.

TL;DR of tests:
- 3x fio --name=writefile --size=50G --filesize=50G --filename=/dev/rbd0
--bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
--refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
-- 1m44s, 1m49s, 1m40s

- 3x fio --name=writefile --size=75G --filesize=75G --filename=/dev/rbd0
--bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write
--refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
-- 10m12s, 10m11s, 10m13s

Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP

Does that ring you guys a bell ?

--
David Moreau Simard


> On Nov 13, 2014, at 3:31 PM, Mike Christie  wrote:
> 
> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>> Running into weird issues here as well in a test environment. I don't
have a solution either but perhaps we can find some things in common..
>> 
>> Setup in a nutshell:
>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with 
>> separate public/cluster network in 10 Gbps)
>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, Ceph 
>> 0.87-1 (10 Gbps)
>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>> 
>> Relevant cluster config: Writeback cache tiering with NVME PCI-E cards (2
replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>> 
>> I'm following the instructions here: 
>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-ima
>> ges-san-storage-devices No issues with creating and mapping a 100GB 
>> RBD image and then creating the target.
>> 
>> I'm interested in finding out the overhead/performance impact of
re-exporting through iSCSI so the idea is to run benchmarks.
>> Here's a fio test I'm trying to run on the client node on the mounted
iscsi device:
>> fio --name=writefile --size=100G --filesize=100G --filename=/dev/sdu 
>> --bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write 
>> --refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
>> 
>> The benchmark will eventually hang towards the end of the test for some
long seconds before completing.
>> On the proxy node, the kernel complains with iscsi portal login 
>> timeout: http://pastebin.com/Q49UnTPr and I also see irqbalance 
>> errors in syslog: http://pastebin.com/AiRTWDwR
>> 
> 
> You are hitting a different issue. German Anders is most likely 
> correct and you hit the rbd hang. That then caused the iscsi/scsi 
> command to timeout which caused the scsi error handler to run. In your 
> logs we see the LIO error handler has received a task abort from the 
> initiator and that timed out which caused the escalation (iscsi portal 
> login related messages).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bonding woes

2014-11-18 Thread Roland Giesler
Hi people, I have two identical servers (both Sun X2100 M2's) that form
part of a cluster of 3 machines (other machines will be added later).   I
want to bond two GB ethernet ports on these, which works perfectly on the
one, but not on the other.

How can this be?

The one machine (named S2) detects no links up (with ethtool), yet the
links are up.  When I assign an ip to eth2 for instance, it works 100%
despite ethtool claiming there is not link.

I understand that bonding uses ethtool to determine whether a link is up
and then activates the bond.  So how can I "fix" this?

both machines have the following:

/etc/network/interfaces

# network interface settings
auto lo
iface lo inet loopback

auto eth2
iface eth2 inet static

auto eth3
iface eth3 inet static

iface eth0 inet manual

iface eth1 inet manual

auto bond0
iface bond0 inet manual
slaves eth2, eth3
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer2

auto vmbr0
iface vmbr0 inet static
address  192.168.121.32
netmask  255.255.255.0
gateway  192.168.121.1
bridge_ports bond0
bridge_stp off
bridge_fd 0

And furthermore: /etc/udev/rules.d/70-persistent-net.rules

# This file was automatically generated by the /lib/udev/write_net_rules
# program, run by the persistent-net-generator.rules rules file.
#
# You can modify it, as long as you keep each rule on a single
# line, and change only the value of the NAME= key.

# PCI device
0x14e4:/sys/devices/pci:00/:00:0d.0/:05:00.0/:06:04.0 (tg3)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
ATTR{address}=="00:16:36:76:0f:3d", ATTR{dev_id}=="0x0", ATTR{type}=="1",
KERNEL=="eth*", NAME="eth0"

# PCI device
0x14e4:/sys/devices/pci:00/:00:0d.0/:05:00.0/:06:04.1 (tg3)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
ATTR{address}=="00:16:36:76:0f:3e", ATTR{dev_id}=="0x0", ATTR{type}=="1",
KERNEL=="eth*", NAME="eth1"

# PCI device 0x10de:/sys/devices/pci:00/:00:09.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
ATTR{address}=="00:16:36:76:0f:40", ATTR{dev_id}=="0x0", ATTR{type}=="1",
KERNEL=="eth*", NAME="eth2"

# PCI device 0x10de:/sys/devices/pci:00/:00:08.0 (forcedeth)
SUBSYSTEM=="net", ACTION=="add", DRIVERS=="?*",
ATTR{address}=="00:16:36:76:0f:3f", ATTR{dev_id}=="0x0", ATTR{type}=="1",
KERNEL=="eth*", NAME="eth3"

The MAC addresses correlate with the hardware.
The above is from the machine that works.

On the one that doesn't, the following:

/etc/network/interfaces

# network interface settings
auto lo
iface lo inet loopback

auto eth2
iface eth2 inet static

auto eth3
iface eth3 inet static

iface eth0 inet manual

iface eth1 inet manual

auto bond0
iface bond0 inet manual
slaves eth2, eth3
bond_miimon 100
bond_mode 802.3ad
bond_xmit_hash_policy layer2

auto vmbr0
iface vmbr0 inet static
address  192.168.121.31
netmask  255.255.255.0
gateway  192.168.121.1
bridge_ports bond0
bridge_stp off
bridge_fd 0

The MAC addresses differ in the udev rules, but nothing else.

ethtool says eth2 and eth3 don't have a link.

On S2 (the working machine) it says eth2 is down and eth3 is up, but a bond
is formed and the machine is connected.

What is happening here and how can it be resolved?

thanks

Roland
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Poor RBD performance as LIO iSCSI target

2014-11-18 Thread David Moreau Simard
Thanks guys. I looked at http://tracker.ceph.com/issues/8818 and chatted with 
"dis" on #ceph-devel.

I ran a LOT of tests on a LOT of comabination of kernels (sometimes with 
tunables legacy). I haven't found a magical combination in which the following 
test does not hang:
fio --name=writefile --size=100G --filesize=100G --filename=/dev/rbd0 --bs=1M 
--nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write --refill_buffers 
--end_fsync=1 --iodepth=200 --ioengine=libaio

Either directly on a mapped rbd device, on a mounted filesystem (over rbd), 
exported through iSCSI.. nothing.
I guess that rules out a potential issue with iSCSI overhead.

Now, something I noticed out of pure luck is that I am unable to reproduce the 
issue if I drop the size of the test to 50GB. Tests will complete in under 2 
minutes.
75GB will hang right at the end and take more than 10 minutes.

TL;DR of tests:
- 3x fio --name=writefile --size=50G --filesize=50G --filename=/dev/rbd0 
--bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write 
--refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
-- 1m44s, 1m49s, 1m40s

- 3x fio --name=writefile --size=75G --filesize=75G --filename=/dev/rbd0 
--bs=1M --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write 
--refill_buffers --end_fsync=1 --iodepth=200 --ioengine=libaio
-- 10m12s, 10m11s, 10m13s

Details of tests here: http://pastebin.com/raw.php?i=3v9wMtYP

Does that ring you guys a bell ?

--
David Moreau Simard


> On Nov 13, 2014, at 3:31 PM, Mike Christie  wrote:
> 
> On 11/13/2014 10:17 AM, David Moreau Simard wrote:
>> Running into weird issues here as well in a test environment. I don't have a 
>> solution either but perhaps we can find some things in common..
>> 
>> Setup in a nutshell:
>> - Ceph cluster: Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 (OSDs with separate 
>> public/cluster network in 10 Gbps)
>> - iSCSI Proxy node (targetcli/LIO): Ubuntu 14.04, Kernel 3.16.7, Ceph 0.87-1 
>> (10 Gbps)
>> - Client node: Ubuntu 12.04, Kernel 3.11 (10 Gbps)
>> 
>> Relevant cluster config: Writeback cache tiering with NVME PCI-E cards (2 
>> replica) in front of a erasure coded pool (k=3,m=2) backed by spindles.
>> 
>> I'm following the instructions here: 
>> http://www.hastexo.com/resources/hints-and-kinks/turning-ceph-rbd-images-san-storage-devices
>> No issues with creating and mapping a 100GB RBD image and then creating the 
>> target.
>> 
>> I'm interested in finding out the overhead/performance impact of 
>> re-exporting through iSCSI so the idea is to run benchmarks.
>> Here's a fio test I'm trying to run on the client node on the mounted iscsi 
>> device:
>> fio --name=writefile --size=100G --filesize=100G --filename=/dev/sdu --bs=1M 
>> --nrfiles=1 --direct=1 --sync=0 --randrepeat=0 --rw=write --refill_buffers 
>> --end_fsync=1 --iodepth=200 --ioengine=libaio
>> 
>> The benchmark will eventually hang towards the end of the test for some long 
>> seconds before completing.
>> On the proxy node, the kernel complains with iscsi portal login timeout: 
>> http://pastebin.com/Q49UnTPr and I also see irqbalance errors in syslog: 
>> http://pastebin.com/AiRTWDwR
>> 
> 
> You are hitting a different issue. German Anders is most likely correct
> and you hit the rbd hang. That then caused the iscsi/scsi command to
> timeout which caused the scsi error handler to run. In your logs we see
> the LIO error handler has received a task abort from the initiator and
> that timed out which caused the escalation (iscsi portal login related
> messages).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Samuel Just
Ok, why is ceph marking osds down?  Post your ceph.log from one of the
problematic periods.
-Sam

On Tue, Nov 18, 2014 at 1:35 AM, Andrei Mikhailovsky  wrote:
> Hello cephers,
>
> I need your help and suggestion on what is going on with my cluster. A few
> weeks ago i've upgraded from Firefly to Giant. I've previously written about
> having issues with Giant where in two weeks period the cluster's IO froze
> three times after ceph down-ed two osds. I have in total just 17 osds
> between two osd servers, 3 mons. The cluster is running on Ubuntu 12.04 with
> latest updates.
>
> I've got zabbix agents monitoring the osd servers and the cluster. I get
> alerts of any issues, such as problems with PGs, etc. Since upgrading to
> Giant, I am now frequently seeing emails alerting of the cluster having
> degraded PGs. I am getting around 10-15 such emails per day stating that the
> cluster has degraded PGs. The number of degraded PGs very between a couple
> of PGs to over a thousand. After several minutes the cluster repairs itself.
> The total number of PGs in the cluster is 4412 between all the pools.
>
> I am also seeing more alerts from vms stating that there is a high IO wait
> and also seeing hang tasks. Some vms reporting over 50% io wait.
>
> This has not happened on Firefly or the previous releases of ceph. Not much
> has changed in the cluster since the upgrade to Giant. Networking and
> hardware is still the same and it is still running the same version of
> Ubuntu OS. The cluster load hasn't changed as well. Thus, I think the issues
> above are related to the upgrade of ceph to Giant.
>
> Here is the ceph.conf that I use:
>
> [global]
> fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe
> mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib
> mon_host = 192.168.168.200,192.168.168.201,192.168.168.13
> auth_supported = cephx
> osd_journal_size = 10240
> filestore_xattr_use_omap = true
> public_network = 192.168.168.0/24
> rbd_default_format = 2
> osd_recovery_max_chunk = 8388608
> osd_recovery_op_priority = 1
> osd_max_backfills = 1
> osd_recovery_max_active = 1
> osd_recovery_threads = 1
> filestore_max_sync_interval = 15
> filestore_op_threads = 8
> filestore_merge_threshold = 40
> filestore_split_multiple = 8
> osd_disk_threads = 8
> osd_op_threads = 8
> osd_pool_default_pg_num = 1024
> osd_pool_default_pgp_num = 1024
> osd_crush_update_on_start = false
>
> [client]
> rbd_cache = true
> admin_socket = /var/run/ceph/$name.$pid.asok
>
>
> I would like to get to the bottom of these issues. Not sure if the issues
> could be fixed with changing some settings in ceph.conf or a full downgrade
> back to the Firefly. Is the downgrade even possible on a production cluster?
>
> Thanks for your help
>
> Andrei
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Andrey Korolyov
On Tue, Nov 18, 2014 at 10:04 PM, Craig Lewis  wrote:
> That would probably have helped.  The XFS deadlocks would only occur when
> there was relatively little free memory.  Kernel 3.18 is supposed to have a
> fix for that, but I haven't tried it yet.
>
> Looking at my actual usage, I don't even need 64k inodes.  64k inodes should
> make things a bit faster when you have a large number of files in a
> directory.  Ceph will automatically split directories with too many files
> into multiple sub-directories, so it's kinda pointless.
>
> I may try the experiment again, but probably not.  It took several weeks to
> reformat all of the OSDS.  Even on a single node, it takes 4-5 days to
> drain, format, and backfill.  That was months ago, and I'm still dealing
> with the side effects.  I'm not eager to try again.
>
>
> On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov  wrote:
>>
>> On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis 
>> wrote:
>> > I did have a problem in my secondary cluster that sounds similar to
>> > yours.
>> > I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
>> > options xfs = -i size=64k).   This showed up with a lot of "XFS:
>> > possible
>> > memory allocation deadlock in kmem_alloc" in the kernel logs.  I was
>> > able to
>> > keep things limping along by flushing the cache frequently, but I
>> > eventually
>> > re-formatted every OSD to get rid of the 64k inodes.
>> >
>> > After I finished the reformat, I had problems because of deep-scrubbing.
>> > While reformatting, I disabled deep-scrubbing.  Once I re-enabled it,
>> > Ceph
>> > wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs
>> > would
>> > be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to
>> > spread
>> > out the schedule a bit.  Once this finishes in a few day, I should be
>> > able
>> > to re-enable deep-scrubbing and keep my HEALTH_OK.
>> >
>> >
>>
>> Would you mind to check suggestions by following mine hints or hints
>> from mentioned URLs from there
>> http://marc.info/?l=linux-mm&m=141607712831090&w=2 with 64k again? As
>> for me, I am not observing lock loop after setting min_free_kbytes for
>> a half of gigabyte per OSD. Even if your locks has a different nature,
>> it may be worthy to try anyway.
>
>

Thanks, I perfectly understand this. But, if you have low enough
OSD/node ratio, it can be possible to check the problem at a node
scale. By the way, I do not see real reasons behind using lower
allocsize NOT for object storage-designed cluster.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd crashed while there was no space

2014-11-18 Thread Craig Lewis
You shouldn't let the cluster get so full that losing a few OSDs will make
you go toofull.  Letting the cluster get to 100% full is such a bad idea
that you should make sure it doesn't happen.


Ceph is supposed to stop moving data to an OSD once that OSD hits
osd_backfill_full_ratio, which defaults to 0.85.  Any disk at 86% full will
stop backfilling.

I have verified this works when the disks fill up while the cluster is
healthy, but I haven't failed a disk once I'm in the toofull state.  Even
so, mon_osd_full_ratio (default 0.95) or osd_failsafe_full_ratio (default
0.97) should stop all IO until a human gets involved.

The only gotcha I can find is that the values are percentages, and the test
is a "greater than" done with two significant digits.  ie, if the
osd_backfill_full_ratio is 0.85, it will continue backfilling until the
disk is 86% full.  So values are 0.99 and 1.00 will cause problems.


On Mon, Nov 17, 2014 at 6:50 PM, han vincent  wrote:

> hi, craig:
>
> Your solution did work very well. But if the data is very
> important, when remove directory of PG from OSDs, a small mistake will
> result in loss of data. And if cluster is very large, do not you think
> delete the data on the disk from 100% to 95% is a tedious and
> error-prone thing, for so many OSDs, large disks, and so on.
>
>  so my key question is: if there is no space in the cluster while
> some OSDs crashed,  why the cluster should choose to migrate? And in
> the migrating, other
> OSDs will crashed one by one until the cluster could not work.
>
> 2014-11-18 5:28 GMT+08:00 Craig Lewis :
> > At this point, it's probably best to delete the pool.  I'm assuming the
> pool
> > only contains benchmark data, and nothing important.
> >
> > Assuming you can delete the pool:
> > First, figure out the ID of the data pool.  You can get that from ceph
> osd
> > dump | grep '^pool'
> >
> > Once you have the number, delete the data pool: rados rmpool data data
> > --yes-i-really-really-mean-it
> >
> > That will only free up space on OSDs that are up.  You'll need to
> manually
> > some PGs on the OSDs that are 100% full.  Go to
> > /var/lib/ceph/osd/ceph-/current, and delete a few directories that
> > start with your data pool ID.  You don't need to delete all of them.
> Once
> > the disk is below 95% full, you should be able to start that OSD.  Once
> it's
> > up, it will finish deleting the pool.
> >
> > If you can't delete the pool, it is possible, but it's more work, and you
> > still run the risk of losing data if you make a mistake.  You need to
> > disable backfilling, then delete some PGs on each OSD that's full. Try to
> > only delete one copy of each PG.  If you delete every copy of a PG on all
> > OSDs, then you lost the data that was in that PG.  As before, once you
> > delete enough that the disk is less than 95% full, you can start the OSD.
> > Once you start it, start deleting your benchmark data out of the data
> pool.
> > Once that's done, you can re-enable backfilling.  You may need to scrub
> or
> > deep-scrub the OSDs you deleted data from to get everything back to
> normal.
> >
> >
> > So how did you get the disks 100% full anyway?  Ceph normally won't let
> you
> > do that.  Did you increase mon_osd_full_ratio, osd_backfill_full_ratio,
> or
> > osd_failsafe_full_ratio?
> >
> >
> > On Mon, Nov 17, 2014 at 7:00 AM, han vincent  wrote:
> >>
> >> hello, every one:
> >>
> >> These days a problem of "ceph" has troubled me for a long time.
> >>
> >> I build a cluster with 3 hosts and each host has three osds in it.
> >> And after that
> >> I used the command "rados bench 360 -p data -b 4194304 -t 300 write
> >> --no-cleanup"
> >> to test the write performance of the cluster.
> >>
> >> When the cluster is near full, there couldn't write any data to
> >> it. Unfortunately,
> >> there was a host hung up, then a lots of PG was going to migrate to
> other
> >> OSDs.
> >> After a while, a lots of OSD was marked down and out, my cluster
> couldn't
> >> work
> >> any more.
> >>
> >> The following is the output of "ceph -s":
> >>
> >> cluster 002c3742-ab04-470f-8a7a-ad0658b547d6
> >> health HEALTH_ERR 103 pgs degraded; 993 pgs down; 617 pgs
> >> incomplete; 1008 pgs peering; 12 pgs recovering; 534 pgs stale; 1625
> >> pgs stuck inactive; 534 pgs stuck stale; 1728 pgs stuck unclean;
> >> recovery 945/29649 objects degraded (3.187%); 1 full osd(s); 1 mons
> >> down, quorum 0,2 2,1
> >>  monmap e1: 3 mons at
> >> {0=10.0.0.97:6789/0,1=10.0.0.98:6789/0,2=10.0.0.70:6789/0}, election
> >> epoch 40, quorum 0,2 2,1
> >>  osdmap e173: 9 osds: 2 up, 2 in
> >> flags full
> >>   pgmap v1779: 1728 pgs, 3 pools, 39528 MB data, 9883 objects
> >> 37541 MB used, 3398 MB / 40940 MB avail
> >> 945/29649 objects degraded (3.187%)
> >>   34 stale+active+degraded+remapped
> >>  176 stale+incomplete
> >>  320 stale+down+peering
> >>

Re: [ceph-users] OSD commits suicide

2014-11-18 Thread Craig Lewis
That would probably have helped.  The XFS deadlocks would only occur when
there was relatively little free memory.  Kernel 3.18 is supposed to have a
fix for that, but I haven't tried it yet.

Looking at my actual usage, I don't even need 64k inodes.  64k inodes
should make things a bit faster when you have a large number of files in a
directory.  Ceph will automatically split directories with too many files
into multiple sub-directories, so it's kinda pointless.

I may try the experiment again, but probably not.  It took several weeks to
reformat all of the OSDS.  Even on a single node, it takes 4-5 days to
drain, format, and backfill.  That was months ago, and I'm still dealing
with the side effects.  I'm not eager to try again.


On Mon, Nov 17, 2014 at 2:04 PM, Andrey Korolyov  wrote:

> On Tue, Nov 18, 2014 at 12:54 AM, Craig Lewis 
> wrote:
> > I did have a problem in my secondary cluster that sounds similar to
> yours.
> > I was using XFS, and traced my problem back to 64 kB inodes (osd mkfs
> > options xfs = -i size=64k).   This showed up with a lot of "XFS: possible
> > memory allocation deadlock in kmem_alloc" in the kernel logs.  I was
> able to
> > keep things limping along by flushing the cache frequently, but I
> eventually
> > re-formatted every OSD to get rid of the 64k inodes.
> >
> > After I finished the reformat, I had problems because of deep-scrubbing.
> > While reformatting, I disabled deep-scrubbing.  Once I re-enabled it,
> Ceph
> > wanted to deep-scrub the whole cluster, and sometimes 90% of my OSDs
> would
> > be doing a deep-scrub.  I'm manually deep-scrubbing now, trying to spread
> > out the schedule a bit.  Once this finishes in a few day, I should be
> able
> > to re-enable deep-scrubbing and keep my HEALTH_OK.
> >
> >
>
> Would you mind to check suggestions by following mine hints or hints
> from mentioned URLs from there
> http://marc.info/?l=linux-mm&m=141607712831090&w=2 with 64k again? As
> for me, I am not observing lock loop after setting min_free_kbytes for
> a half of gigabyte per OSD. Even if your locks has a different nature,
> it may be worthy to try anyway.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Rados Gateway Replication - Containers not accessible via slave zone !

2014-11-18 Thread Vinod H I
Hi,

I am trying to test disaster recovery of rados gateways.
I setup a federated architecture for rados gateway as explained in the docs.
I am using ceph version - 0.80.7
I have setup only one region, "us", with two zones.
"us-west" slave zone having user "us-east"
"us-east" master zone having user "us-east"
The details of specific users are given below.
Details of user for us-east-1 gateway.
{

   - "user_id":"us-east",
   - "display_name":"Region-US Zone-East",
   - "email":"",
   - "suspended":0,
   - "max_buckets":1000,
   - "auid":0,
   - "subusers":[
  1. {
 - "id":"us-east:swift",
 - "permissions":"full-control"
  }
   ],
   - "keys":[
  1. {
 - "user":"us-east:swift",
 - "access_key":"0DQH33TDOLDPNUOHDGLX",
 - "secret_key":""
  },
  2. {
 - "user":"us-east",
 - "access_key":"PAA0BEG7ALEEDYXOJ7NE",
 - "secret_key":"BBQNeJ9il5lVWU0u897KK3oJRcifQcQdntuqNufu"
  }
   ],
   - "swift_keys":[
  1. {
 - "user":"us-east:swift",
 - "secret_key":"yLbRVIs7QIWcSYLS8KMqzdGWyc3LaKqqvaXJNdF6"
  }
   ],
   - "caps":[
  ],
   - "op_mask":"read, write, delete",
   - "system":"true",
   - "default_placement":"",
   - "placement_tags":[
  ],
   - "bucket_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "user_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "temp_url_keys":[
  ]

}

Details of user for us-west-1 gateway
{

   - "user_id":"us-west",
   - "display_name":"Region-US Zone-West",
   - "email":"",
   - "suspended":0,
   - "max_buckets":1000,
   - "auid":0,
   - "subusers":[
  1. {
 - "id":"us-west:swift",
 - "permissions":"full-control"
  }
   ],
   - "keys":[
  1. {
 - "user":"us-west:swift",
 - "access_key":"ABAI43X3JZ2LE734XC71",
 - "secret_key":""
  },
  2. {
 - "user":"us-west",
 - "access_key":"98VDZ8ZTWZMFAT1YWXIL",
 - "secret_key":"wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3"
  }
   ],
   - "swift_keys":[
  1. {
 - "user":"us-west:swift",
 - "secret_key":"KrjdheLazRpMRzUIpzLgxd0pjN81quFlnp97pwHs"
  }
   ],
   - "caps":[
  ],
   - "op_mask":"read, write, delete",
   - "system":"true",
   - "default_placement":"",
   - "placement_tags":[
  ],
   - "bucket_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "user_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "temp_url_keys":[
  ]

}

Now I created a bucket in "us-east" zone with read permissions for all.
vinod@LT05:~$ swift --verbose  -A http://us-east-1.lt.com/auth -U
us-east:swift
-K yLbRVIs7QIWcSYLS8KMqzdGWyc3LaKqqvaXJNdF6 stat Container1
 Account : v1
   Container : Container1
  Objects : 0
  Byte  : 0
   Read ACL: .r:*
   Write ACL :
 Sync To  :
   Sync Key  :
   Vary  : Accept-Encoding
 Serve  : Apache/2.2.22 (Ubuntu)
   X-Container-Bytes-Used-Actual: 0
 Content-Type : text/plain; charset=utf-8

There are no containers on the us-west zone.
When i try to create a new container directly in the us-west zone, it
returns status-403.
I guess this is because its the slave zone.
But the doc says "You may read objects from secondary zones.
Currently, the Gateway does not prevent you from writing to a secondary
zone, but DON’T DO IT."
I am just curious why am I not able to create containers!

Now I sync zones using 'radosgw-agent' , using the command
sudo radosgw-agent
--dest-access-key=wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3
--dest-secret-key=wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3
--src-access-key=PAA0BEG7ALEEDYXOJ7NE
--src-secret-key=BBQNeJ9il5lVWU0u897KK3oJRcifQcQdntuqNufu
--source=http://us-east-1.lt.com
--sync-scope=full --log-file=/var/log/radosgw/zone-sync-us-east-west.log
http://us-west-1.lt.com

There are no error logged during this process.
But I am not able to see this container on us-west zone.

vinod@LT05:~$ swift --verbose  -A http://us-west-1.lt.com/auth -U
us-west:swift
-K KrjdheLazRpMRzUIpzLgxd0pjN81quFlnp97pwHs stat
StorageURL: http://us-west-1.lt.com/swift/v1
Auth Token:
AUTH_rgwtk0d0075732d776573743a7377696674080418ee247db6d6c5986c54a00cc1145bcd8fba363322c25ba6508535b5f513c29b3a53
   Account  : v1
Containers : 0
   Objects   : 0
 Bytes: 0
  Vary : Accept-Encoding
Server: Apache/2.2.22 (Ubuntu)
X-Account-Bytes-Used-Actual: 0
Content-Type: text/plain; charset=utf-8

How can I access the container from us-west-1 rgw instance.
Do I need to manually create the us-east user on the us-west-1 instance
also?
Now the there is common storage cluster for both the zones. Is it that the
replication will work
only when the storage clusters are different?


-- 
Vinod H I
___
ceph-users mailing list
ceph-users@list

[ceph-users] Rados Gateway Replication - Containers not accessible via slave zone !

2014-11-18 Thread Vinod H I
Hi,

I am trying to test disaster recovery of rados gateways.
I setup a federated architecture for rados gateway as explained in the docs.
I am using ceph version - 0.80.7
I have setup only one region, "us", with two zones.
"us-west" slave zone having user "us-east"
"us-east" master zone having user "us-east"
The details of specific users are given below.
Details of user for us-east-1 gateway.
{

   - "user_id":"us-east",
   - "display_name":"Region-US Zone-East",
   - "email":"",
   - "suspended":0,
   - "max_buckets":1000,
   - "auid":0,
   - "subusers":[
  1. {
 - "id":"us-east:swift",
 - "permissions":"full-control"
  }
   ],
   - "keys":[
  1. {
 - "user":"us-east:swift",
 - "access_key":"0DQH33TDOLDPNUOHDGLX",
 - "secret_key":""
  },
  2. {
 - "user":"us-east",
 - "access_key":"PAA0BEG7ALEEDYXOJ7NE",
 - "secret_key":"BBQNeJ9il5lVWU0u897KK3oJRcifQcQdntuqNufu"
  }
   ],
   - "swift_keys":[
  1. {
 - "user":"us-east:swift",
 - "secret_key":"yLbRVIs7QIWcSYLS8KMqzdGWyc3LaKqqvaXJNdF6"
  }
   ],
   - "caps":[
  ],
   - "op_mask":"read, write, delete",
   - "system":"true",
   - "default_placement":"",
   - "placement_tags":[
  ],
   - "bucket_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "user_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "temp_url_keys":[
  ]

}

Details of user for us-west-1 gateway
{

   - "user_id":"us-west",
   - "display_name":"Region-US Zone-West",
   - "email":"",
   - "suspended":0,
   - "max_buckets":1000,
   - "auid":0,
   - "subusers":[
  1. {
 - "id":"us-west:swift",
 - "permissions":"full-control"
  }
   ],
   - "keys":[
  1. {
 - "user":"us-west:swift",
 - "access_key":"ABAI43X3JZ2LE734XC71",
 - "secret_key":""
  },
  2. {
 - "user":"us-west",
 - "access_key":"98VDZ8ZTWZMFAT1YWXIL",
 - "secret_key":"wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3"
  }
   ],
   - "swift_keys":[
  1. {
 - "user":"us-west:swift",
 - "secret_key":"KrjdheLazRpMRzUIpzLgxd0pjN81quFlnp97pwHs"
  }
   ],
   - "caps":[
  ],
   - "op_mask":"read, write, delete",
   - "system":"true",
   - "default_placement":"",
   - "placement_tags":[
  ],
   - "bucket_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "user_quota":{
  - "enabled":false,
  - "max_size_kb":-1,
  - "max_objects":-1
   },
   - "temp_url_keys":[
  ]

}

Now I created a bucket in "us-east" zone with read permissions for all.
vinod@LT05:~$ swift --verbose  -A http://us-east-1.lt.com/auth -U
us-east:swift
-K yLbRVIs7QIWcSYLS8KMqzdGWyc3LaKqqvaXJNdF6 stat Container1
 Account : v1
   Container : Container1
  Objects : 0
  Byte  : 0
   Read ACL: .r:*
   Write ACL :
 Sync To  :
   Sync Key  :
   Vary  : Accept-Encoding
 Serve  : Apache/2.2.22 (Ubuntu)
   X-Container-Bytes-Used-Actual: 0
 Content-Type : text/plain; charset=utf-8

There are no containers on the us-west zone.
When i try to create a new container directly in the us-west zone, it
returns status-403.
I guess this is because its the slave zone.
But the doc says "You may read objects from secondary zones.
Currently, the Gateway does not prevent you from writing to a secondary
zone, but DON’T DO IT."
I am just curious why am I not able to create containers!

Now I sync zones using 'radosgw-agent' , using the command
sudo radosgw-agent
--dest-access-key=wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3
--dest-secret-key=wKQfBqJtYCZ4VSK26JIYN9tad2GC6t9BKyUsHEb3
--src-access-key=PAA0BEG7ALEEDYXOJ7NE
--src-secret-key=BBQNeJ9il5lVWU0u897KK3oJRcifQcQdntuqNufu
--source=http://us-east-1.lt.com
--sync-scope=full --log-file=/var/log/radosgw/zone-sync-us-east-west.log
http://us-west-1.lt.com

There are no error logged during this process.
But I am not able to see this container on us-west zone.

vinod@LT05:~$ swift --verbose  -A http://us-west-1.lt.com/auth -U
us-west:swift
-K KrjdheLazRpMRzUIpzLgxd0pjN81quFlnp97pwHs stat
StorageURL: http://us-west-1.lt.com/swift/v1
Auth Token:
AUTH_rgwtk0d0075732d776573743a7377696674080418ee247db6d6c5986c54a00cc1145bcd8fba363322c25ba6508535b5f513c29b3a53
   Account  : v1
Containers : 0
   Objects   : 0
 Bytes: 0
  Vary : Accept-Encoding
Server: Apache/2.2.22 (Ubuntu)
X-Account-Bytes-Used-Actual: 0
Content-Type: text/plain; charset=utf-8

How can I access the container from us-west-1 rgw instance.
Do I need to manually create the us-east user on the us-west-1 instance
also?
Now the there is common storage cluster for both the zones. Is it that the
replication will work
only when the storage clusters are different?

-- 
Vinod H I
___
ceph-users mailing list
ceph-users@lists

[ceph-users] Dependency issues in fresh ceph/CentOS 7 install

2014-11-18 Thread Massimiliano Cuttini

Dear all,

i try to install ceph but i get errors:

   #ceph-deploy install node1
   []
   [ceph_deploy.install][DEBUG ] Installing stable version *firefly *on
   cluster ceph hosts node1
   [ceph_deploy.install][DEBUG ] Detecting platform for host node1 ...
   []
   [node1][DEBUG ] ---> Pacchetto libXxf86vm.x86_64 0:1.1.3-2.1.el7
   settato per essere installato
   [node1][DEBUG ] ---> Pacchetto mesa-libgbm.x86_64
   0:9.2.5-6.20131218.el7_0 settato per essere installato
   [node1][DEBUG ] ---> Pacchetto mesa-libglapi.x86_64
   0:9.2.5-6.20131218.el7_0 settato per essere installato
   [node1][DEBUG ] --> Risoluzione delle dipendenze completata
   [node1][WARNIN] Errore: Pacchetto:
   ceph-common-0.80.7-0.el7.centos.x86_64 (Ceph)
   [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
   [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64
   (Ceph)
   [node1][DEBUG ]  Si può provare ad usare --skip-broken per aggirare
   il problema
   [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
   [node1][WARNIN] Errore: Pacchetto: ceph-0.80.7-0.el7.centos.x86_64
   (Ceph)
   [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
   [node1][DEBUG ]  Provare ad eseguire: rpm -Va --nofiles --nodigest
   [node1][ERROR ] RuntimeError: command returned non-zero exit status: 1
   *[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum
   -y install ceph*

I installed GIANT version not FIREFLY on admin-node.
Is it a typo error in the config file or is it truly trying to install 
FIREFLY instead of GIANT.


About the error, i see that it's related to wrong python default libraries.
It seems that CEPH require libraries not available in the current distro:

   [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)
   [node1][WARNIN] Richiede: libleveldb.so.1()(64bit)
   [node1][WARNIN] Richiede: libtcmalloc.so.4()(64bit)

This seems strange.
Can you fix this?


Thanks,
Massimiliano Cuttini



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubleshooting an erasure coded pool with a cache tier

2014-11-18 Thread Nick Fisk
Has anyone tried applying this fix to see if it makes any difference?

https://github.com/ceph/ceph/pull/2374

I might be in a position in a few days to build a test cluster to test myself, 
but was wondering if anyone else has had any luck with it?

Nick

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: 18 November 2014 01:11
To: ceph-users
Subject: Re: [ceph-users] Troubleshooting an erasure coded pool with a cache 
tier


Hello,

On Mon, 17 Nov 2014 17:45:54 +0100 Laurent GUERBY wrote:

> Hi,
> 
> Just a follow-up on this issue, we're probably hitting:
> 
> http://tracker.ceph.com/issues/9285
> 
> We had the issue a few weeks ago with replicated SSD pool in front of 
> rotational pool and turned off cache tiering.
> 
> Yesterday we made a new test and activating cache tiering on a single 
> erasure pool threw the whole ceph cluster performance to the floor 
> (including non cached non erasure coded pools) with frequent "slow 
> write" in the logs. Removing cache tiering was enough to go back to 
> normal performance.
>
Ouch!
 
> I assume no one use cache tiering on 0.80.7 in production clusters?
>
Not me and now I'm even less inclined to do so. 
Since this particular item is not the first one that puts cache tiers in doubt, 
but certainly the most compelling one.

I wonder how much pressure was on that cache tier, though. 
If I understand the bug report correctly, this should only happen if some 
object gets evicted before it was fully replicated.
So I suppose if the cache pool is sized "correctly" for the working set in 
question (which of course is a bugger given a 4MB granularity), things should 
work. Until you hit the threshold and they don't anymore...

Given that this isn't fixed in Giant either, there goes my plan to speed up a 
cluster with ample space but insufficient IOPS with cache tiering.

Christian
 
> Sincerely,
> 
> Laurent
> 
> Le Sunday 09 November 2014 à 00:24 +0100, Loic Dachary a écrit :
> > 
> > On 09/11/2014 00:03, Gregory Farnum wrote:
> > > It's all about the disk accesses. What's the slow part when you 
> > > dump historic and in-progress ops?
> > 
> > This is what I see on g1 (6% iowait)
> > 
> > root@g1:~# ceph daemon osd.0 dump_ops_in_flight { "num_ops": 0,
> >   "ops": []}
> > 
> > root@g1:~# ceph daemon osd.0 dump_ops_in_flight { "num_ops": 1,
> >   "ops": [
> > { "description": "osd_op(client.4407100.0:11030174
> > rb.0.410809.238e1f29.1038 [set-alloc-hint object_size 
> > 4194304 write_size 4194304,write 4095488~4096] 58.3aabb66d 
> > ack+ondisk+write e15613)", "received_at": "2014-11-09 00:14:17.385256", 
> > "age":
> > "0.538802", "duration": "0.011955", "type_data": [
> > "waiting for sub ops",
> > { "client": "client.4407100",
> >   "tid": 11030174},
> > [
> > { "time": "2014-11-09 00:14:17.385393",
> >   "event": "waiting_for_osdmap"},
> > { "time": "2014-11-09 00:14:17.385563",
> >   "event": "reached_pg"},
> > { "time": "2014-11-09 00:14:17.385793",
> >   "event": "started"},
> > { "time": "2014-11-09 00:14:17.385807",
> >   "event": "started"},
> > { "time": "2014-11-09 00:14:17.385875",
> >   "event": "waiting for subops from 1,10"},
> > { "time": "2014-11-09 00:14:17.386201",
> >   "event": "commit_queued_for_journal_write"},
> > { "time": "2014-11-09 00:14:17.386336",
> >   "event": "write_thread_in_journal_buffer"},
> > { "time": "2014-11-09 00:14:17.396293",
> >   "event": "journaled_completion_queued"},
> > { "time": "2014-11-09 00:14:17.396332",
> >   "event": "op_commit"},
> > { "time": "2014-11-09 00:14:17.396678",
> >   "event": "op_applied"},
> > { "time": "2014-11-09 00:14:17.397211",
> >   "event": "sub_op_commit_rec"}]]}]}
> > 
> > and it looks ok. When I go to n7 which has 20% iowait, I see a much 
> > larger output http://pastebin.com/DPxsaf6z which includes a number 
> > of
> > "event": "waiting_for_osdmap".
> > 
> > I'm not sure what to make of this and it would certainly be better 
> > if
> > n7 had a lower iowait. Also when I ceph -w I see a new pgmap is 
> > created every second which is also not a good sign.
> > 
> > 2014-11-09 00:22:47.090795 mon.0 [INF] pgmap v4389613: 460 pgs: 460
> > active+clean; 2580 GB data, 6735 GB used, 18850 GB / 26955 GB avail;
> > 3889 B/s rd, 2125 kB/s wr, 237 op/s 2014-11-09 00:22:48.143412 mon.0 
> > [INF] pgmap v4389614: 460 pgs: 460 active+clean; 2580 GB data, 6735 
> > GB used, 18850 GB / 26955 GB avail; 1586 kB/s wr, 204 

Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread houmles
What do you mean by osd level? Pool has size 4 and min_size 1.


On Tue, Nov 18, 2014 at 10:32:11AM +, Anand Bhat wrote:
> What are the setting for min_size and size at OSD level in your Ceph 
> configuration ?  Looks like  size is set to 2 which halves your total storage 
> as two copies of the data needs to be stored.
> 
> Regards,
> Anand
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> houmles
> Sent: Tuesday, November 18, 2014 3:31 PM
> To: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] incorrect pool size, wrong ruleset?
> 
> Nobody knows where should be problem?
> 
> 
> On Wed, Nov 12, 2014 at 10:41:36PM +0100, houmles wrote:
> > Hi,
> >
> > I have 2 hosts with 8 2TB drive in each.
> > I want to have 2 replicas between both hosts and then 2 replicas between 
> > osds on each host. That way even when I lost one host I still have 2 
> > replicas.
> >
> > Currently I have this ruleset:
> >
> > rule repl {
> > ruleset 5
> > type replicated
> > min_size 1
> > max_size 10
> > step take asterix
> > step choose firstn -2 type osd
> > step emit
> > step take obelix
> > step choose firstn 2 type osd
> > step emit
> > }
> >
> > Which works ok. I have 4 replicas as I want and PGs are distributed 
> > perfectly but when I run ceph df I have only 1/2 of my capacity which I 
> > should have.
> > In total it's 32TB, 16TB in each host. If there is a 2 replicas on each 
> > host it should report around 8TB, right? It's reporting only 4TB in pool 
> > which is 1/8 of total capacity.
> > Can anyone tell me what is wrong?
> >
> > Thanks
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] incorrect pool size, wrong ruleset?

2014-11-18 Thread houmles
Nobody knows where should be problem?


On Wed, Nov 12, 2014 at 10:41:36PM +0100, houmles wrote:
> Hi,
> 
> I have 2 hosts with 8 2TB drive in each.
> I want to have 2 replicas between both hosts and then 2 replicas between osds 
> on each host. That way even when I lost one host I still have 2 replicas.
> 
> Currently I have this ruleset:
> 
> rule repl {
> ruleset 5
> type replicated
> min_size 1
> max_size 10
> step take asterix
> step choose firstn -2 type osd
> step emit
> step take obelix
> step choose firstn 2 type osd
> step emit
> }
> 
> Which works ok. I have 4 replicas as I want and PGs are distributed perfectly 
> but when I run ceph df I have only 1/2 of my capacity which I should have.
> In total it's 32TB, 16TB in each host. If there is a 2 replicas on each host 
> it should report around 8TB, right? It's reporting only 4TB in pool which is 
> 1/8 of total capacity.
> Can anyone tell me what is wrong?
> 
> Thanks
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS unresponsive at scale (2M files,

2014-11-18 Thread Thomas Lemarchand
Hi Kevin,

There are every (I think) MDS tunables listed on this page with a short
description : http://ceph.com/docs/master/cephfs/mds-config-ref/

Can you tell us how your cluster behave after the mds-cache-size
change ? What is your MDS ram consumption, before and after ?

Thanks !
-- 
Thomas Lemarchand
Cloud Solutions SAS - Responsable des systèmes d'information



On lun., 2014-11-17 at 16:06 -0800, Kevin Sumner wrote:
> > On Nov 17, 2014, at 15:52, Sage Weil  wrote:
> > 
> > On Mon, 17 Nov 2014, Kevin Sumner wrote:
> > > I?ve got a test cluster together with a ~500 OSDs and, 5 MON, and
> > > 1 MDS.  All
> > > the OSDs also mount CephFS at /ceph.  I?ve got Graphite pointing
> > > at a space
> > > under /ceph.  Over the weekend, I drove almost 2 million metrics,
> > > each of
> > > which creates a ~3MB file in a hierarchical path, each sending a
> > > datapoint
> > > into the metric file once a minute.  CephFS seemed to handle the
> > > writes ok
> > > while I was driving load.  All files containing each metric are at
> > > paths
> > > like this:
> > > /ceph/whisper/sandbox/cephtest-osd0013/2/3/4/5.wsp
> > > 
> > > Today, however, with the load generator still running, reading
> > > metadata of
> > > files (e.g. directory entries and stat(2) info) in the filesystem
> > > (presumably MDS-managed data) seems nearly impossible, especially
> > > deeper
> > > into the tree.  For example, in a shell cd seems to work but
> > > ls hangs,
> > > seemingly indefinitely.  After turning off the load generator and
> > > allowing a
> > > while for things to settle down, everything seems to behave
> > > better.
> > > 
> > > ceph status and ceph health both return good statuses the entire
> > > time.
> > >  During load generation, the ceph-mds process seems pegged at
> > > between 100%
> > > and 150%, but with load generation turned off, the process has
> > > some high
> > > variability from near-idle up to similar 100-150% CPU.
> > > 
> > > Hopefully, I?ve missed something in the CephFS tuning.  However,
> > > I?m looking for
> > > direction on figuring out if it is, indeed, a tuning problem or if
> > > this
> > > behavior is a symptom of the ?not ready for production? banner in
> > > the
> > > documentation.
> > 
> > My first guess is that the MDS cache is just too small and it is 
> > thrashing.  Try
> > 
> > ceph mds tell 0 injectargs '--mds-cache-size 100'
> > 
> > That's 10x bigger than the default, tho be aware that it will eat up
> > 10x 
> > as much RAM too.
> > 
> > We've also seen teh cache behave in a non-optimal way when evicting 
> > things, making it thrash more often than it should.  I'm hoping we
> > can 
> > implement something like MQ instead of our two-level LRU, but it
> > isn't 
> > high on the priority list right now.
> > 
> > sage
> 
> 
> Thanks!  I’ll pursue mds cache size tuning.  Is there any guidance on
> setting the cache and other mds tunables correctly, or is it an
> adjust-and-test sort of thing?  Cursory searching doesn’t return any
> relevant documentation for ceph.com.  I’m plowing through some other
> list posts now.
> --
> Kevin Sumner
> ke...@sumner.io
> 
> 
> 
> 
> -- 
> This message has been scanned for viruses and 
> dangerous content by MailScanner, and is 
> believed to be clean. 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-- 
This message has been scanned for viruses and
dangerous content by MailScanner, and is
believed to be clean.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Giant upgrade - stability issues

2014-11-18 Thread Andrei Mikhailovsky
Hello cephers, 

I need your help and suggestion on what is going on with my cluster. A few 
weeks ago i've upgraded from Firefly to Giant. I've previously written about 
having issues with Giant where in two weeks period the cluster's IO froze three 
times after ceph down-ed two osds. I have in total just 17 osds between two osd 
servers, 3 mons. The cluster is running on Ubuntu 12.04 with latest updates. 

I've got zabbix agents monitoring the osd servers and the cluster. I get alerts 
of any issues, such as problems with PGs, etc. Since upgrading to Giant, I am 
now frequently seeing emails alerting of the cluster having degraded PGs. I am 
getting around 10-15 such emails per day stating that the cluster has degraded 
PGs. The number of degraded PGs very between a couple of PGs to over a 
thousand. After several minutes the cluster repairs itself. The total number of 
PGs in the cluster is 4412 between all the pools. 

I am also seeing more alerts from vms stating that there is a high IO wait and 
also seeing hang tasks. Some vms reporting over 50% io wait. 

This has not happened on Firefly or the previous releases of ceph. Not much has 
changed in the cluster since the upgrade to Giant. Networking and hardware is 
still the same and it is still running the same version of Ubuntu OS. The 
cluster load hasn't changed as well. Thus, I think the issues above are related 
to the upgrade of ceph to Giant. 

Here is the ceph.conf that I use: 

[global] 
fsid = 51e9f641-372e-44ec-92a4-b9fe55cbf9fe 
mon_initial_members = arh-ibstorage1-ib, arh-ibstorage2-ib, arh-cloud13-ib 
mon_host = 192.168.168.200,192.168.168.201,192.168.168.13 
auth_supported = cephx 
osd_journal_size = 10240 
filestore_xattr_use_omap = true 
public_network = 192.168.168.0/24 
rbd_default_format = 2 
osd_recovery_max_chunk = 8388608 
osd_recovery_op_priority = 1 
osd_max_backfills = 1 
osd_recovery_max_active = 1 
osd_recovery_threads = 1 
filestore_max_sync_interval = 15 
filestore_op_threads = 8 
filestore_merge_threshold = 40 
filestore_split_multiple = 8 
osd_disk_threads = 8 
osd_op_threads = 8 
osd_pool_default_pg_num = 1024 
osd_pool_default_pgp_num = 1024 
osd_crush_update_on_start = false 

[client] 
rbd_cache = true 
admin_socket = /var/run/ceph/$name.$pid.asok 


I would like to get to the bottom of these issues. Not sure if the issues could 
be fixed with changing some settings in ceph.conf or a full downgrade back to 
the Firefly. Is the downgrade even possible on a production cluster? 

Thanks for your help 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] jbod + SMART : how to identify failing disks ?

2014-11-18 Thread SCHAER Frederic
Wow. Thanks
Not very operations friendly though…

Wouldn’t it be just OK to pull the disk that we think is the bad one, check the 
serial number, and if not, just replug and let the udev rules do their job and 
re-insert the disk in the ceph cluster ?
(provided XFS doesn’t freeze for good when we do that)

Regards

De : Craig Lewis [mailto:cle...@centraldesktop.com]
Envoyé : lundi 17 novembre 2014 22:32
À : SCHAER Frederic
Cc : ceph-users@lists.ceph.com
Objet : Re: [ceph-users] jbod + SMART : how to identify failing disks ?

I use `dd` to force activity to the disk I want to replace, and watch the 
activity lights.  That only works if your disks aren't 100% busy.  If they are, 
stop the ceph-osd daemon, and see which drive stops having activity.  Repeat 
until you're 100% confident that you're pulling the right drive.

On Wed, Nov 12, 2014 at 5:05 AM, SCHAER Frederic 
mailto:frederic.sch...@cea.fr>> wrote:
Hi,

I’m used to RAID software giving me the failing disks  slots, and most often 
blinking the disks on the disk bays.
I recently installed a  DELL “6GB HBA SAS” JBOD card, said to be an LSI 2008 
one, and I now have to identify 3 pre-failed disks (so says S.M.A.R.T) .

Since this is an LSI, I thought I’d use MegaCli to identify the disks slot, but 
MegaCli does not see the HBA card.
Then I found the LSI “sas2ircu” utility, but again, this one fails at giving me 
the disk slots (it finds the disks, serials and others, but slot is always 0)
Because of this, I’m going to head over to the disk bay and unplug the disk 
which I think corresponds to the alphabetical order in linux, and see if it’s 
the correct one…. But even if this is correct this time, it might not be next 
time.

But this makes me wonder : how do you guys, Ceph users, manage your disks if 
you really have JBOD servers ?
I can’t imagine having to guess slots that each time, and I can’t imagine 
neither creating serial number stickers for every single disk I could have to 
manage …
Is there any specific advice reguarding JBOD cards people should (not) use in 
their systems ?
Any magical way to “blink” a drive in linux ?

Thanks && regards

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com