Re: [ceph-users] ceph osd won't boot, resource shortage?

2015-09-18 Thread Shinobu Kinjo
Sorry for that. That's my fault.

 Disclaimer:
  This is about what I do always to do advanced investigation.
  This is NOT about common solution.
  Other experts have different solution.

So what you should do to know what's exactly going on on I/O layer.

   1.Install fio
   2.Change the following the parameter.

 /proc/sys/fs/aio-max-nr

 For the example, I used very low number: 5

 sudo sysctl -w fs.aio-max-nr=5

   3.Monitor:

 /proc/sys/fs/aio-nr

  while :
  do
  date
  cat /proc/sys/fs/aio-nr
  sleep 1
  done

   4.Do fio with:

 ioengine=libaio
 direct=1
 # man fio(1)
 # Number of clones (processes/threads 
 # performing the same workload) of this
 # job.
 numjobs=100 

After finishing fio job under that condition, you should be able to
see the followings

// Monitor
Sat Sep 19 08:14:10 JST 2015
5
Sat Sep 19 08:14:10 JST 2015
5

Meaning that your job got up to max aio which is allowed by the 
kernel.

// Fio

fio: pid=29607, err=11/file:engines/libaio.c:273, func=io_queue_init, \
 error=Resource temporarily unavailable
fio: check /proc/sys/fs/aio-max-nr

Meaning that, since you set 5 for aio-max-nr, you ended up with
lack of resource.

If you have any question, concern or whatever you have, just let
us know. 

Shinobu

- Original Message -
From: "Peter Sabaini" 
To: "Shinobu Kinjo" 
Cc: ceph-users@lists.ceph.com
Sent: Friday, September 18, 2015 10:21:29 PM
Subject: Re: [ceph-users] ceph osd won't boot, resource shortage?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 18.09.15 14:47, Shinobu Kinjo wrote:
> I do not think that it's best practice to increase that number
> at the moment. It's kind of lack of consideration.
> 
> We might need to do that as a result.
> 
> But what we should do, first, is to check current actual number
> of aio using:
> 
> watch -dc cat /proc/sys/fs/aio-nr

I did, it got up to about 138240

> then increase, if it's necessary.
> 
> Anyway you have to be more careful otherwise there might be
> back-and-force meaningless configuration change -;

I'm sorry, I don't quite understand what you mean. Could you
elaborate? Are there specific risks associated with a high setting
of fs.aio-max-nr?

FWIW, I've done some load testing (using rados bench and rados
load-gen) -- anything I should watch out for in your opinion?


Thanks,
peter.


> Shinobu
> 
> - Original Message - From: "Peter Sabaini"
>  To: ceph-users@lists.ceph.com Sent:
> Thursday, September 17, 2015 11:51:11 PM Subject: Re:
> [ceph-users] ceph osd won't boot, resource shortage?
> 
> On 16.09.15 16:41, Peter Sabaini wrote:
>> Hi all,
> 
>> I'm having trouble adding OSDs to a storage node; I've got 
>> about 28 OSDs running, but adding more fails.
> 
> So, it seems the requisite knob was sysctl fs.aio-max-nr By
> default, this was set to 64K here. I set it:
> 
> # echo 2097152 > /proc/sys/fs/aio-max-nr
> 
> This let me add my remaining OSDs.
> 
> 
> 
>> Typical log excerpt:
> 
>> 2015-09-16 13:55:58.083797 7f3e7b821800  1 journal _open 
>> /var/lib/ceph/osd/ceph-28/journal fd 20: 21474836480 bytes, 
>> block size 4096 bytes, directio = 1, aio = 1 2015-09-16 
>> 13:55:58.090709 7f3e7b821800 -1 journal FileJournal::_open: 
>> unable to setup io_context (61) No data available 2015-09-16 
>> 13:55:58.090825 7f3e74a96700 -1 journal io_submit to 0~4096
>> got (22) Invalid argument 2015-09-16 13:55:58.091061
>> 7f3e7b821800 1 journal close
>> /var/lib/ceph/osd/ceph-28/journal 2015-09-16 13:55:58.091993
>> 7f3e74a96700 -1 os/FileJournal.cc: In function 'int
>> FileJournal::write_aio_bl(off64_t&, ceph::bufferlist&, 
>> uint64_t)' thread 7f3e74a96700 time 2 015-09-16 
>> 13:55:58.090842 os/FileJournal.cc: 1337: FAILED assert(0 == 
>> "io_submit got unexpected error")
> 
>> More complete: http://pastebin.ubuntu.com/12427041/
> 
>> If, however, I stop one of the running OSDs, starting the 
>> original OSD works fine. I'm guessing I'm running out of 
>> resources somewhere, but where?
> 
>> Some poss. relevant sysctl values:
> 
>> vm.max_map_count=524288 kernel.pid_max=2097152 
>> kernel.threads-max=2097152 fs.aio-max-nr = 65536 fs.aio-nr = 
>> 129024 fs.dentry-state = 75710   49996   45  0   0   0 
>> fs.file-max = 
>> 26244198 fs.file-nr = 13504  0   26244198 fs.inode-nr = 60706
>> 202 fs.nr_open = 1048576
> 
>> I've also set max open files = 1048576 in ceph.conf
> 
>> The OSDs are setup with dedicated journal disks - 3 OSDs
>> share one journal device.
> 
>> Any advice on what I'm missing, or where I should dig
>> deeper?
> 
>> Thanks, peter.
> 
> 
> 
> 
> 
> 
>> ___ ceph-users 
>> mailing list ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___ ceph-users
> mailing list 

[ceph-users] ESXI 5.5 Update 3 and LIO

2015-09-18 Thread Nick Fisk
Hi All,

 

Just browsing through the release notes of the latest ESXi update and can
see this

 

During transient error conditions, I/O to a device might repeatedly fail and
not failover to an alternate working path
During transient error conditions like BUS BUSY, QFULL, HOST ABORTS, HOST
RETRY and so on, you might repeatedly attempt commands on current path and
do not failover to another path even after a reasonable amount of time.

This issue is resolved in this release. During occurrence of such transient
errors, if the path is busy after a couple of retries, the path state is now
changed to DEAD. As a result, a failover is triggered and an alternate
working path to the device is used to send I/Os.

 

Just wondering if this has any positive effect on the problems many of us
experience with LIO and ESXi entering a never ending loop when a RBD hangs
for more than a couple of seconds? I can probably put something together and
test this next week, but would be interested in hearing from anyone else.

 

Copying you in Mike as I don't know if this would be interesting to you or
not?

 

Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help! Ceph Manual Depolyment

2015-09-18 Thread Henrik Korkuc

On 15-09-17 18:59, wikison wrote:


Is there any detailed manual deployment document? I downloaded the 
source and built ceph, then installed ceph on 7 computers. I used 
three as monitors and four as OSD. I followed the official document on 
ceph.com. But it didn't work and it seemed to be out-dated. Could 
anybody help me?


What documentation did you follow? What doesn't work for you? I recently 
launched Ceph cluster without ceph-deploy, so maybe I'll be able to help 
you out


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] erasure pool, ruleset-root

2015-09-18 Thread Loic Dachary


On 18/09/2015 09:00, Loic Dachary wrote:
> Hi Tom,
> 
> Could you please share command you're using and their output ? A dump of the 
> crush rules would also be useful to figure out why it did not work as 
> expected.
> 

s/command/the commands/

> Cheers
> 
> On 18/09/2015 01:01, Deneau, Tom wrote:
>> I see that I can create a crush rule that only selects osds
>> from a certain node by this:
>>ceph osd crush rule create-simple byosdn1 myhostname osd
>>
>> and if I then create a replicated pool that uses that rule,
>> it does indeed select osds only from that node.
>>
>> I would like to do a similar thing with an erasure pool.
>>
>> When creating the ec-profile, I have successfully used
>>ruleset-failure-domain=osd
>> but when I try to use
>>ruleset-root=myhostname
>> and then use that profile to create an erasure pool,
>> the resulting pool doesn't seem to limit to that node.
>>
>> What is the correct syntax for creating such an erasure pool?
>>
>> -- Tom Deneau
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] C example of using libradosstriper?

2015-09-18 Thread Paul Mansfield
Hello
sorry for delay in replying.

I have found your example code very useful.

My problem now is that I am using LTTNG to trace my program and it seems
that libradosstriper also uses LTTNG and both try and initialise it and
the program exits.
I don't really want to trip out all my trace and debug :-(
;-(


On 17/09/15 04:01, 张冬卯 wrote:
> 
> Hi,
> 
> src/tools/rados.c has some  striper rados snippet.
> 
> and I have  this little project using striper rados.
> see:https://github.com/thesues/striprados
> 
> wish could help you
> 
> Dongmao Zhang
> 
> 在 2015年09月17日 01:05, Paul Mansfield 写道:
>> Hello,
>> I'm using the C interface librados striper and am looking for examples
>> on how to use it.
>>
>>
>> Please can someone point me to any useful code snippets? All I've found
>> so far is the source code :-(
>>
>> Thanks very much
>> Paul
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> 

-- 
Paul Mansfield
DevOps Engineer
Velocix - An Alcatel Lucent Company - http://www.velocix.com
t: +44 1223 435858
paul.mansfi...@alcatel-lucent.com
3 Ely Road, Milton, Cambridge, CB24 6DD

Alcatel-Lucent Telecom Limited - www.alcatel-lucent.com
Registered Office: 740 Waterside Drive, Aztec West, Bristol,  BS32 4UF
Registered in England & Wales number 02650571
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] help! Ceph Manual Depolyment

2015-09-18 Thread Max A. Krasilnikov
Здравствуйте! 

On Thu, Sep 17, 2015 at 11:59:47PM +0800, wikison wrote:


> Is there any detailed manual deployment document? I downloaded the source and 
> built ceph, then installed ceph on 7 computers. I used three as monitors and 
> four as OSD. I followed the official document on ceph.com. But it didn't work 
> and it seemed to be out-dated. Could anybody help me?

This works for me:
http://docs.ceph.com/docs/master/install/manual-deployment/
http://docs.ceph.com/docs/master/rados/operations/add-or-rm-osds/
http://www.sebastien-han.fr/blog/2013/05/13/deploy-a-ceph-mds-server/
http://docs.ceph.com/docs/master/cephfs/createfs/

-- 
WBR, Max A. Krasilnikov
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
> > On 18 Sep 2015, at 11:28, Christian Balzer  wrote:
> > 
> > On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
> > 
> > > Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
> > > :
> > > > In that case it can either be slow monitors (slow network, slow
> > > > disks(!!!)  or a CPU or memory problem).
> > > > But it still can also be on the OSD side in the form of either
> > > > CPU
> > > > usage or memory pressure - in my case there were lots of memory
> > > > used
> > > > for pagecache (so for all intents and purposes considered
> > > > "free") but
> > > > when peering the OSD had trouble allocating any memory from it
> > > > and it
> > > > caused lots of slow ops and peering hanging in there for a
> > > > while.
> > > > This also doesn't show as high CPU usage, only kswapd spins up
> > > > a bit
> > > > (don't be fooled by its name, it has nothing to do with swap in
> > > > this
> > > > case).
> > > 
> > > My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
> > > (for
> > > 4x800GB ones), so I will try track this too. Thanks !
> > > 
> > I haven't seen this (known problem) with 64GB or 128GB nodes,
> > probably
> > because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
> > respectively.
> > 
> 
> I had this set to 6G and that doesn't help. This "buffer" is probably
> only useful for some atomic allocations that can use it, not for
> userland processes and their memory. Or maybe they get memory from
> this pool but it gets replenished immediately.
> QEMU has no problem allocating 64G on the same host, OSD struggles to
> allocate memory during startup or when PGs are added during
> rebalancing - probably because it does a lot of smaller allocations
> instead of one big.
> 

For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
any rebalance, but I can see a reduced filestore.commitcycle_latency.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Jan Schermer
In that case it can either be slow monitors (slow network, slow disks(!!!)  or 
a CPU or memory problem).
But it still can also be on the OSD side in the form of either CPU usage or 
memory pressure - in my case there were lots of memory used for pagecache (so 
for all intents and purposes considered "free") but when peering the OSD had 
trouble allocating any memory from it and it caused lots of slow ops and 
peering hanging in there for a while. This also doesn't show as high CPU usage, 
only kswapd spins up a bit (don't be fooled by its name, it has nothing to do 
with swap in this case).

echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a routine 
now and that problem is gone.

Jan

> On 18 Sep 2015, at 10:53, Olivier Bonvalet  wrote:
> 
> mmm good point.
> 
> I don't see CPU or IO problem on mons, but in logs, I have this :
> 
> 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap v86359128:
> 6632 pgs: 77 inactive, 1 remapped, 10 active+remapped+wait_backfill, 25
> peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used, 58578
> GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s; 8417/15680513
> objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> 
> 
> So... it can be a peering problem. Didn't see that, thanks.
> 
> 
> 
> Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
>> Could this be caused by monitors? In my case lagging monitors can
>> also cause slow requests (because of slow peering). Not sure if
>> that's expected or not, but it of course doesn't show on the OSDs as
>> any kind of bottleneck when you try to investigate...
>> 
>> Jan
>> 
>>> On 18 Sep 2015, at 09:37, Olivier Bonvalet 
>>> wrote:
>>> 
>>> Hi,
>>> 
>>> sorry for missing informations. I was to avoid putting too much
>>> inappropriate infos ;)
>>> 
>>> 
>>> 
>>> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
>>> écrit :
 Hello,
 
 On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
 
 The items below help, but be a s specific as possible, from OS,
 kernel
 version to Ceph version, "ceph -s", any other specific details
 (pool
 type,
 replica size).
 
>>> 
>>> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
>>> kernel,
>>> and Ceph 0.80.10.
>>> I don't have anymore ceph status right now. But I have
>>> data to move tonight again, so I'll track that.
>>> 
>>> The affected pool is a standard one (no erasure coding), with only
>>> 2 replica (size=2).
>>> 
>>> 
>>> 
>>> 
> Some additionnal informations :
> - I have 4 SSD per node.
 Type, if nothing else for anecdotal reasons.
>>> 
>>> I have 7 storage nodes here :
>>> - 3 nodes which have each 12 OSD of 300GB
>>> SSD
>>> - 4 nodes which have each  4 OSD of 800GB SSD
>>> 
>>> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
>>> 
>>> 
>>> 
> - the CPU usage is near 0
> - IO wait is near 0 too
 Including the trouble OSD(s)?
>>> 
>>> Yes
>>> 
>>> 
 Measured how, iostat or atop?
>>> 
>>> iostat, htop, and confirmed with Zabbix supervisor.
>>> 
>>> 
>>> 
>>> 
> - bandwith usage is also near 0
> 
 Yeah, all of the above are not surprising if everything is stuck
 waiting
 on some ops to finish. 
 
 How many nodes are we talking about?
>>> 
>>> 
>>> 7 nodes, 52 OSDs.
>>> 
>>> 
>>> 
> The whole cluster seems waiting for something... but I don't
> see
> what.
> 
 Is it just one specific OSD (or a set of them) or is that all
 over
 the
 place?
>>> 
>>> A set of them. When I increase the weight of all 4 OSDs of a node,
>>> I
>>> frequently have blocked IO from 1 OSD of this node.
>>> 
>>> 
>>> 
 Does restarting the OSD fix things?
>>> 
>>> Yes. For several minutes.
>>> 
>>> 
 Christian
> 
> Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> écrit :
>> Hi,
>> 
>> I have a cluster with lot of blocked operations each time I
>> try
>> to
>> move
>> data (by reweighting a little an OSD).
>> 
>> It's a full SSD cluster, with 10GbE network.
>> 
>> In logs, when I have blocked OSD, on the main OSD I can see
>> that
>> :
>> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for > 33.976680
>> secs
>> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
>> request
>> 30.125556 seconds old, received at 2015-09-18
>> 01:54:46.855821:
>> osd_op(client.29760717.1:18680817544
>> rb.0.1c16005.238e1f29.027f [write 180224~16384]
>> 6.c11916a4
>> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
>> currently
>> reached pg
>> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
>> requests, 1 included below; oldest blocked for > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer
On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:

> Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
> > In that case it can either be slow monitors (slow network, slow
> > disks(!!!)  or a CPU or memory problem).
> > But it still can also be on the OSD side in the form of either CPU
> > usage or memory pressure - in my case there were lots of memory used
> > for pagecache (so for all intents and purposes considered "free") but
> > when peering the OSD had trouble allocating any memory from it and it
> > caused lots of slow ops and peering hanging in there for a while.
> > This also doesn't show as high CPU usage, only kswapd spins up a bit
> > (don't be fooled by its name, it has nothing to do with swap in this
> > case).
> 
> My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
> 4x800GB ones), so I will try track this too. Thanks !
> 
I haven't seen this (known problem) with 64GB or 128GB nodes, probably
because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB respectively.

Christian.

> 
> > echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
> > routine now and that problem is gone.
> > 
> > Jan
> > 
> > > On 18 Sep 2015, at 10:53, Olivier Bonvalet 
> > > wrote:
> > > 
> > > mmm good point.
> > > 
> > > I don't see CPU or IO problem on mons, but in logs, I have this :
> > > 
> > > 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
> > > v86359128:
> > > 6632 pgs: 77 inactive, 1 remapped, 10
> > > active+remapped+wait_backfill, 25
> > > peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> > > active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
> > > 58578
> > > GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
> > > 8417/15680513
> > > objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> > > 
> > > 
> > > So... it can be a peering problem. Didn't see that, thanks.
> > > 
> > > 
> > > 
> > > Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> > > > Could this be caused by monitors? In my case lagging monitors can
> > > > also cause slow requests (because of slow peering). Not sure if
> > > > that's expected or not, but it of course doesn't show on the OSDs
> > > > as
> > > > any kind of bottleneck when you try to investigate...
> > > > 
> > > > Jan
> > > > 
> > > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet  > > > > >
> > > > > wrote:
> > > > > 
> > > > > Hi,
> > > > > 
> > > > > sorry for missing informations. I was to avoid putting too much
> > > > > inappropriate infos ;)
> > > > > 
> > > > > 
> > > > > 
> > > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > > > écrit :
> > > > > > Hello,
> > > > > > 
> > > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > > > 
> > > > > > The items below help, but be a s specific as possible, from
> > > > > > OS,
> > > > > > kernel
> > > > > > version to Ceph version, "ceph -s", any other specific
> > > > > > details
> > > > > > (pool
> > > > > > type,
> > > > > > replica size).
> > > > > > 
> > > > > 
> > > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > > > kernel,
> > > > > and Ceph 0.80.10.
> > > > > I don't have anymore ceph status right now. But I have
> > > > > data to move tonight again, so I'll track that.
> > > > > 
> > > > > The affected pool is a standard one (no erasure coding), with
> > > > > only
> > > > > 2 replica (size=2).
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > > Some additionnal informations :
> > > > > > > - I have 4 SSD per node.
> > > > > > Type, if nothing else for anecdotal reasons.
> > > > > 
> > > > > I have 7 storage nodes here :
> > > > > - 3 nodes which have each 12 OSD of 300GB
> > > > > SSD
> > > > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > > > 
> > > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > > > 
> > > > > 
> > > > > 
> > > > > > > - the CPU usage is near 0
> > > > > > > - IO wait is near 0 too
> > > > > > Including the trouble OSD(s)?
> > > > > 
> > > > > Yes
> > > > > 
> > > > > 
> > > > > > Measured how, iostat or atop?
> > > > > 
> > > > > iostat, htop, and confirmed with Zabbix supervisor.
> > > > > 
> > > > > 
> > > > > 
> > > > > 
> > > > > > > - bandwith usage is also near 0
> > > > > > > 
> > > > > > Yeah, all of the above are not surprising if everything is
> > > > > > stuck
> > > > > > waiting
> > > > > > on some ops to finish. 
> > > > > > 
> > > > > > How many nodes are we talking about?
> > > > > 
> > > > > 
> > > > > 7 nodes, 52 OSDs.
> > > > > 
> > > > > 
> > > > > 
> > > > > > > The whole cluster seems waiting for something... but I
> > > > > > > don't
> > > > > > > see
> > > > > > > what.
> > > > > > > 
> > > > > > Is it just one specific OSD (or a set of them) or is that all
> > > > > > over
> > > > > > the
> > > > > > place?
> > > > > 
> > > > > A set of them. When I increase the weight of all 4 OSDs of a
> > > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> Hello,
> 
> On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> Nah, everything helps, there are known problems with some versions,
> kernels, file systems, etc.
> 
> Speaking of which, what FS are you using on your OSDs?
> 

XFS.

> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> All my stuff is on Jessie, but at least Firefly should be stable and
> I
> haven't seen anything like your problem with it.
> And while 3.14 is a LTS kernel I wonder if something newer may be
> beneficial, but probably not.
> 

Well, I can try a 3.18.x kernel. But for that I have to restart all
nodes, which will throw some backfilling and probably some blocked IO
too ;)


> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> I was interested in that to see how many pools and PGs you have.

Well :

cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
 health HEALTH_OK
 monmap e21: 3 mons at 
{faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0}, 
election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh
 osdmap e847496: 88 osds: 88 up, 87 in
  pgmap v86390609: 6632 pgs, 16 pools, 18883 GB data, 5266 kobjects
68559 GB used, 59023 GB / 124 TB avail
6632 active+clean
  client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s


There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
pool use different OSD, on different nodes.
Since I don't often balance data of this hdd pool, I don't yet see
problem on it.



> >  The affected pool is a standard one (no erasure coding), with only
> > 2
> > replica (size=2).
> > 
> Good, nothing fancy going on there then.
> 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> Type as in model/maker, but helpful information.
> 

300GB models are Intel SSDSC2BB300G4 (DC S3500).
800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).




> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> 
> Good. I'm sure you checked for network errors. 
> Single network or split client/cluster network?
> 

It's the first thing I checked, and latency and packet loss is
monitored between each node and mons, but maybe I forgot some checks.


> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> That be below the threshold for most system tunables (there are
> various
> threads and articles on how to tune Ceph for "large" clusters).
> 
> Since this happens only when your cluster reshuffles data (and thus
> has
> more threads going) what is your ulimit setting for open files?


Wow... the default one on Debian Wheezy : 1024.



> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I increase the weight of all 4 OSDs of a node,
> > I
> > frequently have blocked IO from 1 OSD of this node.
> > 
> The plot thickens, as in, the target of most writes (new PGs being
> moved
> there) is the culprit.
> 

Yes.


> > 
> > 
> > > Does restarting the OSD fix things?
> > 
> > Yes. For several minutes.
> > 
> That also ties into a resource starvation of sorts, I'd investigate
> along
> those lines.

Yes, I agree. I will increase verbosity of OSD.


> Christian
> > 
> > > Christian
> > > > 
> > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > écrit :
> > > > > Hi,
> > > > > 
> > > > > I have a cluster with lot of blocked operations each time I
> > > > > try
> > > > > to
> > > > > move
> > > > > data (by reweighting a little an OSD).
> > > > > 
> > > > > It's a full SSD 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit :
> In that case it can either be slow monitors (slow network, slow
> disks(!!!)  or a CPU or memory problem).
> But it still can also be on the OSD side in the form of either CPU
> usage or memory pressure - in my case there were lots of memory used
> for pagecache (so for all intents and purposes considered "free") but
> when peering the OSD had trouble allocating any memory from it and it
> caused lots of slow ops and peering hanging in there for a while.
> This also doesn't show as high CPU usage, only kswapd spins up a bit
> (don't be fooled by its name, it has nothing to do with swap in this
> case).

My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM (for
4x800GB ones), so I will try track this too. Thanks !


> echo 1 >/proc/sys/vm/drop_caches before I touch anything has become a
> routine now and that problem is gone.
> 
> Jan
> 
> > On 18 Sep 2015, at 10:53, Olivier Bonvalet 
> > wrote:
> > 
> > mmm good point.
> > 
> > I don't see CPU or IO problem on mons, but in logs, I have this :
> > 
> > 2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap
> > v86359128:
> > 6632 pgs: 77 inactive, 1 remapped, 10
> > active+remapped+wait_backfill, 25
> > peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
> > active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used,
> > 58578
> > GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s;
> > 8417/15680513
> > objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering
> > 
> > 
> > So... it can be a peering problem. Didn't see that, thanks.
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> > > Could this be caused by monitors? In my case lagging monitors can
> > > also cause slow requests (because of slow peering). Not sure if
> > > that's expected or not, but it of course doesn't show on the OSDs
> > > as
> > > any kind of bottleneck when you try to investigate...
> > > 
> > > Jan
> > > 
> > > > On 18 Sep 2015, at 09:37, Olivier Bonvalet  > > > >
> > > > wrote:
> > > > 
> > > > Hi,
> > > > 
> > > > sorry for missing informations. I was to avoid putting too much
> > > > inappropriate infos ;)
> > > > 
> > > > 
> > > > 
> > > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > > écrit :
> > > > > Hello,
> > > > > 
> > > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > > 
> > > > > The items below help, but be a s specific as possible, from
> > > > > OS,
> > > > > kernel
> > > > > version to Ceph version, "ceph -s", any other specific
> > > > > details
> > > > > (pool
> > > > > type,
> > > > > replica size).
> > > > > 
> > > > 
> > > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > > kernel,
> > > > and Ceph 0.80.10.
> > > > I don't have anymore ceph status right now. But I have
> > > > data to move tonight again, so I'll track that.
> > > > 
> > > > The affected pool is a standard one (no erasure coding), with
> > > > only
> > > > 2 replica (size=2).
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > Some additionnal informations :
> > > > > > - I have 4 SSD per node.
> > > > > Type, if nothing else for anecdotal reasons.
> > > > 
> > > > I have 7 storage nodes here :
> > > > - 3 nodes which have each 12 OSD of 300GB
> > > > SSD
> > > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > > 
> > > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > > 
> > > > 
> > > > 
> > > > > > - the CPU usage is near 0
> > > > > > - IO wait is near 0 too
> > > > > Including the trouble OSD(s)?
> > > > 
> > > > Yes
> > > > 
> > > > 
> > > > > Measured how, iostat or atop?
> > > > 
> > > > iostat, htop, and confirmed with Zabbix supervisor.
> > > > 
> > > > 
> > > > 
> > > > 
> > > > > > - bandwith usage is also near 0
> > > > > > 
> > > > > Yeah, all of the above are not surprising if everything is
> > > > > stuck
> > > > > waiting
> > > > > on some ops to finish. 
> > > > > 
> > > > > How many nodes are we talking about?
> > > > 
> > > > 
> > > > 7 nodes, 52 OSDs.
> > > > 
> > > > 
> > > > 
> > > > > > The whole cluster seems waiting for something... but I
> > > > > > don't
> > > > > > see
> > > > > > what.
> > > > > > 
> > > > > Is it just one specific OSD (or a set of them) or is that all
> > > > > over
> > > > > the
> > > > > place?
> > > > 
> > > > A set of them. When I increase the weight of all 4 OSDs of a
> > > > node,
> > > > I
> > > > frequently have blocked IO from 1 OSD of this node.
> > > > 
> > > > 
> > > > 
> > > > > Does restarting the OSD fix things?
> > > > 
> > > > Yes. For several minutes.
> > > > 
> > > > 
> > > > > Christian
> > > > > > 
> > > > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier
> > > > > > Bonvalet a
> > > > > > écrit :
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I have a cluster with lot of blocked operations each time
> > > > > > > I
> > > > > > > try
> > > > > > > 

Re: [ceph-users] erasure pool, ruleset-root

2015-09-18 Thread Loic Dachary
Hi Tom,

Could you please share command you're using and their output ? A dump of the 
crush rules would also be useful to figure out why it did not work as expected.

Cheers

On 18/09/2015 01:01, Deneau, Tom wrote:
> I see that I can create a crush rule that only selects osds
> from a certain node by this:
>ceph osd crush rule create-simple byosdn1 myhostname osd
> 
> and if I then create a replicated pool that uses that rule,
> it does indeed select osds only from that node.
> 
> I would like to do a similar thing with an erasure pool.
> 
> When creating the ec-profile, I have successfully used
>ruleset-failure-domain=osd
> but when I try to use
>ruleset-root=myhostname
> and then use that profile to create an erasure pool,
> the resulting pool doesn't seem to limit to that node.
> 
> What is the correct syntax for creating such an erasure pool?
> 
> -- Tom Deneau
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer

Hello,

On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:

> Hi,
> 
> sorry for missing informations. I was to avoid putting too much
> inappropriate infos ;)
> 
Nah, everything helps, there are known problems with some versions,
kernels, file systems, etc.

Speaking of which, what FS are you using on your OSDs?

> 
> 
> Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > 
> > The items below help, but be a s specific as possible, from OS,
> > kernel
> > version to Ceph version, "ceph -s", any other specific details (pool
> > type,
> > replica size).
> > 
> 
> So, all nodes use Debian Wheezy, running on a vanilla 3.14.x kernel,
> and Ceph 0.80.10.
All my stuff is on Jessie, but at least Firefly should be stable and I
haven't seen anything like your problem with it.
And while 3.14 is a LTS kernel I wonder if something newer may be
beneficial, but probably not.

> I don't have anymore ceph status right now. But I have
> data to move tonight again, so I'll track that.
>
I was interested in that to see how many pools and PGs you have.
 
> The affected pool is a standard one (no erasure coding), with only 2
> replica (size=2).
> 
Good, nothing fancy going on there then.

> 
> 
> 
> > > Some additionnal informations :
> > > - I have 4 SSD per node.
> > Type, if nothing else for anecdotal reasons.
> 
> I have 7 storage nodes here :
> - 3 nodes which have each 12 OSD of 300GB
> SSD
> - 4 nodes which have each  4 OSD of 800GB SSD
> 
> And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> 
Type as in model/maker, but helpful information.

> 
> 
> > > - the CPU usage is near 0
> > > - IO wait is near 0 too
> > Including the trouble OSD(s)?
> 
> Yes
> 
> 
> > Measured how, iostat or atop?
> 
> iostat, htop, and confirmed with Zabbix supervisor.
>

Good. I'm sure you checked for network errors. 
Single network or split client/cluster network?

> 
> 
> 
> > > - bandwith usage is also near 0
> > > 
> > Yeah, all of the above are not surprising if everything is stuck
> > waiting
> > on some ops to finish. 
> > 
> > How many nodes are we talking about?
> 
> 
> 7 nodes, 52 OSDs.
> 
That be below the threshold for most system tunables (there are various
threads and articles on how to tune Ceph for "large" clusters).

Since this happens only when your cluster reshuffles data (and thus has
more threads going) what is your ulimit setting for open files?

> 
> 
> > > The whole cluster seems waiting for something... but I don't see
> > > what.
> > > 
> > Is it just one specific OSD (or a set of them) or is that all over
> > the
> > place?
> 
> A set of them. When I increase the weight of all 4 OSDs of a node, I
> frequently have blocked IO from 1 OSD of this node.
> 
The plot thickens, as in, the target of most writes (new PGs being moved
there) is the culprit.

> 
> 
> > Does restarting the OSD fix things?
> 
> Yes. For several minutes.
> 
That also ties into a resource starvation of sorts, I'd investigate along
those lines.

Christian
> 
> > Christian
> > > 
> > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > écrit :
> > > > Hi,
> > > > 
> > > > I have a cluster with lot of blocked operations each time I try
> > > > to
> > > > move
> > > > data (by reweighting a little an OSD).
> > > > 
> > > > It's a full SSD cluster, with 10GbE network.
> > > > 
> > > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > > :
> > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 33.976680 secs
> > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > requests, 1 included below; oldest blocked for > 63.981596 secs
> > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > request
> > > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > > osd_op(client.29760717.1:18680817544
> > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > 6.c11916a4
> > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > currently
> > > > reached pg
> > > > 
> > > > How should I read that ? What this OSD is waiting for ?
> > > > 
> > > > Thanks for any help,
> > > > 
> > > > Olivier
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > > 
> > > ___
> > > ceph-users mailing list
> > > 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Christian Balzer

Hello,

On Fri, 18 Sep 2015 10:35:37 +0200 Olivier Bonvalet wrote:

> Le vendredi 18 septembre 2015 à 17:04 +0900, Christian Balzer a écrit :
> > Hello,
> > 
> > On Fri, 18 Sep 2015 09:37:24 +0200 Olivier Bonvalet wrote:
> > 
> > > Hi,
> > > 
> > > sorry for missing informations. I was to avoid putting too much
> > > inappropriate infos ;)
> > > 
> > Nah, everything helps, there are known problems with some versions,
> > kernels, file systems, etc.
> > 
> > Speaking of which, what FS are you using on your OSDs?
> > 
> 
> XFS.
>
No surprises there, one hopes.
 
> > > 
> > > 
> > > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > > écrit :
> > > > Hello,
> > > > 
> > > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > > 
> > > > The items below help, but be a s specific as possible, from OS,
> > > > kernel
> > > > version to Ceph version, "ceph -s", any other specific details
> > > > (pool
> > > > type,
> > > > replica size).
> > > > 
> > > 
> > > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > > kernel,
> > > and Ceph 0.80.10.
> > All my stuff is on Jessie, but at least Firefly should be stable and
> > I
> > haven't seen anything like your problem with it.
> > And while 3.14 is a LTS kernel I wonder if something newer may be
> > beneficial, but probably not.
> > 
> 
> Well, I can try a 3.18.x kernel. But for that I have to restart all
> nodes, which will throw some backfilling and probably some blocked IO
> too ;)
>
Yeah, as I said, it might be helpful in some other ways, but probably not
related to your problems.
  
> 
> > > I don't have anymore ceph status right now. But I have
> > > data to move tonight again, so I'll track that.
> > > 
> > I was interested in that to see how many pools and PGs you have.
> 
> Well :
> 
> cluster de035250-323d-4cf6-8c4b-cf0faf6296b1
>  health HEALTH_OK
>  monmap e21: 3 mons at
> {faude=10.0.0.13:6789/0,murmillia=10.0.0.18:6789/0,rurkh=10.0.0.19:6789/0},
> election epoch 4312, quorum 0,1,2 faude,murmillia,rurkh osdmap e847496:
> 88 osds: 88 up, 87 in pgmap v86390609: 6632 pgs, 16 pools, 18883 GB
> data, 5266 kobjects 68559 GB used, 59023 GB / 124 TB avail 6632
> active+clean client io 3194 kB/s rd, 23542 kB/s wr, 1450 op/s
> 
> 
> There is mainly 2 pools used. A "ssd" pool, and a "hdd" pool. This hdd
> pool use different OSD, on different nodes.
> Since I don't often balance data of this hdd pool, I don't yet see
> problem on it.
> 
How many PGs in the SSD pool? 
I can see this easily exceeding your open file limits.

> 
> 
> > >  The affected pool is a standard one (no erasure coding), with only
> > > 2
> > > replica (size=2).
> > > 
> > Good, nothing fancy going on there then.
> > 
> > > 
> > > 
> > > 
> > > > > Some additionnal informations :
> > > > > - I have 4 SSD per node.
> > > > Type, if nothing else for anecdotal reasons.
> > > 
> > > I have 7 storage nodes here :
> > > - 3 nodes which have each 12 OSD of 300GB
> > > SSD
> > > - 4 nodes which have each  4 OSD of 800GB SSD
> > > 
> > > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > > 
> > Type as in model/maker, but helpful information.
> > 
> 
> 300GB models are Intel SSDSC2BB300G4 (DC S3500).
> 800GB models are Intel SSDSC2BB800H4 (DC S3500 I think).
> 
0.3 DWPD, but I guess you know that.

> 
> 
> 
> > > 
> > > 
> > > > > - the CPU usage is near 0
> > > > > - IO wait is near 0 too
> > > > Including the trouble OSD(s)?
> > > 
> > > Yes
> > > 
> > > 
> > > > Measured how, iostat or atop?
> > > 
> > > iostat, htop, and confirmed with Zabbix supervisor.
> > > 
> > 
> > Good. I'm sure you checked for network errors. 
> > Single network or split client/cluster network?
> > 
> 
> It's the first thing I checked, and latency and packet loss is
> monitored between each node and mons, but maybe I forgot some checks.
> 
> 
> > > 
> > > 
> > > 
> > > > > - bandwith usage is also near 0
> > > > > 
> > > > Yeah, all of the above are not surprising if everything is stuck
> > > > waiting
> > > > on some ops to finish. 
> > > > 
> > > > How many nodes are we talking about?
> > > 
> > > 
> > > 7 nodes, 52 OSDs.
> > > 
> > That be below the threshold for most system tunables (there are
> > various
> > threads and articles on how to tune Ceph for "large" clusters).
> > 
> > Since this happens only when your cluster reshuffles data (and thus
> > has
> > more threads going) what is your ulimit setting for open files?
> 
> 
> Wow... the default one on Debian Wheezy : 1024.
>
You want to fix this, both during startup and in general.
 
I use this for sysv-init systems like Wheezy:
---
# cat /etc/initscript 
#
ulimit -Hn 65536
ulimit -Sn 16384

# Execute the program.
eval exec "$4"
---

and:
---
# cat /etc/security/limits.d/tuning.conf 
rootsoftnofile  16384
roothardnofile  65536
*   softnofile  16384
*   hardnofile  65536
---

Adjust as needed.

Christian
> 
> 

[ceph-users] Delete pool with cache tier

2015-09-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Is there a way to delete a pool with a cache tier without first
evicting the cache tier and removing it (ceph 0.94.3)?

Something like:

ceph osd pool delete   --delete-cache-tier --yes-i-really-really-mean-it

Evicting the cache tier has taken over 24 hours and I just want to
trash it and recreate it. It takes too long to delete all the RBDs in
the pool as well so it seems that I'm stuck. Trying to delete the pool
or the cache error with saying that it is participating in a cache
tier.

There are not many RBDs, but one has 10,000 clones to test a high
number of clones.

Thanks,
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/FISCRDmVDuy+mK58QAAsgsP/1WSFOHSkcVy6O582ECf
9JN6/ZyvcH5HMSsB1cE0bEziDEsgoYEGdTSgB+GytABb+0BRip9SjH7kDpcs
5/1ZvAv9dgnNPbwBb9zOWp6p27fDwQ5vdmy4UnT7iyPbNFz2Cyp09g8mwEpe
nAu2mDorwOGtzPv4z7LUnevHigyhWIZpGJhw7hJGuKVNxRXwAbSXQqk7YzKI
xg6Ccs/eDpTiSygD+PeAfC7uizjqrw28lBaEmlgaK5OfFvrhXWSNZr6AuQap
wEv4QG+0NsUzptOMSTIlhBHS/0HZb87vscckBlaitfWzOiv73PuKQRpwVFRD
IkcIZx83I2pGQ4ir7LkjmmyQgW1cY5FcHrC6V70kaqpjgT5tuEIpwtlfJoUc
qpt1fKeEGF4sutLWpQCgPyY/bQC144PZyMYWjNS17GbzkoR/AwaZ8cBiK2+P
1Htc9ntk4hwtPOWXa6kYfRASLbC+nqTBFWTq5hcDkD/5ViTKhMpW+ldInaSt
8g8jukLWQ0ZjFET0MfYogYWEAdsER4dhk3bawfA/0dKiAyE5UPN5L9CvZ9kp
JPVqmSPKhXU68xGQXE7Ugx9BpWZXiRyO8hBOzoivrcvbZOLUhklQW61Qmq0A
EKWyeXy7P9cE3OONelWmgUXxVPuZT2ZAaoM+KjwJKdT1Jt7VtNVjqgS9NvwZ
alzR
=y/7e
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-datacenter crush map

2015-09-18 Thread Gregory Farnum
On Fri, Sep 18, 2015 at 4:57 AM, Wouter De Borger  wrote:
> Hi all,
>
> I have found on the mailing list that it should be possible to have a multi
> datacenter setup, if latency is low enough.
>
> I would like to set this up, so that each datacenter has at least two
> replicas and each PG has a replication level of 3.
>
> In this mail, it is suggested that I should use the following crush map for
> multi DC:
>
> rule dc {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take default
> step chooseleaf firstn 0 type datacenter
> step emit
> }
>
> This looks suspicious to me, as it will only generate a list of two PG's,
> (and only one PG if one DC is down).
>
> I think I should use:
>
> rule replicated_ruleset {
> ruleset 0
> type replicated
> min_size 1
> max_size 10
> step take root
> step choose firstn 2 type datacenter
> step chooseleaf firstn 2 type host
> step emit
> step take root
> step chooseleaf firstn -4 type host
> step emit
> }
>
> This correctly generates a list with 2 PG's in one DC, then 2 PG's in the
> other and then a list of PG's
>
> The problem is that this list contains duplicates (e.g. for 8 OSDS per DC)
>
> [13,11,1,8,13,11,16,4,3,7]
> [9,2,13,11,9,15,12,18,3,5]
> [3,5,17,10,3,5,7,13,18,10]
> [7,6,11,14,7,14,3,16,4,11]
> [6,3,15,18,6,3,12,9,16,15]
>
> Will this be a problem?

For replicated pools, it probably will cause trouble. For EC pools I
think it should work fine, but obviously you're losing all kinds of
redundancy. Nothing in the system will do work to avoid colocating
them if you use a rule like this. Rather than distributing some of the
replicas randomly across DCs, you really just want to split them up
evenly across datacenters (or in some ratio, if one has more space
than the other). Given CRUSH's current abilities that does require
building the replication size into the rule, but such is life.


> If crush is executed, will it only consider osd's which are (up,in)  or all
> OSD's in the map and then filter them from the list afterwards?

CRUSH will consider all OSDs, but if it selects any OSDs which are out
then it retries until it gets one that is still marked in.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Delete pool with cache tier

2015-09-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Created request http://tracker.ceph.com/issues/13163
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 18, 2015 at 12:06 PM, John Spray  wrote:
> On Fri, Sep 18, 2015 at 7:04 PM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> Is there a way to delete a pool with a cache tier without first
>> evicting the cache tier and removing it (ceph 0.94.3)?
>>
>> Something like:
>>
>> ceph osd pool delete   --delete-cache-tier --yes-i-really-really-mean-it
>
> Not as far as I know, but it seems like a pretty reasonable feature
> request to me.
>
> John
>
>>
>> Evicting the cache tier has taken over 24 hours and I just want to
>> trash it and recreate it. It takes too long to delete all the RBDs in
>> the pool as well so it seems that I'm stuck. Trying to delete the pool
>> or the cache error with saying that it is participating in a cache
>> tier.
>>
>> There are not many RBDs, but one has 10,000 clones to test a high
>> number of clones.
>>
>> Thanks,
>> - 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.1.0
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJV/FISCRDmVDuy+mK58QAAsgsP/1WSFOHSkcVy6O582ECf
>> 9JN6/ZyvcH5HMSsB1cE0bEziDEsgoYEGdTSgB+GytABb+0BRip9SjH7kDpcs
>> 5/1ZvAv9dgnNPbwBb9zOWp6p27fDwQ5vdmy4UnT7iyPbNFz2Cyp09g8mwEpe
>> nAu2mDorwOGtzPv4z7LUnevHigyhWIZpGJhw7hJGuKVNxRXwAbSXQqk7YzKI
>> xg6Ccs/eDpTiSygD+PeAfC7uizjqrw28lBaEmlgaK5OfFvrhXWSNZr6AuQap
>> wEv4QG+0NsUzptOMSTIlhBHS/0HZb87vscckBlaitfWzOiv73PuKQRpwVFRD
>> IkcIZx83I2pGQ4ir7LkjmmyQgW1cY5FcHrC6V70kaqpjgT5tuEIpwtlfJoUc
>> qpt1fKeEGF4sutLWpQCgPyY/bQC144PZyMYWjNS17GbzkoR/AwaZ8cBiK2+P
>> 1Htc9ntk4hwtPOWXa6kYfRASLbC+nqTBFWTq5hcDkD/5ViTKhMpW+ldInaSt
>> 8g8jukLWQ0ZjFET0MfYogYWEAdsER4dhk3bawfA/0dKiAyE5UPN5L9CvZ9kp
>> JPVqmSPKhXU68xGQXE7Ugx9BpWZXiRyO8hBOzoivrcvbZOLUhklQW61Qmq0A
>> EKWyeXy7P9cE3OONelWmgUXxVPuZT2ZAaoM+KjwJKdT1Jt7VtNVjqgS9NvwZ
>> alzR
>> =y/7e
>> -END PGP SIGNATURE-
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/G95CRDmVDuy+mK58QAAMswP/jt8ECEuctxhO8Mu7gdl
a3jbGAq8JKE7mfzm0CaZuizfJX6gqJiLT8uul0TDaDlSMgdNON//U28YXjPQ
voDJg+IJV7bBt87t2PFFrIXJRPlos6YuXhg3D8hsljPEIXsWCD2LDu2mLysn
1wN9v8Xk1FqGzejszU2VP6i05Pvf55F3tRsXi1unOMlAeMB6E7BadspziaGP
7W4s4CLz2osNDAreNtUPIjFQZEIwQbAZTrcTYehjwHCS5BTgUjWtoP4MxyzO
h5ZbM+3czdsl0Zys+dVOTmo0F3qzRjSW/MjPI04PxYrvEDoNGXmdX22s+MEY
D8hCAxeEp3N3+LTdUtiWl8mWd4PblYxuQXK+15tkLQoNf2kO3yVNduR88BFm
NTKyOPqRsZFanXSWr/MGK93BYIn9xKi/L9zfOKl2tsLKveg4WRWfyu+Pf4/v
B37mGcmVfFeqiuFt/wtvk/xjJVJDSx12Z6Ak4uiQETRy/+bM+QUbTKuB6lTZ
K8tkS1hwbbnJuOrQCQyLyIbBQAwzJQscxsclbWQjsoy92nC/kXXjDRDLx5uS
Sm84HNHjmZCtK4/fjQ4gsZIBzdaRRP3QTDgz/RFC9ywZDzTHTMdn82lKr836
dtcOyQqsGha7iD4O3MnYWvu7QgEBZ1tU2AMuV30rrVzpIE5QxBO2PmkR7Sf3
z0KC
=eZ9D
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread 张冬卯
yes, a raid1 system disk is necessary, from my perspective.

And a swap partition is still  needed even though the memory is big.
Martin Palma > 于 
2015年9月18日,下午11:07写道:___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] radosgw and keystone version 3 domains

2015-09-18 Thread Shinobu Kinjo
What's error message you saw when you tried?

Shinobu

- Original Message -
From: "Abhishek L" 
To: "Robert Duncan" 
Cc: ceph-us...@ceph.com
Sent: Friday, September 18, 2015 12:29:20 PM
Subject: Re: [ceph-users] radosgw and keystone version 3 domains

On Fri, Sep 18, 2015 at 4:38 AM, Robert Duncan  wrote:
>
> Hi
>
>
>
> It seems that radosgw cannot find users in Keystone V3 domains, that is,
>
> When keystone is configured for domain specific  drivers radossgw cannot find 
> the users in the keystone users table (as they are not there)
>
> I have a deployment in which ceph providers object block ephemeral and user 
> storage, however any user outside of the ‘default’ sql backed domain cannot 
> be found by radosgw.
>
> Has anyone seen this issue before when using ceph in openstack? Is it 
> possible to configure radosgw to use a keystone v3 url?

I'm not sure whether keystone v3 support for radosgw is there yet,
particularly for the swift api. Currently keystone v2 api is supported,
and due to the change in format between v2 and v3 tokens, I'm not sure
whether swift apis will work with v3 yet, though keystone v3 *might*
just work on the s3 interface due to the different format used.


>
>
> Thanks,
>
> Rob.
>
> 
>
> The information contained and transmitted in this e-mail is confidential 
> information, and is intended only for the named recipient to which it is 
> addressed. The content of this e-mail may not have been sent with the 
> authority of National College of Ireland. Any views or opinions presented are 
> solely those of the author and do not necessarily represent those of National 
> College of Ireland. If the reader of this message is not the named recipient 
> or a person responsible for delivering it to the named recipient, you are 
> notified that the review, dissemination, distribution, transmission, printing 
> or copying, forwarding, or any other use of this message or any part of it, 
> including any attachments, is strictly prohibited. If you have received this 
> communication in error, please delete the e-mail and destroy all record of 
> this communication. Thank you for your assistance.
>
> 
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

If you decide to use swap, be warned that significant parts of the OSD
code can be swapped out even without memory pressure. This has caused
OSD processes to take 5 minutes to shut down in my experience. I would
recommend tuning swappiness in this case. My strongest recommendation
is not to have swap if it is a pure OSD node.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 18, 2015 at 8:15 PM, 张冬卯  wrote:
> yes, a raid1 system disk is necessary, from my perspective.
>
> And a swap partition is still  needed even though the memory is big.
> Martin Palma  于 2015年9月18日,下午11:07写道: Hi,
>
> Is it a good idea to use a software raid for the system disk (Operating
> System) on a Ceph storage node? I mean only for the OS not for the OSD
> disks.
>
> And what about a swap partition? Is that needed?
>
> Best,
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/OEjCRDmVDuy+mK58QAAzWMQAI/sVjPQP+sJPPAl/7lq
r75SccEO5i7rGZuTrrDn6kl1p2Nackb9o0PKh0tnyOSgouO2Jj3IOvqlnv/S
t40SxkoDoaG2//JmBHZs2BtOUWQCBh4R9h5UjwROXvByKBCNzzH+tsrJ4cST
7HdVNB3Cjg/kcxC7kB88OOF1Kl8mrPSsbb9kG1RSlqqCFh82pZ3GJjDpNP30
mCQHMcTWwlYoAx+5Lii0cCks5Csc78BAv7gFtv47NAvXyRT5LPN8ZvXYv71N
Bm5pqkIC5H38nVrTb9UCyrVdFmL7M9FeaEaJ+CvNOZFYVgjrAUWNw0LGAgey
yAH9q8GHEJAmvtZLWebCXMucLmNM6LUDySuxgN2sx8upNFg57Zwz8zLFmitp
mvp8YdWS+blSk8gyMWFMLUbtIPu6QepzYGpY5lKy6HJI4pqzd5g8HK9gIjjO
0EY330T7KE03HjQS4Nuj3xmSWeY5lOQ1sSMANBACXLtDpgTQ8/rQpni9DQBP
VHyk0t3DWwE83MbF6T8o1h+vS06BIeVc4mOehAxvmsSMITmORejJmLQtu3sS
bUEWDZS7KcVPj0/FIqGJbf4d7CIYKfDouNUb1J2aXjz9YwK9CbfFZzHbGIfT
gKsgbL8wHsHNIjOlQVbvSgJ+CjRz6H7xO3hHVUrYL4pXLjSt+VJodfB7azlk
8Dq9
=ny7s
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread Martin Palma
Thanks all for the suggestions.

Our storage nodes have plenty of RAM and their only purpose is to host the
OSD daemons, so we will not create a swap partition on provisioning.

For the OS disk we will then use a software raid 1 to handle eventually
disk failures. For provisioning the hosts we use kickstart and then Ansible
to install an prepare the hosts to be ready to for ceph-deploy.

Thanks all for our opinions and suggestions helped a lot.

Best,
Martin

On Sat, Sep 19, 2015 at 6:14 AM, Robert LeBlanc 
wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> If you decide to use swap, be warned that significant parts of the OSD
> code can be swapped out even without memory pressure. This has caused
> OSD processes to take 5 minutes to shut down in my experience. I would
> recommend tuning swappiness in this case. My strongest recommendation
> is not to have swap if it is a pure OSD node.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Sep 18, 2015 at 8:15 PM, 张冬卯  wrote:
> > yes, a raid1 system disk is necessary, from my perspective.
> >
> > And a swap partition is still  needed even though the memory is big.
> > Martin Palma  于 2015年9月18日,下午11:07写道: Hi,
> >
> > Is it a good idea to use a software raid for the system disk (Operating
> > System) on a Ceph storage node? I mean only for the OS not for the OSD
> > disks.
> >
> > And what about a swap partition? Is that needed?
> >
> > Best,
> > Martin
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV/OEjCRDmVDuy+mK58QAAzWMQAI/sVjPQP+sJPPAl/7lq
> r75SccEO5i7rGZuTrrDn6kl1p2Nackb9o0PKh0tnyOSgouO2Jj3IOvqlnv/S
> t40SxkoDoaG2//JmBHZs2BtOUWQCBh4R9h5UjwROXvByKBCNzzH+tsrJ4cST
> 7HdVNB3Cjg/kcxC7kB88OOF1Kl8mrPSsbb9kG1RSlqqCFh82pZ3GJjDpNP30
> mCQHMcTWwlYoAx+5Lii0cCks5Csc78BAv7gFtv47NAvXyRT5LPN8ZvXYv71N
> Bm5pqkIC5H38nVrTb9UCyrVdFmL7M9FeaEaJ+CvNOZFYVgjrAUWNw0LGAgey
> yAH9q8GHEJAmvtZLWebCXMucLmNM6LUDySuxgN2sx8upNFg57Zwz8zLFmitp
> mvp8YdWS+blSk8gyMWFMLUbtIPu6QepzYGpY5lKy6HJI4pqzd5g8HK9gIjjO
> 0EY330T7KE03HjQS4Nuj3xmSWeY5lOQ1sSMANBACXLtDpgTQ8/rQpni9DQBP
> VHyk0t3DWwE83MbF6T8o1h+vS06BIeVc4mOehAxvmsSMITmORejJmLQtu3sS
> bUEWDZS7KcVPj0/FIqGJbf4d7CIYKfDouNUb1J2aXjz9YwK9CbfFZzHbGIfT
> gKsgbL8wHsHNIjOlQVbvSgJ+CjRz6H7xO3hHVUrYL4pXLjSt+VJodfB7azlk
> 8Dq9
> =ny7s
> -END PGP SIGNATURE-
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Hi,

I think I found the problem : a way too large journal.
I catch this from logs of an OSD having blocked queries :

OSD.15 :

2015-09-19 00:41:12.717062 7fb8a3c44700  1 journal check_for_full at 3548528640 
: JOURNAL FULL 3548528640 >= 1376255 (max_size 4294967296 start 3549904896)
2015-09-19 00:41:43.124590 7fb8a6181700  0 log [WRN] : 6 slow requests, 6 
included below; oldest blocked for > 30.405719 secs
2015-09-19 00:41:43.124596 7fb8a6181700  0 log [WRN] : slow request 30.405719 
seconds old, received at 2015-09-19 00:41:12.718829: 
osd_op(client.31621623.1:5392489797 rb.0.1b844d6.238e1f29.04d3 [write 
0~4096] 6.3aed306f snapc 4=[4,11096,11018] ondisk+write e847952) v4 
currently waiting for subops from 19
2015-09-19 00:41:43.124599 7fb8a6181700  0 log [WRN] : slow request 30.172735 
seconds old, received at 2015-09-19 00:41:12.951813: 
osd_op(client.31435077.1:8423014905 rb.0.1c39394.238e1f29.037a [write 
1499136~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124602 7fb8a6181700  0 log [WRN] : slow request 30.172703 
seconds old, received at 2015-09-19 00:41:12.951845: 
osd_op(client.31435077.1:8423014906 rb.0.1c39394.238e1f29.037a [write 
1523712~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124604 7fb8a6181700  0 log [WRN] : slow request 30.172576 
seconds old, received at 2015-09-19 00:41:12.951972: 
osd_op(client.31435077.1:8423014907 rb.0.1c39394.238e1f29.037a [write 
1515520~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28
2015-09-19 00:41:43.124606 7fb8a6181700  0 log [WRN] : slow request 30.172546 
seconds old, received at 2015-09-19 00:41:12.952002: 
osd_op(client.31435077.1:8423014909 rb.0.1c39394.238e1f29.037a [write 
1531904~8192] 6.2ffed26e snapc 8=[8,1109a,1101c] ondisk+write e847952) 
v4 currently waiting for subops from 28

and at same time on OSD.19 :

2015-09-19 00:41:19.549508 7f55973c0700  0 -- 192.168.42.22:6806/28596 >> 
192.168.42.16:6828/38905 pipe(0x230f sd=358 :6806 s=2 pgs=14268 cs=3 l=0 
c=0x6d9cb00).fault with nothing to send, going to standby
2015-09-19 00:41:43.246421 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 30.253274 secs
2015-09-19 00:41:43.246428 7f55ba277700  0 log [WRN] : slow request 30.253274 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15
2015-09-19 00:42:13.251591 7f55ba277700  0 log [WRN] : 1 slow requests, 1 
included below; oldest blocked for > 60.258446 secs
2015-09-19 00:42:13.251596 7f55ba277700  0 log [WRN] : slow request 60.258446 
seconds old, received at 2015-09-19 00:41:12.993123: 
osd_op(client.31626115.1:4664205553 rb.0.1c918ad.238e1f29.2da9 [write 
3063808~16384] 6.604ba242 snapc 10aaf=[10aaf,10a31,109b3] ondisk+write e847952) 
v4 currently waiting for subops from 15

So the blocking seem to be the "JOURNAL FULL" event, with big numbers. 
3548528640, is the journal size ?
I just reduced the filestore_max_sync_interval to 30s, and everything
seems to work fine.

For SSD OSD, with journal on same device, big journal is a crazy
thing... I suppose I break this setup when trying to tune the journal
for the HDD pool.

At same time, is there tips tuning journal in case of HDD OSD, with
(potentially big) SSD journal, and hardware RAID card which handle
write back ?

Thanks for your help.

Olivier


Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a écrit :
> Hi,
> 
> I have a cluster with lot of blocked operations each time I try to
> move
> data (by reweighting a little an OSD).
> 
> It's a full SSD cluster, with 10GbE network.
> 
> In logs, when I have blocked OSD, on the main OSD I can see that :
> 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 33.976680 secs
> 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow request
> 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> requests, 1 included below; oldest blocked for > 63.981596 secs
> 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow request
> 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> osd_op(client.29760717.1:18680817544
> rb.0.1c16005.238e1f29.027f [write 180224~16384] 6.c11916a4
> snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4 currently
> reached pg
> 
> 

Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
Le vendredi 18 septembre 2015 à 14:14 +0200, Paweł Sadowski a écrit :
> It might be worth checking how many threads you have in your system
> (ps
> -eL | wc -l). By default there is a limit of 32k (sysctl -q
> kernel.pid_max). There is/was a bug in fork()
> (https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit
> is
> reached. We hit a situation when OSD trying to create new thread was
> killed and reports 'Cannot allocate memory' (12 OSD per node created
> more than 32k threads).
> 

Thanks ! For now I don't see more than 5k threads on nodes with 12 OSD,
but maybe during recovery/backfilling ?
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd won't boot, resource shortage?

2015-09-18 Thread Shinobu Kinjo
I do not think that it's best practice to increase that number at the moment.
It's kind of lack of consideration.

We might need to do that as a result.

But what we should do, first, is to check current actual number of aio using:

 watch -dc cat /proc/sys/fs/aio-nr

then increase, if it's necessary.

Anyway you have to be more careful otherwise there might be back-and-force 
meaningless configuration change -;

Shinobu

- Original Message -
From: "Peter Sabaini" 
To: ceph-users@lists.ceph.com
Sent: Thursday, September 17, 2015 11:51:11 PM
Subject: Re: [ceph-users] ceph osd won't boot, resource shortage?

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 16.09.15 16:41, Peter Sabaini wrote:
> Hi all,
> 
> I'm having trouble adding OSDs to a storage node; I've got
> about 28 OSDs running, but adding more fails.

So, it seems the requisite knob was sysctl fs.aio-max-nr
By default, this was set to 64K here. I set it:

# echo 2097152 > /proc/sys/fs/aio-max-nr

This let me add my remaining OSDs.



> Typical log excerpt:
> 
> 2015-09-16 13:55:58.083797 7f3e7b821800  1 journal _open 
> /var/lib/ceph/osd/ceph-28/journal fd 20: 21474836480 bytes,
> block size 4096 bytes, directio = 1, aio = 1 2015-09-16
> 13:55:58.090709 7f3e7b821800 -1 journal FileJournal::_open:
> unable to setup io_context (61) No data available 2015-09-16
> 13:55:58.090825 7f3e74a96700 -1 journal io_submit to 0~4096 got
> (22) Invalid argument 2015-09-16 13:55:58.091061 7f3e7b821800
> 1 journal close /var/lib/ceph/osd/ceph-28/journal 2015-09-16
> 13:55:58.091993 7f3e74a96700 -1 os/FileJournal.cc: In function
> 'int FileJournal::write_aio_bl(off64_t&, ceph::bufferlist&,
> uint64_t)' thread 7f3e74a96700 time 2 015-09-16
> 13:55:58.090842 os/FileJournal.cc: 1337: FAILED assert(0 ==
> "io_submit got unexpected error")
> 
> More complete: http://pastebin.ubuntu.com/12427041/
> 
> If, however, I stop one of the running OSDs, starting the
> original OSD works fine. I'm guessing I'm running out of
> resources somewhere, but where?
> 
> Some poss. relevant sysctl values:
> 
> vm.max_map_count=524288 kernel.pid_max=2097152 
> kernel.threads-max=2097152 fs.aio-max-nr = 65536 fs.aio-nr =
> 129024 fs.dentry-state = 7571049996   45  0   0   0 
> fs.file-max =
> 26244198 fs.file-nr = 13504   0   26244198 fs.inode-nr = 60706202 
> fs.nr_open = 1048576
> 
> I've also set max open files = 1048576 in ceph.conf
> 
> The OSDs are setup with dedicated journal disks - 3 OSDs share
> one journal device.
> 
> Any advice on what I'm missing, or where I should dig deeper?
> 
> Thanks, peter.
> 
> 
> 
> 
> 
> 
> ___ ceph-users
> mailing list ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-BEGIN PGP SIGNATURE-

iQIcBAEBAgAGBQJV+tNfAAoJEDg5mUAO12PZxn8P/16Q+N9//a9msK6tWOqohj37
fF9z3N/MkoWXPZAyDLnhQIghvGQNa6hKb/ArnjTYRhLzsa9pK5BMc/h1Hfotutng
a982tyEYQ8anHpVfAUZ+Ww7vJTgNWNfSeo05rc+meZlE0VowbJxCPJ2iEFUH9lDm
qxCUk0FYI0utH/dxZN6KUnnw8vwhqulj96Wa5qw/9PJGqHLBhFnmthXJXWaVStGd
Yoq27DHY0z5PMMC80zKttKDhYdRXo0psvGtTgWvZRoOsSLcIvA1aF68iMnoYeY57
aH8k1U6aIz7BohdDaxSL58btV3oI3Xv6XuMn2qcvM12iKzqy2Fwf2EQnxBag96d7
loDZuCgamJWuvXBv0FV5NsNKnzcK6UBvHq4Z+LIq9KzaRlx8c3EfF5RnKjw+DNd2
TOsy6a3qIYvJP+2qmiCCFGwUI7glAUPnCkiMy9T9zXMyNRnUtPxALNUjzuHG8o+k
tKg2kYG9EBT3+26SCxfOCRBjDmo2+NdjuQICmCE0qBVLct2bckwAHseuZZfPIHj3
n6oFy8tcb/ycZiknCOZkyuQS+MNjN9vL/v1imdnU2OVnudsS8JxhWehwULOjx9zq
bv9pNwkavg0oj/hl9JbPtLLGHzFw9QnI8TcUDMfMaetpuvyr5mEzt55zCsPQdLuL
fzH2ArRGwonjKM73raHR
=9VS3
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] multi-datacenter crush map

2015-09-18 Thread Wouter De Borger
Hi all,

I have found on the mailing list that it should be possible to have a multi
datacenter setup, if latency is low enough.

I would like to set this up, so that each datacenter has at least two
replicas and each PG has a replication level of 3.

In this

mail, it is suggested that I should use the following crush map for multi
DC:

rule dc {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type datacenter
step emit
}

This looks suspicious to me, as it will only generate a list of two
PG's, (and only one PG if one DC is down).

I think I should use:

rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take root
step choose firstn 2 type datacenter
step chooseleaf firstn 2 type host
step emit
step take root
step chooseleaf firstn -4 type host
step emit
}

This correctly generates a list with 2 PG's in one DC, then 2 PG's in
the other and then a list of PG's

The problem is that this list contains duplicates (e.g. for 8 OSDS per DC)

[*13*,*11*,1,8,*13*,*11*,16,4,3,7]
[9,2,13,11,9,15,12,18,3,5]
[3,5,17,10,3,5,7,13,18,10]
[7,6,11,14,7,14,3,16,4,11]
[6,3,15,18,6,3,12,9,16,15]

Will this be a problem?

If crush is executed, will it only consider osd's which are (up,in)
or all OSD's in the map and then filter them from the list afterwards?

Thx,

Wouter
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Paweł Sadowski
On 09/18/2015 12:17 PM, Olivier Bonvalet wrote:
> Le vendredi 18 septembre 2015 à 12:04 +0200, Jan Schermer a écrit :
>>> On 18 Sep 2015, at 11:28, Christian Balzer  wrote:
>>>
>>> On Fri, 18 Sep 2015 11:07:49 +0200 Olivier Bonvalet wrote:
>>>
 Le vendredi 18 septembre 2015 à 10:59 +0200, Jan Schermer a écrit
 :
> In that case it can either be slow monitors (slow network, slow
> disks(!!!)  or a CPU or memory problem).
> But it still can also be on the OSD side in the form of either
> CPU
> usage or memory pressure - in my case there were lots of memory
> used
> for pagecache (so for all intents and purposes considered
> "free") but
> when peering the OSD had trouble allocating any memory from it
> and it
> caused lots of slow ops and peering hanging in there for a
> while.
> This also doesn't show as high CPU usage, only kswapd spins up
> a bit
> (don't be fooled by its name, it has nothing to do with swap in
> this
> case).
 My nodes have 256GB of RAM (for 12x300GB ones) or 128GB of RAM
 (for
 4x800GB ones), so I will try track this too. Thanks !

>>> I haven't seen this (known problem) with 64GB or 128GB nodes,
>>> probably
>>> because I set /proc/sys/vm/min_free_kbytes to 512MB or 1GB
>>> respectively.
>>>
>> I had this set to 6G and that doesn't help. This "buffer" is probably
>> only useful for some atomic allocations that can use it, not for
>> userland processes and their memory. Or maybe they get memory from
>> this pool but it gets replenished immediately.
>> QEMU has no problem allocating 64G on the same host, OSD struggles to
>> allocate memory during startup or when PGs are added during
>> rebalancing - probably because it does a lot of smaller allocations
>> instead of one big.
>>
> For now I dropped cache *and* set min_free_kbytes to 1GB. I don't throw
> any rebalance, but I can see a reduced filestore.commitcycle_latency.

It might be worth checking how many threads you have in your system (ps
-eL | wc -l). By default there is a limit of 32k (sysctl -q
kernel.pid_max). There is/was a bug in fork()
(https://lkml.org/lkml/2015/2/3/345) reporting ENOMEM when PID limit is
reached. We hit a situation when OSD trying to create new thread was
killed and reports 'Cannot allocate memory' (12 OSD per node created
more than 32k threads).

-- 
PS

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
I use Ceph 0.80.10.

I see IO wait is near 0 thanks to iostat, htop (in detailed mode), and
rechecked with Zabbix supervisor.


Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-18 Thread Jan Schermer
"850 PRO" is a workstation drive. You shouldn't put it in the server...
But it should not just die either way, so don't tell them you use it for Ceph 
next time.

Do the drives work when replugged? Can you get anything from SMART?

Jan


> On 18 Sep 2015, at 02:57, James (Fei) Liu-SSI  
> wrote:
> 
> Hi Quentin,
> Samsung has so different type of SSD for different type of workload with 
> different SSD media like SLC,MLC,TLC ,3D NAND etc. They were designed for 
> different workloads for different purposes. Thanks for your understanding and 
> support.
>  
> Regards,
> James
>   <>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com 
> ] On Behalf Of Quentin Hartman
> Sent: Thursday, September 17, 2015 4:05 PM
> To: Andrija Panic
> Cc: ceph-users@lists.ceph.com 
> Subject: Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel 
> s3700
>  
> I ended up having 7 total die. 5 while in service, 2 more when I hooked them 
> up to a test machine to collect information from them. To Samsung's credit, 
> they've been great to deal with and are replacing the failed drives, on the 
> condition that I don't use them for ceph again. Apparently they sent some of 
> my failed drives to an engineer in Korea and they did a failure analysis on 
> them and came to the conclusion they we put to an "unintended use". I have 
> seven left I'm not sure what to do with.
>  
> I've honestly always really liked Samsung, and I'm disappointed that I wasn't 
> able to find anyone with their DC-class drives actually in stock so I ended 
> up switching the to Intel S3700s. My users will be happy to have some SSDs to 
> put in their workstations though!
>  
> QH
>  
> On Thu, Sep 17, 2015 at 4:49 PM, Andrija Panic  > wrote:
> Another one bites the dust...
>  
> This is Samsung 850 PRO 256GB... (6 journals on this SSDs just died...)
>  
> [root@cs23 ~]# smartctl -a /dev/sda
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.66-1.el6.elrepo.x86_64] 
> (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net 
> 
>  
> Vendor:   /1:0:0:0
> Product:
> User Capacity:600,332,565,813,390,450 bytes [600 PB]
> Logical block size:   774843950 bytes
> >> Terminate command early due to bad response to IEC mode page
> A mandatory SMART command failed: exiting. To continue, add one or more '-T 
> permissive' options
>  
> On 8 September 2015 at 18:01, Quentin Hartman  > wrote:
> On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  > wrote:
> A list of hardware that is known to work well would be incredibly
> valuable to people getting started. It doesn't have to be exhaustive,
> nor does it have to provide all the guidance someone could want. A
> simple "these things have worked for others" would be sufficient. If
> nothing else, it will help people justify more expensive gear when their
> approval people say "X seems just as good and is cheaper, why can't we
> get that?".
> 
> So I have my opinions on different drives, but I think we do need to be 
> really careful not to appear to endorse or pick on specific vendors. The more 
> we can stick to high-level statements like:
> 
> - Drives should have high write endurance
> - Drives should perform well with O_DSYNC writes
> - Drives should support power loss protection for data in motion
> 
> The better I think.  Once those are established, I think it's reasonable to 
> point out that certain drives meet (or do not meet) those criteria and get 
> feedback from the community as to whether or not vendor's marketing actually 
> reflects reality.  It'd also be really nice to see more information available 
> like the actual hardware (capacitors, flash cells, etc) used in the drives.  
> I've had to show photos of the innards of specific drives to vendors to get 
> them to give me accurate information regarding certain drive capabilities.  
> Having a database of such things available to the community would be really 
> helpful.
> 
>  
> That's probably a very good approach. I think it would be pretty simple to 
> avoid the appearance of endorsement if the data is presented correctly.
>  
> 
> To that point, I think perhaps though something more important than a
> list of known "good" hardware would be a list of known "bad" hardware,
> 
> I'm rather hesitant to do this unless it's been specifically confirmed by the 
> vendor.  It's too easy to point fingers (see the recent kernel trim bug 
> situation).
>  
> I disagree. I think that only comes into play if you claim to know why the 
> hardware has problems. In this case, if you simply state "people who have 
> used this drive have experienced a large number of seemingly 

[ceph-users] Strange rbd hung with non-standard crush location

2015-09-18 Thread Max A. Krasilnikov
Hello!

I have 3-node ceph cluster under ubuntu 14.04.3 with hammer 0.94.2 from
ubuntu-cloud repository. My config and crush map is attached below.

After adding a volume with cinder any of my openstack instances hung after a
small period of time with "[sda]: abort" message in VM's kernel log. When
connecting volume directly to my compute node with

rbd map --name client.openstack --keyfile client.openstack.key 
openstack-hdd/volume-da53d8d0-b361-4697-94ed-218b92c1541e

I have the same thing: small amount of written data and hung after that:

Sep 15 16:36:24 compute001 kernel: [ 1620.258823] Key type ceph registered
Sep 15 16:36:24 compute001 kernel: [ 1620.259143] libceph: loaded (mon/osd 
proto 15/24)
Sep 15 16:36:24 compute001 kernel: [ 1620.263448] rbd: loaded (major 251)
Sep 15 16:36:24 compute001 kernel: [ 1620.264948] libceph: client13757843 fsid 
b490cb36-ab9b-4dd1-b3bf-c022061a977e
Sep 15 16:36:24 compute001 kernel: [ 1620.265359] libceph: mon2 10.0.66.3:6789 
session established
Sep 15 16:36:24 compute001 kernel: [ 1620.275268]  rbd0: p1
Sep 15 16:36:24 compute001 kernel: [ 1620.275484] rbd: rbd0: added with size 
0xe60
Sep 15 16:41:24 compute001 kernel: [ 1920.445112] INFO: task fio:31185 blocked 
for more than 120 seconds.
Sep 15 16:41:24 compute001 kernel: [ 1920.445484]   Not tainted 
3.16.0-49-generic #65~14.04.1-Ubuntu
Sep 15 16:41:24 compute001 kernel: [ 1920.445835] "echo 0 > 
/proc/sys/kernel/hung_task_timeout_secs" disables this message.
Sep 15 16:41:24 compute001 kernel: [ 1920.446286] fio D 
881fffab30c0 0 31185  1 0x0004
Sep 15 16:41:24 compute001 kernel: [ 1920.446295]  881fba167b60 
0046 881fac8cbd20 881fba167fd8
Sep 15 16:41:24 compute001 kernel: [ 1920.446302]  000130c0 
000130c0 881fd2a18a30 881fba167c88
Sep 15 16:41:24 compute001 kernel: [ 1920.446308]  881fba167c90 
7fff 881fac8cbd20 881fac8cbd20
Sep 15 16:41:24 compute001 kernel: [ 1920.446315] Call Trace:
Sep 15 16:41:24 compute001 kernel: [ 1920.446333]  [] 
schedule+0x29/0x70
Sep 15 16:41:24 compute001 kernel: [ 1920.446338]  [] 
schedule_timeout+0x229/0x2a0
Sep 15 16:41:24 compute001 kernel: [ 1920.446350]  [] ? 
__wake_up+0x44/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446357]  [] ? 
__call_rcu_nocb_enqueue+0xc8/0xd0
Sep 15 16:41:24 compute001 kernel: [ 1920.446363]  [] 
wait_for_completion+0xa6/0x160
Sep 15 16:41:24 compute001 kernel: [ 1920.446370]  [] ? 
wake_up_state+0x20/0x20
Sep 15 16:41:24 compute001 kernel: [ 1920.446380]  [] 
exit_aio+0xe0/0xf0
Sep 15 16:41:24 compute001 kernel: [ 1920.446388]  [] 
mmput+0x30/0x120
Sep 15 16:41:24 compute001 kernel: [ 1920.446395]  [] 
do_exit+0x26c/0xa60
Sep 15 16:41:24 compute001 kernel: [ 1920.446401]  [] ? 
dequeue_entity+0x142/0x5c0
Sep 15 16:41:24 compute001 kernel: [ 1920.446407]  [] 
do_group_exit+0x3f/0xa0
Sep 15 16:41:24 compute001 kernel: [ 1920.446416]  [] 
get_signal_to_deliver+0x1d0/0x6f0
Sep 15 16:41:24 compute001 kernel: [ 1920.446426]  [] 
do_signal+0x48/0xad0
Sep 15 16:41:24 compute001 kernel: [ 1920.446434]  [] ? 
hrtimer_cancel+0x1a/0x30
Sep 15 16:41:24 compute001 kernel: [ 1920.446440]  [] ? 
read_events+0x207/0x230
Sep 15 16:41:24 compute001 kernel: [ 1920.446445]  [] ? 
hrtimer_get_res+0x50/0x50
Sep 15 16:41:24 compute001 kernel: [ 1920.446451]  [] 
do_notify_resume+0x69/0xb0
Sep 15 16:41:24 compute001 kernel: [ 1920.446459]  [] 
int_signal+0x12/0x17

At the same time I have no problems with cephfs mounted on this host using fuse.

I had rebuilt my cluster with almost default config and ended up with strange
behavior:
When using crush map named "crush-good" cluster is doing well. When removing
unused root "default" or even osds from hosts in this root, problem comes back.
Adding osds and hosts in "default" root fix the problem.

Hosts storage00[1-3] listed in /etc/hosts, even [ssd|hdd]-st00[1-3] listed
there with their public ips, even though I know that this is not necessary.

All of OSD run on ext4 made so:
mkfs.ext4 -L osd-[n] -m0 -Tlargefile /dev/drive
mounted with noatime.

all journals lies on separate ssds, 2 per host, one for ssd osds, one for hdd
osds, made as partitions 24GB-sized.

crush-good (almost copy from Ceph site :)):

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 20 osd.20
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 40 osd.40
device 50 osd.50
device 51 osd.51
device 52 osd.52

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host ssd-st001 {
id -1   # do not change unnecessarily
# weight 1.000
alg straw
hash 0  # rjenkins1
  

[ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread Martin Palma
Hi,

Is it a good idea to use a software raid for the system disk (Operating
System) on a Ceph storage node? I mean only for the OS not for the OSD
disks.

And what about a swap partition? Is that needed?

Best,
Martin
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Brian Kroth
Hmm, apparently I haven't gotten that far in my email backlog yet.  
That's good to know too.


Thanks,
Brian

Olivier Bonvalet  2015-09-18 16:02:

Hi,

not sure if it's related, but there is recent changes because of a
security issue :

http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/




Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit :

Hi all, we've had the following in our
/etc/apt/sources.list.d/ceph.list
for a while based on some previous docs,

# ceph upstream stable (currently giant) release packages for wheezy:
deb http://ceph.com/debian/ wheezy main

# ceph extras:
deb http://ceph.com/packages/ceph-extras/debian wheezy main

but it seems like the straight "debian/" portion of that path has
gone
missing recently, and now there's only debian-firefly/, debian
-giant/,
debian-hammer/, etc.

Is that just an oversight, or should we be switching our sources to
one
of the named releases?  I figured that the unnamed one would
automatically track what ceph currently considered "stable" for the
target distro release for me, but maybe that's not the case.

Thanks,
Brian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Alfredo Deza
The new locations are in:


http://packages.ceph.com/

For debian this would be:

http://packages.ceph.com/debian-{release}

Note that ceph-extras is no longer available: the current repos should
provide everything/anything that is needed to properly install
ceph. Otherwise, please let us know .

On Fri, Sep 18, 2015 at 10:35 AM, Brian Kroth  wrote:
> Hmm, apparently I haven't gotten that far in my email backlog yet.  That's
> good to know too.
>
> Thanks,
> Brian
>
> Olivier Bonvalet  2015-09-18 16:02:
>
>> Hi,
>>
>> not sure if it's related, but there is recent changes because of a
>> security issue :
>>
>>
>> http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/
>>
>>
>>
>>
>> Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit :
>>>
>>> Hi all, we've had the following in our
>>> /etc/apt/sources.list.d/ceph.list
>>> for a while based on some previous docs,
>>>
>>> # ceph upstream stable (currently giant) release packages for wheezy:
>>> deb http://ceph.com/debian/ wheezy main
>>>
>>> # ceph extras:
>>> deb http://ceph.com/packages/ceph-extras/debian wheezy main
>>>
>>> but it seems like the straight "debian/" portion of that path has
>>> gone
>>> missing recently, and now there's only debian-firefly/, debian
>>> -giant/,
>>> debian-hammer/, etc.
>>>
>>> Is that just an oversight, or should we be switching our sources to
>>> one
>>> of the named releases?  I figured that the unnamed one would
>>> automatically track what ceph currently considered "stable" for the
>>> target distro release for me, but maybe that's not the case.
>>>
>>> Thanks,
>>> Brian
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] missing SRPMs - for librados2 and libradosstriper1?

2015-09-18 Thread Paul Mansfield

I was looking to download the SRPMs associated with the packages in
  http://download.ceph.com/rpm-hammer/rhel6/x86_64/
or
  http://download.ceph.com/rpm-hammer/rhel7/x86_64/


but there's only a subset; the things I am really looking for are
librados2 and libradosstriper1 source rpms.

Please can someone enlighten me if there's a different place to download
them?

thanks
Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph osd won't boot, resource shortage?

2015-09-18 Thread Peter Sabaini
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 18.09.15 14:47, Shinobu Kinjo wrote:
> I do not think that it's best practice to increase that number
> at the moment. It's kind of lack of consideration.
> 
> We might need to do that as a result.
> 
> But what we should do, first, is to check current actual number
> of aio using:
> 
> watch -dc cat /proc/sys/fs/aio-nr

I did, it got up to about 138240

> then increase, if it's necessary.
> 
> Anyway you have to be more careful otherwise there might be
> back-and-force meaningless configuration change -;

I'm sorry, I don't quite understand what you mean. Could you
elaborate? Are there specific risks associated with a high setting
of fs.aio-max-nr?

FWIW, I've done some load testing (using rados bench and rados
load-gen) -- anything I should watch out for in your opinion?


Thanks,
peter.


> Shinobu
> 
> - Original Message - From: "Peter Sabaini"
>  To: ceph-users@lists.ceph.com Sent:
> Thursday, September 17, 2015 11:51:11 PM Subject: Re:
> [ceph-users] ceph osd won't boot, resource shortage?
> 
> On 16.09.15 16:41, Peter Sabaini wrote:
>> Hi all,
> 
>> I'm having trouble adding OSDs to a storage node; I've got 
>> about 28 OSDs running, but adding more fails.
> 
> So, it seems the requisite knob was sysctl fs.aio-max-nr By
> default, this was set to 64K here. I set it:
> 
> # echo 2097152 > /proc/sys/fs/aio-max-nr
> 
> This let me add my remaining OSDs.
> 
> 
> 
>> Typical log excerpt:
> 
>> 2015-09-16 13:55:58.083797 7f3e7b821800  1 journal _open 
>> /var/lib/ceph/osd/ceph-28/journal fd 20: 21474836480 bytes, 
>> block size 4096 bytes, directio = 1, aio = 1 2015-09-16 
>> 13:55:58.090709 7f3e7b821800 -1 journal FileJournal::_open: 
>> unable to setup io_context (61) No data available 2015-09-16 
>> 13:55:58.090825 7f3e74a96700 -1 journal io_submit to 0~4096
>> got (22) Invalid argument 2015-09-16 13:55:58.091061
>> 7f3e7b821800 1 journal close
>> /var/lib/ceph/osd/ceph-28/journal 2015-09-16 13:55:58.091993
>> 7f3e74a96700 -1 os/FileJournal.cc: In function 'int
>> FileJournal::write_aio_bl(off64_t&, ceph::bufferlist&, 
>> uint64_t)' thread 7f3e74a96700 time 2 015-09-16 
>> 13:55:58.090842 os/FileJournal.cc: 1337: FAILED assert(0 == 
>> "io_submit got unexpected error")
> 
>> More complete: http://pastebin.ubuntu.com/12427041/
> 
>> If, however, I stop one of the running OSDs, starting the 
>> original OSD works fine. I'm guessing I'm running out of 
>> resources somewhere, but where?
> 
>> Some poss. relevant sysctl values:
> 
>> vm.max_map_count=524288 kernel.pid_max=2097152 
>> kernel.threads-max=2097152 fs.aio-max-nr = 65536 fs.aio-nr = 
>> 129024 fs.dentry-state = 75710   49996   45  0   0   0 
>> fs.file-max = 
>> 26244198 fs.file-nr = 13504  0   26244198 fs.inode-nr = 60706
>> 202 fs.nr_open = 1048576
> 
>> I've also set max open files = 1048576 in ceph.conf
> 
>> The OSDs are setup with dedicated journal disks - 3 OSDs
>> share one journal device.
> 
>> Any advice on what I'm missing, or where I should dig
>> deeper?
> 
>> Thanks, peter.
> 
> 
> 
> 
> 
> 
>> ___ ceph-users 
>> mailing list ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___ ceph-users
> mailing list ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-BEGIN PGP SIGNATURE-

iQIcBAEBAgAGBQJV/A/ZAAoJEDg5mUAO12PZ3tMP/06JdIoNf3DM00UPMHCZdZUm
Uz5ZhQV7/Cc9ZurLkD1VSC/OAtTfIR99MJeoozczN6KKL6euGafUk1oJRuGlMst/
1LDu28EbWmBn29k4szyLnqZZcj49JZFBDQ3zHEAAvPmmglQOeENooWoMbjjGb/+p
wX6ANBOBkaVYbwmG8pRndab0DYdV/GBsTDDIbHVp4GnOwg/wOQriKIfRhHw1q4l6
KcGeZs84bhzfiqRQHHJXDieHAsUpKKUbLH0ofLxzCYOjrmpUgrHoVPV2YlNV0BYU
WS2dJaOs0EwVK4iTdnb3B8VH11QsdKk0zCpC40+jaxU7Zn7THoMIURmDCIaI8OGB
B1I4/Ima1Z6CMmPqDQIvebtnhdizgCpq11z6LRAb50TnNPnMuzIccyl5z013Sk8J
JGG5/0sMDjE+apKx/bZdC+Q0TyJ8I49zcizo5qfHhvAqW51McTXEVspJy9ZlQvwK
2Q9bVZsdHBHbM6B45iILOel/K/ids6PzypzKMrwRDmsLI4NfB/fAvWcaWXW7GeQ0
fVbjEv9m12gWhJugJt5ue5JcRcnP8gdg2oG2kzAggGvqkaYrns2VwUXCux+wzkjw
V418bjOWs78eHofmhhteitIItYDROYj9HSioDoaE15cqjOujn6N46PRRToY2eaGP
s2LCkcql3hrWMBKp2h2D
=GijP
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] missing SRPMs - for librados2 and libradosstriper1?

2015-09-18 Thread Paul Mansfield

p.s. this page:
   http://docs.ceph.com/docs/giant/install/get-packages/

is entirely wrong and the links to
   http://ceph.com/packages/ceph-extras/rpm

are all useless; http://ceph.com/packages seems to have gone away, I
also tried https.

;-(
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Olivier Bonvalet
Hi,

not sure if it's related, but there is recent changes because of a
security issue :

http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/




Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit :
> Hi all, we've had the following in our
> /etc/apt/sources.list.d/ceph.list 
> for a while based on some previous docs,
> 
> # ceph upstream stable (currently giant) release packages for wheezy:
> deb http://ceph.com/debian/ wheezy main
> 
> # ceph extras:
> deb http://ceph.com/packages/ceph-extras/debian wheezy main
> 
> but it seems like the straight "debian/" portion of that path has
> gone 
> missing recently, and now there's only debian-firefly/, debian
> -giant/, 
> debian-hammer/, etc.
> 
> Is that just an oversight, or should we be switching our sources to
> one 
> of the named releases?  I figured that the unnamed one would 
> automatically track what ceph currently considered "stable" for the 
> target distro release for me, but maybe that's not the case.
> 
> Thanks,
> Brian
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
But yes, I will try to increase OSD verbosity.

Le jeudi 17 septembre 2015 à 20:28 -0700, GuangYang a écrit :
> Which version are you using?
> 
> My guess is that the request (op) is waiting for lock (might be
> ondisk_read_lock of the object, but a debug_osd=20 should be helpful
> to tell what happened to the op).
> 
> How do you tell the IO wait is near to 0 (by top?)? 
> 
> Thanks,
> Guang
> 
> > From: ceph.l...@daevel.fr
> > To: ceph-users@lists.ceph.com
> > Date: Fri, 18 Sep 2015 02:43:49 +0200
> > Subject: Re: [ceph-users] Lot of blocked operations
> > 
> > Some additionnal informations :
> > - I have 4 SSD per node.
> > - the CPU usage is near 0
> > - IO wait is near 0 too
> > - bandwith usage is also near 0
> > 
> > The whole cluster seems waiting for something... but I don't see
> > what.
> > 
> > 
> > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > écrit :
> > > Hi,
> > > 
> > > I have a cluster with lot of blocked operations each time I try
> > > to
> > > move
> > > data (by reweighting a little an OSD).
> > > 
> > > It's a full SSD cluster, with 10GbE network.
> > > 
> > > In logs, when I have blocked OSD, on the main OSD I can see that
> > > :
> > > 2015-09-18 01:55:16.981396 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 33.976680 secs
> > > 2015-09-18 01:55:16.981402 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 30.125556 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 2015-09-18 01:55:46.986319 7f89e8cb8700 0 log [WRN] : 2 slow
> > > requests, 1 included below; oldest blocked for> 63.981596 secs
> > > 2015-09-18 01:55:46.986324 7f89e8cb8700 0 log [WRN] : slow
> > > request
> > > 60.130472 seconds old, received at 2015-09-18 01:54:46.855821:
> > > osd_op(client.29760717.1:18680817544
> > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > 6.c11916a4
> > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > currently
> > > reached pg
> > > 
> > > How should I read that ? What this OSD is waiting for ?
> > > 
> > > Thanks for any help,
> > > 
> > > Olivier
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lot of blocked operations

2015-09-18 Thread Olivier Bonvalet
mmm good point.

I don't see CPU or IO problem on mons, but in logs, I have this :

2015-09-18 01:55:16.921027 7fb951175700  0 log [INF] : pgmap v86359128:
6632 pgs: 77 inactive, 1 remapped, 10 active+remapped+wait_backfill, 25
peering, 5 active+remapped, 6 active+remapped+backfilling, 6499
active+clean, 9 remapped+peering; 18974 GB data, 69004 GB used, 58578
GB / 124 TB avail; 915 kB/s rd, 26383 kB/s wr, 1671 op/s; 8417/15680513
objects degraded (0.054%); 1062 MB/s, 274 objects/s recovering


So... it can be a peering problem. Didn't see that, thanks.



Le vendredi 18 septembre 2015 à 09:52 +0200, Jan Schermer a écrit :
> Could this be caused by monitors? In my case lagging monitors can
> also cause slow requests (because of slow peering). Not sure if
> that's expected or not, but it of course doesn't show on the OSDs as
> any kind of bottleneck when you try to investigate...
> 
> Jan
> 
> > On 18 Sep 2015, at 09:37, Olivier Bonvalet 
> > wrote:
> > 
> > Hi,
> > 
> > sorry for missing informations. I was to avoid putting too much
> > inappropriate infos ;)
> > 
> > 
> > 
> > Le vendredi 18 septembre 2015 à 12:30 +0900, Christian Balzer a
> > écrit :
> > > Hello,
> > > 
> > > On Fri, 18 Sep 2015 02:43:49 +0200 Olivier Bonvalet wrote:
> > > 
> > > The items below help, but be a s specific as possible, from OS,
> > > kernel
> > > version to Ceph version, "ceph -s", any other specific details
> > > (pool
> > > type,
> > > replica size).
> > > 
> > 
> > So, all nodes use Debian Wheezy, running on a vanilla 3.14.x
> > kernel,
> > and Ceph 0.80.10.
> > I don't have anymore ceph status right now. But I have
> > data to move tonight again, so I'll track that.
> > 
> > The affected pool is a standard one (no erasure coding), with only
> > 2 replica (size=2).
> > 
> > 
> > 
> > 
> > > > Some additionnal informations :
> > > > - I have 4 SSD per node.
> > > Type, if nothing else for anecdotal reasons.
> > 
> > I have 7 storage nodes here :
> > - 3 nodes which have each 12 OSD of 300GB
> > SSD
> > - 4 nodes which have each  4 OSD of 800GB SSD
> > 
> > And I'm trying to replace 12x300GB nodes by 4x800GB nodes.
> > 
> > 
> > 
> > > > - the CPU usage is near 0
> > > > - IO wait is near 0 too
> > > Including the trouble OSD(s)?
> > 
> > Yes
> > 
> > 
> > > Measured how, iostat or atop?
> > 
> > iostat, htop, and confirmed with Zabbix supervisor.
> > 
> > 
> > 
> > 
> > > > - bandwith usage is also near 0
> > > > 
> > > Yeah, all of the above are not surprising if everything is stuck
> > > waiting
> > > on some ops to finish. 
> > > 
> > > How many nodes are we talking about?
> > 
> > 
> > 7 nodes, 52 OSDs.
> > 
> > 
> > 
> > > > The whole cluster seems waiting for something... but I don't
> > > > see
> > > > what.
> > > > 
> > > Is it just one specific OSD (or a set of them) or is that all
> > > over
> > > the
> > > place?
> > 
> > A set of them. When I increase the weight of all 4 OSDs of a node,
> > I
> > frequently have blocked IO from 1 OSD of this node.
> > 
> > 
> > 
> > > Does restarting the OSD fix things?
> > 
> > Yes. For several minutes.
> > 
> > 
> > > Christian
> > > > 
> > > > Le vendredi 18 septembre 2015 à 02:35 +0200, Olivier Bonvalet a
> > > > écrit :
> > > > > Hi,
> > > > > 
> > > > > I have a cluster with lot of blocked operations each time I
> > > > > try
> > > > > to
> > > > > move
> > > > > data (by reweighting a little an OSD).
> > > > > 
> > > > > It's a full SSD cluster, with 10GbE network.
> > > > > 
> > > > > In logs, when I have blocked OSD, on the main OSD I can see
> > > > > that
> > > > > :
> > > > > 2015-09-18 01:55:16.981396 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > requests, 1 included below; oldest blocked for > 33.976680
> > > > > secs
> > > > > 2015-09-18 01:55:16.981402 7f89e8cb8700  0 log [WRN] : slow
> > > > > request
> > > > > 30.125556 seconds old, received at 2015-09-18
> > > > > 01:54:46.855821:
> > > > > osd_op(client.29760717.1:18680817544
> > > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > > 6.c11916a4
> > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > currently
> > > > > reached pg
> > > > > 2015-09-18 01:55:46.986319 7f89e8cb8700  0 log [WRN] : 2 slow
> > > > > requests, 1 included below; oldest blocked for > 63.981596
> > > > > secs
> > > > > 2015-09-18 01:55:46.986324 7f89e8cb8700  0 log [WRN] : slow
> > > > > request
> > > > > 60.130472 seconds old, received at 2015-09-18
> > > > > 01:54:46.855821:
> > > > > osd_op(client.29760717.1:18680817544
> > > > > rb.0.1c16005.238e1f29.027f [write 180224~16384]
> > > > > 6.c11916a4
> > > > > snapc 11065=[11065,10fe7,10f69] ondisk+write e845819) v4
> > > > > currently
> > > > > reached pg
> > > > > 
> > > > > How should I read that ? What this OSD is waiting for ?
> > > > > 
> > > > > Thanks for any help,
> > > > > 
> > > > > Olivier
> > > > > ___
> > > > > ceph-users mailing list
> > > > > 

Re: [ceph-users] Delete pool with cache tier

2015-09-18 Thread John Spray
On Fri, Sep 18, 2015 at 7:04 PM, Robert LeBlanc  wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Is there a way to delete a pool with a cache tier without first
> evicting the cache tier and removing it (ceph 0.94.3)?
>
> Something like:
>
> ceph osd pool delete   --delete-cache-tier --yes-i-really-really-mean-it

Not as far as I know, but it seems like a pretty reasonable feature
request to me.

John

>
> Evicting the cache tier has taken over 24 hours and I just want to
> trash it and recreate it. It takes too long to delete all the RBDs in
> the pool as well so it seems that I'm stuck. Trying to delete the pool
> or the cache error with saying that it is participating in a cache
> tier.
>
> There are not many RBDs, but one has 10,000 clones to test a high
> number of clones.
>
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.1.0
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV/FISCRDmVDuy+mK58QAAsgsP/1WSFOHSkcVy6O582ECf
> 9JN6/ZyvcH5HMSsB1cE0bEziDEsgoYEGdTSgB+GytABb+0BRip9SjH7kDpcs
> 5/1ZvAv9dgnNPbwBb9zOWp6p27fDwQ5vdmy4UnT7iyPbNFz2Cyp09g8mwEpe
> nAu2mDorwOGtzPv4z7LUnevHigyhWIZpGJhw7hJGuKVNxRXwAbSXQqk7YzKI
> xg6Ccs/eDpTiSygD+PeAfC7uizjqrw28lBaEmlgaK5OfFvrhXWSNZr6AuQap
> wEv4QG+0NsUzptOMSTIlhBHS/0HZb87vscckBlaitfWzOiv73PuKQRpwVFRD
> IkcIZx83I2pGQ4ir7LkjmmyQgW1cY5FcHrC6V70kaqpjgT5tuEIpwtlfJoUc
> qpt1fKeEGF4sutLWpQCgPyY/bQC144PZyMYWjNS17GbzkoR/AwaZ8cBiK2+P
> 1Htc9ntk4hwtPOWXa6kYfRASLbC+nqTBFWTq5hcDkD/5ViTKhMpW+ldInaSt
> 8g8jukLWQ0ZjFET0MfYogYWEAdsER4dhk3bawfA/0dKiAyE5UPN5L9CvZ9kp
> JPVqmSPKhXU68xGQXE7Ugx9BpWZXiRyO8hBOzoivrcvbZOLUhklQW61Qmq0A
> EKWyeXy7P9cE3OONelWmgUXxVPuZT2ZAaoM+KjwJKdT1Jt7VtNVjqgS9NvwZ
> alzR
> =y/7e
> -END PGP SIGNATURE-
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread Jan Schermer
Hi,
> On 18 Sep 2015, at 17:06, Martin Palma  wrote:
> 
> Hi,
> 
> Is it a good idea to use a software raid for the system disk (Operating 
> System) on a Ceph storage node? I mean only for the OS not for the OSD disks.
> 

Yes, absolutely. Or even a hardware RAID if that's what you use elsewhere.

> And what about a swap partition? Is that needed?
> 

If it's only for the OSDs then no, you don't need swap, all the OSDs need to 
fit in memory to perform well and the OS daemons are tiny in comparison. Saves 
some hassle tuning swapiness and stuff too.

Jan


> Best,
> Martin
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Using cephfs with hadoop

2015-09-18 Thread Gregory Farnum
On Thu, Sep 17, 2015 at 7:48 PM, Fulin Sun  wrote:
> Hi, guys
>
> I am wondering if I am able to deploy ceph and hadoop into different cluster
> nodes and I can
>
> still use cephfs as the backend for hadoop access.
>
> For example, ceph in cluster 1 and hadoop in cluster 2, while cluster 1 and
> cluster 2 can be mutally accessed.
>
> If so, what would be done for the hadoop cluster ? Do I need to setup cephfs
> client in every hadoop node ?
>
> Or I just need to configure core-site.xml and put ceph-hadoop.jar into
> hadoop classpath ?

It should be basically this. We're still working on nice docs for
Hadoop-on-Ceph but the ceph-hadoop jar handles configuring libcephfs
and connecting to the Ceph cluster on its own, so having a separate
mount on the machines is totally pointless.
-Greg

>
> Really need your kind advice and experience.
>
> Best,
> Sun.
>
> 
> 
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] lttng duplicate registration problem when using librados2 and libradosstriper

2015-09-18 Thread Paul Mansfield
Hello,
thanks for your attention.

I have started using rados striper library, calling the functions from a
C program.

As soon as I add libradosstriper to the linking process, I get this
error when the program runs, even though I am not calling any functions
from the rados striper library (I commented them out).

LTTng-UST: Error (-17) while registering tracepoint probe. Duplicate
registration of tracepoint probes having the same name is not allowed.
/bin/sh: line 1: 61001 Aborted (core dumped) ./$test


I had been using lttng in my program but removed it to ensure it wasn't
causing the problem.

I have tried running the program using gdb but the calls to initialise
lttng occur before main() is called and so I cannot add a break point to
see what is happening.


thanks
Paul
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Ken Dreyer
On Fri, Sep 18, 2015 at 9:28 AM, Sage Weil  wrote:
> On Fri, 18 Sep 2015, Alfredo Deza wrote:
>> The new locations are in:
>>
>>
>> http://packages.ceph.com/
>>
>> For debian this would be:
>>
>> http://packages.ceph.com/debian-{release}
>
> Make that download.ceph.com .. the packages url was temporary while we got
> the new site ready and will go away shortly!
>
> (Also, HTTPS is enabled now.)


To avoid confusion here, I've deleted packages.ceph.com from DNS
today, and the change will propagate soon.

Please use download.ceph.com (it's the same IP address and server,
173.236.248.54)

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Software Raid 1 for system disks on storage nodes (not for OSD disks)

2015-09-18 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Depends on how easy it is to rebuild an OS from scratch. If you have
something like Puppet or Chef that configure a node completely for
you, it may not be too much of a pain to forgo the RAID. We run our
OSD nodes from a single SATADOM and use Puppet for configuration. We
also don't use swap (not very effective on SATADOM), but have enough
RAM that we feel comfortable enough with that decision.

If you use ceph-disk or ceph-deploy to configure the OSDs, then they
should automatically come back up when you lay down the new OS and set
up the necessary ceph config items (ceph.conf and the OSD bootstrap
keys).
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Fri, Sep 18, 2015 at 9:06 AM, Martin Palma  wrote:
> Hi,
>
> Is it a good idea to use a software raid for the system disk (Operating
> System) on a Ceph storage node? I mean only for the OS not for the OSD
> disks.
>
> And what about a swap partition? Is that needed?
>
> Best,
> Martin
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.1.0
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV/C7UCRDmVDuy+mK58QAAoTMQAMZBv4/lphmntC23b9/l
JWUPjZfbXUtNgnfMvWcVyTSXsTtM5mY/4/iSZ4ZfCQ4YyqWWMpSlocHONHFz
nFTtGupqV3vPCo4X8bl58/iv4J0H2iWUr2klk7jtTj+e+JjyWDo25l8V2ofP
edt5g7qcMAwiWYrrpjxQBK4AFNiPJKSMxrzK1Mgic15nwX0OJu0DDNS5twzZ
s8Y+UfS80+hZvyBTUGhsO8pkYoJQvYRGgyqYtCdxA+m1T8lWVe8SC0eLWOXy
xoyGR7dqcvEXQadrqfmU618eNpNEECPoHeIkeCqpTohrUVsyRcfSGAtfM0YY
Ixf2SCaDMAaRwvXGJUf5OP/3HHWps0m4YyLBOddPZ5XZb1utZiclh26KuOyw
QdGkP7uoYEMO0v40dcsIbOVhtgTdX+HrpEGuqEtNEGe194sS1nluw+49aLxe
eozHSRGq3GmRm/q3bR5f2p+WXwKqmdDRFhqII8H11bb5F7etU2PBo1JA2bTW
hUFqu6+ST8eI34OeC7LbC9Txfw/iUhL62kiCm+gj8Rg+m+TZ7a1HEaVc8uyq
Jw1+5hIgyTWFvKdIiW65k++8w9my6kUIsY8RT8p08DTSPzxuwGtHr7UJJ629
K/tlpGdQTRf7PXgmea6sSodnmaF5HRIUdU0nhQpRRxjX/V+PENI8Qq45KyfX
BovV
=Gzvl
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] which SSD / experiences with Samsung 843T vs. Intel s3700

2015-09-18 Thread Quentin Hartman
No, they are dead dead dead. Can't get anything off of them. If you look
back further on this thread I think the most noteworthy part of this whole
experience is just how far off my write estimates were. The ones that have
not died have somewhere between 24 and 32 TB written to them after 9 months
in service. This is almost 4x what I thought they would get.

QH

On Fri, Sep 18, 2015 at 1:48 AM, Jan Schermer  wrote:

> "850 PRO" is a workstation drive. You shouldn't put it in the server...
> But it should not just die either way, so don't tell them you use it for
> Ceph next time.
>
> Do the drives work when replugged? Can you get anything from SMART?
>
> Jan
>
>
> On 18 Sep 2015, at 02:57, James (Fei) Liu-SSI 
> wrote:
>
> Hi Quentin,
> Samsung has so different type of SSD for different type of workload with
> different SSD media like SLC,MLC,TLC ,3D NAND etc. They were designed for
> different workloads for different purposes. Thanks for your understanding
> and support.
>
> Regards,
> James
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com
> ] *On Behalf Of *Quentin Hartman
> *Sent:* Thursday, September 17, 2015 4:05 PM
> *To:* Andrija Panic
> *Cc:* ceph-users@lists.ceph.com
> *Subject:* Re: [ceph-users] which SSD / experiences with Samsung 843T vs.
> Intel s3700
>
> I ended up having 7 total die. 5 while in service, 2 more when I hooked
> them up to a test machine to collect information from them. To Samsung's
> credit, they've been great to deal with and are replacing the failed
> drives, on the condition that I don't use them for ceph again. Apparently
> they sent some of my failed drives to an engineer in Korea and they did a
> failure analysis on them and came to the conclusion they we put to an
> "unintended use". I have seven left I'm not sure what to do with.
>
> I've honestly always really liked Samsung, and I'm disappointed that I
> wasn't able to find anyone with their DC-class drives actually in stock so
> I ended up switching the to Intel S3700s. My users will be happy to have
> some SSDs to put in their workstations though!
>
> QH
>
> On Thu, Sep 17, 2015 at 4:49 PM, Andrija Panic 
> wrote:
> Another one bites the dust...
>
> This is Samsung 850 PRO 256GB... (6 journals on this SSDs just died...)
>
> [root@cs23 ~]# smartctl -a /dev/sda
> smartctl 5.43 2012-06-30 r3573 [x86_64-linux-3.10.66-1.el6.elrepo.x86_64]
> (local build)
> Copyright (C) 2002-12 by Bruce Allen, http://smartmontools.sourceforge.net
>
> Vendor:   /1:0:0:0
> Product:
> User Capacity:600,332,565,813,390,450 bytes [600 PB]
> Logical block size:   774843950 bytes
> >> Terminate command early due to bad response to IEC mode page
> A mandatory SMART command failed: exiting. To continue, add one or more
> '-T permissive' options
>
> On 8 September 2015 at 18:01, Quentin Hartman <
> qhart...@direwolfdigital.com> wrote:
>
> On Tue, Sep 8, 2015 at 9:05 AM, Mark Nelson  wrote:
>
> A list of hardware that is known to work well would be incredibly
> valuable to people getting started. It doesn't have to be exhaustive,
> nor does it have to provide all the guidance someone could want. A
> simple "these things have worked for others" would be sufficient. If
> nothing else, it will help people justify more expensive gear when their
> approval people say "X seems just as good and is cheaper, why can't we
> get that?".
>
>
> So I have my opinions on different drives, but I think we do need to be
> really careful not to appear to endorse or pick on specific vendors. The
> more we can stick to high-level statements like:
>
> - Drives should have high write endurance
> - Drives should perform well with O_DSYNC writes
> - Drives should support power loss protection for data in motion
>
> The better I think.  Once those are established, I think it's reasonable
> to point out that certain drives meet (or do not meet) those criteria and
> get feedback from the community as to whether or not vendor's marketing
> actually reflects reality.  It'd also be really nice to see more
> information available like the actual hardware (capacitors, flash cells,
> etc) used in the drives.  I've had to show photos of the innards of
> specific drives to vendors to get them to give me accurate information
> regarding certain drive capabilities.  Having a database of such things
> available to the community would be really helpful.
>
> That's probably a very good approach. I think it would be pretty simple to
> avoid the appearance of endorsement if the data is presented correctly.
>
>
>
> To that point, I think perhaps though something more important than a
> list of known "good" hardware would be a list of known "bad" hardware,
>
>
> I'm rather hesitant to do this unless it's been specifically confirmed by
> the vendor.  It's too easy to point fingers (see the recent kernel trim bug
> situation).
>
>
> I 

Re: [ceph-users] debian repositories path change?

2015-09-18 Thread Sage Weil
On Fri, 18 Sep 2015, Alfredo Deza wrote:
> The new locations are in:
> 
> 
> http://packages.ceph.com/
> 
> For debian this would be:
> 
> http://packages.ceph.com/debian-{release}

Make that download.ceph.com .. the packages url was temporary while we got 
the new site ready and will go away shortly!

(Also, HTTPS is enabled now.)

sage

> 
> Note that ceph-extras is no longer available: the current repos should
> provide everything/anything that is needed to properly install
> ceph. Otherwise, please let us know .
> 
> On Fri, Sep 18, 2015 at 10:35 AM, Brian Kroth  wrote:
> > Hmm, apparently I haven't gotten that far in my email backlog yet.  That's
> > good to know too.
> >
> > Thanks,
> > Brian
> >
> > Olivier Bonvalet  2015-09-18 16:02:
> >
> >> Hi,
> >>
> >> not sure if it's related, but there is recent changes because of a
> >> security issue :
> >>
> >>
> >> http://ceph.com/releases/important-security-notice-regarding-signing-key-and-binary-downloads-of-ceph/
> >>
> >>
> >>
> >>
> >> Le vendredi 18 septembre 2015 à 08:45 -0500, Brian Kroth a écrit :
> >>>
> >>> Hi all, we've had the following in our
> >>> /etc/apt/sources.list.d/ceph.list
> >>> for a while based on some previous docs,
> >>>
> >>> # ceph upstream stable (currently giant) release packages for wheezy:
> >>> deb http://ceph.com/debian/ wheezy main
> >>>
> >>> # ceph extras:
> >>> deb http://ceph.com/packages/ceph-extras/debian wheezy main
> >>>
> >>> but it seems like the straight "debian/" portion of that path has
> >>> gone
> >>> missing recently, and now there's only debian-firefly/, debian
> >>> -giant/,
> >>> debian-hammer/, etc.
> >>>
> >>> Is that just an oversight, or should we be switching our sources to
> >>> one
> >>> of the named releases?  I figured that the unnamed one would
> >>> automatically track what ceph currently considered "stable" for the
> >>> target distro release for me, but maybe that's not the case.
> >>>
> >>> Thanks,
> >>> Brian
> >>> ___
> >>> ceph-users mailing list
> >>> ceph-users@lists.ceph.com
> >>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com