Re: [ceph-users] question on harvesting freed space

2014-04-16 Thread Wido den Hollander

On 04/17/2014 02:39 AM, Somnath Roy wrote:

It seems Discard support for kernel rbd is targeted for v80..

http://tracker.ceph.com/issues/190



True, but it will obviously take time before this hits the upstream 
kernels and goes into distributions.


For RHEL 7 it might be that the krbd module from the Ceph extra repo 
might work. For Ubuntu it's waiting for newer kernels to be backported 
to the LTS releases.


Wido


Thanks & Regards
Somnath

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer
Sent: Wednesday, April 16, 2014 5:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] question on harvesting freed space


On Wed, 16 Apr 2014 13:12:15 -0500 John-Paul Robinson wrote:


So having learned some about fstrim, I ran it on an SSD backed file
system and it reported space freed. I ran it on an RBD backed file
system and was told it's not implemented.

This is consistent with the test for FITRIM.

$ cat /sys/block/rbd3/queue/discard_max_bytes
0


This looks like you're using the kernelspace RBD interface.

And very sadly, trim/discard is not implemented in it, which is a bummer for 
anybody running for example a HA NFS server with RBD as the backing storage. 
Even sadder is the fact that this was last brought up a year or even longer ago.

Only the userspace (librbd) interface supports this, however the client (KVM as 
prime example) of course need to use a pseudo disk interface that ALSO supports 
it. The standard virtio-block does not, while the very slow IDE emulation does 
as well as the speedier virtio-scsi (however that isn't configurable with 
ganeti for example).

Regards,

Christian


On my SSD backed device I get:

$ cat /sys/block/sda/queue/discard_max_bytes
2147450880

Is this just not needed by RBD or is cleanup handled in a different way?

I'm wondering what will happen to a thin provisioned RBD image
overtime on a file system with lots of file create delete activity.
Will the storage in the ceph pool stay allocated to this application
(the file
system) in that case?

Thanks for any additional insights.

~jpr

On 04/15/2014 04:16 PM, John-Paul Robinson wrote:

Thanks for the insight.

Based on that I found the fstrim command for xfs file systems.

http://xfs.org/index.php/FITRIM/discard

Anyone had experiences using the this command with RBD image backends?

~jpr

On 04/15/2014 02:00 PM, Kyle Bader wrote:

I'm assuming Ceph/RBD doesn't have any direct awareness of this
since the file system doesn't traditionally have a "give back blocks"
operation to the block device.  Is there anything special RBD does
in this case that communicates the release of the Ceph storage
back to the pool?

VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing
TRIM.

http://wiki.qemu.org/Features/QED/Trim


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Does anyone success deploy FEDERATED GATEWAYS

2014-04-16 Thread maoqi1982
Hi list 
 i follow the http://ceph.com/docs/master/radosgw/federated-config/ to test the 
muti-geography  function.failed. .Does anyone success deploy FEDERATED 
GATEWAYS?Is the function in ceph  ok or not? if anyone success deploy it please 
give me some help 


thanks.___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] force_create_pg not working

2014-04-16 Thread Gregory Farnum
Do you have any logging running on those OSDs? I'm going to need to
get somebody else to look at this, but if we could check the probe
messages being sent that might be helpful.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Tue, Apr 15, 2014 at 4:36 PM, Craig Lewis  wrote:
> http://pastebin.com/ti1VYqfr
>
> I assume the problem is at the very end:
>   "probing_osds": [
> 0,
> 2,
> 3,
> 4,
> 11,
> 13],
>   "down_osds_we_would_probe": [],
>   "peering_blocked_by": []},
>
>
> OSDs 3, 4, and 11 have been UP and IN for hours.  OSDs 0, 2, and 13 have
> been UP and IN since the problems started, but they never complete probing.
>
>
>
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
> On 4/15/14 16:07 , Gregory Farnum wrote:
>
> What are the results of "ceph osd pg 11.483 query"?
> -Greg
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
> On Tue, Apr 15, 2014 at 4:01 PM, Craig Lewis 
> wrote:
>
> I have 1 incomplete PG.  The data is gone, but I can upload it again.  I
> just need to make the cluster start working so I can upload it.
>
> I've read a bunch of mailling list posts, and found ceph pg force_create_pg.
> Except, it doesn't work.
>
> I run:
> root@ceph1c:/var/lib/ceph/osd# ceph pg force_create_pg 11.483
> pg 11.483 now creating, ok
>
> The incomplete PG switches to creating.  It sits in creating for a while,
> then flips back to incomplete:
> 2014-04-15 15:06:11.876535 mon.0 [INF] pgmap v5719605: 2592 pgs: 2586
> active+clean, 1 incomplete, 5 active+clean+scrubbing+deep; 15086 GB data,
> 27736 GB used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:13.899681 mon.0 [INF] pgmap v5719606: 2592 pgs: 1 creating,
> 2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
> used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:14.965676 mon.0 [INF] pgmap v5719607: 2592 pgs: 1 creating,
> 2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
> used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:15.995570 mon.0 [INF] pgmap v5719608: 2592 pgs: 1 creating,
> 2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
> used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:17.019972 mon.0 [INF] pgmap v5719609: 2592 pgs: 1 creating,
> 2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
> used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:18.048487 mon.0 [INF] pgmap v5719610: 2592 pgs: 1 creating,
> 2586 active+clean, 5 active+clean+scrubbing+deep; 15086 GB data, 27736 GB
> used, 28127 GB / 55864 GB avail
> 2014-04-15 15:06:19.093757 mon.0 [INF] pgmap v5719611: 2592 pgs: 2586
> active+clean, 1 incomplete, 5 active+clean+scrubbing+deep; 15086 GB data,
> 27736 GB used, 28127 GB / 55864 GB avail
>
> I'm on:
> root@ceph0c:~# ceph -v
> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>
> root@ceph0c:~# uname -a
> Linux ceph0c 3.5.0-46-generic #70~precise1-Ubuntu SMP Thu Jan 9 23:55:12 UTC
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
> root@ceph0c:~# cat /etc/lsb-release
> DISTRIB_ID=Ubuntu
> DISTRIB_RELEASE=12.04
> DISTRIB_CODENAME=precise
> DISTRIB_DESCRIPTION="Ubuntu 12.04.4 LTS"
>
> --
>
> Craig Lewis
> Senior Systems Engineer
> Office +1.714.602.1309
> Email cle...@centraldesktop.com
>
> Central Desktop. Work together in ways you never thought possible.
> Connect with us   Website  |  Twitter  |  Facebook  |  LinkedIn  |  Blog
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Russian-speaking community CephRussian!

2014-04-16 Thread Ирек Фасихов
Loic,thanks for the link!


2014-04-16 18:46 GMT+04:00 Loic Dachary :

> Hi Ирек,
>
> If you organize meetups, feel free to add yourself to
> https://wiki.ceph.com/Community/Meetups :-)
>
> Cheers
>
> On 16/04/2014 13:22, Ирек Фасихов wrote:
> > Hi,All.
> >
> > I created the Russian-speaking community CephRussian in Google+ <
> https://plus.google.com/communities/104570726102090628516>! Welcome!
> >
> > URL: https://plus.google.com/communities/104570726102090628516
> >
> > --
> > С уважением, Фасихов Ирек Нургаязович
> > Моб.: +79229045757
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> --
> Loïc Dachary, Artisan Logiciel Libre
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Blair Bethwaite
Hi Kyle,

Thanks for the response. Further comments/queries...

> Message: 42
> Date: Wed, 16 Apr 2014 06:53:41 -0700
> From: Kyle Bader 
> Cc: ceph-users 
> Subject: Re: [ceph-users] SSDs: cache pool/tier versus node-local
> block cache
> Message-ID:
> 
> Content-Type: text/plain; charset=UTF-8
>
> >> Obviously the ssds could be used as journal devices, but I'm not really
> >> convinced whether this is worthwhile when all nodes have 1GB of
hardware
> >> writeback cache (writes to journal and data areas on the same spindle
have
> >> time to coalesce in the cache and minimise seek time hurt). Any advice
on
> >> this?
>
> All writes need to be written to the journal before being written to
> the data volume so it's going to impact your overall throughput and
> cause seeking, a hardware cache will only help with the later (unless
> you use btrfs).

Right, good point. So back of envelope calculations for throughput
scenarios based on our hardware, just saying 150MB/s r/w for the spindles
and 450/350MB/s r/w for the ssds, and pretending no controller bottlenecks
etc:

1 OSD node (without ssd journals, hence divide by 2):
9 * 150 / 2 = 675MB/s write throughput

1 OSD node (with ssd journals):
min(9 * 150, 3 * 350) = 1050MB/s write throughput

Aggregates for 12 OSDs: ~8GB/s versus 12.5GB/s

So the general naive case seems like a no-brainer, we should use SSD
journals. But then we don't require even 8GB/s most of the time...

> >> I think the timing should work that we'll be deploying with Firefly
and so
> >> have Ceph cache pool tiering as an option, but I'm also evaluating
Bcache
> >> versus Tier to act as node-local block cache device. Does anybody have
real
> >> or anecdotal evidence about which approach has better performance?
> > New idea that is dependent on failure behaviour of the cache tier...
>
> The problem with this type of configuration is it ties a VM to a
> specific hypervisor, in theory it should be faster because you don't
> have network latency from round trips to the cache tier, resulting in
> higher iops. Large sequential workloads may achieve higher throughput
> by parallelizing across many OSDs in a cache tier, whereas local flash
> would be limited to single device throughput.

Ah, I was ambiguous. When I said node-local I meant OSD-local. So I'm
really looking at:
2-copy write-back object ssd cache-pool
versus
OSD write-back ssd block-cache
versus
1-copy write-around object cache-pool & ssd journal

> > Carve the ssds 4-ways: each with 3 partitions for journals servicing the
> > backing data pool and a fourth larger partition serving a write-around
cache
> > tier with only 1 object copy. Thus both reads and writes hit ssd but
the ssd
> > capacity is not halved by replication for availability.
> >
> > ...The crux is how the current implementation behaves in the face of
cache
> > tier OSD failures?
>
> Cache tiers are durable by way of replication or erasure coding, OSDs
> will remap degraded placement groups and backfill as appropriate. With
> single replica cache pools loss of OSDs becomes a real concern, in the
> case of RBD this means losing arbitrary chunk(s) of your block devices
> - bad news. If you want host independence, durability and speed your
> best bet is a replicated cache pool (2-3x).

This is undoubtedly true for a write-back cache-tier. But in the scenario
I'm suggesting, a write-around cache, that needn't be bad news - if a
cache-tier OSD is lost the cache simply just got smaller and some cached
objects were unceremoniously flushed. The next read on those objects should
just miss and bring them into the now smaller cache.

The thing I'm trying to avoid with the above is double read-caching of
objects (so as to get more aggregate read cache). I assume the standard
wisdom with write-back cache-tiering is that the backing data pool
shouldn't bother with ssd journals?

--
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-mds and directory sharding

2014-04-16 Thread Qing Zheng
>It seems that you mount cephfs on the the same nodes that run MDS or OSD.
That can cause deadlock
(http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6648).
>Please try using separate node for cephfs mount.
>
>Regards
>Yan, Zheng



Thanks, Zheng.

Got it

-- Qing Zheng


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-mds and directory sharding

2014-04-16 Thread Yan, Zheng
On Thu, Apr 17, 2014 at 8:54 AM, Qing Zheng  wrote:
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: Wednesday, April 16, 2014 7:44 PM
> To: Qing Zheng
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] multi-mds and directory sharding
>
> On Thu, Apr 17, 2014 at 6:21 AM, Qing Zheng  wrote:
>> I seems that with kernel 3.14 and latest source code from github, we
>> still run into troubles when testing multi-mds and directory sharding.
>>
>
> what's problem you encountered?
>
> Hi Zheng,
>
> We are using mdtest to simulate workloads where multiple parallel client
> procs will keep inserting empty files into a single newly created directory.
> We are expecting CephFS to balance its metadata servers, and eventually all
> metadata
> servers will get a share of the directory.
>
> We currently found that CephFS was only able to run for the first 5-10 mins
> under such workload,
> and then stopped making progress -- the clients' "NewFile" calls would no
> longer return.
> From the client point of view, it was like the server was no longer
> processing any requests.
>
> Our test deployment had 32 osds, 8 mds, and 1 mon.
> CephFS was kernel mounted. Clients were collated with metadata servers.
>

It seems that you mount cephfs on the the same nodes that run MDS or
OSD. That can cause deadlock
(http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/6648).
Please try using separate node for cephfs mount.

Regards
Yan, Zheng



> Cheers,
>
> -- Qing Zheng
>
>
>
>
>
>> Are there any limits either in the max number of active metadata
>> servers that we could possibly run or in the number of directory
>> entries that we could set for Ceph to trigger a directory split?
>>
>> Is it okay to run 128 or more active metadata servers, for example?
>> Is it okay to let Ceph split directories once a directory has
>> accumulated
>> 200 entries?
>>
>> Cheers,
>>
>> -- Qing Zheng
>>
>> -Original Message-
>> From: Yan, Zheng [mailto:uker...@gmail.com]
>> Sent: Sunday, April 13, 2014 6:43 PM
>> To: Qing Zheng
>> Cc: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] multi-mds and directory sharding
>>
>> On Mon, Apr 14, 2014 at 2:54 AM, Qing Zheng  wrote:
>>> Hi -
>>>
>>> We are currently evaluating CephFS's metadata scalability and
>>> performance. One important feature of CephFS is its support for
>>> running multiple "active" mds instances and partitioning huge
>>> directories into small shards.
>>>
>>> We use mdtest to simulate workloads where multiple parallel client
>>> processes will keep inserting empty files into several large directories.
>>> We found that CephFS is only able to run for the first 5-10 mins, and
>>> then stop making progress -- the clients' "creat" call no longer return.
>>>
>>> We were using Ceph 0.72 and Ubuntu 12.10 with kernel 3.6.6.
>>> Our setup consisted of 8 osds, 3 mds, and 1 mon. All mds were active,
>>> instead of standby, and they were all configured to split directories
>>> once the directory size is greater than 2k. We kernel (not fuse)
>>> mounted CephFS on all 8 osd nodes.
>>
>> 3.6 kernel is too old for cephfs. please use kernel compiled from
>> testing branch https://github.com/ceph/ceph-client and the newest
>> development version of Ceph. There are large number of fixes for
>> directory fragment and multimds.
>>
>> Regards
>> Yan, Zheng
>>
>>>
>>> To test CephFS, we launched 64 client processes on 8 osd nodes (8
>>> procs per osd). Each client would create 1 directory and then insert
>>> 5k empty files into that directory. In total 64 directories and 320k
>>> files would be created. CephFS gave an avg throughput of 300~1k for
>>> the first 5 minutes, and then stopped making any progress.
>>>
>>> What might go wrong?
>>>
>>> If each client insert 200 files, instead of 5k, then CephFS could
>>> finish the workload with 1.5K ops/s. If each client insert 1k files,
>>> then ~500 ops/s If 2k files (the split threshold), then ~400 ops/s
>>>
>>> Are these numbers reasonable?
>>>
>>> -- Qing
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Somnath Roy
I think it still make sense to try with 'noatime'. Here is the reason.

'relatime' requires a write for the first read after a write, but the 'atime' 
requires a write for every read. But with noatime each read is free of a write.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer
Sent: Wednesday, April 16, 2014 5:43 PM
To: Dan van der Ster
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RBD write access patterns and atime

On Wed, 16 Apr 2014 17:08:09 +0200 Dan van der Ster wrote:

> Dear ceph-users,
>
> I've recently started looking through our FileStore logs to better
> understand the VM/RBD IO patterns, and noticed something interesting.
> Here is a snapshot of the write lengths for one OSD server (with 24
> OSDs) -- I've listed the top 10 write lengths ordered by number of
> writes in one day:
>
> Writes per length:
> 4096: 2011442
> 8192: 438259
> 4194304: 207293
> 12288: 175848
> 16384: 148274
> 20480: 69050
> 24576: 58961
> 32768: 54771
> 28672: 43627
> 65536: 34208
> 49152: 31547
> 40960: 28075
>
> There were ~400 writes to that server on that day, so you see that
> ~50% of the writes were 4096 bytes, and then the distribution drops
> off sharply before a peak again at 4MB (the object size, i.e. the max
> write size). (For those interested, read lengths are below in the
> P.S.)
>
> I'm trying to understand that distribution, and the best explanation
> I've come up with is that these are ext4/xfs metadata updates,
> probably atime updates. Based on that theory, I'm going to test
> noatime on a few VMs and see if I notice a change in the distribution.
>
That strikes me as odd, as since kernel 2.6.30 the default option for mounts is 
relatime, which should have an effect quite close to that of a strict noatime.

Regards,

Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-mds and directory sharding

2014-04-16 Thread Qing Zheng
-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: Wednesday, April 16, 2014 7:44 PM
To: Qing Zheng
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] multi-mds and directory sharding

On Thu, Apr 17, 2014 at 6:21 AM, Qing Zheng  wrote:
> I seems that with kernel 3.14 and latest source code from github, we 
> still run into troubles when testing multi-mds and directory sharding.
>

what's problem you encountered?

Hi Zheng,

We are using mdtest to simulate workloads where multiple parallel client 
procs will keep inserting empty files into a single newly created directory.
We are expecting CephFS to balance its metadata servers, and eventually all
metadata
servers will get a share of the directory.

We currently found that CephFS was only able to run for the first 5-10 mins
under such workload,
and then stopped making progress -- the clients' "NewFile" calls would no
longer return.
>From the client point of view, it was like the server was no longer
processing any requests.

Our test deployment had 32 osds, 8 mds, and 1 mon.
CephFS was kernel mounted. Clients were collated with metadata servers.

Cheers,

-- Qing Zheng





> Are there any limits either in the max number of active metadata 
> servers that we could possibly run or in the number of directory 
> entries that we could set for Ceph to trigger a directory split?
>
> Is it okay to run 128 or more active metadata servers, for example?
> Is it okay to let Ceph split directories once a directory has 
> accumulated
> 200 entries?
>
> Cheers,
>
> -- Qing Zheng
>
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: Sunday, April 13, 2014 6:43 PM
> To: Qing Zheng
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] multi-mds and directory sharding
>
> On Mon, Apr 14, 2014 at 2:54 AM, Qing Zheng  wrote:
>> Hi -
>>
>> We are currently evaluating CephFS's metadata scalability and 
>> performance. One important feature of CephFS is its support for 
>> running multiple "active" mds instances and partitioning huge 
>> directories into small shards.
>>
>> We use mdtest to simulate workloads where multiple parallel client 
>> processes will keep inserting empty files into several large directories.
>> We found that CephFS is only able to run for the first 5-10 mins, and 
>> then stop making progress -- the clients' "creat" call no longer return.
>>
>> We were using Ceph 0.72 and Ubuntu 12.10 with kernel 3.6.6.
>> Our setup consisted of 8 osds, 3 mds, and 1 mon. All mds were active, 
>> instead of standby, and they were all configured to split directories 
>> once the directory size is greater than 2k. We kernel (not fuse) 
>> mounted CephFS on all 8 osd nodes.
>
> 3.6 kernel is too old for cephfs. please use kernel compiled from 
> testing branch https://github.com/ceph/ceph-client and the newest 
> development version of Ceph. There are large number of fixes for 
> directory fragment and multimds.
>
> Regards
> Yan, Zheng
>
>>
>> To test CephFS, we launched 64 client processes on 8 osd nodes (8 
>> procs per osd). Each client would create 1 directory and then insert 
>> 5k empty files into that directory. In total 64 directories and 320k 
>> files would be created. CephFS gave an avg throughput of 300~1k for 
>> the first 5 minutes, and then stopped making any progress.
>>
>> What might go wrong?
>>
>> If each client insert 200 files, instead of 5k, then CephFS could 
>> finish the workload with 1.5K ops/s. If each client insert 1k files, 
>> then ~500 ops/s If 2k files (the split threshold), then ~400 ops/s
>>
>> Are these numbers reasonable?
>>
>> -- Qing
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Christian Balzer
On Wed, 16 Apr 2014 17:08:09 +0200 Dan van der Ster wrote:

> Dear ceph-users,
> 
> I've recently started looking through our FileStore logs to better 
> understand the VM/RBD IO patterns, and noticed something interesting. 
> Here is a snapshot of the write lengths for one OSD server (with 24 
> OSDs) -- I've listed the top 10 write lengths ordered by number of 
> writes in one day:
> 
> Writes per length:
> 4096: 2011442
> 8192: 438259
> 4194304: 207293
> 12288: 175848
> 16384: 148274
> 20480: 69050
> 24576: 58961
> 32768: 54771
> 28672: 43627
> 65536: 34208
> 49152: 31547
> 40960: 28075
> 
> There were ~400 writes to that server on that day, so you see that 
> ~50% of the writes were 4096 bytes, and then the distribution drops off 
> sharply before a peak again at 4MB (the object size, i.e. the max write 
> size). (For those interested, read lengths are below in the P.S.)
> 
> I'm trying to understand that distribution, and the best explanation 
> I've come up with is that these are ext4/xfs metadata updates, probably 
> atime updates. Based on that theory, I'm going to test noatime on a few 
> VMs and see if I notice a change in the distribution.
> 
That strikes me as odd, as since kernel 2.6.30 the default option for
mounts is relatime, which should have an effect quite close to that of a
strict noatime.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-16 Thread Somnath Roy
It seems Discard support for kernel rbd is targeted for v80..

http://tracker.ceph.com/issues/190

Thanks & Regards
Somnath

-Original Message-
From: ceph-users-boun...@lists.ceph.com 
[mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Christian Balzer
Sent: Wednesday, April 16, 2014 5:36 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] question on harvesting freed space


On Wed, 16 Apr 2014 13:12:15 -0500 John-Paul Robinson wrote:

> So having learned some about fstrim, I ran it on an SSD backed file
> system and it reported space freed. I ran it on an RBD backed file
> system and was told it's not implemented.
>
> This is consistent with the test for FITRIM.
>
> $ cat /sys/block/rbd3/queue/discard_max_bytes
> 0
>
This looks like you're using the kernelspace RBD interface.

And very sadly, trim/discard is not implemented in it, which is a bummer for 
anybody running for example a HA NFS server with RBD as the backing storage. 
Even sadder is the fact that this was last brought up a year or even longer ago.

Only the userspace (librbd) interface supports this, however the client (KVM as 
prime example) of course need to use a pseudo disk interface that ALSO supports 
it. The standard virtio-block does not, while the very slow IDE emulation does 
as well as the speedier virtio-scsi (however that isn't configurable with 
ganeti for example).

Regards,

Christian

> On my SSD backed device I get:
>
> $ cat /sys/block/sda/queue/discard_max_bytes
> 2147450880
>
> Is this just not needed by RBD or is cleanup handled in a different way?
>
> I'm wondering what will happen to a thin provisioned RBD image
> overtime on a file system with lots of file create delete activity.
> Will the storage in the ceph pool stay allocated to this application
> (the file
> system) in that case?
>
> Thanks for any additional insights.
>
> ~jpr
>
> On 04/15/2014 04:16 PM, John-Paul Robinson wrote:
> > Thanks for the insight.
> >
> > Based on that I found the fstrim command for xfs file systems.
> >
> > http://xfs.org/index.php/FITRIM/discard
> >
> > Anyone had experiences using the this command with RBD image backends?
> >
> > ~jpr
> >
> > On 04/15/2014 02:00 PM, Kyle Bader wrote:
> >>> I'm assuming Ceph/RBD doesn't have any direct awareness of this
> >>> since the file system doesn't traditionally have a "give back blocks"
> >>> operation to the block device.  Is there anything special RBD does
> >>> in this case that communicates the release of the Ceph storage
> >>> back to the pool?
> >> VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing
> >> TRIM.
> >>
> >> http://wiki.qemu.org/Features/QED/Trim
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>


--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-16 Thread Christian Balzer

On Wed, 16 Apr 2014 13:12:15 -0500 John-Paul Robinson wrote:

> So having learned some about fstrim, I ran it on an SSD backed file
> system and it reported space freed. I ran it on an RBD backed file
> system and was told it's not implemented. 
> 
> This is consistent with the test for FITRIM. 
> 
> $ cat /sys/block/rbd3/queue/discard_max_bytes
> 0
> 
This looks like you're using the kernelspace RBD interface.

And very sadly, trim/discard is not implemented in it, which is a bummer
for anybody running for example a HA NFS server with RBD as the backing
storage. Even sadder is the fact that this was last brought up a year or
even longer ago.

Only the userspace (librbd) interface supports this, however the client
(KVM as prime example) of course need to use a pseudo disk interface that
ALSO supports it. The standard virtio-block does not, while the very slow
IDE emulation does as well as the speedier virtio-scsi (however that isn't
configurable with ganeti for example).

Regards,

Christian

> On my SSD backed device I get:
> 
> $ cat /sys/block/sda/queue/discard_max_bytes
> 2147450880
> 
> Is this just not needed by RBD or is cleanup handled in a different way?
> 
> I'm wondering what will happen to a thin provisioned RBD image overtime
> on a file system with lots of file create delete activity.  Will the
> storage in the ceph pool stay allocated to this application (the file
> system) in that case?
> 
> Thanks for any additional insights.
> 
> ~jpr
> 
> On 04/15/2014 04:16 PM, John-Paul Robinson wrote:
> > Thanks for the insight.
> >
> > Based on that I found the fstrim command for xfs file systems. 
> >
> > http://xfs.org/index.php/FITRIM/discard
> >
> > Anyone had experiences using the this command with RBD image backends?
> >
> > ~jpr
> >
> > On 04/15/2014 02:00 PM, Kyle Bader wrote:
> >>> I'm assuming Ceph/RBD doesn't have any direct awareness of this since
> >>> the file system doesn't traditionally have a "give back blocks"
> >>> operation to the block device.  Is there anything special RBD does in
> >>> this case that communicates the release of the Ceph storage back to
> >>> the pool?
> >> VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing
> >> TRIM.
> >>
> >> http://wiki.qemu.org/Features/QED/Trim
> >>
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-mds and directory sharding

2014-04-16 Thread Yan, Zheng
On Thu, Apr 17, 2014 at 6:21 AM, Qing Zheng  wrote:
> I seems that with kernel 3.14 and latest source code from github, we still
> run
> into troubles when testing multi-mds and directory sharding.
>
what's problem you encountered?




> Are there any limits either in the max number of active metadata servers
> that we could possibly run
> or in the number of directory entries that we could set for Ceph to trigger
> a directory split?
>
> Is it okay to run 128 or more active metadata servers, for example?
> Is it okay to let Ceph split directories once a directory has accumulated
> 200 entries?
>
> Cheers,
>
> -- Qing Zheng
>
> -Original Message-
> From: Yan, Zheng [mailto:uker...@gmail.com]
> Sent: Sunday, April 13, 2014 6:43 PM
> To: Qing Zheng
> Cc: ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] multi-mds and directory sharding
>
> On Mon, Apr 14, 2014 at 2:54 AM, Qing Zheng  wrote:
>> Hi -
>>
>> We are currently evaluating CephFS's metadata scalability and
>> performance. One important feature of CephFS is its support for
>> running multiple "active" mds instances and partitioning huge
>> directories into small shards.
>>
>> We use mdtest to simulate workloads where multiple parallel client
>> processes will keep inserting empty files into several large directories.
>> We found that CephFS is only able to run for the first 5-10 mins, and
>> then stop making progress -- the clients' "creat" call no longer return.
>>
>> We were using Ceph 0.72 and Ubuntu 12.10 with kernel 3.6.6.
>> Our setup consisted of 8 osds, 3 mds, and 1 mon. All mds were active,
>> instead of standby, and they were all configured to split directories
>> once the directory size is greater than 2k. We kernel (not fuse)
>> mounted CephFS on all 8 osd nodes.
>
> 3.6 kernel is too old for cephfs. please use kernel compiled from testing
> branch https://github.com/ceph/ceph-client and the newest development
> version of Ceph. There are large number of fixes for directory fragment and
> multimds.
>
> Regards
> Yan, Zheng
>
>>
>> To test CephFS, we launched 64 client processes on 8 osd nodes (8
>> procs per osd). Each client would create 1 directory and then insert
>> 5k empty files into that directory. In total 64 directories and 320k
>> files would be created. CephFS gave an avg throughput of 300~1k for
>> the first 5 minutes, and then stopped making any progress.
>>
>> What might go wrong?
>>
>> If each client insert 200 files, instead of 5k, then CephFS could
>> finish the workload with 1.5K ops/s. If each client insert 1k files,
>> then ~500 ops/s If 2k files (the split threshold), then ~400 ops/s
>>
>> Are these numbers reasonable?
>>
>> -- Qing
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multi-mds and directory sharding

2014-04-16 Thread Qing Zheng
I seems that with kernel 3.14 and latest source code from github, we still
run
into troubles when testing multi-mds and directory sharding.

Are there any limits either in the max number of active metadata servers
that we could possibly run
or in the number of directory entries that we could set for Ceph to trigger
a directory split?

Is it okay to run 128 or more active metadata servers, for example?
Is it okay to let Ceph split directories once a directory has accumulated
200 entries?

Cheers,

-- Qing Zheng

-Original Message-
From: Yan, Zheng [mailto:uker...@gmail.com] 
Sent: Sunday, April 13, 2014 6:43 PM
To: Qing Zheng
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] multi-mds and directory sharding

On Mon, Apr 14, 2014 at 2:54 AM, Qing Zheng  wrote:
> Hi -
>
> We are currently evaluating CephFS's metadata scalability and 
> performance. One important feature of CephFS is its support for 
> running multiple "active" mds instances and partitioning huge 
> directories into small shards.
>
> We use mdtest to simulate workloads where multiple parallel client 
> processes will keep inserting empty files into several large directories.
> We found that CephFS is only able to run for the first 5-10 mins, and 
> then stop making progress -- the clients' "creat" call no longer return.
>
> We were using Ceph 0.72 and Ubuntu 12.10 with kernel 3.6.6.
> Our setup consisted of 8 osds, 3 mds, and 1 mon. All mds were active, 
> instead of standby, and they were all configured to split directories 
> once the directory size is greater than 2k. We kernel (not fuse) 
> mounted CephFS on all 8 osd nodes.

3.6 kernel is too old for cephfs. please use kernel compiled from testing
branch https://github.com/ceph/ceph-client and the newest development
version of Ceph. There are large number of fixes for directory fragment and
multimds.

Regards
Yan, Zheng

>
> To test CephFS, we launched 64 client processes on 8 osd nodes (8 
> procs per osd). Each client would create 1 directory and then insert 
> 5k empty files into that directory. In total 64 directories and 320k 
> files would be created. CephFS gave an avg throughput of 300~1k for 
> the first 5 minutes, and then stopped making any progress.
>
> What might go wrong?
>
> If each client insert 200 files, instead of 5k, then CephFS could 
> finish the workload with 1.5K ops/s. If each client insert 1k files, 
> then ~500 ops/s If 2k files (the split threshold), then ~400 ops/s
>
> Are these numbers reasonable?
>
> -- Qing
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question on harvesting freed space

2014-04-16 Thread John-Paul Robinson
So having learned some about fstrim, I ran it on an SSD backed file
system and it reported space freed. I ran it on an RBD backed file
system and was told it's not implemented. 

This is consistent with the test for FITRIM. 

$ cat /sys/block/rbd3/queue/discard_max_bytes
0

On my SSD backed device I get:

$ cat /sys/block/sda/queue/discard_max_bytes
2147450880

Is this just not needed by RBD or is cleanup handled in a different way?

I'm wondering what will happen to a thin provisioned RBD image overtime
on a file system with lots of file create delete activity.  Will the
storage in the ceph pool stay allocated to this application (the file
system) in that case?

Thanks for any additional insights.

~jpr

On 04/15/2014 04:16 PM, John-Paul Robinson wrote:
> Thanks for the insight.
>
> Based on that I found the fstrim command for xfs file systems. 
>
> http://xfs.org/index.php/FITRIM/discard
>
> Anyone had experiences using the this command with RBD image backends?
>
> ~jpr
>
> On 04/15/2014 02:00 PM, Kyle Bader wrote:
>>> I'm assuming Ceph/RBD doesn't have any direct awareness of this since
>>> the file system doesn't traditionally have a "give back blocks"
>>> operation to the block device.  Is there anything special RBD does in
>>> this case that communicates the release of the Ceph storage back to the
>>> pool?
>> VMs running a 3.2+ kernel (iirc) can "give back blocks" by issuing TRIM.
>>
>> http://wiki.qemu.org/Features/QED/Trim
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Mark Nelson

On 04/16/2014 12:14 PM, Gregory Farnum wrote:

On Wed, Apr 16, 2014 at 8:08 AM, Dan van der Ster
 wrote:

Dear ceph-users,

I've recently started looking through our FileStore logs to better
understand the VM/RBD IO patterns, and noticed something interesting. Here
is a snapshot of the write lengths for one OSD server (with 24 OSDs) -- I've
listed the top 10 write lengths ordered by number of writes in one day:

Writes per length:
4096: 2011442
8192: 438259
4194304: 207293
12288: 175848
16384: 148274
20480: 69050
24576: 58961
32768: 54771
28672: 43627
65536: 34208
49152: 31547
40960: 28075

There were ~400 writes to that server on that day, so you see that ~50%
of the writes were 4096 bytes, and then the distribution drops off sharply
before a peak again at 4MB (the object size, i.e. the max write size). (For
those interested, read lengths are below in the P.S.)

I'm trying to understand that distribution, and the best explanation I've
come up with is that these are ext4/xfs metadata updates, probably atime
updates. Based on that theory, I'm going to test noatime on a few VMs and
see if I notice a change in the distribution.

Did anyone already go through such an exercise, or does anyone already
enforce/recommend specific mount options for their clients' RBD volumes? Of
course I realize that noatime is a generally recommended mount option for
"performance", but I've never heard a discussion about noatime specifically
in relation to RBD volumes.


I don't think we have any standard recommendations, but Mark might
have more insight into this than I do. I forget which clients you're
using — is rbd caching enabled? (If it's atime updates that could
definitely still happen with ext4/xfs, but it's a bummer.) Still, it's
good we're getting a bunch of larger writes as well.


Nope, I haven't specifically looked at the difference between 
enabling/disabling noatime on the rbd volumes myself.  It would be a 
good exercise though!  Same thing with read_ahead_kb and the IO 
elevator, especially in this context.   Sounds like a fun project.  :D


Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Mike Dawson

Dan,

Could you describe how you harvested and analyzed this data? Even 
better, could you share the code?


Cheers,
Mike

On 4/16/2014 11:08 AM, Dan van der Ster wrote:

Dear ceph-users,

I've recently started looking through our FileStore logs to better
understand the VM/RBD IO patterns, and noticed something interesting.
Here is a snapshot of the write lengths for one OSD server (with 24
OSDs) -- I've listed the top 10 write lengths ordered by number of
writes in one day:

Writes per length:
4096: 2011442
8192: 438259
4194304: 207293
12288: 175848
16384: 148274
20480: 69050
24576: 58961
32768: 54771
28672: 43627
65536: 34208
49152: 31547
40960: 28075

There were ~400 writes to that server on that day, so you see that
~50% of the writes were 4096 bytes, and then the distribution drops off
sharply before a peak again at 4MB (the object size, i.e. the max write
size). (For those interested, read lengths are below in the P.S.)

I'm trying to understand that distribution, and the best explanation
I've come up with is that these are ext4/xfs metadata updates, probably
atime updates. Based on that theory, I'm going to test noatime on a few
VMs and see if I notice a change in the distribution.

Did anyone already go through such an exercise, or does anyone already
enforce/recommend specific mount options for their clients' RBD volumes?
Of course I realize that noatime is a generally recommended mount option
for "performance", but I've never heard a discussion about noatime
specifically in relation to RBD volumes.

Best Regards, Dan

P.S. Reads per length:
524288: 1235401
4096: 675012
8192: 488194
516096: 342771
16384: 187577
65536: 87783
131072: 87279
12288: 66735
49152: 50170
24576: 47794
262144: 45199
466944: 23064

So reads are mostly 512kB, which is probably some default read-ahead size.

-- Dan van der Ster || Data & Storage Services || CERN IT Department --
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD write access patterns and atime

2014-04-16 Thread Gregory Farnum
On Wed, Apr 16, 2014 at 8:08 AM, Dan van der Ster
 wrote:
> Dear ceph-users,
>
> I've recently started looking through our FileStore logs to better
> understand the VM/RBD IO patterns, and noticed something interesting. Here
> is a snapshot of the write lengths for one OSD server (with 24 OSDs) -- I've
> listed the top 10 write lengths ordered by number of writes in one day:
>
> Writes per length:
> 4096: 2011442
> 8192: 438259
> 4194304: 207293
> 12288: 175848
> 16384: 148274
> 20480: 69050
> 24576: 58961
> 32768: 54771
> 28672: 43627
> 65536: 34208
> 49152: 31547
> 40960: 28075
>
> There were ~400 writes to that server on that day, so you see that ~50%
> of the writes were 4096 bytes, and then the distribution drops off sharply
> before a peak again at 4MB (the object size, i.e. the max write size). (For
> those interested, read lengths are below in the P.S.)
>
> I'm trying to understand that distribution, and the best explanation I've
> come up with is that these are ext4/xfs metadata updates, probably atime
> updates. Based on that theory, I'm going to test noatime on a few VMs and
> see if I notice a change in the distribution.
>
> Did anyone already go through such an exercise, or does anyone already
> enforce/recommend specific mount options for their clients' RBD volumes? Of
> course I realize that noatime is a generally recommended mount option for
> "performance", but I've never heard a discussion about noatime specifically
> in relation to RBD volumes.

I don't think we have any standard recommendations, but Mark might
have more insight into this than I do. I forget which clients you're
using — is rbd caching enabled? (If it's atime updates that could
definitely still happen with ext4/xfs, but it's a bummer.) Still, it's
good we're getting a bunch of larger writes as well.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] DCB-X...

2014-04-16 Thread N. Richard Solis
I am working to do some testing using Ceph in a RED and non-RED
environment.  That might be a long way off though.

Can you tell me if the OSD to OSD flows are via long-lived TCP sessions for
writes or is the session set-up/torn-down once the block is ACK'ed as
durable back to the client?






On Sun, Apr 13, 2014 at 7:00 PM, Adam Clark  wrote:

> Most definitely, I actually had the same thoughts after I replied.
>
> The many more flows behaviour is more likely to be kept healthy with RED
> rather than pause frames.
>
> Pause frames would halt all flows at time of congestion (while keeping TCP
> windowing unaware of the congestion issues),
>
> RED would impact a fraction of the flows and keep the overall throughput
> of the system high using TCP as the primary method for individual flow
> rates.
>
> You are most likely to have congestion into an OSD node during multiple
> clients writes, or into a client node for reads from multiple OSDs.  Using
> DCB-X may inadvertently impact other Ceph IOPS that are not directly
> experiencing congestion.
>
> Would be interesting to do some tests though.
>
> Regards
>
> Adam
>
>
>
> On Mon, Apr 14, 2014 at 12:46 AM, N. Richard Solis wrote:
>
>> Adam,
>>
>> Thank you so much for replying.
>>
>> The links you posted are excellent discussions for me to reference.
>>
>> What's interesting about the difference between iSCSI and Ceph/RBD type
>> storage is the fact that under Ceph the client will make use of lots of
>> individual TCP sessions to access any particular virtual block device,
>> whereas iSCSI will use a single TCP session for every block.  That's how
>> you get into the elephant-flow vs mice-flow problem in a converged network.
>>  The (elephant) iSCSI device would compete for limited bandwidth with all
>> of the other TCP flows and you get crappy storage performance as the TCP
>> loss mechanisms kick in and drop the iSCSI TCP frames.
>>
>> If I'm thinking correctly, then Ceph wouldn't have this problem since any
>> individual block device would be spread across multiple OSDs and have
>> multiple TCP sessions connected to any particular client.  So the problem
>> of a single elephant flow really doesn't develop because you don't have
>> those large single-session TCP flows.  Am I thinking correctly?
>>
>>
>>
>>
>> On Sun, Apr 13, 2014 at 5:00 AM, Adam Clark wrote:
>>
>>> Heya,
>>>   I have read a bunch of stuff regarding PFC/DCB-X and iSCSI, so it
>>> might be a similar argument with Ceph given the underlying transport is TCP.
>>> DCB-X was needed as FC has no tolerance for loss and deals with
>>> congestion by instructing senders to slow down using pause frames.
>>>
>>> From what I read, it can help, but you really need to know what you are
>>> doing, as TCP has it's own congestion avoidance mechanisms it works under
>>> most use cases.  It it probably better to randomly introduce drops in
>>> congestions rather than use a Pause Frame.
>>>
>>> Check these out:
>>> http://blog.ipspace.net/2013/07/iscsi-with-pfc.html
>>> https://blogs.cisco.com/datacenter/the-napkin-dialogues-lossless-iscsi/
>>>
>>> Cisco centric, but but applicable to other vendors also.
>>>
>>> If you are planning on running your ceph traffic alongside other traffic
>>> types, I would definitely be putting it in it's own class and guaranteeing
>>> bandwidth under times of congestion, short queues (<50ms max delay) with
>>> somewhat aggressive RED thresholds would probably suffice.
>>>
>>> Cheers
>>>
>>> Adam
>>>
>>>
>>>
>>> On Fri, Apr 11, 2014 at 11:42 PM, N. Richard Solis wrote:
>>>
 Guys,

 I'm new to ceph in general but I'm wondering if anyone out there is
 using Ceph with any of the Data Center Bridging (DCB) technologies?  I'm
 specifically thinking of DCB-X support provided by the open-lldp package.

 I'm wondering if there is any benefit to be gained by making use of the
 lossless Ethernet support in a properly configured network for the OSD to
 OSD traffic that would normally occur on the "cluster" network.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Unsubscribe

2014-04-16 Thread Knut Moe
Unsubscribe
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Access denied error

2014-04-16 Thread Yehuda Sadeh
On Tue, Apr 15, 2014 at 11:33 PM, Punit Dambiwal  wrote:
> Hi,
>
> Still i am getting the same error,when i run the following :-
>
> --
> curl -i 'http://xxx.xlinux.com/admin/usage?format=json' -X GET -H
> 'Authorization: AWS
> YHFQ4D8BM835BCGERHTN:kXpM0XB9UjOadexDu2ZoP8s4nKjuoL0iIZhE\/+Gv' -H 'Host:


Where did you come up with this authorization field? You need to sign
the message appropriately.

Yehuda

> xxx.xlinux.com' -H 'Content-Length: 0'
> HTTP/1.1 403 Forbidden
> Date: Wed, 16 Apr 2014 06:26:45 GMT
>
> Server: Apache/2.2.22 (Ubuntu)
> Accept-Ranges: bytes
> Content-Length: 23
> Content-Type: application/json
>
> {"Code":"AccessDenied"}
> ---
>
> Can any body help me to resolve this issue..
>
> Thanks,
> punit
>
>
> On Mon, Apr 14, 2014 at 11:55 AM, Punit Dambiwal  wrote:
>>
>> Hi,
>>
>> I am trying to list out all users using the Ceph S3 api and php. These are
>> the lines of code which i used for
>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>> $url = "http://.Xlinux.com/admin/user?format=json";;
>>
>> $ch = curl_init ($url);
>>
>> curl_setopt($ch, CURLOPT_CUSTOMREQUEST, "GET"); curl_setopt($ch,
>> CURLOPT_USERPWD,
>> "P8K3750Z3PP5MGUKQYBL:CB+Ioydr1XsmQF\/gQmE\/X3YsDjtDbxLZzByaU9t\/");
>>
>> curl_setopt($ch, CURLOPT_HTTPHEADER, array("Authorization: AWS
>> P8K3750Z3PP5MGUKQYBL:CB+Ioydr1XsmQF\/gQmE\/X3YsDjtDbxLZzByaU9t\/"));
>>
>> curl_setopt($ch, CURLOPT_HEADER, 0);
>>
>> curl_setopt($ch, CURLOPT_RETURNTRANSFER, 1); curl_setopt($ch,
>> CURLOPT_BINARYTRANSFER,1); $response = curl_exec($ch); $output =
>> json_encode($response);
>>
>>
>>
>> print_r($output);
>>
>>
>>
>>
>> --
>>
>>
>>
>>
>>
>> I am getting an error: access denied as output. Could you please check it
>> ?
>>
>>
>>
>> I have also tried the same using the curl command like
>>
>>
>>
>> curl -i 'http://XXX.Xlinux.com/admin/usage?format=json' -X GET -H
>>
>> 'Authorization: AWS
>>
>> P8K3750Z3PP5MGUKQYBL:CB+Ioydr1XsmQF\/gQmE\/X3YsDjtDbxLZzByaU9t\/' -H
>>
>> 'Host: XXX.Xlinux.com' -H 'Content-Length: 0'
>>
>>
>>
>> HTTP/1.1 403 Forbidden
>>
>> Date: Fri, 11 Apr 2014 10:08:20 GMT
>>
>> Server: Apache/2.2.22 (Ubuntu)
>>
>> Accept-Ranges: bytes
>>
>> Content-Length: 23
>>
>> Content-Type: application/json
>>
>>
>>
>> {"Code":"AccessDenied"}
>>
>>
>>
>> Can any body let me know if anything wrong in the above... ??
>>
>>
>>
>>
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Slow request problem - help please.

2014-04-16 Thread Ian Colle
Moving to ceph-users.

Ian R. Colle
Director of Engineering
Inktank
Delivering the Future of Storage
http://www.linkedin.com/in/ircolle
http://www.twitter.com/ircolle
Cell: +1.303.601.7713
Email: i...@inktank.com



On 4/16/14, 7:52 AM, "Ilya Storozhilov"  wrote:

>Hello Ceph developers,
>
>we encountered a problem with 3 OSDs/3 MONs Ceph cluster (CentOS 6.5. 64
>bit, Ceph "Emperor", v0.72.2 from RPM), which causes errors on RGW-node.
>A respective RGW output is:
>
>2014-04-15 09:50:03.906527 7fcd4c5e6700  1 == starting new request
>req=0x1811b20 =
>2014-04-15 09:50:03.908509 7fcd64dfa700  1 Executing get_obj operation
>2014-04-15 09:50:03.909138 7fcd64dfa700  0 ERROR: s->cio->print()
>returned err=-1
>2014-04-15 09:50:03.909164 7fcd64dfa700  0 ERROR: s->cio->print()
>returned err=-1
>2014-04-15 09:50:03.909168 7fcd64dfa700  0 ERROR: s->cio->print()
>returned err=-1
>2014-04-15 09:50:03.909171 7fcd64dfa700  0 ERROR: s->cio->print()
>returned err=-1
>2014-04-15 09:50:03.909401 7fcd64dfa700  1 == req done req=0x16f4c10
>http_status=403 ==
>2014-04-15 09:50:03.909444 7fcd64dfa700  1 == starting new request
>req=0x1816610 =
>2014-04-15 09:50:07.444606 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd66bfd700' had timed out after 600
>2014-04-15 09:50:07.444628 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd245a6700' had timed out after 600
>2014-04-15 09:50:07.444634 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd263a9700' had timed out after 600
>2014-04-15 09:50:07.444639 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd28bad700' had timed out after 600
>2014-04-15 09:50:07.444642 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd2a9b0700' had timed out after 600
>2014-04-15 09:50:07.444649 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd349c0700' had timed out after 600
>2014-04-15 09:50:07.444652 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd367c3700' had timed out after 600
>2014-04-15 09:50:07.444668 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd385c6700' had timed out after 600
>2014-04-15 09:50:07.444673 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd3adca700' had timed out after 600
>2014-04-15 09:50:07.444676 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd3b7cb700' had timed out after 600
>2014-04-15 09:50:07.444681 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd3e9d0700' had timed out after 600
>2014-04-15 09:50:07.444685 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd3fdd2700' had timed out after 600
>2014-04-15 09:50:07.444689 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd411d4700' had timed out after 600
>2014-04-15 09:50:07.444694 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd475de700' had timed out after 600
>2014-04-15 09:50:07.444699 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd4a7e3700' had timed out after 600
>2014-04-15 09:50:07.444703 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd4bbe5700' had timed out after 600
>2014-04-15 09:50:07.444708 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd4cfe7700' had timed out after 600
>2014-04-15 09:50:07.444713 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd533f1700' had timed out after 600
>2014-04-15 09:50:07.444717 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd547f3700' had timed out after 600
>2014-04-15 09:50:07.444728 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd55bf5700' had timed out after 600
>2014-04-15 09:50:07.444734 7fce4f6f8700  1 heartbeat_map is_healthy
>'RGWProcess::m_tp thread 0x7fcd579f8700' had timed out after 600
>2014-04-15 09:50:10.155239 7fcd2a9b0700  1 == req done req=0x15831f0
>http_status=200 ==
>
>
>An output of the ³ceph -s² command is:
>
>$ sudo ceph -s
>cluster 5c6c95b8-8234-4b34-96a9-facc3c561b72
> health HEALTH_WARN 88 requests are blocked > 32 sec
> monmap e2: 3 mons at
>{ceph-81=10.44.1.81:6789/0,ceph-82=10.44.1.82:6789/0,ceph-83=10.44.1.83:67
>89/0}, election epoch 22, quorum 0,1,2 ceph-81,ceph-82,ceph-83
> osdmap e519: 3 osds: 3 up, 3 in
>  pgmap v407538: 296 pgs, 32 pools, 64060 kB data, 280 objects
>17388 MB used, 1058 GB / 1133 GB avail
> 296 active+clean
>  client io 2148 B/s rd, 102 B/s wr, 3 op/s
>
>Ceph complains with following messages in OSD log-file:
>
>2014-04-16 02:21:18.065305 osd.0 [WRN] slow request 30.613493 seconds
>old, received at 2014-04-16 02:20:47.451757: osd_op(client.5793.0:1270
>default.5793.1__shadow__F0jnzwQRYyOtc-QfW0S2_M0ptLm-o-b_1 [write
>1572864~524288] 172.d7dc6717 e608) v4 currently commit sen

[ceph-users] RBD write access patterns and atime

2014-04-16 Thread Dan van der Ster

Dear ceph-users,

I've recently started looking through our FileStore logs to better 
understand the VM/RBD IO patterns, and noticed something interesting. 
Here is a snapshot of the write lengths for one OSD server (with 24 
OSDs) -- I've listed the top 10 write lengths ordered by number of 
writes in one day:


Writes per length:
4096: 2011442
8192: 438259
4194304: 207293
12288: 175848
16384: 148274
20480: 69050
24576: 58961
32768: 54771
28672: 43627
65536: 34208
49152: 31547
40960: 28075

There were ~400 writes to that server on that day, so you see that 
~50% of the writes were 4096 bytes, and then the distribution drops off 
sharply before a peak again at 4MB (the object size, i.e. the max write 
size). (For those interested, read lengths are below in the P.S.)


I'm trying to understand that distribution, and the best explanation 
I've come up with is that these are ext4/xfs metadata updates, probably 
atime updates. Based on that theory, I'm going to test noatime on a few 
VMs and see if I notice a change in the distribution.


Did anyone already go through such an exercise, or does anyone already 
enforce/recommend specific mount options for their clients' RBD volumes? 
Of course I realize that noatime is a generally recommended mount option 
for "performance", but I've never heard a discussion about noatime 
specifically in relation to RBD volumes.


Best Regards, Dan

P.S. Reads per length:
524288: 1235401
4096: 675012
8192: 488194
516096: 342771
16384: 187577
65536: 87783
131072: 87279
12288: 66735
49152: 50170
24576: 47794
262144: 45199
466944: 23064

So reads are mostly 512kB, which is probably some default read-ahead size.

-- Dan van der Ster || Data & Storage Services || CERN IT Department --
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Russian-speaking community CephRussian!

2014-04-16 Thread Loic Dachary
Hi Ирек,

If you organize meetups, feel free to add yourself to 
https://wiki.ceph.com/Community/Meetups :-)

Cheers

On 16/04/2014 13:22, Ирек Фасихов wrote:
> Hi,All.
> 
> I created the Russian-speaking community CephRussian in Google+ 
> ! Welcome!
> 
> URL: https://plus.google.com/communities/104570726102090628516
> 
> -- 
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Troubles MDS

2014-04-16 Thread Gregory Farnum
What's the backtrace from the MDS crash?
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 16, 2014 at 7:11 AM, Georg Höllrigl
 wrote:
> Hello,
>
> Using Ceph MDS with one active and one standby server - a day ago one of the
> mds crashed and I restarted it.
> Tonight it crashed again, a few hours later, also the second mds crashed.
>
> #ceph -v
> ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)
>
> At the moment cephfs is dead - with following health status:
>
> #ceph -s
> cluster b04fc583-9e71-48b7-a741-92f4dff4cfef
>  health HEALTH_WARN mds cluster is degraded; mds c is laggy
>  monmap e3: 3 mons at
> {ceph-m-01=10.0.0.176:6789/0,ceph-m-02=10.0.1.107:6789/0,ceph-m-03=10.0.1.108:6789/0},
> election epoch 6274, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03
>  mdsmap e2055: 1/1/1 up {0=ceph-m-03=up:rejoin(laggy or crashed)}
>  osdmap e3752: 39 osds: 39 up, 39 in
>   pgmap v3277576: 8328 pgs, 17 pools, 6461 GB data, 17066 kobjects
> 13066 GB used, 78176 GB / 91243 GB avail
> 8328 active+clean
>   client io 1193 B/s rd, 0 op/s
>
> I couldn't really find any useful infos in the logfiles nor searching in
> documentations. Any ideas how to get cephfs up and running?
>
> Here is part of mds log:
> 2014-04-16 14:07:05.603501 7ff184c64700  1 mds.0.server reconnect gave up on
> client.7846580 10.0.1.152:0/14639
> 2014-04-16 14:07:05.603525 7ff184c64700  1 mds.0.46 reconnect_done
> 2014-04-16 14:07:05.674990 7ff186d69700  1 mds.0.46 handle_mds_map i am now
> mds.0.46
> 2014-04-16 14:07:05.674996 7ff186d69700  1 mds.0.46 handle_mds_map state
> change up:reconnect --> up:rejoin
> 2014-04-16 14:07:05.674998 7ff186d69700  1 mds.0.46 rejoin_start
> 2014-04-16 14:07:22.347521 7ff17f825700  0 -- 10.0.1.107:6815/17325 >>
> 10.0.1.68:0/4128280551 pipe(0x5e2ac80 sd=930 :6815 s=2 pgs=153 cs=1 l=0
> c=0x5e2e160).fault with nothing to send, going to standby
>
> Any ideas, how to solve "laggy or crashed" ?
>
>
> Georg
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Troubles MDS

2014-04-16 Thread Georg Höllrigl

Hello,

Using Ceph MDS with one active and one standby server - a day ago one of 
the mds crashed and I restarted it.

Tonight it crashed again, a few hours later, also the second mds crashed.

#ceph -v
ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60)

At the moment cephfs is dead - with following health status:

#ceph -s
cluster b04fc583-9e71-48b7-a741-92f4dff4cfef
 health HEALTH_WARN mds cluster is degraded; mds c is laggy
 monmap e3: 3 mons at 
{ceph-m-01=10.0.0.176:6789/0,ceph-m-02=10.0.1.107:6789/0,ceph-m-03=10.0.1.108:6789/0}, 
election epoch 6274, quorum 0,1,2 ceph-m-01,ceph-m-02,ceph-m-03

 mdsmap e2055: 1/1/1 up {0=ceph-m-03=up:rejoin(laggy or crashed)}
 osdmap e3752: 39 osds: 39 up, 39 in
  pgmap v3277576: 8328 pgs, 17 pools, 6461 GB data, 17066 kobjects
13066 GB used, 78176 GB / 91243 GB avail
8328 active+clean
  client io 1193 B/s rd, 0 op/s

I couldn't really find any useful infos in the logfiles nor searching in 
documentations. Any ideas how to get cephfs up and running?


Here is part of mds log:
2014-04-16 14:07:05.603501 7ff184c64700  1 mds.0.server reconnect gave 
up on client.7846580 10.0.1.152:0/14639

2014-04-16 14:07:05.603525 7ff184c64700  1 mds.0.46 reconnect_done
2014-04-16 14:07:05.674990 7ff186d69700  1 mds.0.46 handle_mds_map i am 
now mds.0.46
2014-04-16 14:07:05.674996 7ff186d69700  1 mds.0.46 handle_mds_map state 
change up:reconnect --> up:rejoin

2014-04-16 14:07:05.674998 7ff186d69700  1 mds.0.46 rejoin_start
2014-04-16 14:07:22.347521 7ff17f825700  0 -- 10.0.1.107:6815/17325 >> 
10.0.1.68:0/4128280551 pipe(0x5e2ac80 sd=930 :6815 s=2 pgs=153 cs=1 l=0 
c=0x5e2e160).fault with nothing to send, going to standby


Any ideas, how to solve "laggy or crashed" ?


Georg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Kyle Bader
>> Obviously the ssds could be used as journal devices, but I'm not really
>> convinced whether this is worthwhile when all nodes have 1GB of hardware
>> writeback cache (writes to journal and data areas on the same spindle have
>> time to coalesce in the cache and minimise seek time hurt). Any advice on
>> this?

All writes need to be written to the journal before being written to
the data volume so it's going to impact your overall throughput and
cause seeking, a hardware cache will only help with the later (unless
you use btrfs).

>> I think the timing should work that we'll be deploying with Firefly and so
>> have Ceph cache pool tiering as an option, but I'm also evaluating Bcache
>> versus Tier to act as node-local block cache device. Does anybody have real
>> or anecdotal evidence about which approach has better performance?
> New idea that is dependent on failure behaviour of the cache tier...

The problem with this type of configuration is it ties a VM to a
specific hypervisor, in theory it should be faster because you don't
have network latency from round trips to the cache tier, resulting in
higher iops. Large sequential workloads may achieve higher throughput
by parallelizing across many OSDs in a cache tier, whereas local flash
would be limited to single device throughput.

> Carve the ssds 4-ways: each with 3 partitions for journals servicing the
> backing data pool and a fourth larger partition serving a write-around cache
> tier with only 1 object copy. Thus both reads and writes hit ssd but the ssd
> capacity is not halved by replication for availability.
>
> ...The crux is how the current implementation behaves in the face of cache
> tier OSD failures?

Cache tiers are durable by way of replication or erasure coding, OSDs
will remap degraded placement groups and backfill as appropriate. With
single replica cache pools loss of OSDs becomes a real concern, in the
case of RBD this means losing arbitrary chunk(s) of your block devices
- bad news. If you want host independence, durability and speed your
best bet is a replicated cache pool (2-3x).

-- 

Kyle
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)

2014-04-16 Thread Kenneth Waegeman


- Message from "Yan, Zheng"  -
   Date: Wed, 16 Apr 2014 14:45:37 +0800
   From: "Yan, Zheng" 
Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
 To: Stijn De Weirdt 
 Cc: ceph-users@lists.ceph.com



On Wed, Apr 16, 2014 at 2:08 PM, Stijn De Weirdt
 wrote:

What do you mean by the MDS journal? Where can I find this journal?
Can a better CPU solve the slow trimming? ( Now 2 hexacore AMD Opteron
4334)



MDS uses journal to record recent metadata update. the journal is
stored in metadata pool
(object name 200.*). The speed of trimming log is limited by OSD
performance. Better CPU
does not help.


i'm a bit confused now. so the metadata journal is stored in the metadata
pool (which is using OSDs that also have journals), but where is the
metadata-data (for lack of better wording) stored? also in metadata pool?

we have configured a separate set of OSDs and modified crushmap so the
metadata pools uses those OSDs. if these are only used as journal for the
metadata, then it's not so odd the metadata journal is way ahead.


metadata are also stored in metadata data. When handling a update
request, MDS writes metadata changes to the metadata journal. MDS
write metadata changes to corresponding metadata objects when it trims
the journal.





Please try
https://github.com/ceph/ceph/commit/6dfed2693e9002dbaf82b3dc1a637e1c53878fe3
I hope it can solve this issue.


We are now running the patched version, thank you for that! We'll  
monitor what happens.

Is there a way to see how much of backlog we have?




ok, we'll rebuild and try asap

stijn




Regards
Yan, Zheng


Thanks!



Regards
Yan, Zheng


Thanks!

Kenneth

- Message from Stijn De Weirdt  -
Date: Fri, 04 Apr 2014 20:31:34 +0200
From: Stijn De Weirdt 

Subject: Re: [ceph-users] ceph 0.78 mon and mds crashing (bus error)
  To: "Yan, Zheng" 
  Cc: ceph-users@lists.ceph.com




hi yan,

(taking the list in CC)

On 04/04/2014 04:44 PM, Yan, Zheng wrote:




On Thu, Apr 3, 2014 at 2:52 PM, Stijn De Weirdt

wrote:




hi,

latest pprof output attached.

this is no kernel client, this is ceph-fuse on EL6. starting the mds
without
any ceph-fuse mounts works without issue. mounting ceph-fuse
afterwards
also
works fine. simple filesystem operations work as expected.

we'll check the state of the fuse mount via fusectl once we've
reproduced
the issue.





One dark side of fuse client is that there is no way to purge inodes
in kernel inode caches. Each cached inode in kernel pins
corresponding
inode in ceph-mds. I think that's why ceph-fuse and ceph-mds use so
many memory. Could you try executing "echo 3 >
/proc/sys/vm/drop_caches" periodically on the ceph-fuse mount
machine.




thanks a lot for the suggestion. that night very well explain the
issue.
we're currently trying looking into other issues that we tought was
related
(extremely low mdtest performance), but we will retry your suggestion
when
we start to reproduce the condition



stijn



Regards
Yan, Zheng




stijn


On 04/03/2014 02:39 AM, Yan, Zheng wrote:





which version of kernel client did you use? please send out context
of
client node's /sys/kernel/debug/ceph/*/caps when the MDS uses lots
memory.

Regards
Yan, Zheng

On Thu, Apr 3, 2014 at 2:58 AM, Stijn De Weirdt

wrote:





wow. kudos for integrating this in ceph. more projects should do
it
like
that!

anyway, in attachement a gzipped ps file. heap is at 4.4GB, top
reports
6.5GB mem usage.

care to point out what to look for? i'll send a new one when the
usage
is
starting to cause swapping.

thanks,

stijn


On 04/02/2014 06:35 PM, Gregory Farnum wrote:






Did you see

http://ceph.com/docs/master/rados/troubleshooting/memory-profiling?
That should have what you need to get started, although you'll
also
need to learn the basics of using the heap analysis tools
elsewhere.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com


On Wed, Apr 2, 2014 at 9:23 AM, Stijn De Weirdt

wrote:






hi,



1) How big and what shape the filesystem is. Do you have some
extremely large directory that the MDS keeps trying to load
and
then
dump?







anyway to extract this from the mds without having to start it?
as
it
was an rsync operation, i can try to locate possible candidates
on
the
source filesystem, but what would be considered "large"?







total number of files 13M, spread over 800k directories, but
it's
unclear
how far the sync was at time of failing. i've not found a good
way
to
for
directories with lots of files and/or subdirs.





2) Use tcmalloc's heap analyzer to see where all the memory is
being
allocated.







we'll giv ethat a try







i run ceph-mds with HEAPCHECK=normal (via the init script), but
how
can
we
"stop" mds without killing it? the heapchecker only seems to
dump
at
the
end
of a run, maybe there's a way to have intermediate dump like
valgrind,
but
the documentation is not very helpful.

stijn





3) Look through the logs for when the beacon

Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
Yes llya. From that command output only, I assumed this must be
rados_classes issue. After that copied in exact location and restarted all
the nodes.

Thanks,
Srinivas.



On Wed, Apr 16, 2014 at 5:50 PM, Ilya Dryomov wrote:

> On Wed, Apr 16, 2014 at 4:00 PM, Srinivasa Rao Ragolu
>  wrote:
> > Hi All,
> >
> > Thanks a lot to one and all.. Thank you so much for your support. I found
> > the issue with your clues.
> >
> > Issue is : root filesystem does not have /usr/lib64/rados_classes
> >
> > After adding rados_classes and restarting all the nodes, I could able map
> > the block devices.
>
> Great, I was just going to have you restart the osds.  Do you know why
> there wasn't a rados_classes on one (?) of the osds?  (I assume the
> 'find' result that you pasted came from the monitor and it is one or
> more osds that lacked rados_clases?)
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Ilya Dryomov
On Wed, Apr 16, 2014 at 4:00 PM, Srinivasa Rao Ragolu
 wrote:
> Hi All,
>
> Thanks a lot to one and all.. Thank you so much for your support. I found
> the issue with your clues.
>
> Issue is : root filesystem does not have /usr/lib64/rados_classes
>
> After adding rados_classes and restarting all the nodes, I could able map
> the block devices.

Great, I was just going to have you restart the osds.  Do you know why
there wasn't a rados_classes on one (?) of the osds?  (I assume the
'find' result that you pasted came from the monitor and it is one or
more osds that lacked rados_clases?)

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
Hi All,

Thanks a lot to one and all.. Thank you so much for your support. I found
the issue with your clues.

*Issue is : root filesystem does not have /usr/lib64/rados_classes*

After adding rados_classes and restarting all the nodes, I could able map
the block devices.

Thanks,
Srinivas.


On Wed, Apr 16, 2014 at 5:12 PM, Srinivasa Rao Ragolu wrote:

> root@node1:/etc/ceph# ceph daemon osd.0 config get osd_class_dir
> { "osd_class_dir": "\/usr\/lib64\/rados-classes"}
>
> Thanks,
> Srinivas.
>
>
> On Wed, Apr 16, 2014 at 4:37 PM, Ilya Dryomov wrote:
>
>> On Wed, Apr 16, 2014 at 2:45 PM, Srinivasa Rao Ragolu
>>  wrote:
>> > root@mon:/etc/ceph# find / -name "libcls_rbd.so"
>> > /usr/lib64/rados-classes/libcls_rbd.so
>> > root@mon:/etc/ceph# echo $osd_class_dir
>> >
>> > root@mon:/etc/ceph#
>> >
>> > Please let me know how to find osd_class_dir value
>>
>> ceph daemon osd.0 config get osd_class_dir
>>
>> Thanks,
>>
>> Ilya
>>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cinder Volume Attach Error(Please help me)CENTOS OPENSTACK + CINDER + CEPH

2014-04-16 Thread 一ノ瀬峻
Hi All,

I followed the guide on Ceph to integrate Ceph into OpenStack
(https://ceph.com/docs/master/rbd/rbd-openstack/).

But, failed to attach the cinder volume, the following error is output
in /var/log/nova/compute.log

internal error unable to execute QEMU command '__com.redhat_drive_add':
Device 'drive-virtio-disk1' could not be initialized

I found the following information
http://waipeng.wordpress.com/2013/05/20/centos-openstack-cinder-ceph/

But, my environment newer than that.
>qemu-kvm-0.12.1.2-2.415.el6_5.7.x86_64

Please tell me solving this issue if you know it.

===Environment===
CENTOS 6.5
  kernel Version
Ceph nodes:2.6.32-431.11.2.el6.x86_64
OpenStack Controller:2.6.32-431.1.2.0.1.el6.x86_64
OpenStack Compute:2.6.32-431.el6.x86_64
OPENSTACK RDO Havana openstack-cinder-2013.2.1-1.el6.noarch
CEPH ceph-0.72.2-0.el6.x86_64


Thanks.
Shun.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
root@node1:/etc/ceph# ceph daemon osd.0 config get osd_class_dir
{ "osd_class_dir": "\/usr\/lib64\/rados-classes"}

Thanks,
Srinivas.


On Wed, Apr 16, 2014 at 4:37 PM, Ilya Dryomov wrote:

> On Wed, Apr 16, 2014 at 2:45 PM, Srinivasa Rao Ragolu
>  wrote:
> > root@mon:/etc/ceph# find / -name "libcls_rbd.so"
> > /usr/lib64/rados-classes/libcls_rbd.so
> > root@mon:/etc/ceph# echo $osd_class_dir
> >
> > root@mon:/etc/ceph#
> >
> > Please let me know how to find osd_class_dir value
>
> ceph daemon osd.0 config get osd_class_dir
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Russian-speaking community CephRussian!

2014-04-16 Thread Ирек Фасихов
Hi,All.

I created the Russian-speaking community CephRussian in
Google+!
Welcome!

URL: https://plus.google.com/communities/104570726102090628516

-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Ilya Dryomov
On Wed, Apr 16, 2014 at 2:45 PM, Srinivasa Rao Ragolu
 wrote:
> root@mon:/etc/ceph# find / -name "libcls_rbd.so"
> /usr/lib64/rados-classes/libcls_rbd.so
> root@mon:/etc/ceph# echo $osd_class_dir
>
> root@mon:/etc/ceph#
>
> Please let me know how to find osd_class_dir value

ceph daemon osd.0 config get osd_class_dir

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
root@mon:/etc/ceph# find / -name "libcls_rbd.so"
/usr/lib64/rados-classes/libcls_rbd.so
root@mon:/etc/ceph# echo $osd_class_dir

root@mon:/etc/ceph#

Please let me know how to find osd_class_dir value

Thanks,
Srinivas.


On Wed, Apr 16, 2014 at 4:00 PM, Ilya Dryomov wrote:

> On Wed, Apr 16, 2014 at 2:13 PM, Srinivasa Rao Ragolu
>  wrote:
> > Thanks. Please see the output of above command
> >
> > root@mon:/etc/ceph# rbd ls -l
> > rbd: error opening blk2: (95) Operation not supported2014-04-16
> > 10:12:13.947625 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> > Operation not supported
> >
> > rbd: error opening blk3: (95) Operation not supported2014-04-16
> > 10:12:13.961595 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> > Operation not supported
> >
> > rbd: error opening ceph-block1: (95) Operation not supported2014-04-16
> > 10:12:13.974869 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> > Operation not supported
> >
> > rbd: error opening sample: (95) Operation not supported
> > NAME SIZE PARENT FMT PROT LOCK
> > 2014-04-16 10:12:13.986056 7f3a2a0c7780 -1 librbd: Error listing
> snapshots:
> > (95) Operation not supported
>
> OK, so the kernel is not to blame here.  This is probably a class path
> issue.
>
> - Can you try to locate a libcls_rbd.so library on your
> system?
>
> - What's the value of osd_class_dir conf variable?
>
> Thanks,
>
> Ilya
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] mon server down

2014-04-16 Thread Jonathan Gowar
On Wed, 2014-04-16 at 03:02 +0100, Joao Eduardo Luis wrote:
> You didn't recreate mon.ceph-3.
> 
> The following should take care of that:
> 
> 1. stop mon.ceph-3
> 2. ceph mon remove ceph-3
> 3. mv /var/lib/ceph/mon/ceph-3 /someplace/ceph-3
> 4. ceph mon getmap -o /tmp/monmap
> 5. ceph-mon -i ceph-3 --keyring /someplace/ceph-3 --monmap /tmp/monmap
> 6. ceph mon add ceph-3 IP:PORT
> 7. start mon.ceph-3
> 
> Alternatively, you can inject a modified monmap into ceph-3:
> 
> 1. stop mon.ceph-3
> 2. ceph mon getmap -o /tmp/monmap
> 3. monmaptool --rm ceph-3 /tmp/monmap
> 4. monmaptool --add ceph-3 IP:PORT /tmp/monmap
> 5. ceph-mon -i ceph-3 --inject-monmap /tmp/monmap
> 6. start mon.ceph-3
> 
>-Joao
> 

Thanks, working now.

Regards,
Jon

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Ilya Dryomov
On Wed, Apr 16, 2014 at 2:13 PM, Srinivasa Rao Ragolu
 wrote:
> Thanks. Please see the output of above command
>
> root@mon:/etc/ceph# rbd ls -l
> rbd: error opening blk2: (95) Operation not supported2014-04-16
> 10:12:13.947625 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> Operation not supported
>
> rbd: error opening blk3: (95) Operation not supported2014-04-16
> 10:12:13.961595 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> Operation not supported
>
> rbd: error opening ceph-block1: (95) Operation not supported2014-04-16
> 10:12:13.974869 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
> Operation not supported
>
> rbd: error opening sample: (95) Operation not supported
> NAME SIZE PARENT FMT PROT LOCK
> 2014-04-16 10:12:13.986056 7f3a2a0c7780 -1 librbd: Error listing snapshots:
> (95) Operation not supported

OK, so the kernel is not to blame here.  This is probably a class path
issue.

- Can you try to locate a libcls_rbd.so library on your
system?

- What's the value of osd_class_dir conf variable?

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Karan Singh
You ceph.conf please 


Karan Singh 
Systems Specialist , Storage Platforms
CSC - IT Center for Science,
Keilaranta 14, P. O. Box 405, FIN-02101 Espoo, Finland
mobile: +358 503 812758
tel. +358 9 4572001
fax +358 9 4572302
http://www.csc.fi/


On 16 Apr 2014, at 13:13, Srinivasa Rao Ragolu  wrote:

> Thanks. Please see the output of above command
> 
> root@mon:/etc/ceph# rbd ls -l 
> rbd: error opening blk2: (95) Operation not supported2014-04-16 
> 10:12:13.947625 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95) 
> Operation not supported
> 
> rbd: error opening blk3: (95) Operation not supported2014-04-16 
> 10:12:13.961595 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95) 
> Operation not supported
> 
> rbd: error opening ceph-block1: (95) Operation not supported2014-04-16 
> 10:12:13.974869 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95) 
> Operation not supported
> 
> rbd: error opening sample: (95) Operation not supported
> NAME SIZE PARENT FMT PROT LOCK 
> 2014-04-16 10:12:13.986056 7f3a2a0c7780 -1 librbd: Error listing snapshots: 
> (95) Operation not supported
> 
> 
> Thanks,
> Srinivas.
> 
> 
> On Wed, Apr 16, 2014 at 3:37 PM, Ирек Фасихов  wrote:
> Show command output rbd ls -l.
> 
> 
> 2014-04-16 13:59 GMT+04:00 Srinivasa Rao Ragolu :
> 
> Hi Wido,
> 
> Output of info command is given below
> 
> root@mon:/etc/ceph# rbd info sample
> rbd: error opening image sample: (95) Operation not supported2014-04-16 
> 09:57:24.575279 7f661c6e5780 -1 librbd: Error listing snapshots: (95) 
> Operation not supported
> 
> root@mon:/etc/ceph# ceph status
> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>  health HEALTH_OK
>  monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 1, 
> quorum 0 mon
>  osdmap e13: 2 osds: 2 up, 2 in
>   pgmap v68: 192 pgs, 3 pools, 513 bytes data, 5 objects
> 2077 MB used, 9113 MB / 11837 MB avail
>  192 active+clean
>   client io 13 B/s rd, 0 op/s
> 
> After this monitor daemon getting killed. Need to start it again.
> 
> Thanks,
> Srinivas.
> 
> 
> On Wed, Apr 16, 2014 at 3:18 PM, Wido den Hollander  wrote:
> On 04/16/2014 11:41 AM, Srinivasa Rao Ragolu wrote:
> HI all,
> 
> I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
> health is OK and Active.
> 
> My deployment is on our private distribution of Linux kernel 3.10.33 and
> ceph version is 0.72.2
> 
> I could able to create image with command " rbd create sample --size 200".
> 
> What is the RBD format of the image?
> 
> $ rbd info sample
> 
> I don't think it's the problem, but it could be that the krbd doesn't support 
> format 2 yet.
> 
> inserted rbd.ko successfully with modprobe command " modprobe rbd"
> 
> Now when I try to map it with command:
> 
> #*rbd map sample
> *
> *[10584.497492] libceph: client4301 fsid
> 
> a7f64266-0894-4f1e-a635-d0aeaca0e993
> [10584.535926] libceph: mon0 192.168.0.102:6789
>  session established
> rbd: add failed: (34) Numerical result out of range*
> 
> 
> 
> Please help me in solving this issue.
> 
> Thanks,
> Srinivas.
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Wido den Hollander
> Ceph consultant and trainer
> 42on B.V.
> 
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> 
> -- 
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
Thanks. Please see the output of above command

root@mon:/etc/ceph# rbd ls -l
rbd: error opening blk2: (95) Operation not supported2014-04-16
10:12:13.947625 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
Operation not supported

rbd: error opening blk3: (95) Operation not supported2014-04-16
10:12:13.961595 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
Operation not supported

rbd: error opening ceph-block1: (95) Operation not supported2014-04-16
10:12:13.974869 7f3a2a0c7780 -1 librbd: Error listing snapshots: (95)
Operation not supported

rbd: error opening sample: (95) Operation not supported
NAME SIZE PARENT FMT PROT LOCK
2014-04-16 10:12:13.986056 7f3a2a0c7780 -1 librbd: Error listing snapshots:
(95) Operation not supported


Thanks,
Srinivas.


On Wed, Apr 16, 2014 at 3:37 PM, Ирек Фасихов  wrote:

> Show command output rbd ls -l.
>
>
> 2014-04-16 13:59 GMT+04:00 Srinivasa Rao Ragolu :
>
> Hi Wido,
>>
>> Output of info command is given below
>>
>> root@mon:/etc/ceph#
>> * rbd info sample rbd: error opening image sample: (95) Operation not
>> supported2014-04-16 09:57:24.575279 7f661c6e5780 -1 librbd: Error listing
>> snapshots: (95) Operation not supported*
>>
>> root@mon:/etc/ceph# ceph status
>> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>>  health HEALTH_OK
>>  monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 1,
>> quorum 0 mon
>>  osdmap e13: 2 osds: 2 up, 2 in
>>   pgmap v68: 192 pgs, 3 pools, 513 bytes data, 5 objects
>> 2077 MB used, 9113 MB / 11837 MB avail
>>  192 active+clean
>>   client io 13 B/s rd, 0 op/s
>>
>> After this monitor daemon getting killed. Need to start it again.
>>
>> Thanks,
>> Srinivas.
>>
>>
>> On Wed, Apr 16, 2014 at 3:18 PM, Wido den Hollander wrote:
>>
>>> On 04/16/2014 11:41 AM, Srinivasa Rao Ragolu wrote:
>>>
 HI all,

 I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
 health is OK and Active.

 My deployment is on our private distribution of Linux kernel 3.10.33 and
 ceph version is 0.72.2

 I could able to create image with command " rbd create sample --size
 200".

>>>
>>> What is the RBD format of the image?
>>>
>>> $ rbd info sample
>>>
>>> I don't think it's the problem, but it could be that the krbd doesn't
>>> support format 2 yet.
>>>
>>>  inserted rbd.ko successfully with modprobe command " modprobe rbd"

 Now when I try to map it with command:

 #*rbd map sample
 *
 *[10584.497492] libceph: client4301 fsid

 a7f64266-0894-4f1e-a635-d0aeaca0e993
 [10584.535926] libceph: mon0 192.168.0.102:6789
  session established
 rbd: add failed: (34) Numerical result out of range*



 Please help me in solving this issue.

 Thanks,
 Srinivas.



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


>>>
>>> --
>>> Wido den Hollander
>>> Ceph consultant and trainer
>>> 42on B.V.
>>>
>>> Phone: +31 (0)20 700 9902
>>> Skype: contact42on
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> С уважением, Фасихов Ирек Нургаязович
> Моб.: +79229045757
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Ирек Фасихов
Show command output rbd ls -l.


2014-04-16 13:59 GMT+04:00 Srinivasa Rao Ragolu :

> Hi Wido,
>
> Output of info command is given below
>
> root@mon:/etc/ceph#
> * rbd info samplerbd: error opening image sample: (95) Operation not
> supported2014-04-16 09:57:24.575279 7f661c6e5780 -1 librbd: Error listing
> snapshots: (95) Operation not supported*
>
> root@mon:/etc/ceph# ceph status
> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>  health HEALTH_OK
>  monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 1,
> quorum 0 mon
>  osdmap e13: 2 osds: 2 up, 2 in
>   pgmap v68: 192 pgs, 3 pools, 513 bytes data, 5 objects
> 2077 MB used, 9113 MB / 11837 MB avail
>  192 active+clean
>   client io 13 B/s rd, 0 op/s
>
> After this monitor daemon getting killed. Need to start it again.
>
> Thanks,
> Srinivas.
>
>
> On Wed, Apr 16, 2014 at 3:18 PM, Wido den Hollander  wrote:
>
>> On 04/16/2014 11:41 AM, Srinivasa Rao Ragolu wrote:
>>
>>> HI all,
>>>
>>> I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
>>> health is OK and Active.
>>>
>>> My deployment is on our private distribution of Linux kernel 3.10.33 and
>>> ceph version is 0.72.2
>>>
>>> I could able to create image with command " rbd create sample --size
>>> 200".
>>>
>>
>> What is the RBD format of the image?
>>
>> $ rbd info sample
>>
>> I don't think it's the problem, but it could be that the krbd doesn't
>> support format 2 yet.
>>
>>  inserted rbd.ko successfully with modprobe command " modprobe rbd"
>>>
>>> Now when I try to map it with command:
>>>
>>> #*rbd map sample
>>> *
>>> *[10584.497492] libceph: client4301 fsid
>>>
>>> a7f64266-0894-4f1e-a635-d0aeaca0e993
>>> [10584.535926] libceph: mon0 192.168.0.102:6789
>>>  session established
>>> rbd: add failed: (34) Numerical result out of range*
>>>
>>>
>>>
>>> Please help me in solving this issue.
>>>
>>> Thanks,
>>> Srinivas.
>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> --
>> Wido den Hollander
>> Ceph consultant and trainer
>> 42on B.V.
>>
>> Phone: +31 (0)20 700 9902
>> Skype: contact42on
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
Hi Wido,

Output of info command is given below

root@mon:/etc/ceph#
* rbd info samplerbd: error opening image sample: (95) Operation not
supported2014-04-16 09:57:24.575279 7f661c6e5780 -1 librbd: Error listing
snapshots: (95) Operation not supported*

root@mon:/etc/ceph# ceph status
cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
 health HEALTH_OK
 monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 1,
quorum 0 mon
 osdmap e13: 2 osds: 2 up, 2 in
  pgmap v68: 192 pgs, 3 pools, 513 bytes data, 5 objects
2077 MB used, 9113 MB / 11837 MB avail
 192 active+clean
  client io 13 B/s rd, 0 op/s

After this monitor daemon getting killed. Need to start it again.

Thanks,
Srinivas.


On Wed, Apr 16, 2014 at 3:18 PM, Wido den Hollander  wrote:

> On 04/16/2014 11:41 AM, Srinivasa Rao Ragolu wrote:
>
>> HI all,
>>
>> I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
>> health is OK and Active.
>>
>> My deployment is on our private distribution of Linux kernel 3.10.33 and
>> ceph version is 0.72.2
>>
>> I could able to create image with command " rbd create sample --size 200".
>>
>
> What is the RBD format of the image?
>
> $ rbd info sample
>
> I don't think it's the problem, but it could be that the krbd doesn't
> support format 2 yet.
>
>  inserted rbd.ko successfully with modprobe command " modprobe rbd"
>>
>> Now when I try to map it with command:
>>
>> #*rbd map sample
>> *
>> *[10584.497492] libceph: client4301 fsid
>>
>> a7f64266-0894-4f1e-a635-d0aeaca0e993
>> [10584.535926] libceph: mon0 192.168.0.102:6789
>>  session established
>> rbd: add failed: (34) Numerical result out of range*
>>
>>
>>
>> Please help me in solving this issue.
>>
>> Thanks,
>> Srinivas.
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> --
> Wido den Hollander
> Ceph consultant and trainer
> 42on B.V.
>
> Phone: +31 (0)20 700 9902
> Skype: contact42on
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Wido den Hollander

On 04/16/2014 11:41 AM, Srinivasa Rao Ragolu wrote:

HI all,

I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
health is OK and Active.

My deployment is on our private distribution of Linux kernel 3.10.33 and
ceph version is 0.72.2

I could able to create image with command " rbd create sample --size 200".


What is the RBD format of the image?

$ rbd info sample

I don't think it's the problem, but it could be that the krbd doesn't 
support format 2 yet.



inserted rbd.ko successfully with modprobe command " modprobe rbd"

Now when I try to map it with command:

#*rbd map sample
*
*[10584.497492] libceph: client4301 fsid
a7f64266-0894-4f1e-a635-d0aeaca0e993
[10584.535926] libceph: mon0 192.168.0.102:6789
 session established
rbd: add failed: (34) Numerical result out of range*


Please help me in solving this issue.

Thanks,
Srinivas.



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




--
Wido den Hollander
Ceph consultant and trainer
42on B.V.

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd: add failed: (34) Numerical result out of range ( Please help me)

2014-04-16 Thread Srinivasa Rao Ragolu
HI all,

I have created ceph cluster with 1 monitor node and 2 OSd nodes. Cluster
health is OK and Active.

My deployment is on our private distribution of Linux kernel 3.10.33 and
ceph version is 0.72.2

I could able to create image with command " rbd create sample --size 200".
inserted rbd.ko successfully with modprobe command " modprobe rbd"

Now when I try to map it with command:

#
*rbd map sample *


*[10584.497492] libceph: client4301 fsid
a7f64266-0894-4f1e-a635-d0aeaca0e993[10584.535926] libceph: mon0
192.168.0.102:6789  session establishedrbd: add
failed: (34) Numerical result out of range*


Please help me in solving this issue.

Thanks,
Srinivas.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Errors while mapping the created image (Numerical result out of range)

2014-04-16 Thread Ирек Фасихов
Show command output dmesg.


2014-04-16 12:18 GMT+04:00 Srinivasa Rao Ragolu :

> Hi All,
>
> I could successfully able to create ceph cluster on our proprietary
> distribution with manual ceph commands
>
> *ceph.conf*
>
> [global]
> fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
> mon initial members = mon
> mon host = 192.168.0.102
> public network = 192.168.0.0/22
> auth cluster required = cephx
> auth service required = cephx
> auth client required = cephx
> osd journal size = 1024
> filestore xattr use omap = true
> osd pool default size = 2
> osd pool default min size = 1
> osd pool default pg num = 333
> osd pool default pgp num = 333
> osd crush chooseleaf type = 1
>
> [mon.mon]
> host = mon
> mon addr = 192.168.0.102
>
> [osd]
> filestore xattr use omap = true
> osd data = /var/lib/ceph/osd/$cluster-$id
> osd journal size = 1024
> osd mkfs type = ext4
>
> [osd.0]
> host = node1
>
> [osd.1]
> host = node2
>
> *#ceph status*
> root@mon:/etc/ceph# ceph status
> cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
>  health HEALTH_OK
>  monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 2,
> quorum 0 mon
>  osdmap e13: 2 osds: 2 up, 2 in
>   pgmap v46: 192 pgs, 3 pools, 389 bytes data, 4 objects
> 2077 MB used, 9113 MB / 11837 MB avail
>  192 active+clean
>
> *Problem/Issue*
>
> On Monitor node
>
> 1) rbd create ceph-block1 --size 10   Fine
> 2) modprobe rbd  Fine
> 3) rbd map ceph-block1  - Error
>
> root@mon:/etc/ceph#
>
>
> * rbd map sample[ 4884.565320] libceph: client4160 fsid
> a7f64266-0894-4f1e-a635-d0aeaca0e993[ 4884.603480] libceph: mon0
> 192.168.0.102:6789  session established rbd: add
> failed: (34) Numerical result out of range*
>
> Ceph version is 0.72.2
> Linux kernel is 3.10.33
>
> Please help me in solving this issue ASAP
>
> Thanks,
> Srinivas.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
С уважением, Фасихов Ирек Нургаязович
Моб.: +79229045757
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Errors while mapping the created image (Numerical result out of range)

2014-04-16 Thread Srinivasa Rao Ragolu
Hi All,

I could successfully able to create ceph cluster on our proprietary
distribution with manual ceph commands

*ceph.conf*

[global]
fsid = a7f64266-0894-4f1e-a635-d0aeaca0e993
mon initial members = mon
mon host = 192.168.0.102
public network = 192.168.0.0/22
auth cluster required = cephx
auth service required = cephx
auth client required = cephx
osd journal size = 1024
filestore xattr use omap = true
osd pool default size = 2
osd pool default min size = 1
osd pool default pg num = 333
osd pool default pgp num = 333
osd crush chooseleaf type = 1

[mon.mon]
host = mon
mon addr = 192.168.0.102

[osd]
filestore xattr use omap = true
osd data = /var/lib/ceph/osd/$cluster-$id
osd journal size = 1024
osd mkfs type = ext4

[osd.0]
host = node1

[osd.1]
host = node2

*#ceph status*
root@mon:/etc/ceph# ceph status
cluster a7f64266-0894-4f1e-a635-d0aeaca0e993
 health HEALTH_OK
 monmap e1: 1 mons at {mon=192.168.0.102:6789/0}, election epoch 2,
quorum 0 mon
 osdmap e13: 2 osds: 2 up, 2 in
  pgmap v46: 192 pgs, 3 pools, 389 bytes data, 4 objects
2077 MB used, 9113 MB / 11837 MB avail
 192 active+clean

*Problem/Issue*

On Monitor node

1) rbd create ceph-block1 --size 10   Fine
2) modprobe rbd  Fine
3) rbd map ceph-block1  - Error

root@mon:/etc/ceph#


* rbd map sample[ 4884.565320] libceph: client4160 fsid
a7f64266-0894-4f1e-a635-d0aeaca0e993[ 4884.603480] libceph: mon0
192.168.0.102:6789  session establishedrbd: add
failed: (34) Numerical result out of range*

Ceph version is 0.72.2
Linux kernel is 3.10.33

Please help me in solving this issue ASAP

Thanks,
Srinivas.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSDs: cache pool/tier versus node-local block cache

2014-04-16 Thread Blair Bethwaite
New idea that is dependent on failure behaviour of the cache tier...

Carve the ssds 4-ways: each with 3 partitions for journals servicing the
backing data pool and a fourth larger partition serving a write-around
cache tier with only 1 object copy. Thus both reads and writes hit ssd but
the ssd capacity is not halved by replication for availability.

...The crux is how the current implementation behaves in the face of cache
tier OSD failures?

Cheers, Blairo
On 16/04/2014 4:45 PM, "Blair Bethwaite"  wrote:

> Hi all,
>
> We'll soon be configuring a new cluster, hardware is already purchased -
> OSD nodes are Dell R720XDs (E5-2630v2, 32GB RAM, PERC 710p, 9x 4TB NL-SAS,
> 3x 200GB Intel DC S3700, Mellanox CX3 10GE DP). 12 of these to start with.
>
> So we have a 3:1 spindle:ssd ratio, but as yet I'm not sure how we'll
> configure things...
>
> Obviously the ssds could be used as journal devices, but I'm not really
> convinced whether this is worthwhile when all nodes have 1GB of hardware
> writeback cache (writes to journal and data areas on the same spindle have
> time to coalesce in the cache and minimise seek time hurt). Any advice on
> this?
>
> I think the timing should work that we'll be deploying with Firefly and so
> have Ceph cache pool tiering as an option, but I'm also evaluating Bcache
> versus Tier to act as node-local block cache device. Does anybody have real
> or anecdotal evidence about which approach has better performance?
>
> --
> Cheers,
> ~Blairo
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] NAND-backed DRAM for Ceph journals

2014-04-16 Thread Charles 'Boyo
Hello list.

It is a well-known fact that speeding up the OSD journals results in overall 
performance improvement. And most installations use SSDs to gain this benefit.

But is anyone using or considering using NAND-backed DRAM like the Viking 
ArxCiS-NV and similar NVDIMM solutions?

I think these will be even faster - eliminating the disk/PCIe bottlenecks - and 
more performant given DRAM's order of magnitude speed improvement. Who knows, 
with the entire system memory consisting of "non-volatile" RAM, maybe we can 
even turn-off the journal and count on the OSD filesystem caches to writeback 
to disk eventually?

What do you think?

Charles Oluboyo
Sent from my BlackBerry® wireless handheld from Glo Mobile.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] [Bug]radosgw.log won't be generated when deleted

2014-04-16 Thread Craig Lewis
This is a standard feature of logs in Unix.  The radosgw process has a 
filehandle for the log file, and continues to write to that filehandle 
after you move or delete the log.  If you rename the log, you'll notice 
the log keeps growing.


If you move or delete the log, you need to tell radosgw that you've done 
so.  If you run service radosgw reload, radosgw will recreate it's log file.


In general though, you shouldn't be messing with the logs directly.  If 
they're taking up too much disk space, reduce the logging level, or 
adjust the logrotated schedule.  If you want to rotate faster, or keep 
fewer files, take a look in /etc/logrotate.d/radosgw.  That's on Ubuntu, 
and I think it's there on CentOS 6.



*Craig Lewis*
Senior Systems Engineer
Office +1.714.602.1309
Email cle...@centraldesktop.com 

*Central Desktop. Work together in ways you never thought possible.*
Connect with us Website   | Twitter 
  | Facebook 
  | LinkedIn 
  | Blog 



On 4/15/14 23:49 , wsnote wrote:

OS: CentOS 6.5
Ceph version: 0.67.7

When I delete or move /var/log/ceph/radosgw.log, I can continue 
operating files through rgw. Then I find there are no log. The log 
won't be generated automatically. Even if I created it, it will still 
been written nothing. And if I restart radowgw, the log will be 
generated again.
I think this is a bug  and don't know whether it was fixed or not. 
Whatever I delete or move the log, it should re-generates again and 
record the following operation through rgw.






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com