Re: [ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Somnath Roy
Sorry, I missed that you are upgrading from Hammer...I think it is probably a 
bug introduced in post hammer..Here is why it is happening IMO..

In hammer:
-

https://github.com/ceph/ceph/blob/hammer/src/os/FileJournal.cc#L158

In Master/Infernalis/Jewel:
-

https://github.com/ceph/ceph/blob/infernalis/src/os/FileJournal.cc#L151

Which is hard coded 4096

Not sure why this is changed, Sam/Sage ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@caviumnetworks.com]
Sent: Tuesday, March 01, 2016 9:34 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: INFARNALIS with 64K Kernel PAGES

The OSDS were created with 64K page size, and mkfs was done with the same size.
After upgrade, I have not changed anything on the machine (except applied the 
ownership fix for files for user ceph:ceph)

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, March 01, 2016 9:32 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: INFARNALIS with 64K Kernel PAGES

Did you recreated OSDs on this setup meaning did you do mkfs with 64K page size 
?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Tuesday, March 01, 2016 9:07 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] INFARNALIS with 64K Kernel PAGES

Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_command 0 hook 0xff6380c0
   -27> 2016-03-01 20:52:56.503479 97e38f10  5 asok(0xff6c) 

Re: [ceph-users] rbd cache did not help improve performance

2016-03-01 Thread Josh Durgin

On 03/01/2016 10:03 PM, min fang wrote:

thanks, with your help, I set the read ahead parameter. What is the
cache parameter for kernel module rbd?
Such as:
1) what is the cache size?
2) Does it support write back?
3) Will read ahead be disabled if max bytes has been read into cache?
(similar the concept as "rbd_readahead_disable_after_bytes".


The kernel rbd module does not implement any caching itself. If you're 
doing I/O to a file on a filesystem on top of a kernel rbd device,

it will go through the usual kernel page cache (unless you use O_DIRECT
of course).

Josh



2016-03-01 21:31 GMT+08:00 Adrien Gillard >:

As Tom stated, RBD cache only works if your client is using librbd
(KVM clients for instance).
Using the kernel RBD client, one of the parameter you can tune to
optimize sequential read is increasing
/sys/class/block/rbd4/queue/read_ahead_kb

Adrien



On Tue, Mar 1, 2016 at 12:48 PM, min fang > wrote:

I can use the following command to change parameter, for example
as the following,  but not sure whether it will work.

  ceph --admin-daemon /var/run/ceph/ceph-mon.openpower-0.asok
config set rbd_readahead_disable_after_bytes 0

2016-03-01 15:07 GMT+08:00 Tom Christensen >:

If you are mapping the RBD with the kernel driver then
you're not using librbd so these settings will have no
effect I believe.  The kernel driver does its own caching
but I don't believe there are any settings to change its
default behavior.


On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo
> wrote:

You may want to set "ioengine=rbd", I guess.

Cheers,

- Original Message -
From: "min fang" >
To: "ceph-users" >
Sent: Tuesday, March 1, 2016 1:28:54 PM
Subject: [ceph-users]  rbd cache did not help improve
performance

Hi, I set the following parameters in ceph.conf

[client]
rbd cache=true
rbd cache size= 25769803776
rbd readahead disable after byte=0


map a rbd image to a rbd device then run fio testing on
4k read as the command
./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread
-rw=read -ioengine=aio -bs=4K -size=500G -numjobs=32
-runtime=300 -group_reporting -name=mytest2

Compared the result with setting rbd cache=false and
enable cache model, I did not see performance improved
by librbd cache.

Is my setting not right, or it is true that ceph librbd
cache will not have benefit on 4k seq read?

thanks.


___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache did not help improve performance

2016-03-01 Thread min fang
thanks, with your help, I set the read ahead parameter. What is the cache
parameter for kernel module rbd?
Such as:
1) what is the cache size?
2) Does it support write back?
3) Will read ahead be disabled if max bytes has been read into cache?
(similar the concept as "rbd_readahead_disable_after_bytes".

thanks again.

2016-03-01 21:31 GMT+08:00 Adrien Gillard :

> As Tom stated, RBD cache only works if your client is using librbd (KVM
> clients for instance).
> Using the kernel RBD client, one of the parameter you can tune to optimize
> sequential read is increasing /sys/class/block/rbd4/queue/read_ahead_kb
>
> Adrien
>
>
>
> On Tue, Mar 1, 2016 at 12:48 PM, min fang  wrote:
>
>> I can use the following command to change parameter, for example as the
>> following,  but not sure whether it will work.
>>
>>  ceph --admin-daemon /var/run/ceph/ceph-mon.openpower-0.asok config set
>> rbd_readahead_disable_after_bytes 0
>>
>> 2016-03-01 15:07 GMT+08:00 Tom Christensen :
>>
>>> If you are mapping the RBD with the kernel driver then you're not using
>>> librbd so these settings will have no effect I believe.  The kernel driver
>>> does its own caching but I don't believe there are any settings to change
>>> its default behavior.
>>>
>>>
>>> On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo 
>>> wrote:
>>>
 You may want to set "ioengine=rbd", I guess.

 Cheers,

 - Original Message -
 From: "min fang" 
 To: "ceph-users" 
 Sent: Tuesday, March 1, 2016 1:28:54 PM
 Subject: [ceph-users]  rbd cache did not help improve performance

 Hi, I set the following parameters in ceph.conf

 [client]
 rbd cache=true
 rbd cache size= 25769803776
 rbd readahead disable after byte=0


 map a rbd image to a rbd device then run fio testing on 4k read as the
 command
 ./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread -rw=read
 -ioengine=aio -bs=4K -size=500G -numjobs=32 -runtime=300 -group_reporting
 -name=mytest2

 Compared the result with setting rbd cache=false and enable cache
 model, I did not see performance improved by librbd cache.

 Is my setting not right, or it is true that ceph librbd cache will not
 have benefit on 4k seq read?

 thanks.


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Garg, Pankaj
The OSDS were created with 64K page size, and mkfs was done with the same size.
After upgrade, I have not changed anything on the machine (except applied the 
ownership fix for files for user ceph:ceph)

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, March 01, 2016 9:32 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: INFARNALIS with 64K Kernel PAGES

Did you recreated OSDs on this setup meaning did you do mkfs with 64K page size 
?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Tuesday, March 01, 2016 9:07 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] INFARNALIS with 64K Kernel PAGES

Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_command 0 hook 0xff6380c0
   -27> 2016-03-01 20:52:56.503479 97e38f10  5 asok(0xff6c) 
register_command version hook 0xff6380c0
   -26> 2016-03-01 20:52:56.503490 97e38f10  5 asok(0xff6c) 
register_command git_version hook 0xff6380c0
   -25> 2016-03-01 20:52:56.503500 97e38f10  5 asok(0xff6c) 
register_command help hook 0xff63c1e0
   -24> 2016-03-01 20:52:56.503510 97e38f10  5 asok(0xff6c) 
register_command get_command_descriptions hook 0xff63c1f0
   -23> 2016-03-01 20:52:56.503566 9643f030  5 asok(0xff6c) entry 
start
   -22> 2016-03-01 20:52:56.503635 97e38f10 10 monclient(hunting): 
build_initial_monmap
   -21> 2016-03-01 20:52:56.520227 97e38f10  5 adding auth protocol: cephx
   -20> 

Re: [ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Somnath Roy
Did you recreated OSDs on this setup meaning did you do mkfs with 64K page size 
?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Tuesday, March 01, 2016 9:07 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] INFARNALIS with 64K Kernel PAGES

Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_command 0 hook 0xff6380c0
   -27> 2016-03-01 20:52:56.503479 97e38f10  5 asok(0xff6c) 
register_command version hook 0xff6380c0
   -26> 2016-03-01 20:52:56.503490 97e38f10  5 asok(0xff6c) 
register_command git_version hook 0xff6380c0
   -25> 2016-03-01 20:52:56.503500 97e38f10  5 asok(0xff6c) 
register_command help hook 0xff63c1e0
   -24> 2016-03-01 20:52:56.503510 97e38f10  5 asok(0xff6c) 
register_command get_command_descriptions hook 0xff63c1f0
   -23> 2016-03-01 20:52:56.503566 9643f030  5 asok(0xff6c) entry 
start
   -22> 2016-03-01 20:52:56.503635 97e38f10 10 monclient(hunting): 
build_initial_monmap
   -21> 2016-03-01 20:52:56.520227 97e38f10  5 adding auth protocol: cephx
   -20> 2016-03-01 20:52:56.520244 97e38f10  5 adding auth protocol: cephx
   -19> 2016-03-01 20:52:56.520427 97e38f10  5 asok(0xff6c) 
register_command objecter_requests hook 0xff63c2b0
   -18> 2016-03-01 20:52:56.520538 97e38f10  1 -- 10.18.240.124:6802/17095 
messenger.start
   -17> 2016-03-01 20:52:56.520601 97e38f10  1 -- :/0 messenger.start
   -16> 2016-03-01 20:52:56.520655 97e38f10  1 -- 

Re: [ceph-users] Upgrade to INFERNALIS

2016-03-01 Thread Garg, Pankaj
Thanks François. That was the issue. After changing Journal partition 
permissions, things look better now.

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Francois Lafont
Sent: Tuesday, March 01, 2016 4:06 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Upgrade to INFERNALIS

Hi,

On 02/03/2016 00:12, Garg, Pankaj wrote:

> I have upgraded my cluster from 0.94.4 as recommended to the just released 
> Infernalis (9.2.1) Update directly (skipped 9.2.0).
> I installed the packaged on each system, manually (.deb files that I built).
> 
> After that I followed the steps :
> 
> Stop ceph-all
> chown -R  ceph:ceph /var/lib/ceph
> start ceph-all

Ok, and the journals?

> I am still getting errors on starting OSDs.
> 
> 2016-03-01 22:44:45.991043 7fa185f000 -1 filestore(/var/lib/ceph/osd/ceph-69) 
> mount failed to open journal /var/lib/ceph/osd/ceph-69/journal: (13) 
> Permission denied

I suppose your journal is a symlink which targets to a raw partition, correct? 
In this case, the ceph Unix account seems currently to be unable to read and 
write in this partition. If this partition is /dev/sdb2 (for instance), you 
have to set the Unix rights this "file" /dev/sdb2 (manually or via a udev rule).

> 2016-03-01 22:44:46.001112 7fa185f000 -1 osd.69 0 OSD:init: unable to mount 
> object store
> 2016-03-01 22:44:46.001128 7fa185f000 -1  ** ERROR: osd init failed: (13) 
> Permission denied
> 
> 
> What am I missing?

I think you missed to set the Unix rights of the journal partitions. The ceph 
account must be able to read/write in /var/lib/ceph/osd/$cluster-$id/ _and_ in 
the journal partitions too.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Garg, Pankaj
Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_command 0 hook 0xff6380c0
   -27> 2016-03-01 20:52:56.503479 97e38f10  5 asok(0xff6c) 
register_command version hook 0xff6380c0
   -26> 2016-03-01 20:52:56.503490 97e38f10  5 asok(0xff6c) 
register_command git_version hook 0xff6380c0
   -25> 2016-03-01 20:52:56.503500 97e38f10  5 asok(0xff6c) 
register_command help hook 0xff63c1e0
   -24> 2016-03-01 20:52:56.503510 97e38f10  5 asok(0xff6c) 
register_command get_command_descriptions hook 0xff63c1f0
   -23> 2016-03-01 20:52:56.503566 9643f030  5 asok(0xff6c) entry 
start
   -22> 2016-03-01 20:52:56.503635 97e38f10 10 monclient(hunting): 
build_initial_monmap
   -21> 2016-03-01 20:52:56.520227 97e38f10  5 adding auth protocol: cephx
   -20> 2016-03-01 20:52:56.520244 97e38f10  5 adding auth protocol: cephx
   -19> 2016-03-01 20:52:56.520427 97e38f10  5 asok(0xff6c) 
register_command objecter_requests hook 0xff63c2b0
   -18> 2016-03-01 20:52:56.520538 97e38f10  1 -- 10.18.240.124:6802/17095 
messenger.start
   -17> 2016-03-01 20:52:56.520601 97e38f10  1 -- :/0 messenger.start
   -16> 2016-03-01 20:52:56.520655 97e38f10  1 -- 10.18.240.124:6803/17095 
messenger.start
   -15> 2016-03-01 20:52:56.520712 97e38f10  1 -- 
192.168.240.124:6803/17095 messenger.start
   -14> 2016-03-01 20:52:56.520768 97e38f10  1 -- 
192.168.240.124:6802/17095 messenger.start
   -13> 2016-03-01 20:52:56.520824 97e38f10  1 -- :/0 

Re: [ceph-users] Manual or fstab mount on Ceph FS

2016-03-01 Thread Yan, Zheng
On Wed, Mar 2, 2016 at 4:57 AM, Jose M  wrote:
> Hi guys, easy question.
>
> If i need to mount a ceph FS in client (manual mount or fstab),  but this 
> client won't be part of the ceph cluster (neither osd nor monitor node),  do 
> i still have to run the "ceph-deploy install ceph-client" command from the 
> ceph admin node or there is another way?
>
> I mean what are the minimal requirements to get the "mount -t ceph 1.2.3.4:/ 
> mountpoint" work.
>
> The SO in my cliente is an Ubuntu 14.04 LTS.

No other package is required. But the kernel shipped by 14.04 is a
little old, please update it if possible.

Yan, Zheng


>
> Thanks in advance,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Restrict cephx commands

2016-03-01 Thread chris holcombe
Hey Ceph Users!

I'm wondering if it's possible to restrict the ceph keyring to only
being able to run certain commands.  I think the answer to this is no
but I just wanted to ask.  I haven't seen any documentation indicating
whether or not this is possible.  Anyone know?

Thanks,
Chris
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing OSD drive without rempaping pg's

2016-03-01 Thread Lindsay Mathieson

On 02/03/16 02:41, Robert LeBlanc wrote:

With a fresh disk, you will need to remove the old key in ceph (ceph
auth del osd.X) and the old osd (ceph osd rm X), but I think you can
leave the CRUSH map alone (don't do ceph osd crush rm osd.X) so that
there isn't any additional data movement (if there aren't any
available OSD numbers less than the OSD being replaced, it will get
the same ID. There may also be a way to specify an ID, but I haven't
used it). Then when you add the new disk in, it only backfills what
the previous disk had, unless the size is different, then it will take
on more or less and shuffle some things around the cluster.



Thanks Robert, that makes sense. I'll try it out tonight.

--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Francois Lafont
On 01/03/2016 18:14, John Spray wrote:

>> And what is the meaning of the first and the second number below?
>>
>> mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>>^ ^
> 
> Your whitespace got lost here I think, but I guess you're talking
> about the 1/1 part.

Yes indeed.

> The shorthand MDS status is up/in/max_mds
> (https://github.com/ceph/ceph/blob/master/src/mds/MDSMap.cc#L248)
> 
> up: how many daemons are up and holding a rank (they may be active or
> replaying, etc)
> in: how many ranks exist in the MDS cluster
> max_mds: if there are this many MDSs already, new daemons will be made
> standbys instead of having ranks created for them.
> 
> On single-active-daemon systems, this is really just going to be 1/1/1
> or 0/1/1 for whether you have an up MDS or not.

Ok thx John for the explanations.


-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.6 Hammer released

2016-03-01 Thread Chris Dunlop
Hi,

The "old list of supported platforms" includes debian wheezy.
Will v0.94.6 be built for this?

Chris

On Mon, Feb 29, 2016 at 10:57:53AM -0500, Sage Weil wrote:
> The intention was to continue building stable releases (0.94.x) on the old 
> list of supported platforms (which inclues 12.04 and el6).  I think it was 
> just an oversight that they weren't built this time around.  I the 
> overhead to doing so is just keeping a 12.04 and el6 jenkins build slave 
> around.
> 
> Doing this builds in the existing environment sounds much better than 
> trying to pull in externally built binaries...
> 
> sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph RGW NFS

2016-03-01 Thread David Wang
Thanks for reply. I will wait for Jewel.

2016-03-02 0:29 GMT+08:00 Yehuda Sadeh-Weinraub :

> On Tue, Mar 1, 2016 at 7:23 AM, Daniel Gryniewicz  wrote:
> > On 02/28/2016 08:36 PM, David Wang wrote:
> >>
> >> Hi All,
> >>  How the progress of NFS on RGW? Does it released on Infernalis? The
> >> contents of NFS on RGW is
> >> http://tracker.ceph.com/projects/ceph/wiki/RGW_-_NFS
> >>
> >>
> >
> > The FSAL has been integrated into upstream Ganesha
> > (https://github.com/nfs-ganesha/nfs-ganesha/tree/next).  It's in
> testing at
> > this point, and so not production ready.  It will be released with Jewel
> > (has been merged into master), and may or may not be backported at some
> > point.
> >
>
> It's really unlikely that we'll backport the nfs work to Infernalis.
>
> Yehuda
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] User Interface

2016-03-01 Thread Vlad Blando
Any ideas guys?
ᐧ

/Vlad

On Tue, Mar 1, 2016 at 10:42 AM, Vlad Blando  wrote:

> Hi,
>
> We already have a user interface that is admin facing (ex. calamari,
> kraken, ceph-dash), how about a client facing interface, that can cater for
> both block and object store. For object store I can use Swift via Horizon
> dashboard, but for block-store, I'm not sure how.
>
> Thanks.
>
>
> /Vlad
> ᐧ
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade to INFERNALIS

2016-03-01 Thread Francois Lafont
Hi,

On 02/03/2016 00:12, Garg, Pankaj wrote:

> I have upgraded my cluster from 0.94.4 as recommended to the just released 
> Infernalis (9.2.1) Update directly (skipped 9.2.0).
> I installed the packaged on each system, manually (.deb files that I built).
> 
> After that I followed the steps :
> 
> Stop ceph-all
> chown -R  ceph:ceph /var/lib/ceph
> start ceph-all

Ok, and the journals?

> I am still getting errors on starting OSDs.
> 
> 2016-03-01 22:44:45.991043 7fa185f000 -1 filestore(/var/lib/ceph/osd/ceph-69) 
> mount failed to open journal /var/lib/ceph/osd/ceph-69/journal: (13) 
> Permission denied

I suppose your journal is a symlink which targets to a raw partition, correct? 
In this case, the ceph Unix account seems currently to be unable to read and 
write in this partition. If this partition is /dev/sdb2 (for instance), you 
have to set the Unix rights this "file" /dev/sdb2 (manually or via a udev rule).

> 2016-03-01 22:44:46.001112 7fa185f000 -1 osd.69 0 OSD:init: unable to mount 
> object store
> 2016-03-01 22:44:46.001128 7fa185f000 -1  ** ERROR: osd init failed: (13) 
> Permission denied
> 
> 
> What am I missing?

I think you missed to set the Unix rights of the journal partitions. The ceph 
account must be able to read/write in /var/lib/ceph/osd/$cluster-$id/ _and_ in 
the journal partitions too.

Regards.

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade to INFERNALIS

2016-03-01 Thread Garg, Pankaj
Hi,
I have upgraded my cluster from 0.94.4 as recommended to the just released 
Infernalis (9.2.1) Update directly (skipped 9.2.0).
I installed the packaged on each system, manually (.deb files that I built).

After that I followed the steps :

Stop ceph-all
chown -R  ceph:ceph /var/lib/ceph
start ceph-all


I am still getting errors on starting OSDs.

2016-03-01 22:44:45.991043 7fa185f000 -1 filestore(/var/lib/ceph/osd/ceph-69) 
mount failed to open journal /var/lib/ceph/osd/ceph-69/journal: (13) Permission 
denied
2016-03-01 22:44:46.001112 7fa185f000 -1 osd.69 0 OSD:init: unable to mount 
object store
2016-03-01 22:44:46.001128 7fa185f000 -1  ** ERROR: osd init failed: (13) 
Permission denied


What am I missing?

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] blocked i/o on rbd device

2016-03-01 Thread Randy Orr
Hello,

I am running the following:

ceph version 9.2.0 (bb2ecea240f3a1d525bcb35670cb07bd1f0ca299)
ubuntu 14.04 with kernel 3.19.0-49-generic #55~14.04.1-Ubuntu SMP

For this use case I am mapping and mounting an rbd using the kernel client
and exporting the ext4 filesystem via NFS to a number of clients.

Once or twice a week we've seen disk io "stuck" or "blocked" on the rbd
device. When this happens iostat shows avgqu-sz at a constant number with
utilization at 100%. All i/o operations via NFS blocks, though I am able to
traverse the filesystem locally on the nfs server and read/write data. If I
wait long enough the device will eventually recover and avgqu-sz goes to
zero.

The only issue I could find that was similar to this is:
http://tracker.ceph.com/issues/8818 - However, I am not seeing the error
messages described and I am running a more recent version of the kernel
that should contain the fix from that issue. So, I assume this is likely a
different problem.

The ceph cluster reports as healthy the entire time, all pgs up and in,
there was no scrubbing going on, no osd failures or anything like that.

I ran echo t > /proc/sysrq-trigger and the output is here:
https://gist.github.com/anonymous/89c305443080149e9f45

 Any ideas on what could be going on here? Any additional information I can
provide?

Thanks,
Randy Orr
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Manual or fstab mount on Ceph FS

2016-03-01 Thread Jose M
Hi guys, easy question. 

If i need to mount a ceph FS in client (manual mount or fstab),  but this 
client won't be part of the ceph cluster (neither osd nor monitor node),  do i 
still have to run the "ceph-deploy install ceph-client" command from the ceph 
admin node or there is another way? 

I mean what are the minimal requirements to get the "mount -t ceph 1.2.3.4:/ 
mountpoint" work.

The SO in my cliente is an Ubuntu 14.04 LTS.

Thanks in advance,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] systemd & sysvinit scripts mix ?

2016-03-01 Thread Ken Dreyer
In theory the RPM should contain either the init script, or the
systemd .service files, but not both.

If that's not the case, you can file a bug @ http://tracker.ceph.com/
. Patches are even better!

- Ken

On Tue, Mar 1, 2016 at 2:36 AM, Florent B  wrote:
> By the way, why /etc/init.d/ceph script packaged in Infernalis "ceph"
> package, is not the same script as git's one ?
> (https://github.com/ceph/ceph/blob/master/systemd/ceph) ?
> Is it expected ? I think there's too mixing things between both systems...
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] babeltrace and lttng-ust headed to EPEL 7

2016-03-01 Thread Ken Dreyer
lttng is destined for EPEL 7, so we will finally have lttng
tracepoints in librbd for our EL7 Ceph builds, as we've done with the
EL6 builds.

https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-200bd827c6
https://bodhi.fedoraproject.org/updates/FEDORA-EPEL-2016-8c74b0b27f

If you are running RHEL 7 or CentOS 7 with EPEL enabled, please try
those two packages out on a test machine. If they don't cause issues
for you, add positive karma in Fedora's Bodhi tool (note that you have
to log in for it to take effect). Karma points will allow the updates
to land in EPEL sooner than the normal waiting period (2 weeks).

- Ken
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread John Spray
On Tue, Mar 1, 2016 at 11:41 AM, Francois Lafont  wrote:
> Hi,
>
> On 01/03/2016 10:32, John Spray wrote:
>
>> As Zheng has said, that last number is the "max_mds" setting.
>
> And what is the meaning of the first and the second number below?
>
> mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>^ ^

Your whitespace got lost here I think, but I guess you're talking
about the 1/1 part.

The shorthand MDS status is up/in/max_mds
(https://github.com/ceph/ceph/blob/master/src/mds/MDSMap.cc#L248)

up: how many daemons are up and holding a rank (they may be active or
replaying, etc)
in: how many ranks exist in the MDS cluster
max_mds: if there are this many MDSs already, new daemons will be made
standbys instead of having ranks created for them.

On single-active-daemon systems, this is really just going to be 1/1/1
or 0/1/1 for whether you have an up MDS or not.

John

>
> --
> François Lafont
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Replacing OSD drive without rempaping pg's

2016-03-01 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

With a fresh disk, you will need to remove the old key in ceph (ceph
auth del osd.X) and the old osd (ceph osd rm X), but I think you can
leave the CRUSH map alone (don't do ceph osd crush rm osd.X) so that
there isn't any additional data movement (if there aren't any
available OSD numbers less than the OSD being replaced, it will get
the same ID. There may also be a way to specify an ID, but I haven't
used it). Then when you add the new disk in, it only backfills what
the previous disk had, unless the size is different, then it will take
on more or less and shuffle some things around the cluster.
-BEGIN PGP SIGNATURE-
Version: Mailvelope v1.3.6
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJW1cZBCRDmVDuy+mK58QAAm9EQAMVOnCOrBkqGaczy5+ds
yplotd9kKt/eyhp1nSgPJD+4RdOVQjoL4VVLtCfApXcMfxHkW/vBjpOWD1Bh
l14NDjCzpkXM5HpHqQkiel/7thcN45u/Z7wSX8T+x9ontYn1Bv0CfI/6qaFb
DmIYdAGjdLgWKpORyeN1WjgrU5DzUbCMHw/3sLfieVpoYsh91dMuxt33366z
mcMQ6RYIE/5xpm8LkTsjYkmnl7Xes5fGsIAlx6kJDHpAoBBWEfstjgtCXIBt
PgDnBJ/SwisAQKXuQOZg87/3OE+qFQUyILwFE3USD3ugx8xvo1aUGnerY/mT
8rUNfFLCPLhdiAp1fr2kkQW/SfV7spkNkZ/v99J/9dEwSj2pgJ7iHMGNr/Em
K3oLezrm7NO2RHsMrn/pz82bO1CSzHrRQ5Aq7Re2r48zYeFxSgvcbMk6Ogzh
rDPb2q+QEw/UbIuotl09ab3OGCjzXxhfDIQ44iEUEj0l2Cl5MQQcakdYakoC
WCPaqIN7ocqiWnQPY/RnSXuhUgsd8uTBtxcXtHp+y0feAf/80nxc3dFWDfiK
8sKmt+rHoBQKQz0yhc0A0YqM8vnWYatVrVh1+SZe7iJE3/qyglNFmbJQ0O54
au/AJ7OqEy1MnJ06fIaLbSIQMXXMWdEqcib2gIKeunhLDkwoUbi+JRJLBY5X
ITts
=yIjM
-END PGP SIGNATURE-

Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Mon, Feb 29, 2016 at 10:29 PM, Lindsay Mathieson
 wrote:
> I was looking at replacing an osd drive in place as per the procedure here:
>
> http://www.spinics.net/lists/ceph-users/msg05959.html
>
> "If you are going to replace the drive immediately, set the “noout” flag.
> Take the OSD “down” and replace drive.  Assuming it is mounted in the same
> place as the bad drive, bring the OSD back up.  This will replicate exactly
> the same PGs the bad drive held back to the replacement drive."
>
>
>
> But the new drive mount will be blank - what happens with the journal,
> keyring etc? does starting the OSD process recreate them automatically?
>
>
> thanks,
>
> --
> Lindsay Mathieson
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph RGW NFS

2016-03-01 Thread Daniel Gryniewicz

On 02/28/2016 08:36 PM, David Wang wrote:

Hi All,
 How the progress of NFS on RGW? Does it released on Infernalis? The
contents of NFS on RGW is
http://tracker.ceph.com/projects/ceph/wiki/RGW_-_NFS




The FSAL has been integrated into upstream Ganesha 
(https://github.com/nfs-ganesha/nfs-ganesha/tree/next).  It's in testing 
at this point, and so not production ready.  It will be released with 
Jewel (has been merged into master), and may or may not be backported at 
some point.


Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache did not help improve performance

2016-03-01 Thread Adrien Gillard
As Tom stated, RBD cache only works if your client is using librbd (KVM
clients for instance).
Using the kernel RBD client, one of the parameter you can tune to optimize
sequential read is increasing /sys/class/block/rbd4/queue/read_ahead_kb

Adrien



On Tue, Mar 1, 2016 at 12:48 PM, min fang  wrote:

> I can use the following command to change parameter, for example as the
> following,  but not sure whether it will work.
>
>  ceph --admin-daemon /var/run/ceph/ceph-mon.openpower-0.asok config set
> rbd_readahead_disable_after_bytes 0
>
> 2016-03-01 15:07 GMT+08:00 Tom Christensen :
>
>> If you are mapping the RBD with the kernel driver then you're not using
>> librbd so these settings will have no effect I believe.  The kernel driver
>> does its own caching but I don't believe there are any settings to change
>> its default behavior.
>>
>>
>> On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo  wrote:
>>
>>> You may want to set "ioengine=rbd", I guess.
>>>
>>> Cheers,
>>>
>>> - Original Message -
>>> From: "min fang" 
>>> To: "ceph-users" 
>>> Sent: Tuesday, March 1, 2016 1:28:54 PM
>>> Subject: [ceph-users]  rbd cache did not help improve performance
>>>
>>> Hi, I set the following parameters in ceph.conf
>>>
>>> [client]
>>> rbd cache=true
>>> rbd cache size= 25769803776
>>> rbd readahead disable after byte=0
>>>
>>>
>>> map a rbd image to a rbd device then run fio testing on 4k read as the
>>> command
>>> ./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread -rw=read
>>> -ioengine=aio -bs=4K -size=500G -numjobs=32 -runtime=300 -group_reporting
>>> -name=mytest2
>>>
>>> Compared the result with setting rbd cache=false and enable cache model,
>>> I did not see performance improved by librbd cache.
>>>
>>> Is my setting not right, or it is true that ceph librbd cache will not
>>> have benefit on 4k seq read?
>>>
>>> thanks.
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS memory sizing

2016-03-01 Thread Yan, Zheng
On Tue, Mar 1, 2016 at 7:28 PM, Dietmar Rieder
 wrote:
> Dear ceph users,
>
>
> I'm in the very initial phase of planning a ceph cluster an have a
> question regarding the RAM recommendation for an MDS.
>
> According to the ceph docs the minimum amount of RAM should be "1 GB
> minimum per daemon". Is this per OSD in the cluster or per MDS in the
> cluster?
>
> I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3
> ceph-msd on these  machines as well. The raw capacity of the cluster
> should be ~1.9PB. Would 64GB of RAM then be enough for the
> ceph-mon/ceph-msd nodes?
>

Each file inode in MDS uses about 2k memory (It's not relevant to file
size).  MDS memory usage depends on how large active file set are.

Regards
Yan, Zheng

> Thanks
>   Dietmar
>
> --
> _
> D i e t m a r  R i e d e r
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MDS memory sizing

2016-03-01 Thread Simon Hallam
Hi Dietmar,

I asked the same question not long ago, this this may be relevant to you:
http://www.spinics.net/lists/ceph-users/msg24359.html

Cheers,

Si

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Dietmar Rieder
> Sent: 01 March 2016 11:30
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] MDS memory sizing
> 
> Dear ceph users,
> 
> 
> I'm in the very initial phase of planning a ceph cluster an have a
> question regarding the RAM recommendation for an MDS.
> 
> According to the ceph docs the minimum amount of RAM should be "1 GB
> minimum per daemon". Is this per OSD in the cluster or per MDS in the
> cluster?
> 
> I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3
> ceph-msd on these  machines as well. The raw capacity of the cluster
> should be ~1.9PB. Would 64GB of RAM then be enough for the
> ceph-mon/ceph-msd nodes?
> 
> Thanks
>   Dietmar
> 
> --
> _
> D i e t m a r  R i e d e r



Please visit our new website at www.pml.ac.uk and follow us on Twitter  
@PlymouthMarine

Winner of the Environment & Conservation category, the Charity Awards 2014.

Plymouth Marine Laboratory (PML) is a company limited by guarantee registered 
in England & Wales, company number 4178503. Registered Charity No. 1091222. 
Registered Office: Prospect Place, The Hoe, Plymouth  PL1 3DH, UK. 

This message is private and confidential. If you have received this message in 
error, please notify the sender and remove it from your system. You are 
reminded that e-mail communications are not secure and may contain viruses; PML 
accepts no liability for any loss or damage which may be caused by viruses.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd cache did not help improve performance

2016-03-01 Thread min fang
I can use the following command to change parameter, for example as the
following,  but not sure whether it will work.

 ceph --admin-daemon /var/run/ceph/ceph-mon.openpower-0.asok config set
rbd_readahead_disable_after_bytes 0

2016-03-01 15:07 GMT+08:00 Tom Christensen :

> If you are mapping the RBD with the kernel driver then you're not using
> librbd so these settings will have no effect I believe.  The kernel driver
> does its own caching but I don't believe there are any settings to change
> its default behavior.
>
>
> On Mon, Feb 29, 2016 at 9:36 PM, Shinobu Kinjo  wrote:
>
>> You may want to set "ioengine=rbd", I guess.
>>
>> Cheers,
>>
>> - Original Message -
>> From: "min fang" 
>> To: "ceph-users" 
>> Sent: Tuesday, March 1, 2016 1:28:54 PM
>> Subject: [ceph-users]  rbd cache did not help improve performance
>>
>> Hi, I set the following parameters in ceph.conf
>>
>> [client]
>> rbd cache=true
>> rbd cache size= 25769803776
>> rbd readahead disable after byte=0
>>
>>
>> map a rbd image to a rbd device then run fio testing on 4k read as the
>> command
>> ./fio -filename=/dev/rbd4 -direct=1 -iodepth 64 -thread -rw=read
>> -ioengine=aio -bs=4K -size=500G -numjobs=32 -runtime=300 -group_reporting
>> -name=mytest2
>>
>> Compared the result with setting rbd cache=false and enable cache model,
>> I did not see performance improved by librbd cache.
>>
>> Is my setting not right, or it is true that ceph librbd cache will not
>> have benefit on 4k seq read?
>>
>> thanks.
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Francois Lafont
Hi,

On 01/03/2016 10:32, John Spray wrote:

> As Zheng has said, that last number is the "max_mds" setting.

And what is the meaning of the first and the second number below?

mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
   ^ ^

-- 
François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MDS memory sizing

2016-03-01 Thread Dietmar Rieder
Dear ceph users,


I'm in the very initial phase of planning a ceph cluster an have a
question regarding the RAM recommendation for an MDS.

According to the ceph docs the minimum amount of RAM should be "1 GB
minimum per daemon". Is this per OSD in the cluster or per MDS in the
cluster?

I plan to run 3 ceph-mon on 3 dedicated machines and would like to run 3
ceph-msd on these  machines as well. The raw capacity of the cluster
should be ~1.9PB. Would 64GB of RAM then be enough for the
ceph-mon/ceph-msd nodes?

Thanks
  Dietmar

-- 
_
D i e t m a r  R i e d e r



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] omap support with erasure coded pools

2016-03-01 Thread Puerta Treceno, Jesus Ernesto (Nokia - ES)
Hi cephers,

It seems that explicit omap insertions are not supported by EC pools (errno 
EOPNOTSUPP):
  $ rados -p  setomapval 'dummy_obj' 'test_key' 'test_value'
  error setting omap value cdvr_ec/dummy/test: (95) Operation not supported

When trying the same with replicated pools, the above command succeeded. 
Besides, storing xattrs on omap does work even for EC pools. Having a look at 
code, it seems that pg_pool_t struct (osd_types.h) "supports_omap()" returns 
false on EC pools.

Does anyone know:

-  Why omap storage is not supported in EC pools? Is there any design 
limitation?

-  How is then storage of xattrs in omap supported by EC pools?

-  Is massive (thousands of keys) omap storage through xattrs 
(bypassing FS storage) efficient?

I'll try to find how libradosfs guys manage to circumvent this limitation and 
store metadata in EC pools (perhaps relying on replicated pools for omaps).

Thanks a lot!



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Shinobu Kinjo
Thanks, John.
Your additional explanation would be much help for the community.

Cheers,
S

- Original Message -
From: "John Spray" 
To: "1" <10...@candesoft.com>
Cc: "ceph-users" 
Sent: Tuesday, March 1, 2016 6:32:34 PM
Subject: Re: [ceph-users] Cannot mount cephfs after some disaster recovery

On Tue, Mar 1, 2016 at 3:51 AM, 1 <10...@candesoft.com> wrote:
> Hi,
> I meet a trouble on mount the cephfs after doing some disaster recovery
> introducing by official
> document(http://docs.ceph.com/docs/master/cephfs/disaster-recovery).
>
> Now when I try to mount the cephfs, I get "mount error 5 = Input/output
> error".
> When run "ceph -s" on clusters, it print like this:
>  cluster 15935dde-1d19-486e-9e1c-67414f9927f6
>  health HEALTH_OK
>  monmap e1: 4 mons at
> {HK-IDC1-10-1-72-151=172.17.17.151:6789/0,HK-IDC1-10-1-72-152=172.17.17.152:6789/0,HK-IDC1-10-1-72-153=172.17.17.153:6789/0,HK-IDC1-10-1-72-160=10.1.72.160:6789/0}
> election epoch 528, quorum 0,1,2,3
> HK-IDC1-10-1-72-160,HK-IDC1-10-1-72-151,HK-IDC1-10-1-72-152,HK-IDC1-10-1-72-153
>  mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>  osdmap e10536: 108 osds: 108 up, 108 in
> flags sortbitwise
>   pgmap v424957: 6564 pgs, 3 pools, 3863 GB data, 67643 kobjects
> 8726 GB used, 181 TB / 189 TB avail
> 6560 active+clean
>3 active+clean+scrubbing+deep
>1 active+clean+scrubbing
>
>  It seems there should be "1/1/1 up" at mdsmap instead of "1/1/0 up" and
> I really don't know what the last number mean.

As Zheng has said, that last number is the "max_mds" setting.  It's a
little bit weird the way that "fs reset" leaves it at zero, but it
shouldn't be causing any problems (you still have one active MDS
daemon here).

>  And there is cephfs if I run "ceph fs ls" which print this:
>
>  name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
> ]
>
>  I try my best to Google such problem however i get nothing. And I still
> want to know if i can bring the cephfs back. So does any one have ideas?
>
>  Oh, I do the disaster recovery because I get "mdsmap e21012: 0/1/1 up,
> 1 up:standby, 1 damaged" at first. And to bring the fs back to work, I do
> "JOURNAL TRUNCATION", "MDS TABLE WIPES", "MDS MAP RESET". However I think
> there must exist (and most) files that their metadata have been saved at
> OSDs (metadata pool, in RADOS). I just want to get them.

*before* trying to run any disaster recovery tools, you must diagnose
what is actually wrong with the filesystem.  It is too late for that
now, but I say it anyway so that people reading this mailing list will
be reminded.

1. Go look in your logs to see what caused the MDS to go damaged: the
cluster log should indicate when it happened, and then you can go and
look at your MDS daemon logs to see what was going on that caused it.
2. Go look in your logs to see what is going on now when you try to
mount and get EIO.  Either the client logs, or the MDS logs, or both
should contain some clues.
3. Hopefully you followed the first section on the page you linked
that you should make a backup of your journal.  Keep that somewhere
safe for the moment.
4. Keep a careful log of which commands you run from this point onwards.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] s3 bucket creation time

2016-03-01 Thread Abhishek Varshney
I was once faced with similar issue. Did you try increasing the rgw
log level and see what's happening? In my case, it was lot of gc
happening on rgw cache which was causing latent operations.

Thanks
Abhishek

On Tue, Mar 1, 2016 at 3:35 PM, Luis Periquito  wrote:
> On Mon, Feb 29, 2016 at 11:20 PM, Robin H. Johnson  wrote:
>> On Mon, Feb 29, 2016 at 04:58:07PM +, Luis Periquito wrote:
>>> Hi all,
>>>
>>> I have a biggish ceph environment and currently creating a bucket in
>>> radosgw can take as long as 20s.
>>>
>>> What affects the time a bucket takes to be created? How can I improve that 
>>> time?
>>>
>>> I've tried to create in several "bucket-location" with different
>>> backing pools (some of them empty) and the time was the same.
>> How many shards do you have configured for the bucket index?
>
> this is a older cluster (originally bobtail) and we never tuned those
> variables. Just double checked and it's 0.
> Forgot to say it's running hammer (0.94.5).
>
>>
>> I was recently benchmarking different bucket index shard values, and
>> also saw a notable increase relative to the number of shards.
>>
>> Plus a concerning increase directly correlated to number of keys in the
>> bucket, but I need more data before I post to the lists about it.
>>
>> --
>> Robin Hugh Johnson
>> Gentoo Linux: Developer, Infrastructure Lead, Foundation Trustee
>> E-Mail : robb...@gentoo.org
>> GnuPG FP   : 11ACBA4F 4778E3F6 E4EDF38E B27B944E 34884E85
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier weirdness

2016-03-01 Thread Nick Fisk
Interesting... see below

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Christian Balzer
> Sent: 01 March 2016 08:20
> To: ceph-users@lists.ceph.com
> Cc: Nick Fisk 
> Subject: Re: [ceph-users] Cache tier weirdness
> 
> 
> 
> Talking to myself again ^o^, see below:
> 
> On Sat, 27 Feb 2016 01:48:49 +0900 Christian Balzer wrote:
> 
> >
> > Hello Nick,
> >
> > On Fri, 26 Feb 2016 09:46:03 - Nick Fisk wrote:
> >
> > > Hi Christian,
> > >
> > > > -Original Message-
> > > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On
> > > > Behalf Of Christian Balzer
> > > > Sent: 26 February 2016 09:07
> > > > To: ceph-users@lists.ceph.com
> > > > Subject: [ceph-users] Cache tier weirdness
> > > >
> > > >
> > > > Hello,
> > > >
> > > > still my test cluster with 0.94.6.
> > > > It's a bit fuzzy, but I don't think I saw this with Firefly, but
> > > > then
> > > again that is
> > > > totally broken when it comes to cache tiers (switching between
> > > > writeback and forward mode).
> > > >
> > > > goat is a cache pool for rbd:
> > > > ---
> > > > # ceph osd pool ls detail
> > > > pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 2
> > > > object_hash
> > > rjenkins
> > > > pg_num 512 pgp_num 512 last_change 11729 lfor 11662 flags
> > > > hashpspool tiers 9 read_tier 9 write_tier 9 stripe_width 0
> > > >
> > > > pool 9 'goat' replicated size 1 min_size 1 crush_ruleset 3
> > > > object_hash
> > > rjenkins
> > > > pg_num 128 pgp_num 128 last_change 11730 flags
> > > > hashpspool,incomplete_clones tier_of 2 cache_mode writeback
> > > > target_bytes 524288000 hit_set bloom{false_positive_probability:
> > > > 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
> > > > ---
> > > >
> > > > Initial state is this:
> > > > ---
> > > > # rados df
> > > > pool name KB  objects   clones degraded
> > > unfound   rd
> > > > rd KB   wrwr KB
> > > > goat  34  42900
> > > 0 1051  4182046   145803
> > > > 10617422
> > > > rbd1640807024074700
> > > 0   419664 71142697
> > > > 4430922531299267
> > > >   total used   59946106041176
> > > >   total avail 5301740284
> > > >   total space 5940328912
> > > > ---
> > > >
> > > > First we put some data in there with "rados -p rbd  bench 20 write
> > > > -t 32 --no-cleanup"
> > > > which easily exceeds the target bytes of 512MB and gives us:
> > > > ---
> > > > pool name KB  objects
> > > > goat  356386  372
> > > > ---
> > > >
> > > > For starters, that's not the number I would have expected given
> > > > how this configured:
> > > > cache_target_dirty_ratio: 0.5
> > > > cache_target_full_ratio: 0.9
> > > >
> > > > Lets ignore (but not forget) that discrepancy for now.
> > >
> > > One of the things I have noticed is that whilst the target_max_bytes
> > > is set per pool, its actually acted on per PG. So each PG will
> > > flush/evict based on its share of the pool capacity. Depending on
> > > where data resides, PG's will normally have differing amount of data
> > > stored which leads to inconsistent cache pool flush/eviction limits.
> > > I believe there is also a "slop" factor in the cache code so that
> > > the caching agents are not always working on hard limits. I think
> > > with artificially small cache sizes, both of these cause adverse
effects.
> > >
> > Interesting, that goes a long way to explain this mismatch.
> > It is probably a spawn of the same logic that warns about too many or
> > little PGs per OSD in Hammer by averaging the numbers, totally
> > ignoring the actual usage per OSD.
> >
> 
> I did that test with increased target_max_bytes to 50GB and had about the
> same odd ratio.
> Then I thought, hmm, slightly low number of PGs for this pool, increased
it
> from 128 to 512.
> That dropped things even further, from 36GB to about 25GB (or half the
> configured target) for the eviction threshold.

I wonder if there is a way to use du to walk the PG directory tree and
confirm if any of the PG's are sitting at the 80% mark. If my rough maths
are correct each PG should evict when it has around 80-90MB in it.

Actually I wonder if that is the problem, there is not a lot of capacity per
PG if you have 512 pg's over 50G. When you are dealing with 4MB objects you
only need a few unbalanced ones and that can easily shift the percentages by
quite a bit. That might explain why it got worse when you increased the
number of PG's.

> 
> I hope in real life it will fare a little better because when dealing with
HDD
> based OSDs people are more forgiving when it comes to waste significant
> ratios of space by default (full and near_full) and even more due to
uneven
> distribution.
> 
> In the case of a quite expensive SSD based cache pool, adding the 

Re: [ceph-users] Cannot mount cephfs after some disaster recovery

2016-03-01 Thread Yan, Zheng
On Tue, Mar 1, 2016 at 11:51 AM, 1 <10...@candesoft.com> wrote:
> Hi,
> I meet a trouble on mount the cephfs after doing some disaster recovery
> introducing by official
> document(http://docs.ceph.com/docs/master/cephfs/disaster-recovery).
> Now when I try to mount the cephfs, I get "mount error 5 = Input/output
> error".
> When run "ceph -s" on clusters, it print like this:
>  cluster 15935dde-1d19-486e-9e1c-67414f9927f6
>  health HEALTH_OK
>  monmap e1: 4 mons at
> {HK-IDC1-10-1-72-151=172.17.17.151:6789/0,HK-IDC1-10-1-72-152=172.17.17.152:6789/0,HK-IDC1-10-1-72-153=172.17.17.153:6789/0,HK-IDC1-10-1-72-160=10.1.72.160:6789/0}
> election epoch 528, quorum 0,1,2,3
> HK-IDC1-10-1-72-160,HK-IDC1-10-1-72-151,HK-IDC1-10-1-72-152,HK-IDC1-10-1-72-153
>  mdsmap e21038: 1/1/0 up {0=HK-IDC1-10-1-72-160=up:active}
>  osdmap e10536: 108 osds: 108 up, 108 in
> flags sortbitwise
>   pgmap v424957: 6564 pgs, 3 pools, 3863 GB data, 67643 kobjects
> 8726 GB used, 181 TB / 189 TB avail
> 6560 active+clean
>3 active+clean+scrubbing+deep
>1 active+clean+scrubbing
>
>  It seems there should be "1/1/1 up" at mdsmap instead of "1/1/0 up" and
> I really don't know what the last number mean.
>  And there is cephfs if I run "ceph fs ls" which print this:
>
>  name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data
> ]
>
>  I try my best to Google such problem however i get nothing. And I still
> want to know if i can bring the cephfs back. So does any one have ideas?
>
>  Oh, I do the disaster recovery because I get "mdsmap e21012: 0/1/1 up,
> 1 up:standby, 1 damaged" at first. And to bring the fs back to work, I do
> "JOURNAL TRUNCATION", "MDS TABLE WIPES", "MDS MAP RESET". However I think
> there must exist (and most) files that their metadata have been saved at
> OSDs (metadata pool, in RADOS). I just want to get them.

try running command "ceph mds set max_mds 1"

>
>   Thanks.
>
> Yingdi Guo
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache tier weirdness

2016-03-01 Thread Christian Balzer


Talking to myself again ^o^, see below:

On Sat, 27 Feb 2016 01:48:49 +0900 Christian Balzer wrote:

> 
> Hello Nick,
> 
> On Fri, 26 Feb 2016 09:46:03 - Nick Fisk wrote:
> 
> > Hi Christian,
> > 
> > > -Original Message-
> > > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> > > Of Christian Balzer
> > > Sent: 26 February 2016 09:07
> > > To: ceph-users@lists.ceph.com
> > > Subject: [ceph-users] Cache tier weirdness
> > > 
> > > 
> > > Hello,
> > > 
> > > still my test cluster with 0.94.6.
> > > It's a bit fuzzy, but I don't think I saw this with Firefly, but then
> > again that is
> > > totally broken when it comes to cache tiers (switching between
> > > writeback and forward mode).
> > > 
> > > goat is a cache pool for rbd:
> > > ---
> > > # ceph osd pool ls detail
> > > pool 2 'rbd' replicated size 3 min_size 1 crush_ruleset 2 object_hash
> > rjenkins
> > > pg_num 512 pgp_num 512 last_change 11729 lfor 11662 flags hashpspool
> > > tiers 9 read_tier 9 write_tier 9 stripe_width 0
> > > 
> > > pool 9 'goat' replicated size 1 min_size 1 crush_ruleset 3
> > > object_hash
> > rjenkins
> > > pg_num 128 pgp_num 128 last_change 11730 flags
> > > hashpspool,incomplete_clones tier_of 2 cache_mode writeback
> > > target_bytes 524288000 hit_set bloom{false_positive_probability:
> > > 0.05, target_size: 0, seed: 0} 3600s x1 stripe_width 0
> > > ---
> > > 
> > > Initial state is this:
> > > ---
> > > # rados df
> > > pool name KB  objects   clones degraded
> > unfound   rd
> > > rd KB   wrwr KB
> > > goat  34  42900
> > 0 1051  4182046   145803
> > > 10617422
> > > rbd1640807024074700
> > 0   419664 71142697
> > > 4430922531299267
> > >   total used   59946106041176
> > >   total avail 5301740284
> > >   total space 5940328912
> > > ---
> > > 
> > > First we put some data in there with
> > > "rados -p rbd  bench 20 write -t 32 --no-cleanup"
> > > which easily exceeds the target bytes of 512MB and gives us:
> > > ---
> > > pool name KB  objects
> > > goat  356386  372
> > > ---
> > > 
> > > For starters, that's not the number I would have expected given how
> > > this configured:
> > > cache_target_dirty_ratio: 0.5
> > > cache_target_full_ratio: 0.9
> > > 
> > > Lets ignore (but not forget) that discrepancy for now.
> > 
> > One of the things I have noticed is that whilst the target_max_bytes is
> > set per pool, its actually acted on per PG. So each PG will flush/evict
> > based on its share of the pool capacity. Depending on where data
> > resides, PG's will normally have differing amount of data stored which
> > leads to inconsistent cache pool flush/eviction limits. I believe there
> > is also a "slop" factor in the cache code so that the caching agents
> > are not always working on hard limits. I think with artificially small
> > cache sizes, both of these cause adverse effects.
> >
> Interesting, that goes a long way to explain this mismatch.
> It is probably a spawn of the same logic that warns about too many or
> little PGs per OSD in Hammer by averaging the numbers, totally ignoring
> the actual usage per OSD.
> 

I did that test with increased target_max_bytes to 50GB and had about the
same odd ratio. 
Then I thought, hmm, slightly low number of PGs for this pool, increased
it from 128 to 512. 
That dropped things even further, from 36GB to about 25GB (or half the
configured target) for the eviction threshold.

I hope in real life it will fare a little better because when dealing with
HDD based OSDs people are more forgiving when it comes to waste
significant ratios of space by default (full and near_full) and even more
due to uneven distribution.

In the case of a quite expensive SSD based cache pool, adding the PG based
calculation and an error ratio of 30-40% is going to cause people to game
the system to get the utilization they paid for. 
With of course potentially disastrous consequences. 

Christian
> My test cluster has 4 OSDs in the cache pool, the production one will
> have 8 (and 2TB of raw data), and while not exactly artificially small I
> foresee lots of parameter fondling.
>  
> > > 
> > > After doing a read with "rados -p rbd  bench 20 rand -t 32" to my
> > > utter bafflement I get:
> > > ---
> > > pool name KB  objects
> > > goat8226  199
> > > ---
> > > 
> > > And after a second read it's all gone, looking at the network traffic
> > > it
> > all
> > > originated from the base pool nodes and got relayed through the node
> > > hosting the cache pool:
> > > ---
> > > pool name KB  objects
> > > goat  34  191
> > > ---
> > > 
> > > I verified that the actual objects are on the base pool with 4MB
> > > each,
> > while
> > >