Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Somnath Roy
Also, there are lot of discussion about SSDs not suitable for Ceph write 
workload (with filestore) in community as those are not good for odirect/odsync 
kind of writes. Hope your SSDs are tolerant of that.

-Original Message-
From: Somnath Roy
Sent: Friday, January 13, 2017 10:06 AM
To: 'Mohammed Naser'; Wido den Hollander
Cc: ceph-users@lists.ceph.com
Subject: RE: [ceph-users] All SSD cluster performance

<< Both OSDs are pinned to two cores on the system Is there any reason you are 
pinning osds like that ? I would say for object workload there is no need to 
pin osds.
The configuration you mentioned , Ceph with 4M object PUT it should be 
saturating your network first.

Have you run say 4M object GET to see what BW you are getting ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Mohammed Naser
Sent: Friday, January 13, 2017 9:51 AM
To: Wido den Hollander
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] All SSD cluster performance


> On Jan 13, 2017, at 12:41 PM, Wido den Hollander <w...@42on.com> wrote:
>
>
>> Op 13 januari 2017 om 18:39 schreef Mohammed Naser <mna...@vexxhost.com>:
>>
>>
>>
>>> On Jan 13, 2017, at 12:37 PM, Wido den Hollander <w...@42on.com> wrote:
>>>
>>>
>>>> Op 13 januari 2017 om 18:18 schreef Mohammed Naser <mna...@vexxhost.com>:
>>>>
>>>>
>>>> Hi everyone,
>>>>
>>>> We have a deployment with 90 OSDs at the moment which is all SSD that’s 
>>>> not hitting quite the performance that it should be in my opinion, a 
>>>> `rados bench` run gives something along these numbers:
>>>>
>>>> Maintaining 16 concurrent writes of 4194304 bytes to objects of
>>>> size 4194304 for up to 10 seconds or 0 objects Object prefix: 
>>>> benchmark_data_bench.vexxhost._30340
>>>> sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
>>>>   0   0 0 0 0 0   -   0
>>>>   1  16   158   142   568.513   568   0.0965336   0.0939971
>>>>   2  16   287   271   542.191   516   0.02914940.107503
>>>>   3  16   375   359478.75   352   0.08927240.118463
>>>>   4  16   477   461   461.042   408   0.02434930.126649
>>>>   5  16   540   524   419.216   2520.2391230.132195
>>>>   6  16   644   628418.67   4160.3476060.146832
>>>>   7  16   734   718   410.281   360   0.05344470.147413
>>>>   8  16   811   795   397.487   308   0.0311927 0.15004
>>>>   9  16   879   863   383.537   272   0.08945340.158513
>>>>  10  16   980   964   385.578   404   0.09698650.162121
>>>>  11   3   981   978   355.613560.7989490.171779
>>>> Total time run: 11.063482
>>>> Total writes made:  981
>>>> Write size: 4194304
>>>> Object size:4194304
>>>> Bandwidth (MB/sec): 354.68
>>>> Stddev Bandwidth:   137.608
>>>> Max bandwidth (MB/sec): 568
>>>> Min bandwidth (MB/sec): 56
>>>> Average IOPS:   88
>>>> Stddev IOPS:34
>>>> Max IOPS:   142
>>>> Min IOPS:   14
>>>> Average Latency(s): 0.175273
>>>> Stddev Latency(s):  0.294736
>>>> Max latency(s): 1.97781
>>>> Min latency(s): 0.0205769
>>>> Cleaning up (deleting benchmark objects) Clean up completed and
>>>> total clean up time :3.895293
>>>>
>>>> We’ve verified the network by running `iperf` across both replication and 
>>>> public networks and it resulted in 9.8Gb/s (10G links for both).  The 
>>>> machine that’s running the benchmark doesn’t even saturate it’s port.  The 
>>>> SSDs are S3520 960GB drives which we’ve benchmarked and they can handle 
>>>> the load using fio/etc.  At this point, not really sure where to look 
>>>> next.. anyone running all SSD clusters that might be able to share their 
>>>> experience?
>>>
>>> I suggest that you search a bit on the ceph-users list since this topic has 
>>> been discussed multiple times in the past and even recently.
>>>
>>> Ceph isn't your average storage system and you have to keep that in

Re: [ceph-users] All SSD cluster performance

2017-01-13 Thread Somnath Roy
<< Both OSDs are pinned to two cores on the system
Is there any reason you are pinning osds like that ? I would say for object 
workload there is no need to pin osds.
The configuration you mentioned , Ceph with 4M object PUT it should be 
saturating your network first.

Have you run say 4M object GET to see what BW you are getting ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Mohammed Naser
Sent: Friday, January 13, 2017 9:51 AM
To: Wido den Hollander
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] All SSD cluster performance


> On Jan 13, 2017, at 12:41 PM, Wido den Hollander  wrote:
>
>
>> Op 13 januari 2017 om 18:39 schreef Mohammed Naser :
>>
>>
>>
>>> On Jan 13, 2017, at 12:37 PM, Wido den Hollander  wrote:
>>>
>>>
 Op 13 januari 2017 om 18:18 schreef Mohammed Naser :


 Hi everyone,

 We have a deployment with 90 OSDs at the moment which is all SSD that’s 
 not hitting quite the performance that it should be in my opinion, a 
 `rados bench` run gives something along these numbers:

 Maintaining 16 concurrent writes of 4194304 bytes to objects of
 size 4194304 for up to 10 seconds or 0 objects Object prefix: 
 benchmark_data_bench.vexxhost._30340
 sec Cur ops   started  finished  avg MB/s  cur MB/s last lat(s)  avg lat(s)
   0   0 0 0 0 0   -   0
   1  16   158   142   568.513   568   0.0965336   0.0939971
   2  16   287   271   542.191   516   0.02914940.107503
   3  16   375   359478.75   352   0.08927240.118463
   4  16   477   461   461.042   408   0.02434930.126649
   5  16   540   524   419.216   2520.2391230.132195
   6  16   644   628418.67   4160.3476060.146832
   7  16   734   718   410.281   360   0.05344470.147413
   8  16   811   795   397.487   308   0.0311927 0.15004
   9  16   879   863   383.537   272   0.08945340.158513
  10  16   980   964   385.578   404   0.09698650.162121
  11   3   981   978   355.613560.7989490.171779
 Total time run: 11.063482
 Total writes made:  981
 Write size: 4194304
 Object size:4194304
 Bandwidth (MB/sec): 354.68
 Stddev Bandwidth:   137.608
 Max bandwidth (MB/sec): 568
 Min bandwidth (MB/sec): 56
 Average IOPS:   88
 Stddev IOPS:34
 Max IOPS:   142
 Min IOPS:   14
 Average Latency(s): 0.175273
 Stddev Latency(s):  0.294736
 Max latency(s): 1.97781
 Min latency(s): 0.0205769
 Cleaning up (deleting benchmark objects) Clean up completed and
 total clean up time :3.895293

 We’ve verified the network by running `iperf` across both replication and 
 public networks and it resulted in 9.8Gb/s (10G links for both).  The 
 machine that’s running the benchmark doesn’t even saturate it’s port.  The 
 SSDs are S3520 960GB drives which we’ve benchmarked and they can handle 
 the load using fio/etc.  At this point, not really sure where to look 
 next.. anyone running all SSD clusters that might be able to share their 
 experience?
>>>
>>> I suggest that you search a bit on the ceph-users list since this topic has 
>>> been discussed multiple times in the past and even recently.
>>>
>>> Ceph isn't your average storage system and you have to keep that in mind. 
>>> Nothing is free in this world. Ceph provides excellent consistency and 
>>> distribution of data, but that also means that you have things like network 
>>> and CPU latency.
>>>
>>> However, I suggest you look up a few threads on this list which have 
>>> valuable tips.
>>>
>>> Wido
>>
>> Thanks for the reply, I’ve actually done quite a lot of research and went 
>> through many of the previous posts.  While I agree a 100% with your 
>> statement, I’ve found that other people with similar setups have been able 
>> to reach numbers that I cannot, which leads me to believe that there is 
>> actually an issue in here.  They have been able to max out at 1200 MB/s 
>> which is the maximum of their benchmarking host.  We’d like to reach that 
>> and I think that given the specifications of the cluster, it can do it with 
>> no problems.
>
> A few tips:
>
> - Disable all logging in Ceph (debug_osd, debug_ms, debug_auth, etc,
> etc)

All logging is configured to default settings, should those be turned down?

> - Disable power saving on the CPUs

All disabled as well, everything running on `performance` mode.

>
> Can you also share how the 90 OSDs are 

Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
I generally do a 1M seq write to fill up the device. Block size doesn’t matter 
here but bigger block size is faster to fill up and that’s why people use that.

From: V Plus [mailto:v.plussh...@gmail.com]
Sent: Sunday, December 11, 2016 7:03 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks!

One more question, what do you mean by "bigger" ?
Do you mean that bigger block size (say, I will run read test with bs=4K, then 
I need to first write the rbd with bs>4K?)? or size that is big enough to cover 
the area where the test will be executed?


On Sun, Dec 11, 2016 at 9:54 PM, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:
A block needs to be written before read otherwise you will get funny result. 
For example, in case of flash (depending on how FW is implemented) , it will 
mostly return you 0 if a block is not written. Now, I have seen some flash FW 
is really inefficient on manufacturing this data (say 0) if not written and 
some are really fast.
So, to get predictable result you should be always reading a block that is 
written. In a device say half of the block is written and you are doing a full 
device random reads , you will get unpredictable/spiky/imbalanced result.
Same with rbd as well, consider it as a storage device and behavior would be 
similar. So, it is always recommended to precondition (fill up) a rbd image 
with bigger block seq write before you do any synthetic test on that. Now, for 
filestore backend added advantage of preconditioning rbd will be the files in 
the filesystem will be created beforehand.

Thanks & Regards
Somnath

From: V Plus [mailto:v.plussh...@gmail.com<mailto:v.plussh...@gmail.com>]
Sent: Sunday, December 11, 2016 6:01 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks Somnath!
As you recommended, I executed:
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1

Then the output results look more reasonable!
Could you tell me why??

Btw, the purpose of my run is to test the performance of rbd in ceph. Does my 
case mean that before every test, I have to "initialize" all the images???

Great thanks!!

On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
A block needs to be written before read otherwise you will get funny result. 
For example, in case of flash (depending on how FW is implemented) , it will 
mostly return you 0 if a block is not written. Now, I have seen some flash FW 
is really inefficient on manufacturing this data (say 0) if not written and 
some are really fast.
So, to get predictable result you should be always reading a block that is 
written. In a device say half of the block is written and you are doing a full 
device random reads , you will get unpredictable/spiky/imbalanced result.
Same with rbd as well, consider it as a storage device and behavior would be 
similar. So, it is always recommended to precondition (fill up) a rbd image 
with bigger block seq write before you do any synthetic test on that. Now, for 
filestore backend added advantage of preconditioning rbd will be the files in 
the filesystem will be created beforehand.

Thanks & Regards
Somnath

From: V Plus [mailto:v.plussh...@gmail.com]
Sent: Sunday, December 11, 2016 6:01 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance is too good (impossible..)...

Thanks Somnath!
As you recommended, I executed:
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd0
dd if=/dev/zero bs=1M count=4096 of=/dev/rbd1

Then the output results look more reasonable!
Could you tell me why??

Btw, the purpose of my run is to test the performance of rbd in ceph. Does my 
case mean that before every test, I have to "initialize" all the images???

Great thanks!!

On Sun, Dec 11, 2016 at 8:47 PM, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance is too good (impossible..)...

2016-12-11 Thread Somnath Roy
Fill up the image with big write (say 1M) first before reading and you should 
see sane throughput.

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of V Plus
Sent: Sunday, December 11, 2016 5:44 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance is too good (impossible..)...

Hi Guys,
we have a ceph cluster with 6 machines (6 OSD per host).
1. I created 2 images in Ceph, and map them to another host A (outside the Ceph 
cluster). On host A, I got /dev/rbd0 and /dev/rbd1.
2. I start two fio job to perform READ test on rbd0 and rbd1. (fio job 
descriptions can be found below)
"sudo fio fioA.job -output a.txt & sudo  fio fioB.job -output b.txt  & wait"
3. After the test, in a.txt, we got bw=1162.7MB/s, in b.txt, we get 
bw=3579.6MB/s.
The results do NOT make sense because there is only one NIC on host A, and its 
limit is 10 Gbps (1.25GB/s).

I suspect it is because of the cache setting.
But I am sure that in file /etc/ceph/ceph.conf on host A,I already added:
[client]
rbd cache = false

Could anyone give me a hint what is missing? why
Thank you very much.

fioA.job:
[A]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd0
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

fioB.job:
[B]
direct=1
group_reporting=1
unified_rw_reporting=1
size=100%
time_based=1
filename=/dev/rbd1
rw=read
bs=4MB
numjobs=16
ramp_time=10
runtime=20

Thanks...
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Wei/Pavan for the response, it seems I need to debug osds to find out 
what is the cause of slowing down.
Will update community if I find anything conclusive.

Regards
Somnath

-Original Message-
From: Wei Jin [mailto:wjin...@gmail.com] 
Sent: Monday, October 17, 2016 2:13 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com; ceph-de...@vger.kernel.org
Subject: Re: [ceph-users] OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 3:16 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
> submit_message osd_op_reply(1463 
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size 
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890 
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.

I think you need to tune threads' timeout values as heartbeat message will be 
dropped during timeout and suicide (health check will fail).
That's why you observe 'wrongly marked me down' message but osd process is 
still alive. See function OSD::handle_osd_ping()

Also, you could backport this
pr(https://github.com/ceph/ceph/pull/8808) to accelerate dealing with heartbeat 
message.

After that, you may consider tuning grace time.


>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?
>
> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?
>
> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?
>
> Any help on clarifying this would be very helpful.
>
> Thanks & Regards
> Somnath
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Thanks Piotr, Wido for quick response.

@Wido , yes, I thought of trying with those values but I am seeing in the log 
messages at least 7 osds are reporting failure , so, didn't try. BTW, I found 
default mon_osd_min_down_reporters is 2 , not 1 and latest master is not having 
mon_osd_min_down_reports anymore. Not sure what it is replaced with..

@Piotr , yes, your PR really helps , thanks !  Regarding each messenger needs 
to respond to HB is confusing, I know each thread has a HB timeout value and 
beyond which it will crash with suicide timeout , are you talking about that ?

Regards
Somnath

-Original Message-
From: Piotr Dałek [mailto:bra...@predictor.org.pl]
Sent: Monday, October 17, 2016 12:52 AM
To: ceph-users@lists.ceph.com; Somnath Roy; ceph-de...@vger.kernel.org
Subject: Re: OSDs are flapping and marked down wrongly

On Mon, Oct 17, 2016 at 07:16:44AM +, Somnath Roy wrote:
> Hi Sage et. al,
>
> I know this issue is reported number of times in community and attributed to 
> either network issue or unresponsive OSDs.
> Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
> stressed with large block size and very high QD. Lowering QD it is working 
> just fine.
> We are seeing the lossy connection message like below and followed by the osd 
> marked down by monitor.
>
> 2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767
> submit_message osd_op_reply(1463
> rbd_data.55246b8b4567.d633 [set-alloc-hint object_size
> 4194304 write_size 4194304,write 3932160~262144] v222'95890 uv95890
> ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping message
>
> In the monitor log, I am seeing the osd is reported down by peers and 
> subsequently monitor is marking it down.
> OSDs is rejoining the cluster after detecting it is marked down wrongly and 
> rebalancing started. This is hurting performance very badly.
>
> My question is the following.
>
> 1. I have 40Gb network and I am seeing network is not utilized beyond 
> 10-12Gb/s , no network error is reported. So, why this lossy connection 
> message is coming ? what could go wrong here ? Is it network prioritization 
> issue of smaller ping packets ? I tried to gaze ping round time during this 
> and nothing seems abnormal.
>
> 2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk 
> is left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
> path. Heartbeat is going through separate messenger and threads as well, so, 
> busy op threads should not be making heartbeat delayed. Increasing osd 
> heartbeat grace is only delaying this phenomenon , but, eventually happens 
> after several hours. Anything else we can tune here ?

There's a bunch of messengers in OSD code, if ANY of them doesn't respond to 
heartbeat messages in reasonable time, it is marked as down. Since packets are 
processed in FIFO/synchronous manner, overloading OSD with large I/O will cause 
it to time-out on at least one messenger.
There was an idea to have heartbeat messages go in the OOB TCP/IP stream and 
process them asynchronously, but I don't know if that went beyond the idea 
stage.

> 3. What could be the side effect of big grace period ? I understand that 
> detecting a faulty osd will be delayed, anything else ?

Yes - stalled ops. Assume that primary OSD goes down and replicas are still 
alive. Having big grace period will cause all ops going to that OSD to stall 
until that particular OSD is marked down or resumes normal operation.

> 4. I saw if an OSD is crashed, monitor will detect the down osd almost 
> instantaneously and it is not waiting till this grace period. How it is 
> distinguishing between unresponsive and crashed osds ? In which scenario this 
> heartbeat grace is coming into picture ?

This is the effect of my PR#8558 (https://github.com/ceph/ceph/pull/8558)
which causes any OSD that crash to be immediately marked as down, preventing 
stalled I/Os in most common cases. Grace period is only applied to unresponsive 
OSDs (i.e. temporary packet loss, bad cases of network lags, routing issues, in 
other words, everything that is known to be at least possible to resolve by 
itself in a finite amount of time). OSDs that crash and burn won't respond - 
instead, OS will respond with ECONNREFUSED indicating that OSD is not listening 
and in that case the OSD will be immediately marked down.

--
Piotr Dałek
bra...@predictor.org.pl
http://blog.predictor.org.pl
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If y

[ceph-users] OSDs are flapping and marked down wrongly

2016-10-17 Thread Somnath Roy
Hi Sage et. al,

I know this issue is reported number of times in community and attributed to 
either network issue or unresponsive OSDs.
Recently, we are seeing this issue when our all SSD cluster (Jewel based)  is 
stressed with large block size and very high QD. Lowering QD it is working just 
fine.
We are seeing the lossy connection message like below and followed by the osd 
marked down by monitor.

2016-10-15 14:30:13.957534 7f6297bff700  0 -- 10.10.10.94:6810/2461767 
submit_message osd_op_reply(1463 rbd_data.55246b8b4567.d633 
[set-alloc-hint object_size 4194304 write_size 4194304,write 3932160~262144] 
v222'95890 uv95890 ondisk = 0) v7 remote, 10.10.10.98:0/1174431362, dropping 
message

In the monitor log, I am seeing the osd is reported down by peers and 
subsequently monitor is marking it down.
OSDs is rejoining the cluster after detecting it is marked down wrongly and 
rebalancing started. This is hurting performance very badly.

My question is the following.

1. I have 40Gb network and I am seeing network is not utilized beyond 10-12Gb/s 
, no network error is reported. So, why this lossy connection message is coming 
? what could go wrong here ? Is it network prioritization issue of smaller ping 
packets ? I tried to gaze ping round time during this and nothing seems 
abnormal.

2. Nothing is saturated on the OSD side , plenty of network/memory/cpu/disk is 
left. So, I doubt my osds are unresponsive but yes it is really busy on IO 
path. Heartbeat is going through separate messenger and threads as well, so, 
busy op threads should not be making heartbeat delayed. Increasing osd 
heartbeat grace is only delaying this phenomenon , but, eventually happens 
after several hours. Anything else we can tune here ?

3. What could be the side effect of big grace period ? I understand that 
detecting a faulty osd will be delayed, anything else ?

4. I saw if an OSD is crashed, monitor will detect the down osd almost 
instantaneously and it is not waiting till this grace period. How it is 
distinguishing between unresponsive and crashed osds ? In which scenario this 
heartbeat grace is coming into picture ?

Any help on clarifying this would be very helpful.

Thanks & Regards
Somnath
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread Somnath Roy
This is usual feature mismatch stuff , the inbox krbd you are using is not 
supporting Jewel.
Try googling with the error and I am sure you will get lot of prior discussion 
around that..

From: EP Komarla [mailto:ep.koma...@flextronics.com]
Sent: Tuesday, August 16, 2016 4:15 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Rbd map command doesn't work

Somnath,

Thanks.

I am trying your suggestion.  See the commands below.  Still it doesn't seem to 
go.

I am missing something here...

Thanks,

- epk

=
[test@ep-c2-client-01 ~]$ rbd create rbd/test1 --size 1G --image-format 1
rbd: image format 1 is deprecated
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1
^C[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$
[test@ep-c2-client-01 ~]$ dmesg|tail -20
[1201954.248195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201954.253365] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201964.274082] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201964.281195] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1201974.298195] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1201974.305300] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204128.917562] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204128.924173] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204138.956737] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204138.964011] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204148.980701] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204148.987892] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204159.004939] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204159.012136] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1204169.028802] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204169.035992] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204476.803192] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1204476.810578] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1204486.821279] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400



From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, August 16, 2016 3:59 PM
To: EP Komarla <ep.koma...@flextronics.com<mailto:ep.koma...@flextronics.com>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Rbd map command doesn't work

The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image -image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libc

Re: [ceph-users] Rbd map command doesn't work

2016-08-16 Thread Somnath Roy
The default format of rbd image in jewel is 2 along with bunch of other 
deatures enabled , so, you have following two option:

1. create a format 1 image -image-format 1

2. Or, do this in the ceph.conf file [client] or [global] before creating 
image..
rbd_default_features = 3

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, August 16, 2016 2:52 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Rbd map command doesn't work

All,

I am creating an image and mapping it.  The below commands used to work in 
Hammer, now the same is not working in Jewel.  I see the message about some 
feature set mismatch - what features are we talking about here?  Is this a 
known issue in Jewel with a workaround?

Thanks,

- epk

=


[test@ep-c2-client-01 ~]$  rbd create rbd/test1 --size 1G
[test@ep-c2-client-01 ~]$ rbd info test1
rbd image 'test1':
size 1024 MB in 256 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.8146238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:
[test@ep-c2-client-01 ~]$ rbd map rbd/test1
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (13) Permission denied
[test@ep-c2-client-01 ~]$ dmesg|tail
[1197731.547522] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197731.554621] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1197741.571645] libceph: mon2 172.20.60.53:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1197741.578760] libceph: mon2 172.20.60.53:6789 missing required protocol 
features
[1198586.766120] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198586.771248] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[1198596.789453] libceph: mon0 172.20.60.51:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198596.796557] libceph: mon0 172.20.60.51:6789 missing required protocol 
features
[1198606.813825] libceph: mon1 172.20.60.52:6789 feature set mismatch, my 
102b84a842a42 < server's 40102b84a842a42, missing 400
[1198606.820929] libceph: mon1 172.20.60.52:6789 missing required protocol 
features
[test@ep-c2-client-01 ~]$ sudo rbd map rbd/test1


EP KOMARLA,
[Flex_RGB_Sml_tm]
Emal: ep.koma...@flextronics.com
Address: 677 Gibraltor Ct, Building #2, Milpitas, CA 94035, USA
Phone: 408-674-6090 (mobile)


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Somnath Roy
Yes Greg, agreed, I found some corruption during BlueFS replay , could also be 
caught in detail if I run fsck() may be..
Will do it , but, in dev environment time consumed during fsck() could be a 
challenge (though I have no idea how long it will take per TB of data, never 
ran it)  considering the number of time we need to restart OSDs..

Thanks & Regards
Somnath

-Original Message-
From: Gregory Farnum [mailto:gfar...@redhat.com]
Sent: Wednesday, August 03, 2016 4:03 PM
To: Somnath Roy
Cc: Stillwell, Bryan J; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

On Wed, Aug 3, 2016 at 3:50 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Probably, it is better to move to latest master and reproduce this defect. 
> Lot of stuff has changed between this.
> This is a good test case and I doubt any of us testing by enabling fsck() on 
> mount/unmount.

Given that the allocator keeps changing, running fsck frequently while testing 
is probably a good idea... ;)
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

2016-08-03 Thread Somnath Roy
Probably, it is better to move to latest master and reproduce this defect. Lot 
of stuff has changed between this.
This is a good test case and I doubt any of us testing by enabling fsck() on 
mount/unmount.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stillwell, Bryan J
Sent: Wednesday, August 03, 2016 3:41 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Multi-device BlueStore OSDs multiple fsck failures

I've been doing some benchmarking of BlueStore in 10.2.2 the last few days and 
have come across a failure that keeps happening after stressing the cluster 
fairly heavily.  Some of the OSDs started failing and attempts to restart them 
fail to log anything in /var/log/ceph/, so I tried starting them manually and 
ran into these error messages:

# /usr/bin/ceph-osd --cluster=ceph -i 4 -f --setuser ceph --setgroup ceph
2016-08-02 22:52:01.190226 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.190340 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.190497 7f97d75e1800 -1 WARNING: experimental feature 
'bluestore' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

starting osd.4 at :/0 osd_data /var/lib/ceph/osd/ceph-4/ 
/var/lib/ceph/osd/ceph-4/journal
2016-08-02 22:52:01.194461 7f97d75e1800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-02 22:52:01.237619 7f97d75e1800 -1 WARNING: experimental feature 
'rocksdb' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

2016-08-02 22:52:01.501405 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  a#20:bac03f87:::4_454:head# nid
67134 already in use
2016-08-02 22:52:01.629900 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/)  9#20:e64f44a7:::4_258:head# nid
78351 already in use
2016-08-02 22:52:01.967599 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 256983760896~1245184 
intersects allocated blocks
2016-08-02 22:52:01.967605 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [256984940544~65536]
2016-08-02 22:52:01.978635 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck free extent 258455044096~196608 
intersects allocated blocks
2016-08-02 22:52:01.978640 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck overlap: [258455175168~65536]
2016-08-02 22:52:01.978647 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) fsck leaked some space; free+used =
[0~252138684416,252138815488~4844945408,256984940544~1470103552,25845517516
8~5732719067136] != expected 0~5991174242304
2016-08-02 22:52:02.987479 7f97d75e1800 -1
bluestore(/var/lib/ceph/osd/ceph-4/) mount fsck found 5 errors
2016-08-02 22:52:02.987488 7f97d75e1800 -1 osd.4 0 OSD:init: unable to mount 
object store
2016-08-02 22:52:02.987498 7f97d75e1800 -1  ** ERROR: osd init failed: (5) 
Input/output error


Here's another one:

# /usr/bin/ceph-osd --cluster=ceph -i 11 -f --setuser ceph --setgroup ceph
2016-08-03 22:16:49.052319 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.052445 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.052690 7f0e4d949800 -1 WARNING: experimental feature 
'bluestore' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

starting osd.11 at :/0 osd_data /var/lib/ceph/osd/ceph-11/ 
/var/lib/ceph/osd/ceph-11/journal
2016-08-03 22:16:49.056779 7f0e4d949800 -1 WARNING: the following dangerous and 
experimental features are enabled: *
2016-08-03 22:16:49.095695 7f0e4d949800 -1 WARNING: experimental feature 
'rocksdb' is enabled Please be aware that this feature is experimental, 
untested, unsupported, and may result in data corruption, data loss, and/or 
irreparable damage to your cluster.  Do not use feature with important data.

2016-08-03 22:16:49.821451 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/)  6#20:2eed99bf:::4_257:head# nid
72869 already in use
2016-08-03 22:16:49.961943 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck free extent 257123155968~65536 
intersects allocated blocks
2016-08-03 22:16:49.961950 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck overlap: [257123155968~65536]
2016-08-03 22:16:49.962012 7f0e4d949800 -1
bluestore(/var/lib/ceph/osd/ceph-11/) fsck leaked some space; free+used = 

Re: [ceph-users] RocksDB compression

2016-07-28 Thread Somnath Roy
I am using snappy and it is working fine with Bluestore..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Thursday, July 28, 2016 2:03 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RocksDB compression

Should work fine AFAIK, let us know if it doesn't. :)

FWIW, the goal at the moment is to make the onode so dense that rocksdb 
compression isn't going to help after we are done optimizing it.

Mark

On 07/28/2016 03:37 PM, Garg, Pankaj wrote:
> Hi,
>
> Has anyone configured compression in RockDB for BlueStore? Does it work?
>
> Thanks
>
> Pankaj
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] bluestore overlay write failure

2016-07-26 Thread Somnath Roy
Bluestore has evolved a long way and I don’t think we support this overlay 
anymore. Please try Bluestore with latest master..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of ???
Sent: Tuesday, July 26, 2016 7:09 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] bluestore overlay write failure


Hi All



I'm using ceph-10.1.1, when I open overlay write, some OSDs will down and out 
when I use fio to test 4K IO write of rbd.

The default option is below :

OPTION(bluestore_overlay_max, OPT_INT, 0)



I change the 0 to 512 to make data write that small than 64K processed by 
overlay, then some OSDs down.



Could someone tell me what's wrong?

Thanks!



Kind Regards,

Haitao Wang

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-26 Thread Somnath Roy
<< Ceph performance in general (without read_ahead_kb) will be lower specially 
in all flash as the requests will be serialized within a PG

I meant to say Ceph sequential performance..Sorry for the spam..

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Tuesday, July 26, 2016 5:08 PM
To: EP Komarla; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph performance pattern

Not exactly, but, we are seeing some drop with 256K compare to 64K. This is 
with random reads though in Ubuntu. We had to bump up read_ahead_kb from 
default 128KB to 512KB to work around that.
But, in RHEL we saw all sorts of issues with read_ahead_kb for small block 
random reads and I think it is already default to 4MB or so..If so, try to 
reduce it to 512KB and see..
Generally, for sequential reads, you need to play with read_ahead_kb to achieve 
better performance. Ceph performance in general (without read_ahead_kb) will be 
lower specially in all flash as the requests will be serialized within a PG.
Our test is with all flash though and take my comments with a grain of salt in 
case of ceph + HDD..

Thanks & Regards
Somnath


From: EP Komarla [mailto:ep.koma...@flextronics.com]
Sent: Tuesday, July 26, 2016 4:50 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Ceph performance pattern

Thanks Somnath.

I am running with CentOS7.2.  Have you seen this pattern before?

- epk

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, July 26, 2016 4:44 PM
To: EP Komarla <ep.koma...@flextronics.com<mailto:ep.koma...@flextronics.com>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Ceph performance pattern

Which OS/kernel you are running with ?
Try setting bigger read_ahead_kb for sequential runs.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, July 26, 2016 4:38 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Ceph performance pattern

Hi,

I am showing below fio results for Sequential Read on my Ceph cluster.  I am 
trying to understand this pattern:

- why there is a dip in the performance for block sizes 32k-256k?
- is this an expected performance graph?
- have you seen this kind of pattern before

[cid:image001.png@01D1E760.729A3340]

My cluster details:
Ceph: Hammer release
Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 OSD 
journals on one SSD)
Client network: 10Gbps
Cluster network: 10Gbps
FIO test:
- 2 Client servers
- Sequential Read
- Run time of 600 seconds
- Filesize = 1TB
- 10 rbd images per client
- Queue depth=16

Any ideas on tuning this cluster?  Where should I look first?

Thanks,

- epk


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-26 Thread Somnath Roy
Not exactly, but, we are seeing some drop with 256K compare to 64K. This is 
with random reads though in Ubuntu. We had to bump up read_ahead_kb from 
default 128KB to 512KB to work around that.
But, in RHEL we saw all sorts of issues with read_ahead_kb for small block 
random reads and I think it is already default to 4MB or so..If so, try to 
reduce it to 512KB and see..
Generally, for sequential reads, you need to play with read_ahead_kb to achieve 
better performance. Ceph performance in general (without read_ahead_kb) will be 
lower specially in all flash as the requests will be serialized within a PG.
Our test is with all flash though and take my comments with a grain of salt in 
case of ceph + HDD..

Thanks & Regards
Somnath


From: EP Komarla [mailto:ep.koma...@flextronics.com]
Sent: Tuesday, July 26, 2016 4:50 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Ceph performance pattern

Thanks Somnath.

I am running with CentOS7.2.  Have you seen this pattern before?

- epk

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, July 26, 2016 4:44 PM
To: EP Komarla <ep.koma...@flextronics.com<mailto:ep.koma...@flextronics.com>>; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Ceph performance pattern

Which OS/kernel you are running with ?
Try setting bigger read_ahead_kb for sequential runs.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, July 26, 2016 4:38 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Ceph performance pattern

Hi,

I am showing below fio results for Sequential Read on my Ceph cluster.  I am 
trying to understand this pattern:

- why there is a dip in the performance for block sizes 32k-256k?
- is this an expected performance graph?
- have you seen this kind of pattern before

[cid:image001.png@01D1E75E.5A5D48A0]

My cluster details:
Ceph: Hammer release
Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 OSD 
journals on one SSD)
Client network: 10Gbps
Cluster network: 10Gbps
FIO test:
- 2 Client servers
- Sequential Read
- Run time of 600 seconds
- Filesize = 1TB
- 10 rbd images per client
- Queue depth=16

Any ideas on tuning this cluster?  Where should I look first?

Thanks,

- epk


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph performance pattern

2016-07-26 Thread Somnath Roy
Which OS/kernel you are running with ?
Try setting bigger read_ahead_kb for sequential runs.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of EP 
Komarla
Sent: Tuesday, July 26, 2016 4:38 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph performance pattern

Hi,

I am showing below fio results for Sequential Read on my Ceph cluster.  I am 
trying to understand this pattern:

- why there is a dip in the performance for block sizes 32k-256k?
- is this an expected performance graph?
- have you seen this kind of pattern before

[cid:image001.png@01D1E75C.DAD08080]

My cluster details:
Ceph: Hammer release
Cluster: 6 nodes (dual Intel sockets) each with 20 OSDs and 4 SSDs (5 OSD 
journals on one SSD)
Client network: 10Gbps
Cluster network: 10Gbps
FIO test:
- 2 Client servers
- Sequential Read
- Run time of 600 seconds
- Filesize = 1TB
- 10 rbd images per client
- Queue depth=16

Any ideas on tuning this cluster?  Where should I look first?

Thanks,

- epk


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Too much pgs backfilling

2016-07-19 Thread Somnath Roy
The settings are per OSD and the messages you are seeing aggregated on the 
cluster with multiple OSDs doing backfill (working on multiple PGs in 
parallel)..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jimmy 
Goffaux
Sent: Tuesday, July 19, 2016 5:19 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Too much pgs backfilling


Hello,



This is my configuration :

-> "osd_max_backfills": "1",
-> "osd_recovery_threads": "1"
->  "osd_recovery_max_active": "1",
-> "osd_recovery_op_priority": "3",

-> "osd_client_op_priority": "63",



I have run command :  ceph osd crush tunables optimal

After  upgrade Hammer to Jewel.



My cluster is overloaded on : pgs backfilling  : 15  
active+remapped+backfilling . ..



Why 15 ? My configuration is bad ? normally I should have max 1



Thanks

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Multi-device BlueStore testing

2016-07-19 Thread Somnath Roy
I don't think ceph-disk has support for separating block.db and block.wal yet 
(?).
You need to create the cluster manually by running mkfs.
Or if you have old mkcephfs script (which sadly deprecated) you can point the 
db / wal path and it will create cluster for you. I am using that to configure 
bluestore on multiple devices.
Alternatively, vstart.sh also has support for multiple device bluestore config 
I believe.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Stillwell, Bryan J
Sent: Tuesday, July 19, 2016 3:36 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Multi-device BlueStore testing

I would like to do some BlueStore testing using multiple devices like mentioned 
here:

https://www.sebastien-han.fr/blog/2016/05/04/Ceph-Jewel-configure-BlueStore-with-multiple-devices/

However, simply creating the block.db and block.wal symlinks and pointing them 
at empty partitions doesn't appear to be enough:

2016-07-19 21:30:15.717827 7f48ec4d9800  1 bluestore(/var/lib/ceph/osd/ceph-0) 
mount path /var/lib/ceph/osd/ceph-0
2016-07-19 21:30:15.717855 7f48ec4d9800  1 bluestore(/var/lib/ceph/osd/ceph-0) 
fsck
2016-07-19 21:30:15.717869 7f48ec4d9800  1 bdev create path 
/var/lib/ceph/osd/ceph-0/block type kernel
2016-07-19 21:30:15.718367 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
open path /var/lib/ceph/osd/ceph-0/block
2016-07-19 21:30:15.718462 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
open size 6001069202944 (5588 GB) block_size 4096 (4096 B)
2016-07-19 21:30:15.718786 7f48ec4d9800  1 bdev create path 
/var/lib/ceph/osd/ceph-0/block.db type kernel
2016-07-19 21:30:15.719305 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) open path 
/var/lib/ceph/osd/ceph-0/block.db
2016-07-19 21:30:15.719388 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) open size 1023410176 (976 MB) 
block_size 4096 (4096 B)
2016-07-19 21:30:15.719394 7f48ec4d9800  1 bluefs add_block_device bdev 1 path 
/var/lib/ceph/osd/ceph-0/block.db size 976 MB
2016-07-19 21:30:15.719586 7f48ec4d9800 -1 
bluestore(/var/lib/ceph/osd/ceph-0/block.db) _read_bdev_label unable to decode 
label at offset 66: buffer::malformed_input: void 
bluestore_bdev_label_t::decode(ceph::buffer::list::iterator&) decode past end 
of struct encoding
2016-07-19 21:30:15.719597 7f48ec4d9800 -1 bluestore(/var/lib/ceph/osd/ceph-0) 
_open_db check block device(/var/lib/ceph/osd/ceph-0/block.db) label returned: 
(22) Invalid argument
2016-07-19 21:30:15.719602 7f48ec4d9800  1 
bdev(/var/lib/ceph/osd/ceph-0/block.db) close
2016-07-19 21:30:15.999311 7f48ec4d9800  1 bdev(/var/lib/ceph/osd/ceph-0/block) 
close
2016-07-19 21:30:16.243312 7f48ec4d9800 -1 osd.0 0 OSD:init: unable to mount 
object store

I originally used 'ceph-disk prepare --bluestore' to create the OSD, but I feel 
like there is some kind of initialization step I need to do when moving the db 
and wal over to an NVMe device.  My google searches just aren't turning up 
much.  Could someone point me in the right direction?

Thanks,
Bryan
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-14 Thread Somnath Roy
Try increasing the following to say 10

osd_op_num_shards = 10
filestore_fd_cache_size = 128

Hope, the following you introduced after I told you , so, it shouldn't be the 
cause it seems (?)

filestore_odsync_write = true

Also, comment out the following.

filestore_wbthrottle_enable = false



From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Thursday, July 14, 2016 10:05 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Something in this section is causing all the 0 IOPS issue. Have not been able 
to nail down it yet. (I did comment out the filestore_max_inline_xattr_size 
entries, and problem still exists).
If I take out the whole [osd] section, I was able to get rid of IOPS staying at 
0 for long periods of time. Performance is still not where I would expect.
[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
#filestore_max_inline_xattr_size = 254
#filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 7:05 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
a

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_i

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

2016-06-30 Thread Somnath Roy
It seems your client kernel is pretty old ?
Either upgrade your kernel to 3.15 or later or you need to disable 
CRUSH_TUNABLES3.
ceph osd crush tunables bobtail or ceph osd crush tunables legacy should help. 
This will start rebalancing and also you will lose improvement added in 
Firefly. So, better to upgrade client kernel IMO.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mike 
Jacobacci
Sent: Thursday, June 30, 2016 7:27 PM
To: Jake Young
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Mounting Ceph RBD image to XenServer 7 as SR

Thanks Jake!  I enabled the epel 7 repo and was able to get ceph-common 
installed.  Here is what happens when I try to map the drive:

rbd map rbd/enterprise-vm0 --name client.admin -m mon0 -k 
/etc/ceph/ceph.client.admin.keyring
rbd: sysfs write failed
In some cases useful info is found in syslog - try "dmesg | tail" or so.
rbd: map failed: (5) Input/output error

dmesg | tail:

[35034.469236] libceph: mon0 192.168.10.187:6789 socket error on read
[35044.469183] libceph: mon0 192.168.10.187:6789 feature set mismatch, my 
4a042a42 < server's 2004a042a42, missing 200
[35044.469199] libceph: mon0 192.168.10.187:6789 socket error on read
[35054.469076] libceph: mon0 192.168.10.187:6789 feature set mismatch, my 
4a042a42 < server's 2004a042a42, missing 200
[35054.469083] libceph: mon0 192.168.10.187:6789 socket error on read
[35064.469287] libceph: mon0 192.168.10.187:6789 feature set mismatch, my 
4a042a42 < server's 2004a042a42, missing 200
[35064.469302] libceph: mon0 192.168.10.187:6789 socket error on read
[35074.469162] libceph: mon0 192.168.10.187:6789 feature set mismatch, my 
4a042a42 < server's 2004a042a42, missing 200
[35074.469178] libceph: mon0 192.168.10.187:6789 socket error on read




On Jun 30, 2016, at 6:15 PM, Jake Young 
> wrote:

See https://www.mail-archive.com/ceph-users@lists.ceph.com/msg17112.html


On Thursday, June 30, 2016, Mike Jacobacci 
> wrote:
So after adding the ceph repo and enabling the cents-7 repo… It fails trying to 
install ceph-common:

Loaded plugins: fastestmirror
Loading mirror speeds from cached hostfile
 * base: mirror.web-ster.com
Resolving Dependencies
--> Running transaction check
---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
--> Processing Dependency: python-cephfs = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: python-rados = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: librbd1 = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libcephfs1 = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: python-rbd = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: librados2 = 1:10.2.2-0.el7 for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: python-requests for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libboost_program_options-mt.so.1.53.0()(64bit) for 
package: 1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: librgw.so.2()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libradosstriper.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libboost_regex-mt.so.1.53.0()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libboost_iostreams-mt.so.1.53.0()(64bit) for 
package: 1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: librbd.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libtcmalloc.so.4()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: librados.so.2()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libbabeltrace.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Running transaction check
---> Package boost-iostreams.x86_64 0:1.53.0-25.el7 will be installed
---> Package boost-program-options.x86_64 0:1.53.0-25.el7 will be installed
---> Package boost-regex.x86_64 0:1.53.0-25.el7 will be installed
---> Package ceph-common.x86_64 1:10.2.2-0.el7 will be installed
--> Processing Dependency: libbabeltrace-ctf.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
--> Processing Dependency: libbabeltrace.so.1()(64bit) for package: 
1:ceph-common-10.2.2-0.el7.x86_64
---> Package gperftools-libs.x86_64 0:2.4-7.el7 will be installed
--> Processing Dependency: libunwind.so.8()(64bit) for package: 
gperftools-libs-2.4-7.el7.x86_64
---> Package libcephfs1.x86_64 1:10.2.2-0.el7 will be installed
--> Processing 

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-23 Thread Somnath Roy
Oops , typo , 128 GB :-)...

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Thursday, June 23, 2016 5:08 PM
To: ceph-users@lists.ceph.com
Cc: Somnath Roy; Warren Wang - ISD; Wade Holler; Blair Bethwaite; Ceph 
Development
Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
objects in pool


Hello,

On Thu, 23 Jun 2016 22:24:59 + Somnath Roy wrote:

> Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to
> *pin* inode/dentries in memory. We are using that for long now (with
> 128 TB node memory) and it seems helping specially for the random
> write workload and saving xattrs read in between.
>
128TB node memory, really?
Can I have some of those, too? ^o^
And here I was thinking that Wade's 660GB machines were on the excessive side.

There's something to be said (and optimized) when your storage nodes have the 
same or more RAM as your compute nodes...

As for Warren, well spotted.
I personally use vm.vfs_cache_pressure = 1, this avoids the potential fireworks 
if your memory is really needed elsewhere, while keeping things in memory 
normally.

Christian

> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Warren Wang - ISD Sent: Thursday, June 23, 2016 3:09 PM
> To: Wade Holler; Blair Bethwaite
> Cc: Ceph Development; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Dramatic performance drop at certain number
> of objects in pool
>
> vm.vfs_cache_pressure = 100
>
> Go the other direction on that. You¹ll want to keep it low to help
> keep inode/dentry info in memory. We use 10, and haven¹t had a problem.
>
>
> Warren Wang
>
>
>
>
> On 6/22/16, 9:41 PM, "Wade Holler" <wade.hol...@gmail.com> wrote:
>
> >Blairo,
> >
> >We'll speak in pre-replication numbers, replication for this pool is 3.
> >
> >23.3 Million Objects / OSD
> >pg_num 2048
> >16 OSDs / Server
> >3 Servers
> >660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
> >vm.vfs_cache_pressure = 100
> >
> >Workload is native librados with python.  ALL 4k objects.
> >
> >Best Regards,
> >Wade
> >
> >
> >On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> ><blair.bethwa...@gmail.com> wrote:
> >> Wade, good to know.
> >>
> >> For the record, what does this work out to roughly per OSD? And how
> >> much RAM and how many PGs per OSD do you have?
> >>
> >> What's your workload? I wonder whether for certain workloads (e.g.
> >> RBD) it's better to increase default object size somewhat before
> >> pushing the split/merge up a lot...
> >>
> >> Cheers,
> >>
> >> On 23 June 2016 at 11:26, Wade Holler <wade.hol...@gmail.com> wrote:
> >>> Based on everyones suggestions; The first modification to 50 / 16
> >>> enabled our config to get to ~645Mill objects before the behavior
> >>> in question was observed (~330 was the previous ceiling).
> >>> Subsequent modification to 50 / 24 has enabled us to get to 1.1
> >>> Billion+
> >>>
> >>> Thank you all very much for your support and assistance.
> >>>
> >>> Best Regards,
> >>> Wade
> >>>
> >>>
> >>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer <ch...@gol.com>
> >>>wrote:
> >>>>
> >>>> Hello,
> >>>>
> >>>> On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:
> >>>>
> >>>>> Sorry, late to the party here. I agree, up the merge and split
> >>>>>thresholds. We're as high as 50/12. I chimed in on an RH ticket
> >>>>>here.
> >>>>> One of those things you just have to find out as an operator
> >>>>>since it's  not well documented :(
> >>>>>
> >>>>> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
> >>>>>
> >>>>> We have over 200 million objects in this cluster, and it's still
> >>>>>doing  over 15000 write IOPS all day long with 302 spinning
> >>>>>drives
> >>>>>+ SATA SSD  journals. Having enough memory and dropping your
> >>>>>vfs_cache_pressure  should also help.
> >>>>>
> >>>> Indeed.
> >>>>
> >>>> Since it was asked in that bug report and also my first
> >>>>suspicion, it  would probably be good time to clarify that it
> >>>>isn't the 

Re: [ceph-users] Dramatic performance drop at certain number of objects in pool

2016-06-23 Thread Somnath Roy
Or even vm.vfs_cache_pressure = 0 if you have sufficient memory to *pin* 
inode/dentries in memory.
We are using that for long now (with 128 TB node memory) and it seems helping 
specially for the random write workload and saving xattrs read in between.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Warren 
Wang - ISD
Sent: Thursday, June 23, 2016 3:09 PM
To: Wade Holler; Blair Bethwaite
Cc: Ceph Development; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Dramatic performance drop at certain number of 
objects in pool

vm.vfs_cache_pressure = 100

Go the other direction on that. You¹ll want to keep it low to help keep 
inode/dentry info in memory. We use 10, and haven¹t had a problem.


Warren Wang




On 6/22/16, 9:41 PM, "Wade Holler"  wrote:

>Blairo,
>
>We'll speak in pre-replication numbers, replication for this pool is 3.
>
>23.3 Million Objects / OSD
>pg_num 2048
>16 OSDs / Server
>3 Servers
>660 GB RAM Total, 179 GB Used (free -t) / Server vm.swappiness = 1
>vm.vfs_cache_pressure = 100
>
>Workload is native librados with python.  ALL 4k objects.
>
>Best Regards,
>Wade
>
>
>On Wed, Jun 22, 2016 at 9:33 PM, Blair Bethwaite
> wrote:
>> Wade, good to know.
>>
>> For the record, what does this work out to roughly per OSD? And how
>> much RAM and how many PGs per OSD do you have?
>>
>> What's your workload? I wonder whether for certain workloads (e.g.
>> RBD) it's better to increase default object size somewhat before
>> pushing the split/merge up a lot...
>>
>> Cheers,
>>
>> On 23 June 2016 at 11:26, Wade Holler  wrote:
>>> Based on everyones suggestions; The first modification to 50 / 16
>>> enabled our config to get to ~645Mill objects before the behavior in
>>> question was observed (~330 was the previous ceiling).  Subsequent
>>> modification to 50 / 24 has enabled us to get to 1.1 Billion+
>>>
>>> Thank you all very much for your support and assistance.
>>>
>>> Best Regards,
>>> Wade
>>>
>>>
>>> On Mon, Jun 20, 2016 at 6:58 PM, Christian Balzer 
>>>wrote:

 Hello,

 On Mon, 20 Jun 2016 20:47:32 + Warren Wang - ISD wrote:

> Sorry, late to the party here. I agree, up the merge and split
>thresholds. We're as high as 50/12. I chimed in on an RH ticket here.
> One of those things you just have to find out as an operator since
>it's  not well documented :(
>
> https://bugzilla.redhat.com/show_bug.cgi?id=1219974
>
> We have over 200 million objects in this cluster, and it's still
>doing  over 15000 write IOPS all day long with 302 spinning drives
>+ SATA SSD  journals. Having enough memory and dropping your
>vfs_cache_pressure  should also help.
>
 Indeed.

 Since it was asked in that bug report and also my first suspicion,
it  would probably be good time to clarify that it isn't the splits
that cause  the performance degradation, but the resulting inflation
of dir entries  and exhaustion of SLAB and thus having to go to disk
for things that  normally would be in memory.

 Looking at Blair's graph from yesterday pretty much makes that
clear, a  purely split caused degradation should have relented much
quicker.


> Keep in mind that if you change the values, it won't take effect
> immediately. It only merges them back if the directory is under
> the calculated threshold and a write occurs (maybe a read, I forget).
>
 If it's a read a plain scrub might do the trick.

 Christian
> Warren
>
>
> From: ceph-users
>
>cep
>h.com>>
> on behalf of Wade Holler
> > Date:
>Monday, June  20, 2016 at 2:48 PM To: Blair Bethwaite
>>, Wido
>den  Hollander > Cc: Ceph
>Development
>>,
> "ceph-users@lists.ceph.com"
> >
>Subject:
> Re: [ceph-users] Dramatic performance drop at certain number of
>objects  in pool
>
> Thanks everyone for your replies.  I sincerely appreciate it. We
> are testing with different pg_num and filestore_split_multiple settings.
> Early indications are  well not great. Regardless it is nice
> to understand the symptoms better so we try to design around it.
>
> Best Regards,
> Wade
>
>
> On Mon, Jun 20, 2016 at 2:32 AM Blair Bethwaite
>> wrote:
>On
> 20 June 2016 at 09:21, Blair Bethwaite

Re: [ceph-users] rbd ioengine for fio

2016-06-16 Thread Somnath Roy
What is your fio script ?

Make sure you do this..

1. Run say ‘ceph-s’ from  the server you are trying to connect and see if it is 
connecting properly or not. If so, you don’t have any keyring issues.

2. Now, make sure you have given the following param value properly based on 
your setup.

pool=
rbdname=

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mavis 
Xiang
Sent: Thursday, June 16, 2016 1:47 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] rbd ioengine for fio

Hi all,
I am new to the rbd engine for fio, and ran into the following problems when i 
try to run a 4k write with my rbd image:



rbd_iodepth32: (g=0): rw=randwrite, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, 
iodepth=32

fio-2.11-17-ga275

Starting 1 process

rbd engine: RBD version: 0.1.8

rados_connect failed.

fio_rbd_connect failed.



It seems that the rbd client cannot connect to the ceph cluster.

Ceph health output:

cluster e414604c-29d7-4adb-a889-7f70fc252dfa

 health HEALTH_WARN clock skew detected on mon.h02, mon.h05



But it should not affected the connection to the cluster.

Ceph.conf:

[global]

fsid = e414604c-29d7-4adb-a889-7f70fc252dfa

mon_initial_members = h02

mon_host = XXX.X.XXX.XXX

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_pool_default_pg_num = 2400

osd_pool_default_pgp_num = 2400

public_network = XXX.X.XXX.X/21



[osd]

osd_crush_update_on_start = false






Should this be something about keyring? i did not find any options about 
keyring that can be set in fio file.
Can anyone please give some insights about this problem?
Any help would be appreciated!

Thanks!

Yu


PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

2016-06-15 Thread Somnath Roy
You ran out of fd limit..Increase with ulimit..

From: Mansour Shafaei Moghaddam [mailto:mansoor.shaf...@gmail.com]
Sent: Wednesday, June 15, 2016 2:08 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

It fails at "FileStore.cc: 2761". Here is a more complete log:

-9> 2016-06-15 10:55:13.205014 7fa2dcd85700 -1 dump_open_fds unable to open 
/proc/self/fd
-8> 2016-06-15 10:55:13.205085 7fa2cb402700  2 
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 > 104857600
-7> 2016-06-15 10:55:13.205094 7fa2cd406700  2 
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328389 > 104857600
-6> 2016-06-15 10:55:13.205111 7fa2cac01700  2 
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328317 > 104857600
-5> 2016-06-15 10:55:13.205118 7fa2ca400700  2 
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 > 104857600
-4> 2016-06-15 10:55:13.205121 7fa2cdc07700  2 
filestore(/var/lib/ceph/osd/ceph-0) waiting 51 > 50 ops || 328390 > 104857600
-3> 2016-06-15 10:55:13.205153 7fa2de588700  5 -- op tracker -- seq: 1476, 
time: 2016-06-15 10:55:13.205153, event: journaled_completion_queued, op: 
osd_op(client.4109.0:1457 rb.0.100a.6b8b4567.6b6c [set-alloc-hint 
object_size 4194304 write_size 4194304,write 1884160~4096] 0.cbe1d8a4 
ack+ondisk+write e9)
-2> 2016-06-15 10:55:13.205183 7fa2de588700  5 -- op tracker -- seq: 1483, 
time: 2016-06-15 10:55:13.205183, event: write_thread_in_journal_buffer, op: 
osd_op(client.4109.0:1464 rb.0.100a.6b8b4567.524d [set-alloc-hint 
object_size 4194304 write_size 4194304,write 3051520~4096] 0.6778c255 
ack+ondisk+write e9)
-1> 2016-06-15 10:55:13.205400 7fa2de588700  5 -- op tracker -- seq: 1483, 
time: 2016-06-15 10:55:13.205400, event: journaled_completion_queued, op: 
osd_op(client.4109.0:1464 rb.0.100a.6b8b4567.524d [set-alloc-hint 
object_size 4194304 write_size 4194304,write 3051520~4096] 0.6778c255 
ack+ondisk+write e9)
 0> 2016-06-15 10:55:13.206559 7fa2dcd85700 -1 os/FileStore.cc: In function 
'unsigned int FileStore::_do_transaction(ObjectStore::Transaction&, uint64_t, 
int, ThreadPool::TPHandle*)' thread 7fa2dcd85700 time 2016-06-15 10:55:13.205018
os/FileStore.cc: 2761: FAILED assert(0 == "unexpected error")

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x78) 
[0xacd718]
 2: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*)+0xa24) [0x8b8114]
 3: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, 
std::allocator<ObjectStore::Transaction*> >&, unsigned long, 
ThreadPool::TPHandle*)+0x64) [0x8bcf34]
 4: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x17e) 
[0x8bd0ce]
 5: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xabe326]
 6: (ThreadPool::WorkThread::entry()+0x10) [0xabf3d0]
 7: (()+0x7dc5) [0x7fa2e88f3dc5]
 8: (clone()+0x6d) [0x7fa2e73d528d]


On Wed, Jun 15, 2016 at 2:05 PM, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:
There should be a line in the log specifying which assert is failing , post 
that along with say 10 lines from top of that..

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of Mansour Shafaei Moghaddam
Sent: Wednesday, June 15, 2016 1:57 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

Hi All,

Has anyone faced a similar issue? I do not have a problem with random read, 
sequential read, and sequential writes though. Everytime I try running fio for 
random writes, one osd in the cluster crashes. Here is the what I see at the 
tail of the log:

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: ceph-osd() [0x9d6334]
 2: (()+0xf100) [0x7fa2e88fb100]
 3: (gsignal()+0x37) [0x7fa2e73145f7]
 4: (abort()+0x148) [0x7fa2e7315ce8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fa2e7c189d5]
 6: (()+0x5e946) [0x7fa2e7c16946]
 7: (()+0x5e973) [0x7fa2e7c16973]
 8: (()+0x5eb93) [0x7fa2e7c16b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x24a) [0xacd8ea]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*)+0xa24) [0x8b8114]
 11: (FileStore::_do_transactions(std::list<ObjectStore::Transaction*, 
std::allocator<ObjectStore::Transaction*> >&, unsigned long, 
ThreadPool::TPHandle*)+0x64) [0x8bcf34]
 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x17e) 
[0x8bd0ce]
 13: (ThreadPool::worker(ThreadPool::WorkThread

Re: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

2016-06-15 Thread Somnath Roy
There should be a line in the log specifying which assert is failing , post 
that along with say 10 lines from top of that..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Mansour Shafaei Moghaddam
Sent: Wednesday, June 15, 2016 1:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Fio randwrite does not work on Centos 7.2 VM

Hi All,

Has anyone faced a similar issue? I do not have a problem with random read, 
sequential read, and sequential writes though. Everytime I try running fio for 
random writes, one osd in the cluster crashes. Here is the what I see at the 
tail of the log:

 ceph version 0.94.6 (e832001feaf8c176593e0325c8298e3f16dfb403)
 1: ceph-osd() [0x9d6334]
 2: (()+0xf100) [0x7fa2e88fb100]
 3: (gsignal()+0x37) [0x7fa2e73145f7]
 4: (abort()+0x148) [0x7fa2e7315ce8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x165) [0x7fa2e7c189d5]
 6: (()+0x5e946) [0x7fa2e7c16946]
 7: (()+0x5e973) [0x7fa2e7c16973]
 8: (()+0x5eb93) [0x7fa2e7c16b93]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char 
const*)+0x24a) [0xacd8ea]
 10: (FileStore::_do_transaction(ObjectStore::Transaction&, unsigned long, int, 
ThreadPool::TPHandle*)+0xa24) [0x8b8114]
 11: (FileStore::_do_transactions(std::list >&, unsigned long, 
ThreadPool::TPHandle*)+0x64) [0x8bcf34]
 12: (FileStore::_do_op(FileStore::OpSequencer*, ThreadPool::TPHandle&)+0x17e) 
[0x8bd0ce]
 13: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa56) [0xabe326]
 14: (ThreadPool::WorkThread::entry()+0x10) [0xabf3d0]
 15: (()+0x7dc5) [0x7fa2e88f3dc5]
 16: (clone()+0x6d) [0x7fa2e73d528d]


PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Pool JERASURE issue.

2016-06-01 Thread Somnath Roy
You need to either change failure domain to osd or need at least 5 host to 
satisfy host failure domain.
Since it is not satisfying failure domain , pgs are undersized and degraded..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Khang 
Nguy?n Nh?t
Sent: Wednesday, June 01, 2016 9:33 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph Pool JERASURE issue.

Hi,
I have 1 cluster as pictured below:
[cid:image001.png@01D1BC4D.B9ED37D0]

- OSD-host1 run 2 ceph-osd daemon is mounted in /var/ceph/osd0 and  
/var/ceph/osd1.
- OSD-host2 run 2 ceph-osd daemon is mounted in /var/ceph/osd2 and  
/var/ceph/osd3.
- OSD-host3 only run 1 ceph-osd daemon is mounted in the /var/ceph/osd4.
- This is my myprofile:
 jerasure-per-chunk-alignment = false
 k = 3
 m = 2
 plugin = jerasure
 ruleset-failure-domain = host
 ruleset-root = default
 technique = reed_sol_van
 w = 8
When I used it to create a pool
CLI: ceph osd create test pool myprofile 8 8 erasure. (id test pool=62)
CLI: ceph-s
​Here are the results
///
 health HEALTH_WARN
8 pgs degraded
8 pgs stuck unclean
8 pgs undersized
 monmap e1: 1 mons at {mon0 = x.x.x.x: 6789/0}
election epoch 7, quorum 0 mon0
 osdmap e441: 5 osds: 5 up, 5 in
flags sortbitwise
  pgmap ///
   8 Active + undersized + degraded

CLI: health CePH detail
HEALTH_WARN 8 pgs degraded; 8 pgs stuck unclean; 8 pgs undersized
62.6 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [1,2,2147483647,2147483647,4]
62.7 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [2,0,2147483647,4,2147483647]
62.4 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [3,0,4,2147483647,2147483647]
62.5 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [0,4,2147483647,3,2147483647]
62.2 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [1,2147483647,2147483647,4,2]
62.3 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [2,2147483647,0,4,2147483647]
62.0 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [0,3,2147483647,4,2147483647]
62.1 pg is stuck unclean since forever, current degraded state active + + 
undersized, last acting [4,0,3,2147483647,2147483647]
is active + 62.1 pg undersized + degraded, acting [4,0,3,2147483647,2147483647]
is active + 62.0 pg undersized + degraded, acting [0,3,2147483647,4,2147483647]
is active + 62.3 pg undersized + degraded, acting [2,2147483647,0,4,2147483647]
is active + 62.2 pg undersized + degraded, acting [1,2147483647,2147483647,4,2]
is active + 62.5 pg undersized + degraded, acting [0,4,2147483647,3,2147483647]
is active + 62.4 pg undersized + degraded, acting [3,0,4,2147483647,2147483647]
is active + 62.7 pg undersized + degraded, acting [2,0,2147483647,4,2147483647]
is active + 62.6 pg undersized + degraded, acting [1,2,2147483647,2147483647,4]

This is related to reasonable ruleset-failure-domain? Can somebody please help 
me out ?
Thank !
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] NVRAM cards as OSD journals

2016-05-24 Thread Somnath Roy
If you are not tweaking ceph.conf settings when using NVRAM as journal , I 
would highly recommend to try the following.

1. Since you have very small journal , try to reduce 
filestore_max_sync_interval/min_sync_interval significantly.

2. If you are using Jewel , there are bunch of filestore throttle parameter 
introduced (discussed over ceph-devl) which is now doing no throttling by 
default. But, since your journal size is small and NVRAM is much faster you may 
need to tweak those to extract better and stable performance out.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brian 
::
Sent: Tuesday, May 24, 2016 1:37 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] NVRAM cards as OSD journals

Hello List

To confirm what Christian has said. We have been playing with a 3 node
4 SSD (3610) per node cluster. Putting the journals on the OSD SSDs we were 
getting 770MB /s sustained with large sequential writes, and 35 MB/s and about 
9200 IOPS with small random writes. Putting an NVME as journals decreased the 
sustained throughput marginally, probably by 40MB/s and increased consistently 
the small random writes by about 10 MB/s and 3100 IOPS or so. But now with my 
small cluster I've got a huge failure domain in each OSD server.

As the number of OSDs increase I would imagine the value of backing SSDs with 
NVME journals diminishes.

B

On Tue, May 24, 2016 at 3:28 AM, Christian Balzer  wrote:
>
> Hello,
>
> On Fri, 20 May 2016 15:52:45 + EP Komarla wrote:
>
>> Hi,
>>
>> I am contemplating using a NVRAM card for OSD journals in place of
>> SSD drives in our ceph cluster.
>>
>> Configuration:
>>
>> * 4 Ceph servers
>>
>> * Each server has 24 OSDs (each OSD is a 1TB SAS drive)
>>
>> * 1 PCIe NVRAM card of 16GB capacity per ceph server
>>
>> * Both Client & cluster network is 10Gbps
>>
> Since you were afraid of loosing just 5 OSDs if a single journal SSD
> would fail, putting all your eggs in one NVRAM basket is quite the leap.
>
> Your failure domains should match your cluster size and abilities and
> 4 nodes is small cluster, loosing one because your NVRAM card failed
> will have massive impacts during re-balancing and then you'll have a 3
> cluster node with less overall performance until you can fix things.
>
> And while a node can of course fail as well in it's entirety (like bad
> Mainboard, CPU, RAM) these things often times can be fixed quickly
> (especially if you have spares on hand) and don't need to involve a
> full re-balancing if Ceph is configured accordingly
> (mon_osd_down_out_subtree_limit = host).
>
> As for your question, this has been discussed to some extend less than
> two months ago, especially concerning journal size and usage:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg28003.html
>
> That being said, it would be best to have a comparison between a
> normal sized journal on a fast SSD/NVMe versus the 600MB NVRAM journals.
>
> I'd expect small write IOPS to be faster with the NVRAM and _maybe_ to
> see some slowdown compared to SSDs when comes to large writes, like
> during a backfill.
>
>> As per ceph documents:
>> The expected throughput number should include the expected disk
>> throughput (i.e., sustained data transfer rate), and network throughput.
>> For example, a 7200 RPM disk will likely have approximately 100 MB/s.
>> Taking the min() of the disk and network throughput should provide a
>> reasonable expected throughput. Some users just start off with a 10GB
>> journal size. For example: osd journal size = 1 Given that I have
>> a single 16GB card per server that has to be carved among all 24OSDs,
>> I will have to configure each OSD journal to be much smaller around
>> 600MB, i.e., 16GB/24 drives.  This value is much smaller than
>> 10GB/OSD journal that is generally used.  So, I am wondering if this
>> configuration and journal size is valid.  Is there a performance
>> benefit of having a journal that is this small?  Also, do I have to
>> reduce the default "filestore maxsync interval" from 5 seconds to a
>> smaller value say 2 seconds to match the smaller journal size?
>>
> Yes, just to be on the safe side.
>
> Regards,
>
> Christian
>
>> Have people used NVRAM cards in the Ceph clusters as journals?  What
>> is their experience?
>>
>> Any thoughts?
>>
>>
>>
>> Legal Disclaimer:
>> The information contained in this message may be privileged and
>> confidential. It is intended to be read only by the individual or
>> entity to whom it is addressed or by their designee. If the reader of
>> this message is not the intended recipient, you are on notice that
>> any distribution of this message, in any form, is strictly
>> prohibited. If you have received this message in error, please
>> immediately notify the sender and delete or destroy any copy of this message!
>
>
> --
> Christian BalzerNetwork/Systems 

Re: [ceph-users] Public and Private network over 1 interface

2016-05-23 Thread Somnath Roy
4MB block size EC-object use case scenario (Mostly for reads , not so much for 
writes) we saw some benefit separating public/cluster network for 40GbE. We 
didn’t use two NIC though. We configured two ports on a NIC.
Both network can give up to 48Gb/s but with Mellanox card/Mellanox switch 
combination it can go up to 56Gb/s..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Brady 
Deetz
Sent: Monday, May 23, 2016 12:53 PM
To: ceph-users
Subject: [ceph-users] Public and Private network over 1 interface

TLDR;
Has anybody deployed a Ceph cluster using a single 40 gig nic? This is 
discouraged in 
http://docs.ceph.com/docs/master/rados/configuration/network-config-ref/

"One NIC OSD in a Two Network Cluster:
Generally, we do not recommend deploying an OSD host with a single NIC in a 
cluster with two networks. --- [cut] --- Additionally, the public network and 
cluster network must be able to route traffic to each other, which we don’t 
recommend for security reasons."


Reason for this question:
My hope is that I can keep capital expenses down for this year then add a 
second switch and second 40 gig DAC to each node next year.

Thanks for any wisdom you can provide.
-

Details:
Planned configuration - 40 gig interconnect via Brocade VDX 6940 and 8x OSD 
nodes configured as follows:
2x E5-2660v4
8x 16GB ECC DDR4 (128 GB RAM)
1x dual port Mellanox ConnectX-3 Pro EN
24x 6TB enterprise sata
2x P3700 400GB pcie nvme (journals)
2x 200GB SSD (OS drive)

1) From a security perspective, why not keep the networks segmented all the way 
to the node using tagged VLANs or VXLANs then untag them at the node? From a 
security perspective, that's no different than sending 2 networks to the same 
host on different interfaces.

2) By using VLANs, I wouldn't have to worry about the special configuration of 
Ceph mentioned in referenced documentation, since the untagged VLANs would show 
up as individual interfaces on the host.

3) From a performance perspective, has anybody observed a significant 
performance hit by untagging vlans on the node? This is something I can't test, 
since I don't currently own 40 gig gear.

3.a) If I used a VXLAN offloading nic, wouldn't this remove this potential 
issue?

3.a) My back of napkin estimate shows that total OSD read throughput per node 
could max out around 38gbps (4800MB/s). But in reality, with plenty of random 
I/O, I'm expecting to see something more around 30gbps. So a single 40 gig 
connection ought to leave plenty of headroom. right?
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] using jemalloc in trusty

2016-05-23 Thread Somnath Roy
Yes, if you are using do_autogen , use -J option. If you are using config files 
directly , use --with-jemalloc

-Original Message-
From: Luis Periquito [mailto:periqu...@gmail.com] 
Sent: Monday, May 23, 2016 7:44 AM
To: Somnath Roy
Cc: Ceph Users
Subject: Re: [ceph-users] using jemalloc in trusty

Thanks Somnath, I expected that much. But given the hint in the config files do 
you know if they are built to use jemalloc? it seems not...

On Mon, May 23, 2016 at 3:34 PM, Somnath Roy <somnath@sandisk.com> wrote:
> You need to build ceph code base to use jemalloc for OSDs..LD_PRELOAD won't 
> work..
>
> Thanks & regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
> Of Luis Periquito
> Sent: Monday, May 23, 2016 7:30 AM
> To: Ceph Users
> Subject: [ceph-users] using jemalloc in trusty
>
> I've been running some tests with jewel, and wanted to enable jemalloc.
> I noticed that the new jewel release now loads properly /etc/default/ceph and 
> has an option to use jemalloc.
>
> I've installed jemalloc, enabled the LD_PRELOAD option, however doing some 
> tests it seems that it's still using tcmalloc: I still see the 
> "tcmalloc::CentralFreeList::FetchFromSpans()" and it's accompanying lines in 
> perf top.
>
> Also from a lsof I can see the tcmalloc libraries being used, but not the 
> jemalloc ones...
>
> Does anyone know what I'm doing wrong? I'm using the standard binaries from 
> the repo 10.2.1 and ubuntu trusty with kernel 3.13.0-52-generic.
>
> thanks,
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] using jemalloc in trusty

2016-05-23 Thread Somnath Roy
You need to build ceph code base to use jemalloc for OSDs..LD_PRELOAD won't 
work..

Thanks & regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Luis 
Periquito
Sent: Monday, May 23, 2016 7:30 AM
To: Ceph Users
Subject: [ceph-users] using jemalloc in trusty

I've been running some tests with jewel, and wanted to enable jemalloc.
I noticed that the new jewel release now loads properly /etc/default/ceph and 
has an option to use jemalloc.

I've installed jemalloc, enabled the LD_PRELOAD option, however doing some 
tests it seems that it's still using tcmalloc: I still see the 
"tcmalloc::CentralFreeList::FetchFromSpans()" and it's accompanying lines in 
perf top.

Also from a lsof I can see the tcmalloc libraries being used, but not the 
jemalloc ones...

Does anyone know what I'm doing wrong? I'm using the standard binaries from the 
repo 10.2.1 and ubuntu trusty with kernel 3.13.0-52-generic.

thanks,
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD process doesn't die immediately after device disappears

2016-05-17 Thread Somnath Roy
Hi Marcel,
FileStore doesn't subscribe for any such event from the device. Presently, it 
is relying on filesystem (for the FileStore assert) to return back error during 
IO and based on the error it is giving an assert.
FileJournal assert you are getting in the aio path is relying on linux aio 
system to report an error.
It should get these asserts pretty quickly not couple of minutes if IO is on.
Are you saying this crash timestamp is couple of minutes after ?
BTW, if you are on Ubuntu , upstart will restart the OSDs after crash and based 
on some logic (more frequent crash)  it will eventually decide not to. So, in 
the log try to get the very first crash trace and see when it occurred.
BTW, hope you are aware that recovery will not be kicking off unless there is 
some grace period (configurable) is over.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Marcel 
Lauhoff
Sent: Tuesday, May 17, 2016 5:59 AM
To: ceph-users
Subject: [ceph-users] OSD process doesn't die immediately after device 
disappears


Hi,

we recently played the good ol' pull a harddrive game and wondered, why the OSD 
process took a couple of minutes to recognize their misfortune.

In our configuration two OSDs share an HDD:
  OSD n as its journal device,
  OSD n+1 as its filesystem.

We expected that OSDs detect this kind of failure and immediately shut down, so 
that transactions aren't blocked and recovery can start as soon as possible.

What do you think?


I read through the FileStore code about a year ago and can't remember any code 
that somehow subscribes to events of the underlying devices.

Does anyone use external watchdog tools for this type of failure?



~irq0


The last messages of the two OSD daemons:

2016-04-27 14:57:25.613408 7f1b9ed10700 -1 journal aio to 0~4096 wrote 
18446744073709551611
2016-04-27 14:57:25.642669 7f1b9ed10700 -1 os/FileJournal.cc: In function 'void 
FileJournal::write_finish_thread_entry()' thread 7f1b9ed10700 time 2016-04-27 
14:57:25.613475
os/FileJournal.cc: 1426: FAILED assert(0 == "unexpected aio error")

2016-04-27 14:57:22.534578 7f0e0c6a5700 -1 os/FileStore.cc: In function 
'unsigned int FileStore::_do_transaction(ObjectStore::Trans
action&, uint64_t, int, ThreadPool::TPHandle*)' thread 7f0e0c6a5700 time 
2016-04-27 14:57:22.489978
os/FileStore.cc: 2757: FAILED assert(0 == "unexpected error")

--
Marcel Lauhoff
Mail: lauh...@uni-mainz.de
XMPP: mlauh...@jabber.uni-mainz.de
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Segfault in libtcmalloc.so.4.2.2

2016-05-13 Thread Somnath Roy
I am not sure about debian , but, for Ubuntu latest tcmalloc is not 
incorporated till 3.16.0.50..
You can use the attached program to detect if your tcmalloc is okay or not. Do 
this..

$ g++ -o gperftest tcmalloc_test.c -ltcmalloc
   $ TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES=67108864 ./gperftest

BTW, I am not saying latest tcmalloc will fix the issue , but worth trying.

Thanks & Regards
Somnath

From: David [mailto:da...@visions.se]
Sent: Friday, May 13, 2016 7:49 AM
To: Somnath Roy
Cc: ceph-users
Subject: Re: [ceph-users] Segfault in libtcmalloc.so.4.2.2

Linux osd11.storage 3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt20-1+deb8u3 
(2016-01-17) x86_64 GNU/Linux

apt-show-versions linux-image-3.16.0-4-amd64
linux-image-3.16.0-4-amd64:amd64/jessie-updates 3.16.7-ckt20-1+deb8u3 
upgradeable to 3.16.7-ckt25-2

apt-show-versions libtcmalloc-minimal4
libtcmalloc-minimal4:amd64/jessie 2.2.1-0.2 uptodate



13 maj 2016 kl. 16:02 skrev Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>:

What is the exact kernel version ?
Ubuntu has a new tcmalloc incorporated from 3.16.0.50 kernel onwards. If you 
are using older kernel than this better to upgrade kernel or try building 
latest tcmalloc and try to see if this is happening there.
Ceph is not packaging tcmalloc it is using the tcmalloc available with distro.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David
Sent: Friday, May 13, 2016 6:13 AM
To: ceph-users
Subject: [ceph-users] Segfault in libtcmalloc.so.4.2.2

Hi,

Been getting some segfaults in our newest ceph cluster running ceph 9.2.1-1 on 
Debian 8.3
segfault at 0 ip 7f27e85120f7 sp 7f27cff9e860 error 4 in 
libtcmalloc.so.4.2.2

I saw there’s already a bug up there on the tracker: 
http://tracker.ceph.com/issues/15628
Don’t know how many other are affected by it. We stop and start the osd to 
bring it up again but it’s quite annoying.

I’m guessing this affects Jewel as well?

Kind Regards,

David Majchrzak

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

#include 
#include 
#ifdef HAVE_GPERFTOOLS_HEAP_PROFILER_H
#include 
#else
#include 
#endif

#ifdef HAVE_GPERFTOOLS_MALLOC_EXTENSION_H
#include 
#else
#include 
#endif

using namespace std;

int main ()
{
  size_t tc_cache_sz;
  size_t env_cache_sz;
  char *env_cache_sz_str;
  int st;

  env_cache_sz_str = getenv("TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES");
  if (env_cache_sz_str) {
env_cache_sz = strtoul(env_cache_sz_str, NULL, 0);
if (env_cache_sz == 33554432) {
cout << "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES Value same as default:"
" 33554432 export a different value for test" << endl;
exit(EXIT_FAILURE);
}
tc_cache_sz = 0;
MallocExtension::instance()->
GetNumericProperty("tcmalloc.max_total_thread_cache_bytes",
_cache_sz);
if (tc_cache_sz == env_cache_sz) {
  cout << "Tcmalloc OK! Internal and Env cache size are same:" <<
  tc_cache_sz << endl;
  st = EXIT_SUCCESS;
} else {
  cout << "Tcmalloc BUG! TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES: "
  << env_cache_sz << " Internal Size: " << tc_cache_sz
  << " different" << endl;
  st = EXIT_FAILURE;
}
  } else {
cout << "TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES Env Not Set" << endl;
st = EXIT_FAILURE;
  }
  exit(st);
}
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weighted Priority Queue testing

2016-05-13 Thread Somnath Roy
Thanks Christian for the input.
I will start digging the code and look for possible explanation.

Regards
Somnath

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Thursday, May 12, 2016 11:52 PM
To: Somnath Roy
Cc: Scottix; ceph-users@lists.ceph.com; Nick Fisk
Subject: Re: [ceph-users] Weighted Priority Queue testing


Hello,

On Fri, 13 May 2016 05:46:41 + Somnath Roy wrote:

> FYI in my test I used osd_max_backfills = 10 which is hammer default.
> Post hammer it's been changed to 1.
>
All my tests, experiences are with Firefly and Hammer.

Also FYI and possibly pertinent to this discussion, I just added a node with 6 
OSDs to one of my clusters.
I did this by initially adding things with a crush weight of 0 (so nothing
happened) and then in one fell swoop set the weights of all those OSDs to 5.

Now what I'm seeing (and remembering seeing before) is that Ceph is processing 
this very sequentially, meaning it is currently backfilling the first 2 OSDs 
and doing nothing of the sorts with the other 4, they are idle.

"osd_max_backfills" is set to 4, which is incidentally the number of backfills 
happening on the new node now, however this is per OSD, so in theory we could 
expect 24 backfills.
The prospective source OSDs aren't pegged with backfills either, they have
1-2 going on.

I'm seriously wondering if this behavior is related to what we're talking about 
here.

Christian

> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com]
> Sent: Thursday, May 12, 2016 10:40 PM
> To: Scottix
> Cc: Somnath Roy; ceph-users@lists.ceph.com; Nick Fisk
> Subject: Re: [ceph-users] Weighted Priority Queue testing
>
>
> Hello,
>
> On Thu, 12 May 2016 15:41:13 + Scottix wrote:
>
> > We have run into this same scenarios in terms of the long tail
> > taking much longer on recovery than the initial.
> >
> > Either time we are adding osd or an osd get taken down. At first we
> > have max-backfill set to 1 so it doesn't kill the cluster with io.
> > As time passes by the single osd is performing the backfill. So we
> > are gradually increasing the max-backfill up to 10 to reduce the
> > amount of time it needs to recover fully. I know there are a few
> > other factors at play here but for us we tend to do this procedure every 
> > time.
> >
>
> Yeah, as I wrote in my original mail "This becomes even more obvious
> when backfills and recovery settings are lowered".
>
> However my test cluster is at the default values, so it starts with a
> (much too big) bang and ends with a whimper, not because it's
> throttled but simply because there are so few PGs/OSDs to choose from.
> Or so it seems, purely from observation.
>
> Christian
> > On Wed, May 11, 2016 at 6:29 PM Christian Balzer <ch...@gol.com> wrote:
> >
> > > On Wed, 11 May 2016 16:10:06 + Somnath Roy wrote:
> > >
> > > > I bumped up the backfill/recovery settings to match up Hammer.
> > > > It is probably unlikely that long tail latency is a parallelism
> > > > issue. If so, entire recovery would be suffering not the tail
> > > > alone. It's probably a prioritization issue. Will start looking
> > > > and update my findings. I can't add devl because of the table
> > > > but needed to add community that's why ceph-users :-).. Also,
> > > > wanted to know from Ceph's user if they are also facing similar issues..
> > > >
> > >
> > > What I meant with lack of parallelism is that at the start of a
> > > rebuild, there are likely to be many candidate PGs for recovery
> > > and backfilling, so many things happen at the same time, up to the
> > > limits of what is configured (max backfill etc).
> > >
> > > From looking at my test cluster, it starts with 8-10 backfills and
> > > recoveries (out of 140 affected PGs), but later on in the game
> > > there are less and less PGs (and OSDs/nodes) to choose from, so
> > > things slow down around 60 PGs to just 3-4 backfills.
> > > And around 20 PGs it's down to 1-2 backfills, so the parallelism
> > > is clearly gone at that point and recovery speed is down to what a
> > > single PG/OSD can handle.
> > >
> > > Christian
> > >
> > > > Thanks & Regards
> > > > Somnath
> > > >
> > > > -Original Message-
> > > > From: Christian Balzer [mailto:ch...@gol.com]
> > > > Sent: Wednesday, May 11, 2016 12:31 AM
> > > > To: Somnath Roy
> > > > Cc: Mark Nelson; Nick Fisk; ceph-users@lists.ceph.com

Re: [ceph-users] Segfault in libtcmalloc.so.4.2.2

2016-05-13 Thread Somnath Roy
What is the exact kernel version ?
Ubuntu has a new tcmalloc incorporated from 3.16.0.50 kernel onwards. If you 
are using older kernel than this better to upgrade kernel or try building 
latest tcmalloc and try to see if this is happening there.
Ceph is not packaging tcmalloc it is using the tcmalloc available with distro.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of David
Sent: Friday, May 13, 2016 6:13 AM
To: ceph-users
Subject: [ceph-users] Segfault in libtcmalloc.so.4.2.2

Hi,

Been getting some segfaults in our newest ceph cluster running ceph 9.2.1-1 on 
Debian 8.3
segfault at 0 ip 7f27e85120f7 sp 7f27cff9e860 error 4 in 
libtcmalloc.so.4.2.2

I saw there’s already a bug up there on the tracker: 
http://tracker.ceph.com/issues/15628
Don’t know how many other are affected by it. We stop and start the osd to 
bring it up again but it’s quite annoying.

I’m guessing this affects Jewel as well?

Kind Regards,

David Majchrzak

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weighted Priority Queue testing

2016-05-12 Thread Somnath Roy
FYI in my test I used osd_max_backfills = 10 which is hammer default.
Post hammer it's been changed to 1.

Thanks & Regards
Somnath

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Thursday, May 12, 2016 10:40 PM
To: Scottix
Cc: Somnath Roy; ceph-users@lists.ceph.com; Nick Fisk
Subject: Re: [ceph-users] Weighted Priority Queue testing


Hello,

On Thu, 12 May 2016 15:41:13 + Scottix wrote:

> We have run into this same scenarios in terms of the long tail taking
> much longer on recovery than the initial.
>
> Either time we are adding osd or an osd get taken down. At first we
> have max-backfill set to 1 so it doesn't kill the cluster with io. As
> time passes by the single osd is performing the backfill. So we are
> gradually increasing the max-backfill up to 10 to reduce the amount of
> time it needs to recover fully. I know there are a few other factors
> at play here but for us we tend to do this procedure every time.
>

Yeah, as I wrote in my original mail "This becomes even more obvious when 
backfills and recovery settings are lowered".

However my test cluster is at the default values, so it starts with a (much too 
big) bang and ends with a whimper, not because it's throttled but simply 
because there are so few PGs/OSDs to choose from.
Or so it seems, purely from observation.

Christian
> On Wed, May 11, 2016 at 6:29 PM Christian Balzer <ch...@gol.com> wrote:
>
> > On Wed, 11 May 2016 16:10:06 + Somnath Roy wrote:
> >
> > > I bumped up the backfill/recovery settings to match up Hammer. It
> > > is probably unlikely that long tail latency is a parallelism
> > > issue. If so, entire recovery would be suffering not the tail
> > > alone. It's probably a prioritization issue. Will start looking
> > > and update my findings. I can't add devl because of the table but
> > > needed to add community that's why ceph-users :-).. Also, wanted
> > > to know from Ceph's user if they are also facing similar issues..
> > >
> >
> > What I meant with lack of parallelism is that at the start of a
> > rebuild, there are likely to be many candidate PGs for recovery and
> > backfilling, so many things happen at the same time, up to the
> > limits of what is configured (max backfill etc).
> >
> > From looking at my test cluster, it starts with 8-10 backfills and
> > recoveries (out of 140 affected PGs), but later on in the game there
> > are less and less PGs (and OSDs/nodes) to choose from, so things
> > slow down around 60 PGs to just 3-4 backfills.
> > And around 20 PGs it's down to 1-2 backfills, so the parallelism is
> > clearly gone at that point and recovery speed is down to what a
> > single PG/OSD can handle.
> >
> > Christian
> >
> > > Thanks & Regards
> > > Somnath
> > >
> > > -Original Message-
> > > From: Christian Balzer [mailto:ch...@gol.com]
> > > Sent: Wednesday, May 11, 2016 12:31 AM
> > > To: Somnath Roy
> > > Cc: Mark Nelson; Nick Fisk; ceph-users@lists.ceph.com
> > > Subject: Re: [ceph-users] Weighted Priority Queue testing
> > >
> > >
> > >
> > > Hello,
> > >
> > > not sure if the Cc: to the users ML was intentional or not, but
> > > either way.
> > >
> > > The issue seen in the tracker:
> > > http://tracker.ceph.com/issues/15763
> > > and what you have seen (and I as well) feels a lot like the lack
> > > of parallelism towards the end of rebuilds.
> > >
> > > This becomes even more obvious when backfills and recovery
> > > settings are lowered.
> > >
> > > Regards,
> > >
> > > Christian
> > > --
> > > Christian BalzerNetwork/Systems Engineer
> > > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > > http://www.gol.com/
> > > PLEASE NOTE: The information contained in this electronic mail
> > > message is intended only for the use of the designated
> > > recipient(s) named above. If the reader of this message is not the
> > > intended recipient, you are hereby notified that you have received
> > > this message in error and that any review, dissemination,
> > > distribution, or copying of this message is strictly prohibited.
> > > If you have received this communication in error, please notify
> > > the sender by telephone or e-mail (as shown above) immediately and
> > > destroy any and all copies of this message in your possession
> > > (whether hard copies or electronically stored copies).
> > >
> >
> &g

Re: [ceph-users] Weighted Priority Queue testing

2016-05-11 Thread Somnath Roy
Yes Mark, I have the following io profile going on during recovery.

[recover-test]
ioengine=rbd
clientname=admin
pool=mypool
rbdname=<>
direct=1
invalidate=0
rw=randrw
norandommap
randrepeat=0
rwmixread=40
rwmixwrite=60
iodepth=256
numjobs=6
end_fsync=0
bssplit=512/4:1024/1:1536/1:2048/1:2560/1:3072/1:3584/1:4k/67:8k/10:16k/7:32k/3:64k/3
group_reporting=1
time_based
runtime=24h

There is a degradation on the client io but unfortunately I didn't quantify 
that for all the cases. Will do that next time. I have one for scenario 2 
though (attached here).

Thanks & Regards
Somnath



-Original Message-
From: Mark Nelson [mailto:mnel...@redhat.com]
Sent: Wednesday, May 11, 2016 5:16 AM
To: Somnath Roy; Nick Fisk; Ben England; Kyle Bader
Cc: Sage Weil; Samuel Just; ceph-users@lists.ceph.com
Subject: Re: Weighted Priority Queue testing

> 1. First scenario, only 4 node scenario and since it is chassis level
> replication single node remaining on the chassis taking all the traffic.
> It seems that is a bottleneck as for the host level replication on the
> similar setup recovery time is much less (data is not in this table).
>
>
>
> 2. In the second scenario , I kept everything else same but doubled
> the node/chassis. Recovery time is also half.
>
>
>
> 3.  For the third scenario, increased cluster data and also now I have
> doubled the number of  OSDs in the cluster (since each drive size is
> 4TB now). Recovery time came down further.
>
>
>
> 4. Moved to Jewel keeping everything else same, got further improvement.
> Mostly because of improved write performance in jewel (?).
>
>
>
> 5. Last scenario is interesting. I got improved recovery speed than
> any other scenario with this WPQ. Degraded PG % came down to 2% within
> 3 hours , ~0.6% within 4 hours and 15 min , but *last 0.6% took ~4
> hours* hurting overall time for recovery.
>
> 6. In fact, this long tail latency is hurting the overall recovery
> time for every other scenarios. Related tracker I found is
> http://tracker.ceph.com/issues/15763
>
>
>
> Any feedback much appreciated. We can discuss this in tomorrow’s
> performance call if needed.

Hi Somnath,

Thanks for these!  Interesting results.  Did you have a client load going at 
the same time as recovery?  It would be interesting to know how client IO 
performance was affected in each case.  Too bad about the long tail on WPQ.  I 
wonder if the long tail is consistently higher with WPQ or it just happened to 
be higher in that test.

Anyway, thanks for the results!  Glad to see the recovery time in general is 
lower in hammer.

Mark
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weighted Priority Queue testing

2016-05-11 Thread Somnath Roy
I bumped up the backfill/recovery settings to match up Hammer. It is probably 
unlikely that long tail latency is a parallelism issue. If so, entire recovery 
would be suffering not the tail alone. It's probably a prioritization issue. 
Will start looking and update my findings.
I can't add devl because of the table but needed to add community that's why 
ceph-users :-).. Also, wanted to know from Ceph's user if they are also facing 
similar issues..

Thanks & Regards
Somnath

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com]
Sent: Wednesday, May 11, 2016 12:31 AM
To: Somnath Roy
Cc: Mark Nelson; Nick Fisk; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Weighted Priority Queue testing



Hello,

not sure if the Cc: to the users ML was intentional or not, but either way.

The issue seen in the tracker:
http://tracker.ceph.com/issues/15763
and what you have seen (and I as well) feels a lot like the lack of parallelism 
towards the end of rebuilds.

This becomes even more obvious when backfills and recovery settings are lowered.

Regards,

Christian
--
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Weighted Priority Queue testing

2016-05-11 Thread Somnath Roy
+ceph users

Hi,



Here is first cut result. I can only manage 128TB box for now.




Ceph code base

Capacity

Each drive capacity

Compute-nodes

Total copy

Total data-set

Failure domain

Fault-injected

Percentage of degraded PGs

Full recovery time

Last 1% of degraded PG recovery time

Hammer

2X128TB IF150

8TB

2

2

~80TB

Chassis

One OSD node down

~20%

~24 hours

~3-4 hours

Hammer

2X128TB IF150

8TB

4

2

~80TB

Chassis

One OSD node down

~10%

10 hours 3 min

~3 hours

Hammer

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

7 hours 5 min

~2.5 hours

Jewel

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

6 hours 10 min

~1 hour 30 min

Jewel + wpq

2X128TB IF150

4TB

4

2

~100TB

Chassis

One OSD node down

~12.5%

8 hours 30 min

~4 hours 30 min




Summary :





1. First scenario, only 4 node scenario and since it is chassis level 
replication single node remaining on the chassis taking all the traffic. It 
seems that is a bottleneck as for the host level replication on the similar 
setup recovery time is much less (data is not in this table).



2. In the second scenario , I kept everything else same but doubled the 
node/chassis. Recovery time is also half.



3.  For the third scenario, increased cluster data and also now I have doubled 
the number of  OSDs in the cluster (since each drive size is 4TB now). Recovery 
time came down further.



4. Moved to Jewel keeping everything else same, got further improvement. Mostly 
because of improved write performance in jewel (?).



5. Last scenario is interesting. I got improved recovery speed than any other 
scenario with this WPQ. Degraded PG % came down to 2% within 3 hours , ~0.6% 
within 4 hours and 15 min , but last 0.6% took ~4 hours hurting overall time 
for recovery.

6. In fact, this long tail latency is hurting the overall recovery time for 
every other scenarios. Related tracker I found is 
http://tracker.ceph.com/issues/15763



Any feedback much appreciated. We can discuss this in tomorrow’s performance 
call if needed.



Thanks & Regards

Somnath



-Original Message-
From: Somnath Roy
Sent: Wednesday, May 04, 2016 11:47 AM
To: 'Mark Nelson'; Nick Fisk; Ben England; Kyle Bader
Cc: Sage Weil; Samuel Just
Subject: RE: Weighted Priority Queue testing



Thanks Mark, I will come back to you with some data on that. This is what I am 
planning to run.



1. One 2X IF150 chassis with 256 TB  flash each and total 8 node cluster (4 
servers on each). Will generate ~100TB of data on the cluster.



2. Will go for host and chassis level replication with 2 copies.



3. Heavy IO will be on (different block sizes 60% RW and 40% RR)



Hammer took me ~4 hours to complete recovery for a host level replication and 
single host down.

~12 hours when single host down with chassis level replication.



Bear with me till I find all the HW for this :-) Let me know if you guys want 
to add something here..



Regards

Somnath



-Original Message-

From: Mark Nelson [mailto:mnel...@redhat.com]

Sent: Wednesday, May 04, 2016 8:40 AM

To: Somnath Roy; Nick Fisk; Ben England; Kyle Bader

Cc: Sage Weil; Samuel Just

Subject: Weighted Priority Queue testing



Hi Guys,



I think all of you have expressed some interest in recovery testing either now 
or in the past, so I wanted to get folks together to talk.

We need to get the new weighted priority queue tested to:



a) see when/how it's breaking

b) hopefully see better behavior



It's available in Jewel through a simple ceph.conf change:



osd_op_queue = wpq



For those of you who have run cbt recovery tests in the past, it might be worth 
running some new stress tests comparing:



a) jewel + wpq

b) jewel + prio queue

c) hammer



In the past I've done this under various concurrent client workloads (say large 
sequential or small random writes).  I think Kyle has done quite a bit of this 
kind of testing in the recent past with Intel as well, so he might have some 
insights as to where we've been hurting recently.



Thanks,

Mark

PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] performance drop a lot when running fio mix read/write

2016-05-02 Thread Somnath Roy
Yes, reads will be affected a lot for mix read/write scenarios as Ceph is 
serializing ops on a PG. Write path is inefficient and that is affecting reads 
in turn.
Hope you are following all the config settings (shards/threads, pg numbers etc. 
etc.) already discussed in the community.
You may want to try with bigger QD and see if it is improving or not.
BTW, try with jewel or latest master (if not already) and you should see mix 
read/write performance improvement as write performance has improved in Jewel..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of min 
fang
Sent: Monday, May 02, 2016 8:18 PM
To: ceph-users
Subject: [ceph-users] performance drop a lot when running fio mix read/write

Hi,  I run randow fio with rwmixread=70, and found read iops is 707, write is 
303. (reference the following).  This value is less than random write and read 
value. The 4K random write IOPs is 529 and 4k randread IOPs is 11343.  Apart 
from rw type is different, other parameters are all same.
I do not understand why mix write and read will have so huge impact on 
performance. All random IOs. thanks.


fio -filename=/dev/rbd2 -direct=1 -iodepth 64 -thread -rw=randrw -rwmixread=70 
-ioengine=libaio -bs=4k -size=100G -numjobs=1 -runtime=1000 -group_reporting 
-name=mytest1
mytest1: (g=0): rw=randrw, bs=4K-4K/4K-4K/4K-4K, ioengine=libaio, iodepth=64
fio-2.8
time 7423  cycles_start=1062103697308843
Starting 1 thread
Jobs: 1 (f=1): [m(1)] [100.0% done] [2144KB/760KB/0KB /s] [536/190/0 iops] [eta 
00m:00s]
mytest1: (groupid=0, jobs=1): err= 0: pid=7425: Sat Apr 30 08:55:14 2016
  read : io=2765.2MB, bw=2830.5KB/s, iops=707, runt=1000393msec
slat (usec): min=2, max=268, avg= 8.93, stdev= 4.17
clat (usec): min=203, max=1939.9K, avg=34039.43, stdev=93674.48
 lat (usec): min=207, max=1939.9K, avg=34048.93, stdev=93674.50
clat percentiles (usec):
 |  1.00th=[  516],  5.00th=[  836], 10.00th=[ 1112], 20.00th=[ 1448],
 | 30.00th=[ 1736], 40.00th=[ 6944], 50.00th=[13376], 60.00th=[17280],
 | 70.00th=[21888], 80.00th=[30848], 90.00th=[49920], 95.00th=[103936],
 | 99.00th=[552960], 99.50th=[675840], 99.90th=[880640], 99.95th=[954368],
 | 99.99th=[1105920]
bw (KB  /s): min=  350, max= 5944, per=100.00%, avg=2837.77, stdev=1272.84
  write: io=1184.8MB, bw=1212.8KB/s, iops=303, runt=1000393msec
slat (usec): min=2, max=310, avg= 9.35, stdev= 4.50
clat (msec): min=5, max=2210, avg=131.60, stdev=226.47
 lat (msec): min=5, max=2210, avg=131.61, stdev=226.47
clat percentiles (msec):
 |  1.00th=[9],  5.00th=[   13], 10.00th=[   15], 20.00th=[   20],
 | 30.00th=[   25], 40.00th=[   34], 50.00th=[   44], 60.00th=[   61],
 | 70.00th=[   84], 80.00th=[  125], 90.00th=[  449], 95.00th=[  709],
 | 99.00th=[ 1037], 99.50th=[ 1139], 99.90th=[ 1369], 99.95th=[ 1450],
 | 99.99th=[ 1663]
bw (KB  /s): min=   40, max= 2562, per=100.00%, avg=1215.62, stdev=564.19
lat (usec) : 250=0.01%, 500=0.60%, 750=1.94%, 1000=2.95%
lat (msec) : 2=18.69%, 4=2.46%, 10=4.21%, 20=22.05%, 50=26.40%
lat (msec) : 100=9.65%, 250=4.64%, 500=2.76%, 750=2.13%, 1000=1.11%
lat (msec) : 2000=0.39%, >=2000=0.01%
  cpu  : usr=0.83%, sys=1.47%, ctx=971080, majf=0, minf=1
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.1%, >=64=0.0%
 issued: total=r=707885/w=303294/d=0, short=r=0/w=0/d=0, 
drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=64

Run status group 0 (all jobs):
   READ: io=2765.2MB, aggrb=2830KB/s, minb=2830KB/s, maxb=2830KB/s, 
mint=1000393msec, maxt=1000393msec
  WRITE: io=1184.8MB, aggrb=1212KB/s, minb=1212KB/s, maxb=1212KB/s, 
mint=1000393msec, maxt=1000393msec

Disk stats (read/write):
  rbd2: ios=707885/303293, merge=0/0, ticks=24085792/39904840, 
in_queue=64045864, util=100.00%


PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD Crashes

2016-04-29 Thread Somnath Roy
Check system log and search for the corresponding drive. It should have the 
information what is failing..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Friday, April 29, 2016 8:59 AM
To: Samuel Just
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD Crashes

I can see that. I guess what would that be symptomatic of? How is it doing that 
on 6 different systems and on multiple OSDs?

-Original Message-
From: Samuel Just [mailto:sj...@redhat.com]
Sent: Friday, April 29, 2016 8:57 AM
To: Garg, Pankaj
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] OSD Crashes

Your fs is throwing an EIO on open.
-Sam

On Fri, Apr 29, 2016 at 8:54 AM, Garg, Pankaj  
wrote:
> Hi,
>
> I had a fully functional Ceph cluster with 3 x86 Nodes and 3 ARM64
> nodes, each with 12 HDD Drives and 2SSD Drives. All these were
> initially running Hammer, and then were successfully updated to Infernalis 
> (9.2.0).
>
> I recently deleted all my OSDs and swapped my drives with new ones on
> the
> x86 Systems, and the ARM servers were swapped with different ones
> (keeping drives same).
>
> I again provisioned the OSDs, keeping the same cluster and Ceph
> versions as before. But now, every time I try to run RADOS bench, my
> OSDs start crashing (on both ARM and x86 servers).
>
> I’m not sure why this is happening on all 6 systems. On the x86, it’s
> the same Ceph bits as before, and the only thing different is the new drives.
>
> It’s the same stack (pasted below) on all the OSDs too.
>
> Can anyone provide any clues?
>
>
>
> Thanks
>
> Pankaj
>
>
>
>
>
>
>
>
>
>
>
>   -14> 2016-04-28 08:09:45.423950 7f1ef05b1700  1 --
> 192.168.240.117:6820/14377 <== osd.93 192.168.240.116:6811/47080 1236
> 
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26) v1  981+0+4759 (3923326827 0 3705383247) 0x5634cbabc400
> con 0x5634c5168420
>
>-13> 2016-04-28 08:09:45.423981 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423882, event: header_read, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26)
>
>-12> 2016-04-28 08:09:45.423991 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423884, event: throttled, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26)
>
>-11> 2016-04-28 08:09:45.423996 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.423942, event: all_read, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26)
>
>-10> 2016-04-28 08:09:45.424001 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 0.00, event: dispatched, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26)
>
> -9> 2016-04-28 08:09:45.424014 7f1ef05b1700  5 -- op tracker -- seq:
> 29404, time: 2016-04-28 08:09:45.424014, event: queued_for_pg, op:
> osd_repop(client.2794263.0:37721 284.6d4
> 284/afa8fed4/benchmark_data_x86Ceph1_147212_object37720/head v
> 12284'26)
>
> -8> 2016-04-28 08:09:45.561827 7f1f15799700  5 osd.102 12284
> tick_without_osd_lock
>
> -7> 2016-04-28 08:09:45.973944 7f1f0801a700  1 --
> 192.168.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306
>  osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2 
> 47+0+0
> (846632602 0 0) 0x5634c8305c00 con 0x5634c58dd760
>
> -6> 2016-04-28 08:09:45.973995 7f1f0801a700  1 --
> 192.168.240.117:6821/14377 --> 192.168.240.115:0/26572 --
> osd_ping(ping_reply e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0
> 0x5634c7ba8000 con 0x5634c58dd760
>
> -5> 2016-04-28 08:09:45.974300 7f1f0981d700  1 --
> 10.18.240.117:6821/14377 <== osd.73 192.168.240.115:0/26572 1306 
> osd_ping(ping e12284 stamp 2016-04-28 08:09:45.971751) v2  47+0+0
> (846632602 0 0) 0x5634c8129400 con 0x5634c58dcf20
>
> -4> 2016-04-28 08:09:45.974337 7f1f0981d700  1 --
> 10.18.240.117:6821/14377 --> 192.168.240.115:0/26572 --
> osd_ping(ping_reply
> e12284 stamp 2016-04-28 08:09:45.971751) v2 -- ?+0 0x5634c617d600 con
> 0x5634c58dcf20
>
> -3> 2016-04-28 08:09:46.174079 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102) write couldn't open
> 287.6f9_head/287/ae33fef9/benchmark_data_ceph7_17591_object39895/head:
> (117) Structure needs cleaning
>
> -2> 2016-04-28 08:09:46.174103 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102)  error (117) Structure needs
> cleaning not handled on operation 0x5634c885df9e (16590.1.0, or op 0,
> counting from
> 0)
>
> -1> 2016-04-28 08:09:46.174109 7f1f11f92700  0
> filestore(/var/lib/ceph/osd/ceph-102) unexpected error code
>
>  0> 2016-04-28 08:09:46.178707 

Re: [ceph-users] krbd map on Jewel, sysfs write failed when rbd map

2016-04-26 Thread Somnath Roy
By default image format is 2 in jewel which is not supported by krbd..try 
creating image with --image-format 1 and it should be resolved..

Thanks
Somnath

Sent from my iPhone

On Apr 25, 2016, at 9:38 PM, 
"wd_hw...@wistron.com" 
> wrote:

Dear Cephers:
  I got the same issue under Ubuntu 14.04, even I try to use the image format 
‘1’.
# modinfo rbd
filename:   /lib/modules/3.13.0-85-generic/kernel/drivers/block/rbd.ko
license:GPL
author: Jeff Garzik >
description:rados block device
author: Yehuda Sadeh 
>
author: Sage Weil >
author: Alex Elder >
srcversion: 48BFBD5C3D31D799F01D218
depends:libceph
intree: Y
vermagic:   3.13.0-85-generic SMP mod_unload modversions
signer: Magrathea: Glacier signing key
sig_key:C6:33:E9:BF:A6:CA:49:D5:3D:2E:B5:25:6A:35:87:7D:04:F1:64:F8
sig_hashalgo:   sha512

##
# rbd info block_data/data01
rbd image 'data01':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.540752ae8944a
format: 2
features:
flags:
# rbd map block_data/data01
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

##
# rbd info block_data/data02
rbd image 'data02':
size 10240 MB in 2560 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.2aac0238e1f29
format: 2
features: layering
   flags:
# rbd map block_data/data02
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

  Is there any new idea to solve this issue?

Thanks a lot,
WD

---

This email contains confidential or legally privileged information and is for 
the sole use of its intended recipient.

Any unauthorized review, use, copying or distribution of this email or the 
content of this email is strictly prohibited.

If you are not the intended recipient, you may reply to the sender and should 
delete this e-mail immediately.

---

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] On-going Bluestore Performance Testing Results

2016-04-22 Thread Somnath Roy
Yes, kernel should do read ahead , it's a block device setting..but if there is 
something extra xfs is doing for seq workload , not sure...

Sent from my iPhone

> On Apr 22, 2016, at 8:54 AM, Jan Schermer  wrote:
>
> Having correlated graphs of CPU and block device usage would be helpful.
>
> To my cynical eye this looks like a clear regression in CPU usage, which was 
> always bottlenecking pure-SSD OSDs, and now got worse.
> The gains are from doing less IO on IO-saturated HDDs.
>
> Regression of 70% in 16-32K random writes is the most troubling, that's 
> coincidentaly the average IO size for a DB2, and the biggest bottleneck to 
> its performance I've seen (other databases will be similiar).
> It's great
>
> Btw readahead is not dependant on filesystem (it's a mechanism in the IO 
> scheduler), so it should be present even on a block device, I think?
>
> Jan
>
>
>> On 22 Apr 2016, at 17:35, Mark Nelson  wrote:
>>
>> Hi Guys,
>>
>> Now that folks are starting to dig into bluestore with the Jewel release, I 
>> wanted to share some of our on-going performance test data. These are from 
>> 10.1.0, so almost, but not quite, Jewel.  Generally bluestore is looking 
>> very good on HDDs, but there are a couple of strange things to watch out 
>> for, especially with NVMe devices.  Mainly:
>>
>> 1) in HDD+NVMe configurations performance increases dramatically when 
>> replacing the stock CentOS7 kernel with Kernel 4.5.1.
>>
>> 2) In NVMe only configurations performance is often lower at middle-sized 
>> IOs.  Kernel 4.5.1 doesn't really help here.  In fact it seems to amplify 
>> both the cases where bluestore is faster and where it is slower.
>>
>> 3) Medium sized sequential reads are where bluestore consistently tends to 
>> be slower than filestore.  It's not clear yet if this is simply due to 
>> Bluestore not doing read ahead at the OSD (ie being entirely dependent on 
>> client read ahead) or something else as well.
>>
>> I wanted to post this so other folks have some ideas of what to look for as 
>> they do their own bluestore testing.  This data is shown as percentage 
>> differences vs filestore, but I can also release the raw throughput values 
>> if people are interested in those as well.
>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZOTVQNkV0M2tIWkk/view?usp=sharing
>>
>> Thanks!
>> Mark
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread Somnath Roy
We faced this issue too and figured out it in Jewel the default image creation 
was with format 2.
Not sure if this is a good idea to change the default though as almost all the 
LTS releases are with older kernel and will face incompatibility issue.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jason 
Dillaman
Sent: Tuesday, March 29, 2016 3:15 PM
To: Stefan Lissmats
Cc: ceph-users
Subject: Re: [ceph-users] Scrubbing a lot

Image format 1 is still supported -- just trying to slowly move users off of it 
and onto image format 2 through lots of log message nagging.

--

Jason Dillaman


- Original Message -

> From: "Stefan Lissmats" 
> To: "German Anders" 
> Cc: "ceph-users" 
> Sent: Tuesday, March 29, 2016 4:55:02 PM
> Subject: Re: [ceph-users] Scrubbing a lot

> Ok, i also got the warning but was able to use it anyway. Could be
> blocked in the new release of Jewel. Probably the more correct answer
> is to use the other answer (to use - - image-features layering) but
> haven't tried that myself.

> Skickat från min Samsung-enhet

>  Originalmeddelande 
> Från: German Anders 
> Datum: 2016-03-29 22:48 (GMT+01:00)
> Till: Stefan Lissmats 
> Kopia: Samuel Just , ceph-users
> 
> Rubrik: Re: [ceph-users] Scrubbing a lot

>  it seems that the image-format option is deprecated:

> # rbd --id cinder --cluster cephIB create e60host01v2 --size 100G
> --image-format 1 --pool cinder-volumes -k
> /etc/ceph/cephIB.client.cinder.keyring
> rbd: image format 1 is deprecated

> # rbd --cluster cephIB info e60host01v2 --pool cinder-volumes
> 2016-03-29 16:45:39.073198 7fb859eb7700 -1 librbd::image::OpenRequest:
> RBD image format 1 is deprecated. Please copy this image to image format 2.
> rbd image 'e60host01v2':
> size 102400 MB in 25600 objects
> order 22 (4096 kB objects)
> block_name_prefix: rb.0.37d7.238e1f29
> format: 1

> and the map operations still doesn't work :(

> # rbd --cluster cephIB map e60host01v2 --pool cinder-volumes -k
> /etc/ceph/cephIB.client.cinder.keyring
> rbd: sysfs write failed
> rbd: map failed: (5) Input/output error

> also, I'm running kernel 3.19.0-39-generic

> German

> 2016-03-29 17:40 GMT-03:00 Stefan Lissmats < ste...@trimmat.se > :

> > I agrree. I ran in to the same issue and the error massage is not
> > that clear.
> > Mapping with the kernel rbd client (rbd map) needs a quite new
> > kernel to handle the new image format. The work-around is to use - -
> > image-format 1 when creating the image.
>

> >  Originalmeddelande 
>
> > Från: Samuel Just < sj...@redhat.com >
>
> > Datum: 2016-03-29 22:24 (GMT+01:00)
>
> > Till: German Anders < gand...@despegar.com >
>
> > Kopia: ceph-users < ceph-users@lists.ceph.com >
>
> > Rubrik: Re: [ceph-users] Scrubbing a lot
>

> > Sounds like a version/compatibility thing. Are your rbd clients really old?
>
> > -Sam
>

> > On Tue, Mar 29, 2016 at 1:19 PM, German Anders <
> > gand...@despegar.com >
> > wrote:
>
> > > I've just upgrade to jewel, and the scrubbing seems to been corrected...
> > > but
>
> > > now I'm not able to map an rbd on a host (before I was able to),
> > > basically
>
> > > I'm getting this error msg:
>
> > >
>
> > > rbd: sysfs write failed
>
> > > rbd: map failed: (5) Input/output error
>
> > >
>
> > > # rbd --cluster cephIB create host01 --size 102400 --pool
> > > cinder-volumes -k
>
> > > /etc/ceph/cephIB.client.cinder.keyring
>
> > > # rbd --cluster cephIB map host01 --pool cinder-volumes -k
>
> > > /etc/ceph/cephIB.client.cinder.keyring
>
> > > rbd: sysfs write failed
>
> > > rbd: map failed: (5) Input/output error
>
> > >
>
> > > Any ideas? on the /etc/ceph directory on the host I've:
>
> > >
>
> > > -rw-r--r-- 1 ceph ceph 92 Nov 17 15:45 rbdmap
>
> > > -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
>
> > > -rw-r--r-- 1 ceph ceph 37 Dec 15 15:12 virsh-secret
>
> > > -rw-r--r-- 1 ceph ceph 0 Dec 15 15:12 virsh-secret-set
>
> > > -rw-r--r-- 1 ceph ceph 37 Dec 21 14:53 virsh-secretIB
>
> > > -rw-r--r-- 1 ceph ceph 0 Dec 21 14:53 virsh-secret-setIB
>
> > > -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
>
> > > -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
>
> > > -rw-r--r-- 1 ceph ceph 72 Dec 23 09:51 ceph.client.cinder.keyring
>
> > > -rw-r--r-- 1 ceph ceph 63 Mar 28 09:03
> > > cephIB.client.cinder.keyring
>
> > > -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
>
> > > -rw--- 1 ceph ceph 63 Mar 29 16:11 cephIB.client.admin.keyring
>
> > >
>
> > > Thanks in advance,
>
> > >
>
> > > Best,
>
> > >
>
> > > German
>
> > >
>
> > > 2016-03-29 14:45 GMT-03:00 German Anders < gand...@despegar.com >:
>
> > >>
>
> > >> Sure, also the scrubbing is happening on all the osds :S
>
> > >>
>
> > >> # ceph --cluster cephIB daemon osd.4 config diff

Re: [ceph-users] SSD and Journal

2016-03-15 Thread Somnath Roy
Yes, if you can manage *cost* , separating journal on a different device should 
improve write performance. But, you need to evaluate how many osd journals you 
can dedicate to a single OSD as at some point it will be bottlenecked by that 
journal device BW.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Yair 
Magnezi
Sent: Tuesday, March 15, 2016 6:44 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SSD and Journal

Hi Guys .

On a full ssd cluster , is it meaningful to put the journal on a different 
drive ? does it have any impact on  performance ?

Thanks




Yair Magnezi
Storage & Data Protection   // Kenshoo
Office +972 7 32862423   // Mobile +972 50 575-2955
__



This e-mail, as well as any attached document, may contain material which is 
confidential and privileged and may include trademark, copyright and other 
intellectual property rights that are proprietary to Kenshoo Ltd,  its 
subsidiaries or affiliates ("Kenshoo"). This e-mail and its attachments may be 
read, copied and used only by the addressee for the purpose(s) for which it was 
disclosed herein. If you have received it in error, please destroy the message 
and any attachment, and contact us immediately. If you are not the intended 
recipient, be aware that any review, reliance, disclosure, copying, 
distribution or use of the contents of this message without Kenshoo's express 
permission is strictly prohibited.
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Somnath Roy
Sorry, I missed that you are upgrading from Hammer...I think it is probably a 
bug introduced in post hammer..Here is why it is happening IMO..

In hammer:
-

https://github.com/ceph/ceph/blob/hammer/src/os/FileJournal.cc#L158

In Master/Infernalis/Jewel:
-

https://github.com/ceph/ceph/blob/infernalis/src/os/FileJournal.cc#L151

Which is hard coded 4096

Not sure why this is changed, Sam/Sage ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@caviumnetworks.com]
Sent: Tuesday, March 01, 2016 9:34 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: INFARNALIS with 64K Kernel PAGES

The OSDS were created with 64K page size, and mkfs was done with the same size.
After upgrade, I have not changed anything on the machine (except applied the 
ownership fix for files for user ceph:ceph)

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Tuesday, March 01, 2016 9:32 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: RE: INFARNALIS with 64K Kernel PAGES

Did you recreated OSDs on this setup meaning did you do mkfs with 64K page size 
?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Tuesday, March 01, 2016 9:07 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] INFARNALIS with 64K Kernel PAGES

Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_com

Re: [ceph-users] INFARNALIS with 64K Kernel PAGES

2016-03-01 Thread Somnath Roy
Did you recreated OSDs on this setup meaning did you do mkfs with 64K page size 
?

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Tuesday, March 01, 2016 9:07 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] INFARNALIS with 64K Kernel PAGES

Hi,
Is there a known issue with using 64K Kernel PAGE_SIZE?
I am using ARM64 systems, and I upgraded from 0.94.4 to 9.2.1 today. The system 
which was on 4K page size, came up OK and OSDs are all online.
Systems with 64K Page size are all seeing the OSDs crash with following stack:

Begin dump of recent events ---
   -54> 2016-03-01 20:52:56.489752 97e38f10  5 asok(0xff6c) 
register_command perfcounters_dump hook 0xff63c030
   -53> 2016-03-01 20:52:56.489798 97e38f10  5 asok(0xff6c) 
register_command 1 hook 0xff63c030
   -52> 2016-03-01 20:52:56.489809 97e38f10  5 asok(0xff6c) 
register_command perf dump hook 0xff63c030
   -51> 2016-03-01 20:52:56.489819 97e38f10  5 asok(0xff6c) 
register_command perfcounters_schema hook 0xff63c030
   -50> 2016-03-01 20:52:56.489829 97e38f10  5 asok(0xff6c) 
register_command 2 hook 0xff63c030
   -49> 2016-03-01 20:52:56.489839 97e38f10  5 asok(0xff6c) 
register_command perf schema hook 0xff63c030
   -48> 2016-03-01 20:52:56.489849 97e38f10  5 asok(0xff6c) 
register_command perf reset hook 0xff63c030
   -47> 2016-03-01 20:52:56.489858 97e38f10  5 asok(0xff6c) 
register_command config show hook 0xff63c030
   -46> 2016-03-01 20:52:56.489868 97e38f10  5 asok(0xff6c) 
register_command config set hook 0xff63c030
   -45> 2016-03-01 20:52:56.489877 97e38f10  5 asok(0xff6c) 
register_command config get hook 0xff63c030
   -44> 2016-03-01 20:52:56.489886 97e38f10  5 asok(0xff6c) 
register_command config diff hook 0xff63c030
   -43> 2016-03-01 20:52:56.489896 97e38f10  5 asok(0xff6c) 
register_command log flush hook 0xff63c030
   -42> 2016-03-01 20:52:56.489905 97e38f10  5 asok(0xff6c) 
register_command log dump hook 0xff63c030
   -41> 2016-03-01 20:52:56.489914 97e38f10  5 asok(0xff6c) 
register_command log reopen hook 0xff63c030
   -40> 2016-03-01 20:52:56.497924 97e38f10  0 set uid:gid to 64045:64045
   -39> 2016-03-01 20:52:56.498074 97e38f10  0 ceph version 9.2.1 
(752b6a3020c3de74e07d2a8b4c5e48dab5a6b6fd), process ceph-osd, pid 17095
   -38> 2016-03-01 20:52:56.499547 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -37> 2016-03-01 20:52:56.499572 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6802/17095 need_addr=0
   -36> 2016-03-01 20:52:56.499620 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -35> 2016-03-01 20:52:56.499638 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6802/17095 need_addr=0
   -34> 2016-03-01 20:52:56.499673 97e38f10  1 -- 192.168.240.124:0/0 
learned my addr 192.168.240.124:0/0
   -33> 2016-03-01 20:52:56.499690 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 192.168.240.124:6803/17095 need_addr=0
   -32> 2016-03-01 20:52:56.499724 97e38f10  1 -- 10.18.240.124:0/0 learned 
my addr 10.18.240.124:0/0
   -31> 2016-03-01 20:52:56.499741 97e38f10  1 accepter.accepter.bind 
my_inst.addr is 10.18.240.124:6803/17095 need_addr=0
   -30> 2016-03-01 20:52:56.503307 97e38f10  5 asok(0xff6c) init 
/var/run/ceph/ceph-osd.100.asok
   -29> 2016-03-01 20:52:56.503329 97e38f10  5 asok(0xff6c) 
bind_and_listen /var/run/ceph/ceph-osd.100.asok
   -28> 2016-03-01 20:52:56.503460 97e38f10  5 asok(0xff6c) 
register_command 0 hook 0xff6380c0
   -27> 2016-03-01 20:52:56.503479 97e38f10  5 asok(0xff6c) 
register_command version hook 0xff6380c0
   -26> 2016-03-01 20:52:56.503490 97e38f10  5 asok(0xff6c) 
register_command git_version hook 0xff6380c0
   -25> 2016-03-01 20:52:56.503500 97e38f10  5 asok(0xff6c) 
register_command help hook 0xff63c1e0
   -24> 2016-03-01 20:52:56.503510 97e38f10  5 asok(0xff6c) 
register_command get_command_descriptions hook 0xff63c1f0
   -23> 2016-03-01 20:52:56.503566 9643f030  5 asok(0xff6c) entry 
start
   -22> 2016-03-01 20:52:56.503635 97e38f10 10 monclient(hunting): 
build_initial_monmap
   -21> 2016-03-01 20:52:56.520227 97e38f10  5 adding auth protocol: cephx
   -20> 2016-03-01 20:52:56.520244 97e38f10  5 adding auth protocol: cephx
   -19> 2016-03-01 20:52:56.520427 97e38f10  5 asok(0xff6c) 
register_command objecter_requests hook 0xff63c2b0
   -18> 2016-03-01 20:52:56.520538 97e38f10  1 -- 10.18.240.124:6802/17095 
messenger.start
   -17> 2016-03-01 20:52:56.520601 97e38f10  1 -- :/0 messenger.start
   -16> 2016-03-01 20:52:56.520655 97e38f10  1 -- 

Re: [ceph-users] SSD Journal Performance Priorties

2016-02-26 Thread Somnath Roy
You need to make sure SSD O_DIRECT|O_DSYNC performance is good. Not all the 
SSDs are good at it..Refer the prior discussions in the community for that.

<< Presumably as long as the SSD read speed exceeds that of the spinners, that 
is sufficient.
You probably meant write speed of SSDs ? Journal will not be read in the IO 
path , it will be read only during journal replay at the time of OSD restart..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Lindsay Mathieson
Sent: Friday, February 26, 2016 4:16 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] SSD Journal Performance Priorties

Ignoring the durability and network issues for now :) Are there any aspects of 
a journals performance that matter most for over all ceph performance?

i.e my inital thought is if I want to improve ceph write performance journal 
seq write speed is what matters. Does random write speed factor at all?

Presumably as long as the SSD read speed exceeds that of the spinners, that is 
sufficient.

--
Lindsay Mathieson

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to recover from OSDs full in small cluster

2016-02-17 Thread Somnath Roy
If you are not sure about what weight to put , ‘ceph osd 
reweight-by-utilization’ should also do the job for you automatically..

Thanks & Regards
Somnath


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Wednesday, February 17, 2016 12:48 PM
To: Lukáš Kubín
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] How to recover from OSDs full in small cluster

Ahoj ;-)

You can reweight them temporarily, that shifts the data from the full drives.

ceph osd reweight osd.XX YY
(XX = the number of full OSD, YY is "weight" which default to 1)

This is different from "crush reweight" which defaults to drive size in TB.

Beware that reweighting will (afaik) only shuffle the data to other local 
drives, so you should reweight both the full drives at the same time and only 
by little bit at a time (0.95 is a good starting point).

Jan


On 17 Feb 2016, at 21:43, Lukáš Kubín 
> wrote:

Hi,
I'm running a very small setup of 2 nodes with 6 OSDs each. There are 2 pools, 
each of size=2. Today, one of our OSDs got full, another 2 near full. Cluster 
turned into ERR state. I have noticed uneven space distribution among OSD 
drives between 70 and 100 perce. I have realized there's a low amount of pgs in 
those 2 pools (128 each) and increased one of them to 512, expecting a magic to 
happen and redistribute the space evenly.

Well, something happened - another OSD became full during the redistribution 
and cluster stopped both OSDs and marked them down. After some hours the 
remaining drives partially rebalanced and cluster get to WARN state.

I've deleted 3 placement group directories from one of the full OSD's 
filesystem which allowed me to start it up again. Soon, however this drive 
became full again.

So now, there are 2 of 12 OSDs down, cluster is in WARN and I have no drives to 
add.

Is there a way how to get out of this situation without adding OSDs? I will 
attempt to release some space, just waiting for colleague to identify RBD 
volumes (openstack images and volumes) which can be deleted.

Thank you.

Lukas


This is my cluster state now:

[root@compute1 ~]# ceph -w
cluster d35174e9-4d17-4b5e-80f2-02440e0980d5
 health HEALTH_WARN
10 pgs backfill_toofull
114 pgs degraded
114 pgs stuck degraded
147 pgs stuck unclean
114 pgs stuck undersized
114 pgs undersized
1 requests are blocked > 32 sec
recovery 56923/640724 objects degraded (8.884%)
recovery 29122/640724 objects misplaced (4.545%)
3 near full osd(s)
 monmap e3: 3 mons at 
{compute1=10.255.242.14:6789/0,compute2=10.255.242.15:6789/0,compute3=10.255.242.16:6789/0}
election epoch 128, quorum 0,1,2 compute1,compute2,compute3
 osdmap e1073: 12 osds: 10 up, 10 in; 39 remapped pgs
  pgmap v21609066: 640 pgs, 2 pools, 2390 GB data, 309 kobjects
4365 GB used, 890 GB / 5256 GB avail
56923/640724 objects degraded (8.884%)
29122/640724 objects misplaced (4.545%)
 493 active+clean
 108 active+undersized+degraded
  29 active+remapped
   6 active+undersized+degraded+remapped+backfill_toofull
   4 active+remapped+backfill_toofull

[root@ceph1 ~]# df|grep osd
/dev/sdg1   580496384 500066812  80429572  87% 
/var/lib/ceph/osd/ceph-3
/dev/sdf1   580496384 502131428  78364956  87% 
/var/lib/ceph/osd/ceph-2
/dev/sde1   580496384 506927100  73569284  88% 
/var/lib/ceph/osd/ceph-0
/dev/sdb1   287550208 28755018820 100% 
/var/lib/ceph/osd/ceph-5
/dev/sdd1   580496384 58049636420 100% 
/var/lib/ceph/osd/ceph-4
/dev/sdc1   580496384 478675672 101820712  83% 
/var/lib/ceph/osd/ceph-1

[root@ceph2 ~]# df|grep osd
/dev/sdf1   580496384 448689872 131806512  78% 
/var/lib/ceph/osd/ceph-7
/dev/sdb1   287550208 227054336  60495872  79% 
/var/lib/ceph/osd/ceph-11
/dev/sdd1   580496384 464175196 116321188  80% 
/var/lib/ceph/osd/ceph-10
/dev/sdc1   580496384 489451300  91045084  85% 
/var/lib/ceph/osd/ceph-6
/dev/sdg1   580496384 470559020 109937364  82% 
/var/lib/ceph/osd/ceph-9
/dev/sde1   580496384 490289388  90206996  85% 
/var/lib/ceph/osd/ceph-8

[root@ceph2 ~]# ceph df
GLOBAL:
SIZE  AVAIL RAW USED %RAW USED
5256G  890G4365G 83.06
POOLS:
NAME   ID USED  %USED MAX AVAIL OBJECTS
glance 6  1714G 32.61  385G  219579
cinder 7   676G 12.86  385G   97488

[root@ceph2 ~]# ceph osd pool get glance pg_num
pg_num: 512
[root@ceph2 ~]# ceph osd pool get cinder pg_num
pg_num: 

Re: [ceph-users] Extra RAM to improve OSD write performance ?

2016-02-14 Thread Somnath Roy
I doubt it will do much good in case of 100% write workload. You can tweak your 
VM dirty ration stuff to help the buffered write but the down side is the more 
amount of data it has to sync (while dumping dirty buffer eventually) the more 
spikiness it will induce..The write behavior won’t be smooth and gain won’t be 
much (or not at all).
But, Ceph does xattr reads in the write path, if you have very huge workload 
this extra RAM will help you to hold dentry caches in the memory (or go for 
sappiness setting not to swap out dentry caches) and effectively will save some 
disk hit. Also, in case of mixed read/write scenario this should help as some 
read could be benefitting from this. All depends on how random and how big is 
your workload..


Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickey 
Singh
Sent: Sunday, February 14, 2016 1:55 AM
To: ceph-users@lists.ceph.com; ceph-users
Subject: [ceph-users] Extra RAM to improve OSD write performance ?

Hello Community

Happy Valentines Day ;-)

I need some advice on using EXATA RAM on my OSD servers to improve Ceph's write 
performance.

I have 20 OSD servers each with 256GB RAM and 6TB x 16 OSD's, so assuming 
cluster is not recovering, most of the time system will have at least ~150GB 
RAM free. And for 20 machines its a lot ~3.0 TB RAM

Is there any way to use this free RAM to improve write performance of cluster. 
Something like Linux page cache for OSD write operations.

I assume that by default Linux page cache can use free memory to improve OSD 
read performance ( please correct me if i am wrong). But how about OSD write 
improvement , How to improve that with free RAM.

PS : My Ceph cluster's workload is just OpenStack Cinder , Glance , Nova for 
instance disk

- Vickey -



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: HEALTH_WARN pool vol has too few pgs

2016-02-03 Thread Somnath Roy
You can increase it, but, that will trigger rebalancing and based on the amount 
of data it will take some time before cluster is coming into clean state.
Client IO performance will be affected during this.
BTW this is not really an error , it is a warning because performance on that 
pool will be affected because of low pg count.

Thanks & Regards
Somnath
-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of M 
Ranga Swami Reddy
Sent: Wednesday, February 03, 2016 9:48 PM
To: Ferhat Ozkasgarli
Cc: ceph-users
Subject: Re: [ceph-users] Fwd: HEALTH_WARN pool vol has too few pgs

Current pg_num: 4096.  As per the PG num formula, no OSD * 100/pool size ->
184 * 100/3 = 6133, so I can increase to 8192. Is this solves the problem?

Thanks
Swami

On Thu, Feb 4, 2016 at 2:14 AM, Ferhat Ozkasgarli  wrote:
> As message satates, you must increase placement group number for the pool.
> Because 108T data require larger pg mumber.
>
> On Feb 3, 2016 8:09 PM, "M Ranga Swami Reddy"  wrote:
>>
>> Hi,
>>
>> I am using ceph for my storage cluster and health shows as WARN state
>> with too few pgs.
>>
>> ==
>> health HEALTH_WARN pool volumes has too few pgs ==
>>
>> The volume pool has 4096 pgs
>> --
>> ceph osd pool get volumes pg_num
>> pg_num: 4096
>> ---
>>
>> and
>> >ceph df
>> NAME   ID USED  %USED MAX AVAIL
>> OBJECTS
>> volumes4  2830G  0.82  108T
>> 763509
>> --
>>
>> How do we fix this, without downtime?
>>
>> Thanks
>> Swami
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-01-28 Thread Somnath Roy
Hi,
Ceph needs to maintain a journal in case of filestore as underlying filesystem 
like XFS *doesn’t have* any transactional semantics. Ceph has to do a 
transactional write with data and metadata in the write path. It does in the 
following way.

1. It creates a transaction object having multiple metadata operations and the 
actual payload write.

2. It is passed to Objectstore layer.

3. Objectstore can complete the transaction in sync or async (Filestore) way.

4.  Filestore dumps the entire Transaction object to the journal. It is a 
circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
way.

5. Once journal write is successful , write is acknowledged to the client. Read 
for this data is not allowed yet as it is still not been written to the actual 
location in the filesystem.

6. The actual execution of the transaction is done in parallel for the 
filesystem that can do check pointing like BTRFS. For the filesystem like 
XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
first and then the Tx execution will happen.

7. Tx execution is done in parallel by the filestore worker threads. The 
payload write is a buffered write and a sync thread within filestore is 
periodically calling ‘syncfs’ to persist data/metadata to the actual location.

8. Before each ‘syncfs’ call it determines the seq number till it is persisted 
and trim the transaction objects from journal upto that point. This will make 
room for more writes in the journal. If journal is full, write will be stuck.

9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
the last successful backend commit seq number (maintained in a file after 
‘syncfs’).

So, as you can see, it’s not a flaw but a necessity to have a journal for 
filestore in case of rbd workload as it can do partial overwrites. It is not 
needed for full writes like for objects and that’s the reason Sage came up with 
new store which will not be doing double writes for Object workload.
The keyvaluestore backend also doesn’t have any journal as it is relying on 
backend like leveldb/rocksdb for that.

Regarding Jan’s point for block vs file journal, IMO the only advantage of 
journal being a block device is filestore can do aio writes on that.

Now, here is what SanDisk changed..

1. In the write path Filestore has to do some throttling as journal can’t go 
much further than the actual backend write (Tx execution). We have introduced a 
dynamic throlling based on journal fill rate and a % increase from a config 
option filestore_queue_max_bytes. This config option keeps track of outstanding 
backend byte writes.

2. Instead of buffered write we have introduced a O_DSYNC write during 
transaction execution as it is reducing the amount of data syncfs has to write 
and thus getting a more stable performance out.

3. Main reason that we can’t allow journal to go further ahead as the Tx object 
will not be deleted till the Tx executes. More behind the Tx execution , more 
memory growth will happen. Presently, Tx object is deleted asynchronously (and 
thus taking more time)and we changed it to delete it from the filestore worker 
thread itself.

4. The sync thread is optimized to do a fast sync. The extra last commit seq 
file is not maintained any more for *the write ahead journal* as this 
information can be found in journal header.

Here is the related pull requests..




https://github.com/ceph/ceph/pull/7271

https://github.com/ceph/ceph/pull/7303

https://github.com/ceph/ceph/pull/7278

https://github.com/ceph/ceph/pull/6743



Regarding bypassing filesystem and accessing block device directly, yes, that 
should be more clean/simple and efficient solution. With Sage’s Bluestore, Ceph 
is moving towards that very fast !!!

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tyler 
Bishop
Sent: Thursday, January 28, 2016 1:35 PM
To: Jan Schermer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] SSD Journal

What approach did sandisk take with this for jewel?




 [http://static.beyondhosting.net/img/bh-small.png]


Tyler Bishop
Chief Technical Officer
513-299-7108 x10


tyler.bis...@beyondhosting.net



If you are not the intended recipient of this transmission you are notified 
that disclosing, copying, distributing or taking any action in reliance on the 
contents of this information is strictly prohibited.





From: "Jan Schermer" >
To: "Tyler Bishop" 
>
Cc: "Bill WONG" >, 
ceph-users@lists.ceph.com
Sent: Thursday, January 28, 2016 4:32:54 PM
Subject: Re: [ceph-users] SSD Journal

You can't run Ceph OSD without a journal. The journal is always there.
If you don't have a journal 

Re: [ceph-users] SSD Journal

2016-01-28 Thread Somnath Roy
<mailto:j...@schermer.cz]
Sent: Thursday, January 28, 2016 3:51 PM
To: Somnath Roy
Cc: Tyler Bishop; ceph-users@lists.ceph.com
Subject: Re: SSD Journal

Thanks for a great walkthrough explanation.
I am not really going to (and capable) of commenting on everything but.. see 
below

On 28 Jan 2016, at 23:35, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:

Hi,
Ceph needs to maintain a journal in case of filestore as underlying filesystem 
like XFS *doesn’t have* any transactional semantics. Ceph has to do a 
transactional write with data and metadata in the write path. It does in the 
following way.

"Ceph has to do a transactional write with data and metadata in the write path"
Why? Isn't that only to provide that to itself?

[Somnath] Yes, that is for Ceph..That’s 2 setattrs (for rbd) + PGLog/Info..

1. It creates a transaction object having multiple metadata operations and the 
actual payload write.

2. It is passed to Objectstore layer.

3. Objectstore can complete the transaction in sync or async (Filestore) way.

Depending on whether the write was flushed or not? How is that decided?
[Somnath] It depends on how ObjectStore backend is written..Not 
dynamic..Filestore implemented in async way , I think BlueStore is written in 
sync way (?)..


4.  Filestore dumps the entire Transaction object to the journal. It is a 
circular buffer and written to the disk sequentially with O_DIRECT | O_DSYNC 
way.

Just FYI, O_DIRECT doesn't really guarantee "no buffering", it's purpose is 
just to avoid needless caching.
It should behave the way you want on Linux, but you must not rely on it since 
this guarantee is not portable.

[Somnath] O_DIRECT alone is not guaranteed but With O_DSYNC it is guaranteed to 
be reaching the disk..It may still be there in Disk cache , but, this is taken 
care by disks..

5. Once journal write is successful , write is acknowledged to the client. Read 
for this data is not allowed yet as it is still not been written to the actual 
location in the filesystem.

Now you are providing a guarantee for something nobody really needs. There is 
no guarantee with traditional filesystems of not returning dirty unwritten 
data. The guarentees are on writes, not reads. It might be easier to do it this 
way if you plan for some sort of concurrent access to the same data from 
multiple readers (that don't share the cache) - but is that really the case 
here if it's still the same OSD that serves the data?
Do the journals absorb only the unbuffered IO or all IO?

And what happens currently if I need to read the written data rightaway? When 
do I get it then?

[Somnath] Well, this is debatable, but currently reads are blocked till entire 
Tx execution is completed (not after doing syncfs)..Journal absorbs all the IO..

6. The actual execution of the transaction is done in parallel for the 
filesystem that can do check pointing like BTRFS. For the filesystem like 
XFS/ext4 the journal is write ahead i.e Tx object will be written to journal 
first and then the Tx execution will happen.

7. Tx execution is done in parallel by the filestore worker threads. The 
payload write is a buffered write and a sync thread within filestore is 
periodically calling ‘syncfs’ to persist data/metadata to the actual location.

8. Before each ‘syncfs’ call it determines the seq number till it is persisted 
and trim the transaction objects from journal upto that point. This will make 
room for more writes in the journal. If journal is full, write will be stuck.

9. If OSD is crashed after write is acknowledge, the Tx will be replayed from 
the last successful backend commit seq number (maintained in a file after 
‘syncfs’).


You can just completely rip at least 6-9 out and mirror what the client sends 
to the filesystem with the same effect (and without journal). Who cares how the 
filesystem implements it then, everybody can choose the filesystem that matches 
the workload (e.g. the one they use alread on a physical volume they are 
migrating from).
It's a sensible solution to a non existing problem...

[Somnath] May be but different client has different requirement, can’t design 
OSD I guess based on what client will do..One has to do all effort to make OSD 
crash consistent IMO..
Probably, it would be better if filestore gives user a choice where to use 
journal or not based on client’s need….If client can live without being 
consistent , so be it..


So, as you can see, it’s not a flaw but a necessity to have a journal for 
filestore in case of rbd workload as it can do partial overwrites. It is not 
needed for full writes like for objects and that’s the reason Sage came up with 
new store which will not be doing double writes for Object workload.
The keyvaluestore backend also doesn’t have any journal as it is relying on 
backend like leveldb/rocksdb for that.

Regarding Jan’s point for block vs file journal, IMO the only advantage of 
journal being a b

Re: [ceph-users] optimized SSD settings for hammer

2016-01-25 Thread Somnath Roy
Yes, I think you should try with crc enabled as it is recommended for network 
level corruption detection.
It will definitely add some cpu cost but it is ~5x lower with Intel new cpu 
instruction set..

-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
Sent: Monday, January 25, 2016 12:09 AM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: Re: optimized SSD settings for hammer


Am 25.01.2016 um 08:54 schrieb Somnath Roy:
> ms_nocrc options is changed to the following in Hammer..
> 
> ms_crc_data = false
> ms_crc_header = false

If i add those the osds / client can't comunicate any longer.

> Rest looks good , you need to tweak the shard/thread based on your cpu 
> complex and total number of OSDs running on a box..
> BTW, with latest Intel instruction sets crc overhead is reduced significantly 
> and you may want to turn back on..

ah OK so i can remove ms_nocrcin generall ans also skip the data and header 
stuff you mentioned above?

Stefan

> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag]
> Sent: Sunday, January 24, 2016 11:48 PM
> To: ceph-users@lists.ceph.com
> Cc: Somnath Roy
> Subject: optimized SSD settings for hammer
> 
> Hi,
> 
> is there a guide or recommendation to optimized SSD settings for hammer?
> 
> We have:
> CPU E5-1650 v3 @ 3.50GHz (12 core incl. HT) 10x SSD / Node journal and 
> fs on the same ssd
> 
> currently we're runnig:
> - with auth disabled
> - all debug settings to 0
> 
> and
> 
> ms_nocrc = true
> osd_op_num_threads_per_shard = 2
> osd_op_num_shards = 12
> filestore_fd_cache_size = 512
> filestore_fd_cache_shards = 32
> ms_dispatch_throttle_bytes = 0
> osd_client_message_size_cap = 0
> osd_client_message_cap = 0
> osd_enable_op_tracker = false
> filestore_op_threads = 8
> filestore_min_sync_interval = 1
> filestore_max_sync_interval = 10
> 
> Thanks!
> 
> Greets,
> Stefan
> 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] optimized SSD settings for hammer

2016-01-24 Thread Somnath Roy
ms_nocrc options is changed to the following in Hammer..

ms_crc_data = false
ms_crc_header = false

Rest looks good , you need to tweak the shard/thread based on your cpu complex 
and total number of OSDs running on a box..
BTW, with latest Intel instruction sets crc overhead is reduced significantly 
and you may want to turn back on..

Thanks & Regards
Somnath

-Original Message-
From: Stefan Priebe - Profihost AG [mailto:s.pri...@profihost.ag] 
Sent: Sunday, January 24, 2016 11:48 PM
To: ceph-users@lists.ceph.com
Cc: Somnath Roy
Subject: optimized SSD settings for hammer

Hi,

is there a guide or recommendation to optimized SSD settings for hammer?

We have:
CPU E5-1650 v3 @ 3.50GHz (12 core incl. HT) 10x SSD / Node journal and fs on 
the same ssd

currently we're runnig:
- with auth disabled
- all debug settings to 0

and

ms_nocrc = true
osd_op_num_threads_per_shard = 2
osd_op_num_shards = 12
filestore_fd_cache_size = 512
filestore_fd_cache_shards = 32
ms_dispatch_throttle_bytes = 0
osd_client_message_size_cap = 0
osd_client_message_cap = 0
osd_enable_op_tracker = false
filestore_op_threads = 8
filestore_min_sync_interval = 1
filestore_max_sync_interval = 10

Thanks!

Greets,
Stefan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph scale testing

2016-01-20 Thread Somnath Roy
Hi,
Here is the copy of the ppt I presented in today's performance meeting..

https://docs.google.com/presentation/d/1j4Lcb9fx0OY7eQlQ_iUI6TPVJ6t_orZWKJyhz0S_3ic/edit?usp=sharing

Thanks & Regards
Somnath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD OSDs - more Cores or more GHz

2016-01-20 Thread Somnath Roy
Yes, thanks for the data..
BTW, Nick, do we know what is more important more cpu core or more frequency ?
For example, We have Xeon cpus available with a bit less frequency but with 
more cores /socket , so, which one we should be going with for OSD servers ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Wednesday, January 20, 2016 6:54 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz

Excellent testing Nick!

Mark

On 01/20/2016 08:18 AM, Nick Fisk wrote:
> See this benchmark I did last year
>
> http://www.sys-pro.co.uk/ceph-storage-fast-cpus-ssd-performance/
>
>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
>> Of Oliver Dzombic
>> Sent: 20 January 2016 13:33
>> To: ceph-us...@ceph.com
>> Subject: Re: [ceph-users] SSD OSDs - more Cores or more GHz
>>
>> Hi,
>>
>> to be honest, i never made real benchmarks about that.
>>
>> But to me, i doubt that the higher frequency of a cpu will have a "real"
>> impact on ceph's performance.
>>
>> I mean, yes, mathematically, just like Wade pointed out, its true.
>>> frequency = < latency
>>
>> But when we compare CPU's of the same model, with different frequencies.
>>
>> How much time ( in nano seconds ), do we save ?
>> I mean i have really no numbers here.
>>
>> But the difference between a 2,1 GHz and a 2,9 GHz ( Low End Xeon E5 
>> / High End Xeon E5 ) ( when it comes to delay in "memory/what ever" 
>> allocation ), will be, inside an Linux OS, quiet small. And i mean 
>> nano seconds tiny/non existing small.
>> But again, thats just my guess. Of course, if we talk about complete 
>> different CPU Models ( E5 vs. I7 vs. AMD vs. what ever ) we will have 
>> different 1st/2nd level Caches in CPU, different Architecture/RAM/everything.
>>
>> But we are talking here about pure frequency issues. So we compare 
>> identical CPU Models, just with different frequencies.
>>
>> And there, the difference, especially inside an OS and inside a 
>> productive environment must be nearly not existing.
>>
>> I can not imagine how much an OSD / HDD needs to be hammered, that a 
>> server is in general not totally overloaded and that the higher 
>> frequency will make a measureable difference.
>>
>> 
>>
>> But again, i have here no numbers/benchmarks that could proove this 
>> pure theory of mine.
>>
>> In the very end, more cores will usually mean more GHz frequency in sum.
>>
>> So maybe the whole discussion is very theoretically, because usually 
>> we wont run in a situation where we have to choose frequency vs. cores.
>>
>> Simply because more cores always means more frequency in sum.
>>
>> Except you compare totally different cpu models and generations, and 
>> this is even more worst theoretically and maybe pointless since the 
>> different cpu generations have totally different inner architecture 
>> which has a great impact in overall performance ( aside from numbers of 
>> frequency and cores ).
>>
>> --
>> Mit freundlichen Gruessen / Best regards
>>
>> Oliver Dzombic
>> IP-Interactive
>>
>> mailto:i...@ip-interactive.de
>>
>> Anschrift:
>>
>> IP Interactive UG ( haftungsbeschraenkt ) Zum Sonnenberg 1-3
>> 63571 Gelnhausen
>>
>> HRB 93402 beim Amtsgericht Hanau
>> Geschäftsführung: Oliver Dzombic
>>
>> Steuer Nr.: 35 236 3622 1
>> UST ID: DE274086107
>>
>>
>> Am 20.01.2016 um 14:14 schrieb Wade Holler:
>>> Great commentary.
>>>
>>> While it is fundamentally true that higher clock speed equals lower 
>>> latency, I'm my practical experience we are more often interested in 
>>> latency at the concurrency profile of the applications.
>>>
>>> So in this regard I favor more cores when I have to choose, such 
>>> that we can support more concurrent operations at a queue depth of 0.
>>>
>>> Cheers
>>> Wade
>>> On Wed, Jan 20, 2016 at 7:58 AM Jan Schermer >> > wrote:
>>>
>>>  I'm using Ceph with all SSDs, I doubt you have to worry about speed 
>>> that
>>>  much with HDD (it will be abysmall either way).
>>>  With SSDs you need to start worrying about processor caches and
>> memory
>>>  colocation in NUMA systems, linux scheduler is not really that smart
>>>  right now.
>>>  Yes, the process will get its own core, but it might be a different
>>>  core every
>>>  time it spins up, this increases latencies considerably if you start
>>>  hammering
>>>  the OSDs on the same host.
>>>
>>>  But as always, YMMV ;-)
>>>
>>>  Jan
>>>
>>>
>>>  > On 20 Jan 2016, at 13:28, Oliver Dzombic >>  > wrote:
>>>  >
>>>  > Hi Jan,
>>>  >
>>>  > actually the linux kernel does this automatically anyway ( sending 
>>> new
>>>  > processes to "empty/low used" cores ).
>>>  >
>>>  > A single scrubbing/recovery or what ever 

Re: [ceph-users] OSD size and performance

2016-01-03 Thread Somnath Roy
Hi Prabu,
Check the krbd version (and libceph) running in the kernel..You can try 
building the latest krbd source for the 7.1 kernel if this is an option for you.
As I mentioned in my earlier mail, please isolate problem the way I suggested 
if that seems reasonable to you.

Thanks & Regards
Somnath

From: gjprabu [mailto:gjpr...@zohocorp.com]
Sent: Sunday, January 03, 2016 10:53 PM
To: gjprabu
Cc: Somnath Roy; ceph-users; Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance

Hi Somnath,

   Just check the below details and let us know do you need any 
other information.

Regards
Prabu

 On Sat, 02 Jan 2016 08:47:05 +0530 gjprabu 
<gjpr...@zohocorp.com<mailto:gjpr...@zohocorp.com>>wrote 

Hi Somnath,

   Please check the details and help me on this issue.

Regards
Prabu

 On Thu, 31 Dec 2015 12:50:36 +0530 gjprabu 
<gjpr...@zohocorp.com<mailto:gjpr...@zohocorp.com>>wrote 



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
Hi Somnath,

 We are using RBD, please find linux and rbd versions. I agree this is 
related to client side issue. My though gone to backup because weekly once will 
take full backup not incremental at the time we found issue once but not sure.

Linux version
CentOS Linux release 7.1.1503 (Core)
Kernel : - 3.10.91

rbd --version
ceph version 0.94.5 (9764da52395923e0b32908d83a9f7304401fee43)

rbd showmapped
id pool image   snap device
1  rbd  downloads  -/dev/rbd1

rbd ls
downloads

Client server RBD mounted using ocfs2 file system.
/dev/rbd1  ocfs2 9.6T  2.6T  7.0T  27% /data/downloads

Client level cluster configuration done with 5 clients and We are using below 
procedure in client node.

1) rbd map downloads --pool rbd --name client.admin -m 
192.168.112.192,192.168.112.193,192.168.112.194 -k 
/etc/ceph/ceph.client.admin.keyring


2)  Formatting rbd with ocfs2
mkfs.ocfs2 -b4K -C 4K -L label -T mail -N5 /dev/rbd/rbd/downloads

3) We have do ocfs2 client level configuration and start ocfs2 service.

4) mount /dev/rbd/rbd/downloads /data/downloads

 Please let me know do you need any other information.

Regards
Prabu




 On Thu, 31 Dec 2015 01:04:39 +0530 Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>wrote 



___
ceph-users mailing list
ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Prabu,



I assume you are using krbd then..Could you please let us know the Linux 
version/flavor you are using ?

Krbd had some hang issues and supposed to be fixed with the latest versions 
available..Also, it could be due to OCFS2->krbd integration as well (?) 
..Handling data consistency is the responsibility of OCFS as krbd doesn’t 
guarantee that..So, I would suggest to do the following to root cause if your 
cluster is not into production.



1. Do a synthetic fio run  on krbd alone (or creating a filesystem on top) and 
see if you can reproduce the hang



2. Try building the latest krbd or upgrade your Linux version to get a newer 
krbd and see if it is still happening.





<< Also we are taking backup from client, we feel that could be the reason for 
this hang



I assume this is regular filesystem back up ? Why do you think this could be a 
problem ?



I think it is a client side issue , I doubt it could be because of large OSD 
size..





Thanks & Regards

Somnath



From: gjprabu [mailto:gjpr...@zohocorp.com<mailto:gjpr...@zohocorp.com>]
Sent: Wednesday, December 30, 2015 4:29 AM
To: gjprabu
Cc: Somnath Roy; ceph-users; Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance



Hi Somnath,



 Thanks for your reply. Current setup we are having client hang issue 
and its hang frequently and after reboot it is working, Client used to mount 
with OCFS2 file system for multiple concurrent client access for same data. 
Also we are taking backup from client, we feel that could be the reason for 
this hang.



Regards

Prabu







 On Wed, 30 Dec 2015 11:33:20 +0530 Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>wrote 



FYI , we are using 8TB SSD drive as OSD and not seeing any problem so far. 
Failure domain could be a concern for bigger OSDs.



Thanks & Regards

Somnath



From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of gjprabu
Sent: Tuesday, December 29, 2015 9:38 PM
To: ceph-users
Cc: Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance



Hi Team,



 Anybody please clarify the below queries.



Regards

Prabu



 On Tue, 29 Dec 2015 13:03:45 +0530 gjprabu 
<gjpr...@zohocorp.com<mailt

Re: [ceph-users] more performance issues :(

2015-12-30 Thread Somnath Roy
Well, that’s not as straight forward..You can’t let journal write go unleash 
for long time. The main reason is, OSD memory usage will keep increasing and 
depending on your HW OSD will crash at some point.
You can try tweaking the following..

journal_max_write_bytes
journal_max_write_entries
journal_queue_max_ops
journal_queue_max_bytes
filestore_queue_max_ops
filestore_queue_max_bytes

Existing code base will also force flush journal if it is more than half full..

If you have a test cluster you may want to try the following pull request which 
should be able to utilize big journal better..But, you need to tweak some 
config option according to your setup (I never ran it on HDD based set up)..If 
you are interested , let me know I can help you on that.

https://github.com/ceph/ceph/pull/6670

Thanks & Regards
Somnath

From: Florian Rommel [mailto:florian.rom...@datalounges.com]
Sent: Wednesday, December 30, 2015 2:54 AM
To: Somnath Roy
Cc: Tyler Bishop; ceph-users@lists.ceph.com
Subject: Re: more performance issues :(

Hi all, again thanks for all the suggestions..

I have now narrowed it down to this problem:

Data gets written to journal (SSD), but the journal, when flushing things out 
to the SATA disks, doesn’t continue taking new writes, it kind of stops until 
the journal is flushed to SATA. During that time, the data xfer rate drops 
quite low it almost looks like its the same speed as the sata disks. When the 
flush is done, you can see an uptake in speed until the next flush.

I have 15GB SSD partitions for 2-3 SATA disks on each server. total of 10 OSDs 
for now. Is there a ceph.conf option to set the journal to fill up before flush 
or to continue to write to journal while flushing?

Thanks,
//florian



On 26 Dec 2015, at 20:46, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:

FYI, osd_op_threads is not used in the main io path anymore (from Giant). I 
don’t think increasing this will do any good.
If you want to tweak threads in io path play with the following two.

osd_op_num_threads_per_shard
osd_op_num_shards

But, It may not be the problem with writes..Default value should work just 
fine..Need some more info..

1. Are you using fio-rbd ? If so, try running with rbd_cache = false in the 
client side ceph.conf and see if that is making any difference.

2. What is the block size you are trying with ?

3. Check how the SSD is behaving with raw fio o_direct and o_dsync mode with 
the same block size

4. What kind of fio write io profile are you running ? Hope you are doing 
similar IO profile with benchwrite.

5. How many OSDs a single SSD as journal is serving ? How many OSDs total you 
are running ? what is the replication factor ?

6. Hope none of the resources are saturating

Thanks & Regards
Somnath


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tyler 
Bishop
Sent: Saturday, December 26, 2015 8:38 AM
To: Florian Rommel
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] more performance issues :(

Add this under osd.

osd op threads = 8



Restart the osd services and try that.




From: "Florian Rommel" 
<florian.rom...@datalounges.com<mailto:florian.rom...@datalounges.com>>
To: "Wade Holler" <wade.hol...@gmail.com<mailto:wade.hol...@gmail.com>>
Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Sent: Saturday, December 26, 2015 4:55:06 AM
Subject: Re: [ceph-users] more performance issues :(

Hi, iostat shows all OSDs working when data is benched.  it looks like the 
culprit is nowhere to be found. If i add SSD journals with the SSDs that we 
have, even thought they give a much higher result with fio than the SATA 
drives, the speed of the cluster is exactly the same… 150-180MB/s, while reads 
max out the 10GBe network with no problem.
rbd benchwrite however gives me NICE throughput… about 500MB /s to start with 
and then dropping and flattening out at 320MB/s, 9 IOPs…. so what the hell 
is going on?.


if i take the journals off and move them to the disks themselves, same results. 
Something is really really off with my config i guess, and I need to do some 
serious troubleshooting to figure this out.

Thanks for the help so far .
//Florian



On 24 Dec 2015, at 13:54, Wade Holler 
<wade.hol...@gmail.com<mailto:wade.hol...@gmail.com>> wrote:

Have a look at the iostsat -x 1 1000 output to see what the drives are doing

On Wed, Dec 23, 2015 at 4:35 PM Florian Rommel 
<florian.rom...@datalounges.com<mailto:florian.rom...@datalounges.com>> wrote:
Ah, totally forgot the additional details :)

OS is SUSE Enterprise Linux 12.0 with all patches,
Ceph version 0.94.3
4 node cluster with 2x 10GBe networking, one for cluster and one for public 
network, 1 additional server purely as an admin server.
Test machine is also 10gbe connected

ceph.conf is incl

Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Somnath Roy
Jan,
Two monitors should be just fine as long as both can agree to every action.
Even number of monitors are not useful as the total number of failures 
tolerated will be the same as odd numbers. For example, 1,2 will not tolerate 
any failure and 3,4 both can tolerate 2 failures.

Thanks & Regards
Somnath

From: Jan Schermer [mailto:j...@schermer.cz]
Sent: Tuesday, December 29, 2015 3:32 PM
To: Ing. Martin Samek
Cc: Somnath Roy; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] My OSDs are down and not coming UP

Just try putting something like the following in ceph.conf:

[global]
mon_host =::2:1612::50 ::2:1612::30
mon_initial_members = node-1 node-2

Also, I just noticed you have two MONs? It should always be an odd number. Not 
sure if they can ever get quorum now?

Jan



On 30 Dec 2015, at 00:15, Ing. Martin Samek 
<samek...@fel.cvut.cz<mailto:samek...@fel.cvut.cz>> wrote:

I'm deploying ceph cluster manually following different guides. I didn't use 
ceph-deploy yet.

MS:

Dne 30.12.2015 v 00:13 Somnath Roy napsal(a):
It should be monitor host names..If you are deploying with ceph-deploy it 
should be added in the conf file automatically..How are you creating your 
cluster ?
Did you change conf file after installing ?

From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz]
Sent: Tuesday, December 29, 2015 3:09 PM
To: Jan Schermer
Cc: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] My OSDs are down and not coming UP

Hi,

No, never. It is my first attempt, first ceph cluster i try ever run.

im not sure, if "mon initial members" should contain mon servers ids or 
hostnames ?

MS:
Dne 30.12.2015 v 00:04 Jan Schermer napsal(a):
Has the cluster ever worked?

Are you sure that "mon initial members = 0" is correct? How do the OSDs know 
where to look for MONs?

Jan


On 29 Dec 2015, at 21:41, Ing. Martin Samek 
<samek...@fel.cvut.cz<mailto:samek...@fel.cvut.cz>> wrote:

Hi,

network is OK, all nodes are in one VLAN, in one switch, in one rack.




tracepath6 node2

 1?: [LOCALHOST]0.030ms pmtu 1500

 1:  node2 0.634ms reached

 1:  node2 0.296ms reached

 Resume: pmtu 1500 hops 1 back 64

tracepath6 node3

 1?: [LOCALHOST]0.022ms pmtu 1500

 1:  node3 0.643ms reached

 1:  node3 1.065ms reached

 Resume: pmtu 1500 hops 1 back 64

There is no firewall installed or configured.

Martin







___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Somnath Roy
6 
wait: waiting for dispatch queue

2015-12-29 20:21:43.258688 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
wait: dispatch queue is stopped

2015-12-29 20:21:43.258695 7f8a59b2f800 20 -- [::2:1612::60]:6801/15406 
wait: stopping accepter thread

2015-12-29 20:21:43.258699 7f8a59b2f800 10 accepter.stop accepter

2015-12-29 20:21:43.258715 7f8a44858700 20 accepter.accepter poll got 1

2015-12-29 20:21:43.258737 7f8a44858700 20 accepter.accepter closing

2015-12-29 20:21:43.258745 7f8a44858700 10 accepter.accepter stopping

2015-12-29 20:21:43.258791 7f8a59b2f800 20 -- [::2:1612::60]:6801/15406 
wait: stopped accepter thread

2015-12-29 20:21:43.258802 7f8a59b2f800 20 -- [::2:1612::60]:6801/15406 
wait: stopping reaper thread

2015-12-29 20:21:43.258819 7f8a51059700 10 -- [::2:1612::60]:6801/15406 
reaper_entry done

2015-12-29 20:21:43.258892 7f8a59b2f800 20 -- [::2:1612::60]:6801/15406 
wait: stopped reaper thread

2015-12-29 20:21:43.258903 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
wait: closing pipes

2015-12-29 20:21:43.258910 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
reaper

2015-12-29 20:21:43.258914 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
reaper done

2015-12-29 20:21:43.258919 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
wait: waiting for pipes  to close

2015-12-29 20:21:43.258922 7f8a59b2f800 10 -- [::2:1612::60]:6801/15406 
wait: done.

2015-12-29 20:21:43.258926 7f8a59b2f800  1 -- [::2:1612::60]:6801/15406 
shutdown complete.

2015-12-29 20:21:43.258930 7f8a59b2f800 10 -- :/15406 wait: waiting for 
dispatch queue

2015-12-29 20:21:43.258960 7f8a59b2f800 10 -- :/15406 wait: dispatch queue is 
stopped

2015-12-29 20:21:43.258965 7f8a59b2f800 20 -- :/15406 wait: stopping reaper 
thread

2015-12-29 20:21:43.258979 7f8a50858700 10 -- :/15406 reaper_entry done

2015-12-29 20:21:43.259043 7f8a59b2f800 20 -- :/15406 wait: stopped reaper 
thread

2015-12-29 20:21:43.259051 7f8a59b2f800 10 -- :/15406 wait: closing pipes

2015-12-29 20:21:43.259053 7f8a59b2f800 10 -- :/15406 reaper

2015-12-29 20:21:43.259056 7f8a59b2f800 10 -- :/15406 reaper done

2015-12-29 20:21:43.259057 7f8a59b2f800 10 -- :/15406 wait: waiting for pipes  
to close

2015-12-29 20:21:43.259059 7f8a59b2f800 10 -- :/15406 wait: done.

2015-12-29 20:21:43.259060 7f8a59b2f800  1 -- :/15406 shutdown complete.

Dne 29.12.2015 v 00:21 Ing. Martin Samek napsal(a):
Hi,
all nodes are in one VLAN connected to one switch. Conectivity is OK, MTU 1500, 
can transfer data over netcat and mbuffer at 660 Mbps.

debug_ms, there is nothing interest:

/usr/bin/ceph-osd --debug_ms 100 -f -i 0 --pid-file /run/ceph/osd.0.pid -c 
/etc/ceph/ceph.conf

starting osd.0 at :/0 osd_data /var/lib/ceph/osd/ceph-0 
/var/lib/ceph/osd/ceph-0/journal

2015-12-29 00:18:05.878954 7fd9892e7800 -1 journal FileJournal::_open: 
disabling aio for non-block journal.  Use journal_force_aio to force use of aio 
anyway

2015-12-29 00:18:05.899633 7fd9892e7800 -1 osd.0 24 log_to_monitors 
{default=true}

Thanks,
Martin

Dne 29.12.2015 v 00:08 Somnath Roy napsal(a):

It could be a network issue..May be related to MTU (?)..Try running with 
debug_ms = 1 and see if you find anything..Also, try running command like 
'traceroute' and see if it is reporting any error..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ing. 
Martin Samek
Sent: Monday, December 28, 2015 2:59 PM
To: Ceph Users
Subject: [ceph-users] My OSDs are down and not coming UP

Hi,

I'm a newbie in a Ceph world. I try setup my first testing Ceph cluster but 
unlikely my MON server running and talking each to other but my OSDs are still 
down and won't to come up. Actually only the one OSD running at the same node 
as a elected master is able to connect and come UP.

To be technical. I have 4 physical nodes living in pure IPv6 environment, 
running Gentoo Linux and Ceph 9.2. All nodes names are resolvable in DNS and 
also saved in hosts files.

I'm running OSD with command like this:

node1# /usr/bin/ceph-osd -f -i 1 --pid-file /run/ceph/osd.1.pid -c 
/etc/ceph/ceph.conf

single mon.0 is running also at node1, and OSD come up:

2015-12-28 23:37:27.931686 mon.0 [INF] osd.1 [::2:1612::50]:6800/23709 
boot

2015-12-28 23:37:27.932605 mon.0 [INF] osdmap e19: 2 osds: 1 up, 1 in

2015-12-28 23:37:27.933963 mon.0 [INF] pgmap v24: 64 pgs: 64 
stale+active+undersized+degraded; 0 bytes data, 1057 MB used, 598 GB / 599 GB 
avail

but running osd.0 at node2:

# /usr/bin/ceph-osd -f -i 0 --pid-file /run/ceph/osd.0.pid -c 
/etc/ceph/ceph.conf

did nothing, process is running, netstat shows opened connection from ceph-osd 
between node2 and node1. Here I'm lost. IPv6 connectivity is OK, DNS is OK, 
time is in sync, 1 mon running, 2 osds but only one UP.
What is missing?

ceph-osd in debug mode show differences at node1 and node2:

node1, UP

Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Somnath Roy
It should be monitor host names..If you are deploying with ceph-deploy it 
should be added in the conf file automatically..How are you creating your 
cluster ?
Did you change conf file after installing ?

From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz]
Sent: Tuesday, December 29, 2015 3:09 PM
To: Jan Schermer
Cc: Somnath Roy; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] My OSDs are down and not coming UP

Hi,

No, never. It is my first attempt, first ceph cluster i try ever run.

im not sure, if "mon initial members" should contain mon servers ids or 
hostnames ?

MS:
Dne 30.12.2015 v 00:04 Jan Schermer napsal(a):
Has the cluster ever worked?

Are you sure that "mon initial members = 0" is correct? How do the OSDs know 
where to look for MONs?

Jan


On 29 Dec 2015, at 21:41, Ing. Martin Samek 
<samek...@fel.cvut.cz<mailto:samek...@fel.cvut.cz>> wrote:

Hi,

network is OK, all nodes are in one VLAN, in one switch, in one rack.



tracepath6 node2

 1?: [LOCALHOST]0.030ms pmtu 1500

 1:  node2 0.634ms reached

 1:  node2 0.296ms reached

 Resume: pmtu 1500 hops 1 back 64

tracepath6 node3

 1?: [LOCALHOST]0.022ms pmtu 1500

 1:  node3 0.643ms reached

 1:  node3 1.065ms reached

 Resume: pmtu 1500 hops 1 back 64

There is no firewall installed or configured.

Martin




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] My OSDs are down and not coming UP

2015-12-29 Thread Somnath Roy
May be try commenting out mon_initial_members (or give mon host name) and 
see..It is certainly not correct as Jan pointed out..

From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz]
Sent: Tuesday, December 29, 2015 3:16 PM
To: Somnath Roy; Jan Schermer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] My OSDs are down and not coming UP

I'm deploying ceph cluster manually following different guides. I didn't use 
ceph-deploy yet.

MS:
Dne 30.12.2015 v 00:13 Somnath Roy napsal(a):
It should be monitor host names..If you are deploying with ceph-deploy it 
should be added in the conf file automatically..How are you creating your 
cluster ?
Did you change conf file after installing ?

From: Ing. Martin Samek [mailto:samek...@fel.cvut.cz]
Sent: Tuesday, December 29, 2015 3:09 PM
To: Jan Schermer
Cc: Somnath Roy; ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: Re: [ceph-users] My OSDs are down and not coming UP

Hi,

No, never. It is my first attempt, first ceph cluster i try ever run.

im not sure, if "mon initial members" should contain mon servers ids or 
hostnames ?

MS:
Dne 30.12.2015 v 00:04 Jan Schermer napsal(a):
Has the cluster ever worked?

Are you sure that "mon initial members = 0" is correct? How do the OSDs know 
where to look for MONs?

Jan


On 29 Dec 2015, at 21:41, Ing. Martin Samek 
<samek...@fel.cvut.cz<mailto:samek...@fel.cvut.cz>> wrote:

Hi,

network is OK, all nodes are in one VLAN, in one switch, in one rack.




tracepath6 node2

 1?: [LOCALHOST]0.030ms pmtu 1500

 1:  node2 0.634ms reached

 1:  node2 0.296ms reached

 Resume: pmtu 1500 hops 1 back 64

tracepath6 node3

 1?: [LOCALHOST]0.022ms pmtu 1500

 1:  node3 0.643ms reached

 1:  node3 1.065ms reached

 Resume: pmtu 1500 hops 1 back 64

There is no firewall installed or configured.

Martin






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD size and performance

2015-12-29 Thread Somnath Roy
FYI , we are using 8TB SSD drive as OSD and not seeing any problem so far. 
Failure domain could be a concern for bigger OSDs.

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of gjprabu
Sent: Tuesday, December 29, 2015 9:38 PM
To: ceph-users
Cc: Siva Sokkumuthu
Subject: Re: [ceph-users] OSD size and performance

Hi Team,

 Anybody please clarify the below quires.

Regards
Prabu

 On Tue, 29 Dec 2015 13:03:45 +0530 gjprabu 
>wrote 

Hi Team,

 We are using ceph with 3 osd and 2 replicas. Each osd size is 13TB 
and current data is reached to 2.5TB (each osd). Because of this huge size do 
we face any problem.

OSD server configuration
Hard disk -- 13TB
RAM -- 96GB
CPU -- 2 CPU with multi 8 core processor.


Regards
Prabu



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] My OSDs are down and not coming UP

2015-12-28 Thread Somnath Roy
It could be a network issue..May be related to MTU (?)..Try running with 
debug_ms = 1 and see if you find anything..Also, try running command like 
'traceroute' and see if it is reporting any error..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Ing. 
Martin Samek
Sent: Monday, December 28, 2015 2:59 PM
To: Ceph Users
Subject: [ceph-users] My OSDs are down and not coming UP

Hi,

I'm a newbie in a Ceph world. I try setup my first testing Ceph cluster but 
unlikely my MON server running and talking each to other but my OSDs are still 
down and won't to come up. Actually only the one OSD running at the same node 
as a elected master is able to connect and come UP.

To be technical. I have 4 physical nodes living in pure IPv6 environment, 
running Gentoo Linux and Ceph 9.2. All nodes names are resolvable in DNS and 
also saved in hosts files.

I'm running OSD with command like this:

node1# /usr/bin/ceph-osd -f -i 1 --pid-file /run/ceph/osd.1.pid -c 
/etc/ceph/ceph.conf

single mon.0 is running also at node1, and OSD come up:

2015-12-28 23:37:27.931686 mon.0 [INF] osd.1 [2001:718:2:1612::50]:6800/23709 
boot

2015-12-28 23:37:27.932605 mon.0 [INF] osdmap e19: 2 osds: 1 up, 1 in

2015-12-28 23:37:27.933963 mon.0 [INF] pgmap v24: 64 pgs: 64 
stale+active+undersized+degraded; 0 bytes data, 1057 MB used, 598 GB / 599 GB 
avail

but running osd.0 at node2:

# /usr/bin/ceph-osd -f -i 0 --pid-file /run/ceph/osd.0.pid -c 
/etc/ceph/ceph.conf

did nothing, process is running, netstat shows opened connection from ceph-osd 
between node2 and node1. Here I'm lost. IPv6 connectivity is OK, DNS is OK, 
time is in sync, 1 mon running, 2 osds but only one UP. 
What is missing?

ceph-osd in debug mode show differences at node1 and node2:

node1, UP:
> 2015-12-28 01:42:59.084371 7f72f9873800 20 osd.1 15  clearing temps in 
> 0.3f_head pgid 0.3f
> 2015-12-28 01:42:59.084453 7f72f9873800  0 osd.1 15 load_pgs
> 2015-12-28 01:42:59.085248 7f72f9873800 10 osd.1 15 load_pgs ignoring 
> unrecognized meta
> 2015-12-28 01:42:59.094690 7f72f9873800 10 osd.1 15 pgid 0.0 coll 
> 0.0_head
> 2015-12-28 01:42:59.094835 7f72f9873800 30 osd.1 0 get_map 15 -cached
> 2015-12-28 01:42:59.094848 7f72f9873800 10 osd.1 15 _open_lock_pg 0.0
> 2015-12-28 01:42:59.094857 7f72f9873800 10 osd.1 15 _get_pool 0
> 2015-12-28 01:42:59.094928 7f72f9873800  5 osd.1 pg_epoch: 15 
> pg[0.0(unlocked)] enter Initial
> 2015-12-28 01:42:59.094980 7f72f9873800 20 osd.1 pg_epoch: 15 
> pg[0.0(unlocked)] enter NotTrimming
> 2015-12-28 01:42:59.094998 7f72f9873800 30 osd.1 pg_epoch: 15 pg[0.0( 
> DNE empty local-les=0 n=0 ec=0 les/c/f 0/0/0 0/0/0) [] r=0 lpr=0 
> crt=0'0 inactive NIBBLEW
> 2015-12-28 01:42:59.095186 7f72f9873800 20 read_log coll 0.0_head 
> log_oid 0///head

node2, DOWN:
> 2015-12-28 01:36:54.437246 7f4507957800  0 osd.0 11 load_pgs
> 2015-12-28 01:36:54.437267 7f4507957800 10 osd.0 11 load_pgs ignoring 
> unrecognized meta
> 2015-12-28 01:36:54.437274 7f4507957800  0 osd.0 11 load_pgs opened 0 
> pgs
> 2015-12-28 01:36:54.437278 7f4507957800 10 osd.0 11 
> build_past_intervals_parallel nothing to build
> 2015-12-28 01:36:54.437282 7f4507957800  2 osd.0 11 superblock: i am 
> osd.0
> 2015-12-28 01:36:54.437287 7f4507957800 10 osd.0 11 create_logger
> 2015-12-28 01:36:54.438157 7f4507957800 -1 osd.0 11 log_to_monitors 
> {default=true}
> 2015-12-28 01:36:54.449278 7f4507957800 10 osd.0 11 
> set_disk_tp_priority class  priority -1
> 2015-12-28 01:36:54.450813 7f44ddbff700 30 osd.0 11 heartbeat
> 2015-12-28 01:36:54.452558 7f44ddbff700 30 osd.0 11 heartbeat checking 
> stats
> 2015-12-28 01:36:54.452592 7f44ddbff700 20 osd.0 11 update_osd_stat 
> osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist 
> [])
> 2015-12-28 01:36:54.452611 7f44ddbff700  5 osd.0 11 heartbeat: 
> osd_stat(1056 MB used, 598 GB avail, 599 GB total, peers []/[] op hist 
> [])
> 2015-12-28 01:36:54.452618 7f44ddbff700 30 osd.0 11 heartbeat check
> 2015-12-28 01:36:54.452622 7f44ddbff700 30 osd.0 11 heartbeat lonely?
> 2015-12-28 01:36:54.452624 7f44ddbff700 30 osd.0 11 heartbeat done
> 2015-12-28 01:36:54.452627 7f44ddbff700 30 osd.0 11 heartbeat_entry 
> sleeping for 2.3
> 2015-12-28 01:36:54.452588 7f44da7fc700 10 osd.0 11 agent_entry start
> 2015-12-28 01:36:54.453338 7f44da7fc700 20 osd.0 11 agent_entry empty 
> queue

My ceph.conf looks like this:

[global]

fsid = b186d870-9c6d-4a8b-ac8a-e263f4c205da

ms_bind_ipv6 = true

public_network = ::2:1612::/64

mon initial members = 0

mon host = [::2:1612::50]:6789

auth cluster required = cephx

auth service required = cephx

auth client required = cephx

osd pool default size = 2

osd pool default min size = 1

osd journal size = 1024

osd mkfs type = xfs

osd mount options xfs = rw,inode64

osd crush chooseleaf type = 1

[mon.0]

host = node1

mon addr = [::2:1612::50]:6789

[mon.1]

host = node3


Re: [ceph-users] more performance issues :(

2015-12-26 Thread Somnath Roy
FYI, osd_op_threads is not used in the main io path anymore (from Giant). I 
don’t think increasing this will do any good.
If you want to tweak threads in io path play with the following two.

osd_op_num_threads_per_shard
osd_op_num_shards

But, It may not be the problem with writes..Default value should work just 
fine..Need some more info..

1. Are you using fio-rbd ? If so, try running with rbd_cache = false in the 
client side ceph.conf and see if that is making any difference.

2. What is the block size you are trying with ?

3. Check how the SSD is behaving with raw fio o_direct and o_dsync mode with 
the same block size

4. What kind of fio write io profile are you running ? Hope you are doing 
similar IO profile with benchwrite.

5. How many OSDs a single SSD as journal is serving ? How many OSDs total you 
are running ? what is the replication factor ?

6. Hope none of the resources are saturating

Thanks & Regards
Somnath


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Tyler 
Bishop
Sent: Saturday, December 26, 2015 8:38 AM
To: Florian Rommel
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] more performance issues :(

Add this under osd.

osd op threads = 8


Restart the osd services and try that.





From: "Florian Rommel" 
>
To: "Wade Holler" >
Cc: ceph-users@lists.ceph.com
Sent: Saturday, December 26, 2015 4:55:06 AM
Subject: Re: [ceph-users] more performance issues :(

Hi, iostat shows all OSDs working when data is benched.  it looks like the 
culprit is nowhere to be found. If i add SSD journals with the SSDs that we 
have, even thought they give a much higher result with fio than the SATA 
drives, the speed of the cluster is exactly the same… 150-180MB/s, while reads 
max out the 10GBe network with no problem.
rbd benchwrite however gives me NICE throughput… about 500MB /s to start with 
and then dropping and flattening out at 320MB/s, 9 IOPs…. so what the hell 
is going on?.


if i take the journals off and move them to the disks themselves, same results. 
Something is really really off with my config i guess, and I need to do some 
serious troubleshooting to figure this out.

Thanks for the help so far .
//Florian



On 24 Dec 2015, at 13:54, Wade Holler 
> wrote:

Have a look at the iostsat -x 1 1000 output to see what the drives are doing

On Wed, Dec 23, 2015 at 4:35 PM Florian Rommel 
> wrote:
Ah, totally forgot the additional details :)

OS is SUSE Enterprise Linux 12.0 with all patches,
Ceph version 0.94.3
4 node cluster with 2x 10GBe networking, one for cluster and one for public 
network, 1 additional server purely as an admin server.
Test machine is also 10gbe connected

ceph.conf is included:
[global]
fsid = 312e0996-a13c-46d3-abe3-903e0b4a589a
mon_initial_members = ceph-admin, ceph-01, ceph-02, ceph-03, ceph-04
mon_host = 192.168.0.190,192.168.0.191,192.168.0.192,192.168.0.193,192.168.0.194
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public network = 192.168.0.0/24
cluster network = 192.168.10.0/24

osd pool default size = 2
[osd]
osd journal size = 2048

Thanks again for any help and merry xmas already .
//F
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW pool contents

2015-12-22 Thread Somnath Roy
Thanks for responding back, unfortunately Cosbench setup is not there..
Good to know that there are cleanup steps for Cosbench data.

Regards
Somnath

From: ghislain.cheval...@orange.com [mailto:ghislain.cheval...@orange.com]
Sent: Tuesday, December 22, 2015 11:28 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: RGW pool contents

Hi,
Did you try to use the cleanup and dispose steps of cosbench?
brgds

De : ceph-users [mailto:ceph-users-boun...@lists.ceph.com] De la part de 
Somnath Roy
Envoyé : mardi 24 novembre 2015 20:49
À : ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Objet : [ceph-users] RGW pool contents

Hi Yehuda/RGW experts,
I have one cluster with RGW up and running in the customer site.
I did some heavy performance testing on that with CosBench and as a result 
written significant amount of data to showcase performance on that.
Over time, customer also wrote significant amount of data using S3 api into the 
cluster.
Now, I want to remove the buckets/objects created by CosBench and need some 
help on that.
I ran the following command to list the buckets.

"radosgw-admin bucket list"

The output is the following snippet..

"rgwdef42",
"rgwdefghijklmnop79",
"rgwyzabc43",
"rgwdefgh43",
"rgwdefghijklm200",

..
..

My understanding is , cosbench should create containers with "mycontainers_" 
 and objects with format "myobjects_" prefix (?). But, it's not there in the 
output of the above command.

Next, I tried to list the contents of the different rgw pools..

rados -p .rgw.buckets.index ls

.dir.default.5407.17
.dir.default.6063.24
.dir.default.6068.23
.dir.default.6046.7
.dir.default.6065.44
.dir.default.5409.3
...
...

Nothing with rgw prefix...Shouldn't the bucketindex objects having similar 
prefix with bucket names ?


Now, tried to get the actual objects...
rados -p .rgw.buckets ls

default.6662.5_myobjects57862
default.5193.18_myobjects6615
default.5410.5_myobjects68518
default.6661.8_myobjects7407
default.5410.22_myobjects54939
default.6651.6_myobjects23790


...

So, looking at these, it seems cosbench run is creating the .dir.default.* 
buckets and the default._myobjects* objects (?)

But, these buckets are not listed by the first "radosgw-admin" command, why ?

Next, I listed the contents of the .rgw pool and here is the output..

rados -p .rgw ls

.bucket.meta.rgwdefghijklm78:default.6069.18
rgwdef42
rgwdefghijklmnop79
rgwyzabc43
.bucket.meta.rgwdefghijklmnopqr71:default.6655.3
rgwdefgh43
.bucket.meta.rgwdefghijklm119:default.6066.25
rgwdefghijklm200
.bucket.meta.rgwxghi2:default.5203.4
rgwxjk17
rgwdefghijklm196

...
...

It seems this pool has the buckets listed by the radosgw-admin command.

Can anybody explain what is .rgw pool supposed to contain ?

Also, what is the difference between .users.uid and .users pool ?


Appreciate any help on this.

Thanks & Regards
Somnath

_



Ce message et ses pieces jointes peuvent contenir des informations 
confidentielles ou privilegiees et ne doivent donc

pas etre diffuses, exploites ou copies sans autorisation. Si vous avez recu ce 
message par erreur, veuillez le signaler

a l'expediteur et le detruire ainsi que les pieces jointes. Les messages 
electroniques etant susceptibles d'alteration,

Orange decline toute responsabilite si ce message a ete altere, deforme ou 
falsifie. Merci.



This message and its attachments may contain confidential or privileged 
information that may be protected by law;

they should not be distributed, used or copied without authorisation.

If you have received this email in error, please notify the sender and delete 
this message and its attachments.

As emails may be altered, Orange is not liable for messages that have been 
modified, changed or falsified.

Thank you.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-02 Thread Somnath Roy
I think each write will create 2 objects (512 KB head object + rest of the 
contents)  if your object size > 512KB. Also, it is writing some xattrs on top 
of what OSD is writing. Don't take my word blindly as I am not fully familiar 
with  RGW :-)
This will pollute significant number of INODE I guess..
But, I think the effect will be much more severe in case of RBD partial random 
write case.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of flisky
Sent: Wednesday, December 02, 2015 6:39 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] does anyone know what xfsaild and kworker are?they 
make osd disk busy. produce 100-200iops per osd disk?

Ignore my last reply. I read the thread [Re: XFS 
Syncd]("http://oss.sgi.com/archives/xfs/2015-06/msg00111.html;), and found that 
might be okay.

The call xfs_ail_push is almost INODE rather than BUF (1579 vs 99).
Our ceph is dedicated to S3 service, and the write is small.
So, where are so many INODE changes come from? How can I decrease it?

Thanks in advanced!

==
Mount Options:
rw,noatime,seclabel,swalloc,attr2,largeio,nobarrier,inode64,logbsize=256k,noquota

==
XFS Info:
meta-data=/dev/sdb1  isize=2048   agcount=4,
agsize=182979519 blks
 =   sectsz=512   attr=2, projid32bit=1
 =   crc=0finobt=0
data =   bsize=4096   blocks=731918075, imaxpct=5
 =   sunit=0  swidth=0 blks
naming   =version 2  bsize=4096   ascii-ci=0 ftype=0
log  =internal   bsize=4096   blocks=357381, version=2
 =   sectsz=512   sunit=0 blks, lazy-count=1
realtime =none   extsz=4096   blocks=0, rtextents=0




On 2015年12月02日 16:20, flisky wrote:
> It works. However, I think the root case is due to the xfs_buf missing?
> 
> trace-cmd record -e xfs\*
> trace-cmd report > xfs.txt
> awk '{print $4}' xfs2.txt |sort -n |uniq -c|sort -n|tail -n 20
> 
>14468 xfs_file_splice_write:
>16562 xfs_buf_find:
>19597 xfs_buf_read:
>19634 xfs_buf_get:
>21943 xfs_get_blocks_alloc:
>23265 xfs_perag_put:
>26327 xfs_perag_get:
>27853 xfs_ail_locked:
>39252 xfs_buf_iorequest:
>40187 xfs_ail_delete:
>41590 xfs_buf_ioerror:
>42523 xfs_buf_hold:
>44659 xfs_buf_trylock:
>47986 xfs_ail_flushing:
>50793 xfs_ilock_nowait:
>57585 xfs_ilock:
>58293 xfs_buf_unlock:
>79977 xfs_buf_iodone:
>   104165 xfs_buf_rele:
>   108383 xfs_iunlock:
> 
> Could you please give me another hint? :) Thanks!
> 
> On 2015年12月02日 05:14, Somnath Roy wrote:
>> Sure..The following settings helped me minimizing the effect a bit 
>> for the PR https://github.com/ceph/ceph/pull/6670
>>
>>
>> sysctl -w fs.xfs.xfssyncd_centisecs=72
>> sysctl -w fs.xfs.xfsbufd_centisecs=3000
>> sysctl -w fs.xfs.age_buffer_centisecs=72
>>
>> But, for existing Ceph write path you may need to tweak this..
>>
>> Thanks & Regards
>> Somnath
>>
>> -Original Message-
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf 
>> Of flisky
>> Sent: Tuesday, December 01, 2015 11:04 AM
>> To: ceph-users@lists.ceph.com
>> Subject: Re: [ceph-users] does anyone know what xfsaild and kworker are?they 
>> make osd disk busy. produce 100-200iops per osd disk?
>>
>> On 2015年12月02日 01:31, Somnath Roy wrote:
>>> This is xfs metadata sync process...when it is waking up and there 
>>> are lot of data to sync it will throttle all the process accessing 
>>> the drive...There are some xfs settings to control the behavior, but 
>>> you can't stop that
>> May I ask how to tune the xfs settings? Thanks!
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread Somnath Roy
Sure..The following settings helped me minimizing the effect a bit for the PR 
https://github.com/ceph/ceph/pull/6670


  sysctl -w fs.xfs.xfssyncd_centisecs=72
  sysctl -w fs.xfs.xfsbufd_centisecs=3000
  sysctl -w fs.xfs.age_buffer_centisecs=72

But, for existing Ceph write path you may need to tweak this..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of flisky
Sent: Tuesday, December 01, 2015 11:04 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] does anyone know what xfsaild and kworker are?they 
make osd disk busy. produce 100-200iops per osd disk?

On 2015年12月02日 01:31, Somnath Roy wrote:
> This is xfs metadata sync process...when it is waking up and there are lot of 
> data to sync it will throttle all the process accessing the drive...There are 
> some xfs settings to control the behavior, but you can't stop that
May I ask how to tune the xfs settings? Thanks!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] does anyone know what xfsaild and kworker are?they make osd disk busy. produce 100-200iops per osd disk?

2015-12-01 Thread Somnath Roy
This is xfs metadata sync process...when it is waking up and there are lot of 
data to sync it will throttle all the process accessing the drive...There are 
some xfs settings to control the behavior, but you can't stop that

Sent from my iPhone

>> On Dec 1, 2015, at 8:26 AM, flisky  wrote:
>> 
>> On 2014年11月11日 12:23, duan.xuf...@zte.com.cn wrote:
>> 
>>  ZTE Information
>> Security Notice: The information contained in this mail (and any
>> attachment transmitted herewith) is privileged and confidential and is
>> intended for the exclusive use of the addressee(s). If you are not an
>> intended recipient, any disclosure, reproduction, distribution or other
>> dissemination or use of the information contained is strictly
>> prohibited. If you have received this mail in error, please delete it
>> and notify us immediately.
>> 
>> 
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> I'm facing the exactly same problem, and my situation is much worst.
> 
> BTT gives me this -
> 
>Q2Q   MIN   AVG   MAX   N
> --- - - - ---
> ceph-osd  0.01243   0.009448228   7.065125958   12643
> kworker   0.01491   0.479659256  30.080631593 226
> pid002853761  0.000668293  20.053390778  30.080227966   3
> xfsaild   0.01097   0.008947398  30.073285005   10879
> 
>D2C   MIN   AVG   MAX   N
> --- - - - ---
> ceph-osd  0.36810   0.014268501   1.626915131   12642
> kworker   0.44483   0.005548645   0.653310778 203
> pid002853761  0.000156094   0.001594357   0.005841911   4
> xfsaild   0.000307363   0.190863515   1.3219928029849
> 
> The disk util is almost 100%, while avgrq-sz and avgqu-sz is very low, which 
> makes me very confused.
> 
> Could any one give me some hint on this? Thanks!
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph OSD: Memory Leak problem

2015-11-29 Thread Somnath Roy
It could be a network issue in your environment.. First thing to check is MTU 
(if you have changed it) and run tool like traceroute to see if all the cluster 
nodes are reachable from each other..

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of prasad 
pande
Sent: Saturday, November 28, 2015 8:15 PM
To: ceph-us...@ceph.com
Subject: [ceph-users] Ceph OSD: Memory Leak problem

Hi,
I installed a ceph cluster with 2 MON, 1 MDS and 10 OSDs.
While performing the rados put operation to put objects in ceph cluster I am 
getting the OSD errors as follows:

2015-11-28 23:02:03.276821 7f7f5affb700  0 -- 
10.176.128.135:0/1009266 >> 
10.176.128.136:6800/22824 pipe(0x7f7f6000e190 
sd=6 :0 s=1 pgs=0 cs=0 l=1 c=0x7f7f60012430).fault
According to the comments in Bug #3883 I 
restarted the corresponding OSD (10.176.128.135) but it is not working for me.
Also, I observe that during the operation some of the OSDs go down and after 
sometime come up automatically.
Following is the output of OSD tree map.

ID  WEIGHT  TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
 -1 0.41992 root default
 -2 0.03999 host ceph-node2
  0 0.03999 osd.0 up  1.0  1.0
 -3 0.04999 host ceph-node4
  1 0.04999 osd.1 up  1.0  1.0
 -4 0.03999 host ceph-node1
  2 0.03999 osd.2 up  1.0  1.0
 -5 0.04999 host ceph-node6
  3 0.04999 osd.3   down0  1.0
 -6 0.03000 host ceph-node5
  4 0.03000 osd.4 up  1.0  1.0
 -7 0.04999 host ceph-node7
  5 0.04999 osd.5 up  1.0  1.0
 -8 0.03999 host ceph-node8
  6 0.03999 osd.6 up  1.0  1.0
 -9 0.03999 host ceph-node9
  7 0.03999 osd.7 up  1.0  1.0
-10 0.07999 host ceph-node10
  8 0.03999 osd.8 up  1.0  1.0
  9 0.03999 osd.9 up  1.0  1.0

Can someone help me with this issue?




Thanks & Regards

Prasad Pande
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RGW pool contents

2015-11-26 Thread Somnath Roy
Thanks Wido !
Could you please explain a bit more on the relationship between user created 
buckets and the objects within .bucket.index pool ?
I am not seeing for each bucket one entry is created within .bucket.index pool.

Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Wido 
den Hollander
Sent: Wednesday, November 25, 2015 10:56 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] RGW pool contents

On 11/24/2015 08:48 PM, Somnath Roy wrote:
> Hi Yehuda/RGW experts,
> 
> I have one cluster with RGW up and running in the customer site.
> 
> I did some heavy performance testing on that with CosBench and as a 
> result written significant amount of data to showcase performance on that.
> 
> Over time, customer also wrote significant amount of data using S3 api 
> into the cluster.
> 
> Now, I want to remove the buckets/objects created by CosBench and need 
> some help on that.
> 
> I ran the following command to list the buckets.
> 
>  
> 
> "radosgw-admin bucket list"
> 
>  
> 
> The output is the following snippet..
> 
>  
> 
> "rgwdef42",
> 
> "rgwdefghijklmnop79",
> 
> "rgwyzabc43",
> 
> "rgwdefgh43",
> 
> "rgwdefghijklm200",
> 
>  
> 
> ..
> 
> ..
> 
>  
> 
> My understanding is , cosbench should create containers with 
> "*mycontainers_*"  and objects with format "*myobjects*_" prefix 
> (?). But, it's not there in the output of the above command.
> 
>  

Well, if it did, they should show up there.

> 
> Next, I tried to list the contents of the different rgw pools..
> 
>  
> 
> *rados -p .rgw.buckets.index ls*
> 
>  
> 
> .dir.default.5407.17
> 
> .dir.default.6063.24
> 
> .dir.default.6068.23
> 
> .dir.default.6046.7
> 
> .dir.default.6065.44
> 
> .dir.default.5409.3
> 
> ...
> 
> ...
> 
>  
> 
> Nothing with rgw prefix...Shouldn't the bucketindex objects having 
> similar prefix with bucket names ?
> 

No, there are the internal IDs of the buckets. You can find the actual bucket 
objects in the ".rgw" pool.

>  
> 
>  
> 
> Now, tried to get the actual objects...
> 
> *rados -p .rgw.buckets ls*
> 
>  
> 
> default.6662.5_myobjects57862
> 
> default.5193.18_myobjects6615
> 
> default.5410.5_myobjects68518
> 
> default.6661.8_myobjects7407
> 
> default.5410.22_myobjects54939
> 
> default.6651.6_myobjects23790
> 
>  
> 
> 
> 
> ...
> 
>  
> 
> So, looking at these, it seems cosbench run is creating the
> .dir.default.* buckets and the default._myobjects* objects 
> (?)
> 

No, again, the .dir.default.X is the internal ID of the bucket. It creates 
"myobject" object on those buckets.

>  
> 
> But, these buckets are not listed by the first "radosgw-admin" 
> command, *why ?*
> 
>  
> 
> Next, I listed the contents of the .rgw pool and here is the output..
> 
>  
> 
> *rados -p .rgw ls*
> 
>  
> 
> .bucket.meta.rgwdefghijklm78:default.6069.18
> 
> rgwdef42
> 
> rgwdefghijklmnop79
> 
> rgwyzabc43
> 
> .bucket.meta.rgwdefghijklmnopqr71:default.6655.3
> 
> rgwdefgh43
> 
> .bucket.meta.rgwdefghijklm119:default.6066.25
> 
> rgwdefghijklm200
> 
> .bucket.meta.rgwxghi2:default.5203.4
> 
> rgwxjk17
> 
> rgwdefghijklm196
> 
>  
> 
> ...
> 
> ...
> 
>  
> 
> It seems this pool has the buckets listed by the radosgw-admin command.
> 
>  
> 
> Can anybody explain what is *.rgw pool* supposed to contain ?
> 
>  

This pool contains only the bucket metadata objects, here it references to the 
internal IDs.

You can fetch this with 'radosgw-admin metadata get bucket:XX'

> 
> Also, what is the difference between .*users.uid and .users pool* ?
> 
>  

In the .user.uid pool the RGW can do a quick query for users IDs since that is 
required for matching ACLs which might be on a bucket and/or object.

Wido

> 
>  
> 
> Appreciate any help on this.
> 
>  
> 
> Thanks & Regards
> 
> Somnath
> 
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


--
Wido den Hollander
42on B.V.
Ceph trainer and consultant

Phone: +31 (0)20 700 9902
Skype: contact42on
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RGW pool contents

2015-11-24 Thread Somnath Roy
Hi Yehuda/RGW experts,
I have one cluster with RGW up and running in the customer site.
I did some heavy performance testing on that with CosBench and as a result 
written significant amount of data to showcase performance on that.
Over time, customer also wrote significant amount of data using S3 api into the 
cluster.
Now, I want to remove the buckets/objects created by CosBench and need some 
help on that.
I ran the following command to list the buckets.

"radosgw-admin bucket list"

The output is the following snippet..

"rgwdef42",
"rgwdefghijklmnop79",
"rgwyzabc43",
"rgwdefgh43",
"rgwdefghijklm200",

..
..

My understanding is , cosbench should create containers with "mycontainers_" 
 and objects with format "myobjects_" prefix (?). But, it's not there in the 
output of the above command.

Next, I tried to list the contents of the different rgw pools..

rados -p .rgw.buckets.index ls

.dir.default.5407.17
.dir.default.6063.24
.dir.default.6068.23
.dir.default.6046.7
.dir.default.6065.44
.dir.default.5409.3
...
...

Nothing with rgw prefix...Shouldn't the bucketindex objects having similar 
prefix with bucket names ?


Now, tried to get the actual objects...
rados -p .rgw.buckets ls

default.6662.5_myobjects57862
default.5193.18_myobjects6615
default.5410.5_myobjects68518
default.6661.8_myobjects7407
default.5410.22_myobjects54939
default.6651.6_myobjects23790


...

So, looking at these, it seems cosbench run is creating the .dir.default.* 
buckets and the default._myobjects* objects (?)

But, these buckets are not listed by the first "radosgw-admin" command, why ?

Next, I listed the contents of the .rgw pool and here is the output..

rados -p .rgw ls

.bucket.meta.rgwdefghijklm78:default.6069.18
rgwdef42
rgwdefghijklmnop79
rgwyzabc43
.bucket.meta.rgwdefghijklmnopqr71:default.6655.3
rgwdefgh43
.bucket.meta.rgwdefghijklm119:default.6066.25
rgwdefghijklm200
.bucket.meta.rgwxghi2:default.5203.4
rgwxjk17
rgwdefghijklm196

...
...

It seems this pool has the buckets listed by the radosgw-admin command.

Can anybody explain what is .rgw pool supposed to contain ?

Also, what is the difference between .users.uid and .users pool ?


Appreciate any help on this.

Thanks & Regards
Somnath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] about PG_Number

2015-11-13 Thread Somnath Roy
Use the following link, it should give an idea about pg_number.

http://ceph.com/pgcalc/

Number of PGs/OSd has implication on number of TCP connection in the system 
along with some resources on cpu/memory. So, if you have lots of PG/OSD it may 
degrade performance, I think mainly because of extensive number of TCP 
connection specially if you have pretty decent high end node and not worrying 
about cpu/memory. Basically, ~200 PGs/OSD we should be targeting for.

On the other hand, very less PG/OSD will hurt parallelism because Ceph will not 
allow multiple operation on a PG to go in parallel. I am not sure if data 
distribution will affect that much (considering you have pgs more than your 
OSDs :-) ) , but I must admit that I never tried with very few Pgs...

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Francois Lafont
Sent: Friday, November 13, 2015 4:34 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] about PG_Number

Hi,

On 13/11/2015 09:13, Vickie ch wrote:

> If you have a large amount of OSDs but less pg number. You will find 
> your data write unevenly.
> Some OSD have no change to write data.
> In the other side, pg number too large but OSD number too small that 
> have a chance to cause data lost.

Data lost, are you sure?

Personally, I would have said:

  few PG/OSDs   lot of PG/OSDs
  >
 * Data distribution less envenly  * Good balanced distribution of data
 * Use less CPU and RAM* Use lot of CPU and RAM

No?


François Lafont
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] iSCSI over RDB is a good idea ?

2015-11-04 Thread Somnath Roy
We are using SCST over RBD and not seeing much of a degradation...Need to make 
sure you tune SCST properly and use multiple session..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Hugo 
Slabbert
Sent: Wednesday, November 04, 2015 1:44 PM
To: Jason Dillaman; Gaetan SLONGO
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] iSCSI over RDB is a good idea ?

> The disadvantage of the iSCSI design is that it adds an extra hop between 
> your VMs and the backing Ceph cluster.

...and introduces a bottleneck. iSCSI initiators are "dumb" in comparison to 
native ceph/rbd clients. Whereas native clients will talk to all the relevant 
OSDs directly, iSCSI initiators will just talk to the target (unless there is 
some awesome magic in the RBD/tgt integration that I'm unaware of). So the 
targets and their connectivity are a bottleneck.

--
Hugo
h...@slabnet.com: email, xmpp/jabber
also on Signal

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding the number of TCP connections between clients and OSDs

2015-11-04 Thread Somnath Roy
Hope this will be helpful..



Total connections per osd = (Target PGs per osd) * (# of pool replicas)

* 3 + (2 #clients) + (min_hb_peer)



# of pool replicas = configurable, default is 3

3 = is number of data communication messengers (cluster, hb_backend,

hb_frontend)

min_hb_peer = default is 20 I guess..
Total number connections per node: total connections per osd * number of osds 
per node

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rick 
Balsano
Sent: Wednesday, November 04, 2015 12:28 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Understanding the number of TCP connections between 
clients and OSDs

Just following up since this thread went silent after a few comments showing 
similar concerns, but no explanation of the behavior. Can anyone point to some 
code or documentation which explains how to estimate the expected number of TCP 
connections a client would open based on read/write volume, # of volumes, # of 
OSDs in the pool, etc?


On Tue, Oct 27, 2015 at 5:05 AM, Dan van der Ster 
> wrote:
On Mon, Oct 26, 2015 at 10:48 PM, Jan Schermer 
> wrote:
> If we're talking about RBD clients (qemu) then the number also grows with
> number of volumes attached to the client.

I never thought about that but it might explain a problem we have
where multiple attached volumes crashes an HV. I had assumed that
multiple volumes would reuse the same rados client instance, and thus
reuse the same connections to the OSDs.

-- dan



--
Rick Balsano
Senior Software Engineer
Opower

O +1 571 384 1210
We're Hiring! See jobs here.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Somnath Roy
One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 13:06
> > *To:* Matteo Dacrema ; ceph-us...@ceph.com
> > *Subject:* RE: BAD nvme SSD performance
> >
> >
> >
> > Hi Matteo,
> >
> >
> >
> > Ceph introduces latency into the write path and so what you are
> > seeing is typical. If you increase the iodepth of the fio test you
> > should get higher results though, until you start maxing out your CPU.
> >
> >
> >
> > Nick
> >
> >
> >
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
> > Behalf Of *Matteo Dacrema
> > *Sent:* 26 October 2015 11:20
> > *To:* ceph-us...@ceph.com 
> > *Subject:* [ceph-users] BAD nvme SSD performance
> >
> >
> >
> > Hi all,
> >
> >
> >
> > I’ve recently buy two Samsung SM951 256GB nvme PCIe SSDs and built a
> > 2 OSD ceph cluster with min_size = 1.
> >
> > I’ve tested them with fio ad I obtained two very different results
> > with these two situations with fio.
> >
> > This is the command : *fio  --ioengine=libaio --direct=1
> > --name=test --filename=test --bs=4k  --size=100M
> > --readwrite=randwrite
> > --numjobs=200  --group_reporting*
> >
> >
> >
> > On the OSD host I’ve obtained this result:
> >
> > *bw=575493KB/s, iops=143873*
> >
> > * *
> >
> > On the client host with a mounted volume I’ve obtained this result:
> >
> >
> >
> > Fio executed on the client osd with a mounted volume:
> >
> > *bw=9288.1KB/s, iops=2322*
> >
> > * *
> >
> > I’ve obtained this results with Journal and data on the same disk
> > and also with Journal on separate SSD.
> >
> > * *
> >
> > I’ve two OSD host with 64GB of RAM and 2x Intel Xeon 

Re: [ceph-users] BAD nvme SSD performance

2015-10-26 Thread Somnath Roy
Another point,
As Christian mentioned, try to evaluate O_DIRECT|O_DSYNC performance of a SSD 
before choosing that for Ceph..
Try to run with direct=1 and sync =1 with fio to a raw ssd drive..

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Monday, October 26, 2015 9:20 AM
To: Christian Balzer; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance

One thing, *don't* trust iostat disk util% in case of SSDs..100% doesn't mean 
you are saturating SSDs there..I have seen a large performance delta even if 
iostat is reporting 100% disk util in both the cases.
Also, the ceph.conf file you are using is not optimal..Try to add these..

debug_lockdep = 0/0
debug_context = 0/0
debug_crush = 0/0
debug_buffer = 0/0
debug_timer = 0/0
debug_filer = 0/0
debug_objecter = 0/0
debug_rados = 0/0
debug_rbd = 0/0
debug_journaler = 0/0
debug_objectcatcher = 0/0
debug_client = 0/0
debug_osd = 0/0
debug_optracker = 0/0
debug_objclass = 0/0
debug_filestore = 0/0
debug_journal = 0/0
debug_ms = 0/0
debug_monc = 0/0
debug_tp = 0/0
debug_auth = 0/0
debug_finisher = 0/0
debug_heartbeatmap = 0/0
debug_perfcounter = 0/0
debug_asok = 0/0
debug_throttle = 0/0
debug_mon = 0/0
debug_paxos = 0/0
debug_rgw = 0/0

You didn't mention anything about your cpu, considering you have powerful cpu 
complex for SSDs tweak this to high number of shards..It also depends on number 
of OSDs per box..

osd_op_num_threads_per_shard
osd_op_num_shards


Don't need to change the following..

osd_disk_threads
osd_op_threads


Instead, try increasing..

filestore_op_threads

Use the following in the global section..

ms_dispatch_throttle_bytes = 0
throttler_perf_counter = false

Change the following..
filestore_max_sync_interval = 1   (or even lower, need to lower 
filestore_min_sync_interval as well)


I am assuming you are using hammer and newer..

Thanks & Regards
Somnath

Try increasing the following to very big numbers..

> > filestore_queue_max_ops = 2000
> >
> > filestore_queue_max_bytes = 536870912
> >
> > filestore_queue_committing_max_ops = 500
> >
> > filestore_queue_committing_max_bytes = 268435456

Use the following..

osd_enable_op_tracker = false


-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Christian Balzer
Sent: Monday, October 26, 2015 8:23 AM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] BAD nvme SSD performance


Hello,

On Mon, 26 Oct 2015 14:35:19 +0100 Wido den Hollander wrote:

>
>
> On 26-10-15 14:29, Matteo Dacrema wrote:
> > Hi Nick,
> >
> >
> >
> > I also tried to increase iodepth but nothing has changed.
> >
> >
> >
> > With iostat I noticed that the disk is fully utilized and write per 
> > seconds from iostat match fio output.
> >
>
> Ceph isn't fully optimized to get the maximum potential out of NVME 
> SSDs yet.
>
Indeed. Don't expect Ceph to be near raw SSD performance.

However he writes that per iostat the SSD is fully utilized.

Matteo, can you run run atop instead of iostat and confirm that:

a) utilization of the SSD is 100%.
b) CPU is not the bottleneck.

My guess would be these particular NVMe SSDs might just suffer from the same 
direct sync I/O deficiencies as other Samsung SSDs.
This feeling is re-affirmed by seeing Samsung listing them as a Client SSDs, 
not data center one.
http://www.samsung.com/semiconductor/products/flash-storage/client-ssd/MZHPV256HDGL?ia=831

Regards,

Christian

> For example, NVM-E SSDs work best with very high queue depths and 
> parallel IOps.
>
> Also, be aware that Ceph add multiple layers to the whole I/O 
> subsystem and that there will be a performance impact when Ceph is used in 
> between.
>
> Wido
>
> >
> >
> > Matteo
> >
> >
> >
> > *From:*Nick Fisk [mailto:n...@fisk.me.uk]
> > *Sent:* lunedì 26 ottobre 2015 13:06
> > *To:* Matteo Dacrema <mdacr...@enter.it>; ceph-us...@ceph.com
> > *Subject:* RE: BAD nvme SSD performance
> >
> >
> >
> > Hi Matteo,
> >
> >
> >
> > Ceph introduces latency into the write path and so what you are 
> > seeing is typical. If you increase the iodepth of the fio test you 
> > should get higher results though, until you start maxing out your CPU.
> >
> >
> >
> > Nick
> >
> >
> >
> > *From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On 
> > Behalf Of *Matteo Dacrema
> > *Sent:* 26 October 2015 11:20
> > *To:* ceph-us...@ceph.com <mailto:ceph-us...@ceph.com>
> > *Subject:* [ceph-users] BAD nvme SSD performance
> >
> >
> >
> > Hi all,
> >
> >
> >

Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Somnath Roy
Jan,
Journal helps FileStore to maintain the transactional integrity in the event of 
a crash. That's the main reason.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Wednesday, October 14, 2015 2:28 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

Hi,
I've been thinking about this for a while now - does Ceph really need a 
journal? Filesystems are already pretty good at committing data to disk when 
asked (and much faster too), we have external journals in XFS and Ext4...
In a scenario where client does an ordinary write, there's no need to flush it 
anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
committed eventually.
If a client asks for the data to be flushed then fdatasync/fsync on the 
filestore object takes care of that, including ordering and stuff.
For reads, you just read from filestore (no need to differentiate between 
filestore/journal) - pagecache gives you the right version already.

Or is journal there to achieve some tiering for writes when the running 
spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
out of box even when filesystem journal is put on SSD - the data get flushed to 
spindle whenever fsync-ed (even with data=journal). But in reality, most of the 
data will hit the spindle either way and when you run with SSDs it will always 
be much slower. And even for tiering - there are already many options (bcache, 
flashcache or even ZFS L2ARC) that are much more performant and proven stable. 
I think the fact that people  have a need to combine Ceph with stuff like that 
already proves the point.

So a very interesting scenario would be to disable Ceph journal and at most use 
data=journal on ext4. The complexity of the data path would drop significantly, 
latencies decrease, CPU time is saved...
I just feel that Ceph has lots of unnecessary complexity inside that duplicates 
what filesystems (and pagecache...) have been doing for a while now without 
eating most of our CPU cores - why don't we use that? Is it possible to disable 
journal completely?

Did I miss something that makes journal essential?

Jan

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

2015-10-14 Thread Somnath Roy
FileSystem like XFS guarantees a single file write but in Ceph transaction we 
are touching file/xattrs/leveldb (omap), so no way filesystem can guarantee 
that transaction. That's why FileStore has implemented a write_ahead journal. 
Basically, it is writing the entire transaction object there and only trimming 
from journal when it is actually applied (all the operation executed) and 
persisted in the backend. 

Thanks & Regards
Somnath

-Original Message-
From: Jan Schermer [mailto:j...@schermer.cz] 
Sent: Wednesday, October 14, 2015 9:06 AM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?

But that's exactly what filesystems and their own journals do already :-)

Jan

> On 14 Oct 2015, at 17:02, Somnath Roy <somnath@sandisk.com> wrote:
> 
> Jan,
> Journal helps FileStore to maintain the transactional integrity in the event 
> of a crash. That's the main reason.
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
> Schermer
> Sent: Wednesday, October 14, 2015 2:28 AM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Ceph journal - isn't it a bit redundant sometimes?
> 
> Hi,
> I've been thinking about this for a while now - does Ceph really need a 
> journal? Filesystems are already pretty good at committing data to disk when 
> asked (and much faster too), we have external journals in XFS and Ext4...
> In a scenario where client does an ordinary write, there's no need to flush 
> it anywhere (the app didn't ask for it) so it ends up in pagecache and gets 
> committed eventually.
> If a client asks for the data to be flushed then fdatasync/fsync on the 
> filestore object takes care of that, including ordering and stuff.
> For reads, you just read from filestore (no need to differentiate between 
> filestore/journal) - pagecache gives you the right version already.
> 
> Or is journal there to achieve some tiering for writes when the running 
> spindles with SSDs? This is IMO the only thing ordinary filesystems don't do 
> out of box even when filesystem journal is put on SSD - the data get flushed 
> to spindle whenever fsync-ed (even with data=journal). But in reality, most 
> of the data will hit the spindle either way and when you run with SSDs it 
> will always be much slower. And even for tiering - there are already many 
> options (bcache, flashcache or even ZFS L2ARC) that are much more performant 
> and proven stable. I think the fact that people  have a need to combine Ceph 
> with stuff like that already proves the point.
> 
> So a very interesting scenario would be to disable Ceph journal and at most 
> use data=journal on ext4. The complexity of the data path would drop 
> significantly, latencies decrease, CPU time is saved...
> I just feel that Ceph has lots of unnecessary complexity inside that 
> duplicates what filesystems (and pagecache...) have been doing for a while 
> now without eating most of our CPU cores - why don't we use that? Is it 
> possible to disable journal completely?
> 
> Did I miss something that makes journal essential?
> 
> Jan
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-13 Thread Somnath Roy
Thanks Haomai..
Since Async messenger is always using a constant number of threads , there 
could be a potential performance problem of scaling up the client connections 
keeping the constant number of OSDs ?
May be it's a good tradeoff..

Regards
Somnath


-Original Message-
From: Haomai Wang [mailto:haomaiw...@gmail.com]
Sent: Monday, October 12, 2015 11:35 PM
To: Somnath Roy
Cc: Mark Nelson; ceph-devel; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs 
AsyncMessenger results

On Tue, Oct 13, 2015 at 12:18 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Mark,
>
> Thanks for this data. This means probably simple messenger (not OSD
> core) is not doing optimal job of handling memory.
>
>
>
> Haomai,
>
> I am not that familiar with Async messenger code base, do you have an
> explanation of the behavior (like good performance with default
> tcmalloc) Mark reported ? Is it using lot less thread overall than Simple ?

Originally async messenger mainly want to solve with high thread number problem 
which limited the ceph cluster size. High context switch and cpu usage caused 
by simple messenger under large cluster.

Recently we have memory problem discussed on ML and I also spend times to think 
about the root cause. Currently I would like to consider the simple messenger's 
memory usage is deviating from the design of tcmalloc. Tcmalloc is aimed to 
provide memory with local cache, and it also has memory control among all 
threads, if we have too much threads, it may let tcmalloc busy with memory lock 
contention.

Async messenger uses thread pool to serve connections, it make all blocking 
calls in simple messenger async.

>
> Also, it seems Async messenger has some inefficiencies in the io path
> and that’s why it is not performing as well as simple if the memory
> allocation stuff is optimally handled.

Yep, simple messenger use two threads(one for read, one for write) to serve one 
connection, async messenger at most have one thread to serve one connection and 
multi connection  will share the same thread.

Next, I would like to have several plans to improve performance:
1. add poll mode support, I hope it can help enhance high performance storage 
need 2. add load balance ability among worker threads 3. move more works out of 
messenger thread.

>
> Could you please send out any documentation around Async messenger ? I
> tried to google it , but, not even blueprint is popping up.

>
>
>
>
>
> Thanks & Regards
>
> Somnath
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
> Of Haomai Wang
> Sent: Monday, October 12, 2015 7:57 PM
> To: Mark Nelson
> Cc: ceph-devel; ceph-users@lists.ceph.com
> Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger
> vs AsyncMessenger results
>
>
>
> COOL
>
>
>
> Interesting that async messenger will consume more memory than simple,
> in my mind I always think async should use less memory. I will give a
> look at this
>
>
>
> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson <mnel...@redhat.com> wrote:
>
> Hi Guy,
>
> Given all of the recent data on how different memory allocator
> configurations improve SimpleMessenger performance (and the effect of
> memory allocators and transparent hugepages on RSS memory usage), I
> thought I'd run some tests looking how AsyncMessenger does in
> comparison.  We spoke about these a bit at the last performance meeting but 
> here's the full write up.
> The rough conclusion as of right now appears to be:
>
> 1) AsyncMessenger performance is not dependent on the memory allocator
> like with SimpleMessenger.
>
> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB
> (ie
> default) thread cache.
>
> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
> random reads.
>
> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
> allocator optimizations are used.
>
> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.
>
> Here's a link to the paper:
>
> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view
>
> Mark
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> --
>
> Best Regards,
>
> Wheat
>
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
&g

Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-12 Thread Somnath Roy
Mark,
Thanks for this data. This means probably simple messenger (not OSD core) is 
not doing optimal job of handling memory.

Haomai,
I am not that familiar with Async messenger code base, do you have an 
explanation of the behavior (like good performance with default tcmalloc) Mark 
reported ? Is it using lot less thread overall than Simple ?
Also, it seems Async messenger has some inefficiencies in the io path and 
that’s why it is not performing as well as simple if the memory allocation 
stuff is optimally handled.
Could you please send out any documentation around Async messenger ? I tried to 
google it , but, not even blueprint is popping up.


Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Haomai 
Wang
Sent: Monday, October 12, 2015 7:57 PM
To: Mark Nelson
Cc: ceph-devel; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs 
AsyncMessenger results

COOL

Interesting that async messenger will consume more memory than simple, in my 
mind I always think async should use less memory. I will give a look at this

On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson 
> wrote:
Hi Guy,

Given all of the recent data on how different memory allocator configurations 
improve SimpleMessenger performance (and the effect of memory allocators and 
transparent hugepages on RSS memory usage), I thought I'd run some tests 
looking how AsyncMessenger does in comparison.  We spoke about these a bit at 
the last performance meeting but here's the full write up.  The rough 
conclusion as of right now appears to be:

1) AsyncMessenger performance is not dependent on the memory allocator like 
with SimpleMessenger.

2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie 
default) thread cache.

3) AsyncMessenger is consistently faster than SimpleMessenger for 128K random 
reads.

4) AsyncMessenger is sometimes slower than SimpleMessenger when memory 
allocator optimizations are used.

5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.

Here's a link to the paper:

https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view

Mark
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--

Best Regards,

Wheat



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph, SSD, and NVMe

2015-09-30 Thread Somnath Roy
David,
You should move to Hammer to get all the benefits of performance. It's all 
added to Giant and migrated to the present hammer LTS release.
FYI, focus was so far with read performance improvement and what we saw in our 
environment with 6Gb SAS SSDs so far that we are able to saturate drives BW 
wise with 64K onwards. But, with smaller block like 4K we are not able to 
saturate the SAS SSD drives yet.
But, considering Ceph's scale out nature you can get some very good numbers out 
of a cluster. For example, with 8 SAS SSD drives (in a JBOF) and having 2 heads 
in front (So, a 2 node Ceph cluster) we are able to hit ~300K Random read iops 
while 8 SSD aggregated performance would be ~400K. Not too bad. At this point 
we are saturating host cpus.
We have seen almost linear scaling if you add similar setups i.e adding say ~3 
of the above setup, you could hit ~900K RR iops. So, I would say it is 
definitely there in terms read iops and more improvement are coming.
But, write path is very awful compare to read and that's where the problem is. 
Because, in the mainstream, no workload is 100% RR (IMO). So,  even if you have 
say 90-10 read/write the performance numbers would be  ~6/7 X slower.
So, it is very much dependent on your workload/application access pattern and 
obviously the cost you are willing to spend.

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Mark 
Nelson
Sent: Wednesday, September 30, 2015 12:04 PM
To: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Ceph, SSD, and NVMe

On 09/30/2015 09:34 AM, J David wrote:
> Because we have a good thing going, our Ceph clusters are still
> running Firefly on all of our clusters including our largest, all-SSD
> cluster.
>
> If I understand right, newer versions of Ceph make much better use of
> SSDs and give overall much higher performance on the same equipment.
> However, the impression I get of newer versions is that they are also
> not as stable as Firefly and should only be used with caution.
>
> Given our storage consumers have an effectively unlimited appetite for
> IOPs and throughput, more performance would be very welcome.  But not
> if it leads to cluster crashes and lost data.
>
> What really prompts this is that we are starting to see large-scale
> NVMe equipment appearing in the channel ( e.g.
> http://www.supermicro.com/products/system/1U/1028/SYS-1028U-TN10RT_.cf
> m ).  The cost is significantly higher with commensurately higher
> theoretical perfomance.  But if we're already not pushing our SSD's to
> the max over SAS, the added benefit of NVMe would largely be lost.
>
> On the other hand, if we could safely upgrade to a more recent version
> that is as stable and bulletproof as Firefly has been for us, but has
> better performance with SSDs, that would not only benefit our current
> setup, it would be a necessary first step for moving onto NVMe.
>
> So this raises three questions:
>
> 1) Have I correctly understood that one or more post-FireFly releases
> exist that (c.p.) perform significantly better with all-SSD setups?
>
> 2) Is there any such release that (generally) is as rock-solid as
> FireFly.  Of course this is somewhat situationally dependent, so I
> would settle for: is there any such release that doesn't have any
> known minding-my-own-business-suddenly-lost-data bugs in a 100% RBD
> use case?
>
> 3) Has anyone done anything with NVMe as storage (not just journals)
> who would care to share what kind of performance they experienced?
>
> (Of course if we do upgrade we will do so carefully, do a test cluster
> first, have backups standing by, etc.  But if it's already known that
> doing so will either not improve anything or is likely to blow up in
> our faces, it would be better to leave well enough alone.  The current
> performance is by no means bad, we're just always greedy for more. :)
> )
>
> Thanks for any advice/suggestions!

Hi David,

The single biggest performance improvement we've seen for SSDs has resulted 
from the memory allocator investigation that Chaitanya Hulgol and Somnath Roy 
spearheaded at Sandisk and others including myself have followed up and tried 
to expand on since then.

See:

http://www.spinics.net/lists/ceph-devel/msg25823.html
https://www.mail-archive.com/ceph-devel@vger.kernel.org/msg23100.html
http://www.spinics.net/lists/ceph-devel/msg21582.html

I haven't tested firefly, but there's a good chance that you may see a 
significant performance improvement simply by upgrading your systems to 
tcmalloc 2.4 and loading the OSDs with 128MB of thread cache or LD_PRELOAD 
jemalloc.  This isn't something we officially support in RHCS yet, but we'll 
likely be moving toward it for future releases based on the very positive 
results we are seeing.  The biggest thing to kee

Re: [ceph-users] OSD reaching file open limit - known issues?

2015-09-25 Thread Somnath Roy
Yes, known issue, make sure your system open file limit is pretty high..

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Jan 
Schermer
Sent: Friday, September 25, 2015 4:42 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] OSD reaching file open limit - known issues?

Hi,
we recently migrated some of our nodes to Ubuntu 12, which helped everything 
quite a bit.

But we hit a snag where the upstart initscript would not set the file open 
ulimit correctly and some OSDs ran out of fds.

Some roblems manifested since then on the node where this happened such as 
scrub errors (which were corrected - don't ask me how, I was sleeping :-))  - 
but not two of the OSDs on this node started failing with SIGABORT:

2015-09-25 10:45:17.860461 361ea17e700 -1 osd.12 pg_epoch: 1209679 pg[4.3d85( v 
1209679'13913518 (1209090'13910508,1209679'13913518] local-les=1209679 n=235 
ec=3 les/c 1209679/1209679 1209678/1209678/1209678) [12,36,59] r=0 lpr=1209678 
mlcod 1209656'13913517 active+clean snaptrimq=[857c0~1,857f4~1,85a43~2]] 
trim_objectcould not find coid 
783dbd85/rbd_data.1a785181f15746a.000238df/857c0//4
2015-09-25 10:45:17.862019 361ea17e700 -1 
osd/ReplicatedPG.cc: In function 
'ReplicatedPG::RepGather* ReplicatedPG::trim_object(const hobject_t&)' thread 
361ea17e700 time 2015-09-25 10:45:17.860501
osd/ReplicatedPG.cc: 1510: FAILED assert(0)

 ceph version 0.67.11-82-ge5b6eea (e5b6eea91cc37434f78a987d2dd1d3edd4a23f3f)
 1: (ReplicatedPG::trim_object(hobject_t const&)+0x150) [0x6e8bd0]
 2: (ReplicatedPG::TrimmingObjects::react(ReplicatedPG::SnapTrim const&)+0x2e7) 
[0x6ee0d7]
 3: (boost::statechart::detail::reaction_result 
boost::statechart::simple_state, 
(boost::statechart::history_mode)0>::local_react_impl_non_empty::local_react_impl, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, mpl_::na, 
mpl_::na, mpl_::na, mpl_::na, mpl_::na>, 
boost::statechart::simple_state, (boost::statechart::history_mode)0> 
>(boost::statechart::simple_state, (boost::statechart::history_mode)0>&, boost::statechart::event_base 
const&, void const*)+0x96) [0x740fa6]
 4: (boost::statechart::state_machine::process_queued_events()+0x137) 
[0x71bdf7]
 5: (boost::statechart::state_machine::process_event(boost::statechart::event_base
 const&)+0x26) [0x71cfe6]
 6: (ReplicatedPG::snap_trimmer()+0x4ed) [0x6b59ad]
 7: (OSD::SnapTrimWQ::_process(PG*)+0x14) [0x790c54]
 8: (ThreadPool::worker(ThreadPool::WorkThread*)+0x68c) [0x9a69cc]
 9: (ThreadPool::WorkThread::entry()+0x10) [0x9a7c20]
 10: (()+0x7e9a) [0x36258121e9a]
 11: (clone()+0x6d) [0x3625669638d]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed to 
interpret this.
(full log available on request, file system will be thrashed shortly so tell me 
if it's helpful to look for something in there)

Could this all be caused by the OSD running out of file descriptors? Is it 
supposed to handle a problem like this (meaning both the assert that happened 
now and the file descriptor limit) gracefuly? Or is it a known issue that this 
could happen?
The thing about upstart is it apparently keeps restarting the OSD, which makes 
the problem even worse.

Luckily we caught this in time and it only happened on one node, so we are 
thrashing all the OSDs here.

Looks like a problem that could hit anyone, and if it actually damages data 
then it could be pretty bad and maybe worth looking into - tell me what more is 
needed.

Config:
XFS filesystem
Ubuntu 12 with 3.14.37 kernel
FIEMAP disabled
Ceph Dumpling 

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-11 Thread Somnath Roy
Check this..

http://www.spinics.net/lists/ceph-users/msg16294.html

http://tracker.ceph.com/issues/9344

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Bill 
Sanders
Sent: Friday, September 11, 2015 11:17 AM
To: Jan Schermer
Cc: Rafael Lopez; ceph-users@lists.ceph.com; Nick Fisk
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Is there a thread on the mailing list (or LKML?) with some background about 
tcp_low_latency and TCP_NODELAY?
Bill

On Fri, Sep 11, 2015 at 2:30 AM, Jan Schermer 
<j...@schermer.cz<mailto:j...@schermer.cz>> wrote:
Can you try

echo 1 > /proc/sys/net/ipv4/tcp_low_latency

And see if it improves things? I remember there being an option to disable 
nagle completely, but it's gone apparently.

Jan

> On 11 Sep 2015, at 10:43, Nick Fisk <n...@fisk.me.uk<mailto:n...@fisk.me.uk>> 
> wrote:
>
>
>
>
>
>> -Original Message-
>> From: ceph-users 
>> [mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
>>  On Behalf Of
>> Somnath Roy
>> Sent: 11 September 2015 06:23
>> To: Rafael Lopez <rafael.lo...@monash.edu<mailto:rafael.lo...@monash.edu>>
>> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>>
>> That’s probably because the krbd version you are using doesn’t have the
>> TCP_NODELAY patch. We have submitted it (and you can build it from latest
>> rbd source) , but, I am not sure when it will be in linux mainline.
>
> From memory it landed in 3.19, but there are also several issues with max IO 
> size, max nr_requests and readahead. I would suggest for testing, try one of 
> these:-
>
> http://gitbuilder.ceph.com/kernel-deb-precise-x86_64-basic/ref/ra-bring-back/
>
>
>>
>> Thanks & Regards
>> Somnath
>>
>> From: Rafael Lopez 
>> [mailto:rafael.lo...@monash.edu<mailto:rafael.lo...@monash.edu>]
>> Sent: Thursday, September 10, 2015 10:12 PM
>> To: Somnath Roy
>> Cc: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
>> Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO
>>
>> Ok I ran the two tests again with direct=1, smaller block size (4k) and 
>> smaller
>> total io (100m), disabled cache at ceph.conf side on client by adding:
>>
>> [client]
>> rbd cache = false
>> rbd cache max dirty = 0
>> rbd cache size = 0
>> rbd cache target dirty = 0
>>
>>
>> The result seems to have swapped around, now the librbd job is running
>> ~50% faster than the krbd job!
>>
>> ### krbd job:
>>
>> [root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
>> job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
>> fio-2.2.8
>> Starting 1 process
>> Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta
>> 00m:00s]
>> job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
>>  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
>>clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>> lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
>>clat percentiles (usec):
>> |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
>> | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
>> | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
>> | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
>> | 99.99th=[19328]
>>bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
>>lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
>>  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
>>  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%,
>>> =64=0.0%
>> submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>> =64=0.0%
>> issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0,
>> drop=r=0/w=0/d=0
>> latency   : target=0, window=0, percentile=100.00%, depth=16
>>
>> Run status group 0 (all jobs):
>>  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s,
>> mint=162033msec, maxt=162033msec
>>
>> Disk stats (read/write):
>>  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745,
>> util=99.11%
>> [root@rcprsdc1r72-01-ac rafaell]#
>>
>> ## librb job:
>>
>> [root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_tes

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy
I am not an expert on that, but, probably these settings will help backfill to 
go slow and thus less degradation on client IO. You may want to try..

Thanks & Regards
Somnath

-Original Message-
From: Robert LeBlanc [mailto:rob...@leblancnet.us] 
Sent: Thursday, September 10, 2015 3:16 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

Do the recovery options kick in when there is only backfill going on?
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Thu, Sep 10, 2015 at 3:01 PM, Somnath Roy  wrote:
> Try all these..
>
> osd recovery max active = 1
> osd max backfills = 1
> osd recovery threads = 1
> osd recovery op priority = 1
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
> Robert LeBlanc
> Sent: Thursday, September 10, 2015 1:56 PM
> To: ceph-users@lists.ceph.com
> Subject: [ceph-users] Hammer reduce recovery impact
>
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> We are trying to add some additional OSDs to our cluster, but the impact of 
> the backfilling has been very disruptive to client I/O and we have been 
> trying to figure out how to reduce the impact. We have seen some client I/O 
> blocked for more than 60 seconds. There has been CPU and RAM head room on the 
> OSD nodes, network has been fine, disks have been busy, but not terrible.
>
> 11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
> dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.
>
> Clients are QEMU VMs.
>
> [ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
> (5fb85614ca8f354284c713a2f9c610860720bbf3)
>
> Some nodes are 0.94.3
>
> [ulhglive-root@ceph5 current]# ceph status
> cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
>  health HEALTH_WARN
> 3 pgs backfill
> 1 pgs backfilling
> 4 pgs stuck unclean
> recovery 2382/33044847 objects degraded (0.007%)
> recovery 50872/33044847 objects misplaced (0.154%)
> noscrub,nodeep-scrub flag(s) set
>  monmap e2: 3 mons at
> {mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
> election epoch 180, quorum 0,1,2 mon1,mon2,mon3
>  osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
> flags noscrub,nodeep-scrub
>   pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
> 128 TB used, 322 TB / 450 TB avail
> 2382/33044847 objects degraded (0.007%)
> 50872/33044847 objects misplaced (0.154%)
> 2300 active+clean
>3 active+remapped+wait_backfill
>1 active+remapped+backfilling recovery io 70401 kB/s, 16 
> objects/s
>   client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s
>
> Each pool is size 4 with min_size 2.
>
> One problem we have is that the requirements of the cluster changed after 
> setting up our pools, so our PGs are really out of wack. Our most active pool 
> has only 256 PGs and each PG is about 120 GB is size.
> We are trying to clear out a pool that has way too many PGs so that we can 
> split the PGs in that pool. I think these large PGs is part of our issues.
>
> Things I've tried:
>
> * Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
> latency sometimes up to 3000 ms down to a max of 500-700 ms.
> it has also reduced the huge swings in  latency, but has also reduced 
> throughput somewhat.
> * Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
> process gives the recovery threads a different disk priority or if changing 
> the scheduler without restarting the OSD allows the OSD to use disk 
> priorities.
> * Reduced the number of osd_max_backfills from 2 to 1.
> * Tried setting noin to give the new OSDs time to get the PG map and peer 
> before starting the backfill. This caused more problems than solved as we had 
> blocked I/O (over 200 seconds) until we set the new OSDs to in.
>
> Even adding one OSD disk into the cluster is causing these slow I/O messages. 
> We still have 5 more disks to add from this server and four more servers to 
> add.
>
> In addition to trying to minimize these impacts, would it be better to split 
> the PGs then add the rest of the servers, or add the servers then do the PG 
> split. I'm thinking splitting first would be better, but I'd like to get 
> other opinions.
>
> No spindle stays at high utilization for long and the await drops below 20 ms 
> usually 

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy
That’s probably because the krbd version you are using doesn’t have the 
TCP_NODELAY patch. We have submitted it (and you can build it from latest rbd 
source) , but, I am not sure when it will be in linux mainline.

Thanks & Regards
Somnath

From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
Sent: Thursday, September 10, 2015 10:12 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Ok I ran the two tests again with direct=1, smaller block size (4k) and smaller 
total io (100m), disabled cache at ceph.conf side on client by adding:

[client]
rbd cache = false
rbd cache max dirty = 0
rbd cache size = 0
rbd cache target dirty = 0


The result seems to have swapped around, now the librbd job is running ~50% 
faster than the krbd job!

### krbd job:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 process
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/571KB/0KB /s] [0/142/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=29095: Fri Sep 11 14:48:21 2015
  write: io=102400KB, bw=647137B/s, iops=157, runt=162033msec
clat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
 lat (msec): min=2, max=25, avg= 6.32, stdev= 1.21
clat percentiles (usec):
 |  1.00th=[ 2896],  5.00th=[ 4320], 10.00th=[ 4768], 20.00th=[ 5536],
 | 30.00th=[ 5920], 40.00th=[ 6176], 50.00th=[ 6432], 60.00th=[ 6624],
 | 70.00th=[ 6816], 80.00th=[ 7136], 90.00th=[ 7584], 95.00th=[ 7968],
 | 99.00th=[ 9024], 99.50th=[ 9664], 99.90th=[15808], 99.95th=[17536],
 | 99.99th=[19328]
bw (KB  /s): min=  506, max= 1171, per=100.00%, avg=632.22, stdev=104.77
lat (msec) : 4=2.88%, 10=96.69%, 20=0.43%, 50=0.01%
  cpu  : usr=0.17%, sys=0.71%, ctx=25634, majf=0, minf=35
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=631KB/s, minb=631KB/s, maxb=631KB/s, 
mint=162033msec, maxt=162033msec

Disk stats (read/write):
  rbd0: ios=0/25638, merge=0/32, ticks=0/160765, in_queue=160745, util=99.11%
[root@rcprsdc1r72-01-ac rafaell]#

## librb job:

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=4K-4K/4K-4K/4K-4K, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 process
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/703KB/0KB /s] [0/175/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=30568: Fri Sep 11 14:50:24 2015
  write: io=102400KB, bw=950141B/s, iops=231, runt=110360msec
slat (usec): min=70, max=992, avg=115.05, stdev=30.07
clat (msec): min=13, max=117, avg=67.91, stdev=24.93
 lat (msec): min=13, max=117, avg=68.03, stdev=24.93
clat percentiles (msec):
 |  1.00th=[   19],  5.00th=[   26], 10.00th=[   38], 20.00th=[   40],
 | 30.00th=[   46], 40.00th=[   62], 50.00th=[   77], 60.00th=[   85],
 | 70.00th=[   88], 80.00th=[   91], 90.00th=[   95], 95.00th=[   99],
 | 99.00th=[  105], 99.50th=[  110], 99.90th=[  116], 99.95th=[  117],
 | 99.99th=[  118]
bw (KB  /s): min=  565, max= 3174, per=100.00%, avg=935.74, stdev=407.67
lat (msec) : 20=2.41%, 50=29.85%, 100=64.46%, 250=3.29%
  cpu  : usr=2.43%, sys=0.29%, ctx=7847, majf=0, minf=2750
  IO depths: 1=6.2%, 2=12.5%, 4=25.0%, 8=50.0%, 16=6.2%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=94.1%, 8=0.0%, 16=5.9%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=25600/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400KB, aggrb=927KB/s, minb=927KB/s, maxb=927KB/s, 
mint=110360msec, maxt=110360msec

Disk stats (read/write):
dm-1: ios=240/369, merge=0/0, ticks=742/40, in_queue=782, util=0.38%, 
aggrios=240/379, aggrmerge=0/19, aggrticks=742/41, aggrin_queue=783, 
aggrutil=0.39%
  sda: ios=240/379, merge=0/19, ticks=742/41, in_queue=783, util=0.39%
[root@rcprsdc1r72-01-ac rafaell]#



Confirmed speed (at least for krbd) using dd:
[root@rcprsdc1r72-01-ac rafaell]# dd if=/mnt/ssd/random100g 
of=/mnt/rbd/dd_io_test bs=4k count=1 oflag=direct
1+0 records in
1+0 records out
4096 bytes (41 MB) copied, 64.9799 s, 630 kB/s
[root@rcprsdc1r72-01-ac rafaell]#


Back to FIO, it's worse for 1M block size (librbd is about ~100% better perf).
1M librbd:
Run status group 0 (all jobs):
  WRITE: io=1024.0MB, aggrb=112641KB/s, minb=112641KB/s, maxb=112641KB/s, 
mint=9309msec, maxt=9309msec

1M krbd:
R

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy
It may be due to rbd cache effect..
Try the following..

Run your test with direct = 1 both the cases and rbd_cache = false  (disable 
all other rbd cache option as well). This should give you similar result like 
krbd.

In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true.
But, direct = 0 case, it could be more as you are seeing..

I think there is a delta (or need to tune properly) if you want to use rbd 
cache.

Thanks & Regards
Somnath



From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Rafael 
Lopez
Sent: Thursday, September 10, 2015 8:24 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] bad perf for librbd vs krbd using FIO

Hi all,

I am seeing a big discrepancy between librbd and kRBD/ext4 performance using 
FIO with single RBD image. RBD images are coming from same RBD pool, same size 
and settings for both. The librbd results are quite bad by comparison, and in 
addition if I scale up the kRBD FIO job with more jobs/threads it increases up 
to 3-4x results below, but librbd doesn't seem to scale much at all. I figured 
that it should be close to the kRBD result for a single job/thread before 
parallelism comes into play though. RBD cache settings are all default.

I can see some obvious differences in FIO output, but not being well versed 
with FIO I'm not sure what to make of it or where to start diagnosing the 
discrepancy. Hunted around but haven't found anything useful, any 
suggestions/insights would be appreciated.

RBD cache settings:
[root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon 
/var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
[root@rcmktdc1r72-09-ac rafaell]#

This is the FIO job file for the kRBD job:

[root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
; -- start job file --
[global]
rw=rw
size=100g
filename=/mnt/rbd/fio_test_file_ext4
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
[job1]
; -- end job file --

[root@rcprsdc1r72-01-ac rafaell]#

This is the FIO job file for the librbd job:

[root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
; -- start job file --
[global]
rw=rw
size=100g
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
ioengine=rbd
rbdname=nas1-rds-stg31
pool=rbd
[job1]
; -- end job file --


Here are the results:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 thread
job1: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 2015
  write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
clat percentiles (usec):
 |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
 | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
 | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
 | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
 | 99.99th=[464896]
bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27, 
stdev=375092.66
lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
  cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
 issued: total=r=0/w=102400/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency   : target=0, window=0, percentile=100.00%, depth=16

Run status group 0 (all jobs):
  WRITE: io=102400MB, aggrb=399740KB/s, minb=399740KB/s, maxb=399740KB/s, 
mint=262314msec, maxt=262314msec

Disk stats (read/write):
  rbd0: ios=0/150890, merge=0/49, ticks=0/36117700, in_queue=36145277, 
util=96.97%
[root@rcprsdc1r72-01-ac rafaell]#

[root@rcprsdc1r72-01-ac rafaell]# fio fio_rbd_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=rbd, iodepth=16
fio-2.2.8
Starting 1 thread
rbd engine: RBD version: 0.1.9
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/65405KB/0KB /s] [0/63/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=43960: Fri Sep 11 12:54:25 2015
  write: io=102400MB, bw=121882KB/s, iops=119, runt=860318msec
slat (usec): min=355, max=7300, 

Re: [ceph-users] bad perf for librbd vs krbd using FIO

2015-09-10 Thread Somnath Roy
Only changing client side ceph.conf and rerunning the tests is sufficient.

Thanks & Regards
Somnath

From: Rafael Lopez [mailto:rafael.lo...@monash.edu]
Sent: Thursday, September 10, 2015 8:58 PM
To: Somnath Roy
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] bad perf for librbd vs krbd using FIO

Thanks for the quick reply Somnath, will give this a try.

In order to set the rbd cache settings, is it a matter of updating the 
ceph.conf file on the client only prior to running the test, or do I need to 
inject args to all OSDs ?

Raf


On 11 September 2015 at 13:39, Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>> wrote:
It may be due to rbd cache effect..
Try the following..

Run your test with direct = 1 both the cases and rbd_cache = false  (disable 
all other rbd cache option as well). This should give you similar result like 
krbd.

In direct =1 case, we saw ~10-20% degradation if we make rbd_cache = true.
But, direct = 0 case, it could be more as you are seeing..

I think there is a delta (or need to tune properly) if you want to use rbd 
cache.

Thanks & Regards
Somnath



From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of Rafael Lopez
Sent: Thursday, September 10, 2015 8:24 PM
To: ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>
Subject: [ceph-users] bad perf for librbd vs krbd using FIO

Hi all,

I am seeing a big discrepancy between librbd and kRBD/ext4 performance using 
FIO with single RBD image. RBD images are coming from same RBD pool, same size 
and settings for both. The librbd results are quite bad by comparison, and in 
addition if I scale up the kRBD FIO job with more jobs/threads it increases up 
to 3-4x results below, but librbd doesn't seem to scale much at all. I figured 
that it should be close to the kRBD result for a single job/thread before 
parallelism comes into play though. RBD cache settings are all default.

I can see some obvious differences in FIO output, but not being well versed 
with FIO I'm not sure what to make of it or where to start diagnosing the 
discrepancy. Hunted around but haven't found anything useful, any 
suggestions/insights would be appreciated.

RBD cache settings:
[root@rcmktdc1r72-09-ac rafaell]# ceph --admin-daemon 
/var/run/ceph/ceph-osd.659.asok config show | grep rbd_cache
"rbd_cache": "true",
"rbd_cache_writethrough_until_flush": "true",
"rbd_cache_size": "33554432",
"rbd_cache_max_dirty": "25165824",
"rbd_cache_target_dirty": "16777216",
"rbd_cache_max_dirty_age": "1",
"rbd_cache_max_dirty_object": "0",
"rbd_cache_block_writes_upfront": "false",
[root@rcmktdc1r72-09-ac rafaell]#

This is the FIO job file for the kRBD job:

[root@rcprsdc1r72-01-ac rafaell]# cat ext4_test
; -- start job file --
[global]
rw=rw
size=100g
filename=/mnt/rbd/fio_test_file_ext4
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
[job1]
; -- end job file --

[root@rcprsdc1r72-01-ac rafaell]#

This is the FIO job file for the librbd job:

[root@rcprsdc1r72-01-ac rafaell]# cat fio_rbd_test
; -- start job file --
[global]
rw=rw
size=100g
rwmixread=0
rwmixwrite=100
percentage_random=0
bs=1024k
direct=0
iodepth=16
thread=1
numjobs=1
ioengine=rbd
rbdname=nas1-rds-stg31
pool=rbd
[job1]
; -- end job file --


Here are the results:

[root@rcprsdc1r72-01-ac rafaell]# fio ext4_test
job1: (g=0): rw=rw, bs=1M-1M/1M-1M/1M-1M, ioengine=sync, iodepth=16
fio-2.2.8
Starting 1 thread
job1: Laying out IO file(s) (1 file(s) / 102400MB)
Jobs: 1 (f=1): [W(1)] [100.0% done] [0KB/321.7MB/0KB /s] [0/321/0 iops] [eta 
00m:00s]
job1: (groupid=0, jobs=1): err= 0: pid=37981: Fri Sep 11 12:33:13 
2015<tel:13%202015>
  write: io=102400MB, bw=399741KB/s, iops=390, runt=262314msec
clat (usec): min=411, max=574082, avg=2492.91, stdev=7316.96
 lat (usec): min=418, max=574113, avg=2520.12, stdev=7318.53
clat percentiles (usec):
 |  1.00th=[  446],  5.00th=[  458], 10.00th=[  474], 20.00th=[  510],
 | 30.00th=[ 1064], 40.00th=[ 1096], 50.00th=[ 1160], 60.00th=[ 1320],
 | 70.00th=[ 1592], 80.00th=[ 2448], 90.00th=[ 7712], 95.00th=[ 7904],
 | 99.00th=[11072], 99.50th=[11712], 99.90th=[13120], 99.95th=[73216],
 | 99.99th=[464896]
bw (KB  /s): min=  264, max=2156544, per=100.00%, avg=412986.27, 
stdev=375092.66
lat (usec) : 500=18.68%, 750=7.43%, 1000=2.11%
lat (msec) : 2=48.89%, 4=4.35%, 10=16.79%, 20=1.67%, 50=0.03%
lat (msec) : 100=0.03%, 250=0.02%, 500=0.01%, 750=0.01%
  cpu  : usr=1.24%, sys=45.38%, ctx=19298, majf=0, minf=974
  IO depths: 1=100.0%, 2=0.0%, 4=0.0%, 8=0.0%, 16=0.0%, 32=0.0%, >=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, &

Re: [ceph-users] Hammer reduce recovery impact

2015-09-10 Thread Somnath Roy
Try all these..

osd recovery max active = 1
osd max backfills = 1
osd recovery threads = 1
osd recovery op priority = 1

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Robert 
LeBlanc
Sent: Thursday, September 10, 2015 1:56 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Hammer reduce recovery impact

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

We are trying to add some additional OSDs to our cluster, but the impact of the 
backfilling has been very disruptive to client I/O and we have been trying to 
figure out how to reduce the impact. We have seen some client I/O blocked for 
more than 60 seconds. There has been CPU and RAM head room on the OSD nodes, 
network has been fine, disks have been busy, but not terrible.

11 OSD servers: 10 4TB disks with two Intel S3500 SSDs for journals (10GB), 
dual 40Gb Ethernet, 64 GB RAM, single CPU E5-2640 Quanta S51G-1UL.

Clients are QEMU VMs.

[ulhglive-root@ceph5 current]# ceph --version ceph version 0.94.2 
(5fb85614ca8f354284c713a2f9c610860720bbf3)

Some nodes are 0.94.3

[ulhglive-root@ceph5 current]# ceph status
cluster 48de182b-5488-42bb-a6d2-62e8e47b435c
 health HEALTH_WARN
3 pgs backfill
1 pgs backfilling
4 pgs stuck unclean
recovery 2382/33044847 objects degraded (0.007%)
recovery 50872/33044847 objects misplaced (0.154%)
noscrub,nodeep-scrub flag(s) set
 monmap e2: 3 mons at
{mon1=10.217.72.27:6789/0,mon2=10.217.72.28:6789/0,mon3=10.217.72.29:6789/0}
election epoch 180, quorum 0,1,2 mon1,mon2,mon3
 osdmap e54560: 125 osds: 124 up, 124 in; 4 remapped pgs
flags noscrub,nodeep-scrub
  pgmap v10274197: 2304 pgs, 3 pools, 32903 GB data, 8059 kobjects
128 TB used, 322 TB / 450 TB avail
2382/33044847 objects degraded (0.007%)
50872/33044847 objects misplaced (0.154%)
2300 active+clean
   3 active+remapped+wait_backfill
   1 active+remapped+backfilling recovery io 70401 kB/s, 16 
objects/s
  client io 93080 kB/s rd, 46812 kB/s wr, 4927 op/s

Each pool is size 4 with min_size 2.

One problem we have is that the requirements of the cluster changed after 
setting up our pools, so our PGs are really out of wack. Our most active pool 
has only 256 PGs and each PG is about 120 GB is size.
We are trying to clear out a pool that has way too many PGs so that we can 
split the PGs in that pool. I think these large PGs is part of our issues.

Things I've tried:

* Lowered nr_requests on the spindles from 1000 to 100. This reduced the max 
latency sometimes up to 3000 ms down to a max of 500-700 ms.
it has also reduced the huge swings in  latency, but has also reduced 
throughput somewhat.
* Changed the scheduler from deadline to CFQ. I'm not sure if the the OSD 
process gives the recovery threads a different disk priority or if changing the 
scheduler without restarting the OSD allows the OSD to use disk priorities.
* Reduced the number of osd_max_backfills from 2 to 1.
* Tried setting noin to give the new OSDs time to get the PG map and peer 
before starting the backfill. This caused more problems than solved as we had 
blocked I/O (over 200 seconds) until we set the new OSDs to in.

Even adding one OSD disk into the cluster is causing these slow I/O messages. 
We still have 5 more disks to add from this server and four more servers to add.

In addition to trying to minimize these impacts, would it be better to split 
the PGs then add the rest of the servers, or add the servers then do the PG 
split. I'm thinking splitting first would be better, but I'd like to get other 
opinions.

No spindle stays at high utilization for long and the await drops below 20 ms 
usually within 10 seconds so I/O should be serviced "pretty quick". My next 
guess is that the journals are getting full and blocking while waiting for 
flushes, but I'm not exactly sure how to identify that. We are using the 
defaults for the journal except for size (10G). We'd like to have journals 
large to handle bursts, but if they are getting filled with backfill traffic, 
it may be counter productive. Can/does backfill/recovery bypass the journal?

Thanks,

- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1 -BEGIN 
PGP SIGNATURE-
Version: Mailvelope v1.0.2
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJV8e5qCRDmVDuy+mK58QAAaIwQAMN5DJlhrZkqwqsVXaKB
nnegQjG6Y02ObLRrg96ghHr+AGgY/HRm3iShng6E1N9CL+XjcHSLeb1JqH9n
2SgGQGoRAU1dY6DIlOs5K8Fwd2bBECh863VymYbO+OLgtXbpp2mWfZZVAkTf
V9ryaEh7tZOY1Mhx7mSIyr9Ur7IxTUOjzExAFPGfTLP1cbjE/FXoQMHh10fe
zSzk/qK0AvajFD0PR04uRyEsGYeCLl68kGQi1R7IQlxZWc7hMhWXKNIFlbKB
lk5+8OGx/LawW7qxpFm8a1SNoiAwMtrPKepvHYGi8u3rfXJa6ZE38jGuoqRs
8jD+b+gS0yxKbahT6S/gAEbgzAH0JF4YSz+nHNrvS6eSebykE9/7HGe9W7WA

Re: [ceph-users] maximum object size

2015-09-08 Thread Somnath Roy
I think the limit is 90 MB from OSD side, isn't it ?
If so, how are you able to write object till 1.99 GB ?
Am I missing anything ?

Thanks & Regards
Somnath

-Original Message-
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
HEWLETT, Paul (Paul)
Sent: Tuesday, September 08, 2015 8:55 AM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] maximum object size

Hi All

We have recently encountered a problem on Hammer (0.94.2) whereby we cannot 
write objects > 2GB in size to the rados backend.
(NB not RadosGW, CephFS or RBD)

I found the following issue
https://wiki.ceph.com/Planning/Blueprints/Firefly/Object_striping_in_librad
os which seems to address this but no progress reported.

What are the implications of writing such large objects to RADOS? What impact 
is expected on the XFS backend particularly regarding the size and location of 
the journal?

Any prospect of progressing the issue reported in the enclosed link?

Interestingly I could not find anywhere in the ceph documentation that 
describes the 2GB limitation. The implication of most of the website docs is 
that there is no limit on objects stored in Ceph. The only hint is that 
osd_max_write_size is a 32 bit signed integer.

If we use erasure coding will this reduce the impact? I.e. 4+1 EC will only 
write 500MB to each OSD and then this value will be tested against the chunk 
size instead of the total file size?

The relevant code in Ceph is:

src/FileJournal.cc:

  needed_space = ((int64_t)g_conf->osd_max_write_size) << 20;
  needed_space += (2 * sizeof(entry_header_t)) + get_top();
  if (header.max_size - header.start < needed_space) {
derr << "FileJournal::create: OSD journal is not large enough to hold "
<< "osd_max_write_size bytes!" << dendl;
ret = -ENOSPC;
goto free_buf;
  }

src/osd/OSD.cc:

// too big?
if (cct->_conf->osd_max_write_size &&
m->get_data_len() > cct->_conf->osd_max_write_size << 20) {
// journal can't hold commit!
 derr << "handle_op msg data len " << m->get_data_len()
 << " > osd_max_write_size " << (cct->_conf->osd_max_write_size << 20)
 << " on " << *m << dendl;
service.reply_op_error(op, -OSD_WRITETOOBIG);
return;
  }

Interestingly the code in OSD.cc looks like a bug - the max_write value should 
be cast to an int64_t before shifting left 20 bits (which is done correctly in 
FileJournal.cc). Otherwise overflow may occur and negative values generated.


Any comments welcome - any help appreciated.

Regards
Paul


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Extra RAM use as Read Cache

2015-09-07 Thread Somnath Roy
Vickey,
OSDs are on top of filesystem and those unused memory will be automatically 
part of paged cache by filesystem.
But, the read performance improvement depends on the pattern application is 
reading data and the size of working set.
Sequential pattern will benefit most (you may need to tweak read_ahead_kb to 
bigger values). For random workload also you will get benefit if the working 
set is not too big. For example, a LUN of say 1 TB and aggregated OSD page 
cache of say of 200GB will benefit more than a LUN of say 100TB with the 
similar amount of page cache (considering a true random pattern).

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Vickey 
Singh
Sent: Monday, September 07, 2015 2:19 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Extra RAM use as Read Cache

Hello Experts ,

I want to increase my Ceph cluster's read performance.

I have several OSD nodes having 196G RAM. On my OSD nodes Ceph just uses 15-20 
GB of RAM.

So, can i instruct Ceph to make use of the remaining 150GB+ RAM as read cache. 
So that it should cache data in RAM and server to clients very fast.

I hope if this can be done, i can get a good read performance boost.


By the way we have a LUSTRE cluster , that uses extra RAM as read cache and we 
can get upto 2.5GBps read performance.  I am looking someone to do with Ceph.

- Vickey -







PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Thanks !
I think you should try installing from the ceph mainstream..There are some bug 
fixes went on after Hammer (not sure if it is backported)..
I would say try with 1 drive -> 1 OSD first since presently we have seen some 
stability issues (mainly due to resource constraint) with more OSDs in a box.
The another point is, installation itself is not straight forward. You need to 
build all the components probably, not sure if it is added as git submodule or 
not, Vu , could you please confirm ?

Since we are working to make this solution work at scale, could you please give 
us some idea what is the scale you are looking at for future deployment ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:19 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Hi Roy,
   I understand, we are looking for using accelio with an starting small 
cluster of 3 mon and 8 osd servers:
3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
and thinking of using infernalis v.9.0.0 or hammer release? comments? 
recommendations?

German

2015-09-01 14:46 GMT-03:00 Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>:
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of German Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
<rob...@leblancnet.us<mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02

m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV

YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA

Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF

XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD

/7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z

SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt

8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh

TiL2

=oSrX

-END PGP SIGNATURE-




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribu

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Thanks !
6 OSD daemons per server should be good.

Vu,
Could you please send out the doc you are maintaining ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com]
Sent: Tuesday, September 01, 2015 11:36 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks Roy, we're planning to grow on this cluster if can get the performance 
that we need, the idea is to run non-relational databases here, so it would be 
high-io intensive. We are talking in grow terms of about 40-50 OSD servers with 
no more than 6 OSD daemons per server. If you got some hints or docs out there 
on how to compile ceph with accelio it would be awesome.

German

2015-09-01 15:31 GMT-03:00 Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>:
Thanks !
I think you should try installing from the ceph mainstream..There are some bug 
fixes went on after Hammer (not sure if it is backported)..
I would say try with 1 drive -> 1 OSD first since presently we have seen some 
stability issues (mainly due to resource constraint) with more OSDs in a box.
The another point is, installation itself is not straight forward. You need to 
build all the components probably, not sure if it is added as git submodule or 
not, Vu , could you please confirm ?

Since we are working to make this solution work at scale, could you please give 
us some idea what is the scale you are looking at for future deployment ?

Regards
Somnath

From: German Anders [mailto:gand...@despegar.com<mailto:gand...@despegar.com>]
Sent: Tuesday, September 01, 2015 11:19 AM
To: Somnath Roy
Cc: Robert LeBlanc; ceph-users

Subject: Re: [ceph-users] Accelio & Ceph

Hi Roy,
   I understand, we are looking for using accelio with an starting small 
cluster of 3 mon and 8 osd servers:
3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
and thinking of using infernalis v.9.0.0 or hammer release? comments? 
recommendations?

German

2015-09-01 14:46 GMT-03:00 Somnath Roy 
<somnath@sandisk.com<mailto:somnath@sandisk.com>>:
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users 
[mailto:ceph-users-boun...@lists.ceph.com<mailto:ceph-users-boun...@lists.ceph.com>]
 On Behalf Of German Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
<rob...@leblancnet.us<mailto:rob...@leblancnet.us>>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com<mailto:ceph-users@lists.ceph.com>

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFow

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread Somnath Roy
Hi German,
We are working on to make it production ready ASAP. As you know RDMA is very 
resource constrained and at the same time will outperform TCP as well. There 
will be some definite tradeoff between cost Vs Performance.
We are lacking on ideas on how big the RDMA deployment could be and it will be 
really helpful if you can give some idea on how you are planning to deploy that 
(i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc. etc.).

Thanks & Regards
Somnath

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of German 
Anders
Sent: Tuesday, September 01, 2015 10:39 AM
To: Robert LeBlanc
Cc: ceph-users
Subject: Re: [ceph-users] Accelio & Ceph

Thanks a lot for the quick response Robert, any idea when it's going to be 
ready for production? any alternative solution for similar-performance?
Best regards,

German

2015-09-01 13:42 GMT-03:00 Robert LeBlanc 
>:

-BEGIN PGP SIGNED MESSAGE-

Hash: SHA256



Accelio and Ceph are still in heavy development and not ready for production.



- 

Robert LeBlanc

PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1



On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:

Hi cephers,



 I would like to know the status for production-ready of Accelio & Ceph, does 
anyone had a home-made procedure implemented with Ubuntu?



recommendations, comments?



Thanks in advance,



Best regards,



German



___

ceph-users mailing list

ceph-users@lists.ceph.com

http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





-BEGIN PGP SIGNATURE-

Version: Mailvelope v1.0.2

Comment: https://www.mailvelope.com



wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J

FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX

xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92

OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/

VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02

m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV

YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA

Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF

XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD

/7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z

SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt

8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh

TiL2

=oSrX

-END PGP SIGNATURE-




PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   3   >