Re: [ceph-users] osd failing to start

2016-07-13 Thread Brad Hubbard
On Thu, Jul 14, 2016 at 06:06:58AM +0200, Martin Wilderoth wrote:
>  Hello,
> 
> I have a ceph cluster where the one osd is failng to start. I have been
> upgrading ceph to see if the error dissappered. Now I'm running jewel but I
> still get the  error message.
> 
> -1> 2016-07-13 17:04:22.061384 7fda4d24e700  1 heartbeat_map is_healthy
> 'OSD::osd_tp thread 0x7fda25dd8700' had suicide timed out after 150

This appears to indicate that an OSD thread pool thread (work queue thread)
has failed to complete an operation within the 150 second grace period.

The most likely and common cause for this is hardware failure and I would
therefore suggest you thoroughly check this device and look for indicators in
syslog, dmesg, diagnostics, etc. tat this device may have failed.

-- 
HTH,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] osd failing to start

2016-07-13 Thread Martin Wilderoth
 Hello,

I have a ceph cluster where the one osd is failng to start. I have been
upgrading ceph to see if the error dissappered. Now I'm running jewel but I
still get the  error message.


   -31> 2016-07-13 17:03:30.474321 7fda18a8b700  2 -- 10.0.6.21:6800/1876
>> 10.0.5.71:6789/0 pipe(0x7fdb5712a800 sd=111 :36196 s=2 pgs=486 cs=1 l=1
c=0x7fdaaf060400).reader got KEEPALIVE_ACK
   -30> 2016-07-13 17:03:32.054328 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -29> 2016-07-13 17:03:32.054353 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -28> 2016-07-13 17:03:37.054430 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -27> 2016-07-13 17:03:37.054456 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -26> 2016-07-13 17:03:42.054535 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -25> 2016-07-13 17:03:42.054553 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -24> 2016-07-13 17:03:47.054633 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -23> 2016-07-13 17:03:47.054658 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -22> 2016-07-13 17:03:52.054735 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -21> 2016-07-13 17:03:52.054752 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -20> 2016-07-13 17:03:57.054829 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
   -19> 2016-07-13 17:03:57.054847 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
   -18> 2016-07-13 17:04:00.473446 7fda275db700 10 monclient(hunting): tick
   -17> 2016-07-13 17:04:00.473485 7fda275db700  1 monclient(hunting):
continuing hunt
   -16> 2016-07-13 17:04:00.473488 7fda275db700 10 monclient(hunting):
_reopen_session rank -1 name
   -15> 2016-07-13 17:04:00.473498 7fda275db700  1 -- 10.0.6.21:6800/1876
mark_down 0x7fdaaf060400 -- 0x7fdb5712a800
   -14> 2016-07-13 17:04:00.473678 7fda275db700 10 monclient(hunting):
picked mon.c con 0x7fdaaf060580 addr 10.0.5.73:6789/0
   -13> 2016-07-13 17:04:00.473698 7fda275db700 10 monclient(hunting):
_send_mon_message to mon.c at 10.0.5.73:6789/0
   -12> 2016-07-13 17:04:00.473705 7fda275db700  1 -- 10.0.6.21:6800/1876
--> 10.0.5.73:6789/0 -- auth(proto 0 27 bytes epoch 17) v1 -- ?+0
0x7fdad949 con 0x7fdaaf060580
   -11> 2016-07-13 17:04:00.473720 7fda275db700 10 monclient(hunting):
renew_subs
   -10> 2016-07-13 17:04:02.054922 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
-9> 2016-07-13 17:04:02.054938 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
-8> 2016-07-13 17:04:07.055017 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
-7> 2016-07-13 17:04:07.055035 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
-6> 2016-07-13 17:04:12.055114 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
-5> 2016-07-13 17:04:12.055144 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
-4> 2016-07-13 17:04:17.055223 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
-3> 2016-07-13 17:04:17.055243 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda265d9700' had timed out after 15
-2> 2016-07-13 17:04:22.055321 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had timed out after 15
-1> 2016-07-13 17:04:22.061384 7fda4d24e700  1 heartbeat_map is_healthy
'OSD::osd_tp thread 0x7fda25dd8700' had suicide timed out after 150
 0> 2016-07-13 17:04:24.244698 7fda4d24e700 -1 common/HeartbeatMap.cc:
In function 'bool ceph::HeartbeatMap::_check(const
ceph::heartbeat_handle_d*, const char*, time_t)' thread 7fda4d24e700 time
2016-07-13 17:04:22.078324
common/HeartbeatMap.cc: 86: FAILED assert(0 == "hit suicide timeout")

 ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x82) [0x7fda53cbd5d2]
 2: (ceph::HeartbeatMap::_check(ceph::heartbeat_handle_d const*, char
const*, long)+0x11f) [0x7fda53bf30bf]
 3: (ceph::HeartbeatMap::is_healthy()+0xd6) [0x7fda53bf3ae6]
 4: (ceph::HeartbeatMap::check_touch_file()+0x2a) [0x7fda53bf42fa]
 5: (CephContextServiceThread::entry()+0x16c) [0x7fda53cd767c]
 

Re: [ceph-users] cephfs change metadata pool?

2016-07-13 Thread Christian Balzer

Hello,

On Wed, 13 Jul 2016 22:47:05 -0500 Di Zhang wrote:

> Hi,
>   I changed to only use the infiniband network. For the 4KB write, the 
> IOPS doesn’t improve much. 

That's mostly going to be bound by latencies (as I just wrote in the other
thread), both network and internal Ceph ones.

The cluster I described in the other thread has 32 OSDs and does about
1050 "IOPS" with "rados -p rbd bench 30 write -t 32 -b 4096".
So about half with your 15 OSDs isn't all that unexpected.

Once again, to get something more realistic use fio.

>I also logged into the OSD nodes and atop showed the disks are not always
at 100% busy. Please check a snapshot of one node below:

When you do the 4KB bench (for 60 seconds or so), also watch the CPU
usage, rados bench is a killer there.

Christian

> 
> DSK |  sdc  | busy 72% |  read20/s |  write   86/s | KiB/w
>  13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
> DSK |  sda  | busy 47% |  read 0/s |  write  589/s | KiB/w
>   4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
> DSK |  sdb  | busy 31% |  read14/s |  write   77/s | KiB/w
>  10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
> DSK |  sdd  | busy 19% |  read 4/s |  write   50/s | KiB/w
>  11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
> NET | transport | tcpi   656/s |  tcpo   655/s |  udpi 0/s | udpo 
> 0/s  | tcpao0/s |  tcppo0/s |  tcprs0/s |
> NET | network   | ipi657/s |  ipo655/s |  ipfrw0/s | deliv  
> 657/s  |  |  icmpi0/s |  icmpo0/s |
> NET | p10p1 0%  | pcki 0/s |  pcko 0/s |  si0 Kbps | so1 
> Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
> NET | ib0   | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 
> Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
> NET | lo    | pcki19/s |  pcko19/s |  si   14 Kbps | so   14 
> Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
>   
>   /dev/sda is the OS and journaling SSD. The other three are OSDs.
> 
>   Am I missing anything?
> 
>   Thanks,
> 
>   
> 
>   
> Zhang, Di
> Postdoctoral Associate
> Baylor College of Medicine
> 
> > On Jul 13, 2016, at 6:56 PM, Christian Balzer  wrote:
> > 
> > 
> > Hello,
> > 
> > On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> > 
> >> I also tried 4K write bench. The IOPS is ~420. 
> > 
> > That's what people usually mean (4KB blocks) when talking about IOPS.
> > This number is pretty low, my guess would be network latency on your 1Gbs
> > network for the most part.
> > 
> > You should run atop on your storage nodes will running a test like this
> > and see if the OSDs (HDDs) are also very busy.
> > 
> > Lastly the rados bench gives you some basic numbers but it is not the same
> > as real client I/O, for that you want to run fio inside a VM or in your
> > case on a mounted CephFS.
> > 
> >> I used to have better
> >> bandwidth when I use the same network for both the cluster and clients. Now
> >> the bandwidth must be limited by the 1G ethernet. 
> > That's the bandwidth you also see in your 4MB block tests below.
> > For small I/Os the real killer is latency, though.
> > 
> >> What would you suggest to
> >> me to do?
> >> 
> > That depends on your budget mostly (switch ports, client NICs).
> > 
> > A uniform, single 10Gb/s network would be better in all aspects than the
> > split network you have now.
> > 
> > Christian
> > 
> >> Thanks,
> >> 
> >> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang  wrote:
> >> 
> >>> Hello,
> >>>Sorry for the misunderstanding about IOPS. Here are some summary stats
> >>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
> >>> 
> >>> ceph osd pool create test 512 512
> >>> 
> >>> rados bench -p test 10 write --no-cleanup
> >>> 
> >>> Total time run: 10.480383
> >>> Total writes made:  288
> >>> Write size: 4194304
> >>> Object size:4194304
> >>> Bandwidth (MB/sec): 109.92
> >>> Stddev Bandwidth:   11.9926
> >>> Max bandwidth (MB/sec): 124
> >>> Min bandwidth (MB/sec): 80
> >>> Average IOPS:   27
> >>> Stddev IOPS:3
> >>> Max IOPS:   31
> >>> Min IOPS:   20
> >>> Average Latency(s): 0.579105
> >>> Stddev Latency(s):  0.19902
> >>> Max latency(s): 1.32831
> >>> Min latency(s): 0.245505
> >>> 
> >>> rados bench -p bench -p test 10 seq
> >>> Total time run:   10.340724
> >>> Total reads made: 288
> >>> Read size:4194304
> >>> Object size:  4194304
> >>> Bandwidth (MB/sec):   111.404
> >>> Average IOPS  27
> >>> Stddev IOPS:  2
> >>> Max IOPS: 31
> >>> Min IOPS: 22
> >>> Average Latency(s):   0.564858
> >>> Max latency(s):   1.65278
> >>> Min latency(s):   0.141504
> >>> 
> >>> rados bench -p bench -p test 10 rand
> 

Re: [ceph-users] Question on Sequential Write performance at 4K blocksize

2016-07-13 Thread Christian Balzer

Hello,

On Wed, 13 Jul 2016 18:15:10 + EP Komarla wrote:

> Hi All,
> 
> Have a question on the performance of sequential write @ 4K block sizes.
> 
Which version of Ceph?
Any significant ceph.conf modifications?

> Here is my configuration:
> 
> Ceph Cluster: 6 Nodes. Each node with :-
> 20x HDDs (OSDs) - 10K RPM 1.2 TB SAS disks
> SSDs - 4x - Intel S3710, 400GB; for OSD journals shared across 20 HDDs (i.e., 
> SSD journal ratio 1:5)
> 
> Network:
> - Client network - 10Gbps
> - Cluster network - 10Gbps
> - Each node with dual NIC - Intel 82599 ES - driver version 4.0.1
> 
> Traffic generators:
> 2 client servers - running on dual Intel sockets with 16 physical cores (32 
> cores with hyper-threading enabled)
> 
Are you mounting a RBD image on those servers via the kernel interface and
if so which kernel version?
Are you running the test inside a VM on those servers, or are you using
the RBD ioengine with fio?

> Test program:
> FIO - sequential read/write; random read/write

Exact fio command line please.


> Blocksizes - 4k, 32k, 256k...
> FIO - Number of jobs = 32; IO depth = 64
> Runtime = 10 minutes; Ramptime = 5 minutes
> Filesize = 4096g (5TB)
> 
> I observe that my sequential write performance at 4K block size is very low - 
> I am getting around 6MB/sec bandwidth.  The performance improves 
> significantly at larger block sizes (shown below)
> 
This is to some extend expected and normal.
You can see this behavior on local storage as well, just not as
pronounced.

Your main enemy here is latency, each write potentially needs to be sent
to the storage server(s, replication!) and then ACK'ed back to the client.

If your fio command line has sync writes (aka direct=1) things will be the
worst.

Small IOPs also stress your CPU's, look at atop on your storage nodes
during a 4KB fio run. 
That might also show other issues (as in overloaded HDDs/SSDs).

RBD caching (is it enabled on your clients?) can help with non-direct
writes.

That all being said, if I run this fio inside a VM (with RBD caching
enabled) against a cluster here with 4 nodes connected by QDR (40Gb/s)
Infiniband, 4x100GB DC S3700 and 8x plain SATA HDDs, I get:
---
# fio --size=4G --ioengine=libaio --invalidate=1 --direct=1 --numjobs=1 
--rw=write --name=fiojob --blocksize=4K --iodepth=32 

  write: io=4096.0MB, bw=134274KB/s, iops=33568 , runt= 31237msec

Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=134273KB/s, minb=134273KB/s, maxb=134273KB/s, 
mint=31237msec, maxt=31237msec
---

And with buffered I/O (direct=0) I get:
---
  write: io=4096.0MB, bw=359194KB/s, iops=89798 , runt= 11677msec


Run status group 0 (all jobs):
  WRITE: io=4096.0MB, aggrb=359193KB/s, minb=359193KB/s, maxb=359193KB/s, 
mint=11677msec, maxt=11677msec
---

Increasing numjobs of course reduces the performance per job, so numjob=2
will give half the speed per individual job.

So something is fishy with your setup, unless the 5.6MB/s below there are
the results PER JOB, which would make it 180MB/s with 32 jobs or even
360MB/s with 64 jobs and a pretty decent and expected result.

Christian

> FIO - Sequential Write test
> 
> Block Size
> 
> Sequential Write Bandwidth KB/Sec
> 
> 4K
> 
> 5694
> 
> 32K
> 
> 141020
> 
> 256K
> 
> 747421
> 
> 1024K
> 
> 602236
> 
> 4096K
> 
> 683029
> 
> 
> Here are my questions:
> - Why is the sequential write performance at 4K block size so low? Is this 
> in-line what others see?
> - Is it because of less number of clients, i.e., traffic generators? I am 
> planning to increase the number of clients to 4 servers.
> - There is a later version on NIC driver from Intel, v4.3.15 - do you think 
> upgrading to later version (v4.3.15) will improve performance?
> 
> Any thoughts or pointers will be helpful.
> 
> Thanks,
> 
> - epk
> 
> Legal Disclaimer:
> The information contained in this message may be privileged and confidential. 
> It is intended to be read only by the individual or entity to whom it is 
> addressed or by their designee. If the reader of this message is not the 
> intended recipient, you are on notice that any distribution of this message, 
> in any form, is strictly prohibited. If you have received this message in 
> error, please immediately notify the sender and delete or destroy any copy of 
> this message!


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs change metadata pool?

2016-07-13 Thread Di Zhang
Hi,
I changed to only use the infiniband network. For the 4KB write, the 
IOPS doesn’t improve much. I also logged into the OSD nodes and atop showed the 
disks are not always at 100% busy. Please check a snapshot of one node below:

DSK |  sdc  | busy 72% |  read20/s |  write   86/s | KiB/w 
13  | MBr/s   0.16 |  MBw/s   1.12 |  avio 6.69 ms |
DSK |  sda  | busy 47% |  read 0/s |  write  589/s | KiB/w  
4  | MBr/s   0.00 |  MBw/s   2.83 |  avio 0.79 ms |
DSK |  sdb  | busy 31% |  read14/s |  write   77/s | KiB/w 
10  | MBr/s   0.11 |  MBw/s   0.76 |  avio 3.42 ms |
DSK |  sdd  | busy 19% |  read 4/s |  write   50/s | KiB/w 
11  | MBr/s   0.03 |  MBw/s   0.55 |  avio 3.40 ms |
NET | transport | tcpi   656/s |  tcpo   655/s |  udpi 0/s | udpo 
0/s  | tcpao0/s |  tcppo0/s |  tcprs0/s |
NET | network   | ipi657/s |  ipo655/s |  ipfrw0/s | deliv  
657/s  |  |  icmpi0/s |  icmpo0/s |
NET | p10p1 0%  | pcki 0/s |  pcko 0/s |  si0 Kbps | so1 
Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
NET | ib0   | pcki   637/s |  pcko   636/s |  si 8006 Kbps | so 5213 
Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |
NET | lo    | pcki19/s |  pcko19/s |  si   14 Kbps | so   14 
Kbps  | erri 0/s |  erro 0/s |  drpo 0/s |

/dev/sda is the OS and journaling SSD. The other three are OSDs.

Am I missing anything?

Thanks,




Zhang, Di
Postdoctoral Associate
Baylor College of Medicine

> On Jul 13, 2016, at 6:56 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Wed, 13 Jul 2016 12:01:14 -0500 Di Zhang wrote:
> 
>> I also tried 4K write bench. The IOPS is ~420. 
> 
> That's what people usually mean (4KB blocks) when talking about IOPS.
> This number is pretty low, my guess would be network latency on your 1Gbs
> network for the most part.
> 
> You should run atop on your storage nodes will running a test like this
> and see if the OSDs (HDDs) are also very busy.
> 
> Lastly the rados bench gives you some basic numbers but it is not the same
> as real client I/O, for that you want to run fio inside a VM or in your
> case on a mounted CephFS.
> 
>> I used to have better
>> bandwidth when I use the same network for both the cluster and clients. Now
>> the bandwidth must be limited by the 1G ethernet. 
> That's the bandwidth you also see in your 4MB block tests below.
> For small I/Os the real killer is latency, though.
> 
>> What would you suggest to
>> me to do?
>> 
> That depends on your budget mostly (switch ports, client NICs).
> 
> A uniform, single 10Gb/s network would be better in all aspects than the
> split network you have now.
> 
> Christian
> 
>> Thanks,
>> 
>> On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang  wrote:
>> 
>>> Hello,
>>>Sorry for the misunderstanding about IOPS. Here are some summary stats
>>> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
>>> 
>>> ceph osd pool create test 512 512
>>> 
>>> rados bench -p test 10 write --no-cleanup
>>> 
>>> Total time run: 10.480383
>>> Total writes made:  288
>>> Write size: 4194304
>>> Object size:4194304
>>> Bandwidth (MB/sec): 109.92
>>> Stddev Bandwidth:   11.9926
>>> Max bandwidth (MB/sec): 124
>>> Min bandwidth (MB/sec): 80
>>> Average IOPS:   27
>>> Stddev IOPS:3
>>> Max IOPS:   31
>>> Min IOPS:   20
>>> Average Latency(s): 0.579105
>>> Stddev Latency(s):  0.19902
>>> Max latency(s): 1.32831
>>> Min latency(s): 0.245505
>>> 
>>> rados bench -p bench -p test 10 seq
>>> Total time run:   10.340724
>>> Total reads made: 288
>>> Read size:4194304
>>> Object size:  4194304
>>> Bandwidth (MB/sec):   111.404
>>> Average IOPS  27
>>> Stddev IOPS:  2
>>> Max IOPS: 31
>>> Min IOPS: 22
>>> Average Latency(s):   0.564858
>>> Max latency(s):   1.65278
>>> Min latency(s):   0.141504
>>> 
>>> rados bench -p bench -p test 10 rand
>>> Total time run:   10.546251
>>> Total reads made: 293
>>> Read size:4194304
>>> Object size:  4194304
>>> Bandwidth (MB/sec):   111.13
>>> Average IOPS: 27
>>> Stddev IOPS:  2
>>> Max IOPS: 32
>>> Min IOPS: 24
>>> Average Latency(s):   0.57092
>>> Max latency(s):   1.8631
>>> Min latency(s):   0.161936
>>> 
>>> 
>>> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer  wrote:
>>> 
 
 Hello,
 
 On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
 
> I am using 10G infiniband for cluster network and 1G ethernet for
 public.
 Hmm, very unbalanced, but I guess that's HW you already had.
 
> Because I don't have enough slots on the node, 

Re: [ceph-users] SSD Journal

2016-07-13 Thread Christian Balzer

Hello,

On Wed, 13 Jul 2016 09:34:35 + Ashley Merrick wrote:

> Hello,
> 
> Looking at using 2 x 960GB SSD's (SM863)
>
Massive overkill.
 
> Reason for larger is I was thinking would be better off with them in Raid 1 
> so enough space for OS and all Journals.
>
As I pointed out several times in this ML, Ceph journal usage rarely
exceeds hundreds of MB, let alone several GB with default parameters.
So 10GB per journal is plenty, unless you're doing something very special
(and you aren't with normal HDDs as OSDs).
 
> Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a 
> SSD?
>
S3700s are unfortunately EOL'ed, the 200GB ones were great at 375MB/s.
200GB S3710s are about on par for 5 HDDs at 300MB/s, but if you can afford
it and have a 10Gb/s network, the 400GB ones at 470MB/s would be optimal.

As for sharing the SSDs with OS, I do that all the time, the minute
logging of a storage node really has next to no impact.

I prefer this over using DoMs for reasons of:
1. Redundancy
2. hot-swapability  

If you go the DoM route, make sure it's size AND endurance are a match for
what you need. 
This is especially important if you were to run a MON on those machines as
well.
 
Christian

> Thanks,
> Ashley
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: 13 July 2016 01:12
> To: ceph-users@lists.ceph.com
> Cc: Wido den Hollander ; Ashley Merrick 
> Subject: Re: [ceph-users] SSD Journal
> 
> 
> Hello,
> 
> On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
> 
> > 
> > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
> > > 
> > > 
> > > Hello,
> > > 
> > > Looking at final stages of planning / setup for a CEPH Cluster.
> > > 
> > > Per a Storage node looking @
> > > 
> > > 2 x SSD OS / Journal
> > > 10 x SATA Disk
> > > 
> > > Will have a small Raid 1 Partition for the OS, however not sure if best 
> > > to do:
> > > 
> > > 5 x Journal Per a SSD
> > 
> > Best solution. Will give you the most performance for the OSDs. RAID-1 will 
> > just burn through cycles on the SSDs.
> > 
> > SSDs don't fail that often.
> >
> What Wido wrote, but let us know what SSDs you're planning to use.
> 
> Because the detailed version of that sentence should read: 
> "Well known and tested DC level SSDs whose size/endurance levels are matched 
> to the workload rarely fail, especially unexpected."
>  
> > Wido
> > 
> > > 10 x Journal on Raid 1 of two SSD's
> > > 
> > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
> > > worth the "issue" caused when one SSD goes down?
> > > 
> As always, assume at least a node being the failure domain you need to be 
> able to handle.
> 
> Christian
> 
> > > Thanks,
> > > Ashley
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Mark Nelson
As Somnath mentioned, you've got a lot of tunables set there.  Are you 
sure those are all doing what you think they are doing?


FWIW, the xfs -n size=64k option is probably not a good idea. 
Unfortunately it can't be changed without making a new filesystem.


See:

http://lists.ceph.com/pipermail/ceph-users-ceph.com/2016-February/007645.html

Typically that seems to manifest as suicide timeouts on the OSDs though. 
 You'd also see kernel log messages that look like:


kernel: XFS: possible memory allocation deadlock in kmem_alloc (mode:0x8250)

Mark

On 07/13/2016 08:39 PM, Garg, Pankaj wrote:

I agree, but I’m dealing with something else out here with this setup.

I just ran a test, and within 3 seconds my IOPS went to 0, and stayed
there for 90 seconds….then started and within seconds again went to 0.

This doesn’t seem normal at all. Here is my ceph.conf:



[global]

fsid = xx

public_network = 

cluster_network = 

mon_initial_members = ceph1

mon_host = 

auth_cluster_required = cephx

auth_service_required = cephx

auth_client_required = cephx

filestore_xattr_use_omap = true

osd_mkfs_options = -f -i size=2048 -n size=64k

osd_mount_options_xfs = inode64,noatime,logbsize=256k

filestore_merge_threshold = 40

filestore_split_multiple = 8

osd_op_threads = 12

osd_pool_default_size = 2

mon_pg_warn_max_object_skew = 10

mon_pg_warn_min_per_osd = 0

mon_pg_warn_max_per_osd = 32768

filestore_op_threads = 6



[osd]

osd_enable_op_tracker = false

osd_op_num_shards = 2

filestore_wbthrottle_enable = false

filestore_max_sync_interval = 1

filestore_odsync_write = true

filestore_max_inline_xattr_size = 254

filestore_max_inline_xattrs = 6

filestore_queue_committing_max_bytes = 1048576000

filestore_queue_committing_max_ops = 5000

filestore_queue_max_bytes = 1048576000

filestore_queue_max_ops = 500

journal_max_write_bytes = 1048576000

journal_max_write_entries = 1000

journal_queue_max_bytes = 1048576000

journal_queue_max_ops = 3000

filestore_fd_cache_shards = 32

filestore_fd_cache_size = 64





*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* Wednesday, July 13, 2016 6:06 PM
*To:* Garg, Pankaj; ceph-users@lists.ceph.com
*Subject:* RE: Terrible RBD performance with Jewel



You should do that first to get a stable performance out with filestore.

1M seq write for the entire image should be sufficient to precondition it.



*From:*Garg, Pankaj [mailto:pankaj.g...@cavium.com]
*Sent:* Wednesday, July 13, 2016 6:04 PM
*To:* Somnath Roy; ceph-users@lists.ceph.com

*Subject:* RE: Terrible RBD performance with Jewel



No I have not.



*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* Wednesday, July 13, 2016 6:00 PM
*To:* Garg, Pankaj; ceph-users@lists.ceph.com

*Subject:* RE: Terrible RBD performance with Jewel



In fact, I was wrong , I missed you are running with 12 OSDs
(considering one OSD per SSD). In that case, it will take ~250 second to
fill up the journal.

Have you preconditioned the entire image with bigger block say 1M before
doing any real test ?



*From:*Garg, Pankaj [mailto:pankaj.g...@cavium.com]
*Sent:* Wednesday, July 13, 2016 5:55 PM
*To:* Somnath Roy; ceph-users@lists.ceph.com

*Subject:* RE: Terrible RBD performance with Jewel



Thanks Somnath. I will try all these, but I think there is something
else going on too.

Firstly my test reaches 0 IOPS within 10 seconds sometimes.

Secondly, when I’m at 0 IOPS, I see NO disk activity on IOSTAT and no
CPU activity either. This part is strange.



Thanks

Pankaj



*From:*Somnath Roy [mailto:somnath@sandisk.com]
*Sent:* Wednesday, July 13, 2016 5:49 PM
*To:* Somnath Roy; Garg, Pankaj; ceph-users@lists.ceph.com

*Subject:* RE: Terrible RBD performance with Jewel



Also increase the following..



filestore_op_threads



*From:*ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
Of *Somnath Roy
*Sent:* Wednesday, July 13, 2016 5:47 PM
*To:* Garg, Pankaj; ceph-users@lists.ceph.com

*Subject:* Re: [ceph-users] Terrible RBD performance with Jewel



Pankaj,



Could be related to the new throttle parameter introduced in jewel. By
default these throttles are off , you need to tweak it according to your
setup.

What is your journal size and fio block size ?

If it is default 5GB , with this rate (assuming 4K RW)   you mentioned
and considering 3X replication , it can fill up your journal and stall
io within ~30 seconds or so.

If you think this is what is happening in your system , you need to turn
this throttle on (see
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt )
and also need to lower the filestore_max_sync_interval to ~1 (or even
lower). Since you are trying on SSD , I would 

Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
I am not sure whether you need to set the following. What's the point of 
reducing inline xattr stuff ? I forgot the calculation but lower values could 
redirect your xattrs to omap. Better comment those out.

filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6

We could do some improvement on some of the params but nothing it seems 
responsible for the behavior you are seeing.
Could you run iotop and see if any process (like xfsaild) is doing io on the 
drives during that time ?

Thanks & Regards
Somnath

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:40 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 

[ceph-users] CEPH-Developer Oppurtunity-Bangalore,India

2016-07-13 Thread Janardhan Husthimme
Hello CEPH-users,


I am looking for hiring CEPH developers for my team in Bangalore, if anyone 
keen to explore, do unicast me at jhusthi...@walmartlabs.com.


Sorry folks for using this forum, thought of dropping a note.



Thanks,

Janardhan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Garg, Pankaj
I agree, but I'm dealing with something else out here with this setup.
I just ran a test, and within 3 seconds my IOPS went to 0, and stayed there for 
90 secondsthen started and within seconds again went to 0.
This doesn't seem normal at all. Here is my ceph.conf:

[global]
fsid = xx
public_network = 
cluster_network = 
mon_initial_members = ceph1
mon_host = 
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
osd_mkfs_options = -f -i size=2048 -n size=64k
osd_mount_options_xfs = inode64,noatime,logbsize=256k
filestore_merge_threshold = 40
filestore_split_multiple = 8
osd_op_threads = 12
osd_pool_default_size = 2
mon_pg_warn_max_object_skew = 10
mon_pg_warn_min_per_osd = 0
mon_pg_warn_max_per_osd = 32768
filestore_op_threads = 6

[osd]
osd_enable_op_tracker = false
osd_op_num_shards = 2
filestore_wbthrottle_enable = false
filestore_max_sync_interval = 1
filestore_odsync_write = true
filestore_max_inline_xattr_size = 254
filestore_max_inline_xattrs = 6
filestore_queue_committing_max_bytes = 1048576000
filestore_queue_committing_max_ops = 5000
filestore_queue_max_bytes = 1048576000
filestore_queue_max_ops = 500
journal_max_write_bytes = 1048576000
journal_max_write_entries = 1000
journal_queue_max_bytes = 1048576000
journal_queue_max_ops = 3000
filestore_fd_cache_shards = 32
filestore_fd_cache_size = 64


From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:06 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS 

Re: [ceph-users] rbd command anomaly

2016-07-13 Thread EP Komarla
Thanks.  It works.

From: c.y. lee [mailto:c...@inwinstack.com]
Sent: Wednesday, July 13, 2016 6:17 PM
To: EP Komarla 
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] rbd command anomaly

Hi,

You need to specify pool name.

rbd -p testpool info testvol11

On Thu, Jul 14, 2016 at 8:55 AM, EP Komarla 
> wrote:
Hi,

I am seeing an issue.  I created 5 images testvol11-15 and I mapped them to 
/dev/rbd0-4.  When I execute the command ‘rbd showmapped’, it shows correctly 
the image and the mappings as shown below:

[root@ep-compute-2-16 run1]# rbd showmapped
id pool image snap device
0  testpool testvol11 -/dev/rbd0
1  testpool testvol12 -/dev/rbd1
2  testpool testvol13 -/dev/rbd2
3  testpool testvol14 -/dev/rbd3
4  testpool testvol15 -/dev/rbd4

I created image by this command:
rbd create testvol11 -p testpool --size 512 -m ep-compute-2-15

mapping was done using this command:
rbd map testvol11 -p testpool --name client.admin -m ep-compute-2-15

However, when I try to find details about each image it is failing

[root@ep-compute-2-16 run1]# rbd info testvol11
2016-07-13 17:50:23.093293 7f3372c1a7c0 -1 librbd::ImageCtx: error finding 
header: (2) No such file or directory
rbd: error opening image testvol11: (2) No such file or directory

Even the image list fails:
[root@ep-compute-2-16 run1]# rbd ls
[root@ep-compute-2-16 run1]#

Unable to understand why I am seeing this anomaly.  Any clues or pointers are 
appreciated.

Thanks,

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd command anomaly

2016-07-13 Thread c.y. lee
Hi,

You need to specify pool name.

rbd -p testpool info testvol11

On Thu, Jul 14, 2016 at 8:55 AM, EP Komarla 
wrote:

> Hi,
>
>
>
> I am seeing an issue.  I created 5 images testvol11-15 and I mapped them
> to /dev/rbd0-4.  When I execute the command ‘rbd showmapped’, it shows
> correctly the image and the mappings as shown below:
>
>
>
> [root@ep-compute-2-16 run1]# rbd showmapped
>
> id pool image snap device
>
> 0  testpool testvol11 -/dev/rbd0
>
> 1  testpool testvol12 -/dev/rbd1
>
> 2  testpool testvol13 -/dev/rbd2
>
> 3  testpool testvol14 -/dev/rbd3
>
> 4  testpool testvol15 -/dev/rbd4
>
>
>
> I created image by this command:
>
> rbd create testvol11 -p testpool --size 512 -m ep-compute-2-15
>
>
>
> mapping was done using this command:
>
> rbd map testvol11 -p testpool --name client.admin -m ep-compute-2-15
>
>
>
> However, when I try to find details about each image it is failing
>
>
>
> [root@ep-compute-2-16 run1]# rbd info testvol11
>
> 2016-07-13 17:50:23.093293 7f3372c1a7c0 -1 librbd::ImageCtx: error finding
> header: (2) No such file or directory
>
> rbd: error opening image testvol11: (2) No such file or directory
>
>
>
> Even the image list fails:
>
> [root@ep-compute-2-16 run1]# rbd ls
>
> [root@ep-compute-2-16 run1]#
>
>
>
> Unable to understand why I am seeing this anomaly.  Any clues or pointers
> are appreciated.
>
>
>
> Thanks,
>
>
>
> - epk
>
> Legal Disclaimer:
> The information contained in this message may be privileged and
> confidential. It is intended to be read only by the individual or entity to
> whom it is addressed or by their designee. If the reader of this message is
> not the intended recipient, you are on notice that any distribution of this
> message, in any form, is strictly prohibited. If you have received this
> message in error, please immediately notify the sender and delete or
> destroy any copy of this message!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
You should do that first to get a stable performance out with filestore.
1M seq write for the entire image should be sufficient to precondition it.

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 6:04 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Garg, Pankaj
No I have not.

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 6:00 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
In fact, I was wrong , I missed you are running with 12 OSDs (considering one 
OSD per SSD). In that case, it will take ~250 second to fill up the journal.
Have you preconditioned the entire image with bigger block say 1M before doing 
any real test ?

From: Garg, Pankaj [mailto:pankaj.g...@cavium.com]
Sent: Wednesday, July 13, 2016 5:55 PM
To: Somnath Roy; ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Thanks Somnath. I will try all these, but I think there is something else going 
on too.
Firstly my test reaches 0 IOPS within 10 seconds sometimes.
Secondly, when I'm at 0 IOPS, I see NO disk activity on IOSTAT and no CPU 
activity either. This part is strange.

Thanks
Pankaj

From: Somnath Roy [mailto:somnath@sandisk.com]
Sent: Wednesday, July 13, 2016 5:49 PM
To: Somnath Roy; Garg, Pankaj; 
ceph-users@lists.ceph.com
Subject: RE: Terrible RBD performance with Jewel

Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] rbd command anomaly

2016-07-13 Thread EP Komarla
Hi,

I am seeing an issue.  I created 5 images testvol11-15 and I mapped them to 
/dev/rbd0-4.  When I execute the command 'rbd showmapped', it shows correctly 
the image and the mappings as shown below:

[root@ep-compute-2-16 run1]# rbd showmapped
id pool image snap device
0  testpool testvol11 -/dev/rbd0
1  testpool testvol12 -/dev/rbd1
2  testpool testvol13 -/dev/rbd2
3  testpool testvol14 -/dev/rbd3
4  testpool testvol15 -/dev/rbd4

I created image by this command:
rbd create testvol11 -p testpool --size 512 -m ep-compute-2-15

mapping was done using this command:
rbd map testvol11 -p testpool --name client.admin -m ep-compute-2-15

However, when I try to find details about each image it is failing

[root@ep-compute-2-16 run1]# rbd info testvol11
2016-07-13 17:50:23.093293 7f3372c1a7c0 -1 librbd::ImageCtx: error finding 
header: (2) No such file or directory
rbd: error opening image testvol11: (2) No such file or directory

Even the image list fails:
[root@ep-compute-2-16 run1]# rbd ls
[root@ep-compute-2-16 run1]#

Unable to understand why I am seeing this anomaly.  Any clues or pointers are 
appreciated.

Thanks,

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
Also increase the following..

filestore_op_threads

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of 
Somnath Roy
Sent: Wednesday, July 13, 2016 5:47 PM
To: Garg, Pankaj; ceph-users@lists.ceph.com
Subject: Re: [ceph-users] Terrible RBD performance with Jewel

Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Somnath Roy
Pankaj,

Could be related to the new throttle parameter introduced in jewel. By default 
these throttles are off , you need to tweak it according to your setup.
What is your journal size and fio block size ?
If it is default 5GB , with this rate (assuming 4K RW)   you mentioned and 
considering 3X replication , it can fill up your journal and stall io within 
~30 seconds or so.
If you think this is what is happening in your system , you need to turn this 
throttle on (see 
https://github.com/ceph/ceph/blob/jewel/src/doc/dynamic-throttle.txt ) and also 
need to lower the filestore_max_sync_interval to ~1 (or even lower). Since you 
are trying on SSD , I would also recommend to turn the following parameter on 
for the stable performance out.


filestore_odsync_write = true

Thanks & Regards
Somnath
From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Garg, 
Pankaj
Sent: Wednesday, July 13, 2016 4:57 PM
To: ceph-users@lists.ceph.com
Subject: [ceph-users] Terrible RBD performance with Jewel

Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd inside LXC

2016-07-13 Thread Łukasz Jagiełło
Hi,

Just wonder why you want each OSD inside separate LXC container? Just to
pin them to specific cpus?

On Tue, Jul 12, 2016 at 6:33 AM, Guillaume Comte <
guillaume.co...@blade-group.com> wrote:

> Hi,
>
> I am currently defining a storage architecture based on ceph, and i wish
> to know if i don't misunderstood some stuffs.
>
> So, i plan to deploy for each HDD of each servers as much as OSD as free
> harddrive, each OSD will be inside a LXC container.
>
> Then, i wish to turn the server itself as a rbd client for objects created
> in the pools, i wish also to have a SSD to activate caching (and also store
> osd logs as well)
>
> The idea behind is to create CRUSH rules which will maintain a set of
> object within a couple of servers connected to the same pair of switch in
> order to have the best proximity between where i store the object and where
> i use them (i don't bother having a very high insurance to not loose data
> if my whole rack powerdown)
>
> Am i already on the wrong track ? Is there a way to guaranty proximity of
> data with ceph whitout making twisted configuration as i am ready to do ?
>
> Thks in advance,
>
> Regards
> --
> *Guillaume Comte*
> 06 25 85 02 02  | guillaume.co...@blade-group.com
> 
> 90 avenue des Ternes, 75 017 Paris
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Łukasz Jagiełło
lukaszjagielloorg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Terrible RBD performance with Jewel

2016-07-13 Thread Garg, Pankaj
Hi,
I just  installed jewel on a small cluster of 3 machines with 4 SSDs each. I 
created 8 RBD images, and use a single client, with 8 threads, to do random 
writes (using FIO with RBD engine) on the images ( 1 thread per image).
The cluster has 3X replication and 10G cluster and client networks.
FIO prints the aggregate IOPS every second for the cluster. Before Jewel, I get 
roughtly 10K IOPS. It was up and down, but still kept going.
Now I see IOPS that go to 13-15K, but then it drops, and eventually drops to 
ZERO for several seconds, and then starts back up again.

What am I missing?

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question on Sequential Write performance at 4K blocksize

2016-07-13 Thread EP Komarla
Hi All,

Have a question on the performance of sequential write @ 4K block sizes.

Here is my configuration:

Ceph Cluster: 6 Nodes. Each node with :-
20x HDDs (OSDs) - 10K RPM 1.2 TB SAS disks
SSDs - 4x - Intel S3710, 400GB; for OSD journals shared across 20 HDDs (i.e., 
SSD journal ratio 1:5)

Network:
- Client network - 10Gbps
- Cluster network - 10Gbps
- Each node with dual NIC - Intel 82599 ES - driver version 4.0.1

Traffic generators:
2 client servers - running on dual Intel sockets with 16 physical cores (32 
cores with hyper-threading enabled)

Test program:
FIO - sequential read/write; random read/write
Blocksizes - 4k, 32k, 256k...
FIO - Number of jobs = 32; IO depth = 64
Runtime = 10 minutes; Ramptime = 5 minutes
Filesize = 4096g (5TB)

I observe that my sequential write performance at 4K block size is very low - I 
am getting around 6MB/sec bandwidth.  The performance improves significantly at 
larger block sizes (shown below)

FIO - Sequential Write test

Block Size

Sequential Write Bandwidth KB/Sec

4K

5694

32K

141020

256K

747421

1024K

602236

4096K

683029


Here are my questions:
- Why is the sequential write performance at 4K block size so low? Is this 
in-line what others see?
- Is it because of less number of clients, i.e., traffic generators? I am 
planning to increase the number of clients to 4 servers.
- There is a later version on NIC driver from Intel, v4.3.15 - do you think 
upgrading to later version (v4.3.15) will improve performance?

Any thoughts or pointers will be helpful.

Thanks,

- epk

Legal Disclaimer:
The information contained in this message may be privileged and confidential. 
It is intended to be read only by the individual or entity to whom it is 
addressed or by their designee. If the reader of this message is not the 
intended recipient, you are on notice that any distribution of this message, in 
any form, is strictly prohibited. If you have received this message in error, 
please immediately notify the sender and delete or destroy any copy of this 
message!___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs change metadata pool?

2016-07-13 Thread Di Zhang
I also tried 4K write bench. The IOPS is ~420. I used to have better
bandwidth when I use the same network for both the cluster and clients. Now
the bandwidth must be limited by the 1G ethernet. What would you suggest to
me to do?

Thanks,

On Wed, Jul 13, 2016 at 11:37 AM, Di Zhang  wrote:

> Hello,
> Sorry for the misunderstanding about IOPS. Here are some summary stats
> of my benchmark (Is the 20 - 30 IOPS seems normal to you?):
>
> ceph osd pool create test 512 512
>
> rados bench -p test 10 write --no-cleanup
>
> Total time run: 10.480383
> Total writes made:  288
> Write size: 4194304
> Object size:4194304
> Bandwidth (MB/sec): 109.92
> Stddev Bandwidth:   11.9926
> Max bandwidth (MB/sec): 124
> Min bandwidth (MB/sec): 80
> Average IOPS:   27
> Stddev IOPS:3
> Max IOPS:   31
> Min IOPS:   20
> Average Latency(s): 0.579105
> Stddev Latency(s):  0.19902
> Max latency(s): 1.32831
> Min latency(s): 0.245505
>
> rados bench -p bench -p test 10 seq
> Total time run:   10.340724
> Total reads made: 288
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   111.404
> Average IOPS  27
> Stddev IOPS:  2
> Max IOPS: 31
> Min IOPS: 22
> Average Latency(s):   0.564858
> Max latency(s):   1.65278
> Min latency(s):   0.141504
>
> rados bench -p bench -p test 10 rand
> Total time run:   10.546251
> Total reads made: 293
> Read size:4194304
> Object size:  4194304
> Bandwidth (MB/sec):   111.13
> Average IOPS: 27
> Stddev IOPS:  2
> Max IOPS: 32
> Min IOPS: 24
> Average Latency(s):   0.57092
> Max latency(s):   1.8631
> Min latency(s):   0.161936
>
>
> On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer  wrote:
>
>>
>> Hello,
>>
>> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
>>
>> > I am using 10G infiniband for cluster network and 1G ethernet for
>> public.
>> Hmm, very unbalanced, but I guess that's HW you already had.
>>
>> > Because I don't have enough slots on the node, so I am using three
>> files on
>> > the OS drive (SSD) for journaling, which really improved but not
>> entirely
>> > solved the problem.
>> >
>> If you can, use partitions instead of files, less overhead.
>> What model SSD is that?
>>
>> Also putting the meta-data pool on SSDs might help.
>>
>> > I am quite happy with the current IOPS, which range from 200 MB/s to 400
>> > MB/s sequential write, depending on the block size.
>> That's not IOPS, that's bandwidth, throughput.
>>
>> >But the problem is,
>> > when I transfer data to the cephfs at a rate below 100MB/s, I can
>> observe
>> > the slow/blocked requests warnings after a few minutes via "ceph -w".
>>
>> I doubt the speed has anything to do with this, but the actual block size
>> and IOPS numbers.
>>
>> As always, watch your storage nodes with atop (or iostat) during such
>> scenarios/tests and spot the bottlenecks.
>>
>> >It's
>> > not specific to any particular OSDs. So I started to doubt if the
>> > configuration is correct or upgrading to Jewel can solve it.
>> >
>> Jewel is likely to help in general, but can't fix insufficient HW or
>> broken configurations.
>>
>> > There are about 5,000,000 objects currently in the cluster.
>> >
>> You're robably not hitting his, but read the recent filestore merge and
>> split threads, including the entirety of this thread:
>> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29243.html
>>
>> Christian
>>
>> > Thanks for the hints.
>> >
>> > On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer 
>> wrote:
>> >
>> > >
>> > > Hello,
>> > >
>> > > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
>> > >
>> > > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512
>> for both
>> > > > cephfs_data and cephfs_metadata. I experienced some slow/blocked
>> requests
>> > > > issues when I was using hammer 0.94.x and prior. So I was thinking
>> if the
>> > > > pg_num is too large for metadata.
>> > >
>> > > Very, VERY much doubt this.
>> > >
>> > > Your "ideal" values for a cluster of this size (are you planning to
>> grow
>> > > it?) would be about 1024 PGs for data and 128 or 256 PGs for
>> meta-data.
>> > >
>> > > Not really that far off and more importantly not overloading the OSDs
>> with
>> > > too many PGs in total. Or do you have more pools?
>> > >
>> > >
>> > > >I just upgraded the cluster to Jewel
>> > > > today. Will watch if the problem remains.
>> > > >
>> > > Jewel improvements might mask things, but I'd venture that your
>> problems
>> > > were caused by your HW not being sufficient for the load.
>> > >
>> > > As in, do you use SSD journals, etc?
>> > > How many IOPS do you need/expect from your CephFS?
>> > > How many objects are in there?
>> > >
>> > > Christian
>> > >
>> > > > Thank you.
>> 

Re: [ceph-users] cephfs change metadata pool?

2016-07-13 Thread Di Zhang
Hello,
Sorry for the misunderstanding about IOPS. Here are some summary stats
of my benchmark (Is the 20 - 30 IOPS seems normal to you?):

ceph osd pool create test 512 512

rados bench -p test 10 write --no-cleanup

Total time run: 10.480383
Total writes made:  288
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): 109.92
Stddev Bandwidth:   11.9926
Max bandwidth (MB/sec): 124
Min bandwidth (MB/sec): 80
Average IOPS:   27
Stddev IOPS:3
Max IOPS:   31
Min IOPS:   20
Average Latency(s): 0.579105
Stddev Latency(s):  0.19902
Max latency(s): 1.32831
Min latency(s): 0.245505

rados bench -p bench -p test 10 seq
Total time run:   10.340724
Total reads made: 288
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   111.404
Average IOPS  27
Stddev IOPS:  2
Max IOPS: 31
Min IOPS: 22
Average Latency(s):   0.564858
Max latency(s):   1.65278
Min latency(s):   0.141504

rados bench -p bench -p test 10 rand
Total time run:   10.546251
Total reads made: 293
Read size:4194304
Object size:  4194304
Bandwidth (MB/sec):   111.13
Average IOPS: 27
Stddev IOPS:  2
Max IOPS: 32
Min IOPS: 24
Average Latency(s):   0.57092
Max latency(s):   1.8631
Min latency(s):   0.161936


On Tue, Jul 12, 2016 at 9:18 PM, Christian Balzer  wrote:

>
> Hello,
>
> On Tue, 12 Jul 2016 20:57:00 -0500 Di Zhang wrote:
>
> > I am using 10G infiniband for cluster network and 1G ethernet for public.
> Hmm, very unbalanced, but I guess that's HW you already had.
>
> > Because I don't have enough slots on the node, so I am using three files
> on
> > the OS drive (SSD) for journaling, which really improved but not entirely
> > solved the problem.
> >
> If you can, use partitions instead of files, less overhead.
> What model SSD is that?
>
> Also putting the meta-data pool on SSDs might help.
>
> > I am quite happy with the current IOPS, which range from 200 MB/s to 400
> > MB/s sequential write, depending on the block size.
> That's not IOPS, that's bandwidth, throughput.
>
> >But the problem is,
> > when I transfer data to the cephfs at a rate below 100MB/s, I can observe
> > the slow/blocked requests warnings after a few minutes via "ceph -w".
>
> I doubt the speed has anything to do with this, but the actual block size
> and IOPS numbers.
>
> As always, watch your storage nodes with atop (or iostat) during such
> scenarios/tests and spot the bottlenecks.
>
> >It's
> > not specific to any particular OSDs. So I started to doubt if the
> > configuration is correct or upgrading to Jewel can solve it.
> >
> Jewel is likely to help in general, but can't fix insufficient HW or
> broken configurations.
>
> > There are about 5,000,000 objects currently in the cluster.
> >
> You're robably not hitting his, but read the recent filestore merge and
> split threads, including the entirety of this thread:
> https://www.mail-archive.com/ceph-users@lists.ceph.com/msg29243.html
>
> Christian
>
> > Thanks for the hints.
> >
> > On Tue, Jul 12, 2016 at 8:19 PM, Christian Balzer  wrote:
> >
> > >
> > > Hello,
> > >
> > > On Tue, 12 Jul 2016 19:54:38 -0500 Di Zhang wrote:
> > >
> > > > It's a 5 nodes cluster. Each node has 3 OSDs. I set pg_num = 512 for
> both
> > > > cephfs_data and cephfs_metadata. I experienced some slow/blocked
> requests
> > > > issues when I was using hammer 0.94.x and prior. So I was thinking
> if the
> > > > pg_num is too large for metadata.
> > >
> > > Very, VERY much doubt this.
> > >
> > > Your "ideal" values for a cluster of this size (are you planning to
> grow
> > > it?) would be about 1024 PGs for data and 128 or 256 PGs for meta-data.
> > >
> > > Not really that far off and more importantly not overloading the OSDs
> with
> > > too many PGs in total. Or do you have more pools?
> > >
> > >
> > > >I just upgraded the cluster to Jewel
> > > > today. Will watch if the problem remains.
> > > >
> > > Jewel improvements might mask things, but I'd venture that your
> problems
> > > were caused by your HW not being sufficient for the load.
> > >
> > > As in, do you use SSD journals, etc?
> > > How many IOPS do you need/expect from your CephFS?
> > > How many objects are in there?
> > >
> > > Christian
> > >
> > > > Thank you.
> > > >
> > > > On Tue, Jul 12, 2016 at 6:45 PM, Gregory Farnum 
> > > wrote:
> > > >
> > > > > I'm not at all sure that rados cppool actually captures everything
> (it
> > > > > might). Doug has been working on some similar stuff for disaster
> > > > > recovery testing and can probably walk you through moving over.
> > > > >
> > > > > But just how large *is* your metadata pool in relation to others?
> > > > > Having a too-large pool doesn't cost much unless it's
> > > > > grossly-inflated, and having 

Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread David
Aside from the 10GbE vs 40GbE question, if you're planning to export an RBD
image over smb/nfs I think you are going to struggle to reach anywhere near
1GB/s in a single threaded read. This is because even with readahead
cranked right up you're still only going be hitting a handful of disks at a
time. There's a few threads on this list about sequential reads with the
kernel rbd client. I think CephFS would be more appropriate in your use
case.

On Wed, Jul 13, 2016 at 1:52 PM, Götz Reinicke - IT Koordinator <
goetz.reini...@filmakademie.de> wrote:

> Am 13.07.16 um 14:27 schrieb Wido den Hollander:
> >> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de>:
> >>
> >>
> >> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
>  Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de>:
> 
> 
>  Hi,
> 
>  can anybody give some realworld feedback on what hardware
>  (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The
> Ceph
>  Cluster will be mostly rbd images. S3 in the future, CephFS we will
> see :)
> 
>  Thanks for some feedback and hints! Regadrs . Götz
> 
> >>> Why do you think you need 40Gb? That's some serious traffic to the
> OSDs and I doubt it's really needed.
> >>>
> >>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with
> that?
> >>>
> >>> It's also better to have more smaller nodes than a few big nodes with
> Ceph.
> >>>
> >>> Wido
> >>>
> >> Hi Wido,
> >>
> >> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
> >> in front to the Clients/Destops should have 40G.
> >>
> > Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do
> just fine I think.
> True @re-export
> > Still, 40GbE is a lot of bandwidth!
> :) I know, but we have users which like to transfer e.g. raw movie
> footage for a normal project which might be quick at 1TB and they dont
> want to wait hours ;). Or others like to screen/stream 4K Video footage
> raw which is +- 10Gb/second ... Thats the challenge :)
>
> And yes our Ceph Cluster is well designed .. on the paper ;) SSDs
> considered. With lot of helpful feedback from the List!!
>
> I just try to find linux/ceph useres with 40Gb experiences :)
>
> cheers . Götz
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Jake Young
We use all Cisco UCS servers (C240 M3 and M4s) with the PCIE VIC 1385 40G
NIC.  The drivers were included in Ubuntu 14.04.  I've had no issues with
the NICs or my network what so ever.

We have two Cisco Nexus 5624Q that the OSD servers connect to.  The
switches are just switching two VLANs (ceph client and cluster networks),
no layer 3 routing.  Those switches connect directly to two pairs of 6248
Fabric Interconnects, which are like TOR switches for UCS Blade Server
Chassis.

On Wed, Jul 13, 2016 at 11:08 AM,  wrote:

> I am using these for other stuff:
> http://www.supermicro.com/products/accessories/addon/AOC-STG-b4S.cfm
>
> If you want NIC, also think of the "network side" : SFP+ switch are very
> common, 40G is less common, 25G is really new (= really few products)
>
>
>
> On 13/07/2016 16:50, Warren Wang - ISD wrote:
> > I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
> > Back when I ran it, the  drivers were kind of a pain to deal with in
> > Ubuntu, primarily during PXE. It should be better now though.
> >
> > If you have the network to support it, 25Gbe is quite a bit cheaper per
> > port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
> > personally probably would not do 40 again.
> >
> > Warren Wang
> >
> >
> >
> > On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
> > Koordinator"  > goetz.reini...@filmakademie.de> wrote:
> >
> >> Am 13.07.16 um 14:59 schrieb Joe Landman:
> >>>
> >>>
> >>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>  40Gbps can be used as 4*10Gbps
> 
>  I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>  ports", but extented to "usage of more than a single 10Gbps port, eg
>  20Gbps etc too"
> 
>  Is there people here that are using more than 10G on an ceph server ?
> >>>
> >>> We have built, and are building Ceph units for some of our customers
> >>> with dual 100Gb links.  The storage box was one of our all flash
> >>> Unison units for OSDs.  Similarly, we have several customers actively
> >>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
> >>> (SRD) box.
> >>>
> >> Now we get closer. Can you tell me which 40G Nic you use?
> >>
> >>/götz
> >>
> >
> > This email and any files transmitted with it are confidential and
> intended solely for the individual or entity to whom they are addressed. If
> you have received this email in error destroy it immediately. *** Walmart
> Confidential ***
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] multiple journals on SSD

2016-07-13 Thread George Shuklin

Hello.

On 07/13/2016 03:31 AM, Christian Balzer wrote:

Hello,

did you actually read my full reply last week, the in-line parts,
not just the top bit?

http://www.spinics.net/lists/ceph-users/msg29266.html

On Tue, 12 Jul 2016 16:16:09 +0300 George Shuklin wrote:


Yes, linear io speed was concern during benchmark. I can not predict how
much linear IO would be generated by clients (compare to IOPS) so we
going to balance HDD-OSD per SSD according to real usage. If users would
generate too much random IO, we will raise HDD/SSD ratio, if they would
generate more linear-write load, we will reduce that number. I plan to
do it by reserving space for 'more HDD' or 'more SSD' in planned servers
- they will go to production with ~50% slot utilization.


Journal writes are always "linear", in a fashion.
And Ceph journals only sees writes, never reads.

So what your SSD sees is n sequential (with varying lengths, mind ya)
write streams and that's all.
Where n is the number of journals.


Yes, I knew this. I mean that under real load in production there going 
to be too much random IO (directed toward OSD), that HDD inside of each 
OSD would not be able to accept too much of linear writing (it needs to 
serve random write & read in parallel, and this significantly reduce 
speed of linear IO on the same device). I expect HDD to be busy enough 
with random write/read without saturating SSD linear write performance.

My main concern is that random IO for OSD includes not only writes, but
reads too, and cold random read will slower HDD performance
significantly. On my previous experience, any weekly cronjob on server
with backup (or just 'find /') cause bad spikes of cold read, and that
drastically diminish HDD performance.


As I wrote last week, reads have nothing to do with journals.


Read have nothing to do with journals except one thing: if underlying 
HDD is very busy with read, it will accept write operations slowly. And 
because it will accept them slowly, SSD with journal would not get too 
much IO of any kind.


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-13 Thread Luis Periquito
Thanks for sharing Wido.

>From your information you only talk about MON and OSD. What about the
RGW nodes? You stated in the beginning that 99% is rgw...

On Wed, Jul 13, 2016 at 3:56 PM, Wido den Hollander  wrote:
> Hello,
>
> The last 3 days I worked at a customer with a 1800 OSD cluster which had to 
> be upgraded from Hammer 0.94.5 to Jewel 10.2.2
>
> The cluster in this case is 99% RGW, but also some RBD.
>
> I wanted to share some of the things we encountered during this upgrade.
>
> All 180 nodes are running CentOS 7.1 on a IPv6-only network.
>
> ** Hammer Upgrade **
> At first we upgraded from 0.94.5 to 0.94.7, this went well except for the 
> fact that the monitors got spammed with these kind of messages:
>
>   "Failed to encode map eXXX with expected crc"
>
> Some searching on the list brought me to:
>
>   ceph tell osd.* injectargs -- --clog_to_monitors=false
>
>  This reduced the load on the 5 monitors and made recovery succeed smoothly.
>
>  ** Monitors to Jewel **
>  The next step was to upgrade the monitors from Hammer to Jewel.
>
>  Using Salt we upgraded the packages and afterwards it was simple:
>
>killall ceph-mon
>chown -R ceph:ceph /var/lib/ceph
>chown -R ceph:ceph /var/log/ceph
>
> Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
> manually enabled the monitor and start it:
>
>   systemctl enable ceph-mon@srv-zmb04-05.service
>   systemctl start ceph-mon@srv-zmb04-05.service
>
> Afterwards the monitors were running just fine.
>
> ** OSDs to Jewel **
> To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
> all systems to 10.2.2, we then used a Shell script which we ran on one node 
> at a time.
>
> The failure domain here is 'rack', so we executed this in one rack, then the 
> next one, etc, etc.
>
> Script can be found on Github: 
> https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6
>
> Be aware that the chown can take a long, long, very long time!
>
> We ran into the issue that some OSDs crashed after start. But after trying 
> again they would start.
>
>   "void FileStore::init_temp_collections()"
>
> I reported this in the tracker as I'm not sure what is happening here: 
> http://tracker.ceph.com/issues/16672
>
> ** New OSDs with Jewel **
> We also had some new nodes which we wanted to add to the Jewel cluster.
>
> Using Salt and ceph-disk we ran into a partprobe issue in combination with 
> ceph-disk. There was already a Pull Request for the fix, but that was not 
> included in Jewel 10.2.2.
>
> We manually applied the PR and it fixed our issues: 
> https://github.com/ceph/ceph/pull/9330
>
> Hope this helps other people with their upgrades to Jewel!
>
> Wido
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] New to Ceph - osd autostart problem

2016-07-13 Thread George Shuklin
As you can see you have 'unknown' partition type. It should be 'ceph 
journal' and 'ceph data'.


Stop ceph-osd, unmount partitions and change typecodes for partition 
properly:
/sbin/sgdisk --typecode=PART:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- 
/dev/DISK


PART - number of partition with data (1 in your case), so:

/sbin/sgdisk --typecode=1:4fbd7e29-9d25-41b8-afd0-062c0ceff05d -- 
/dev/sdb (sdc, etc).


You can change typecode for journal partition too:

/sbin/sgdisk --typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 -- /dev/sdb


On 07/12/2016 01:05 AM, Dirk Laurenz wrote:


root@cephosd01:~# fdisk -l /dev/sdb

Disk /dev/sdb: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 87B152E0-EB5D-4EB0-8FFB-C27096CBB1ED

DeviceStart   End  Sectors Size Type
/dev/sdb1  10487808 104857566 94369759  45G unknown
/dev/sdb2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.
root@cephosd01:~# fdisk -l /dev/sdc

Disk /dev/sdc: 50 GiB, 53687091200 bytes, 104857600 sectors
Units: sectors of 1 * 512 = 512 bytes
Sector size (logical/physical): 512 bytes / 512 bytes
I/O size (minimum/optimal): 512 bytes / 512 bytes
Disklabel type: gpt
Disk identifier: 31B81FCA-9163-4723-B195-97AEC9568AD0

DeviceStart   End  Sectors Size Type
/dev/sdc1  10487808 104857566 94369759  45G unknown
/dev/sdc2  2048  10487807 10485760   5G unknown

Partition table entries are not in disk order.


Am 11.07.2016 um 18:01 schrieb George Shuklin:

Check out partition type for data partition for ceph.

fdisk -l /dev/sdc

On 07/11/2016 04:03 PM, Dirk Laurenz wrote:


hmm, helps partially ... running


/usr/sbin/ceph-disk trigger /dev/sdc1 or sdb1 works and brings osd up..


systemctl enable does not help


Am 11.07.2016 um 14:49 schrieb George Shuklin:

Short story how OSDs are started in systemd environments:

Ceph OSD parittions has specific typecode (partition type 
4FBD7E29-9D25-41B8-AFD0-062C0CEFF05D). It handled by udev rules 
shipped by ceph package:

/lib/udev/rules.d/95-ceph-osd.rules

It set up proper owner/group for this disk ('ceph' instead 'root') 
and calls /usr/sbin/ceph-disk trigger.


ceph-disk triggers creation of instance of ceph-disk@ systemd unit 
(to mount disk to /var/lib/ceph/osd/...), and ceph-osd@ (i'm not 
sure about all sequence of events).


Basically, to make OSD autostart they NEED to have proper typecode 
in their partition. If you using something different (like 
'directory based OSD') you should enable OSD autostart:


systemctl enable ceph-osd@42


On 07/11/2016 03:32 PM, Dirk Laurenz wrote:

Hello,


i'm new to ceph an try to do some first steps with ceph to 
understand concepts.


my setup is at first completly in vm


i deployed (with ceph-deploy) three monitors and three osd hosts. 
(3+3 vms)


my frist test was to find out, if everything comes back online 
after a system restart. this works fine for the monitors, but 
fails for the osds. i have to start them manually.



OS is debian jessie, ceph is the current release


Where can find out, what's going wrong



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread ceph
I am using these for other stuff:
http://www.supermicro.com/products/accessories/addon/AOC-STG-b4S.cfm

If you want NIC, also think of the "network side" : SFP+ switch are very
common, 40G is less common, 25G is really new (= really few products)



On 13/07/2016 16:50, Warren Wang - ISD wrote:
> I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
> Back when I ran it, the  drivers were kind of a pain to deal with in
> Ubuntu, primarily during PXE. It should be better now though.
> 
> If you have the network to support it, 25Gbe is quite a bit cheaper per
> port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
> personally probably would not do 40 again.
> 
> Warren Wang
> 
> 
> 
> On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
> Koordinator"  goetz.reini...@filmakademie.de> wrote:
> 
>> Am 13.07.16 um 14:59 schrieb Joe Landman:
>>>
>>>
>>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
 40Gbps can be used as 4*10Gbps

 I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
 ports", but extented to "usage of more than a single 10Gbps port, eg
 20Gbps etc too"

 Is there people here that are using more than 10G on an ceph server ?
>>>
>>> We have built, and are building Ceph units for some of our customers
>>> with dual 100Gb links.  The storage box was one of our all flash
>>> Unison units for OSDs.  Similarly, we have several customers actively
>>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
>>> (SRD) box.
>>>
>> Now we get closer. Can you tell me which 40G Nic you use?
>>
>>/götz
>>
> 
> This email and any files transmitted with it are confidential and intended 
> solely for the individual or entity to whom they are addressed. If you have 
> received this email in error destroy it immediately. *** Walmart Confidential 
> ***
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Warren Wang - ISD
I¹ve run the Mellanox 40 gig card. Connectx 3-Pro, but that¹s old now.
Back when I ran it, the  drivers were kind of a pain to deal with in
Ubuntu, primarily during PXE. It should be better now though.

If you have the network to support it, 25Gbe is quite a bit cheaper per
port, and won¹t be so hard to drive. 40Gbe is very hard to fill. I
personally probably would not do 40 again.

Warren Wang



On 7/13/16, 9:10 AM, "ceph-users on behalf of Götz Reinicke - IT
Koordinator"  wrote:

>Am 13.07.16 um 14:59 schrieb Joe Landman:
>>
>>
>> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>>> 40Gbps can be used as 4*10Gbps
>>>
>>> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>>> ports", but extented to "usage of more than a single 10Gbps port, eg
>>> 20Gbps etc too"
>>>
>>> Is there people here that are using more than 10G on an ceph server ?
>>
>> We have built, and are building Ceph units for some of our customers
>> with dual 100Gb links.  The storage box was one of our all flash
>> Unison units for OSDs.  Similarly, we have several customers actively
>> using multiple 40GbE links on our 60 bay Unison spinning rust disk
>> (SRD) box.
>>
>Now we get closer. Can you tell me which 40G Nic you use?
>
>/götz
>

This email and any files transmitted with it are confidential and intended 
solely for the individual or entity to whom they are addressed. If you have 
received this email in error destroy it immediately. *** Walmart Confidential 
***
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Lessons learned upgrading Hammer -> Jewel

2016-07-13 Thread Wido den Hollander
Hello,

The last 3 days I worked at a customer with a 1800 OSD cluster which had to be 
upgraded from Hammer 0.94.5 to Jewel 10.2.2

The cluster in this case is 99% RGW, but also some RBD.

I wanted to share some of the things we encountered during this upgrade.

All 180 nodes are running CentOS 7.1 on a IPv6-only network.

** Hammer Upgrade **
At first we upgraded from 0.94.5 to 0.94.7, this went well except for the fact 
that the monitors got spammed with these kind of messages:

  "Failed to encode map eXXX with expected crc"

Some searching on the list brought me to:

  ceph tell osd.* injectargs -- --clog_to_monitors=false
  
 This reduced the load on the 5 monitors and made recovery succeed smoothly.
 
 ** Monitors to Jewel **
 The next step was to upgrade the monitors from Hammer to Jewel.
 
 Using Salt we upgraded the packages and afterwards it was simple:
 
   killall ceph-mon
   chown -R ceph:ceph /var/lib/ceph
   chown -R ceph:ceph /var/log/ceph

Now, a systemd quirck. 'systemctl start ceph.target' does not work, I had to 
manually enabled the monitor and start it:

  systemctl enable ceph-mon@srv-zmb04-05.service
  systemctl start ceph-mon@srv-zmb04-05.service

Afterwards the monitors were running just fine.

** OSDs to Jewel **
To upgrade the OSDs to Jewel we initially used Salt to update the packages on 
all systems to 10.2.2, we then used a Shell script which we ran on one node at 
a time.

The failure domain here is 'rack', so we executed this in one rack, then the 
next one, etc, etc.

Script can be found on Github: 
https://gist.github.com/wido/06eac901bd42f01ca2f4f1a1d76c49a6

Be aware that the chown can take a long, long, very long time!

We ran into the issue that some OSDs crashed after start. But after trying 
again they would start.

  "void FileStore::init_temp_collections()"
  
I reported this in the tracker as I'm not sure what is happening here: 
http://tracker.ceph.com/issues/16672

** New OSDs with Jewel **
We also had some new nodes which we wanted to add to the Jewel cluster.

Using Salt and ceph-disk we ran into a partprobe issue in combination with 
ceph-disk. There was already a Pull Request for the fix, but that was not 
included in Jewel 10.2.2.

We manually applied the PR and it fixed our issues: 
https://github.com/ceph/ceph/pull/9330

Hope this helps other people with their upgrades to Jewel!

Wido
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Jason Dillaman
The RAW file will appear to be the exact image size but the filesystem
will know about the holes in the image and it will be sparsely
allocated on disk.  For example:

# dd if=/dev/zero of=sparse-file bs=1 count=1 seek=2GiB
# ll sparse-file
-rw-rw-r--. 1 jdillaman jdillaman 2147483649 Jul 13 09:20 sparse-file
# du -sh sparse-file
4.0K sparse-file

Now, running qemu-img to copy the image into the backing RBD pool:

# qemu-img convert -f raw -O raw ~/sparse-file rbd:rbd/sparse-file
# rbd disk-usage sparse-file
NAMEPROVISIONED USED
sparse-file   2048M0


On Wed, Jul 13, 2016 at 3:31 AM, Fran Barrera  wrote:
> Yes, but is the same problem isn't? The image will be too large because the
> format is raw.
>
> Thanks.
>
> 2016-07-13 9:24 GMT+02:00 Kees Meijs :
>>
>> Hi Fran,
>>
>> Fortunately, qemu-img(1) is able to directly utilise RBD (supporting
>> sparse block devices)!
>>
>> Please refer to http://docs.ceph.com/docs/hammer/rbd/qemu-rbd/ for
>> examples.
>>
>> Cheers,
>> Kees
>>
>> On 13-07-16 09:18, Fran Barrera wrote:
>> > Can you explain how you do this procedure? I have the same problem
>> > with the large images and snapshots.
>> >
>> > This is what I do:
>> >
>> > # qemu-img convert -f qcow2 -O raw image.qcow2 image.img
>> > # openstack image create image.img
>> >
>> > But the image.img is too large.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs change metadata pool?

2016-07-13 Thread John Spray
On Wed, Jul 13, 2016 at 12:14 AM, Di Zhang  wrote:
> Hi,
>
> Is there any way to change the metadata pool for a cephfs without losing
> any existing data? I know how to clone the metadata pool using rados cppool.
> But the filesystem still links to the original metadata pool no matter what
> you name it.
>
> The motivation here is to decrease the pg_num of the metadata pool. I
> created this cephfs cluster sometime ago, while I didn't realize that I
> shouldn't assign a large pg_num to such a small pool.
>
> I'm not sure if I can delete the fs and re-create it using the existing
> data pool and the cloned metadata pool.

It may not be impossible, but it is a wild and crazy untested
procedure that I would only do if my system was badly broken and I had
no other options.

**COMPLETELY UNTESTED AND DANGEROUS**

stop all MDS daemons
delete your filesystem (but leave the pools)
use "rados export" and "rados import" to do a full copy of the
metadata to a new pool (*not* cppool, it doesn't copy OMAP data)
use "ceph fs new" to create a new filesystem that uses your new metadata pool
use "ceph fs reset" to skip the creating phase of the new filesystem
start MDS daemons

**COMPLETELY UNTESTED AND DANGEROUS**

As Christian mentions later in the thread, 512 PGs isn't such a crazy
number, so you may well find that your real issue isn't the
configuration of your metadata pool.

John

> Thank you.
>
>
> Zhang Di
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
Looks good.
You can start several OSDs at a time as long as you have enough CPU and you're 
not saturating your drives or controllers.

Jan

> On 13 Jul 2016, at 15:09, Wido den Hollander  wrote:
> 
> 
>> Op 13 juli 2016 om 14:47 schreef Kees Meijs :
>> 
>> 
>> Thanks!
>> 
>> So to sum up, I'd best:
>> 
>>  * set the noout flag
>>  * stop the OSDs one by one
>>  * shut down the physical node
>>  * jank the OSD drives to prevent ceph-disk(8) from automaticly
>>activating at boot time
>>  * do my maintainance
>>  * start the physical node
>>  * reseat and activate the OSD drives one by one
>>  * unset the noout flag
>> 
> 
> That should do it indeed. Take your time between the OSDs and that should 
> limit the 'downtime' for clients.
> 
> Wido
> 
>> On 13-07-16 14:39, Jan Schermer wrote:
>>> If you stop the OSDs cleanly then that should cause no disruption to 
>>> clients.
>>> Starting the OSD back up is another story, expect slow request for a while 
>>> there and unless you have lots of very fast CPUs on the OSD node, start 
>>> them one-by-one and not all at once.
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Götz Reinicke - IT Koordinator
Am 13.07.16 um 14:59 schrieb Joe Landman:
>
>
> On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:
>> 40Gbps can be used as 4*10Gbps
>>
>> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
>> ports", but extented to "usage of more than a single 10Gbps port, eg
>> 20Gbps etc too"
>>
>> Is there people here that are using more than 10G on an ceph server ?
>
> We have built, and are building Ceph units for some of our customers
> with dual 100Gb links.  The storage box was one of our all flash
> Unison units for OSDs.  Similarly, we have several customers actively
> using multiple 40GbE links on our 60 bay Unison spinning rust disk
> (SRD) box.
>
Now we get closer. Can you tell me which 40G Nic you use?

/götz



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 14:47 schreef Kees Meijs :
> 
> 
> Thanks!
> 
> So to sum up, I'd best:
> 
>   * set the noout flag
>   * stop the OSDs one by one
>   * shut down the physical node
>   * jank the OSD drives to prevent ceph-disk(8) from automaticly
> activating at boot time
>   * do my maintainance
>   * start the physical node
>   * reseat and activate the OSD drives one by one
>   * unset the noout flag
> 

That should do it indeed. Take your time between the OSDs and that should limit 
the 'downtime' for clients.

Wido

> On 13-07-16 14:39, Jan Schermer wrote:
> > If you stop the OSDs cleanly then that should cause no disruption to 
> > clients.
> > Starting the OSD back up is another story, expect slow request for a while 
> > there and unless you have lots of very fast CPUs on the OSD node, start 
> > them one-by-one and not all at once.
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Jake Young
My OSDs have dual 40G NICs.  I typically don't use more than 1Gbps on
either network. During heavy recovery activity (like if I lose a whole
server), I've seen up to 12Gbps on the cluster network.

For reference my cluster is 9 OSD nodes with 9x 7200RPM 2TB OSDs. They all
have RAID cards with 4GB of RAM and a BBU. The disks are in single disk
RAID 1 to make use of the card's WB cache.

I can imagine with more servers, the peak recovery BW usage may go up even
more, to the max write rate to the RAID card's cache.

Jake



On Wednesday, July 13, 2016,  wrote:

> 40Gbps can be used as 4*10Gbps
>
> I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
> ports", but extented to "usage of more than a single 10Gbps port, eg
> 20Gbps etc too"
>
> Is there people here that are using more than 10G on an ceph server ?
>
> On 13/07/2016 14:27, Wido den Hollander wrote:
> >
> >> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de >:
> >>
> >>
> >> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
>  Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator <
> goetz.reini...@filmakademie.de >:
> 
> 
>  Hi,
> 
>  can anybody give some realworld feedback on what hardware
>  (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The
> Ceph
>  Cluster will be mostly rbd images. S3 in the future, CephFS we will
> see :)
> 
>  Thanks for some feedback and hints! Regadrs . Götz
> 
> >>> Why do you think you need 40Gb? That's some serious traffic to the
> OSDs and I doubt it's really needed.
> >>>
> >>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with
> that?
> >>>
> >>> It's also better to have more smaller nodes than a few big nodes with
> Ceph.
> >>>
> >>> Wido
> >>>
> >> Hi Wido,
> >>
> >> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
> >> in front to the Clients/Destops should have 40G.
> >>
> >
> > Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do
> just fine I think.
> >
> > Still, 40GbE is a lot of bandwidth!
> >
> > Wido
> >
> >>
> >> OSD NODEs/Cluster 2*10Gb Bond  40G Fileserver 40G  1G/10G
> Clients
> >>
> >> /Götz
> >>
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com 
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com 
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Joe Landman



On 07/13/2016 08:41 AM, c...@jack.fr.eu.org wrote:

40Gbps can be used as 4*10Gbps

I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
ports", but extented to "usage of more than a single 10Gbps port, eg
20Gbps etc too"

Is there people here that are using more than 10G on an ceph server ?


We have built, and are building Ceph units for some of our customers 
with dual 100Gb links.  The storage box was one of our all flash Unison 
units for OSDs.  Similarly, we have several customers actively using 
multiple 40GbE links on our 60 bay Unison spinning rust disk (SRD) box.


--
Joe Landman
Scalable Informatics, Inc.
e: land...@scalableinformatics.com
w: http://scalableinformatics.com
t: @scalableinfo
p: +1 734 786 8423 x121
c: +1 734 612 4615

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Götz Reinicke - IT Koordinator
Am 13.07.16 um 14:27 schrieb Wido den Hollander:
>> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator 
>> :
>>
>>
>> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
 Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator 
 :


 Hi,

 can anybody give some realworld feedback on what hardware
 (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
 Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)

 Thanks for some feedback and hints! Regadrs . Götz

>>> Why do you think you need 40Gb? That's some serious traffic to the OSDs and 
>>> I doubt it's really needed.
>>>
>>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with that?
>>>
>>> It's also better to have more smaller nodes than a few big nodes with Ceph.
>>>
>>> Wido
>>>
>> Hi Wido,
>>
>> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
>> in front to the Clients/Destops should have 40G.
>>
> Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do just 
> fine I think.
True @re-export
> Still, 40GbE is a lot of bandwidth!
:) I know, but we have users which like to transfer e.g. raw movie
footage for a normal project which might be quick at 1TB and they dont
want to wait hours ;). Or others like to screen/stream 4K Video footage
raw which is +- 10Gb/second ... Thats the challenge :)

And yes our Ceph Cluster is well designed .. on the paper ;) SSDs
considered. With lot of helpful feedback from the List!!

I just try to find linux/ceph useres with 40Gb experiences :)

cheers . Götz



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Kees Meijs
Thanks!

So to sum up, I'd best:

  * set the noout flag
  * stop the OSDs one by one
  * shut down the physical node
  * jank the OSD drives to prevent ceph-disk(8) from automaticly
activating at boot time
  * do my maintainance
  * start the physical node
  * reseat and activate the OSD drives one by one
  * unset the noout flag

On 13-07-16 14:39, Jan Schermer wrote:
> If you stop the OSDs cleanly then that should cause no disruption to clients.
> Starting the OSD back up is another story, expect slow request for a while 
> there and unless you have lots of very fast CPUs on the OSD node, start them 
> one-by-one and not all at once.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread ceph
40Gbps can be used as 4*10Gbps

I guess welcome feedbacks should not be stuck by "usage of a 40Gbps
ports", but extented to "usage of more than a single 10Gbps port, eg
20Gbps etc too"

Is there people here that are using more than 10G on an ceph server ?

On 13/07/2016 14:27, Wido den Hollander wrote:
> 
>> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator 
>> :
>>
>>
>> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
 Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator 
 :


 Hi,

 can anybody give some realworld feedback on what hardware
 (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
 Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)

 Thanks for some feedback and hints! Regadrs . Götz

>>> Why do you think you need 40Gb? That's some serious traffic to the OSDs and 
>>> I doubt it's really needed.
>>>
>>> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with that?
>>>
>>> It's also better to have more smaller nodes than a few big nodes with Ceph.
>>>
>>> Wido
>>>
>> Hi Wido,
>>
>> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
>> in front to the Clients/Destops should have 40G.
>>
> 
> Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do just 
> fine I think.
> 
> Still, 40GbE is a lot of bandwidth!
> 
> Wido
> 
>>
>> OSD NODEs/Cluster 2*10Gb Bond  40G Fileserver 40G  1G/10G Clients
>>
>> /Götz
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Jan Schermer
If you stop the OSDs cleanly then that should cause no disruption to clients.
Starting the OSD back up is another story, expect slow request for a while 
there and unless you have lots of very fast CPUs on the OSD node, start them 
one-by-one and not all at once.


Jan


> On 13 Jul 2016, at 14:37, Wido den Hollander  wrote:
> 
> 
>> Op 13 juli 2016 om 14:31 schreef Kees Meijs :
>> 
>> 
>> Hi Cephers,
>> 
>> There's some physical maintainance I need to perform on an OSD node.
>> Very likely the maintainance is going to take a while since it involves
>> replacing components, so I would like to be well prepared.
>> 
>> Unfortunately it is no option to add another OSD node or rebalance at
>> this time, so I'm planning to operate in degraded state during the
>> maintainance.
>> 
>> If at all possible, I would to shut down the OSD node cleanly and
>> prevent slow (or even blocking) requests on Ceph clients.
>> 
>> Just setting the noout flag and shutting down the OSDs on the given node
>> is not enough as it seems. In fact clients do not act that well in this
>> case. Connections time out and for a while I/O seems to stall.
>> 
> 
> noout doesn't do anything with the clients, it just tells the cluster not to 
> mark any OSD as out after they go down.
> 
> If you want to do this slowly, take the OSDs down one by one and wait for the 
> PGs to become active+X again.
> 
> When you start, do the same again, start them one by one.
> 
> You will always have a short moment where the PGs are inactive.
> 
>> Any thoughts on this, anyone? For example, is it a sensible idea and are
>> writes still possible? Let's assume there are OSDs on to the
>> to-be-maintained host which are primary for sure.
>> 
>> Thanks in advance!
>> 
>> Cheers,
>> Kees
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Physical maintainance

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 14:31 schreef Kees Meijs :
> 
> 
> Hi Cephers,
> 
> There's some physical maintainance I need to perform on an OSD node.
> Very likely the maintainance is going to take a while since it involves
> replacing components, so I would like to be well prepared.
> 
> Unfortunately it is no option to add another OSD node or rebalance at
> this time, so I'm planning to operate in degraded state during the
> maintainance.
> 
> If at all possible, I would to shut down the OSD node cleanly and
> prevent slow (or even blocking) requests on Ceph clients.
> 
> Just setting the noout flag and shutting down the OSDs on the given node
> is not enough as it seems. In fact clients do not act that well in this
> case. Connections time out and for a while I/O seems to stall.
> 

noout doesn't do anything with the clients, it just tells the cluster not to 
mark any OSD as out after they go down.

If you want to do this slowly, take the OSDs down one by one and wait for the 
PGs to become active+X again.

When you start, do the same again, start them one by one.

You will always have a short moment where the PGs are inactive.

> Any thoughts on this, anyone? For example, is it a sensible idea and are
> writes still possible? Let's assume there are OSDs on to the
> to-be-maintained host which are primary for sure.
> 
> Thanks in advance!
> 
> Cheers,
> Kees
> 
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 12:00 schreef Götz Reinicke - IT Koordinator 
> :
> 
> 
> Am 13.07.16 um 11:47 schrieb Wido den Hollander:
> >> Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator 
> >> :
> >>
> >>
> >> Hi,
> >>
> >> can anybody give some realworld feedback on what hardware
> >> (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
> >> Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)
> >>
> >> Thanks for some feedback and hints! Regadrs . Götz
> >>
> > Why do you think you need 40Gb? That's some serious traffic to the OSDs and 
> > I doubt it's really needed.
> >
> > Latency-wise 40Gb isn't much better than 10Gb, so why not stick with that?
> >
> > It's also better to have more smaller nodes than a few big nodes with Ceph.
> >
> > Wido
> >
> Hi Wido,
> 
> may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
> in front to the Clients/Destops should have 40G.
> 

Ah, the fileserver will re-export RBD/Samba? Any Xeon E5 CPU will do just fine 
I think.

Still, 40GbE is a lot of bandwidth!

Wido

> 
> OSD NODEs/Cluster 2*10Gb Bond  40G Fileserver 40G  1G/10G Clients
> 
> /Götz
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Renaming pools

2016-07-13 Thread Mateusz Skała
Hello,

It is safe to rename pool with cache-tier? I want to make some
standardization in pools name for example pools 'prod01' and 'cache-prod01'.
Maybe before rename should I remove cache tier?

Regards,

-- 

Mateusz Skała

mateusz.sk...@budikom.net

 

budikom.net

ul. Trzy Lipy  3, GPNT, bud. C

80-172 Gdańsk

email: bi...@budikom.net

tel. +48 58 58 58 708

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Götz Reinicke - IT Koordinator
Am 13.07.16 um 11:47 schrieb Wido den Hollander:
>> Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator 
>> :
>>
>>
>> Hi,
>>
>> can anybody give some realworld feedback on what hardware
>> (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
>> Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)
>>
>> Thanks for some feedback and hints! Regadrs . Götz
>>
> Why do you think you need 40Gb? That's some serious traffic to the OSDs and I 
> doubt it's really needed.
>
> Latency-wise 40Gb isn't much better than 10Gb, so why not stick with that?
>
> It's also better to have more smaller nodes than a few big nodes with Ceph.
>
> Wido
>
Hi Wido,

may be my post was misleading. The OSD Nodes do have 10G, the FIleserver
in front to the Clients/Destops should have 40G.


OSD NODEs/Cluster 2*10Gb Bond  40G Fileserver 40G  1G/10G Clients

/Götz



smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-13 Thread Kees Meijs
Hi,

This is an OSD box running Hammer on Ubuntu 14.04 LTS with additional
systems administration tools:
> $ df -h | grep -v /var/lib/ceph/osd
> Filesystem  Size  Used Avail Use% Mounted on
> udev5,9G  4,0K  5,9G   1% /dev
> tmpfs   1,2G  892K  1,2G   1% /run
> /dev/dm-1   203G  2,1G  200G   2% /
> none4,0K 0  4,0K   0% /sys/fs/cgroup
> none5,0M 0  5,0M   0% /run/lock
> none5,9G 0  5,9G   0% /run/shm
> none100M 0  100M   0% /run/user
> /dev/dm-1   203G  2,1G  200G   2% /home

As you can see, less than 10G is actually used.

Regards,
Kees

On 13-07-16 11:51, Ashley Merrick wrote:
> May sound a random question, but what size would you recommend for the 
> SATA-DOM, obviously I know standard OS space requirements, but will CEPH 
> required much on the root OS of a OSD only node apart from standard logs.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 11:51 schreef Ashley Merrick :
> 
> 
> Okie perfect.
> 
> May sound a random question, but what size would you recommend for the 
> SATA-DOM, obviously I know standard OS space requirements, but will CEPH 
> required much on the root OS of a OSD only node apart from standard logs.
> 

32GB should be sufficient, even with 16GB you should be OK. But I'd go with 
32GB so you have enough space when you need to.

Wido

> ,Ashley
> 
> -Original Message-
> From: Wido den Hollander [mailto:w...@42on.com] 
> Sent: 13 July 2016 10:44
> To: Ashley Merrick ; ceph-users@lists.ceph.com; 
> Christian Balzer 
> Subject: RE: [ceph-users] SSD Journal
> 
> 
> > Op 13 juli 2016 om 11:34 schreef Ashley Merrick :
> > 
> > 
> > Hello,
> > 
> > Looking at using 2 x 960GB SSD's (SM863)
> > 
> > Reason for larger is I was thinking would be better off with them in Raid 1 
> > so enough space for OS and all Journals.
> > 
> > Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a 
> > SSD?
> > 
> 
> Both the Samsung SM and Intel DC (3510/3710) SSDs are good. If you can, put 
> the OS on it's own device. Maybe a SATA-DOM for example?
> 
> Wido
> 
> > Thanks,
> > Ashley
> > 
> > -Original Message-
> > From: Christian Balzer [mailto:ch...@gol.com] 
> > Sent: 13 July 2016 01:12
> > To: ceph-users@lists.ceph.com
> > Cc: Wido den Hollander ; Ashley Merrick 
> > 
> > Subject: Re: [ceph-users] SSD Journal
> > 
> > 
> > Hello,
> > 
> > On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
> > 
> > > 
> > > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
> > > > 
> > > > 
> > > > Hello,
> > > > 
> > > > Looking at final stages of planning / setup for a CEPH Cluster.
> > > > 
> > > > Per a Storage node looking @
> > > > 
> > > > 2 x SSD OS / Journal
> > > > 10 x SATA Disk
> > > > 
> > > > Will have a small Raid 1 Partition for the OS, however not sure if best 
> > > > to do:
> > > > 
> > > > 5 x Journal Per a SSD
> > > 
> > > Best solution. Will give you the most performance for the OSDs. RAID-1 
> > > will just burn through cycles on the SSDs.
> > > 
> > > SSDs don't fail that often.
> > >
> > What Wido wrote, but let us know what SSDs you're planning to use.
> > 
> > Because the detailed version of that sentence should read: 
> > "Well known and tested DC level SSDs whose size/endurance levels are 
> > matched to the workload rarely fail, especially unexpected."
> >  
> > > Wido
> > > 
> > > > 10 x Journal on Raid 1 of two SSD's
> > > > 
> > > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
> > > > worth the "issue" caused when one SSD goes down?
> > > > 
> > As always, assume at least a node being the failure domain you need to be 
> > able to handle.
> > 
> > Christian
> > 
> > > > Thanks,
> > > > Ashley
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > 
> > 
> > 
> > -- 
> > Christian BalzerNetwork/Systems Engineer
> > ch...@gol.com   Global OnLine Japan/Rakuten Communications
> > http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-13 Thread Ashley Merrick
Okie perfect.

May sound a random question, but what size would you recommend for the 
SATA-DOM, obviously I know standard OS space requirements, but will CEPH 
required much on the root OS of a OSD only node apart from standard logs.

,Ashley

-Original Message-
From: Wido den Hollander [mailto:w...@42on.com] 
Sent: 13 July 2016 10:44
To: Ashley Merrick ; ceph-users@lists.ceph.com; 
Christian Balzer 
Subject: RE: [ceph-users] SSD Journal


> Op 13 juli 2016 om 11:34 schreef Ashley Merrick :
> 
> 
> Hello,
> 
> Looking at using 2 x 960GB SSD's (SM863)
> 
> Reason for larger is I was thinking would be better off with them in Raid 1 
> so enough space for OS and all Journals.
> 
> Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a 
> SSD?
> 

Both the Samsung SM and Intel DC (3510/3710) SSDs are good. If you can, put the 
OS on it's own device. Maybe a SATA-DOM for example?

Wido

> Thanks,
> Ashley
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: 13 July 2016 01:12
> To: ceph-users@lists.ceph.com
> Cc: Wido den Hollander ; Ashley Merrick 
> Subject: Re: [ceph-users] SSD Journal
> 
> 
> Hello,
> 
> On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
> 
> > 
> > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
> > > 
> > > 
> > > Hello,
> > > 
> > > Looking at final stages of planning / setup for a CEPH Cluster.
> > > 
> > > Per a Storage node looking @
> > > 
> > > 2 x SSD OS / Journal
> > > 10 x SATA Disk
> > > 
> > > Will have a small Raid 1 Partition for the OS, however not sure if best 
> > > to do:
> > > 
> > > 5 x Journal Per a SSD
> > 
> > Best solution. Will give you the most performance for the OSDs. RAID-1 will 
> > just burn through cycles on the SSDs.
> > 
> > SSDs don't fail that often.
> >
> What Wido wrote, but let us know what SSDs you're planning to use.
> 
> Because the detailed version of that sentence should read: 
> "Well known and tested DC level SSDs whose size/endurance levels are matched 
> to the workload rarely fail, especially unexpected."
>  
> > Wido
> > 
> > > 10 x Journal on Raid 1 of two SSD's
> > > 
> > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
> > > worth the "issue" caused when one SSD goes down?
> > > 
> As always, assume at least a node being the failure domain you need to be 
> able to handle.
> 
> Christian
> 
> > > Thanks,
> > > Ashley
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 8:19 schreef Götz Reinicke - IT Koordinator 
> :
> 
> 
> Hi,
> 
> can anybody give some realworld feedback on what hardware
> (CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
> Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)
> 
> Thanks for some feedback and hints! Regadrs . Götz
> 

Why do you think you need 40Gb? That's some serious traffic to the OSDs and I 
doubt it's really needed.

Latency-wise 40Gb isn't much better than 10Gb, so why not stick with that?

It's also better to have more smaller nodes than a few big nodes with Ceph.

Wido

> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-13 Thread Wido den Hollander

> Op 13 juli 2016 om 11:34 schreef Ashley Merrick :
> 
> 
> Hello,
> 
> Looking at using 2 x 960GB SSD's (SM863)
> 
> Reason for larger is I was thinking would be better off with them in Raid 1 
> so enough space for OS and all Journals.
> 
> Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a 
> SSD?
> 

Both the Samsung SM and Intel DC (3510/3710) SSDs are good. If you can, put the 
OS on it's own device. Maybe a SATA-DOM for example?

Wido

> Thanks,
> Ashley
> 
> -Original Message-
> From: Christian Balzer [mailto:ch...@gol.com] 
> Sent: 13 July 2016 01:12
> To: ceph-users@lists.ceph.com
> Cc: Wido den Hollander ; Ashley Merrick 
> Subject: Re: [ceph-users] SSD Journal
> 
> 
> Hello,
> 
> On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:
> 
> > 
> > > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
> > > 
> > > 
> > > Hello,
> > > 
> > > Looking at final stages of planning / setup for a CEPH Cluster.
> > > 
> > > Per a Storage node looking @
> > > 
> > > 2 x SSD OS / Journal
> > > 10 x SATA Disk
> > > 
> > > Will have a small Raid 1 Partition for the OS, however not sure if best 
> > > to do:
> > > 
> > > 5 x Journal Per a SSD
> > 
> > Best solution. Will give you the most performance for the OSDs. RAID-1 will 
> > just burn through cycles on the SSDs.
> > 
> > SSDs don't fail that often.
> >
> What Wido wrote, but let us know what SSDs you're planning to use.
> 
> Because the detailed version of that sentence should read: 
> "Well known and tested DC level SSDs whose size/endurance levels are matched 
> to the workload rarely fail, especially unexpected."
>  
> > Wido
> > 
> > > 10 x Journal on Raid 1 of two SSD's
> > > 
> > > Is the "Performance" increase from splitting 5 Journal's on each SSD 
> > > worth the "issue" caused when one SSD goes down?
> > > 
> As always, assume at least a node being the failure domain you need to be 
> able to handle.
> 
> Christian
> 
> > > Thanks,
> > > Ashley
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > 
> 
> 
> -- 
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com Global OnLine Japan/Rakuten Communications
> http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Journal

2016-07-13 Thread Ashley Merrick
Hello,

Looking at using 2 x 960GB SSD's (SM863)

Reason for larger is I was thinking would be better off with them in Raid 1 so 
enough space for OS and all Journals.

Instead am I better off using 2 x 200GB S3700's instead, with 5 disks per a SSD?

Thanks,
Ashley

-Original Message-
From: Christian Balzer [mailto:ch...@gol.com] 
Sent: 13 July 2016 01:12
To: ceph-users@lists.ceph.com
Cc: Wido den Hollander ; Ashley Merrick 
Subject: Re: [ceph-users] SSD Journal


Hello,

On Tue, 12 Jul 2016 19:14:14 +0200 (CEST) Wido den Hollander wrote:

> 
> > Op 12 juli 2016 om 15:31 schreef Ashley Merrick :
> > 
> > 
> > Hello,
> > 
> > Looking at final stages of planning / setup for a CEPH Cluster.
> > 
> > Per a Storage node looking @
> > 
> > 2 x SSD OS / Journal
> > 10 x SATA Disk
> > 
> > Will have a small Raid 1 Partition for the OS, however not sure if best to 
> > do:
> > 
> > 5 x Journal Per a SSD
> 
> Best solution. Will give you the most performance for the OSDs. RAID-1 will 
> just burn through cycles on the SSDs.
> 
> SSDs don't fail that often.
>
What Wido wrote, but let us know what SSDs you're planning to use.

Because the detailed version of that sentence should read: 
"Well known and tested DC level SSDs whose size/endurance levels are matched to 
the workload rarely fail, especially unexpected."
 
> Wido
> 
> > 10 x Journal on Raid 1 of two SSD's
> > 
> > Is the "Performance" increase from splitting 5 Journal's on each SSD worth 
> > the "issue" caused when one SSD goes down?
> > 
As always, assume at least a node being the failure domain you need to be able 
to handle.

Christian

> > Thanks,
> > Ashley
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Rakuten Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Change with disk from 1TB to 2TB

2016-07-13 Thread 王和勇
yes

From: Fabio - NS3 srl
Date: 2016-07-13 16:11
To: ceph-us...@ceph.com
Subject: [ceph-users] Change with disk from 1TB to 2TB
Hello,
i have a ceph with many 2T disk e only one 1T disk (for wrong)

I want to change 1T disk with 2T...

what is the correct procedure 

the same of "change broken disk"?


many thanks

-- 

Fabio  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Change with disk from 1TB to 2TB

2016-07-13 Thread Fabio - NS3 srl

Hello,
i have a ceph with many 2T disk e only one 1T disk (for wrong)

I want to change 1T disk with 2T...

what is the correct procedure

the same of "change broken disk"?


many thanks
--
*Fabio *


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Kees Meijs
Hi,

If the qemu-img is able to handle RBD in a clever way (and I assume it
does) it is able to sparsely write the image to the Ceph pool.

But, it is an assumption! Maybe someone else could shed some light on this?

Or even better: read the source, the RBD handler specifically.

And last but not least, create an empty test image in qcow2 sparse
format of e.g. 10G and store it on Ceph. In other words: just test it
and you'll know for sure.

Cheers,
Kees

On 13-07-16 09:31, Fran Barrera wrote:
> Yes, but is the same problem isn't? The image will be too large
> because the format is raw.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Fran Barrera
Yes, but is the same problem isn't? The image will be too large because the
format is raw.

Thanks.

2016-07-13 9:24 GMT+02:00 Kees Meijs :

> Hi Fran,
>
> Fortunately, qemu-img(1) is able to directly utilise RBD (supporting
> sparse block devices)!
>
> Please refer to http://docs.ceph.com/docs/hammer/rbd/qemu-rbd/ for
> examples.
>
> Cheers,
> Kees
>
> On 13-07-16 09:18, Fran Barrera wrote:
> > Can you explain how you do this procedure? I have the same problem
> > with the large images and snapshots.
> >
> > This is what I do:
> >
> > # qemu-img convert -f qcow2 -O raw image.qcow2 image.img
> > # openstack image create image.img
> >
> > But the image.img is too large.
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Kees Meijs
Hi Fran,

Fortunately, qemu-img(1) is able to directly utilise RBD (supporting
sparse block devices)!

Please refer to http://docs.ceph.com/docs/hammer/rbd/qemu-rbd/ for examples.

Cheers,
Kees

On 13-07-16 09:18, Fran Barrera wrote:
> Can you explain how you do this procedure? I have the same problem
> with the large images and snapshots.
>
> This is what I do:
>
> # qemu-img convert -f qcow2 -O raw image.qcow2 image.img
> # openstack image create image.img
>
> But the image.img is too large.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Fran Barrera
Hello,

Can you explain how you do this procedure? I have the same problem with the
large images and snapshots.

This is what I do:

# qemu-img convert -f qcow2 -O raw image.qcow2 image.img
# openstack image create image.img

But the image.img is too large.

Thanks,
Fran.

2016-07-13 8:29 GMT+02:00 Kees Meijs :

> Sorry, should have posted this to the list.
>
>  Forwarded Message 
> Subject: Re: [ceph-users] (no subject)
> Date: Tue, 12 Jul 2016 08:30:49 +0200
> From: Kees Meijs  
> To: Gaurav Goyal  
>
> Hi Gaurav,
>
> It might seem a little far fetched, but I'd use the qemu-img(1) tool to
> convert the qcow2 image file to a Ceph backed volume.
>
> First of all, create a volume of appropriate size in Cinder. The volume
> will be sparse. Then, figure out the identifier and use rados(8) to find
> the exact name of the volume in Ceph.
>
> Finally, use qemu-img(1) and point to the volume you just found out about.
>
> Cheers,
> Kees
>
> On 11-07-16 18:07, Gaurav Goyal wrote:
> > Thanks!
> >
> > I need to create a VM having qcow2 image file as 6.7 GB but raw image
> > as 600GB which is too big.
> > Is there a way that i need not to convert qcow2 file to raw and it
> > works well with rbd?
>
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Re: (no subject)

2016-07-13 Thread Kees Meijs
Sorry, should have posted this to the list.

 Forwarded Message 
Subject:Re: [ceph-users] (no subject)
Date:   Tue, 12 Jul 2016 08:30:49 +0200
From:   Kees Meijs 
To: Gaurav Goyal 



Hi Gaurav,

It might seem a little far fetched, but I'd use the qemu-img(1) tool to
convert the qcow2 image file to a Ceph backed volume.

First of all, create a volume of appropriate size in Cinder. The volume
will be sparse. Then, figure out the identifier and use rados(8) to find
the exact name of the volume in Ceph.

Finally, use qemu-img(1) and point to the volume you just found out about.

Cheers,
Kees

On 11-07-16 18:07, Gaurav Goyal wrote:
> Thanks!
>
> I need to create a VM having qcow2 image file as 6.7 GB but raw image
> as 600GB which is too big.
> Is there a way that i need not to convert qcow2 file to raw and it
> works well with rbd?



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] setting crushmap while creating pool fails

2016-07-13 Thread Wido den Hollander

> Op 12 juli 2016 om 22:30 schreef Oliver Dzombic :
> 
> 
> Hi,
> 
> i have a crushmap which looks like:
> 
> http://pastebin.com/YC9FdTUd
> 
> I issue:
> 
> # ceph osd pool create vmware1 64 cold-storage-rule
> pool 'vmware1' created
> 
> I would expect the pool to have ruleset 2.
> 
> #ceph osd pool ls detail
> 
> pool 10 'vmware1' replicated size 3 min_size 2 crush_ruleset 1
> object_hash rjenkins pg_num 64 pgp_num 64 last_change 483 flags
> hashpspool stripe_width 0
> 
> but it has crush_ruleset 1.
> 
> 
> Why ?

What happens if you set 'osd_pool_default_crush_replicated_ruleset' to 2 and 
try again?

Should be set in the [global] or [mon] section.

Wido

> 
> Thank you !
> 
> 
> -- 
> Mit freundlichen Gruessen / Best regards
> 
> Oliver Dzombic
> IP-Interactive
> 
> mailto:i...@ip-interactive.de
> 
> Anschrift:
> 
> IP Interactive UG ( haftungsbeschraenkt )
> Zum Sonnenberg 1-3
> 63571 Gelnhausen
> 
> HRB 93402 beim Amtsgericht Hanau
> Geschäftsführung: Oliver Dzombic
> 
> Steuer Nr.: 35 236 3622 1
> UST ID: DE274086107
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 40Gb fileserver/NIC suggestions

2016-07-13 Thread Götz Reinicke - IT Koordinator
Hi,

can anybody give some realworld feedback on what hardware
(CPU/Cores/NIC) you use for a 40Gb (file)server (smb and nfs)? The Ceph
Cluster will be mostly rbd images. S3 in the future, CephFS we will see :)

Thanks for some feedback and hints! Regadrs . Götz




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Emergency! Production cluster is down

2016-07-13 Thread Wido den Hollander

> Op 12 juli 2016 om 23:10 schreef Chandrasekhar Reddy 
> :
> 
> 
> Hi Wido,
> 
> Thank you for helping out. it worked like charm. i followed this steps
> 
> http://docs.ceph.com/docs/master/rados/operations/add-or-rm-mons/#removing-monitors
> 
> can you help in sharing any good docs which deals with backups ?
> 

Backups for Ceph really depend on the use-case, there is no general 
recommendation for backups.

With Jewel for example you can use RBD mirroring to back up RBD data or with 
CephFS you can use old-fashion rsync.

Wido

> Thanks,
> Chandra.
> 
> On Tue, Jul 12, 2016 at 10:37 PM, Chandrasekhar Reddy <
> chandrasekha...@payoda.com> wrote:
> 
> > Thanks wido..  I will give a try.
> >
> > Thanks,
> > Chandra
> > On Tue, Jul 12, 2016 at 10:35 PM, Wido den Hollander 
> > wrote:
> >
> >
> > > Op 12 juli 2016 om 19:00 schreef Chandrasekhar Reddy <
> > chandrasekha...@payoda.com>:
> > >
> > >
> > > Thanks for quick reply..
> > >
> > > Should I need to remove cephx in osd nodes also??
> > >
> > disable all cephx on all nodes in the ceph.conf
> >
> > See: http://docs.ceph.com/docs/master/rados/configuration/auth-config-ref/
> >
> > Add this to the [global] section:
> >
> > auth_cluster_required = none
> > auth_service_required = none
> > auth_client_required = none
> >
> > You still have the problem that your monitor map contains 3 monitors. You
> > removed it from the ceph.conf, but that is not sufficient. You will need to
> > inject the monmap with just one monitor into the remaining monitor.
> >
> > BEFORE YOU DO, CREATE A BACKUP OF THE MON'S DATA STORE.
> >
> > I don't know the commands from the top of my head, but 'monmaptool' is
> > something you will need/want.
> >
> > Wido
> >
> > > Thanks,
> > > Chandra
> > >
> > > On Tue, Jul 12, 2016 at 10:22 PM, Oliver Dzombic <
> > i...@ip-interactive.de [i...@ip-interactive.de] > wrote:
> > > Hi,
> > >
> > > fast aid: remove cephx authentication.
> > >
> > > --
> > > Mit freundlichen Gruessen / Best regards
> > >
> > > Oliver Dzombic
> > > IP-Interactive
> > >
> > > mailto:i...@ip-interactive.de
> > >
> > > Anschrift:
> > >
> > > IP Interactive UG ( haftungsbeschraenkt )
> > > Zum Sonnenberg 1-3
> > > 63571 Gelnhausen
> > >
> > > HRB 93402 beim Amtsgericht Hanau
> > > Geschäftsführung: Oliver Dzombic
> > >
> > > Steuer Nr.: 35 236 3622 1
> > > UST ID: DE274086107
> > >
> > >
> > > Am 12.07.2016 um 18:45 schrieb Chandrasekhar Reddy:
> > > > Hi Guys,
> > > >
> > > > Need help. I had 3 monitors nodes and 2 went down ( Disk got corrupted
> > > > ). after some time even 3rd monitor went unresponsive. so i rebooted
> > the
> > > > 3rd node. it came up but ceph is not working .
> > > >
> > > > so i tried to remove 2 failed monitors from ceph.conf file and
> > restarted
> > > > the mon and osd. but still ceph is not up.
> > > >
> > > > please find log files as attached.
> > > >
> > > > 1. Log file of ceph-mon.openstack01-vm001.log ( Monitor node )
> > > >
> > > > http
> > > > ://
> > paste.openstack.org/show/530944/
> > > > 
> > > >
> > > > 2. ceph.conf
> > > >
> > > > http
> > > > ://
> > paste.openstack.org/show/530945/
> > > > 
> > > >
> > > > 3. ceph -w output
> > > >
> > > > http
> > > > ://
> > paste.openstack.org/show/530947/
> > > > 
> > > >
> > > > 4. ceph mon dump
> > > >
> > > > http
> > > > ://
> > paste.openstack.org/show/530950/
> > > > 
> > > >
> > > > what error i see is
> > > >
> > > > monclient(hunting): authenticate timed out after 300
> > > >
> > > > librados: client.admin authentication error (110) Connection timed out
> > > >
> > > > Any suggestions? please help ...
> > > >
> > > > Thanks
> > > > Chandra
> > > >
> > > >
> > > >
> > > > ___
> > > > ceph-users mailing list
> > > > ceph-users@lists.ceph.com
> > > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> > > >
> > > ___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > >
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
> > > ceph-users mailing list
> > > ceph-users@lists.ceph.com
> > > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com