[ceph-users] creating empty object store when osd activate

2017-03-20 Thread 奧圖碼
Hi all,

There is an error message when I run “ceph-deploy osd activate…..”
Before that, I have run “ceph-deploy osd prepare t-ceph-01:/mnt/dev/ 
t-ceph-02:/mnt/dev/” successfully
Does anyone know the reason? Anything I did wrongly?
Thank you.

ceph@t-ceph-01:/home/ubuntu$ ceph --version
ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)

ceph@t-ceph-01:/home/ubuntu$ df -Th
Filesystem Type  Size  Used Avail Use% Mounted on
udev   devtmpfs  492M   12K  492M   1% /dev
tmpfs  tmpfs 100M  336K   99M   1% /run
/dev/xvda1 ext4  7.8G  1.8G  5.6G  25% /
none   tmpfs 4.0K 0  4.0K   0% /sys/fs/cgroup
none   tmpfs 5.0M 0  5.0M   0% /run/lock
none   tmpfs 497M 0  497M   0% /run/shm
none   tmpfs 100M 0  100M   0% /run/user
/dev/xvdf  ext4  2.0G  1.8G  4.0K 100% /mnt/dev

Here is the error message:
ceph@t-ceph-01:~$ ceph-deploy osd activate t-ceph-01:/mnt/dev/ 
t-ceph-02:/mnt/dev/
[ceph_deploy.conf][DEBUG ] found configuration file at: 
/var/lib/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.37): /usr/bin/ceph-deploy osd activate 
t-ceph-01:/mnt/dev/ t-ceph-02:/mnt/dev/
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: activate
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   : 

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  disk  : [('t-ceph-01', 
'/mnt/dev/', None), ('t-ceph-02', '/mnt/dev/', None)]
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks t-ceph-01:/mnt/dev/: 
t-ceph-02:/mnt/dev/:
[t-ceph-01][DEBUG ] connection detected need for sudo
[t-ceph-01][DEBUG ] connected to host: t-ceph-01
[t-ceph-01][DEBUG ] detect platform information from remote host
[t-ceph-01][DEBUG ] detect machine type
[t-ceph-01][DEBUG ] find the location of an executable
[t-ceph-01][INFO  ] Running command: sudo /sbin/initctl version
[t-ceph-01][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host t-ceph-01 disk /mnt/dev/
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[t-ceph-01][DEBUG ] find the location of an executable
[t-ceph-01][INFO  ] Running command: sudo /usr/sbin/ceph-disk -v activate 
--mark-init upstart --mount /mnt/dev/
[t-ceph-01][WARNIN] main_activate: path = /mnt/dev/
[t-ceph-01][WARNIN] activate: Cluster uuid is 
cc34bc6d-d5c0-4b2e-a8e7-12a3331830a2
[t-ceph-01][WARNIN] command: Running command: /usr/bin/ceph-osd --cluster=ceph 
--show-config-value=fsid
[t-ceph-01][WARNIN] activate: Cluster name is ceph
[t-ceph-01][WARNIN] activate: OSD uuid is c06b3214-befc-4670-a99b-b80ef70c8c14
[t-ceph-01][WARNIN] activate: OSD id is 1
[t-ceph-01][WARNIN] activate: Initializing OSD...
[t-ceph-01][WARNIN] command_check_call: Running command: /usr/bin/ceph 
--cluster ceph --name client.bootstrap-osd --keyring 
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o /mnt/dev/activate.monmap
[t-ceph-01][WARNIN] got monmap epoch 1
[t-ceph-01][WARNIN] command: Running command: /usr/bin/timeout 300 ceph-osd 
--cluster ceph --mkfs --mkkey -i 1 --monmap /mnt/dev/activate.monmap --osd-data 
/mnt/dev/ --osd-journal /mnt/dev/journal --osd-uuid 
c06b3214-befc-4670-a99b-b80ef70c8c14 --keyring /mnt/dev/keyring --setuser ceph 
--setgroup ceph
[t-ceph-01][WARNIN] Traceback (most recent call last):
[t-ceph-01][WARNIN]   File "/usr/sbin/ceph-disk", line 9, in 
[t-ceph-01][WARNIN] load_entry_point('ceph-disk==1.0.0', 'console_scripts', 
'ceph-disk')()
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 5047, in run
[t-ceph-01][WARNIN] main(sys.argv[1:])
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 4998, in main
[t-ceph-01][WARNIN] args.func(args)
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3365, in 
main_activate
[t-ceph-01][WARNIN] init=args.mark_init,
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3185, in activate_dir
[t-ceph-01][WARNIN] (osd_id, cluster) = activate(path, 
activate_key_template, init)
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 3290, in activate
[t-ceph-01][WARNIN] keyring=keyring,
[t-ceph-01][WARNIN]   File 
"/usr/lib/python2.7/dist-packages/ceph_disk/main.py", line 2773, in mkfs
[t-ceph-01][WARNIN] 

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Wes Dillingham
This is because of the min_size specification. I would bet you have it set
at 2 (which is good).

ceph osd pool get rbd min_size

With 4 hosts, and a size of 3, removing 2 of the hosts (or 2 drives 1 from
each hosts) results in some of the objects only having 1 replica
min_size dictates that IO freezes for those objects until min_size is
achieved.
http://docs.ceph.com/docs/jewel/rados/operations/pools/#set-the-number-of-object-replicas

I cant tell if your under the impression that your RBD device is a single
object. It is not. It is chunked up into many objects and spread throughout
the cluster, as Kjeti mentioned earlier.

On Mon, Mar 20, 2017 at 8:48 PM, Kjetil Jørgensen 
wrote:

> Hi,
>
> rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
> you a "prefix", which then gets you on to rbd_header.,
> rbd_header.prefix contains block size, striping, etc. The actual data
> bearing objects will be named something like rbd_data.prefix.%-016x.
>
> Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first  size> of that image will be named rbd_data. 86ce2ae8944a., the
> second  will be 86ce2ae8944a.0001, and so on, chances
> are that one of these objects are mapped to a pg which has both host3 and
> host4 among it's replicas.
>
> An rbd image will end up scattered across most/all osds of the pool it's
> in.
>
> Cheers,
> -KJ
>
> On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  wrote:
>
>> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
>> running on hosts 1, 2 and 3. It has a single replicated pool of size
>> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
>> 5(host1) and 3(host2).
>>
>> I can 'fail' any one host by disabling the SAN network interface and
>> the VM keeps running with a simple slowdown in I/O performance just as
>> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
>> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
>> have quorum, so that shouldn't be an issue. The placement group still
>> has 2 of its 3 replicas online.
>>
>> Why does I/O hang even though host4 isn't running a monitor and
>> doesn't have anything to do with my VM's hard drive.
>>
>>
>> Size?
>> # ceph osd pool get rbd size
>> size: 3
>>
>> Where's rbd_id.vm-100-disk-1?
>> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
>> rbd_id.vm-100-disk-1 /tmp/map
>> got osdmap epoch 1043
>> osdmaptool: osdmap file '/tmp/map'
>>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>>
>> # ceph osd tree
>> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
>> -1 8.06160 root default
>> -7 5.50308 room A
>> -3 1.88754 host host1
>>  4 0.40369 osd.4   up  1.0  1.0
>>  5 0.40369 osd.5   up  1.0  1.0
>>  6 0.54008 osd.6   up  1.0  1.0
>>  7 0.54008 osd.7   up  1.0  1.0
>> -2 3.61554 host host2
>>  0 0.90388 osd.0   up  1.0  1.0
>>  1 0.90388 osd.1   up  1.0  1.0
>>  2 0.90388 osd.2   up  1.0  1.0
>>  3 0.90388 osd.3   up  1.0  1.0
>> -6 2.55852 room B
>> -4 1.75114 host host3
>>  8 0.40369 osd.8   up  1.0  1.0
>>  9 0.40369 osd.9   up  1.0  1.0
>> 10 0.40369 osd.10  up  1.0  1.0
>> 11 0.54008 osd.11  up  1.0  1.0
>> -5 0.80737 host host4
>> 12 0.40369 osd.12  up  1.0  1.0
>> 13 0.40369 osd.13  up  1.0  1.0
>>
>>
>> --
>> Adam Carheden
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
>
>
> --
> Kjetil Joergensen 
> SRE, Medallia Inc
> Phone: +1 (650) 739-6580 <(650)%20739-6580>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>


-- 
Respectfully,

Wes Dillingham
wes_dilling...@harvard.edu
Research Computing | Infrastructure Engineer
Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs will turn down after the fio test when Ceph use RDMA as its ms_type.

2017-03-20 Thread 邱宏瑋
Hi.

I have rebuild my ceph yesterday with the latest master branch and the
problem still occurs.
I also found that the number of receive errors will increase during the
testing  (/sys/class/infiniband/mlx4_0/ports/1/counters//port_rcv_errors)
and I think that is the reason why the osd's connection will broken and I
will try to figure out.

Thanks.


Best Regards,

Hung-Wei Chiu(邱宏瑋)
--
Computer Center, Department of Computer Science
National Chiao Tung University

2017-03-21 5:07 GMT+08:00 Haomai Wang :

> plz uses master branch to test rdma
>
> On Sun, Mar 19, 2017 at 11:08 PM, Hung-Wei Chiu (邱宏瑋) <
> hwc...@cs.nctu.edu.tw> wrote:
>
>> Hi
>>
>> I want to test the performance for Ceph with RDMA, so I build the ceph
>> with RDMA and deploy into my test environment manually.
>>
>> I use the fio for my performance evaluation and it works fine if the Cepu
>> use the *async + posix* as its ms_type.
>> After changing the ms_type from *async + posix* to *async + rdma,  *some
>> osd's status will turn down during the performance testing and that  causing
>> the fio can't finish its job.
>> The log file of those strange OSD shows that there're something wrong
>> when OSD try to send a message and you can see below.
>>
>> ...
>> 2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error
>> -104: (104) Connection reset by peer
>> 2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29
>> l=0).fault initiating reconnect
>> 2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24
>> l=0).fault initiating reconnect
>> 2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275
>> cs=30 l=0).fault initiating reconnect
>> 2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11
>> l=0).fault with nothing to send, going to standby
>> 2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23
>> l=0).fault with nothing to send, going to standby
>> 2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25
>> l=0).fault initiating reconnect
>> 2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send
>> returned error 32: (32) Broken pipe
>> 2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >>
>> 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268
>> cs=23 l=0).fault with nothing to send, going to standby
>> 2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no
>> reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507
>> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
>> 2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no
>> reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507
>> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
>> 2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no
>> reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507
>> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
>> 2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no
>> reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507
>> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
>> *...*
>>
>>
>> The following is my environment.
>> *[Software]*
>> *Ceph Version*: ceph version 12.0.0-1356-g7ba32cb (I build my self with
>> master branch)
>>
>> *Deployment*: Without ceph-deploy and systemd, just manually invoke
>> every daemons.
>>
>> *Host*: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic.
>>
>> *NIC*: Ethernet controller: Mellanox Technologies MT27520 Family
>> [ConnectX-3 Pro]
>>
>> *NIC Driver*: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1):
>>
>>
>> *[Configuration]*
>> Ceph.conf
>>
>> [global]
>> fsid = 0612cc7e-6239-456c-978b-b4df781fe831
>> mon initial members = ceph-1,ceph-2,ceph-3

Re: [ceph-users] I/O hangs with 2 node failure even if one node isn't involved in I/O

2017-03-20 Thread Kjetil Jørgensen
Hi,

rbd_id.vm-100-disk-1 is only a "meta object", IIRC, it's contents will get
you a "prefix", which then gets you on to rbd_header.,
rbd_header.prefix contains block size, striping, etc. The actual data
bearing objects will be named something like rbd_data.prefix.%-016x.

Example - vm-100-disk-1 has the prefix 86ce2ae8944a, the first 
of that image will be named rbd_data. 86ce2ae8944a., the second
 will be 86ce2ae8944a.0001, and so on, chances are that
one of these objects are mapped to a pg which has both host3 and host4
among it's replicas.

An rbd image will end up scattered across most/all osds of the pool it's in.

Cheers,
-KJ

On Fri, Mar 17, 2017 at 12:30 PM, Adam Carheden  wrote:

> I have a 4 node cluster shown by `ceph osd tree` below. Monitors are
> running on hosts 1, 2 and 3. It has a single replicated pool of size
> 3. I have a VM with its hard drive replicated to OSDs 11(host3),
> 5(host1) and 3(host2).
>
> I can 'fail' any one host by disabling the SAN network interface and
> the VM keeps running with a simple slowdown in I/O performance just as
> expected. However, if 'fail' both nodes 3 and 4, I/O hangs on the VM.
> (i.e. `df` never completes, etc.) The monitors on hosts 1 and 2 still
> have quorum, so that shouldn't be an issue. The placement group still
> has 2 of its 3 replicas online.
>
> Why does I/O hang even though host4 isn't running a monitor and
> doesn't have anything to do with my VM's hard drive.
>
>
> Size?
> # ceph osd pool get rbd size
> size: 3
>
> Where's rbd_id.vm-100-disk-1?
> # ceph osd getmap -o /tmp/map && osdmaptool --pool 0 --test-map-object
> rbd_id.vm-100-disk-1 /tmp/map
> got osdmap epoch 1043
> osdmaptool: osdmap file '/tmp/map'
>  object 'rbd_id.vm-100-disk-1' -> 0.1ea -> [11,5,3]
>
> # ceph osd tree
> ID WEIGHT  TYPE NAME  UP/DOWN REWEIGHT PRIMARY-AFFINITY
> -1 8.06160 root default
> -7 5.50308 room A
> -3 1.88754 host host1
>  4 0.40369 osd.4   up  1.0  1.0
>  5 0.40369 osd.5   up  1.0  1.0
>  6 0.54008 osd.6   up  1.0  1.0
>  7 0.54008 osd.7   up  1.0  1.0
> -2 3.61554 host host2
>  0 0.90388 osd.0   up  1.0  1.0
>  1 0.90388 osd.1   up  1.0  1.0
>  2 0.90388 osd.2   up  1.0  1.0
>  3 0.90388 osd.3   up  1.0  1.0
> -6 2.55852 room B
> -4 1.75114 host host3
>  8 0.40369 osd.8   up  1.0  1.0
>  9 0.40369 osd.9   up  1.0  1.0
> 10 0.40369 osd.10  up  1.0  1.0
> 11 0.54008 osd.11  up  1.0  1.0
> -5 0.80737 host host4
> 12 0.40369 osd.12  up  1.0  1.0
> 13 0.40369 osd.13  up  1.0  1.0
>
>
> --
> Adam Carheden
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Kjetil Joergensen 
SRE, Medallia Inc
Phone: +1 (650) 739-6580
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Jason Dillaman
On Mon, Mar 20, 2017 at 6:49 PM, Alejandro Comisario
 wrote:
> Jason, thanks for the reply, you really got my question right.
> So, some doubts that might show that i lack of some general knowledge.
>
> When i read that someone is testing a ceph cluster with secuential 4k
> block writes, does that could happen inside a vm that is using an RBD
> backed OS ?

You can use some benchmarks directly against librbd (e.g. see fio's
rbd engine), some within a VM against an RBD-backed block device, and
some within a VM against a filesystem backed by an RBD-backed block
device.

> In that case, should the vm's FS should be formated to allow 4K writes
>  so that the block level of the vm writes 4K down to the hypervisor ?
>
> In that case, asuming that i have a 9K mtu between the compute node
> and the ceph cluster.
> What is the default rados block size in whitch the objects are divided
> against the amount of information ?

MTU size (network maximum packet size) and the RBD block object size
are not interrelated.

>
> On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman  wrote:
>> It's a very broad question -- are you trying to determine something
>> more specific?
>>
>> Notionally, your DB engine will safely journal the changes to disk,
>> commit the changes to the backing table structures, and prune the
>> journal. Your mileage my vary depending on the specific DB engine and
>> its configuration settings.
>>
>> The VM's OS will send write requests addressed by block offset and
>> block counts (e.g. 512 blocks) through the block device hardware
>> (either a slower emulated block device or a faster paravirtualized
>> block device like virtio-blk/virtio-scsi). Within the internals of
>> QEMU, these block-addressed write requests will be delivered to librbd
>> in byte-addressed format (the blocks are converted to absolute byte
>> ranges).
>>
>> librbd will take the provided byte offset and length and quickly
>> calculate which backing RADOS objects are associated with the provided
>> range [1]. If the extent intersects multiple backing objects, the
>> sub-operation is sent to each affected object in parallel. These
>> operations will be sent to the OSDs responsible for handling the
>> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
>> maximum size of each IP packet -- larger MTUs allow you to send more
>> data within a single packet [2].
>>
>> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
>> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>>
>>
>>
>> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>>  wrote:
>>> anyone ?
>>>
>>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>>  wrote:
 Hi, it's been a while since im using Ceph, and still im a little
 ashamed that when certain situation happens, i dont have the knowledge
 to explain or plan things.

 Basically what i dont know is, and i will do an exercise.

 EXCERCISE:
 a virtual machine running on KVM has an extra block device where the
 datafiles of a database runs (this block device is exposed to the vm
 using libvirt)

 facts.
 * the db writes to disk in 8K blocks
 * the connection between the phisical compute node and Ceph has an MTU of 
 1500
 * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
 * everything else is default

 So conceptually, if someone can explain me, what happens from the
 momment the DB contained on the VM commits to disk a query of
 20MBytes, what happens on the compute node, what happens on the
 client's file striping, what happens on the network (regarding
 packages, if other than creating 1500 bytes packages), what happens
 with rados objects, block sizes, etc.

 I would love to read this from the bests, mainly because as i said i
 dont understand all the workflow of blocks, objects, etc.

 thanks to everyone !

 --
 Alejandrito
>>>
>>>
>>>
>>> --
>>> Alejandro Comisario
>>> CTO | NUBELIU
>>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>>> _
>>> www.nubeliu.com
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
Jason, thanks for the reply, you really got my question right.
So, some doubts that might show that i lack of some general knowledge.

When i read that someone is testing a ceph cluster with secuential 4k
block writes, does that could happen inside a vm that is using an RBD
backed OS ?
In that case, should the vm's FS should be formated to allow 4K writes
 so that the block level of the vm writes 4K down to the hypervisor ?

In that case, asuming that i have a 9K mtu between the compute node
and the ceph cluster.
What is the default rados block size in whitch the objects are divided
against the amount of information ?


On Mon, Mar 20, 2017 at 7:06 PM, Jason Dillaman  wrote:
> It's a very broad question -- are you trying to determine something
> more specific?
>
> Notionally, your DB engine will safely journal the changes to disk,
> commit the changes to the backing table structures, and prune the
> journal. Your mileage my vary depending on the specific DB engine and
> its configuration settings.
>
> The VM's OS will send write requests addressed by block offset and
> block counts (e.g. 512 blocks) through the block device hardware
> (either a slower emulated block device or a faster paravirtualized
> block device like virtio-blk/virtio-scsi). Within the internals of
> QEMU, these block-addressed write requests will be delivered to librbd
> in byte-addressed format (the blocks are converted to absolute byte
> ranges).
>
> librbd will take the provided byte offset and length and quickly
> calculate which backing RADOS objects are associated with the provided
> range [1]. If the extent intersects multiple backing objects, the
> sub-operation is sent to each affected object in parallel. These
> operations will be sent to the OSDs responsible for handling the
> object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
> maximum size of each IP packet -- larger MTUs allow you to send more
> data within a single packet [2].
>
> [1] http://docs.ceph.com/docs/master/architecture/#data-striping
> [2] https://en.wikipedia.org/wiki/Maximum_transmission_unit
>
>
>
> On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
>  wrote:
>> anyone ?
>>
>> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>>  wrote:
>>> Hi, it's been a while since im using Ceph, and still im a little
>>> ashamed that when certain situation happens, i dont have the knowledge
>>> to explain or plan things.
>>>
>>> Basically what i dont know is, and i will do an exercise.
>>>
>>> EXCERCISE:
>>> a virtual machine running on KVM has an extra block device where the
>>> datafiles of a database runs (this block device is exposed to the vm
>>> using libvirt)
>>>
>>> facts.
>>> * the db writes to disk in 8K blocks
>>> * the connection between the phisical compute node and Ceph has an MTU of 
>>> 1500
>>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>>> * everything else is default
>>>
>>> So conceptually, if someone can explain me, what happens from the
>>> momment the DB contained on the VM commits to disk a query of
>>> 20MBytes, what happens on the compute node, what happens on the
>>> client's file striping, what happens on the network (regarding
>>> packages, if other than creating 1500 bytes packages), what happens
>>> with rados objects, block sizes, etc.
>>>
>>> I would love to read this from the bests, mainly because as i said i
>>> dont understand all the workflow of blocks, objects, etc.
>>>
>>> thanks to everyone !
>>>
>>> --
>>> Alejandrito
>>
>>
>>
>> --
>> Alejandro Comisario
>> CTO | NUBELIU
>> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
>> _
>> www.nubeliu.com
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Mike Lovell
On Mon, Mar 20, 2017 at 4:20 PM, Nick Fisk  wrote:

> Just a few corrections, hope you don't mind
>
> > -Original Message-
> > From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> > Mike Lovell
> > Sent: 20 March 2017 20:30
> > To: Webert de Souza Lima 
> > Cc: ceph-users 
> > Subject: Re: [ceph-users] cephfs cache tiering - hitset
> >
> > i'm not an expert but here is my understanding of it. a hit_set keeps
> track of
> > whether or not an object was accessed during the timespan of the hit_set.
> > for example, if you have a hit_set_period of 600, then the hit_set
> covers a
> > period of 10 minutes. the hit_set_count defines how many of the hit_sets
> to
> > keep a record of. setting this to a value of 12 with the 10 minute
> > hit_set_period would mean that there is a record of objects accessed
> over a
> > 2 hour period. the min_read_recency_for_promote, and its newer
> > min_write_recency_for_promote sibling, define how many of these hit_sets
> > and object must be in before and object is promoted from the storage pool
> > into the cache pool. if this were set to 6 with the previous examples,
> it means
> > that the cache tier will promote an object if that object has been
> accessed at
> > least once in 6 of the 12 10-minute periods. it doesn't matter how many
> > times the object was used in each period and so 6 requests in one
> 10-minute
> > hit_set will not cause a promotion. it would be any number of access in 6
> > separate 10-minute periods over the 2 hours.
>
> Sort of, the recency looks at the last N most recent hitsets. So if set to
> 6, then the object would have to be in all last 6 hitsets. Because of this,
> during testing I found setting recency above 2 or 3 made the behavior quite
> binary. If an object was hot enough, it would probably be in every hitset,
> if it was only warm it would never be in enough hitsets in row. I did
> experiment with X out of N promotion logic, ie must be in 3 hitsets out of
> 10 non sequential. If you could find the right number to configure, you
> could get improved cache behavior, but if not, then there was a large
> chance it would be worse.
>
> For promotion I think having more hitsets probably doesn't add much value,
> but they may help when it comes to determining what to flush.
>

that's good to know. i just made an assumption without actually digging in
to the code. do you recommend keeping the number of hitsets equal to the
max of either min_read_recency_for_promote and
min_write_recency_for_promote? how are the hitsets checked during flush
and/or eviction?

mike
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Nick Fisk
Just a few corrections, hope you don't mind

> -Original Message-
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Mike Lovell
> Sent: 20 March 2017 20:30
> To: Webert de Souza Lima 
> Cc: ceph-users 
> Subject: Re: [ceph-users] cephfs cache tiering - hitset
> 
> i'm not an expert but here is my understanding of it. a hit_set keeps track of
> whether or not an object was accessed during the timespan of the hit_set.
> for example, if you have a hit_set_period of 600, then the hit_set covers a
> period of 10 minutes. the hit_set_count defines how many of the hit_sets to
> keep a record of. setting this to a value of 12 with the 10 minute
> hit_set_period would mean that there is a record of objects accessed over a
> 2 hour period. the min_read_recency_for_promote, and its newer
> min_write_recency_for_promote sibling, define how many of these hit_sets
> and object must be in before and object is promoted from the storage pool
> into the cache pool. if this were set to 6 with the previous examples, it 
> means
> that the cache tier will promote an object if that object has been accessed at
> least once in 6 of the 12 10-minute periods. it doesn't matter how many
> times the object was used in each period and so 6 requests in one 10-minute
> hit_set will not cause a promotion. it would be any number of access in 6
> separate 10-minute periods over the 2 hours.

Sort of, the recency looks at the last N most recent hitsets. So if set to 6, 
then the object would have to be in all last 6 hitsets. Because of this, during 
testing I found setting recency above 2 or 3 made the behavior quite binary. If 
an object was hot enough, it would probably be in every hitset, if it was only 
warm it would never be in enough hitsets in row. I did experiment with X out of 
N promotion logic, ie must be in 3 hitsets out of 10 non sequential. If you 
could find the right number to configure, you could get improved cache 
behavior, but if not, then there was a large chance it would be worse.

For promotion I think having more hitsets probably doesn't add much value, but 
they may help when it comes to determining what to flush.

> 
> this is just an example and might not fit well for your use case. the systems 
> i
> run have a lower hit_set_period, higher hit_set_count, and higher recency
> options. that means that the osds use some more memory (each hit_set
> takes space but i think they use the same amount of space regardless of
> period) but hit_set covers a smaller amount of time. the longer the period,
> the more likely a given object is in the hit_set. without knowing your access
> patterns, it would be hard to recommend settings. the overhead of a
> promotion can be substantial and so i'd probably go with settings that only
> promote after many requests to an object.

Also in Jewel is a promotion throttle which will limit promotions to 4MB/s

> 
> one thing to note is that the recency options only seemed to work for me in
> jewel. i haven't tried infernalis. the older versions of hammer didn't seem to
> use the min_read_recency_for_promote properly and 0.94.6 definitely had a
> bug that could corrupt data when min_read_recency_for_promote was more
> than 1. even though that was fixed in 0.94.7, i was hesitant to increase it 
> will
> still on hammer. min_write_recency_for_promote wasn't added till after
> hammer.
> 
> hopefully that helps.
> mike
> 
> On Fri, Mar 17, 2017 at 2:02 PM, Webert de Souza Lima
>  wrote:
> Hello everyone,
> 
> I`m deploying a ceph cluster with cephfs and I`d like to tune ceph cache
> tiering, and I`m
> a little bit confused of the
> settings hit_set_count, hit_set_period and min_read_recency_for_promote.
> The docs are very lean and I can`f find any more detailed explanation
> anywhere.
> 
> Could someone provide me a better understandment of this?
> 
> Thanks in advance!
> 
> ___
> ceph-users mailing list
> mailto:ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Jason Dillaman
It's a very broad question -- are you trying to determine something
more specific?

Notionally, your DB engine will safely journal the changes to disk,
commit the changes to the backing table structures, and prune the
journal. Your mileage my vary depending on the specific DB engine and
its configuration settings.

The VM's OS will send write requests addressed by block offset and
block counts (e.g. 512 blocks) through the block device hardware
(either a slower emulated block device or a faster paravirtualized
block device like virtio-blk/virtio-scsi). Within the internals of
QEMU, these block-addressed write requests will be delivered to librbd
in byte-addressed format (the blocks are converted to absolute byte
ranges).

librbd will take the provided byte offset and length and quickly
calculate which backing RADOS objects are associated with the provided
range [1]. If the extent intersects multiple backing objects, the
sub-operation is sent to each affected object in parallel. These
operations will be sent to the OSDs responsible for handling the
object (as per the CRUSH map) -- by default via TCP/IP. The MTU is the
maximum size of each IP packet -- larger MTUs allow you to send more
data within a single packet [2].

[1] http://docs.ceph.com/docs/master/architecture/#data-striping
[2] https://en.wikipedia.org/wiki/Maximum_transmission_unit



On Mon, Mar 20, 2017 at 5:24 PM, Alejandro Comisario
 wrote:
> anyone ?
>
> On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
>  wrote:
>> Hi, it's been a while since im using Ceph, and still im a little
>> ashamed that when certain situation happens, i dont have the knowledge
>> to explain or plan things.
>>
>> Basically what i dont know is, and i will do an exercise.
>>
>> EXCERCISE:
>> a virtual machine running on KVM has an extra block device where the
>> datafiles of a database runs (this block device is exposed to the vm
>> using libvirt)
>>
>> facts.
>> * the db writes to disk in 8K blocks
>> * the connection between the phisical compute node and Ceph has an MTU of 
>> 1500
>> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
>> * everything else is default
>>
>> So conceptually, if someone can explain me, what happens from the
>> momment the DB contained on the VM commits to disk a query of
>> 20MBytes, what happens on the compute node, what happens on the
>> client's file striping, what happens on the network (regarding
>> packages, if other than creating 1500 bytes packages), what happens
>> with rados objects, block sizes, etc.
>>
>> I would love to read this from the bests, mainly because as i said i
>> dont understand all the workflow of blocks, objects, etc.
>>
>> thanks to everyone !
>>
>> --
>> Alejandrito
>
>
>
> --
> Alejandro Comisario
> CTO | NUBELIU
> E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
> _
> www.nubeliu.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] question about block sizes, rados objects and file striping (and maybe more)

2017-03-20 Thread Alejandro Comisario
anyone ?

On Fri, Mar 17, 2017 at 5:40 PM, Alejandro Comisario
 wrote:
> Hi, it's been a while since im using Ceph, and still im a little
> ashamed that when certain situation happens, i dont have the knowledge
> to explain or plan things.
>
> Basically what i dont know is, and i will do an exercise.
>
> EXCERCISE:
> a virtual machine running on KVM has an extra block device where the
> datafiles of a database runs (this block device is exposed to the vm
> using libvirt)
>
> facts.
> * the db writes to disk in 8K blocks
> * the connection between the phisical compute node and Ceph has an MTU of 1500
> * the QEMU RBD driver uses a stipe unit of 2048 kB and a stripe count of 4.
> * everything else is default
>
> So conceptually, if someone can explain me, what happens from the
> momment the DB contained on the VM commits to disk a query of
> 20MBytes, what happens on the compute node, what happens on the
> client's file striping, what happens on the network (regarding
> packages, if other than creating 1500 bytes packages), what happens
> with rados objects, block sizes, etc.
>
> I would love to read this from the bests, mainly because as i said i
> dont understand all the workflow of blocks, objects, etc.
>
> thanks to everyone !
>
> --
> Alejandrito



-- 
Alejandro Comisario
CTO | NUBELIU
E-mail: alejandro@nubeliu.comCell: +54 9 11 3770 1857
_
www.nubeliu.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSDs will turn down after the fio test when Ceph use RDMA as its ms_type.

2017-03-20 Thread Haomai Wang
plz uses master branch to test rdma

On Sun, Mar 19, 2017 at 11:08 PM, Hung-Wei Chiu (邱宏瑋)  wrote:

> Hi
>
> I want to test the performance for Ceph with RDMA, so I build the ceph
> with RDMA and deploy into my test environment manually.
>
> I use the fio for my performance evaluation and it works fine if the Cepu
> use the *async + posix* as its ms_type.
> After changing the ms_type from *async + posix* to *async + rdma,  *some
> osd's status will turn down during the performance testing and that  causing
> the fio can't finish its job.
> The log file of those strange OSD shows that there're something wrong when
> OSD try to send a message and you can see below.
>
> ...
> 2017-03-20 09:43:10.096042 7faac163e700 -1 Infiniband recv_msg got error
> -104: (104) Connection reset by peer
> 2017-03-20 09:43:10.096314 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.17:6813/32315 conn(0x563de5282000 :-1 s=STATE_OPEN pgs=264 cs=29
> l=0).fault initiating reconnect
> 2017-03-20 09:43:10.251606 7faac1e3f700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.251755 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.17:6821/32509 conn(0x563de51f1000 :-1 s=STATE_OPEN pgs=314 cs=24
> l=0).fault initiating reconnect
> 2017-03-20 09:43:10.254103 7faac1e3f700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.254375 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.15:6821/48196 conn(0x563de514b000 :6809 s=STATE_OPEN pgs=275 cs=30
> l=0).fault initiating reconnect
> 2017-03-20 09:43:10.260622 7faac1e3f700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.260693 7faac1e3f700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.15:6805/47835 conn(0x563de537d800 :-1 s=STATE_OPEN pgs=310 cs=11
> l=0).fault with nothing to send, going to standby
> 2017-03-20 09:43:10.264621 7faac163e700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.264682 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.15:6829/48397 conn(0x563de5fdb000 :-1 s=STATE_OPEN pgs=231 cs=23
> l=0).fault with nothing to send, going to standby
> 2017-03-20 09:43:10.291832 7faac163e700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.291895 7faac163e700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.17:6817/32412 conn(0x563de50f5800 :-1 s=STATE_OPEN pgs=245 cs=25
> l=0).fault initiating reconnect
> 2017-03-20 09:43:10.387540 7faac2e41700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.387565 7faac2e41700 -1 Infiniband send_msg send
> returned error 32: (32) Broken pipe
> 2017-03-20 09:43:10.387635 7faac2e41700 0 -- 10.0.0.16:6809/23853 >>
> 10.0.0.17:6801/32098 conn(0x563de51ab800 :6809 s=STATE_OPEN pgs=268 cs=23
> l=0).fault with nothing to send, going to standby
> 2017-03-20 09:43:11.453373 7faabdee0700 -1 osd.10 902 heartbeat_check: no
> reply from 10.0.0.15:6803 osd.0 since back 2017-03-20 09:42:50.610507
> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
> 2017-03-20 09:43:11.453422 7faabdee0700 -1 osd.10 902 heartbeat_check: no
> reply from 10.0.0.15:6807 osd.1 since back 2017-03-20 09:42:50.610507
> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
> 2017-03-20 09:43:11.453435 7faabdee0700 -1 osd.10 902 heartbeat_check: no
> reply from 10.0.0.15:6811 osd.2 since back 2017-03-20 09:42:50.610507
> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
> 2017-03-20 09:43:11.453444 7faabdee0700 -1 osd.10 902 heartbeat_check: no
> reply from 10.0.0.15:6815 osd.3 since back 2017-03-20 09:42:50.610507
> front 2017-03-20 09:42:50.610507 (cutoff 2017-03-20 09:42:51.453371)
> *...*
>
>
> The following is my environment.
> *[Software]*
> *Ceph Version*: ceph version 12.0.0-1356-g7ba32cb (I build my self with
> master branch)
>
> *Deployment*: Without ceph-deploy and systemd, just manually invoke every
> daemons.
>
> *Host*: Ubuntu 16.04.1 LTS (x86_64 ), with linux kernel 4.4.0-66-generic.
>
> *NIC*: Ethernet controller: Mellanox Technologies MT27520 Family
> [ConnectX-3 Pro]
>
> *NIC Driver*: MLNX_OFED_LINUX-4.0-1.0.1.0 (OFED-4.0-1.0.1):
>
>
> *[Configuration]*
> Ceph.conf
>
> [global]
> fsid = 0612cc7e-6239-456c-978b-b4df781fe831
> mon initial members = ceph-1,ceph-2,ceph-3
> mon host = 10.0.0.15,10.0.0.16,10.0.0.17
> osd pool default size = 2
> osd pool default pg num = 1024
> osd pool default pgp num = 1024
> ms_type=async+rdma
> ms_async_rdma_device_name = mlx4_0
>
> Fio.conf
>
> [global]
>
> ioengine=rbd
> clientname=admin
> pool=rbd
> rbdname=rbd
> clustername=ceph
> runtime=120
> iodepth=128
> numjobs=6
> group_reporting
> size=256G
> direct=1
> ramp_time=5
> [r75w25]
> bs=4k
> rw=randrw
> rwmixread=75
>
>
> *[Cluster Env]*
>
>1. Total three Node.
>2. 3 ceph monitors on each node.
>3. 8 ceph osd on each node (total 24 osd).
>
>
> Thanks
>
>
>
>
>
> 

Re: [ceph-users] Understanding Ceph in case of a failure

2017-03-20 Thread Christian Wuerdig
On Tue, Mar 21, 2017 at 8:57 AM, Karol Babioch  wrote:

> Hi,
>
> Am 20.03.2017 um 05:34 schrieb Christian Balzer:
> > you do realize that you very much have a corner case setup there, right?
>
> Yes, I know that this is not exactly a recommendation, but I hoped it
> would be good enough for the start :-).
>
> > That being said, if you'd search the archives, a similar question was
> > raised by me a long time ago.
>
> Do you have some sort of reference to this? Sounds interesting, but
> couldn't find a particular thread, and you posted quite a lot on this
> list already :-).
>
> > The new CRUSH map of course results in different computations of where
> PGs
> > should live, so they get copied to their new primary OSDs.
> > This is the I/O you're seeing and that's why it stops eventually.
>
> Hm, ok, that might be an explanation. Haven't considered the fact that
> it gets removed from the CRUSH map and a new location is calculated. Is
> there a way to prevent this in my case?
>
>
If an OSD doesn't respond it will be marked as down and then after some
time (default 300sec) it will be marked as out
Data will start to move once the OSD is marked out (i.e. no longer part of
the crush map) which is what you are observing.

The settings you are probably interested in are (docs from here:
http://docs.ceph.com/docs/jewel/rados/configuration/mon-osd-interaction/)

1. mon osd down out interval   - defaults to 300sec after which a down OSD
will be marked out
2. mon osd down out subtree limit - will prevent down OSDs being marked out
automatically if the whole subtree disappears. This defaults to rack - if
you change it to host then turning off an entire host should prevent all
those OSDs from being marked out automatically



> Thank you very much for your insights!
>
> Best regards,
> Karol Babioch
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cephfs cache tiering - hitset

2017-03-20 Thread Mike Lovell
i'm not an expert but here is my understanding of it. a hit_set keeps track
of whether or not an object was accessed during the timespan of the
hit_set. for example, if you have a hit_set_period of 600, then the hit_set
covers a period of 10 minutes. the hit_set_count defines how many of the
hit_sets to keep a record of. setting this to a value of 12 with the 10
minute hit_set_period would mean that there is a record of objects accessed
over a 2 hour period. the min_read_recency_for_promote, and its newer
min_write_recency_for_promote sibling, define how many of these hit_sets
and object must be in before and object is promoted from the storage pool
into the cache pool. if this were set to 6 with the previous examples, it
means that the cache tier will promote an object if that object has been
accessed at least once in 6 of the 12 10-minute periods. it doesn't matter
how many times the object was used in each period and so 6 requests in one
10-minute hit_set will not cause a promotion. it would be any number of
access in 6 separate 10-minute periods over the 2 hours.

this is just an example and might not fit well for your use case. the
systems i run have a lower hit_set_period, higher hit_set_count, and higher
recency options. that means that the osds use some more memory (each
hit_set takes space but i think they use the same amount of space
regardless of period) but hit_set covers a smaller amount of time. the
longer the period, the more likely a given object is in the hit_set.
without knowing your access patterns, it would be hard to recommend
settings. the overhead of a promotion can be substantial and so i'd
probably go with settings that only promote after many requests to an
object.

one thing to note is that the recency options only seemed to work for me in
jewel. i haven't tried infernalis. the older versions of hammer didn't seem
to use the min_read_recency_for_promote properly and 0.94.6 definitely had
a bug that could corrupt data when min_read_recency_for_promote was more
than 1. even though that was fixed in 0.94.7, i was hesitant to increase it
will still on hammer. min_write_recency_for_promote wasn't added till after
hammer.

hopefully that helps.
mike

On Fri, Mar 17, 2017 at 2:02 PM, Webert de Souza Lima  wrote:

> Hello everyone,
>
> I`m deploying a ceph cluster with cephfs and I`d like to tune ceph cache
> tiering, and I`m
> a little bit confused of the settings hit_set_count, hit_set_period and
> min_read_recency_for_promote. The docs are very lean and I can`f find any
> more detailed explanation anywhere.
>
> Could someone provide me a better understandment of this?
>
> Thanks in advance!
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Understanding Ceph in case of a failure

2017-03-20 Thread Karol Babioch
Hi,

Am 20.03.2017 um 05:34 schrieb Christian Balzer:
> you do realize that you very much have a corner case setup there, right?

Yes, I know that this is not exactly a recommendation, but I hoped it
would be good enough for the start :-).

> That being said, if you'd search the archives, a similar question was
> raised by me a long time ago.

Do you have some sort of reference to this? Sounds interesting, but
couldn't find a particular thread, and you posted quite a lot on this
list already :-).

> The new CRUSH map of course results in different computations of where PGs
> should live, so they get copied to their new primary OSDs.
> This is the I/O you're seeing and that's why it stops eventually.

Hm, ok, that might be an explanation. Haven't considered the fact that
it gets removed from the CRUSH map and a new location is calculated. Is
there a way to prevent this in my case?

Thank you very much for your insights!

Best regards,
Karol Babioch



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] radosgw: Multipart upload fails while using AWS Java SDK and signature v4

2017-03-20 Thread Shaon
Hi,

When I try to do a multipart upload using s3cmd and signature v4 against
ceph (ceph version 10.2.6 (656b5b63ed7c43bd014bcafd81b001959d5f089f)) it
succeeds [1].

But when I try to do the same thing with AWS Java SDK (1.11.97), it fails
with 403 SignatureDoesNotMatch.

(FYI, it works fine with signature version 2 and upload without mpu with
sigv4 works as well.)


In both cases I am trying to upload a 17mb file with 15mb part size.

SignatureDoesNotMatch exception from AWS Java SDK:

2017-03-20 09:29:09.933 DEBUG wire:72 - http-outgoing-0 >> "POST
/testbucket/testkeyawssdk?uploads HTTP/1.1[\r][\n]"
2017-03-20 09:29:09.936 DEBUG wire:72 - http-outgoing-0 >> "Host:
10.111.5.141:7480[\r][\n]"
2017-03-20 09:29:09.937 DEBUG wire:72 - http-outgoing-0 >>
"x-amz-content-sha256:
e3b0c44298fc1c149afbf4c8996fb92427ae41e4649b934ca495991b7852b855[\r][\n]"
2017-03-20 09:29:09.937 DEBUG wire:72 - http-outgoing-0 >> "Authorization:
AWS4-HMAC-SHA256
Credential=4J31KSQ9040IGL2DDA7Y/20170320/us-east-1/s3/aws4_request,
SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date,
Signature=c4a36f357f8d87601dfd7b1cb5262988c3594d568bfb5a756bc4ad5fa09ac5c0[\r][\n]"
2017-03-20 09:29:09.937 DEBUG wire:72 - http-outgoing-0 >> "X-Amz-Date:
20170320T162904Z[\r][\n]"
2017-03-20 09:29:09.937 DEBUG wire:72 - http-outgoing-0 >> "User-Agent:
aws-sdk-java/1.11.97 Mac_OS_X/10.12.3
Java_HotSpot(TM)_64-Bit_Server_VM/25.66-b17/1.8.0_66[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >>
"amz-sdk-invocation-id: 404a6785-b78b-e30d-7748-59c06ae5ae97[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >> "amz-sdk-retry:
0/0/500[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >> "Content-Type:
application/octet-stream[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >> "Content-Length:
0[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >> "Connection:
Keep-Alive[\r][\n]"
2017-03-20 09:29:09.938 DEBUG wire:72 - http-outgoing-0 >> "[\r][\n]"
2017-03-20 09:29:09.997 DEBUG wire:72 - http-outgoing-0 << "HTTP/1.1 200
OK[\r][\n]"
2017-03-20 09:29:09.998 DEBUG wire:72 - http-outgoing-0 <<
"x-amz-request-id: tx02010-0058d0034e-d83b-default[\r][\n]"
2017-03-20 09:29:09.998 DEBUG wire:72 - http-outgoing-0 << "Content-Type:
application/xml[\r][\n]"
2017-03-20 09:29:09.999 DEBUG wire:72 - http-outgoing-0 << "Content-Length:
254[\r][\n]"
2017-03-20 09:29:10.001 DEBUG wire:72 - http-outgoing-0 << "Date: Mon, 20
Mar 2017 16:29:02 GMT[\r][\n]"
2017-03-20 09:29:10.001 DEBUG wire:72 - http-outgoing-0 << "Connection:
Keep-Alive[\r][\n]"
2017-03-20 09:29:10.001 DEBUG wire:72 - http-outgoing-0 << "[\r][\n]"
2017-03-20 09:29:10.046 DEBUG wire:86 - http-outgoing-0 << "http://s3.amazonaws.com/doc/2006-03-01/
">testbuckettestkeyawssdk2~o3Ixqoi-90fZeQCVYpwAJloD5S3iz5S"
2017-03-20 09:29:10.059 DEBUG wire:72 - http-outgoing-0 >> "PUT
/testbucket/testkeyawssdk?uploadId=2%7Eo3Ixqoi-90fZeQCVYpwAJloD5S3iz5S=1
HTTP/1.1[\r][\n]"
2017-03-20 09:29:10.059 DEBUG wire:72 - http-outgoing-0 >> "Host:
10.111.5.141:7480[\r][\n]"
2017-03-20 09:29:10.059 DEBUG wire:72 - http-outgoing-0 >>
"x-amz-content-sha256: STREAMING-AWS4-HMAC-SHA256-PAYLOAD[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >> "Authorization:
AWS4-HMAC-SHA256
Credential=4J31KSQ9040IGL2DDA7Y/20170320/us-east-1/s3/aws4_request,
SignedHeaders=amz-sdk-invocation-id;amz-sdk-retry;content-length;content-type;host;user-agent;x-amz-content-sha256;x-amz-date;x-amz-decoded-content-length,
Signature=63a15ad0819c4a309f0772dd41273eb91303d29b29e296bd3be178e86ef71a32[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >> "X-Amz-Date:
20170320T162910Z[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >> "User-Agent:
aws-sdk-java/1.11.97 Mac_OS_X/10.12.3
Java_HotSpot(TM)_64-Bit_Server_VM/25.66-b17/1.8.0_66[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >>
"amz-sdk-invocation-id: fecabaaf-d753-3a2b-6b33-73bb46a71413[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >>
"x-amz-decoded-content-length: 15728640[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >> "amz-sdk-retry:
0/0/500[\r][\n]"
2017-03-20 09:29:10.060 DEBUG wire:72 - http-outgoing-0 >> "Content-Type:
application/octet-stream[\r][\n]"
2017-03-20 09:29:10.061 DEBUG wire:72 - http-outgoing-0 >> "Content-Length:
15739526[\r][\n]"
2017

Re: [ceph-users] OSDs cannot match up with fast OSD map changes (epochs) during recovery

2017-03-20 Thread Wido den Hollander

> Op 18 maart 2017 om 10:39 schreef Muthusamy Muthiah 
> :
> 
> 
> Hi,
> 
> We had similar issue on one of the 5 node cluster cluster again during
> recovery(200/335 OSDs are to be recovered)  , we see a lot of differences
> in the OSDmap epocs between OSD which is booting and the current one same
> is below,
> 
> -  In the current situation the OSD are trying to register with an
> old OSDMAP version *7620 * but he current version in the cluster is
> higher  *13102
> *version – as a result it takes longer for OSD to update to this version ..
> 

Do you see these OSDs eating 100% CPU at that moment? Eg, could it be that the 
CPUs are not fast enough to process all the map updates quick enough.

iirc map updates are not processed multi-threaded.

Wido

> 
> We also see 2017-03-18 09:19:04.628206 7f2056735700 0 --
> 10.139.4.69:6836/777372 >> - conn(0x7f20c1bfa800 :6836
> s=STATE_ACCEPTING_WAIT_BANNER_ADDR pgs=0 cs=0 l=0).fault with nothing to
> send and in the half accept state just closed messages on many osds which
> are recovering.
> 
> Suggestions would be helpful.
> 
> 
> Thanks,
> 
> Muthu
> 
> On 13 February 2017 at 18:14, Wido den Hollander  wrote:
> 
> >
> > > Op 13 februari 2017 om 12:57 schreef Muthusamy Muthiah <
> > muthiah.muthus...@gmail.com>:
> > >
> > >
> > > Hi All,
> > >
> > > We also have same issue on one of our platforms which was upgraded from
> > > 11.0.2 to 11.2.0 . The issue occurs on one node alone where CPU hits 100%
> > > and OSDs of that node marked down. Issue not seen on cluster which was
> > > installed from scratch with 11.2.0.
> > >
> >
> > How many maps is this OSD behind?
> >
> > Does it help if you set the nodown flag for a moment to let it catch up?
> >
> > Wido
> >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > >
> > > *[r...@cn3.c7.vna ~] # systemctl start ceph-osd@315.service
> > >  [r...@cn3.c7.vna ~] # cd /var/log/ceph/
> > > [r...@cn3.c7.vna ceph] # tail -f *osd*315.log 2017-02-13 11:29:46.752897
> > > 7f995c79b940  0 
> > > /home/jenkins-build/build/workspace/ceph-build/ARCH/x86_
> > 64/AVAILABLE_ARCH/x86_64/AVAILABLE_DIST/centos7/DIST/
> > centos7/MACHINE_SIZE/huge/release/11.2.0/rpm/el7/BUILD/
> > ceph-11.2.0/src/cls/hello/cls_hello.cc:296:
> > > loading cls_hello 2017-02-13 11:29:46.753065 7f995c79b940  0 _get_class
> > not
> > > permitted to load kvs 2017-02-13 11:29:46.757571 7f995c79b940  0
> > _get_class
> > > not permitted to load lua 2017-02-13 11:29:47.058720 7f995c79b940  0
> > > osd.315 44703 crush map has features 288514119978713088, adjusting msgr
> > > requires for clients 2017-02-13 11:29:47.058728 7f995c79b940  0 osd.315
> > > 44703 crush map has features 288514394856620032 was 8705, adjusting msgr
> > > requires for mons 2017-02-13 11:29:47.058732 7f995c79b940  0 osd.315
> > 44703
> > > crush map has features 288531987042664448, adjusting msgr requires for
> > osds
> > > 2017-02-13 11:29:48.343979 7f995c79b940  0 osd.315 44703 load_pgs
> > > 2017-02-13 11:29:55.913550 7f995c79b940  0 osd.315 44703 load_pgs opened
> > > 130 pgs 2017-02-13 11:29:55.913604 7f995c79b940  0 osd.315 44703 using 1
> > op
> > > queue with priority op cut off at 64. 2017-02-13 11:29:55.914102
> > > 7f995c79b940 -1 osd.315 44703 log_to_monitors {default=true} 2017-02-13
> > > 11:30:19.384897 7f9939bbb700  1 heartbeat_map reset_timeout 'tp_osd
> > thread
> > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073336 7f9955a2b700  1
> > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after 15
> > > 2017-02-13 11:30:31.073343 7f9955a2b700  1 heartbeat_map is_healthy
> > 'tp_osd
> > > thread tp_osd' had timed out after 15 2017-02-13 11:30:31.073344
> > > 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed
> > > out after 15 2017-02-13 11:30:31.073345 7f9955a2b700  1 heartbeat_map
> > > is_healthy 'tp_osd thread tp_osd' had timed out after 15 2017-02-13
> > > 11:30:31.073347 7f9955a2b700  1 heartbeat_map is_healthy 'tp_osd thread
> > > tp_osd' had timed out after 15 2017-02-13 11:30:31.073348 7f9955a2b700  1
> > > heartbeat_map is_healthy 'tp_osd thread tp_osd' had timed out after
> > > 152017-02-13 11:30:54.772516 7f995c79b940  0 osd.315 44703 done with
> > init,
> > > starting boot process*
> > >
> > >
> > > *Thanks,*
> > > *Muthu*
> > >
> > > On 13 February 2017 at 10:50, Andreas Gerstmayr <
> > andreas.gerstm...@gmail.com
> > > > wrote:
> > >
> > > > Hi,
> > > >
> > > > Due to a faulty upgrade from Jewel 10.2.0 to Kraken 11.2.0 our test
> > > > cluster is unhealthy since about two weeks and can't recover itself
> > > > anymore (unfortunately I skipped the upgrade to 10.2.5 because I
> > > > missed the ".z" in "All clusters must first be upgraded to Jewel
> > > > 10.2.z").
> > > >
> > > > Immediately after the upgrade I saw the following in the OSD logs:
> > > > 

[ceph-users] (no subject)

2017-03-20 Thread Shaon
-- 
Imran Hossain Shaon | http://shaon.me/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] For what `radosgw-admin bucket check` is used?

2017-03-20 Thread ??????
Hi, 

I want to know what `radosgw-admin bucket check` is used for?  
http://docs.ceph.com/docs/jewel/man/8/radosgw-admin/ this manual only tells 
check bucket index. I want o know how bucket index is checked?


thanks
yaozongyou
2017/3/20___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: rbd: show the snapshot tree

2017-03-20 Thread Dongsheng Yang

Resending.

as the mailing list does not accept too large mail. I removed the
attachments from the original message.

Please refer to [1] for the pictures.

[1]: https://github.com/ceph/ceph/pull/13870#issuecomment-287042199

Thanx
Dongsheng

 Original Message 
Subject:rbd: show the snapshot tree
Date:   Fri, 17 Mar 2017 11:50:17 +0800
From:   Dongsheng Yang 
To: 'Ceph Users' 
CC: jason , chenhanx...@gmail.com



Hi guys,

 There is an idea about showing the snapshots of an image in a tree
view, as
what vmware is doing in screenshot attached(vmware.jpeg).

 So I think that's a good to implement a similar feature in rbd, as
attached(rbd_snap_tree.jpeg).
But, is that a hot requirement for us? as Jason mentioned, A rollback is
a pretty large sledgehammer
(i.e. not efficient) approach to switch between a hierarchy of
historical configurations. Maybe clone is
much better. If so, the hierarchy of snapshots is not a "tree", but a
"line".

 Therefore, I am here to collect more opinions on this topic. What
do you think about this feature?

BTW, Jason provide another idea, show a tree view about the relationship
between parents and
children. Yea, that's another idea, but I think that's good. what do you
think about this one?

Thanx
Dongsheng



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com