[ceph-users] subscribe to ceph-user list

2018-01-15 Thread German Anders

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Recommendations for I/O (blk-mq) scheduler for HDDs and SSDs?

2017-12-11 Thread German Anders
Hi Patrick,

Some thoughts about blk-mq:

*(virtio-blk)*

   - it's activated by default on kernels >= 3.13 on driver virtio-blk

   - *The blk-mq feature is currently implemented, and enabled by default,
   in the following drivers: virtio-blk, mtip32xx, nvme, and rbd*. (
   https://access.redhat.com/documentation/en-us/red_hat_enter
   prise_linux/7/html/7.2_release_notes/storage
   

   )

   - can be checked with "cat /sys/block/vda/queue/scheduler", it appears
   as none

   - https://serverfault.com/questions/693348/what-does-it-mean-
   when-linux-has-no-i-o-scheduler
   


*hosts local disk (scsi-mq)*

   - with disks "sda" (scsi), either rotational or ssd IT'S NOT activated
   by default on ubuntu (cat /sys/block/sda/queue/scheduler)

   - canonical deactivated in >= 3.18 https://bugs.launchpad.
   net/ubuntu/+source/linux/+bug/1397061
   

   - suse says it does not suit, for scsi rotational, but it's ok for SSD:
   https://doc.opensuse.org/documentation/leap/tuning/html
   /book.sle.tuning/cha.tuning.io.html#cha.tuning.io.scsimq
   


   - redhat says: "*The scsi-mq feature is provided as a Technology Preview
   in Red Hat Enterprise Linux 7.2. To enable scsi-mq, specify
   scsi_mod.use_blk_mq=y on the kernel command line. The default value is n
   (disabled).*" (https://access.redhat.com/doc
   umentation/en-us/red_hat_enterprise_linux/7/html/7.2_release
   _notes/storage
   

   )

   - how to change it: vi /etc/default/grub:
GRUB_CMDLINE_LINUX="scsi_mod.use_blk_mq=1";
   update-grub; reboot

*ceph (rbd)*

   - it's activated by default: *The blk-mq feature is currently
   implemented, and enabled by default, in the following drivers: virtio-blk,
   mtip32xx, nvme, and rbd*. (https://access.redhat.com/doc
   umentation/en-us/red_hat_enterprise_linux/7/html/7.2_release
   _notes/storage
   

   )

*multipath (device mapper; dm / dm-mpath)*

   - how to change it: dm_mod.use_blk_mq=y

   - deactivated by default, how to verify: *To determine whether DM
   multipath is using blk-mq on a system, cat the file
   /sys/block/dm-X/dm/use_blk_mq, where dm-X is replaced by the DM multipath
   device of interest. This file is read-only and reflects what the global
   value in /sys/module/dm_mod/parameters/use_blk_mq was at the time the
   request-based DM multipath device was created*. (
   https://access.redhat.com/documentation/en-us/red_hat_enter
   prise_linux/7/html/7.2_release_notes/storage
   

   )

   - I thought it would not make any sense, since iscsi is by definition
   (network) much slower than SSD/NVMe, which is what blk-mq was generated
   for, but...:* It may be beneficial to set dm_mod.use_blk_mq=y if the
   underlying SCSI devices are also using blk-mq, as doing so reduces locking
   overhead at the DM layer*. (redhat)


observations

   - WARNING low performance https://www.redhat.com/archives/dm-devel/
   2016-February/msg00036.html

   - request-based device mapper targets planned for 4.1

   - now with >= 4.12 linux come with BFQ, a scheduler based en blk-mq

We try in our environment several schedulers but we did't notice a real
important improvement, in order to justify a global change in the whole
environment. But the best thing is to change/test/document an repeat again
and again :)

Hope it helps

Best,



*German*

2017-12-11 18:17 GMT-03:00 Patrick Fruh :

> Hi,
>
>
>
> after reading a lot about I/O schedulers and performance gains with
> blk-mq, I switched to a custom 4.14.5 kernel with  CONFIG_SCSI_MQ_DEFAULT
> enabled to have blk-mq for all devices on my cluster.
>
>
>
> This allows me to use the following schedulers for HDDs and SSDs:
>
> mq-deadline, kyber, bfq, none
>
>
>
> I’ve currently set the HDD scheduler to bfq and the SSD scheduler to none,
> however I’m still not sure if this is the best solution performance-wise.
>
> Does anyone have more experience with this and can maybe give me a
> recommendation? I’m not even sure if blk-mq is a good idea for ceph, since
> I haven’t really found anything on the topic.
>
>
>
> Best,
>
> Patrick
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list

Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread German Anders
Yes, it include all the available pools on the cluster:

*# ceph df*
GLOBAL:
SIZE   AVAIL  RAW USED %RAW USED
53650G 42928G   10722G 19.99
POOLS:
NAMEID USED  %USED MAX AVAIL OBJECTS
volumes 13 2979G 33.73 5854G  767183
db  18  856G  4.6517563G 1657174
cephfs_data 22   880 0 5854G   6
cephfs_metadata 23  977k 0 5854G  65

*# rados lspools*
volumes
db
cephfs_data
cephfs_metadata

The goods news is that after restarting the ceph-mgr, it started to work :)
but like you said, it would be nice to know how the system got into this.

Thanks a lot John :)

Best,


*German*

2017-12-11 12:17 GMT-03:00 John Spray <jsp...@redhat.com>:

> On Mon, Dec 11, 2017 at 3:13 PM, German Anders <gand...@despegar.com>
> wrote:
> > Hi John,
> >
> > how are you? no problem :) . Unfortunately the error on the 'ceph fs
> status'
> > command is still happening:
>
> OK, can you check:
>  - does the "ceph df" output include all the pools?
>  - does restarting ceph-mgr clear the issue?
>
> We probably need to modify this code to handle stats-less pools
> anyway, but I'm curious about how the system got into this state.
>
> John
>
>
> > # ceph fs status
> > Error EINVAL: Traceback (most recent call last):
> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> > return self.handle_fs_status(cmd)
> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> handle_fs_status
> > stats = pool_stats[pool_id]
> > KeyError: (15L,)
> >
> >
> >
> > German
> > 2017-12-11 12:08 GMT-03:00 John Spray <jsp...@redhat.com>:
> >>
> >> On Mon, Dec 4, 2017 at 6:37 PM, German Anders <gand...@despegar.com>
> >> wrote:
> >> > Hi,
> >> >
> >> > I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2
> >> > (stable),
> >> > and i'm getting a traceback while trying to run:
> >> >
> >> > # ceph fs status
> >> >
> >> > Error EINVAL: Traceback (most recent call last):
> >> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in
> handle_command
> >> > return self.handle_fs_status(cmd)
> >> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> >> > handle_fs_status
> >> > stats = pool_stats[pool_id]
> >> > KeyError: (15L,)
> >> >
> >> >
> >> > # ceph fs ls
> >> > name: cephfs, metadata pool: cephfs_metadata, data pools:
> [cephfs_data ]
> >> >
> >> >
> >> > Any ideas?
> >>
> >> (I'm a bit late but...)
> >>
> >> Is this still happening or did it self-correct?  It could have been
> >> happening when the pool had just been created but the mgr hadn't heard
> >> about any stats from the OSDs about that pool yet (which we should
> >> fix, anyway)
> >>
> >> John
> >>
> >>
> >> >
> >> > Thanks in advance,
> >> >
> >> > Germ
> >> > an
> >> >
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >> >
> >
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-11 Thread German Anders
Hi John,

how are you? no problem :) . Unfortunately the error on the 'ceph fs
status' command is still happening:

*# ceph fs status*
Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
return self.handle_fs_status(cmd)
  File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
stats = pool_stats[pool_id]
KeyError: (15L,)



*German*
2017-12-11 12:08 GMT-03:00 John Spray <jsp...@redhat.com>:

> On Mon, Dec 4, 2017 at 6:37 PM, German Anders <gand...@despegar.com>
> wrote:
> > Hi,
> >
> > I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2
> (stable),
> > and i'm getting a traceback while trying to run:
> >
> > # ceph fs status
> >
> > Error EINVAL: Traceback (most recent call last):
> >   File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
> > return self.handle_fs_status(cmd)
> >   File "/usr/lib/ceph/mgr/status/module.py", line 219, in
> handle_fs_status
> > stats = pool_stats[pool_id]
> > KeyError: (15L,)
> >
> >
> > # ceph fs ls
> > name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]
> >
> >
> > Any ideas?
>
> (I'm a bit late but...)
>
> Is this still happening or did it self-correct?  It could have been
> happening when the pool had just been created but the mgr hadn't heard
> about any stats from the OSDs about that pool yet (which we should
> fix, anyway)
>
> John
>
>
> >
> > Thanks in advance,
> >
> > Germ
> > an
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] luminous 12.2.2 traceback (ceph fs status)

2017-12-04 Thread German Anders
Hi,

I just upgrade a ceph cluster from version 12.2.0 (rc) to 12.2.2 (stable),
and i'm getting a traceback while trying to run:

*# ceph fs status*

Error EINVAL: Traceback (most recent call last):
  File "/usr/lib/ceph/mgr/status/module.py", line 301, in handle_command
return self.handle_fs_status(cmd)
  File "/usr/lib/ceph/mgr/status/module.py", line 219, in handle_fs_status
stats = pool_stats[pool_id]
KeyError: (15L,)


*# ceph fs ls*
name: cephfs, metadata pool: cephfs_metadata, data pools: [cephfs_data ]


Any ideas?

Thanks in advance,

*Germ​an​*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-12-04 Thread German Anders
Could anyone run the tests? and share some results..

Thanks in advance,

Best,


*German*

2017-11-30 14:25 GMT-03:00 German Anders <gand...@despegar.com>:

> That's correct, IPoIB for the backend (already configured the irq
> affinity),  and 10GbE on the frontend. I would love to try rdma but like
> you said is not stable for production, so I think I'll have to wait for
> that. Yeah, the thing is that it's not my decision to go for 50GbE or
> 100GbE... :( so.. 10GbE for the front-end will be...
>
> Would be really helpful if someone could run the following sysbench test
> on a mysql db so I could make some compares:
>
> *my.cnf *configuration file:
>
> [mysqld_safe]
> nice= 0
> pid-file= /home/test_db/mysql/mysql.pid
>
> [client]
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
>
> [mysqld]
> user= test_db
> port= 33033
> socket  = /home/test_db/mysql/mysql.sock
> pid-file= /home/test_db/mysql/mysql.pid
> log-error   = /home/test_db/mysql/mysql.err
> datadir = /home/test_db/mysql/data
> tmpdir  = /tmp
> server-id   = 1
>
> # ** Binlogging **
> #log-bin= /home/test_db/mysql/binlog/
> mysql-bin
> #log_bin_index  = /home/test_db/mysql/binlog/
> mysql-bin.index
> expire_logs_days= 1
> max_binlog_size = 512MB
>
> thread_handling = pool-of-threads
> thread_pool_max_threads = 300
>
>
> # ** Slow query log **
> slow_query_log  = 1
> slow_query_log_file = /home/test_db/mysql/mysql-
> slow.log
> long_query_time = 10
> log_output  = FILE
> log_slow_slave_statements   = 1
> log_slow_verbosity  = query_plan,innodb,explain
>
> # ** INNODB Specific options **
> transaction_isolation   = READ-COMMITTED
> innodb_buffer_pool_size = 12G
> innodb_data_file_path   = ibdata1:256M:autoextend
> innodb_thread_concurrency   = 16
> innodb_log_file_size= 256M
> innodb_log_files_in_group   = 3
> innodb_file_per_table
> innodb_log_buffer_size  = 16M
> innodb_stats_on_metadata= 0
> innodb_lock_wait_timeout= 30
> # innodb_flush_method   = O_DSYNC
> innodb_flush_method = O_DIRECT
> max_connections = 1
> max_connect_errors  = 99
> max_allowed_packet  = 128M
> skip-host-cache
> skip-name-resolve
> explicit_defaults_for_timestamp = 1
> performance_schema  = OFF
> log_warnings= 2
> event_scheduler = ON
>
> # ** Specific Galera Cluster Settings **
> binlog_format   = ROW
> default-storage-engine  = innodb
> query_cache_size= 0
> query_cache_type= 0
>
>
> Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
> mounted on */home/test_db/mysql/data*
>
> commands for the test:
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null
>
> sysbench 
> --test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
> --oltp-read-only=off --oltp-table-size=20 --threads=10
> --rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null
>
> sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
> --mysql-host= --mysql-port=33033 --mysql-user=sysbench
> --mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
> --db-driver=mysql --oltp

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-30 Thread German Anders
That's correct, IPoIB for the backend (already configured the irq
affinity),  and 10GbE on the frontend. I would love to try rdma but like
you said is not stable for production, so I think I'll have to wait for
that. Yeah, the thing is that it's not my decision to go for 50GbE or
100GbE... :( so.. 10GbE for the front-end will be...

Would be really helpful if someone could run the following sysbench test on
a mysql db so I could make some compares:

*my.cnf *configuration file:

[mysqld_safe]
nice= 0
pid-file= /home/test_db/mysql/mysql.pid

[client]
port= 33033
socket  = /home/test_db/mysql/mysql.sock

[mysqld]
user= test_db
port= 33033
socket  = /home/test_db/mysql/mysql.sock
pid-file= /home/test_db/mysql/mysql.pid
log-error   = /home/test_db/mysql/mysql.err
datadir = /home/test_db/mysql/data
tmpdir  = /tmp
server-id   = 1

# ** Binlogging **
#log-bin= /home/test_db
/mysql/binlog/mysql-bin
#log_bin_index  = /home/test_db
/mysql/binlog/mysql-bin.index
expire_logs_days= 1
max_binlog_size = 512MB

thread_handling = pool-of-threads
thread_pool_max_threads = 300


# ** Slow query log **
slow_query_log  = 1
slow_query_log_file = /home/test_db/mysql/mysql-slow.log
long_query_time = 10
log_output  = FILE
log_slow_slave_statements   = 1
log_slow_verbosity  = query_plan,innodb,explain

# ** INNODB Specific options **
transaction_isolation   = READ-COMMITTED
innodb_buffer_pool_size = 12G
innodb_data_file_path   = ibdata1:256M:autoextend
innodb_thread_concurrency   = 16
innodb_log_file_size= 256M
innodb_log_files_in_group   = 3
innodb_file_per_table
innodb_log_buffer_size  = 16M
innodb_stats_on_metadata= 0
innodb_lock_wait_timeout= 30
# innodb_flush_method   = O_DSYNC
innodb_flush_method = O_DIRECT
max_connections = 1
max_connect_errors  = 99
max_allowed_packet  = 128M
skip-host-cache
skip-name-resolve
explicit_defaults_for_timestamp = 1
performance_schema  = OFF
log_warnings= 2
event_scheduler = ON

# ** Specific Galera Cluster Settings **
binlog_format   = ROW
default-storage-engine  = innodb
query_cache_size= 0
query_cache_type= 0


Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
mounted on */home/test_db/mysql/data*

commands for the test:

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=20
--rand-type=uniform --rand-init=on --time=120 run >
result_sysbench_perf_test.out 2>/dev/null

Im looking for tps, qps and 95th perc, could anyone with a all-nvme cluster
run the test and share the results? I would really appreciate the help :)

Thanks in advance,

Best,


*German *

2017-11-29 19:14 GMT-03:00 Zoltan Arnold Nagy <zol...@linux.vnet.ibm.com>:

> On 2017-11-27 14:02, German Anders wrote:
>
>> 4x 2U servers:
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter 

Re: [ceph-users] Transparent huge pages

2017-11-29 Thread German Anders
Is possible that in Ubuntu with kernel version 4.12.14 at least, it comes
by default with the parameter enabled in [madvise]?



*German*

2017-11-28 12:07 GMT-03:00 Nigel Williams :

> Given that memory is a key resource for Ceph, this advice about switching
> Transparent Huge Pages kernel setting to madvise would be worth testing to
> see if THP is helping or hindering.
>
> Article:
> https://blog.nelhage.com/post/transparent-hugepages/
>
> Discussion:
> https://news.ycombinator.com/item?id=15795337
>
>
> echo madvise | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread German Anders
Don't know if there's any statistics available really, but Im running some
sysbench tests with mysql before the changes and the idea is to run those
tests again after the 'tuning' and see if numbers get better in any way,
also I'm gathering numbers from some collectd and statsd collectors running
on the osd nodes so, I hope to get some info about that :)


*German*

2017-11-28 16:12 GMT-03:00 Marc Roos <m.r...@f1-outsourcing.eu>:

>
> I was wondering if there are any statistics available that show the
> performance increase of doing such things?
>
>
>
>
>
>
> -Original Message-
> From: German Anders [mailto:gand...@despegar.com]
> Sent: dinsdag 28 november 2017 19:34
> To: Luis Periquito
> Cc: ceph-users
> Subject: Re: [ceph-users] ceph all-nvme mysql performance tuning
>
> Thanks a lot Luis, I agree with you regarding the CPUs, but
> unfortunately those were the best CPU model that we can afford :S
>
> For the NUMA part, I manage to pinned the OSDs by changing the
> /usr/lib/systemd/system/ceph-osd@.service file and adding the
> CPUAffinity list to it. But, this is for ALL the OSDs to specific nodes
> or specific CPU list. But I can't find the way to specify a list for
> only a specific number of OSDs.
>
> Also, I notice that the NVMe disks are all on the same node (since I'm
> using half of the shelf - so the other half will be pinned to the other
> node), so the lanes of the NVMe disks are all on the same CPU (in this
> case 0). Also, I find that the IB adapter that is mapped to the OSD
> network (osd replication) is pinned to CPU 1, so this will cross the QPI
> path.
>
> And for the memory, from the other email, we are already using the
> TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of
> 134217728
>
> In this case I can pinned all the actual OSDs to CPU 0, but in the near
> future when I add more nvme disks to the OSD nodes, I'll definitely need
> to pinned the other half OSDs to CPU 1, someone already did this?
>
> Thanks a lot,
>
> Best,
>
>
>
> German
>
> 2017-11-28 6:36 GMT-03:00 Luis Periquito <periqu...@gmail.com>:
>
>
> There are a few things I don't like about your machines... If you
> want latency/IOPS (as you seemingly do) you really want the highest
> frequency CPUs, even over number of cores. These are not too bad, but
> not great either.
>
> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA
> nodes? Ideally OSD is pinned to same NUMA node the NVMe device is
> connected to. Each NVMe device will be running on PCIe lanes generated
> by one of the CPUs...
>
> What versions of TCMalloc (or jemalloc) are you running? Have you
> tuned them to have a bigger cache?
>
> These are from what I've learned using filestore - I've yet to run
> full tests on bluestore - but they should still apply...
>
> On Mon, Nov 27, 2017 at 5:10 PM, German Anders
> <gand...@despegar.com> wrote:
>
>
> Hi Nick,
>
> yeah, we are using the same nvme disk with an additional
> partition to use as journal/wal. We double check the c-state and it was
> not configure to use c1, so we change that on all the osd nodes and mon
> nodes and we're going to make some new tests, and see how it goes. I'll
> get back as soon as get got those tests running.
>
> Thanks a lot,
>
> Best,
>
>
>
>
>
>
> German
>
> 2017-11-27 12:16 GMT-03:00 Nick Fisk <n...@fisk.me.uk>:
>
>
> From: ceph-users
> [mailto:ceph-users-boun...@lists.ceph.com
> <mailto:ceph-users-boun...@lists.ceph.com> ] On Behalf Of German Anders
> Sent: 27 November 2017 14:44
> To: Maged Mokhtar <mmokh...@petasan.org>
> Cc: ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] ceph all-nvme mysql
> performance
> tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with
> different
> number of threads and we're getting almost the same kind of difference
> between the storage types. Going to try with different rbd stripe size,
> object size values and see if we get more competitive numbers. Will get
> back with more tests and param changes to see if we get better :)
>
>
>
>
>
> Just to echo a couple of comments. Ceph will always
> struggle to match the performance of a traditional array for mainly 2
> reasons.
>
>
>
> 1.  You are replacing some sort

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-28 Thread German Anders
Thanks a lot Luis, I agree with you regarding the CPUs, but unfortunately
those were the best CPU model that we can afford :S

For the NUMA part, I manage to pinned the OSDs by changing the
/usr/lib/systemd/system/ceph-osd@.service file and adding the CPUAffinity
list to it. But, this is for ALL the OSDs to specific nodes or specific CPU
list. But I can't find the way to specify a list for only a specific number
of OSDs.

Also, I notice that the NVMe disks are all on the same node (since I'm
using half of the shelf - so the other half will be pinned to the other
node), so the lanes of the NVMe disks are all on the same CPU (in this case
0). Also, I find that the IB adapter that is mapped to the OSD network (osd
replication) is pinned to CPU 1, so this will cross the QPI path.

And for the memory, from the other email, we are already using the
TCMALLOC_MAX_TOTAL_THREAD_CACHE_BYTES parameter with a value of 134217728

In this case I can pinned all the actual OSDs to CPU 0, but in the near
future when I add more nvme disks to the OSD nodes, I'll definitely need to
pinned the other half OSDs to CPU 1, someone already did this?

Thanks a lot,

Best,


*German*

2017-11-28 6:36 GMT-03:00 Luis Periquito <periqu...@gmail.com>:

> There are a few things I don't like about your machines... If you want
> latency/IOPS (as you seemingly do) you really want the highest frequency
> CPUs, even over number of cores. These are not too bad, but not great
> either.
>
> Also you have 2x CPU meaning NUMA. Have you pinned OSDs to NUMA nodes?
> Ideally OSD is pinned to same NUMA node the NVMe device is connected to.
> Each NVMe device will be running on PCIe lanes generated by one of the
> CPUs...
>
> What versions of TCMalloc (or jemalloc) are you running? Have you tuned
> them to have a bigger cache?
>
> These are from what I've learned using filestore - I've yet to run full
> tests on bluestore - but they should still apply...
>
> On Mon, Nov 27, 2017 at 5:10 PM, German Anders <gand...@despegar.com>
> wrote:
>
>> Hi Nick,
>>
>> yeah, we are using the same nvme disk with an additional partition to use
>> as journal/wal. We double check the c-state and it was not configure to use
>> c1, so we change that on all the osd nodes and mon nodes and we're going to
>> make some new tests, and see how it goes. I'll get back as soon as get got
>> those tests running.
>>
>> Thanks a lot,
>>
>> Best,
>>
>>
>> *German*
>>
>> 2017-11-27 12:16 GMT-03:00 Nick Fisk <n...@fisk.me.uk>:
>>
>>> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On
>>> Behalf Of *German Anders
>>> *Sent:* 27 November 2017 14:44
>>> *To:* Maged Mokhtar <mmokh...@petasan.org>
>>> *Cc:* ceph-users <ceph-users@lists.ceph.com>
>>> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>>>
>>>
>>>
>>> Hi Maged,
>>>
>>>
>>>
>>> Thanks a lot for the response. We try with different number of threads
>>> and we're getting almost the same kind of difference between the storage
>>> types. Going to try with different rbd stripe size, object size values and
>>> see if we get more competitive numbers. Will get back with more tests and
>>> param changes to see if we get better :)
>>>
>>>
>>>
>>>
>>>
>>> Just to echo a couple of comments. Ceph will always struggle to match
>>> the performance of a traditional array for mainly 2 reasons.
>>>
>>>
>>>
>>>1. You are replacing some sort of dual ported SAS or internally RDMA
>>>connected device with a network for Ceph replication traffic. This will
>>>instantly have a large impact on write latency
>>>2. Ceph locks at the PG level and a PG will most likely cover at
>>>least one 4MB object, so lots of small accesses to the same blocks (on a
>>>block device) will wait on each other and go effectively at a single
>>>threaded rate.
>>>
>>>
>>>
>>> The best thing you can do to mitigate these, is to run the fastest
>>> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
>>> run your CPU’s at max C and P states.
>>>
>>>
>>>
>>> You stated that you are running the performance profile on the CPU’s.
>>> Could you also just double check that the C-states are being held at C1(e)?
>>> There are a few utilities that can show this in realtime.
>>>
>>>
>>>
>>> Other than that, although there could be some minor tweaks, you are
>>> proba

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Nick,

yeah, we are using the same nvme disk with an additional partition to use
as journal/wal. We double check the c-state and it was not configure to use
c1, so we change that on all the osd nodes and mon nodes and we're going to
make some new tests, and see how it goes. I'll get back as soon as get got
those tests running.

Thanks a lot,

Best,


*German*

2017-11-27 12:16 GMT-03:00 Nick Fisk <n...@fisk.me.uk>:

> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 27 November 2017 14:44
> *To:* Maged Mokhtar <mmokh...@petasan.org>
> *Cc:* ceph-users <ceph-users@lists.ceph.com>
> *Subject:* Re: [ceph-users] ceph all-nvme mysql performance tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with different number of threads and
> we're getting almost the same kind of difference between the storage types.
> Going to try with different rbd stripe size, object size values and see if
> we get more competitive numbers. Will get back with more tests and param
> changes to see if we get better :)
>
>
>
>
>
> Just to echo a couple of comments. Ceph will always struggle to match the
> performance of a traditional array for mainly 2 reasons.
>
>
>
>1. You are replacing some sort of dual ported SAS or internally RDMA
>connected device with a network for Ceph replication traffic. This will
>instantly have a large impact on write latency
>2. Ceph locks at the PG level and a PG will most likely cover at least
>one 4MB object, so lots of small accesses to the same blocks (on a block
>device) will wait on each other and go effectively at a single threaded
>rate.
>
>
>
> The best thing you can do to mitigate these, is to run the fastest
> journal/WAL devices you can, fastest network connections (ie 25Gb/s) and
> run your CPU’s at max C and P states.
>
>
>
> You stated that you are running the performance profile on the CPU’s.
> Could you also just double check that the C-states are being held at C1(e)?
> There are a few utilities that can show this in realtime.
>
>
>
> Other than that, although there could be some minor tweaks, you are
> probably nearing the limit of what you can hope to achieve.
>
>
>
> Nick
>
>
>
>
>
> Thanks,
>
>
>
> Best,
>
>
> *German*
>
>
>
> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokh...@petasan.org>:
>
> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
>
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
>
>
> The cluster configuration is the following:
>
>
>
> *MON Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 3x 1U servers:
>
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>
>
> *OSD Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 4x 2U servers:
>
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   1x Ethernet Controller 10G X550T
>
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
>
>
>
> Here's the tree:
>
>
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -7   48.0 root root
>
> -5   24.0 rack rack1
>
> -1   12.0 node cpn01
>
>  0  nvme  1.0 osd.0  up  1.0 1.0
>
>  1  nvme  1.0 osd.1  up  1.0 1.0
>
>  2  nvme  1.0 osd.2  up  1.0 1.0
>
>  3  nvme  1.0 osd.3  up  1.0 1.0
>
>  4  nvme  1.0 osd.4  up  1.0 1.0
>
>  5  nvme  1.0 osd.5  up  1.0 1.0
>
>  6  nvme  1.0 osd.6  up  1.0 1.0
>
>  7  nvme  1.0 osd.7  up  1.0 1.0
>
>  8  nvme  1.0 osd.8  up  1.0 1.0
>
>  9  nvme  1.0 osd.9  up  1.0 1.0
>
> 10  nvme  1.0 osd.10 up  1.0 1.0
>
> 11  nvme  1.0 osd.11 up  1.0 1.0
>
> -3  

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi David,
Thanks a lot for the response. In fact, we first try to not use any
scheduler at all, but then we try kyber iosched and we notice a slightly
improve in terms of performance, that's why we actually keep it.


*German*

2017-11-27 13:48 GMT-03:00 David Byte <db...@suse.com>:

> From the benchmarks I have seen and done myself, I’m not sure why you are
> using an i/o scheduler at all with NVMe.  While there are a few cases where
> it may provide a slight benefit, simply having mq enabled with no scheduler
> seems to provide the best performance for an all flash, especially all
> NVMe, environment.
>
>
>
> David Byte
>
> Sr. Technology Strategist
>
> *SCE Enterprise Linux*
>
> *SCE Enterprise Storage*
>
> Alliances and SUSE Embedded
>
> db...@suse.com
>
> 918.528.4422
>
>
>
> *From: *ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of
> German Anders <gand...@despegar.com>
> *Date: *Monday, November 27, 2017 at 8:44 AM
> *To: *Maged Mokhtar <mmokh...@petasan.org>
> *Cc: *ceph-users <ceph-users@lists.ceph.com>
> *Subject: *Re: [ceph-users] ceph all-nvme mysql performance tuning
>
>
>
> Hi Maged,
>
>
>
> Thanks a lot for the response. We try with different number of threads and
> we're getting almost the same kind of difference between the storage types.
> Going to try with different rbd stripe size, object size values and see if
> we get more competitive numbers. Will get back with more tests and param
> changes to see if we get better :)
>
>
>
> Thanks,
>
>
>
> Best,
>
>
> *German*
>
>
>
> 2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokh...@petasan.org>:
>
> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
>
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
>
>
> The cluster configuration is the following:
>
>
>
> *MON Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 3x 1U servers:
>
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>
>
> *OSD Nodes:*
>
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>
> 4x 2U servers:
>
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>
>   128G RAM
>
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>
>   1x Ethernet Controller 10G X550T
>
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
>
>
>
> Here's the tree:
>
>
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>
> -7   48.0 root root
>
> -5   24.0 rack rack1
>
> -1   12.0 node cpn01
>
>  0  nvme  1.0 osd.0  up  1.0 1.0
>
>  1  nvme  1.0 osd.1  up  1.0 1.0
>
>  2  nvme  1.0 osd.2  up  1.0 1.0
>
>  3  nvme  1.0 osd.3  up  1.0 1.0
>
>  4  nvme  1.0 osd.4  up  1.0 1.0
>
>  5  nvme  1.0 osd.5  up  1.0 1.0
>
>  6  nvme  1.0 osd.6  up  1.0 1.0
>
>  7  nvme  1.0 osd.7  up  1.0 1.0
>
>  8  nvme  1.0 osd.8  up  1.0 1.0
>
>  9  nvme  1.0 osd.9  up  1.0 1.0
>
> 10  nvme  1.0 osd.10 up  1.0 1.0
>
> 11  nvme  1.0 osd.11 up  1.0 1.0
>
> -3   12.0 node cpn03
>
> 24  nvme  1.0 osd.24 up  1.0 1.0
>
> 25  nvme  1.0 osd.25 up  1.0 1.0
>
> 26  nvme  1.0 osd.26 up  1.0 1.0
>
> 27  nvme  1.0 osd.27 up  1.0 1.0
>
> 28  nvme  1.0 osd.28 up  1.0 1.0
>
> 29  nvme  1.0 osd.29 up  1.0 1.0
>
> 30  nvme  1.0 osd.30 up  1.0 1.0
>
> 31  nvme  1.0 osd.31 up  1.0 1.0
>
> 32  nvme  1.0 osd.32 up  1.0 1.0
>
> 33  nvme  1.0 osd.33 up  1.0 1.0
>
> 34  nvme  1.0 osd.34 up  

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Maged,

Thanks a lot for the response. We try with different number of threads and
we're getting almost the same kind of difference between the storage types.
Going to try with different rbd stripe size, object size values and see if
we get more competitive numbers. Will get back with more tests and param
changes to see if we get better :)

Thanks,

Best,

*German*

2017-11-27 11:36 GMT-03:00 Maged Mokhtar <mmokh...@petasan.org>:

> On 2017-11-27 15:02, German Anders wrote:
>
> Hi All,
>
> I've a performance question, we recently install a brand new Ceph cluster
> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> The back-end of the cluster is using a bond IPoIB (active/passive) , and
> for the front-end we are using a bonding config with active/active (20GbE)
> to communicate with the clients.
>
> The cluster configuration is the following:
>
> *MON Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 3x 1U servers:
>   2x Intel Xeon E5-2630v4 @2.2Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>
> *OSD Nodes:*
> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> 4x 2U servers:
>   2x Intel Xeon E5-2640v4 @2.4Ghz
>   128G RAM
>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>   1x Ethernet Controller 10G X550T
>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>
>
> Here's the tree:
>
> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -7   48.0 root root
> -5   24.0 rack rack1
> -1   12.0 node cpn01
>  0  nvme  1.0 osd.0  up  1.0 1.0
>  1  nvme  1.0 osd.1  up  1.0 1.0
>  2  nvme  1.0 osd.2  up  1.0 1.0
>  3  nvme  1.0 osd.3  up  1.0 1.0
>  4  nvme  1.0 osd.4  up  1.0 1.0
>  5  nvme  1.0 osd.5  up  1.0 1.0
>  6  nvme  1.0 osd.6  up  1.0 1.0
>  7  nvme  1.0 osd.7  up  1.0 1.0
>  8  nvme  1.0 osd.8  up  1.0 1.0
>  9  nvme  1.0 osd.9  up  1.0 1.0
> 10  nvme  1.0 osd.10 up  1.0 1.0
> 11  nvme  1.0 osd.11 up  1.0 1.0
> -3   12.0 node cpn03
> 24  nvme  1.0 osd.24 up  1.0 1.0
> 25  nvme  1.0 osd.25 up  1.0 1.0
> 26  nvme  1.0 osd.26 up  1.0 1.0
> 27  nvme  1.0 osd.27 up  1.0 1.0
> 28  nvme  1.0 osd.28 up  1.0 1.0
> 29  nvme  1.0 osd.29 up  1.0 1.0
> 30  nvme  1.0 osd.30 up  1.0 1.0
> 31  nvme  1.0 osd.31 up  1.0 1.0
> 32  nvme  1.0 osd.32 up  1.0 1.0
> 33  nvme  1.0 osd.33 up  1.0 1.0
> 34  nvme  1.0 osd.34 up  1.0 1.0
> 35  nvme  1.0 osd.35 up  1.0 1.0
> -6   24.0 rack rack2
> -2   12.0 node cpn02
> 12  nvme  1.0 osd.12 up  1.0 1.0
> 13  nvme  1.0 osd.13 up  1.0 1.0
> 14  nvme  1.0 osd.14 up  1.0 1.0
> 15  nvme  1.0 osd.15 up  1.0 1.0
> 16  nvme  1.0 osd.16 up  1.0 1.0
> 17  nvme  1.0 osd.17 up  1.0 1.0
> 18  nvme  1.0 osd.18 up  1.0 1.0
> 19  nvme  1.0 osd.19 up  1.0 1.0
> 20  nvme  1.0 osd.20 up  1.0 1.0
> 21  nvme  1.0 osd.21 up  1.0 1.0
> 22  nvme  1.0 osd.22 up  1.0 1.0
> 23  nvme  1.0 osd.23 up  1.0 1.0
> -4   12.0 node cpn04
> 36  nvme  1.0 osd.36 up  1.0 1.0
> 37  nvme  1.0 osd.37 up  1.0 1.0
> 38  nvme  1.0 osd.38 up  1.0 1.0
> 39  nvme  1.0 osd.39 up  1.0 1.0
> 40  nvme  1.0 osd.40 up  1.0 1.0
> 41  nvme  1.0 osd.41 up  1.0 1.0
> 42  nvme  1.0 osd.42 up  1.0 1.0
> 43  nvme  1.0 osd.43 up  1.0 1.0
> 44  nvme  1.0 osd.44 up  1.0 1.0
> 45  nvme  1.0 osd.45   

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Wido, thanks a lot for the quick response, regarding the questions:

Have you tried to attach multiple RBD volumes:

- Root for OS (the root partition has local SSDs)
- MySQL data dir (the idea is to have all the storage tests with the same
scheme, the first test is using one volume and put the data dir, innodb and
bin log)
- MySQL InnoDB Logfile
- MySQL Binary Logging

So 4 disks in total where you spread out the I/O over. (the following tests
are going to be spread into 3 disks, and we'll make a new compare between
the arrays)

Regarding the version of librbd it's not a type we use this server also
with an old ceph cluster. we are going to upgrade the version and see if
tests get better.

Thanks


*German*

2017-11-27 10:16 GMT-03:00 Wido den Hollander <w...@42on.com>:

>
> > Op 27 november 2017 om 14:02 schreef German Anders <gand...@despegar.com
> >:
> >
> >
> > Hi All,
> >
> > I've a performance question, we recently install a brand new Ceph cluster
> > with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
> > The back-end of the cluster is using a bond IPoIB (active/passive) , and
> > for the front-end we are using a bonding config with active/active
> (20GbE)
> > to communicate with the clients.
> >
> > The cluster configuration is the following:
> >
> > *MON Nodes:*
> > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> > 3x 1U servers:
> >   2x Intel Xeon E5-2630v4 @2.2Ghz
> >   128G RAM
> >   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >
> > *OSD Nodes:*
> > OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
> > 4x 2U servers:
> >   2x Intel Xeon E5-2640v4 @2.4Ghz
> >   128G RAM
> >   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
> >   1x Ethernet Controller 10G X550T
> >   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
> >   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
> >   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
> >
> >
> > Here's the tree:
> >
> > ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> > -7   48.0 root root
> > -5   24.0 rack rack1
> > -1   12.0 node cpn01
> >  0  nvme  1.0 osd.0  up  1.0 1.0
> >  1  nvme  1.0 osd.1  up  1.0 1.0
> >  2  nvme  1.0 osd.2  up  1.0 1.0
> >  3  nvme  1.0 osd.3  up  1.0 1.0
> >  4  nvme  1.0 osd.4  up  1.0 1.0
> >  5  nvme  1.0 osd.5  up  1.0 1.0
> >  6  nvme  1.0 osd.6  up  1.0 1.0
> >  7  nvme  1.0 osd.7  up  1.0 1.0
> >  8  nvme  1.0 osd.8  up  1.0 1.0
> >  9  nvme  1.0 osd.9  up  1.0 1.0
> > 10  nvme  1.0 osd.10 up  1.0 1.0
> > 11  nvme  1.0 osd.11 up  1.0 1.0
> > -3   12.0 node cpn03
> > 24  nvme  1.0 osd.24 up  1.0 1.0
> > 25  nvme  1.0 osd.25 up  1.0 1.0
> > 26  nvme  1.0 osd.26 up  1.0 1.0
> > 27  nvme  1.0 osd.27 up  1.0 1.0
> > 28  nvme  1.0 osd.28 up  1.0 1.0
> > 29  nvme  1.0 osd.29 up  1.0 1.0
> > 30  nvme  1.0 osd.30 up  1.0 1.0
> > 31  nvme  1.0 osd.31 up  1.0 1.0
> > 32  nvme  1.0 osd.32 up  1.0 1.0
> > 33  nvme  1.0 osd.33 up  1.0 1.0
> > 34  nvme  1.0 osd.34 up  1.0 1.0
> > 35  nvme  1.0 osd.35 up  1.0 1.0
> > -6   24.0 rack rack2
> > -2   12.0 node cpn02
> > 12  nvme  1.0 osd.12 up  1.0 1.0
> > 13  nvme  1.0 osd.13 up  1.0 1.0
> > 14  nvme  1.0 osd.14 up  1.0 1.0
> > 15  nvme  1.0 osd.15 up  1.0 1.0
> > 16  nvme  1.0 osd.16 up  1.0 1.0
> > 17  nvme  1.0 osd.17 up  1.0 1.0
> > 18  nvme  1.0 osd.18 up  1.0 1.0
> > 19  nvme  1.0 osd.19 up  1.0 1.0
> > 20  nvme  1.0 osd.20 up  1.0 1.0
> > 21  nvme  1.0 osd.21 up  1.0 1.0
> > 22  nvme  1.000

Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi Jason,

We are using librbd (librbd1-0.80.5-9.el6.x86_64), ok I will change those
parameters and see if that changes something

thanks a lot

best,


*German*

2017-11-27 10:09 GMT-03:00 Jason Dillaman <jdill...@redhat.com>:

> Are you using krbd or librbd? You might want to consider "debug_ms = 0/0"
> as well since per-message log gathering takes a large hit on small IO
> performance.
>
> On Mon, Nov 27, 2017 at 8:02 AM, German Anders <gand...@despegar.com>
> wrote:
>
>> Hi All,
>>
>> I've a performance question, we recently install a brand new Ceph cluster
>> with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
>> The back-end of the cluster is using a bond IPoIB (active/passive) , and
>> for the front-end we are using a bonding config with active/active (20GbE)
>> to communicate with the clients.
>>
>> The cluster configuration is the following:
>>
>> *MON Nodes:*
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 3x 1U servers:
>>   2x Intel Xeon E5-2630v4 @2.2Ghz
>>   128G RAM
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>   2x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>
>> *OSD Nodes:*
>> OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
>> 4x 2U servers:
>>   2x Intel Xeon E5-2640v4 @2.4Ghz
>>   128G RAM
>>   2x Intel SSD DC S3520 150G (in RAID-1 for OS)
>>   1x Ethernet Controller 10G X550T
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>   12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
>>
>> Here's the tree:
>>
>> ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
>> -7   48.0 root root
>> -5   24.0 rack rack1
>> -1   12.0 node cpn01
>>  0  nvme  1.0 osd.0  up  1.0 1.0
>>  1  nvme  1.0 osd.1  up  1.0 1.0
>>  2  nvme  1.0 osd.2  up  1.0 1.0
>>  3  nvme  1.0 osd.3  up  1.0 1.0
>>  4  nvme  1.0 osd.4  up  1.0 1.0
>>  5  nvme  1.0 osd.5  up  1.0 1.0
>>  6  nvme  1.0 osd.6  up  1.0 1.0
>>  7  nvme  1.0 osd.7  up  1.0 1.0
>>  8  nvme  1.0 osd.8  up  1.0 1.0
>>  9  nvme  1.0 osd.9  up  1.0 1.0
>> 10  nvme  1.0 osd.10 up  1.0 1.0
>> 11  nvme  1.0 osd.11 up  1.0 1.0
>> -3   12.0 node cpn03
>> 24  nvme  1.0 osd.24 up  1.0 1.0
>> 25  nvme  1.0 osd.25 up  1.0 1.0
>> 26  nvme  1.0 osd.26 up  1.0 1.0
>> 27  nvme  1.0 osd.27 up  1.0 1.0
>> 28  nvme  1.0 osd.28 up  1.0 1.0
>> 29  nvme  1.0 osd.29 up  1.0 1.0
>> 30  nvme  1.0 osd.30 up  1.0 1.0
>> 31  nvme  1.0 osd.31 up  1.0 1.0
>> 32  nvme  1.0 osd.32 up  1.0 1.0
>> 33  nvme  1.0 osd.33 up  1.0 1.0
>> 34  nvme  1.0 osd.34 up  1.0 1.0
>> 35  nvme  1.0 osd.35 up  1.0 1.0
>> -6   24.0 rack rack2
>> -2   12.0 node cpn02
>> 12  nvme  1.0 osd.12 up  1.0 1.0
>> 13  nvme  1.0 osd.13 up  1.0 1.0
>> 14  nvme  1.0 osd.14 up  1.0 1.0
>> 15  nvme  1.0 osd.15 up  1.0 1.0
>> 16  nvme  1.0 osd.16 up  1.0 1.0
>> 17  nvme  1.0 osd.17 up  1.0 1.0
>> 18  nvme  1.0 osd.18 up  1.0 1.0
>> 19  nvme  1.0 osd.19 up  1.0 1.0
>> 20  nvme  1.0 osd.20 up  1.0 1.0
>> 21  nvme  1.0 osd.21 up  1.0 1.0
>> 22  nvme  1.0 osd.22 up  1.0 1.0
>> 23  nvme  1.0 osd.23 up  1.0 1.0
>> -4   12.0 node cpn04
>> 36  nvme  1.0 osd.36 up  1.0 1.0
>> 37  nvme  1.0 osd.37 up  1.0 1.0
>> 38  nvme  1.0 osd.38 up  1.0 1.0
>> 39  nvme  1.0 osd.39 up  1.0 1.0
>> 40  nvme  1.0 

[ceph-users] ceph all-nvme mysql performance tuning

2017-11-27 Thread German Anders
Hi All,

I've a performance question, we recently install a brand new Ceph cluster
with all-nvme disks, using ceph version 12.2.0 with bluestore configured.
The back-end of the cluster is using a bond IPoIB (active/passive) , and
for the front-end we are using a bonding config with active/active (20GbE)
to communicate with the clients.

The cluster configuration is the following:

*MON Nodes:*
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
3x 1U servers:
  2x Intel Xeon E5-2630v4 @2.2Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  2x 82599ES 10-Gigabit SFI/SFP+ Network Connection

*OSD Nodes:*
OS: Ubuntu 16.04.3 LTS | kernel 4.12.14
4x 2U servers:
  2x Intel Xeon E5-2640v4 @2.4Ghz
  128G RAM
  2x Intel SSD DC S3520 150G (in RAID-1 for OS)
  1x Ethernet Controller 10G X550T
  1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
  12x Intel SSD DC P3520 1.2T (NVMe) for OSD daemons
  1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)


Here's the tree:

ID CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-7   48.0 root root
-5   24.0 rack rack1
-1   12.0 node cpn01
 0  nvme  1.0 osd.0  up  1.0 1.0
 1  nvme  1.0 osd.1  up  1.0 1.0
 2  nvme  1.0 osd.2  up  1.0 1.0
 3  nvme  1.0 osd.3  up  1.0 1.0
 4  nvme  1.0 osd.4  up  1.0 1.0
 5  nvme  1.0 osd.5  up  1.0 1.0
 6  nvme  1.0 osd.6  up  1.0 1.0
 7  nvme  1.0 osd.7  up  1.0 1.0
 8  nvme  1.0 osd.8  up  1.0 1.0
 9  nvme  1.0 osd.9  up  1.0 1.0
10  nvme  1.0 osd.10 up  1.0 1.0
11  nvme  1.0 osd.11 up  1.0 1.0
-3   12.0 node cpn03
24  nvme  1.0 osd.24 up  1.0 1.0
25  nvme  1.0 osd.25 up  1.0 1.0
26  nvme  1.0 osd.26 up  1.0 1.0
27  nvme  1.0 osd.27 up  1.0 1.0
28  nvme  1.0 osd.28 up  1.0 1.0
29  nvme  1.0 osd.29 up  1.0 1.0
30  nvme  1.0 osd.30 up  1.0 1.0
31  nvme  1.0 osd.31 up  1.0 1.0
32  nvme  1.0 osd.32 up  1.0 1.0
33  nvme  1.0 osd.33 up  1.0 1.0
34  nvme  1.0 osd.34 up  1.0 1.0
35  nvme  1.0 osd.35 up  1.0 1.0
-6   24.0 rack rack2
-2   12.0 node cpn02
12  nvme  1.0 osd.12 up  1.0 1.0
13  nvme  1.0 osd.13 up  1.0 1.0
14  nvme  1.0 osd.14 up  1.0 1.0
15  nvme  1.0 osd.15 up  1.0 1.0
16  nvme  1.0 osd.16 up  1.0 1.0
17  nvme  1.0 osd.17 up  1.0 1.0
18  nvme  1.0 osd.18 up  1.0 1.0
19  nvme  1.0 osd.19 up  1.0 1.0
20  nvme  1.0 osd.20 up  1.0 1.0
21  nvme  1.0 osd.21 up  1.0 1.0
22  nvme  1.0 osd.22 up  1.0 1.0
23  nvme  1.0 osd.23 up  1.0 1.0
-4   12.0 node cpn04
36  nvme  1.0 osd.36 up  1.0 1.0
37  nvme  1.0 osd.37 up  1.0 1.0
38  nvme  1.0 osd.38 up  1.0 1.0
39  nvme  1.0 osd.39 up  1.0 1.0
40  nvme  1.0 osd.40 up  1.0 1.0
41  nvme  1.0 osd.41 up  1.0 1.0
42  nvme  1.0 osd.42 up  1.0 1.0
43  nvme  1.0 osd.43 up  1.0 1.0
44  nvme  1.0 osd.44 up  1.0 1.0
45  nvme  1.0 osd.45 up  1.0 1.0
46  nvme  1.0 osd.46 up  1.0 1.0
47  nvme  1.0 osd.47 up  1.0 1.0

The disk partition of one of the OSD nodes:

NAME   MAJ:MIN RM   SIZE RO TYPE  MOUNTPOINT
nvme6n1259:10   1.1T  0 disk
├─nvme6n1p2259:15   0   1.1T  0 part
└─nvme6n1p1259:13   0   100M  0 part  /var/lib/ceph/osd/ceph-6
nvme9n1259:00   1.1T  0 disk
├─nvme9n1p2259:80   1.1T  0 part
└─nvme9n1p1259:70   100M  0 part  /var/lib/ceph/osd/ceph-9
sdb  8:16   0 139.8G  0 disk
└─sdb1   8:17   0 139.8G  0 part
  └─md0  9:00 139.6G  0 raid1
├─md0p2259:31   0 1K  0 md
├─md0p5259:32   0 139.1G  0 md
│ ├─cpn01--vg-swap 253:10  27.4G  0 lvm   [SWAP]
│ └─cpn01--vg-root 253:0  

Re: [ceph-users] Ceph monitoring

2017-10-02 Thread German Anders
prometheus has a nice data exporter build in go, that then you can send to
grafana or any other tool

https://github.com/digitalocean/ceph_exporter

*German*

2017-10-02 8:34 GMT-03:00 Osama Hasebou :

> Hi Everyone,
>
> Is there a guide/tutorial about how to setup Ceph monitoring system using
> collectd / grafana / graphite ? Other suggestions are welcome as well !
>
> I found some GitHub solutions but not much documentation on how to
> implement.
>
> Thanks.
>
> Regards,
> Ossi
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Minimum requirements to mount luminous cephfs ?

2017-09-27 Thread German Anders
Try to work with the tunables:

$ *ceph osd crush show-tunables*
{
"choose_local_tries": 0,
"choose_local_fallback_tries": 0,
"choose_total_tries": 50,
"chooseleaf_descend_once": 1,
"chooseleaf_vary_r": 1,
"chooseleaf_stable": 0,
"straw_calc_version": 1,
"allowed_bucket_algs": 54,
"profile": "hammer",
"optimal_tunables": 0,
"legacy_tunables": 0,
"minimum_required_version": "firefly",
"require_feature_tunables": 1,
"require_feature_tunables2": 1,
"has_v2_rules": 0,
"require_feature_tunables3": 1,
"has_v3_rules": 0,
"has_v4_buckets": 0,
"require_feature_tunables5": 0,
"has_v5_rules": 0
}

try to 'disable' the '*require_feature_tunables5*', with that I think you
should be ok, maybe there's another way, but that works for me. One way to
change it, is to comment out in the crushmap the option "*tunable
chooseleaf_stable 1*" and inject the crushmap again in the cluster (of
course that would produce on a lot of data moving on the pgs)




*German*
2017-09-27 9:08 GMT-03:00 Yoann Moulin :

> Hello,
>
> I try to mount a cephfs filesystem from fresh luminous cluster.
>
> With the latest kernel 4.13.3, it works
>
> > $ sudo mount.ceph 
> > iccluster041.iccluster,iccluster042.iccluster,iccluster054.iccluster:/
> /mnt -v -o name=container001,secretfile=/tmp/secret
> > parsing options: name=container001,secretfile=/tmp/secret
>
> > $ df -h /mnt
> > FilesystemSize  Used Avail Use% Mounted on
> > 10.90.38.17,10.90.38.18,10.90.39.5:/   66T   19G   66T   1% /mnt
>
>
> > root@iccluster054:~# ceph auth get client.container001
> > exported keyring for client.container001
> > [client.container001]
> >   key = 
> >   caps mds = "allow rw"
> >   caps mon = "allow r"
> >   caps osd = "allow rw pool=cephfs_data"
>
> > root@iccluster05:~#:/var/log# ceph --cluster container fs authorize
> cephfs client.container001 / rw
> > [client.container001]
> >   key = 
>
> With the latest Ubuntu 16.04 LTS Kernel and ceph-common 12.2.0, I'm not
> able to mount it
>
> > Linux iccluster013 4.4.0-96-generic #119~14.04.1-Ubuntu SMP Wed Sep 13
> 08:40:48 UTC 2017 x86_64 x86_64 x86_64 GNU/Linux
> > ii  ceph-common  12.2.0-1trusty
>amd64common utilities to mount and interact with a ceph
> storage cluster
>
> > root@iccluster013:~# mount.ceph  iccluster041,iccluster042,iccluster054:/
> /mnt -v -o name=container001,secretfile=/tmp/secret
> > parsing options: name=container001,secretfile=/tmp/secret
> > mount error 110 = Connection timed out
>
> here the dmesg :
>
> > [  417.528621] Key type ceph registered
> > [  417.528996] libceph: loaded (mon/osd proto 15/24)
> > [  417.540534] FS-Cache: Netfs 'ceph' registered for caching
> > [  417.540546] ceph: loaded (mds proto 32)
> > [...]
> > [ 2596.609885] libceph: mon1 10.90.38.18:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2596.626797] libceph: mon1 10.90.38.18:6789 missing required protocol
> features
> > [ 2606.960704] libceph: mon0 10.90.38.17:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2606.977621] libceph: mon0 10.90.38.17:6789 missing required protocol
> features
> > [ 2616.944998] libceph: mon0 10.90.38.17:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2616.961917] libceph: mon0 10.90.38.17:6789 missing required protocol
> features
> > [ 2626.961329] libceph: mon0 10.90.38.17:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2626.978290] libceph: mon0 10.90.38.17:6789 missing required protocol
> features
> > [ 2636.945765] libceph: mon0 10.90.38.17:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2636.962677] libceph: mon0 10.90.38.17:6789 missing required protocol
> features
> > [ 2646.962255] libceph: mon1 10.90.38.18:6789 feature set mismatch, my
> 107b84a842aca < server's 40107b84a842aca, missing 400
> > [ 2646.979228] libceph: mon1 10.90.38.18:6789 missing required protocol
> features
>
> Is there specific option to set on the cephfs to be able to mount it on a
> kernel 4.4 ?
>
> Best regards,
>
> --
> Yoann Moulin
> EPFL IC-IT
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] after reboot node appear outside the root root tree

2017-09-13 Thread German Anders
Thanks a lot Maxime, I did the osd_crush_update_on_start = false on
ceph.conf and push it to all the nodes, and then i create a map file:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable chooseleaf_vary_r 1
tunable chooseleaf_stable 1
tunable straw_calc_version 1
tunable allowed_bucket_algs 54

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35
device 36 osd.36
device 37 osd.37
device 38 osd.38
device 39 osd.39
device 40 osd.40
device 41 osd.41
device 42 osd.42
device 43 osd.43
device 44 osd.44
device 45 osd.45
device 46 osd.46
device 47 osd.47

# types
type 0 osd
type 1 node
type 2 rack
type 3 root

# buckets
node cpn01 {
id -1 # do not change unnecessarily
# weight 12.000
alg straw
hash 0 # rjenkins1
item osd.0 weight 1.000
item osd.1 weight 1.000
item osd.2 weight 1.000
item osd.3 weight 1.000
item osd.4 weight 1.000
item osd.5 weight 1.000
item osd.6 weight 1.000
item osd.7 weight 1.000
item osd.8 weight 1.000
item osd.9 weight 1.000
item osd.10 weight 1.000
item osd.11 weight 1.000
}
node cpn02 {
id -2 # do not change unnecessarily
# weight 12.000
alg straw
hash 0 # rjenkins1
item osd.12 weight 1.000
item osd.13 weight 1.000
item osd.14 weight 1.000
item osd.15 weight 1.000
item osd.16 weight 1.000
item osd.17 weight 1.000
item osd.18 weight 1.000
item osd.19 weight 1.000
item osd.20 weight 1.000
item osd.21 weight 1.000
item osd.22 weight 1.000
item osd.23 weight 1.000
}
node cpn03 {
id -3 # do not change unnecessarily
# weight 12.000
alg straw
hash 0 # rjenkins1
item osd.24 weight 1.000
item osd.25 weight 1.000
item osd.26 weight 1.000
item osd.27 weight 1.000
item osd.28 weight 1.000
item osd.29 weight 1.000
item osd.30 weight 1.000
item osd.31 weight 1.000
item osd.32 weight 1.000
item osd.33 weight 1.000
item osd.34 weight 1.000
item osd.35 weight 1.000
}
node cpn04 {
id -4 # do not change unnecessarily
# weight 12.000
alg straw
hash 0 # rjenkins1
item osd.36 weight 1.000
item osd.37 weight 1.000
item osd.38 weight 1.000
item osd.39 weight 1.000
item osd.40 weight 1.000
item osd.41 weight 1.000
item osd.42 weight 1.000
item osd.43 weight 1.000
item osd.44 weight 1.000
item osd.45 weight 1.000
item osd.46 weight 1.000
item osd.47 weight 1.000
}
rack rack1 {
id -5 # do not change unnecessarily
# weight 24.000
alg straw
hash 0 # rjenkins1
item cpn01 weight 12.000
item cpn03 weight 12.000
}
rack rack2 {
id -6 # do not change unnecessarily
# weight 24.000
alg straw
hash 0 # rjenkins1
item cpn02 weight 12.000
item cpn04 weight 12.000
}
root root {
id -7 # do not change unnecessarily
# weight 48.000
alg straw
hash 0 # rjenkins1
item rack1 weight 24.000
item rack2 weight 24.000
}

# rules
rule replicated_rule {
id 0
type replicated
min_size 1
max_size 10
step take root
step chooseleaf firstn 0 type node
step emit
}

# end crush map

and finally issue:
# *crushtool -c map.txt -o crushmap*
# *ceph osd setcrushmap -i crushmap*

since it's a new cluster no problem with rebalance


​Best,​

*German*

2017-09-13 13:46 GMT-03:00 Maxime Guyot <max...@root314.com>:

> Hi,
>
> This is a common problem when doing custom CRUSHmap, the default behavior
> is to update the OSD node to location in the CRUSHmap on start. did you
> keep to the defaults there?
>
> If that is the problem, you can either:
> 1) Disable the update on start option: "osd crush update on start = false"
> (see http://docs.ceph.com/docs/master/rados/operations/
> crush-map/#crush-location)
> 2) Customize the script defining the location of OSDs with "crush location
> hook = /path/to/customized-ceph-crush-location" (see
> https://github.com/ceph/ceph/blob/master/src/ceph-crush-location.in).
>
> Cheers,
> Maxime
>
> On Wed, 13 Sep 2017 at 18:35 German Anders <gand...@despegar.com> wrote:
>
>> *# ceph health detail*
>> HEALTH_OK
>>
>> *# ceph osd stat*
>> 48 osds: 48 up, 48 in
>>
>> *# ceph pg stat*
>> 3200 pgs: 3200 active+clean; 5336 MB data, 79455 MB used, 53572 GB /
>> 53650 GB avail
>>
>>
>> *German*
>>
>> 2017-09-13 13:24 GMT-03:00 dE <de.tec...@gmail.com>:
>>
>>> On 09/13/2017 09:08 PM, German Anders wrote:
>>>
>>> Hi cephers,
>>>
>>> I'm having an issue with a newly c

Re: [ceph-users] after reboot node appear outside the root root tree

2017-09-13 Thread German Anders
*# ceph health detail*
HEALTH_OK

*# ceph osd stat*
48 osds: 48 up, 48 in

*# ceph pg stat*
3200 pgs: 3200 active+clean; 5336 MB data, 79455 MB used, 53572 GB / 53650
GB avail


*German*

2017-09-13 13:24 GMT-03:00 dE <de.tec...@gmail.com>:

> On 09/13/2017 09:08 PM, German Anders wrote:
>
> Hi cephers,
>
> I'm having an issue with a newly created cluster 12.2.0 (
> 32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc). Basically when I
> reboot one of the nodes, and when it come back, it come outside of the root
> type on the tree:
>
> root@cpm01:~# ceph osd tree
> ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
> -15   12.0 *root default*
> * 36  nvme  1.0 osd.36 up  1.0 1.0*
> * 37  nvme  1.0 osd.37 up  1.0 1.0*
> * 38  nvme  1.0 osd.38 up  1.0 1.0*
> * 39  nvme  1.0 osd.39 up  1.0 1.0*
> * 40  nvme  1.0 osd.40 up  1.0 1.0*
> * 41  nvme  1.0 osd.41 up  1.0 1.0*
> * 42  nvme  1.0 osd.42 up  1.0 1.0*
> * 43  nvme  1.0 osd.43 up  1.0 1.0*
> * 44  nvme  1.0 osd.44 up  1.0 1.0*
> * 45  nvme  1.0 osd.45 up  1.0 1.0*
> * 46  nvme  1.0 osd.46 up  1.0 1.0*
> * 47  nvme  1.0 osd.47 up  1.0 1.0*
>  -7   36.0 *root root*
>  -5   24.0 rack rack1
>  -1   12.0 node cpn01
>   01.0 osd.0  up  1.0 1.0
>   11.0 osd.1  up  1.0 1.0
>   21.0 osd.2  up  1.0 1.0
>   31.0 osd.3  up  1.0 1.0
>   41.0 osd.4  up  1.0 1.0
>   51.0 osd.5  up  1.0 1.0
>   61.0 osd.6  up  1.0 1.0
>   71.0 osd.7  up  1.0 1.0
>   81.0 osd.8  up  1.0 1.0
>   91.0 osd.9  up  1.0 1.0
>  101.0 osd.10 up  1.0 1.0
>  111.0 osd.11 up  1.0 1.0
>  -3   12.0 node cpn03
>  241.0 osd.24 up  1.0 1.0
>  251.0 osd.25 up  1.0 1.0
>  261.0 osd.26 up  1.0 1.0
>  271.0 osd.27 up  1.0 1.0
>  281.0 osd.28 up  1.0 1.0
>  291.0 osd.29 up  1.0 1.0
>  301.0 osd.30 up  1.0 1.0
>  311.0 osd.31 up  1.0 1.0
>  321.0 osd.32 up  1.0 1.0
>  331.0 osd.33 up  1.0 1.0
>  341.0 osd.34 up  1.0 1.0
>  351.0 osd.35 up  1.0 1.0
>  -6   12.0 rack rack2
>  -2   12.0 node cpn02
>  121.0 osd.12 up  1.0 1.0
>  131.0 osd.13 up  1.0 1.0
>  141.0 osd.14 up  1.0 1.0
>  151.0 osd.15 up  1.0 1.0
>  161.0 osd.16 up  1.0 1.0
>  171.0 osd.17 up  1.0 1.0
>  181.0 osd.18 up  1.0 1.0
>  191.0 osd.19 up  1.0 1.0
>  201.0 osd.20 up  1.0 1.0
>  211.0 osd.21 up  1.0 1.0
>  221.0 osd.22 up  1.0 1.0
>  231.0 osd.23 up  1.0 1.0
> * -4  0 node cpn04*
>
> Any ideas of why this happen? and how can I fix it? It supposed to be
> inside rack2
>
> Thanks in advance,
>
> Best,
>
> *German*
>
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Can we see the output of ceph health detail. Maybe they're under the
> process of recovery.
>
> Also post the output of ceph osd stat so we can see what nodes are up/in
> etc... and ceph pg stat to see the status of various PGs (a pointer to the
> recovery process).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] after reboot node appear outside the root root tree

2017-09-13 Thread German Anders
Hi cephers,

I'm having an issue with a newly created cluster 12.2.0
(32ce2a3ae5239ee33d6150705cdb24d43bab910c) luminous (rc). Basically when I
reboot one of the nodes, and when it come back, it come outside of the root
type on the tree:

root@cpm01:~# ceph osd tree
ID  CLASS WEIGHT   TYPE NAME  STATUS REWEIGHT PRI-AFF
-15   12.0 *root default*
* 36  nvme  1.0 osd.36 up  1.0 1.0*
* 37  nvme  1.0 osd.37 up  1.0 1.0*
* 38  nvme  1.0 osd.38 up  1.0 1.0*
* 39  nvme  1.0 osd.39 up  1.0 1.0*
* 40  nvme  1.0 osd.40 up  1.0 1.0*
* 41  nvme  1.0 osd.41 up  1.0 1.0*
* 42  nvme  1.0 osd.42 up  1.0 1.0*
* 43  nvme  1.0 osd.43 up  1.0 1.0*
* 44  nvme  1.0 osd.44 up  1.0 1.0*
* 45  nvme  1.0 osd.45 up  1.0 1.0*
* 46  nvme  1.0 osd.46 up  1.0 1.0*
* 47  nvme  1.0 osd.47 up  1.0 1.0*
 -7   36.0 *root root*
 -5   24.0 rack rack1
 -1   12.0 node cpn01
  01.0 osd.0  up  1.0 1.0
  11.0 osd.1  up  1.0 1.0
  21.0 osd.2  up  1.0 1.0
  31.0 osd.3  up  1.0 1.0
  41.0 osd.4  up  1.0 1.0
  51.0 osd.5  up  1.0 1.0
  61.0 osd.6  up  1.0 1.0
  71.0 osd.7  up  1.0 1.0
  81.0 osd.8  up  1.0 1.0
  91.0 osd.9  up  1.0 1.0
 101.0 osd.10 up  1.0 1.0
 111.0 osd.11 up  1.0 1.0
 -3   12.0 node cpn03
 241.0 osd.24 up  1.0 1.0
 251.0 osd.25 up  1.0 1.0
 261.0 osd.26 up  1.0 1.0
 271.0 osd.27 up  1.0 1.0
 281.0 osd.28 up  1.0 1.0
 291.0 osd.29 up  1.0 1.0
 301.0 osd.30 up  1.0 1.0
 311.0 osd.31 up  1.0 1.0
 321.0 osd.32 up  1.0 1.0
 331.0 osd.33 up  1.0 1.0
 341.0 osd.34 up  1.0 1.0
 351.0 osd.35 up  1.0 1.0
 -6   12.0 rack rack2
 -2   12.0 node cpn02
 121.0 osd.12 up  1.0 1.0
 131.0 osd.13 up  1.0 1.0
 141.0 osd.14 up  1.0 1.0
 151.0 osd.15 up  1.0 1.0
 161.0 osd.16 up  1.0 1.0
 171.0 osd.17 up  1.0 1.0
 181.0 osd.18 up  1.0 1.0
 191.0 osd.19 up  1.0 1.0
 201.0 osd.20 up  1.0 1.0
 211.0 osd.21 up  1.0 1.0
 221.0 osd.22 up  1.0 1.0
 231.0 osd.23 up  1.0 1.0
* -4  0 node cpn04*

Any ideas of why this happen? and how can I fix it? It supposed to be
inside rack2

Thanks in advance,

Best,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Upgrade target for 0.82

2017-06-27 Thread German Anders
Thanks a lot Wido

Best,

*German*

2017-06-27 16:08 GMT-03:00 Wido den Hollander <w...@42on.com>:

>
> > Op 27 juni 2017 om 20:56 schreef German Anders <gand...@despegar.com>:
> >
> >
> > Hi Cephers,
> >
> >I want to upgrade an existing cluster (version 0.82), and I would like
> > to know if there's any recommended upgrade-path and also the recommended
> > target version.
> >
>
> I would go to Hammer (0.94) first, make sure the cluster is updated to the
> latest 0.94.11 version and then proceed to Jewel.
>
> Make sure all clients are up to date as well and you are using the latest
> CRUSH tunables before you proceed to Jewel.
>
> Wido
>
> > Thanks in advance,
> >
> > *German*
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Upgrade target for 0.82

2017-06-27 Thread German Anders
Hi Cephers,

   I want to upgrade an existing cluster (version 0.82), and I would like
to know if there's any recommended upgrade-path and also the recommended
target version.

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy to a particular version

2017-05-02 Thread German Anders
I think you can do *$ceph-deploy install --release  --repo-url
http://download.ceph.com/. .. *, also you
can change the --release flag with --dev or --testing and specify the
version, I've done with release and dev flags and work great :)

hope it helps

best,


*German*

2017-05-02 10:03 GMT-03:00 David Turner :

> You can indeed install ceph via yum and then utilize ceph-deploy to finish
> things up. You just skip the Ceph install portion. I haven't done it in a
> while and you might need to manually place the config and key on the new
> servers yourself.
>
> On Tue, May 2, 2017, 8:57 AM Puff, Jonathon 
> wrote:
>
>> From what I can find ceph-deploy only allows installs for a release, i.e
>> jewel which is giving me 10.2.7, but I’d like to specify the particular
>> update.  For instance, I want to go to 10.2.3.Do I need to avoid
>> ceph-deploy entirely to do this or can I install the correct version via
>> yum then leverage ceph-deploy for the remaining configuration?
>>
>>
>>
>> -JP
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph UPDATE (not upgrade)

2017-04-26 Thread German Anders
oh sorry my bad, I thought he wants to upgrade the ceph cluster, not the os
packages.

best,


*German*


2017-04-26 14:29 GMT-03:00 David Turner <drakonst...@gmail.com>:

> He's asking how NOT to upgrade Ceph, but to update the rest of the
> packages on his system.  In Ubuntu, you have to type `apt-get dist-upgrade`
> instead of just `apt-get upgrade` when you want to upgrade ceph.  That
> becomes a problem when trying to update the kernel, but not too bad.  I
> think in CentOS you need to do something like `yum update
> --exclude=ceph*`.  You should also be able to disable the packages in the
> repo files and make it so that you have to include the packages to update
> the ceph packages.
>
> On Wed, Apr 26, 2017 at 1:12 PM German Anders <gand...@despegar.com>
> wrote:
>
>> Hi Massimiiano,
>>
>> I think you best go with the upgrade process from Ceph site, take a look
>> at it, since you need to do it in an specific order:
>>
>> 1. the MONs
>> 2. the OSDs
>> 3. the MDS
>> 4. the Object gateways
>>
>> http://docs.ceph.com/docs/master/install/upgrading-ceph/
>>
>> it's better to do it like that and get things fine :)
>>
>> hope it helps,
>>
>> Best,
>>
>>
>> *German Anders*
>>
>>
>> 2017-04-26 11:21 GMT-03:00 Massimiliano Cuttini <m...@phoenixweb.it>:
>>
>>> On a Ceph Monitor/OSD server can i run just:
>>>
>>> *yum update -y*
>>>
>>> in order to upgrade system and packages or did this mess up Ceph?
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph UPDATE (not upgrade)

2017-04-26 Thread German Anders
Hi Massimiiano,

I think you best go with the upgrade process from Ceph site, take a look at
it, since you need to do it in an specific order:

1. the MONs
2. the OSDs
3. the MDS
4. the Object gateways

http://docs.ceph.com/docs/master/install/upgrading-ceph/

it's better to do it like that and get things fine :)

hope it helps,

Best,


*German Anders*


2017-04-26 11:21 GMT-03:00 Massimiliano Cuttini <m...@phoenixweb.it>:

> On a Ceph Monitor/OSD server can i run just:
>
> *yum update -y*
>
> in order to upgrade system and packages or did this mess up Ceph?
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] How to think a two different disk's technologies architecture

2017-03-25 Thread German Anders
Like Alex said network is not an issue now. For example I got a 6 node
cluster with mix of sas and ssd disks running inside cassandra clusters
with heavy load and also mysql clusters and Im getting less than 1ms of IO
latency, on the network part i have infiniband with FDR configured and also
cluster and public network running on the same ib network, no separation of
clus network and runs so far so good! The ib network is shared by a total
of 128 host plus the 6 node and 3 mon ceph cluster.

We are Planning now a all-nvme ceph cluster with ib. It would be nice that
ceph works on the near future with the option of rdma integrated anyone
know if this is on the roadmap? I know that there's a "possible" config of
that but it's not production ready and needs a lot of tuning and
configuration.

Best


On Fri, Mar 24, 2017 at 11:04 Alejandro Comisario <alejan...@nubeliu.com>
wrote:

thanks for the recommendations so far.
any one with more experiences and thoughts?

best

On Mar 23, 2017 16:36, "Maxime Guyot" <maxime.gu...@elits.com> wrote:

Hi Alexandro,

As I understand you are planning NVMe for Journal for SATA HDD and
collocated journal for SATA SSD?

Option 1:
- 24x SATA SSDs per server, will have a bottleneck with the storage
bus/controller.  Also, I would consider the network capacity 24xSSDs will
deliver more performance than 24xHDD with journal, but you have the same
network capacity on both types of nodes.
- This option is a little easier to implement: just move nodes in different
CRUSHmap root
- Failure of a server (assuming size = 3) will impact all PGs
Option 2:
- You may have noisy neighbors effect between HDDs and SSDs, if HDDs are
able to saturate your NICs or storage controller. So be mindful of this
with the hardware design
- To configure the CRUSHmap for this you need to split each server in 2, I
usually use “server1-hdd” and “server1-ssd” and map the right OSD in the
right bucket, so a little extra work here but you can easily fix a “crush
location hook” script for it (see example
http://www.root314.com/2017/01/15/Ceph-storage-tiers/)
- In case of a server failure recovery will be faster than option 1 and
will impact less PGs

Some general notes:
- SSD pools perform better with higher frequency CPUs
- the 1GB of RAM per TB is a little outdated, the current consensus for HDD
OSDs is around 2GB/OSD (see
https://www.redhat.com/cms/managed-files/st-rhcs-config-guide-technology-detail-inc0387897-201604-en.pdf
)
- Network wise, if the SSD OSDs are rated for 500MB/s and use collocated
journal you could generate up to 250MB/s of traffic per SSD OSD (24Gbps for
12x or 48Gbps for 24x) therefore I would consider doing 4x10G and
consolidate both client and cluster network on that

Cheers,
Maxime

On 23/03/17 18:55, "ceph-users on behalf of Alejandro Comisario" <
ceph-users-boun...@lists.ceph.com on behalf of alejan...@nubeliu.com> wrote:

Hi everyone!
I have to install a ceph cluster (6 nodes) with two "flavors" of
disks, 3 servers with SSD and 3 servers with SATA.

Y will purchase 24 disks servers (the ones with sata with NVE SSD for
the SATA journal)
Processors will be 2 x E5-2620v4 with HT, and ram will be 20GB for the
OS, and 1.3GB of ram per storage TB.

The servers will have 2 x 10Gb bonding for public network and 2 x 10Gb
for cluster network.
My doubts resides, ar want to ask the community about experiences and
pains and gains of choosing between.

Option 1
3 x servers just for SSD
3 x servers jsut for SATA

Option 2
6 x servers with 12 SSD and 12 SATA each

Regarding crushmap configuration and rules everything is clear to make
sure that two pools (poolSSD and poolSATA) uses the right disks.

But, what about performance, maintenance, architecture scalability, etc
?

thank you very much !

--
Alejandrito
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

-- 

*German Anders*
Storage Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] News on RDMA on future releases

2016-12-07 Thread German Anders
Hi all, I want to know if there's any news on future releases, regarding
RDMA if it's going to be integrated or not, since RDMA should increase IOPS
performance a lot, specially on small block sizes.

Thanks in advance,

Best,



*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A VM with 6 volumes - hangs

2016-11-14 Thread German Anders
try to see the specific logs for those particularly osd's, and see if
something is there, also take a deep close to the pg's that hold those osds

Best,


*German*

2016-11-14 12:04 GMT-03:00 M Ranga Swami Reddy <swamire...@gmail.com>:

> When this issue seen, ceph logs shows "slow requests to OSD"
>
> But Ceph status is in OK state.
>
> Thanks
> Swami
>
> On Mon, Nov 14, 2016 at 8:27 PM, German Anders <gand...@despegar.com>
> wrote:
>
>> Could you share some info about the ceph cluster? logs? did you see
>> anything different from normal op on the logs?
>>
>> Best,
>>
>>
>> *German*
>>
>> 2016-11-14 11:46 GMT-03:00 M Ranga Swami Reddy <swamire...@gmail.com>:
>>
>>> +ceph-devel
>>>
>>> On Fri, Nov 11, 2016 at 5:09 PM, M Ranga Swami Reddy <
>>> swamire...@gmail.com> wrote:
>>>
>>>> Hello,
>>>> I am using the ceph volumes with a VM. Details are below:
>>>>
>>>> VM:
>>>>   OS: Ubuntu 14.0.4
>>>>CPU: 12 Cores
>>>>RAM: 40 GB
>>>>
>>>> Volumes:
>>>>Size: 1 TB
>>>> No:   6 Volumes
>>>>
>>>>
>>>> With above, VM got hung without any read/write operation.
>>>>
>>>> Any suggestions..
>>>>
>>>> Thanks
>>>> Swami
>>>>
>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] A VM with 6 volumes - hangs

2016-11-14 Thread German Anders
Could you share some info about the ceph cluster? logs? did you see
anything different from normal op on the logs?

Best,


*German*

2016-11-14 11:46 GMT-03:00 M Ranga Swami Reddy :

> +ceph-devel
>
> On Fri, Nov 11, 2016 at 5:09 PM, M Ranga Swami Reddy  > wrote:
>
>> Hello,
>> I am using the ceph volumes with a VM. Details are below:
>>
>> VM:
>>   OS: Ubuntu 14.0.4
>>CPU: 12 Cores
>>RAM: 40 GB
>>
>> Volumes:
>>Size: 1 TB
>> No:   6 Volumes
>>
>>
>> With above, VM got hung without any read/write operation.
>>
>> Any suggestions..
>>
>> Thanks
>> Swami
>>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-25 Thread German Anders
for example, to have one cluster with four instances, two on each data
center sharing the same storage cluster, so if data center #1 let's said
goes down along with instances 1 & 2, you can still run the cluster with
instances 3 & 4 on the data center #2 with the same storage cluster pool
and no data loss or impact (just performance)

Best,



*German*

2016-10-21 15:01 GMT-03:00 Wes Dillingham <wes_dilling...@harvard.edu>:

> What is the use case that requires you to have it in two datacenters?
> In addition to RBD mirroring already mentioned by others, you can do
> RBD snapshots and ship those snapshots to a remote location (separate
> cluster or separate pool). Similar to RBD mirroring, in this situation
> your client writes are not subject to that latency.
>
> On Thu, Oct 20, 2016 at 1:51 PM, German Anders <gand...@despegar.com>
> wrote:
> > Thanks, that's too far actually lol. And how things going with rbd
> > mirroring?
> >
> > German
> >
> > 2016-10-20 14:49 GMT-03:00 yan cui <ccuiy...@gmail.com>:
> >>
> >> The two data centers are actually cross US.  One is in the west, and the
> >> other in the east.
> >> We try to sync rdb images using RDB mirroring.
> >>
> >> 2016-10-20 9:54 GMT-07:00 German Anders <gand...@despegar.com>:
> >>>
> >>> from curiosity I wanted to ask you what kind of network topology are
> you
> >>> trying to use across the cluster? In this type of scenario you really
> need
> >>> an ultra low latency network, how far from each other?
> >>>
> >>> Best,
> >>>
> >>> German
> >>>
> >>> 2016-10-18 16:22 GMT-03:00 Sean Redmond <sean.redmo...@gmail.com>:
> >>>>
> >>>> Maybe this would be an option for you:
> >>>>
> >>>> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
> >>>>
> >>>>
> >>>> On Tue, Oct 18, 2016 at 8:18 PM, yan cui <ccuiy...@gmail.com> wrote:
> >>>>>
> >>>>> Hi Guys,
> >>>>>
> >>>>>Our company has a use case which needs the support of Ceph across
> >>>>> two data centers (one data center is far away from the other). The
> >>>>> experience of using one data center is good. We did some
> benchmarking on two
> >>>>> data centers, and the performance is bad because of the
> synchronization
> >>>>> feature in Ceph and large latency between data centers. So, are there
> >>>>> setting ups like data center aware features in Ceph, so that we have
> good
> >>>>> locality? Usually, we use rbd to create volume and snapshot. But we
> want the
> >>>>> volume is high available with acceptable performance in case one
> data center
> >>>>> is down. Our current setting ups does not consider data center
> difference.
> >>>>> Any ideas?
> >>>>>
> >>>>>
> >>>>> Thanks, Yan
> >>>>>
> >>>>> --
> >>>>> Think big; Dream impossible; Make it happen.
> >>>>>
> >>>>> ___
> >>>>> ceph-users mailing list
> >>>>> ceph-users@lists.ceph.com
> >>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>>
> >>>>
> >>>>
> >>>> ___
> >>>> ceph-users mailing list
> >>>> ceph-users@lists.ceph.com
> >>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>>>
> >>>
> >>
> >>
> >>
> >> --
> >> Think big; Dream impossible; Make it happen.
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>
>
>
> --
> Respectfully,
>
> Wes Dillingham
> wes_dilling...@harvard.edu
> Research Computing | Infrastructure Engineer
> Harvard University | 38 Oxford Street, Cambridge, Ma 02138 | Room 210
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-20 Thread German Anders
Thanks, that's too far actually lol. And how things going with rbd
mirroring?

*German*

2016-10-20 14:49 GMT-03:00 yan cui <ccuiy...@gmail.com>:

> The two data centers are actually cross US.  One is in the west, and the
> other in the east.
> We try to sync rdb images using RDB mirroring.
>
> 2016-10-20 9:54 GMT-07:00 German Anders <gand...@despegar.com>:
>
>> from curiosity I wanted to ask you what kind of network topology are you
>> trying to use across the cluster? In this type of scenario you really need
>> an ultra low latency network, how far from each other?
>>
>> Best,
>>
>> *German*
>>
>> 2016-10-18 16:22 GMT-03:00 Sean Redmond <sean.redmo...@gmail.com>:
>>
>>> Maybe this would be an option for you:
>>>
>>> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
>>>
>>>
>>> On Tue, Oct 18, 2016 at 8:18 PM, yan cui <ccuiy...@gmail.com> wrote:
>>>
>>>> Hi Guys,
>>>>
>>>>Our company has a use case which needs the support of Ceph across
>>>> two data centers (one data center is far away from the other). The
>>>> experience of using one data center is good. We did some benchmarking on
>>>> two data centers, and the performance is bad because of the synchronization
>>>> feature in Ceph and large latency between data centers. So, are there
>>>> setting ups like data center aware features in Ceph, so that we have good
>>>> locality? Usually, we use rbd to create volume and snapshot. But we want
>>>> the volume is high available with acceptable performance in case one data
>>>> center is down. Our current setting ups does not consider data center
>>>> difference. Any ideas?
>>>>
>>>>
>>>> Thanks, Yan
>>>>
>>>> --
>>>> Think big; Dream impossible; Make it happen.
>>>>
>>>> ___
>>>> ceph-users mailing list
>>>> ceph-users@lists.ceph.com
>>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>>
>>>>
>>>
>>> ___
>>> ceph-users mailing list
>>> ceph-users@lists.ceph.com
>>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>>
>>>
>>
>
>
> --
> Think big; Dream impossible; Make it happen.
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph on two data centers far away

2016-10-20 Thread German Anders
from curiosity I wanted to ask you what kind of network topology are you
trying to use across the cluster? In this type of scenario you really need
an ultra low latency network, how far from each other?

Best,

*German*

2016-10-18 16:22 GMT-03:00 Sean Redmond :

> Maybe this would be an option for you:
>
> http://docs.ceph.com/docs/jewel/rbd/rbd-mirroring/
>
>
> On Tue, Oct 18, 2016 at 8:18 PM, yan cui  wrote:
>
>> Hi Guys,
>>
>>Our company has a use case which needs the support of Ceph across two
>> data centers (one data center is far away from the other). The experience
>> of using one data center is good. We did some benchmarking on two data
>> centers, and the performance is bad because of the synchronization feature
>> in Ceph and large latency between data centers. So, are there setting ups
>> like data center aware features in Ceph, so that we have good locality?
>> Usually, we use rbd to create volume and snapshot. But we want the volume
>> is high available with acceptable performance in case one data center is
>> down. Our current setting ups does not consider data center difference. Any
>> ideas?
>>
>>
>> Thanks, Yan
>>
>> --
>> Think big; Dream impossible; Make it happen.
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] is the web site down ?

2016-10-12 Thread German Anders
I think that you can check it over here:

http://www.dreamhoststatus.com/2016/10/11/dreamcompute-
us-east-1-cluster-service-disruption/

*German Anders*
Storage Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com

2016-10-12 16:44 GMT-03:00 Andrey Shevel <shevel.and...@gmail.com>:

> does anybody know when the site http://docs.ceph.com/docs/jewel/cephfs
> will be available ?
>
> thanks in advance
>
> --
> Andrey Y Shevel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread German Anders
also jewel does not supposed to get more 'performance', since it used
bluestore in order to store metadata. Or do I need to specify during
install to use bluestore?

Thanks,


*German*

2016-04-07 16:55 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Ceph is not able to use native Infiniband protocols yet and so it is
> only leveraging IPoIB at the moment. The most likely reason you are
> only getting ~10 Gb performance is that IPoIB heavily leverages
> multicast in Infiniband (if you do so research in this area you will
> understand why unicast IP still uses multicast on an Inifiniband
> network). To be extremely compatible with all adapters, the subnet
> manager will set the speed of multicast to 10 Gb/s so that SDR
> adapters can be used and not drop packets. If you know that you will
> never have adapters under a certain speed, you can configure the
> subnet manager to use a higher speed. This does not change IPoIB
> networks that are already configured (I had to down all the IPoIB
> adapter at the same time and bring them back up to upgrade the speed).
> Even after that, there still wasn't similar performance to native
> Infiniband, but I got at least a 2x improvement (along with setting
> the MTU to 64K) on the FDR adapters. There is still a ton of overhead
> for doing IPoIB so it is not an ideal transport to get performance on
> Infiniband, I think of it as a compatibility feature. Hopefully, that
> will give you enough information to perform the research. If you
> search the OFED mailing list, you will see some posts from me 2-3
> years ago regarding this very topic.
>
> Good luck and keep holding out for Ceph with XIO.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.3.6
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJXBrtICRDmVDuy+mK58QAAqVkP/2hpe93FYIbQtpV4Qta4
> 9Fohqf478kVPX/v6XkAYOlAFFAISxfbDdm0FxOjbGSEOMKGNs/oSaFRCsqb9
> +T5dfMUHyhY51wyaNeVF3k3zgvGpNUO1xEQ1IenUquZp9825VRBze5/T6r8Z
> PMFySNtuHBp8AhARisPJcXqKv/Vowfy/LqyvlL6ytIHfwqsVHngbtVN7L/HX
> vzMZM93cLwwV44v2bT8t63U76GKyQpbksDx02CktMIFzNbfApsiMaA1dyx1O
> 9HEgirtddMO358f+1DN/OjNc/Z3zECILaw3tq/HUWJyBJqO95uBw++znIacb
> UKwqJ1HmUeDvdqY72ZQa2fQT7ayMMlPPwzoVtdQGMZnSaAjn8MlunDFCrdLw
> +JPT+kt0qnjzs9qK0zEp5drfUwnV5BXS4hZhKUvuxWmVjUv1EfJrIFCszSFO
> 2be/xLxqBTpCEcHL9fsc16P7HsrdBW8GDy3X5PC2sOl/2DSes4y2TpCfr7w9
> V8Mhs7mmkEQtwcvyaYQ0bx0Bs3o4cvTTeYbJUpLWEgMmGAEBZbf7Sx+y3dIp
> jUHb2jPEchBb83BGeLvAkCTfouq/J3pzQK96gA2Kh/KOlVJTpFdKUU5x+wpM
> ACqD+S/AFkgnfGm4fcgBexhro7GImiO6VIaOdxvTSdQbSsaoKckZOxFhVWih
> XyBJ
> =EF9A
> -END PGP SIGNATURE-
> --------
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Thu, Apr 7, 2016 at 1:43 PM, German Anders <gand...@despegar.com>
> wrote:
> > Hi Cephers,
> >
> > I've setup a production environment Ceph cluster with the Jewel release
> > (10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
> > Servers and 6 OSD Servers:
> >
> > 3x MON Servers:
> > 2x Intel Xeon E5-2630v3@2.40Ghz
> > 384GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 1x InfiniBand ConnectX-3 ADPT DP
> >
> > 6x OSD Servers:
> > 2x Intel Xeon E5-2650v2@2.60Ghz
> > 128GB RAM
> > 2x 200G Intel DC3700 in RAID-1 for OS
> > 12x 800G Intel DC3510 (osd & journal) on same device
> > 1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other
> on
> > the CLUS network)
> >
> > ceph.conf file is:
> >
> > [global]
> > fsid = xxx
> > mon_initial_members = cibm01, cibm02, cibm03
> > mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
> > auth_cluster_required = cephx
> > auth_service_required = cephx
> > auth_client_required = cephx
> > filestore_xattr_use_omap = true
> > public_network = xx.xx.16.0/20
> > cluster_network = xx.xx.32.0/20
> >
> > [mon]
> >
> > [mon.cibm01]
> > host = cibm01
> > mon_addr = xx.xx.xx.1:6789
> >
> > [mon.cibm02]
> > host = cibm02
> > mon_addr = xx.xx.xx.2:6789
> >
> > [mon.cibm03]
> > host = cibm03
> > mon_addr = xx.xx.xx.3:6789
> >
> > [osd]
> > osd_pool_default_size = 2
> > osd_pool_default_min_size = 1
> >
> > ## OSD Configuration ##
> > [osd.0]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > [osd.1]
> > host = cibn01
> > public_addr = xx.xx.17.1
> > cluster_addr = xx.xx.32.1
> >
> > ...
> >
> >
> >
> > They are all running Ubuntu 14.04.4 LTS. Journals ar

[ceph-users] Ceph InfiniBand Cluster - Jewel - Performance

2016-04-07 Thread German Anders
Hi Cephers,

I've setup a production environment Ceph cluster with the Jewel release
(10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)) consisting of 3 MON
Servers and 6 OSD Servers:

3x MON Servers:
2x Intel Xeon E5-2630v3@2.40Ghz
384GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
1x InfiniBand ConnectX-3 ADPT DP

6x OSD Servers:
2x Intel Xeon E5-2650v2@2.60Ghz
128GB RAM
2x 200G Intel DC3700 in RAID-1 for OS
12x 800G Intel DC3510 (osd & journal) on same device
1x InfiniBand ConnectX-3 ADPT DP (one port on PUB network and the other on
the CLUS network)

ceph.conf file is:

[global]
fsid = xxx
mon_initial_members = cibm01, cibm02, cibm03
mon_host = xx.xx.xx.1,xx.xx.xx.2,xx.xx.xx.3
auth_cluster_required = cephx
auth_service_required = cephx
auth_client_required = cephx
filestore_xattr_use_omap = true
public_network = xx.xx.16.0/20
cluster_network = xx.xx.32.0/20

[mon]

[mon.cibm01]
host = cibm01
mon_addr = xx.xx.xx.1:6789

[mon.cibm02]
host = cibm02
mon_addr = xx.xx.xx.2:6789

[mon.cibm03]
host = cibm03
mon_addr = xx.xx.xx.3:6789

[osd]
osd_pool_default_size = 2
osd_pool_default_min_size = 1

## OSD Configuration ##
[osd.0]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

[osd.1]
host = cibn01
public_addr = xx.xx.17.1
cluster_addr = xx.xx.32.1

...



They are all running *Ubuntu 14.04.4 LTS*. Journals are 5GB partitions on
each disk, since all the OSD daemons are SSD disks (Intel DC3510 800G). For
example:

sdc  8:32   0 745.2G  0 disk
|-sdc1   8:33   0 740.2G  0 part
/var/lib/ceph/osd/ceph-0
`-sdc2   8:34   0 5G  0 part

The purpose of this cluster will be to serve as a backend storage for
Cinder volumes (RBD) and Glance images in an OpenStack cloud, most of the
clusters on OpenStack will be non-relational databases like Cassandra with
many instances each.

All of the nodes of the cluster are running InfiniBand FDR 56Gb/s with
Mellanox Technologies MT27500 Family [ConnectX-3] adapters.


So I assume that performance will be really nice, right?...but.. I'm
getting some numbers that I think they could be really more important.

# rados --pool rbd bench 10 write -t 16

Total writes made:  1964
Write size: 4194304
Object size:4194304
Bandwidth (MB/sec): *755.435*

Stddev Bandwidth:   90.3288
Max bandwidth (MB/sec): 884
Min bandwidth (MB/sec): 612
Average IOPS:   188
Stddev IOPS:22
Max IOPS:   221
Min IOPS:   153
Average Latency(s): 0.0836802
Stddev Latency(s):  0.147561
Max latency(s): 1.50925
Min latency(s): 0.0192736


Then I connect to another server (this one is running on QDR - so I would
expect something between 2-3Gb/s), I map a RBD on the host, then create a
ext4 fs and mount it, and finally run a fio test:

# fio --rw=randwrite --bs=4M --numjobs=8 --iodepth=32 --runtime=22
--time_based --size=10G --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name cephV1 --filename=/mnt/host01v1/test1

fio-2.1.3
Starting 8 processes
cephIBV1: Laying out IO file(s) (1 file(s) / 10240MB)
Jobs: 7 (f=7): [ww_w] [100.0% done] [0KB/431.6MB/0KB /s] [0/107/0 iops]
[eta 00m:00s]
cephIBV1: (groupid=0, jobs=8): err= 0: pid=6203: Thu Apr  7 15:24:12 2016
  write: io=15284MB, bw=676412KB/s, iops=165, runt= 23138msec
slat (msec): min=1, max=480, avg=46.15, stdev=63.68
clat (msec): min=64, max=8966, avg=1459.91, stdev=1252.64
 lat (msec): min=87, max=8969, avg=1506.06, stdev=1253.63
clat percentiles (msec):
 |  1.00th=[  235],  5.00th=[  478], 10.00th=[  611], 20.00th=[  766],
 | 30.00th=[  889], 40.00th=[  988], 50.00th=[ 1106], 60.00th=[ 1237],
 | 70.00th=[ 1434], 80.00th=[ 1680], 90.00th=[ 2474], 95.00th=[ 4555],
 | 99.00th=[ 6915], 99.50th=[ 7439], 99.90th=[ 8291], 99.95th=[ 8586],
 | 99.99th=[ 8979]
bw (KB  /s): min= 3091, max=209877, per=12.31%, avg=83280.51,
stdev=35226.98
lat (msec) : 100=0.16%, 250=0.97%, 500=4.61%, 750=12.93%, 1000=22.61%
lat (msec) : 2000=45.04%, >=2000=13.69%
  cpu  : usr=0.87%, sys=4.77%, ctx=6803, majf=0, minf=16337
  IO depths: 1=0.2%, 2=0.4%, 4=0.8%, 8=1.7%, 16=3.3%, 32=93.5%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=99.8%, 8=0.0%, 16=0.0%, 32=0.2%, 64=0.0%,
>=64=0.0%
 issued: total=r=0/w=3821/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
  WRITE: io=15284MB, aggrb=676411KB/s, minb=676411KB/s, maxb=676411KB/s,
mint=23138msec, maxt=23138msec

Disk stats (read/write):
  rbd1: ios=0/4189, merge=0/26613, ticks=0/2852032, in_queue=2857996,
util=99.08%


Does it look acceptable? I mean for an InfiniBand network, I guess that
throughput need to be better. How much more can I expect to achieve by
tuning the servers? The MTU on the OSD servers is:


Re: [ceph-users] Scrubbing a lot

2016-03-30 Thread German Anders
Ok, but I've kernel 3.19.0-39-generic, so the new version is supposed to
work right?, and I'm still getting issues while trying to map the RBD:

$ *sudo rbd --cluster cephIB create e60host01vX --size 100G --pool rbd -c
/etc/ceph/cephIB.conf*
$ *sudo rbd -p rbd bench-write e60host01vX --io-size 4096 --io-threads 1
--io-total 4096 --io-pattern rand -c /etc/ceph/cephIB.conf*
bench-write  io_size 4096 io_threads 1 bytes 4096 pattern random
  SEC   OPS   OPS/SEC   BYTES/SEC
elapsed: 0  ops:1  ops/sec:29.67  bytes/sec: 121536.32

$ *sudo rbd --cluster cephIB map e60host01vX --pool rbd -c
/etc/ceph/cephIB.conf*
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

$ *sudo rbd -p rbd info e60host01vX -c /etc/ceph/cephIB.conf*
rbd image 'e60host01vX':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.5f03238e1f29
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags:

Any other idea of what could be the problem here?


*German*

2016-03-30 5:15 GMT-03:00 Ilya Dryomov :

> On Wed, Mar 30, 2016 at 3:03 AM, Jason Dillaman 
> wrote:
> > Understood -- format 2 was promoted to the default image format starting
> with Infernalis (which not all users would have played with since it isn't
> LTS).  The defaults can be overridden via the command-line when creating
> new images or via the Ceph configuration file.
> >
> > I'll let Ilya provide input on which kernels support image format 2, but
> from a quick peek on GitHub it looks like support was added around the v3.8
> timeframe.
>
> Layering (i.e. format 2 with default striping parameters) is supported
> starting with 3.10.  We don't really support older kernels - backports
> are pretty much all 3.10+, etc.
>
> Thanks,
>
> Ilya
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread German Anders
Jason, I try that but the mapping is not working anyway:

# rbd --cluster cephIB map e60host02 --pool cinder-volumes -k
/etc/ceph/cephIB.client.cinder.keyring
rbd: sysfs write failed
rbd: map failed: (5) Input/output error


*German*

2016-03-29 17:46 GMT-03:00 Jason Dillaman <dilla...@redhat.com>:

> Running the following should fix your image up for krbd usage:
>
> # rbd --cluster cephIB feature disable e60host02
> exclusive-lock,object-map,fast-diff,deep-flatten --pool cinder-volumes
>
> In the future, you can create krbd-compatible images by adding
> "--image-feature layering" to the "rbd create" command-line (or by updating
> your config as documented in the release notes).
>
> --
>
> Jason Dillaman
>
>
> - Original Message -
>
> > From: "German Anders" <gand...@despegar.com>
> > To: "Jason Dillaman" <dilla...@redhat.com>
> > Cc: "Samuel Just" <sj...@redhat.com>, "ceph-users"
> > <ceph-users@lists.ceph.com>
> > Sent: Tuesday, March 29, 2016 4:38:02 PM
> > Subject: Re: [ceph-users] Scrubbing a lot
>
> > # rbd --cluster cephIB info e60host02 --pool cinder-volumes
> > rbd image 'e60host02':
> > size 102400 MB in 25600 objects
> > order 22 (4096 kB objects)
> > block_name_prefix: rbd_data.5ef1238e1f29
> > format: 2
> > features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
> > flags:
>
> > German
>
> > 2016-03-29 17:36 GMT-03:00 Jason Dillaman < dilla...@redhat.com > :
>
> > > Under Jewel, newly created images default to features that are not
> > > currently
> > > compatible with krbd. If you run 'rbd --cluster cephIB info host01
> --pool
> > > cinder-volumes', what features do you see? If you see more than
> layering,
> > > you need to disable them via the 'rbd feature disable' command.
> >
>
> > > [1]
> https://github.com/ceph/ceph/blob/master/doc/release-notes.rst#L302
> >
>
> > > --
> >
>
> > > Jason Dillaman
> >
>
> > > - Original Message -
> >
>
> > > > From: "German Anders" < gand...@despegar.com >
> >
> > > > To: "Samuel Just" < sj...@redhat.com >
> >
> > > > Cc: "ceph-users" < ceph-users@lists.ceph.com >
> >
> > > > Sent: Tuesday, March 29, 2016 4:19:03 PM
> >
> > > > Subject: Re: [ceph-users] Scrubbing a lot
> >
>
> > > > I've just upgrade to jewel , and the scrubbing seems to been
> corrected...
> > > > but
> >
> > > > now I'm not able to map an rbd on a host (before I was able to),
> > > > basically
> >
> > > > I'm getting this error msg:
> >
>
> > > > rbd: sysfs write failed
> >
> > > > rbd: map failed: (5) Input/output error
> >
>
> > > > # rbd --cluster cephIB create host01 --size 102400 --pool
> cinder-volumes
> > > > -k
> >
> > > > /etc/ceph/cephIB.client.cinder.keyring
> >
> > > > # rbd --cluster cephIB map host01 --pool cinder-volumes -k
> >
> > > > /etc/ceph/cephIB.client.cinder.keyring
> >
> > > > rbd: sysfs write failed
> >
> > > > rbd: map failed: (5) Input/output error
> >
>
> > > > Any ideas? on the /etc/ceph directory on the host I've:
> >
>
> > > > -rw-r--r-- 1 ceph ceph 92 Nov 17 15:45 rbdmap
> >
> > > > -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
> >
> > > > -rw-r--r-- 1 ceph ceph 37 Dec 15 15:12 virsh-secret
> >
> > > > -rw-r--r-- 1 ceph ceph 0 Dec 15 15:12 virsh-secret-set
> >
> > > > -rw-r--r-- 1 ceph ceph 37 Dec 21 14:53 virsh-secretIB
> >
> > > > -rw-r--r-- 1 ceph ceph 0 Dec 21 14:53 virsh-secret-setIB
> >
> > > > -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
> >
> > > > -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
> >
> > > > -rw-r--r-- 1 ceph ceph 72 Dec 23 09:51 ceph.client.cinder.keyring
> >
> > > > -rw-r--r-- 1 ceph ceph 63 Mar 28 09:03 cephIB.client.cinder.keyring
> >
> > > > -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
> >
> > > > -rw--- 1 ceph ceph 63 Mar 29 16:11 cephIB.client.admin.keyring
> >
>
> > > > ​Thanks in advance,
> >
>
> > > > Best,
> >
>
> > > > German
> >
>
> > > > 2016-03-29 14:45 GM

Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread German Anders
 it seems that the image-format option is deprecated:

# rbd --id cinder --cluster cephIB create e60host01v2 --size 100G
--image-format 1 --pool cinder-volumes -k
/etc/ceph/cephIB.client.cinder.keyring
rbd: image format 1 is deprecated

# rbd --cluster cephIB info e60host01v2 --pool cinder-volumes
2016-03-29 16:45:39.073198 7fb859eb7700 -1 librbd::image::OpenRequest: RBD
image format 1 is deprecated. Please copy this image to image format 2.
rbd image 'e60host01v2':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rb.0.37d7.238e1f29
format: 1

and the map operations still doesn't work :(

# rbd --cluster cephIB map e60host01v2 --pool cinder-volumes -k
/etc/ceph/cephIB.client.cinder.keyring
rbd: sysfs write failed
rbd: map failed: (5) Input/output error

also, I'm running kernel *3.19.0-39-generic*

*German*

2016-03-29 17:40 GMT-03:00 Stefan Lissmats <ste...@trimmat.se>:

> I agrree. I ran in to the same issue and the error massage is not that
> clear. Mapping with the kernel rbd client (rbd map) needs a quite new
> kernel to handle the new image format. The work-around is to use - -
> image-format 1 when creating the image.
>
>
>
>  Originalmeddelande 
> Från: Samuel Just <sj...@redhat.com>
> Datum: 2016-03-29 22:24 (GMT+01:00)
> Till: German Anders <gand...@despegar.com>
> Kopia: ceph-users <ceph-users@lists.ceph.com>
> Rubrik: Re: [ceph-users] Scrubbing a lot
>
> Sounds like a version/compatibility thing.  Are your rbd clients really
> old?
> -Sam
>
> On Tue, Mar 29, 2016 at 1:19 PM, German Anders <gand...@despegar.com>
> wrote:
> > I've just upgrade to jewel, and the scrubbing seems to been corrected...
> but
> > now I'm not able to map an rbd on a host (before I was able to),
> basically
> > I'm getting this error msg:
> >
> > rbd: sysfs write failed
> > rbd: map failed: (5) Input/output error
> >
> > # rbd --cluster cephIB create host01 --size 102400 --pool cinder-volumes
> -k
> > /etc/ceph/cephIB.client.cinder.keyring
> > # rbd --cluster cephIB map host01 --pool cinder-volumes -k
> > /etc/ceph/cephIB.client.cinder.keyring
> > rbd: sysfs write failed
> > rbd: map failed: (5) Input/output error
> >
> > Any ideas? on the /etc/ceph directory on the host I've:
> >
> > -rw-r--r-- 1 ceph ceph  92 Nov 17 15:45 rbdmap
> > -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
> > -rw-r--r-- 1 ceph ceph  37 Dec 15 15:12 virsh-secret
> > -rw-r--r-- 1 ceph ceph   0 Dec 15 15:12 virsh-secret-set
> > -rw-r--r-- 1 ceph ceph  37 Dec 21 14:53 virsh-secretIB
> > -rw-r--r-- 1 ceph ceph   0 Dec 21 14:53 virsh-secret-setIB
> > -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
> > -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
> > -rw-r--r-- 1 ceph ceph  72 Dec 23 09:51 ceph.client.cinder.keyring
> > -rw-r--r-- 1 ceph ceph  63 Mar 28 09:03 cephIB.client.cinder.keyring
> > -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
> > -rw--- 1 ceph ceph  63 Mar 29 16:11 cephIB.client.admin.keyring
> >
> > Thanks in advance,
> >
> > Best,
> >
> > German
> >
> > 2016-03-29 14:45 GMT-03:00 German Anders <gand...@despegar.com>:
> >>
> >> Sure, also the scrubbing is happening on all the osds :S
> >>
> >> # ceph --cluster cephIB daemon osd.4 config diff
> >> {
> >> "diff": {
> >> "current": {
> >> "admin_socket": "\/var\/run\/ceph\/cephIB-osd.4.asok",
> >> "auth_client_required": "cephx",
> >> "filestore_fd_cache_size": "10240",
> >> "filestore_journal_writeahead": "true",
> >> "filestore_max_sync_interval": "10",
> >> "filestore_merge_threshold": "40",
> >> "filestore_op_threads": "20",
> >> "filestore_queue_max_ops": "10",
> >> "filestore_split_multiple": "8",
> >> "fsid": "a4bce51b-4d6b-4394-9737-3e4d9f5efed2",
> >> "internal_safe_to_start_threads": "true",
> >> "keyring": "\/var\/lib\/ceph\/osd\/cephIB-4\/keyring",
> >> "leveldb_log": "",
> >> "log_file": "\/var\/log\/ceph\/cephIB-osd.4.log",
> >> "log_to_stderr": "false",
> &g

Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread German Anders
On the host:

# ceph --cluster cephIB --version
*ceph version 10.1.0* (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)

# rbd --version
*ceph version 10.1.0* (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)

If I run the command without root or sudo the command failed with a
Permission denied

*German*

2016-03-29 17:24 GMT-03:00 Samuel Just <sj...@redhat.com>:

> Or you needed to run it as root?
> -Sam
>
> On Tue, Mar 29, 2016 at 1:24 PM, Samuel Just <sj...@redhat.com> wrote:
> > Sounds like a version/compatibility thing.  Are your rbd clients really
> old?
> > -Sam
> >
> > On Tue, Mar 29, 2016 at 1:19 PM, German Anders <gand...@despegar.com>
> wrote:
> >> I've just upgrade to jewel, and the scrubbing seems to been
> corrected... but
> >> now I'm not able to map an rbd on a host (before I was able to),
> basically
> >> I'm getting this error msg:
> >>
> >> rbd: sysfs write failed
> >> rbd: map failed: (5) Input/output error
> >>
> >> # rbd --cluster cephIB create host01 --size 102400 --pool
> cinder-volumes -k
> >> /etc/ceph/cephIB.client.cinder.keyring
> >> # rbd --cluster cephIB map host01 --pool cinder-volumes -k
> >> /etc/ceph/cephIB.client.cinder.keyring
> >> rbd: sysfs write failed
> >> rbd: map failed: (5) Input/output error
> >>
> >> Any ideas? on the /etc/ceph directory on the host I've:
> >>
> >> -rw-r--r-- 1 ceph ceph  92 Nov 17 15:45 rbdmap
> >> -rw-r--r-- 1 ceph ceph 170 Dec 15 14:47 secret.xml
> >> -rw-r--r-- 1 ceph ceph  37 Dec 15 15:12 virsh-secret
> >> -rw-r--r-- 1 ceph ceph   0 Dec 15 15:12 virsh-secret-set
> >> -rw-r--r-- 1 ceph ceph  37 Dec 21 14:53 virsh-secretIB
> >> -rw-r--r-- 1 ceph ceph   0 Dec 21 14:53 virsh-secret-setIB
> >> -rw-r--r-- 1 ceph ceph 173 Dec 22 13:34 secretIB.xml
> >> -rw-r--r-- 1 ceph ceph 619 Dec 22 13:38 ceph.conf
> >> -rw-r--r-- 1 ceph ceph  72 Dec 23 09:51 ceph.client.cinder.keyring
> >> -rw-r--r-- 1 ceph ceph  63 Mar 28 09:03 cephIB.client.cinder.keyring
> >> -rw-r--r-- 1 ceph ceph 526 Mar 28 12:06 cephIB.conf
> >> -rw--- 1 ceph ceph  63 Mar 29 16:11 cephIB.client.admin.keyring
> >>
> >> Thanks in advance,
> >>
> >> Best,
> >>
> >> German
> >>
> >> 2016-03-29 14:45 GMT-03:00 German Anders <gand...@despegar.com>:
> >>>
> >>> Sure, also the scrubbing is happening on all the osds :S
> >>>
> >>> # ceph --cluster cephIB daemon osd.4 config diff
> >>> {
> >>> "diff": {
> >>> "current": {
> >>> "admin_socket": "\/var\/run\/ceph\/cephIB-osd.4.asok",
> >>> "auth_client_required": "cephx",
> >>> "filestore_fd_cache_size": "10240",
> >>> "filestore_journal_writeahead": "true",
> >>> "filestore_max_sync_interval": "10",
> >>> "filestore_merge_threshold": "40",
> >>> "filestore_op_threads": "20",
> >>> "filestore_queue_max_ops": "10",
> >>> "filestore_split_multiple": "8",
> >>> "fsid": "a4bce51b-4d6b-4394-9737-3e4d9f5efed2",
> >>> "internal_safe_to_start_threads": "true",
> >>> "keyring": "\/var\/lib\/ceph\/osd\/cephIB-4\/keyring",
> >>> "leveldb_log": "",
> >>> "log_file": "\/var\/log\/ceph\/cephIB-osd.4.log",
> >>> "log_to_stderr": "false",
> >>> "mds_data": "\/var\/lib\/ceph\/mds\/cephIB-4",
> >>> "mon_cluster_log_file":
> >>> "default=\/var\/log\/ceph\/cephIB.$channel.log
> >>> cluster=\/var\/log\/ceph\/cephIB.log",
> >>> "mon_data": "\/var\/lib\/ceph\/mon\/cephIB-4",
> >>> "mon_debug_dump_location":
> >>> "\/var\/log\/ceph\/cephIB-osd.4.tdump",
> >>> "mon_host": "172.23.16.1,172.23.16.2,172.23.16.3",
> >>> "mon_initial_members": "cibm01, cibm02, cibm03",
> >>> "o

Re: [ceph-users] Scrubbing a lot

2016-03-29 Thread German Anders
Sure, also the scrubbing is happening on all the osds :S

# ceph --cluster cephIB daemon osd.4 config diff
{
"diff": {
"current": {
"admin_socket": "\/var\/run\/ceph\/cephIB-osd.4.asok",
"auth_client_required": "cephx",
"filestore_fd_cache_size": "10240",
"filestore_journal_writeahead": "true",
"filestore_max_sync_interval": "10",
"filestore_merge_threshold": "40",
"filestore_op_threads": "20",
"filestore_queue_max_ops": "10",
"filestore_split_multiple": "8",
"fsid": "a4bce51b-4d6b-4394-9737-3e4d9f5efed2",
"internal_safe_to_start_threads": "true",
"keyring": "\/var\/lib\/ceph\/osd\/cephIB-4\/keyring",
"leveldb_log": "",
"log_file": "\/var\/log\/ceph\/cephIB-osd.4.log",
"log_to_stderr": "false",
"mds_data": "\/var\/lib\/ceph\/mds\/cephIB-4",
"mon_cluster_log_file":
"default=\/var\/log\/ceph\/cephIB.$channel.log
cluster=\/var\/log\/ceph\/cephIB.log",
"mon_data": "\/var\/lib\/ceph\/mon\/cephIB-4",
"mon_debug_dump_location":
"\/var\/log\/ceph\/cephIB-osd.4.tdump",
"mon_host": "172.23.16.1,172.23.16.2,172.23.16.3",
"mon_initial_members": "cibm01, cibm02, cibm03",
"osd_data": "\/var\/lib\/ceph\/osd\/cephIB-4",
"osd_journal": "\/var\/lib\/ceph\/osd\/cephIB-4\/journal",
"osd_op_threads": "8",
"rgw_data": "\/var\/lib\/ceph\/radosgw\/cephIB-4",
"setgroup": "ceph",
"setuser": "ceph"
},
"defaults": {
"admin_socket": "\/var\/run\/ceph\/ceph-osd.4.asok",
"auth_client_required": "cephx, none",
"filestore_fd_cache_size": "128",
"filestore_journal_writeahead": "false",
"filestore_max_sync_interval": "5",
"filestore_merge_threshold": "10",
"filestore_op_threads": "2",
"filestore_queue_max_ops": "50",
"filestore_split_multiple": "2",
"fsid": "----",
"internal_safe_to_start_threads": "false",
"keyring":
"\/etc\/ceph\/ceph.osd.4.keyring,\/etc\/ceph\/ceph.keyring,\/etc\/ceph\/keyring,\/etc\/ceph\/keyring.bin",
"leveldb_log": "\/dev\/null",
"log_file": "\/var\/log\/ceph\/ceph-osd.4.log",
"log_to_stderr": "true",
"mds_data": "\/var\/lib\/ceph\/mds\/ceph-4",
    "mon_cluster_log_file":
"default=\/var\/log\/ceph\/ceph.$channel.log
cluster=\/var\/log\/ceph\/ceph.log",
"mon_data": "\/var\/lib\/ceph\/mon\/ceph-4",
"mon_debug_dump_location": "\/var\/log\/ceph\/ceph-osd.4.tdump",
"mon_host": "",
"mon_initial_members": "",
"osd_data": "\/var\/lib\/ceph\/osd\/ceph-4",
"osd_journal": "\/var\/lib\/ceph\/osd\/ceph-4\/journal",
"osd_op_threads": "2",
"rgw_data": "\/var\/lib\/ceph\/radosgw\/ceph-4",
"setgroup": "",
"setuser": ""
}
},
"unknown": []
}


Thanks a lot!

Best,


*German*

2016-03-29 14:10 GMT-03:00 Samuel Just <sj...@redhat.com>:

> That seems to be scrubbing pretty often.  Can you attach a config diff
> from osd.4 (ceph daemon osd.4 config diff)?
> -Sam
>
> On Tue, Mar 29, 2016 at 9:30 AM, German Anders <gand...@despegar.com>
> wrote:
> > Hi All,
> >
> > I've maybe a simple question, I've setup a new cluster with Infernalis
> > release, there's no IO going on at the cluster level and I'm receiving a
> lot
> > of these messages:
> >
> > 2016-03-29 12:22:07.462818 mon.0 [INF] pgmap v158062: 8192 pgs: 8192
> > active+clean; 2061

[ceph-users] Crush Map tunning recommendation and validation

2016-03-23 Thread German Anders
Hi all,

I had a question, I'm in the middle of a new ceph deploy cluster and I've 6
OSD servers between two racks, so rack1 would have osdserver1,3 and 5, and
rack2 osdserver2,4 and 6. I've edited the following crush map and I want to
know if it's ok and also if the objects would be stored one on each
rack-host. So, if I lost one rack, I had one copy on the other rack/server:

*http://pastebin.com/raw/QJf1VeeJ *

Also If I need to run any command in order to 'apply' the new crush map to
the existing pools (actually only two):

- 0 rbd(pg_num: 4096 | pgp_num: 4096 | size: 2 | min_size: 1)
- 1 cinder-volumes (pg_num: 4096 | pgp_num: 4096 | size: 2 | min_size: 1)

# ceph --cluster cephIB osd tree
ID WEIGHT   TYPE NAMEUP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 51.29668 root default
-8 26.00958 rack cage5-rack1
-2  8.66986 host cibn01
 0  0.72249 osd.0 up  1.0  1.0
 1  0.72249 osd.1 up  1.0  1.0
 2  0.72249 osd.2 up  1.0  1.0
 3  0.72249 osd.3 up  1.0  1.0
 4  0.72249 osd.4 up  1.0  1.0
 5  0.72249 osd.5 up  1.0  1.0
 6  0.72249 osd.6 up  1.0  1.0
 7  0.72249 osd.7 up  1.0  1.0
 8  0.72249 osd.8 up  1.0  1.0
 9  0.72249 osd.9 up  1.0  1.0
10  0.72249 osd.10up  1.0  1.0
11  0.72249 osd.11up  1.0  1.0
-4  8.66986 host cibn03
24  0.72249 osd.24up  1.0  1.0
25  0.72249 osd.25up  1.0  1.0
26  0.72249 osd.26up  1.0  1.0
27  0.72249 osd.27up  1.0  1.0
28  0.72249 osd.28up  1.0  1.0
29  0.72249 osd.29up  1.0  1.0
30  0.72249 osd.30up  1.0  1.0
31  0.72249 osd.31up  1.0  1.0
32  0.72249 osd.32up  1.0  1.0
33  0.72249 osd.33up  1.0  1.0
34  0.72249 osd.34up  1.0  1.0
35  0.72249 osd.35up  1.0  1.0
-6  8.66986 host cibn05
48  0.72249 osd.48up  1.0  1.0
49  0.72249 osd.49up  1.0  1.0
50  0.72249 osd.50up  1.0  1.0
51  0.72249 osd.51up  1.0  1.0
52  0.72249 osd.52up  1.0  1.0
53  0.72249 osd.53up  1.0  1.0
54  0.72249 osd.54up  1.0  1.0
55  0.72249 osd.55up  1.0  1.0
56  0.72249 osd.56up  1.0  1.0
57  0.72249 osd.57up  1.0  1.0
58  0.72249 osd.58up  1.0  1.0
59  0.72249 osd.59up  1.0  1.0
-9 25.28709 rack cage5-rack2
-3  8.66986 host cibn02
12  0.72249 osd.12up  1.0  1.0
13  0.72249 osd.13up  1.0  1.0
14  0.72249 osd.14up  1.0  1.0
15  0.72249 osd.15up  1.0  1.0
16  0.72249 osd.16up  1.0  1.0
17  0.72249 osd.17up  1.0  1.0
18  0.72249 osd.18up  1.0  1.0
19  0.72249 osd.19up  1.0  1.0
20  0.72249 osd.20up  1.0  1.0
21  0.72249 osd.21up  1.0  1.0
22  0.72249 osd.22up  1.0  1.0
23  0.72249 osd.23up  1.0  1.0
-5  8.66986 host cibn04
36  0.72249 osd.36up  1.0  1.0
37  0.72249 osd.37up  1.0  1.0
38  0.72249 osd.38up  1.0  1.0
39  0.72249 osd.39up  1.0  1.0
40  0.72249 osd.40up  1.0  1.0
41  0.72249 osd.41up  1.0  1.0
42  0.72249 osd.42up  1.0  1.0
43  0.72249 osd.43up  1.0  1.0
44  0.72249 osd.44up  1.0  1.0
45  0.72249 osd.45up  1.0  1.0
46  0.72249 osd.46up  

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
I'll try to put the ports on the HP IB QDR switch to 4K and then configured
the interfaces also to mtu 4096 and do the same tests again and see what
are the results. However, is there any other parameter that I need to take
into account to tune for this? For example this is the port configuration
of one of the Blade hosts:

$ ibportstate -L 29 17 query
Switch PortInfo:
# Port info: Lid 29 port 17
LinkState:...Active
PhysLinkState:...LinkUp
Lid:.75
SMLid:...2328
LMC:.0
LinkWidthSupported:..1X or 4X
LinkWidthEnabled:1X or 4X
LinkWidthActive:.4X
LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.10.0 Gbps
Peer PortInfo:
# Port info: Lid 29 DR path slid 4; dlid 65535; 0,17 port 1
LinkState:...Active
PhysLinkState:...LinkUp
Lid:.32
SMLid:...2
LMC:.0
LinkWidthSupported:..1X or 4X
LinkWidthEnabled:1X or 4X
LinkWidthActive:.4X
LinkSpeedSupported:..2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedEnabled:2.5 Gbps or 5.0 Gbps or 10.0 Gbps
LinkSpeedActive:.10.0 Gbps
Mkey:
MkeyLeasePeriod:.0
ProtectBits:.0

Ok changed the MTU from 65520 on the hosts to 4096 drops really bad the
performance, from 1.7GB/s to 143.8MB/s,... I'll keep looking after this..

Best,



*German*

2015-11-24 14:35 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I've gotten about 3.2 GB/s with IPoIB on QDR, but it took a couple of
> weeks of tuning to get that rate. If your switch is at 2048 MTU, it is
> really hard to get it increased without an outage if I remember
> correctly. Connected mode is much easier to get higher MTUs, but it
> was a bit flaky with IPoIB (had to send several pings to get the
> connection established some times). This was all a couple of years ago
> now so my memory is a bit fuzzy. My current IB Ceph cluster is so
> small that doing any tuning is not going to help because the
> bottleneck is my disks and CPU.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Nov 24, 2015 at 10:26 AM, German Anders  wrote:
> > Thanks a lot Robert for the explanation. I understand what you are saying
> > and I'm also excited to see more about IB with Ceph to get those
> performance
> > numbers up, and hopefully (hopefully soon) to see accelio working for
> > production. Regarding the HP IB switch we got 4 ports (uplinks)
> connected to
> > our IB SW, and internally the blades are connected through the backplane
> to
> > two ports so they used the total number of ports inside the Encl SW (16
> > ports). The bonding that I've configured is active/backup, I didn't know
> > that active/active is possible with IPoIB. Also, the adapters that we
> got on
> > the ceph nodes (supermicro servers), are Mellanox Technologies MT27500
> > Family [ConnectX-3], I also double check the port type configuration on
> the
> > IB SW and see that it's speed rate is 14.0 Gbps and also that the MTU
> > supported is 4096 and the current line rate is 56.0 Gbps.
> >
> > I've try almost all possible combinations and I'm not getting any
> > improvement that's more than 1.8 GB/s, so I was wondering if this is my
> top
> > limit speed with this kind of setup.
> >
> > Best,
> >
> >
> > German
> >
> > 2015-11-24 14:11 GMT-03:00 Robert LeBlanc :
> >>
> >> -BEGIN PGP SIGNED MESSAGE-
> >> Hash: SHA256
> >>
> >> I've had wildly different iperf results based on the version of the
> >> kernel, OFED and whether you are using datagram or connected mode as
> >> well as the MTU. You really have to just try all the different options
> >> to figure out what works the best.
> >>
> >> Please also remember that you will not get iSER performance out of
> >> Ceph at the moment (probably never), but the work being done will
> >> help. Even if you get the network transport optimially tuned, unless
> >> you have a massive Ceph cluster, you won't get the performance out the
> >> of the SSDs. I'm just as excited about Ceph on Infiniband, but I've
> >> had to just chill out and let the devs do their work.
> >>
> >> I've never had good experiences with active/active

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
Thanks a lot Robert for the explanation. I understand what you are saying
and I'm also excited to see more about IB with Ceph to get those
performance numbers up, and hopefully (hopefully soon) to see accelio
working for production. Regarding the HP IB switch we got 4 ports (uplinks)
connected to our IB SW, and internally the blades are connected through the
backplane to two ports so they used the total number of ports inside the
Encl SW (16 ports). The bonding that I've configured is active/backup, I
didn't know that active/active is possible with IPoIB. Also, the adapters
that we got on the ceph nodes (supermicro servers), are Mellanox
Technologies MT27500 Family [ConnectX-3], I also double check the port type
configuration on the IB SW and see that it's speed rate is 14.0 Gbps and
also that the MTU supported is 4096 and the current line rate is 56.0 Gbps.

I've try almost all possible combinations and I'm not getting any
improvement that's more than 1.8 GB/s, so I was wondering if this is my top
limit speed with this kind of setup.

Best,


*German*

2015-11-24 14:11 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> I've had wildly different iperf results based on the version of the
> kernel, OFED and whether you are using datagram or connected mode as
> well as the MTU. You really have to just try all the different options
> to figure out what works the best.
>
> Please also remember that you will not get iSER performance out of
> Ceph at the moment (probably never), but the work being done will
> help. Even if you get the network transport optimially tuned, unless
> you have a massive Ceph cluster, you won't get the performance out the
> of the SSDs. I'm just as excited about Ceph on Infiniband, but I've
> had to just chill out and let the devs do their work.
>
> I've never had good experiences with active/active bonding on IPoIB.
> For two blades in the same chassis, you should get non-blocking line
> rate. For going out of the chassis, you will be limited by the number
> of ports you connect to the upstream switch (that is why there is
> usually the same number of uplink ports as there are blades so that
> you can do non-blocking, however HP has been selling switches with
> only half the uplinks making your oversubscription 2:1, it really
> depends on what you actually need). Between QDR and FDR, you should
> get QDR speed. Also be sure it is full FDR and not FDR-10 which is the
> same signal rate as QDR but with the new 64/66 encoding, it won't give
> you as much speed improvement as FDR and it can be difficult to tell
> which your adapter has if you don't research it. We thought we bought
> FDR cards only to find out later they were FDR-10.
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.2.3
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJWVJpCCRDmVDuy+mK58QAAEX4P/jFvdBzNob2xdftEkD2K
> rSB5i/Idmi7BAe1/JUzMF/t7l7zFXEpq96oLbt5NMbreOhCe6MitEApfhpWq
> dmt3IZYyUYVvXCxNGE/U7L58wi9DGPKJTWsigKScFtqjcQkIOlCh2VAHCmnE
> /WZBtlMnBsoibqq+zZsM4GEBwvPCwUwpGDKU13DhpuvmiN09jICEHH05wZzq
> ig/Ia309ioAZJ8PEKZ61kHUxAzTIMhwe1LV2jtlGQcJB4jMq7TQzOyizq0mQ
> 7DJTNNkMVpB9IEBCuOzzs/ByjKz+Tu31Jw2Y8R9MjtoDpOo+WQzzn6W4+NS0
> jG0cFiumIBKVwoMJyXpQeS6UC0w7balHaXy+8F4SUa+J/9X5w4bH9MmlJBfh
> p81YDtNs7mQYKsuDOkjNe0BkthhHbdQThHn4A75j8Hqaltwr28UqL83ywCUJ
> SqTGkhRLyU9O74snPfG+T7hM4fIVpH7DS4ebmK7yvSVzwwuExPgwWhjvAsmt
> DRnXv0qd8UAIgza0VYTyZuElUC4V39wMe503tXo5By+NGKWzVNOWR1X0+46i
> Xq2zvZQzc9MPtGHMmnm1dkJ+d6imfLzTf099njZ+Wl1xbagnQiKbiwKL8T/k
> d3OClf514rV4i7FtwOoB8NQcUMUjaeZGmPVDhmVt7fRYz/+rARkN/jwXH4qG
> x/Dk
> =/88f
> -END PGP SIGNATURE-----
> ----
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Tue, Nov 24, 2015 at 8:24 AM, German Anders <gand...@despegar.com>
> wrote:
> > Another test make between two HP blades with QDR (with bonding)
> >
> > e60-host01# iperf -s
> > 
> > Server listening on TCP port 5001
> > TCP window size: 85.3 KByte (default)
> > 
> > [  5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807
> > [  4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806
> > [  6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808
> > [  7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809
> > [ ID] Interval   Transfer Bandwidth
> > [  5]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
> > [  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
> > [  6]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
> > [  7]  0.0-10.0 sec  3.57 GBytes  3.07 

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
Yes, I'm wondering if this is my top performance threshold with this kind
of setup, although I'll assume that IB perf would be better.. :(

*German*

2015-11-24 14:24 GMT-03:00 Mark Nelson <mnel...@redhat.com>:

> On 11/24/2015 09:05 AM, German Anders wrote:
>
>> Thanks a lot for the response Mark, I will take a look at the guide that
>> you point me out. Regarding the iperf results find them below:
>>
>> *FDR-HOST -> to -> QDR-Blade-HOST
>> *
>> *(client)  (server)*
>>
>>
>> server:
>> --
>>
>> # iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51863
>> [  5] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51864
>> [  6] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51865
>> [  7] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51866
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.0 sec  4.24 GBytes  3.64 Gbits/sec
>> [  5]  0.0-10.0 sec  4.28 GBytes  3.67 Gbits/sec
>> [  6]  0.0-10.0 sec  4.30 GBytes  3.69 Gbits/sec
>> [  7]  0.0-10.0 sec  4.34 GBytes  3.73 Gbits/sec
>> [SUM]  0.0-10.0 sec  17.2 GBytes  14.7 Gbits/sec
>>
>> client:
>> --
>>
>> # iperf -c 172.23.18.1 -P 4
>> 
>> Client connecting to 172.23.18.1, TCP port 5001
>> TCP window size: 2.50 MByte (default)
>> 
>> [  6] local 172.23.16.1 port 51866 connected with 172.23.18.1 port 5001
>> [  3] local 172.23.16.1 port 51864 connected with 172.23.18.1 port 5001
>> [  4] local 172.23.16.1 port 51863 connected with 172.23.18.1 port 5001
>> [  5] local 172.23.16.1 port 51865 connected with 172.23.18.1 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  6]  0.0-10.0 sec  4.34 GBytes  3.73 Gbits/sec
>> [  3]  0.0-10.0 sec  4.28 GBytes  3.68 Gbits/sec
>> [  4]  0.0-10.0 sec  4.24 GBytes  3.64 Gbits/sec
>> [  5]  0.0-10.0 sec  4.30 GBytes  3.69 Gbits/sec
>> [SUM]  0.0-10.0 sec  17.2 GBytes  14.7 Gbits/sec
>>
>>
>
> hrm, pretty crappy. :/
>
>
>> *FDR-HOST -> to -> FDR-HOST
>> *
>> *(client)  (server)
>> *
>>
>> server:
>> --
>>
>> # iperf -s
>> 
>> Server listening on TCP port 5001
>> TCP window size: 85.3 KByte (default)
>> 
>> [  4] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59900
>> [  6] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59902
>> [  5] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59901
>> [  7] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59903
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0-10.0 sec  6.76 GBytes  5.80 Gbits/sec
>> [  6]  0.0-10.0 sec  6.71 GBytes  5.76 Gbits/sec
>> [  5]  0.0-10.0 sec  6.81 GBytes  5.84 Gbits/sec
>> [  7]  0.0-11.0 sec  9.24 GBytes  7.22 Gbits/sec
>> [SUM]  0.0-11.0 sec  29.5 GBytes  23.1 Gbits/sec
>>
>>
>> client:
>> --
>>
>> # iperf -c 172.23.17.5 -P 4
>> 
>> Client connecting to 172.23.17.5, TCP port 5001
>> TCP window size: 2.50 MByte (default)
>> 
>> [  6] local 172.23.16.1 port 59903 connected with 172.23.17.5 port 5001
>> [  4] local 172.23.16.1 port 59900 connected with 172.23.17.5 port 5001
>> [  3] local 172.23.16.1 port 59901 connected with 172.23.17.5 port 5001
>> [  5] local 172.23.16.1 port 59902 connected with 172.23.17.5 port 5001
>> [ ID] Interval   Transfer Bandwidth
>> [  4]  0.0- 9.0 sec  6.76 GBytes  6.45 Gbits/sec
>> [  3]  0.0- 9.0 sec  6.81 GBytes  6.49 Gbits/sec
>> [  5]  0.0- 9.0 sec  6.71 GBytes  6.40 Gbits/sec
>> [  6]  0.0-10.0 sec  9.24 GBytes  7.94 Gbits/sec
>> [SUM]  0.0-10.0 sec  29.5 GBytes  25.4 Gbits/sec
>>
>>
> Looking better, though maybe not as good as I would expect for FDR...
>
> Were these tests with the affinity tuning from the mellanox guide?
>
>
>> **
>>
>> *German
>>
>> *
>> 2015-11-24 11:51 GMT-03:00 Mark Nelson <mnel...@redhat.com
>> <mailto:mnel...@r

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
Is anyone around the list using ceph + IB FDR or QDR and getting with fio
or any other tool around 3GB/s? then if possible to share some config
variables to see where can I tweak a little bit, since I've already use
mlnx_tune and mlnx_affinity in order to improve and change parameters for
irq affinity and other values, but I'm still getting no more than 1.8GB/s
with fio.

Thanks in advance,

Best,


*German*

2015-11-24 11:51 GMT-03:00 Mark Nelson <mnel...@redhat.com>:

> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
> PCIe or car related bottlenecks.  IPoIB will further limit that, especially
> if you haven't done any kind of interrupt affinity tuning.
>
> Assuming these are mellanox cards you'll want to read this guide:
>
>
> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>
> For QDR I think the maximum throughput with IPoIB I've ever seen was about
> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what you
> should expect for a well tuned setup.
>
> I'd still suggest doing iperf tests.  It's really easy:
>
> "iperf -s" on one node to act as a server.
>
> "iperf -c  -P " on the client
>
> This will give you an idea of how your network is doing.  All-To-All
> network tests are also useful, in that sometimes network issues can crop up
> only when there's lots of traffic across many ports.  We've seen this in
> lab environments, especially with bonded ethernet.
>
> Mark
>
> On 11/24/2015 07:22 AM, German Anders wrote:
>
>> After doing some more in deep research and tune some parameters I've
>> gain a little bit more of performance:
>>
>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>> --group_reporting --exitall --name
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> --filename=/mnt/e60host01vol1/test1
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> ...
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> fio-2.1.3
>> Starting 4 processes
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
>> (1 file(s) / 16384MB)
>> Jobs: 4 (f=4): [] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 iops]
>>
>> [eta 00m:15s]
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015
>>read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec
>>  slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91
>>  clat (msec): min=2, max=321, avg=70.56, stdev=35.80
>>   lat (msec): min=2, max=321, avg=72.81, stdev=36.13
>>  clat percentiles (msec):
>>   |  1.00th=[   13],  5.00th=[   24], 10.00th=[   30], 20.00th=[
>>  40],
>>   | 30.00th=[   50], 40.00th=[   57], 50.00th=[   65], 60.00th=[
>>  75],
>>   | 70.00th=[   85], 80.00th=[   98], 90.00th=[  120], 95.00th=[
>> 139],
>>   | 99.00th=[  178], 99.50th=[  194], 99.90th=[  229], 99.95th=[
>> 247],
>>   | 99.99th=[  273]
>>  bw (KB  /s): min=301056, max=612352, per=25.01%, avg=449291.87,
>> stdev=54288.85
>>  lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%
>>  lat (msec) : 250=18.34%, 500=0.03%
>>cpu  : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636
>>IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,
>>  >=64=0.0%
>>   submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>>  >=64=0.0%
>>   complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>>  >=64=0.0%
>>   issued: total=r=38699/w=0/d=0, short=r=0/w=0/d=0
>>
>> Run status group 0 (all jobs):
>> READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s,
>>
>> maxb=1754.2MB/s, mint=22062msec, maxt=22062msec
>>
>> Disk stats (read/write):
>>rbd1: ios=77386/17, merge=0/122, ticks=3168312/500, in_queue=3170168,
>> util=99.76%
>>
>> The thing is that this test was running from a 'HP Blade enclosure with
>> QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s (I
>> guess that this number must be divided by the total number of ports, in
>> this case 2, so a maximum of 1.6GB/s is the max of throughput that I'll
>> get on a single port, is that correct? Also 

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
Thanks a lot for the response Mark, I will take a look at the guide that
you point me out. Regarding the iperf results find them below:


*FDR-HOST -> to -> QDR-Blade-HOST*
*(client)  (server)*

server:
--

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51863
[  5] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51864
[  6] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51865
[  7] local 172.23.18.1 port 5001 connected with 172.23.16.1 port 51866
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  4.24 GBytes  3.64 Gbits/sec
[  5]  0.0-10.0 sec  4.28 GBytes  3.67 Gbits/sec
[  6]  0.0-10.0 sec  4.30 GBytes  3.69 Gbits/sec
[  7]  0.0-10.0 sec  4.34 GBytes  3.73 Gbits/sec
[SUM]  0.0-10.0 sec  17.2 GBytes  14.7 Gbits/sec

client:
--

# iperf -c 172.23.18.1 -P 4

Client connecting to 172.23.18.1, TCP port 5001
TCP window size: 2.50 MByte (default)

[  6] local 172.23.16.1 port 51866 connected with 172.23.18.1 port 5001
[  3] local 172.23.16.1 port 51864 connected with 172.23.18.1 port 5001
[  4] local 172.23.16.1 port 51863 connected with 172.23.18.1 port 5001
[  5] local 172.23.16.1 port 51865 connected with 172.23.18.1 port 5001
[ ID] Interval   Transfer Bandwidth
[  6]  0.0-10.0 sec  4.34 GBytes  3.73 Gbits/sec
[  3]  0.0-10.0 sec  4.28 GBytes  3.68 Gbits/sec
[  4]  0.0-10.0 sec  4.24 GBytes  3.64 Gbits/sec
[  5]  0.0-10.0 sec  4.30 GBytes  3.69 Gbits/sec
[SUM]  0.0-10.0 sec  17.2 GBytes  14.7 Gbits/sec



*FDR-HOST -> to -> FDR-HOST*

*(client)  (server)*
server:
--

# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  4] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59900
[  6] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59902
[  5] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59901
[  7] local 172.23.17.5 port 5001 connected with 172.23.16.1 port 59903
[ ID] Interval   Transfer Bandwidth
[  4]  0.0-10.0 sec  6.76 GBytes  5.80 Gbits/sec
[  6]  0.0-10.0 sec  6.71 GBytes  5.76 Gbits/sec
[  5]  0.0-10.0 sec  6.81 GBytes  5.84 Gbits/sec
[  7]  0.0-11.0 sec  9.24 GBytes  7.22 Gbits/sec
[SUM]  0.0-11.0 sec  29.5 GBytes  23.1 Gbits/sec


client:
--

# iperf -c 172.23.17.5 -P 4

Client connecting to 172.23.17.5, TCP port 5001
TCP window size: 2.50 MByte (default)

[  6] local 172.23.16.1 port 59903 connected with 172.23.17.5 port 5001
[  4] local 172.23.16.1 port 59900 connected with 172.23.17.5 port 5001
[  3] local 172.23.16.1 port 59901 connected with 172.23.17.5 port 5001
[  5] local 172.23.16.1 port 59902 connected with 172.23.17.5 port 5001
[ ID] Interval   Transfer Bandwidth
[  4]  0.0- 9.0 sec  6.76 GBytes  6.45 Gbits/sec
[  3]  0.0- 9.0 sec  6.81 GBytes  6.49 Gbits/sec
[  5]  0.0- 9.0 sec  6.71 GBytes  6.40 Gbits/sec
[  6]  0.0-10.0 sec  9.24 GBytes  7.94 Gbits/sec
[SUM]  0.0-10.0 sec  29.5 GBytes  25.4 Gbits/sec




*German*
2015-11-24 11:51 GMT-03:00 Mark Nelson <mnel...@redhat.com>:

> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
> PCIe or car related bottlenecks.  IPoIB will further limit that, especially
> if you haven't done any kind of interrupt affinity tuning.
>
> Assuming these are mellanox cards you'll want to read this guide:
>
>
> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>
> For QDR I think the maximum throughput with IPoIB I've ever seen was about
> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what you
> should expect for a well tuned setup.
>
> I'd still suggest doing iperf tests.  It's really easy:
>
> "iperf -s" on one node to act as a server.
>
> "iperf -c  -P " on the client
>
> This will give you an idea of how your network is doing.  All-To-All
> network tests are also useful, in that sometimes network issues can crop up
> only when there's lots of traffic across many ports.  We've seen this in
> lab environments, especially with bonded ethernet.
>
> Mark
>
> On 11/24/2015 07:22 AM, German Anders wrote:
>
>> After doing some more in deep research and tune some parameters I've
>> gain a little bit more of performance:
>>
>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --ru

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
Another test make between two HP blades with QDR (with bonding)

e60-host01# iperf -s

Server listening on TCP port 5001
TCP window size: 85.3 KByte (default)

[  5] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41807
[  4] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41806
[  6] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41808
[  7] local 172.23.18.2 port 5001 connected with 172.23.18.1 port 41809
[ ID] Interval   Transfer Bandwidth
[  5]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
[  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
[  6]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
[  7]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
[SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec

e60-host02# iperf -c 172.23.18.2 -P 4


Client connecting to 172.23.18.2, TCP port 5001
TCP window size: 2.50 MByte (default)

[  3] local 172.23.18.1 port 41806 connected with 172.23.18.2 port 5001
[  5] local 172.23.18.1 port 41808 connected with 172.23.18.2 port 5001
[  4] local 172.23.18.1 port 41807 connected with 172.23.18.2 port 5001
[  6] local 172.23.18.1 port 41809 connected with 172.23.18.2 port 5001
[ ID] Interval   Transfer Bandwidth
[  3]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
[  5]  0.0-10.0 sec  3.58 GBytes  3.08 Gbits/sec
[  4]  0.0-10.0 sec  2.64 GBytes  2.27 Gbits/sec
[  6]  0.0-10.0 sec  3.57 GBytes  3.07 Gbits/sec
[SUM]  0.0-10.0 sec  12.4 GBytes  10.7 Gbits/sec

notice that also the blades are on the same enclosure.

bonding configuration:

alias bond-ib bonding options bonding mode=1 miimon=100 downdelay=100
updelay=100 max_bonds=2

## INFINIBAND CONF

auto ib0
iface ib0 inet manual
bond-master bond-ib

auto ib1
iface ib1 inet manual
bond-master bond-ib

auto bond-ib
iface bond-ib inet static
address 172.23.xx.xx
netmask 255.255.xx.xx
slaves ib0 ib1
bond_miimon 100
bond_mode active-backup
pre-up echo connected > /sys/class/net/ib0/mode
pre-up echo connected > /sys/class/net/ib1/mode
pre-up /sbin/ifconfig ib0 mtu 65520
pre-up /sbin/ifconfig ib1 mtu 65520
pre-up modprobe bond-ib
pre-up /sbin/ifconfig bond-ib mtu 65520


*German*

2015-11-24 11:51 GMT-03:00 Mark Nelson <mnel...@redhat.com>:

> Each port should be able to do 40Gb/s or 56Gb/s minus overhead and any
> PCIe or car related bottlenecks.  IPoIB will further limit that, especially
> if you haven't done any kind of interrupt affinity tuning.
>
> Assuming these are mellanox cards you'll want to read this guide:
>
>
> http://www.mellanox.com/related-docs/prod_software/Performance_Tuning_Guide_for_Mellanox_Network_Adapters.pdf
>
> For QDR I think the maximum throughput with IPoIB I've ever seen was about
> 2.7GB/s for a single port.  Typically 2-2.5GB/s is probably about what you
> should expect for a well tuned setup.
>
> I'd still suggest doing iperf tests.  It's really easy:
>
> "iperf -s" on one node to act as a server.
>
> "iperf -c  -P " on the client
>
> This will give you an idea of how your network is doing.  All-To-All
> network tests are also useful, in that sometimes network issues can crop up
> only when there's lots of traffic across many ports.  We've seen this in
> lab environments, especially with bonded ethernet.
>
> Mark
>
> On 11/24/2015 07:22 AM, German Anders wrote:
>
>> After doing some more in deep research and tune some parameters I've
>> gain a little bit more of performance:
>>
>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>> --group_reporting --exitall --name
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
>> --filename=/mnt/e60host01vol1/test1
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> ...
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> fio-2.1.3
>> Starting 4 processes
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
>> (1 file(s) / 16384MB)
>> Jobs: 4 (f=4): [] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 iops]
>>
>> [eta 00m:15s]
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>> err= 0: pid=54857: Tue Nov 24 07:56:30 2015
>>read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062m

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-24 Thread German Anders
After doing some more in deep research and tune some parameters I've gain a
little bit more of performance:

# fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
--filename=/mnt/e60host01vol1/test1
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
...
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
fio-2.1.3
Starting 4 processes
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s) (1
file(s) / 16384MB)
Jobs: 4 (f=4): [] [60.5% done] [*1714MB*/0KB/0KB /s] [1713/0/0 iops]
[eta 00m:15s]
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4): err=
0: pid=54857: Tue Nov 24 07:56:30 2015
  read : io=38699MB, bw=1754.2MB/s, iops=1754, runt= 22062msec
slat (usec): min=131, max=63426, avg=2249.87, stdev=4320.91
clat (msec): min=2, max=321, avg=70.56, stdev=35.80
 lat (msec): min=2, max=321, avg=72.81, stdev=36.13
clat percentiles (msec):
 |  1.00th=[   13],  5.00th=[   24], 10.00th=[   30], 20.00th=[   40],
 | 30.00th=[   50], 40.00th=[   57], 50.00th=[   65], 60.00th=[   75],
 | 70.00th=[   85], 80.00th=[   98], 90.00th=[  120], 95.00th=[  139],
 | 99.00th=[  178], 99.50th=[  194], 99.90th=[  229], 99.95th=[  247],
 | 99.99th=[  273]
bw (KB  /s): min=301056, max=612352, per=25.01%, avg=449291.87,
stdev=54288.85
lat (msec) : 4=0.11%, 10=0.61%, 20=2.11%, 50=27.87%, 100=50.92%
lat (msec) : 250=18.34%, 500=0.03%
  cpu  : usr=0.19%, sys=33.60%, ctx=66708, majf=0, minf=636
  IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.2%, 32=99.7%,
>=64=0.0%
 submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%,
>=64=0.0%
 complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%,
>=64=0.0%
 issued: total=r=38699/w=0/d=0, short=r=0/w=0/d=0

Run status group 0 (all jobs):
   READ: io=38699MB, aggrb=*1754.2MB/s*, minb=1754.2MB/s, maxb=1754.2MB/s,
mint=22062msec, maxt=22062msec

Disk stats (read/write):
  rbd1: ios=77386/17, merge=0/122, ticks=3168312/500, in_queue=3170168,
util=99.76%

The thing is that this test was running from a 'HP Blade enclosure with
QDR' so I think that if in QDR the max Throughput is around 3.2 GB/s (I
guess that this number must be divided by the total number of ports, in
this case 2, so a maximum of 1.6GB/s is the max of throughput that I'll get
on a single port, is that correct? Also I made another test in another host
that also had FDR so (max throughput would be around 6.8 GB/s), and if the
same theory is valid, that would lead me to 3.4 GB/s per port, but I'm not
getting more than 1.4 - 1.6 GB/s, any ideas? same tuning on both servers.

Basically I changed the scaling_governor of the cpufreq of all cpus to
'performance' and then set the following values:

sysctl -w net.ipv4.tcp_timestamps=0
sysctl -w net.core.netdev_max_backlog=25
sysctl -w net.core.rmem_max=4194304
sysctl -w net.core.wmem_max=4194304
sysctl -w net.core.rmem_default=4194304
sysctl -w net.core.wmem_default=4194304
sysctl -w net.core.optmem_max=4194304
sysctl -w net.ipv4.tcp_rmem="4096 87380 4194304"
sysctl -w net.ipv4.tcp_wmem="4096 65536 4194304"
sysctl -w net.ipv4.tcp_low_latency=1


However, on the HP blade, there's no Intel CPUs like the other server, so
this kind of 'tuning' can't be done, so I left it as a default and only
changed the TCP networking part.

Any comments or hint would be really appreciated.

Thanks in advance,

Best,




*German*
2015-11-23 15:06 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Are you using unconnected mode or connected mode? With connected mode
> you can up your MTU to 64K which may help on the network side.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:
> > Hi Mark,
> >
> > Thanks a lot for the quick response. Regarding the numbers that you send
> me,
> > they look REALLY nice. I've the following setup
> >
> > 4 OSD nodes:
> >
> > 2 x Intel Xeon E5-2650v2 @2.60Ghz
> > 1 x Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
> > Dual-Port (1 for PUB and 1 for CLUS)
> > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
> > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on the same
> > drive, so 1:1 relationship)
> > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
> > 128GB RAM
> >
> > [0:0:0:0]diskATA  INT

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-23 Thread German Anders
Hi Robert,

Thanks for the response. I was configured as 'datagram', so I try to
changed it in the /etc/network/interfaces file and add the following:

## IB0 PUBLIC_CEPH
auto ib0
iface ib0 inet static
address 172.23.17.8
netmask 255.255.240.0
network 172.23.16.0
post-up echo connected > /sys/class/net/ib0/mode
post-up /sbin/ifconfig ib0 mtu 65520

## IB1 CLUS_CEPH
auto ib1
iface ib1 inet static
address 172.23.32.8
netmask 255.255.240.0
network 172.23.47.254
post-up echo connected > /sys/class/net/ib1/mode
post-up /sbin/ifconfig ib1 mtu 65520



then reboot, but when it come up again, the mode is still saying 'datagram'
instead of 'connected', any idea?

Regards,


*German*

2015-11-23 15:06 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Are you using unconnected mode or connected mode? With connected mode
> you can up your MTU to 64K which may help on the network side.
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:
> > Hi Mark,
> >
> > Thanks a lot for the quick response. Regarding the numbers that you send
> me,
> > they look REALLY nice. I've the following setup
> >
> > 4 OSD nodes:
> >
> > 2 x Intel Xeon E5-2650v2 @2.60Ghz
> > 1 x Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
> > Dual-Port (1 for PUB and 1 for CLUS)
> > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
> > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on the same
> > drive, so 1:1 relationship)
> > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
> > 128GB RAM
> >
> > [0:0:0:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sdc
> > [0:0:1:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sdd
> > [0:0:2:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sde
> > [0:0:3:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdf
> > [0:0:4:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdg
> > [0:0:5:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdh
> > [0:0:6:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdi
> > [0:0:7:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdj
> > [0:0:8:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdk
> > [0:0:9:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdl
> > [0:0:10:0]   diskATA  INTEL SSDSC2BB80 0130  /dev/sdm
> >
> > sdf8:80   0 745.2G  0 disk
> > |-sdf1 8:81   0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-16
> > `-sdf2 8:82   0 5G  0 part
> > sdg8:96   0 745.2G  0 disk
> > |-sdg1 8:97   0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-17
> > `-sdg2 8:98   0 5G  0 part
> > sdh8:112  0 745.2G  0 disk
> > |-sdh1 8:113  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-18
> > `-sdh2 8:114  0 5G  0 part
> > sdi8:128  0 745.2G  0 disk
> > |-sdi1 8:129  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-19
> > `-sdi2 8:130  0 5G  0 part
> > sdj8:144  0 745.2G  0 disk
> > |-sdj1 8:145  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-20
> > `-sdj2 8:146  0 5G  0 part
> > sdk8:160  0 745.2G  0 disk
> > |-sdk1 8:161  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-21
> > `-sdk2 8:162  0 5G  0 part
> > sdl8:176  0 745.2G  0 disk
> > |-sdl1 8:177  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-22
> > `-sdl2 8:178  0 5G  0 part
> > sdm8:192  0 745.2G  0 disk
> > |-sdm1 8:193  0 740.2G  0 part
> > /var/lib/ceph/osd/ceph-23
> > `-sdm2 8:194  0 5G  0 part
> >
> >
> > $ rados bench -p rbd 20 write --no-cleanup -t 4
> >  Maintaining 4 concurrent writes of 4194304 bytes for up to 20 seconds
> or 0
> > objects
> >  Object prefix: benchmark_data_cibm01_1409
> >sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg
> lat
> >  0  

[ceph-users] Ceph 0.94.5 with accelio

2015-11-23 Thread German Anders
Hi all,

I want to know if there's any improvement or update regarding ceph 0.94.5
with accelio, I've an already configured cluster (with no data on it) and I
would like to know if there's a way to 'modify' the cluster in order to use
accelio. Any info would be really appreciated.

Cheers,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-23 Thread German Anders
lts from a similar environment?

Thanks in advance,

Best,

*German*

2015-11-23 13:08 GMT-03:00 Gregory Farnum <gfar...@redhat.com>:

> On Mon, Nov 23, 2015 at 10:05 AM, German Anders <gand...@despegar.com>
> wrote:
> > Hi all,
> >
> > I want to know if there's any improvement or update regarding ceph 0.94.5
> > with accelio, I've an already configured cluster (with no data on it)
> and I
> > would like to know if there's a way to 'modify' the cluster in order to
> use
> > accelio. Any info would be really appreciated.
>
> The XioMessenger is still experimental. As far as I know it's not
> expected to be stable any time soon and I can't imagine it will be
> backported to Hammer even when done.
> -Greg
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-23 Thread German Anders
Got it Robert, It was my mistake, I put post-up instead of pre-up, now it
changed ok, I'll do new tests with this config and let you know.

Regards,


*German*

2015-11-23 15:36 GMT-03:00 German Anders <gand...@despegar.com>:

> Hi Robert,
>
> Thanks for the response. I was configured as 'datagram', so I try to
> changed it in the /etc/network/interfaces file and add the following:
>
> ## IB0 PUBLIC_CEPH
> auto ib0
> iface ib0 inet static
> address 172.23.17.8
> netmask 255.255.240.0
> network 172.23.16.0
> post-up echo connected > /sys/class/net/ib0/mode
> post-up /sbin/ifconfig ib0 mtu 65520
>
> ## IB1 CLUS_CEPH
> auto ib1
> iface ib1 inet static
> address 172.23.32.8
> netmask 255.255.240.0
> network 172.23.47.254
> post-up echo connected > /sys/class/net/ib1/mode
> post-up /sbin/ifconfig ib1 mtu 65520
>
>
>
> then reboot, but when it come up again, the mode is still saying
> 'datagram' instead of 'connected', any idea?
>
> Regards,
>
>
> *German*
>
> 2015-11-23 15:06 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:
>
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> Are you using unconnected mode or connected mode? With connected mode
>> you can up your MTU to 64K which may help on the network side.
>> - ----
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 23, 2015 at 10:40 AM, German Anders  wrote:
>> > Hi Mark,
>> >
>> > Thanks a lot for the quick response. Regarding the numbers that you
>> send me,
>> > they look REALLY nice. I've the following setup
>> >
>> > 4 OSD nodes:
>> >
>> > 2 x Intel Xeon E5-2650v2 @2.60Ghz
>> > 1 x Network controller: Mellanox Technologies MT27500 Family
>> [ConnectX-3]
>> > Dual-Port (1 for PUB and 1 for CLUS)
>> > 1 x SAS2308 PCI-Express Fusion-MPT SAS-2
>> > 8 x Intel SSD DC S3510 800GB (1 OSD on each drive + journal on the same
>> > drive, so 1:1 relationship)
>> > 3 x Intel SSD DC S3710 200GB (to be used maybe as a cache tier)
>> > 128GB RAM
>> >
>> > [0:0:0:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sdc
>> > [0:0:1:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sdd
>> > [0:0:2:0]diskATA  INTEL SSDSC2BA20 0110  /dev/sde
>> > [0:0:3:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdf
>> > [0:0:4:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdg
>> > [0:0:5:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdh
>> > [0:0:6:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdi
>> > [0:0:7:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdj
>> > [0:0:8:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdk
>> > [0:0:9:0]diskATA  INTEL SSDSC2BB80 0130  /dev/sdl
>> > [0:0:10:0]   diskATA  INTEL SSDSC2BB80 0130  /dev/sdm
>> >
>> > sdf8:80   0 745.2G  0 disk
>> > |-sdf1 8:81   0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-16
>> > `-sdf2 8:82   0 5G  0 part
>> > sdg8:96   0 745.2G  0 disk
>> > |-sdg1 8:97   0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-17
>> > `-sdg2 8:98   0 5G  0 part
>> > sdh8:112  0 745.2G  0 disk
>> > |-sdh1 8:113  0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-18
>> > `-sdh2 8:114  0 5G  0 part
>> > sdi8:128  0 745.2G  0 disk
>> > |-sdi1 8:129  0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-19
>> > `-sdi2 8:130  0 5G  0 part
>> > sdj8:144  0 745.2G  0 disk
>> > |-sdj1 8:145  0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-20
>> > `-sdj2 8:146  0 5G  0 part
>> > sdk8:160  0 745.2G  0 disk
>> > |-sdk1 8:161  0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-21
>> > `-sdk2 8:162  0 5G  0 part
>> > sdl8:176  0 745.2G  0 disk
>> > |-sdl1 8:177  0 740.2G  0 part
>> > /var/lib/ceph/osd/ceph-

Re: [ceph-users] Ceph 0.94.5 with accelio

2015-11-23 Thread German Anders
 bandwidth (MB/sec): 0
Average Latency:0.0376536
Stddev Latency: 0.032886
Max latency:0.27063
Min latency:0.0229266


$ rados bench -p rbd 20 seq --no-cleanup -t 4
   sec Cur ops   started  finished  avg MB/s  cur MB/s  last lat   avg lat
 0   0 0 0 0 0 - 0
 1   4   394   390   1559.52  1560 0.014 0.0102236
 2   4   753   749   1496.68  1436 0.0129162 0.0106595
 3   4  1137  1133   1509.65  1536 0.0101854 0.0105731
 4   4  1526  1522   1521.17  1556 0.0122154 0.0103827
 5   4  1890  1886   1508.07  14560.00825445 0.0105908
 Total time run:5.675418
Total reads made: 2141
Read size:4194304
Bandwidth (MB/sec):1508.964

Average Latency:   0.0105951
Max latency:   0.211469
Min latency:   0.00603694


I'm not even close to those numbers that you are getting... :( any ideas?
or hints? Also I've configured NOOP as the scheduler for all the SSD disks.
I don't know really what else to look for, in order to improve performance
and get some similar numbers from what you are getting


Thanks in advance,

Cheers,


*German*

2015-11-23 13:32 GMT-03:00 Mark Nelson <mnel...@redhat.com>:

> Hi German,
>
> I don't have exactly the same setup, but on the ceph community cluster I
> have tests with:
>
> 4 nodes, each of which are configured in some tests with:
>
> 2 x Intel Xeon E5-2650
> 1 x Intel XL710 40GbE (currently limited to about 2.5GB/s each)
> 1 x Intel P3700 800GB (4 OSDs per card using 4 data and 4 journal
> partitions)
> 64GB RAM
>
> With filestore, I can get an aggregate throughput of:
>
> 1MB randread: 8715.3MB/s
> 4MB randread: 8046.2MB/s
>
> This is with 4 fio instances on the same nodes as the OSDs using the fio
> librbd engine.
>
> A couple of things I would suggest trying:
>
> 1) See how rados bench does.  This is an easy test and you can see how
> different the numbers look.
>
> 2) try fio with librbd to see if it might be a qemu limitation.
>
> 3) Assuming you are using IPoIB, try some iperf tests to see how your
> network is doing.
>
> Mark
>
>
> On 11/23/2015 10:17 AM, German Anders wrote:
>
>> Thanks a lot for the quick update Greg. This lead me to ask if there's
>> anything out there to improve performance in an Infiniband environment
>> with Ceph. In the cluster that I mentioned earlier. I've setup 4 OSD
>> server nodes nodes each with 8 OSD daemons running with 800x Intel SSD
>> DC S3710 disks (740.2G for OSD and 5G for Journal) and also using IB FDR
>> 56Gb/s for the PUB and CLUS network, and I'm getting the following fio
>> numbers:
>>
>>
>> # fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
>> --time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
>> --invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
>> --group_reporting --exitall --name
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec --filename=/mnt/rbd/test1
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> ...
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (g=0): rw=randread,
>> bs=1M-1M/1M-1M/1M-1M, ioengine=libaio, iodepth=32
>> fio-2.1.3
>> Starting 4 processes
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: Laying out IO file(s)
>> (1 file(s) / 16384MB)
>> Jobs: 4 (f=4): [] [33.8% done] [1082MB/0KB/0KB /s] [1081/0/0 iops]
>> [eta 00m:45s]
>> dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec: (groupid=0, jobs=4):
>> err= 0: pid=63852: Mon Nov 23 10:48:07 2015
>>read : io=21899MB, bw=988.23MB/s, iops=988, runt= 22160msec
>>  slat (usec): min=192, max=186274, avg=3990.48, stdev=7533.77
>>  clat (usec): min=10, max=808610, avg=125099.41, stdev=90717.56
>>   lat (msec): min=6, max=809, avg=129.09, stdev=91.14
>>  clat percentiles (msec):
>>   |  1.00th=[   27],  5.00th=[   38], 10.00th=[   45], 20.00th=[
>>  61],
>>   | 30.00th=[   74], 40.00th=[   85], 50.00th=[  100], 60.00th=[
>> 117],
>>   | 70.00th=[  141], 80.00th=[  174], 90.00th=[  235], 95.00th=[
>> 297],
>>   | 99.00th=[  482], 99.50th=[  578], 99.90th=[  717], 99.95th=[
>> 750],
>>   | 99.99th=[  775]
>>  bw (KB  /s): min=134691, max=335872, per=25.08%, avg=253748.08,
>> stdev=40454.88
>>  lat (usec) : 20=0.01%
>>  lat (msec) : 10=0.02%, 20=0.27%, 50=12.90%, 100=36.93%, 250=41.39%
>>  lat (msec) : 500=7.59%, 750=0.84%, 1000=0.05%
>>cpu  : usr=0.11%, sy

Re: [ceph-users] ceph osd prepare cmd on infernalis 9.2.0

2015-11-20 Thread German Anders
Paul, thanks for the reply, I try to run the command and then run again the
ceph-deploy command again with the same parameters, but I'm getting the
same error msg. I try with XFS and run ok, so the problem is with the btrfs


*German*

2015-11-20 6:08 GMT-03:00 HEWLETT, Paul (Paul) <
paul.hewl...@alcatel-lucent.com>:

> Flushing a GPT partition table using dd does not work as the table is
> duplicated at the end of the disk as well
>
> Use the sgdisk –Z command
>
> Paul
>
> From: ceph-users <ceph-users-boun...@lists.ceph.com> on behalf of Mykola <
> mykola.dvor...@gmail.com>
> Date: Thursday, 19 November 2015 at 18:43
> To: German Anders <gand...@despegar.com>
> Cc: ceph-users <ceph-users@lists.ceph.com>
> Subject: Re: [ceph-users] ceph osd prepare cmd on infernalis 9.2.0
>
> I believe the error message says that there is no space left on the device
> for the second partition to be created. Perhaps try to flush gpt with old
> good dd.
>
>
>
> Sent from Outlook Mail <http://go.microsoft.com/fwlink/?LinkId=550987>
> for Windows 10 phone
>
>
>
>
> *From: *German Anders <gand...@despegar.com>
> *Sent: *Thursday, November 19, 2015 7:25 PM
> *To: *Mykola Dvornik <mykola.dvor...@gmail.com>
> *Cc: *ceph-users <ceph-users@lists.ceph.com>
> *Subject: *Re: ceph osd prepare cmd on infernalis 9.2.0
>
>
>
> I've already try that with no luck at all
>
>
>
> On Thursday, 19 November 2015, Mykola Dvornik <mykola.dvor...@gmail.com>
> wrote:
>
> *'Could not create partition 2 from 10485761 to 10485760'.*
>
> Perhaps try to zap the disks first?
>
>
>
> On 19 November 2015 at 16:22, German Anders <gand...@despegar.com> wrote:
>
> Hi cephers,
>
> I had some issues while running the prepare osd command:
>
> ceph version: infernalis 9.2.0
>
> disk: /dev/sdf (745.2G)
>
>   /dev/sdf1 740.2G
>
>   /dev/sdf2 5G
>
> # parted /dev/sdf
> GNU Parted 2.3
> Using /dev/sdf
> Welcome to GNU Parted! Type 'help' to view a list of commands.
> (parted) print
> Model: ATA INTEL SSDSC2BB80 (scsi)
> Disk /dev/sdf: 800GB
> Sector size (logical/physical): 512B/4096B
> Partition Table: gpt
>
> Number  Start   End SizeFile system  Name  Flags
>  2  1049kB  5369MB  5368MB   ceph journal
>  1  5370MB  800GB   795GB   btrfsceph data
>
> cibn05:
>
>
>
> $ ceph-deploy osd prepare --fs-type btrfs cibn05:sdf
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ceph/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy osd
> prepare --fs-type btrfs cibn05:sdf
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  disk  : [('cibn05',
> '/dev/sdf', None)]
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: prepare
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  fs_type   : btrfs
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7fbb1e1d9050>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibn05:/dev/sdf:
> [cibn05][DEBUG ] connection detected need for sudo
> [cibn05][DEBUG ] connected to host: cibn05
> [cibn05][DEBUG ] detect platform information from remote host
> [cibn05][DEBUG ] detect machine type
> [cibn05][DEBUG ] find the location of an executable
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
> [ceph_deploy.osd][DEBUG ] Deploying osd to cibn05
> [cibn05][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [cibn05][INFO  ] Running command: sudo udevadm trigger
> --subsystem-match=block --action=add
> [ceph_deploy.osd][DEBUG ] Preparing host cibn05 disk /dev/sdf journal None
> activate False
> [cibn05][INFO  ] Running command: sudo ceph-disk -v prepare --cluster ceph
> --fs-type btrfs -- /dev/sdf
> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --check-allows

[ceph-users] ceph infernalis pg creating forever

2015-11-20 Thread German Anders
Hi all, I've finished the install of a new ceph cluster with infernalis
9.2.0 release. But I'm getting the following error msg:

$ ceph -w
cluster 29xx-3xxx-xxx9-xxx7-b8xx
 health HEALTH_WARN
64 pgs degraded
64 pgs stale
64 pgs stuck degraded
1024 pgs stuck inactive
64 pgs stuck stale
1024 pgs stuck unclean
64 pgs stuck undersized
64 pgs undersized
pool rbd pg_num 1024 > pgp_num 64
 monmap e1: 3 mons at {cibm01=
172.23.16.1:6789/0,cibm02=172.23.16.2:6789/0,cibm03=172.23.16.3:6789/0}
election epoch 6, quorum 0,1,2 cibm01,cibm02,cibm03
 osdmap e113: 32 osds: 32 up, 32 in
flags sortbitwise
  pgmap v1264: 1024 pgs, 1 pools, 0 bytes data, 0 objects
1344 MB used, 23673 GB / 23675 GB avail
* 960 creating*
  64 stale+undersized+degraded+peered

2015-11-20 10:22:27.947850 mon.0 [INF] pgmap v1264: 1024 pgs: 960 creating,
64 stale+undersized+degraded+peered; 0 bytes data, 1344 MB used, 23673 GB /
23675 GB avail


It seems like it's 'creating' the new pgs... but, I had been in this state
for a while, with actually no changes at all. Also the ceph health detail
output:

$ ceph health
HEALTH_WARN 64 pgs degraded; 64 pgs stale; 64 pgs stuck degraded; 1024 pgs
stuck inactive; 64 pgs stuck stale; 1024 pgs stuck unclean; 64 pgs stuck
undersized; 64 pgs undersized; pool rbd pg_num 1024 > pgp_num 64

$ ceph health detail
(...)
pg 0.31 is stuck stale for 85287.493086, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.32 is stuck stale for 85287.493090, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.33 is stuck stale for 85287.493093, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.34 is stuck stale for 85287.493097, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.35 is stuck stale for 85287.493101, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.36 is stuck stale for 85287.493105, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.37 is stuck stale for 85287.493110, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.38 is stuck stale for 85287.493114, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.39 is stuck stale for 85287.493119, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3a is stuck stale for 85287.493123, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3b is stuck stale for 85287.493127, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3c is stuck stale for 85287.493131, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3d is stuck stale for 85287.493135, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3e is stuck stale for 85287.493139, current state
stale+undersized+degraded+peered, last acting [0]
pg 0.3f is stuck stale for 85287.493149, current state
stale+undersized+degraded+peered, last acting [0]
pool rbd pg_num 1024 > pgp_num 64

If I try to increment the pgp_num it will said:

$ ceph osd pool set rbd pgp_num 1024
Error EBUSY: currently creating pgs, wait

$ ceph osd lspools
0 rbd,

$ ceph osd tree
ID WEIGHT   TYPE NAME   UP/DOWN REWEIGHT PRIMARY-AFFINITY
-1 23.11963 root default
-2  5.77991 host cibn05
 0  0.72249 osd.0up  1.0  1.0
 1  0.72249 osd.1up  1.0  1.0
 2  0.72249 osd.2up  1.0  1.0
 3  0.72249 osd.3up  1.0  1.0
 4  0.72249 osd.4up  1.0  1.0
 5  0.72249 osd.5up  1.0  1.0
 6  0.72249 osd.6up  1.0  1.0
 7  0.72249 osd.7up  1.0  1.0
-3  5.77991 host cibn06
 8  0.72249 osd.8up  1.0  1.0
 9  0.72249 osd.9up  1.0  1.0
10  0.72249 osd.10   up  1.0  1.0
11  0.72249 osd.11   up  1.0  1.0
12  0.72249 osd.12   up  1.0  1.0
13  0.72249 osd.13   up  1.0  1.0
14  0.72249 osd.14   up  1.0  1.0
15  0.72249 osd.15   up  1.0  1.0
-4  5.77991 host cibn07
16  0.72249 osd.16   up  1.0  1.0
17  0.72249 osd.17   up  1.0  1.0
18  0.72249 osd.18   up  1.0  1.0
19  0.72249 osd.19   up  1.0  1.0
20  0.72249 osd.20   up  1.0  1.0
21  0.72249 osd.21   up  1.0  1.0
22  0.72249 osd.22   up  1.0  1.0
23  0.72249 osd.23   up  1.0  1.0
-5  5.77991 host 

Re: [ceph-users] ceph infernalis pg creating forever

2015-11-20 Thread German Anders
Here's the actual crush map:

$ cat /home/ceph/actual_map.out

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
tunable straw_calc_version 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cibn05 {
id -2# do not change unnecessarily
# weight 5.780
alg straw
hash 0# rjenkins1
item osd.0 weight 0.722
item osd.1 weight 0.722
item osd.2 weight 0.722
item osd.3 weight 0.722
item osd.4 weight 0.722
item osd.5 weight 0.722
item osd.6 weight 0.722
item osd.7 weight 0.722
}
host cibn06 {
id -3# do not change unnecessarily
# weight 5.780
alg straw
hash 0# rjenkins1
item osd.8 weight 0.722
item osd.9 weight 0.722
item osd.10 weight 0.722
item osd.11 weight 0.722
item osd.12 weight 0.722
item osd.13 weight 0.722
item osd.14 weight 0.722
item osd.15 weight 0.722
}
host cibn07 {
id -4# do not change unnecessarily
# weight 5.780
alg straw
hash 0# rjenkins1
item osd.16 weight 0.722
item osd.17 weight 0.722
item osd.18 weight 0.722
item osd.19 weight 0.722
item osd.20 weight 0.722
item osd.21 weight 0.722
item osd.22 weight 0.722
item osd.23 weight 0.722
}
host cibn08 {
id -5# do not change unnecessarily
# weight 5.780
alg straw
hash 0# rjenkins1
item osd.24 weight 0.722
item osd.25 weight 0.722
item osd.26 weight 0.722
item osd.27 weight 0.722
item osd.28 weight 0.722
item osd.29 weight 0.722
item osd.30 weight 0.722
item osd.31 weight 0.722
}
root default {
id -1# do not change unnecessarily
# weight 23.120
alg straw
hash 0# rjenkins1
item cibn05 weight 5.780
item cibn06 weight 5.780
item cibn07 weight 5.780
item cibn08 weight 5.780
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}

# end crush map

Kernel version of the servers mon and osd is *3.19.0-25-generic*


Regards,


*German*

2015-11-20 12:56 GMT-03:00 Gregory Farnum <gfar...@redhat.com>:

> This usually means your crush mapping for the pool in question is
> unsatisfiable. Check what the rule is doing.
> -Greg
>
>
> On Friday, November 20, 2015, German Anders <gand...@despegar.com> wrote:
>
>> Hi all, I've finished the install of a new ceph cluster with infernalis
>> 9.2.0 release. But I'm getting the following error msg:
>>
>> $ ceph -w
>> cluster 29xx-3xxx-xxx9-xxx7-b8xx
>>  health HEALTH_WARN
>> 64 pgs degraded
>> 64 pgs stale
>> 64 pgs stuck degraded
>> 1024 pgs stuck inactive
>> 64 pgs stuck stale
>> 1024 pgs stuck unclean
>> 64 pgs stuck undersized
>> 64 pgs undersized
>> pool rbd pg_num 1024 > pgp_num 64
>>  monmap e1: 3 mons at {cibm01=
>> 172.23.16.1:6789/0,cibm02=172.23.16.2:6789/0,cibm03=172.23.16.3:6789/0}
>> election epoch 6, quorum 0,1,2 cibm01,cibm02,cibm03
>>  osdmap e113: 32 osds: 32 up, 32 in
>> flags sortbitwise
>>   pgmap v1264: 1024 pgs, 1 pools, 0 bytes data, 0 objects
>> 1344 MB used, 23673 GB / 23675 GB avail
>> * 960 creating*
>>   64 stale+undersized+degraded+peered
>>
>> 2015-11-20 10:22:27.947850 mon.0 [INF] pgmap v1264: 1024 pgs: 960
>> creating, 64 stale+undersized+degraded+peered; 0 bytes data, 1344 MB used,
>> 23673 GB / 23675 GB avail
>>
>>
>> It seems like it's 'creating' the new pgs... but, I had been in this
>> state for a while, with actually no changes at all. Also the ceph health
>> detail output:
>>
>> $ ceph health
>> HEALTH_WARN 64 pgs degraded; 64 pgs stale; 64 pgs stuck degraded; 1024
>> pgs stuck inactive; 64 pgs stuck stale; 1024 pgs stuck unclean; 64 pgs
>> stuck undersized; 64 pgs undersized; pool rbd p

Re: [ceph-users] ceph osd prepare cmd on infernalis 9.2.0

2015-11-19 Thread German Anders
I've already try that with no luck at all


On Thursday, 19 November 2015, Mykola Dvornik <mykola.dvor...@gmail.com>
wrote:

> *'Could not create partition 2 from 10485761 to 10485760'.*
>
> Perhaps try to zap the disks first?
>
> On 19 November 2015 at 16:22, German Anders <gand...@despegar.com
> <javascript:_e(%7B%7D,'cvml','gand...@despegar.com');>> wrote:
>
>> Hi cephers,
>>
>> I had some issues while running the prepare osd command:
>>
>> ceph version: infernalis 9.2.0
>>
>> disk: /dev/sdf (745.2G)
>>   /dev/sdf1 740.2G
>>   /dev/sdf2 5G
>>
>> # parted /dev/sdf
>> GNU Parted 2.3
>> Using /dev/sdf
>> Welcome to GNU Parted! Type 'help' to view a list of commands.
>> (parted) print
>> Model: ATA INTEL SSDSC2BB80 (scsi)
>> Disk /dev/sdf: 800GB
>> Sector size (logical/physical): 512B/4096B
>> Partition Table: gpt
>>
>> Number  Start   End SizeFile system  Name  Flags
>>  2  1049kB  5369MB  5368MB   ceph journal
>>  1  5370MB  800GB   795GB   btrfsceph data
>>
>>
>> cibn05:
>>
>>
>> $ ceph-deploy osd prepare --fs-type btrfs cibn05:sdf
>> [ceph_deploy.conf][DEBUG ] found configuration file at:
>> /home/ceph/.cephdeploy.conf
>> [ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy
>> osd prepare --fs-type btrfs cibn05:sdf
>> [ceph_deploy.cli][INFO  ] ceph-deploy options:
>> [ceph_deploy.cli][INFO  ]  username  : None
>> [ceph_deploy.cli][INFO  ]  disk  : [('cibn05',
>> '/dev/sdf', None)]
>> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
>> [ceph_deploy.cli][INFO  ]  verbose   : False
>> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
>> [ceph_deploy.cli][INFO  ]  subcommand: prepare
>> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
>> /etc/ceph/dmcrypt-keys
>> [ceph_deploy.cli][INFO  ]  quiet : False
>> [ceph_deploy.cli][INFO  ]  cd_conf   :
>> 
>> [ceph_deploy.cli][INFO  ]  cluster   : ceph
>> [ceph_deploy.cli][INFO  ]  fs_type   : btrfs
>> [ceph_deploy.cli][INFO  ]  func  : > at 0x7fbb1e1d9050>
>> [ceph_deploy.cli][INFO  ]  ceph_conf : None
>> [ceph_deploy.cli][INFO  ]  default_release   : False
>> [ceph_deploy.cli][INFO  ]  zap_disk  : False
>> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibn05:/dev/sdf:
>> [cibn05][DEBUG ] connection detected need for sudo
>> [cibn05][DEBUG ] connected to host: cibn05
>> [cibn05][DEBUG ] detect platform information from remote host
>> [cibn05][DEBUG ] detect machine type
>> [cibn05][DEBUG ] find the location of an executable
>> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
>> [ceph_deploy.osd][DEBUG ] Deploying osd to cibn05
>> [cibn05][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
>> [cibn05][INFO  ] Running command: sudo udevadm trigger
>> --subsystem-match=block --action=add
>> [ceph_deploy.osd][DEBUG ] Preparing host cibn05 disk /dev/sdf journal
>> None activate False
>> [cibn05][INFO  ] Running command: sudo ceph-disk -v prepare --cluster
>> ceph --fs-type btrfs -- /dev/sdf
>> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-allows-journal -i 0 --cluster ceph
>> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-wants-journal -i 0 --cluster ceph
>> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --check-needs-journal -i 0 --cluster ceph
>> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
>> /sys/dev/block/8:80/dm/uuid
>> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
>> /sys/dev/block/8:80/dm/uuid
>> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
>> /sys/dev/block/8:80/dm/uuid
>> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
>> /sys/dev/block/8:81/dm/uuid
>> [cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf2 uuid path is
>> /sys/dev/block/8:82/dm/uuid
>> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
>> --cluster=ceph --show-config-value=fsid
>> [cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
>> --cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
>> [cibn05][WARNIN] INFO:ceph-disk:Running comm

Re: [ceph-users] Can't activate osd in infernalis

2015-11-19 Thread German Anders
I've a similar problem while trying to run the prepare osd command:

ceph version: infernalis 9.2.0

disk: /dev/sdf (745.2G)
  /dev/sdf1 740.2G
  /dev/sdf2 5G

# parted /dev/sdf
GNU Parted 2.3
Using /dev/sdf
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ATA INTEL SSDSC2BB80 (scsi)
Disk /dev/sdf: 800GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End SizeFile system  Name  Flags
 2  1049kB  5369MB  5368MB   ceph journal
 1  5370MB  800GB   795GB   btrfsceph data


cibn05:


$ ceph-deploy osd prepare --fs-type btrfs cibn05:sdf
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy osd
prepare --fs-type btrfs cibn05:sdf
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  disk  : [('cibn05',
'/dev/sdf', None)]
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : btrfs
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibn05:/dev/sdf:
[cibn05][DEBUG ] connection detected need for sudo
[cibn05][DEBUG ] connected to host: cibn05
[cibn05][DEBUG ] detect platform information from remote host
[cibn05][DEBUG ] detect machine type
[cibn05][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to cibn05
[cibn05][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[cibn05][INFO  ] Running command: sudo udevadm trigger
--subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host cibn05 disk /dev/sdf journal None
activate False
[cibn05][INFO  ] Running command: sudo ceph-disk -v prepare --cluster ceph
--fs-type btrfs -- /dev/sdf
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-allows-journal -i 0 --cluster ceph
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-wants-journal -i 0 --cluster ceph
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-needs-journal -i 0 --cluster ceph
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
/sys/dev/block/8:81/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf2 uuid path is
/sys/dev/block/8:82/dm/uuid
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] INFO:ceph-disk:Will colocate journal with data on /dev/sdf
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] 

[ceph-users] ceph osd prepare cmd on infernalis 9.2.0

2015-11-19 Thread German Anders
Hi cephers,

I had some issues while running the prepare osd command:

ceph version: infernalis 9.2.0

disk: /dev/sdf (745.2G)
  /dev/sdf1 740.2G
  /dev/sdf2 5G

# parted /dev/sdf
GNU Parted 2.3
Using /dev/sdf
Welcome to GNU Parted! Type 'help' to view a list of commands.
(parted) print
Model: ATA INTEL SSDSC2BB80 (scsi)
Disk /dev/sdf: 800GB
Sector size (logical/physical): 512B/4096B
Partition Table: gpt

Number  Start   End SizeFile system  Name  Flags
 2  1049kB  5369MB  5368MB   ceph journal
 1  5370MB  800GB   795GB   btrfsceph data


cibn05:


$ ceph-deploy osd prepare --fs-type btrfs cibn05:sdf
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy osd
prepare --fs-type btrfs cibn05:sdf
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  disk  : [('cibn05',
'/dev/sdf', None)]
[ceph_deploy.cli][INFO  ]  dmcrypt   : False
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: prepare
[ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
/etc/ceph/dmcrypt-keys
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  fs_type   : btrfs
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  zap_disk  : False
[ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibn05:/dev/sdf:
[cibn05][DEBUG ] connection detected need for sudo
[cibn05][DEBUG ] connected to host: cibn05
[cibn05][DEBUG ] detect platform information from remote host
[cibn05][DEBUG ] detect machine type
[cibn05][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] Deploying osd to cibn05
[cibn05][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[cibn05][INFO  ] Running command: sudo udevadm trigger
--subsystem-match=block --action=add
[ceph_deploy.osd][DEBUG ] Preparing host cibn05 disk /dev/sdf journal None
activate False
[cibn05][INFO  ] Running command: sudo ceph-disk -v prepare --cluster ceph
--fs-type btrfs -- /dev/sdf
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-allows-journal -i 0 --cluster ceph
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-wants-journal -i 0 --cluster ceph
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--check-needs-journal -i 0 --cluster ceph
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
/sys/dev/block/8:81/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf2 uuid path is
/sys/dev/block/8:82/dm/uuid
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[cibn05][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] INFO:ceph-disk:Will colocate journal with data on /dev/sdf
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid
[cibn05][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf uuid path is
/sys/dev/block/8:80/dm/uuid

[ceph-users] Bcache and Ceph Question

2015-11-17 Thread German Anders
Hi all,

Is there any way to use bcache in an already configured Ceph cluster? I've
both OSD and Journals inside the same OSD daemon, and I want to try bcache
in front of the OSD daemon and also move in the bcache device the journal,
so for example I got:

/dev/sdc  --> SSD disk
/dev/sdc1  --> 1st partition GPT for Journal (5G)
/dev/sdc2  --> 2nd partition GPT for Journal (5G)
/dev/sdc3  --> 3rd partition GPT for Journal (5G)
/dev/sdc4  --> partition GPT for bcache device (200G)

Then relate */dev/sdf1*, */dev/sdg1 *and* /dev/sdh1* all *OSD daemons 2TB
each in size*, to each of the bcache partitions on the */dev/sdc* device.
Is possible or I will need to 'format' all the devices in order to do this
kind of procedure? any ideas? Or is any other better approach to this?

Thanks in advance,

Regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Performance output con Ceph IB with fio examples

2015-11-17 Thread German Anders
Hi cephers,

Is there anyone out there using Ceph (any version) with Infiniband FDR
topology network (both public and cluster), that could share some
performance results? To be more specific, running something like this on a
RBD volume mapped to a IB host:

# fio --rw=randread --bs=4m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randread-4m-4thr-libaio-32iodepth-22sec
--filename=/mnt/rbdtest/test1

# fio --rw=randread --bs=1m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randread-1m-4thr-libaio-32iodepth-22sec
--filename=/mnt/rbdtest/test2

# fio --rw=randwrite --bs=1m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randwrite-1m-4thr-libaio-32iodepth-22sec
--filename=/mnt/rbdtest/test3

# fio --rw=randwrite --bs=4m --numjobs=4 --iodepth=32 --runtime=22
--time_based --size=16777216k --loops=1 --ioengine=libaio --direct=1
--invalidate=1 --fsync_on_close=1 --randrepeat=1 --norandommap
--group_reporting --exitall --name
dev-ceph-randwrite-4m-4thr-libaio-32iodepth-22sec
--filename=/mnt/rbdtest/test4

will really appreciate the outputs.

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Question about OSD activate with ceph-deploy

2015-11-13 Thread German Anders
Hi all,

I'm having some issues while trying to run the osd activate command with
ceph-deploy tool (1.5.28), the osd prepare command run fine, but then...

osd: sdf1
journal: /dev/sdc1


$ ceph-deploy osd activate cibn01:sdf1:/dev/sdc1
[ceph_deploy.conf][DEBUG ] found configuration file at:
/home/ceph/.cephdeploy.conf
[ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/local/bin/ceph-deploy osd
activate cibn01:sdf1:/dev/sdc1
[ceph_deploy.cli][INFO  ] ceph-deploy options:
[ceph_deploy.cli][INFO  ]  username  : None
[ceph_deploy.cli][INFO  ]  verbose   : False
[ceph_deploy.cli][INFO  ]  overwrite_conf: False
[ceph_deploy.cli][INFO  ]  subcommand: activate
[ceph_deploy.cli][INFO  ]  quiet : False
[ceph_deploy.cli][INFO  ]  cd_conf   :

[ceph_deploy.cli][INFO  ]  cluster   : ceph
[ceph_deploy.cli][INFO  ]  func  : 
[ceph_deploy.cli][INFO  ]  ceph_conf : None
[ceph_deploy.cli][INFO  ]  default_release   : False
[ceph_deploy.cli][INFO  ]  disk  : [('cibn01',
'/dev/sdf1', '/dev/sdc1')]
[ceph_deploy.osd][DEBUG ] Activating cluster ceph disks
cibn01:/dev/sdf1:/dev/sdc1
ceph@cibn01's password:
[cibn01][DEBUG ] connection detected need for sudo
ceph@cibn01's password:
[cibn01][DEBUG ] connected to host: cibn01
[cibn01][DEBUG ] detect platform information from remote host
[cibn01][DEBUG ] detect machine type
[cibn01][DEBUG ] find the location of an executable
[ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
[ceph_deploy.osd][DEBUG ] activating host cibn01 disk /dev/sdf1
[ceph_deploy.osd][DEBUG ] will use init type: upstart
[cibn01][INFO  ] Running command: sudo ceph-disk -v activate --mark-init
upstart --mount /dev/sdf1
[cibn01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
/sys/dev/block/8:81/dm/uuid
[cibn01][WARNIN] DEBUG:ceph-disk:get_dm_uuid /dev/sdf1 uuid path is
/sys/dev/block/8:81/dm/uuid
[cibn01][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk -i 1 /dev/sdf
[cibn01][WARNIN] INFO:ceph-disk:Running command: /sbin/blkid -p -s TYPE
-ovalue -- /dev/sdf1
[cibn01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[cibn01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
[cibn01][WARNIN] DEBUG:ceph-disk:Mounting /dev/sdf1 on
/var/lib/ceph/tmp/mnt.zv_wAh with options noatime,user_subvol_rm_allowed
[cibn01][WARNIN] INFO:ceph-disk:Running command: /bin/mount -t btrfs -o
noatime,user_subvol_rm_allowed -- /dev/sdf1 /var/lib/ceph/tmp/mnt.zv_wAh
[cibn01][WARNIN] DEBUG:ceph-disk:Cluster uuid is
1661668a-bc97-419f-9000-6fb23f364479
[cibn01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cibn01][WARNIN] DEBUG:ceph-disk:Cluster name is ceph
[cibn01][WARNIN] DEBUG:ceph-disk:OSD uuid is --x--x
[cibn01][WARNIN] DEBUG:ceph-disk:OSD id is 0
[cibn01][WARNIN] DEBUG:ceph-disk:Initializing OSD...


























*[cibn01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph --cluster
ceph --name client.bootstrap-osd --keyring
/var/lib/ceph/bootstrap-osd/ceph.keyring mon getmap -o
/var/lib/ceph/tmp/mnt.zv_wAh/activate.monmap[cibn01][WARNIN] got monmap
epoch 2[cibn01][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster ceph --mkfs --mkkey -i 0 --monmap
/var/lib/ceph/tmp/mnt.zv_wAh/activate.monmap --osd-data
/var/lib/ceph/tmp/mnt.zv_wAh --osd-journal
/var/lib/ceph/tmp/mnt.zv_wAh/journal --osd-uuid
81d60fe0-3b40-4045-8674-7f4723c6a67a --keyring
/var/lib/ceph/tmp/mnt.zv_wAh/keyring --setuser ceph --setgroup
ceph[cibn01][WARNIN] 2015-11-13 09:34:35.903047 7f2be4e3a940 -1
filestore(/var/lib/ceph/tmp/mnt.zv_wAh) mkjournal error creating journal on
/var/lib/ceph/tmp/mnt.zv_wAh/journal: (13) Permission
denied[cibn01][WARNIN] 2015-11-13 09:34:35.903059 7f2be4e3a940 -1
OSD::mkfs: ObjectStore::mkfs failed with error -13[cibn01][WARNIN]
2015-11-13 09:34:35.903080 7f2be4e3a940 -1  ** ERROR: error creating empty
object store in /var/lib/ceph/tmp/mnt.zv_wAh: (13) Permission
denied[cibn01][WARNIN] ERROR:ceph-disk:Failed to activate[cibn01][WARNIN]
DEBUG:ceph-disk:Unmounting /var/lib/ceph/tmp/mnt.zv_wAh[cibn01][WARNIN]
INFO:ceph-disk:Running command: /bin/umount --
/var/lib/ceph/tmp/mnt.zv_wAh[cibn01][WARNIN] Traceback (most recent call
last):[cibn01][WARNIN]   File "/usr/sbin/ceph-disk", line 3576, in
[cibn01][WARNIN] main(sys.argv[1:])[cibn01][WARNIN]   File
"/usr/sbin/ceph-disk", line 3530, in main[cibn01][WARNIN]
args.func(args)[cibn01][WARNIN]   File "/usr/sbin/ceph-disk", line 2424, in
main_activate[cibn01][WARNIN]
dmcrypt_key_dir=args.dmcrypt_key_dir,[cibn01][WARNIN]   File
"/usr/sbin/ceph-disk", line 2197, in mount_activate[cibn01][WARNIN]
(osd_id, cluster) = 

[ceph-users] v0.94.4 Hammer released upgrade

2015-10-20 Thread German Anders
trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
error msg while trying to restart the mon daemons ($ sudo restart
ceph-mon-all):

2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.506244 7fd971a858c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6847
2015-10-20 08:56:37.524079 7fd971a858c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.524101 7fd971a858c0 -1 error checking features: (1)
Operation not permitted

any ideas?

$ ceph -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)


Thanks in advance,

Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread German Anders
Hi Udo,

We I've try that and no luck at all.

Cheers,


*German* <gand...@despegar.com>

2015-10-20 15:06 GMT-03:00 Udo Lembke <ulem...@polarzone.de>:

> Hi,
> do you have changed the ownership like discribed in Sages mail about
> "v9.1.0 Infernalis release candidate released"?
>
>   #. Fix the ownership::
>
>  chown -R ceph:ceph /var/lib/ceph
>
> or set ceph.conf to use root instead?
>   When upgrading, administrators have two options:
>
>#. Add the following line to ``ceph.conf`` on all hosts::
>
> setuser match path = */var/lib/ceph/*$type/$cluster-$id
>
>   This will make the Ceph daemons run as root (i.e., not drop
>   privileges and switch to user ceph) if the daemon's data
>   directory is still owned by root.  Newly deployed daemons will
>   be created with data owned by user ceph and will run with
>   reduced privileges, but upgraded daemons will continue to run as
>   root.
>
>
>
> Udo
>
> On 20.10.2015 14:59, German Anders wrote:
>
> trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
> error msg while trying to restart the mon daemons:
>
> 2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
> 2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={7=support shec
> erasure code}
> 2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
> Operation not permitted
> 2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
> (95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
> 2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
> unsupported features: compat={},rocompat={},incompat={7=support shec
> erasure code}
> 2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
> Operation not permitted
>
>
> any ideas?
>
> $ ceph -v
> ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)
>
>
> Thanks in advance,
>
> Cheers,
>
> *German*
>
> 2015-10-19 18:07 GMT-03:00 Sage Weil <s...@redhat.com>:
>
>> This Hammer point fixes several important bugs in Hammer, as well as
>> fixing interoperability issues that are required before an upgrade to
>> Infernalis. That is, all users of earlier version of Hammer or any
>> version of Firefly will first need to upgrade to hammer v0.94.4 or
>> later before upgrading to Infernalis (or future releases).
>>
>> All v0.94.x Hammer users are strongly encouraged to upgrade.
>>
>> Changes
>> ---
>>
>> * build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166,
>> Nathan Cutler)
>> * build/ops: ceph.spec.in: ceph-common needs python-argparse on older
>> distros, but doesn't require it (#12034, Nathan Cutler)
>> * build/ops: ceph.spec.in: radosgw requires apache for SUSE only --
>> makes no sense (#12358, Nathan Cutler)
>> * build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized
>> (#11991, Nathan Cutler)
>> * build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992,
>> Owen Synge)
>> * build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan
>> Cutler)
>> * build/ops: ceph.spec.in: snappy-devel for all supported distros
>> (#12361, Nathan Cutler)
>> * build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel
>> (#11629, Nathan Cutler)
>> * build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build
>> (#12351, Nathan Cutler)
>> * build/ops: error in ext_mime_map_init() when /etc/mime.types is missing
>> (#11864, Ken Dreyer)
>> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
>> (#11798, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#10927, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#11140, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#11686, Sage Weil)
>> * build/ops: With root as default user, unable to have multiple RGW
>> instances running (#12407, Sage Weil)
>> * cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu
>> Chai)
>> * cli: ceph tell: broken error message / misleading hinting (#11101, Kefu
>> Chai)
>> * common: arm: all programs that link to librados2 hang forever on
>> startup (#12505, Boris Ranto)
>> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
>> * common

Re: [ceph-users] v0.94.4 Hammer released

2015-10-20 Thread German Anders
trying to upgrade from hammer 0.94.3 to 0.94.4 I'm getting the following
error msg while trying to restart the mon daemons:

2015-10-20 08:56:37.410321 7f59a8c9d8c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6821
2015-10-20 08:56:37.429036 7f59a8c9d8c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.429066 7f59a8c9d8c0 -1 error checking features: (1)
Operation not permitted
2015-10-20 08:56:37.458637 7f67460958c0  0 ceph version 0.94.4
(95292699291242794510b39ffde3f4df67898d3a), process ceph-mon, pid 6834
2015-10-20 08:56:37.478365 7f67460958c0 -1 ERROR: on disk data includes
unsupported features: compat={},rocompat={},incompat={7=support shec
erasure code}
2015-10-20 08:56:37.478387 7f67460958c0 -1 error checking features: (1)
Operation not permitted


any ideas?

$ ceph -v
ceph version 0.94.4 (95292699291242794510b39ffde3f4df67898d3a)


Thanks in advance,

Cheers,

*German* 

2015-10-19 18:07 GMT-03:00 Sage Weil :

> This Hammer point fixes several important bugs in Hammer, as well as
> fixing interoperability issues that are required before an upgrade to
> Infernalis. That is, all users of earlier version of Hammer or any
> version of Firefly will first need to upgrade to hammer v0.94.4 or
> later before upgrading to Infernalis (or future releases).
>
> All v0.94.x Hammer users are strongly encouraged to upgrade.
>
> Changes
> ---
>
> * build/ops: ceph.spec.in: 50-rbd.rules conditional is wrong (#12166,
> Nathan Cutler)
> * build/ops: ceph.spec.in: ceph-common needs python-argparse on older
> distros, but doesn't require it (#12034, Nathan Cutler)
> * build/ops: ceph.spec.in: radosgw requires apache for SUSE only -- makes
> no sense (#12358, Nathan Cutler)
> * build/ops: ceph.spec.in: rpm: cephfs_java not fully conditionalized
> (#11991, Nathan Cutler)
> * build/ops: ceph.spec.in: rpm: not possible to turn off Java (#11992,
> Owen Synge)
> * build/ops: ceph.spec.in: running fdupes unnecessarily (#12301, Nathan
> Cutler)
> * build/ops: ceph.spec.in: snappy-devel for all supported distros
> (#12361, Nathan Cutler)
> * build/ops: ceph.spec.in: SUSE/openSUSE builds need libbz2-devel
> (#11629, Nathan Cutler)
> * build/ops: ceph.spec.in: useless %py_requires breaks SLE11-SP3 build
> (#12351, Nathan Cutler)
> * build/ops: error in ext_mime_map_init() when /etc/mime.types is missing
> (#11864, Ken Dreyer)
> * build/ops: upstart: limit respawn to 3 in 30 mins (instead of 5 in 30s)
> (#11798, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#10927, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#11140, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#11686, Sage Weil)
> * build/ops: With root as default user, unable to have multiple RGW
> instances running (#12407, Sage Weil)
> * cli: ceph: cli throws exception on unrecognized errno (#11354, Kefu Chai)
> * cli: ceph tell: broken error message / misleading hinting (#11101, Kefu
> Chai)
> * common: arm: all programs that link to librados2 hang forever on startup
> (#12505, Boris Ranto)
> * common: buffer: critical bufferlist::zero bug (#12252, Haomai Wang)
> * common: ceph-object-corpus: add 0.94.2-207-g88e7ee7 hammer objects
> (#13070, Sage Weil)
> * common: do not insert emtpy ptr when rebuild emtpy bufferlist (#12775,
> Xinze Chi)
> * common: [  FAILED  ] TestLibRBD.BlockingAIO (#12479, Jason Dillaman)
> * common: LibCephFS.GetPoolId failure (#12598, Yan, Zheng)
> * common: Memory leak in Mutex.cc, pthread_mutexattr_init without
> pthread_mutexattr_destroy (#11762, Ketor Meng)
> * common: object_map_update fails with -EINVAL return code (#12611, Jason
> Dillaman)
> * common: Pipe: Drop connect_seq increase line (#13093, Haomai Wang)
> * common: recursive lock of md_config_t (0) (#12614, Josh Durgin)
> * crush: ceph osd crush reweight-subtree does not reweight parent node
> (#11855, Sage Weil)
> * doc: update docs to point to download.ceph.com (#13162, Alfredo Deza)
> * fs: ceph-fuse 0.94.2-1trusty segfaults / aborts (#12297, Greg Farnum)
> * fs: segfault launching ceph-fuse with bad --name (#12417, John Spray)
> * librados: Change radosgw pools default crush ruleset (#11640, Yuan Zhou)
> * librbd: correct issues discovered via lockdep / helgrind (#12345, Jason
> Dillaman)
> * librbd: Crash during TestInternal.MultipleResize (#12664, Jason Dillaman)
> * librbd: deadlock during cooperative exclusive lock transition (#11537,
> Jason Dillaman)
> * librbd: Possible crash while concurrently writing and shrinking an image
> (#11743, Jason Dillaman)
> * mon: add a cache layer over MonitorDBStore (#12638, Kefu Chai)
> * mon: fix crush testing for new pools (#13400, Sage Weil)
> * mon: get pools health'info have error (#12402, 

[ceph-users] Error after upgrading to Infernalis

2015-10-16 Thread German Anders
Hi all,

I'm trying to upgrade a ceph cluster (prev hammer release 0.94.3) to the
last release of *infernalis* (9.1.0-61-gf2b9f89). So far so good while
upgrading the mon servers, all work fine. But then when trying to upgrade
the OSD servers I got an error while trying to start the osd services again:

What I did is first to upgrade the packages, then stop the osd daemons, run
the chown -R ceph:ceph /var/lib/ceph command, and then try to start again
all the daemons. Well, they are not coming back and the error on one of the
OSD is the following:

(...)
5 10:21:05.910850
os/FileStore.cc: 1698: FAILED assert(r == 0)

 ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x7fec7b74489b]
 2: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
 3: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
 4: (OSD::init()+0x269) [0x7fec7b1bc2f9]
 5: (main()+0x2817) [0x7fec7b142bb7]
 6: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
 7: (()+0x30a9e7) [0x7fec7b1729e7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

any ideas? find the complete log output:

http://pastebin.com/raw.php?i=zVABvJX3


Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] error while upgrading to infernalis last release on OSD serv

2015-10-15 Thread German Anders
Hi all,

I'm trying to upgrade a ceph cluster (prev hammer release) to the last
release of infernalis. So far so good while upgrading the mon servers, all
work fine. But then when trying to upgrade the OSD servers I got an error
while trying to start the osd services again:

What I did is first to upgrade the packages, then stop the osd daemons, run
the chown -R ceph:ceph /var/lib/ceph command, and then try to start again
all the daemons. Well, they are not coming back and the error on one of the
OSD is the following:

(...)
5 10:21:05.910850
os/FileStore.cc: 1698: FAILED assert(r == 0)

 ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x8b) [0x7fec7b74489b]
 2: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
 3: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
 4: (OSD::init()+0x269) [0x7fec7b1bc2f9]
 5: (main()+0x2817) [0x7fec7b142bb7]
 6: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
 7: (()+0x30a9e7) [0x7fec7b1729e7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
  20/20 filestore
   1/ 3 keyvaluestore
  20/20 journal
   1/ 1 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.0.log
--- end dump of recent events ---
2015-10-15 10:21:05.923314 7fec7bc4f980 -1 *** Caught signal (Aborted) **
 in thread 7fec7bc4f980

 ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
 1: (()+0x7f031a) [0x7fec7b65831a]
 2: (()+0x10340) [0x7fec79b02340]
 3: (gsignal()+0x39) [0x7fec77d7dcc9]
 4: (abort()+0x148) [0x7fec77d810d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec78688535]
 6: (()+0x5e6d6) [0x7fec786866d6]
 7: (()+0x5e703) [0x7fec78686703]
 8: (()+0x5e922) [0x7fec78686922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0x7fec7b744a88]
 10: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
 11: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
 12: (OSD::init()+0x269) [0x7fec7b1bc2f9]
 13: (main()+0x2817) [0x7fec7b142bb7]
 14: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
 15: (()+0x30a9e7) [0x7fec7b1729e7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- begin dump of recent events ---
 0> 2015-10-15 10:21:05.923314 7fec7bc4f980 -1 *** Caught signal
(Aborted) **
 in thread 7fec7bc4f980

 ceph version 9.1.0-61-gf2b9f89 (f2b9f898074db6473d993436e6aa566a945e3b40)
 1: (()+0x7f031a) [0x7fec7b65831a]
 2: (()+0x10340) [0x7fec79b02340]
 3: (gsignal()+0x39) [0x7fec77d7dcc9]
 4: (abort()+0x148) [0x7fec77d810d8]
 5: (__gnu_cxx::__verbose_terminate_handler()+0x155) [0x7fec78688535]
 6: (()+0x5e6d6) [0x7fec786866d6]
 7: (()+0x5e703) [0x7fec78686703]
 8: (()+0x5e922) [0x7fec78686922]
 9: (ceph::__ceph_assert_fail(char const*, char const*, int, char
const*)+0x278) [0x7fec7b744a88]
 10: (FileStore::init_temp_collections()+0xb2d) [0x7fec7b40ea9d]
 11: (FileStore::mount()+0x33bb) [0x7fec7b41206b]
 12: (OSD::init()+0x269) [0x7fec7b1bc2f9]
 13: (main()+0x2817) [0x7fec7b142bb7]
 14: (__libc_start_main()+0xf5) [0x7fec77d68ec5]
 15: (()+0x30a9e7) [0x7fec7b1729e7]
 NOTE: a copy of the executable, or `objdump -rdS ` is needed
to interpret this.

--- logging levels ---
   0/ 5 none
   0/ 1 lockdep
   0/ 1 context
   1/ 1 crush
   1/ 5 mds
   1/ 5 mds_balancer
   1/ 5 mds_locker
   1/ 5 mds_log
   1/ 5 mds_log_expire
   1/ 5 mds_migrator
   0/ 1 buffer
   0/ 1 timer
   0/ 1 filer
   0/ 1 striper
   0/ 1 objecter
   0/ 5 rados
   0/ 5 rbd
   0/ 5 rbd_replay
   0/ 5 journaler
   0/ 5 objectcacher
   0/ 5 client
  20/20 osd
   0/ 5 optracker
   0/ 5 objclass
  20/20 filestore
   1/ 3 keyvaluestore
  20/20 journal
   1/ 1 ms
   1/ 5 mon
   0/10 monc
   1/ 5 paxos
   0/ 5 tp
   1/ 5 auth
   1/ 5 crypto
   1/ 1 finisher
   1/ 5 heartbeatmap
   1/ 5 perfcounter
   1/ 5 rgw
   1/10 civetweb
   1/ 5 javaclient
   1/ 5 asok
   1/ 1 throttle
   0/ 0 refs
   1/ 5 xio
   1/ 5 compressor
   1/ 5 newstore
  -2/-2 (syslog threshold)
  -1/-1 (stderr threshold)
  max_recent 1
  max_new 1000
  log_file /var/log/ceph/ceph-osd.0.log
--- end dump of recent events ---

any ideas?

Thanks in advance,

cheers,

*German* 

[ceph-users] Proc for Impl XIO mess with Infernalis

2015-10-14 Thread German Anders
Hi all,

I would like to know if with this new release of Infernalis is there
somewhere a procedure in order to implement xio messager with ib and ceph.
Also if it's possible to change an existing ceph cluster to this kind of
new setup (the existing cluster does not had any production data yet).

Thanks in advance,

Cheers,

*German* 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Fwd: Proc for Impl XIO mess with Infernalis

2015-10-14 Thread German Anders
Let me be more specific about what I need in order to move forward with
this kind of install:

setup:

3 mon servers
8 osd servers (4 with SAS disks and SSD journal - relation 1:3) and (4 with
SSD disks osd & journal on the same disk)


running ceph version 0.94.3

I've already install and test *fio*, *accelio* and *rdma* on all the nodes
(mon + osd)

The ceph cluster is already setup and running (with no production data on
it). What I would want to do is to change this cluster in order to support
XIO messager rdma.

a couple of questions...

first, is this a valid option? is possible to change it without re-doing
the whole cluster?
second, is there anyone around who has already made this thing?
third, I know that it's not production ready, but if someone has any
procedure to turn an existing cluster into rmda xio support will really
appreciated.

thanks in advance,

cheers,

*German* <gand...@despegar.com>

-- Forwarded message ------
From: German Anders <gand...@despegar.com>
Date: 2015-10-14 12:46 GMT-03:00
Subject: Proc for Impl XIO mess with Infernalis
To: ceph-users <ceph-users@lists.ceph.com>


Hi all,

I would like to know if with this new release of Infernalis is there
somewhere a procedure in order to implement xio messager with ib and ceph.
Also if it's possible to change an existing ceph cluster to this kind of
new setup (the existing cluster does not had any production data yet).

Thanks in advance,

Cheers,

*German* <gand...@despegar.com>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-deploy prepare btrfs osd error

2015-09-07 Thread German Anders
Thanks a lot Simon, this helps me to resolved the issue, it was the bug
that you mentioned.

Best regards,​

*German*

2015-09-07 5:34 GMT-03:00 Simon Hallam <s...@pml.ac.uk>:

> Hi German,
>
>
>
> This is what I’m running to redo an OSD as btrfs (not sure if this is the
> exact error you’re getting):
>
>
>
> DISK_LETTER=( a b c d e f g h i j k l )
>
>
>
> i=0
>
>
>
> for OSD_NUM in {12..23}; do
>
> sudo /etc/init.d/ceph stop osd.${OSD_NUM}
>
> sudo umount /var/lib/ceph/osd/ceph-${OSD_NUM}
>
> sudo ceph auth del osd.${OSD_NUM}
>
> sudo ceph osd crush remove osd.${OSD_NUM}
>
> sudo ceph osd rm ${OSD_NUM}
>
>
>
> # recreate again
>
> sudo wipefs /dev/sd${DISK_LETTER[$i]}1
>
> sudo dd if=/dev/zero of=/dev/sd${DISK_LETTER[$i]}1 bs=4k count=1
>
> sudo sgdisk --zap-all --clear -g /dev/sd${DISK_LETTER[$i]}
>
> sudo kpartx -dug /dev/sd${DISK_LETTER[$i]}
>
> sudo partprobe /dev/sd${DISK_LETTER[$i]}
>
> sudo dd if=/dev/zero of=/dev/sd${DISK_LETTER[$i]} bs=4k count=1
>
> sudo ceph-disk zap /dev/sd${DISK_LETTER[$i]}
>
> echo ""
>
> echo "ceph-deploy --overwrite-conf disk prepare --fs-type btrfs
> ceph2:sd${DISK_LETTER[$i]}"
>
> echo ""
>
> read -p "Press [Enter] key to continue next disk after running the above
> command on ceph1"
>
> i=$((i + 1))
>
> done
>
>
>
> There appears to be an issue with zap not wiping the partitions correctly.
> http://tracker.ceph.com/issues/6258
>
>
>
> Yours seems slightly different though. Curious, what size disk are you
> trying to use?
>
>
>
> Cheers,
>
>
>
> Simon
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 04 September 2015 19:53
> *To:* ceph-users
> *Subject:* [ceph-users] ceph-deploy prepare btrfs osd error
>
>
>
> Any ideas?
>
> ceph@cephdeploy01:~/ceph-ib$ ceph-deploy osd prepare --fs-type btrfs
> cibosd04:sdc
> [ceph_deploy.conf][DEBUG ] found configuration file at:
> /home/ceph/.cephdeploy.conf
> [ceph_deploy.cli][INFO  ] Invoked (1.5.28): /usr/bin/ceph-deploy osd
> prepare --fs-type btrfs cibosd04:sdc
> [ceph_deploy.cli][INFO  ] ceph-deploy options:
> [ceph_deploy.cli][INFO  ]  username  : None
> [ceph_deploy.cli][INFO  ]  disk  : [('cibosd04',
> '/dev/sdc', None)]
> [ceph_deploy.cli][INFO  ]  dmcrypt   : False
> [ceph_deploy.cli][INFO  ]  verbose   : False
> [ceph_deploy.cli][INFO  ]  overwrite_conf: False
> [ceph_deploy.cli][INFO  ]  subcommand: prepare
> [ceph_deploy.cli][INFO  ]  dmcrypt_key_dir   :
> /etc/ceph/dmcrypt-keys
> [ceph_deploy.cli][INFO  ]  quiet : False
> [ceph_deploy.cli][INFO  ]  cd_conf   :
> 
> [ceph_deploy.cli][INFO  ]  cluster   : ceph
> [ceph_deploy.cli][INFO  ]  fs_type   : btrfs
> [ceph_deploy.cli][INFO  ]  func  :  at 0x7faf71576938>
> [ceph_deploy.cli][INFO  ]  ceph_conf : None
> [ceph_deploy.cli][INFO  ]  default_release   : False
> [ceph_deploy.cli][INFO  ]  zap_disk  : False
> [ceph_deploy.osd][DEBUG ] Preparing cluster ceph disks cibosd04:/dev/sdc:
> [cibosd04][DEBUG ] connection detected need for sudo
> [cibosd04][DEBUG ] connected to host: cibosd04
> [cibosd04][DEBUG ] detect platform information from remote host
> [cibosd04][DEBUG ] detect machine type
> [cibosd04][DEBUG ] find the location of an executable
> [ceph_deploy.osd][INFO  ] Distro info: Ubuntu 14.04 trusty
> [ceph_deploy.osd][DEBUG ] Deploying osd to cibosd04
> [cibosd04][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
> [cibosd04][INFO  ] Running command: sudo udevadm trigger
> --subsystem-match=block --action=add
> [ceph_deploy.osd][DEBUG ] Preparing host cibosd04 disk /dev/sdc journal
> None activate False
> [cibosd04][INFO  ] Running command: sudo ceph-disk -v prepare --cluster
> ceph --fs-type btrfs -- /dev/sdc
> [cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
> --cluster=ceph --show-config-value=fsid
> [cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
> [cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
> [cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
> --cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
> [cibosd04][WARN

Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-05 Thread German Anders
Hi Christian,

Ok so would said that it's better to rearrange the nodes so i dont mix
the hdd and ssd disks right? And create high perf nodes with ssd and others
with hdd, its fine since its a new deploy.
   Also the nodes had different type of ram cpu, 4 had more cpu and more
memory 384gb and other 3 had less cpu and 128gb of ram, so maybe i can put
the ssd con the much more cpu nodes and left the hdd for the other nodes.
Network is going to be used infiniband fdr at 56gb/s on all the nodes for
the publ network and for the clus network.
   Any other suggestion/comment?

Thanks a lot!

Best regards

German


On Saturday, September 5, 2015, Christian Balzer <ch...@gol.com> wrote:

>
> Hello,
>
> On Fri, 4 Sep 2015 12:30:12 -0300 German Anders wrote:
>
> > Hi cephers,
> >
> >I've the following scheme:
> >
> > 7x OSD servers with:
> >
> Is this a new cluster, total initial deployment?
>
> What else are these nodes made of, CPU/RAM/network?
> While uniform nodes have some appeal (interchangeability, one node down
> does impact the cluster uniformly) they tend to be compromise solutions.
> I personally would go with optimized HDD and SSD nodes.
>
> > 4x 800GB SSD Intel DC S3510 (OSD-SSD)
> Only 0.3DWPD, 450TB total in 5 years.
> If you can correctly predict your write volume and it is below that per
> SSD, fine. I'd use 3610s, with internal journals.
>
> > 3x 120GB SSD Intel DC S3500 (Journals)
> In this case even more so the S3500 is a bad choice. 3x 135MB/s is
> nowhere near your likely network speed of 10Gb/s.
>
> You will vastly superior performance and endurance with two 200GB S3610
> (2x 230MB/s) or S3700 (2x365 MB/s)
>
> Why the uneven number of journals SSDs?
> You want uniform utilization, wear. 2 journal SSDs for 6 HDDs would be a
> good ratio.
>
> > 5x 3TB SAS disks (OSD-SAS)
> >
> See above, even numbers make a lot more sense.
>
> >
> > The OSD servers are located on two separate Racks with two power circuits
> > each.
> >
> >I would like to know what is the best way to implement this.. use the
> > 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
> > suggestion? Also any advice for the crush design?
> >
> Nick touched on that already, for right now SSD pools would be definitely
> better.
>
> Christian
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.com <javascript:;>Global OnLine Japan/Fusion
> Communications
> http://www.gol.com/
>


-- 

*German Anders*
Storage Engineer Manager
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread German Anders
Hi cephers,

   I've the following scheme:

7x OSD servers with:
4x 800GB SSD Intel DC S3510 (OSD-SSD)
3x 120GB SSD Intel DC S3500 (Journals)
5x 3TB SAS disks (OSD-SAS)

The OSD servers are located on two separate Racks with two power circuits
each.

   I would like to know what is the best way to implement this.. use the 4x
800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
suggestion? Also any advice for the crush design?

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph osd prepare btrfs

2015-09-04 Thread German Anders
Trying to do a prepare on a osd with btrfs, and getting this error:

[cibosd04][INFO  ] Running command: sudo ceph-disk -v prepare --cluster
ceph --fs-type btrfs -- /dev/sdc
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=fsid
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mkfs_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mkfs_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_mount_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_fs_mount_options_btrfs
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-osd
--cluster=ceph --show-config-value=osd_journal_size
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_cryptsetup_parameters
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_key_size
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /usr/bin/ceph-conf
--cluster=ceph --name=osd. --lookup osd_dmcrypt_type
[cibosd04][WARNIN] INFO:ceph-disk:Will colocate journal with data on
/dev/sdc
[cibosd04][WARNIN] DEBUG:ceph-disk:Creating journal partition num 2 size
5120 on /dev/sdc
[cibosd04][WARNIN] INFO:ceph-disk:Running command: /sbin/sgdisk
--new=2:0:5120M --change-name=2:ceph journal
--partition-guid=2:2d7cd194-6185-4515-ae32-40b88524d03a
--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106 --mbrtogpt -- /dev/sdc
[cibosd04]*[WARNIN] Invalid partition data!*
[cibosd04][WARNIN] ceph-disk: Error: Command '['/sbin/sgdisk',
'--new=2:0:5120M', '--change-name=2:ceph journal',
'--partition-guid=2:2d7cd194-6185-4515-ae32-40b88524d03a',
'--typecode=2:45b0969e-9b03-4f30-b4c6-b4b80ceff106', '--mbrtogpt', '--',
'/dev/sdc']' returned non-zero exit status 2
[cibosd04][ERROR ] *RuntimeError: command returned non-zero exit status: 1*
[ceph_deploy.osd][ERROR ] Failed to execute command: ceph-disk -v prepare
--cluster ceph --fs-type btrfs -- /dev/sdc
[ceph_deploy][ERROR ] *GenericError: Failed to create 1 OSDs*


I try to format the device but with no luck, any ideas?

Thanks in advance,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Best layout for SSD & SAS OSDs

2015-09-04 Thread German Anders
Thanks a lot Nick, regarding the power feeds, we only had two circuits for
all the racks, so I'll to do in the crush the "rack" bucket and separate
the osd servers on the rack buckets, then regarding the SSD pools, I've
installed the hammer version and wondering to upgrade to Infernalis v9.0.3
and apply the SSD cache, or stay on Hammer and do the SSD pools and maybe
left two 800GB SSD for later used as Cache (1.6TB per OSD server), do you
have a crushmap example for this type of config?

Thanks a lot,

Best regards,


*German*

2015-09-04 13:10 GMT-03:00 Nick Fisk <n...@fisk.me.uk>:

> Hi German,
>
>
>
> Are the power feeds completely separate (ie 4 feeds in total), or just
> each rack has both feeds? If it’s the latter I don’t see any benefit from
> including this into the crushmap and would just create a “rack” bucket.
> Also assuming your servers have dual PSU’s, this also changes the power
> failure scenarios quite a bit as well.
>
>
>
> In regards to the pools, unless you know your workload will easily fit
> into a cache pool with room to spare, I would suggest not going down that
> route currently. Performance in many cases can actually end up being worse
> if you end up doing a lot of promotions.
>
>
>
> **However** I’ve been doing a bit of testing with the current master and
> there are a lot of improvements around cache tiering that are starting to
> have a massive improvement on performance. If you can get by with just the
> SAS disks for now and make a more informed decision about the cache tiering
> when Infernalis is released then that might be your best bet.
>
>
>
> Otherwise you might just be best using them as a basic SSD only Pool.
>
>
>
> Nick
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* 04 September 2015 16:30
> *To:* ceph-users <ceph-users@lists.ceph.com>
> *Subject:* [ceph-users] Best layout for SSD & SAS OSDs
>
>
>
> Hi cephers,
>
>I've the following scheme:
>
> 7x OSD servers with:
>
> 4x 800GB SSD Intel DC S3510 (OSD-SSD)
>
> 3x 120GB SSD Intel DC S3500 (Journals)
>
> 5x 3TB SAS disks (OSD-SAS)
>
> The OSD servers are located on two separate Racks with two power circuits
> each.
>
>I would like to know what is the best way to implement this.. use the
> 4x 800GB SSD like a SSD-pool, or used them us a Cache pool? or any other
> suggestion? Also any advice for the crush design?
>
> Thanks in advance,
>
>
> *German*
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph new mon deploy v9.0.3-1355

2015-09-02 Thread German Anders
Hi cephers, trying to deploying a new ceph cluster with master release
(v9.0.3) and when trying to create the initial mons and error appears
saying that "admin_socket: exception getting command descriptions: [Errno
2] No such file or directory", find the log:


...
[ceph_deploy.mon][INFO  ] distro info: Ubuntu 14.04 trusty
[cibmon01][DEBUG ] determining if provided host has same hostname in remote
[cibmon01][DEBUG ] get remote short hostname
[cibmon01][DEBUG ] deploying mon to cibmon01
[cibmon01][DEBUG ] get remote short hostname
[cibmon01][DEBUG ] remote hostname: cibmon01
[cibmon01][DEBUG ] write cluster configuration to /etc/ceph/{cluster}.conf
[cibmon01][DEBUG ] create the mon path if it does not exist
[cibmon01][DEBUG ] checking for done path:
/var/lib/ceph/mon/ceph-cibmon01/done
[cibmon01][DEBUG ] done path does not exist:
/var/lib/ceph/mon/ceph-cibmon01/done
[cibmon01][INFO  ] creating keyring file:
/var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][DEBUG ] create the monitor keyring file
[cibmon01][INFO  ] Running command: ceph-mon --cluster ceph --mkfs -i
cibmon01 --keyring /var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][WARNING] libust[16029/16029]: Warning: HOME environment variable
not set. Disabling LTTng-UST per-user tracing. (in setup_local_apps() at
lttng-ust
-comm.c:305)
[cibmon01][DEBUG ] ceph-mon: set fsid to xx----x
[cibmon01][DEBUG ] ceph-mon: created monfs at
/var/lib/ceph/mon/ceph-cibmon01 for mon.cibmon01
[cibmon01][INFO  ] unlinking keyring file
/var/lib/ceph/tmp/ceph-cibmon01.mon.keyring
[cibmon01][DEBUG ] create a done file to avoid re-doing the mon deployment
[cibmon01][DEBUG ] create the init path if it does not exist
[cibmon01][DEBUG ] locating the `service` executable...
[cibmon01][INFO  ] Running command: initctl emit ceph-mon cluster=ceph
id=cibmon01
[cibmon01][INFO  ] Running command: ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.cibmon01.asok mon_status
*[cibmon01][ERROR ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory*
[cibmon01][WARNING] monitor: mon.cibmon01, might not be running yet
[cibmon01][INFO  ] Running command: ceph --cluster=ceph --admin-daemon
/var/run/ceph/ceph-mon.cibmon01.asok mon_status
*[cibmon01][ERROR ] admin_socket: exception getting command descriptions:
[Errno 2] No such file or directory*
[cibmon01][WARNING] monitor cibmon01 does not exist in monmap
[ceph_deploy.mon][DEBUG ] detecting platform for host cibmon02 ...
[cibmon02][DEBUG ] connected to host: cibmon02
...

Any ideas?

Thanks in advance,

Regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi Roy,

   I understand, we are looking for using accelio with an starting small
cluster of 3 mon and 8 osd servers:

3x MON servers
   2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
   24x 16GB DIMM DDR3 1333Mhz (384GB)
   2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

4x OSD servers
   2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
   8x 16GB DIMM DDR3 1333Mhz (128GB)
   2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
   3x 120GB Intel SSD DC SC3500 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

4x OSD servers
   2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
   8x 16GB DIMM DDR3 1866Mhz (128GB)
   2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
   3x 200GB Intel SSD DC S3700 (Journals)
   4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
   5x 3TB SAS (OSD-SAS-POOL)
   1x ConnectX-3 VPI FDR 56Gb/s ADPT DP

and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
recommendations?


*German*

2015-09-01 14:46 GMT-03:00 Somnath Roy <somnath@sandisk.com>:

> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
>
>
> recommendations, comments?
>
>
>
> Thanks in advance,
>
>
>
> Best regards,
>
>
>
> German
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> -BEGIN PGP SIGNATURE-
>
> Version: Mailvelope v1.0.2
>
> Comment: https://www.mailvelope.com
>
>
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
>
> FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
>
> xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
>
> OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
>
> VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
>
> m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
>
> YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
>
> Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
>
> XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
>
> /7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z
>
> SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt
>
> 8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh
>
> TiL2
>
> =oSrX
>
> -END PGP SIGNATURE-
>
>
>
> --
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If
> the reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Thanks Roy, we're planning to grow on this cluster if can get the
performance that we need, the idea is to run non-relational databases here,
so it would be high-io intensive. We are talking in grow terms of about
40-50 OSD servers with no more than 6 OSD daemons per server. If you got
some hints or docs out there on how to compile ceph with accelio it would
be awesome.


*German*

2015-09-01 15:31 GMT-03:00 Somnath Roy <somnath@sandisk.com>:

> Thanks !
>
> I think you should try installing from the ceph mainstream..There are some
> bug fixes went on after Hammer (not sure if it is backported)..
>
> I would say try with 1 drive -> 1 OSD first since presently we have seen
> some stability issues (mainly due to resource constraint) with more OSDs in
> a box.
>
> The another point is, installation itself is not straight forward. You
> need to build all the components probably, not sure if it is added as git
> submodule or not, Vu , could you please confirm ?
>
>
>
> Since we are working to make this solution work at scale, could you please
> give us some idea what is the scale you are looking at for future
> deployment ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:19 AM
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
>
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Hi Roy,
>
>I understand, we are looking for using accelio with an starting small
> cluster of 3 mon and 8 osd servers:
>
> 3x MON servers
>
>2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
>
>24x 16GB DIMM DDR3 1333Mhz (384GB)
>
>2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
>
>8x 16GB DIMM DDR3 1333Mhz (128GB)
>
>2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
>
>3x 120GB Intel SSD DC SC3500 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
>
>8x 16GB DIMM DDR3 1866Mhz (128GB)
>
>2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
>
>3x 200GB Intel SSD DC S3700 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
> recommendations?
>
>
> *German*
>
>
>
> 2015-09-01 14:46 GMT-03:00 Somnath Roy <somnath@sandisk.com>:
>
> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
> *To:* Robert LeBlanc
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot for the quick response Robert, any idea when it's going to be
> ready for production? any alternative solution for similar-performance?
>
> Best regards,
>
>
> *German *
>
>
>
> 2015-09-01 13:42 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:
>
> -BEGIN PGP SIGNED MESSAGE-
>
> Hash: SHA256
>
>
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
>
>
> - 
>
> Robert LeBlanc
>
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
>
> Hi cephers,
>
>
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
>
>
> recommendations, comments?
>
>
>
> Thanks in advance,
>
>
>
> Best regards,
>
>
>
> German
>
>
>
> ___
>
> ceph-users mailing list
>
> ceph-users@lists.ceph.com
>
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> -BEGIN PGP SIGNATUR

Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Thanks a lot for the quick response Robert, any idea when it's going to be
ready for production? any alternative solution for similar-performance?

Best regards,


*German *

2015-09-01 13:42 GMT-03:00 Robert LeBlanc <rob...@leblancnet.us>:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> Accelio and Ceph are still in heavy development and not ready for production.
>
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
> On Tue, Sep 1, 2015 at 10:31 AM, German Anders  wrote:
> Hi cephers,
>
>  I would like to know the status for production-ready of Accelio & Ceph, does 
> anyone had a home-made procedure implemented with Ubuntu?
>
> recommendations, comments?
>
> Thanks in advance,
>
> Best regards,
>
> German
>
> ___
> ceph-users mailing 
> listceph-us...@lists.ceph.comhttp://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> -BEGIN PGP SIGNATURE-
> Version: Mailvelope v1.0.2
> Comment: https://www.mailvelope.com
>
> wsFcBAEBCAAQBQJV5dWKCRDmVDuy+mK58QAAZWcQAKIRYhnlSzIQJ9PGaC1J
> FGYxZ9IOmXX89IbpZuM8Ns8Q1Y52SrYkez8jwtB/A1OWXH0uw2GT45shDfzX
> xFaqRVVHnjI7MiO+aijGkDZLrdE5fvGfTAOa1m2ovlx7BWRG6k0aSeqdMr92
> OB/n2ona94ILvHW/Uq/o5YnoFsThUdTTRWckWeRMKIz9eA7v+bneukjXyLf/
> VwFAk0V9LevzNZY83nARYThDfL20SYT05dAhJ6bbzYFowdymZcNWTEDkUY02
> m76bhEQO4k3MypL+kv0YyFi3cDkMBa4CaCm3UwRWC5KG6MlQnFl+f3UQuOwV
> YhYkagw2qUP4rx+/5LIAU+WEzegZ+3mDgk0qIB6pa7TK5Gk4hvHZG884YpXA
> Fa6Lj9x7gQjszLI1esW1zuNhlTBUJfxygfdJQPV2w/9cjjFlXG8QgmZcgyJF
> XjtH/T1BK8t7x6IgerXBPEjPlU6tYI75HSSryarFH9ntKIIr6Yrcaaa8heLD
> /7S/S05yQ2TcfnkVPGapDzJ2Ko5h5gwO/29EIlOsYiHCwDYXDonRFFUrRa2Z
> SzSq9iiCywglYtqqzaDpqeU5soPIaijHn7ELSEq51Lc6D19pRdEMdmFnxcmt
> 8QAYEihGnckbcSLdwm1nOP0Nme5ixyGLxcEfxUYv6hTxhJt4RuAj83f2cFxh
> TiL2
> =oSrX
> -END PGP SIGNATURE-
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi Vu,
   Thanks a lot for the link

Best regards,

*German*

2015-09-01 19:02 GMT-03:00 Vu Pham <vuhu...@mellanox.com>:

> Hi German,
>
>
>
> You can try this small wiki to setup ceph/accelio
>
>
>
> https://community.mellanox.com/docs/DOC-2141
>
>
>
> thanks,
>
> -vu
>
>
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 12:00 PM
> *To:* Somnath Roy
>
> *Cc:* ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks a lot guys, I'll configure the cluster and send you some feedback
> once we test it
>
> Best regards,
>
>
> *German*
>
>
>
> 2015-09-01 15:38 GMT-03:00 Somnath Roy <somnath@sandisk.com>:
>
> Thanks !
>
> 6 OSD daemons per server should be good.
>
>
>
> Vu,
>
> Could you please send out the doc you are maintaining ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:36 AM
>
>
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Thanks Roy, we're planning to grow on this cluster if can get the
> performance that we need, the idea is to run non-relational databases here,
> so it would be high-io intensive. We are talking in grow terms of about
> 40-50 OSD servers with no more than 6 OSD daemons per server. If you got
> some hints or docs out there on how to compile ceph with accelio it would
> be awesome.
>
>
> *German*
>
>
>
> 2015-09-01 15:31 GMT-03:00 Somnath Roy <somnath@sandisk.com>:
>
> Thanks !
>
> I think you should try installing from the ceph mainstream..There are some
> bug fixes went on after Hammer (not sure if it is backported)..
>
> I would say try with 1 drive -> 1 OSD first since presently we have seen
> some stability issues (mainly due to resource constraint) with more OSDs in
> a box.
>
> The another point is, installation itself is not straight forward. You
> need to build all the components probably, not sure if it is added as git
> submodule or not, Vu , could you please confirm ?
>
>
>
> Since we are working to make this solution work at scale, could you please
> give us some idea what is the scale you are looking at for future
> deployment ?
>
>
>
> Regards
>
> Somnath
>
>
>
> *From:* German Anders [mailto:gand...@despegar.com]
> *Sent:* Tuesday, September 01, 2015 11:19 AM
> *To:* Somnath Roy
> *Cc:* Robert LeBlanc; ceph-users
>
>
> *Subject:* Re: [ceph-users] Accelio & Ceph
>
>
>
> Hi Roy,
>
>I understand, we are looking for using accelio with an starting small
> cluster of 3 mon and 8 osd servers:
>
> 3x MON servers
>
>2x Intel Xeon E5-2630v3 @2.40Ghz (32C with HT)
>
>24x 16GB DIMM DDR3 1333Mhz (384GB)
>
>2x 120GB Intel SSD DC S3500 (RAID-1 for OS)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2609v2 @2.50Ghz (8C)
>
>8x 16GB DIMM DDR3 1333Mhz (128GB)
>
>2x 120GB Intel SSD DC SC3500 (RAID-1 for OS)
>
>3x 120GB Intel SSD DC SC3500 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> 4x OSD servers
>
>2x Intel Xeon E5-2650v2 @2.60Ghz (32C with HT)
>
>8x 16GB DIMM DDR3 1866Mhz (128GB)
>
>2x 200GB Intel SSD DC S3700 (RAID-1 for OS)
>
>3x 200GB Intel SSD DC S3700 (Journals)
>
>4x 800GB Intel SSD DC SC3510 (OSD-SSD-POOL)
>
>5x 3TB SAS (OSD-SAS-POOL)
>
>1x ConnectX-3 VPI FDR 56Gb/s ADPT DP
>
> and thinking of using *infernalis v.9.0.0* or *hammer* release? comments?
> recommendations?
>
>
> *German*
>
>
>
> 2015-09-01 14:46 GMT-03:00 Somnath Roy <somnath@sandisk.com>:
>
> Hi German,
>
> We are working on to make it production ready ASAP. As you know RDMA is
> very resource constrained and at the same time will outperform TCP as well.
> There will be some definite tradeoff between cost Vs Performance.
>
> We are lacking on ideas on how big the RDMA deployment could be and it
> will be really helpful if you can give some idea on how you are planning to
> deploy that (i.e how many nodes/OSDs/SSD or HDDs/ EC or Replication etc.
> etc.).
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
> *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
> Of *German Anders
> *Sent:* Tuesday, September 01, 2015 10:39 AM
>

[ceph-users] Accelio & Ceph

2015-09-01 Thread German Anders
Hi cephers,

 I would like to know the status for production-ready of Accelio & Ceph,
does anyone had a home-made procedure implemented with Ubuntu?

recommendations, comments?

Thanks in advance,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph version for productive clusters?

2015-08-31 Thread German Anders
Hi cephers,

   What's the recommended version for new productive clusters?

Thanks in advanced,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph version for productive clusters?

2015-08-31 Thread German Anders
Thanks a lot Kobi

*German*

2015-08-31 14:20 GMT-03:00 Kobi Laredo <kobi.lar...@dreamhost.com>:

> Hammer should be very stable at this point.
>
> *Kobi Laredo*
> *Cloud Systems Engineer* | (*408) 409-KOBI*
>
> On Mon, Aug 31, 2015 at 8:51 AM, German Anders <gand...@despegar.com>
> wrote:
>
>> Hi cephers,
>>
>>What's the recommended version for new productive clusters?
>>
>> Thanks in advanced,
>>
>> Best regards,
>>
>> *German*
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk/Pool Layout

2015-08-27 Thread German Anders
I already had the SC3510... so I would stick with those :S...the SC 3700
were a little bit high on price and the manager didn't approved that... not
my decision...unfortunately.

Also I see very hard to add at the moment another node,.. so maybe I can
start with 6 OSD instead of 8, and leave those 2 extra OSD there, and in a
couple of months buy a new OSD and add the three OSD at a time.

I've the nodes installed in two separate racks, with two different circuits
and PDUs. I would need some info on CRUSH tuning so I could start building
the solution, anyone?

Thanks a lot!

Best regards,


*German*

2015-08-27 13:40 GMT-03:00 Jan Schermer j...@schermer.cz:

 In that case you need fair IOPS and high throughput. Go with S3610 or the
 Samsungs (or something else that people there can recommend, but for the
 love of god don't save on drives :)). It's easier to stick to one type of
 drives and not complicate things.

 I would also recommend you add one storage node and make the CRUSH with
 intermediate level for groups of 3 nodes. Makes maintenance much easier and
 the data more durable in case of failure. (Best to put the nodes in
 separate cages on different UPSes, then you can do stuff like disable
 barriers if you go with some cheaper drives that need it.) I'm not a CRUSH
 expert, there are more tricks to do before you set this up.

 Jan

 On 27 Aug 2015, at 18:31, German Anders gand...@despegar.com wrote:

 Hi Jan,

Thanks for responding the email, regarding the cluster usage, we are
 going to used it for non-relational databases, Cassandra, mongoDBs and
 other apps, so we need that this cluster response well to intense io apps,
 it's going to be connected to HP enclosures with IB FDR also, and mapped
 through Cinder to mount it on VMs (KVM hypervisor), then on the vms we are
 going to run the non-relational dbs.

 Thanks in advance,


 *German*

 2015-08-27 13:25 GMT-03:00 Jan Schermer j...@schermer.cz:

 Some comments inline.
 A lot of it depends on your workload, but I'd say you almost certainly
 need higher-grade SSDs. You can save money on memory.

 What will be the role of this cluster? VM disks? Object storage?
 Streaming?...

 Jan

 On 27 Aug 2015, at 17:56, German Anders gand...@despegar.com wrote:

 Hi all,

I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've
 the following HW:

 *3x MON Servers:*
2x Intel Xeon E5-2600@v3 8C

256GB RAM


 I don't think you need that much memory, 64GB should be plenty (if that's
 the only role for the servers).

1xIB FRD ADPT-DP (two ports for PUB network)
1xGB ADPT-DP

Disk Layout:

SOFT-RAID:
SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)


 I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast
 ones (but they can be fairly small). Should be the same grade as journal
 drives IMO.
 NOT S3500!
 I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1
 DWPD rating, better go with 3 DWPD.


 *8x OSD Servers:*
2x Intel Xeon E5-2600@v3 10C


 Go for the fastest you can afford if you need the latency - even at the
 expense of cores.
 Go for cores if you want bigger throughput.

256GB RAM


 Again - I think too much if that's the only role for those nodes, 64GB
 should be plenty.


1xIB FRD ADPT-DP (one port for PUB and one for CLUS network)
1xGB ADPT-DP

Disk Layout:

SOFT-RAID:
SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)

JBOD:
SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)


 No no no. Those SSDs will die a horrible death, too little endurance.
 Better go with 2x 3700 in RAID1 and partition them for journals. Or just
 don't use journaling drives and buy better SSDs for storage.

SCSI9 (0,3,0) (sdg) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,4,0) (sdh) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,5,0) (sdi) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,6,0) (sdj) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)


 Too little endurance.


SCSI9 (0,7,0) (sdk) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,8,0) (sdl) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,9,0) (sdm) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,10,0) (sdn) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,11,0) (sdo) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)


 I would like to have an expert opinion on what would be the best
 deploy/config disk pools and crush map? any other advice?

 Thanks in advance,

 Best regards,

 *German*
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http

[ceph-users] Disk/Pool Layout

2015-08-27 Thread German Anders
Hi all,

   I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've
the following HW:

*3x MON Servers:*
   2x Intel Xeon E5-2600@v3 8C
   256GB RAM
   1xIB FRD ADPT-DP (two ports for PUB network)
   1xGB ADPT-DP

   Disk Layout:

   SOFT-RAID:
   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)

*8x OSD Servers:*
   2x Intel Xeon E5-2600@v3 10C
   256GB RAM
   1xIB FRD ADPT-DP (one port for PUB and one for CLUS network)
   1xGB ADPT-DP

   Disk Layout:

   SOFT-RAID:
   SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
   SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)

   JBOD:
   SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
   SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
   SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
   SCSI9 (0,3,0) (sdg) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
   SCSI9 (0,4,0) (sdh) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
   SCSI9 (0,5,0) (sdi) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
   SCSI9 (0,6,0) (sdj) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)

   SCSI9 (0,7,0) (sdk) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
   SCSI9 (0,8,0) (sdl) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
   SCSI9 (0,9,0) (sdm) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
   SCSI9 (0,10,0) (sdn) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
   SCSI9 (0,11,0) (sdo) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)


I would like to have an expert opinion on what would be the best
deploy/config disk pools and crush map? any other advice?

Thanks in advance,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk/Pool Layout

2015-08-27 Thread German Anders
Hi Jan,

   Thanks for responding the email, regarding the cluster usage, we are
going to used it for non-relational databases, Cassandra, mongoDBs and
other apps, so we need that this cluster response well to intense io apps,
it's going to be connected to HP enclosures with IB FDR also, and mapped
through Cinder to mount it on VMs (KVM hypervisor), then on the vms we are
going to run the non-relational dbs.

Thanks in advance,


*German*

2015-08-27 13:25 GMT-03:00 Jan Schermer j...@schermer.cz:

 Some comments inline.
 A lot of it depends on your workload, but I'd say you almost certainly
 need higher-grade SSDs. You can save money on memory.

 What will be the role of this cluster? VM disks? Object storage?
 Streaming?...

 Jan

 On 27 Aug 2015, at 17:56, German Anders gand...@despegar.com wrote:

 Hi all,

I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've
 the following HW:

 *3x MON Servers:*
2x Intel Xeon E5-2600@v3 8C

256GB RAM


 I don't think you need that much memory, 64GB should be plenty (if that's
 the only role for the servers).

1xIB FRD ADPT-DP (two ports for PUB network)
1xGB ADPT-DP

Disk Layout:

SOFT-RAID:
SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)


 I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast
 ones (but they can be fairly small). Should be the same grade as journal
 drives IMO.
 NOT S3500!
 I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1
 DWPD rating, better go with 3 DWPD.


 *8x OSD Servers:*
2x Intel Xeon E5-2600@v3 10C


 Go for the fastest you can afford if you need the latency - even at the
 expense of cores.
 Go for cores if you want bigger throughput.

256GB RAM


 Again - I think too much if that's the only role for those nodes, 64GB
 should be plenty.


1xIB FRD ADPT-DP (one port for PUB and one for CLUS network)
1xGB ADPT-DP

Disk Layout:

SOFT-RAID:
SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)

JBOD:
SCSI9 (0,0,0) (sdd) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
SCSI9 (0,1,0) (sde) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)
SCSI9 (0,2,0) (sdf) - 120.0 GB ATA INTEL SC3500 SSDSC2BB12 (Journal)


 No no no. Those SSDs will die a horrible death, too little endurance.
 Better go with 2x 3700 in RAID1 and partition them for journals. Or just
 don't use journaling drives and buy better SSDs for storage.

SCSI9 (0,3,0) (sdg) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,4,0) (sdh) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,5,0) (sdi) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)
SCSI9 (0,6,0) (sdj) - 800.2 GB ATA INTEL SC3510 SSDSC2BB80 (Pool-SSD)


 Too little endurance.


SCSI9 (0,7,0) (sdk) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,8,0) (sdl) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,9,0) (sdm) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,10,0) (sdn) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)
SCSI9 (0,11,0) (sdo) - 3.0 TB SEAGATE ST3000NM0023 (Pool-SATA)


 I would like to have an expert opinion on what would be the best
 deploy/config disk pools and crush map? any other advice?

 Thanks in advance,

 Best regards,

 *German*
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Disk/Pool Layout

2015-08-27 Thread German Anders
Thanks a lot Robert and Jan for the comments about the available and
possible disk layouts. Is there any advice from the point of view of
configuration? any tunable parameters, crush algorithm?

Thanks a lot,

Best regards,

*German*

2015-08-27 16:37 GMT-03:00 Robert LeBlanc rob...@leblancnet.us:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256


 On Thu, Aug 27, 2015 at 1:13 PM, Jan Schermer  wrote:
 
  On 27 Aug 2015, at 20:57, Robert LeBlanc  wrote:
 
  -BEGIN PGP SIGNED MESSAGE-
  Hash: SHA256
 
 
 
 
  On Thu, Aug 27, 2015 at 10:25 AM, Jan Schermer  wrote:
  Some comments inline.
  A lot of it depends on your workload, but I'd say you almost certainly 
  need
  higher-grade SSDs. You can save money on memory.
 
  What will be the role of this cluster? VM disks? Object storage?
  Streaming?...
 
  Jan
 
  On 27 Aug 2015, at 17:56, German Anders  wrote:
 
  Hi all,
 
I'm planning to deploy a new Ceph cluster with IB FDR 56Gb/s and I've 
  the
  following HW:
 
  3x MON Servers:
2x Intel Xeon E5-2600@v3 8C
 
  This is overkill if only a monitor server.
 
  Maybe with newer releases of Ceph, but my Mons spin CPU pretty high (100% 
  core, which means it doesn't scale that well with cores), and when 
  adding/removing OSDs or shuffling data some of the peering issues I've seen 
  were caused by lagging Mons.

 If I remember right, you have a fairly large cluster. This is a pretty small 
 cluster, so probably OK with less CPU. Are you running Dumpling? I haven't 
 seen many issues with Hammer.

 
 
 
256GB RAM
 
 
  I don't think you need that much memory, 64GB should be plenty (if that's
  the only role for the servers).
 
 
  If it is only monitor, you can get by with even less.
 
 
1xIB FRD ADPT-DP (two ports for PUB network)
1xGB ADPT-DP
 
Disk Layout:
 
SOFT-RAID:
SCSI1 (0,0,0) (sda) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
SCSI2 (0,0,0) (sdb) - 120.0 GB ATA INTEL SSDSC2BB12 (OS-RAID1)
 
 
  I 100% recommend going with SSDs for the /var/lib/ceph/mon storage, fast
  ones (but they can be fairly small). Should be the same grade as journal
  drives IMO.
  NOT S3500!
  I can recommend S3610 (just got some :)), Samsung 845 DC PRO. At least 1
  DWPD rating, better go with 3 DWPD.
 
  S3500 should be just fine here. I get 25% better performance on the
  S3500 vs the S3700 doing sync direct writes. Write endurance should be
  just fine as the volume of data is not going to be that great. Unless
  there is something else I'm not aware of.
 
 
  S3500 is faster than S3700? I can compare 3700 x 3510 x 3610 tomorrow but 
  I'd be very surprised if the S3500 had a _sustained_ throughput better than 
  36xx or 37xx. Were you comparing that on the same HBA and in the same way? 
  (No offense, just curious)

 None taken. I used the same box and swapped out the drives. The only 
 difference was the S3500 has been heavily used, the 3700 was fresh from the 
 package (if anything that should have helped the S3700).

 for i in {1..8}; do fio --filename=/dev/sda --direct=1 --sync=1 --rw=write 
 --bs=4k --numjobs=$i --iodepth=1 --runtime=60 --time_based --group_reporting 
 --name=journal-test; done

 # jobs  IOPs   Bandwidth (KB/s)

 Intel S3500 (SSDSC2BB240G4) Max 4K RW 7,500
 1   5,617  22,468.0
 2   8,326  33,305.0
 3  11,575  46,301.0
 4  13,882  55,529.0
 5  16,254  65,020.0
 6  17,890  71,562.0
 7  19,438  77,752.0
 8  20,894  83,576.0

 Intel S3700 (SSDSC2BA200G3) Max 4K RW 32,000
  1  4,417  17,670.0
  2  5,544  22,178.0
  3  7,337  29,352.0
  4  9,243  36,975.0
  5 11,189  44,759.0
  6 13,218  52,874.0
  7 14,801  59,207.0
  8 16,604  66,419.0
  9 17,671  70,685.0
 10 18,715  74,861.0
 11 20,079  80,318.0
 12 20,832  83,330.0
 13 20,571  82,288.0
 14 23,033  92,135.0
 15 22,169  88,679.0
 16 22,875  91,502.0

 
  Mons can use some space, I've experienced logging havoc, leveldb bloating 
  havoc  (I have to compact manually or it just grows and grows), and my Mons 
  write quite a lot at times. I guesstimate my mons can write 200GB a day, 
  often less but often more. Maybe that's not normal. I can confirm those 
  numbers tomorrow.

 True, I haven't had the compact issues so I can't comment on that. He has a 
 small cluster so I don't think he will get to the level you have.

 
 
 
  8x OSD Servers:
2x Intel Xeon E5-2600@v3 10C
 
 
  Go for the fastest you can afford if you need the latency - even at the
  expense of cores.
  Go for cores if you want bigger throughput.
 
  I'm in the middle of my testing, but it seems that with lots of I/O
  depth (either from a single client or multiple clients) that clock
  speed does not have as much of an impact as core count does. Once I'm
  done, I'll be posting my results. Unless you have a single client that
  has a QD=1, go for cores at this point.
 
  NoSQL is basically still a database, and while NoSQL is mostly a more

Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-02 Thread German Anders
  375.00 0.0023.68
129.3469.76  182.510.00  182.51   2.47  92.80

The other OSD server had pretty much the same load.

The config of the OSD's is the following:

- 2x Intel Xeon E5-2609 v2 @ 2.50GHz (4C)
- 128G RAM
- 2x 120G SSD Intel SSDSC2BB12 (RAID-1) for OS
- 2x 10GbE ADPT DP
- Journals are configured to run on RAMDISK (TMPFS), but in the first OSD
serv we've the journals going on to a FusionIO (/dev/fioa) ADPT with batt.

CRUSH algorithm is the following:

# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1

# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
device 9 osd.9
device 10 osd.10
device 11 osd.11
device 12 osd.12
device 13 osd.13
device 14 osd.14
device 15 osd.15
device 16 osd.16
device 17 osd.17
device 18 osd.18
device 19 osd.19
device 20 osd.20
device 21 osd.21
device 22 osd.22
device 23 osd.23
device 24 osd.24
device 25 osd.25
device 26 osd.26
device 27 osd.27
device 28 osd.28
device 29 osd.29
device 30 osd.30
device 31 osd.31
device 32 osd.32
device 33 osd.33
device 34 osd.34
device 35 osd.35

# types
type 0 osd
type 1 host
type 2 chassis
type 3 rack
type 4 row
type 5 pdu
type 6 pod
type 7 room
type 8 datacenter
type 9 region
type 10 root

# buckets
host cephosd03 {
id -4# do not change unnecessarily
# weight 24.570
alg straw
hash 0# rjenkins1
item osd.18 weight 2.730
item osd.19 weight 2.730
item osd.20 weight 2.730
item osd.21 weight 2.730
item osd.22 weight 2.730
item osd.23 weight 2.730
item osd.24 weight 2.730
item osd.25 weight 2.730
item osd.26 weight 2.730
}
host cephosd04 {
id -5# do not change unnecessarily
# weight 24.570
alg straw
hash 0# rjenkins1
item osd.27 weight 2.730
item osd.28 weight 2.730
item osd.29 weight 2.730
item osd.30 weight 2.730
item osd.31 weight 2.730
item osd.32 weight 2.730
item osd.33 weight 2.730
item osd.34 weight 2.730
item osd.35 weight 2.730
}
root default {
id -1# do not change unnecessarily
# weight 49.140
alg straw
hash 0# rjenkins1
item cephosd03 weight 24.570
item cephosd04 weight 24.570
}
host cephosd01 {
id -2# do not change unnecessarily
# weight 24.570
alg straw
hash 0# rjenkins1
item osd.0 weight 2.730
item osd.1 weight 2.730
item osd.2 weight 2.730
item osd.3 weight 2.730
item osd.4 weight 2.730
item osd.5 weight 2.730
item osd.6 weight 2.730
item osd.7 weight 2.730
item osd.8 weight 2.730
}
host cephosd02 {
id -3# do not change unnecessarily
# weight 24.570
alg straw
hash 0# rjenkins1
item osd.9 weight 2.730
item osd.10 weight 2.730
item osd.11 weight 2.730
item osd.12 weight 2.730
item osd.13 weight 2.730
item osd.14 weight 2.730
item osd.15 weight 2.730
item osd.16 weight 2.730
item osd.17 weight 2.730
}
root fusionio {
id -6# do not change unnecessarily
# weight 49.140
alg straw
hash 0# rjenkins1
item cephosd01 weight 24.570
item cephosd02 weight 24.570
}

# rules
rule replicated_ruleset {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
rule fusionio_ruleset {
ruleset 1
type replicated
min_size 0
max_size 10
step take fusionio
step chooseleaf firstn 1 type host
step emit
step take default
step chooseleaf firstn -1 type host
step emit
}

# end crush map





*German*

2015-07-02 8:15 GMT-03:00 Lionel Bouton lionel+c...@bouton.name:

 On 07/02/15 12:48, German Anders wrote:
  The idea is to cache rbd at a host level. Also could be possible to
  cache at the osd level. We have high iowait and we need to lower it a
  bit, since we are getting the max from our sas disks 100-110 iops per
  disk (3TB osd's), any advice? Flashcache?

 It's hard to suggest anything without knowing more about your setup. Are
 your I/O mostly reads or writes? Reads: can you add enough RAM on your
 guests or on your OSD to cache your working set? Writes: do you use SSD
 for journals already?

 Lionel

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-02 Thread German Anders
The idea is to cache rbd at a host level. Also could be possible to cache
at the osd level. We have high iowait and we need to lower it a bit, since
we are getting the max from our sas disks 100-110 iops per disk (3TB
osd's), any advice? Flashcache?


On Thursday, July 2, 2015, Jan Schermer j...@schermer.cz wrote:

 I think I posted my experience here ~1 month ago.

 My advice for EnhanceIO: don’t use it.

 But you didn’t exactly say what you want to cache - do you want to cache
 the OSD filestore disks? RBD devices on hosts? RBD devices inside guests?

 Jan

  On 02 Jul 2015, at 11:29, Emmanuel Florac eflo...@intellique.com
 javascript:; wrote:
 
  Le Wed, 1 Jul 2015 17:13:03 -0300
  German Anders gand...@despegar.com javascript:; écrivait:
 
  Hi cephers,
 
Is anyone out there that implement enhanceIO in a production
  environment? any recommendation? any perf output to share with the
  diff between using it and not?
 
  I've tried EnhanceIO back when it wasn't too stale, but never put it in
  production. I've set up bcache on trial, it has its problems (load is
  stuck at 1.0 because of the bcache_writeback kernel thread, and I
  suspect a crash was due to it) but works pretty well overall.
 
  --
  
  Emmanuel Florac |   Direction technique
 |   Intellique
 |  eflo...@intellique.com javascript:;
 |   +33 1 78 94 84 02
  
  ___
  ceph-users mailing list
  ceph-users@lists.ceph.com javascript:;
  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 

*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] any recommendation of using EnhanceIO?

2015-07-02 Thread German Anders
yeah 3TB SAS disks

*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com

2015-07-02 9:04 GMT-03:00 Jan Schermer j...@schermer.cz:

 And those disks are spindles?
 Looks like there’s simply too few of there….

 Jan

 On 02 Jul 2015, at 13:49, German Anders gand...@despegar.com wrote:

 output from iostat:

 *CEPHOSD01:*

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sdc(ceph-0)   0.00 0.001.00  389.00 0.0035.98
 188.9660.32  120.12   16.00  120.39   1.26  49.20
 sdd(ceph-1)   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdf(ceph-2)   0.00 1.006.00  521.00 0.0260.72
 236.05   143.10  309.75  484.00  307.74   1.90 100.00
 sdg(ceph-3)   0.00 0.00   11.00  535.00 0.0442.41
 159.22   139.25  279.72  394.18  277.37   1.83 100.00
 sdi(ceph-4)   0.00 1.004.00  560.00 0.0254.87
 199.32   125.96  187.07  562.00  184.39   1.65  93.20
 sdj(ceph-5)   0.00 0.000.00  566.00 0.0061.41
 222.19   109.13  169.620.00  169.62   1.53  86.40
 sdl(ceph-6)   0.00 0.008.000.00 0.09 0.00
 23.00 0.12   12.00   12.000.00   2.50   2.00
 sdm(ceph-7)   0.00 0.002.00  481.00 0.0144.59
 189.12   116.64  241.41  268.00  241.30   2.05  99.20
 sdn(ceph-8)   0.00 0.001.000.00 0.00 0.00
 8.00 0.018.008.000.00   8.00   0.80
 fioa  0.00 0.000.00 1016.00 0.0019.09
 38.47 0.000.060.000.06   0.00   0.00

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sdc(ceph-0)   0.00 1.00   10.00  278.00 0.0426.07
 185.6960.82  257.97  309.60  256.12   2.83  81.60
 sdd(ceph-1)   0.00 0.002.000.00 0.02 0.00
 20.00 0.02   10.00   10.000.00  10.00   2.00
 sdf(ceph-2)   0.00 1.006.00  579.00 0.0254.16
 189.68   142.78  246.55  328.67  245.70   1.71 100.00
 sdg(ceph-3)   0.00 0.00   10.00   75.00 0.05 5.32
 129.41 4.94  185.08   11.20  208.27   4.05  34.40
 sdi(ceph-4)   0.00 0.00   19.00  147.00 0.0912.61
 156.6317.88  230.89  114.32  245.96   3.37  56.00
 sdj(ceph-5)   0.00 1.002.00  629.00 0.0143.66
 141.72   143.00  223.35  426.00  222.71   1.58 100.00
 sdl(ceph-6)   0.00 0.00   10.000.00 0.04 0.00
 8.00 0.16   18.40   18.400.00   5.60   5.60
 sdm(ceph-7)   0.00 0.00   11.004.00 0.05 0.01
 8.00 0.48   35.20   25.82   61.00  14.13  21.20
 sdn(ceph-8)   0.00 0.009.000.00 0.07 0.00
 15.11 0.078.008.000.00   4.89   4.40
 fioa  0.00 0.000.00 6415.00 0.00   125.81
 40.16 0.000.140.000.14   0.00   0.00

 *CEPHOSD02:*

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sdc1(ceph-9)  0.00 0.00   13.000.00 0.11 0.00
 16.62 0.17   13.23   13.230.00   4.92   6.40
 sdd1(ceph-10) 0.00 0.00   15.000.00 0.13 0.00
 18.13 0.26   17.33   17.330.00   1.87   2.80
 sdf1(ceph-11) 0.00 0.00   22.00  650.00 0.1151.75
 158.04   143.27  212.07  308.55  208.81   1.49 100.00
 sdg1(ceph-12) 0.00 0.00   12.00  282.00 0.0554.60
 380.6813.16  120.52  352.00  110.67   2.91  85.60
 sdi1(ceph-13) 0.00 0.001.000.00 0.00 0.00
 8.00 0.018.008.000.00   8.00   0.80
 sdj1(ceph-14) 0.00 0.00   20.000.00 0.08 0.00
 8.00 0.26   12.80   12.800.00   3.60   7.20
 sdl1(ceph-15) 0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdm1(ceph-16) 0.00 0.00   20.00  424.00 0.1132.20
 149.0589.69  235.30  243.00  234.93   2.14  95.20
 sdn1(ceph-17) 0.00 0.005.00  411.00 0.0245.47
 223.9498.32  182.28 1057.60  171.63   2.40 100.00

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sdc1(ceph-9)  0.00 0.00   26.00  383.00 0.1134.32
 172.4486.92  258.64  297.08  256.03   2.29  93.60
 sdd1(ceph-10) 0.00 0.008.00   31.00 0.09 1.86
 101.95 0.84  178.15   94.00  199.87   6.46  25.20
 sdf1(ceph-11) 0.00 1.005.00  409.00 0.0548.34
 239.3490.94  219.43  383.20  217.43   2.34  96.80
 sdg1(ceph-12) 0.00 0.000.00  238.00 0.00 1.64
 14.1258.34  143.600.00  143.60   1.83  43.60
 sdi1(ceph-13) 0.00 0.00   11.000.00

[ceph-users] any recommendation of using EnhanceIO?

2015-07-01 Thread German Anders
Hi cephers,

   Is anyone out there that implement enhanceIO in a production
environment? any recommendation? any perf output to share with the diff
between using it and not?

Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders
I would probably go with less size osd disks, 4TB is to much to loss in
case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
for the journals would be ok, so you can save some money there, the osd's
of course configured them as a JBOD, don't use any RAID under it, and use
two different networks for public and cluster net.

*German*

2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Journal Disk Size

2015-07-01 Thread German Anders
I'm interested in such a configuration, can you share some perfomance
test/numbers?

Thanks in advance,

Best regards,

*German*

2015-07-01 21:16 GMT-03:00 Shane Gibson shane_gib...@symantec.com:


 It also depends a lot on the size of your cluster ... I have a test
 cluster I'm standing up right now with 60 nodes - a total of 600 OSDs each
 at 4 TB ... If I lose 4 TB - that's a very small fraction of the data.  My
 replicas are going to be spread out across a lot of spindles, and
 replicating that missing 4 TB isn't much of an issue, across 3 racks each
 with 80 gbit/sec ToR uplinks to Spine.  Each node has 20 gbit/sec to ToR in
 a bond.

 On the other hand ... if you only have 4 .. or 8 ... or 10 servers ... and
 a smaller number of OSDs - you have fewer spindles replicating that loss,
 and it might be more of an issue.

 It just depends on the size/scale of  your environment.

 We're going to 8 TB drives - and that will ultimately be spread over a 100
 or more physical servers w/ 10 OSD disks per server.   This will be across
 7 to 10 racks (same network topology) ... so an 8 TB drive loss isn't too
 big of an issue.   Now that assumes that replication actually works well in
 that size cluster.  We're still cessing out this part of the PoC
 engagement.

 ~~shane




 On 7/1/15, 5:05 PM, ceph-users on behalf of German Anders 
 ceph-users-boun...@lists.ceph.com on behalf of gand...@despegar.com
 wrote:

 ask the other guys on the list, but for me to lose 4TB of data is to much,
 the cluster will still running fine, but in some point you need to recover
 that disk, and also if you lose one server with all the 4TB disk in that
 case yeah it will hurt the cluster, also take into account that with that
 kind of disk you will get no more than 100-110 iops per disk

 *German Anders*
 Storage System Engineer Leader
 *Despegar* | IT Team
 *office* +54 11 4894 3500 x3408
 *mobile* +54 911 3493 7262
 *mail* gand...@despegar.com

 2015-07-01 20:54 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 4TB is too much to lose?  Why would it matter if you lost one 4TB with
 the redundancy?  Won't it auto recover from the disk failure?

 Nate Curry
 On Jul 1, 2015 6:12 PM, German Anders gand...@despegar.com wrote:

 I would probably go with less size osd disks, 4TB is to much to loss in
 case of a broken disk, so maybe more osd daemons with less size, maybe 1TB
 or 2TB size. 4:1 relationship is good enough, also i think that 200G disk
 for the journals would be ok, so you can save some money there, the osd's
 of course configured them as a JBOD, don't use any RAID under it, and use
 two different networks for public and cluster net.

 *German*

 2015-07-01 18:49 GMT-03:00 Nate Curry cu...@mosaicatm.com:

 I would like to get some clarification on the size of the journal disks
 that I should get for my new Ceph cluster I am planning.  I read about the
 journal settings on
 http://ceph.com/docs/master/rados/configuration/osd-config-ref/#journal-settings
 but that didn't really clarify it for me that or I just didn't get it.  I
 found in the Learning Ceph Packt book it states that you should have one
 disk for journalling for every 4 OSDs.  Using that as a reference I was
 planning on getting multiple systems with 8 x 6TB inline SAS drives for
 OSDs with two SSDs for journalling per host as well as 2 hot spares for the
 6TB drives and 2 drives for the OS.  I was thinking of 400GB SSD drives but
 am wondering if that is too much.  Any informed opinions would be
 appreciated.

 Thanks,

 *Nate Curry*


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] infiniband implementation

2015-06-29 Thread German Anders
hi cephers,

   Want to know if there's any 'best' practice or procedure to implement
Ceph with Infiniband FDR 56gb/s for front and back end connectivity. Any
crush tunning parameters, etc.

The Ceph cluster has:

- 8 OSD servers
- 2x Intel Xeon E5 8C with HT
- 128G RAM
- 2x 200G Intel DC S3700 (RAID-1) OS
- 3x 200G Intel DC S3500 - Journals
- 4x 800G Intel DC S3500 - OSD SSD  Journal on same disks
- 4x 3TB - OSD SATA
- 1x IB FDR ADPT DP

- 3 MON servers
- 2x Intel Xeon E5 6C with HT
- 128G RAM
- 2x 200G Intel SSD (RAID-1) OS
- 1x IB FDR ADP DP

All with Ubuntu 14.04.1LTS with Kern 4.0.6


Thanks in advance,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] infiniband implementation

2015-06-29 Thread German Anders
Thanks a lot Adam, it was an error typo, the 3700 are for the journals and
the 3500 for the OS. Any special crush algorithm configuration for IB and
for the mix of SSD and SATA OSD daemons?

Thanks in advance,


*German*

2015-06-29 14:05 GMT-03:00 Adam Boyhan ad...@medent.com:

 One thing that jumps out at me is using the S3700 for OS but the S3500 for
 journals.  I would use the S3700 for journals and S3500 for the OS.  Looks
 pretty good other than that!



 --
 *From: *German Anders gand...@despegar.com
 *To: *ceph-users ceph-users@lists.ceph.com
 *Sent: *Monday, June 29, 2015 12:24:41 PM
 *Subject: *[ceph-users] infiniband implementation

 hi cephers,

Want to know if there's any 'best' practice or procedure to implement
 Ceph with Infiniband FDR 56gb/s for front and back end connectivity. Any
 crush tunning parameters, etc.

 The Ceph cluster has:

 - 8 OSD servers
 - 2x Intel Xeon E5 8C with HT
 - 128G RAM
 - 2x 200G Intel DC S3700 (RAID-1) OS
 - 3x 200G Intel DC S3500 - Journals
 - 4x 800G Intel DC S3500 - OSD SSD  Journal on same disks
 - 4x 3TB - OSD SATA
 - 1x IB FDR ADPT DP

 - 3 MON servers
 - 2x Intel Xeon E5 6C with HT
 - 128G RAM
 - 2x 200G Intel SSD (RAID-1) OS
 - 1x IB FDR ADP DP

 All with Ubuntu 14.04.1LTS with Kern 4.0.6


 Thanks in advance,

 *German*

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel 3.18 io bottlenecks?

2015-06-24 Thread German Anders
Hi Lincoln,
   how are you? It's with RBD

Thanks a lot,

Best regards,


*German*

2015-06-24 11:53 GMT-03:00 Lincoln Bryant linco...@uchicago.edu:

 Hi German,

 Is this with CephFS, or RBD?

 Thanks,
 Lincoln

 On Jun 24, 2015, at 9:44 AM, German Anders gand...@despegar.com wrote:

 Hi all,

Is there any IO botleneck reported on kernel 3.18.3-031803-generic?
 since I'm having a lot of iowait and the cluster is really getting slow,
 and actually there's no much going on. I've read some time ago that there
 were some issues with kern 3.18, so I would like to know what's the 'best'
 kernel to go with, I'm using Ubuntu 14.04.1LTs and Ceph v0.82.

 Thanks a lot,

 Best regards,

 *German*
  ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] kernel 3.18 io bottlenecks?

2015-06-24 Thread German Anders
Hi all,

   Is there any IO botleneck reported on kernel 3.18.3-031803-generic?
since I'm having a lot of iowait and the cluster is really getting slow,
and actually there's no much going on. I've read some time ago that there
were some issues with kern 3.18, so I would like to know what's the 'best'
kernel to go with, I'm using Ubuntu 14.04.1LTs and Ceph v0.82.

Thanks a lot,

Best regards,

*German*
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] kernel 3.18 io bottlenecks?

2015-06-24 Thread German Anders
Thanks a lot Nick, ok i see, and is there any lower version that could be
used? Or is best to go with 4.0+ ?


*German*

2015-06-24 11:55 GMT-03:00 Nick Fisk n...@fisk.me.uk:

 That kernel probably has the bug where tcp_nodelay is not enabled. That is
 fixed in Kernel 4.0+, however also in 4.0 blk-mq was introduced which
 brings two other limitations:-



 1.   Max queue depth of 128

 2.   IO’s sizes are restricted/split to 128kb



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *German Anders
 *Sent:* 24 June 2015 15:45
 *To:* ceph-users
 *Subject:* [ceph-users] kernel 3.18 io bottlenecks?



 Hi all,

Is there any IO botleneck reported on kernel 3.18.3-031803-generic?
 since I'm having a lot of iowait and the cluster is really getting slow,
 and actually there's no much going on. I've read some time ago that there
 were some issues with kern 3.18, so I would like to know what's the 'best'
 kernel to go with, I'm using Ubuntu 14.04.1LTs and Ceph v0.82.

 Thanks a lot,

 Best regards,

 *German*


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] krbd splitting large IO's into smaller IO's

2015-06-10 Thread German Anders
hi guys, sorry that I hang on this email, I've four OSD servers with Ubuntu
14.04.1 LTS with 9 osd daemons each, 3TB drive size, and 3 ssd journal
drives (each journal holds 3 osd daemons), the kernel version that I'm
using is 3.18.3-031803-generic, and ceph version 0.82, I would like to know
what would be the 'best' parameters in term of io for my 3TB devices, I've:

scheduler: deadline
max_hw_sectors_kb: 16383
max_sectors_kb: 4096
read_ahead_kb: 128
nr_requests: 128

I'm experience some high io waits on all the OSD servers:

avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   1.740.00   15.43   *64.80*0.00   18.03

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
avgqu-sz   await r_await w_await  svctm  %util
sda1610.40   322.20  374.80   11.00  7940.80  1330.00
48.06 0.080.210.200.44   0.20   7.68
sdb 130.60   322.20   55.00   11.00   742.40  1330.00
62.80 0.020.230.170.51   0.19   1.28
md0   0.00 0.00 2170.80  332.40  8683.20  1329.60
8.00 0.000.000.000.00   0.00   0.00
dm-0  0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
dm-1  0.00 0.00 2170.80  332.40  8683.20  1329.60
8.00 0.870.350.211.26   0.03   7.84
sdd   0.00 0.00   11.80  384.40  4217.60 33197.60
188.8775.17  189.72  130.78  191.53   1.88  74.64
sdc   0.00 0.00   18.80  313.40   581.60 33154.40
203.1178.09  235.08   66.85  245.17   2.16  71.84
sde   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdf   0.00 0.80   78.20  181.40 10400.80 19204.80
228.0931.75  110.93   43.09  140.18   2.99  77.52
sdg   0.00 0.001.60  304.6051.20 31647.20
207.0464.05  209.19   73.50  209.90   1.90  58.32
sdh   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdi   0.00 0.006.60   17.20   159.20  2784.80
247.39 0.279.14   12.128.00   3.19   7.60
sdk   0.00 0.000.000.00 0.00 0.00
0.00 0.000.000.000.00   0.00   0.00
sdj   0.00 0.00   13.40  120.00   428.80  8487.20
133.6723.91  203.37   36.18  222.04   2.64  35.28
sdl   0.00 0.80   12.40  524.20  2088.80 40842.40
160.0193.53  168.27  183.35  167.91   1.64  88.24
sdn   0.00 1.404.00  433.8092.80 35926.40
164.5588.72  196.29  299.40  195.33   1.71  74.96
sdm   0.00 0.000.60  544.6019.20 40348.00
148.08   118.31  217.00   17.33  217.22   1.67  90.80

Thanks in advance,

Best regards,


*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com

2015-06-10 13:07 GMT-03:00 Ilya Dryomov idryo...@gmail.com:

 On Wed, Jun 10, 2015 at 7:04 PM, Nick Fisk n...@fisk.me.uk wrote:
-Original Message-
From: Ilya Dryomov [mailto:idryo...@gmail.com]
Sent: 10 June 2015 14:06
To: Nick Fisk
Cc: ceph-users
Subject: Re: [ceph-users] krbd splitting large IO's into smaller
IO's
   
On Wed, Jun 10, 2015 at 2:47 PM, Nick Fisk n...@fisk.me.uk
 wrote:
 Hi,

 Using Kernel RBD client with Kernel 4.03 (I have also tried some
 older kernels with the same effect) and IO is being split into
 smaller IO's which is having a negative impact on performance.

 cat /sys/block/sdc/queue/max_hw_sectors_kb
 4096

 cat /sys/block/rbd0/queue/max_sectors_kb
 4096

 Using DD
 dd if=/dev/rbd0 of=/dev/null bs=4M

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  201.500.00 25792.00 0.00
  256.00
 1.99   10.15   10.150.00   4.96 100.00


 Using FIO with 4M blocks
 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s
  avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 rbd0  0.00 0.00  232.000.00 118784.00
  0.00
  1024.00
 11.29   48.58   48.580.00   4.31 100.00

 Any ideas why IO sizes are limited to 128k (256 blocks) in DD's
 case and 512k in Fio's case?
   
128k vs 512k is probably buffered vs direct IO - add iflag=direct
to your dd invocation.
   
Yes, thanks for this, that was the case
   
   
As for the 512k - I'm pretty sure it's a regression in our switch
to blk-mq.  I tested it around 3.18-3.19 and saw steady 4M IOs.  I
hope we are just missing a knob - I'll take a look.
   
I've tested both 4.03 and 3.16 and both seem to be split into 512k.
Let
  me
   know if you need me to test any other particular

Re: [ceph-users] High IO Waits

2015-06-10 Thread German Anders
Thanks a lot Nick, I'll try with more PGs and if I don't see any improve
I'll add more OSD servers to the cluster.

Best regards,


*German Anders*
Storage System Engineer Leader
*Despegar* | IT Team
*office* +54 11 4894 3500 x3408
*mobile* +54 911 3493 7262
*mail* gand...@despegar.com

2015-06-10 13:58 GMT-03:00 Nick Fisk n...@fisk.me.uk:

 From the looks of it you are reaching the maxing out your OSD’s. Some of
 them are pushing over 500 iops, which is a lot for a 7.2k disk and at the
 high queue depths IO’s will have to wait a long time to reach the front of
 the queue.  The only real thing I can suggest is to add more OSD’s which
 will spread your workload over more disks. It’s possible creating more PG’s
 may distribute the workload a little better, but I don’t see  it making a
 massive difference.



 *From:* ceph-users [mailto:ceph-users-boun...@lists.ceph.com] *On Behalf
 Of *German Anders
 *Sent:* 10 June 2015 17:20
 *To:* Ilya Dryomov
 *Cc:* ceph-users; Nick Fisk
 *Subject:* Re: [ceph-users] krbd splitting large IO's into smaller IO's



 hi guys, sorry that I hang on this email, I've four OSD servers with
 Ubuntu 14.04.1 LTS with 9 osd daemons each, 3TB drive size, and 3 ssd
 journal drives (each journal holds 3 osd daemons), the kernel version that
 I'm using is 3.18.3-031803-generic, and ceph version 0.82, I would like to
 know what would be the 'best' parameters in term of io for my 3TB devices,
 I've:

 scheduler: deadline

 max_hw_sectors_kb: 16383

 max_sectors_kb: 4096

 read_ahead_kb: 128

 nr_requests: 128

 I'm experience some high io waits on all the OSD servers:

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
1.740.00   15.43   *64.80*0.00   18.03

 Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz
 avgqu-sz   await r_await w_await  svctm  %util
 sda1610.40   322.20  374.80   11.00  7940.80  1330.00
 48.06 0.080.210.200.44   0.20   7.68
 sdb 130.60   322.20   55.00   11.00   742.40  1330.00
 62.80 0.020.230.170.51   0.19   1.28
 md0   0.00 0.00 2170.80  332.40  8683.20  1329.60
 8.00 0.000.000.000.00   0.00   0.00
 dm-0  0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 dm-1  0.00 0.00 2170.80  332.40  8683.20  1329.60
 8.00 0.870.350.211.26   0.03   7.84
 sdd   0.00 0.00   11.80  384.40  4217.60 33197.60
 188.8775.17  189.72  130.78  191.53   1.88  74.64
 sdc   0.00 0.00   18.80  313.40   581.60 33154.40
 203.1178.09  235.08   66.85  245.17   2.16  71.84
 sde   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdf   0.00 0.80   78.20  181.40 10400.80 19204.80
 228.0931.75  110.93   43.09  140.18   2.99  77.52
 sdg   0.00 0.001.60  304.6051.20 31647.20
 207.0464.05  209.19   73.50  209.90   1.90  58.32
 sdh   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdi   0.00 0.006.60   17.20   159.20  2784.80
 247.39 0.279.14   12.128.00   3.19   7.60
 sdk   0.00 0.000.000.00 0.00 0.00
 0.00 0.000.000.000.00   0.00   0.00
 sdj   0.00 0.00   13.40  120.00   428.80  8487.20
 133.6723.91  203.37   36.18  222.04   2.64  35.28
 sdl   0.00 0.80   12.40  524.20  2088.80 40842.40
 160.0193.53  168.27  183.35  167.91   1.64  88.24
 sdn   0.00 1.404.00  433.8092.80 35926.40
 164.5588.72  196.29  299.40  195.33   1.71  74.96
 sdm   0.00 0.000.60  544.6019.20 40348.00
 148.08   118.31  217.00   17.33  217.22   1.67  90.80

 Thanks in advance,

 Best regards,


 *German Anders*
 Storage System Engineer Leader
 *Despegar* | IT Team
 *office* +54 11 4894 3500 x3408
 *mobile* +54 911 3493 7262
 *mail* gand...@despegar.com






___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  1   2   >