Re: [ceph-users] ceph-volume lvm for bluestore for newer disk

2017-11-30 Thread Brad Hubbard


On Thu, Nov 30, 2017 at 5:30 PM, nokia ceph  wrote:
> Hello,
>
> I'm following
> http://docs.ceph.com/docs/master/ceph-volume/lvm/prepare/#ceph-volume-lvm-prepare-bluestore
> to create new OSD's.
>
> I took the latest branch from https://shaman.ceph.com/repos/ceph/luminous/
>
> # ceph -v
> ceph version 12.2.1-851-g6d9f216
>
> What I did, formatted the device.
>
> #sgdisk -Z /dev/sdv
> Creating new GPT entries.
> GPT data structures destroyed! You may now partition the disk using fdisk or
> other utilities.
>
>
> Getting below error while the creation of bluestore OSD's
>
> # ceph-volume lvm prepare --bluestore  --data /dev/sdv
> Running command: sudo vgcreate --force --yes
> ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121 # use uuidgen to create an ID, use
> this for all ceph nodes in your cluster /dev/sdv
>  stderr: Name contains invalid character, valid set includes:
> [a-zA-Z0-9.-_+].
>   New volume group name "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121 # use
> uuidgen to create an ID, use this for all ceph nodes in your cluster" is
> invalid.
>   Run `vgcreate --help' for more information.
> -->  RuntimeError: command returned non-zero exit status: 3

Can you remove the comment "# use `uuidgen` to generate your own UUID" from the
line for 'fsid' in your ceph.conf and try again?

>
> # grep fsid /etc/ceph/ceph.conf
> fsid = b2f1b9b9-eecc-4c17-8b92-cfa60b31c121
>
>
> My question
>
> 1. We have 68 disks per server so for all the 68 disks sharing same Volume
> group --> "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121" ?
> 2. Why ceph-volume failed to create vg name with this name, even I manually
> tried to create, as it will ask for Physical volume as argument
> #vgcreate --force --yes "ceph-b2f1b9b9-eecc-4c17-8b92-cfa60b31c121"
>   No command with matching syntax recognised.  Run 'vgcreate --help' for
> more information.
>   Correct command syntax is:
>   vgcreate VG_new PV ...
>
> Please let me know the comments.
>
> Thanks
> Jayaram
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Cheers,
Brad
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph Developers Monthly - October

2017-11-30 Thread kefu chai
On Tue, Nov 7, 2017 at 3:12 AM, Leonardo Vaz  wrote:
> On Mon, Nov 06, 2017 at 09:54:41PM +0800, kefu chai wrote:
>> On Thu, Oct 5, 2017 at 12:16 AM, Leonardo Vaz  wrote:
>> > On Wed, Oct 04, 2017 at 03:02:09AM -0300, Leonardo Vaz wrote:
>> >> On Thu, Sep 28, 2017 at 12:08:00AM -0300, Leonardo Vaz wrote:
>> >> > Hey Cephers,
>> >> >
>> >> > This is just a friendly reminder that the next Ceph Developer Montly
>> >> > meeting is coming up:
>> >> >
>> >> >  http://wiki.ceph.com/Planning
>> >> >
>> >> > If you have work that you're doing that it a feature work, significant
>> >> > backports, or anything you would like to discuss with the core team,
>> >> > please add it to the following page:
>> >> >
>> >> >  http://wiki.ceph.com/CDM_04-OCT-2017
>> >> >
>> >> > If you have questions or comments, please let us know.
>> >>
>>
>> Leo,
>>
>> do we have a recording for the CDM in Oct?
>
> We didn't record the CDM in October because some people were not able to
> join us to discuss the topics and we ended having a very quick meeting
> instead.
>
> I have the video recording for the CDM we had on Nov 1st, however I need
> to upload it. We are in Sydney for the OpenStack Summit and I'll be able
> to do that later this week.

Leo, thank you! could you update
http://tracker.ceph.com/projects/ceph/wiki/Planning with the URL of
the recording once you upload it?


-- 
Regards
Kefu Chai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] RBD corruption when removing tier cache

2017-11-30 Thread Jan Pekař - Imatic

Hi all,
today I tested adding SSD cache tier to pool.
Everything worked, but when I tried to remove it and run

rados -p hot-pool cache-flush-evict-all

I got

rbd_data.9c000238e1f29.
failed to flush /rbd_data.9c000238e1f29.: (2) No such 
file or directory

rbd_data.9c000238e1f29.0621
failed to flush /rbd_data.9c000238e1f29.0621: (2) No such 
file or directory

rbd_data.9c000238e1f29.0001
failed to flush /rbd_data.9c000238e1f29.0001: (2) No such 
file or directory

rbd_data.9c000238e1f29.0a2c
failed to flush /rbd_data.9c000238e1f29.0a2c: (2) No such 
file or directory

rbd_data.9c000238e1f29.0200
failed to flush /rbd_data.9c000238e1f29.0200: (2) No such 
file or directory

rbd_data.9c000238e1f29.0622
failed to flush /rbd_data.9c000238e1f29.0622: (2) No such 
file or directory

rbd_data.9c000238e1f29.0009
failed to flush /rbd_data.9c000238e1f29.0009: (2) No such 
file or directory

rbd_data.9c000238e1f29.0208
failed to flush /rbd_data.9c000238e1f29.0208: (2) No such 
file or directory

rbd_data.9c000238e1f29.00c1
failed to flush /rbd_data.9c000238e1f29.00c1: (2) No such 
file or directory

rbd_data.9c000238e1f29.0625
failed to flush /rbd_data.9c000238e1f29.0625: (2) No such 
file or directory

rbd_data.9c000238e1f29.00d8
failed to flush /rbd_data.9c000238e1f29.00d8: (2) No such 
file or directory

rbd_data.9c000238e1f29.0623
failed to flush /rbd_data.9c000238e1f29.0623: (2) No such 
file or directory

rbd_data.9c000238e1f29.0624
failed to flush /rbd_data.9c000238e1f29.0624: (2) No such 
file or directory

error from cache-flush-evict-all: (1) Operation not permitted

I also notice, that switching cache tier to "forward" is not safe?

Error EPERM: 'forward' is not a well-supported cache mode and may 
corrupt your data.  pass --yes-i-really-mean-it to force.


In the moment of flushing (or switching to forward mode) RBD got 
corrupted and even fsck was unable to repair it (unable to set 
superblock flags). I don't know if it is due to cache still active and 
corrupted or ext4 got messed, that it cannot work anymore.


Even if VM that was using that pool is stopped I cannot flush it.

So what I did wrong? Can I get my data back? Is it safe to remove tier 
cache and how?


Using rados get I can dump objects to disk, but why I cannot flush it 
(evict)?


It looks like the same issue as on
http://tracker.ceph.com/issues/12659
but it is unresolved.

I also have some snapshot of RBD image in the cold pool, but that should 
not cause problems in production.


I'm using 12.2.1 version on all 4 nodes.

With regards
Jan Pekar
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Duplicate snapid's

2017-11-30 Thread Kjetil Joergensen
Hi,

we currently do not understand how we got into this situation, nevertheless
we have a set of rbd images which has the same SNAPID in the same pool.

kjetil@sc2-r10-u09:~$ rbd snap ls _qa-staging_foo_partial_db
SNAPID NAME   SIZE
478104 2017-11-29.001 2 MB
kjetil@sc2-r10-u09:~$ rbd snap ls _qa-staging_bar_decimated_be
SNAPID NAME   SIZE
478104 2017-11-27.001 30720 kB

(We have a small collection of these)

I currently believe this is bad - is this correct that this is bad ?

My rudimentary understanding is that snapid is monotonically increasing,
and unique within a pool. At which point this is bad at the point a
snapshot gets removed, snapid would get put into removed_snaps, and at some
point the osd's would go trimming and might prematurely get rid of clones.

Cheers,
-- 
Kjetil Joergensen 
SRE, Medallia Inc
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] dropping trusty

2017-11-30 Thread David Galloway
On 11/30/2017 12:21 PM, Sage Weil wrote:
> We're talking about dropping trusty support for mimic due to the old 
> compiler (incomplete C++11), hassle of using an updated toolchain, general 
> desire to stop supporting old stuff, and lack of user objections to 
> dropping it in the next release.
> 
> We would continue to build trusty packages for luminous and older 
> releases, just not mimic going forward.
> 
> My question is whether we should drop all of the trusty installs on smithi 
> and focus testing on xenial and centos.  I haven't seen any trusty related 
> failures in half a year.  There were some kernel-related issues 6+ months 
> ago that are resolved, and there is a valgrind issue with xenial that is 
> making us do valgrind only on centos, but otherwise I don't recall any 
> other problems.  I think the likelihood of a trusty-specific regression on 
> luminous/jewel is low.  Note that we can still do install and smoke 
> testing on VMs to ensure the packages work; we just wouldn't stress test.
> 
> Does this seem reasonable?  If so, we could reimage the trusty hosts 
> immediately, right?
> 
> Am I missing anything?
> 

Someone would need to prune through the qa dir and make sure nothing
relies on trusty for tests.  We've gotten into a bind recently with the
testing of FOG [1] where jobs are stuck in Waiting for a long time
(tying up workers) because jobs are requesting Trusty.  We got close to
having zero Trusty testnodes since the wip-fog branch has been reimaging
baremetal testnodes on every job.

But other than that, yes, I can reimage the Trusty testnodes.  Once FOG
is merged into teuthology master, we won't have to worry about this
anymore since jobs will automatically reimage machines based on what
distro they require.

[1] https://github.com/ceph/teuthology/compare/wip-fog
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph Developers Monthly - December

2017-11-30 Thread Leonardo Vaz
Hey Cephers,

This is just a friendly reminder that the next Ceph Developer Montly
meeting is coming up:

 http://wiki.ceph.com/Planning

If you have work that you're doing that it a feature work, significant
backports, or anything you would like to discuss with the core team,
please add it to the following page:

 http://wiki.ceph.com/CDM_06-DEC-2017

If you have questions or comments, please let us know.

Kindest regards,

Leo

-- 
Leonardo Vaz
Ceph Community Manager
Open Source and Standards Team
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph all-nvme mysql performance tuning

2017-11-30 Thread German Anders
That's correct, IPoIB for the backend (already configured the irq
affinity),  and 10GbE on the frontend. I would love to try rdma but like
you said is not stable for production, so I think I'll have to wait for
that. Yeah, the thing is that it's not my decision to go for 50GbE or
100GbE... :( so.. 10GbE for the front-end will be...

Would be really helpful if someone could run the following sysbench test on
a mysql db so I could make some compares:

*my.cnf *configuration file:

[mysqld_safe]
nice= 0
pid-file= /home/test_db/mysql/mysql.pid

[client]
port= 33033
socket  = /home/test_db/mysql/mysql.sock

[mysqld]
user= test_db
port= 33033
socket  = /home/test_db/mysql/mysql.sock
pid-file= /home/test_db/mysql/mysql.pid
log-error   = /home/test_db/mysql/mysql.err
datadir = /home/test_db/mysql/data
tmpdir  = /tmp
server-id   = 1

# ** Binlogging **
#log-bin= /home/test_db
/mysql/binlog/mysql-bin
#log_bin_index  = /home/test_db
/mysql/binlog/mysql-bin.index
expire_logs_days= 1
max_binlog_size = 512MB

thread_handling = pool-of-threads
thread_pool_max_threads = 300


# ** Slow query log **
slow_query_log  = 1
slow_query_log_file = /home/test_db/mysql/mysql-slow.log
long_query_time = 10
log_output  = FILE
log_slow_slave_statements   = 1
log_slow_verbosity  = query_plan,innodb,explain

# ** INNODB Specific options **
transaction_isolation   = READ-COMMITTED
innodb_buffer_pool_size = 12G
innodb_data_file_path   = ibdata1:256M:autoextend
innodb_thread_concurrency   = 16
innodb_log_file_size= 256M
innodb_log_files_in_group   = 3
innodb_file_per_table
innodb_log_buffer_size  = 16M
innodb_stats_on_metadata= 0
innodb_lock_wait_timeout= 30
# innodb_flush_method   = O_DSYNC
innodb_flush_method = O_DIRECT
max_connections = 1
max_connect_errors  = 99
max_allowed_packet  = 128M
skip-host-cache
skip-name-resolve
explicit_defaults_for_timestamp = 1
performance_schema  = OFF
log_warnings= 2
event_scheduler = ON

# ** Specific Galera Cluster Settings **
binlog_format   = ROW
default-storage-engine  = innodb
query_cache_size= 0
query_cache_type= 0


Volume is just an RBD (on a RF=3 pool) with the default 22 bit order
mounted on */home/test_db/mysql/data*

commands for the test:

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on cleanup > /dev/null 2>/dev/null

sysbench
--test=/usr/share/sysbench/tests/include/oltp_legacy/parallel_prepare.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=10
--rand-type=uniform --rand-init=on prepare > /dev/null 2>/dev/null

sysbench --test=/usr/share/sysbench/tests/include/oltp_legacy/oltp.lua
--mysql-host= --mysql-port=33033 --mysql-user=sysbench
--mysql-password=sysbench --mysql-db=sysbench --mysql-table-engine=innodb
--db-driver=mysql --oltp_tables_count=10 --oltp-test-mode=complex
--oltp-read-only=off --oltp-table-size=20 --threads=20
--rand-type=uniform --rand-init=on --time=120 run >
result_sysbench_perf_test.out 2>/dev/null

Im looking for tps, qps and 95th perc, could anyone with a all-nvme cluster
run the test and share the results? I would really appreciate the help :)

Thanks in advance,

Best,


*German *

2017-11-29 19:14 GMT-03:00 Zoltan Arnold Nagy :

> On 2017-11-27 14:02, German Anders wrote:
>
>> 4x 2U servers:
>>   1x 82599ES 10-Gigabit SFI/SFP+ Network Connection
>>   1x Mellanox ConnectX-3 InfiniBand FDR 56Gb/s Adapter (dual port)
>>
> so I assume you are using IPoIB as 

[ceph-users] dropping trusty

2017-11-30 Thread Sage Weil
We're talking about dropping trusty support for mimic due to the old 
compiler (incomplete C++11), hassle of using an updated toolchain, general 
desire to stop supporting old stuff, and lack of user objections to 
dropping it in the next release.

We would continue to build trusty packages for luminous and older 
releases, just not mimic going forward.

My question is whether we should drop all of the trusty installs on smithi 
and focus testing on xenial and centos.  I haven't seen any trusty related 
failures in half a year.  There were some kernel-related issues 6+ months 
ago that are resolved, and there is a valgrind issue with xenial that is 
making us do valgrind only on centos, but otherwise I don't recall any 
other problems.  I think the likelihood of a trusty-specific regression on 
luminous/jewel is low.  Note that we can still do install and smoke 
testing on VMs to ensure the packages work; we just wouldn't stress test.

Does this seem reasonable?  If so, we could reimage the trusty hosts 
immediately, right?

Am I missing anything?

sage
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-11-30 Thread Peter Woodman
How quickly are you planning to cut 12.2.3?

On Thu, Nov 30, 2017 at 4:25 PM, Alfredo Deza  wrote:
> Thanks all for your feedback on deprecating ceph-disk, we are very
> excited to be able to move forwards on a much more robust tool and
> process for deploying and handling activation of OSDs, removing the
> dependency on UDEV which has been a tremendous source of constant
> issues.
>
> Initially (see "killing ceph-disk" thread [0]) we planned for removal
> of Mimic, but we didn't want to introduce the deprecation warnings up
> until we had an out for those who had OSDs deployed in previous
> releases with ceph-disk (we are now able to handle those as well).
> That is the reason ceph-volume, although present since the first
> Luminous release, hasn't been pushed forward much.
>
> Now that we feel like we can cover almost all cases, we would really
> like to see a wider usage so that we can improve on issues/experience.
>
> Given that 12.2.2 is already in the process of getting released, we
> can't undo the deprecation warnings for that version, but we will
> remove them for 12.2.3, add them back again in Mimic, which will mean
> ceph-disk will be kept around a bit longer, and finally fully removed
> by N.
>
> To recap:
>
> * ceph-disk deprecation warnings will stay for 12.2.2
> * deprecation warnings will be removed in 12.2.3 (and from all later
> Luminous releases)
> * deprecation warnings will be added again in ceph-disk for all Mimic releases
> * ceph-disk will no longer be available for the 'N' release, along
> with the UDEV rules
>
> I believe these four points address most of the concerns voiced in
> this thread, and should give enough time to port clusters over to
> ceph-volume.
>
> [0] 
> http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021358.html
>
> On Thu, Nov 30, 2017 at 8:22 AM, Daniel Baumann  wrote:
>> On 11/30/17 14:04, Fabian Grünbichler wrote:
>>> point is - you should not purposefully attempt to annoy users and/or
>>> downstreams by changing behaviour in the middle of an LTS release cycle,
>>
>> exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
>> introduce a behaviour-change, regardless if it's "just" adding new
>> warnings or not.
>>
>> this is a stable update we're talking about, even more so since it's an
>> LTS release. you never know how people use stuff (e.g. by parsing stupid
>> things), so such behaviour-change will break stuff for *some* people
>> (granted, most likely a really low number).
>>
>> my expection to an stable release is, that it stays, literally, stable.
>> that's the whole point of having it in the first place. otherwise we
>> would all be running git snapshots and update randomly to newer ones.
>>
>> adding deprecation messages in mimic makes sense, and getting rid of
>> it/not provide support for it in mimic+1 is reasonable.
>>
>> Regards,
>> Daniel
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-11-30 Thread Peter Woodman
how quickly are you planning to cut 12.2.3?

On Thu, Nov 30, 2017 at 4:25 PM, Alfredo Deza  wrote:

> Thanks all for your feedback on deprecating ceph-disk, we are very
> excited to be able to move forwards on a much more robust tool and
> process for deploying and handling activation of OSDs, removing the
> dependency on UDEV which has been a tremendous source of constant
> issues.
>
> Initially (see "killing ceph-disk" thread [0]) we planned for removal
> of Mimic, but we didn't want to introduce the deprecation warnings up
> until we had an out for those who had OSDs deployed in previous
> releases with ceph-disk (we are now able to handle those as well).
> That is the reason ceph-volume, although present since the first
> Luminous release, hasn't been pushed forward much.
>
> Now that we feel like we can cover almost all cases, we would really
> like to see a wider usage so that we can improve on issues/experience.
>
> Given that 12.2.2 is already in the process of getting released, we
> can't undo the deprecation warnings for that version, but we will
> remove them for 12.2.3, add them back again in Mimic, which will mean
> ceph-disk will be kept around a bit longer, and finally fully removed
> by N.
>
> To recap:
>
> * ceph-disk deprecation warnings will stay for 12.2.2
> * deprecation warnings will be removed in 12.2.3 (and from all later
> Luminous releases)
> * deprecation warnings will be added again in ceph-disk for all Mimic
> releases
> * ceph-disk will no longer be available for the 'N' release, along
> with the UDEV rules
>
> I believe these four points address most of the concerns voiced in
> this thread, and should give enough time to port clusters over to
> ceph-volume.
>
> [0] http://lists.ceph.com/pipermail/ceph-users-ceph.com/
> 2017-October/021358.html
>
> On Thu, Nov 30, 2017 at 8:22 AM, Daniel Baumann 
> wrote:
> > On 11/30/17 14:04, Fabian Grünbichler wrote:
> >> point is - you should not purposefully attempt to annoy users and/or
> >> downstreams by changing behaviour in the middle of an LTS release cycle,
> >
> > exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
> > introduce a behaviour-change, regardless if it's "just" adding new
> > warnings or not.
> >
> > this is a stable update we're talking about, even more so since it's an
> > LTS release. you never know how people use stuff (e.g. by parsing stupid
> > things), so such behaviour-change will break stuff for *some* people
> > (granted, most likely a really low number).
> >
> > my expection to an stable release is, that it stays, literally, stable.
> > that's the whole point of having it in the first place. otherwise we
> > would all be running git snapshots and update randomly to newer ones.
> >
> > adding deprecation messages in mimic makes sense, and getting rid of
> > it/not provide support for it in mimic+1 is reasonable.
> >
> > Regards,
> > Daniel
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-disk removal roadmap (was ceph-disk is now deprecated)

2017-11-30 Thread Alfredo Deza
Thanks all for your feedback on deprecating ceph-disk, we are very
excited to be able to move forwards on a much more robust tool and
process for deploying and handling activation of OSDs, removing the
dependency on UDEV which has been a tremendous source of constant
issues.

Initially (see "killing ceph-disk" thread [0]) we planned for removal
of Mimic, but we didn't want to introduce the deprecation warnings up
until we had an out for those who had OSDs deployed in previous
releases with ceph-disk (we are now able to handle those as well).
That is the reason ceph-volume, although present since the first
Luminous release, hasn't been pushed forward much.

Now that we feel like we can cover almost all cases, we would really
like to see a wider usage so that we can improve on issues/experience.

Given that 12.2.2 is already in the process of getting released, we
can't undo the deprecation warnings for that version, but we will
remove them for 12.2.3, add them back again in Mimic, which will mean
ceph-disk will be kept around a bit longer, and finally fully removed
by N.

To recap:

* ceph-disk deprecation warnings will stay for 12.2.2
* deprecation warnings will be removed in 12.2.3 (and from all later
Luminous releases)
* deprecation warnings will be added again in ceph-disk for all Mimic releases
* ceph-disk will no longer be available for the 'N' release, along
with the UDEV rules

I believe these four points address most of the concerns voiced in
this thread, and should give enough time to port clusters over to
ceph-volume.

[0] http://lists.ceph.com/pipermail/ceph-users-ceph.com/2017-October/021358.html

On Thu, Nov 30, 2017 at 8:22 AM, Daniel Baumann  wrote:
> On 11/30/17 14:04, Fabian Grünbichler wrote:
>> point is - you should not purposefully attempt to annoy users and/or
>> downstreams by changing behaviour in the middle of an LTS release cycle,
>
> exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
> introduce a behaviour-change, regardless if it's "just" adding new
> warnings or not.
>
> this is a stable update we're talking about, even more so since it's an
> LTS release. you never know how people use stuff (e.g. by parsing stupid
> things), so such behaviour-change will break stuff for *some* people
> (granted, most likely a really low number).
>
> my expection to an stable release is, that it stays, literally, stable.
> that's the whole point of having it in the first place. otherwise we
> would all be running git snapshots and update randomly to newer ones.
>
> adding deprecation messages in mimic makes sense, and getting rid of
> it/not provide support for it in mimic+1 is reasonable.
>
> Regards,
> Daniel
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread Denes Dolhay
As per your ceph status it seems that you have 19 pools, all of them are 
erasure coded as 3+2?


It seems that when you taken the node offline ceph could move some of 
the PGs to other nodes (it seems that that one or more pools does not 
require all 5 osds to be healty. Maybe they are replicated, or not 3+2 
erasure coded?)


Theese pgs are the active+clean+remapped. (Ceph could successfully put 
theese on other osds to maintain the replica count / erasure coding 
profile, and this remapping process completed)


Some other pgs do seem to require all 5 osds to be present, these are 
the "undersized" ones.



One other thing, if your failure domain is osd and not host or a larger 
unit, then Ceph will not try to place all replicas on different servers, 
just different osds, hence it can satisfy the criteria even if one of 
the hosts are down. This setting would be highly inadvisable on a 
production system!



Denes.

On 11/30/2017 02:45 PM, David Turner wrote:


active+clean+remapped is not a healthy state for a PG. If it actually 
we're going to a new osd it would say backfill+wait or backfilling and 
eventually would get back to active+clean.


I'm not certain what the active+clean+remapped state means. Perhaps a 
PG query, PG dump, etc can give more insight. In any case, this is not 
a healthy state and you're still testing removing a node to have less 
than you need to be healthy.



On Thu, Nov 30, 2017, 5:38 AM Jakub Jaszewski 
> wrote:


I've just did ceph upgrade jewel -> luminous and am facing the
same case...

# EC profile
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

5 hosts in the cluster and I run systemctl stop ceph.target on one
of them
some PGs from EC pool were remapped (active+clean+remappedstate)
even when there was not enough hosts in the cluster but some are
still in active+undersized+degradedstate


root@host01:~# ceph status
  cluster:
    id: a6f73750-1972-47f6-bcf5-a99753be65ad
    health: HEALTH_WARN
            Degraded data redundancy: 876/9115 objects degraded
(9.611%), 540 pgs unclean, 540 pgs degraded, 540 pgs undersized
  services:
    mon: 3 daemons, quorum host01,host02,host03
    mgr: host01(active), standbys: host02, host03
    osd: 60 osds: 48 up, 48 in; 484 remapped pgs
    rgw: 3 daemons active
  data:
    pools:   19 pools, 3736 pgs
    objects: 1965 objects, 306 MB
    usage:   5153 MB used, 174 TB / 174 TB avail
    pgs:     876/9115 objects degraded (9.611%)
             2712 active+clean
             540  active+undersized+degraded
             484  active+clean+remapped
  io:
    client:   17331 B/s rd, 20 op/s rd, 0 op/s wr
root@host01:~#



Anyone here able to explain this behavior to me ?

Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Wido den Hollander

> Op 30 november 2017 om 14:19 schreef Jason Dillaman :
> 
> 
> On Thu, Nov 30, 2017 at 4:00 AM, Wido den Hollander  wrote:
> >
> >> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> >>
> >>
> >> We experienced this problem in the past on older (pre-Jewel) releases
> >> where a PG split that affected the RBD header object would result in
> >> the watch getting lost by librados. Any chance you know if the
> >> affected RBD header objects were involved in a PG split? Can you
> >> generate a gcore dump of one of the affected VMs and ceph-post-file it
> >> for analysis?
> >>
> >
> > There was no PG splitting in the recent months on this cluster, so that's 
> > not something that might have happened here.
> 
> Possible alternative explanation: are you using cache tiering?

No, not either. It's running 3x replication. Standard RBD behind OpenStack.

Cluster has around 2.000 OSDs running all with 4TB disks and 3x replication.

I'll wait for the gcore dump of a running VM, but that may take a few days.

Wido

> 
> > I've asked the OpenStack team for a gcore dump, but they have to get that 
> > cleared before they can send it to me.
> >
> > This might take a bit of time!
> >
> > Wido
> >
> >> As for the VM going R/O, that is the expected behavior when a client
> >> breaks the exclusive lock held by a (dead) client.
> >>
> >> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> >> > Hi,
> >> >
> >> > On a OpenStack environment I encountered a VM which went into R/O mode 
> >> > after a RBD snapshot was created.
> >> >
> >> > Digging into this I found 10s (out of thousands) RBD images which DO 
> >> > have a running VM, but do NOT have a watcher on the RBD image.
> >> >
> >> > For example:
> >> >
> >> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >> >
> >> > 'Watchers: none'
> >> >
> >> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
> >> > the client.
> >> >
> >> > In the meantime the cluster was already upgraded to 10.2.10
> >> >
> >> > Looking further I also found a Compute node with 10.2.10 installed which 
> >> > also has RBD images without watchers.
> >> >
> >> > Restarting or live migrating the VM to a different host resolves this 
> >> > issue.
> >> >
> >> > The internet is full of posts where RBD images still have Watchers when 
> >> > people don't expect them, but in this case I'm expecting a watcher which 
> >> > isn't there.
> >> >
> >> > The main problem right now is that creating a snapshot potentially puts 
> >> > a VM in Read-Only state because of the lack of notification.
> >> >
> >> > Has anybody seen this as well?
> >> >
> >> > Thanks,
> >> >
> >> > Wido
> >> > ___
> >> > ceph-users mailing list
> >> > ceph-users@lists.ceph.com
> >> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >>
> >>
> >>
> >> --
> >> Jason
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] rbd mount unmap network outage

2017-11-30 Thread David Turner
This doesn't answer your question, but maybe nudges you in a different
direction. CephFS seams like the much better solution for what you're
doing. You linked a 5 year old blog post. CephFS was not a stable
technology at the time, but it's an excellent method to share a network FS
to multiple clients. There are even methods to export it over nfs, although
I'd personally set them up to mount it using ceph-fuse.

On Thu, Nov 30, 2017, 2:34 AM Hauke Homburg  wrote:

> Hello,
>
> Actually i am working on a NFS HA Cluster to export rbd Images with NFS.
> To test the failover i tried the following:
>
> https://www.sebastien-han.fr/blog/2012/07/06/nfs-over-rbd/
>
> i set the rbdimage to exclusive Lock and the osd and mon timeout to 20
> Seconds.
>
> on 1 NFS Server i mapped the rbd image with rbd map. after mapping i
> blocked the TCP ports with iptables to simulate a network outage. Ports
> tcp 6789 and 6800:7300.
>
> I can see with rbd status that the watchers on the CLuster itself are
> unmapped after timeout.
>
> The nfs server gives encountered watch error -110.
>
> The NFS Server tries to connect libceph to another mon.
>
> When all this happens i cannot unmap the image.
>
> Ceph CLuster is 10.2.10 with Centos 7 the NFS Server is Debian 9. The
> Pacemaker ra is ceph-resource.agents 10.2.10.
>
> My consideration is to unmap the image when the network outage happens,
> because the failover and the Problem that i don't want to mount 1 rbd
> image with 2 Server to prevent data damage. after network outage is solved.
>
> Thanks fpr Help
>
> Hauke
>
>
>
> --
> www.w3-creative.de
>
> www.westchat.de
>
> https://friendica.westchat.de/profile/hauke
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread David Turner
active+clean+remapped is not a healthy state for a PG. If it actually we're
going to a new osd it would say backfill+wait or backfilling and eventually
would get back to active+clean.

I'm not certain what the active+clean+remapped state means. Perhaps a PG
query, PG dump, etc can give more insight. In any case, this is not a
healthy state and you're still testing removing a node to have less than
you need to be healthy.

On Thu, Nov 30, 2017, 5:38 AM Jakub Jaszewski 
wrote:

> I've just did ceph upgrade jewel -> luminous and am facing the same case...
>
> # EC profile
> crush-failure-domain=host
> crush-root=default
> jerasure-per-chunk-alignment=false
> k=3
> m=2
> plugin=jerasure
> technique=reed_sol_van
> w=8
>
> 5 hosts in the cluster and I run systemctl stop ceph.target on one of them
> some PGs from EC pool were remapped (active+clean+remapped state) even
> when there was not enough hosts in the cluster but some are still in
> active+undersized+degraded state
>
>
> root@host01:~# ceph status
>   cluster:
> id: a6f73750-1972-47f6-bcf5-a99753be65ad
> health: HEALTH_WARN
> Degraded data redundancy: 876/9115 objects degraded (9.611%),
> 540 pgs unclean, 540 pgs degraded, 540 pgs undersized
>
>   services:
> mon: 3 daemons, quorum host01,host02,host03
> mgr: host01(active), standbys: host02, host03
> osd: 60 osds: 48 up, 48 in; 484 remapped pgs
> rgw: 3 daemons active
>
>   data:
> pools:   19 pools, 3736 pgs
> objects: 1965 objects, 306 MB
> usage:   5153 MB used, 174 TB / 174 TB avail
> pgs: 876/9115 objects degraded (9.611%)
>  2712 active+clean
>  540  active+undersized+degraded
>  484  active+clean+remapped
>
>   io:
> client:   17331 B/s rd, 20 op/s rd, 0 op/s wr
>
> root@host01:~#
>
>
>
> Anyone here able to explain this behavior to me ?
>
> Jakub
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk is now deprecated

2017-11-30 Thread Daniel Baumann
On 11/30/17 14:04, Fabian Grünbichler wrote:
> point is - you should not purposefully attempt to annoy users and/or
> downstreams by changing behaviour in the middle of an LTS release cycle,

exactly. upgrading the patch level (x.y.z to x.y.z+1) should imho never
introduce a behaviour-change, regardless if it's "just" adding new
warnings or not.

this is a stable update we're talking about, even more so since it's an
LTS release. you never know how people use stuff (e.g. by parsing stupid
things), so such behaviour-change will break stuff for *some* people
(granted, most likely a really low number).

my expection to an stable release is, that it stays, literally, stable.
that's the whole point of having it in the first place. otherwise we
would all be running git snapshots and update randomly to newer ones.

adding deprecation messages in mimic makes sense, and getting rid of
it/not provide support for it in mimic+1 is reasonable.

Regards,
Daniel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Jason Dillaman
On Thu, Nov 30, 2017 at 4:00 AM, Wido den Hollander  wrote:
>
>> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
>>
>>
>> We experienced this problem in the past on older (pre-Jewel) releases
>> where a PG split that affected the RBD header object would result in
>> the watch getting lost by librados. Any chance you know if the
>> affected RBD header objects were involved in a PG split? Can you
>> generate a gcore dump of one of the affected VMs and ceph-post-file it
>> for analysis?
>>
>
> There was no PG splitting in the recent months on this cluster, so that's not 
> something that might have happened here.

Possible alternative explanation: are you using cache tiering?

> I've asked the OpenStack team for a gcore dump, but they have to get that 
> cleared before they can send it to me.
>
> This might take a bit of time!
>
> Wido
>
>> As for the VM going R/O, that is the expected behavior when a client
>> breaks the exclusive lock held by a (dead) client.
>>
>> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
>> > Hi,
>> >
>> > On a OpenStack environment I encountered a VM which went into R/O mode 
>> > after a RBD snapshot was created.
>> >
>> > Digging into this I found 10s (out of thousands) RBD images which DO have 
>> > a running VM, but do NOT have a watcher on the RBD image.
>> >
>> > For example:
>> >
>> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
>> >
>> > 'Watchers: none'
>> >
>> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on 
>> > the client.
>> >
>> > In the meantime the cluster was already upgraded to 10.2.10
>> >
>> > Looking further I also found a Compute node with 10.2.10 installed which 
>> > also has RBD images without watchers.
>> >
>> > Restarting or live migrating the VM to a different host resolves this 
>> > issue.
>> >
>> > The internet is full of posts where RBD images still have Watchers when 
>> > people don't expect them, but in this case I'm expecting a watcher which 
>> > isn't there.
>> >
>> > The main problem right now is that creating a snapshot potentially puts a 
>> > VM in Read-Only state because of the lack of notification.
>> >
>> > Has anybody seen this as well?
>> >
>> > Thanks,
>> >
>> > Wido
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk is now deprecated

2017-11-30 Thread Fabian Grünbichler
On Thu, Nov 30, 2017 at 07:04:33AM -0500, Alfredo Deza wrote:
> On Thu, Nov 30, 2017 at 6:31 AM, Fabian Grünbichler
>  wrote:
> > On Tue, Nov 28, 2017 at 10:39:31AM -0800, Vasu Kulkarni wrote:
> >> On Tue, Nov 28, 2017 at 9:22 AM, David Turner  
> >> wrote:
> >> > Isn't marking something as deprecated meaning that there is a better 
> >> > option
> >> > that we want you to use and you should switch to it sooner than later? I
> >> > don't understand how this is ready to be marked as such if ceph-volume 
> >> > can't
> >> > be switched to for all supported use cases. If ZFS, encryption, FreeBSD, 
> >> > etc
> >> > are all going to be supported under ceph-volume, then how can ceph-disk 
> >> > be
> >> > deprecated before ceph-volume can support them? I can imagine many Ceph
> >> > admins wasting time chasing an erroneous deprecated warning because it 
> >> > came
> >> > out before the new solution was mature enough to replace the existing
> >> > solution.
> >>
> >> There is no need to worry about this deprecation, Its mostly for
> >> admins to be prepared
> >> for the changes coming ahead and its mostly for *new* installations
> >> that can plan on using ceph-volume which provides
> >> great flexibility compared to ceph-disk.
> >
> > changing existing installations to output deprecation warnings from one
> > minor release to the next means it is not just for new installations
> > though, no matter how you spin it. a mention in the release notes and
> > docs would be enough to get admins to test and use ceph-volume on new
> > installations.
> >
> > I am pretty sure many admins will be bothered by all nodes running OSDs
> > spamming the logs and their terminals with huge deprecation warnings on
> > each OSD activation[1] or other actions involving ceph-disk, and having
> > this state for the remainder of Luminous unless they switch to a new
> > (and as of yet not battle-tested) way of activating their OSDs seems
> > crazy to me.
> >
> > I know our users will be, and given the short notice and huge impact
> > this would have we will likely have to remove the deprecation warnings
> > altogether in our (downstream) packages until we have completed testing
> > of and implementing support for ceph-volume..
> >
> >>
> >> a) many dont use ceph-disk or ceph-volume directly, so the tool you
> >> have right now eg: ceph-deploy or ceph-ansible
> >> will still support the ceph-disk, the previous ceph-deploy release is
> >> still available from pypi
> >>   https://pypi.python.org/pypi/ceph-deploy
> >
> > we have >> 10k (user / customer managed!) installations on Ceph Luminous
> > alone, all using our wrapper around ceph-disk - changing something like
> > this in the middle of a release causes huge headaches for downstreams
> > like us, and is not how a stable project is supposed to be run.
> 
> If you are using a wrapper around ceph-disk, then silencing the
> deprecation warnings should be easy to do.
> 
> These are plain Python warnings, and can be silenced within Python or
> environment variables. There are some details
> on how to do that here https://github.com/ceph/ceph/pull/18989

the problem is not how to get rid of the warnings, but having to when
upgrading from one bug fix release to the next.

> >
> >>
> >> b) also the current push will help anyone who is using ceph-deploy or
> >> ceph-disk in scripts/chef/etc
> >>to have time to think about using newer cli based on ceph-volume
> >
> > a regular deprecate at the beginning of the release cycle were the
> > replacement is deemed stable, remove in the next release cycle would be
> > adequate for this purpose.
> >
> > I don't understand the rush to shoe-horn ceph-volume into existing
> > supposedly stable Ceph installations at all - especially given the
> > current state of ceph-volume (we'll file bugs once we are done writing
> > them up, but a quick rudimentary test already showed stuff like choking
> > on valid ceph.conf files because they contain leading whitespace and
> > incomplete error handling leading to crush map entries for failed OSD
> > creation attempts).
> 
> Any ceph-volume bugs are welcomed as soon as you can get them to us.
> Waiting to get them reported is a problem, since ceph-volume
> is tied to Ceph releases, it means that these will now have to wait
> for another point release instead of having them in the upcoming one.

we started evaluating ceph-volume at the start of this thread in order
to see whether a switch-over pre-Mimic is feasible. we don't
artificially delay bug reports, it just takes time to to test, find bugs
and report them properly.

> 
> >
> > I DO understand the motivation behind ceph-volume and the desire to get
> > rid of the udev-based trigger mess, but the solution is not to scare
> > users into switching in the middle of a release by introducing
> > deprecation warnings for a core piece of the deployment stack.
> >
> > IMHO the only reason to push or force such a 

Re: [ceph-users] ceph-disk is now deprecated

2017-11-30 Thread Alfredo Deza
On Thu, Nov 30, 2017 at 6:31 AM, Fabian Grünbichler
 wrote:
> On Tue, Nov 28, 2017 at 10:39:31AM -0800, Vasu Kulkarni wrote:
>> On Tue, Nov 28, 2017 at 9:22 AM, David Turner  wrote:
>> > Isn't marking something as deprecated meaning that there is a better option
>> > that we want you to use and you should switch to it sooner than later? I
>> > don't understand how this is ready to be marked as such if ceph-volume 
>> > can't
>> > be switched to for all supported use cases. If ZFS, encryption, FreeBSD, 
>> > etc
>> > are all going to be supported under ceph-volume, then how can ceph-disk be
>> > deprecated before ceph-volume can support them? I can imagine many Ceph
>> > admins wasting time chasing an erroneous deprecated warning because it came
>> > out before the new solution was mature enough to replace the existing
>> > solution.
>>
>> There is no need to worry about this deprecation, Its mostly for
>> admins to be prepared
>> for the changes coming ahead and its mostly for *new* installations
>> that can plan on using ceph-volume which provides
>> great flexibility compared to ceph-disk.
>
> changing existing installations to output deprecation warnings from one
> minor release to the next means it is not just for new installations
> though, no matter how you spin it. a mention in the release notes and
> docs would be enough to get admins to test and use ceph-volume on new
> installations.
>
> I am pretty sure many admins will be bothered by all nodes running OSDs
> spamming the logs and their terminals with huge deprecation warnings on
> each OSD activation[1] or other actions involving ceph-disk, and having
> this state for the remainder of Luminous unless they switch to a new
> (and as of yet not battle-tested) way of activating their OSDs seems
> crazy to me.
>
> I know our users will be, and given the short notice and huge impact
> this would have we will likely have to remove the deprecation warnings
> altogether in our (downstream) packages until we have completed testing
> of and implementing support for ceph-volume..
>
>>
>> a) many dont use ceph-disk or ceph-volume directly, so the tool you
>> have right now eg: ceph-deploy or ceph-ansible
>> will still support the ceph-disk, the previous ceph-deploy release is
>> still available from pypi
>>   https://pypi.python.org/pypi/ceph-deploy
>
> we have >> 10k (user / customer managed!) installations on Ceph Luminous
> alone, all using our wrapper around ceph-disk - changing something like
> this in the middle of a release causes huge headaches for downstreams
> like us, and is not how a stable project is supposed to be run.

If you are using a wrapper around ceph-disk, then silencing the
deprecation warnings should be easy to do.

These are plain Python warnings, and can be silenced within Python or
environment variables. There are some details
on how to do that here https://github.com/ceph/ceph/pull/18989
>
>>
>> b) also the current push will help anyone who is using ceph-deploy or
>> ceph-disk in scripts/chef/etc
>>to have time to think about using newer cli based on ceph-volume
>
> a regular deprecate at the beginning of the release cycle were the
> replacement is deemed stable, remove in the next release cycle would be
> adequate for this purpose.
>
> I don't understand the rush to shoe-horn ceph-volume into existing
> supposedly stable Ceph installations at all - especially given the
> current state of ceph-volume (we'll file bugs once we are done writing
> them up, but a quick rudimentary test already showed stuff like choking
> on valid ceph.conf files because they contain leading whitespace and
> incomplete error handling leading to crush map entries for failed OSD
> creation attempts).

Any ceph-volume bugs are welcomed as soon as you can get them to us.
Waiting to get them reported is a problem, since ceph-volume
is tied to Ceph releases, it means that these will now have to wait
for another point release instead of having them in the upcoming one.

>
> I DO understand the motivation behind ceph-volume and the desire to get
> rid of the udev-based trigger mess, but the solution is not to scare
> users into switching in the middle of a release by introducing
> deprecation warnings for a core piece of the deployment stack.
>
> IMHO the only reason to push or force such a switch in this manner would
> be a (grave) security or data corruption bug, which is not the case at
> all here..

There is no forcing here. A deprecation warning was added, which can
be silenced.
>
> 1: have you looked at the journal / boot logs of a mid-sized OSD node
> using ceph-disk for activation with the deprecation warning active?  if
> my boot log is suddenly filled with 20% warnings, my first reaction will
> be that something is very wrong.. my likely second reaction when
> realizing what is going on is probably not fit for posting to a public
> mailing list ;)

The purpose of the deprecation warning is to be 

Re: [ceph-users] ceph-disk is now deprecated

2017-11-30 Thread Fabian Grünbichler
On Tue, Nov 28, 2017 at 10:39:31AM -0800, Vasu Kulkarni wrote:
> On Tue, Nov 28, 2017 at 9:22 AM, David Turner  wrote:
> > Isn't marking something as deprecated meaning that there is a better option
> > that we want you to use and you should switch to it sooner than later? I
> > don't understand how this is ready to be marked as such if ceph-volume can't
> > be switched to for all supported use cases. If ZFS, encryption, FreeBSD, etc
> > are all going to be supported under ceph-volume, then how can ceph-disk be
> > deprecated before ceph-volume can support them? I can imagine many Ceph
> > admins wasting time chasing an erroneous deprecated warning because it came
> > out before the new solution was mature enough to replace the existing
> > solution.
> 
> There is no need to worry about this deprecation, Its mostly for
> admins to be prepared
> for the changes coming ahead and its mostly for *new* installations
> that can plan on using ceph-volume which provides
> great flexibility compared to ceph-disk.

changing existing installations to output deprecation warnings from one
minor release to the next means it is not just for new installations
though, no matter how you spin it. a mention in the release notes and
docs would be enough to get admins to test and use ceph-volume on new
installations.

I am pretty sure many admins will be bothered by all nodes running OSDs
spamming the logs and their terminals with huge deprecation warnings on
each OSD activation[1] or other actions involving ceph-disk, and having
this state for the remainder of Luminous unless they switch to a new
(and as of yet not battle-tested) way of activating their OSDs seems
crazy to me.

I know our users will be, and given the short notice and huge impact
this would have we will likely have to remove the deprecation warnings
altogether in our (downstream) packages until we have completed testing
of and implementing support for ceph-volume..

> 
> a) many dont use ceph-disk or ceph-volume directly, so the tool you
> have right now eg: ceph-deploy or ceph-ansible
> will still support the ceph-disk, the previous ceph-deploy release is
> still available from pypi
>   https://pypi.python.org/pypi/ceph-deploy

we have >> 10k (user / customer managed!) installations on Ceph Luminous
alone, all using our wrapper around ceph-disk - changing something like
this in the middle of a release causes huge headaches for downstreams
like us, and is not how a stable project is supposed to be run.

> 
> b) also the current push will help anyone who is using ceph-deploy or
> ceph-disk in scripts/chef/etc
>to have time to think about using newer cli based on ceph-volume

a regular deprecate at the beginning of the release cycle were the
replacement is deemed stable, remove in the next release cycle would be
adequate for this purpose.

I don't understand the rush to shoe-horn ceph-volume into existing
supposedly stable Ceph installations at all - especially given the
current state of ceph-volume (we'll file bugs once we are done writing
them up, but a quick rudimentary test already showed stuff like choking
on valid ceph.conf files because they contain leading whitespace and
incomplete error handling leading to crush map entries for failed OSD
creation attempts).

I DO understand the motivation behind ceph-volume and the desire to get
rid of the udev-based trigger mess, but the solution is not to scare
users into switching in the middle of a release by introducing
deprecation warnings for a core piece of the deployment stack.

IMHO the only reason to push or force such a switch in this manner would
be a (grave) security or data corruption bug, which is not the case at
all here..

1: have you looked at the journal / boot logs of a mid-sized OSD node
using ceph-disk for activation with the deprecation warning active?  if
my boot log is suddenly filled with 20% warnings, my first reaction will
be that something is very wrong.. my likely second reaction when
realizing what is going on is probably not fit for posting to a public
mailing list ;)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CRUSH rule seems to work fine not for all PGs in erasure coded pools

2017-11-30 Thread Jakub Jaszewski
I've just did ceph upgrade jewel -> luminous and am facing the same case...

# EC profile
crush-failure-domain=host
crush-root=default
jerasure-per-chunk-alignment=false
k=3
m=2
plugin=jerasure
technique=reed_sol_van
w=8

5 hosts in the cluster and I run systemctl stop ceph.target on one of them
some PGs from EC pool were remapped (active+clean+remapped state) even when
there was not enough hosts in the cluster but some are still in
active+undersized+degraded state


root@host01:~# ceph status
  cluster:
id: a6f73750-1972-47f6-bcf5-a99753be65ad
health: HEALTH_WARN
Degraded data redundancy: 876/9115 objects degraded (9.611%),
540 pgs unclean, 540 pgs degraded, 540 pgs undersized

  services:
mon: 3 daemons, quorum host01,host02,host03
mgr: host01(active), standbys: host02, host03
osd: 60 osds: 48 up, 48 in; 484 remapped pgs
rgw: 3 daemons active

  data:
pools:   19 pools, 3736 pgs
objects: 1965 objects, 306 MB
usage:   5153 MB used, 174 TB / 174 TB avail
pgs: 876/9115 objects degraded (9.611%)
 2712 active+clean
 540  active+undersized+degraded
 484  active+clean+remapped

  io:
client:   17331 B/s rd, 20 op/s rd, 0 op/s wr

root@host01:~#



Anyone here able to explain this behavior to me ?

Jakub
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Can not delete snapshot with "ghost" children

2017-11-30 Thread Valery Tschopp

Hi,

We've a problem to delete a snapshot. There was a child image of the 
snapshot, but the child image was flatten. And now the snapshot still 
"think" it has children, and can not be deleted.


Snapshot and children:

$ rbd snap ls volumes/volume-49ccf5a6-4c17-434a-a087-f04acef978ef
SNAPID NAME  SIZE
 94183 snapshot-376e23d6-e723-4dbb-b558-174b275244b5 40960 MB

$ rbd children 
volumes/volume-49ccf5a6-4c17-434a-a087-f04acef978ef@snapshot-376e23d6-e723-4dbb-b558-174b275244b5

volumes/volume-a86350ad-2d4e-4863-bff3-67304b4b7b3c


Child image (was flatten, and now is without parent):

$ rbd info volumes/volume-a86350ad-2d4e-4863-bff3-67304b4b7b3c
rbd image 'volume-a86350ad-2d4e-4863-bff3-67304b4b7b3c':
size 40960 MB in 10240 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.1d4f6a3cd198f8
format: 2
features: layering, exclusive-lock, object-map
flags:


When I try to delete the snapshot, ceph tells me that the snapshot is 
protected. And when I try to unprotect it, it fails telling me there is 
still a child!?!?


$ rbd snap rm 
volumes/volume-49ccf5a6-4c17-434a-a087-f04acef978ef@snapshot-376e23d6-e723-4dbb-b558-174b275244b5
rbd: snapshot 'snapshot-376e23d6-e723-4dbb-b558-174b275244b5' is 
protected from removal.
2017-11-30 11:09:31.054416 7fe80e8f9100 -1 librbd::Operations: snapshot 
is protected


$ rbd snap unprotect 
volumes/volume-49ccf5a6-4c17-434a-a087-f04acef978ef@snapshot-376e23d6-e723-4dbb-b558-174b275244b5
2017-11-30 11:09:56.899548 7fc6432cd700 -1 
librbd::SnapshotUnprotectRequest: cannot unprotect: at least 1 
child(ren) [1d4f6a3cd198f8] in pool 'volumes'
2017-11-30 11:09:56.899578 7fc6432cd700 -1 
librbd::SnapshotUnprotectRequest: encountered error: (16) Device or 
resource busy
2017-11-30 11:09:56.899588 7fc6432cd700 -1 
librbd::SnapshotUnprotectRequest: 0xab0d918540 should_complete_error: 
ret_val=-16
2017-11-30 11:09:56.902702 7fc6432cd700 -1 
librbd::SnapshotUnprotectRequest: 0xab0d918540 should_complete_error: 
ret_val=-16

rbd: unprotecting snap failed: (16) Device or resource busy


How can we solve this issue?

Cheers,
Valery

--
SWITCH
Valéry Tschopp, Software Engineer
Werdstrasse 2, P.O. Box, 8021 Zurich, Switzerland
email: valery.tsch...@switch.ch phone: +41 44 268 1544

30 years of pioneering the Swiss Internet.
Celebrate with us at https://swit.ch/30years




smime.p7s
Description: S/MIME Cryptographic Signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] RBD image has no active watchers while OpenStack KVM VM is running

2017-11-30 Thread Wido den Hollander

> Op 29 november 2017 om 14:56 schreef Jason Dillaman :
> 
> 
> We experienced this problem in the past on older (pre-Jewel) releases
> where a PG split that affected the RBD header object would result in
> the watch getting lost by librados. Any chance you know if the
> affected RBD header objects were involved in a PG split? Can you
> generate a gcore dump of one of the affected VMs and ceph-post-file it
> for analysis?
> 

There was no PG splitting in the recent months on this cluster, so that's not 
something that might have happened here.

I've asked the OpenStack team for a gcore dump, but they have to get that 
cleared before they can send it to me.

This might take a bit of time!

Wido

> As for the VM going R/O, that is the expected behavior when a client
> breaks the exclusive lock held by a (dead) client.
> 
> On Wed, Nov 29, 2017 at 8:48 AM, Wido den Hollander  wrote:
> > Hi,
> >
> > On a OpenStack environment I encountered a VM which went into R/O mode 
> > after a RBD snapshot was created.
> >
> > Digging into this I found 10s (out of thousands) RBD images which DO have a 
> > running VM, but do NOT have a watcher on the RBD image.
> >
> > For example:
> >
> > $ rbd status volumes/volume-79773f2e-1f40-4eca-b9f0-953fa8d83086
> >
> > 'Watchers: none'
> >
> > The VM is however running since September 5th 2017 with Jewel 10.2.7 on the 
> > client.
> >
> > In the meantime the cluster was already upgraded to 10.2.10
> >
> > Looking further I also found a Compute node with 10.2.10 installed which 
> > also has RBD images without watchers.
> >
> > Restarting or live migrating the VM to a different host resolves this issue.
> >
> > The internet is full of posts where RBD images still have Watchers when 
> > people don't expect them, but in this case I'm expecting a watcher which 
> > isn't there.
> >
> > The main problem right now is that creating a snapshot potentially puts a 
> > VM in Read-Only state because of the lack of notification.
> >
> > Has anybody seen this as well?
> >
> > Thanks,
> >
> > Wido
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> 
> 
> -- 
> Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS: costly MDS cache misses?

2017-11-30 Thread Yan, Zheng
On Thu, Nov 30, 2017 at 2:08 AM, Jens-U. Mozdzen  wrote:
> Hi *,
>
> while tracking down a different performance issue with CephFS (creating tar
> balls from CephFS-based directories takes multiple times as long as when
> backing up the same data from local disks, i.e. 56 hours instead of 7), we
> had a look at CephFS performance related to the size of the MDS process.
>
> Our Ceph cluster (Luminous 12.2.1) is using file-based OSDs, CephFS data is
> on SAS HDDs, meta data is on SAS SSDs.
>
> It came to mind that MDS memory consumption might cause the delays with
> "tar". But while below results don't confirm this (it actually confirms that
> MDS memory size does not affect CephFS read speed when the cache is
> sufficiently warm), it does show an almost 30% performance drop if the cache
> is filled with the wrong entries.
>
> After a fresh process start, our MDS takes about 450k memory, with 56k
> residual. I then start a tar run for 36 GB small files (which I had also run
> a few minutes before MDS restart, to warm up disk caches):
>
> --- cut here ---
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>1233 ceph  20   0  446584  56000  15908 S  3.960 0.085   0:01.08
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 17:38:21 CET 2017
> 38245529600
> Wed Nov 29 17:44:27 CET 2017
> server01:~ #
>
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>   1233 ceph  20   0  485760 109156  16148 S  0.331 0.166   0:10.76
> ceph-mds
> --- cut here ---
>
> As you can see, there's only small growth in MDS virtual size.
>
> The job took 366 seconds, that an average of about 100 MB/s.
>
> I repeat that job a few minutes later, to get numbers with a previously
> active MDS (the MDS cache should be warmed up now):
>
> --- cut here ---
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>   1233 ceph  20   0  494976 118404  16148 S  2.961 0.180   0:16.21
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 17:53:09 CET 2017
> 38245529600
> Wed Nov 29 17:58:53 CET 2017
> server01:~ #
>
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>   1233 ceph  20   0  508288 131368  16148 S  1.980 0.200   0:25.45
> ceph-mds
> --- cut here ---
>
> The job took 344 seconds, that's an average of about 106 MB/s. With only a
> single run per situation, these numbers aren't more than rough estimate, of
> course.
>
> At 18:00:00, a file-based incremental backup job kicks in, which reads
> through most of the files on the CephFS, but only backing up those that were
> changed since the last run. This has nothing to do with our "tar" and is
> running on a different node, where CephFS is kernel-mounted as well. That
> backup job makes the MDS cache grow drastically, you can see MDS at more
> than 8 GB now.
>
> We then start another tar job (or rather two, to account for MDS caching),
> as before:
>
> --- cut here ---
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>   1233 ceph  20   0 8644776 7.750g  16184 S  0.990 12.39   6:45.24
> ceph-mds
>
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 18:13:20 CET 2017
> 38245529600
> Wed Nov 29 18:21:50 CET 2017
> server01:~ # date; tar -C /srv/cephfs/prod/fileshare/stuff/ -cf- . | wc -c;
> date
> Wed Nov 29 18:22:52 CET 2017
> 38245529600
> Wed Nov 29 18:28:28 CET 2017
> server01:~ #
>
>PID USER  PR  NIVIRTRESSHR S   %CPU  %MEM TIME+
> COMMAND
>   1233 ceph  20   0 8761512 7.642g  16184 S  3.300 12.22   7:03.52
> ceph-mds
> --- cut here ---
>
> The second run is even a bit quicker than the "warmed-up" run with the only
> partially filled cache (336 seconds, that's 108,5 MB/s).
>
> But the run against the filled-up MDS cache, where most (if not all) entries
> are no match for our tar lookups, took 510 seconds - that 71,5 MB/s, instead
> of the roughly 100 MB/s when the cache was empty.
>
> This is by far no precise benchmark test, indeed. But it at least seems to
> be an indicator that MDS cache misses are costly. (During the tests, only
> small amounts of changes in CephFS were likely - especially compared to the
> amount of reads and file lookups for their metadata.)
>
> Regards,
> Jens
>
> PS: Why so much memory for MDS in the first place? Because during those
> (hourly) incremental backup runs, we got a large number of MDS warnings
> about insufficient cache pressure responses from clients. Increasing the MDS
> cache size did help to avoid these.
>

I just found a kernel client bug. The kernel can fail to trim as many
capabilities as MDS asked. MDS need to keep corresponding inode in its
cache while client has capabilitiy. The bug can explain why large
memory usage of MDS during backup job. It also could