Re: [ceph-users] ceph-deploy jewel stopped working

2016-04-21 Thread Stephen Lord
Sorry about the mangled urls in there, these are all from download.ceph.com 
rpm-jewel el7 xfs_64 

Steve


> On Apr 21, 2016, at 1:17 PM, Stephen Lord  wrote:
> 
> 
> 
> Running this command
> 
> ceph-deploy install --stable jewel  ceph00 
> 
> And using the 1.5.32 version of ceph-deploy onto a redhat 7.2 system is 
> failing today (worked yesterday)
> 
> [ceph00][DEBUG ] 
> 
> [ceph00][DEBUG ]  Package Arch  Version   
>  Repository   Size
> [ceph00][DEBUG ] 
> 
> [ceph00][DEBUG ] Installing:
> [ceph00][DEBUG ]  ceph-mdsx86_641:10.2.0-0.el7
>  ceph2.8 M
> [ceph00][DEBUG ]  ceph-monx86_641:10.2.0-0.el7
>  ceph2.8 M
> [ceph00][DEBUG ]  ceph-osdx86_641:10.2.0-0.el7
>  ceph9.0 M
> [ceph00][DEBUG ]  ceph-radosgwx86_641:10.2.0-0.el7
>  ceph245 k
> [ceph00][DEBUG ] Installing for dependencies:
> [ceph00][DEBUG ]  ceph-base   x86_641:10.2.0-0.el7
>  ceph4.2 M
> [ceph00][DEBUG ]  ceph-common x86_641:10.2.0-0.el7
>  ceph 15 M
> [ceph00][DEBUG ]  ceph-selinuxx86_641:10.2.0-0.el7
>  ceph 19 k
> [ceph00][DEBUG ] Updating for dependencies:
> [ceph00][DEBUG ]  libcephfs1  x86_641:10.2.0-0.el7
>  ceph1.8 M
> [ceph00][DEBUG ]  librados2   x86_641:10.2.0-0.el7
>  ceph1.9 M
> [ceph00][DEBUG ]  librados2-devel x86_641:10.2.0-0.el7
>  ceph474 k
> [ceph00][DEBUG ]  libradosstriper1x86_641:10.2.0-0.el7
>  ceph1.8 M
> [ceph00][DEBUG ]  librbd1 x86_641:10.2.0-0.el7
>  ceph2.4 M
> [ceph00][DEBUG ]  librgw2 x86_641:10.2.0-0.el7
>  ceph2.8 M
> [ceph00][DEBUG ]  python-cephfs   x86_641:10.2.0-0.el7
>  ceph 66 k
> [ceph00][DEBUG ]  python-radosx86_641:10.2.0-0.el7
>  ceph145 k
> [ceph00][DEBUG ]  python-rbd  x86_641:10.2.0-0.el7
>  ceph 61 k
> [ceph00][DEBUG ] 
> [ceph00][DEBUG ] Transaction Summary
> [ceph00][DEBUG ] 
> 
> [ceph00][DEBUG ] Install  4 Packages (+3 Dependent packages)
> [ceph00][DEBUG ] Upgrade ( 9 Dependent packages)
> [ceph00][DEBUG ] 
> [ceph00][DEBUG ] Total download size: 45 M
> [ceph00][DEBUG ] Downloading packages:
> [ceph00][DEBUG ] Delta RPMs disabled because /usr/bin/applydeltarpm not 
> installed.
> [ceph00][WARNIN] 
> https://urldefense.proofpoint.com/v2/url?u=http-3A__download.ceph.com_rpm-2Djewel_el7_x86-5F64_ceph-2Dcommon-2D10.2.0-2D0.el7.x86-5F64.rpm-3A&d=CwIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=F9k0WXxF7xUfRHp4ZqBOkaLr0_5mni4JwI4czlkUybY&s=5BrqfSf5EOziOhw2Z4ZEzBgDMZchLGlpYl4EF7pBB_Y&e=
>   [Errno -1] Package does not match intended download. Suggestion: run yum 
> --enablerepo=ceph clean metadata
> [ceph00][WARNIN] Trying other mirror.
> …..
> 
> I have cleaned up all the repo info on this end and it makes no difference. I 
> suspect something in the last update to the site is wrong or missing, the 
> repomd.xml file here:
> 
> https://urldefense.proofpoint.com/v2/url?u=https-3A__download.ceph.com_rpm-2Djewel_el7_x86-5F64_repodata_&d=CwIGaQ&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=F9k0WXxF7xUfRHp4ZqBOkaLr0_5mni4JwI4czlkUybY&s=NJLHwaWpdVogSPGcnGJz0e4wtL5Q_lUJZk6QpAabHnw&e=
>  
> 
> Is a day older than all the packages which may or may not be part of the 
> issue.
> 
> Steve
> 
> --
> The information contained in this transmission may be confidential. Any 
> disclosure, copying, or further distribution of confidential information is 
> not permitted unless such privilege is explicitly granted in writing by 
> Quantum. Quantum reserves the right to have electronic communications, 
> including email and attachments, sent across its networks filtered through 
> anti virus and spam software programs and retain such messages in order to 
> comply with applicable data security and retention requirements. Quantum is 
> not responsible for the proper 

[ceph-users] ceph-deploy jewel stopped working

2016-04-21 Thread Stephen Lord


Running this command

ceph-deploy install --stable jewel  ceph00 

And using the 1.5.32 version of ceph-deploy onto a redhat 7.2 system is failing 
today (worked yesterday)

[ceph00][DEBUG ] 

[ceph00][DEBUG ]  Package Arch  Version
Repository   Size
[ceph00][DEBUG ] 

[ceph00][DEBUG ] Installing:
[ceph00][DEBUG ]  ceph-mdsx86_641:10.2.0-0.el7 
ceph2.8 M
[ceph00][DEBUG ]  ceph-monx86_641:10.2.0-0.el7 
ceph2.8 M
[ceph00][DEBUG ]  ceph-osdx86_641:10.2.0-0.el7 
ceph9.0 M
[ceph00][DEBUG ]  ceph-radosgwx86_641:10.2.0-0.el7 
ceph245 k
[ceph00][DEBUG ] Installing for dependencies:
[ceph00][DEBUG ]  ceph-base   x86_641:10.2.0-0.el7 
ceph4.2 M
[ceph00][DEBUG ]  ceph-common x86_641:10.2.0-0.el7 
ceph 15 M
[ceph00][DEBUG ]  ceph-selinuxx86_641:10.2.0-0.el7 
ceph 19 k
[ceph00][DEBUG ] Updating for dependencies:
[ceph00][DEBUG ]  libcephfs1  x86_641:10.2.0-0.el7 
ceph1.8 M
[ceph00][DEBUG ]  librados2   x86_641:10.2.0-0.el7 
ceph1.9 M
[ceph00][DEBUG ]  librados2-devel x86_641:10.2.0-0.el7 
ceph474 k
[ceph00][DEBUG ]  libradosstriper1x86_641:10.2.0-0.el7 
ceph1.8 M
[ceph00][DEBUG ]  librbd1 x86_641:10.2.0-0.el7 
ceph2.4 M
[ceph00][DEBUG ]  librgw2 x86_641:10.2.0-0.el7 
ceph2.8 M
[ceph00][DEBUG ]  python-cephfs   x86_641:10.2.0-0.el7 
ceph 66 k
[ceph00][DEBUG ]  python-radosx86_641:10.2.0-0.el7 
ceph145 k
[ceph00][DEBUG ]  python-rbd  x86_641:10.2.0-0.el7 
ceph 61 k
[ceph00][DEBUG ] 
[ceph00][DEBUG ] Transaction Summary
[ceph00][DEBUG ] 

[ceph00][DEBUG ] Install  4 Packages (+3 Dependent packages)
[ceph00][DEBUG ] Upgrade ( 9 Dependent packages)
[ceph00][DEBUG ] 
[ceph00][DEBUG ] Total download size: 45 M
[ceph00][DEBUG ] Downloading packages:
[ceph00][DEBUG ] Delta RPMs disabled because /usr/bin/applydeltarpm not 
installed.
[ceph00][WARNIN] 
http://download.ceph.com/rpm-jewel/el7/x86_64/ceph-common-10.2.0-0.el7.x86_64.rpm:
 [Errno -1] Package does not match intended download. Suggestion: run yum 
--enablerepo=ceph clean metadata
[ceph00][WARNIN] Trying other mirror.
…..

I have cleaned up all the repo info on this end and it makes no difference. I 
suspect something in the last update to the site is wrong or missing, the 
repomd.xml file here:

https://download.ceph.com/rpm-jewel/el7/x86_64/repodata/

Is a day older than all the packages which may or may not be part of the issue.

Steve

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Stephen Lord
o, not how fast you can get things out of it :-(

I have been using readforward and that is working OK, there is sufficient read
bandwidth that it does not matter if data is coming from the cache pool or the
disk backing pool.

Steve


> On Apr 19, 2016, at 7:47 PM, Christian Balzer  wrote:
> 
> 
> Hello,
> 
> On Tue, 19 Apr 2016 20:21:39 + Stephen Lord wrote:
> 
>> 
>> 
>> I Have a setup using some Intel P3700 devices as a cache tier, and 33
>> sata drives hosting the pool behind them. 
> 
> A bit more details about the setup would be nice, as in how many nodes,
> interconnect, replication size of the cache tier and the backing HDD
> pool, etc. 
> And "some" isn't a number, how many P3700s (which size?) in how many nodes?
> One assumes there are no further SSDs involved with those SATA HDDs?

> 
>> I setup the cache tier with
>> writeback, gave it a size and max object count etc:
>> 
>> ceph osd pool set target_max_bytes 5000
>^^^
> This should have given you an error, it needs the pool name, as in your
> next line.
> 
>> ceph osd pool set nvme target_max_bytes 5000
>> ceph osd pool set nvme target_max_objects 50
>> ceph osd pool set nvme cache_target_dirty_ratio 0.5
>> ceph osd pool set nvme cache_target_full_ratio 0.8
>> 
>> This is all running Jewel using bluestore OSDs (I know experimental).
> Make sure to report all pyrotechnics, trap doors and sharp edges. ^_-
> 
>> The cache tier will write at about 900 Mbytes/sec and read at 2.2
>> Gbytes/sec, the sata pool can take writes at about 600 Mbytes/sec in
>> aggregate. 
>  ^
> Key word there.
> 
> That's just 18MB/s per HDD (60MB/s with a replication of 3), a pretty
> disappointing result for the supposedly twice as fast BlueStore. 
> Again, replication size and topology might explain that up to a point, but
> we don't know them (yet).
> 
> Also exact methodology of your tests please, i.e. the fio command line, how
> was the RBD device (if you tested with one) mounted and where, etc...
> 
>> However, it looks like the mechanism for cleaning the cache
>> down to the disk layer is being massively rate limited and I see about
>> 47 Mbytes/sec of read activity from each SSD while this is going on.
>> 
> This number is meaningless w/o knowing home many NVMe's you have.
> That being said, there are 2 levels of flushing past Hammer, but if you
> push the cache tier to the 2nd limit (cache_target_dirty_high_ratio) you
> will get full speed.
> 
>> This means that while I could be pushing data into the cache at high
>> speed, It cannot evict old content very fast at all, and it is very easy
>> to hit the high water mark and the application I/O drops dramatically as
>> it becomes throttled by how fast the cache can flush.
>> 
>> I suspect it is operating on a placement group at a time so ends up
>> targeting a very limited number of objects and hence disks at any one
>> time. I can see individual disk drives going busy for very short
>> periods, but most of them are idle at any one point in time. The only
>> way to drive the disk based OSDs fast is to hit a lot of them at once
>> which would mean issuing many cache flush operations in parallel.
>> 
> Yes, it is all PG based, so your observations match the expectations and
> what everybody else is seeing. 
> See also the thread "Cache tier operation clarifications" by me, version 2
> is in the works.
> There are also some new knobs in Jewel that may be helpful, see:
> https://urldefense.proofpoint.com/v2/url?u=http-3A__www.spinics.net_lists_ceph-2Dusers_msg25679.html&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=yIRKIZ4yBOSAr9O5lav-0J24ys7S21EhX394KIorQ-E&s=XV0tKumf_xV99IUKYgnbmJrvhbL9I5Fdk1eCwG-YUYQ&e=
>  
> 
> If you have a use case with a clearly defined idle/low use time and a
> small enough growth in dirty objects, consider what I'm doing, dropping the
> cache_target_dirty_ratio a few percent (in my case 2-3% is enough for a
> whole day) via cron job,wait a bit and then up again to it's normal value. 
> 
> That way flushes won't normally happen at all during your peak usage
> times, though in my case that's purely cosmetic, flushes are not
> problematic at any time in that cluster currently.
> 
>> Are there any controls which can influence this behavior?
>> 
> See above (cache_target_dirty_high_ratio).
> 
> Aside from that you might want to reflect on what your use case, workload
> is going to be and how your testing reflects on it.
> 
> 

[ceph-users] ceph cache tier clean rate too low

2016-04-19 Thread Stephen Lord


I Have a setup using some Intel P3700 devices as a cache tier, and 33 sata 
drives hosting the pool behind them. I setup the cache tier with writeback, 
gave it a size and max object count etc:

 ceph osd pool set target_max_bytes 5000
 ceph osd pool set nvme target_max_bytes 5000
 ceph osd pool set nvme target_max_objects 50
 ceph osd pool set nvme cache_target_dirty_ratio 0.5
 ceph osd pool set nvme cache_target_full_ratio 0.8

This is all running Jewel using bluestore OSDs (I know experimental). The cache 
tier will write at about 900 Mbytes/sec and read at 2.2 Gbytes/sec, the sata 
pool can take writes at about 600 Mbytes/sec in aggregate. However, it looks 
like the mechanism for cleaning the cache down to the disk layer is being 
massively rate limited and I see about 47 Mbytes/sec of read activity from each 
SSD while this is going on.

This means that while I could be pushing data into the cache at high speed, It 
cannot evict old content very fast at all, and it is very easy to hit the high 
water mark and the application I/O drops dramatically as it becomes throttled 
by how fast the cache can flush.

I suspect it is operating on a placement group at a time so ends up targeting a 
very limited number of objects and hence disks at any one time. I can see 
individual disk drives going busy for very short periods, but most of them are 
idle at any one point in time. The only way to drive the disk based OSDs fast 
is to hit a lot of them at once which would mean issuing many cache flush 
operations in parallel.

Are there any controls which can influence this behavior?

Thanks

  Steve

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Bluestore OSD died - error (39) Directory not empty not handled on operation 21

2016-04-05 Thread Stephen Lord

I was experimenting with using bluestore OSDs and appear to have found a fairly 
consistent way to crash them…

Changing the number of copies in a pool down from 3 to 1 has now twice caused 
the mass panic of a whole pool of OSDs. In one case it was a cache tier, in 
another case it was just a pool hosting rbd images. 

From the log file of one of the OSDs:

2016-04-05 12:09:54.272475 7f5a58027700  0 bluestore(/var/lib/ceph/osd/ceph-43) 
 error (39) Directory not empty not handled on operation 21 (op 1, counting 
from 0)
2016-04-05 12:09:54.272489 7f5a58027700  0 bluestore(/var/lib/ceph/osd/ceph-43) 
 transaction dump:
{
"ops": [
{
"op_num": 0,
"op_name": "remove",
"collection": "2.354_head",
"oid": "#2:2ac0head#"
},
{
"op_num": 1,
"op_name": "rmcoll",
"collection": "2.354_head"
}
]
}


2016-04-05 12:09:54.275114 7f5a58027700 -1 os/bluestore/BlueStore.cc: In 
function 'void BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)' thread 7f5a58027700 time 2016-04-05 12:09:54.272532
os/bluestore/BlueStore.cc: 4357: FAILED assert(0 == "unexpected error")

 ceph version 10.1.0 (96ae8bd25f31862dbd5302f304ebf8bf1166aba6)
 1: (ceph::__ceph_assert_fail(char const*, char const*, int, char const*)+0x85) 
[0x7f5a82e74a55]
 2: (BlueStore::_txc_add_transaction(BlueStore::TransContext*, 
ObjectStore::Transaction*)+0x77a) [0x7f5a82b02eba]
 3: (BlueStore::queue_transactions(ObjectStore::Sequencer*, 
std::vector 
>&, std::shared_ptr, ThreadPool::TPHandle*)+0x3a5) [0x7f5a82b056e5]
 4: (ObjectStore::queue_transactions(ObjectStore::Sequencer*, 
std::vector 
>&, Context*, Context*, Context*, Context*, std::shared_ptr)+0x2a6) 
[0x7f5a82aad0b6]
 5: (OSD::RemoveWQ::_process(std::pair, 
std::shared_ptr >, ThreadPool::TPHandle&)+0x6e4) [0x7f5a827debb4]
 6: (ThreadPool::WorkQueueVal, 
std::shared_ptr >, std::pair, 
std::shared_ptr > >::_void_process(void*, 
ThreadPool::TPHandle&)+0x11a) [0x7f5a8283a15a]
 7: (ThreadPool::worker(ThreadPool::WorkThread*)+0xa7e) [0x7f5a82e65a9e]
 8: (ThreadPool::WorkThread::entry()+0x10) [0x7f5a82e66980]
 9: (()+0x7dc5) [0x7f5a80dbedc5]
 10: (clone()+0x6d) [0x7f5a7f44a28d]

In both cases a replicated pool with 3 copies was created, some content added 
and then the number of copies set down to 1. Not a common thing to do I know, 
but this works on FileStore OSDs.

This is a cluster deployed using redhat 7 Jewel (10.1) RPMs from 
download.ceph.com

Steve



--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord
My ceph-deploy came from the download.ceph.com site and it is 1.5.31-0. This 
code is in ceph itself though, the deploy logic is where the code appears to do 
the right thing ;-)

Steve

> On Mar 15, 2016, at 2:38 PM, Vasu Kulkarni  wrote:
> 
> Thanks for the steps that should be enough to test it out, I hope you got the 
> latest ceph-deploy either from pip or throught github.
> 
> On Tue, Mar 15, 2016 at 12:29 PM, Stephen Lord  wrote:
> I would have to nuke my cluster right now, and I do not have a spare one..
> 
> The procedure though is literally this, given a 3 node redhat 7.2 cluster, 
> ceph00, ceph01 and ceph02
> 
> ceph-deploy install --testing ceph00 ceph01 ceph02
> ceph-deploy new ceph00 ceph01 ceph02
> 
> ceph-deploy mon create  ceph00 ceph01 ceph02
> ceph-deploy gatherkeys  ceph00
> 
> ceph-deploy osd create ceph00:sdb:/dev/sdi
> ceph-deploy osd create ceph00:sdc:/dev/sdi
> 
> All devices have their partition tables wiped before this. They are all just 
> SATA devices, no special devices in the way.
> 
> sdi is an ssd and it is being carved up for journals. The first osd create 
> works, the second one gets stuck in a loop in the update_partition call in 
> ceph_disk for the 5 iterations before it gives up. When I look in 
> /sys/block/sdi the partition for the first osd is visible, the one for the 
> second is not. However looking at /proc/partitions it sees the correct thing. 
> So something about partprobe is not kicking udev into doing the right thing 
> when the second partition is added I suspect.
> 
> If I do not use the separate journal device then it usually works, but 
> occasionally I see a single retry in that same loop.
> 
> There is code in ceph_deploy which uses partprobe or partx depending on which 
> distro it detects, that is how I worked out what to change here.
> 
> If I have to tear things down again I will reproduce and post here.
> 
> Steve
> 
> > On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
> >
> > Do you mind giving the full failed logs somewhere in fpaste.org along with 
> > some os version details?
> >  There are some known issues on RHEL,  If you use 'osd prepare' and 'osd 
> > activate'(specifying just the journal partition here) it might work better.
> >
> > On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord  
> > wrote:
> > Not multipath if you mean using the multipath driver, just trying to setup 
> > OSDs which use a data disk and a journal ssd. If I run just a disk based 
> > OSD and only specify one device to ceph-deploy then it usually works 
> > although sometimes has to retry. In the case where I am using it to carve 
> > an SSD into several partitions for journals it fails on the second one.
> >
> > Steve
> >
> 
> 
> --
> The information contained in this transmission may be confidential. Any 
> disclosure, copying, or further distribution of confidential information is 
> not permitted unless such privilege is explicitly granted in writing by 
> Quantum. Quantum reserves the right to have electronic communications, 
> including email and attachments, sent across its networks filtered through 
> anti virus and spam software programs and retain such messages in order to 
> comply with applicable data security and retention requirements. Quantum is 
> not responsible for the proper and complete transmission of the substance of 
> this communication or for any delay in its receipt.
> 


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord
I would have to nuke my cluster right now, and I do not have a spare one..

The procedure though is literally this, given a 3 node redhat 7.2 cluster, 
ceph00, ceph01 and ceph02

ceph-deploy install --testing ceph00 ceph01 ceph02
ceph-deploy new ceph00 ceph01 ceph02 

ceph-deploy mon create  ceph00 ceph01 ceph02
ceph-deploy gatherkeys  ceph00

ceph-deploy osd create ceph00:sdb:/dev/sdi
ceph-deploy osd create ceph00:sdc:/dev/sdi

All devices have their partition tables wiped before this. They are all just 
SATA devices, no special devices in the way.

sdi is an ssd and it is being carved up for journals. The first osd create 
works, the second one gets stuck in a loop in the update_partition call in 
ceph_disk for the 5 iterations before it gives up. When I look in 
/sys/block/sdi the partition for the first osd is visible, the one for the 
second is not. However looking at /proc/partitions it sees the correct thing. 
So something about partprobe is not kicking udev into doing the right thing 
when the second partition is added I suspect.

If I do not use the separate journal device then it usually works, but 
occasionally I see a single retry in that same loop.

There is code in ceph_deploy which uses partprobe or partx depending on which 
distro it detects, that is how I worked out what to change here.

If I have to tear things down again I will reproduce and post here.

Steve

> On Mar 15, 2016, at 2:12 PM, Vasu Kulkarni  wrote:
> 
> Do you mind giving the full failed logs somewhere in fpaste.org along with 
> some os version details?
>  There are some known issues on RHEL,  If you use 'osd prepare' and 'osd 
> activate'(specifying just the journal partition here) it might work better.
> 
> On Tue, Mar 15, 2016 at 12:05 PM, Stephen Lord  wrote:
> Not multipath if you mean using the multipath driver, just trying to setup 
> OSDs which use a data disk and a journal ssd. If I run just a disk based OSD 
> and only specify one device to ceph-deploy then it usually works although 
> sometimes has to retry. In the case where I am using it to carve an SSD into 
> several partitions for journals it fails on the second one.
> 
> Steve
> 


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord
Not multipath if you mean using the multipath driver, just trying to setup OSDs 
which use a data disk and a journal ssd. If I run just a disk based OSD and 
only specify one device to ceph-deploy then it usually works although sometimes 
has to retry. In the case where I am using it to carve an SSD into several 
partitions for journals it fails on the second one.

Steve


> On Mar 15, 2016, at 1:45 PM, Vasu Kulkarni  wrote:
> 
> Ceph-deploy suite and also selinux suite(which isn't merged yet) indirectly 
> tests ceph-disk and has been run on Jewel as well. I guess the issue Stephen 
> is seeing is on multipath device
> which I believe is a known issue.
> 
> On Tue, Mar 15, 2016 at 11:42 AM, Gregory Farnum  wrote:
> There's a ceph-disk suite from last August that Loïc set up, but based
> on the qa list it wasn't running for a while and isn't in great shape.
> :/ I know there are some CentOS7 boxes in the sepia lab but it might
> not be enough for a small and infrequently-run test to reliably get
> tested against them.
> -Greg
> 
> On Tue, Mar 15, 2016 at 11:04 AM, Ben Hines  wrote:
> > It seems like ceph-disk is often breaking on centos/redhat systems. Does it
> > have automated tests in the ceph release structure?
> >
> > -Ben
> >
> >
> > On Tue, Mar 15, 2016 at 8:52 AM, Stephen Lord 
> > wrote:
> >>
> >>
> >> Hi,
> >>
> >> The ceph-disk (10.0.4 version) command seems to have problems operating on
> >> a Redhat 7 system, it uses the partprobe command unconditionally to update
> >> the partition table, I had to change this to partx -u to get past this.
> >>
> >> @@ -1321,13 +1321,13 @@
> >>  processed, i.e. the 95-ceph-osd.rules actions and mode changes,
> >>  group changes etc. are complete.
> >>  """
> >> -LOG.debug('Calling partprobe on %s device %s', description, dev)
> >> +LOG.debug('Calling partx on %s device %s', description, dev)
> >>  partprobe_ok = False
> >>  error = 'unknown error'
> >>  for i in (1, 2, 3, 4, 5):
> >>  command_check_call(['udevadm', 'settle', '--timeout=600'])
> >>  try:
> >> -_check_output(['partprobe', dev])
> >> +_check_output(['partx', '-u', dev])
> >>  partprobe_ok = True
> >>  break
> >>  except subprocess.CalledProcessError as e:
> >>
> >>
> >> It really needs to be doing that conditional on the operating system
> >> version.
> >>
> >> Steve
> >>
> >>
> >> --
> >> The information contained in this transmission may be confidential. Any
> >> disclosure, copying, or further distribution of confidential information is
> >> not permitted unless such privilege is explicitly granted in writing by
> >> Quantum. Quantum reserves the right to have electronic communications,
> >> including email and attachments, sent across its networks filtered through
> >> anti virus and spam software programs and retain such messages in order to
> >> comply with applicable data security and retention requirements. Quantum is
> >> not responsible for the proper and complete transmission of the substance 
> >> of
> >> this communication or for any delay in its receipt.
> >> ___
> >> ceph-users mailing list
> >> ceph-users@lists.ceph.com
> >> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> https://urldefense.proofpoint.com/v2/url?u=http-3A__lists.ceph.com_listinfo.cgi_ceph-2Dusers-2Dceph.com&d=CwICAg&c=8S5idjlO_n28Ko3lg6lskTMwneSC-WqZ5EBTEEvDlkg&r=tA8AXp6f2QAGtnc-TrB3H1XZIqqTELvv3S6ZQGJZBLs&m=xCs2lM8j21CKCDOMzMG9A39MKnroKXExLDI0-FgCPkA&s=yZ89WNI7wA8agL8i7CODARX7K864Ewod22WMdbv82xw&e=


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph-disk from jewel has issues on redhat 7

2016-03-15 Thread Stephen Lord

Hi,

The ceph-disk (10.0.4 version) command seems to have problems operating on a 
Redhat 7 system, it uses the partprobe command unconditionally to update the 
partition table, I had to change this to partx -u to get past this.

@@ -1321,13 +1321,13 @@
 processed, i.e. the 95-ceph-osd.rules actions and mode changes,
 group changes etc. are complete.
 """
-LOG.debug('Calling partprobe on %s device %s', description, dev)
+LOG.debug('Calling partx on %s device %s', description, dev)
 partprobe_ok = False
 error = 'unknown error'
 for i in (1, 2, 3, 4, 5):
 command_check_call(['udevadm', 'settle', '--timeout=600'])
 try:
-_check_output(['partprobe', dev])
+_check_output(['partx', '-u', dev])
 partprobe_ok = True
 break
 except subprocess.CalledProcessError as e:


It really needs to be doing that conditional on the operating system version.

Steve
 

--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-11 Thread Stephen Lord

I saw this go by in the commit log:

commit cc2200c5e60caecf7931e546f6522b2ba364227f
Merge: f8d5807 12c083e
Author: Sage Weil 
Date:   Thu Feb 11 08:44:35 2016 -0500

Merge pull request #7537 from ifed01/wip-no-promote-for-delete-fix

osd: fix unnecessary object promotion when deleting from cache pool

Reviewed-by: Sage Weil 


Is there any chance that I was basically seeing with the same thing from the 
filesystem standpoint?

Thanks

  Steve

> On Feb 5, 2016, at 8:42 AM, Gregory Farnum  wrote:
> 
> On Fri, Feb 5, 2016 at 6:39 AM, Stephen Lord  wrote:
>> 
>> I looked at this system this morning, and the it actually finished what it 
>> was
>> doing. The erasure coded pool still contains all the data and the cache
>> pool has about a million zero sized objects:
>> 
>> 
>> GLOBAL:
>>SIZE   AVAIL RAW USED %RAW USED OBJECTS
>>15090G 9001G6080G 40.29   2127k
>> POOLS:
>>NAMEID CATEGORY USED   %USED MAX AVAIL
>>  OBJECTS DIRTY READ   WRITE
>>cache-data  21 - 0 0 7962G
>>  1162258 1057k  22969 3220k
>>cephfs-data 22 - 3964G 26.27 5308G
>>  1014840  991k   891k 1143k
>> 
>> Definitely seems like a bug since I removed all references to these from the 
>> filesystem
>> which created them.
>> 
>> I originally wrote 4.5 Tbytes of data into the file system, the erasure coded
>> pool is setup as 4+2, and the cache has a size limit of 1 Tbyte. Looks like 
>> not
>> all the data made it out of the cache tier before I removed content, it 
>> removed the
>> content which was only present in the cache tier and created a zero sized 
>> object
>> in the cache for all the content. The used capacity is somewhat consistent 
>> with
>> this.
>> 
>> I tried to look at the extended attributes on one of the zero size object 
>> with ceph-dencoder,
>> but it failed:
>> 
>> error: buffer::malformed_input: void 
>> object_info_t::decode(ceph::buffer::list::iterator&) unknown encoding 
>> version > 15
>> 
>> Same error on one of the objects in the erasure coded pool.
>> 
>> Looks like I am a little too bleeding edge for this, or the contents of the 
>> .ceph_ attribute are not an object_info_t
> 
> ghobject_info_t
> 
> You can get the EC stuff actually deleted by getting the cache pool to
> flush everything. That's discussed in the docs and in various mailing
> list archives.
> -Greg
> 
>> 
>> 
>> 
>> Steve
>> 
>>> On Feb 4, 2016, at 7:10 PM, Gregory Farnum  wrote:
>>> 
>>> On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord  wrote:
>>>> 
>>>>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum  wrote:
>>>>> 
>>>>> I presume we're doing reads in order to gather some object metadata
>>>>> from the cephfs-data pool; and the (small) newly-created objects in
>>>>> cache-data are definitely whiteout objects indicating the object no
>>>>> longer exists logically.
>>>>> 
>>>>> What kinds of reads are you actually seeing? Does it appear to be
>>>>> transferring data, or merely doing a bunch of seeks? I thought we were
>>>>> trying to avoid doing reads-to-delete, but perhaps the way we're
>>>>> handling snapshots or something is invoking behavior that isn't
>>>>> amicable to a full-FS delete.
>>>>> 
>>>>> I presume you're trying to characterize the system's behavior, but of
>>>>> course if you just want to empty it out entirely you're better off
>>>>> deleting the pools and the CephFS instance entirely and then starting
>>>>> it over again from scratch.
>>>>> -Greg
>>>> 
>>>> I believe it is reading all the data, just from the volume of traffic and
>>>> the cpu load on the OSDs maybe suggests it is doing more than
>>>> just that.
>>>> 
>>>> iostat is showing a lot of data moving, I am seeing about the same volume
>>>> of read and write activity here. Because the OSDs underneath both pools
>>>> are the same ones, I know that’s not exactly optimal, it is hard to tell 
>>>> what
>>>> which pool is responsible for which I/O. Large reads and small writes 
>>>> suggest
>>>> it is reading up all the data from th

Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-05 Thread Stephen Lord

I looked at this system this morning, and the it actually finished what it was
doing. The erasure coded pool still contains all the data and the cache
pool has about a million zero sized objects:


GLOBAL:
SIZE   AVAIL RAW USED %RAW USED OBJECTS 
15090G 9001G6080G 40.29   2127k 
POOLS:
NAMEID CATEGORY USED   %USED MAX AVAIL 
OBJECTS DIRTY READ   WRITE 
cache-data  21 - 0 0 7962G 
1162258 1057k  22969 3220k 
cephfs-data 22 - 3964G 26.27 5308G 
1014840  991k   891k 1143k 

Definitely seems like a bug since I removed all references to these from the 
filesystem
which created them.

I originally wrote 4.5 Tbytes of data into the file system, the erasure coded
pool is setup as 4+2, and the cache has a size limit of 1 Tbyte. Looks like not
all the data made it out of the cache tier before I removed content, it removed 
the
content which was only present in the cache tier and created a zero sized object
in the cache for all the content. The used capacity is somewhat consistent with
this.

I tried to look at the extended attributes on one of the zero size object with 
ceph-dencoder,
but it failed:

error: buffer::malformed_input: void 
object_info_t::decode(ceph::buffer::list::iterator&) unknown encoding version > 
15

Same error on one of the objects in the erasure coded pool.

Looks like I am a little too bleeding edge for this, or the contents of the 
.ceph_ attribute are not an object_info_t



Steve

> On Feb 4, 2016, at 7:10 PM, Gregory Farnum  wrote:
> 
> On Thu, Feb 4, 2016 at 5:07 PM, Stephen Lord  wrote:
>> 
>>> On Feb 4, 2016, at 6:51 PM, Gregory Farnum  wrote:
>>> 
>>> I presume we're doing reads in order to gather some object metadata
>>> from the cephfs-data pool; and the (small) newly-created objects in
>>> cache-data are definitely whiteout objects indicating the object no
>>> longer exists logically.
>>> 
>>> What kinds of reads are you actually seeing? Does it appear to be
>>> transferring data, or merely doing a bunch of seeks? I thought we were
>>> trying to avoid doing reads-to-delete, but perhaps the way we're
>>> handling snapshots or something is invoking behavior that isn't
>>> amicable to a full-FS delete.
>>> 
>>> I presume you're trying to characterize the system's behavior, but of
>>> course if you just want to empty it out entirely you're better off
>>> deleting the pools and the CephFS instance entirely and then starting
>>> it over again from scratch.
>>> -Greg
>> 
>> I believe it is reading all the data, just from the volume of traffic and
>> the cpu load on the OSDs maybe suggests it is doing more than
>> just that.
>> 
>> iostat is showing a lot of data moving, I am seeing about the same volume
>> of read and write activity here. Because the OSDs underneath both pools
>> are the same ones, I know that’s not exactly optimal, it is hard to tell what
>> which pool is responsible for which I/O. Large reads and small writes suggest
>> it is reading up all the data from the objects,  the write traffic is I 
>> presume all
>> journal activity relating to deleting objects and creating the empty ones.
>> 
>> The 9:1 ratio between things being deleted and created seems odd though.
>> 
>> A previous version of this exercise with just a regular replicated data pool
>> did not read anything, just a lot of write activity and eventually the 
>> content
>> disappeared. So definitely related to the pool configuration here and 
>> probably
>> not to the filesystem layer.
> 
> Sam, does this make any sense to you in terms of how RADOS handles deletes?
> -Greg


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Stephen Lord

> On Feb 4, 2016, at 6:51 PM, Gregory Farnum  wrote:
> 
> I presume we're doing reads in order to gather some object metadata
> from the cephfs-data pool; and the (small) newly-created objects in
> cache-data are definitely whiteout objects indicating the object no
> longer exists logically.
> 
> What kinds of reads are you actually seeing? Does it appear to be
> transferring data, or merely doing a bunch of seeks? I thought we were
> trying to avoid doing reads-to-delete, but perhaps the way we're
> handling snapshots or something is invoking behavior that isn't
> amicable to a full-FS delete.
> 
> I presume you're trying to characterize the system's behavior, but of
> course if you just want to empty it out entirely you're better off
> deleting the pools and the CephFS instance entirely and then starting
> it over again from scratch.
> -Greg

I believe it is reading all the data, just from the volume of traffic and
the cpu load on the OSDs maybe suggests it is doing more than
just that.

iostat is showing a lot of data moving, I am seeing about the same volume
of read and write activity here. Because the OSDs underneath both pools
are the same ones, I know that’s not exactly optimal, it is hard to tell what
which pool is responsible for which I/O. Large reads and small writes suggest
it is reading up all the data from the objects,  the write traffic is I presume 
all
journal activity relating to deleting objects and creating the empty ones.

The 9:1 ratio between things being deleted and created seems odd though.

A previous version of this exercise with just a regular replicated data pool
did not read anything, just a lot of write activity and eventually the content
disappeared. So definitely related to the pool configuration here and probably
not to the filesystem layer.

I will eventually just put this out of its misery and wipe it.

Steve



--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] why is there heavy read traffic during object delete?

2016-02-04 Thread Stephen Lord
I setup a cephfs file system with a cache tier over an erasure coded tier as an 
experiment:

  ceph osd erasure-code-profile set raid6 k=4 m=2 
  ceph osd pool create cephfs-metadata 512 512 
  ceph osd pool set cephfs-metadata size 3
  ceph osd pool create cache-data 2048 2048
  ceph osd pool create cephfs-data 256 256 erasure raid6 default_erasure
  ceph osd tier add cephfs-data cache-data
  ceph osd tier cache-mode cache-data writeback
  ceph osd tier set-overlay cephfs-data cache-data
  ceph osd pool set cache-data hit_set_type bloom
  ceph osd pool set cache-data target_max_bytes 1099511627776 

The file system was created from the cephfs-metadata and cephfs-data pools

After adding a lot of data to this and waiting for the pools to idle down and 
stabilize I removing the file system content with rm. I am seeing very strange 
behavior, the file system remove was quick, and then it started removing the 
data from the pools. However it appears to be reading the data from the erasure 
coded pool and creating empty content in the cache pool.

At its peak capacity the system looked like this:

NAMEID CATEGORY USED   %USED MAX AVAIL 
OBJECTS DIRTY READ  WRITE 
cache-data  21 -  791G  5.25 6755G  
256302  140k 22969 2138k 
cephfs-data 22 - 4156G 27.54 4503G 
1064086 1039k 51271 1046k 


2 hours later it looked like this:

NAMEID CATEGORY USED   %USED MAX AVAIL 
OBJECTS DIRTY READ  WRITE 
cache-data  21 -  326G  2.17 7576G  
702142  559k 22969 2689k 
cephfs-data 22 - 3964G 26.27 5051G 
1014842  991k  476k 1143k 

The object count in the erasure coded pool has gone down a little, the count in 
the cache pool has gone up a lot, there has been a lot of read activity in the 
erasure coded pool and write activity into both pools. The used count in the 
cache pool is also going down. It looks like the cache pool is gaining 9 
objects for each one removed from the erasure code pool. Looking at the actual 
files being created by the OSD for this, they are empty.

What is going on here? It looks like this will take a day or so to complete at 
this rate of progress.

The ceph version here is the master branch from a couple of days ago.

Thanks

  Steve Lord



--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] how to get even placement group distribution across OSDs - looking for hints

2016-01-27 Thread Stephen Lord

I have a configuration with 18 OSDs spread across 3 hosts. I am struggling with 
getting an even distribution of placement groups between the OSDs for a 
specific pool. All the OSDs are the same size with the same weight in the crush 
map. The fill level of the individual placement groups is very close when I put 
data into them, however I get a fairly uneven spread of placement groups across 
the OSDs. This leads to the pool filling one of the OSDs well before the 
others, a file system can report 70% full in aggregate, but one of the OSDs 
fills so it can take no more data. In addition, this will clearly lead to less 
than balanced load between the different devices which are all of the same 
physical type with the same throughput.

If I dump the placement groups and count them by OSD I typically see something 
like this:

pool :  4   5   6   7   8   9   0   1   14  
2   15  3   | SUM 

osd.17  4   5   7   5   11  6   9   4   17  
12  42  8   | 130
osd.4   7   8   8   7   4   6   1   4   12  
8   23  8   | 96
osd.5   8   5   10  10  5   6   3   7   13  
7   34  13  | 121
osd.6   9   6   8   2   3   10  1   4   12  
10  26  10  | 101
osd.7   7   10  7   7   9   13  1   6   20  
5   29  5   | 119
osd.8   6   7   4   6   6   3   7   11  20  
7   28  9   | 114
osd.9   8   10  9   9   5   6   4   5   15  
5   22  4   | 102
osd.10  3   2   4   5   11  9   3   4   20  
7   38  8   | 114
osd.11  8   11  10  7   7   13  3   4   19  
8   29  6   | 125
osd.12  7   6   10  5   8   4   2   8   18  
6   37  9   | 120
osd.0   3   6   11  13  7   5   6   11  17  
6   35  9   | 129
osd.13  7   8   5   10  11  8   4   13  18  
11  35  5   | 135
osd.1   13  8   9   4   7   7   4   6   10  
10  43  3   | 124
osd.14  8   7   4   7   8   3   3   8   16  
3   28  6   | 101
osd.15  9   7   5   3   4   10  5   6   17  
7   35  5   | 113
osd.2   7   9   9   11  11  8   2   8   9   
6   34  9   | 123
osd.16  9   4   5   7   4   0   3   6   21  
4   26  6   | 95
osd.3   5   9   3   10  7   11  3   13  14  
6   32  5   | 118

SUM :   128 128 128 128 128 128 64  128 288 
128 576 128 |

In this example I want to put a filesystem across pools 14 and 15. The data 
pool has between 23 and 43 placement groups per OSD.

Am I just missing something here in defining the crush map? All I can find is 
recommendations to get a more even balance by having more PGs per OSD. 
Eventually I just get warnings about too many placement groups per OSD.

Or is the issue that there are multiple pools on this set of OSDs and placement 
groups are being created in parallel for several of them?  In this case though 
pool 15 was created after all the other pools existed and all their placement 
groups were created, even the first pool is unevenly spread.

So are there any controls which influence how placement groups are allocated to 
OSDs in the initial pool creation?

Thanks

   Steve


--
The information contained in this transmission may be confidential. Any 
disclosure, copying, or further distribution of confidential information is not 
permitted unless such privilege is explicitly granted in writing by Quantum. 
Quantum reserves the right to have electronic communications, including email 
and attachments, sent across its networks filtered through anti virus and spam 
software programs and retain such messages in order to comply with applicable 
data security and retention requirements. Quantum is not responsible for the 
proper and complete transmission of the substance of this communication or for 
any delay in its receipt.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com