Re: [ceph-users] assertion error trying to start mds server

2017-10-12 Thread Bill Sharer
After your comment about the dual mds servers I decided to just give up
trying to get the second restarted.  After eyeballing what I had on one
of the new Ryzen boxes for drive space, I decided to just dump the
filesystem.  That will also make things go faster if and when I flip
everything over to bluestore.  So far so good...  I just took a peek and
saw the files being owned by Mr root though.  Is there going to be an
ownership reset at some point or will I have to resolve that by hand?


On 10/12/2017 06:09 AM, John Spray wrote:
> On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer  wrote:
>> I was wondering if I can't get the second mds back up That offline
>> backward scrub check sounds like it should be able to also salvage what
>> it can of the two pools to a normal filesystem.  Is there an option for
>> that or has someone written some form of salvage tool?
> Yep, cephfs-data-scan can do that.
>
> To scrape the files out of a CephFS data pool to a local filesystem, do this:
> cephfs-data-scan scan_extents   # this is discovering
> all the file sizes
> cephfs-data-scan scan_inodes --output-dir /tmp/my_output 
>
> The time taken by both these commands scales linearly with the number
> of objects in your data pool.
>
> This tool may not see the correct filename for recently created files
> (any file whose metadata is in the journal but not flushed), these
> files will go into a lost+found directory, named after their inode
> number.
>
> John
>
>> On 10/11/2017 07:07 AM, John Spray wrote:
>>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer  wrote:
 I've been in the process of updating my gentoo based cluster both with
 new hardware and a somewhat postponed update.  This includes some major
 stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
 and using gcc 6.4.0 to make better use of AMD Ryzen on the new
 hardware.  The existing cluster was on 10.2.2, but I was going to
 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
 transitioning to bluestore on the osd's.

 The Ryzen units are slated to be bluestore based OSD servers if and when
 I get to that point.  Up until the mds failure, they were simply cephfs
 clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
 MON) and had two servers left to update.  Both of these are also MONs
 and were acting as a pair of dual active MDS servers running 10.2.2.
 Monday morning I found out the hard way that an UPS one of them was on
 has a dead battery.  After I fsck'd and came back up, I saw the
 following assertion error when it was trying to start it's mds.B server:


  mdsbeacon(64162/B up:replay seq 3 v4699) v7  126+0+0 (709014160
 0 0) 0x7f6fb4001bc0 con 0x55f94779d
 8d0
  0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
 function 'virtual void EImportStart::r
 eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
 mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)

  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
 const*)+0x82) [0x55f93d64a122]
  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
  5: (()+0x74a4) [0x7f6fd009b4a4]
  6: (clone()+0x6d) [0x7f6fce5a598d]
  NOTE: a copy of the executable, or `objdump -rdS ` is
 needed to interpret this.

 --- logging levels ---
0/ 5 none
0/ 1 lockdep
0/ 1 context
1/ 1 crush
1/ 5 mds
1/ 5 mds_balancer
1/ 5 mds_locker
1/ 5 mds_log
1/ 5 mds_log_expire
1/ 5 mds_migrator
0/ 1 buffer
0/ 1 timer
0/ 1 filer
0/ 1 striper
0/ 1 objecter
0/ 5 rados
0/ 5 rbd
0/ 5 rbd_mirror
0/ 5 rbd_replay
0/ 5 journaler
0/ 5 objectcacher
0/ 5 client
0/ 5 osd
0/ 5 optracker
0/ 5 objclass
1/ 3 filestore
1/ 3 journal
0/ 5 ms
1/ 5 mon
0/10 monc
1/ 5 paxos
0/ 5 tp
1/ 5 auth
1/ 5 crypto
1/ 1 finisher
1/ 5 heartbeatmap
1/ 5 perfcounter
1/ 5 rgw
1/10 civetweb
1/ 5 javaclient
1/ 5 asok
1/ 1 throttle
0/ 0 refs
1/ 5 xio
1/ 5 compressor
1/ 5 newstore
1/ 5 bluestore
1/ 5 bluefs
1/ 3 bdev
1/ 5 kstore
4/ 5 rocksdb
4/ 5 leveldb
1/ 5 kinetic
1/ 5 fuse
   -2/-2 (syslog threshold)
   -1/-1 (stderr threshold)
   max_recent 1
   max_new 1000
   log_file /var/log/ceph/ceph-mds.B.log



 

Re: [ceph-users] using Bcache on blueStore

2017-10-12 Thread Jorge Pinilla López
Well, I wouldn't use bcache on filestore at all.First there are problems with 
all that you have said and second but way important you got doble writes (in FS 
data was written to journal and to storage disk at the same time), if jounal 
and data disk were the same then speed was divided by two getting really bad 
output.
In BlueStore things change quite a lot, first there are not double writes there 
is no "journal" (well there is  a something call Wal but  it's not used in the 
same way), data goes directly into the data disk and you only write a few 
metadata and make a commit into the DB. Rebalancing and scrub go through a 
RockDB not a file system making it way more simple and effective, you aren't 
supposed to have all the problems that you had with FS.
In addition, cache tiering has been deprecated on Red Hat Ceph Storage so I 
personally wouldn't use something deprecated by developers and support.

 Mensaje original De: Marek Grzybowski 
 Fecha: 13/10/17  12:22 AM  (GMT+01:00) Para: Jorge 
Pinilla López , ceph-users@lists.ceph.com Asunto: Re: 
[ceph-users] using Bcache on blueStore 
On 12.10.2017 20:28, Jorge Pinilla López wrote:
> Hey all!
> I have a ceph with multiple HDD and 1 really fast SSD with (30GB per OSD) per 
> host.
> 
> I have been thinking and all docs say that I should give all the SSD space 
> for RocksDB, so I would have a HDD data and a 30GB partition for RocksDB.
> 
> But it came to my mind that if the OSD isnt full maybe I am not using all the 
> space in the SSD, or maybe I prefer having a really small amount of hot k/v 
> and metadata and the data itself in a really fast device than just storing 
> all could metadata.
> 
> So I though that using Bcache to make SSD to be a cache and as metadata and 
> k/v are usually hot, they should be place on the cache. But this doesnt 
> guarantee me that k/v and metadata are actually always in the SSD cause under 
> heavy cache loads it can be pushed out (like really big data files).
> 
> So I came up with the idea of setting small 5-10GB partitions for the hot 
> RocksDB and the rest to use it as a cache, so I make sure that really hot 
> metadata is actually always on the SSD and the coulder one should be also on 
> the SSD (as a bcache) if its not really freezing, in that case they would be 
> pushed to the HDD. It also doesnt make anysense to have metadatada that you 
> never used using space on the SSD, I rather use that space to store hotter 
> data.
> 
> This is also make writes faster, and in blueStore we dont have the double 
> write problem so it should work fine.
> 
> What do you think about this? does it have any downsite? is there any other 
> way?

Hi Jorge
  I was inexperienced and tried bcache on old fsstore once. It was bad.
Mostly because bcache does not have any typical disk scheduling algorithm.
So when scrub or rebalnce was running latency on such storage was very high and 
unpredictable.
OSD deamon could not give any ioprio for disks read or writes, and additionaly
bcache cache was poisoned by scrub/rebalance.

Fortunately to me, it is very easy to rolling replace OSDs.
I use some SSDs partitions for journal now and what left for pure ssd storage.
This works really great .

If i will ever need cache, i will use cache tiering instead .


-- 
  Kind Regards
    Marek Grzybowski





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] using Bcache on blueStore

2017-10-12 Thread Marek Grzybowski
On 12.10.2017 20:28, Jorge Pinilla López wrote:
> Hey all!
> I have a ceph with multiple HDD and 1 really fast SSD with (30GB per OSD) per 
> host.
> 
> I have been thinking and all docs say that I should give all the SSD space 
> for RocksDB, so I would have a HDD data and a 30GB partition for RocksDB.
> 
> But it came to my mind that if the OSD isnt full maybe I am not using all the 
> space in the SSD, or maybe I prefer having a really small amount of hot k/v 
> and metadata and the data itself in a really fast device than just storing 
> all could metadata.
> 
> So I though that using Bcache to make SSD to be a cache and as metadata and 
> k/v are usually hot, they should be place on the cache. But this doesnt 
> guarantee me that k/v and metadata are actually always in the SSD cause under 
> heavy cache loads it can be pushed out (like really big data files).
> 
> So I came up with the idea of setting small 5-10GB partitions for the hot 
> RocksDB and the rest to use it as a cache, so I make sure that really hot 
> metadata is actually always on the SSD and the coulder one should be also on 
> the SSD (as a bcache) if its not really freezing, in that case they would be 
> pushed to the HDD. It also doesnt make anysense to have metadatada that you 
> never used using space on the SSD, I rather use that space to store hotter 
> data.
> 
> This is also make writes faster, and in blueStore we dont have the double 
> write problem so it should work fine.
> 
> What do you think about this? does it have any downsite? is there any other 
> way?

Hi Jorge
  I was inexperienced and tried bcache on old fsstore once. It was bad.
Mostly because bcache does not have any typical disk scheduling algorithm.
So when scrub or rebalnce was running latency on such storage was very high and 
unpredictable.
OSD deamon could not give any ioprio for disks read or writes, and additionaly
bcache cache was poisoned by scrub/rebalance.

Fortunately to me, it is very easy to rolling replace OSDs.
I use some SSDs partitions for journal now and what left for pure ssd storage.
This works really great .

If i will ever need cache, i will use cache tiering instead .


-- 
  Kind Regards
Marek Grzybowski





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Flattening loses sparseness

2017-10-12 Thread Jason Dillaman
The sparseness is actually preserved, but the fast-diff stats are
incorrect because zero-byte objects were being created during the
flatten operation. This should be fixed under Luminous [1] where the
flatten operation (and any writes to a clone more generally) no longer
performs a zero-byte copy-up operation. The change was a little too
invasive for comfort to backport to Jewel, however.

[1] http://tracker.ceph.com/issues/15028

On Thu, Oct 12, 2017 at 5:06 PM, Massey, Kevin  wrote:
> Hi,
>
> I'm evaluating ceph (Jewel) for an application that will have a chain of 
> layered images, with the need to sometimes flatten from the top to limit 
> chain length. However, it appears that running "rbd flatten" causes loss of 
> sparseness in the clone. For example:
>
> $ rbd --version
> ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
> $ rbd du
> NAMEPROVISIONED USED
> child10240k0
> parent@snap  10240k0
> parent   10240k0
>   20480k0
> $ rbd info child
> rbd image 'child':
> size 10240 kB in 3 objects
> order 22 (4096 kB objects)
> block_name_prefix: rbd_data.18c852eb141f2
> format: 2
> features: layering, exclusive-lock, object-map, fast-diff, 
> deep-flatten
> flags:
> parent: rbd/parent@snap
> overlap: 10240 kB
> $ rbd flatten child
> Image flatten: 100% complete...done.
> $ rbd du
> NAMEPROVISIONED   USED
> child10240k 10240k
> parent@snap  10240k  0
> parent   10240k  0
>   20480k 10240k
>
> Is there any way to flatten a clone while retaining its sparseness, perhaps 
> in Luminous or with BlueStore backend?
>
> Thanks,
> Kevin
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS metadata pool to SSDs

2017-10-12 Thread David Turner
John covered everything better than I was going to, so I'll just remove
that from my reply.

If you aren't using DC SSDs and this is prod, then I wouldn't recommend
moving towards this model.  However you are correct on how to move the pool
to the SSDs from the HDDs and based on how simple and quick it can be for a
healthy cluster to do that, you can always let it run for a few weeks and
see how it affects the durability of your SSDs before deciding to leave it
or go back to your current setup.

On Thu, Oct 12, 2017 at 4:43 PM Reed Dier  wrote:

> I found an older ML entry from 2015 and not much else, mostly detailing
> the doing performance testing to dispel poor performance numbers presented
> by OP.
>
> Currently have the metadata pool on my slow 24 HDDs, and am curious if I
> should see any increased performance with CephFS by moving the metadata
> pool onto SSD medium.
> My thought is that the SSDs are lower latency, and it removes those iops
> from the slower spinning disks.
>
> My next concern would be write amplification on the SSDs. Would this
> thrash the SSD lifespan with tons of little writes or should it not be too
> heavy of a workload to matter too much?
>
> My last question from the operations standpoint, if I use:
> # ceph osd pool set fs-metadata crush_ruleset 
> Will this just start to backfill the metadata pool over to the SSDs until
> it satisfies the crush requirements for size and failure domains and not
> skip a beat?
>
> Obviously things like enabling dirfrags, and multiple MDS ranks will be
> more likely to improve performance with CephFS, but the metadata pool uses
> very little space, and I have the SSDs already, so I figured I would
> explore it as an option.
>
> Thanks,
>
> Reed
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Jason Dillaman
Yes -- all image creation commands (create, clone, copy, import, etc)
should accept the "--data-pool" optional.

On Thu, Oct 12, 2017 at 5:01 PM, Josy  wrote:
> I would also like to create a clone of the image present in another pool.
>
> Will this command work ?
>
>  rbd clone --data-pool ecpool pool/image@snap_image ecpool/cloneimage
>
>
> On 13-10-2017 00:19, Jorge Pinilla López wrote:
>
> The rbd has 2 kinds of data, metadata and data itself, the metadata gives
> information about the rbd imagen and small amounts of internal information.
> This one is place on the replicated pool and that's why it's considering
> that the image is on the replicated pool.
> But the actual data (big information) are stored in the Erasure pool.
> So when you start filling your image you will see how your replicated
> metadata pool grows a little bit when the actual Erasure data pool grows way
> more!
>
> You can get information about your image using rbd info {replicated
> pool}/{image name}
>
>  Mensaje original 
> De: Josy 
> Fecha: 12/10/17 8:40 PM (GMT+01:00)
> Para: David Turner , dilla...@redhat.com
> Cc: ceph-users 
> Asunto: Re: [ceph-users] Erasure coding with RBD
>
> Thank you for your reply.
>
> I created a erasure coded pool 'ecpool' and a replicated pool to store
> metadata 'ec_rep_pool'
>
>  And created image as you mentioned :
>
> rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1
>
> But the image seems to be  created in ec_rep_pool
>
> 
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool
> NAME   SIZE PARENT
> FMT PROT LOCK
> ectestimage  20480M
> 2
> ectestimage1 20480M
> 2
>
>
> Is that how it suppose to work ?
>
>
> On 12-10-2017 23:53, David Turner wrote:
>
> Here is your friend.
> http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites
>
> On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman  wrote:
>>
>> The image metadata still needs to live in a replicated data pool --
>> only the data blocks can be stored in an EC pool. Therefore, when
>> creating the image, you should provide the "--data-pool "
>> optional to specify the EC pool name.
>>
>>
>> On Thu, Oct 12, 2017 at 2:06 PM, Josy  wrote:
>> > Hi,
>> >
>> > I am trying to setup an erasure coded pool with rbd image.
>> >
>> > The ceph version is Luminous 12.2.1. and I understand,  since Luminous,
>> > RBD
>> > and Cephfs can store their data in an erasure coded pool without use of
>> > cache tiring.
>> >
>> > I created a pool ecpool and when trying to create a rbd image, gets this
>> > error.
>> >
>> > ==
>> >
>> > [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
>> > ecpool/ectestimage2
>> > 2017-10-12 10:55:37.992965 7f18857fa700 -1 librbd::image::CreateRequest:
>> > 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
>> > directory: (95) Operation not supported
>> > rbd: create error: (95) Operation not supported
>> > ==
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS metadata pool to SSDs

2017-10-12 Thread John Spray
On Thu, Oct 12, 2017 at 9:34 PM, Reed Dier  wrote:
> I found an older ML entry from 2015 and not much else, mostly detailing the
> doing performance testing to dispel poor performance numbers presented by
> OP.
>
> Currently have the metadata pool on my slow 24 HDDs, and am curious if I
> should see any increased performance with CephFS by moving the metadata pool
> onto SSD medium.

It depends a lot on the workload.

The primary advantage of moving metadata to dedicated drives
(especially SSDs) is that it makes the system more deterministic under
load.  The most benefit will be seen on systems which had previously
had shared HDD OSDs that were fully saturated with data IO, and were
consequently suffering from very slow metadata writes.

The impact will also depend on whether the metadata workload fit in
the mds_cache_size or not: if the MDS is frequently missing its cache
then the metadata pool latency will be more important.

On systems with plenty of spare IOPs, with non-latency-sensitive
workloads, one might see little or no difference in performance when
using SSDs, as those systems would typically bottleneck on the number
of operations per second MDS daemon (CPU bound).  Systems like that
would benefit more from multiple MDS daemons.

Then again, systems with plenty of spare IOPs can quickly become
congested during recovery/backfill scenarios, so having SSDs for
metadata is a nice risk mitigation to keep the system more responsive
during bad times.

> My thought is that the SSDs are lower latency, and it removes those iops
> from the slower spinning disks.
>
> My next concern would be write amplification on the SSDs. Would this thrash
> the SSD lifespan with tons of little writes or should it not be too heavy of
> a workload to matter too much?

The MDS is comparatively efficient in how it writes out metadata:
journal writes get batched up into larger IOs, and if something is
frequently modified then it doesn't get written back every time (just
when it falls off the end of the journal, or periodically).

If you've got SSDs that you're confident enough to use for data or
general workloads, I wouldn't be too worried about using them for
CephFS metadata.

> My last question from the operations standpoint, if I use:
> # ceph osd pool set fs-metadata crush_ruleset 
> Will this just start to backfill the metadata pool over to the SSDs until it
> satisfies the crush requirements for size and failure domains and not skip a
> beat?

On a healthy cluster, yes, this should just work.  The level of impact
you see will depend on how much else you're trying to do with the
system.  The prioritization of client IO vs. backfill IO has been
improved in luminous, so you should use luminous if you can.

Because the overall size of the metadata pool is small, the smart
thing is probably to find a time that is quiet for your system, and do
the crush rule change at that time to get it over with quickly, rather
than trying to do it during normal operations.

Cheers,
John

>
> Obviously things like enabling dirfrags, and multiple MDS ranks will be more
> likely to improve performance with CephFS, but the metadata pool uses very
> little space, and I have the SSDs already, so I figured I would explore it
> as an option.
>
> Thanks,
>
> Reed
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Flattening loses sparseness

2017-10-12 Thread Massey, Kevin
Hi,

I'm evaluating ceph (Jewel) for an application that will have a chain of 
layered images, with the need to sometimes flatten from the top to limit chain 
length. However, it appears that running "rbd flatten" causes loss of 
sparseness in the clone. For example:

$ rbd --version
ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
$ rbd du
NAMEPROVISIONED USED 
child10240k0 
parent@snap  10240k0 
parent   10240k0 
  20480k0 
$ rbd info child
rbd image 'child':
size 10240 kB in 3 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.18c852eb141f2
format: 2
features: layering, exclusive-lock, object-map, fast-diff, deep-flatten
flags: 
parent: rbd/parent@snap
overlap: 10240 kB
$ rbd flatten child
Image flatten: 100% complete...done.
$ rbd du
NAMEPROVISIONED   USED 
child10240k 10240k 
parent@snap  10240k  0 
parent   10240k  0 
  20480k 10240k 

Is there any way to flatten a clone while retaining its sparseness, perhaps in 
Luminous or with BlueStore backend?

Thanks,
Kevin




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Josy

I would also like to create a clone of the image present in another pool.

Will this command work ?

 rbd clone --data-pool ecpool pool/image@snap_image ecpool/cloneimage


On 13-10-2017 00:19, Jorge Pinilla López wrote:
The rbd has 2 kinds of data, metadata and data itself, the metadata 
gives information about the rbd imagen and small amounts of internal 
information. This one is place on the replicated pool and that's why 
it's considering that the image is on the replicated pool.

But the actual data (big information) are stored in the Erasure pool.
So when you start filling your image you will see how your replicated 
metadata pool grows a little bit when the actual Erasure data pool 
grows way more!


You can get information about your image using rbd info {replicated 
pool}/{image name}


 Mensaje original 
De: Josy 
Fecha: 12/10/17 8:40 PM (GMT+01:00)
Para: David Turner , dilla...@redhat.com
Cc: ceph-users 
Asunto: Re: [ceph-users] Erasure coding with RBD

Thank you for your reply.

I created a erasure coded pool 'ecpool' and a replicated pool to store 
metadata 'ec_rep_pool'


 And created image as you mentioned :

rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1

But the image seems to be  created in ec_rep_pool*
*


[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool
[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool
NAME   SIZE PARENT FMT PROT LOCK
ectestimage 20480M 2
ectestimage1 20480M 2


Is that how it suppose to work ?


On 12-10-2017 23:53, David Turner wrote:
Here is your friend. 
http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites


On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman > wrote:


The image metadata still needs to live in a replicated data pool --
only the data blocks can be stored in an EC pool. Therefore, when
creating the image, you should provide the "--data-pool "
optional to specify the EC pool name.


On Thu, Oct 12, 2017 at 2:06 PM, Josy > wrote:
> Hi,
>
> I am trying to setup an erasure coded pool with rbd image.
>
> The ceph version is Luminous 12.2.1. and I understand, since
Luminous, RBD
> and Cephfs can store their data in an erasure coded pool
without use of
> cache tiring.
>
> I created a pool ecpool and when trying to create a rbd image,
gets this
> error.
>
> ==
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
> ecpool/ectestimage2
> 2017-10-12 10:55:37.992965 7f18857fa700 -1
librbd::image::CreateRequest:
> 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
> directory: (95) Operation not supported
> rbd: create error: (95) Operation not supported
> ==
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Josy

Thank you all!


On 13-10-2017 00:19, Jorge Pinilla López wrote:
The rbd has 2 kinds of data, metadata and data itself, the metadata 
gives information about the rbd imagen and small amounts of internal 
information. This one is place on the replicated pool and that's why 
it's considering that the image is on the replicated pool.

But the actual data (big information) are stored in the Erasure pool.
So when you start filling your image you will see how your replicated 
metadata pool grows a little bit when the actual Erasure data pool 
grows way more!


You can get information about your image using rbd info {replicated 
pool}/{image name}


 Mensaje original 
De: Josy 
Fecha: 12/10/17 8:40 PM (GMT+01:00)
Para: David Turner , dilla...@redhat.com
Cc: ceph-users 
Asunto: Re: [ceph-users] Erasure coding with RBD

Thank you for your reply.

I created a erasure coded pool 'ecpool' and a replicated pool to store 
metadata 'ec_rep_pool'


 And created image as you mentioned :

rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1

But the image seems to be  created in ec_rep_pool*
*


[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool
[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool
NAME   SIZE PARENT FMT PROT LOCK
ectestimage 20480M 2
ectestimage1 20480M 2


Is that how it suppose to work ?


On 12-10-2017 23:53, David Turner wrote:
Here is your friend. 
http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites


On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman > wrote:


The image metadata still needs to live in a replicated data pool --
only the data blocks can be stored in an EC pool. Therefore, when
creating the image, you should provide the "--data-pool "
optional to specify the EC pool name.


On Thu, Oct 12, 2017 at 2:06 PM, Josy > wrote:
> Hi,
>
> I am trying to setup an erasure coded pool with rbd image.
>
> The ceph version is Luminous 12.2.1. and I understand, since
Luminous, RBD
> and Cephfs can store their data in an erasure coded pool
without use of
> cache tiring.
>
> I created a pool ecpool and when trying to create a rbd image,
gets this
> error.
>
> ==
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
> ecpool/ectestimage2
> 2017-10-12 10:55:37.992965 7f18857fa700 -1
librbd::image::CreateRequest:
> 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
> directory: (95) Operation not supported
> rbd: create error: (95) Operation not supported
> ==
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS metadata pool to SSDs

2017-10-12 Thread Reed Dier
I found an older ML entry from 2015 and not much else, mostly detailing the 
doing performance testing to dispel poor performance numbers presented by OP.

Currently have the metadata pool on my slow 24 HDDs, and am curious if I should 
see any increased performance with CephFS by moving the metadata pool onto SSD 
medium.
My thought is that the SSDs are lower latency, and it removes those iops from 
the slower spinning disks.

My next concern would be write amplification on the SSDs. Would this thrash the 
SSD lifespan with tons of little writes or should it not be too heavy of a 
workload to matter too much?

My last question from the operations standpoint, if I use:
# ceph osd pool set fs-metadata crush_ruleset 
Will this just start to backfill the metadata pool over to the SSDs until it 
satisfies the crush requirements for size and failure domains and not skip a 
beat?

Obviously things like enabling dirfrags, and multiple MDS ranks will be more 
likely to improve performance with CephFS, but the metadata pool uses very 
little space, and I have the SSDs already, so I figured I would explore it as 
an option.

Thanks,

Reed___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Jorge Pinilla López
The rbd has 2 kinds of data, metadata and data itself, the metadata gives 
information about the rbd imagen and small amounts of internal information. 
This one is place on the replicated pool and that's why it's considering that 
the image is on the replicated pool.But the actual data (big information) are 
stored in the Erasure pool.So when you start filling your image you will see 
how your replicated metadata pool grows a little bit when the actual Erasure 
data pool grows way more!
You can get information about your image using rbd info {replicated 
pool}/{image name}
 Mensaje original De: Josy  Fecha: 
12/10/17  8:40 PM  (GMT+01:00) Para: David Turner , 
dilla...@redhat.com Cc: ceph-users  Asunto: Re: 
[ceph-users] Erasure coding with RBD 

Thank you for your reply.
I created a erasure coded pool 'ecpool' and a replicated pool to
  store metadata 'ec_rep_pool'


 And created image as you mentioned : 


rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1
But the image seems to be  created in ec_rep_pool 

  


  [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool

  [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool

  NAME   SIZE
PARENT  
  FMT PROT LOCK

  ectestimage 
20480M  
 
  2

  ectestimage1
20480M  
 
  2

  

  

  Is that how it suppose to work ?


On 12-10-2017 23:53, David Turner
  wrote:



  Here is your friend. 
http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites
  

  
On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman
   wrote:


The image
  metadata still needs to live in a replicated data pool --

  only the data blocks can be stored in an EC pool. Therefore,
  when

  creating the image, you should provide the "--data-pool
  "

  optional to specify the EC pool name.

  

  

  On Thu, Oct 12, 2017 at 2:06 PM, Josy 
  wrote:

  > Hi,

  >

  > I am trying to setup an erasure coded pool with rbd
  image.

  >

  > The ceph version is Luminous 12.2.1. and I understand, 
  since Luminous, RBD

  > and Cephfs can store their data in an erasure coded pool
  without use of

  > cache tiring.

  >

  > I created a pool ecpool and when trying to create a rbd
  image, gets this

  > error.

  >

  > ==

  >

  > [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create
  --size 20G

  > ecpool/ectestimage2

  > 2017-10-12 10:55:37.992965 7f18857fa700 -1
  librbd::image::CreateRequest:

  > 0x55608e1e0c20 handle_add_image_to_directory: error
  adding image to

  > directory: (95) Operation not supported

  > rbd: create error: (95) Operation not supported

  > ==

  >

  >

  >

  > ___

  > ceph-users mailing list

  > ceph-users@lists.ceph.com

  > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

  

  

  

  --

  Jason

  ___

  ceph-users mailing list

  ceph-users@lists.ceph.com

  http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


  



  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Jason Dillaman
Yes -- the "image" will be in the replicated pool and its data blocks
will be in the specified data pool. An "rbd info" against the image
will show the data pool.

On Thu, Oct 12, 2017 at 2:40 PM, Josy  wrote:
> Thank you for your reply.
>
> I created a erasure coded pool 'ecpool' and a replicated pool to store
> metadata 'ec_rep_pool'
>
>  And created image as you mentioned :
>
> rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1
>
> But the image seems to be  created in ec_rep_pool
>
> 
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool
> NAME   SIZE PARENT
> FMT PROT LOCK
> ectestimage  20480M
> 2
> ectestimage1 20480M
> 2
>
>
> Is that how it suppose to work ?
>
>
> On 12-10-2017 23:53, David Turner wrote:
>
> Here is your friend.
> http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites
>
> On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman  wrote:
>>
>> The image metadata still needs to live in a replicated data pool --
>> only the data blocks can be stored in an EC pool. Therefore, when
>> creating the image, you should provide the "--data-pool "
>> optional to specify the EC pool name.
>>
>>
>> On Thu, Oct 12, 2017 at 2:06 PM, Josy  wrote:
>> > Hi,
>> >
>> > I am trying to setup an erasure coded pool with rbd image.
>> >
>> > The ceph version is Luminous 12.2.1. and I understand,  since Luminous,
>> > RBD
>> > and Cephfs can store their data in an erasure coded pool without use of
>> > cache tiring.
>> >
>> > I created a pool ecpool and when trying to create a rbd image, gets this
>> > error.
>> >
>> > ==
>> >
>> > [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
>> > ecpool/ectestimage2
>> > 2017-10-12 10:55:37.992965 7f18857fa700 -1 librbd::image::CreateRequest:
>> > 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
>> > directory: (95) Operation not supported
>> > rbd: create error: (95) Operation not supported
>> > ==
>> >
>> >
>> >
>> > ___
>> > ceph-users mailing list
>> > ceph-users@lists.ceph.com
>> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>>
>> --
>> Jason
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Josy

Thank you for your reply.

I created a erasure coded pool 'ecpool' and a replicated pool to store 
metadata 'ec_rep_pool'


 And created image as you mentioned :

rbd create --size 20G --data-pool ecpool ec_rep_pool/ectestimage1

But the image seems to be  created in ec_rep_pool*
*


[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ecpool
[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd ls -l ec_rep_pool
NAME   SIZE PARENT FMT PROT LOCK
ectestimage 20480M 2
ectestimage1 20480M 2


Is that how it suppose to work ?


On 12-10-2017 23:53, David Turner wrote:
Here is your friend. 
http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites


On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman > wrote:


The image metadata still needs to live in a replicated data pool --
only the data blocks can be stored in an EC pool. Therefore, when
creating the image, you should provide the "--data-pool "
optional to specify the EC pool name.


On Thu, Oct 12, 2017 at 2:06 PM, Josy > wrote:
> Hi,
>
> I am trying to setup an erasure coded pool with rbd image.
>
> The ceph version is Luminous 12.2.1. and I understand, since
Luminous, RBD
> and Cephfs can store their data in an erasure coded pool without
use of
> cache tiring.
>
> I created a pool ecpool and when trying to create a rbd image,
gets this
> error.
>
> ==
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
> ecpool/ectestimage2
> 2017-10-12 10:55:37.992965 7f18857fa700 -1
librbd::image::CreateRequest:
> 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
> directory: (95) Operation not supported
> rbd: create error: (95) Operation not supported
> ==
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com 
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



--
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] using Bcache on blueStore

2017-10-12 Thread Jorge Pinilla López
Hey all!
I have a ceph with multiple HDD and 1 really fast SSD with (30GB per
OSD) per host.

I have been thinking and all docs say that I should give all the SSD
space for RocksDB, so I would have a HDD data and a 30GB partition for
RocksDB.

But it came to my mind that if the OSD isnt full maybe I am not using
all the space in the SSD, or maybe I prefer having a really small amount
of hot k/v and metadata and the data itself in a really fast device than
just storing all could metadata.

So I though that using Bcache to make SSD to be a cache and as metadata
and k/v are usually hot, they should be place on the cache. But this
doesnt guarantee me that k/v and metadata are actually always in the SSD
cause under heavy cache loads it can be pushed out (like really big data
files).

So I came up with the idea of setting small 5-10GB partitions for the
hot RocksDB and the rest to use it as a cache, so I make sure that
really hot metadata is actually always on the SSD and the coulder one
should be also on the SSD (as a bcache) if its not really freezing, in
that case they would be pushed to the HDD. It also doesnt make anysense
to have metadatada that you never used using space on the SSD, I rather
use that space to store hotter data.

This is also make writes faster, and in blueStore we dont have the
double write problem so it should work fine.

What do you think about this? does it have any downsite? is there any
other way?

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria informática
Becario del area de sistemas (SICUZ)
Universidad de Zaragoza
PGP-KeyID: A34331932EBC715A


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread David Turner
Here is your friend.
http://docs.ceph.com/docs/luminous/rados/operations/erasure-code/#erasure-coding-with-overwrites

On Thu, Oct 12, 2017 at 2:09 PM Jason Dillaman  wrote:

> The image metadata still needs to live in a replicated data pool --
> only the data blocks can be stored in an EC pool. Therefore, when
> creating the image, you should provide the "--data-pool "
> optional to specify the EC pool name.
>
>
> On Thu, Oct 12, 2017 at 2:06 PM, Josy  wrote:
> > Hi,
> >
> > I am trying to setup an erasure coded pool with rbd image.
> >
> > The ceph version is Luminous 12.2.1. and I understand,  since Luminous,
> RBD
> > and Cephfs can store their data in an erasure coded pool without use of
> > cache tiring.
> >
> > I created a pool ecpool and when trying to create a rbd image, gets this
> > error.
> >
> > ==
> >
> > [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
> > ecpool/ectestimage2
> > 2017-10-12 10:55:37.992965 7f18857fa700 -1 librbd::image::CreateRequest:
> > 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
> > directory: (95) Operation not supported
> > rbd: create error: (95) Operation not supported
> > ==
> >
> >
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Jason
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] BlueStore Cache Ratios

2017-10-12 Thread Jorge Pinilla López
Hey Mark,

Thanks a lot for the info. You should really make a paper of it and post
it :)

First of all, I am sorry if I say something wrong, I am still learning
about this topic and I am speaking from totally unawareness.

Second, I understood that ratios ar a way of controling priorities and
they make that bloom filters and indexes don't get page out from cache,
which really makes sense.

Also the 512MB restriction kind of makes sense but I dont really know if
It would make any sense to give more space for rocksdb block cache (like
1GB). I think only testing can resolve that question because I think it
really depends on workloads.

what I don't understand is why data ins't cache at all, even if there is
free space for it. I undertand the importance order would be: bloom
filter and index >> metadata >> data, but if there is free space left
for data then why not go for it? maybe setting ratios of 0.90 k/v 0.09
metadata and 0.01 data would make more sense.


El 11/10/2017 a las 15:44, Mark Nelson escribió:
> Hi Jorge,
>
> I was sort of responsible for all of this. :)
>
> So basically there are different caches in different places:
>
> - rocksdb cache
> - rocksdb block cache (which can be configured to include filters and
> indexes)
> - rocksdb compressed block cache
> - bluestore onode cache
>
> The bluestore onode cache is the only one that stores
> onode/extent/blob metadata before it is encoded, ie it's bigger but
> has lower impact on the CPU.  The next step is the regular rocksdb
> block cache where we've already encoded the data, but it's not
> compressed.  Optionally we could also compress the data and then cache
> it using rocksdb's compressed block cache.  Finally, rocksdb can set
> memory aside for bloom filters and indexes but we're configuring those
> to go into the block cache so we can get a better accounting for how
> memory is being used (otherwise it's difficult to control how much
> memory index and filters get).  The downside is that bloom filters and
> indexes can theoretically get paged out under heavy cache pressure. 
> We set these to be high priority in the block cache and also pin the
> L0 filters/index though to help avoid this.
>
> In the testing I did earlier this year, what I saw is that in low
> memory scenarios it's almost always best to give all of the cache to
> rocksdb's block cache.  Once you hit about the 512MB mark, we start
> seeing bigger gains by giving additional memory to bluestore's onode
> cache.  So we devised a mechanism where you can decide where to cut
> over.  It's quite possible that on very fast CPUs it might make sense
> ot use rocksdb compressed cache, or possibly if you have a huge number
> of objects these ratios might change.  The values we have now were
> sort of the best jack-of-all-trades values we found.
>
> Mark
>
> On 10/11/2017 08:32 AM, Jorge Pinilla López wrote:
>> okay, thanks for the explanation, so from the 3GB of Cache (default
>> cache for SSD) only a 0.5GB is going to K/V and 2.5 going to metadata.
>>
>> Is there a way of knowing how much k/v, metadata, data is storing and
>> how full cache is so I can adjust my ratios?, I was thinking some ratios
>> (like 0.9 k/v, 0.07 meta 0.03 data) but only speculating, I dont have
>> any real data.
>>
>> El 11/10/2017 a las 14:32, Mohamad Gebai escribió:
>>> Hi Jorge,
>>>
>>> On 10/10/2017 07:23 AM, Jorge Pinilla López wrote:
 Are .99 KV, .01 MetaData and .0 Data ratios right? they seem a little
 too disproporcionate.
>>> Yes, this is correct.
>>>
 Also .99 KV and Cache of 3GB for SSD means that almost the 3GB would
 be used for KV but there is also another attributed called
 bluestore_cache_kv_max which is by fault 512MB, then what is the rest
 of the cache used for?, nothing? shouldnt it be more kv_max value or
 less KV ratio?
>>> Anything over the *cache_kv_max value goes to the metadata cache. You
>>> can look in your logs to see the final values of kv, metadata and data
>>> cache ratios. To get data cache, you need to lower the ratios of
>>> metadata and kv caches.
>>>
>>> Mohamad
>>
>> -- 
>> 
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> Estudiante de ingenieria informática
>> Becario del area de sistemas (SICUZ)
>> Universidad de Zaragoza
>> PGP-KeyID: A34331932EBC715A
>> 
>>
>> 
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>

-- 

*Jorge Pinilla López*
jorp...@unizar.es
Estudiante de ingenieria 

Re: [ceph-users] Erasure coding with RBD

2017-10-12 Thread Jason Dillaman
The image metadata still needs to live in a replicated data pool --
only the data blocks can be stored in an EC pool. Therefore, when
creating the image, you should provide the "--data-pool "
optional to specify the EC pool name.


On Thu, Oct 12, 2017 at 2:06 PM, Josy  wrote:
> Hi,
>
> I am trying to setup an erasure coded pool with rbd image.
>
> The ceph version is Luminous 12.2.1. and I understand,  since Luminous, RBD
> and Cephfs can store their data in an erasure coded pool without use of
> cache tiring.
>
> I created a pool ecpool and when trying to create a rbd image, gets this
> error.
>
> ==
>
> [cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G
> ecpool/ectestimage2
> 2017-10-12 10:55:37.992965 7f18857fa700 -1 librbd::image::CreateRequest:
> 0x55608e1e0c20 handle_add_image_to_directory: error adding image to
> directory: (95) Operation not supported
> rbd: create error: (95) Operation not supported
> ==
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Jason
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Erasure coding with RBD

2017-10-12 Thread Josy

Hi,

I am trying to setup an erasure coded pool with rbd image.

The ceph version is Luminous 12.2.1. and I understand,  since Luminous, 
RBD and Cephfs can store their data in an erasure coded pool without use 
of cache tiring.


I created a pool ecpool and when trying to create a rbd image, gets this 
error.


==

[cephuser@ceph-las-admin-a1 ceph-cluster]$ rbd create --size 20G 
ecpool/ectestimage2
2017-10-12 10:55:37.992965 7f18857fa700 -1 librbd::image::CreateRequest: 
0x55608e1e0c20 handle_add_image_to_directory: error adding image to 
directory: (95) Operation not supported

rbd: create error: (95) Operation not supported
==



___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] objects degraded higher than 100%

2017-10-12 Thread Gregory Farnum
On Thu, Oct 12, 2017 at 10:52 AM Florian Haas  wrote:

> On Thu, Oct 12, 2017 at 7:22 PM, Gregory Farnum 
> wrote:
> >
> >
> > On Thu, Oct 12, 2017 at 3:50 AM Florian Haas 
> wrote:
> >>
> >> On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann 
> >> wrote:
> >> > Hi,
> >> >
> >> > how could this happen:
> >> >
> >> > pgs: 197528/1524 objects degraded (12961.155%)
> >> >
> >> > I did some heavy failover tests, but a value higher than 100% looks
> >> > strange
> >> > (ceph version 12.2.0). Recovery is quite slow.
> >> >
> >> >   cluster:
> >> > health: HEALTH_WARN
> >> > 3/1524 objects misplaced (0.197%)
> >> > Degraded data redundancy: 197528/1524 objects degraded
> >> > (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized
> >> >
> >> >   data:
> >> > pools:   1 pools, 2048 pgs
> >> > objects: 508 objects, 1467 MB
> >> > usage:   127 GB used, 35639 GB / 35766 GB avail
> >> > pgs: 197528/1524 objects degraded (12961.155%)
> >> >  3/1524 objects misplaced (0.197%)
> >> >  1042 active+recovery_wait+degraded
> >> >  991  active+clean
> >> >  8active+recovering+degraded
> >> >  3active+undersized+degraded+remapped+backfill_wait
> >> >  2active+recovery_wait+degraded+remapped
> >> >  2active+remapped+backfill_wait
> >> >
> >> >   io:
> >> > recovery: 340 kB/s, 80 objects/s
> >>
> >> Did you ever get to the bottom of this? I'm seeing something very
> >> similar on a 12.2.1 reference system:
> >>
> >> https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c
> >>
> >> I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":
> >> https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85
> >>
> >> The odd thing in there is that the "bench" pool was empty when the
> >> recovery started (that pool had been wiped with "rados cleanup"), so
> >> the number of objects deemed to be missing from the primary really
> >> ought to be zero.
> >>
> >> It seems like it's considering these deleted objects to still require
> >> replication, but that sounds rather far fetched to be honest.
> >
> >
> > Actually, that makes some sense. This cluster had an OSD down while (some
> > of) the deletes were happening?
>
> I thought of exactly that too, but no it didn't. That's the problem.
>

Okay, in that case I've no idea. What was the timeline for the recovery
versus the rados bench and cleanup versus the degraded object counts, then?


>
> > I haven't dug through the code but I bet it is considering those as
> degraded
> > objects because the out-of-date OSD knows it doesn't have the latest
> > versions on them! :)
>
> Yeah I bet against that. :)
>
> Another tidbit: these objects were not deleted with rados rm, they
> were cleaned up after rados bench. In the case quoted above, this was
> an explicit "rados cleanup" after "rados bench --no-cleanup"; in
> another, I saw the same behavior after a regular "rados bench" that
> included the automatic cleanup.
>
> So there are two hypotheses here:
> (1) The deletion in rados bench is neglecting to do something that a
> regular object deletion does do. Given the fact that at least one
> other thing is fishy in rados bench
> (http://tracker.ceph.com/issues/21375), this may be due to some simple
> oversight in the Luminous cycle, and thus would constitute a fairly
> minor (if irritating) issue.
> (2) Regular object deletion is buggy in some previously unknown
> fashion. That would would be a rather major problem.
>

These both seem exceedingly unlikely. *shrug*


>
> By the way, *deleting the pool* altogether makes the degraded object
> count drop to expected levels immediately. Probably no surprise there,
> though.
>
> Cheers,
> Florian
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] objects degraded higher than 100%

2017-10-12 Thread Florian Haas
On Thu, Oct 12, 2017 at 7:22 PM, Gregory Farnum  wrote:
>
>
> On Thu, Oct 12, 2017 at 3:50 AM Florian Haas  wrote:
>>
>> On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann 
>> wrote:
>> > Hi,
>> >
>> > how could this happen:
>> >
>> > pgs: 197528/1524 objects degraded (12961.155%)
>> >
>> > I did some heavy failover tests, but a value higher than 100% looks
>> > strange
>> > (ceph version 12.2.0). Recovery is quite slow.
>> >
>> >   cluster:
>> > health: HEALTH_WARN
>> > 3/1524 objects misplaced (0.197%)
>> > Degraded data redundancy: 197528/1524 objects degraded
>> > (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized
>> >
>> >   data:
>> > pools:   1 pools, 2048 pgs
>> > objects: 508 objects, 1467 MB
>> > usage:   127 GB used, 35639 GB / 35766 GB avail
>> > pgs: 197528/1524 objects degraded (12961.155%)
>> >  3/1524 objects misplaced (0.197%)
>> >  1042 active+recovery_wait+degraded
>> >  991  active+clean
>> >  8active+recovering+degraded
>> >  3active+undersized+degraded+remapped+backfill_wait
>> >  2active+recovery_wait+degraded+remapped
>> >  2active+remapped+backfill_wait
>> >
>> >   io:
>> > recovery: 340 kB/s, 80 objects/s
>>
>> Did you ever get to the bottom of this? I'm seeing something very
>> similar on a 12.2.1 reference system:
>>
>> https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c
>>
>> I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":
>> https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85
>>
>> The odd thing in there is that the "bench" pool was empty when the
>> recovery started (that pool had been wiped with "rados cleanup"), so
>> the number of objects deemed to be missing from the primary really
>> ought to be zero.
>>
>> It seems like it's considering these deleted objects to still require
>> replication, but that sounds rather far fetched to be honest.
>
>
> Actually, that makes some sense. This cluster had an OSD down while (some
> of) the deletes were happening?

I thought of exactly that too, but no it didn't. That's the problem.

> I haven't dug through the code but I bet it is considering those as degraded
> objects because the out-of-date OSD knows it doesn't have the latest
> versions on them! :)

Yeah I bet against that. :)

Another tidbit: these objects were not deleted with rados rm, they
were cleaned up after rados bench. In the case quoted above, this was
an explicit "rados cleanup" after "rados bench --no-cleanup"; in
another, I saw the same behavior after a regular "rados bench" that
included the automatic cleanup.

So there are two hypotheses here:
(1) The deletion in rados bench is neglecting to do something that a
regular object deletion does do. Given the fact that at least one
other thing is fishy in rados bench
(http://tracker.ceph.com/issues/21375), this may be due to some simple
oversight in the Luminous cycle, and thus would constitute a fairly
minor (if irritating) issue.
(2) Regular object deletion is buggy in some previously unknown
fashion. That would would be a rather major problem.

By the way, *deleting the pool* altogether makes the degraded object
count drop to expected levels immediately. Probably no surprise there,
though.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] objects degraded higher than 100%

2017-10-12 Thread Gregory Farnum
On Thu, Oct 12, 2017 at 3:50 AM Florian Haas  wrote:

> On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann 
> wrote:
> > Hi,
> >
> > how could this happen:
> >
> > pgs: 197528/1524 objects degraded (12961.155%)
> >
> > I did some heavy failover tests, but a value higher than 100% looks
> strange
> > (ceph version 12.2.0). Recovery is quite slow.
> >
> >   cluster:
> > health: HEALTH_WARN
> > 3/1524 objects misplaced (0.197%)
> > Degraded data redundancy: 197528/1524 objects degraded
> > (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized
> >
> >   data:
> > pools:   1 pools, 2048 pgs
> > objects: 508 objects, 1467 MB
> > usage:   127 GB used, 35639 GB / 35766 GB avail
> > pgs: 197528/1524 objects degraded (12961.155%)
> >  3/1524 objects misplaced (0.197%)
> >  1042 active+recovery_wait+degraded
> >  991  active+clean
> >  8active+recovering+degraded
> >  3active+undersized+degraded+remapped+backfill_wait
> >  2active+recovery_wait+degraded+remapped
> >  2active+remapped+backfill_wait
> >
> >   io:
> > recovery: 340 kB/s, 80 objects/s
>
> Did you ever get to the bottom of this? I'm seeing something very
> similar on a 12.2.1 reference system:
>
> https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c
>
> I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":
> https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85
>
> The odd thing in there is that the "bench" pool was empty when the
> recovery started (that pool had been wiped with "rados cleanup"), so
> the number of objects deemed to be missing from the primary really
> ought to be zero.
>
> It seems like it's considering these deleted objects to still require
> replication, but that sounds rather far fetched to be honest.
>

Actually, that makes some sense. This cluster had an OSD down while (some
of) the deletes were happening?

I haven't dug through the code but I bet it is considering those as
degraded objects because the out-of-date OSD knows it doesn't have the
latest versions on them! :)
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] min_size & hybrid OSD latency

2017-10-12 Thread Gregory Farnum
On Wed, Oct 11, 2017 at 7:42 AM Reed Dier  wrote:

> Just for the sake of putting this in the public forum,
>
> In theory, by placing the primary copy of the object on an SSD medium, and
> placing replica copies on HDD medium, it should still yield *some* improvement
> in writes, compared to an all HDD scenario.
>
> My logic here is rooted in the idea that the first copy requires a write,
> ACK, and then a read to send a copy to the replicas.
> So instead of a slow write, and a slow read on your first hop, you have a
> fast write and fast read on the first hop, before pushing out to the slower
> second hop of 2x slow writes and ACKs.
>

Nope. A write remains in memory the whole time, and it does *not* need to
be committed on the primary before the primary sends it to replica OSDs.
They do their writing in parallel.


> Doubly so, if you have active io on the cluster, the SSD is taking all of
> the read io away from the slow HDDs, freeing up iops on the HDDs, which in
> turn should clear write ops quicker.
>

That's accurate, though. :)
-Greg


>
> Please poke holes in this if you can.
>
> Hopefully this will be useful for someone searching the ML.
>
> Thanks,
>
> Reed
>
>
> On Oct 10, 2017, at 6:50 PM, Christian Balzer  wrote:
>
> All writes have to be ACKed, the only time where hybrid stuff helps is to
> accelerate reads.
> Which is something that people like me at least have very little interest
> in as the writes need to be fast.
>
> Christian
>
> (the same setup could involve some high latency OSDs, in the case of
> country-level cluster)
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
> --
> Christian BalzerNetwork/Systems Engineer
> ch...@gol.comRakuten Communications
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] MGR Dahhsboard hostname missing

2017-10-12 Thread Richard Hesketh
On 12/10/17 17:15, Josy wrote:
> Hello,
> 
> After taking down couple of OSDs, the dashboard is not showing the 
> corresponding hostname.

Ceph-mgr is known to have issues with associated services with hostnames 
sometimes, e.g. http://tracker.ceph.com/issues/20887

Fixes look to be incoming.

Rich



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] MGR Dahhsboard hostname missing

2017-10-12 Thread Josy

Hello,

After taking down couple of OSDs, the dashboard is not showing the 
corresponding hostname.



It shows correctly in ceph osd tree output
--

-15    3.49280 host ceph-las1-a7-osd
 21   ssd  0.87320 osd.21    up  1.0 1.0
 22   ssd  0.87320 osd.22    up  1.0 1.0
 23   ssd  0.87320 osd.23    up  1.0 1.0
 24   ssd  0.87320 osd.24    up  1.0 1.0
-17    2.61960 host ceph-las1-a8-osd
 25   ssd  0.87320 osd.25    up  1.0 1.0
 27   ssd  0.87320 osd.27    up  1.0 1.0
 28   ssd  0.87320 osd.28    up  1.0 1.0



HostID  Status  PGs Usage   Read bytes  Write bytes Read 
opsWrite ops

	21  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
22  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
23  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
24  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
25  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
27  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
28  	up,in 	0 	0B/0B 	0B/s	0B/s 
0/s 	0/s
ceph-las1-a2-osd 	2  	up,in 	198 
56.1GiB/894GiB 	20.1KiB/s	263KiB/s	16.2/s 	18.8/s





___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Luminous 12.2.1 - RadosGW Multisite doesnt replicate multipart uploads

2017-10-12 Thread Casey Bodley
Thanks Enrico. I wrote a test case that reproduces the issue, and opened 
http://tracker.ceph.com/issues/21772 to track the bug. It sounds like 
this is a regression in luminous.



On 10/11/2017 06:41 PM, Enrico Kern wrote:

or this:

   {
"shard_id": 22,
"entries": [
{
"id": "1_1507761448.758184_10459.1",
"section": "data",
"name": 
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3/Wireshark-win64-2.2.7.exe",

"timestamp": "2017-10-11 22:37:28.758184Z",
"info": {
"source_zone": "6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync object"
}
}
]
},


 
	Virenfrei. www.avg.com 
 




On Thu, Oct 12, 2017 at 12:39 AM, Enrico Kern 
> wrote:


its 45MB, but it happens with all multipart uploads.

sync error list shows

   {
"shard_id": 31,
"entries": [
{
"id": "1_1507761459.607008_8197.1",
"section": "data",
"name":
"testbucket:6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.3",
"timestamp": "2017-10-11 22:37:39.607008Z",
"info": {
"source_zone":
"6a9448d2-bdba-4bec-aad6-aba72cd8eac6",
"error_code": 5,
"message": "failed to sync bucket instance:
(5) Input/output error"
}
}
]
}

for multiple shards not just this one



On Thu, Oct 12, 2017 at 12:31 AM, Yehuda Sadeh-Weinraub
> wrote:

What is the size of the object? Is it only this one?

Try this command: 'radosgw-admin sync error list'. Does it
show anything related to that object?

Thanks,
Yehuda


On Wed, Oct 11, 2017 at 3:26 PM, Enrico Kern
> wrote:

if i change permissions the sync status shows that it is
syncing 1 shard, but no files ends up in the pool (testing
with empty data pool). after a while it shows that data is
back in sync but there is no file

On Wed, Oct 11, 2017 at 11:26 PM, Yehuda Sadeh-Weinraub
> wrote:

Thanks for your report. We're looking into it. You can
try to see if touching the object (e.g., modifying its
permissions) triggers the sync.

Yehuda

On Wed, Oct 11, 2017 at 1:36 PM, Enrico Kern
> wrote:

Hi David,

yeah seems you are right, they are stored as
different filenames in the data bucket when using
multisite upload. But anyway it stil doesnt get
replicated. As example i have files like


6a9448d2-bdba-4bec-aad6-aba72cd8eac6.21344646.1__multipart_Wireshark-win64-2.2.7.exe.2~0LAfq93OMdk7hrijvyzW_EBRkVQLX37.6

in the data pool on one zone. But its not
replicated to the other zone. naming is not
relevant, the other data bucket doesnt have any
file multipart or not.

im really missing the file on the other zone.



Virenfrei. www.avg.com





On Wed, Oct 11, 2017 at 10:25 PM, David Turner
> wrote:

Multipart is a client side setting when
uploading. Multisite in and of itself is a
client and it doesn't use multipart (at least
not by default). I have a Jewel RGW Multisite
cluster and one site has the object as
multi-part while the second site just has it
as a single object.  I had to change from
looking at the objects in the pool for
monitoring to looking at an ls of the buckets
 

Re: [ceph-users] Re : general protection fault: 0000 [#1] SMP

2017-10-12 Thread Luis Henriques
Olivier Bonvalet  writes:

> Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
>> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
>> ceph-induced, it looks like.  I don't remember seeing anything
>> similar
>> in the context of krbd.
>> 
>> This is a Xen dom0 kernel, right?  What did the workload look like?
>> Can you provide dmesg before the crash?
>
> Hi,
>
> yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
> 0.94.10 Ceph (so, Hammer).
>
> Before this error, I add this in logs :
>
> Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message 
> 88021a910200 data crc 2306836368 != exp. 2215155875
> Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 
> 10.0.0.31:6804 bad crc/signature
> Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message 
> 880220b10100 data crc 842840543 != exp. 2657161714
> Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 
> bad crc/signature
> Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message 
> 88025d1aa400 data crc 938978341 != exp. 4154366769
> Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 
> 10.0.0.31:6804 bad crc/signature
> Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message 
> 880260300400 data crc 2988747958 != exp. 319958859
> Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 
> bad crc/signature
> Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message 
> 8802208b8300 data crc 369498361 != exp. 906022772
> Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 
> bad crc/signature
> Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message 
> 880236f9a000 data crc 2586662963 != exp. 2886241494
> Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 
> bad crc/signature
> Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message 
> 88027e633c00 data crc 1102363051 != exp. 679243837
> Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 
> bad crc/signature
> Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message 
> 88024ce0a800 data crc 2467415342 != exp. 1753860323
> Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 
> 10.0.0.30:6804 bad crc/signature
> Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 
>  [#1] SMP
>
>
> We had to switch to TCP Cubic (instead of badly configured TCP BBR, without 
> FQ), to reduce the data crc errors.
> But since we still had some errors, last night we rebooted all the OSD nodes 
> in Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.
>
> Since the last 7 hours, we haven't got any data crc errors from OSD, but we 
> had one from a MON. Without hang/crash.

Since there are a bunch of errors before the GPF I suspect this bug is
related to some error paths that haven't been thoroughly tested (as it is
the case for error paths in general I guess).

My initial guess was a race in ceph_con_workfn:

 - An error returned from try_read() would cause a delayed retry (in
   function con_fault())
 - con_fault_finish() would then trigger a ceph_con_close/ceph_con_open in
   osd_fault.
 - the delayed retry kicks-in and the above close+open, which includes
   releasing con->in_msg and con->out_msg, could cause this GPF.

Unfortunately, I wasn't yet able to find any race there (probably because
there's none), but maybe there's a small window where this could occur.

I wonder if this occurred only once, or if this is something that is
easily triggerable.

Cheers,
-- 
Luis
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread Jason Dillaman
On Thu, Oct 12, 2017 at 5:02 AM, Maged Mokhtar  wrote:

> On 2017-10-11 14:57, Jason Dillaman wrote:
>
> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López 
> wrote:
>
>> As far as I am able to understand there are 2 ways of setting iscsi for
>> ceph
>>
>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
>>
>
> The target_core_rbd approach is only utilized by SUSE (and its derivatives
> like PetaSAN) as far as I know. This was the initial approach for Red
> Hat-derived kernels as well until the upstream kernel maintainers indicated
> that they really do not want a specialized target backend for just krbd.
> The next attempt was to re-use the existing target_core_iblock to interface
> with krbd via the kernel's block layer, but that hit similar upstream walls
> trying to get support for SCSI command passthrough to the block layer.
>
>
>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
>>
>
> The TCMU approach is what upstream and Red Hat-derived kernels will
> support going forward.
>
> The lrbd project was developed by SUSE to assist with configuring a
> cluster of iSCSI gateways via the cli.  The ceph-iscsi-config +
> ceph-iscsi-cli projects are similar in goal but take a slightly different
> approach. ceph-iscsi-config provides a set of common Python libraries that
> can be re-used by ceph-iscsi-cli and ceph-ansible for deploying and
> configuring the gateway. The ceph-iscsi-cli project provides the gwcli tool
> which acts as a cluster-aware replacement for targetcli.
>
>
>> I don't know which one is better, I am seeing that oficial support is
>> pointing to tcmu but i havent done any testbench.
>>
>
> We (upstream Ceph) provide documentation for the TCMU approach because
> that is what is available against generic upstream kernels (starting with
> 4.14 when it's out). Since it uses librbd (which still needs to undergo
> some performance improvements) instead of krbd, we know that librbd 4k IO
> performance is slower compared to krbd, but 64k and 128k IO performance is
> comparable. However, I think most iSCSI tuning guides would already tell
> you to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks).
>
>
>> Does anyone tried both? Do they give the same output? Are both able to
>> manage multiple iscsi targets mapped to a single rbd disk?
>>
>
> Assuming you mean multiple portals mapped to the same RBD disk, the answer
> is yes, both approaches should support ALUA. The ceph-iscsi-config tooling
> will only configure Active/Passive because we believe there are certain
> edge conditions that could result in data corruption if configured for
> Active/Active ALUA.
>
> The TCMU approach also does not currently support SCSI persistent
> reservation groups (needed for Windows clustering) because that support
> isn't available in the upstream kernel. The SUSE kernel has an approach
> that utilizes two round-trips to the OSDs for each IO to simulate PGR
> support. Earlier this summer I believe SUSE started to look into how to get
> generic PGR support merged into the upstream kernel using corosync/dlm to
> synchronize the states between multiple nodes in the target. I am not sure
> of the current state of that work, but it would benefit all LIO targets
> when complete.
>
>
>> I will try to make my own testing but if anyone has tried in advance it
>> would be really helpful.
>>
>> --
>> *Jorge Pinilla López*
>> jorp...@unizar.es
>> --
>>
>>
>> 
>>  Libre
>> de virus. www.avast.com
>> 
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> Jason
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> Hi Jason,
>
> Similar to TCMU user space backstore approach, i would prefer cluster sync
> of PR and other task management be done user space. It really does not
> belong in the kernel and will give more flexibility in implementation. A
> user space PR get/set interface could be implemented via:
>
> -corosync
> -Writing PR metada to Ceph / network share
> -Use Ceph watch/notify
>
> Also in the future it may be beneficial to build/extend on Ceph features
> such as exclusive locks and paxos based leader election for applications
> such as iSCSI gateways to use for resource distribution and fail over as an
> alternative to Pacemaker which has sociability limits.
>
> Maged
>

I would definitely love to eventually see a plugable TCMU interface for
distributing PGRs, port states, and any other shared state to avoid the
need to configure a separate 

Re: [ceph-users] killing ceph-disk [was Re: ceph-volume: migration and disk partition support]

2017-10-12 Thread Matthew Vernon

Hi,

On 09/10/17 16:09, Sage Weil wrote:

To put this in context, the goal here is to kill ceph-disk in mimic.

One proposal is to make it so new OSDs can *only* be deployed with LVM,
and old OSDs with the ceph-disk GPT partitions would be started via
ceph-volume support that can only start (but not deploy new) OSDs in that
style.

Is the LVM-only-ness concerning to anyone?

Looking further forward, NVMe OSDs will probably be handled a bit
differently, as they'll eventually be using SPDK and kernel-bypass (hence,
no LVM).  For the time being, though, they would use LVM.


This seems the best point to jump in on this thread. We have a ceph 
(Jewel / Ubuntu 16.04) cluster with around 3k OSDs, deployed with 
ceph-ansible. They are plain-disk OSDs with journal on NVME partitions. 
I don't think this is an unusual configuration :)


I think to get rid of ceph-disk, we would want at least some of the 
following:


* solid scripting for "move slowly through cluster migrating OSDs from 
disk to lvm" - 1 OSD at a time isn't going to produce unacceptable 
rebalance load, but it is going to take a long time, so such scripting 
would have to cope with being stopped and restarted and suchlike (and be 
able to use the correct journal partitions)


* ceph-ansible support for "some lvm, some plain disk" arrangements - 
presuming a "create new OSDs as lvm" approach when adding new OSDs or 
replacing failed disks


* support for plain disk (regardless of what provides it) that remains 
solid for some time yet



On Fri, 6 Oct 2017, Alfredo Deza wrote:



Bluestore support should be the next step for `ceph-volume lvm`, and
while that is planned we are thinking of ways to improve the current
caveats (like OSDs not coming up) for clusters that have deployed OSDs
with ceph-disk.


These issues seem mostly to be down to timeouts being too short and the 
single global lock for activating OSDs.



IMO we can't require any kind of data migration in order to upgrade, which
means we either have to (1) keep ceph-disk around indefinitely, or (2)
teach ceph-volume to start existing GPT-style OSDs.  Given all of the
flakiness around udev, I'm partial to #2.  The big question for me is
whether #2 alone is sufficient, or whether ceph-volume should also know
how to provision new OSDs using partitions and no LVM.  Hopefully not?


I think this depends on how well tools such as ceph-ansible can cope 
with mixed OSD types (my feeling at the moment is "not terribly well", 
but I may be being unfair).


Regards,

Matthew



--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Cephalocon 2018?

2017-10-12 Thread Matthew Vernon

Hi,

The recent FOSDEM CFP reminded me to wonder if there's likely to be a 
Cephalocon in 2018? It was mentioned as a possibility when the 2017 one 
was cancelled...


Regards,

Matthew


--
The Wellcome Trust Sanger Institute is operated by Genome Research 
Limited, a charity registered in England with number 1021457 and a 
company registered in England with number 2742969, whose registered 
office is 215 Euston Road, London, NW1 2BE. 
___

ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FOSDEM Call for Participation: Software Defined Storage devroom

2017-10-12 Thread Jan Fajerski


CfP for the Software Defined Storage devroom at FOSDEM 2018
(Brussels, Belgium, February 4th).

FOSDEM is a free software event that offers open source communities a place to
meet, share ideas and collaborate.  It is renown for being highly developer-
oriented and brings together 8000+ participants from all over the world.  It
is held in the city of Brussels (Belgium).

FOSDEM 2018 will take place during the weekend of February 3rd-4th 2018. More
details about the event can be found at http://fosdem.org/

** Call For Participation

The Software Defined Storage devroom will go into it's second round for
talks around Open Source Software Defined Storage projects, management tools
and real world deployments.

Presentation topics could include but are not limited too:

- Your work on a SDS project like Ceph, GlusterFS or LizardFS

- Your work on or with SDS related projects like SWIFT or Container Storage
Interface

- Management tools for SDS deployments

- Monitoring tools for SDS clusters

** Important dates:

- 26 Nov 2017:  submission deadline for talk proposals
- 15 Dec 2017:  announcement of the final schedule
-  4 Feb 2018:  Software Defined Storage dev room

Talk proposals will be reviewed by a steering committee:
- Leonardo Vaz (Ceph Community Manager - Red Hat Inc.)
- Joao Luis (Core Ceph contributor - SUSE)
- Jan Fajerski (Ceph Developer - SUSE)

Use the FOSDEM 'pentabarf' tool to submit your proposal:
https://penta.fosdem.org/submission/FOSDEM18

- If necessary, create a Pentabarf account and activate it.
Please reuse your account from previous years if you have
already created it.

- In the "Person" section, provide First name, Last name
(in the "General" tab), Email (in the "Contact" tab)
and Bio ("Abstract" field in the "Description" tab).

- Submit a proposal by clicking on "Create event".

- Important! Select the "Software Defined Storage devroom" track
(on the "General" tab).

- Provide the title of your talk ("Event title" in the "General" tab).

- Provide a description of the subject of the talk and the
intended audience (in the "Abstract" field of the "Description" tab)

- Provide a rough outline of the talk or goals of the session (a short
list of bullet points covering topics that will be discussed) in the
"Full description" field in the "Description" tab

- Provide an expected length of your talk in the "Duration" field. Please
count at least 10 minutes of discussion into your proposal.

Suggested talk length would be 15, 20+10, 30+15, and 45+15 minutes.

** Recording of talks

The FOSDEM organizers plan to have live streaming and recording fully working,
both for remote/later viewing of talks, and so that people can watch streams
in the hallways when rooms are full. This requires speakers to consent to
being recorded and streamed. If you plan to be a speaker, please understand
that by doing so you implicitly give consent for your talk to be recorded and
streamed. The recordings will be published under the same license as all
FOSDEM content (CC-BY).

Hope to hear from you soon! And please forward this announcement.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] general protection fault: 0000 [#1] SMP

2017-10-12 Thread Ilya Dryomov
On Thu, Oct 12, 2017 at 12:23 PM, Jeff Layton  wrote:
> On Thu, 2017-10-12 at 09:12 +0200, Ilya Dryomov wrote:
>> On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet  
>> wrote:
>> > Hi,
>> >
>> > I had a "general protection fault: " with Ceph RBD kernel client.
>> > Not sure how to read the call, is it Ceph related ?
>> >
>> >
>> > Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 
>> >  [#1] SMP
>> > Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid 
>> > binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev 
>> > br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph 
>> > xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi 
>> > sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul 
>> > ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel 
>> > aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit 
>> > drm_kms_helper intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore 
>> > sysimgblt fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp 
>> > acpi_power_meter ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge 
>> > efivarfs ip_tables x_tables autofs4 dm_mod dax raid10 raid456 
>> > async_raid6_recov async_memcpy async_pq async_xor xor async_tx raid6_pq
>> > Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 
>> > multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core 
>> > xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca 
>> > ptp usb_common pps_core mdio
>> > Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: 
>> > kworker/1:0 Not tainted 4.13-dae-dom0 #2
>> > Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel 
>> > Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 
>> > 10/12/2016
>> > Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr 
>> > ceph_con_workfn [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.898355] task: 8801ce434280 
>> > task.stack: c900151bc000
>> > Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: 
>> > e030:memcpy_erms+0x6/0x10
>> > Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:c900151bfac0 
>> > EFLAGS: 00010202
>> > Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: 8801b63df000 RBX: 
>> > 88021b41be00 RCX: 04df
>> > Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 04df RSI: 
>> > 4450736e24806564 RDI: 8801b63df000
>> > Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ea0005fdd8c8 R08: 
>> > 88028545d618 R09: 0010
>> > Oct 11 16:15:11 lorunde kernel: [311418.902104] R10:  R11: 
>> > 880215815000 R12: 
>> > Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: 8802158156c0 R14: 
>> >  R15: 8801ce434280
>> > Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  
>> > () GS:88028544() knlGS:88028544
>> > Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS:  ES: 
>> >  CR0: 80050033
>> > Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 55a8461cfc20 CR3: 
>> > 01809000 CR4: 00042660
>> > Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
>> > Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? 
>> > skb_copy_ubufs+0xef/0x290
>> > Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
>> > Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? 
>> > tcp_transmit_skb+0x74/0x930
>> > Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? 
>> > tcp_write_xmit+0x1bd/0xfb0
>> > Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? 
>> > __sk_mem_raise_allocated+0x4e/0x220
>> > Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? 
>> > __tcp_push_pending_frames+0x28/0x90
>> > Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? 
>> > do_tcp_sendpages+0x4fc/0x590
>> > Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
>> > Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
>> > Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? 
>> > kernel_sendpage+0x12/0x20
>> > Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? 
>> > ceph_tcp_sendpage+0x5c/0xc0 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? 
>> > ceph_tcp_recvmsg+0x53/0x70 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? 
>> > ceph_con_workfn+0xd08/0x22a0 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? 
>> > ceph_osdc_start_request+0x23/0x30 [libceph]
>> > Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? 
>> > rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
>> > Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? 
>> > process_one_work+0x1ad/0x340
>> > Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
>> > Oct 11 

Re: [ceph-users] objects degraded higher than 100%

2017-10-12 Thread Florian Haas
On Mon, Sep 11, 2017 at 8:13 PM, Andreas Herrmann  wrote:
> Hi,
>
> how could this happen:
>
> pgs: 197528/1524 objects degraded (12961.155%)
>
> I did some heavy failover tests, but a value higher than 100% looks strange
> (ceph version 12.2.0). Recovery is quite slow.
>
>   cluster:
> health: HEALTH_WARN
> 3/1524 objects misplaced (0.197%)
> Degraded data redundancy: 197528/1524 objects degraded
> (12961.155%), 1057 pgs unclean, 1055 pgs degraded, 3 pgs undersized
>
>   data:
> pools:   1 pools, 2048 pgs
> objects: 508 objects, 1467 MB
> usage:   127 GB used, 35639 GB / 35766 GB avail
> pgs: 197528/1524 objects degraded (12961.155%)
>  3/1524 objects misplaced (0.197%)
>  1042 active+recovery_wait+degraded
>  991  active+clean
>  8active+recovering+degraded
>  3active+undersized+degraded+remapped+backfill_wait
>  2active+recovery_wait+degraded+remapped
>  2active+remapped+backfill_wait
>
>   io:
> recovery: 340 kB/s, 80 objects/s

Did you ever get to the bottom of this? I'm seeing something very
similar on a 12.2.1 reference system:

https://gist.github.com/fghaas/f547243b0f7ebb78ce2b8e80b936e42c

I'm also seeing an unusual MISSING_ON_PRIMARY count in "rados df":
https://gist.github.com/fghaas/59cd2c234d529db236c14fb7d46dfc85

The odd thing in there is that the "bench" pool was empty when the
recovery started (that pool had been wiped with "rados cleanup"), so
the number of objects deemed to be missing from the primary really
ought to be zero.

It seems like it's considering these deleted objects to still require
replication, but that sounds rather far fetched to be honest.

Cheers,
Florian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph auth doesn't work on cephfs?

2017-10-12 Thread Frank Yu
John,

I tried to write some data to the new created files, it failed, just as you
said.
Thanks very much.



On Thu, Oct 12, 2017 at 6:20 PM, John Spray  wrote:

> On Thu, Oct 12, 2017 at 11:12 AM, Frank Yu  wrote:
> > Hi,
> > I have a ceph cluster with three nodes, and I have a cephfs, use pool
> > cephfs_data, cephfs_metadata, and there're also a rbd pool with name
> > 'rbd-test'.
> >
> > # rados lspools
> > .rgw.root
> > default.rgw.control
> > default.rgw.meta
> > default.rgw.log
> > cephfs_data
> > cephfs_metadata
> > default.rgw.buckets.index
> > default.rgw.buckets.data
> > rbd-test
> >
> > then I add a user with name cephfs-ct, and have 'rw' permission on pool
> > 'rbd-test' only.
> >
> > # ceph auth add client.cephfs-ct mon 'allow rw' osd 'allow rw
> pool=rbd-test'
> > mds 'allow rw'
> > added key for client.cephfs-ct
> >
> > # ceph auth ls |grep client.cephfs-ct -A4
> > installed auth entries:
> >
> > client.cephfs-ct
> > key:AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
> > caps: [mds] allow rw
> > caps: [mon] allow rw
> > caps: [osd] allow rw pool=rbd-test
> >
> > then I try to mount cephfs with this user cephfs-ct on another host, and
> try
> > to do some write operations.
> >
> > # mount -t ceph HOST:6789:/ /mnt/ceph/ -o
> > name=cephfs-ct,secret=AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
> > # touch /mnt/ceph/testceph
> > # ll /mnt/ceph/testceph
> > -rw-r--r-- 1 root root 0 Oct 12 18:04 /mnt/ceph/testceph
> >
> > So my question, should user cephfs-ct have no write permission on pool
> > cephfs_data, this mean, I should can't write data under mountpoint
> > /mnt/ceph/?? or I'm wrong ?
>
> Because your client has "allow rw" mds permissions, it can read and
> write all metadata, such as listing a directory.
>
> If you tried to put some data in a file and sync it, you would find that
> failed.
>
> John
>
> >
> > thanks
> >
> > --
> > Regards
> > Frank Yu
> >
> > ___
> > ceph-users mailing list
> > ceph-users@lists.ceph.com
> > http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
> >
>



-- 
Regards
Frank Yu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread Maged Mokhtar
On 2017-10-12 11:32, David Disseldorp wrote:

> On Wed, 11 Oct 2017 14:03:59 -0400, Jason Dillaman wrote:
> 
> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote: Hmmm, If you failover the identity of the 
> LIO configuration including PGRs
> (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> gateways which have shared storage to store the LIO configuration and PGR
> data.   
> Are you referring to the Active Persist Through Power Loss (APTPL)
> support in LIO where it writes the PR metadata to
> "/var/target/pr/aptpl_"? I suppose that would work for a
> Pacemaker failover if you had a shared file system mounted between all
> your gateways *and* the initiator requests APTPL mode(?).

I'm going off on a tangent here, but I can't seem to find where LIO
reads the /var/target/pr/aptpl_ PR state back off disk -
__core_scsi3_write_aptpl_to_file() seems to be the only function that
uses the path. Otherwise I would have thought the same, that the
propagating the file to backup gateways prior to failover would be
sufficient.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

This code may help from rtslib  

https://github.com/open-iscsi/rtslib-fb/blob/master/rtslib/tcm.py 

def _config_pr_aptpl(self):
"""
LIO actually *writes* pr aptpl info to the filesystem, so we
need to read it in and squirt it back into configfs when we configure
the storage object. BLEH.
"""
from .root import RTSRoot
aptpl_dir = "%s/pr" % RTSRoot().dbroot 

try:
lines = fread("%s/aptpl_%s" % (aptpl_dir, self.wwn)).split()
except:
return 

if not lines[0].startswith("PR_REG_START:"):
return 

reservations = []
for line in lines:
if line.startswith("PR_REG_START:"):
res_list = []
elif line.startswith("PR_REG_END:"):
reservations.append(res_list)
else:
res_list.append(line.strip()) 

for res in reservations:
fwrite(self.path + "/pr/res_aptpl_metadata", ",".join(res))___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph auth doesn't work on cephfs?

2017-10-12 Thread John Spray
On Thu, Oct 12, 2017 at 11:12 AM, Frank Yu  wrote:
> Hi,
> I have a ceph cluster with three nodes, and I have a cephfs, use pool
> cephfs_data, cephfs_metadata, and there're also a rbd pool with name
> 'rbd-test'.
>
> # rados lspools
> .rgw.root
> default.rgw.control
> default.rgw.meta
> default.rgw.log
> cephfs_data
> cephfs_metadata
> default.rgw.buckets.index
> default.rgw.buckets.data
> rbd-test
>
> then I add a user with name cephfs-ct, and have 'rw' permission on pool
> 'rbd-test' only.
>
> # ceph auth add client.cephfs-ct mon 'allow rw' osd 'allow rw pool=rbd-test'
> mds 'allow rw'
> added key for client.cephfs-ct
>
> # ceph auth ls |grep client.cephfs-ct -A4
> installed auth entries:
>
> client.cephfs-ct
> key:AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
> caps: [mds] allow rw
> caps: [mon] allow rw
> caps: [osd] allow rw pool=rbd-test
>
> then I try to mount cephfs with this user cephfs-ct on another host, and try
> to do some write operations.
>
> # mount -t ceph HOST:6789:/ /mnt/ceph/ -o
> name=cephfs-ct,secret=AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
> # touch /mnt/ceph/testceph
> # ll /mnt/ceph/testceph
> -rw-r--r-- 1 root root 0 Oct 12 18:04 /mnt/ceph/testceph
>
> So my question, should user cephfs-ct have no write permission on pool
> cephfs_data, this mean, I should can't write data under mountpoint
> /mnt/ceph/?? or I'm wrong ?

Because your client has "allow rw" mds permissions, it can read and
write all metadata, such as listing a directory.

If you tried to put some data in a file and sync it, you would find that failed.

John

>
> thanks
>
> --
> Regards
> Frank Yu
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] ceph auth doesn't work on cephfs?

2017-10-12 Thread Frank Yu
Hi,
I have a ceph cluster with three nodes, and I have a cephfs, use pool
cephfs_data, cephfs_metadata, and there're also a rbd pool with name
'rbd-test'.

# rados lspools
.rgw.root
default.rgw.control
default.rgw.meta
default.rgw.log
cephfs_data
cephfs_metadata
default.rgw.buckets.index
default.rgw.buckets.data
rbd-test

then I add a user with name cephfs-ct, and have 'rw' permission on pool
'rbd-test' only.

# ceph auth add client.cephfs-ct mon 'allow rw' osd 'allow rw
pool=rbd-test' mds 'allow rw'
added key for client.cephfs-ct

# ceph auth ls |grep client.cephfs-ct -A4
installed auth entries:

client.cephfs-ct
key:AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
caps: [mds] allow rw
caps: [mon] allow rw
caps: [osd] allow rw pool=rbd-test

then I try to mount cephfs with this user cephfs-ct on another host, and
try to do some write operations.

# mount -t ceph HOST:6789:/ /mnt/ceph/ -o name=cephfs-ct,secret=
AQDIPd9ZyXcTLBAAvcG82SFL3wOBAMLMcrJxMA==
# touch /mnt/ceph/testceph
# ll /mnt/ceph/testceph
-rw-r--r-- 1 root root 0 Oct 12 18:04 /mnt/ceph/testceph

So my question, should user cephfs-ct have no write permission on pool
cephfs_data, this mean, I should can't write data under mountpoint
/mnt/ceph/?? or I'm wrong ?

thanks

-- 
Regards
Frank Yu
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] assertion error trying to start mds server

2017-10-12 Thread John Spray
On Thu, Oct 12, 2017 at 12:23 AM, Bill Sharer  wrote:
> I was wondering if I can't get the second mds back up That offline
> backward scrub check sounds like it should be able to also salvage what
> it can of the two pools to a normal filesystem.  Is there an option for
> that or has someone written some form of salvage tool?

Yep, cephfs-data-scan can do that.

To scrape the files out of a CephFS data pool to a local filesystem, do this:
cephfs-data-scan scan_extents   # this is discovering
all the file sizes
cephfs-data-scan scan_inodes --output-dir /tmp/my_output 

The time taken by both these commands scales linearly with the number
of objects in your data pool.

This tool may not see the correct filename for recently created files
(any file whose metadata is in the journal but not flushed), these
files will go into a lost+found directory, named after their inode
number.

John

>
> On 10/11/2017 07:07 AM, John Spray wrote:
>> On Wed, Oct 11, 2017 at 1:42 AM, Bill Sharer  wrote:
>>> I've been in the process of updating my gentoo based cluster both with
>>> new hardware and a somewhat postponed update.  This includes some major
>>> stuff including the switch from gcc 4.x to 5.4.0 on existing hardware
>>> and using gcc 6.4.0 to make better use of AMD Ryzen on the new
>>> hardware.  The existing cluster was on 10.2.2, but I was going to
>>> 10.2.7-r1 as an interim step before moving on to 12.2.0 to begin
>>> transitioning to bluestore on the osd's.
>>>
>>> The Ryzen units are slated to be bluestore based OSD servers if and when
>>> I get to that point.  Up until the mds failure, they were simply cephfs
>>> clients.  I had three OSD servers updated to 10.2.7-r1 (one is also a
>>> MON) and had two servers left to update.  Both of these are also MONs
>>> and were acting as a pair of dual active MDS servers running 10.2.2.
>>> Monday morning I found out the hard way that an UPS one of them was on
>>> has a dead battery.  After I fsck'd and came back up, I saw the
>>> following assertion error when it was trying to start it's mds.B server:
>>>
>>>
>>>  mdsbeacon(64162/B up:replay seq 3 v4699) v7  126+0+0 (709014160
>>> 0 0) 0x7f6fb4001bc0 con 0x55f94779d
>>> 8d0
>>>  0> 2017-10-09 11:43:06.935662 7f6fa9ffb700 -1 mds/journal.cc: In
>>> function 'virtual void EImportStart::r
>>> eplay(MDSRank*)' thread 7f6fa9ffb700 time 2017-10-09 11:43:06.934972
>>> mds/journal.cc: 2929: FAILED assert(mds->sessionmap.get_version() == cmapv)
>>>
>>>  ceph version 10.2.2 (45107e21c568dd033c2f0a3107dec8f0b0e58374)
>>>  1: (ceph::__ceph_assert_fail(char const*, char const*, int, char
>>> const*)+0x82) [0x55f93d64a122]
>>>  2: (EImportStart::replay(MDSRank*)+0x9ce) [0x55f93d52a5ce]
>>>  3: (MDLog::_replay_thread()+0x4f4) [0x55f93d4a8e34]
>>>  4: (MDLog::ReplayThread::entry()+0xd) [0x55f93d25bd4d]
>>>  5: (()+0x74a4) [0x7f6fd009b4a4]
>>>  6: (clone()+0x6d) [0x7f6fce5a598d]
>>>  NOTE: a copy of the executable, or `objdump -rdS ` is
>>> needed to interpret this.
>>>
>>> --- logging levels ---
>>>0/ 5 none
>>>0/ 1 lockdep
>>>0/ 1 context
>>>1/ 1 crush
>>>1/ 5 mds
>>>1/ 5 mds_balancer
>>>1/ 5 mds_locker
>>>1/ 5 mds_log
>>>1/ 5 mds_log_expire
>>>1/ 5 mds_migrator
>>>0/ 1 buffer
>>>0/ 1 timer
>>>0/ 1 filer
>>>0/ 1 striper
>>>0/ 1 objecter
>>>0/ 5 rados
>>>0/ 5 rbd
>>>0/ 5 rbd_mirror
>>>0/ 5 rbd_replay
>>>0/ 5 journaler
>>>0/ 5 objectcacher
>>>0/ 5 client
>>>0/ 5 osd
>>>0/ 5 optracker
>>>0/ 5 objclass
>>>1/ 3 filestore
>>>1/ 3 journal
>>>0/ 5 ms
>>>1/ 5 mon
>>>0/10 monc
>>>1/ 5 paxos
>>>0/ 5 tp
>>>1/ 5 auth
>>>1/ 5 crypto
>>>1/ 1 finisher
>>>1/ 5 heartbeatmap
>>>1/ 5 perfcounter
>>>1/ 5 rgw
>>>1/10 civetweb
>>>1/ 5 javaclient
>>>1/ 5 asok
>>>1/ 1 throttle
>>>0/ 0 refs
>>>1/ 5 xio
>>>1/ 5 compressor
>>>1/ 5 newstore
>>>1/ 5 bluestore
>>>1/ 5 bluefs
>>>1/ 3 bdev
>>>1/ 5 kstore
>>>4/ 5 rocksdb
>>>4/ 5 leveldb
>>>1/ 5 kinetic
>>>1/ 5 fuse
>>>   -2/-2 (syslog threshold)
>>>   -1/-1 (stderr threshold)
>>>   max_recent 1
>>>   max_new 1000
>>>   log_file /var/log/ceph/ceph-mds.B.log
>>>
>>>
>>>
>>> When I was googling around, I ran into this Cern presentation and tried
>>> out the offline backware scrubbing commands on slide 25 first:
>>>
>>> https://indico.cern.ch/event/531810/contributions/2309925/attachments/1357386/2053998/GoncaloBorges-HEPIX16-v3.pdf
>>>
>>>
>>> Both ran without any messages, so I'm assuming I have sane contents in
>>> the cephfs_data and cephfs_metadata pools.  Still no luck getting things
>>> restarted, so I tried the cephfs-journal-tool journal reset on slide
>>> 23.  That didn't work either.  Just for giggles, I tried setting up the
>>> two Ryzen boxes as new mds.C and mds.D servers which would run on
>>> 10.2.7-r1 instead of using 

Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread David Disseldorp
On Wed, 11 Oct 2017 14:03:59 -0400, Jason Dillaman wrote:

> On Wed, Oct 11, 2017 at 1:10 PM, Samuel Soulard
>  wrote:
> > Hmmm, If you failover the identity of the LIO configuration including PGRs
> > (I believe they are files on disk), this would work no?  Using an 2 ISCSI
> > gateways which have shared storage to store the LIO configuration and PGR
> > data.  
> 
> Are you referring to the Active Persist Through Power Loss (APTPL)
> support in LIO where it writes the PR metadata to
> "/var/target/pr/aptpl_"? I suppose that would work for a
> Pacemaker failover if you had a shared file system mounted between all
> your gateways *and* the initiator requests APTPL mode(?).

I'm going off on a tangent here, but I can't seem to find where LIO
reads the /var/target/pr/aptpl_ PR state back off disk -
__core_scsi3_write_aptpl_to_file() seems to be the only function that
uses the path. Otherwise I would have thought the same, that the
propagating the file to backup gateways prior to failover would be
sufficient.

Cheers, David
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph-ISCSI

2017-10-12 Thread Maged Mokhtar
On 2017-10-11 14:57, Jason Dillaman wrote:

> On Wed, Oct 11, 2017 at 6:38 AM, Jorge Pinilla López  
> wrote:
> 
>> As far as I am able to understand there are 2 ways of setting iscsi for ceph
>> 
>> 1- using kernel (lrbd) only able on SUSE, CentOS, fedora...
> 
> The target_core_rbd approach is only utilized by SUSE (and its derivatives 
> like PetaSAN) as far as I know. This was the initial approach for Red 
> Hat-derived kernels as well until the upstream kernel maintainers indicated 
> that they really do not want a specialized target backend for just krbd. The 
> next attempt was to re-use the existing target_core_iblock to interface with 
> krbd via the kernel's block layer, but that hit similar upstream walls trying 
> to get support for SCSI command passthrough to the block layer. 
> 
>> 2- using userspace (tcmu , ceph-iscsi-conf, ceph-iscsi-cli)
> 
> The TCMU approach is what upstream and Red Hat-derived kernels will support 
> going forward.  
> 
> The lrbd project was developed by SUSE to assist with configuring a cluster 
> of iSCSI gateways via the cli.  The ceph-iscsi-config + ceph-iscsi-cli 
> projects are similar in goal but take a slightly different approach. 
> ceph-iscsi-config provides a set of common Python libraries that can be 
> re-used by ceph-iscsi-cli and ceph-ansible for deploying and configuring the 
> gateway. The ceph-iscsi-cli project provides the gwcli tool which acts as a 
> cluster-aware replacement for targetcli. 
> 
>> I don't know which one is better, I am seeing that oficial support is 
>> pointing to tcmu but i havent done any testbench.
> 
> We (upstream Ceph) provide documentation for the TCMU approach because that 
> is what is available against generic upstream kernels (starting with 4.14 
> when it's out). Since it uses librbd (which still needs to undergo some 
> performance improvements) instead of krbd, we know that librbd 4k IO 
> performance is slower compared to krbd, but 64k and 128k IO performance is 
> comparable. However, I think most iSCSI tuning guides would already tell you 
> to use larger block sizes (i.e. 64K NTFS blocks or 32K-128K ESX blocks). 
> 
>> Does anyone tried both? Do they give the same output? Are both able to 
>> manage multiple iscsi targets mapped to a single rbd disk?
> 
> Assuming you mean multiple portals mapped to the same RBD disk, the answer is 
> yes, both approaches should support ALUA. The ceph-iscsi-config tooling will 
> only configure Active/Passive because we believe there are certain edge 
> conditions that could result in data corruption if configured for 
> Active/Active ALUA. 
> 
> The TCMU approach also does not currently support SCSI persistent reservation 
> groups (needed for Windows clustering) because that support isn't available 
> in the upstream kernel. The SUSE kernel has an approach that utilizes two 
> round-trips to the OSDs for each IO to simulate PGR support. Earlier this 
> summer I believe SUSE started to look into how to get generic PGR support 
> merged into the upstream kernel using corosync/dlm to synchronize the states 
> between multiple nodes in the target. I am not sure of the current state of 
> that work, but it would benefit all LIO targets when complete. 
> 
>> I will try to make my own testing but if anyone has tried in advance it 
>> would be really helpful.
>> 
>> -
>> JORGE PINILLA LÓPEZ
>> jorp...@unizar.es
>> 
>> -
>> 
>> [1]
>> Libre de virus. www.avast.com [1]
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com [2]
> 
> -- 
> 
> Jason 
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Hi Jason, 

Similar to TCMU user space backstore approach, i would prefer cluster
sync of PR and other task management be done user space. It really does
not belong in the kernel and will give more flexibility in
implementation. A user space PR get/set interface could be implemented
via: 

-corosync 
-Writing PR metada to Ceph / network share
-Use Ceph watch/notify 

Also in the future it may be beneficial to build/extend on Ceph features
such as exclusive locks and paxos based leader election for applications
such as iSCSI gateways to use for resource distribution and fail over as
an alternative to Pacemaker which has sociability limits. 

Maged 

  

Links:
--
[1]
https://www.avast.com/sig-email?utm_medium=emailutm_source=linkutm_campaign=sig-emailutm_content=emailclient
[2] http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Re : general protection fault: 0000 [#1] SMP

2017-10-12 Thread Olivier Bonvalet
Le jeudi 12 octobre 2017 à 09:12 +0200, Ilya Dryomov a écrit :
> It's a crash in memcpy() in skb_copy_ubufs().  It's not in ceph, but
> ceph-induced, it looks like.  I don't remember seeing anything
> similar
> in the context of krbd.
> 
> This is a Xen dom0 kernel, right?  What did the workload look like?
> Can you provide dmesg before the crash?

Hi,

yes it's a Xen dom0 kernel. Linux 4.13.3, Xen 4.8.2, with an old
0.94.10 Ceph (so, Hammer).

Before this error, I add this in logs :

Oct 11 16:00:41 lorunde kernel: [310548.899082] libceph: read_partial_message 
88021a910200 data crc 2306836368 != exp. 2215155875
Oct 11 16:00:41 lorunde kernel: [310548.899841] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:02:25 lorunde kernel: [310652.695015] libceph: read_partial_message 
880220b10100 data crc 842840543 != exp. 2657161714
Oct 11 16:02:25 lorunde kernel: [310652.695731] libceph: osd3 10.0.0.26:6804 
bad crc/signature
Oct 11 16:07:24 lorunde kernel: [310952.485202] libceph: read_partial_message 
88025d1aa400 data crc 938978341 != exp. 4154366769
Oct 11 16:07:24 lorunde kernel: [310952.485870] libceph: osd117 10.0.0.31:6804 
bad crc/signature
Oct 11 16:10:44 lorunde kernel: [311151.841812] libceph: read_partial_message 
880260300400 data crc 2988747958 != exp. 319958859
Oct 11 16:10:44 lorunde kernel: [311151.842672] libceph: osd9 10.0.0.51:6802 
bad crc/signature
Oct 11 16:10:57 lorunde kernel: [311165.211412] libceph: read_partial_message 
8802208b8300 data crc 369498361 != exp. 906022772
Oct 11 16:10:57 lorunde kernel: [311165.212135] libceph: osd87 10.0.0.5:6800 
bad crc/signature
Oct 11 16:12:27 lorunde kernel: [311254.635767] libceph: read_partial_message 
880236f9a000 data crc 2586662963 != exp. 2886241494
Oct 11 16:12:27 lorunde kernel: [311254.636493] libceph: osd90 10.0.0.5:6814 
bad crc/signature
Oct 11 16:14:31 lorunde kernel: [311378.808191] libceph: read_partial_message 
88027e633c00 data crc 1102363051 != exp. 679243837
Oct 11 16:14:31 lorunde kernel: [311378.808889] libceph: osd13 10.0.0.21:6804 
bad crc/signature
Oct 11 16:15:01 lorunde kernel: [311409.431034] libceph: read_partial_message 
88024ce0a800 data crc 2467415342 != exp. 1753860323
Oct 11 16:15:01 lorunde kernel: [311409.431718] libceph: osd111 10.0.0.30:6804 
bad crc/signature
Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault:  
[#1] SMP


We had to switch to TCP Cubic (instead of badly configured TCP BBR, without 
FQ), to reduce the data crc errors.
But since we still had some errors, last night we rebooted all the OSD nodes in 
Linux 4.4.91, instead of Linux 4.9.47 & 4.9.53.

Since the last 7 hours, we haven't got any data crc errors from OSD, but we had 
one from a MON. Without hang/crash.

About the workload, the Xen VMs are mainly LAMP servers : http traffic, handle 
by nginx or apache, php, and MySQL databases.

Thanks,

Olivier
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] general protection fault: 0000 [#1] SMP

2017-10-12 Thread Ilya Dryomov
On Wed, Oct 11, 2017 at 4:40 PM, Olivier Bonvalet  wrote:
> Hi,
>
> I had a "general protection fault: " with Ceph RBD kernel client.
> Not sure how to read the call, is it Ceph related ?
>
>
> Oct 11 16:15:11 lorunde kernel: [311418.891238] general protection fault: 
>  [#1] SMP
> Oct 11 16:15:11 lorunde kernel: [311418.891855] Modules linked in: cpuid 
> binfmt_misc nls_iso8859_1 nls_cp437 vfat fat tcp_diag inet_diag xt_physdev 
> br_netfilter iptable_filter xen_netback loop xen_blkback cbc rbd libceph 
> xen_gntdev xen_evtchn xenfs xen_privcmd ipmi_ssif intel_rapl iosf_mbi sb_edac 
> x86_pkg_temp_thermal intel_powerclamp coretemp crct10dif_pclmul 
> ghash_clmulni_intel iTCO_wdt pcbc iTCO_vendor_support mxm_wmi aesni_intel 
> aes_x86_64 crypto_simd glue_helper cryptd mgag200 i2c_algo_bit drm_kms_helper 
> intel_rapl_perf ttm drm syscopyarea sysfillrect efi_pstore sysimgblt 
> fb_sys_fops lpc_ich efivars mfd_core evdev ioatdma shpchp acpi_power_meter 
> ipmi_si wmi button ipmi_devintf ipmi_msghandler bridge efivarfs ip_tables 
> x_tables autofs4 dm_mod dax raid10 raid456 async_raid6_recov async_memcpy 
> async_pq async_xor xor async_tx raid6_pq
> Oct 11 16:15:11 lorunde kernel: [311418.895403]  libcrc32c raid1 raid0 
> multipath linear md_mod hid_generic usbhid i2c_i801 crc32c_intel i2c_core 
> xhci_pci ahci ixgbe xhci_hcd libahci ehci_pci ehci_hcd libata usbcore dca ptp 
> usb_common pps_core mdio
> Oct 11 16:15:11 lorunde kernel: [311418.896551] CPU: 1 PID: 4916 Comm: 
> kworker/1:0 Not tainted 4.13-dae-dom0 #2
> Oct 11 16:15:11 lorunde kernel: [311418.897134] Hardware name: Intel 
> Corporation S2600CWR/S2600CWR, BIOS SE5C610.86B.01.01.0019.101220160604 
> 10/12/2016
> Oct 11 16:15:11 lorunde kernel: [311418.897745] Workqueue: ceph-msgr 
> ceph_con_workfn [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.898355] task: 8801ce434280 
> task.stack: c900151bc000
> Oct 11 16:15:11 lorunde kernel: [311418.899007] RIP: e030:memcpy_erms+0x6/0x10
> Oct 11 16:15:11 lorunde kernel: [311418.899616] RSP: e02b:c900151bfac0 
> EFLAGS: 00010202
> Oct 11 16:15:11 lorunde kernel: [311418.900228] RAX: 8801b63df000 RBX: 
> 88021b41be00 RCX: 04df
> Oct 11 16:15:11 lorunde kernel: [311418.900848] RDX: 04df RSI: 
> 4450736e24806564 RDI: 8801b63df000
> Oct 11 16:15:11 lorunde kernel: [311418.901479] RBP: ea0005fdd8c8 R08: 
> 88028545d618 R09: 0010
> Oct 11 16:15:11 lorunde kernel: [311418.902104] R10:  R11: 
> 880215815000 R12: 
> Oct 11 16:15:11 lorunde kernel: [311418.902723] R13: 8802158156c0 R14: 
>  R15: 8801ce434280
> Oct 11 16:15:11 lorunde kernel: [311418.903359] FS:  () 
> GS:88028544() knlGS:88028544
> Oct 11 16:15:11 lorunde kernel: [311418.903994] CS:  e033 DS:  ES:  
> CR0: 80050033
> Oct 11 16:15:11 lorunde kernel: [311418.904627] CR2: 55a8461cfc20 CR3: 
> 01809000 CR4: 00042660
> Oct 11 16:15:11 lorunde kernel: [311418.905271] Call Trace:
> Oct 11 16:15:11 lorunde kernel: [311418.905909]  ? skb_copy_ubufs+0xef/0x290
> Oct 11 16:15:11 lorunde kernel: [311418.906548]  ? skb_clone+0x82/0x90
> Oct 11 16:15:11 lorunde kernel: [311418.907225]  ? tcp_transmit_skb+0x74/0x930
> Oct 11 16:15:11 lorunde kernel: [311418.907858]  ? tcp_write_xmit+0x1bd/0xfb0
> Oct 11 16:15:11 lorunde kernel: [311418.908490]  ? 
> __sk_mem_raise_allocated+0x4e/0x220
> Oct 11 16:15:11 lorunde kernel: [311418.909122]  ? 
> __tcp_push_pending_frames+0x28/0x90
> Oct 11 16:15:11 lorunde kernel: [311418.909755]  ? 
> do_tcp_sendpages+0x4fc/0x590
> Oct 11 16:15:11 lorunde kernel: [311418.910386]  ? tcp_sendpage+0x7c/0xa0
> Oct 11 16:15:11 lorunde kernel: [311418.911026]  ? inet_sendpage+0x37/0xe0
> Oct 11 16:15:11 lorunde kernel: [311418.911655]  ? kernel_sendpage+0x12/0x20
> Oct 11 16:15:11 lorunde kernel: [311418.912297]  ? 
> ceph_tcp_sendpage+0x5c/0xc0 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.912926]  ? ceph_tcp_recvmsg+0x53/0x70 
> [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.913553]  ? 
> ceph_con_workfn+0xd08/0x22a0 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.914179]  ? 
> ceph_osdc_start_request+0x23/0x30 [libceph]
> Oct 11 16:15:11 lorunde kernel: [311418.914807]  ? 
> rbd_img_obj_request_submit+0x1ac/0x3c0 [rbd]
> Oct 11 16:15:11 lorunde kernel: [311418.915458]  ? 
> process_one_work+0x1ad/0x340
> Oct 11 16:15:11 lorunde kernel: [311418.916083]  ? worker_thread+0x45/0x3f0
> Oct 11 16:15:11 lorunde kernel: [311418.916706]  ? kthread+0xf2/0x130
> Oct 11 16:15:11 lorunde kernel: [311418.917327]  ? 
> process_one_work+0x340/0x340
> Oct 11 16:15:11 lorunde kernel: [311418.917946]  ? 
> kthread_create_on_node+0x40/0x40
> Oct 11 16:15:11 lorunde kernel: [311418.918565]  ? do_group_exit+0x35/0xa0
> Oct 11 16:15:11 lorunde kernel: [311418.919215]  ? ret_from_fork+0x25/0x30
> Oct 11 16:15:11 

Re: [ceph-users] Crush Map for test lab

2017-10-12 Thread Stefan Kooman
Quoting Ashley Merrick (ash...@amerrick.co.uk):
> Hello,
> 
> Setting up a new test lab, single server 5 disks/OSD.
> 
> Want to run an EC Pool that has more shards than avaliable OSD's , is
> it possible to force crush to 're use an OSD for another shard?
> 
> I know normally this is bad practice but is for testing only on a
> single server setup.

Create multiple OSDs per disk? Nowadays (luminous) this should be
possible by creating multiple LVM volumes (filestore), or by creating
different mounts and use filestore OSDs and or creating different
parititions and put bluestore on top (if you need to test bluestore as
well). Forcing CRUSH to do things it shouldn't will be much harder I
guess / hope ;-)).

Gr. Stefan


-- 
| BIT BV  http://www.bit.nl/Kamer van Koophandel 09090351
| GPG: 0xD14839C6   +31 318 648 688 / i...@bit.nl
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com