Re: [ceph-users] scubbing for a long time and not finished

2015-03-19 Thread Xinze Chi
Currently, users do not know  when some pg do scrubbing for a long time.
I think whether we could give some warming if it happend (defined as
osd_scrub_max_time).
It would tell the user something may be wrong in cluster.


2015-03-17 21:21 GMT+08:00 池信泽 xmdx...@gmail.com:

 On 周二, 3月 17, 2015 at 10:01 上午, Xinze Chi xmdx...@gmail.com wrote:

 hi,all:

 I find a pg on my test cluster in doing scrubbing for a long time
 and not finish. there are not some useful scrubbing log. scrubs_active
 is 1, so inc_scrubs_pending return false. I think the reason is that
 some scrub message is lost, so primary can not continue chunky_scrub ,
 so it hang up at scrubbing.

Could anyone give some suggestion?

Thanks


 [root@ceph0 ~]# date
 Tue Mar 17 09:54:54 CST 2015
 [root@ceph0 ~]# ceph pg dump | grep scrub
 dumped all in format plain
 pg_stat objects mip degr misp unf bytes log disklog state state_stamp
 v reported up up_primary acting acting_primary last_scrub
 scrub_stamplast_deep_scrub deep_scrub_stamp
 1.97 30 0 0 0 0 117702656 31 31 active+clean+scrubbing 2015-03-16
 14:50:02.110796 78'31 78:50 [9,6,1] 9 [9,6,1] 9 0'0 2015-03-15
 14:49:33.661597 0'0 2015-03-13 14:48:53.341679

The attachment is the log from primary, the scrubbing pg is 1.97s0.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Christian Balzer

Hello,

On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

 On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Greg,
 
  Thanks for your input and completely agree that we cannot expect
  developers to fully document what impact each setting has on a
  cluster, particularly in a performance related way
 
  That said, if you or others could spare some time for a few pointers it
  would be much appreciated and I will endeavour to create some useful
  results/documents that are more relevant to end users.
 
  I have taken on board what you said about the WB throttle and have been
  experimenting with it by switching it on and off. I know it's a bit of
  a blunt configuration change, but it was useful to understand its
  effect. With it off, I do see initially quite a large performance
  increase but overtime it actually starts to slow the average
  throughput down. Like you said, I am guessing this is to do with it
  making sure the journal doesn't get to far ahead, leaving it with
  massive sync's to carry out.
 
  One thing I do see with the WBT enabled and to some extent with it
  disabled, is that there are large periods of small block writes at the
  max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
  seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
  where this behaviour can be seen.
 
 If you're doing 64k IOs then I believe it's creating a new on-disk
 file for each of those writes. How that's laid out on-disk will depend
 on your filesystem and the specific config options that we're using to
 try to avoid running too far ahead of the journal.
 
Could you elaborate on that a bit?
I would have expected those 64KB writes to go to the same object (file)
until it is full (4MB).

Because this behavior would explain some (if not all) of the write
amplification I've seen in the past with small writes (see the SSD
Hardware recommendation thread).

Christian

 I think you're just using these config options in conflict with
 eachother. You've set the min sync time to 20 seconds for some reason,
 presumably to try and batch stuff up? So in that case you probably
 want to let your journal run for twenty seconds worth of backing disk
 IO before you start throttling it, and probably 10-20 seconds worth of
 IO before forcing file flushes. That means increasing the throttle
 limits while still leaving the flusher enabled.
 -Greg
 
 
  http://www.sys-pro.co.uk/misc/wbt_on.png
 
  http://www.sys-pro.co.uk/misc/wbt_off.png
 
  I would really appreciate if someone could comment on why this type of
  behaviour happens? As can be seen in the trace, if the blocks are
  submitted to the disk as larger IO's and with higher concurrency,
  hundreds of Mb of data can be flushed in seconds. Is this something
  specific to the filesystem behaviour which Ceph cannot influence, like
  dirty filesystem metadata/inodes which can't be merged into larger
  IO's?
 
  For sequential writes, I would have thought that in an optimum
  scenario, a spinning disk should be able to almost maintain its large
  block write speed (100MB/s) no matter the underlying block size. That
  being said, from what I understand when a sync is called it will try
  and flush all dirty data so the end result is probably slightly
  different to a traditional battery backed write back cache.
 
  Chris, would you be interested in forming a ceph-users based
  performance team? There's a developer performance meeting which is
  mainly concerned with improving the internals of Ceph. There is also a
  raft of information on the mailing list archives where people have
  said hey look at my SSD speed at x,y,z settings, but making
  comparisons or recommendations is not that easy. It may also reduce a
  lot of the repetitive posts of why is X so slowetc
 


-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Hardware recommendation

2015-03-19 Thread Christian Balzer
On Wed, 18 Mar 2015 08:59:14 +0100 Josef Johansson wrote:

 Hi,
 
  On 18 Mar 2015, at 05:29, Christian Balzer ch...@gol.com wrote:
  
  
  Hello,
  
  On Wed, 18 Mar 2015 03:52:22 +0100 Josef Johansson wrote:

[snip]
  We though of doing a cluster with 3 servers, and any recommendation of
  supermicro servers would be appreciated.
  
  Why 3, replication of 3? 
  With Intel SSDs and diligent (SMART/NAGIOS) wear level monitoring I'd
  personally feel safe with a replication factor of 2.
  
 I’ve seen recommendations  of replication 2!  The Intel SSDs are indeed
 endurable. This is only with Intel SSDs I assume?

From the specifications and reviews I've seen the Samsung 845DC PRO, the
SM 843T and even more so the SV843 
(http://www.samsung.com/global/business/semiconductor/product/flash-ssd/overview
don't you love it when the same company has different, competing
products?) should do just fine when it comes to endurance and performance.
Alas I have no first hand experience with either, just the
(read-optimized) 845DC EVO.


 This 1U
 http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm
 http://www.supermicro.com.tw/products/system/1U/1028/SYS-1028U-TR4T_.cfm
 is really nice, missing the SuperDOM peripherals though.. 
While I certainly see use cases for SuperDOM, not all models have 2
connectors, so no chance to RAID1 things, thus the need to _definitely_
have to pull the server out (and re-install the OS) should it fail.

 so you really
 get 8 drives if you need two for OS. And the rails.. don’t get me
 started, but lately they do just snap into the racks! No screws needed.
 That’s a refresh from earlier 1U SM rails.
 
Ah, the only 1U servers I'm currently deploying from SM are older ones, so
still no snap-in rails. Everything 2U has been that way for at least 2
years, though. ^^

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping OSD to physical device

2015-03-19 Thread Robert LeBlanc
I don't use ceph-deploy, but using ceph-disk for creating the OSDs
automatically uses the by-partuuid reference for the journals (at
least I recall only using /dev/sdX for the journal reference, which is
what I have in my documentation). Since ceph-disk does all the
partitioning, it automatically finds the volume with udev, mounts it
in the correct location and accesses the journal on the right disk.

It also may be a limitation on the version of ceph-deploy/ceph-disk
you are using.

On Thu, Mar 19, 2015 at 5:54 PM, Colin Corr co...@pc-doctor.com wrote:
 On 03/19/2015 12:27 PM, Robert LeBlanc wrote:
 Udev already provides some of this for you. Look in /dev/disk/by-*.
 You can reference drives by UUID, id or path (for
 SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across
 reboots and hardware changes.

 Thanks for the quick responses. And to Kobi (off list) as well.

 It seems the optimal way to do this is to create the OSDs by ID in the first 
 place.

 So, for /dev/sde with a journal on /dev/sda5:

 root@osd1:~$ ls -l /dev/disk/by-id/ | grep sde
 lrwxrwxrwx 1 root root  9 Mar 19 23:36 
 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5TUCJX9 - ../../sde
 lrwxrwxrwx 1 root root  9 Mar 19 23:36 wwn-0x50014ee20a66aefe - ../../sde

 root@osd1:~$ ls -l /dev/disk/by-id/ | grep sda5
 lrwxrwxrwx 1 root root 10 Mar 19 23:36 
 ata-Crucial_CT480M500SSD1_14210C292B50-part5 - ../../sda5
 lrwxrwxrwx 1 root root 10 Mar 19 23:36 wwn-0x500a07510c292b50-part5 - 
 ../../sda5

 The deploy command looks like this:

 ceph-deploy --overwrite-conf osd create 
 osd1:/dev/disk/by-id/wwn-0x50014ee20a66aefe:/dev/disk/by-id/wwn-0x500a07510c292b50-part5

 And alternatively, create a udev rule set for existing devices.

 I haven't tested yet, but I am guessing that the udev rule for that same disk 
 (deployed as sde) would look something like this:

 KERNEL==sde, SUBSYSTEM==block, 
 DEVLINKS==/dev/disk/by-id/wwn-0x50014ee20a66aefe


 Many thanks for the assistance!

 Colin


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] hadoop namenode not starting due to bindException while deploying hadoop with cephFS

2015-03-19 Thread Ridwan Rashid
Hi,

I have a 5 node ceph(v0.87) cluster and am trying to deploy hadoop with
cephFS. I have installed hadoop-1.1.1 in the nodes and changed the
conf/core-site.xml file according to the ceph documentation
http://ceph.com/docs/master/cephfs/hadoop/ but after changing the file the
namenode is not starting (namenode can be formatted) but the other
services(datanode, jobtracker, tasktracker) are running in hadoop. 

The default hadoop works fine but when I change the core-site.xml file as
above I get the following bindException as can be seen from the namenode log:


2015-03-19 01:37:31,436 ERROR
org.apache.hadoop.hdfs.server.namenode.NameNode: java.net.BindException:
Problem binding to node1/10.242.144.225:6789 : Cannot assign requested address
   

I have one monitor for the ceph cluster (node1/10.242.144.225) and I
included in the core-site.xml file ceph://10.242.144.225:6789 as the value
of fs.default.name. The 6789 port is the default port being used by the
monitor node of ceph, so that may be the reason for the bindException but
the ceph documentation mentions that it should be included like this in the
core-site.xml file. It would be really helpful to get some pointers to where
I am doing wrong in the setup.

Thank you.   


 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] 'pgs stuck unclean ' problem

2015-03-19 Thread houguanghua
Dear all, 
Ceph 0.72.2 is deployed in three hosts. But the ceph's status is  HEALTH_WARN . 
The status is as follows:

 # ceph -s
cluster e25909ed-25d9-42fd-8c97-0ed31eec6194
 health HEALTH_WARN 768 pgs degraded; 768 pgs stuck unclean; recovery 2/3 
objects degraded (66.667%)
 monmap e3: 3 mons at 
{ceph-node1=192.168.57.101:6789/0,ceph-node2=192.168.57.102:6789/0,ceph-node3=192.168.57.103:6789/0},
 election epoch 34, quorum 0,1,2 ceph-node1,ceph-node2,ceph-node3
 osdmap e170: 9 osds: 9 up, 9 in
  pgmap v1741: 768 pgs, 7 pools, 36 bytes data, 1 objects
367 MB used, 45612 MB / 45980 MB avail
2/3 objects degraded (66.667%)
 768 active+degraded
There are 3 pools created, but 7 pools appears in above ceph status.
# ceph osd lspools
5 data,6 metadata,7 rbd,
The object in pool 'data' justs has one replication. But the pool's replication 
is set as 3.
 # ceph osd map data object1
osdmap e170 pool 'data' (5) object 'object1' - pg 5.bac5debc (5.bc) - up [6] 
acting [6]
 
# ceph osd dump|more
epoch 170
fsid e25909ed-25d9-42fd-8c97-0ed31eec6194
created 2015-03-16 11:23:28.805286
modified 2015-03-19 15:45:39.451077
flags 
pool 5 'data' rep size 3 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 
256 pgp_num 256 last_change 155 owner 0
pool 6 'metadata' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins 
pg_num 256 pgp_num 256 last_change 161 owner 0
pool 7 'rbd' rep size 2 min_size 1 crush_ruleset 0 object_hash rjenkins pg_num 
256 pgp_num 256 last_change 163 owner 0

Other info is depicted here.
# ceph osd tree
# idweight  type name   up/down reweight
-1  0   root default
-7  0   rack rack03
-4  0   host ceph-node3
6   0   osd.6   up  1
7   0   osd.7   up  1
8   0   osd.8   up  1
-6  0   rack rack02
-3  0   host ceph-node2
3   0   osd.3   up  1
4   0   osd.4   up  1
5   0   osd.5   up  1
-5  0   rack rack01
-2  0   host ceph-node1
0   0   osd.0   up  1
1   0   osd.1   up  1
2   0   osd.2   up  1

The crushmap is :
# begin crush map
tunable choose_local_tries 0
tunable choose_local_fallback_tries 0
tunable choose_total_tries 50
tunable chooseleaf_descend_once 1
# devices
device 0 osd.0
device 1 osd.1
device 2 osd.2
device 3 osd.3
device 4 osd.4
device 5 osd.5
device 6 osd.6
device 7 osd.7
device 8 osd.8
# types
type 0 osd
type 1 host
type 2 rack
type 3 row
type 4 room
type 5 datacenter
type 6 root
# buckets
host ceph-node3 {
id -4   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item osd.6 weight 0.000
item osd.7 weight 0.000
item osd.8 weight 0.000
}
rack rack03 {
id -7   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item ceph-node3 weight 0.000
}
host ceph-node2 {
id -3   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item osd.3 weight 0.000
item osd.4 weight 0.000
item osd.5 weight 0.000
}
rack rack02 {
id -6   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item ceph-node2 weight 0.000
}
host ceph-node1 {
id -2   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item osd.0 weight 0.000
item osd.1 weight 0.000
item osd.2 weight 0.000
}
rack rack01 {
id -5   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item ceph-node1 weight 0.000
}
root default {
id -1   # do not change unnecessarily
# weight 0.000
alg straw
hash 0  # rjenkins1
item rack03 weight 0.000
item rack02 weight 0.000
item rack01 weight 0.000
}
# rules
rule data {
ruleset 0
type replicated
min_size 1
max_size 10
step take default
step chooseleaf firstn 0 type host
step emit
}
# end crush map
 
# ceph health detail |more
HEALTH_WARN 768 pgs degraded; 768 pgs stuck unclean; recovery 2/3 objects 
degraded (66.667%)
pg 5.17 is stuck unclean since forever, current state active+degraded, last 
acting [6]
pg 6.14 is stuck unclean since forever, current state active+degraded, last 
acting [6]
pg 7.15 is stuck unclean since forever, current state active+degraded, last 
acting [6]
pg 5.14 is stuck unclean since forever, current 

[ceph-users] Segfault after modifying CRUSHMAP

2015-03-19 Thread gian

Hi guys,

I was creating new buckets and adjusting the crush map when 1 monitor 
stopped replying.


The scenario is:
2 servers
2 MONs
21 OSDs each server

Error message in the mon.log:

NOTE: a copy of the executable, or `objdump -rdS executable` is 
needed to interpret this.


I uploaded the stderr to:
http://ur1.ca/jxbrp

Does anybody have any idea?


Thank you,
Gian
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] SSD Hardware recommendation

2015-03-19 Thread Christian Balzer

Hello,

On Wed, 18 Mar 2015 11:41:17 +0100 Francois Lafont wrote:

 Hi,
 
 Christian Balzer wrote :
 
  Consider what you think your IO load (writes) generated by your
  client(s) will be, multiply that by your replication factor, divide by
  the number of OSDs, that will give you the base load per OSD. 
  Then multiply by 2 (journal on OSD) per OSD.
  Finally based on my experience and measurements (link below) multiply
  that by at least 6, probably 10 to be on safe side. Use that number to
  find the SSD that can handle this write load for the time period
  you're budgeting that cluster for.
  http://lists.opennebula.org/pipermail/ceph-users-ceph.com/2014-October/043949.html
 
 Thanks Christian for this interesting explanations. I have read your link
 and I'd like to understand why the write amplification is greater than
 the the replication factor. For me, in theory, write amplification
 should be approximatively equal to the replication factor. What are the
 reasons of this difference?
 
 Er... in fact, after thinking about it a little, I imagine that 1 write
 IO in the client side becomes 2*R IO in the cluster side (where R is the
 replication factor) because there are R IO for the OSD and R IO for the
 journal. So, with R = 2, I can imagine a write amplification equal to 4
 but I don't understand why it's 5 or 6. Is it possible to have
 explanations about this?
 
You're asking the wrong person, as I'm neither a Ceph or kernel
developer. ^o^
Back then Mark Nelson from the Ceph team didn't expect to see those
numbers as well, but both Mark Wu and I saw them.

Anyways, lets start with the basics and things that are understandable
without any detail knowledge.

Assume a cluster with 2 nodes, 10 OSDs each and a replication of 2 (Since
we're talking about SSD cluster here and keep things related to the
question of the OP).

Now a client writes 40MB of data to the cluster.
Assuming an ideal scenario where all PGs are evenly distributed (they won't
be) and this is totally fresh data (resulting in 10 4MB Ceph objects), this
would mean that each OSD will receive 4MB (10 primary PGs, 10 secondary
ones).
With journals on the same SSD (currently the best way based on tests), we
get a write amplification of 2, as that data is written both to the
journal and the actual storage space.

But as my results in the link above showed, that is very much dependent on
the write size. With a 4MB block size (the ideal size for default RBD
pools and objects) I saw even slightly less than the 2x amplifications
expected, I assume that was due to caching and PG imbalances.

Now my guess what happens with small (4KB) writes is that all these small
writes do not coalesce sufficiently before being written to the object on
the OSD. 
So up to 1000 4KB writes could happen to that 4MB object (clearly is it
much less than that, but how much I can't tell), resulting in the same
blocks being rewritten several times.

There's also the journaling done by the respective file system (I used
ext4 during that test) and while there are bound to be some differences in
a worst case scenario that could result in another 2x write amplification
(FS journal and actual file).

In addition Ceph updates various files like the omap leveldb and
meta-data, quantifying that however would require much more detailed
analysis or familiarity with the Ceph code.

Regards,

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Readonly cache tiering and rbd.

2015-03-19 Thread Matthijs Möhlmann

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Hi,

- From the documentation:

Cache Tier readonly:

Read-only Mode: When admins configure tiers with readonly mode, Ceph
clients write data to the backing tier. On read, Ceph copies the
requested object(s) from the backing tier to the cache tier. Stale
objects get removed from the cache tier based on the defined policy.
This approach is ideal for immutable data (e.g., presenting
pictures/videos on a social network, DNA data, X-Ray imaging, etc.),
because reading data from a cache pool that might contain out-of-date
data provides weak consistency. Do not use readonly mode for mutable data.

Does this mean that when a client (xen / kvm with a RBD volume) writes
some data that the OSD does not mark the readonly cache dirty?

In other words, what does 'weak consistency' mean here?

Regards, Matthijs
-BEGIN PGP SIGNATURE-
Version: GnuPG/MacGPG2 v2.0.22 (Darwin)

iQIcBAEBAgAGBQJVCrcNAAoJEBXBjvSJ+ky+TK4P/2EbWeZICmzwS1RIeZZhRJL7
0tdcrzlETH7E6UZJS/dkOK/qea2ouXPipwnO8axj9nBc9ixHDx4ODTqeJ8t2Tm9T
6xtIcVtjatBsI9chkAcLhYK/vfLCTVeJFLwPeQLu/miYHmcn88eHuhkn/A2ARCdj
MsmIYfTaV8VEY/4oUD2kMHog1yL/Io36vgAEgnMJrtSC2wQvyqiVO9ZVCaStkP8H
ztIeKyhlCJRRWBA0PsvIiBX9brQhIPFIWDA8h+ypppA4YQLNsMq7xrNezrF4mSJt
/keMwqUSeTsm7wkL1PLSAByosOjFsXKJkUHDsNtT6Dyzb5hzTTaA5XcWS7FFrA1p
GnIEXGqf1Xk41zWFQhSzvUImxCtAAIF4DBDvndtEroMmofNLKGbfULKHJvvrkSKd
uVswpSa7diA7dQXkUmisp/ZtoXuMtgA4WtJ4FmKRkCx1OpXHjKQjPm212ZD7hiQk
z8zpasnQAvfE/0otvKaBXU5jTaMI8bhDaIZwY6wqpTxvok1MghsFMM619SQqy0nM
tg0qf2Qb2NQIz0jvvlSsfhzyUmKP9WrSNVvYGeNCkxF1T0i1pRun1f4gMo+6lalj
zLsoLufjgvd4w6e9G+p8eoLv4rcBEtNa8bX0o1vpC7k+Rh8STXcYeTSDAkU0xnf4
jgQXA5kan6ezsEMyqU7I
=WCq1
-END PGP SIGNATURE-


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Code for object deletion

2015-03-19 Thread khyati joshi
Can anyone tel me where code for deleting objects using command
rados rm test-object-1 --pool=data  will be found for ceph-version 0.80.5??


Thanks.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Server Specific Pools

2015-03-19 Thread Garg, Pankaj
Hi,

I have a Ceph cluster with both ARM and x86 based servers in the same cluster. 
Is there a way for me to define Pools or some logical separation that would 
allow me to use only 1 set of machines for a particular test.
That way it makes easy for me to run tests either on x86 or ARM and do some 
comparison testing.

Thanks
Pankaj
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Server Specific Pools

2015-03-19 Thread David Burley
Pankaj,

You can define them via different crush rules, and then assign a pool to a
given crush rule. This is the same in practice as having a node type with
all SSDs and another with all spinners. You can read more about how to set
this up here:

http://ceph.com/docs/master/rados/operations/crush-map/#placing-different-pools-on-different-osds

Cheers,

On Thu, Mar 19, 2015 at 9:28 PM, Garg, Pankaj 
pankaj.g...@caviumnetworks.com wrote:

  Hi,



 I have a Ceph cluster with both ARM and x86 based servers in the same
 cluster. Is there a way for me to define Pools or some logical separation
 that would allow me to use only 1 set of machines for a particular test.

 That way it makes easy for me to run tests either on x86 or ARM and do
 some comparison testing.



 Thanks

 Pankaj

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
David Burley
NOC Manager, Sr. Systems Programmer/Analyst
Slashdot Media

e: da...@slashdotmedia.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping OSD to physical device

2015-03-19 Thread Colin Corr
On 03/19/2015 12:27 PM, Robert LeBlanc wrote:
 Udev already provides some of this for you. Look in /dev/disk/by-*.
 You can reference drives by UUID, id or path (for
 SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across
 reboots and hardware changes.

Thanks for the quick responses. And to Kobi (off list) as well.

It seems the optimal way to do this is to create the OSDs by ID in the first 
place.

So, for /dev/sde with a journal on /dev/sda5:

root@osd1:~$ ls -l /dev/disk/by-id/ | grep sde
lrwxrwxrwx 1 root root  9 Mar 19 23:36 ata-WDC_WD40EFRX-68WT0N0_WD-WCC4E5TUCJX9 
- ../../sde
lrwxrwxrwx 1 root root  9 Mar 19 23:36 wwn-0x50014ee20a66aefe - ../../sde

root@osd1:~$ ls -l /dev/disk/by-id/ | grep sda5
lrwxrwxrwx 1 root root 10 Mar 19 23:36 
ata-Crucial_CT480M500SSD1_14210C292B50-part5 - ../../sda5
lrwxrwxrwx 1 root root 10 Mar 19 23:36 wwn-0x500a07510c292b50-part5 - 
../../sda5

The deploy command looks like this:

ceph-deploy --overwrite-conf osd create 
osd1:/dev/disk/by-id/wwn-0x50014ee20a66aefe:/dev/disk/by-id/wwn-0x500a07510c292b50-part5

And alternatively, create a udev rule set for existing devices.

I haven't tested yet, but I am guessing that the udev rule for that same disk 
(deployed as sde) would look something like this:

KERNEL==sde, SUBSYSTEM==block, 
DEVLINKS==/dev/disk/by-id/wwn-0x50014ee20a66aefe


Many thanks for the assistance!

Colin


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FastCGI and RadosGW issue?

2015-03-19 Thread Yehuda Sadeh-Weinraub


- Original Message -
 From: Potato Farmer potato_far...@outlook.com
 To: ceph-users@lists.ceph.com
 Sent: Thursday, March 19, 2015 12:26:41 PM
 Subject: [ceph-users] FastCGI and RadosGW issue?
 
 
 
 Hi,
 
 
 
 I am running into an issue uploading to a bucket over an s3 connection to
 ceph. I can create buckets just fine. I just can’t create a key and copy
 data to it.
 
 
 
 Command that causes the error:
 
  key.set_contents_from_string(testing from string)
 
 
 
 I encounter the following error:
 
 Traceback (most recent call last):
 
 File stdin, line 1, in module
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in
 set_contents_from_string
 
 encrypt_key=encrypt_key)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in
 set_contents_from_file
 
 chunked_transfer=chunked_transfer, size=size)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in
 send_file
 
 chunked_transfer=chunked_transfer, size=size)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in
 _send_file_internal
 
 query_args=query_args
 
 File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 664, in
 make_request
 
 retry_handler=retry_handler
 
 File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, in
 make_request
 
 retry_handler=retry_handler)
 
 File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, in
 _mexe
 
 raise BotoServerError(response.status, response.reason, body)
 
 boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error
 
 None
 
 
 
 In the Apache logs I see the following:
 
 [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server
 /var/www/s3gw.fcgi aborted: idle timeout (30 sec)
 
 [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 bytes)
 received from server /var/www/s3gw.fcgi
 
 [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server
 /var/www/s3gw.fcgi aborted: idle timeout (30 sec)
 
 [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 bytes)
 received from server /var/www/s3gw.fcgi
 
 
 
 I do not get any data to show in the radosgw logs, it is empty. I have turned
 off FastCGIWrapper and set rgw print continue to false in ceph.conf. I am
 using the version of FastCGI provided by the ceph repo.

In this case you don't need to have 'rgw print continue' set to false; either 
remove that line, or set it to true.

Yehuda
 
 
 
 Has anyone run into this before? Any suggestions?
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] FastCGI and RadosGW issue?

2015-03-19 Thread Potato Farmer
Yehuda, 

You rock! Thank you for the suggestion. That fixed the issue.  :)



-Original Message-
From: Yehuda Sadeh-Weinraub [mailto:yeh...@redhat.com] 
Sent: Thursday, March 19, 2015 12:45 PM
To: Potato Farmer
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] FastCGI and RadosGW issue?



- Original Message -
 From: Potato Farmer potato_far...@outlook.com
 To: ceph-users@lists.ceph.com
 Sent: Thursday, March 19, 2015 12:26:41 PM
 Subject: [ceph-users] FastCGI and RadosGW issue?
 
 
 
 Hi,
 
 
 
 I am running into an issue uploading to a bucket over an s3 connection 
 to ceph. I can create buckets just fine. I just can’t create a key and 
 copy data to it.
 
 
 
 Command that causes the error:
 
  key.set_contents_from_string(testing from string)
 
 
 
 I encounter the following error:
 
 Traceback (most recent call last):
 
 File stdin, line 1, in module
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in 
 set_contents_from_string
 
 encrypt_key=encrypt_key)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in 
 set_contents_from_file
 
 chunked_transfer=chunked_transfer, size=size)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in 
 send_file
 
 chunked_transfer=chunked_transfer, size=size)
 
 File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in 
 _send_file_internal
 
 query_args=query_args
 
 File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 
 664, in make_request
 
 retry_handler=retry_handler
 
 File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, 
 in make_request
 
 retry_handler=retry_handler)
 
 File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, 
 in _mexe
 
 raise BotoServerError(response.status, response.reason, body)
 
 boto.exception.BotoServerError: BotoServerError: 500 Internal Server 
 Error
 
 None
 
 
 
 In the Apache logs I see the following:
 
 [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server 
 /var/www/s3gw.fcgi aborted: idle timeout (30 sec)
 
 [Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 
 bytes) received from server /var/www/s3gw.fcgi
 
 [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server 
 /var/www/s3gw.fcgi aborted: idle timeout (30 sec)
 
 [Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 
 bytes) received from server /var/www/s3gw.fcgi
 
 
 
 I do not get any data to show in the radosgw logs, it is empty. I have 
 turned off FastCGIWrapper and set rgw print continue to false in 
 ceph.conf. I am using the version of FastCGI provided by the ceph repo.

In this case you don't need to have 'rgw print continue' set to false; either 
remove that line, or set it to true.

Yehuda
 
 
 
 Has anyone run into this before? Any suggestions?
 
 
 
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Readonly cache tiering and rbd.

2015-03-19 Thread Gregory Farnum
On Thu, Mar 19, 2015 at 4:46 AM, Matthijs Möhlmann
matth...@cacholong.nl wrote:

 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1

 Hi,

 - From the documentation:

 Cache Tier readonly:

 Read-only Mode: When admins configure tiers with readonly mode, Ceph
 clients write data to the backing tier. On read, Ceph copies the
 requested object(s) from the backing tier to the cache tier. Stale
 objects get removed from the cache tier based on the defined policy.
 This approach is ideal for immutable data (e.g., presenting
 pictures/videos on a social network, DNA data, X-Ray imaging, etc.),
 because reading data from a cache pool that might contain out-of-date
 data provides weak consistency. Do not use readonly mode for mutable data.

 Does this mean that when a client (xen / kvm with a RBD volume) writes
 some data that the OSD does not mark the readonly cache dirty?

Yes, exactly. Reads are directed to the cache but writes go directly
to the base tier, and there's no attempt at communication about the
changed objects.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceiling on number of PGs in a OSD

2015-03-19 Thread Sreenath BH
Hi,

Is there a celing on the number for number of placement groups in a
OSD beyond which steady state and/or recovery performance will start
to suffer?

Example: I need to create a pool with 750 osds (25 OSD per server, 50 servers).
The PG calculator gives me 65536 placement groups with 300 PGs per OSD.
Now as the cluster expands, the number of PGs in a OSD has to increase as well.

If the cluster size inceases by a factor of 10, the number of PGs per
OSD will also need to be increased.
What would be the impact of large pg number in a OSD on peering and rebalancing.

There is 3GB per OSD available.

thanks,
Sreenath
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] scubbing for a long time and not finished

2015-03-19 Thread Sage Weil
On Thu, 19 Mar 2015, Xinze Chi wrote:
 Currently, users do not know  when some pg do scrubbing for a long time.
 I think whether we could give some warming if it happend (defined as
 osd_scrub_max_time).
 It would tell the user something may be wrong in cluster.

This should be pretty straightforward to add along with the other stuck 
x warnings based on the pg_stat_t state timestamps.  On the otherhead, 
that may be a somewhat heavyweight approach (each new warning bloats the 
stat structure a bit); open to other ideas!

sage
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] cciss driver package for RHEL7

2015-03-19 Thread O'Reilly, Dan
I understand there's a KMOD_CCISS package available.  However, I can't find it 
for download.  Anybody have any ideas?

Thanks!

Dan O'Reilly
UNIX Systems Administration
[cid:image001.jpg@01D06222.B852F940]
9601 S. Meridian Blvd.
Englewood, CO 80112
720-514-6293


___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Issue with Ceph mons starting up- leveldb store

2015-03-19 Thread Andrew Diller
Hello:

We have a cuttlefish (0.61.9) 192-OSD cluster that we are trying to get
back to a quorum. We have 2 mon nodes up and ready, we just need this 3rd.

We moved the data dir over (/var/lib/ceph/mon) from one of the good ones to
this 3rd node, but it won't start- we see this error, after which no
further logging occurs:

2015-03-19 06:25:05.395210 7fcb57f1c7c0 -1 failed to create new leveldb
store
2015-03-19 06:25:05.417716 7f272ae0d7c0  0 ceph version 0.61.9
(7440dcd135750839fa0f00263f80722ff6f51e90), process ceph-mon, pid 37967

Does anyone have an idea why the mon process would have issues creating the
leveldb store (we've seen this error since the outage) and where does it
create it? Is it part of the paxos implementation?

Thanks for any help,

-andy
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] FastCGI and RadosGW issue?

2015-03-19 Thread Potato Farmer
Hi, 

 

I am running into an issue uploading to a bucket over an s3 connection to
ceph. I can create buckets just fine. I just can't create a key and copy
data to it. 

 

Command that causes the error: 

 key.set_contents_from_string(testing from string)

 

I encounter the following error: 

Traceback (most recent call last):

  File stdin, line 1, in module

  File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1424, in
set_contents_from_string

encrypt_key=encrypt_key)

  File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 1291, in
set_contents_from_file

chunked_transfer=chunked_transfer, size=size)

  File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 748, in
send_file

chunked_transfer=chunked_transfer, size=size)

  File /usr/lib/python2.7/site-packages/boto/s3/key.py, line 949, in
_send_file_internal

query_args=query_args

  File /usr/lib/python2.7/site-packages/boto/s3/connection.py, line 664,
in make_request

retry_handler=retry_handler

  File /usr/lib/python2.7/site-packages/boto/connection.py, line 1068, in
make_request

retry_handler=retry_handler)

  File /usr/lib/python2.7/site-packages/boto/connection.py, line 1025, in
_mexe

raise BotoServerError(response.status, response.reason, body)

boto.exception.BotoServerError: BotoServerError: 500 Internal Server Error

None

 

In the Apache logs I see the following: 

[Thu Mar 19 12:03:13 2015] [error] [] FastCGI: comm with server
/var/www/s3gw.fcgi aborted: idle timeout (30 sec)

[Thu Mar 19 12:03:13 2015] [error] [] FastCGI: incomplete headers (0 bytes)
received from server /var/www/s3gw.fcgi

[Thu Mar 19 12:03:32 2015] [error] [] FastCGI: comm with server
/var/www/s3gw.fcgi aborted: idle timeout (30 sec)

[Thu Mar 19 12:03:32 2015] [error] [] FastCGI: incomplete headers (0 bytes)
received from server /var/www/s3gw.fcgi

 

I do not get any data to show in the radosgw logs, it is empty. I have
turned off FastCGIWrapper and set rgw print continue to false in ceph.conf.
I am using the version of FastCGI provided by the ceph repo. 

 

Has anyone run into this before? Any suggestions? 

 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Mapping OSD to physical device

2015-03-19 Thread Robert LeBlanc
Udev already provides some of this for you. Look in /dev/disk/by-*.
You can reference drives by UUID, id or path (for
SAS/SCSI/FC/iSCSI/etc) which will provide some consistency across
reboots and hardware changes.

On Thu, Mar 19, 2015 at 1:10 PM, Colin Corr co...@pc-doctor.com wrote:
 Greetings Cephers,

 I have been lurking on this list for a while, but this is my first inquiry. I 
 have been playing with Ceph for the past 9 months and am in the process of 
 deploying a production Ceph cluster. I am seeking advice on an issue that I 
 have encountered. I do not believe it is a Ceph specific issue, but more of a 
 Linux issue. Technically, its not an issue, just undesired behaviour that I 
 am hoping someone here has encountered and can provide some insight as to a 
 work around.

 Basically, there are occasions when an OSD host machine gets rebooted. 
 Sometimes one or more drives does not spin up properly. This causes the OSD 
 to go offline, along with all other OSDs after it in sequence.

 I created my OSDs using the online docs with the Linux device name (ex. 
 /dev/sdc, sdd, sde, etc). So, osd.0 = /dev/sdc, osd.1 = /dev/sdd, osd.2 = 
 /dev/sde, osd.3 = dev/sdf, etc.

 But, if one of the drives fails/does not spin up, then Linux will rename the 
 drives. Example, /dev/sdd fails on reboot, so now osd.1 comes up with 
 /dev/sde, but /dev/sde is actually the osd.2 drive and osd.2 comes up with 
 what was the osd.3 drive, then they all fall offline in sequence after the 
 one failed osd.1.

 As expected, if I replace the failed drive and reboot, Linux enumerates the 
 drives and gives them the original device names and Ceph behaves properly by 
 marking the affected osd as down and out, while the remaining drives in 
 sequence come up and recover gracefully.

 Does anyone have any thoughts or experience with how one can ensure that 
 Linux device names will always map to the physical device ID? I was thinking 
 along the lines of a udev ruleset for the drives or something similar. Or, is 
 there a better way to create the OSD using the physical device ID? Basically, 
 some sort of way to ensure that a specific physical drive always gets mapped 
 to the same device name and OSD.

 Thanks for any insight or thoughts on this,

 Colin

 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Mapping OSD to physical device

2015-03-19 Thread Colin Corr
Greetings Cephers,

I have been lurking on this list for a while, but this is my first inquiry. I 
have been playing with Ceph for the past 9 months and am in the process of 
deploying a production Ceph cluster. I am seeking advice on an issue that I 
have encountered. I do not believe it is a Ceph specific issue, but more of a 
Linux issue. Technically, its not an issue, just undesired behaviour that I am 
hoping someone here has encountered and can provide some insight as to a work 
around.

Basically, there are occasions when an OSD host machine gets rebooted. 
Sometimes one or more drives does not spin up properly. This causes the OSD to 
go offline, along with all other OSDs after it in sequence.

I created my OSDs using the online docs with the Linux device name (ex. 
/dev/sdc, sdd, sde, etc). So, osd.0 = /dev/sdc, osd.1 = /dev/sdd, osd.2 = 
/dev/sde, osd.3 = dev/sdf, etc.

But, if one of the drives fails/does not spin up, then Linux will rename the 
drives. Example, /dev/sdd fails on reboot, so now osd.1 comes up with /dev/sde, 
but /dev/sde is actually the osd.2 drive and osd.2 comes up with what was the 
osd.3 drive, then they all fall offline in sequence after the one failed osd.1.

As expected, if I replace the failed drive and reboot, Linux enumerates the 
drives and gives them the original device names and Ceph behaves properly by 
marking the affected osd as down and out, while the remaining drives in 
sequence come up and recover gracefully.

Does anyone have any thoughts or experience with how one can ensure that Linux 
device names will always map to the physical device ID? I was thinking along 
the lines of a udev ruleset for the drives or something similar. Or, is there a 
better way to create the OSD using the physical device ID? Basically, some sort 
of way to ensure that a specific physical drive always gets mapped to the same 
device name and OSD.

Thanks for any insight or thoughts on this,

Colin

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Nick Fisk
I think this could be part of what I am seeing. I found this post from back in 
2003

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

Which seems to describe a work around for the behaviour to what I am seeing. 
The constant small block IO I was seeing looks like it was either the pg log 
and info updates or FS metatdata. I have been going through the blktraces I did 
today and 90% of the time I am just seeing 8kb writes and journal writes. 

I think the journal and filestore settings I have been adjusting, have just 
been moving the data sync around the benchmark timeline and altering when the 
journal starts throttling. It seems that with small IO's the metadata overhead 
takes several times longer than the actual data writing. This probably also 
explains why a full SSD OSD is faster than a HDD+SSD even for brief bursts of 
IO.

In the thread I posted above, it seems that adding something like flashcache 
can massively help overcome this problem, so this is something I might look 
into. It’s a shame I didn't get BBWC with my OSD nodes as this would have also 
likely alleviated this problem with a lot less hassle.


 Ah, no, you're right. With the bench command it all goes in to one object, 
 it's
 just a separate transaction for each 64k write. But again depending on flusher
 and throttler settings in the OSD, and the backing FS' configuration, it can 
 be
 a lot of individual updates — in particular, every time there's a sync it has 
 to
 update the inode.
 Certainly that'll be the case in the described configuration, with relatively 
 low
 writeahead limits on the journal but high sync intervals — once you hit the
 limits, every write will get an immediate flush request.
 
 But none of that should have much impact on your write amplification tests
 unless you're actually using osd bench to test it. You're more likely to be
 seeing the overhead of the pg log entry, pg info change, etc that's associated
 with each write.
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Cache Tier Flush = immediate base tier journal sync?

2015-03-19 Thread Gregory Farnum
On Wed, Mar 18, 2015 at 11:10 PM, Christian Balzer ch...@gol.com wrote:

 Hello,

 On Wed, 18 Mar 2015 11:05:47 -0700 Gregory Farnum wrote:

 On Wed, Mar 18, 2015 at 8:04 AM, Nick Fisk n...@fisk.me.uk wrote:
  Hi Greg,
 
  Thanks for your input and completely agree that we cannot expect
  developers to fully document what impact each setting has on a
  cluster, particularly in a performance related way
 
  That said, if you or others could spare some time for a few pointers it
  would be much appreciated and I will endeavour to create some useful
  results/documents that are more relevant to end users.
 
  I have taken on board what you said about the WB throttle and have been
  experimenting with it by switching it on and off. I know it's a bit of
  a blunt configuration change, but it was useful to understand its
  effect. With it off, I do see initially quite a large performance
  increase but overtime it actually starts to slow the average
  throughput down. Like you said, I am guessing this is to do with it
  making sure the journal doesn't get to far ahead, leaving it with
  massive sync's to carry out.
 
  One thing I do see with the WBT enabled and to some extent with it
  disabled, is that there are large periods of small block writes at the
  max speed of the underlying sata disk (70-80iops). Here are 2 blktrace
  seekwatcher traces of performing an OSD bench (64kb io's for 500MB)
  where this behaviour can be seen.

 If you're doing 64k IOs then I believe it's creating a new on-disk
 file for each of those writes. How that's laid out on-disk will depend
 on your filesystem and the specific config options that we're using to
 try to avoid running too far ahead of the journal.

 Could you elaborate on that a bit?
 I would have expected those 64KB writes to go to the same object (file)
 until it is full (4MB).

 Because this behavior would explain some (if not all) of the write
 amplification I've seen in the past with small writes (see the SSD
 Hardware recommendation thread).

Ah, no, you're right. With the bench command it all goes in to one
object, it's just a separate transaction for each 64k write. But again
depending on flusher and throttler settings in the OSD, and the
backing FS' configuration, it can be a lot of individual updates — in
particular, every time there's a sync it has to update the inode.
Certainly that'll be the case in the described configuration, with
relatively low writeahead limits on the journal but high sync
intervals — once you hit the limits, every write will get an immediate
flush request.

But none of that should have much impact on your write amplification
tests unless you're actually using osd bench to test it. You're more
likely to be seeing the overhead of the pg log entry, pg info change,
etc that's associated with each write.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs issue

2015-03-19 Thread Bogdan SOLGA
Hello, everyone!

I have created a Ceph cluster (v0.87.1-1) using the info from the 'Quick
deploy http://docs.ceph.com/docs/master/start/quick-ceph-deploy/' page,
with the following setup:

   - 1 x admin / deploy node;
   - 3 x OSD and MON nodes;
  - each OSD node has 2 x 8 GB HDDs;

The setup was made using Virtual Box images, on Ubuntu 14.04.2.
After performing all the steps, the 'ceph health' output lists the cluster
in the HEALTH_WARN state, with the following details:
HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck unclean;
64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10  min
20)

The output of 'ceph -s':
cluster b483bc59-c95e-44b1-8f8d-86d3feffcfab
 health HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs
stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per
osd (10  min 20)
 monmap e1: 3 mons at {osd-003=
192.168.122.23:6789/0,osd-002=192.168.122.22:6789/0,osd-001=192.168.122.21:6789/0},
election epoch 6, quorum 0,1,2 osd-001,osd-002,osd-003
 osdmap e20: 6 osds: 6 up, 6 in
  pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
199 MB used, 18166 MB / 18365 MB avail
  64 active+undersized+degraded

I have tried to increase the pg_num and pgp_num to 512, as advised here
http://ceph.com/docs/master/rados/operations/placement-groups/#a-preselection-of-pg-num,
but Ceph refused to do that, with the following error:
Error E2BIG: specified pg_num 512 is too large (creating 384 new PGs on ~6
OSDs exceeds per-OSD max of 32)

After changing the pg*_num to 256, as advised here
http://ceph.com/docs/master/rados/operations/placement-groups/#choosing-the-number-of-placement-groups,
the warning was changed to:
health HEALTH_WARN 256 pgs degraded; 256 pgs stuck unclean; 256 pgs
undersized

What is the issue behind these warning? and what do I need to do to fix it?

I'm a newcomer in the Ceph world, so please don't shoot me if this issue
has been answered / discussed countless times before :) I have searched the
web and the mailing list for the answers, but I couldn't find a valid
solution.

Any help is highly appreciated. Thank you!

Regards,
Bogdan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs issue

2015-03-19 Thread Nick Fisk




 -Original Message-
 From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
 Bogdan SOLGA
 Sent: 19 March 2015 20:51
 To: ceph-users@lists.ceph.com
 Subject: [ceph-users] PGs issue
 
 Hello, everyone!
 I have created a Ceph cluster (v0.87.1-1) using the info from the 'Quick
 deploy' page, with the following setup:
 • 1 x admin / deploy node;
 • 3 x OSD and MON nodes;
 o each OSD node has 2 x 8 GB HDDs;
 The setup was made using Virtual Box images, on Ubuntu 14.04.2.
 After performing all the steps, the 'ceph health' output lists the cluster in 
 the
 HEALTH_WARN state, with the following details:
 HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs stuck
 unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per osd (10
  min 20)
 The output of 'ceph -s':
 cluster b483bc59-c95e-44b1-8f8d-86d3feffcfab
  health HEALTH_WARN 64 pgs degraded; 64 pgs stuck degraded; 64 pgs
 stuck unclean; 64 pgs stuck undersized; 64 pgs undersized; too few pgs per
 osd (10  min 20)
  monmap e1: 3 mons at {osd-003=192.168.122.23:6789/0,osd-
 002=192.168.122.22:6789/0,osd-001=192.168.122.21:6789/0}, election epoch
 6, quorum 0,1,2 osd-001,osd-002,osd-003
  osdmap e20: 6 osds: 6 up, 6 in
   pgmap v36: 64 pgs, 1 pools, 0 bytes data, 0 objects
 199 MB used, 18166 MB / 18365 MB avail
   64 active+undersized+degraded
 
 I have tried to increase the pg_num and pgp_num to 512, as advised here,
 but Ceph refused to do that, with the following error:
 Error E2BIG: specified pg_num 512 is too large (creating 384 new PGs on ~6
 OSDs exceeds per-OSD max of 32)
 
 After changing the pg*_num to 256, as advised here, the warning was
 changed to:
 health HEALTH_WARN 256 pgs degraded; 256 pgs stuck unclean; 256 pgs
 undersized
 
 What is the issue behind these warning? and what do I need to do to fix it?

It's basically telling you that you current available OSD's don't meet the 
requirements to suit the number of replica's you have requested.

What replica size have you configured for that pool?

 
 I'm a newcomer in the Ceph world, so please don't shoot me if this issue has
 been answered / discussed countless times before :) I have searched the
 web and the mailing list for the answers, but I couldn't find a valid 
 solution.
 Any help is highly appreciated. Thank you!
 Regards,
 Bogdan




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-19 Thread Nick Fisk
I'm looking at trialling OSD's with a small flashcache device over them to
hopefully reduce the impact of metadata updates when doing small block io.
Inspiration from here:-

http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

One thing I suspect will happen, is that when the OSD node starts up udev
could possibly mount the base OSD partition instead of flashcached device,
as the base disk will have the ceph partition uuid type. This could result
in quite nasty corruption.

I have had a look at the Ceph udev rules and can see that something similar
has been done for encrypted OSD's. Am I correct in assuming that what I need
to do is to create a new partition uuid type for flashcached OSD's and then
create a udev rule to activate these new uuid'd OSD's once flashcache has
finished assembling them?

Many Thanks,
Nick




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD + Flashcache + udev + Partition uuid

2015-03-19 Thread Gregory Farnum
On Thu, Mar 19, 2015 at 2:41 PM, Nick Fisk n...@fisk.me.uk wrote:
 I'm looking at trialling OSD's with a small flashcache device over them to
 hopefully reduce the impact of metadata updates when doing small block io.
 Inspiration from here:-

 http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/12083

 One thing I suspect will happen, is that when the OSD node starts up udev
 could possibly mount the base OSD partition instead of flashcached device,
 as the base disk will have the ceph partition uuid type. This could result
 in quite nasty corruption.

 I have had a look at the Ceph udev rules and can see that something similar
 has been done for encrypted OSD's. Am I correct in assuming that what I need
 to do is to create a new partition uuid type for flashcached OSD's and then
 create a udev rule to activate these new uuid'd OSD's once flashcache has
 finished assembling them?

I haven't worked with the udev rules in a while, but that sounds like
the right way to go.
-Greg
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Issue with Ceph mons starting up- leveldb store

2015-03-19 Thread Steffen W Sørensen
On 19/03/2015, at 15.50, Andrew Diller dill...@gmail.com wrote:
 We moved the data dir over (/var/lib/ceph/mon) from one of the good ones to 
 this 3rd node, but it won't start- we see this error, after which no further 
 logging occurs:
 
 2015-03-19 06:25:05.395210 7fcb57f1c7c0 -1 failed to create new leveldb store
 2015-03-19 06:25:05.417716 7f272ae0d7c0  0 ceph version 0.61.9 
 (7440dcd135750839fa0f00263f80722ff6f51e90), process ceph-mon, pid 37967
 
 Does anyone have an idea why the mon process would have issues creating the 
 leveldb store (we've seen this error since the outage) and where does it 
 create it? Is it part of the paxos implementation?
Just guessing... maybe the simple offen RC, permission on dirs along the path.

/Steffen___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cciss driver package for RHEL7

2015-03-19 Thread Steffen W Sørensen
 On 19/03/2015, at 15.57, O'Reilly, Dan daniel.orei...@dish.com wrote:
 
 I understand there’s a KMOD_CCISS package available.  However, I can’t find 
 it for download.  Anybody have any ideas?
Oh I believe HP swapped cciss for hpsa (Smart Array) driver long ago… so maybe 
only download cciss latest source and then compile your self, or…

Sourceforge http://cciss.sourceforge.net/ says:

*New* The cciss driver has been removed from RHEL7 and SLES12. If you really 
want cciss on RHEL7 checkout the elrepo http://elrepo.org/ directory.


/Steffen___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] cciss driver package for RHEL7

2015-03-19 Thread O'Reilly, Dan
The problem with using the hpsa driver is that I need to install RHEL 7.1 on a 
Proliant system using the SmartArray 400 controller.  Therefore, I need a 
driver that supports it to even install RHEL 7.1.  RHEL 7.1 doesn’t generically 
recognize that controller out of the box.

From: Steffen W Sørensen [mailto:ste...@me.com]
Sent: Thursday, March 19, 2015 10:08 AM
To: O'Reilly, Dan
Cc: ceph-users@lists.ceph.com
Subject: Re: [ceph-users] cciss driver package for RHEL7

On 19/03/2015, at 15.57, O'Reilly, Dan 
daniel.orei...@dish.commailto:daniel.orei...@dish.com wrote:

I understand there’s a KMOD_CCISS package available.  However, I can’t find it 
for download.  Anybody have any ideas?
Oh I believe HP swapped cciss for hpsa (Smart Array) driver long ago… so maybe 
only download cciss latest source and then compile your self, or…

Sourceforgehttp://cciss.sourceforge.net says:

*New* The cciss driver has been removed from RHEL7 and SLES12. If you really 
want cciss on RHEL7 checkout the elrepohttp://elrepo.org/ directory.


/Steffen
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com