Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-22 Thread SCHAER Frederic
Hi Gregory,



Thanks for your replies.

Let's take the 2 hosts config setup (3 MON + 3 idle MDS on same hosts).



2 dell R510 servers, CentOS 7.0.1406, dual xeon 5620 (8 
cores+hyperthreading),16GB RAM, 2 or 1x10gbits/s Ethernet (same results with 
and without private 10gbits network), PERC H700 + 12 2TB SAS disks, and PERC 
H800 + 11 2TB SAS disks (one unused SSD...)

The EC pool is defined with k=4, m=1

I set the failure domain to OSD for the test

The OSDs are set up with XFS and a 10GB journal 1st partition (the single 
doomed-dell SSD was a bottleneck for 23 disks…)

All disks are presently configured with a single-RAID0 because H700/H800 do not 
support JBOD.



I have 5 clients (CentOS 7.1), 10gbits/s ethernet, all running this command :

rados -k ceph.client.admin.keyring -p testec bench 120 write -b 4194304 -t 32 
--run-name bench_`hostname -s` --no-cleanup

I'm aggregating the average bandwidth at the end of the tests.

I'm monitoring the Ceph servers stats live with this dstat command: dstat -N 
p2p1,p2p2,total

The network MTU is 9000 on all nodes.



With this, the average client throughput is around 130MiB/s, i.e 650 MiB/s for 
the whole 2-nodes ceph cluster / 5 clients.

I since have tried removing (ceph osd out/ceph osd crush reweight 0) either the 
H700 or the H800 disks, thus only using 11 or 12 disks per server, and I either 
get 550 MiB/s or 590MiB/s of aggregated clients bandwidth. Not much less 
considering I removed half disks !

I'm therefore starting to think I am CPU /memory bandwidth limited... ?



That's not however what I am tempted to conclude (for the cpu at least) when I 
see the dstat output, as it says the cpus still sit idle or IO waiting :



total-cpu-usage -dsk/total- --net/p2p1net/p2p2---net/total- 
---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send: recv  send: recv  send|  in   
out | int   csw

  1   1  97   0   0   0| 586k 1870k|   0 0 :   0 0 :   0 0 |  49B  
455B|816715k

29  17  24  27   0   3| 128k  734M| 367M  870k:   0 0 : 367M  870k|   0 
0 |  61k   61k

30  17  34  16   0   3| 432k  750M| 229M  567k: 199M  168M: 427M  168M|   0 
0 |  65k   68k

25  14  38  20   0   3|  16k  634M| 232M  654k: 162M  133M: 393M  134M|   0 
0 |  56k   64k

19  10  46  23   0   2| 232k  463M| 244M  670k: 184M  138M: 428M  139M|   0 
0 |  45k   55k

15   8  46  29   0   1| 368k  422M| 213M  623k: 149M  110M: 362M  111M|   0 
0 |  35k   41k

25  17  37  19   0   3|  48k  584M| 139M  394k: 137M   90M: 276M   91M|   0 
0 |  54k   53k



Could it be the interruptions or system context switches that cause this 
relatively poor performance per node ?

PCI-E interractions with the PERC cards ?

I know I can get way more disk throughput with dd (command below)

total-cpu-usage -dsk/total- -net/total- ---paging-- ---system--

usr sys idl wai hiq siq| read  writ| recv  send|  in   out | int   csw

  1   1  97   0   0   0| 595k 2059k|   0 0 | 634B 2886B|797115k

  1  93   0   3   0   3|   0  1722M|  49k   78k|   0 0 |  40k   47k

  1  93   0   3   0   3|   0  1836M|  40k   69k|   0 0 |  45k   57k

  1  95   0   2   0   2|   0  1805M|  40k   69k|   0 0 |  38k   34k

  1  94   0   3   0   2|   0  1864M|  37k   38k|   0 0 |  35k   24k

(…)



Dd command : # use at your own risk # FS_THR=64 ; FILE_MB=8 ; N_FS=`mount|grep 
ceph|wc -l` ; time (for i in `mount|grep ceph|awk '{print $3}'` ; do echo 
writing $FS_THR times (threads)  $[ 4 * FILE_MB ]  mb on $i... ; for j in 
`seq 1 $FS_THR` ; do dd conv=fsync if=/dev/zero of=$i/test.zero.$j bs=4M 
count=$[ FILE_MB / 4 ]  done ; done ; wait) ; echo wrote $[ N_FS * FILE_MB * 
FS_THR ] MB on $N_FS FS with $FS_THR threads ; rm -f 
/var/lib/ceph/osd/*/test.zero*





Hope I gave you more insights on what I’m trying to achieve, and where I’m 
failing ?



Regards





-Message d'origine-
De : Gregory Farnum [mailto:g...@gregs42.com]
Envoyé : mercredi 22 juillet 2015 16:01
À : Florent MONTHEL
Cc : SCHAER Frederic; ceph-users@lists.ceph.com
Objet : Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??



We might also be able to help you improve or better understand your

results if you can tell us exactly what tests you're conducting that

are giving you these numbers.

-Greg



On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL 
fmont...@flox-arts.netmailto:fmont...@flox-arts.net wrote:

 Hi Frederic,



 When you have Ceph cluster with 1 node you don’t experienced network and

 communication overhead due to distributed model

 With 2 nodes and EC 4+1 you will have communication between 2 nodes but you

 will keep internal communication (2 chunks on first node and 3 chunks on

 second node)

 On your configuration EC pool is setup with 4+1 so you will have for each

 write overhead due to write spreading on 5 nodes (for 1 customer IO, you

 will experience 5 Ceph IO due to EC 4+1)

 It’s the reason for that I think you’re 

Re: [ceph-users] PGs going inconsistent after stopping the primary

2015-07-22 Thread Samuel Just
Looks like it's just a stat error.  The primary appears to have the correct 
stats, but the replica for some reason doesn't (thinks there's an object for 
some reason).  I bet it clears itself it you perform a write on the pg since 
the primary will send over its stats.  We'd need information from when the stat 
error originally occurred to debug further.
-Sam

- Original Message -
From: Dan van der Ster d...@vanderster.com
To: ceph-users@lists.ceph.com
Sent: Wednesday, July 22, 2015 7:49:00 AM
Subject: [ceph-users] PGs going inconsistent after stopping the primary

Hi Ceph community,

Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64

We wanted to post here before the tracker to see if someone else has
had this problem.

We have a few PGs (different pools) which get marked inconsistent when
we stop the primary OSD. The problem is strange because once we
restart the primary, then scrub the PG, the PG is marked active+clean.
But inevitably next time we stop the primary OSD, the same PG is
marked inconsistent again.

There is no user activity on this PG, and nothing interesting is
logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
mentioning the PG already says inactive+inconsistent).


We suspect this is related to garbage files left in the PG folder. One
of our PGs is acting basically like above, except it goes through this
cycle: active+clean - (deep-scrub) - active+clean+inconsistent -
(repair) - active+clean - (restart primary OSD) - (deep-scrub) -
active+clean+inconsistent. This one at least logs:

2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors

and this should be debuggable because there is only one object in the pool:

tapetest   55   0 073575G   1

even though rados ls returns no objects:

# rados ls -p tapetest
#

Any ideas?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] ceph-mon cpu usage

2015-07-22 Thread Luis Periquito
This cluster is server RBD storage for openstack, and today all the I/O was
just stopped.
After looking in the boxes ceph-mon was using 17G ram - and this was on
*all* the mons. Restarting the main one just made it work again (I
restarted the other ones because they were using a lot of ram).
This has happened twice now (first was last Monday).

As this is considered a prod cluster there is no logging enabled, and I
can't reproduce it - our test/dev clusters have been working fine, and have
neither symptoms, but they were upgraded from firefly.
What can we do to help debug the issue? Any ideas on how to identify the
underlying issue?

thanks,

On Mon, Jul 20, 2015 at 1:59 PM, Luis Periquito periqu...@gmail.com wrote:

 Hi all,

 I have a cluster with 28 nodes (all physical, 4Cores, 32GB Ram), each node
 has 4 OSDs for a total of 112 OSDs. Each OSD has 106 PGs (counted including
 replication). There are 3 MONs on this cluster.
 I'm running on Ubuntu trusty with kernel 3.13.0-52-generic, with Hammer
 (0.94.2).

 This cluster was installed with Hammer (0.94.1) and has only been upgraded
 to the latest available version.

 On the three mons one is mostly idle, one is using ~170% CPU, and one is
 using ~270% CPU. They will change as I restart the process (usually the
 idle one is the one with the lowest uptime).

 Running a perf top againt the ceph-mon PID on the non-idle boxes it wields
 something like this:

   4.62%  libpthread-2.19.so[.] pthread_mutex_unlock
   3.95%  libpthread-2.19.so[.] pthread_mutex_lock
   3.91%  libsoftokn3.so[.] 0x0001db26
   2.38%  [kernel]  [k] _raw_spin_lock
   2.09%  libtcmalloc.so.4.1.2  [.] operator new(unsigned long)
   1.79%  ceph-mon  [.] DispatchQueue::enqueue(Message*, int,
 unsigned long)
   1.62%  ceph-mon  [.] RefCountedObject::get()
   1.58%  libpthread-2.19.so[.] pthread_mutex_trylock
   1.32%  libtcmalloc.so.4.1.2  [.] operator delete(void*)
   1.24%  libc-2.19.so  [.] 0x00097fd0
   1.20%  ceph-mon  [.] ceph::buffer::ptr::release()
   1.18%  ceph-mon  [.] RefCountedObject::put()
   1.15%  libfreebl3.so [.] 0x000542a8
   1.05%  [kernel]  [k] update_cfs_shares
   1.00%  [kernel]  [k] tcp_sendmsg

 The cluster is mostly idle, and it's healthy. The store is 69MB big, and
 the MONs are consuming around 700MB of RAM.

 Any ideas on this situation? Is it safe to ignore?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS vs RBD

2015-07-22 Thread Robert LeBlanc
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA256

RBD can be safely mounted on multiple machines at once, but the file
system has to be designed for such scenarios. File systems like ext,
xfs, btrfs, etc are only designed to be accessed by a single system.
Clustered file systems like OCFS, GFS, etc are designed to have
multiple discrete machines access the file system at the same time. As
long as you use a clustered file system on RBD, you will be OK.
Now, if that performs better than CephFS, is a question you will have
to answer through testing.
- 
Robert LeBlanc
PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


On Wed, Jul 22, 2015 at 1:17 PM, Lincoln Bryant  wrote:
 Hi Hadi,

 AFAIK, you can’t safely mount RBD as R/W on multiple machines. You could
 re-export the RBD as NFS, but that’ll introduce a bottleneck and probably
 tank your performance gains over CephFS.

 For what it’s worth, some of our RBDs are mapped to multiple machines,
 mounted read-write on one and read-only on the others. We haven’t seen any
 strange effects from that, but I seem to recall it being ill advised.

 —Lincoln

 On Jul 22, 2015, at 2:05 PM, Hadi Montakhabi  wrote:

 Hello Cephers,

 I've been experimenting with CephFS and RBD for some time now.
 From what I have seen so far, RBD outperforms CephFS by far. However, there
 is a catch!
 RBD could be mounted on one client at a time!
 Now, assuming that we have multiple clients running some MPI code (and doing
 some distributed I/O), all these clients need to read/write from the same
 location and sometimes even the same file.
 Is this at all possible by using RBD, and not CephFS?

 Thanks,
 Hadi
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


-BEGIN PGP SIGNATURE-
Version: Mailvelope v0.13.1
Comment: https://www.mailvelope.com

wsFcBAEBCAAQBQJVr+6rCRDmVDuy+mK58QAAbaIP/RhTTsYhx1boWdec8PIb
RBYP7rV1Pg2X/QESNWFZA7waqyymfSJgbLZ+TAuHCfUdLCWAE9lerNAs5Cdn
ZvLx1Z56s9DsYjh/AFawKAq2tIBUzHOsPZORrUI1HkU2Y3vf9IzYOpUNcxUF
2sY5pGOM0NQDdojsuxskqHP47RckdUdiMAb7UWK7LYHCJJKlzvfFBfp7XT9+
sD4Uy/3Wos0KBH60Oxt8ueGyDCd3EYa1fV8+2k/uJ447XRujv9RC3fXth+oE
QYKhNNxi5la0awChs00hfDx7SGlOoq5dy7POAmImo9Y/eoZNuiBSpiUtXAT/
kWvshIUKOUq6A06boEGNDDyGVjaOHUBEZtA1Vpmkj53oY4eDugNKUxMCnFEo
/TVgMgjzMM90+u9E7l/Bx7H497HoIAJkhN/9kFK+t9CWySX8I/A1fZ9XI/hs
hsVHPvhJrQW/8eHRERJbVEItJZP5EI1wkzpZanpsmimeRghqy2S87TYg6Ged
7Eyt7kpBqUXn+i3VJ0LBvlXC81O0SEA32PEN8Zbv/jFdyHJ4nmXz/WWs/M4p
ZseYV/5AHtov+Kbbm8C3CyOO3M8zx5fGPHGNwr+PCYJiQxNCaZPYJ3fB98hn
IQhD2f0KYc6cDY19ZtreXNE6ITuZ78n8LulOSlU1VSTPdv/pWBmkCFEV4f3f
k6BV
=Zvwe
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] CephFS vs RBD

2015-07-22 Thread Hadi Montakhabi
Hello Cephers,

I've been experimenting with CephFS and RBD for some time now.
From what I have seen so far, RBD outperforms CephFS by far. However, there
is a catch!
RBD could be mounted on one client at a time!
Now, assuming that we have multiple clients running some MPI code (and
doing some distributed I/O), all these clients need to read/write from the
same location and sometimes even the same file.
Is this at all possible by using RBD, and not CephFS?

Thanks,
Hadi
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs going inconsistent after stopping the primary

2015-07-22 Thread Dan van der Ster
Cool, writing some objects to the affected PGs has stopped the
consistent/inconsistent cycle. I'll keep an eye on them but this seems
to have fixed the problem.
Thanks!!
Dan

On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just sj...@redhat.com wrote:
 Looks like it's just a stat error.  The primary appears to have the correct 
 stats, but the replica for some reason doesn't (thinks there's an object for 
 some reason).  I bet it clears itself it you perform a write on the pg since 
 the primary will send over its stats.  We'd need information from when the 
 stat error originally occurred to debug further.
 -Sam

 - Original Message -
 From: Dan van der Ster d...@vanderster.com
 To: ceph-users@lists.ceph.com
 Sent: Wednesday, July 22, 2015 7:49:00 AM
 Subject: [ceph-users] PGs going inconsistent after stopping the primary

 Hi Ceph community,

 Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64

 We wanted to post here before the tracker to see if someone else has
 had this problem.

 We have a few PGs (different pools) which get marked inconsistent when
 we stop the primary OSD. The problem is strange because once we
 restart the primary, then scrub the PG, the PG is marked active+clean.
 But inevitably next time we stop the primary OSD, the same PG is
 marked inconsistent again.

 There is no user activity on this PG, and nothing interesting is
 logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
 mentioning the PG already says inactive+inconsistent).


 We suspect this is related to garbage files left in the PG folder. One
 of our PGs is acting basically like above, except it goes through this
 cycle: active+clean - (deep-scrub) - active+clean+inconsistent -
 (repair) - active+clean - (restart primary OSD) - (deep-scrub) -
 active+clean+inconsistent. This one at least logs:

 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
 mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors

 and this should be debuggable because there is only one object in the pool:

 tapetest   55   0 073575G   1

 even though rados ls returns no objects:

 # rados ls -p tapetest
 #

 Any ideas?

 Cheers, Dan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] CephFS vs RBD

2015-07-22 Thread Lionel Bouton
Le 22/07/2015 21:17, Lincoln Bryant a écrit :
 Hi Hadi,

 AFAIK, you can’t safely mount RBD as R/W on multiple machines. You
 could re-export the RBD as NFS, but that’ll introduce a bottleneck and
 probably tank your performance gains over CephFS.

 For what it’s worth, some of our RBDs are mapped to multiple machines,
 mounted read-write on one and read-only on the others. We haven’t seen
 any strange effects from that, but I seem to recall it being ill advised.

Yes it is, for several reasons. Here are two at the top of my head.

Some (many/most/all?) filesystems update on-disk data when they are
mounted even if the mount is read-only. If you map your RBD devices
read-only before mounting the filesystem itself read-only you should be
safe from corruption occurring at mount time though.
The system with read-write access will keep its in-memory data in sync
with the on-disk data. The others with read-only access will not as they
won't be aware of the writes done, this means they will eventually get
incoherent data and will generate fs access errors with various levels
of errors from the benign read error to potentially full kernel crash
with whole filesystem freeze in-between.

Don't do that unless you :
- carefully setup your rbd mappings read-only everywhere but the system
doing the writes,
- can withstand a (simultaneous) system crash on all the systems
mounting the rbd mappings read-only.

Lionel
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Clients' connection for concurrent access to ceph

2015-07-22 Thread Shneur Zalman Mattern
Workaround... We're building now a huge computing cluster 140 computing 
DISKLESS nodes and they are pulling to storage a lot of computing data 
concurrently
User that put job for the cluster - need also access to the same storage 
place (seeking progress  results)

We've built Ceph cluster:
3 mon nodes (one of them is combined with mds)
3 osd nodes (each one have 10 osd + ssd for journaling)
switch 24 ports x 10G
10 gigabit - for public network
20 gigabit bonding - between osds
Ubuntu 12.04.05
Ceph 0.87.2 - giant
-
Clients has:
10 gigabit for ceph-connection
CentOS 6.6 with upgraded kernel 3.19.8 (already running computing cluster)

Surely all nodes, switches and clients were configured to jumbo-frames of 
network

=

First test:
I thought to make big rbd with shareing, but:
  -  RBD supports multiple clients' mappingmounting but not parallel 
writes ...

Second test:
NFS over RBD - it's working pretty good, but:
1. NFS gateway - it's Single-Point-of-Failure
2. There's no performance scaling of scale-out storage e.g. bottleneck 
(limited with bandwidth of NFS-gateway)

Third test:
We wanted to try CephFS, because our client is familiar with Lustre, that's 
very near to CephFS capabilities:
   1. I've used my CEPH nodes in the client's role. I've mounted CephFS 
on one of nodes, and ran dd with bs=1M ...
- I've got wonderful write performance ~ 1.1 GBytes/s 
(really near to 10Gbit network throughput)

2. I've connected CentOS client to 10gig public network, mounted 
CephFS, but ...
- It was just ~ 250 MBytes/s

3. I've connected Ubuntu client (non-ceph member) to 10gig public 
network, mounted CephFS, and ...
- It was also ~ 260 MBytes/s

Now I have to know: perhaps ceph-members-nodes have privileged 
access ???

I'm sure you have more ceph deployment experience,
have you seen this CephFS performance deviations?

Thanks,
Shneur





This footnote confirms that this email message has been scanned by
PineApp Mail-SeCure for the presence of malicious code, vandals  computer 
viruses.




___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Scrubbing optymalisation

2015-07-22 Thread Mateusz Skała
Hi Cephers.

I'm looking for solution to scrubbing process optimization. In our
environment this process make big impact on performance. For monitoring
disks we are using monitorix. If process running 'Disk I/O activity (R+W)'
shows 20-60 reads+writes per second. After disabling scrub and deep-scrub
process this values goes to 0-40reads+write. It makes difference in
performance.

Ceph config settings:

ceph --admin-daemon /var/run/ceph/ceph-osd.1.asok config show | grep ioprio
  osd_disk_thread_ioprio_class: idle,
  osd_disk_thread_ioprio_priority: 7,

All disks have cfq scheduler enabled.

Cluster have 6 servers, 5 monitors, 4-6 osd per server + 1 ssd for journal
in each server.

 

Maybe can I set some other config options to reduce impact of scrubbing
process? In attachment screen from monitorix (srubbing disabled in 27 week).

 

Best Regards, 

Mateusz

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] 10d

2015-07-22 Thread Dan van der Ster
I just filed a ticket after trying ceph-objectstore-tool:
http://tracker.ceph.com/issues/12428

On Fri, Jul 17, 2015 at 3:36 PM, Dan van der Ster d...@vanderster.com wrote:
 A bit of progress: rm'ing everything from inside current/36.10d_head/
 actually let the OSD start and continue deleting other PGs.

 Cheers, Dan

 On Fri, Jul 17, 2015 at 3:26 PM, Dan van der Ster d...@vanderster.com wrote:
 Thanks for the quick reply.

 We /could/ just wipe these OSDs and start from scratch (the only other
 pools were 4+2 ec and recovery already brought us to 100%
 active+clean).

 But it'd be good to understand and prevent this kind of crash...

 Cheers, Dan




 On Fri, Jul 17, 2015 at 3:18 PM, Gregory Farnum g...@gregs42.com wrote:
 I think you'll need to use the ceph-objectstore-tool to remove the
 PG/data consistently, but I've not done this — David or Sam will need
 to chime in.
 -Greg

 On Fri, Jul 17, 2015 at 2:15 PM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi Greg + list,

 Sorry to reply to this old'ish thread, but today one of these PGs bit
 us in the ass.

 Running hammer 0.94.2, we are deleting pool 36 and the OSDs 30, 171,
 and 69 all crash when trying to delete pg 36.10d. They all crash with

ENOTEMPTY suggests garbage data in osd data dir

 (full log below). There is indeed some garbage in there:

 # find 36.10d_head/
 36.10d_head/
 36.10d_head/DIR_D
 36.10d_head/DIR_D/DIR_0
 36.10d_head/DIR_D/DIR_0/DIR_1
 36.10d_head/DIR_D/DIR_0/DIR_1/__head_BD49D10D__24
 36.10d_head/DIR_D/DIR_0/DIR_9


 Do you have any suggestion how to get these OSDs back running? We
 already tried manually moving 36.10d_head to 36.10d_head.bak but then
 the OSD crashes for a different reason:

 -1 2015-07-17 15:07:42.442851 7fe11fc0b800 10 osd.69 92595 pgid
 36.10d coll 36.10d_head
  0 2015-07-17 15:07:42.443925 7fe11fc0b800 -1 osd/PG.cc: In
 function 'static epoch_t PG::peek_map_epoch(ObjectStore*, spg_t,
 ceph::bufferlist*)' thread 7fe11fc0b800 time 2015-07-17
 15:07:42.442902
 osd/PG.cc: 2839: FAILED assert(r  0)


 Any clues?

 Cheers, Dan

 2015-07-17 14:40:54.493935 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  error (39) Directory not empty
 not handled on operation 0xedd0b88 (18879615.0.1, or op 1, counting
 from 0)
 2015-07-17 14:40:54.494019 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30) ENOTEMPTY suggests garbage data
 in osd data dir
 2015-07-17 14:40:54.494021 7f0ba60f4700  0
 filestore(/var/lib/ceph/osd/ceph-30)  transaction dump:
 {
ops: [
{
op_num: 0,
op_name: remove,
collection: 36.10d_head,
oid: 10d\/\/head\/\/36
},
{
op_num: 1,
op_name: rmcoll,
collection: 36.10d_head
}
]
 }

 2015-07-17 14:40:54.606399 7f0ba60f4700 -1 os/FileStore.cc: In
 function 'unsigned int
 FileStore::_do_transaction(ObjectStore::Transaction, uint64_t, int,
 ThreadPool::TPHandle*)' thread 7f0ba60f4700 time 2015-07-17
 14:40:54.502996
 os/FileStore.cc: 2757: FAILED assert(0 == unexpected error)

 ceph version 0.94.2 (5fb85614ca8f354284c713a2f9c610860720bbf3)
 1: (FileStore::_do_transaction(ObjectStore::Transaction, unsigned
 long, int, ThreadPool::TPHandle*)+0xc16) [0x975a06]
 2: (FileStore::_do_transactions(std::listObjectStore::Transaction*,
 std::allocatorObjectStore::Transaction* , unsigned long,
 ThreadPool::TPHandle*)+0x64) [0x97d794]
 3: (FileStore::_do_op(FileStore::OpSequencer*,
 ThreadPool::TPHandle)+0x2a0) [0x97da50]
 4: (ThreadPool::worker(ThreadPool::WorkThread*)+0x4e6) [0xaffdc6]
 5: (ThreadPool::WorkThread::entry()+0x10) [0xb01a10]
 6: /lib64/libpthread.so.0() [0x3fbec079d1]
 7: (clone()+0x6d) [0x3fbe8e88fd]

 On Wed, Jun 17, 2015 at 11:09 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 On Wed, Jun 17, 2015 at 10:52 AM, Gregory Farnum g...@gregs42.com wrote:
 On Wed, Jun 17, 2015 at 8:56 AM, Dan van der Ster d...@vanderster.com 
 wrote:
 Hi,

 After upgrading to 0.94.2 yesterday on our test cluster, we've had 3
 PGs go inconsistent.

 First, immediately after we updated the OSDs PG 34.10d went 
 inconsistent:

 2015-06-16 13:42:19.086170 osd.52 137.138.39.211:6806/926964 2 :
 cluster [ERR] 34.10d scrub stat mismatch, got 4/5 objects, 0/0 clones,
 0/0 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 136/136
 bytes,0/0 hit_set_archive bytes.

 Second, an hour later 55.10d went inconsistent:

 2015-06-16 14:27:58.336550 osd.303 128.142.23.56:6812/879385 10 :
 cluster [ERR] 55.10d deep-scrub stat mismatch, got 0/1 objects, 0/0
 clones, 0/1 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0 whiteouts, 0/0
 bytes,0/0 hit_set_archive bytes.

 Then last night 36.10d suffered the same fate:

 2015-06-16 23:05:17.857433 osd.30 188.184.18.39:6800/2260103 16 :
 cluster [ERR] 36.10d deep-scrub stat mismatch, got 5833/5834 objects,
 0/0 clones, 5758/5759 dirty, 0/0 omap, 0/0 hit_set_archive, 0/0
 whiteouts, 24126649216/24130843520 bytes,0/0 hit_set_archive bytes.


 In all cases, one 

Re: [ceph-users] how to recover from: 1 pgs down; 10 pgs incomplete; 10 pgs stuck inactive; 10 pgs stuck unclean

2015-07-22 Thread Jelle de Jong
On 15/07/15 10:55, Jelle de Jong wrote:
 On 13/07/15 15:40, Jelle de Jong wrote:
 I was testing a ceph cluster with osd_pool_default_size = 2 and while
 rebuilding the OSD on one ceph node a disk in an other node started
 getting read errors and ceph kept taking the OSD down, and instead of me
 executing ceph osd set nodown while the other node was rebuilding I kept
 restarting the OSD for a while and ceph took the OSD in for a few
 minutes and then taking it back down.

 I then removed the bad OSD from the cluster and later added it back in
 with nodown flag set and a weight of zero, moving all the data away.
 Then removed the OSD again and added a new OSD with a new hard drive.

 However I ended up with the following cluster status and I can't seem to
 find how to get the cluster healthy again. I'm doing this as tests
 before taking this ceph configuration in further production.

 http://paste.debian.net/plain/281922

 If I lost data, my bad, but how could I figure out in what pool the data
 was lost and in what rbd volume (so what kvm guest lost data).
 
 Anybody that can help?
 
 Can I somehow reweight some OSD to resolve the problems? Or should I
 rebuild the whole cluster and loose all data?

# ceph pg 3.12 query
http://paste.debian.net/284812/

I used ceph pg force_create_pg x.xx on all the incomplete pgs and I
don’t have any stuck pgs any more but there are still incomplete ones.

# ceph health detail
http://paste.debian.net/284813/

How can I get the incomplete pgs active again?

Kind regards,

Jelle de Jong
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance dégradation after upgrade to hammer

2015-07-22 Thread Mark Nelson

Ok,

So good news that RADOS appears to be doing well.  I'd say next is to 
follow some of the recommendations here:


http://ceph.com/docs/master/radosgw/troubleshooting/

If you examine the objecter_requests and perfcounters during your 
cosbench write test, it might help explain where the requests are 
backing up.  Another thing to look for (as noted in the above URL) are 
HTTP errors in the apache logs (if relevant).


Other general thoughts:  When you upgraded to hammer did you change the 
RGW configuration at all?  Are you using civetweb now?  Does the 
rgw.buckets pool have enough PGs?



Mark

On 07/21/2015 08:17 PM, Florent MONTHEL wrote:

Hi Mark

I've something like 600 write IOPs on EC pool and 800 write IOPs on replicated 
3 pool with rados bench

With  Radosgw  I have 30/40 write IOPs with Cosbench (1 radosgw- the same with 
2) and servers are sleeping :
- 0.005 core for radosgw process
- 0.01 core for osd process

I don't know if we can have .rgw* pool locking or something like that with 
Hammer (or situation specific to me)

On 100% read profile, Radosgw and Ceph servers are working very well with more 
than 6000 IOPs on one radosgw server :
- 7 cores for radosgw process
- 1 core for each osd process
- 0,5 core for each Apache process

Thanks

Sent from my iPhone


On 14 juil. 2015, at 21:03, Mark Nelson mnel...@redhat.com wrote:

Hi Florent,

10x degradation is definitely unusual!  A couple of things to look at:

Are 8K rados bench writes to the rgw.buckets pool slow?  You can with something 
like:

rados -p rgw.buckets bench 30 write -t 256 -b 8192

You may also want to try targeting a specific RGW server to make sure the 
RR-DNS setup isn't interfering (at least while debugging).  It may also be 
worth creating a new replicated pool and try writes to that pool as well to see 
if you see much difference.

Mark


On 07/14/2015 07:17 PM, Florent MONTHEL wrote:
Yes of course thanks Mark

Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb connected 
- EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed on 2 cluster 
servers
No SSD drives used

We're using Cosbench to send :
- 8k object size : 100% read with 256 workers : better results with Hammer
  - 8k object size : 80% read - 20% write with 256 workers : real degradation 
between Firefly and Hammer (divided by something like 10)
- 8k object size : 100% write with 256 workers : real degradation between 
Firefly and Hammer (divided by something like 10)

Thanks

Sent from my iPhone


On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote:

On 07/14/2015 06:42 PM, Florent MONTHEL wrote:
Hi All,

I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) to 
Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM prefork 
mode
I'm experiencing huge write performance degradation just after upgrade 
(Cosbench).

Do you already run performance tests between Hammer and Firefly ?

No problem with read performance that was amazing


Hi Florent,

Can you talk a little bit about how your write tests are setup?  How many 
concurrent IOs and what size?  Also, do you see similar problems with rados 
bench?

We have done some testing and haven't seen significant performance degradation 
except when switching to civetweb which appears to perform deletes more slowly 
than what we saw with apache+fcgi.

Mark




Sent from my iPhone
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph 0.94 (and lower) performance on 1 hosts ??

2015-07-22 Thread Gregory Farnum
We might also be able to help you improve or better understand your
results if you can tell us exactly what tests you're conducting that
are giving you these numbers.
-Greg

On Wed, Jul 22, 2015 at 4:44 AM, Florent MONTHEL fmont...@flox-arts.net wrote:
 Hi Frederic,

 When you have Ceph cluster with 1 node you don’t experienced network and
 communication overhead due to distributed model
 With 2 nodes and EC 4+1 you will have communication between 2 nodes but you
 will keep internal communication (2 chunks on first node and 3 chunks on
 second node)
 On your configuration EC pool is setup with 4+1 so you will have for each
 write overhead due to write spreading on 5 nodes (for 1 customer IO, you
 will experience 5 Ceph IO due to EC 4+1)
 It’s the reason for that I think you’re reaching performance stability with
 5 nodes and more in your cluster


 On Jul 20, 2015, at 10:35 AM, SCHAER Frederic frederic.sch...@cea.fr
 wrote:

 Hi,

 As I explained in various previous threads, I’m having a hard time getting
 the most out of my test ceph cluster.
 I’m benching things with rados bench.
 All Ceph hosts are on the same 10GB switch.

 Basically, I know I can get about 1GB/s of disk write performance per host,
 when I bench things with dd (hundreds of dd threads) +iperf 10gbit
 inbound+iperf 10gbit outbound.
 I also can get 2GB/s or even more if I don’t bench the network at the same
 time, so yes, there is a bottleneck between disks and network, but I can’t
 identify which one, and it’s not relevant for what follows anyway
 (Dell R510 + MD1200 + PERC H700 + PERC H800 here, if anyone has hints about
 this strange bottleneck though…)

 My hosts each are connected though a single 10Gbits/s link for now.

 My problem is the following. Please note I see the same kind of poor
 performance with replicated pools...
 When testing EC pools, I ended putting a 4+1 pool on a single node in order
 to track down the ceph bottleneck.
 On that node, I can get approximately 420MB/s write performance using rados
 bench, but that’s fair enough since the dstat output shows that real data
 throughput on disks is about 800+MB/s (that’s the ceph journal effect, I
 presume).

 I tested Ceph on my other standalone nodes : I can also get around 420MB/s,
 since they’re identical.
 I’m testing things with 5 10Gbits/s clients, each running rados bench.

 But what I really don’t get is the following :

 -  With 1 host : throughput is 420MB/s
 -  With 2 hosts : I get 640MB/s. That’s surely not 2x420MB/s.
 -  With 5 hosts : I get around 1375MB/s . That’s far from the
 expected 2GB/s.

 The network never is maxed out, nor are the disks or CPUs.
 The hosts throughput I see with rados bench seems to match the dstat
 throughput.
 That’s as if each additional host was only capable of adding 220MB/s of
 throughput. Compare this to the 1GB/s they are capable of (420MB/s with
 journals)…

 I’m therefore wondering what could possibly be so wrong with my setup ??
 Why would it impact so much the performance to add hosts ?

 On the hardware side, I have Broadcam BCM57711 10-Gigabit PCIe cards.
 I know, not perfect, but not THAT bad neither… ?

 Any hint would be greatly appreciated !

 Thanks
 Frederic Schaer
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] OSD crashes

2015-07-22 Thread Alex Gorbachev
We have been error free for almost 3 weeks now.  The following settings on
all OSD nodes were changed:

vm.swappiness=1
vm.min_free_kbytes=262144

My discussion on XFS list is here:
http://www.spinics.net/lists/xfs/msg33645.html

Thanks,
Alex

On Fri, Jul 3, 2015 at 6:27 AM, Jan Schermer j...@schermer.cz wrote:

 What’s the value of /proc/sys/vm/min_free_kbytes on your system? Increase
 it to 256M (better do it if there’s lots of free memory) and see if it
 helps.
 It can also be set too high, hard to find any formula how to set it
 correctly...

 Jan


 On 03 Jul 2015, at 10:16, Alex Gorbachev a...@iss-integration.com wrote:

 Hello, we are experiencing severe OSD timeouts, OSDs are not taken out and
 we see the following in syslog on Ubuntu 14.04.2 with Firefly 0.80.9.

 Thank you for any advice.

 Alex


 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.261899] BUG: unable to
 handle kernel paging request at 0019001c
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.261923] IP:
 [8118e476] find_get_entries+0x66/0x160
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.261941] PGD 1035954067 PUD 0
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.261955] Oops:  [#1] SMP
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.261969] Modules linked in:
 xfs libcrc32c ipmi_ssif intel_rapl iosf_mbi x86_pkg_temp_thermal
 intel_powerclamp co
 retemp kvm crct10dif_pclmul crc32_pclmul ghash_clmulni_intel aesni_intel
 aes_x86_64 lrw gf128mul glue_helper ablk_helper cryptd sb_edac edac_core
 lpc_ich joy
 dev mei_me mei ioatdma wmi 8021q ipmi_si garp 8250_fintek mrp
 ipmi_msghandler stp llc bonding mac_hid lp parport mlx4_en vxlan
 ip6_udp_tunnel udp_tunnel hid_
 generic usbhid hid igb ahci mpt2sas mlx4_core i2c_algo_bit libahci dca
 raid_class ptp scsi_transport_sas pps_core arcmsr
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262182] CPU: 10 PID: 8711
 Comm: ceph-osd Not tainted 4.1.0-040100-generic #201506220235
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262197] Hardware name:
 Supermicro X9DRD-7LN4F(-JBOD)/X9DRD-EF/X9DRD-7LN4F, BIOS 3.0a 12/05/2013
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262215] task:
 8800721f1420 ti: 880fbad54000 task.ti: 880fbad54000
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262229] RIP:
 0010:[8118e476]  [8118e476] find_get_entries+0x66/0x160
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262248] RSP:
 0018:880fbad571a8  EFLAGS: 00010246
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262258] RAX:
 880004000158 RBX: 000e RCX: 
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262303] RDX:
 880004000158 RSI: 880fbad571c0 RDI: 0019
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262347] RBP:
 880fbad57208 R08: 00c0 R09: 00ff
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262391] R10:
  R11: 0220 R12: 00b6
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262435] R13:
 880fbad57268 R14: 000a R15: 880fbad572d8
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262479] FS:
  7f98cb0e0700() GS:88103f48() knlGS:
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262524] CS:  0010 DS: 
 ES:  CR0: 80050033
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262551] CR2:
 0019001c CR3: 001034f0e000 CR4: 000407e0
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262596] Stack:
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262618]  880fbad571f8
 880cf6076b30 880bdde05da8 00e6
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262669]  0100
 880cf6076b28 00b5 880fbad57258
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262721]  880fbad57258
 880fbad572d8  880cf6076b28
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262772] Call Trace:
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262801]
  [8119b482] pagevec_lookup_entries+0x22/0x30
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262831]
  [8119bd84] truncate_inode_pages_range+0xf4/0x700
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262862]
  [8119c415] truncate_inode_pages+0x15/0x20
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262891]
  [8119c53f] truncate_inode_pages_final+0x5f/0xa0
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262949]
  [c0431c2c] xfs_fs_evict_inode+0x3c/0xe0 [xfs]
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.262981]
  [81220558] evict+0xb8/0x190
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.263009]
  [81220671] dispose_list+0x41/0x50
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.263037]
  [8122176f] prune_icache_sb+0x4f/0x60
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.263067]
  [81208ab5] super_cache_scan+0x155/0x1a0
 Jul  3 03:42:06 roc-4r-sca020 kernel: [554036.263096]
  [8119d26f] do_shrink_slab+0x13f/0x2c0
 Jul  3 03:42:06 

Re: [ceph-users] CephFS vs RBD

2015-07-22 Thread Lincoln Bryant
Hi Hadi,

AFAIK, you can’t safely mount RBD as R/W on multiple machines. You could 
re-export the RBD as NFS, but that’ll introduce a bottleneck and probably tank 
your performance gains over CephFS.

For what it’s worth, some of our RBDs are mapped to multiple machines, mounted 
read-write on one and read-only on the others. We haven’t seen any strange 
effects from that, but I seem to recall it being ill advised. 

—Lincoln

 On Jul 22, 2015, at 2:05 PM, Hadi Montakhabi h...@cs.uh.edu wrote:
 
 Hello Cephers,
 
 I've been experimenting with CephFS and RBD for some time now.
 From what I have seen so far, RBD outperforms CephFS by far. However, there 
 is a catch!
 RBD could be mounted on one client at a time!
 Now, assuming that we have multiple clients running some MPI code (and doing 
 some distributed I/O), all these clients need to read/write from the same 
 location and sometimes even the same file.
 Is this at all possible by using RBD, and not CephFS?
 
 Thanks,
 Hadi
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] PGs going inconsistent after stopping the primary

2015-07-22 Thread Samuel Just
Annoying that we don't know what caused the replica's stat structure to get out 
of sync.  Let us know if you see it recur.  What were those pools used for?
-Sam

- Original Message -
From: Dan van der Ster d...@vanderster.com
To: Samuel Just sj...@redhat.com
Cc: ceph-users@lists.ceph.com
Sent: Wednesday, July 22, 2015 12:36:53 PM
Subject: Re: [ceph-users] PGs going inconsistent after stopping the primary

Cool, writing some objects to the affected PGs has stopped the
consistent/inconsistent cycle. I'll keep an eye on them but this seems
to have fixed the problem.
Thanks!!
Dan

On Wed, Jul 22, 2015 at 6:07 PM, Samuel Just sj...@redhat.com wrote:
 Looks like it's just a stat error.  The primary appears to have the correct 
 stats, but the replica for some reason doesn't (thinks there's an object for 
 some reason).  I bet it clears itself it you perform a write on the pg since 
 the primary will send over its stats.  We'd need information from when the 
 stat error originally occurred to debug further.
 -Sam

 - Original Message -
 From: Dan van der Ster d...@vanderster.com
 To: ceph-users@lists.ceph.com
 Sent: Wednesday, July 22, 2015 7:49:00 AM
 Subject: [ceph-users] PGs going inconsistent after stopping the primary

 Hi Ceph community,

 Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64

 We wanted to post here before the tracker to see if someone else has
 had this problem.

 We have a few PGs (different pools) which get marked inconsistent when
 we stop the primary OSD. The problem is strange because once we
 restart the primary, then scrub the PG, the PG is marked active+clean.
 But inevitably next time we stop the primary OSD, the same PG is
 marked inconsistent again.

 There is no user activity on this PG, and nothing interesting is
 logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
 mentioning the PG already says inactive+inconsistent).


 We suspect this is related to garbage files left in the PG folder. One
 of our PGs is acting basically like above, except it goes through this
 cycle: active+clean - (deep-scrub) - active+clean+inconsistent -
 (repair) - active+clean - (restart primary OSD) - (deep-scrub) -
 active+clean+inconsistent. This one at least logs:

 2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
 2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
 mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
 hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
 2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors

 and this should be debuggable because there is only one object in the pool:

 tapetest   55   0 073575G   1

 even though rados ls returns no objects:

 # rados ls -p tapetest
 #

 Any ideas?

 Cheers, Dan
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Performance dégradation after upgrade to hammer

2015-07-22 Thread Florent MONTHEL
Hi Mark

Yes enough PG and no error on Apache logs
We identified some bottleneck on bucket index with huge IOPs on one OSD (IOPs 
is done on only 1 bucket)

With bucket sharding (32) configured write IOPs us now 5x better (and after 
bucket delete/create). But we don't yet reach Firefly performance

RedHat case in progress. I will share later with community 

Sent from my iPhone

 On 22 juil. 2015, at 08:20, Mark Nelson mnel...@redhat.com wrote:
 
 Ok,
 
 So good news that RADOS appears to be doing well.  I'd say next is to follow 
 some of the recommendations here:
 
 http://ceph.com/docs/master/radosgw/troubleshooting/
 
 If you examine the objecter_requests and perfcounters during your cosbench 
 write test, it might help explain where the requests are backing up.  Another 
 thing to look for (as noted in the above URL) are HTTP errors in the apache 
 logs (if relevant).
 
 Other general thoughts:  When you upgraded to hammer did you change the RGW 
 configuration at all?  Are you using civetweb now?  Does the rgw.buckets pool 
 have enough PGs?
 
 
 Mark
 
 On 07/21/2015 08:17 PM, Florent MONTHEL wrote:
 Hi Mark
 
 I've something like 600 write IOPs on EC pool and 800 write IOPs on 
 replicated 3 pool with rados bench
 
 With  Radosgw  I have 30/40 write IOPs with Cosbench (1 radosgw- the same 
 with 2) and servers are sleeping :
 - 0.005 core for radosgw process
 - 0.01 core for osd process
 
 I don't know if we can have .rgw* pool locking or something like that with 
 Hammer (or situation specific to me)
 
 On 100% read profile, Radosgw and Ceph servers are working very well with 
 more than 6000 IOPs on one radosgw server :
 - 7 cores for radosgw process
 - 1 core for each osd process
 - 0,5 core for each Apache process
 
 Thanks
 
 Sent from my iPhone
 
 On 14 juil. 2015, at 21:03, Mark Nelson mnel...@redhat.com wrote:
 
 Hi Florent,
 
 10x degradation is definitely unusual!  A couple of things to look at:
 
 Are 8K rados bench writes to the rgw.buckets pool slow?  You can with 
 something like:
 
 rados -p rgw.buckets bench 30 write -t 256 -b 8192
 
 You may also want to try targeting a specific RGW server to make sure the 
 RR-DNS setup isn't interfering (at least while debugging).  It may also be 
 worth creating a new replicated pool and try writes to that pool as well to 
 see if you see much difference.
 
 Mark
 
 On 07/14/2015 07:17 PM, Florent MONTHEL wrote:
 Yes of course thanks Mark
 
 Infrastructure : 5 servers with 10 sata disks (50 osd at all) - 10gb 
 connected - EC 2+1 on rgw.buckets pool - 2 radosgw RR-DNS like installed 
 on 2 cluster servers
 No SSD drives used
 
 We're using Cosbench to send :
 - 8k object size : 100% read with 256 workers : better results with Hammer
  - 8k object size : 80% read - 20% write with 256 workers : real 
 degradation between Firefly and Hammer (divided by something like 10)
 - 8k object size : 100% write with 256 workers : real degradation between 
 Firefly and Hammer (divided by something like 10)
 
 Thanks
 
 Sent from my iPhone
 
 On 14 juil. 2015, at 19:57, Mark Nelson mnel...@redhat.com wrote:
 
 On 07/14/2015 06:42 PM, Florent MONTHEL wrote:
 Hi All,
 
 I've just upgraded Ceph cluster from Firefly 0.80.8 (Redhat Ceph 1.2.3) 
 to Hammer (Redhat Ceph 1.3) - Usage : radosgw with Apache 2.4.19 on MPM 
 prefork mode
 I'm experiencing huge write performance degradation just after upgrade 
 (Cosbench).
 
 Do you already run performance tests between Hammer and Firefly ?
 
 No problem with read performance that was amazing
 
 Hi Florent,
 
 Can you talk a little bit about how your write tests are setup?  How many 
 concurrent IOs and what size?  Also, do you see similar problems with 
 rados bench?
 
 We have done some testing and haven't seen significant performance 
 degradation except when switching to civetweb which appears to perform 
 deletes more slowly than what we saw with apache+fcgi.
 
 Mark
 
 
 
 Sent from my iPhone
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] load-gen throughput numbers

2015-07-22 Thread Deneau, Tom
If I run rados load-gen with the following parameters:
   --num-objects 50
   --max-ops 16
   --min-object-size 4M
   --max-object-size 4M
   --min-op-len 4M
   --max-op-len 4M
   --percent 100
   --target-throughput 2000

So every object is 4M in size and all the ops are reads of the entire 4M.
I would assume this is equivalent to running rados bench rand on that pool
if the pool has been previously filled with 50 4M objects.  And I am assuming
the --max-ops=16 is equivalent to having 16 concurrent threads in rados bench.
And I have set the target throughput higher than is possible with my network.

But when I run both rados load-gen and rados bench as described, I see that 
rados bench gets
about twice the throughput of rados load-gen.  Why would that be?

I see there is a --max-backlog parameter, is there some setting of that 
parameter
that would help the throughput?

-- Tom Deneau
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] load-gen throughput numbers

2015-07-22 Thread Deneau, Tom
Ah, I see that --max-backlog must be expressed in bytes/sec,
in spite of what the --help message says.

-- Tom

 -Original Message-
 From: Deneau, Tom
 Sent: Wednesday, July 22, 2015 5:09 PM
 To: 'ceph-users@lists.ceph.com'
 Subject: load-gen throughput numbers
 
 If I run rados load-gen with the following parameters:
--num-objects 50
--max-ops 16
--min-object-size 4M
--max-object-size 4M
--min-op-len 4M
--max-op-len 4M
--percent 100
--target-throughput 2000
 
 So every object is 4M in size and all the ops are reads of the entire 4M.
 I would assume this is equivalent to running rados bench rand on that pool
 if the pool has been previously filled with 50 4M objects.  And I am assuming
 the --max-ops=16 is equivalent to having 16 concurrent threads in rados
 bench.
 And I have set the target throughput higher than is possible with my network.
 
 But when I run both rados load-gen and rados bench as described, I see that
 rados bench gets
 about twice the throughput of rados load-gen.  Why would that be?
 
 I see there is a --max-backlog parameter, is there some setting of that
 parameter
 that would help the throughput?
 
 -- Tom Deneau
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] Ceph KeyValueStore configuration settings

2015-07-22 Thread Sai Srinath Sundar-SSI
Hi,
I'm rather new to ceph and I was trying to launch a test cluster with the 
Hammer release with the default OSD backend as KeyValueStore instead of 
FileStore. I am deploying my cluster using ceph-deploy. Can someone who has 
already done this please share the changes they have made for this? I am not 
able to see any documentation on the same. I apologize if this question has 
been asked previously.
Thanks
Sai
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] Ceph KeyValueStore configuration settings

2015-07-22 Thread Haomai Wang
I guess you only need to add osd objectstore = keyvaluestore and
enable experimental unrecoverable data corrupting features =
keyvaluestore.

And you need to know keyvaluestore is a experimental backend now, it's
not recommended to deploy in producation env !

On Thu, Jul 23, 2015 at 7:13 AM, Sai Srinath Sundar-SSI
sai.srin...@ssi.samsung.com wrote:
 Hi,

 I’m rather new to ceph and I was trying to launch a test cluster with the
 Hammer release with the default OSD backend as KeyValueStore instead of
 FileStore. I am deploying my cluster using ceph-deploy. Can someone who has
 already done this please share the changes they have made for this? I am not
 able to see any documentation on the same. I apologize if this question has
 been asked previously.

 Thanks

 Sai


 ___
 ceph-users mailing list
 ceph-users@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


[ceph-users] PGs going inconsistent after stopping the primary

2015-07-22 Thread Dan van der Ster
Hi Ceph community,

Env: hammer 0.94.2, Scientific Linux 6.6, kernel 2.6.32-431.5.1.el6.x86_64

We wanted to post here before the tracker to see if someone else has
had this problem.

We have a few PGs (different pools) which get marked inconsistent when
we stop the primary OSD. The problem is strange because once we
restart the primary, then scrub the PG, the PG is marked active+clean.
But inevitably next time we stop the primary OSD, the same PG is
marked inconsistent again.

There is no user activity on this PG, and nothing interesting is
logged in any of the 2nd/3rd OSDs (with debug_osd=20, the first line
mentioning the PG already says inactive+inconsistent).


We suspect this is related to garbage files left in the PG folder. One
of our PGs is acting basically like above, except it goes through this
cycle: active+clean - (deep-scrub) - active+clean+inconsistent -
(repair) - active+clean - (restart primary OSD) - (deep-scrub) -
active+clean+inconsistent. This one at least logs:

2015-07-22 16:42:41.821326 osd.303 [INF] 55.10d deep-scrub starts
2015-07-22 16:42:41.823834 osd.303 [ERR] 55.10d deep-scrub stat
mismatch, got 0/1 objects, 0/0 clones, 0/1 dirty, 0/0 omap, 0/0
hit_set_archive, 0/0 whiteouts, 0/0 bytes,0/0 hit_set_archive bytes.
2015-07-22 16:42:41.823842 osd.303 [ERR] 55.10d deep-scrub 1 errors

and this should be debuggable because there is only one object in the pool:

tapetest   55   0 073575G   1

even though rados ls returns no objects:

# rados ls -p tapetest
#

Any ideas?

Cheers, Dan
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] client io doing unrequested reads

2015-07-22 Thread Hadi Montakhabi
1. ​
Is the layout default, apart from the change to object_size?
It is default. The only change I make is object_size and stripe_unit. I set
both to the same value (i.e. stripe_count is 1 in all cases).

2. What version are the client and server?
ceph version 0.94.1

3.
​
Not really... are you using the fuse client?  Enabling debug objecter =
10 on the client will give you a log that says what writes the client is
doing.
I am using the kernel module. Does this work with the kernel module? How
can I set it up?

4.
​
This is probably a client issue, so I would expect killing the client to
get you out of it.
You are absolutely right. It goes away when I reboot the client node.

Thanks,
Hadi


On Tue, Jul 21, 2015 at 4:57 PM, John Spray john.sp...@redhat.com wrote:



 On 21/07/15 21:54, Hadi Montakhabi wrote:

  Hello Cephers,

  I am using CephFS, and running some benchmarks using fio.
 After increasing the object_size to 33554432, when I try to run some read
 and write tests with different block sizes, when I get to block size of 64m
 and beyond, Ceph does not finish the operation (I tried letting it run for
 more than a day at least three times).
 However, when I cancel the job and I expect to see no io  operations, here
 is what I get:


 ​​
 Is the layout default, apart from the change to object_size?

 What version are the client and server?


  [cephuser@node01 ~]$ ceph -s
 cluster b7beebf6-ea9f-4560-a916-a58e106c6e8e
  health HEALTH_OK
  monmap e3: 3 mons at {node02=
 192.168.17.212:6789/0,node03=192.168.17.213:6789/0,node04=192.168.17.214:6789/0
 }
 election epoch 8, quorum 0,1,2 node02,node03,node04
  mdsmap e74: 1/1/1 up {0=node02=up:active}
  osdmap e324: 14 osds: 14 up, 14 in
   pgmap v155699: 768 pgs, 3 pools, 15285 MB data, 1772 objects
 91283 MB used, 7700 GB / 7817 GB avail
  768 active+clean
 *  client io 2911 MB/s rd, 90 op/s*


  If I do ceph -w, it shows me that it is constantly doing reads, but I
 have no idea from where and when it would stop?
 I had to remove my CephFS file system and the associated pools and start
 things from scratch.

  1. Any idea what is happening?


 ​​
 Not really... are you using the fuse client?  Enabling debug objecter =
 10 on the client will give you a log that says what writes the client is
 doing.



   2. When this happens, do you know a better way to get out of the
 situation without destroying the filesystem and the pools?


 ​​
 This is probably a client issue, so I would expect killing the client to
 get you out of it.

 Cheers,
 John

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com


Re: [ceph-users] osd_agent_max_ops relating to number of OSDs in the cache pool

2015-07-22 Thread Gregory Farnum
On Sat, Jul 18, 2015 at 10:25 PM, Nick Fisk n...@fisk.me.uk wrote:
 Hi All,

 I’m doing some testing on the new High/Low speed cache tiering flushing and 
 I’m trying to get my head round the effect that changing these 2 settings 
 have on the flushing speed.  When setting the osd_agent_max_ops to 1, I can 
 get up to 20% improvement before the osd_agent_max_high_ops value kicks in 
 for high speed flushing. Which is great for bursty workloads.

 As I understand it, these settings loosely effect the number of concurrent 
 operations the cache pool OSD’s will flush down to the base pool.

 I may have got completely the wrong idea in my head but I can’t understand 
 how a static default setting will work with different cache/base ratios. For 
 example if I had a relatively small number of very fast cache tier OSD’s 
 (PCI-E SSD perhaps) and a much larger number of base tier OSD’s, would the 
 value need to be increased to ensure sufficient utilisation of the base tier 
 and make sure that the cache tier doesn’t fill up too fast?

 Alternatively where the cache tier is based on spinning disks or where the 
 base tier is not as comparatively large, this value may need to be reduced to 
 stop it saturating the disks.

 Any Thoughts?

I'm not terribly familiar with these exact values, but I think you've
got it right. We can't make decisions at the level of the entire cache
pool (because sharing that information isn't feasible), so we let you
specify it on a per-OSD basis according to what setup you have.

I've no idea if anybody has gathered up a matrix of baseline good
settings or not.
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com