date:20140910

Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K IOPS

2014-09-10 Thread Alexandre DERUMIER

Hi Sebastien,

here my first results with crucial m550 (I'll send result with intel s3500 
later):

- 3 nodes
- dell r620 without expander backplane
- sas controller : lsi LSI 9207 (no hardware raid or cache)
- 2 x E5-2603v2 1.8GHz (4cores)
- 32GB ram
- network : 2xgigabit link lacp + 2xgigabit lacp for cluster replication.

-os : debian wheezy, with kernel 3.10

os + ceph mon : 2x intel s3500 100gb  linux soft raid
osd : crucial m550 (1TB).


3mon in the ceph cluster,
and 1 osd (journal and datas on same disk)


ceph.conf 
-
  debug_lockdep = 0/0
  debug_context = 0/0
  debug_crush = 0/0
  debug_buffer = 0/0
  debug_timer = 0/0
  debug_filer = 0/0
  debug_objecter = 0/0
  debug_rados = 0/0
  debug_rbd = 0/0
  debug_journaler = 0/0
  debug_objectcatcher = 0/0
  debug_client = 0/0
  debug_osd = 0/0
  debug_optracker = 0/0
  debug_objclass = 0/0
  debug_filestore = 0/0
  debug_journal = 0/0
  debug_ms = 0/0
  debug_monc = 0/0
  debug_tp = 0/0
  debug_auth = 0/0
  debug_finisher = 0/0
  debug_heartbeatmap = 0/0
  debug_perfcounter = 0/0
  debug_asok = 0/0
  debug_throttle = 0/0
  debug_mon = 0/0
  debug_paxos = 0/0
  debug_rgw = 0/0
  osd_op_threads = 5
  filestore_op_threads = 4

 ms_nocrc = true
 cephx sign messages = false
 cephx require signatures = false

 ms_dispatch_throttle_bytes = 0

 #0.85
 throttler_perf_counter = false
 filestore_fd_cache_size = 64
 filestore_fd_cache_shards = 32
 osd_op_num_threads_per_shard = 1
 osd_op_num_shards = 25
 osd_enable_op_tracker = true



Fio disk 4K benchmark
--
rand read 4k : fio --filename=/dev/sdb --direct=1 --rw=randread --bs=4k 
--iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
bw=271755KB/s, iops=67938 

rand write 4k : fio --filename=/dev/sdb --direct=1 --rw=randwrite --bs=4k 
--iodepth=32 --group_reporting --invalidate=0 --name=abc --ioengine=aio
bw=228293KB/s, iops=57073



fio osd benchmark (through librbd)
--
[global]
ioengine=rbd
clientname=admin
pool=test
rbdname=test
invalidate=0# mandatory
rw=randwrite
rw=randread
bs=4k
direct=1
numjobs=4
group_reporting=1

[rbd_iodepth32]
iodepth=32



FIREFLY RESULTS

fio randwrite : bw=5009.6KB/s, iops=1252

fio randread: bw=37820KB/s, iops=9455



O.85 RESULTS


fio randwrite : bw=11658KB/s, iops=2914

fio randread : bw=38642KB/s, iops=9660



0.85 + osd_enable_op_tracker=false
---
fio randwrite : bw=11630KB/s, iops=2907
fio randread : bw=80606KB/s, iops=20151,   (cpu 100% - GREAT !)



So, for read, seem that osd_enable_op_tracker is the bottleneck.


Now for write, I really don't understand why it's so low.


I have done some iostat:


FIO directly on /dev/sdb
bw=228293KB/s, iops=57073

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00 0,000,00 63613,00 0,00 254452,00 8,00
31,240,490,000,49   0,02 100,00


FIO directly on osd through librbd
bw=11658KB/s, iops=2914

Device: rrqm/s   wrqm/s r/s w/srkB/swkB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sdb   0,00   355,000,00 5225,00 0,00 29678,0011,36
57,63   11,030,00   11,03   0,19  99,70


(I don't understand what exactly is %util, 100% in the 2 cases, because 10x 
slower with ceph)

It could be a dsync problem, result seem pretty poor

# dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=direct
65536+0 enregistrements lus
65536+0 enregistrements écrits
268435456 octets (268 MB) copiés, 2,77433 s, 96,8 MB/s


# dd if=rand.file of=/dev/sdb bs=4k count=65536 oflag=dsync,direct
^C17228+0 enregistrements lus
17228+0 enregistrements écrits
70565888 octets (71 MB) copiés, 70,4098 s, 1,0 MB/s



I'll do tests with intel s3500 tomorrow to compare

- Mail original - 

De: "Sebastien Han"  
À: "Warren Wang"  
Cc: ceph-users@lists.ceph.com 
Envoyé: Lundi 8 Septembre 2014 22:58:25 
Objet: Re: [ceph-users] [Single OSD performance on SSD] Can't go over 3, 2K 
IOPS 

They definitely are Warren! 

Thanks for bringing this here :). 

On 05 Sep 2014, at 23:02, Wang, Warren  wrote: 

> +1 to what Cedric said. 
> 
> Anything more than a few minutes of heavy sustained writes tended to get our 
> solid state devices into a state where garbage collection could not keep up. 
> Originally we used small SSDs and did not overprovision the journals by much. 
> Manufacturers publish their SSD stats, and then in very small font, state 
> that the attained IOPS are with empty drives, and the tests are only run for 
> ver

[ceph-users] upload data using swift API

2014-09-10 Thread pragya jain

hi all!

from swift perspective, to upload data in Swift, following steps are performed:

* user must have an account in Swift cluster.
* user create a container within the account to store his data
*user store his data as an object in the container within the account
* object information is updated in container database and container information 
is updated in account database

from ceph documentation, I am unable to understand this flow.

#1. how the information of object and container is updated in container and 
account databases respectively?
#2. where are all account and container information stored? if there are 
different pools to store the information, if so, then how the information is 
updated from object to container and container to account?

Please elaborate on these concepts so that I can be more clear about it.
Please reply

Regards 
Pragya jain___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread 廖建锋

the current ceph cluster was compiled by hand ,  and now i disabled scrub and 
deep-scrub  until your new dev version released
I hope the new version can help to scrub all data which already error displayed

From: Haomai Wang
Date: 2014-09-11 12:00
To: 廖建锋
CC: ceph-users
Subject: Re: Re: [ceph-users] Why so much inconsistent error in 0.85?
No, you need to wait the next develop release, or you can compile it by hand.

On Thu, Sep 11, 2014 at 10:35 AM, 廖建锋 mailto:de...@f-club.cn>> 
wrote:
haomai wang,
i already use 0.85 which is the latest version of CEPH,   is there any 
new version than 0.85?

From: Haomai Wang
Date: 2014-09-11 10:02
To: 廖建锋
CC: ceph-users
Subject: Re: [ceph-users] Why so much inconsistent error in 0.85?
Please check (http://tracker.ceph.com/issues/8589). KeyValueStore is a 
experiment backend. We still make it stable now.
You can checkout master branch to fix it.

On Thu, Sep 11, 2014 at 8:35 AM, 廖建锋 mailto:de...@f-club.cn>> 
wrote:
dear,
 Is this another big bug of CEPH?

[cid:_Foxmail.1@33a80c28-4cc8-0acf-76c4-3eb8a3b9baa3]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Best Regards,

Wheat

--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] monitoring tool for ceph which monitor end-user level usage

2014-09-10 Thread pragya jain

hi all!

Is there any monitoring tool for ceph which monitor end-user level usage and 
data transfer for ceph object storage service?

Please help me  to know any type of information related to it. 

Please reply.

Regards
Pragya Jain___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] different storage disks as a single storage

2014-09-10 Thread pragya jain

hi all!

I have a very low level query. Please help to clarify it.

To store data on a storage cluster, at the bottom, there is a heterogeneous set 
of storage disks,in which there can be a variety of storage disks, such as 
SSDs, HDDs, flash drives, tapes and any other type also. Document says that 
provider view this heterogeneous set of storage disks as a single pool. My 
question is:

#1. how does the provider come to this abstraction layer to view different 
storage disks as a single storage?
#2. Is it a part of storage virtualization?
#3. are there any set of APIs to interact with this heterogeneous set of 
storage disks?

Please somebody reply.

Regards
Pragya Jain
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] error while installing ceph in cluster node

2014-09-10 Thread Subhadip Bagui

Hi,

I'm getting the below error while installing ceph on node using
ceph-deploy. I'm executing the command in admin node as

[root@ceph-admin ~]$ ceph-deploy install ceph-mds

[ceph-mds][DEBUG ] Loaded plugins: fastestmirror, security
[ceph-mds][WARNIN] You need to be root to perform this command.
[ceph-mds][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
install wget

I have changed the Defaults requiretty setting to Defaults:ceph !requiretty
in /etc/sudoers file and also put ceph as sudo user same as root in node
ceph-mds. added root privilege on node ceph-mds using command--- echo "ceph
ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers sudo chmod 0440
/etc/sudoers as mentioned in the doc

All servers are on centOS 6.5

Please let me know what can be the issue here?


Regards,
Subhadip
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-10 Thread Gregory Farnum

On Wed, Sep 10, 2014 at 8:29 PM, yuelongguang  wrote:
>
>
>
>
> as for ack and ondisk, ceph has size and min_size to decide there are  how
> many replications.
> if client receive ack or ondisk, which means there are at least min_size
> osds  have  done the ops?
>
> i am reading the cource code, could you help me with the two questions.
>
> 1.
>  on osd, where is the code that reply ops  separately  according to ack or
> ondisk.
>  i check the code, but i thought they always are replied together.

It depends on what journaling mode you're in, but generally they're
triggered separately (unless it goes on disk first, in which case it
will skip the ack — this is the mode it uses for non-btrfs
filesystems). The places where it actually replies are pretty clear
about doing one or the other, though...

>
> 2.
>  now i just know how client write ops to primary osd, inside osd cluster,
> how it promises min_size copy are reached.
> i mean  when primary osd receives ops , how it spreads ops to others, and
> how it processes other's reply.

That's not how it works. The primary for a PG will not go "active"
with it until it has at least min_size copies that it knows about.
Once the OSD is doing any processing of the PG, it requires all
participating members to respond before it sends any messages back to
the client.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com

>
>
> greg, thanks very much
>
>
>
>
>
> 在 2014-09-11 01:36:39，"Gregory Farnum"  写道：
>
> The important bit there is actually near the end of the message output line,
> where the first says "ack" and the second says "ondisk".
>
> I assume you're using btrfs; the ack is returned after the write is applied
> in-memory and readable by clients. The ondisk (commit) message is returned
> after it's durable to the journal or the backing filesystem.
> -Greg
>
> On Wednesday, September 10, 2014, yuelongguang  wrote:
>>
>> hi,all
>> i recently debug ceph rbd, the log tells that  one write to osd can get
>> two if its reply.
>> the difference between them is seq.
>> why?
>>
>> thanks
>> ---log-
>> reader got message 6 0x7f58900010a0 osd_op_reply(15
>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
>> write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
>> 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue
>> 0x7f58900010a0 prio 127
>> 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader reading tag...
>> 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader got MSG
>> 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
>> 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
>> 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader got front 247
>> 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).aborted = 0
>> 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader got 247 + 0 + 0 byte message
>> 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
>> = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
>> 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
>> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
>> c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
>> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
>> write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
>>
>>
>
>
> --
> Software Engineer #42 @ http://inktank.com | http://ceph.com
>
>
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] (no subject)

2014-09-10 Thread Subhadip Bagui

Hi,

I'm getting the below error while installing ceph on node using
ceph-deploy. I'm executing the command in admin node as

[root@ceph-admin ~]$ ceph-deploy install ceph-mds

[ceph-mds][DEBUG ] Loaded plugins: fastestmirror, security
[ceph-mds][WARNIN] You need to be root to perform this command.
[ceph-mds][ERROR ] RuntimeError: command returned non-zero exit status: 1
[ceph_deploy][ERROR ] RuntimeError: Failed to execute command: yum -y
install wget

I have changed the Defaults requiretty setting to Defaults:ceph !requiretty
in /etc/sudoers file and also put ceph as sudo user same as root in node
ceph-mds. added root privilege on node ceph-mds using command--- echo "ceph
ALL = (root) NOPASSWD:ALL" | sudo tee /etc/sudoers sudo chmod 0440
/etc/sudoers as mentioned in the doc

All servers are on centOS 6.5

Please let me know what can be the issue here?


Regards,
Subhadip
---
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread Haomai Wang

No, you need to wait the next develop release, or you can compile it by
hand.


On Thu, Sep 11, 2014 at 10:35 AM, 廖建锋  wrote:

>  haomai wang,
> i already use 0.85 which is the latest version of CEPH,   is
> there any new version than 0.85?
>
>
>  *From:* Haomai Wang 
> *Date:* 2014-09-11 10:02
> *To:* 廖建锋 
> *CC:* ceph-users 
> *Subject:* Re: [ceph-users] Why so much inconsistent error in 0.85?
>   Please check (http://tracker.ceph.com/issues/8589). KeyValueStore is a
> experiment backend. We still make it stable now.
> You can checkout master branch to fix it.
>
>
> On Thu, Sep 11, 2014 at 8:35 AM, 廖建锋  wrote:
>
>>  dear,
>>  Is this another big bug of CEPH?
>>
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
>  --
>
> Best Regards,
>
> Wheat
>
>


-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Sage Weil

I had two substantiative comments on the first patch and then some trivial 
whitespace nits.Otherwise looks good!

tahnks-
sage

On Thu, 11 Sep 2014, Somnath Roy wrote:

> Sam/Sage,
> I have incorporated all of your comments. Please have a look at the same pull 
> request.
> 
> https://github.com/ceph/ceph/pull/2440
> 
> Thanks & Regards
> Somnath
> 
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com] 
> Sent: Wednesday, September 10, 2014 3:25 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
> 
> Oh, I changed my mind, your approach is fine.  I was unclear.
> Currently, I just need you to address the other comments.
> -Sam
> 
> On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
> > As I understand, you want me to implement the following.
> >
> > 1.  Keep this implementation one sharded optracker for the ios going 
> > through ms_dispatch path.
> >
> > 2. Additionally, for ios going through ms_fast_dispatch, you want me 
> > to implement optracker (without internal shard) per opwq shard
> >
> > Am I right ?
> >
> > Thanks & Regards
> > Somnath
> >
> > -Original Message-
> > From: Samuel Just [mailto:sam.j...@inktank.com]
> > Sent: Wednesday, September 10, 2014 3:08 PM
> > To: Somnath Roy
> > Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> > ceph-users@lists.ceph.com
> > Subject: Re: OpTracker optimization
> >
> > I don't quite understand.
> > -Sam
> >
> > On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  
> > wrote:
> >> Thanks Sam.
> >> So, you want me to go with optracker/shadedopWq , right ?
> >>
> >> Regards
> >> Somnath
> >>
> >> -Original Message-
> >> From: Samuel Just [mailto:sam.j...@inktank.com]
> >> Sent: Wednesday, September 10, 2014 2:36 PM
> >> To: Somnath Roy
> >> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> >> ceph-users@lists.ceph.com
> >> Subject: Re: OpTracker optimization
> >>
> >> Responded with cosmetic nonsense.  Once you've got that and the other 
> >> comments addressed, I can put it in wip-sam-testing.
> >> -Sam
> >>
> >> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  
> >> wrote:
> >>> Thanks Sam..I responded back :-)
> >>>
> >>> -Original Message-
> >>> From: ceph-devel-ow...@vger.kernel.org 
> >>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> >>> Sent: Wednesday, September 10, 2014 11:17 AM
> >>> To: Somnath Roy
> >>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> >>> ceph-users@lists.ceph.com
> >>> Subject: Re: OpTracker optimization
> >>>
> >>> Added a comment about the approach.
> >>> -Sam
> >>>
> >>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  
> >>> wrote:
>  Hi Sam/Sage,
> 
>  As we discussed earlier, enabling the present OpTracker code 
>  degrading performance severely. For example, in my setup a single 
>  OSD node with
>  10 clients is reaching ~103K read iops with io served from memory 
>  while optracking is disabled but enabling optracker it is reduced to 
>  ~39K iops.
>  Probably, running OSD without enabling OpTracker is not an option 
>  for many of Ceph users.
> 
>  Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>  ops_in_flight) and removing some other bottlenecks I am able to 
>  match the performance of OpTracking enabled OSD with OpTracking 
>  disabled, but with the expense of ~1 extra cpu core.
> 
>  In this process I have also fixed the following tracker.
> 
> 
> 
>  http://tracker.ceph.com/issues/9384
> 
> 
> 
>  and probably http://tracker.ceph.com/issues/8885 too.
> 
> 
> 
>  I have created following pull request for the same. Please review it.
> 
> 
> 
>  https://github.com/ceph/ceph/pull/2440
> 
> 
> 
>  Thanks & Regards
> 
>  Somnath
> 
> 
> 
> 
>  
> 
>  PLEASE NOTE: The information contained in this electronic mail 
>  message is intended only for the use of the designated recipient(s) 
>  named above. If the reader of this message is not the intended 
>  recipient, you are hereby notified that you have received this 
>  message in error and that any review, dissemination, distribution, 
>  or copying of this message is strictly prohibited. If you have 
>  received this communication in error, please notify the sender by 
>  telephone or e-mail (as shown above) immediately and destroy any 
>  and all copies of this message in your possession (whether hard copies 
>  or electronically stored copies).
> 
> >>> --
> >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
> >>> in the body of a message to majord...@vger.kernel.org More majordomo 
> >>> info at  http://vger.kernel.org/majordomo-info.html
> >>>
> >>> ___

Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-10 Thread yuelongguang

 

 

as for ack and ondisk, ceph has size and min_size to decide there are  how many 
replications.
if client receive ack or ondisk, which means there are at least min_size  osds  
have  done the ops?
 
i am reading the cource code, could you help me with the two questions.
 
1.
 on osd, where is the code that reply ops  separately  according to ack or 
ondisk.
 i check the code, but i thought they always are replied together.
 
2.
 now i just know how client write ops to primary osd, inside osd cluster, how 
it promises min_size copy are reached.
i mean  when primary osd receives ops , how it spreads ops to others, and how 
it processes other's reply.
 
 
greg, thanks very much 







在 2014-09-11 01:36:39，"Gregory Farnum"  写道：
The important bit there is actually near the end of the message output line, 
where the first says "ack" and the second says "ondisk".


I assume you're using btrfs; the ack is returned after the write is applied 
in-memory and readable by clients. The ondisk (commit) message is returned 
after it's durable to the journal or the backing filesystem.
-Greg

On Wednesday, September 10, 2014, yuelongguang  wrote:

hi,all
i recently debug ceph rbd, the log tells that  one write to osd can get two if 
its reply.
the difference between them is seq.
why?
 
thanks
---log-
reader got message 6 0x7f58900010a0 osd_op_reply(15 
rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 
write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue 
0x7f58900010a0 prio 127
2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader reading tag...
2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got MSG
2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got front 247
2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).aborted = 0
2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got 247 + 0 + 0 byte message
2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq # = 7 
front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15 
rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 
write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6






--
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread 廖建锋

haomai wang,
i already use 0.85 which is the latest version of CEPH,   is there any 
new version than 0.85?

From: Haomai Wang
Date: 2014-09-11 10:02
To: 廖建锋
CC: ceph-users
Subject: Re: [ceph-users] Why so much inconsistent error in 0.85?
Please check (http://tracker.ceph.com/issues/8589). KeyValueStore is a 
experiment backend. We still make it stable now.
You can checkout master branch to fix it.

On Thu, Sep 11, 2014 at 8:35 AM, 廖建锋 mailto:de...@f-club.cn>> 
wrote:
dear,
 Is this another big bug of CEPH?

[cid:_Foxmail.1@e92fd462-33e3-3f56-655b-9e68900572c7]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

--

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread Christian Balzer

Hello,

On Thu, 11 Sep 2014 00:35:15 + 廖建锋 wrote:

First and foremost, screenshot images really have no place in a mailing
list.
Never mind the wasted space and that what you had there was plain text to
begin with, nobody searching for something similar will ever find this
thread, as all the information you actually posted was in that image.

Which in and by itself is way too little to make any guess how this
happened, that would probably be in the logs.

> dear,
>  Is this another big bug of CEPH?
> 
You are running 0.85. 
Which is a development version and I sure hope this is just a test cluster.

So finding bugs is hardly surprising, especially since you're using an
experimental backend on top of that.

Also due it being a development version, you might get more feedback on
the ceph-devel mailing list.

Christian
-- 
Christian BalzerNetwork/Systems Engineer
ch...@gol.com   Global OnLine Japan/Fusion Communications
http://www.gol.com/
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread Haomai Wang

Please check (http://tracker.ceph.com/issues/8589). KeyValueStore is a
experiment backend. We still make it stable now.
You can checkout master branch to fix it.

On Thu, Sep 11, 2014 at 8:35 AM, 廖建锋  wrote:

>  dear,
>  Is this another big bug of CEPH?
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>

-- 

Best Regards,

Wheat
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Somnath Roy

Sam/Sage,
I have incorporated all of your comments. Please have a look at the same pull 
request.

https://github.com/ceph/ceph/pull/2440

Thanks & Regards
Somnath

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Wednesday, September 10, 2014 3:25 PM
To: Somnath Roy
Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
ceph-users@lists.ceph.com
Subject: Re: OpTracker optimization

Oh, I changed my mind, your approach is fine.  I was unclear.
Currently, I just need you to address the other comments.
-Sam

On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
> As I understand, you want me to implement the following.
>
> 1.  Keep this implementation one sharded optracker for the ios going through 
> ms_dispatch path.
>
> 2. Additionally, for ios going through ms_fast_dispatch, you want me 
> to implement optracker (without internal shard) per opwq shard
>
> Am I right ?
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 3:08 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> I don't quite understand.
> -Sam
>
> On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  wrote:
>> Thanks Sam.
>> So, you want me to go with optracker/shadedopWq , right ?
>>
>> Regards
>> Somnath
>>
>> -Original Message-
>> From: Samuel Just [mailto:sam.j...@inktank.com]
>> Sent: Wednesday, September 10, 2014 2:36 PM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
>> ceph-users@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Responded with cosmetic nonsense.  Once you've got that and the other 
>> comments addressed, I can put it in wip-sam-testing.
>> -Sam
>>
>> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
>>> Thanks Sam..I responded back :-)
>>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org 
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>>> Sent: Wednesday, September 10, 2014 11:17 AM
>>> To: Somnath Roy
>>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
>>> ceph-users@lists.ceph.com
>>> Subject: Re: OpTracker optimization
>>>
>>> Added a comment about the approach.
>>> -Sam
>>>
>>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
 Hi Sam/Sage,

 As we discussed earlier, enabling the present OpTracker code 
 degrading performance severely. For example, in my setup a single 
 OSD node with
 10 clients is reaching ~103K read iops with io served from memory 
 while optracking is disabled but enabling optracker it is reduced to ~39K 
 iops.
 Probably, running OSD without enabling OpTracker is not an option 
 for many of Ceph users.

 Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
 ops_in_flight) and removing some other bottlenecks I am able to 
 match the performance of OpTracking enabled OSD with OpTracking 
 disabled, but with the expense of ~1 extra cpu core.

 In this process I have also fixed the following tracker.

 http://tracker.ceph.com/issues/9384

 and probably http://tracker.ceph.com/issues/8885 too.

 I have created following pull request for the same. Please review it.

 https://github.com/ceph/ceph/pull/2440

 Thanks & Regards

 Somnath

 PLEASE NOTE: The information contained in this electronic mail 
 message is intended only for the use of the designated recipient(s) 
 named above. If the reader of this message is not the intended 
 recipient, you are hereby notified that you have received this 
 message in error and that any review, dissemination, distribution, 
 or copying of this message is strictly prohibited. If you have 
 received this communication in error, please notify the sender by 
 telephone or e-mail (as shown above) immediately and destroy any 
 and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo 
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> 
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-10 Thread 廖建锋

I bet he didn't set hit_set yet

From: ceph-users
Date: 2014-09-11 09:00
To: Andrei Mikhailovsky; 
ceph-users
Subject: Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?
Could you show your cache tiering configuration? Especially this three 
parameters.
ceph osd pool set hot-storage cache_target_dirty_ratio 0.4

ceph osd pool set hot-storage cache_target_full_ratio 0.8

ceph osd pool set {cachepool} target_max_bytes {#bytes}

From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei 
Mikhailovsky
Sent: Wednesday, September 10, 2014 8:51 PM
To: ceph-users
Subject: [ceph-users] Cache Pool writing too much on ssds, poor performance?

Hello guys,

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster.

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on.

My test cluster is:
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2.
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib)

So, my test is:
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again.

Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the time.

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done.

I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle.

From the overall tests so far, introducing the cache pool has drastically 
slowed down my cluster (by as much as 50-60%).

Thanks for any help

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-10 Thread Chen, Xiaoxi

Could you show your cache tiering configuration? Especially this three 
parameters.
ceph osd pool set hot-storage cache_target_dirty_ratio 0.4

ceph osd pool set hot-storage cache_target_full_ratio 0.8

ceph osd pool set {cachepool} target_max_bytes {#bytes}


From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of Andrei 
Mikhailovsky
Sent: Wednesday, September 10, 2014 8:51 PM
To: ceph-users
Subject: [ceph-users] Cache Pool writing too much on ssds, poor performance?

Hello guys,

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster.

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on.

My test cluster is:
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2.
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib)

So, my test is:
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again.


Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the time.

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done.


I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle.

From the overall tests so far, introducing the cache pool has drastically 
slowed down my cluster (by as much as 50-60%).

Thanks for any help

Andrei
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Why so much inconsistent error in 0.85?

2014-09-10 Thread 廖建锋

dear,
 Is this another big bug of CEPH?


[cid:_Foxmail.1@970bdd36-9dc6-e03f-6afa-bcdcfed500d0]

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-10 Thread Blair Bethwaite

On 11 September 2014 08:47, John Spray  wrote:
> I do think this is something we could think about building a tool for:
> lots of people will have comparatively tiny quantities of metadata so
> full dumps would be a nice thing to have in our back pockets.  Reminds
> me of the way Lustre people used LVM snapshots for their metadata.
> Some users have set maintenance windows in which even a purely offline
> metadata dump tool could be useful.

This is the sort of thing I was thinking of. Operationally we would
have full tape backups of the filesystems (sans 24 hours) - but
restoring petabytes of data from tape is such an exercise you really
want other types of insurance, especially where problems with the MDS
can be expected. Something like an atomic snapshot of both data and
metadata pools, where you could then export a copy of the metadata
snapshot out of the operational cluster. Then if the MDS went haywire
and corrupted things there'd be a relatively lightweight path to a
rollback recovery. Though I say this with complete ignorance of how
the MDS works...

-- 
Cheers,
~Blairo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Upgraded now MDS won't start

2014-09-10 Thread McNamara, Bradley

Hello,

This is my first real issue since running Ceph for several months.  Here's the 
situation:

I've been running an Emperor cluster for several months.  All was good.  I 
decided to upgrade since I'm running Ubuntu 13.10 and 0.72.2.  I decided to 
first upgrade Ceph to 0.80.4, which was the last version in the apt repository 
for 13.10.  I upgrade the MON's, then the OSD servers to 0.80.4; all went as 
expected with no issues.  The last thing I did was upgrade the MDS using the 
same process, but now the MDS won't start.  I've tried to manually start the 
MDS with debugging on, and I have attached the file.  It complains that it's 
looking for "mds.0.20  need osdmap epoch 3602, have 3601".

Anyway, I'd don't really use CephFS or RGW, so I don't need the MDS, but I'd 
like to have it.  Can someone tell me how to fix it, or delete it, so I can 
start over when I do need it?  Right now my cluster is HEALTH_WARN because of 
it.

Thanks!

Brad2014-09-10 15:48:13.830787 7fae3c48e7c0  0 ceph version 0.72.2 (a913ded2ff138aefb8cb84d347d72164099cfd60), process ceph-mds, pid 3166
2014-09-10 15:48:13.834336 7fae3c48e7c0 10 mds.-1.0 168	MDSCacheObject
2014-09-10 15:48:13.834349 7fae3c48e7c0 10 mds.-1.0 2168	CInode
2014-09-10 15:48:13.834355 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item   *7=112
2014-09-10 15:48:13.834359 7fae3c48e7c0 10 mds.-1.0 392	 inode_t 
2014-09-10 15:48:13.834361 7fae3c48e7c0 10 mds.-1.0 56	  nest_info_t 
2014-09-10 15:48:13.834364 7fae3c48e7c0 10 mds.-1.0 32	  frag_info_t 
2014-09-10 15:48:13.834370 7fae3c48e7c0 10 mds.-1.0 40	 SimpleLock   *5=200
2014-09-10 15:48:13.834373 7fae3c48e7c0 10 mds.-1.0 48	 ScatterLock  *3=144
2014-09-10 15:48:13.834377 7fae3c48e7c0 10 mds.-1.0 488	CDentry
2014-09-10 15:48:13.834379 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item
2014-09-10 15:48:13.834383 7fae3c48e7c0 10 mds.-1.0 40	 SimpleLock
2014-09-10 15:48:13.834385 7fae3c48e7c0 10 mds.-1.0 1024	CDir 
2014-09-10 15:48:13.834387 7fae3c48e7c0 10 mds.-1.0 16	 elist<>::item   *2=32
2014-09-10 15:48:13.834390 7fae3c48e7c0 10 mds.-1.0 192	 fnode_t 
2014-09-10 15:48:13.834392 7fae3c48e7c0 10 mds.-1.0 56	  nest_info_t *2
2014-09-10 15:48:13.834394 7fae3c48e7c0 10 mds.-1.0 32	  frag_info_t *2
2014-09-10 15:48:13.834399 7fae3c48e7c0 10 mds.-1.0 168	Capability 
2014-09-10 15:48:13.834402 7fae3c48e7c0 10 mds.-1.0 32	 xlist<>::item   *2=64
2014-09-10 15:48:13.835815 7fae3c486700 10 mds.-1.0 MDS::ms_get_authorizer type=mon
2014-09-10 15:48:13.836113 7fae37292700  5 mds.-1.0 ms_handle_connect on 156.74.237.50:6789/0
2014-09-10 15:48:13.839873 7fae3c48e7c0 10 mds.-1.0 beacon_send up:boot seq 1 (currently up:boot)
2014-09-10 15:48:13.840110 7fae3c48e7c0 10 mds.-1.0 create_logger
2014-09-10 15:48:13.867040 7fae37292700  5 mds.-1.0 handle_mds_map epoch 149 from mon.0
2014-09-10 15:48:13.867109 7fae37292700 10 mds.-1.0  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:13.867122 7fae37292700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:13.867136 7fae37292700 10 mds.-1.-1 map says i am 156.74.237.56:6800/3166 mds.-1.-1 state down:dne
2014-09-10 15:48:13.867151 7fae37292700 10 mds.-1.-1 not in map yet
2014-09-10 15:48:14.164620 7fae37292700  5 mds.-1.-1 handle_mds_map epoch 150 from mon.0
2014-09-10 15:48:14.164706 7fae37292700 10 mds.-1.-1  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.164716 7fae37292700 10 mds.-1.-1  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.164727 7fae37292700 10 mds.-1.0 map says i am 156.74.237.56:6800/3166 mds.-1.0 state up:standby
2014-09-10 15:48:14.164739 7fae37292700 10 mds.-1.0  peer mds gid 5192121 removed from map
2014-09-10 15:48:14.164757 7fae37292700  1 mds.-1.0 handle_mds_map standby
2014-09-10 15:48:14.237027 7fae37292700  5 mds.-1.0 handle_mds_map epoch 151 from mon.0
2014-09-10 15:48:14.237060 7fae37292700 10 mds.-1.0  my compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.237070 7fae37292700 10 mds.-1.0  mdsmap compat compat={},rocompat={},incompat={1=base v0.20,2=client writeable ranges,3=default file layouts on dirs,4=dir inode in separate object,5=mds uses versioned encoding}
2014-09-10 15:48:14.237079 7fae37292700 10 mds.0.20 map says i am 156.74.237.56:6800/3166 mds.0.20 state up:replay
2014-09-10 15:48:14.237091 7fae37292700  1 mds.0.20 handle_mds

Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-10 Thread John Spray

On Wed, Sep 10, 2014 at 9:05 PM, Gregory Farnum  wrote:
>> Related, given there is no fsck, how would one go about backing up the
>> metadata in order to facilitate DR? Is there even a way for that to
>> make sense given the decoupling of data & metadata pools...?
>
> Uh, depends on the kind of DR you're going for, I guess. There are
> lots of things that will backup a generic filesystem; you could do
> something smarter with a bit of custom scripting using Ceph's rstats.

I do think this is something we could think about building a tool for:
lots of people will have comparatively tiny quantities of metadata so
full dumps would be a nice thing to have in our back pockets.  Reminds
me of the way Lustre people used LVM snapshots for their metadata.
Some users have set maintenance windows in which even a purely offline
metadata dump tool could be useful.

The other one would be the tool for rebuilding an approximate metadata
picture from the backtraces of objects in the data pool.  It's a
degenerate case of an online backward scrub where you create a fresh
filesystem and add the old/orphaned data pool to it, but it's such a
simple thing that it's tempting to implement it as a separate tool as
well/in advance.

John
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Samuel Just

Oh, I changed my mind, your approach is fine.  I was unclear.
Currently, I just need you to address the other comments.
-Sam

On Wed, Sep 10, 2014 at 3:13 PM, Somnath Roy  wrote:
> As I understand, you want me to implement the following.
>
> 1.  Keep this implementation one sharded optracker for the ios going through 
> ms_dispatch path.
>
> 2. Additionally, for ios going through ms_fast_dispatch, you want me to 
> implement optracker (without internal shard) per opwq shard
>
> Am I right ?
>
> Thanks & Regards
> Somnath
>
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 3:08 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> I don't quite understand.
> -Sam
>
> On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  wrote:
>> Thanks Sam.
>> So, you want me to go with optracker/shadedopWq , right ?
>>
>> Regards
>> Somnath
>>
>> -Original Message-
>> From: Samuel Just [mailto:sam.j...@inktank.com]
>> Sent: Wednesday, September 10, 2014 2:36 PM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> ceph-users@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Responded with cosmetic nonsense.  Once you've got that and the other 
>> comments addressed, I can put it in wip-sam-testing.
>> -Sam
>>
>> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
>>> Thanks Sam..I responded back :-)
>>>
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org
>>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>>> Sent: Wednesday, September 10, 2014 11:17 AM
>>> To: Somnath Roy
>>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>>> ceph-users@lists.ceph.com
>>> Subject: Re: OpTracker optimization
>>>
>>> Added a comment about the approach.
>>> -Sam
>>>
>>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
 Hi Sam/Sage,

 As we discussed earlier, enabling the present OpTracker code
 degrading performance severely. For example, in my setup a single
 OSD node with
 10 clients is reaching ~103K read iops with io served from memory
 while optracking is disabled but enabling optracker it is reduced to ~39K 
 iops.
 Probably, running OSD without enabling OpTracker is not an option
 for many of Ceph users.

 Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
 ops_in_flight) and removing some other bottlenecks I am able to
 match the performance of OpTracking enabled OSD with OpTracking
 disabled, but with the expense of ~1 extra cpu core.

 In this process I have also fixed the following tracker.

 http://tracker.ceph.com/issues/9384

 and probably http://tracker.ceph.com/issues/8885 too.

 I have created following pull request for the same. Please review it.

 https://github.com/ceph/ceph/pull/2440

 Thanks & Regards

 Somnath

 PLEASE NOTE: The information contained in this electronic mail
 message is intended only for the use of the designated recipient(s)
 named above. If the reader of this message is not the intended
 recipient, you are hereby notified that you have received this
 message in error and that any review, dissemination, distribution,
 or copying of this message is strictly prohibited. If you have
 received this communication in error, please notify the sender by
 telephone or e-mail (as shown above) immediately and destroy any and
 all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

>>> --
>>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>>> in the body of a message to majord...@vger.kernel.org More majordomo
>>> info at  http://vger.kernel.org/majordomo-info.html
>>>
>>> 
>>>
>>> PLEASE NOTE: The information contained in this electronic mail message is 
>>> intended only for the use of the designated recipient(s) named above. If 
>>> the reader of this message is not the intended recipient, you are hereby 
>>> notified that you have received this message in error and that any review, 
>>> dissemination, distribution, or copying of this message is strictly 
>>> prohibited. If you have received this communication in error, please notify 
>>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>>> any and all copies of this message in your possession (whether hard copies 
>>> or electronically stored copies).
>>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Somnath Roy

As I understand, you want me to implement the following.

1.  Keep this implementation one sharded optracker for the ios going through 
ms_dispatch path.

2. Additionally, for ios going through ms_fast_dispatch, you want me to 
implement optracker (without internal shard) per opwq shard 

Am I right ?

Thanks & Regards
Somnath

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Wednesday, September 10, 2014 3:08 PM
To: Somnath Roy
Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
ceph-users@lists.ceph.com
Subject: Re: OpTracker optimization

I don't quite understand.
-Sam

On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  wrote:
> Thanks Sam.
> So, you want me to go with optracker/shadedopWq , right ?
>
> Regards
> Somnath
>
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 2:36 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> Responded with cosmetic nonsense.  Once you've got that and the other 
> comments addressed, I can put it in wip-sam-testing.
> -Sam
>
> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
>> Thanks Sam..I responded back :-)
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org 
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>> Sent: Wednesday, September 10, 2014 11:17 AM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
>> ceph-users@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Added a comment about the approach.
>> -Sam
>>
>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
>>> Hi Sam/Sage,
>>>
>>> As we discussed earlier, enabling the present OpTracker code 
>>> degrading performance severely. For example, in my setup a single 
>>> OSD node with
>>> 10 clients is reaching ~103K read iops with io served from memory 
>>> while optracking is disabled but enabling optracker it is reduced to ~39K 
>>> iops.
>>> Probably, running OSD without enabling OpTracker is not an option 
>>> for many of Ceph users.
>>>
>>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>>> ops_in_flight) and removing some other bottlenecks I am able to 
>>> match the performance of OpTracking enabled OSD with OpTracking 
>>> disabled, but with the expense of ~1 extra cpu core.
>>>
>>> In this process I have also fixed the following tracker.
>>>
>>>
>>>
>>> http://tracker.ceph.com/issues/9384
>>>
>>>
>>>
>>> and probably http://tracker.ceph.com/issues/8885 too.
>>>
>>>
>>>
>>> I have created following pull request for the same. Please review it.
>>>
>>>
>>>
>>> https://github.com/ceph/ceph/pull/2440
>>>
>>>
>>>
>>> Thanks & Regards
>>>
>>> Somnath
>>>
>>>
>>>
>>>
>>> 
>>>
>>> PLEASE NOTE: The information contained in this electronic mail 
>>> message is intended only for the use of the designated recipient(s) 
>>> named above. If the reader of this message is not the intended 
>>> recipient, you are hereby notified that you have received this 
>>> message in error and that any review, dissemination, distribution, 
>>> or copying of this message is strictly prohibited. If you have 
>>> received this communication in error, please notify the sender by 
>>> telephone or e-mail (as shown above) immediately and destroy any and 
>>> all copies of this message in your possession (whether hard copies or 
>>> electronically stored copies).
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo 
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> 
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Samuel Just

I don't quite understand.
-Sam

On Wed, Sep 10, 2014 at 2:38 PM, Somnath Roy  wrote:
> Thanks Sam.
> So, you want me to go with optracker/shadedopWq , right ?
>
> Regards
> Somnath
>
> -Original Message-
> From: Samuel Just [mailto:sam.j...@inktank.com]
> Sent: Wednesday, September 10, 2014 2:36 PM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> Responded with cosmetic nonsense.  Once you've got that and the other 
> comments addressed, I can put it in wip-sam-testing.
> -Sam
>
> On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
>> Thanks Sam..I responded back :-)
>>
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org
>> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
>> Sent: Wednesday, September 10, 2014 11:17 AM
>> To: Somnath Roy
>> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org;
>> ceph-users@lists.ceph.com
>> Subject: Re: OpTracker optimization
>>
>> Added a comment about the approach.
>> -Sam
>>
>> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
>>> Hi Sam/Sage,
>>>
>>> As we discussed earlier, enabling the present OpTracker code
>>> degrading performance severely. For example, in my setup a single OSD
>>> node with
>>> 10 clients is reaching ~103K read iops with io served from memory
>>> while optracking is disabled but enabling optracker it is reduced to ~39K 
>>> iops.
>>> Probably, running OSD without enabling OpTracker is not an option for
>>> many of Ceph users.
>>>
>>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>>> ops_in_flight) and removing some other bottlenecks I am able to match
>>> the performance of OpTracking enabled OSD with OpTracking disabled,
>>> but with the expense of ~1 extra cpu core.
>>>
>>> In this process I have also fixed the following tracker.
>>>
>>>
>>>
>>> http://tracker.ceph.com/issues/9384
>>>
>>>
>>>
>>> and probably http://tracker.ceph.com/issues/8885 too.
>>>
>>>
>>>
>>> I have created following pull request for the same. Please review it.
>>>
>>>
>>>
>>> https://github.com/ceph/ceph/pull/2440
>>>
>>>
>>>
>>> Thanks & Regards
>>>
>>> Somnath
>>>
>>>
>>>
>>>
>>> 
>>>
>>> PLEASE NOTE: The information contained in this electronic mail
>>> message is intended only for the use of the designated recipient(s)
>>> named above. If the reader of this message is not the intended
>>> recipient, you are hereby notified that you have received this
>>> message in error and that any review, dissemination, distribution, or
>>> copying of this message is strictly prohibited. If you have received
>>> this communication in error, please notify the sender by telephone or
>>> e-mail (as shown above) immediately and destroy any and all copies of
>>> this message in your possession (whether hard copies or electronically 
>>> stored copies).
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel"
>> in the body of a message to majord...@vger.kernel.org More majordomo
>> info at  http://vger.kernel.org/majordomo-info.html
>>
>> 
>>
>> PLEASE NOTE: The information contained in this electronic mail message is 
>> intended only for the use of the designated recipient(s) named above. If the 
>> reader of this message is not the intended recipient, you are hereby 
>> notified that you have received this message in error and that any review, 
>> dissemination, distribution, or copying of this message is strictly 
>> prohibited. If you have received this communication in error, please notify 
>> the sender by telephone or e-mail (as shown above) immediately and destroy 
>> any and all copies of this message in your possession (whether hard copies 
>> or electronically stored copies).
>>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-deploy bug; CentOS 7, Firefly

2014-09-10 Thread Piers Dawson-Damer

Thanks Alfredo,

However the domain/folder name part of URL is fine.. it is that the RPM 
file-name seams incorrect 

regards,

Piers


On 10 Sep 2014, at 10:55 pm, Alfredo Deza  wrote:

> This should get fixed pretty soon, we already have an open issue in
> the tracker for it: http://tracker.ceph.com/issues/9376
> 
> However, it is easy to workaround this issue with ceph-deploy by
> passing explicitly the repo url so that it will
> skip the step to fetch this:
> 
>ceph-deploy install --repo-url http://ceph.com/rpm-firefly/el7/ stor
> 
> On Wed, Sep 10, 2014 at 12:38 AM, Piers Dawson-Damer  wrote:
>> Ceph-deploy wants;
>> 
>> ceph-release-1-0.el7.noarch.rpm
>> 
>> But the contents of ceph.com/rpm-firefly/el7/noarch only include the file;
>> 
>> ceph-release-1-0.el7.centos.noarch.rpm
>> 
>> Piers
>> 
>> 
>> 
>> [stor][DEBUG ] Retrieving
>> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm
>> [stor][WARNIN] curl: (22) The requested URL returned error: 404 Not Found
>> [stor][WARNIN] error: skipping
>> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm -
>> transfer failed
>> [stor][ERROR ] RuntimeError: command returned non-zero exit status: 1
>> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: rpm -Uvh
>> --replacepkgs
>> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm
>> 
>> 
>> 
>> 
>> Linux stor.domain 3.16.2-1.el7.elrepo.x86_64 #1 SMP Sat Sep 6 11:34:36 EDT
>> 2014 x86_64 x86_64 x86_64 GNU/Linux
>> 
>> 
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Somnath Roy

Thanks Sam.
So, you want me to go with optracker/shadedopWq , right ?

Regards
Somnath

-Original Message-
From: Samuel Just [mailto:sam.j...@inktank.com] 
Sent: Wednesday, September 10, 2014 2:36 PM
To: Somnath Roy
Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
ceph-users@lists.ceph.com
Subject: Re: OpTracker optimization

Responded with cosmetic nonsense.  Once you've got that and the other comments 
addressed, I can put it in wip-sam-testing.
-Sam

On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
> Thanks Sam..I responded back :-)
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> Sent: Wednesday, September 10, 2014 11:17 AM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> Added a comment about the approach.
> -Sam
>
> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
>> Hi Sam/Sage,
>>
>> As we discussed earlier, enabling the present OpTracker code 
>> degrading performance severely. For example, in my setup a single OSD 
>> node with
>> 10 clients is reaching ~103K read iops with io served from memory 
>> while optracking is disabled but enabling optracker it is reduced to ~39K 
>> iops.
>> Probably, running OSD without enabling OpTracker is not an option for 
>> many of Ceph users.
>>
>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>> ops_in_flight) and removing some other bottlenecks I am able to match 
>> the performance of OpTracking enabled OSD with OpTracking disabled, 
>> but with the expense of ~1 extra cpu core.
>>
>> In this process I have also fixed the following tracker.
>>
>>
>>
>> http://tracker.ceph.com/issues/9384
>>
>>
>>
>> and probably http://tracker.ceph.com/issues/8885 too.
>>
>>
>>
>> I have created following pull request for the same. Please review it.
>>
>>
>>
>> https://github.com/ceph/ceph/pull/2440
>>
>>
>>
>> Thanks & Regards
>>
>> Somnath
>>
>>
>>
>>
>> 
>>
>> PLEASE NOTE: The information contained in this electronic mail 
>> message is intended only for the use of the designated recipient(s) 
>> named above. If the reader of this message is not the intended 
>> recipient, you are hereby notified that you have received this 
>> message in error and that any review, dissemination, distribution, or 
>> copying of this message is strictly prohibited. If you have received 
>> this communication in error, please notify the sender by telephone or 
>> e-mail (as shown above) immediately and destroy any and all copies of 
>> this message in your possession (whether hard copies or electronically 
>> stored copies).
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" 
> in the body of a message to majord...@vger.kernel.org More majordomo 
> info at  http://vger.kernel.org/majordomo-info.html
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Samuel Just

Responded with cosmetic nonsense.  Once you've got that and the other
comments addressed, I can put it in wip-sam-testing.
-Sam

On Wed, Sep 10, 2014 at 1:30 PM, Somnath Roy  wrote:
> Thanks Sam..I responded back :-)
>
> -Original Message-
> From: ceph-devel-ow...@vger.kernel.org 
> [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
> Sent: Wednesday, September 10, 2014 11:17 AM
> To: Somnath Roy
> Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
> ceph-users@lists.ceph.com
> Subject: Re: OpTracker optimization
>
> Added a comment about the approach.
> -Sam
>
> On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
>> Hi Sam/Sage,
>>
>> As we discussed earlier, enabling the present OpTracker code degrading
>> performance severely. For example, in my setup a single OSD node with
>> 10 clients is reaching ~103K read iops with io served from memory
>> while optracking is disabled but enabling optracker it is reduced to ~39K 
>> iops.
>> Probably, running OSD without enabling OpTracker is not an option for
>> many of Ceph users.
>>
>> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
>> ops_in_flight) and removing some other bottlenecks I am able to match
>> the performance of OpTracking enabled OSD with OpTracking disabled,
>> but with the expense of ~1 extra cpu core.
>>
>> In this process I have also fixed the following tracker.
>>
>>
>>
>> http://tracker.ceph.com/issues/9384
>>
>>
>>
>> and probably http://tracker.ceph.com/issues/8885 too.
>>
>>
>>
>> I have created following pull request for the same. Please review it.
>>
>>
>>
>> https://github.com/ceph/ceph/pull/2440
>>
>>
>>
>> Thanks & Regards
>>
>> Somnath
>>
>>
>>
>>
>> 
>>
>> PLEASE NOTE: The information contained in this electronic mail message
>> is intended only for the use of the designated recipient(s) named
>> above. If the reader of this message is not the intended recipient,
>> you are hereby notified that you have received this message in error
>> and that any review, dissemination, distribution, or copying of this
>> message is strictly prohibited. If you have received this
>> communication in error, please notify the sender by telephone or
>> e-mail (as shown above) immediately and destroy any and all copies of
>> this message in your possession (whether hard copies or electronically 
>> stored copies).
>>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
> body of a message to majord...@vger.kernel.org More majordomo info at  
> http://vger.kernel.org/majordomo-info.html
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Somnath Roy

Thanks Sam..I responded back :-)

-Original Message-
From: ceph-devel-ow...@vger.kernel.org 
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Samuel Just
Sent: Wednesday, September 10, 2014 11:17 AM
To: Somnath Roy
Cc: Sage Weil (sw...@redhat.com); ceph-de...@vger.kernel.org; 
ceph-users@lists.ceph.com
Subject: Re: OpTracker optimization

Added a comment about the approach.
-Sam

On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
> Hi Sam/Sage,
>
> As we discussed earlier, enabling the present OpTracker code degrading
> performance severely. For example, in my setup a single OSD node with
> 10 clients is reaching ~103K read iops with io served from memory
> while optracking is disabled but enabling optracker it is reduced to ~39K 
> iops.
> Probably, running OSD without enabling OpTracker is not an option for
> many of Ceph users.
>
> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
> ops_in_flight) and removing some other bottlenecks I am able to match
> the performance of OpTracking enabled OSD with OpTracking disabled,
> but with the expense of ~1 extra cpu core.
>
> In this process I have also fixed the following tracker.
>
>
>
> http://tracker.ceph.com/issues/9384
>
>
>
> and probably http://tracker.ceph.com/issues/8885 too.
>
>
>
> I have created following pull request for the same. Please review it.
>
>
>
> https://github.com/ceph/ceph/pull/2440
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message
> is intended only for the use of the designated recipient(s) named
> above. If the reader of this message is not the intended recipient,
> you are hereby notified that you have received this message in error
> and that any review, dissemination, distribution, or copying of this
> message is strictly prohibited. If you have received this
> communication in error, please notify the sender by telephone or
> e-mail (as shown above) immediately and destroy any and all copies of
> this message in your possession (whether hard copies or electronically stored 
> copies).
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the 
body of a message to majord...@vger.kernel.org More majordomo info at  
http://vger.kernel.org/majordomo-info.html



PLEASE NOTE: The information contained in this electronic mail message is 
intended only for the use of the designated recipient(s) named above. If the 
reader of this message is not the intended recipient, you are hereby notified 
that you have received this message in error and that any review, 
dissemination, distribution, or copying of this message is strictly prohibited. 
If you have received this communication in error, please notify the sender by 
telephone or e-mail (as shown above) immediately and destroy any and all copies 
of this message in your possession (whether hard copies or electronically 
stored copies).

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] CephFS roadmap (was Re: NAS on RBD)

2014-09-10 Thread Gregory Farnum

On Tue, Sep 9, 2014 at 6:10 PM, Blair Bethwaite
 wrote:
> Hi Sage,
>
> Thanks for weighing into this directly and allaying some concerns.
>
> It would be good to get a better understanding about where the rough
> edges are - if deployers have some knowledge of those then they can be
> worked around to some extent.

It's just a very long process to qualify a filesystem, even in this
limited sense. We're still at the point where we're solving bugs that
the open-source community brings us rather than setting out to make it
stable for a particular identified workload.
For the moment most of our development effort is focused on
1) instrumentation that makes it possible for users (and developers!)
to identify the cause of problems we run across
2) basic mechanisms for fixing "ephemeral" bugs (things like booting
dead clients, restarting hung metadata ops, etc)
3) general usability issues that our newer developers and users are
reporting to us
4) the beginnings of fsck (correctness checking for now, no fixing yet)

> E.g., for our use-case it may be that
> whilst Inktank/RedHat won't provide support for CephFS that we are
> better off using it in a tightly controlled fashion (e.g., no
> snapshots, restricted set of native clients acting as presentation
> layer with others coming in via SAMBA & Ganesha, no dynamic metadata
> tree/s, ???) where we're less likely to run into issues.

Well, snapshots are definitely going to break your install (they're
disabled by default, now). Multi-mds is unstable enough that nobody
should be running with it.
We run samba and NFS tests in our nightlies and they mostly work,
although we've got some odd issues we've not tracked down when
*ending* the samba process or unmounting nfs. (Our best guess on these
is test or environment issues, rather than actual FS issues.) But
these are probably not complete.

> Related, given there is no fsck, how would one go about backing up the
> metadata in order to facilitate DR? Is there even a way for that to
> make sense given the decoupling of data & metadata pools...?

Uh, depends on the kind of DR you're going for, I guess. There are
lots of things that will backup a generic filesystem; you could do
something smarter with a bit of custom scripting using Ceph's rstats.
-Greg
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] OpTracker optimization

2014-09-10 Thread Samuel Just

Added a comment about the approach.
-Sam

On Tue, Sep 9, 2014 at 1:33 PM, Somnath Roy  wrote:
> Hi Sam/Sage,
>
> As we discussed earlier, enabling the present OpTracker code degrading
> performance severely. For example, in my setup a single OSD node with 10
> clients is reaching ~103K read iops with io served from memory while
> optracking is disabled but enabling optracker it is reduced to ~39K iops.
> Probably, running OSD without enabling OpTracker is not an option for many
> of Ceph users.
>
> Now, by sharding the Optracker:: ops_in_flight_lock (thus xlist
> ops_in_flight) and removing some other bottlenecks I am able to match the
> performance of OpTracking enabled OSD with OpTracking disabled, but with the
> expense of ~1 extra cpu core.
>
> In this process I have also fixed the following tracker.
>
>
>
> http://tracker.ceph.com/issues/9384
>
>
>
> and probably http://tracker.ceph.com/issues/8885 too.
>
>
>
> I have created following pull request for the same. Please review it.
>
>
>
> https://github.com/ceph/ceph/pull/2440
>
>
>
> Thanks & Regards
>
> Somnath
>
>
>
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-10 Thread Gregory Farnum

On Wednesday, September 10, 2014, Daniel Schneller <
daniel.schnel...@centerdevice.com> wrote:

> On 09 Sep 2014, at 21:43, Gregory Farnum  > wrote:
>
>
> Yehuda can talk about this with more expertise than I can, but I think
> it should be basically fine. By creating so many buckets you're
> decreasing the effectiveness of RGW's metadata caching, which means
>
> the initial lookup in a particular bucket might take longer.
>
>
> Thanks for your thoughts. With “initial lookup in a particular bucket”
> do you mean accessing any of the objects in a bucket? If we directly
> access the object (not enumerating the buckets content), would that
> still be an issue?
> Just trying to understand the inner workings a bit better to make
> more educated guesses :)
>

When doing an object lookup, the gateway combines the "bucket ID" with a
mangled version of the object name to try and do a read out of RADOS. It
first needs to get that bucket ID though -- it will cache an the bucket
name->ID mapping, but if you have a ton of buckets there could be enough
entries to degrade the cache's effectiveness. (So, you're more likely to
pay that extra disk access lookup.)


>
>
> The big concern is that we do maintain a per-user list of all their
> buckets — which is stored in a single RADOS object — so if you have an
> extreme number of buckets that RADOS object could get pretty big and
> become a bottleneck when creating/removing/listing the buckets. You
>
>
> Alright. Listing buckets is no problem, that we don’t do. Can you
> say what “pretty big” would be in terms of MB? How much space does a
> bucket record consume in there? Based on that I could run a few numbers.
>

Uh, a kilobyte per bucket? You could look it up in the source (I'm on my
phone) but I *believe* the bucket name is allowed to be larger than the
rest combined...
More particularly, though, if you've got a single user uploading documents,
each creating a new bucket, then those bucket creates are going to
serialize on this one object.
-Greg


>
>
> should run your own experiments to figure out what the limits are
> there; perhaps you have an easy way of sharding up documents into
> different users.
>
>
> Good advice. We can do that per distributor (an org unit in our
> software) to at least compartmentalize any potential locking issues
> in this area to that single entity. Still, there would be quite
> a lot of buckets/objects per distributor, so some more detail on
> the above items would be great.
>
> Thanks a lot!
>
>
> Daniel
>


-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-10 Thread Gregory Farnum

The important bit there is actually near the end of the message output
line, where the first says "ack" and the second says "ondisk".

I assume you're using btrfs; the ack is returned after the write is applied
in-memory and readable by clients. The ondisk (commit) message is returned
after it's durable to the journal or the backing filesystem.
-Greg

On Wednesday, September 10, 2014, yuelongguang  wrote:

> hi,all
> i recently debug ceph rbd, the log tells that  one write to osd can get
> two if its reply.
> the difference between them is seq.
> why?
>
> thanks
> ---log-
> reader got message 6 0x7f58900010a0 osd_op_reply(15
> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
> write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
> 2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669
> queue 0x7f58900010a0 prio 127
> 2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader reading tag...
> 2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader got MSG
> 2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
> 2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
> 2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader got front 247
> 2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).aborted = 0
> 2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader got 247 + 0 + 0 byte message
> 2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq #
> = 7 front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
> 2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >>
> 10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1
> c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15
> rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304
> write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
>
>
>

-- 
Software Engineer #42 @ http://inktank.com | http://ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] [ANN] ceph-deploy 1.5.14 released

2014-09-10 Thread Scottix

Suggestion:
Can you link to a changelog of any new features or major bug fixes
when you do new releases.

Thanks,
Scottix

On Wed, Sep 10, 2014 at 6:45 AM, Alfredo Deza  wrote:
> Hi All,
>
> There is a new bug-fix release of ceph-deploy that helps prevent the
> environment variable
> issues that sometimes may cause issues when depending on them on remote hosts.
>
> It is also now possible to specify public and cluster networks when
> creating a new ceph.conf file.
>
> Make sure you update!
>
>
> -Alfredo
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Follow Me: @Scottix
http://about.me/scottix
scot...@gmail.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] [ANN] ceph-deploy 1.5.14 released

2014-09-10 Thread Alfredo Deza

Hi All,

There is a new bug-fix release of ceph-deploy that helps prevent the
environment variable
issues that sometimes may cause issues when depending on them on remote hosts.

It is also now possible to specify public and cluster networks when
creating a new ceph.conf file.

Make sure you update!


-Alfredo
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about RGW

2014-09-10 Thread Sage Weil

[Moving this to ceph-devel, where you're more likely to get a response 
from a developer!]

On Wed, 10 Sep 2014, baijia...@126.com wrote:

> when I read RGW code,  and can't  understand  master_ver  inside struct
> rgw_bucket_dir_header .
> who can explain this struct , in especial master_ver and stats , thanks
>  
> 
> 
> baijia...@126.com
> 
> ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph-deploy bug; CentOS 7, Firefly

2014-09-10 Thread Alfredo Deza

This should get fixed pretty soon, we already have an open issue in
the tracker for it: http://tracker.ceph.com/issues/9376

However, it is easy to workaround this issue with ceph-deploy by
passing explicitly the repo url so that it will
skip the step to fetch this:

ceph-deploy install --repo-url http://ceph.com/rpm-firefly/el7/ stor

On Wed, Sep 10, 2014 at 12:38 AM, Piers Dawson-Damer  wrote:
> Ceph-deploy wants;
>
> ceph-release-1-0.el7.noarch.rpm
>
> But the contents of ceph.com/rpm-firefly/el7/noarch only include the file;
>
> ceph-release-1-0.el7.centos.noarch.rpm
>
> Piers
>
>
>
> [stor][DEBUG ] Retrieving
> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm
> [stor][WARNIN] curl: (22) The requested URL returned error: 404 Not Found
> [stor][WARNIN] error: skipping
> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm -
> transfer failed
> [stor][ERROR ] RuntimeError: command returned non-zero exit status: 1
> [ceph_deploy][ERROR ] RuntimeError: Failed to execute command: rpm -Uvh
> --replacepkgs
> http://ceph.com/rpm-firefly/el7/noarch/ceph-release-1-0.el7.noarch.rpm
>
>
>
>
> Linux stor.domain 3.16.2-1.el7.elrepo.x86_64 #1 SMP Sat Sep 6 11:34:36 EDT
> 2014 x86_64 x86_64 x86_64 GNU/Linux
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] Cache Pool writing too much on ssds, poor performance?

2014-09-10 Thread Andrei Mikhailovsky


Hello guys, 

I am experimeting with cache pool and running some tests to see how adding the 
cache pool improves the overall performance of our small cluster. 

While doing testing I've noticed that it seems that the cache pool is writing 
too much on the cache pool ssds. Not sure what the issue here, perhaps someone 
could help me understand what is going on. 

My test cluster is: 
2 x OSD servers (Each server has: 24GB ram, 12 cores, 8 hdd osds, 2 ssds 
journals, 2 ssds for cache pool, 40gbit/s infiniband network capable of 
25gbit/s over ipoib). Cache pool is set to 500GB with replica of 2. 
4 x host servers (128GB ram, 24 core, 40gbit/s infiniband network capable of 
12gbit/s over ipoib) 

So, my test is: 
Simple tests using the following command: "dd if=/dev/vda of=/dev/null bs=4M 
count=2000 iflag=direct". I am concurrently starting this command on 10 virtual 
machines which are running on 4 host servers. The aim is to monitor the use of 
cache pool when reading the same data over and over again. 


Running the above command for the first time does what I was expecting. The 
osds are doing a lot of reads, the cache pool does a lot of writes (around 
250-300MB/s per ssd disk) and no reads. The dd results for the guest vms are 
poor. The results of the "ceph -w" shows consistent performance across the 
time. 

Running the above for the second and consequent times produces IO patterns 
which I was not expecting at all. The hdd osds are not doing much (this part I 
expected), the cache pool still does a lot of writes and very little reads! The 
dd results have improved just a little, but not much. The results of the "ceph 
-w" shows performance breaks over time. For instance, I have a peak of 
throughput in the first couple of seconds (data is probably coming from the osd 
server's ram at high rate). After the peak throughput has finished, the ceph 
reads are done in the following way: 2-3 seconds of activity followed by 2 
seconds if inactivity) and it keeps doing that throughout the length of the 
test. So, to put the numbers in perspective, when running tests over and over 
again I would get around 2000 - 3000MB/s for the first two seconds, followed by 
0MB/s for the next two seconds, followed by around 150-250MB/s over 2-3 
seconds, followed by 0MB/s for 2 seconds, followed 150-250MB/s over 2-3 
seconds, followed by 0MB/s over 2 secods, and the pattern repeats until the 
test is done. 


I kept running the dd command for about 15-20 times and observed the same 
behariour. The cache pool does mainly writes (around 200MB/s per ssd) when 
guest vms are reading the same data over and over again. There is very little 
read IO (around 20-40MB/s). Why am I not getting high read IO? I have expected 
the 80GB of data that is being read from the vms over and over again to be 
firmly recognised as the hot data and kept in the cache pool and read from it 
when guest vms request the data. Instead, I mainly get writes on the cache pool 
ssds and I am not really sure where these writes are coming from as my hdd osds 
are being pretty idle. 

>From the overall tests so far, introducing the cache pool has drastically 
>slowed down my cluster (by as much as 50-60%). 

Thanks for any help 

Andrei 
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] question about librbd io

2014-09-10 Thread Josh Durgin


On 09/09/2014 07:06 AM, yuelongguang wrote:

hi, josh.durgin:
i want to know how librbd launch io request.
use case:
inside vm, i use fio to test rbd-disk's io performance.
fio's pramaters are  bs=4k, direct io, qemu cache=none.
in this case, if librbd just send what it gets from vm， i mean  no
gather/scatter. the rate , io inside vm : io at librbd: io at osd
filestore = 1:1:1?


If the rbd image is not a clone, the io issued from the vm's block
driver will match the io issued by librbd. With caching disabled
as you have it, the io from the OSDs will be similar, with some
small amount extra for OSD bookkeeping.

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] osd cpu usage is bigger than 100%

2014-09-10 Thread yuelongguang

hi,all
i am testing   rbd performance, now there is only one vm which is using  rbd as 
its disk, and inside it  fio is doing r/w.
the big diffenence is that i set a big iodepth other than iodepth=1.
 
how do you think about it,  which part is using up cpu? i want to find the root 
cause.
 
 
---default options
osd_op_threads": "2",
  "osd_disk_threads": "1",
  "osd_recovery_threads": "1",
"filestore_op_threads": "2",
 
 
thanks
 
 
 
top - 19:50:08 up 1 day, 10:26,  2 users,  load average: 1.55, 0.97, 0.81
Tasks:  97 total,   1 running,  96 sleeping,   0 stopped,   0 zombie
Cpu(s): 37.6%us, 14.2%sy,  0.0%ni, 37.0%id,  9.4%wa,  0.0%hi,  1.3%si,  0.5%st
Mem:   1922540k total,  1820196k used,   102344k free,23100k buffers
Swap:  1048568k total,91724k used,   956844k free,  1052292k cached

  PID USER  PR  NI  VIRT  RES  SHR S %CPU %MEMTIME+  COMMAND

   
 4312 root  20   0 1100m 337m 5192 S 107.3 18.0  88:33.27 ceph-osd  

   
 1704 root  20   0  514m 272m 3648 S  0.7 14.5   3:27.19 ceph-mon  ___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-10 Thread yrabl

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On 09/07/2014 07:15 PM, Loic Dachary wrote:
> 
> 
> On 07/09/2014 14:11, yr...@redhat.com wrote:
>> Hi,
>> 
>> I'm trying to install Ceph Firefly on RHEL 7 on my three of my
>> storage servers. Each server has 17 HD, thus I thought each will
>> have 17 OSD's and I've installed monitors on all three servers.
>> 
>> After the installation I get this output # ceph health 
>> HEALTH_WARN 89 pgs degraded; 67 pgs incomplete; 67 pgs stuck
>> inactive; 192 pgs stuck unclean
>> 
>> Is there a known issue with this distribution, if not, how can I 
>> troubleshoot and provide more info?
> 
> What does ceph osd dump says ? I'm loicd on irc.oftc.net#ceph if
> you want to chat ;-)
> 
> Cheers
> 
>> 
>> thanks, Yogev Rabl gpgkeys: key 708B04EB93EF5C58 not found on
>> keyserver ___ 
>> ceph-users mailing list ceph-users@lists.ceph.com 
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>> 
> 

Hi Loic,

First, thanks for you response & sorry for my lateness.
Second, here's a link to the osd dump:
mind you, I've made some changes to the Ceph and reinstalled
everything on a single server, just for tests.

the dump is here:
http://pastebin.com/SLn1LAjk

thanks,
Yogev Rabl
-BEGIN PGP SIGNATURE-
Version: GnuPG v1

iQIcBAEBAgAGBQJUEDtOAAoJEHCLBOuT71xYYc0P/1DLRnOsS89245QszaNRNmrV
ChpfXnghwlhYS0bcX+gWU2JLy5HaeOyefcJCGtsTVYdXhCiNOZhlsg+L5qmSdAZ9
vJ6uKcpBWWacozbhJwkUykDVEimAJb7fqecA5iR21zDWSwfXYrKKHfjmHIuNyXlc
rNeQyhX9aI0CTLT9HpFPbfPkG0Z9EVl1cfy3KN6Fs8v6txBDvVcMdDTJkqweUwto
1sM5arZsMBFVgHkPzar+4FpiXIHTnF71P3J7s6nJ9FrNE2X7MAbgFGqFHffkf1/K
ySwGNHT8xXUH8yvrbRLESW2XjmQBo6kc8ruchz2EJBVm55pSrWBG8Hmz6T81/zdz
LHzyZCN0bXoTL2sdubKBsRv6G0VUQkQoMLzlIHY2ARsfGLe6phlFnVaFsKZI6syt
bcBxO/Nsn7x4DnwVnqTcvn0Xw7T8gMPjLtQw6O93PHORBYBaDxnK9qIxQ+HSGyEc
w+AfiGYTb1RblgEA6FMMkIDXBWYHIIXgs2nUQxkMBnBWJruhKWjwhCENmWqQLDrQ
nsjrh+W16IrI43Kh21MxIBQS9mXOorv5fWhqsfJVQUorBE/nlKWfckm4Jnm1KEV6
SO0SWKd/VdEGdC/F9Xmbm4qyHnVcNAMzG5xeMlw4cHfcCpYItFlYaoRpeu/Effdw
FjpeufCf/cd2OJH8YKCj
=wsBc
-END PGP SIGNATURE-
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] region creation is failing

2014-09-10 Thread Santhosh Fernandes

Thanks John, I am following the same link.

Do I need to follow "Create a Gateway Configuration" section if I am not
using s3 ?

Still getting this error :
radosgw-admin region default --rgw-region=in --name client.radosgw.in-east-1
failed to init region: (2) No such file or directory
2014-09-10 17:02:04.406142 7f71779877c0 -1 failed reading region info from
.in.rgw.root:region_info.in: (2) No such file or directory

Thanks,
santhosh





On Sat, Sep 6, 2014 at 3:47 AM, John Wilkins 
wrote:

> librados indicates communication between radosgw and the Ceph Storage
> Cluster. So the authentication error is likely due to the key you have set
> up using this procedure:
> http://ceph.com/docs/master/radosgw/federated-config/#create-a-keyring
>
> Check to see if you have the keys you generated imported to your ceph
> storage cluster:
>
> ceph auth list
>
> Ensure that there are a matching set of keys in the keyring and from ceph
> auth list.
>
> Also, ensure that you have appropriate permissions on the keyring.
>
> You might also check to ensure that your ceph.conf file has
> client.radosgw.in-east-1 as an entry and that it has a keyring entry
> pointing to your keyring.
>
> Let me know if this helps.
>
>
>
>
>
> On Fri, Sep 5, 2014 at 1:27 AM, Santhosh Fernandes <
> santhosh.fernan...@gmail.com> wrote:
>
>> Hi All,
>>
>> I am trying to configure Ceph with 2 OSD, one MON, One ADMIN, and One 
>> ObjectGW nodes.
>>
>> radosgw-admin region set --infile in.json --name client.radosgw.in-east-1
>> 2014-09-05 13:48:45.133983 7f7dda4c57c0  0 librados: 
>> client.radosgw.in-east-1 authentication error (1) Operation not permitted
>> couldn't init storage provider
>>
>> Can anyone help me to resolve this issue?
>>
>> Thank you!
>>
>> Regards,
>> Santhosh
>>
>>
>>
>> ___
>> ceph-users mailing list
>> ceph-users@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>>
>>
>
>
> --
> John Wilkins
> Senior Technical Writer
> Inktank
> john.wilk...@inktank.com
> (415) 425-9599
> http://inktank.com
>
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Andrei Mikhailovsky

Keith, 

I think the hypervisor / infrastructure orchestration layer should be able to 
handle proper snapshotting with io freezing. For instance, we use CloudStack 
and you can set up automatic snapshots and snapshot retention policies. 

Cheers 

Andrei 
- Original Message -

From: "Ilya Dryomov"  
To: "Keith Phua"  
Cc: "Andrei Mikhailovsky" , ceph-users@lists.ceph.com, 
y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg 
Sent: Wednesday, 10 September, 2014 11:51:04 AM 
Subject: Re: [ceph-users] Best practices on Filesystem recovery on RBD block 
volume? 

On Wed, Sep 10, 2014 at 2:45 PM, Keith Phua  wrote: 
> Hi Andrei, 
> 
> Thanks for the suggestion. 
> 
> But a rbd volume snapshots may only work if the filesystem is in a 
> consistent state, which mean no IO during snapshotting. With cronjob 
> snapshotting, usually we have no control over client doing any IOs. Having 

xfs_freeze -f /mnt 

xfs_freeze -u /mnt 

Thanks, 

Ilya 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] max_bucket limit -- safe to disable?

2014-09-10 Thread Daniel Schneller

On 09 Sep 2014, at 21:43, Gregory Farnum  wrote:

> Yehuda can talk about this with more expertise than I can, but I think
> it should be basically fine. By creating so many buckets you're
> decreasing the effectiveness of RGW's metadata caching, which means
> the initial lookup in a particular bucket might take longer.

Thanks for your thoughts. With “initial lookup in a particular bucket”
do you mean accessing any of the objects in a bucket? If we directly
access the object (not enumerating the buckets content), would that
still be an issue?
Just trying to understand the inner workings a bit better to make
more educated guesses :)

> The big concern is that we do maintain a per-user list of all their
> buckets — which is stored in a single RADOS object — so if you have an
> extreme number of buckets that RADOS object could get pretty big and
> become a bottleneck when creating/removing/listing the buckets. You

Alright. Listing buckets is no problem, that we don’t do. Can you
say what “pretty big” would be in terms of MB? How much space does a
bucket record consume in there? Based on that I could run a few numbers.

> should run your own experiments to figure out what the limits are
> there; perhaps you have an easy way of sharding up documents into
> different users.

Good advice. We can do that per distributor (an org unit in our
software) to at least compartmentalize any potential locking issues
in this area to that single entity. Still, there would be quite
a lot of buckets/objects per distributor, so some more detail on
the above items would be great.

Thanks a lot!

Daniel___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Ilya Dryomov

On Wed, Sep 10, 2014 at 2:45 PM, Keith Phua  wrote:
> Hi Andrei,
>
> Thanks for the suggestion.
>
> But a rbd volume snapshots may only work if the filesystem is in a
> consistent state, which mean no IO during snapshotting.  With cronjob
> snapshotting, usually we have no control over client doing any IOs.  Having

xfs_freeze -f /mnt

xfs_freeze -u /mnt

Thanks,

Ilya
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Ceph on RHEL 7 with multiple OSD's

2014-09-10 Thread BG

Michal/Marco,

thanks for your help, the issue I had was indeed with firewalld blocking ports.

Still adjusting to the changes in el7 but at least now I have an "active+clean"
cluster to start playing with ;)

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Keith Phua

Hi Andrei, 

Thanks for the suggestion. 

But a rbd volume snapshots may only work if the filesystem is in a consistent 
state, which mean no IO during snapshotting. With cronjob snapshotting, usually 
we have no control over client doing any IOs. Having said that, having regular 
snapshotting is still better than none and we may find some snapshots that are 
usable. 

Any other suggestions is greatly appreciated. 

Regards, 

Keith 

- Original Message -

> From: "Andrei Mikhailovsky" 
> To: "Keith Phua" 
> Cc: y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg,
> ceph-users@lists.ceph.com
> Sent: Wednesday, September 10, 2014 6:21:44 PM
> Subject: Re: [ceph-users] Best practices on Filesystem recovery on RBD block
> volume?

> Keith,

> You should consider doing regular rbd volume snapshots and keep them for N
> amount of hours/days/months depending on your needs.

> Cheers

> Andrei
> - Original Message -

> From: "Keith Phua" 
> To: ceph-users@lists.ceph.com
> Cc: y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg
> Sent: Wednesday, 10 September, 2014 3:22:53 AM
> Subject: [ceph-users] Best practices on Filesystem recovery on RBD block
> volume?

> Dear ceph-users,

> Recently we had an encounter of a XFS filesystem corruption on a NAS box.
> After repairing the filesystem, we discover the files were gone. This
> trigger some questions with regards to filesystem on RBD block which I hope
> the community can enlighten me.

> 1. If a local filesystem on a rbd block is corrupted, is it fair to say that
> regardless of how many replicated copies we specified for the pool, unless
> the filesystem is properly repaired and recovered, we may not get our data
> back?

> 2. If the above statement is true, does it mean that severe filesystem
> corruption on a RBD block constitute a single point of failure, since
> filesystems corruption can happened when the RBD client is not properly
> shutdown or due to a kernel bug?

> 3. Other than existing best practices for a filesystem recovery, does ceph
> have any other best practices for filesystem on RBD which we can adopt for
> data recovery?

> Thanks in advance.

> Regards,

> Keith
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Best practices on Filesystem recovery on RBD block volume?

2014-09-10 Thread Andrei Mikhailovsky


Keith, 

You should consider doing regular rbd volume snapshots and keep them for N 
amount of hours/days/months depending on your needs. 

Cheers 

Andrei 
- Original Message -

From: "Keith Phua"  
To: ceph-users@lists.ceph.com 
Cc: y...@nus.edu.sg, cheechi...@nus.edu.sg, eng...@nus.edu.sg 
Sent: Wednesday, 10 September, 2014 3:22:53 AM 
Subject: [ceph-users] Best practices on Filesystem recovery on RBD block 
volume? 

Dear ceph-users, 

Recently we had an encounter of a XFS filesystem corruption on a NAS box. After 
repairing the filesystem, we discover the files were gone. This trigger some 
questions with regards to filesystem on RBD block which I hope the community 
can enlighten me. 

1. If a local filesystem on a rbd block is corrupted, is it fair to say that 
regardless of how many replicated copies we specified for the pool, unless the 
filesystem is properly repaired and recovered, we may not get our data back? 

2. If the above statement is true, does it mean that severe filesystem 
corruption on a RBD block constitute a single point of failure, since 
filesystems corruption can happened when the RBD client is not properly 
shutdown or due to a kernel bug? 

3. Other than existing best practices for a filesystem recovery, does ceph have 
any other best practices for filesystem on RBD which we can adopt for data 
recovery? 


Thanks in advance. 

Regards, 

Keith 
___ 
ceph-users mailing list 
ceph-users@lists.ceph.com 
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com 

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] why one osd-op from client can get two osd-op-reply?

2014-09-10 Thread yuelongguang

hi,all
i recently debug ceph rbd, the log tells that  one write to osd can get two if 
its reply.
the difference between them is seq.
why?
 
thanks
---log-
reader got message 6 0x7f58900010a0 osd_op_reply(15 
rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 
write_size 4194304,write 0~3145728] v211'518 uv518 ack = 0) v6
2014-09-10 08:47:32.348213 7f58bc16b700 20 -- 10.58.100.92:0/1047669 queue 
0x7f58900010a0 prio 127
2014-09-10 08:47:32.348230 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader reading tag...
2014-09-10 08:47:32.348245 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got MSG
2014-09-10 08:47:32.348257 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got envelope type=43 src osd.1 front=247 data=0 off 0
2014-09-10 08:47:32.348269 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader wants 247 from dispatch throttler 247/104857600
2014-09-10 08:47:32.348286 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got front 247
2014-09-10 08:47:32.348303 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).aborted = 0
2014-09-10 08:47:32.348312 7f58bc16b700 20 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got 247 + 0 + 0 byte message
2014-09-10 08:47:32.348332 7f58bc16b700 10 check_message_signature: seq # = 7 
front_crc_ = 3699418201 middle_crc = 0 data_crc = 0
2014-09-10 08:47:32.348369 7f58bc16b700 10 -- 10.58.100.92:0/1047669 >> 
10.154.249.4:6800/2473 pipe(0xfae6d0 sd=6 :64407 s=2 pgs=133 cs=1 l=1 
c=0xfae940).reader got message 7 0x7f5890003660 osd_op_reply(15 
rbd_data.19d92ae8944a.0001 [set-alloc-hint object_size 4194304 
write_size 4194304,write 0~3145728] v211'518 uv518 ondisk = 0) v6
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

[ceph-users] bad performance of leveldb on 0.85

2014-09-10 Thread 廖建锋

Dear,
is there any body compared the performance with leveldb between 0.80.5 and 
0.85,
  In my pervious cluster(0.80.5),  the average writting speed: 10MB- 15MB
  In current cluster(0.85),  the average writting speed: 5M-8M

what is going on ?   will superblock of leveldb disk cause this ?

___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Re: [ceph-users] Problem with customized crush rule for EC pool

2014-09-10 Thread Loic Dachary

Right : I thought about data loss but what you're after is data availability. 
Thanks for explaining :-)

On 10/09/2014 04:29, Lei Dong wrote:
> Yes, My goal is to make it loosing 3 OSD does not lose data.
> 
> My 6 racks may not be in different rooms but they use 6 different
> switches, so I want when any switch is down or unreachable, my data can
> still be accessed. I think it’s not an unrealistic requirement.
> 
> 
> Thanks!
> 
> LeiDong.
> 
> On 9/9/14, 10:02 PM, "Loic Dachary"  wrote:
> 
>>
>>
>> On 09/09/2014 14:21, Lei Dong wrote:
>>> Thanks loic!
>>>
>>> Actually I've found that increase choose_local_fallback_tries can
>>> help(chooseleaf_tries helps not so significantly), but I'm afraid when
>>> osd failure happen and need to find new acting set, it may be fail to
>>> find enough racks again. So I'm trying to find a more guaranteed way in
>>> case of osd failure.
>>>
>>> My profile is nothing special other than k=8 m=3.
>>
>> So your goal is to make it so loosing 3 OSD simultaneously does not mean
>> loosing data. By forcing each rack to hold at most 2 OSDs for a given
>> object, you make it so loosing a full rack does not mean loosing data.
>> Are these racks in the same room in the datacenter ? In the event of a
>> catastrophic failure that permanently destroy one rack, how realistic is
>> it that the other racks are unharmed ? If the rack is destroyed by fire
>> and is in a row with the six other racks, there is a very high chance
>> that the other racks will also be damaged. Note that I am not a system
>> architect nor a system administrator : I may be completely wrong ;-) If
>> it turns out that the probability of a single rack to fail entirely and
>> independently of the others is negligible, it may not be necessary to
>> make a complex ruleset and instead use the default ruleset.
>>
>> My 2cts
>>
>>>
>>> Thanks again!
>>>
>>> Leidong
>>>
>>>
>>>
>>>
>>>
 On 2014年9月9日, at 下午7:53, "Loic Dachary"  wrote:

 Hi,

 It is indeed possible that mapping fails if there are just enough
 racks to match the constraint. And the probability of a bad mapping
 increases when the number of PG increases because there is a need for
 more mapping. You can tell crush to try harder with

 step set_chooseleaf_tries 10

 Be careful though : increasing this number will change mapping. It
 will not just fix the bad mappings you're seeing, it will also change
 the mappings that succeeded with a lower value. Once you've set this
 parameter, it cannot be modified.

 Would you mind sharing the erasure code profile you plan to work with ?

 Cheers

> On 09/09/2014 12:39, Lei Dong wrote:
> Hi ceph users:
>
> I want to create a customized crush rule for my EC pool (with
> replica_size = 11) to distribute replicas into 6 different Racks.
>
> I use the following rule at first:
>
> Step take default  // root
> Step choose firstn 6 type rack// 6 racks, I have and only have 6 racks
> Step chooseleaf indep 2 type osd // 2 osds per rack
> Step emit
>
> I looks fine and works fine when PG num is small.
> But when pg num increase, there are always some PGs which can not
> take all the 6 racks.
> It looks like “Step choose firstn 6 type rack” sometimes returns only
> 5 racks.
> After some investigation,  I think it may caused by collision of
> choices.
>
> Then I come up with another solution to solve collision like this:
>
> Step take rack0
> Step chooseleaf indep 2 type osd
> Step emit
> Step take rack1
> ….
> (manually take every rack)
>
> This won’t cause rack collision, because I specify rack by name at
> first. But the problem is that osd in rack0 will always be primary osd
> because I choose from rack0 first.
>
> So the question is what is the recommended way to meet such a need
> (distribute 11 replicas into 6 racks evenly in case of rack failure)?
>
>
> Thanks!
> LeiDong
>
>
>
>
> ___
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

 -- 
 Loïc Dachary, Artisan Logiciel Libre

>>
>> -- 
>> Loïc Dachary, Artisan Logiciel Libre
>>
> 

-- 
Loïc Dachary, Artisan Logiciel Libre



signature.asc
Description: OpenPGP digital signature
___
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

50 matches

Mail list logo