from:"Haomai Wang"

Re: Ceph_objectstore_bench crashed on keyvaluestore bench with ceph master branch

2015-12-02 Thread Haomai Wang

thanks! Fixed in https://github.com/ceph/ceph/pull/6783. plz review

On Thu, Dec 3, 2015 at 3:19 AM, James (Fei) Liu-SSI
 wrote:
> Hi Haomai,
>I happened to run ceph_objectstore_bench against key value store on master 
> branch. It always crashed at finisher_thread_entry :  
> assert(!ls_rval.empty());
>
>It looks like the completion not only has null entry in the finisher queue 
> , but also has none entry in the finisher_queue_rval. I tired and it actually 
> is.
>
>  Could you  mind telling us in which case the NULL entry in finisher queue 
> also has not any entry in finisher_queue_rval?
>
>   Thanks,
>
>
>   Please refer to issue http://tracker.ceph.com/issues/13961
>
>
> Regards,
> James
>
> void *Finisher::finisher_thread_entry()
> {
>   finisher_lock.Lock();
>   ldout(cct, 10) << "finisher_thread start" << dendl;
>
>   utime_t start;
>   while (!finisher_stop) {
> /// Every time we are woken up, we process the queue until it is empty.
> while (!finisher_queue.empty()) {
>   if (logger)
> start = ceph_clock_now(cct);
>   // To reduce lock contention, we swap out the queue to process.
>   // This way other threads can submit new contexts to complete while we 
> are working.
>   vector ls;
>   list > ls_rval;
>   ls.swap(finisher_queue);
>   ls_rval.swap(finisher_queue_rval);
>   finisher_running = true;
>   finisher_lock.Unlock();
>   ldout(cct, 10) << "finisher_thread doing " << ls << dendl;
>  // ldout(cct, 10) <<"...Finisher thread is calling again 
> over here" << dendl;
>
>   // Now actually process the contexts.
>   for (vector::iterator p = ls.begin();
>p != ls.end();
>++p) {
> if (*p) {
>   (*p)->complete(0);
> } else {
>   // When an item is NULL in the finisher_queue, it means
>   // we should instead process an item from finisher_queue_rval,
>   // which has a parameter for complete() other than zero.
>   // This preserves the order while saving some storage.
>   assert(!ls_rval.empty());
>   Context *c = ls_rval.front().first;
>   c->complete(ls_rval.front().second);
>   ls_rval.pop_front();
> }
> if (logger) {
>   logger->dec(l_finisher_queue_len);
>   logger->tinc(l_finisher_complete_lat, ceph_clock_now(cct) - start);
> }
>   }
>   ldout(cct, 10) << "finisher_thread done with " << ls << dendl;
>   ls.clear();
>
>   finisher_lock.Lock();
>   finisher_running = false;
> }
> ldout(cct, 10) << "finisher_thread empty" << dendl;
> finisher_empty_cond.Signal();
> if (finisher_stop)
>   break;
>
> ldout(cct, 10) << "finisher_thread sleeping" << dendl;
> finisher_cond.Wait(finisher_lock);
>   }
>   // If we are exiting, we signal the thread waiting in stop(),
>   // otherwise it would never unblock
>   finisher_empty_cond.Signal();
>
>   ldout(cct, 10) << "finisher_thread stop" << dendl;
>   finisher_stop = false;
>   finisher_lock.Unlock();
>   return 0;
> }
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: queue_transaction interface + unique_ptr

2015-12-02 Thread Haomai Wang

On Thu, Dec 3, 2015 at 8:17 AM, Somnath Roy  wrote:
> Hi Sage/Sam,
> As discussed in today's performance meeting , I am planning to change the 
> queue_transactions() interface to the following.
>
>   int queue_transactions(Sequencer *osr, list& tls,
>  Context *onreadable, Context *ondisk=0,
>  Context *onreadable_sync=0,
>  TrackedOpRef op = TrackedOpRef(),
>  ThreadPool::TPHandle *handle = NULL) ;
>
> typedef unique_ptr TransactionRef;
>
>
> IMO , there is a problem with this approach.
>
> The interface like apply_transaction(), queue_transaction() etc. basically 
> the interfaces taking single transaction pointer and internally forming a 
> list to call the queue_transactions() also needs to be changed to accept 
> TransactionRef which will be *bad*. The reason is while preparing list 
> internally we need to move the uniqueue_ptr and callers won't be aware of 
> that.
>
> Also, now changing every interfaces (and callers) that is taking Transaction* 
> will produce a very big delta (and big testing effort as well).
>
> So, should we *reconsider* co-existing both  queue_transactions() interfaces 
> and call the new one from the IO path ?

I like this, isolate any share_ptr area is a hard job. Look forward to!

>
> Thanks & Regards
> Somnath
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Compiling for FreeBSD

2015-11-29 Thread Haomai Wang

On Mon, Nov 30, 2015 at 1:44 AM, Willem Jan Withagen  wrote:
> Hi,
>
> Not unlike many others running FreeBSD I'd like to see if I/we can get
> Ceph to build and run on FreeBSD. If not all components than at least
> certain components.
>
> With compilation I do get quite some way, even with the CLANG compiler.
> But I run into obvious part where Linux goes a different direction from
> what is available on FreeBSD.
>
> If I google one of the reports I ran into, I get at:
> http://lists.ceph.com/pipermail/ceph-commit-ceph.com/2013-November/005812.html
>
> Which sort of suggests that some of the code for FreeBSD has again been
> purged from the tree...
>
> Is that to remove FreeBSD completely from the tree?
> Or just because that code did not work?

I guess we still expect FreeBSD support, which version do you test to
compile? I'd like to help to make bsd work :-)

>
> Thanx,
> --WjW
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why my cluster become unavailable (min_size of pool)

2015-11-26 Thread Haomai Wang

On Thu, Nov 26, 2015 at 3:54 PM, hzwulibin <hzwuli...@gmail.com> wrote:
> Hi, Sage
>
> I has a question about min_size of pool.
>
> The default value of min_size is 2, but in this setting, when two OSDs are 
> down(mean two replicas lost) at same time, the IO will be blocked.
> We want to set the min_size to 1 in our production environment as we think 
> it's normal case when two OSDs are down(sure on different host) at same time.

min_size with 2 means each object must ensure two copies in this pool.
It mainly reduce the permanent storage media corrupt risk which cause
actual data lose. That's mean if min_size is 1 and under this degraded
case, one more osd  permanent corrupt will cause data lose. If
min_size is 2, it need at least 2 osds.

>
> So is there anypotential problem of this setting?
>
> We use 0.80.10 version.
>
> Thanks!
>
>
> --
> hzwulibin
> 2015-11-26
>
> -
> 发件人："hzwulibin"<hzwuli...@gmail.com>
> 发送日期：2015-11-23 09:00
> 收件人：Sage Weil,Haomai Wang
> 抄送：ceph-devel
> 主题：Re: why my cluster become unavailable
>
> Hi, Sage
>
> Thanks! Will try it when next testing!
>
> --
> hzwulibin
> 2015-11-23
>
> -
> 发件人：Sage Weil <s...@newdream.net>
> 发送日期：2015-11-22 01:49
> 收件人：Haomai Wang
> 抄送：Libin Wu,ceph-devel
> 主题：Re: why my cluster become unavailable
>
> On Sun, 22 Nov 2015, Haomai Wang wrote:
>> On Thu, Nov 19, 2015 at 11:26 PM, Libin Wu <hzwuli...@gmail.com> wrote:
>> > Hi, cepher
>> >
>> > I have a cluster of 6 OSD server, every server has 8 OSDs.
>> >
>> > I out 4 OSDs on every server, then my client io is blocking.
>> >
>> > I reboot my client and then create a new rbd device, but the new
>> > device also can't write io.
>> >
>> > Yeah, i understand that some data may lost as threee replicas of some
>> > object were lost, but why the cluster become unavailable?
>> >
>> > There 80 incomplete pg and 4 down+incomplete pg.
>> >
>> > Any solution i could solve the problem?
>>
>> Yes, if you doesn't have a special crushmap to control the data
>> replcement policy, pg will lack of necessary metadata to boot. If need
>> to readd outed osds or force remove pg which is incomplete(hope it's
>> just a test).
>
> Is min_size 2 or 1?  Reducing it to 1 will generally clear some of the
> incomplete pgs.  Just remember to raise it back to 2 after the cluster
> recovers.
>
> sage
>
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: why my cluster become unavailable

2015-11-21 Thread Haomai Wang

On Thu, Nov 19, 2015 at 11:26 PM, Libin Wu  wrote:
> Hi, cepher
>
> I have a cluster of 6 OSD server, every server has 8 OSDs.
>
> I out 4 OSDs on every server, then my client io is blocking.
>
> I reboot my client and then create a new rbd device, but the new
> device also can't write io.
>
> Yeah, i understand that some data may lost as threee replicas of some
> object were lost, but why the cluster become unavailable?
>
> There 80 incomplete pg and 4 down+incomplete pg.
>
> Any solution i could solve the problem?

Yes, if you doesn't have a special crushmap to control the data
replcement policy, pg will lack of necessary metadata to boot. If need
to readd outed osds or force remove pg which is incomplete(hope it's
just a test).

>
> Thanks!
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 答复: journal alignment

2015-11-20 Thread Haomai Wang

On Fri, Nov 20, 2015 at 9:08 PM, Sage Weil <s...@newdream.net> wrote:
> On Fri, 20 Nov 2015, Haomai Wang wrote:
>> On Fri, Nov 20, 2015 at 7:41 PM, Sage Weil <s...@newdream.net> wrote:
>> > On Fri, 20 Nov 2015, changtao381 wrote:
>> >> Hi All,
>> >>
>> >> Thanks for you apply!
>> >>
>> >> If directioIO + async IO requirement that alignment, it shouldn't aligned 
>> >> by PAGE for each journal entry.
>> >> For it may write many entries of journal once time
>> >
>> > We also want to avoid copying the data around in memory to change the
>> > alignment.  The messenger takes care to read data off the wire into
>> > buffers with the correct alignment so that we can later use them for
>> > direct-io.
>> >
>> > If you're worried about the small io case, I think this is just a matter
>> > of setting a threshold for small ios so that we don't bother with all of
>> > the padding when the memory copy isn't that expensive.  But... given that
>> > we have a header *and* footer in the journal format and almost all IOs are
>> > 4k multiples I think it'd save you a single 4k block at most.
>> >
>> > (Also, I thought we already did something like this, but perhaps not!)
>>
>> Hmm, based on our recently test, the data from messenger is aligned.
>> But the encoded data(pglog, transaction) will make thing worse, like
>> PR(https://github.com/ceph/ceph/pull/6368) solved, we even will get 14
>> ptr in the bufferlist which passed into filejournal before. So it make
>> we rebuild each time within filejournal thread. Like this
>> PR(https://github.com/ceph/ceph/pull/6484), we try to make it rebuild
>> not in filejournal thread which is single.
>
> buffer::list::rebuild_page_aligned() should only copy/rebuild ptrs that
> are unaligned, and leave aligned ones untouched.  It looks like the
> journal code is already doing this?

Yes or not, for example we have a bufferlist contains 2 ptrs, the
first is unaligned, the second is aligned. But the current impl will
ignore the second alignment fact. Look at the code:

  void buffer::list::rebuild_aligned_size_and_memory(unsigned align_size,
unsigned align_memory)
  {

  list unaligned;
  unsigned offset = 0;
  do {
/*cout << " segment " << (void*)p->c_str()
   << " offset " << ((unsigned long)p->c_str() & (align - 1))
   << " length " << p->length() << " " << (p->length() &
(align - 1))
   << " overall offset " << offset << " " << (offset & (align - 1))
  << " not ok" << std::endl;
*/
offset += p->length();
unaligned.push_back(*p);
_buffers.erase(p++);
  } while (p != _buffers.end() &&
  (!p->is_aligned(align_memory) ||
   !p->is_n_align_sized(align_size) ||
   (offset % align_size)));
( it will check offset alignment, so won't continues after
meeting the first unalign ptr ))

  if (!(unaligned.is_contiguous() &&
unaligned._buffers.front().is_aligned(align_memory))) {
ptr nb(buffer::create_aligned(unaligned._len, align_memory));
unaligned.rebuild(nb);
_memcopy_count += unaligned._len;
  }
  _buffers.insert(p, unaligned._buffers.front());
}
  }


>
> sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 答复: journal alignment

2015-11-20 Thread Haomai Wang

On Fri, Nov 20, 2015 at 7:41 PM, Sage Weil  wrote:
> On Fri, 20 Nov 2015, changtao381 wrote:
>> Hi All,
>>
>> Thanks for you apply!
>>
>> If directioIO + async IO requirement that alignment, it shouldn't aligned by 
>> PAGE for each journal entry.
>> For it may write many entries of journal once time
>
> We also want to avoid copying the data around in memory to change the
> alignment.  The messenger takes care to read data off the wire into
> buffers with the correct alignment so that we can later use them for
> direct-io.
>
> If you're worried about the small io case, I think this is just a matter
> of setting a threshold for small ios so that we don't bother with all of
> the padding when the memory copy isn't that expensive.  But... given that
> we have a header *and* footer in the journal format and almost all IOs are
> 4k multiples I think it'd save you a single 4k block at most.
>
> (Also, I thought we already did something like this, but perhaps not!)

Hmm, based on our recently test, the data from messenger is aligned.
But the encoded data(pglog, transaction) will make thing worse, like
PR(https://github.com/ceph/ceph/pull/6368) solved, we even will get 14
ptr in the bufferlist which passed into filejournal before. So it make
we rebuild each time within filejournal thread. Like this
PR(https://github.com/ceph/ceph/pull/6484), we try to make it rebuild
not in filejournal thread which is single.



>
> sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: journal alignment

2015-11-20 Thread Haomai Wang

On Fri, Nov 20, 2015 at 4:33 PM, changtao381  wrote:
> HI All,
>
> Why it is needed an entry of journal t is aligned  by  CEPH_PAGE_MASK ?  For
> it causes the data of journal write are amplified by 2X for small io
>

linux aio/dio required this

> For example write io size 4096 bytes, it may write 8192 bytes
>
> prepare_single_write 2 will write 98304 : seq 24 len 4324 -> 8192 (head 40
> pre_pad 0 ebl 4324 post_pad 3788 tail 40) (ebl alignment -1)
>
> Thanks!
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Request for Comments: Weighted Round Robin OP Queue

2015-11-09 Thread Haomai Wang

On Tue, Nov 10, 2015 at 2:19 AM, Samuel Just  wrote:
> Ops are hashed from the messenger (or any of the other enqueue sources
> for non-message items) into one of N queues, each of which is serviced
> by M threads.  We can't quite have a single thread own a single queue
> yet because the current design allows multiple threads/queue
> (important because if a sync read blocks on one thread, other threads
> working on that queue can continue to make progress).  However, the
> queue contents are hashed to a queue based on the PG, so if a PG
> queues work, it'll be on the same queue as it is already operating
> from (which I think is what you are getting at?).  I'm moving away
> from that with the async read work I'm doing (ceph-devel subject
> "Async reads, sync writes, op thread model discussion"), but I'll
> still need a replacement for PrioritizedQueue.

I don't think clearly about the idea that we make PrioriryQueue(or
whatever WeightBased) client-oriented. Because currently each
connection owned by a async messenger thread, if latter queue is pg
oriented, huge lock contention can't be avoided with iops increasing.

The only way I guess is make msgr thread -> osd thread via the same
hash key(or whatever we can make the two threads paired). What's more,
msgr thread could use the same way as sam's branch, it could be only
one thread.

> -Sam
>
> On Mon, Nov 9, 2015 at 9:19 AM, Robert LeBlanc  wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> I should probably work against this branch.
>>
>> I've got some more reading of code to do, but I'm thinking that there
>> isn't one of these queues for each OSD, it seems like there is one
>> queue for each thread in the OSD. If this is true, I think it makes
>> sense to break the queue into it's own thread and have each 'worker'
>> thread push and pop OPs out of that thread. I have been focused on the
>> Queue code that I haven't really looked at the OSD/PG code until last
>> Friday and it is like trying to drink from a fire hose going through
>> that code, so I may be misunderstanding something.
>>
>> I'd appreciate any pointers to quickly understanding the OSD/PG code
>> specifically around the OPs and the queue.
>>
>> Thanks,
>> -BEGIN PGP SIGNATURE-
>> Version: Mailvelope v1.2.3
>> Comment: https://www.mailvelope.com
>>
>> wsFcBAEBCAAQBQJWQNWzCRDmVDuy+mK58QAAAGAQAJ44uFZNl84eGrHIMzDc
>> EyMBCE/STAOtZINV0DRmnKqrKLeWZ2ajHhr7gYdXByMdCi9QTnz/pYH8fP4m
>> sTtf8MnaEdDuFYpc+kVP4sOZx+efF64s4isN8lDpoa6noqDR68W3xJ7MV9/l
>> WJizoD9LWOvPVdPlO6M1jw3waL1eZMrxzPGpz2Xws4XnyGjIWeoUWl0kZYyT
>> EwGNGaQXBsioowd2PySc3axAY/zaeaJFPp4trw2k2sE9Yi4NT39R3tWgljkC
>> Ras8TjfHml1+xPeVadB4fdbYl2TaR8xYsVWCp+k1IuiEk/CAeljMjfST/Dqf
>> TBMhhw8h24AP1GLPwiOFdGIh6h6gj0UoXeXsfHKhSuW6M8Ur+9fuynyuhBUV
>> V0707nVmu9eiBwkgDHBcIRlnMQ0dDH60Uubf6ShagwjQSg6yfh6MNHVt6FFv
>> PJCcGDfEqzCjbcGhRyG0bE4aAXXAlHnUy4y2VRGIodmTHqUcZAfXoQd3dklC
>> KdSNyY+z/inOZip1Pbal4jNv3jAJBABn6Y1nNuB3W+33s/Jvt/aQbJpwYlkQ
>> iivTMkoMsimVNKAhoTybZpVwJ2Hy5TL/tWqDNwg3TBXtWSFU5S1XgJzoAQm5
>> yE7dbMwhAObw3XQ/eGMTmyICs1vwD0+mxaNHHWzSubtFKcdblUDW6BUxc+lj
>> ztfA
>> =GSDL
>> -END PGP SIGNATURE-
>> 
>> Robert LeBlanc
>> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>>
>>
>> On Mon, Nov 9, 2015 at 9:49 AM, Samuel Just  wrote:
>>> It's partially in the unified queue.  The primary's background work
>>> for kicking off a recovery operation is not in the unified queue, but
>>> the messages to the replicas (pushes, pull, backfill scans) as well as
>>> their replies are in the unified queue as normal messages.  I've got a
>>> branch moving the primary's work to the queue as well (didn't quite
>>> make infernalis) --
>>> https://github.com/athanatos/ceph/tree/wip-recovery-wq.  I'm trying to
>>> stabilize it now for merge that infernalis is out.
>>> -Sam
>>>
>>> On Sun, Nov 8, 2015 at 6:20 AM, Sage Weil  wrote:
 On Fri, 6 Nov 2015, Robert LeBlanc wrote:

> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA256
>
> After trying to look through the recovery code, I'm getting the
> feeling that recovery OPs are not scheduled in the OP queue that I've
> been working on. Does that sound right? In the OSD logs I'm only
> seeing priority 63, 127 and 192 (osd_op, osd_repop, osd_repop_reply).
> If the recovery is in another separate queue, then there is no
> reliable way to prioritize OPs between them.
>
> If I'm going off in to the weeds, please help me get back on the trail.

 Yeah, the recovery work isn't in the unified queue yet.

 sage



>
> Thanks,
> - 
> Robert LeBlanc
> PGP Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
>
>
> On Fri, Nov 6, 2015 at 10:03 AM, Robert LeBlanc  wrote:
> > -BEGIN PGP SIGNED MESSAGE-
> > Hash: SHA256
> >
> > On Fri, Nov 6,

Re: ceph encoding optimization

2015-11-07 Thread Haomai Wang

Hi sage,

Could we know about your progress to refactor MSubOP and hobject_t,
pg_stat_t decode problem?

We could work on this based on your work if any.


On Thu, Nov 5, 2015 at 1:29 AM, Haomai Wang <hao...@xsky.com> wrote:
> On Thu, Nov 5, 2015 at 1:19 AM, piotr.da...@ts.fujitsu.com
> <piotr.da...@ts.fujitsu.com> wrote:
>>> -Original Message-
>>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>>> ow...@vger.kernel.org] On Behalf Of ???
>>> Sent: Wednesday, November 04, 2015 4:34 PM
>>> To: Gregory Farnum
>>> Cc: ceph-devel@vger.kernel.org
>>> Subject: Re: ceph encoding optimization
>>>
>>> I agree with pg_stat_t (and friends) is a good first start.
>>> The eversion_t and utime_t are also good choice to start because they are
>>> used at many places.
>>
>> On Ceph Hackathon, Josh Durgin made initial steps in right direction in 
>> terms of pg_stat_t encoding and decoding optimization, with the 
>> endianness-awareness thing left out. Even in that state, performance 
>> improvements offered by this change were huge enough to make it worthwhile. 
>> I'm attaching the patch, but please note that this is prototype and based on 
>> mid-August state of code, so you might need to take that into account when 
>> applying the patch.
>
> Cool, it's exactly we want to see.
>
>>
>>
>> With best regards / Pozdrawiam
>> Piotr Dałek
>>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: ceph encoding optimization

2015-11-04 Thread Haomai Wang

On Thu, Nov 5, 2015 at 1:19 AM, piotr.da...@ts.fujitsu.com
 wrote:
>> -Original Message-
>> From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
>> ow...@vger.kernel.org] On Behalf Of ???
>> Sent: Wednesday, November 04, 2015 4:34 PM
>> To: Gregory Farnum
>> Cc: ceph-devel@vger.kernel.org
>> Subject: Re: ceph encoding optimization
>>
>> I agree with pg_stat_t (and friends) is a good first start.
>> The eversion_t and utime_t are also good choice to start because they are
>> used at many places.
>
> On Ceph Hackathon, Josh Durgin made initial steps in right direction in terms 
> of pg_stat_t encoding and decoding optimization, with the 
> endianness-awareness thing left out. Even in that state, performance 
> improvements offered by this change were huge enough to make it worthwhile. 
> I'm attaching the patch, but please note that this is prototype and based on 
> mid-August state of code, so you might need to take that into account when 
> applying the patch.

Cool, it's exactly we want to see.

>
>
> With best regards / Pozdrawiam
> Piotr Dałek
>
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Re: [ceph-users] Understanding the number of TCP connections between clients and OSDs

2015-10-26 Thread Haomai Wang

On Tue, Oct 27, 2015 at 9:12 AM, hzwulibin  wrote:
> Hi, develops
>
> I also concerns about this problem. And my problem is how many threads will 
> the qemu-system-x86 has?
> When will it cut down the threads?

It's because of network model, each connection will has two threads.
We are actually working on this to avoid.

BTW, for client level, maybe we can add a proxy for ceph message to
avoid too much tcp socket on client host. But it need let us improve
single connection's performance.

>
> From what i tested, it could between 100 to 800, yeah, maybe it has 
> relationship with the osd number. But it
> seems affect the performance when it has many threads. From what i tested, 4k 
> randwrite will reduce from 15k
> to 4k. That's really unacceptable!
>
> My evnironment:
>
> 1. nine OSD storage servers with two intel DC 3500 SSD on each
> 2. hammer 0.94.3
> 3. QEMU emulator version 2.1.2 (Debian 1:2.1+dfsg-12+deb8u4~bpo70+1)
>
> Thanks!
>
> --
> hzwulibin
> 2015-10-27
>
> -
> 发件人：Jan Schermer 
> 发送日期：2015-10-27 05:48
> 收件人：Rick Balsano
> 抄送：ceph-us...@lists.ceph.com
> 主题：Re: [ceph-users] Understanding the number of TCP connections
> between clients and OSDs
>
> If we're talking about RBD clients (qemu) then the number also grows with 
> number of volumes attached to the client. With a single volume it was <1000. 
> It grows when there's heavy IO happening in the guest.
> I had to bump up the file open limits to several thusands (8000 was it?) to 
> accomodate client with 10 volumes in our cluster. We just scaled the number 
> of OSDs down so hopefully I could have a graph of that.
> But I just guesstimated what it could become, and that's not necessarily what 
> the theoretical limit is. Very bad things happen when you reach that 
> threshold. It could also depend on the guest settings (like queue depth), and 
> how much it seeks over the drive (how many different PGs it hits), but 
> knowing the upper bound is most critical.
>
> Jan
>
>> On 26 Oct 2015, at 21:32, Rick Balsano  wrote:
>>
>> We've run into issues with the number of open TCP connections from a single 
>> client to the OSDs in our Ceph cluster.
>>
>> We can (& have) increased the open file limit to work around this, but we're 
>> looking to understand what determines the number of open connections 
>> maintained between a client and a particular OSD. Our naive assumption was 1 
>> open TCP connection per OSD or per port made available by the Ceph node. 
>> There are many more than this, presumably to allow parallel connections, 
>> because we see 1-4 connections from each client per open port on a Ceph node.
>>
>> Here is some background on our cluster:
>> * still running Firefly 0.80.8
>> * 414 OSDs, 35 nodes, one massive pool
>> * clients are KVM processes, accessing Ceph RBD images using virtio
>> * total number of open TCP connections from one client to all nodes between 
>> 500-1000
>>
>> Is there any way to either know or cap the maximum number of connections we 
>> should expect?
>>
>> I can provide more info as required. I've done some searches and found 
>> references to "huge number of TCP connections" but nothing concrete to tell 
>> me how to predict how that scales.
>>
>> Thanks,
>> Rick
>> --
>> Rick Balsano
>> Senior Software Engineer
>> Opower 
>>
>> O +1 571 384 1210
>> We're Hiring! See jobs here .
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Haomai Wang

On Tue, Oct 20, 2015 at 8:47 PM, Sage Weil  wrote:
> On Tue, 20 Oct 2015, Z Zhang wrote:
>> Hi Guys,
>>
>> I am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with
>> rocksdb 3.11 as OSD backend. I use rbd to test performance and following
>> is my cluster info.
>>
>> [ceph@xxx ~]$ ceph -s
>> cluster b74f3944-d77f-4401-a531-fa5282995808
>>  health HEALTH_OK
>>  monmap e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0}
>> election epoch 1, quorum 0 xxx
>>  osdmap e338: 44 osds: 44 up, 44 in
>> flags sortbitwise
>>   pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
>> 1940 MB used, 81930 GB / 81932 GB avail
>> 2048 active+clean
>>
>> All the disks are spinning ones with write cache turning on. Rocksdb's
>> WAL and sst files are on the same disk as every OSD.
>
> Are you using the KeyValueStore backend?
>
>> Using fio to generate following write load:
>> fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K -group_reporting 
>> -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1
>>
>> Test result:
>> WAL enabled + sync: false + disk write cache: on  will get ~700 IOPS.
>> WAL enabled + sync: true (default) + disk write cache: on|off  will get only 
>> ~25 IOPS.
>>
>> I tuned some other rocksdb options, but with no lock.
>
> The wip-newstore-frags branch sets some defaults for rocksdb that I think
> look pretty reasonable (at least given how newstore is using rocksdb).
>
>> I tracked down the rocksdb code and found each writer's Sync operation
>> would take ~30ms to finish. And as shown above, it is strange that
>> performance has no much difference no matters disk write cache is on or
>> off.
>>
>> Do your guys encounter the similar issue? Or do I miss something to
>> cause rocksdb's poor write performance?
>
> Yes, I saw the same thing.  This PR addresses the problem and is nearing
> merge upstream:
>
> https://github.com/facebook/rocksdb/pull/746
>

cool, it looks reasonable for performance degraded

> There is also an XFS performance bug that is contributing to the problem,

are you refer to
this(http://comments.gmane.org/gmane.comp.file-systems.ceph.devel/27645)?

I think newstore also meet this situation.

> but it looks like Dave Chinner just put together a fix for that.
>
> But... we likely won't be using KeyValueStore in its current form over
> rocksdb (or any other kv backend).  It stripes object data over key/value
> pairs, which IMO is not the best approach.
>
> sage
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Write performance issue under rocksdb kvstore

2015-10-20 Thread Haomai Wang

Actually keyvaluestore would submit transaction with sync flag
too(rely to keyvaluedb impl journal/logfile).

Yes, if we disable sync flag, keyvaluestore's performance will
increase a lot. But we dont provide with this option now

On Tue, Oct 20, 2015 at 9:22 PM, Z Zhang  wrote:
> Thanks, Sage, for pointing out the PR and ceph branch. I will take a closer
> look. Yes, I am trying KVStore backend. The reason we are trying it is that
> few user doesn't have such high requirement on data loss occasionally. It
> seems KVStore backend without synchronized WAL could achieve better
> performance than filestore. And only data still in page cache would get lost
> on machine crashing, not process crashing, if we use WAL but no
> synchronization. What do you think? ? ? Thanks. Zhi Zhang (David) Date: Tue,
> 20 Oct 2015 05:47:44 -0700 From: s...@newdream.net To:
> zhangz.da...@outlook.com CC: ceph-us...@lists.ceph.com;
> ceph-devel@vger.kernel.org Subject: Re: [ceph-users] Write performance issue
> under rocksdb kvstore On Tue, 20 Oct 2015, Z Zhang wrote: > Hi Guys, > > I
> am trying latest ceph-9.1.0 with rocksdb 4.1 and ceph-9.0.3 with > rocksdb
> 3.11 as OSD backend. I use rbd to test performance and following > is my
> cluster info. > > [ceph@xxx ~]$ ceph -s > ? ? cluster
> b74f3944-d77f-4401-a531-fa5282995808 > ? ? ?health HEALTH_OK > ? ? ?monmap
> e1: 1 mons at {xxx=xxx.xxx.xxx.xxx:6789/0} > ? ? ? ? ? ? election epoch 1,
> quorum 0 xxx > ? ? ?osdmap e338: 44 osds: 44 up, 44 in > ? ? ? ? ? ? flags
> sortbitwise > ? ? ? pgmap v1476: 2048 pgs, 1 pools, 158 MB data, 59 objects
>> ? ? ? ? ? ? 1940 MB used, 81930 GB / 81932 GB avail > ? ? ? ? ? ? ? ? 2048
> active+clean > > All the disks are spinning ones with write cache turning
> on. Rocksdb's > WAL and sst files are on the same disk as every OSD. Are you
> using the KeyValueStore backend? > Using fio to generate following write
> load:? > fio -direct=1 -rw=randwrite -ioengine=sync -size=10M -bs=4K
> -group_reporting -directory /mnt/rbd_test/ -name xxx.1 -numjobs=1?? > > Test
> result: > WAL enabled + sync: false + disk write cache: on ?will get ~700
> IOPS. > WAL enabled + sync: true (default) + disk write cache: on|off ?will
> get only ~25 IOPS. > > I tuned some other rocksdb options, but with no lock.
> The wip-newstore-frags branch sets some defaults for rocksdb that I think
> look pretty reasonable (at least given how newstore is using rocksdb). > I
> tracked down the rocksdb code and found each writer's Sync operation > would
> take ~30ms to finish. And as shown above, it is strange that > performance
> has no much difference no matters disk write cache is on or > off. > > Do
> your guys encounter the similar issue? Or do I miss something to > cause
> rocksdb's poor write performance? Yes, I saw the same thing. This PR
> addresses the problem and is nearing merge upstream:
> https://github.com/facebook/rocksdb/pull/746 There is also an XFS
> performance bug that is contributing to the problem, but it looks like Dave
> Chinner just put together a fix for that. But... we likely won't be using
> KeyValueStore in its current form over rocksdb (or any other kv backend). It
> stripes object data over key/value pairs, which IMO is not the best
> approach. sage ___ ceph-users
> mailing list ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: newstore direction

2015-10-19 Thread Haomai Wang

On Tue, Oct 20, 2015 at 3:49 AM, Sage Weil  wrote:
> The current design is based on two simple ideas:
>
>  1) a key/value interface is better way to manage all of our internal
> metadata (object metadata, attrs, layout, collection membership,
> write-ahead logging, overlay data, etc.)
>
>  2) a file system is well suited for storage object data (as files).
>
> So far 1 is working out well, but I'm questioning the wisdom of #2.  A few
> things:
>
>  - We currently write the data to the file, fsync, then commit the kv
> transaction.  That's at least 3 IOs: one for the data, one for the fs
> journal, one for the kv txn to commit (at least once my rocksdb changes
> land... the kv commit is currently 2-3).  So two people are managing
> metadata, here: the fs managing the file metadata (with its own
> journal) and the kv backend (with its journal).
>
>  - On read we have to open files by name, which means traversing the fs
> namespace.  Newstore tries to keep it as flat and simple as possible, but
> at a minimum it is a couple btree lookups.  We'd love to use open by
> handle (which would reduce this to 1 btree traversal), but running
> the daemon as ceph and not root makes that hard...
>
>  - ...and file systems insist on updating mtime on writes, even when it is
> a overwrite with no allocation changes.  (We don't care about mtime.)
> O_NOCMTIME patches exist but it is hard to get these past the kernel
> brainfreeze.
>
>  - XFS is (probably) never going going to give us data checksums, which we
> want desperately.
>
> But what's the alternative?  My thought is to just bite the bullet and
> consume a raw block device directly.  Write an allocator, hopefully keep
> it pretty simple, and manage it in kv store along with all of our other
> metadata.

This is really a tough decision. Although making a block device based
objectstore never walk out my mind since two years ago.

We would much more concern about the effective of space utilization
compared to local fs,  the buggy, the consuming time to build a tiny
local filesystem. I'm a little afraid of we would stuck into

>
> Wins:
>
>  - 2 IOs for most: one to write the data to unused space in the block
> device, one to commit our transaction (vs 4+ before).  For overwrites,
> we'd have one io to do our write-ahead log (kv journal), then do
> the overwrite async (vs 4+ before).

Compared to filejournal, it seemed keyvaluedb doesn't play well in WAL
area from my perf.

>
>  - No concern about mtime getting in the way
>
>  - Faster reads (no fs lookup)
>
>  - Similarly sized metadata for most objects.  If we assume most objects
> are not fragmented, then the metadata to store the block offsets is about
> the same size as the metadata to store the filenames we have now.
>
> Problems:
>
>  - We have to size the kv backend storage (probably still an XFS
> partition) vs the block storage.  Maybe we do this anyway (put metadata on
> SSD!) so it won't matter.  But what happens when we are storing gobs of
> rgw index data or cephfs metadata?  Suddenly we are pulling storage out of
> a different pool and those aren't currently fungible.
>
>  - We have to write and maintain an allocator.  I'm still optimistic this
> can be reasonbly simple, especially for the flash case (where
> fragmentation isn't such an issue as long as our blocks are reasonbly
> sized).  For disk we may beed to be moderately clever.
>
>  - We'll need a fsck to ensure our internal metadata is consistent.  The
> good news is it'll just need to validate what we have stored in the kv
> store.
>
> Other thoughts:
>
>  - We might want to consider whether dm-thin or bcache or other block
> layers might help us with elasticity of file vs block areas.
>
>  - Rocksdb can push colder data to a second directory, so we could have a
> fast ssd primary area (for wal and most metadata) and a second hdd
> directory for stuff it has to push off.  Then have a conservative amount
> of file space on the hdd.  If our block fills up, use the existing file
> mechanism to put data there too.  (But then we have to maintain both the
> current kv + file approach and not go all-in on kv + block.)

A complex way...

Actually I would like to employ FileStore2 impl, which means we still
use FileJournal(or alike ..). But we need to employ more memory to
keep metadata/xattrs and use aio+dio to flush disk. A userspace
pagecache needed to be impl. Then we can skip journal if full write,
because osd is pg isolation we could make a barrier for single pg when
skipping journal. @Sage Is there other concerns for filestore skip
journal?

In a word, I like the model that filestore owns, but we need to have a
big refactor for existing impl.

Sorry to disturb the thought

>
> Thoughts?
> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--

Re: XFS xattr limit and Ceph

2015-10-15 Thread Haomai Wang

xfs has three store types for xattrs like inline, btree and extents.
We only want to let xattr stored inline, so it won't need to hit disk.
So we need to limit the number of xattrs

On Thu, Oct 15, 2015 at 10:54 PM, Somnath Roy  wrote:
> Sage,
> Why we are using XFS max inline xattr value as 10 only ?
>
> OPTION(filestore_max_inline_xattrs_xfs, OPT_U32, 10)
>
> XFS is supporting 1k limit I guess. Is there any performance reason behind 
> that ?
>
> Thanks & Regards
> Somnath
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is 
> intended only for the use of the designated recipient(s) named above. If the 
> reader of this message is not the intended recipient, you are hereby notified 
> that you have received this message in error and that any review, 
> dissemination, distribution, or copying of this message is strictly 
> prohibited. If you have received this communication in error, please notify 
> the sender by telephone or e-mail (as shown above) immediately and destroy 
> any and all copies of this message in your possession (whether hard copies or 
> electronically stored copies).
>
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Potential OSD deadlock?

2015-10-14 Thread Haomai Wang

On Wed, Oct 14, 2015 at 1:03 AM, Sage Weil  wrote:
> On Mon, 12 Oct 2015, Robert LeBlanc wrote:
>> -BEGIN PGP SIGNED MESSAGE-
>> Hash: SHA256
>>
>> After a weekend, I'm ready to hit this from a different direction.
>>
>> I replicated the issue with Firefly so it doesn't seem an issue that
>> has been introduced or resolved in any nearby version. I think overall
>> we may be seeing [1] to a great degree. From what I can extract from
>> the logs, it looks like in situations where OSDs are going up and
>> down, I see I/O blocked at the primary OSD waiting for peering and/or
>> the PG to become clean before dispatching the I/O to the replicas.
>>
>> In an effort to understand the flow of the logs, I've attached a small
>> 2 minute segment of a log I've extracted what I believe to be
>> important entries in the life cycle of an I/O along with my
>> understanding. If someone would be kind enough to help my
>> understanding, I would appreciate it.
>>
>> 2015-10-12 14:12:36.537906 7fb9d2c68700 10 -- 192.168.55.16:6800/11295
>> >> 192.168.55.12:0/2013622 pipe(0x26c9 sd=47 :6800 s=2 pgs=2 cs=1
>> l=1 c=0x32c85440).reader got message 19 0x2af81700
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Messenger has recieved the message from the client (previous
>> entries in the 7fb9d2c68700 thread are the individual segments that
>> make up this message).
>>
>> 2015-10-12 14:12:36.537963 7fb9d2c68700  1 -- 192.168.55.16:6800/11295
>> <== client.6709 192.168.55.12:0/2013622 19 
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>  235+0+4194304 (2317308138 0 2001296353) 0x2af81700 con 0x32c85440
>>
>> - ->OSD process acknowledges that it has received the write.
>>
>> 2015-10-12 14:12:36.538096 7fb9d2c68700 15 osd.4 44 enqueue_op
>> 0x3052b300 prio 63 cost 4194304 latency 0.012371
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>>
>> - ->Not sure excatly what is going on here, the op is being enqueued 
>> somewhere..
>>
>> 2015-10-12 14:13:06.542819 7fb9e2d3a700 10 osd.4 44 dequeue_op
>> 0x3052b300 prio 63 cost 4194304 latency 30.017094
>> osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v
>> 5 pg pg[0.29( v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c
>> 40/44 32/32/10) [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702
>> active+clean]
>>
>> - ->The op is dequeued from this mystery queue 30 seconds later in a
>> different thread.
>
> ^^ This is the problem.  Everything after this looks reasonable.  Looking
> at the other dequeue_op calls over this period, it looks like we're just
> overwhelmed with higher priority requests.  New clients are 63, while
> osd_repop (replicated write from another primary) are 127 and replies from
> our own replicated ops are 196.  We do process a few other prio 63 items,
> but you'll see that their latency is also climbing up to 30s over this
> period.
>
> The question is why we suddenly get a lot of them.. maybe the peering on
> other OSDs just completed so we get a bunch of these?  It's also not clear
> to me what makes osd.4 or this op special.  We expect a mix of primary and
> replica ops on all the OSDs, so why would we suddenly have more of them
> here

I guess the bug tracker(http://tracker.ceph.com/issues/13482) is
related to this thread.

So is it means that there exists live lock with client op and repop?
We permit all clients issue too much client ops which cause some OSDs
bottleneck, then actually other OSDs maybe idle enough and accept more
client ops. Finally, all osds are stuck into the bottleneck OSD. It
seemed reasonable, but why it will last so long?

>
> sage
>
>
>>
>> 2015-10-12 14:13:06.542912 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702 active+clean]
>> do_op osd_op(client.6709.0:67 rbd_data.103c74b0dc51.003a
>> [set-alloc-hint object_size 4194304 write_size 4194304,write
>> 0~4194304] 0.474a01a9 ack+ondisk+write+known_if_redirected e44) v5
>> may_write -> write-ordered flags ack+ondisk+write+known_if_redirected
>>
>> - ->Not sure what this message is. Look up of secondary OSDs?
>>
>> 2015-10-12 14:13:06.544999 7fb9e2d3a700 10 osd.4 pg_epoch: 44 pg[0.29(
>> v 44'703 (0'0,44'703] local-les=40 n=641 ec=1 les/c 40/44 32/32/10)
>> [4,5,0] r=0 lpr=32 crt=44'700 lcod 44'702 mlcod 44'702

Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-13 Thread Haomai Wang

Yep, as I said below, I consider to add auto scale up/down for worker
threads with connection load balance ability. It may let users not
entangled with how much thread number I need. :-(

Actually thread number for config value is a pain in ceph osd io stack.

On Tue, Oct 13, 2015 at 2:45 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Thanks Haomai..
> Since Async messenger is always using a constant number of threads , there 
> could be a potential performance problem of scaling up the client connections 
> keeping the constant number of OSDs ?
> May be it's a good tradeoff..
>
> Regards
> Somnath
>
>
> -----Original Message-
> From: Haomai Wang [mailto:haomaiw...@gmail.com]
> Sent: Monday, October 12, 2015 11:35 PM
> To: Somnath Roy
> Cc: Mark Nelson; ceph-devel; ceph-us...@lists.ceph.com
> Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs 
> AsyncMessenger results
>
> On Tue, Oct 13, 2015 at 12:18 PM, Somnath Roy <somnath@sandisk.com> wrote:
>> Mark,
>>
>> Thanks for this data. This means probably simple messenger (not OSD
>> core) is not doing optimal job of handling memory.
>>
>>
>>
>> Haomai,
>>
>> I am not that familiar with Async messenger code base, do you have an
>> explanation of the behavior (like good performance with default
>> tcmalloc) Mark reported ? Is it using lot less thread overall than Simple ?
>
> Originally async messenger mainly want to solve with high thread number 
> problem which limited the ceph cluster size. High context switch and cpu 
> usage caused by simple messenger under large cluster.
>
> Recently we have memory problem discussed on ML and I also spend times to 
> think about the root cause. Currently I would like to consider the simple 
> messenger's memory usage is deviating from the design of tcmalloc. Tcmalloc 
> is aimed to provide memory with local cache, and it also has memory control 
> among all threads, if we have too much threads, it may let tcmalloc busy with 
> memory lock contention.
>
> Async messenger uses thread pool to serve connections, it make all blocking 
> calls in simple messenger async.
>
>>
>> Also, it seems Async messenger has some inefficiencies in the io path
>> and that’s why it is not performing as well as simple if the memory
>> allocation stuff is optimally handled.
>
> Yep, simple messenger use two threads(one for read, one for write) to serve 
> one connection, async messenger at most have one thread to serve one 
> connection and multi connection  will share the same thread.
>
> Next, I would like to have several plans to improve performance:
> 1. add poll mode support, I hope it can help enhance high performance storage 
> need 2. add load balance ability among worker threads 3. move more works out 
> of messenger thread.
>
>>
>> Could you please send out any documentation around Async messenger ? I
>> tried to google it , but, not even blueprint is popping up.
>
>>
>>
>>
>>
>>
>> Thanks & Regards
>>
>> Somnath
>>
>> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf
>> Of Haomai Wang
>> Sent: Monday, October 12, 2015 7:57 PM
>> To: Mark Nelson
>> Cc: ceph-devel; ceph-us...@lists.ceph.com
>> Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger
>> vs AsyncMessenger results
>>
>>
>>
>> COOL
>>
>>
>>
>> Interesting that async messenger will consume more memory than simple,
>> in my mind I always think async should use less memory. I will give a
>> look at this
>>
>>
>>
>> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson <mnel...@redhat.com> wrote:
>>
>> Hi Guy,
>>
>> Given all of the recent data on how different memory allocator
>> configurations improve SimpleMessenger performance (and the effect of
>> memory allocators and transparent hugepages on RSS memory usage), I
>> thought I'd run some tests looking how AsyncMessenger does in
>> comparison.  We spoke about these a bit at the last performance meeting but 
>> here's the full write up.
>> The rough conclusion as of right now appears to be:
>>
>> 1) AsyncMessenger performance is not dependent on the memory allocator
>> like with SimpleMessenger.
>>
>> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB
>> (ie
>> default) thread cache.
>>
>> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
>> random reads.
>>
>> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
>>

Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-13 Thread Haomai Wang

On Tue, Oct 13, 2015 at 12:18 PM, Somnath Roy <somnath@sandisk.com> wrote:
> Mark,
>
> Thanks for this data. This means probably simple messenger (not OSD core) is
> not doing optimal job of handling memory.
>
>
>
> Haomai,
>
> I am not that familiar with Async messenger code base, do you have an
> explanation of the behavior (like good performance with default tcmalloc)
> Mark reported ? Is it using lot less thread overall than Simple ?

Originally async messenger mainly want to solve with high thread
number problem which limited the ceph cluster size. High context
switch and cpu usage caused by simple messenger under large cluster.

Recently we have memory problem discussed on ML and I also spend times
to think about the root cause. Currently I would like to consider the
simple messenger's memory usage is deviating from the design of
tcmalloc. Tcmalloc is aimed to provide memory with local cache, and it
also has memory control among all threads, if we have too much
threads, it may let tcmalloc busy with memory lock contention.

Async messenger uses thread pool to serve connections, it make all
blocking calls in simple messenger async.

>
> Also, it seems Async messenger has some inefficiencies in the io path and
> that’s why it is not performing as well as simple if the memory allocation
> stuff is optimally handled.

Yep, simple messenger use two threads(one for read, one for write) to
serve one connection, async messenger at most have one thread to serve
one connection and multi connection  will share the same thread.

Next, I would like to have several plans to improve performance:
1. add poll mode support, I hope it can help enhance high performance
storage need
2. add load balance ability among worker threads
3. move more works out of messenger thread.

>
> Could you please send out any documentation around Async messenger ? I tried
> to google it , but, not even blueprint is popping up.

>
>
>
>
>
> Thanks & Regards
>
> Somnath
>
> From: ceph-users [mailto:ceph-users-boun...@lists.ceph.com] On Behalf Of
> Haomai Wang
> Sent: Monday, October 12, 2015 7:57 PM
> To: Mark Nelson
> Cc: ceph-devel; ceph-us...@lists.ceph.com
> Subject: Re: [ceph-users] Initial performance cluster SimpleMessenger vs
> AsyncMessenger results
>
>
>
> COOL
>
>
>
> Interesting that async messenger will consume more memory than simple, in my
> mind I always think async should use less memory. I will give a look at this
>
>
>
> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson <mnel...@redhat.com> wrote:
>
> Hi Guy,
>
> Given all of the recent data on how different memory allocator
> configurations improve SimpleMessenger performance (and the effect of memory
> allocators and transparent hugepages on RSS memory usage), I thought I'd run
> some tests looking how AsyncMessenger does in comparison.  We spoke about
> these a bit at the last performance meeting but here's the full write up.
> The rough conclusion as of right now appears to be:
>
> 1) AsyncMessenger performance is not dependent on the memory allocator like
> with SimpleMessenger.
>
> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie
> default) thread cache.
>
> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
> random reads.
>
> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
> allocator optimizations are used.
>
> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.
>
> Here's a link to the paper:
>
> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view
>
> Mark
> ___
> ceph-users mailing list
> ceph-us...@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
>
> --
>
> Best Regards,
>
> Wheat
>
>
> 
>
> PLEASE NOTE: The information contained in this electronic mail message is
> intended only for the use of the designated recipient(s) named above. If the
> reader of this message is not the intended recipient, you are hereby
> notified that you have received this message in error and that any review,
> dissemination, distribution, or copying of this message is strictly
> prohibited. If you have received this communication in error, please notify
> the sender by telephone or e-mail (as shown above) immediately and destroy
> any and all copies of this message in your possession (whether hard copies
> or electronically stored copies).
>



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Initial performance cluster SimpleMessenger vs AsyncMessenger results

2015-10-12 Thread Haomai Wang

resend

On Tue, Oct 13, 2015 at 10:56 AM, Haomai Wang <haomaiw...@gmail.com> wrote:
> COOL
>
> Interesting that async messenger will consume more memory than simple, in my
> mind I always think async should use less memory. I will give a look at this
>
> On Tue, Oct 13, 2015 at 12:50 AM, Mark Nelson <mnel...@redhat.com> wrote:
>>
>> Hi Guy,
>>
>> Given all of the recent data on how different memory allocator
>> configurations improve SimpleMessenger performance (and the effect of memory
>> allocators and transparent hugepages on RSS memory usage), I thought I'd run
>> some tests looking how AsyncMessenger does in comparison.  We spoke about
>> these a bit at the last performance meeting but here's the full write up.
>> The rough conclusion as of right now appears to be:
>>
>> 1) AsyncMessenger performance is not dependent on the memory allocator
>> like with SimpleMessenger.
>>
>> 2) AsyncMessenger is faster than SimpleMessenger with TCMalloc + 32MB (ie
>> default) thread cache.
>>
>> 3) AsyncMessenger is consistently faster than SimpleMessenger for 128K
>> random reads.
>>
>> 4) AsyncMessenger is sometimes slower than SimpleMessenger when memory
>> allocator optimizations are used.
>>
>> 5) AsyncMessenger currently uses far more RSS memory than SimpleMessenger.
>>
>> Here's a link to the paper:
>>
>> https://drive.google.com/file/d/0B2gTBZrkrnpZS1Q4VktjZkhrNHc/view
>>
>> Mark
>> ___
>> ceph-users mailing list
>> ceph-us...@lists.ceph.com
>> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
>
>
>
> --
>
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: wip-addr

2015-10-09 Thread Haomai Wang

resend to ML

On Sat, Oct 10, 2015 at 11:20 AM, Haomai Wang <haomaiw...@gmail.com> wrote:
>
>
> On Sat, Oct 10, 2015 at 5:49 AM, Sage Weil <s...@newdream.net> wrote:
>>
>> Hey Marcus,
>>
>> On Fri, 2 Oct 2015, Marcus Watts wrote:
>> > wip-addr
>> >
>> > 1. where is it?
>> > 2. current state
>> > 3. more info
>> > 4. cheap fixes
>> > 5. in case you were wondering why?
>> >
>> >  1. where is it?
>> >
>> > I've just pushed another update to wip-addr:
>> >
>> > g...@github.com:linuxbox2/linuxbox-ceph.git
>> > wip-addr
>> >
>> >  2. current state
>> >
>> > This version
>> > 1/ compiles
>> > 2/ ran an extremely limited set of tests successfully
>> >   (was able to bring up ceph-mon, ceph-osd).
>> >
>> > In theory, it should do everything a recent "master" branch copy of ceph
>> > can do and little or nothing past that.  Internally it adds "address
>> > vector"
>> > support, some parsing/print logic, and lots of encoding rules to pass
>> > them
>> > around, but there's nothing that can create and little that makes any
>> > sensible use of this.  So this is just the back end encoding and storage
>> > rules.
>>
>> This is looking pretty good. I left some comments.  There are still a few
>> XXX's left... but not many.  Haomai, can you help with the async msgr one?
>> (Also, Marcus, can you check if the msg/Simple/Pipe.cc connect() and
>> accept() code doing the right thing?)
>
>
> I have a quick view among all commits, looks a great improvement for the
> future enhancing.
>
>>
>>
>> One minor thing.. please put the subsystem as the prefix to the commit
>> message instead of the branch name (e.g., mds: add features to event
>> types).
>>
>> > Phase 2 is to add logic to actually make it useful.
>> >   (the very start of this is on linuxbox2 "wip-addr-p2",
>> >   just monmap changes so far...)
>> >  3. more info
>> >
>> > There's an etherpad document that describes this in more detail,
>> >
>> > http://pad.ceph.com/p/wip_addr
>> >
>> >  4. cheap fixes
>> >
>> > a couple of minor issues that should be easy to resolve,
>> > 1.
>> > AsyncConnection.cc
>> > this passes addresses back and forth as it's setting up the connection,
>> > and it also exchanges features.  As best I can tell, it looks like
>> > it exchanges addresses before it knows what features the other end
>> > supports.  There should be something in here that
>> > does this after knowing what features the other end supports.
>>
>> Copying Haomai.
>
>
> Right, it should be the same as simple messenger. The "features" bit is
> exchanged in "ceph_msg_connect" and "ceph_msg_connect_reply".
>
> I'm afraid that making features before addr exchange isn't a smooth way.
> Maybe we need a middle release to help format migrating. Or we need to add
> retry mechanism, we could add proper way to let new-style addr side detect
> peer format.
>
>>
>>
>> > 2.
>> > (about line 2067 in src/tools/ceph_objectstore_tool.cc)
>> > (use via ceph cmd?) tools - "object store tool".
>> > This has a way to serialize objects which includes a watch list
>> > which includes an address.  There should be an option here to say
>> > whether to include exported addresses.
>>
>> I think it's safe to use defaults here.. what do you think, David?
>>
>> Thanks!
>> sage
>>
>
>
>
> --
>
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: advice on indexing sequential data?

2015-10-01 Thread Haomai Wang

resend

On Thu, Oct 1, 2015 at 7:56 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
>
>
> On Thu, Oct 1, 2015 at 6:44 PM, Tom Nakamura <tnakam...@eml.cc> wrote:
>>
>> Hello ceph-devel,
>>
>> My lab is concerned with developing data mining application for
>> detecting and 'deanonymizing' spamming botnets from high-volume spam
>> feeds.
>>
>> Currently, we just store everything in large mbox files in directories
>> separated by day (think: /data/feed1/2015/09/13.mbox) on a single NFS
>> server. We have ad-hoc scripts to extract features from these mboxes and
>> pass them to our analysis pipelines (written in a mixture of
>> R/matlab/python/etc). This system is reaching its limit point.
>>
>> We already have a small Ceph installation with which we've had good luck
>> for storing other data,and would like to see how we can use it to solve
>> our mail problem. Our basic requirements are that:
>>
>> - We need to be able to access each message by its extracted features.
>> These features include simple information found in its header (for
>> example From: and To:) as well as more complex information like
>> signatures from attachments and network information (for example,
>> presence in blacklists).
>> - We will frequently add/remove features.
>> - Faster access to recent data is more important than to older data.
>> - Maintaining strict ordering of incoming messages is not necessary. In
>> other words, if we received two spam messages on our feeds, it doesn't
>> matter too much if they are stored in that order, as long as we can have
>> coarse-grained temporal accuracy (say, 5 minutes). So we don't need
>> anything as sophisticated as Zlog.
>> - We need to be able to remove messages older than some specific age,
>> due to storage constraints.
>>
>> Any advice on how to use Ceph and librados to accomplish this?  Here are
>> my initial thoughts:
>>
>> - Each message is an object with some unique ID. Use omap to store all
>> its features in the same object.
>> - For each time period (which will have to be pre-specified to, say, an
>> hour), we have an object which contains a list of ID's, as a bytestring
>> of contatenated ID's. This should make expiring old messages trivial.
>> - For each feature, we have a timestamped index (like
>> 20150930-from-...@bar.com or
>> 20150813-has-attachment-with-hash-123abddeadbeef) the which contains a
>> list of ID's.
>> - Hopefully use Rados classes to index/feature-extract on the OSD's.
>>
>> How does this sound? One glaring omission is that I do not know how to
>> create indices which would support querying by inequality/ranges ('find
>> all messages between 1000 and 2000 bytes').
>
>
> I guess it likes label in Gmail?
>
> Hmm, each message as a object is a luxurious way. I guess we need to have a
> primary index, which could used to combine multi messages into one rados
> object and store offset/len mapping to omap/xattr. Then Secondary index also
> can store as object, omap is used to refer to actual data.
>
>
>
>>
>>
>> Thank you,
>> Tom N.
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>
>
>
>
> --
>
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About Fio backend with ObjectStore API

2015-09-12 Thread Haomai Wang

It's really cool. Do you prepare to push to upstream? I think it
should be more convenient  if we make fio repo as submodule.

On Sat, Sep 12, 2015 at 5:04 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
> I found my problem why segment:
>
> because fio links librbd/librados from my /usr/local/lib but use
> ceph/src/.libs/libfio_ceph_objectstore.so. They are different ceph
> version.
>
> So maybe we need to add check for abi version?
>
> On Sat, Sep 12, 2015 at 4:08 AM, Casey Bodley <cbod...@redhat.com> wrote:
>> Hi James,
>>
>> I just looked back at the results you posted, and saw that you were using 
>> iodepth=1. Setting this higher should help keep the FileStore busy.
>>
>> Casey
>>
>> - Original Message -
>>> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
>>> To: "Casey Bodley" <cbod...@redhat.com>
>>> Cc: "Haomai Wang" <haomaiw...@gmail.com>, ceph-devel@vger.kernel.org
>>> Sent: Friday, September 11, 2015 1:18:31 PM
>>> Subject: RE: About Fio backend with ObjectStore API
>>>
>>> Hi Casey,
>>>   You are right. I think the bottleneck is in fio side rather than in
>>>   filestore side in this case. The fio did not issue the io commands faster
>>>   enough to saturate the filestore.
>>>   Here is one of possible solution for it: Create a  async engine which are
>>>   normally way faster than sync engine in fio.
>>>
>>>Here is possible framework. This new Objectstore-AIO engine in FIO in
>>>theory will be way faster than sync engine. Once we have FIO which can
>>>saturate newstore, memstore and filestore, we can investigate them in
>>>very details of where the bottleneck in their design.
>>>
>>> .
>>> struct objectstore_aio_data {
>>>   struct aio_ctx *q_aio_ctx;
>>>   struct aio_completion_data *a_data;
>>>   aio_ses_ctx_t *p_ses_ctx;
>>>   unsigned int entries;
>>> };
>>> ...
>>> /*
>>>  * Note that the structure is exported, so that fio can get it via
>>>  * dlsym(..., "ioengine");
>>>  */
>>> struct ioengine_ops us_aio_ioengine = {
>>>   .name   = "objectstore-aio",
>>>   .version= FIO_IOOPS_VERSION,
>>>   .init   = fio_objectstore_aio_init,
>>>   .prep   = fio_objectstore_aio_prep,
>>>   .queue  = fio_objectstore_aio_queue,
>>>   .cancel = fio_objectstore_aio_cancel,
>>>   .getevents  = fio_objectstore_aio_getevents,
>>>   .event  = fio_objectstore_aio_event,
>>>   .cleanup= fio_objectstore_aio_cleanup,
>>>   .open_file  = fio_objectstore_aio_open,
>>>   .close_file = fio_objectstore_aio_close,
>>> };
>>>
>>>
>>> Let me know what you think.
>>>
>>> Regards,
>>> James
>>> 
>>> -Original Message-
>>> From: Casey Bodley [mailto:cbod...@redhat.com]
>>> Sent: Friday, September 11, 2015 7:28 AM
>>> To: James (Fei) Liu-SSI
>>> Cc: Haomai Wang; ceph-devel@vger.kernel.org
>>> Subject: Re: About Fio backend with ObjectStore API
>>>
>>> Hi James,
>>>
>>> That's great that you were able to get fio-objectstore running! Thanks to 
>>> you
>>> and Haomai for all the help with testing.
>>>
>>> In terms of performance, it's possible that we're not handling the
>>> completions optimally. When profiling with MemStore I remember seeing a
>>> significant amount of cpu time spent in polling with
>>> fio_ceph_os_getevents().
>>>
>>> The issue with reads is more of a design issue than a bug. Because the test
>>> starts with a mkfs(), there are no objects to read from initially. You would
>>> just have to add a write job to run before the read job, to make sure that
>>> the objects are initialized. Or perhaps the mkfs() step could be an optional
>>> part of the configuration.
>>>
>>> Casey
>>>
>>> - Original Message -
>>> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
>>> To: "Haomai Wang" <haomaiw...@gmail.com>, "Casey Bodley" 
>>> <cbod...@redhat.com>
>>> Cc: ceph-devel@vger.kernel.org
>>> Sent: Thursday, September 10, 2015 8:08:0

Re: About Fio backend with ObjectStore API

2015-09-12 Thread Haomai Wang

I found my problem why segment:

because fio links librbd/librados from my /usr/local/lib but use
ceph/src/.libs/libfio_ceph_objectstore.so. They are different ceph
version.

So maybe we need to add check for abi version?

On Sat, Sep 12, 2015 at 4:08 AM, Casey Bodley <cbod...@redhat.com> wrote:
> Hi James,
>
> I just looked back at the results you posted, and saw that you were using 
> iodepth=1. Setting this higher should help keep the FileStore busy.
>
> Casey
>
> - Original Message -
>> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
>> To: "Casey Bodley" <cbod...@redhat.com>
>> Cc: "Haomai Wang" <haomaiw...@gmail.com>, ceph-devel@vger.kernel.org
>> Sent: Friday, September 11, 2015 1:18:31 PM
>> Subject: RE: About Fio backend with ObjectStore API
>>
>> Hi Casey,
>>   You are right. I think the bottleneck is in fio side rather than in
>>   filestore side in this case. The fio did not issue the io commands faster
>>   enough to saturate the filestore.
>>   Here is one of possible solution for it: Create a  async engine which are
>>   normally way faster than sync engine in fio.
>>
>>Here is possible framework. This new Objectstore-AIO engine in FIO in
>>theory will be way faster than sync engine. Once we have FIO which can
>>saturate newstore, memstore and filestore, we can investigate them in
>>very details of where the bottleneck in their design.
>>
>> .
>> struct objectstore_aio_data {
>>   struct aio_ctx *q_aio_ctx;
>>   struct aio_completion_data *a_data;
>>   aio_ses_ctx_t *p_ses_ctx;
>>   unsigned int entries;
>> };
>> ...
>> /*
>>  * Note that the structure is exported, so that fio can get it via
>>  * dlsym(..., "ioengine");
>>  */
>> struct ioengine_ops us_aio_ioengine = {
>>   .name   = "objectstore-aio",
>>   .version= FIO_IOOPS_VERSION,
>>   .init   = fio_objectstore_aio_init,
>>   .prep   = fio_objectstore_aio_prep,
>>   .queue  = fio_objectstore_aio_queue,
>>   .cancel = fio_objectstore_aio_cancel,
>>   .getevents  = fio_objectstore_aio_getevents,
>>   .event  = fio_objectstore_aio_event,
>>   .cleanup= fio_objectstore_aio_cleanup,
>>   .open_file      = fio_objectstore_aio_open,
>>   .close_file = fio_objectstore_aio_close,
>> };
>>
>>
>> Let me know what you think.
>>
>> Regards,
>> James
>> 
>> -Original Message-
>> From: Casey Bodley [mailto:cbod...@redhat.com]
>> Sent: Friday, September 11, 2015 7:28 AM
>> To: James (Fei) Liu-SSI
>> Cc: Haomai Wang; ceph-devel@vger.kernel.org
>> Subject: Re: About Fio backend with ObjectStore API
>>
>> Hi James,
>>
>> That's great that you were able to get fio-objectstore running! Thanks to you
>> and Haomai for all the help with testing.
>>
>> In terms of performance, it's possible that we're not handling the
>> completions optimally. When profiling with MemStore I remember seeing a
>> significant amount of cpu time spent in polling with
>> fio_ceph_os_getevents().
>>
>> The issue with reads is more of a design issue than a bug. Because the test
>> starts with a mkfs(), there are no objects to read from initially. You would
>> just have to add a write job to run before the read job, to make sure that
>> the objects are initialized. Or perhaps the mkfs() step could be an optional
>> part of the configuration.
>>
>> Casey
>>
>> - Original Message -
>> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
>> To: "Haomai Wang" <haomaiw...@gmail.com>, "Casey Bodley" <cbod...@redhat.com>
>> Cc: ceph-devel@vger.kernel.org
>> Sent: Thursday, September 10, 2015 8:08:04 PM
>> Subject: RE: About Fio backend with ObjectStore API
>>
>> Hi Casey and Haomai,
>>
>>   We finally made the fio-objectstore works in our end . Here is fio data
>>   against filestore with Samsung 850 Pro. It is sequential write and the
>>   performance is very poor which is expected though.
>>
>> Run status group 0 (all jobs):
>>   WRITE: io=524288KB, aggrb=9467KB/s, minb=9467KB/s, maxb=9467KB/s,
>>   mint=55378msec, maxt=55378msec
>>
>>   But anyway, it works even though still some bugs to fix like read and
>>   filesytem issue

Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Haomai Wang

Yesterday I have a chat with wangrui and the reason is "infos"(legacy
oid) is missing. I'm not sure why it's missing.

PS: resend again because of plain text

On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil  wrote:
> On Fri, 11 Sep 2015, ?? wrote:
>> Thank Sage Weil:
>>
>> 1. I delete some testing pools in the past, but is was a long time ago (may 
>> be 2 months ago), in recently upgrade, do not delete pools.
>> 2.  ceph osd dump please see the (attachment file ceph.osd.dump.log)
>> 3. debug osd = 20' and 'debug filestore = 20  (attachment file 
>> ceph.osd.5.log.tar.gz)
>
> This one is failing on pool 54, which has been deleted.  In this case you
> can work around it by renaming current/54.* out of the way.
>
>> 4. i install the ceph-test, but output error
>> ceph-kvstore-tool /ceph/data5/current/db list
>> Invalid argument: /ceph/data5/current/db: does not exist (create_if_missing 
>> is false)
>
> Sorry, I should have said current/omap, not current/db.  I'm still curious
> to see the key dump.  I'm not sure why the leveldb key for these pgs is
> missing...
>
> Thanks!
> sage
>
>
>>
>> ls -l /ceph/data5/current/db
>> total 0
>> -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
>> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
>> -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
>>
>> Thanks very much!
>> Wang Rui
>>
>> -- Original --
>> From:  "Sage Weil";
>> Date:  Fri, Sep 11, 2015 06:23 AM
>> To:  "??";
>> Cc:  "ceph-devel";
>> Subject:  Re: Failed on starting osd-daemon after upgrade giant-0.87.1 
>> tohammer-0.94.3
>>
>> Hi!
>>
>> On Wed, 9 Sep 2015, ?? wrote:
>> > Hi all:
>> >
>> > I got on error after upgrade my ceph cluster from giant-0.87.2 to 
>> > hammer-0.94.3, my local environment is:
>> > CentOS 6.7 x86_64
>> > Kernel 3.10.86-1.el6.elrepo.x86_64
>> > HDD: XFS, 2TB
>> > Install Package: ceph.com official RPMs x86_64
>> >
>> > step 1:
>> > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
>> >
>> > step 2:
>> > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two servers and 
>> > noticed that some osds can not started!
>> > server-1 have 4 osds, all of them can not started;
>> > server-2 have 3 osds, 2 of them can not started, but 1 of them 
>> > successfully started and work in good.
>> >
>> > Error log 1:
>> > service ceph start osd.4
>> > /var/log/ceph/ceph-osd.24.log
>> > (attachment file: ceph.24.log)
>> >
>> > Error log 2:
>> > /usr/bin/ceph-osd -c /etc/ceph/ceph.conf -i 4 -f
>> >  (attachment file: cli.24.log)
>>
>> This looks a lot like a problem with a stray directory that older versions
>> did not clean up (#11429)... but not quite.  Have you deleted pools in the
>> past? (Can you attach a 'ceph osd dump'?)?  Also, i fyou start the osd
>> with 'debug osd = 20' and 'debug filestore = 20' we can see which PG is
>> problematic.  If you install the 'ceph-test' package which contains
>> ceph-kvstore-tool, the output of
>>
>>  ceph-kvstore-tool /var/lib/ceph/osd/ceph-$id/current/db list
>>
>> would also be helpful.
>>
>> Thanks!
>> sage
> --
> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Failed on starting osd-daemon after upgrade giant-0.87.1 tohammer-0.94.3

2015-09-11 Thread Haomai Wang

On Fri, Sep 11, 2015 at 10:09 PM, Sage Weil <s...@newdream.net> wrote:
> On Fri, 11 Sep 2015, Haomai Wang wrote:
>> On Fri, Sep 11, 2015 at 8:56 PM, Sage Weil <s...@newdream.net> wrote:
>>   On Fri, 11 Sep 2015, ?? wrote:
>>   > Thank Sage Weil:
>>   >
>>   > 1. I delete some testing pools in the past, but is was a long
>>   time ago (may be 2 months ago), in recently upgrade, do not
>>   delete pools.
>>   > 2.  ceph osd dump please see the (attachment file
>>   ceph.osd.dump.log)
>>   > 3. debug osd = 20' and 'debug filestore = 20  (attachment file
>>   ceph.osd.5.log.tar.gz)
>>
>>   This one is failing on pool 54, which has been deleted.  In this
>>   case you
>>   can work around it by renaming current/54.* out of the way.
>>
>>   > 4. i install the ceph-test, but output error
>>   > ceph-kvstore-tool /ceph/data5/current/db list
>>   > Invalid argument: /ceph/data5/current/db: does not exist
>>   (create_if_missing is false)
>>
>>   Sorry, I should have said current/omap, not current/db.  I'm
>>   still curious
>>   to see the key dump.  I'm not sure why the leveldb key for these
>>   pgs is
>>   missing...
>>
>>
>> Yesterday I have a chat with wangrui and the reason is "infos"(legacy oid)
>> is missing. I'm not sure why it's missing.
>
> Probably
>
> https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
>
> Oh, I think I see what happened:
>
>  - the pg removal was aborted pre-hammer.  On pre-hammer, thsi means that
> load_pgs skips it here:
>
>  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L2121
>
>  - we upgrade to hammer.  we skip this pg (same reason), don't upgrade it,
> but delete teh legacy infos object
>
>  https://github.com/ceph/ceph/blob/hammer/src/osd/OSD.cc#L2908
>
>  - now we see this crash...
>
> I think the fix is, in hammer, to bail out of peek_map_epoch if the infos
> object isn't present, here
>
>  https://github.com/ceph/ceph/blob/hammer/src/osd/PG.cc#L2867
>
> Probably we should restructure so we can return a 'fail' value
> instead of a magic epoch_t meaning the same...
>
> This is similar to the bug I'm fixing on master (and I think I just
> realized what I was doing wrong there).

Hmm, I got it. So we could skip this assert or just like load_pgs to
check pool whether exists?

I think it's urgent bug because I remember several people show me the
alike crash.


>
> Thanks!
> sage
>
>
>
>>
>>
>>   Thanks!
>>   sage
>>
>>
>>   >
>>   > ls -l /ceph/data5/current/db
>>   > total 0
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:41 LOCK
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG
>>   > -rw-r--r-- 1 root root 0 Sep 11 09:54 LOG.old
>>   >
>>   > Thanks very much!
>>   > Wang Rui
>>   >
>>   > -- Original --
>>   > From:  "Sage Weil"<s...@newdream.net>;
>>   > Date:  Fri, Sep 11, 2015 06:23 AM
>>   > To:  "??"<wang...@tvmining.com>;
>>   > Cc:  "ceph-devel"<ceph-devel@vger.kernel.org>;
>>   > Subject:  Re: Failed on starting osd-daemon after upgrade
>>   giant-0.87.1 tohammer-0.94.3
>>   >
>>   > Hi!
>>   >
>>   > On Wed, 9 Sep 2015, ?? wrote:
>>   > > Hi all:
>>   > >
>>   > > I got on error after upgrade my ceph cluster from
>>   giant-0.87.2 to hammer-0.94.3, my local environment is:
>>   > > CentOS 6.7 x86_64
>>   > > Kernel 3.10.86-1.el6.elrepo.x86_64
>>   > > HDD: XFS, 2TB
>>   > > Install Package: ceph.com official RPMs x86_64
>>   > >
>>   > > step 1:
>>   > > Upgrade MON server from 0.87.1 to 0.94.3, all is fine!
>>   > >
>>   > > step 2:
>>   > > Upgrade OSD server from 0.87.1 to 0.94.3. i just upgrade two
>>   servers and noticed that some osds can not started!
>>   > > server-1 have 4 osds, all of them can not started;
>>   > > server-2 have 3 osds, 2 of them can not started, but 1 of
>>   them successfully started and work in good.
>>   > >
>>   > > Error log 1:
>>   > > service ceph start osd.4
>>   > > /var/log/cep

[NewStore]About PGLog Workload With RocksDB

2015-09-08 Thread Haomai Wang

Hi Sage,

I notice your post in rocksdb page about make rocksdb aware of short
alive key/value pairs.

I think it would be great if one keyvalue db impl could support
different key types with different store behaviors. But it looks like
difficult for me to add this feature to an existing db.

So combine my experience with filestore, I just think let
NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
could be easy and effective. PGLog owned by PG and maintain the
history of ops. It's alike Journal Data but only have several hundreds
bytes. Actually we only need to have several hundreds MB at most to
store all pgs pglog. For FileStore, we already have FileJournal have a
copy of PGLog, previously I always think about reduce another copy in
leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
it need a lot of works to be done in FileJournal to aware of pglog
things. NewStore doesn't use FileJournal and it should be easier to
settle down my idea(?).

Actually I think a rados write op in current objectstore impl that
omap key/value pairs hurts performance hugely. Lots of cpu cycles are
consumed and contributes to short-alive keys(pglog). It should be a
obvious optimization point. In the other hands, pglog is dull and
doesn't need rich keyvalue api supports. Maybe a lightweight
filejournal to settle down pglogs keys is also worth to try.

In short, I think it would be cleaner and easier than improving
rocksdb to impl a pglog-optimization structure to store this.

PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NewStore]About PGLog Workload With RocksDB

2015-09-08 Thread Haomai Wang

On Tue, Sep 8, 2015 at 10:12 PM, Gregory Farnum <gfar...@redhat.com> wrote:
> On Tue, Sep 8, 2015 at 3:06 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
>> Hit "Send" by accident for previous mail. :-(
>>
>> some points about pglog:
>> 1. short-alive but frequency(HIGH)
>
> Is this really true? The default length of the log is 1000 entries,
> and most OSDs have ~100 PGs, so on a hard drive running at 80
> writes/second that's about 10 seconds (~27 hours) before we delete

SSD is filled in my mind... Yep, for HDD pglogs it's not a passing
traveller.

The main point I think is pglog, journal data and omap keys are three
types data.

> an entry. In reality most deployments aren't writing that
> quicklyand if something goes wrong with the PG we increase to
> 1 log entries!
> -Greg
>
>> 2. small and related to the number of pgs
>> 3. typical seq read/write scene
>> 4. doesn't need rich structure like LSM or B-tree to support apis, has
>> obvious different to user-side/other omap keys.
>> 5. a simple loopback impl is efficient and simple
>>
>>
>> On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
>>> Hi Sage,
>>>
>>> I notice your post in rocksdb page about make rocksdb aware of short
>>> alive key/value pairs.
>>>
>>> I think it would be great if one keyvalue db impl could support
>>> different key types with different store behaviors. But it looks like
>>> difficult for me to add this feature to an existing db.
>>>
>>> So combine my experience with filestore, I just think let
>>> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
>>> could be easy and effective. PGLog owned by PG and maintain the
>>> history of ops. It's alike Journal Data but only have several hundreds
>>> bytes. Actually we only need to have several hundreds MB at most to
>>> store all pgs pglog. For FileStore, we already have FileJournal have a
>>> copy of PGLog, previously I always think about reduce another copy in
>>> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
>>> it need a lot of works to be done in FileJournal to aware of pglog
>>> things. NewStore doesn't use FileJournal and it should be easier to
>>> settle down my idea(?).
>>>
>>> Actually I think a rados write op in current objectstore impl that
>>> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
>>> consumed and contributes to short-alive keys(pglog). It should be a
>>> obvious optimization point. In the other hands, pglog is dull and
>>> doesn't need rich keyvalue api supports. Maybe a lightweight
>>> filejournal to settle down pglogs keys is also worth to try.
>>>
>>> In short, I think it would be cleaner and easier than improving
>>> rocksdb to impl a pglog-optimization structure to store this.
>>>
>>> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>>>
>>>
>>>
>>> --
>>> Best Regards,
>>>
>>> Wheat
>>
>>
>>
>> --
>> Best Regards,
>>
>> Wheat
>> --
>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
>> the body of a message to majord...@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [NewStore]About PGLog Workload With RocksDB

2015-09-08 Thread Haomai Wang

Hit "Send" by accident for previous mail. :-(

some points about pglog:
1. short-alive but frequency(HIGH)
2. small and related to the number of pgs
3. typical seq read/write scene
4. doesn't need rich structure like LSM or B-tree to support apis, has
obvious different to user-side/other omap keys.
5. a simple loopback impl is efficient and simple


On Tue, Sep 8, 2015 at 9:58 PM, Haomai Wang <haomaiw...@gmail.com> wrote:
> Hi Sage,
>
> I notice your post in rocksdb page about make rocksdb aware of short
> alive key/value pairs.
>
> I think it would be great if one keyvalue db impl could support
> different key types with different store behaviors. But it looks like
> difficult for me to add this feature to an existing db.
>
> So combine my experience with filestore, I just think let
> NewStore/FileStore aware of this short-alive keys(Or just PGLog keys)
> could be easy and effective. PGLog owned by PG and maintain the
> history of ops. It's alike Journal Data but only have several hundreds
> bytes. Actually we only need to have several hundreds MB at most to
> store all pgs pglog. For FileStore, we already have FileJournal have a
> copy of PGLog, previously I always think about reduce another copy in
> leveldb to reduce leveldb calls which consumes lots of cpu cycles. But
> it need a lot of works to be done in FileJournal to aware of pglog
> things. NewStore doesn't use FileJournal and it should be easier to
> settle down my idea(?).
>
> Actually I think a rados write op in current objectstore impl that
> omap key/value pairs hurts performance hugely. Lots of cpu cycles are
> consumed and contributes to short-alive keys(pglog). It should be a
> obvious optimization point. In the other hands, pglog is dull and
> doesn't need rich keyvalue api supports. Maybe a lightweight
> filejournal to settle down pglogs keys is also worth to try.
>
> In short, I think it would be cleaner and easier than improving
> rocksdb to impl a pglog-optimization structure to store this.
>
> PS(off topic): a keyvaluedb benchmark http://sphia.org/benchmarks.html
>
>
>
> --
> Best Regards,
>
> Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About Fio backend with ObjectStore API

2015-09-04 Thread Haomai Wang

On Sat, Sep 5, 2015 at 4:29 AM, James (Fei) Liu-SSI
<james@ssi.samsung.com> wrote:
> Hi Casey,
>Thanks. I even got a compiling error with fio-objectstore branch.
>
> Here is error message:
>
>   make[3]: *** No rule to make target `rbd_fuse/rbd-fuse.c', needed by 
> `rbd_fuse/rbd-fuse.o'.  Stop.
> make[3]: *** Waiting for unfinished jobs
>   CXX  ceph_fuse.o

This is a old make problem. I nearly forgot the details, but you can
run "make clean" then try again. Or remove this ceph build directory
directly and git clone a clean tree

> make[3]: Leaving directory `/home/jamesliu/WorkSpace/ceph_fio/src'
> make[2]: *** [all-recursive] Error 1
> make[2]: Leaving directory `/home/jamesliu/WorkSpace/ceph_fio/src'
> make[1]: *** [all] Error 2
> make[1]: Leaving directory `/home/jamesliu/WorkSpace/ceph_fio/src'
> make: *** [all-recursive] Error 1
> jamesliu@jamesliu-OptiPlex-7010:~/WorkSpace/ceph_fio$ git branch
> * fio-objectstore
>   master
>
>
> Regards,
> James
>
> -Original Message-
> From: Casey Bodley [mailto:cbod...@redhat.com]
> Sent: Thursday, September 03, 2015 10:44 AM
> To: James (Fei) Liu-SSI
> Cc: ceph-devel@vger.kernel.org
> Subject: Re: About Fio backend with ObjectStore API
>
> Hi James,
>
> I'm sorry for not following up on that segfault, but I wasn't ever able to 
> reproduce it. I used it recently for memstore testing without any problems. I 
> wonder if there's a problem with the autotools build? I've only tested it 
> with cmake. When I find some time, I'll rebase it on master and do another 
> round of testing.
>
> Casey
>
> - Original Message -
>> From: "James (Fei) Liu-SSI" <james@ssi.samsung.com>
>> To: "Haomai Wang" <haomaiw...@gmail.com>, "Casey Bodley"
>> <cbod...@redhat.com>
>> Cc: "Casey Bodley" <cbod...@gmail.com>, "Matt W. Benjamin"
>> <m...@cohortfs.com>, ceph-devel@vger.kernel.org
>> Sent: Wednesday, September 2, 2015 8:06:14 PM
>> Subject: RE: About Fio backend with ObjectStore API
>>
>> Hi Haomai and Case,
>>  Do you have any fixes for that segfault?
>>
>> Thanks,
>> James
>>
>> -Original Message-
>> From: Haomai Wang [mailto:haomaiw...@gmail.com]
>> Sent: Wednesday, July 22, 2015 6:07 PM
>> To: Casey Bodley
>> Cc: Casey Bodley; Matt W. Benjamin; James (Fei) Liu-SSI;
>> ceph-devel@vger.kernel.org
>> Subject: Re: About Fio backend with ObjectStore API
>>
>> no special
>>
>> [global]
>> #logging
>> #write_iops_log=write_iops_log
>> #write_bw_log=write_bw_log
>> #write_lat_log=write_lat_log
>> ioengine=./ceph-int/src/.libs/libfio_ceph_objectstore.so
>> invalidate=0 # mandatory
>> rw=write
>> #bs=4k
>>
>> [filestore]
>> iodepth=1
>> # create a journaled filestore
>> objectstore=filestore
>> directory=./osd/
>> filestore_journal=./osd/journal
>>
>> On Thu, Jul 23, 2015 at 4:56 AM, Casey Bodley <cbod...@redhat.com> wrote:
>> > Hi Haomai,
>> >
>> > Sorry for the late response, I was out of the office. I'm afraid I
>> > haven't run into that segfault. The io_ops should be set at the very
>> > beginning when it calls get_ioengine(). All I can suggest is that
>> > you verify that your job file is pointing to the correct
>> > fio_ceph_objectstore.so. If you've made any other interesting
>> > changes to the job file, could you share it here?
>> >
>> > Casey
>> >
>> > - Original Message -
>> > From: "Haomai Wang" <haomaiw...@gmail.com>
>> > To: "Casey Bodley" <cbod...@gmail.com>
>> > Cc: "Matt W. Benjamin" <m...@cohortfs.com>, "James (Fei) Liu-SSI"
>> > <james@ssi.samsung.com>, ceph-devel@vger.kernel.org
>> > Sent: Tuesday, July 21, 2015 7:50:32 AM
>> > Subject: Re: About Fio backend with ObjectStore API
>> >
>> > Hi Casey,
>> >
>> > I check your commits and know what you fixed. I cherry-picked your
>> > new commits but I still met the same problem.
>> >
>> > """
>> > It's strange that it alwasys hit segment fault when entering
>> > "_fio_setup_ceph_filestore_data", gdb tells "td->io_ops" is NULL but
>> > when I up the stack, the "td->io_ops" is not null. Maybe it's
>> > related to dlopen?
>> > """
>> >
>> > Do you have any hint about thi

Re: wakeup( ) in async messenger‘ event

2015-08-28 Thread Haomai Wang

On Fri, Aug 28, 2015 at 2:35 PM, Jianhui Yuan zuiwany...@gmail.com wrote:
 Hi Haomai,

 when we use async messenger, the client(as: ceph -s) always stuck in
 WorkerPool::barrier for 30 seconds. It seems the wakeup don't work.

What's the ceph version and os version? It should be a bug we already
fixed before.


 Then, I remove already_wakeup in wakeup. It seems to be working well. So,
 can we just remove already_wakeup. And do read in C_handle_notify until it
 is not data to read.

 Jianhui Yuan



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: support of non-block connect in async messenger?

2015-08-27 Thread Haomai Wang

On Thu, Aug 27, 2015 at 3:47 PM, Jianhui Yuan zuiwany...@gmail.com wrote:
 Hi Haomai,

 In my environment, I suffer from long timeout when connect a breakdown node.
 So I write some code to support non-block connect in async . And It seems to
 be working well. So, I want to know if non-block connect in async may have
 some problem that can't be sloved now, or we may just support this feature?


Yep, async messenger should avoid all potential blocking point. In
msg/async/net_handler.cc, we already have
NetHandler::nonblock_connect impl. Long ago because nonblock connect
adds the difficulties to debug problem, I removed the caller codes.
Now it looks a good time to add nonconnect support again.

 Jianhui Yuan



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: format 2TB rbd device is too slow

2015-08-26 Thread Haomai Wang

On Wed, Aug 26, 2015 at 11:16 PM, huang jun hjwsm1...@gmail.com wrote:
 hi,all
 we create a 2TB rbd image, after map it to local,
 then we format it to xfs with 'mkfs.xfs /dev/rbd0', it spent 318
 seconds to finish, but  local physical disk with the same size just
 need 6 seconds.


I think librbd have two PR related to this.

 After debug, we found there are two steps in rbd module during formating:
 a) send  233093 DELETE requests to osds(number_of_requests = 2TB / 4MB),
this step spent almost 92 seconds.

I guess this(https://github.com/ceph/ceph/pull/4221/files) may help

 b) send 4238 messages like this: [set-alloc-hint object_size 4194304
 write_size 4194304,write 0~512] to osds, that spent 227 seconds.

I think kernel rbd also need to use
https://github.com/ceph/ceph/pull/4983/files


 is there any optimations can we do?
 --
 thanks
 huangjun
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: async messenger

2015-08-24 Thread Haomai Wang

On Tue, Aug 25, 2015 at 5:28 AM, Sage Weil sw...@redhat.com wrote:
 Hi Haomai,

 How did your most recent async messenger run go?

 If there aren't major issues, we'd like to start mixing it in with the
 regular rados suite by doing 'ms type = random'...

From last run, we have no async related failed jobs. I would like to
scheduler a random type by hand to see whether have terrible
compatible problem.


 sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-20 Thread Haomai Wang

On Thu, Aug 20, 2015 at 2:35 PM, Dałek, Piotr
piotr.da...@ts.fujitsu.com wrote:
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
 ow...@vger.kernel.org] On Behalf Of Blinick, Stephen L
 Sent: Wednesday, August 19, 2015 6:58 PM

 [..
 Regarding the all-HDD or high density HDD nodes, is it certain these issues
 with tcmalloc don't apply, due to lower performance, or would it potentially
 be something that would manifest over a longer period of time
 (weeks/months) of running?   I know we've seen some weirdness attributed
 to tcmalloc on our 10-disk 20-node cluster with HDD's   SSD journals, but it
 took a few weeks.

 And it takes me just a few minutes with rados bench to reproduce this issue 
 on mixed-storage node (SSDs, SAS disks, high-capacity SATA disks, etc).
 See here: http://ceph.predictor.org.pl/cpu_usage_over_time.xlsx
 It gets even worse when rebalancing starts...

Cool, it met my thought. I guess the only way to lighten memory
problem is solve this for each heavy memory allocation use case.


 With best regards / Pozdrawiam
 Piotr Dałek



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Inline dedup/compression

2015-08-20 Thread Haomai Wang

I found a 
blog(http://mysqlserverteam.com/innodb-transparent-pageio-compression/)
about mysql innodb transparent compression. It's surprised that innodb
will do it at low level(just like filestore in ceph) and rely it on
filesystem file hole feature. I'm very suspect about the performance
afeter storing lot's of *small* hole files on fs. If reliable, it
would be easy that filestore/newstore impl alike feature.

On Fri, Jul 3, 2015 at 1:13 PM, Allen Samuels allen.samu...@sandisk.com wrote:
 For non-overwriting relatively large objects, this scheme works fine. 
 Unfortunately the real use-case for deduplication is block storage with 
 virtualized infrastructure (eliminating duplicate operating system files and 
 applications, etc.) and in order for this to provide good deduplication, 
 you'll need a block size that's equal or smaller than the cluster-size of the 
 file system mounted on the block device. Meaning that your storage is now 
 dominated by small chunks (probably 8K-ish) rather than the relatively large 
 4M stripes that is used today (this will also kill EC since small objects are 
 replicated rather than ECed). This will have a massive impact on backend 
 storage I/O as the basic data/metadata ratio is complete skewed (both for 
 static storage and dynamic I/O count).


 Allen Samuels
 Software Architect, Emerging Storage Solutions

 2880 Junction Avenue, Milpitas, CA 95134
 T: +1 408 801 7030| M: +1 408 780 6416
 allen.samu...@sandisk.com


 -Original Message-
 From: Chaitanya Huilgol
 Sent: Thursday, July 02, 2015 3:50 AM
 To: James (Fei) Liu-SSI; Allen Samuels; Haomai Wang
 Cc: ceph-devel
 Subject: RE: Inline dedup/compression

 Hi James et.al ,

 Here is an example for clarity,
 1. Client Writes object  object.abcd
 2. Based on the crush rules, say  OSD.a is the primary OSD which receives the 
 write 3. OSD.a  performs segmenting/fingerprinting which can be static or 
 dynamic and generates a list of segments, the object.abcd is now represented 
 by a manifest object with the list of segment hash and len  [Header]  
 [Seg1_sha, len]  [Seg2_sha, len]  ...
  [Seg3_sha, len]
 4. OSD.a writes each segment as a new object in the cluster with object name  
 reserved_dedupe_perfixsha 5. The dedupe object write is treated 
 differently from regular object writes, If the object is present then an 
 object reference count is incremented and the object is not overwritten - 
 this forms the basis of the dedupe logic. Multiple objects with one or more 
 same constituent segments start sharing the segment objects.
 6. Once all the segments are successfully written the object 'object.abcd' is 
 now just a stub object with the segment manifest as described above and is 
 goes through a regular object write sequence

 Partial writes on objects will be complicated,
 - Partially affected segments will have to be read and segmentation logic has 
 to be run from first to last affected segment boundaries
 -  New segments will be written
 - Old overwritten segments have to be deleted
 - Write merged manifest of the object

 All this will need protection of the PG lock, Also additional journaling 
 mechanism will be needed to  recover from cases where the osd goes down 
 before writing all the segments.

 Since this is quite a lot of processing, a better use case for this dedupe 
 mechanism would be in the data tiering model with object redirects.
 The manifest object fits quiet well into object redirects scheme of things, 
 the idea is that, when an object is moved out of the base tier, you have an 
 option to create a dedupe stub object and write individual segments into the 
 cold backend tier with a rados plugin.

 Remaining responses inline.

 Regards,
 Chaitanya

 -Original Message-
 From: James (Fei) Liu-SSI [mailto:james@ssi.samsung.com]
 Sent: Wednesday, July 01, 2015 4:00 AM
 To: Chaitanya Huilgol; Allen Samuels; Haomai Wang
 Cc: ceph-devel
 Subject: RE: Inline dedup/compression

 Hi Chaitanya,
Very interesting thoughts. I am not sure whether I get all of them or now. 
 Here are several questions for the solution you provided, Might be a little 
 bit detailed.

 Regards,
 James

 - Dedupe is set as a pool property
 Write:
 - Write arrives at the primary OSD/pg
 [James] Does the OSD/PG mean PG Backend over here?
 [Chaitanya] I mean the Primary OSD and the PG which get selected by the crush 
 - not the specific OSD component

 - Data is segmented (rabin/static) and secure hash computed [James] Which 
 component in OSD are you going to do the data segment and hash computation?
 [Chaitanya] If partial writes are not supported then this could be down 
 before acquiring the PG lock, else we need the protection of the PG lock.  
 Probably in the do_request() path?

 - A manifest is created with the offset/len/hash for all the segments [James] 
 The manifest is going to be part of xattr of object? Where are you going to 
 save manifest?
 [Chaitanya] The manifest is a stub object

Re: Ceph Hackathon: More Memory Allocator Testing

2015-08-19 Thread Haomai Wang

On Wed, Aug 19, 2015 at 1:36 PM, Somnath Roy somnath@sandisk.com wrote:
 Mark,
 Thanks for verifying this. Nice report !
 Since there is a big difference in memory consumption with jemalloc, I would 
 say a recovery performance data or client performance data during recovery 
 would be helpful.


The RSS memory usage in the report is per OSD I guess(really?). It
can't be ignored since it's really a great improvement memory usage.

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Mark Nelson
 Sent: Tuesday, August 18, 2015 9:46 PM
 To: ceph-devel
 Subject: Ceph Hackathon: More Memory Allocator Testing

 Hi Everyone,

 One of the goals at the Ceph Hackathon last week was to examine how to 
 improve Ceph Small IO performance.  Jian Zhang presented findings showing a 
 dramatic improvement in small random IO performance when Ceph is used with 
 jemalloc.  His results build upon Sandisk's original findings that the 
 default thread cache values are a major bottleneck in TCMalloc 2.1.  To 
 further verify these results, we sat down at the Hackathon and configured the 
 new performance test cluster that Intel generously donated to the Ceph 
 community laboratory to run through a variety of tests with different memory 
 allocator configurations.  I've since written the results of those tests up 
 in pdf form for folks who are interested.

 The results are located here:

 http://nhm.ceph.com/hackathon/Ceph_Hackathon_Memory_Allocator_Testing.pdf

 I want to be clear that many other folks have done the heavy lifting here.  
 These results are simply a validation of the many tests that other folks have 
 already done.  Many thanks to Sandisk and others for figuring this out as 
 it's a pretty big deal!

 Side note:  Very little tuning other than swapping the memory allocator and a 
 couple of quick and dirty ceph tunables were set during these tests. It's 
 quite possible that higher IOPS will be achieved as we really start digging 
 into the cluster and learning what the bottlenecks are.

 Thanks,
 Mark
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: infernalis feature freeze

2015-08-12 Thread Haomai Wang

I hope this PR could be pushed to I :-) It seemed waited so long.
https://github.com/ceph/ceph/pull/3595

On Thu, Aug 13, 2015 at 5:20 AM, Sage Weil sw...@redhat.com wrote:
 The infernalis feature freeze is coming up Real Soon Now.  I've marked
 some of the pull requests on github that I would like to see merged.
 Please take a look:

 
 https://github.com/ceph/ceph/pulls?q=is%3Aopen+is%3Apr+milestone%3Ainfernalis

 Ideally we should focus our testing efforts on whatever is on this list.

 I didn't look at the bug fix PRs carefully since generally these go in
 once tested regardless of any feature freeze, but I suggest we mark things
 that need to make it into infernalis anyway so that we focus our efforts.

 The big items on my list that are pending testing are:

 wip-newstore-sort (teuthology running now)
 wip-newstore (will be marked experimental)
 wip-osd-compat (enforces upgrades include hammer, needs qa)
 wip-user (run daemons as user 'ceph')
 proxy writes (Sam is testing this)
 bufferlist tuning (performance)
 MOSDOp staged decoding (performance)

 There's tons of other stuff pending that is not on this list.  As always,
 our ability to merge code is limited primarily on our ability to test
 it--we can't afford to destabilize core ceph by merging something that we
 aren't confident will work correctly and will not easily break down the
 line.  If you have something pending that you want to get in, the single
 biggest thing you can do to help that happen is to write more or better
 tests for it (e.g., things that run during 'make check') and to run the
 teuthology regression suite (see Loic's recent blogs about doing this with
 OpenStack[1]).

 Thanks!
 sage

 [1] http://dachary.org/?p=3852
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Async reads, sync writes, op thread model discussion

2015-08-12 Thread Haomai Wang

On Wed, Aug 12, 2015 at 2:55 PM, Somnath Roy somnath@sandisk.com wrote:
 Haomai,
 Yes, one of the goals is to make async read xattr..
 IMO, this scheme should benefit in the following scenario..

 Ops within a PG will not be serialized any more as long as it is not coming 
 on the same object and this could be a big win.


 In our workload at least we are not seeing the shard queue depth is going 
 high indicating no bottleneck from workQ end. I am doubtful of having async 
 completion in such case would be helpful.
 Overall, I agree that this is the way to go forward...

Yes, although it doesn't hit high depth, it doesn't mean that io
latency isn't affect by sync xattr read :-). We could observe at least
100ms for deep depth inode reading. Like this
PR(https://github.com/ceph/ceph/pull/5497/files#diff-72747d40a424e7b5404366b557ff12a3),
it's suffered from the high latency from fd read.

So if we have async read, we can read object as early as possible?

The second, if we use fiemap read, we could issue multi async read op
and reap them. I guess it would benefit to rbd case. Although I'm not
sure how many people enable fiemap


 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
 Sent: Tuesday, August 11, 2015 7:50 PM
 To: Yehuda Sadeh-Weinraub
 Cc: Samuel Just; Sage Weil; ceph-devel@vger.kernel.org
 Subject: Re: Async reads, sync writes, op thread model discussion

 On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub ysade...@redhat.com 
 wrote:
 Already mentioned it on irc, adding to ceph-devel for the sake of
 completeness. I did some infrastructure work for rgw and it seems (at
 least to me) that it could at least be partially useful here.
 Basically it's an async execution framework that utilizes coroutines.
 It's comprised of aio notification manager that can also be tied into
 coroutines execution. The coroutines themselves are stackless, they
 are implemented as state machines, but using some boost trickery to
 hide the details so they can be written very similar to blocking
 methods. Coroutines can also execute other coroutines and can be
 stacked, or can generate concurrent execution. It's still somewhat in
 flux, but I think it's mostly done and already useful at this point,
 so if there's anything you could use it might be a good idea to avoid
 effort duplication.


 coroutines like qemu is cool. The only thing I afraid is the complicate of 
 debug and it's really a big task :-(

 I agree with sage that this design is really a new implementation for 
 objectstore so that it's harmful to existing objectstore impl. I also suffer 
 the pain from sync read xattr, we may add a async read interface to solove 
 this?

 For context switch thing, now we have at least 3 cs for one op at osd side. 
 messenger - op queue - objectstore queue. I guess op queue - objectstore 
 is easier to kick off just as sam said. We can make write journal inline with 
 queue_transaction, so the caller could directly handle the transaction right 
 now.

 Anyway, I think we need to do some changes for this field.

 Yehuda

 On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just sj...@redhat.com wrote:
 Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
 to the actual interface I presented so much as the notion that the
 next thing to do is restructure the OpWQ users as async state
 machines.
 -Sam

 On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil s...@newdream.net wrote:
 On Tue, 11 Aug 2015, Samuel Just wrote:
 Currently, there are some deficiencies in how the OSD maps ops onto 
 threads:

 1. Reads are always syncronous limiting the queue depth seen from the 
 device
and therefore the possible parallelism.
 2. Writes are always asyncronous forcing even very fast writes to be 
 completed
in a seperate thread.
 3. do_op cannot surrender the thread/pg lock during an operation forcing 
 reads
required to continue the operation to be syncronous.

 For spinning disks, this is mostly ok since they don't benefit as
 much from large read queues, and writes (filestore with journal)
 are too slow for the thread switches to make a big difference.  For
 very fast flash, however, we want the flexibility to allow the
 backend to perform writes syncronously or asyncronously when it
 makes sense, and to maintain a larger number of outstanding reads
 than we have threads.  To that end, I suggest changing the ObjectStore 
 interface to be somewhat polling based:

 /// Create new token
 void *create_operation_token() = 0; bool is_operation_complete(void
 *token) = 0; bool is_operation_committed(void *token) = 0; bool
 is_operation_applied(void *token) = 0; void wait_for_committed(void
 *token) = 0; void wait_for_applied(void *token) = 0; void
 wait_for_complete(void *token) = 0; /// Get result of operation int
 get_result(void *token) = 0; /// Must only be called once
 is_opearation_complete(token) void

Re: Async reads, sync writes, op thread model discussion

2015-08-11 Thread Haomai Wang

On Wed, Aug 12, 2015 at 6:34 AM, Yehuda Sadeh-Weinraub
ysade...@redhat.com wrote:
 Already mentioned it on irc, adding to ceph-devel for the sake of
 completeness. I did some infrastructure work for rgw and it seems (at
 least to me) that it could at least be partially useful here.
 Basically it's an async execution framework that utilizes coroutines.
 It's comprised of aio notification manager that can also be tied into
 coroutines execution. The coroutines themselves are stackless, they
 are implemented as state machines, but using some boost trickery to
 hide the details so they can be written very similar to blocking
 methods. Coroutines can also execute other coroutines and can be
 stacked, or can generate concurrent execution. It's still somewhat in
 flux, but I think it's mostly done and already useful at this point,
 so if there's anything you could use it might be a good idea to avoid
 effort duplication.


coroutines like qemu is cool. The only thing I afraid is the
complicate of debug and it's really a big task :-(

I agree with sage that this design is really a new implementation for
objectstore so that it's harmful to existing objectstore impl. I also
suffer the pain from sync read xattr, we may add a async read
interface to solove this?

For context switch thing, now we have at least 3 cs for one op at osd
side. messenger - op queue - objectstore queue. I guess op queue -
objectstore is easier to kick off just as sam said. We can make write
journal inline with queue_transaction, so the caller could directly
handle the transaction right now.

Anyway, I think we need to do some changes for this field.

 Yehuda

 On Tue, Aug 11, 2015 at 3:19 PM, Samuel Just sj...@redhat.com wrote:
 Yeah, I'm perfectly happy to have wrappers.  I'm also not at all tied
 to the actual interface I presented so much as the notion that the
 next thing to do is restructure the OpWQ users as async state
 machines.
 -Sam

 On Tue, Aug 11, 2015 at 1:05 PM, Sage Weil s...@newdream.net wrote:
 On Tue, 11 Aug 2015, Samuel Just wrote:
 Currently, there are some deficiencies in how the OSD maps ops onto 
 threads:

 1. Reads are always syncronous limiting the queue depth seen from the 
 device
and therefore the possible parallelism.
 2. Writes are always asyncronous forcing even very fast writes to be 
 completed
in a seperate thread.
 3. do_op cannot surrender the thread/pg lock during an operation forcing 
 reads
required to continue the operation to be syncronous.

 For spinning disks, this is mostly ok since they don't benefit as much from
 large read queues, and writes (filestore with journal) are too slow for the
 thread switches to make a big difference.  For very fast flash, however, we
 want the flexibility to allow the backend to perform writes syncronously or
 asyncronously when it makes sense, and to maintain a larger number of
 outstanding reads than we have threads.  To that end, I suggest changing 
 the
 ObjectStore interface to be somewhat polling based:

 /// Create new token
 void *create_operation_token() = 0;
 bool is_operation_complete(void *token) = 0;
 bool is_operation_committed(void *token) = 0;
 bool is_operation_applied(void *token) = 0;
 void wait_for_committed(void *token) = 0;
 void wait_for_applied(void *token) = 0;
 void wait_for_complete(void *token) = 0;
 /// Get result of operation
 int get_result(void *token) = 0;
 /// Must only be called once is_opearation_complete(token)
 void reset_operation_token(void *token) = 0;
 /// Must only be called once is_opearation_complete(token)
 void detroy_operation_token(void *token) = 0;

 /**
  * Queue a transaction
  *
  * token must be either fresh or reset since the last operation.
  * If the operation is completed syncronously, token can be resused
  * without calling reset_operation_token.
  *
  * @result 0 if completed syncronously, -EAGAIN if async
  */
 int queue_transaction(
   Transaction *t,
   OpSequencer *osr,
   void *token
   ) = 0;

 /**
  * Queue a transaction
  *
  * token must be either fresh or reset since the last operation.
  * If the operation is completed syncronously, token can be resused
  * without calling reset_operation_token.
  *
  * @result -EAGAIN if async, 0 or -error otherwise.
  */
 int read(..., void *token) = 0;
 ...

 The token concept here is opaque to allow the implementation some
 flexibility.  Ideally, it would be nice to be able to include libaio
 operation contexts directly.

 The main goal here is for the backend to have the freedom to complete
 writes and reads asyncronously or syncronously as the sitation warrants.
 It also leaves the interface user in control of where the operation
 completion is handled.  Each op thread can therefore handle its own
 completions:

 struct InProgressOp {
   PGRef pg;
   ObjectStore::Token *token;
   OpContext *ctx;
 };
 vectorInProgressOp in_progress(MAX_IN_PROGRESS);

 Probably a deque since we'll be pushign new requests and slurping off
 completed ones?  Or, we

Re: bufferlist allocation optimization ideas

2015-08-11 Thread Haomai Wang

On Wed, Aug 12, 2015 at 5:48 AM, Dałek, Piotr
piotr.da...@ts.fujitsu.com wrote:
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
 ow...@vger.kernel.org] On Behalf Of Sage Weil
 Sent: Tuesday, August 11, 2015 10:11 PM

 I went ahead and implemented both of these pieces.  See

   https://github.com/ceph/ceph/pull/5534

 My benchmark numbers are highly suspect, but the approximate takeaway is
 that it's 2x faster for the simple microbenchmarks and does 1/3rd the
 allocations.  But there is some weird interaction with the allocator going on
 for 16k allocations that I saw, so it needs some more careful benchmarking.

 16k allocations aren't that common, actually.
 Some time ago I took an alloc profile for raw_char and posix_aligned buffers, 
 and...

 [root@storage1 /]# sort buffer::raw_char-2143984.dat | uniq -c | sort -g
   1 12
   1 33
   1 393
   1 41
   2 473
   2 66447
   3 190
   3 20
   3 64
   4 16
  36 206
  88 174
  88 48
  89 272
  89 36
  90 34
 312 207
3238 208
   32403 209
  196300 210
  360164 45

Since size is centralization, we could use a fixed size buffer pool to
optimize this. The performance is outstanding as I perf.


 [root@storage1 /]# sort buffer::posix_aligned-2081990.dat | uniq -c | sort -g
  36 36864
  433635 4096

 So most common are very small (255 bytes) allocs and CEPH_PAGE_SIZE allocs.

 The other interesting thing is that either of these pieces in isolation 
 seems to
 have a pretty decent benefit, but when combined the benefits are fully
 additive.

 It seems to be reasonably stable, though!

 I'm going to test them in my environment, because allocations and 
 deallocations alone, when done in best-case pattern (series of allocations 
 followed by series of frees not interleaved by frees/allocs), aren't a good 
 benchmark for memory allocators (actually, most allocators are specially 
 optimized for this case).


 With best regards / Pozdrawiam
 Piotr Dałek

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang

Could you print your all thread callback via thread apply all bt?

On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Hi,

 On a Ceph Firefly cluster (version [1]), OSDs are configured to use separate 
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs start-up fine (are 'up' and 'in' state); however, others are stuck 
 in the 'init creating/touching snapmapper object' phase. Below is a OSD 
 start-up log snippet:

 2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open 
 /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
 bytes, directio = 1, aio = 1
 2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open 
 /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size 4096 
 bytes, directio = 1, aio = 1
 2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
 2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock 
 sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0 
 a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
 2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init creating/touching 
 snapmapper object

 The log statement is inaccurate though, since it is actually doing init 
 operation for the 'infos' object (as can be observed from source [2]).

 Upon debugging further, the thread seems to be waiting to acquire the 
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:

 (gdb) where
 #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from 
 /lib/x86_64-linux-gnu/libpthread.so.0
 #1  0x7fd313132bf4 in 
 ObjectStore::apply_transactions(ObjectStore::Sequencer*, 
 std::listObjectStore::Transaction*, 
 std::allocatorObjectStore::Transaction* , Context*) ()
 #2  0x7fd313097d08 in 
 ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
 #3  0x7fd313076790 in OSD::init() ()
 #4  0x7fd3130233a7 in main ()

 In a few cases, upon restarting the stuck OSD (service), it successfully 
 completes the 'init' phase and reaches the 'up' and 'in' state!

 Any help is greatly appreciated. Please let me know if any more details are 
 required for root causing.

 [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
 [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211

 Regards,
 Unmesh G.
 IRC: unmeshg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang

Don't find something strange.

Could you paste your ceph.conf? And restart this osd with
debug_osd=20/20, debug_filestore=20/20 :-)

On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Thanks for quick response Haomai! Please find the backtrace here [1].

 [1] - http://paste.openstack.org/show/411139/

 Regards,
 Unmesh G.
 IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 5:31 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase

 Could you print your all thread callback via thread apply all bt?

 On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Hi,
 
  On a Ceph Firefly cluster (version [1]), OSDs are configured to use 
  separate
 data and journal disks (using the ceph-disk utility). It is observed, that 
 few OSDs
 start-up fine (are 'up' and 'in' state); however, others are stuck in the 
 'init
 creating/touching snapmapper object' phase. Below is a OSD start-up log
 snippet:
 
  2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
  /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block size
  4096 bytes, directio = 1, aio = 1
  2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
  2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
  sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
  a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
  2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
  creating/touching snapmapper object
 
  The log statement is inaccurate though, since it is actually doing init
 operation for the 'infos' object (as can be observed from source [2]).
 
  Upon debugging further, the thread seems to be waiting to acquire the
 'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
 
  (gdb) where
  #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
  /lib/x86_64-linux-gnu/libpthread.so.0
  #1  0x7fd313132bf4 in
  ObjectStore::apply_transactions(ObjectStore::Sequencer*,
  std::listObjectStore::Transaction*,
  std::allocatorObjectStore::Transaction* , Context*) ()
  #2  0x7fd313097d08 in
  ObjectStore::apply_transaction(ObjectStore::Transaction, Context*) ()
  #3  0x7fd313076790 in OSD::init() ()
  #4  0x7fd3130233a7 in main ()
 
  In a few cases, upon restarting the stuck OSD (service), it successfully
 completes the 'init' phase and reaches the 'up' and 'in' state!
 
  Any help is greatly appreciated. Please let me know if any more details are
 required for root causing.
 
  [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
  [2] -  https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
 
  Regards,
  Unmesh G.
  IRC: unmeshg
  --
  To unsubscribe from this list: send the line unsubscribe ceph-devel
  in the body of a message to majord...@vger.kernel.org More majordomo
  info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: OSD sometimes stuck in init phase

2015-08-06 Thread Haomai Wang

It seemed filestore doesn't do transaction as expected. Sorry, you
need to add debug_journal=20/20 to help find the reason. :-)

BTW, what's your os version? How many osds do you have in this
cluster, how many osds failed to start like this?

On Thu, Aug 6, 2015 at 9:17 PM, Gurjar, Unmesh unmesh.gur...@hp.com wrote:
 Please find ceph.conf at [1] and the corresponding OSD log at [2].

 To clarify one thing I skipped earlier on, is while bringing up the OSDs, 
 'ceph-disk activate' was getting hung (due to issue [3]). To get over this, I 
 had to temporarily disable 'journal dio' to get the disk activated (with a 
 'mark-init' set to none) and then explicitly start the OSD service after 
 updating the conf to enable 'journal dio'. I am hopeful that this should not 
 cause the present issue (since few OSD start successfully on first attempt 
 and others on subsequent service restarts)!

 [1] - http://paste.openstack.org/show/411161/
 [2] - http://paste.openstack.org/show/411162/
 [3] - http://tracker.ceph.com/issues/9768

 Regards,
 Unmesh G.
 IRC: unmeshg

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, August 06, 2015 6:22 PM
 To: Gurjar, Unmesh
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: OSD sometimes stuck in init phase

 Don't find something strange.

 Could you paste your ceph.conf? And restart this osd with debug_osd=20/20,
 debug_filestore=20/20 :-)

 On Thu, Aug 6, 2015 at 8:09 PM, Gurjar, Unmesh unmesh.gur...@hp.com
 wrote:
  Thanks for quick response Haomai! Please find the backtrace here [1].
 
  [1] - http://paste.openstack.org/show/411139/
 
  Regards,
  Unmesh G.
  IRC: unmeshg
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Thursday, August 06, 2015 5:31 PM
  To: Gurjar, Unmesh
  Cc: ceph-devel@vger.kernel.org
  Subject: Re: OSD sometimes stuck in init phase
 
  Could you print your all thread callback via thread apply all bt?
 
  On Thu, Aug 6, 2015 at 7:52 PM, Gurjar, Unmesh unmesh.gur...@hp.com
  wrote:
   Hi,
  
   On a Ceph Firefly cluster (version [1]), OSDs are configured to use
   separate
  data and journal disks (using the ceph-disk utility). It is observed,
  that few OSDs start-up fine (are 'up' and 'in' state); however,
  others are stuck in the 'init creating/touching snapmapper object'
  phase. Below is a OSD start-up log
  snippet:
  
   2015-08-06 08:58:02.491537 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498447 7fd312df97c0  1 journal _open
   /var/lib/ceph/osd/ceph-0/journal fd 21: 1073741824 bytes, block
   size
   4096 bytes, directio = 1, aio = 1
   2015-08-06 08:58:02.498720 7fd312df97c0  2 osd.0 0 boot
   2015-08-06 08:58:02.498865 7fd312df97c0 10 osd.0 0 read_superblock
   sb(2645bbf6-16d0-4c42-8835-8ba9f5c95a1d osd.0
   a821146f-0742-4724-b4ca-39ea4ccc298d e0 [0,0] lci=[0,0])
   2015-08-06 08:58:02.498937 7fd312df97c0 10 osd.0 0 init
   creating/touching snapmapper object
  
   The log statement is inaccurate though, since it is actually doing
   init
  operation for the 'infos' object (as can be observed from source [2]).
  
   Upon debugging further, the thread seems to be waiting to acquire
   the
  'ObjectStore::apply_transaction::my_lock' mutex. Below is the debug trace:
  
   (gdb) where
   #0  0x7fd3122b708f in pthread_cond_wait@@GLIBC_2.3.2 () from
   /lib/x86_64-linux-gnu/libpthread.so.0
   #1  0x7fd313132bf4 in
   ObjectStore::apply_transactions(ObjectStore::Sequencer*,
   std::listObjectStore::Transaction*,
   std::allocatorObjectStore::Transaction* , Context*) ()
   #2  0x7fd313097d08 in
   ObjectStore::apply_transaction(ObjectStore::Transaction, Context*)
   ()
   #3  0x7fd313076790 in OSD::init() ()
   #4  0x7fd3130233a7 in main ()
  
   In a few cases, upon restarting the stuck OSD (service), it
   successfully
  completes the 'init' phase and reaches the 'up' and 'in' state!
  
   Any help is greatly appreciated. Please let me know if any more
   details are
  required for root causing.
  
   [1] - 0.80.7 (6c0127fcb58008793d3c8b62d925bc91963672a3)
   [2] -
   https://github.com/ceph/ceph/blob/firefly/src/osd/OSD.cc#L1211
  
   Regards,
   Unmesh G.
   IRC: unmeshg
   --
   To unsubscribe from this list: send the line unsubscribe ceph-devel
   in the body of a message to majord...@vger.kernel.org More
   majordomo info at  http://vger.kernel.org/majordomo-info.html
 
 
 
  --
  Best Regards,
 
  Wheat



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FileStore should not use syncfs(2)

2015-08-05 Thread Haomai Wang

Agree

On Thu, Aug 6, 2015 at 5:38 AM, Somnath Roy somnath@sandisk.com wrote:
 Thanks Sage for digging down..I was suspecting something similar.. As I 
 mentioned in today's call, in idle time also syncfs is taking ~60ms. I have 
 64 GB of RAM in the system.
 The workaround I was talking about today  is working pretty good so far. In 
 this implementation, I am not giving much work to syncfs as each worker 
 thread is writing with o_dsync mode. I am issuing syncfs before trimming the 
 journal and most of the time I saw it is taking  100 ms.

Actually I prefer we don't use syncfs anymore. I more like to use
aio+dio+Filestore custom cache to deal with all syncfs+pagecache
things. So we even can make cache more smart to aware of upper levels
instead of fadvise* calls. Second we can use checkpoint method like
mysql innodb, we can know the bw of frontend(filejournal) and decide
how much and how often we want to flush(using aio+dio).

Anyway, because it's a big project, we may prefer to work at newstore
instead of filestore.

 I have to wake up the sync_thread now after each worker thread finished 
 writing. I will benchmark both the approaches. As we discussed earlier, in 
 case of only fsync approach, we still need to do a db sync to make sure the 
 leveldb stuff persisted, right ?

 Thanks  Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Wednesday, August 05, 2015 2:27 PM
 To: Somnath Roy
 Cc: ceph-devel@vger.kernel.org; sj...@redhat.com
 Subject: FileStore should not use syncfs(2)

 Today I learned that syncfs(2) does an O(n) search of the superblock's inode 
 list searching for dirty items.  I've always assumed that it was only 
 traversing dirty inodes (e.g., a list of dirty inodes), but that appears not 
 to be the case, even on the latest kernels.

 That means that the more RAM in the box, the larger (generally) the inode 
 cache, the longer syncfs(2) will take, and the more CPU you'll waste doing 
 it.  The box I was looking at had 256GB of RAM, 36 OSDs, and a load of ~40 
 servicing a very light workload, and each syncfs(2) call was taking ~7 
 seconds (usually to write out a single inode).

 A possible workaround for such boxes is to turn 
 /proc/sys/vm/vfs_cache_pressure way up (so that the kernel favors caching 
 pages instead of inodes/dentries)...

 I think the take-away though is that we do need to bite the bullet and make 
 FileStore f[data]sync all the right things so that the syncfs call can be 
 avoided.  This is the path you were originally headed down, Somnath, and I 
 think it's the right one.

 The main thing to watch out for is that according to POSIX you really need to 
 fsync directories.  With XFS that isn't the case since all metadata 
 operations are going into the journal and that's fully ordered, but we don't 
 want to allow data loss on e.g. ext4 (we need to check what the metadata 
 ordering behavior is there) or other file systems.

I guess there only a little directory modify operations, is it true?
Maybe we only need to do syncfs when modifying directories?


 :(

 sage

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: More ondisk_finisher thread?

2015-08-04 Thread Haomai Wang

It's interesting to see ondisk_finisher will occur 1ms, could you
replay this workload and see whether exists read io from iostat. I
guess it may help to see the cause.

On Wed, Aug 5, 2015 at 12:13 AM, Somnath Roy somnath@sandisk.com wrote:
 Yes, it has to re-acquire pg_lock today..
 But, between journal write and initiating the ondisk ack, there is one 
 context switche in the code path. So, I guess the pg_lock is not the only one 
 that is causing this 1 ms delay...
 Not sure increasing the finisher threads will help in the pg_lock case as it 
 will be more or less serialized by this pg_lock..
 But, increasing finisher threads for the other context switches I was talking 
 about (see queue_completion_thru) may help...

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Ding Dinghua
 Sent: Tuesday, August 04, 2015 3:00 AM
 To: ceph-devel@vger.kernel.org
 Subject: More ondisk_finisher thread?

 Hi:
Now we are doing some ceph performance tuning work, our setup has ten ceph 
 nodes, and SSD as journal, HDD for filestore, and ceph version is 0.80.9.
We run fio in virtual maching with random 4KB write workload, we find that 
 It took about 1ms in average for ondisk_finisher, while journal write only 
 took 0.4ms, so I think it's unreasonable.
 Since ondisk callback will be called with pg lock held, If pg lock has 
 been grabbed by another thread(for example, osd-op_wq), all ondisk callback 
 will be delayed, then all write op will be delayed.
  I found that op_commit must be called with pg lock, so what about 
 increase the ondisk_finisher thread number, so ondisk callback can be less 
 likely to be delayed.

 --
 Ding Dinghua
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Odd QA Test Running

2015-08-03 Thread Haomai Wang

I found 
https://github.com/ceph/ceph-qa-suite/blob/master/erasure-code/ec-rados-plugin%3Dshec-k%3D4-m%3D3-c%3D2.yaml
has override section and will override user's enable experimental
unrecoverable data corrupting features config. So my jobs are
corrupted.

I made a PR(https://github.com/ceph/ceph-qa-suite/pull/518) and hope
fix this point.

On Fri, Jul 31, 2015 at 5:50 PM, Haomai Wang haomaiw...@gmail.com wrote:
 Hi all,

 I  ran a test 
 suite(http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/)
 and found the failed jobs are failed by 2015-07-29 10:52:35.313197
 7f16ae655780 -1 unrecognized ms_type 'async'

 Then I found the failed jobs(like
 http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/991540/)
 lack of “enable experimental unrecoverable data corrupting features:
 ms-type-async”.

 Other successful jobs(like
 http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/991517/)
 can find enable experimental unrecoverable data corrupting features:
 ms-type-async in yaml.

 So that's mean the same schedule suite will generate the different
 yaml file? Is there something tricky?

 --

 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Odd QA Test Running

2015-07-31 Thread Haomai Wang

Hi all,

I  ran a test 
suite(http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/)
and found the failed jobs are failed by 2015-07-29 10:52:35.313197
7f16ae655780 -1 unrecognized ms_type 'async'

Then I found the failed jobs(like
http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/991540/)
lack of “enable experimental unrecoverable data corrupting features:
ms-type-async”.

Other successful jobs(like
http://pulpito.ceph.com/haomai-2015-07-29_11:40:40-rados-master-distro-basic-multi/991517/)
can find enable experimental unrecoverable data corrupting features:
ms-type-async in yaml.

So that's mean the same schedule suite will generate the different
yaml file? Is there something tricky?

-- 

Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Ceph write path optimization

2015-07-28 Thread Haomai Wang

On Wed, Jul 29, 2015 at 5:08 AM, Somnath Roy somnath@sandisk.com wrote:
Hi,
Eventually, I have a working prototype and able to gather some performance
comparison data with the changes I was talking about in the last performance
meeting. Mark's suggestion of a write up was long pending, so, trying to
summarize what I am trying to do.

Objective:
---

1. Is to saturate SSD write bandwidth with ceph + filestore.
Most of the deployment of ceph + all flash so far (as far as I know) is
having both data and journal on the same SSD. SSDs are far from saturate and
the write performance of ceph is dismal (compare to HW). Can we improve that ?

2. Ceph write performance in most of the cases are not stable, can we have a
stable performance out most of the time ?

Findings/Optimization so far..

1. I saw in flash environment you need to reduce the
filestore_max_sync_interval a lot (from default 5min) and thus the benefit of
syncfs coalescing and writing is going away.

Default is 5s I think.

2. We have some logic to determine the max sequence number it can commit.
That is adding some latency (1 ms or so).

3. This delay is filling up journals quickly if I remove all throttles from
the filestore/journal.

4. Existing throttle scheme is very difficult to tune.

5. In case of write-ahead journaling the commit file is probably redundant as
we can get the last committed seq number from journal headers during next OSD
start. The fact that, the sync interval we need to reduce , this extra write
will only add more to WA (also one extra fsync).

The existing scheme is well suited for HDD environment, but, probably not for
flash. So, I made the following changes.

1. First, I removed the extra commit seq file write and changed the journal
replay stuff accordingly.

2. Each filestore Op threads is now doing O_DSYNC write followed by
posix_fadvise(**fd, 0, 0, POSIX_FADV_DONTNEED);

Maybe we could use AIO+DIO here. BTW always discard page cache isn't a
good idea for reading. If we want to give up page cache, we need to
implement a filestore data buffer cache.

3. I derived an algorithm that each worker thread is executing to determine
the max seq it can trim the journal to.

4. Introduced a new throttle scheme that will throttle journal write based on
the % space left.

5. I saw that this scheme is definitely emptying up the journal faster and
able to saturate the SSD more.

I think you mean filestore worker need to aware of the capacity of
journal and decide how often flushed

6. But, even if we are not saturating any resources, if we are having both
data and journal on the same drive, both writes are suffering latencies.

7. Separating out journal to different disk , the same code (and also stock)
is running faster. Not sure about the exact reason, but, something to do with
underlying layer. Still investigating.

8. Now, if we want to separate out journal, SSD is *not an option*. The
reason is, after some point we will be limited by SSD BW and all writes for N
osds going to that SSD will wear out that SSD very fast. Also, this will be a
very expensive solution considering high end journal SSD.

9. So, I started experimenting with small PCIe NVRAM partition (128 MB). So,
If we have ~4GB NVRAM we can put ~32 OSDs in that(considering NVRAM
durability is much higher). The stock code as is (without throttle), the
performance is becoming very spiky for obvious reason.

10. But, with the above mentioned changes, I am able to make a constant high
performance out most of the time.

11. I am also trying the existing synfs codebase (without op_seq file) + the
throttle scheme I mentioned in this setup to see if we can get out a stable
improve performance out or not. This is still under investigation.

12. Initial benchmark with single OSD (no replication) looks promising and
you can find the draft here.

https://docs.google.com/document/d/1QYAWkBNVfSXhWbLW6nTLx81a9LpWACsGjWG8bkNwKY8/edit?usp=sharing

13. I still need to try this out by increasing number of OSDs.

14. Also, need to see how this scheme is helping both data/journal on the
same SSD.

15. The main challenge I am facing in both the scheme is XFS metadata flush
process (xfsaild) is choking all the processes accessing the disk when it is
waking up. I can delay it till max 30 sec and if there are lot of dirty
metadata, there is a performance spike down for very brief amount of time.
Even if we are acknowledging writes from say NVRAM journal write, still the
opthreads are doing getattrs on the XFS and those threads are getting
blocked. I tried with ext4 and this problem is not there since it is writing
metadata synchronously by default, but, the overall performance of ext4 is
much less. I am not an expert on filesystem, so, any help on this is much
appreciated.

Mark,
If we have time, we can discuss

Re: About Fio backend with ObjectStore API

2015-07-22 Thread Haomai Wang

no special

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=./ceph-int/src/.libs/libfio_ceph_objectstore.so
invalidate=0 # mandatory
rw=write
#bs=4k

[filestore]
iodepth=1
# create a journaled filestore
objectstore=filestore
directory=./osd/
filestore_journal=./osd/journal

On Thu, Jul 23, 2015 at 4:56 AM, Casey Bodley cbod...@redhat.com wrote:
 Hi Haomai,

 Sorry for the late response, I was out of the office. I'm afraid I haven't 
 run into that segfault. The io_ops should be set at the very beginning when 
 it calls get_ioengine(). All I can suggest is that you verify that your job 
 file is pointing to the correct fio_ceph_objectstore.so. If you've made any 
 other interesting changes to the job file, could you share it here?

 Casey

 - Original Message -
 From: Haomai Wang haomaiw...@gmail.com
 To: Casey Bodley cbod...@gmail.com
 Cc: Matt W. Benjamin m...@cohortfs.com, James (Fei) Liu-SSI 
 james@ssi.samsung.com, ceph-devel@vger.kernel.org
 Sent: Tuesday, July 21, 2015 7:50:32 AM
 Subject: Re: About Fio backend with ObjectStore API

 Hi Casey,

 I check your commits and know what you fixed. I cherry-picked your new
 commits but I still met the same problem.

 
 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?
 

 Do you have any hint about this?

 On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote:
 Hi Haomai,

 I was able to run this after a couple changes to the filestore.fio job
 file. Two of the config options were using the wrong names. I pushed a
 fix for the job file, as well as a patch that renames everything from
 filestore to objectstore (thanks James), to
 https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore.

 I found that the read support doesn't appear to work anymore, so give
 rw=write a try. And because it does a mkfs(), make sure you're
 pointing it to an empty xfs directory with the directory= option.

 Casey

 On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Anyone who have successfully ran the fio with this external io engine
 ceph_objectstore?

 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?

 On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com 
 wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf

Re: About Fio backend with ObjectStore API

2015-07-21 Thread Haomai Wang

Hi Casey,

I check your commits and know what you fixed. I cherry-picked your new
commits but I still met the same problem.


It's strange that it alwasys hit segment fault when entering
_fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
when I up the stack, the td-io_ops is not null. Maybe it's related
to dlopen?


Do you have any hint about this?

On Thu, Jul 16, 2015 at 5:23 AM, Casey Bodley cbod...@gmail.com wrote:
 Hi Haomai,

 I was able to run this after a couple changes to the filestore.fio job
 file. Two of the config options were using the wrong names. I pushed a
 fix for the job file, as well as a patch that renames everything from
 filestore to objectstore (thanks James), to
 https://github.com/linuxbox2/linuxbox-ceph/commits/fio-objectstore.

 I found that the read support doesn't appear to work anymore, so give
 rw=write a try. And because it does a mkfs(), make sure you're
 pointing it to an empty xfs directory with the directory= option.

 Casey

 On Tue, Jul 14, 2015 at 2:45 AM, Haomai Wang haomaiw...@gmail.com wrote:
 Anyone who have successfully ran the fio with this external io engine
 ceph_objectstore?

 It's strange that it alwasys hit segment fault when entering
 _fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
 when I up the stack, the td-io_ops is not null. Maybe it's related
 to dlopen?

 On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf-set_val(debug_filestore, 20);
  218 //g_ceph_context-_conf-set_val(debug_throttle, 20);
  219 g_ceph_context-_conf-apply_changes(NULL);
  220
 
  221 ceph_filestore_data-osd_path =
 strdup(/mnt/fio_ceph_filestore.XXX);
  222 ceph_filestore_data-journal_path =
 strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX);
  223
 
  224 if (!mkdtemp(ceph_filestore_data-osd_path)) {
  225 cout  mkdtemp failed:   strerror(errno) 
 std::endl;
  226 return 1;
  227 }
  228 //mktemp(ceph_filestore_data-journal_path); // NOSPC issue
  229
 
  230 ObjectStore *fs = new
 FileStore(ceph_filestore_data-osd_path,
 ceph_filestore_data-journal_path);
  231 ceph_filestore_data-fs = fs;
  232
 
  233 if (fs-mkfs()  0) {
  234 cout  mkfs failed  std::endl;
  235 goto failed;
  236 }
  237
  238 if (fs-mount()  0) {
  239 cout  mount failed  std::endl;
  240 goto failed;
  241 }
  242
 
  243 ft.create_collection(coll_t());
  244 fs-apply_transaction(ft);
  245
 
  246
 
  247 return 0;
  248
 
  249 failed

Re: About Fio backend with ObjectStore API

2015-07-14 Thread Haomai Wang

Anyone who have successfully ran the fio with this external io engine
ceph_objectstore?

It's strange that it alwasys hit segment fault when entering
_fio_setup_ceph_filestore_data, gdb tells td-io_ops is NULL but
when I up the stack, the td-io_ops is not null. Maybe it's related
to dlopen?

On Fri, Jul 10, 2015 at 3:51 PM, Haomai Wang haomaiw...@gmail.com wrote:
 I have rebased the branch with master, and push it to ceph upstream
 repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

 Plz let me know if who is working on this. Otherwise, I would like to
 improve this to be merge ready.

 On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf-set_val(debug_filestore, 20);
  218 //g_ceph_context-_conf-set_val(debug_throttle, 20);
  219 g_ceph_context-_conf-apply_changes(NULL);
  220
 
  221 ceph_filestore_data-osd_path =
 strdup(/mnt/fio_ceph_filestore.XXX);
  222 ceph_filestore_data-journal_path =
 strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX);
  223
 
  224 if (!mkdtemp(ceph_filestore_data-osd_path)) {
  225 cout  mkdtemp failed:   strerror(errno) 
 std::endl;
  226 return 1;
  227 }
  228 //mktemp(ceph_filestore_data-journal_path); // NOSPC issue
  229
 
  230 ObjectStore *fs = new
 FileStore(ceph_filestore_data-osd_path,
 ceph_filestore_data-journal_path);
  231 ceph_filestore_data-fs = fs;
  232
 
  233 if (fs-mkfs()  0) {
  234 cout  mkfs failed  std::endl;
  235 goto failed;
  236 }
  237
  238 if (fs-mount()  0) {
  239 cout  mount failed  std::endl;
  240 goto failed;
  241 }
  242
 
  243 ft.create_collection(coll_t());
  244 fs-apply_transaction(ft);
  245
 
  246
 
  247 return 0;
  248
 
  249 failed:
  250 return 1;
  251
 
  252 }
  -Original Message-
  From: Casey Bodley [mailto:cbod...@gmail.com]
  Sent: Thursday, July 09, 2015 9:19 AM
  To: James (Fei) Liu-SSI
  Cc: Haomai Wang; ceph-devel@vger.kernel.org
  Subject: Re: About Fio backend with ObjectStore API
 
  Hi James,
 
  In the job file src/test/filestore.fio, you can modify the line
  objectstore=filestore to use any objectstore type supported by
 the
  ObjectStore::create() factory.
 
  Casey
 
  On Wed, Jul 8, 2015 at 8:02 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Quick questions, The code in the trunk only cover the test for
 filestore. I was wondering do you have any plan to cover the test for
 kvstore and newstore?
 
Thanks,
James
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei)

  Liu-SSI
  Sent: Tuesday, June 30, 2015 2:19 PM
  To: Casey Bodley
  Cc: Haomai Wang; ceph-devel@vger.kernel.org
  Subject: RE: About Fio backend with ObjectStore

Re: About Adding eventfd support for LibRBD

2015-07-13 Thread Haomai Wang

On Mon, Jul 13, 2015 at 9:52 PM, Jason Dillaman dilla...@redhat.com wrote:
 I was originally thinking that you were just proposing to have librbd write 
 to the eventfd descriptor when your AIO op completed so that you could hook 
 librbd callbacks into an existing app poll loop.  If librbd is doing the 
 polling via poll_io_events, I guess I don't see why you would even need to 
 use eventfd.

Sorry, I'm not following. Even we have poll_io_events, we need to when
to call poll_io_events.

I guess you mean we could notify user's side fd in rbd callback.
Yes, we could do this. But a extra rbd callback could be omitted if we
embed standard notification methods, we can get performance benefits
via inline notify and maybe we can reduce internal completion
structures(maybe?).


 --

 Jason Dillaman
 Red Hat
 dilla...@redhat.com
 http://www.redhat.com


 - Original Message -
 From: Haomai Wang haomaiw...@gmail.com
 To: Josh Durgin jdur...@redhat.com
 Cc: ceph-devel@vger.kernel.org, Jason Dillaman dilla...@redhat.com
 Sent: Thursday, July 9, 2015 11:16:14 PM
 Subject: Re: About Adding eventfd support for LibRBD

 I made a simple draft about adding async event notification support for
 librbd:

 The initial idea is try to avoid much change to existing apis. So we
 could add a new api like:

 struct {
   int result;
   void *userdata;
   ..
 } rbd_aio_event;

 int poll_io_events(ImageCtx *ictx, rbd_aio_event *events, int
 numevents, struct timespec *timeout);

 int set_image_notification(ImageCtx *ictx, void *handler, enum
 notification_type);

 It seemed a little tricky, if user call set_image_notification
 successfully, user can call aio_write/read with specified
 userdata(original callback argument pointer). Librbd internal thread
 will post async event to the eventfd using the specified
 way(notification_type) when io finished. For example, linux/bsd will
 use [eventfd])(http://man7.org/linux/man-pages/man2/eventfd.2.html),
 solaris could use
 [port_send](http://docs.oracle.com/cd/E23823_01/html/816-5168/port-send-3c.html#scrolltoc),
 windows could use iocp method
 [PostQueuedCompletionStatus](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365458(v=vs.85).aspx).

 If client call rbd without set_image_notification, user could call
 poll_io_events will get -EOPNOTSUPP.



 On Wed, Jul 8, 2015 at 11:46 AM, Haomai Wang haomaiw...@gmail.com wrote:
  On Wed, Jul 8, 2015 at 11:08 AM, Josh Durgin jdur...@redhat.com wrote:
  On 07/07/2015 08:18 AM, Haomai Wang wrote:
 
  Hi All,
 
  Currently librbd support aio_read/write with specified
  callback(AioCompletion). It would be nice for simple caller logic, but
  it also has some problems:
 
  1. Performance bottleneck: Create/Free AioCompletion and librbd
  internal finisher thread complete callback isn't a *very
  littleweight job, especially when callback need to update some
  status with lock hold
 
  2. Call logic: Usually like fio rbd engine, caller will maintain some
  status with io and rbd callback isn't enough to finish all the jobs
  related to io. For example, caller need to check each queued io
  stupidly again when rbd callback finished.
 
  So maybe we could add new api which support eventfd, so caller could
  add eventfd to its event loop and batch reap finished io event and
  update status or do more things.
 
  Any feedback is appreciated!
 
 
  It seems like a good idea to me. I'm not sure how much overhead it
  avoids, but letting the callers check status from their own threads
  is much nicer in general.
 
  I'd be curious how much overhead the callback + finisher add. If it's
  significant, it might make sense to add similar eventfd interfaces
  lower in the stack too.
 
  From intuition if we do high iodepth benchmark, noncallback way could
  reduce lots of extra callback latency because new way could batch
  them. Another performance benefit I think from caller side, new way
  could let complexity io finished job avoid callback lock and reduce
  extra logic. Finally, mostly callback need to wakeup caller thread to
  do next thing, it would be great that with new way we can do it in
  librbd via eventfd.
 
 
  Josh
 
 
 
  --
  Best Regards,
 
  Wheat



 --
 Best Regards,

 Wheat




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Patches for review on keyvaluestore

2015-07-10 Thread Haomai Wang

I suggest we could split this PR into plugindb impl and
ceph-disk,init-script things. So I think it will be more easier to be
merge ready.

On Thu, Jul 9, 2015 at 4:31 PM, Varada Kari varada.k...@sandisk.com wrote:
 Hi Sage/Sam/Haomai,

 Sent pull requests for two enhancement for key value store. Can you please 
 review the changes?

 https://github.com/ceph/ceph/pull/5136

 Contains changes for short circuiting op_wq and osr in key value store. We 
 have observed good performance gains with the above pull request. Not sure if 
 it breaks any assumptions or design constraints. Need your comments on the 
 approach.

 https://github.com/ceph/ceph/pull/5169

 Contains changes for adding a generic framework to add any keyvaluedb as a 
 backend.  There are code made for initial proposal on the infernalis 
 blueprint. Other enhancement proposed are being worked on as loadable object 
 store using a factory approach.

 Gist of the implementation sent for review:

 1. A new class is derived from KeyValueDB (Derived from ObjectStore) called 
 PluggableDBStore, which honors the semantics of KeyvalueStore and KeyValueDB. 
 This class acts as mediator between CEPH and loadable shim (a shared 
 library).  This class transforms the ceph related bufferlist etc... to const 
 char pointers for the shim to understand.
 2. Shim layer is assumed to be a shared library.
 3. PluggableDBStore, loads (dlopen) the key value database wrapper/shim 
 needed for Ceph integration.  The loadable shim library location is specified 
 in ceph.conf. Not added any checks to validate the sanity or compatibility of 
 shared object as of now. We can impose certain checks to be honored by the 
 shim layer to be compatible with the ceph version.
 4. Interfaces that needs to be implemented in shim, are added in a new header 
 called PluggableDBInterfaces.h.  This header contains the signatures for the 
 necessary interfaces like init(), close(), submit_transaction(), get() and 
 get_iterator(). PluggableDBStore caches these handles in a table during the 
 initialization time of the backend db.
 5. Similarly for Iterator functionality, PluggableDBIterator.h, contains the 
 functionality to be implemented by the shim layer.
 6. ceph-disk is modified to make two partitions of the osd disk given, one 
 for osd metadata and other for the pluggable DB, similar to existing 
 functionality like journal and osd data partition. Two partitions are created 
 only when PluggableDBstore is selected as backed for OSD.  DB's can work off 
 a raw partition as well, having a partition can enable to them use it. May be 
 we can add additional parameters or conf options to have a file system also 
 on the newly created partition, which can be implemented by ceph-disk.
 7. Partition information and other information needed like osd id etc... are 
 passed to shim layer at the time of initializing the store.
 8. Additional script modification is get correct stats from the backend.

 Please share your comments on the pull requests.

 Thanks,
 Varada




 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About Fio backend with ObjectStore API

2015-07-10 Thread Haomai Wang

I have rebased the branch with master, and push it to ceph upstream
repo. https://github.com/ceph/ceph/compare/fio-objectstore?expand=1

Plz let me know if who is working on this. Otherwise, I would like to
improve this to be merge ready.

On Fri, Jul 10, 2015 at 4:26 AM, Matt W. Benjamin m...@cohortfs.com wrote:
 That makes sense.

 Matt

 - James (Fei) Liu-SSI james@ssi.samsung.com wrote:

 Hi Casey,
   Got it. I was directed to the old code base. By the way, Since the
 testing case was used to exercise all of object stores.  Strongly
 recommend to change the name from fio_ceph_filestore.cc to
 fio_ceph_objectstore.cc . And the code in fio_ceph_filestore.cc should
 be refactored to reflect that the whole objectstore will be supported
 by fio_ceph_objectstore.cc. what you think?

 Let me know if you need any help from my side.


 Regards,
 James



 -Original Message-
 From: Casey Bodley [mailto:cbod...@gmail.com]
 Sent: Thursday, July 09, 2015 12:32 PM
 To: James (Fei) Liu-SSI
 Cc: Haomai Wang; ceph-devel@vger.kernel.org
 Subject: Re: About Fio backend with ObjectStore API

 Hi James,

 Are you looking at the code from
 https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore? It
 uses ObjectStore::create() instead of new FileStore(). This allows us
 to exercise all of the object stores with the same code.

 Casey

 On Thu, Jul 9, 2015 at 2:01 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Here is the code in the fio_ceph_filestore.cc. Basically, it
 creates a filestore as backend engine for IO exercises. If we got to
 send IO commands to KeyValue Store or Newstore, we got to change the
 code accordingly, right?  I did not see any other files like
 fio_ceph_keyvaluestore.cc or fio_ceph_newstore.cc. In my humble
 opinion, we might need to create other two fio engines for
 keyvaluestore and newstore if we want to exercise these two, right?
 
  Regards,
  James
 
  static int fio_ceph_filestore_init(struct thread_data *td)
  209 {
  210 vectorconst char* args;
  211 struct ceph_filestore_data *ceph_filestore_data = (struct
 ceph_filestore_data *) td-io_ops-data;
  212 ObjectStore::Transaction ft;
  213
 
  214 global_init(NULL, args, CEPH_ENTITY_TYPE_OSD,
 CODE_ENVIRONMENT_UTILITY, 0);
  215 //g_conf-journal_dio = false;
  216 common_init_finish(g_ceph_context);
  217 //g_ceph_context-_conf-set_val(debug_filestore, 20);
  218 //g_ceph_context-_conf-set_val(debug_throttle, 20);
  219 g_ceph_context-_conf-apply_changes(NULL);
  220
 
  221 ceph_filestore_data-osd_path =
 strdup(/mnt/fio_ceph_filestore.XXX);
  222 ceph_filestore_data-journal_path =
 strdup(/var/lib/ceph/osd/journal-ram/fio_ceph_filestore.XXX);
  223
 
  224 if (!mkdtemp(ceph_filestore_data-osd_path)) {
  225 cout  mkdtemp failed:   strerror(errno) 
 std::endl;
  226 return 1;
  227 }
  228 //mktemp(ceph_filestore_data-journal_path); // NOSPC issue
  229
 
  230 ObjectStore *fs = new
 FileStore(ceph_filestore_data-osd_path,
 ceph_filestore_data-journal_path);
  231 ceph_filestore_data-fs = fs;
  232
 
  233 if (fs-mkfs()  0) {
  234 cout  mkfs failed  std::endl;
  235 goto failed;
  236 }
  237
  238 if (fs-mount()  0) {
  239 cout  mount failed  std::endl;
  240 goto failed;
  241 }
  242
 
  243 ft.create_collection(coll_t());
  244 fs-apply_transaction(ft);
  245
 
  246
 
  247 return 0;
  248
 
  249 failed:
  250 return 1;
  251
 
  252 }
  -Original Message-
  From: Casey Bodley [mailto:cbod...@gmail.com]
  Sent: Thursday, July 09, 2015 9:19 AM
  To: James (Fei) Liu-SSI
  Cc: Haomai Wang; ceph-devel@vger.kernel.org
  Subject: Re: About Fio backend with ObjectStore API
 
  Hi James,
 
  In the job file src/test/filestore.fio, you can modify the line
  objectstore=filestore to use any objectstore type supported by
 the
  ObjectStore::create() factory.
 
  Casey
 
  On Wed, Jul 8, 2015 at 8:02 PM, James (Fei) Liu-SSI
 james@ssi.samsung.com wrote:
  Hi Casey,
Quick questions, The code in the trunk only cover the test for
 filestore. I was wondering do you have any plan to cover the test for
 kvstore and newstore?
 
Thanks,
James
 
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org
  [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of James (Fei)

  Liu-SSI
  Sent: Tuesday, June 30, 2015 2:19 PM
  To: Casey Bodley
  Cc: Haomai Wang; ceph-devel@vger.kernel.org
  Subject: RE: About Fio backend with ObjectStore API
 
  Hi Casey,
 
Thanks a lot.
 
Regards,
James
 
  -Original Message-
  From: Casey Bodley [mailto:cbod...@gmail.com]
  Sent: Tuesday, June 30, 2015 2:16 PM
  To: James (Fei) Liu-SSI
  Cc: Haomai Wang; ceph-devel@vger.kernel.org
  Subject: Re: About Fio backend with ObjectStore API
 
  Hi,
 
  When Danny Al-Gaaf  Daniel Gollub published Ceph

Re: About Adding eventfd support for LibRBD

2015-07-09 Thread Haomai Wang

I made a simple draft about adding async event notification support for librbd:

The initial idea is try to avoid much change to existing apis. So we
could add a new api like:

struct {
  int result;
  void *userdata;
  ..
} rbd_aio_event;

int poll_io_events(ImageCtx *ictx, rbd_aio_event *events, int
numevents, struct timespec *timeout);

int set_image_notification(ImageCtx *ictx, void *handler, enum
notification_type);

It seemed a little tricky, if user call set_image_notification
successfully, user can call aio_write/read with specified
userdata(original callback argument pointer). Librbd internal thread
will post async event to the eventfd using the specified
way(notification_type) when io finished. For example, linux/bsd will
use [eventfd])(http://man7.org/linux/man-pages/man2/eventfd.2.html),
solaris could use
[port_send](http://docs.oracle.com/cd/E23823_01/html/816-5168/port-send-3c.html#scrolltoc),
windows could use iocp method
[PostQueuedCompletionStatus](https://msdn.microsoft.com/en-us/library/windows/desktop/aa365458(v=vs.85).aspx).

If client call rbd without set_image_notification, user could call
poll_io_events will get -EOPNOTSUPP.



On Wed, Jul 8, 2015 at 11:46 AM, Haomai Wang haomaiw...@gmail.com wrote:
 On Wed, Jul 8, 2015 at 11:08 AM, Josh Durgin jdur...@redhat.com wrote:
 On 07/07/2015 08:18 AM, Haomai Wang wrote:

 Hi All,

 Currently librbd support aio_read/write with specified
 callback(AioCompletion). It would be nice for simple caller logic, but
 it also has some problems:

 1. Performance bottleneck: Create/Free AioCompletion and librbd
 internal finisher thread complete callback isn't a *very
 littleweight job, especially when callback need to update some
 status with lock hold

 2. Call logic: Usually like fio rbd engine, caller will maintain some
 status with io and rbd callback isn't enough to finish all the jobs
 related to io. For example, caller need to check each queued io
 stupidly again when rbd callback finished.

 So maybe we could add new api which support eventfd, so caller could
 add eventfd to its event loop and batch reap finished io event and
 update status or do more things.

 Any feedback is appreciated!


 It seems like a good idea to me. I'm not sure how much overhead it
 avoids, but letting the callers check status from their own threads
 is much nicer in general.

 I'd be curious how much overhead the callback + finisher add. If it's
 significant, it might make sense to add similar eventfd interfaces
 lower in the stack too.

 From intuition if we do high iodepth benchmark, noncallback way could
 reduce lots of extra callback latency because new way could batch
 them. Another performance benefit I think from caller side, new way
 could let complexity io finished job avoid callback lock and reduce
 extra logic. Finally, mostly callback need to wakeup caller thread to
 do next thing, it would be great that with new way we can do it in
 librbd via eventfd.


 Josh



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About Fio backend with ObjectStore API

2015-07-07 Thread Haomai Wang

On Wed, Jul 1, 2015 at 1:58 PM, Daniel Gollub daniel.gol...@gmail.com wrote:
Hi Jens,

as Josh already mentioned the engine would use Ceph internal ObjectStore API
... which is not stable.
So as Josh proposed, my idea was to build this C++ ObjectStorage engine as
an external FIO engine inside Ceph (optionally).

We just need to keep fio working with for external C++ engines - so this
external engine can exists.
That was the intend when I did push various build-fixes for C++ fio
headers, to get the external ObjectStorage fio engine building, because it's
written in C++.

I am no longer with DT, so I don't have a Ceph cluster right now to test
things. But I'm happy to help to get the ObjectStorage fio engine upstream
into Ceph. Josh, Casey do you need any help on this? I guess
https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore is good
base to continue. Casey, thank you for cleaning things up ;)

Cool, do you have plan to do this recently? If not, I'm willing to help :-)

Best Regards,
Daniel

On Wed, Jul 1, 2015 at 12:57 AM, Jens Axboe ax...@kernel.dk wrote:

I'd be more than happy to include it. Daniel has contributed to fio
before.

Daniel (CC'ed), was it your intent to get this upstream? How do we make
this happen?

On 06/30/2015 04:38 PM, Mark Nelson wrote:

It would be fantastic if folks decided to work on this and got it pushed
upstream into fio proper. :D

Mark

On 06/30/2015 04:19 PM, James (Fei) Liu-SSI wrote:

Hi Casey,

Thanks a lot.

Regards,
James

-Original Message-
From: Casey Bodley [mailto:cbod...@gmail.com]
Sent: Tuesday, June 30, 2015 2:16 PM
To: James (Fei) Liu-SSI
Cc: Haomai Wang; ceph-devel@vger.kernel.org
Subject: Re: About Fio backend with ObjectStore API

Hi,

When Danny Al-Gaaf Daniel Gollub published Ceph Performance
Analysis: fio and RBD at

https://telekomcloud.github.io/ceph/2014/02/26/ceph-performance-analysis_fio_rbd.html,

they also mentioned a fio engine that linked directly into ceph's
FileStore. I was able to find Daniel's branch on github at
https://github.com/gollub/ceph/tree/fio_filestore_v2, and did some
more work on it at the time.

I just rebased that work onto the latest ceph master branch, and
pushed to our github at
https://github.com/linuxbox2/linuxbox-ceph/tree/fio-objectstore. You
can find the source in src/test/fio_ceph_filestore.cc, and run fio
with the provided example fio job file in src/test/filestore.fio.

I didn't have a chance to confirm that it builds with automake, but
the cmake version built for me. I'm happy to help if you run into
problems, Casey

On Tue, Jun 30, 2015 at 2:31 PM, James (Fei) Liu-SSI
james@ssi.samsung.com wrote:

Hi Haomai,
What are you trying to ask is to benchmark local objectstore(like
kvstore/filestore/newstore) locally with FIO(ObjectStore engine)? You
want to purely compare the performance locally for these
objectstores, right?

Regards,
James

-Original Message-
From: ceph-devel-ow...@vger.kernel.org
[mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Haomai Wang
Sent: Tuesday, June 30, 2015 9:06 AM
To: ceph-devel@vger.kernel.org
Subject: About Fio backend with ObjectStore API

Hi all,

Long long ago, is there someone said about fio backend with Ceph
ObjectStore API? So we could use the existing mature fio facility to
benchmark ceph objectstore.

--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel
in the body of a message to majord...@vger.kernel.org More majordomo
info at http://vger.kernel.org/majordomo-info.html

N�r��y���b�X��ǧv�^�)޺{.n�+���z�]z���{ay� ʇڙ�,j ��f���h���z� �w���
���j:+v���w�j�m zZ+�ݢj��!tml=

--
Jens Axboe

--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: About Adding eventfd support for LibRBD

2015-07-07 Thread Haomai Wang

On Wed, Jul 8, 2015 at 11:08 AM, Josh Durgin jdur...@redhat.com wrote:
 On 07/07/2015 08:18 AM, Haomai Wang wrote:

 Hi All,

 Currently librbd support aio_read/write with specified
 callback(AioCompletion). It would be nice for simple caller logic, but
 it also has some problems:

 1. Performance bottleneck: Create/Free AioCompletion and librbd
 internal finisher thread complete callback isn't a *very
 littleweight job, especially when callback need to update some
 status with lock hold

 2. Call logic: Usually like fio rbd engine, caller will maintain some
 status with io and rbd callback isn't enough to finish all the jobs
 related to io. For example, caller need to check each queued io
 stupidly again when rbd callback finished.

 So maybe we could add new api which support eventfd, so caller could
 add eventfd to its event loop and batch reap finished io event and
 update status or do more things.

 Any feedback is appreciated!


 It seems like a good idea to me. I'm not sure how much overhead it
 avoids, but letting the callers check status from their own threads
 is much nicer in general.

 I'd be curious how much overhead the callback + finisher add. If it's
 significant, it might make sense to add similar eventfd interfaces
 lower in the stack too.

From intuition if we do high iodepth benchmark, noncallback way could
reduce lots of extra callback latency because new way could batch
them. Another performance benefit I think from caller side, new way
could let complexity io finished job avoid callback lock and reduce
extra logic. Finally, mostly callback need to wakeup caller thread to
do next thing, it would be great that with new way we can do it in
librbd via eventfd.


 Josh



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Transaction struct Op

2015-07-02 Thread Haomai Wang

Yes, some fields only used for special ops. But union may increase the
complexity of stuct.

And the extra memory may not a problem because Ops  in one
transaction should be within ten.

On Thu, Jul 2, 2015 at 10:05 PM, Dałek, Piotr
piotr.da...@ts.fujitsu.com wrote:
 Hello,

 In ObjectStore.h we have the following stuct:

 struct Op {
   __le32 op;
   __le32 cid;
   __le32 oid;
   __le64 off;
   __le64 len;
   __le32 dest_cid;
   __le32 dest_oid;  //OP_CLONE, OP_CLONERANGE
   __le64 dest_off;  //OP_CLONERANGE
   __le32 hint_type; //OP_COLL_HINT
   __le64 expected_object_size;  //OP_SETALLOCHINT
   __le64 expected_write_size;   //OP_SETALLOCHINT
   __le32 split_bits;//OP_SPLIT_COLLECTION2
   __le32 split_rem; //OP_SPLIT_COLLECTION2
 } __attribute__ ((packed)) ;

 Some of the fields (like hint_type and split_rem) are totally unused in 
 certain ops, and they eat up space anyway. Maybe we should use unions here, 
 and re-use some of the fields for other purposes? Right now the structure is 
 72 bytes in size, and for most ops, at least 32 bytes are always wasted.


 With best regards / Pozdrawiam
 Piotr Dałek





-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

About Fio backend with ObjectStore API

2015-06-30 Thread Haomai Wang

Hi all,

Long long ago, is there someone said about fio backend with Ceph
ObjectStore API? So we could use the existing mature fio facility to
benchmark ceph objectstore.

-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Inline dedup/compression

2015-06-30 Thread Haomai Wang

On Tue, Jun 30, 2015 at 4:55 AM, James (Fei) Liu-SSI
james@ssi.samsung.com wrote:
 Hi Haomai,
   Thanks for moving the idea forward. Regarding to the compression.  However, 
  if we do compression on the client level, it is not global. And the 
 compression was only applied to the local client, am I right?  I think there 
 is pros and cons in two solutions and we can get into details more for each 
 solution.

Yes, I think a lot myself about compression with Ceph. At firstly, we
could easily use objectstore backend to implement compress like
filestore with zfs/btrfs and keyvaluestore with leveldb/rocksdb etc.
The advantages are we can enjoy it now. The cons are we may lose too
much for benefit of compression especially for performance.

So we think about to move compression on osd/pg layer(implementation),
maybe we can get compression data from messenger module(NIC or IB card
may offer compression feature), then we directly carry with compressed
data and process this. The problem is that at pg/osd layer, we will
aware of the compress thing and we need to manage compress state(This
is important). Maybe we could create a pool with compressed feature,
and specify compress unit(8-64KB), compress algorithm. The cons is
that the pool level maybe coarsness, and actually increase the
complexity of io process path. For example, if compress unit is 8k, it
means all objects in this pool need to process data with 8k aligned
io, otherwise, we need to read-before-write. Consider the pool is
high-level concept and it's difficult to let users choose accurate
client workload.

If we implement compress thing at client side such as lirbd, we can
get the volume-level compress feature. It should be more friendly to
users. We can create a 64kb compress unit for seq workload and 4kb/8kb
for performance tradeoff volume and no compress for performance
volume. It's the same for cephfs directory level and radosgw bucket.
Librbd may directly split object to compress stripe and cephfs file
can compress one file, it maybe better for compress ratio than
compress unaware data structure in osd side. Another benefit is that
we can enjoy the benefit of compression as early as possible, it may
counteract a part of compress performance degraded. The cons we need
to implement more codes at client library.

   I really like your idea for dedupe in OSD side   by the way. Let me think 
 more about it.

  Regards,
  James

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Friday, June 26, 2015 8:55 PM
 To: James (Fei) Liu-SSI
 Cc: ceph-devel
 Subject: Re: Inline dedup/compression

 On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI 
 james@ssi.samsung.com wrote:
 Hi Haomai,
   Thanks for your response as always. I agree compression is comparable 
 easier task but still very challenge in terms of implementation no matter 
 where we should implement . Client side like RBD, or RDBGW or CephFS, or PG 
 should be a little bit better place to implementation in terms of efficiency 
 and cost reduction before the data were duplicated to other OSDs. It has  
 two reasons :
 1. Keep the data consistency among OSDs in one PG 2. Saving the
 computing resources

 IMHO , The compression should be accomplished before the replication come 
 into play in pool level. However, we can also have second level of 
 compression in the local objectstore.  In term of unit size of compression , 
 It really depends workload and in which layer we should implement.

 About inline deduplication, it will dramatically increase the complexities 
 if we bring in the replication and Erasure Coding for consideration.

 However, Before we talk about implementation, It would be great if we can 
 understand the pros and cons to implement inline dedupe/compression. We all 
 understand the benefits of dedupe/compression. However, the side effect is 
 performance hurt and need more computing resources. It would be great if we 
 can understand the problems from 30,000 feet high for the whole picture 
 about the Ceph. Please correct me if I were wrong.

 Actually we may have some tricks to reduce performance hurt like compression. 
 As Joe mentioned, we can compress slave pg data to avoid performance hurt, 
 but it may increase the complexity of recovery and pg remap things. Another 
 in-detail implement way if we begin to compress data from messenger, osd 
 thread and pg thread won't access data for normal client op, so maybe we can 
 make it parallel with pg process. Journal thread will get the compressed data 
 at last.

 The effect of compression also is a concern, we do compression in rados may 
 not get the best compression result. If we can do compression in libcephfs, 
 librbd and radosgw and make rados unknown to compression, it maybe simpler 
 and we can get file/block/object level compression. it should be better?

 About dedup, my current idea is we could setup a memory pool at osd side for 
 checksum store usage. Then we calculate

Re: Inline dedup/compression

2015-06-26 Thread Haomai Wang

On Sat, Jun 27, 2015 at 2:03 AM, James (Fei) Liu-SSI
james@ssi.samsung.com wrote:
 Hi Haomai,
   Thanks for your response as always. I agree compression is comparable 
 easier task but still very challenge in terms of implementation no matter 
 where we should implement . Client side like RBD, or RDBGW or CephFS, or PG 
 should be a little bit better place to implementation in terms of efficiency 
 and cost reduction before the data were duplicated to other OSDs. It has  two 
 reasons :
 1. Keep the data consistency among OSDs in one PG
 2. Saving the computing resources

 IMHO , The compression should be accomplished before the replication come 
 into play in pool level. However, we can also have second level of 
 compression in the local objectstore.  In term of unit size of compression , 
 It really depends workload and in which layer we should implement.

 About inline deduplication, it will dramatically increase the complexities if 
 we bring in the replication and Erasure Coding for consideration.

 However, Before we talk about implementation, It would be great if we can 
 understand the pros and cons to implement inline dedupe/compression. We all 
 understand the benefits of dedupe/compression. However, the side effect is 
 performance hurt and need more computing resources. It would be great if we 
 can understand the problems from 30,000 feet high for the whole picture about 
 the Ceph. Please correct me if I were wrong.

Actually we may have some tricks to reduce performance hurt like
compression. As Joe mentioned, we can compress slave pg data to avoid
performance hurt, but it may increase the complexity of recovery and
pg remap things. Another in-detail implement way if we begin to
compress data from messenger, osd thread and pg thread won't access
data for normal client op, so maybe we can make it parallel with pg
process. Journal thread will get the compressed data at last.

The effect of compression also is a concern, we do compression in
rados may not get the best compression result. If we can do
compression in libcephfs, librbd and radosgw and make rados unknown to
compression, it maybe simpler and we can get file/block/object level
compression. it should be better?

About dedup, my current idea is we could setup a memory pool at osd
side for checksum store usage. Then we calculate object data and map
to PG instead of object name at client side, so a object could always
in a osd where it's also responsible for dedup storage. It also could
be distributed at pool level.



 By the way, Both of software defined storage solution startups like Hdevig 
 and Springpath provide inline dedupe/compression.  It is not apple to apple 
 comparison. But it is good reference. The datacenters need cost effective 
 solution.

 Regards,
 James



 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Thursday, June 25, 2015 8:08 PM
 To: James (Fei) Liu-SSI
 Cc: ceph-devel
 Subject: Re: Inline dedup/compression

 On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI 
 james@ssi.samsung.com wrote:
 Hi Cephers,
 It is not easy to ask when Ceph is going to support inline 
 dedup/compression across OSDs in RADOS because it is not easy task and 
 answered. Ceph is providing replication and EC for performance and failure 
 recovery. But we also lose the efficiency  of storage store and cost 
 associate with it. It is kind of contradicted with each other. But I am 
 curious how other Cephers think about this question.
Any plan for Cephers to do anything regarding to inline 
 dedupe/compression except the features brought by local node itself like 
 BRTFS?

 Compression is easier to implement in rados than dedup. The most important 
 thing about compression is where we begin to compress, client, pg or 
 objectstore. Then we need to decide how much the compress unit is. Of course, 
 compress and dedup both like to use keyvalue-alike storage api to use, but I 
 think it's not difficult to use existing objectstore api.

 Dedup is more possible to implement in local osd instead of the whole pool or 
 cluster, and if we want to do dedup for the pool level, we need to do dedup 
 from client.


   Regards,
   James
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Inline dedup/compression

2015-06-25 Thread Haomai Wang

On Fri, Jun 26, 2015 at 6:01 AM, James (Fei) Liu-SSI
james@ssi.samsung.com wrote:
 Hi Cephers,
 It is not easy to ask when Ceph is going to support inline 
 dedup/compression across OSDs in RADOS because it is not easy task and 
 answered. Ceph is providing replication and EC for performance and failure 
 recovery. But we also lose the efficiency  of storage store and cost 
 associate with it. It is kind of contradicted with each other. But I am 
 curious how other Cephers think about this question.
Any plan for Cephers to do anything regarding to inline dedupe/compression 
 except the features brought by local node itself like BRTFS?

Compression is easier to implement in rados than dedup. The most
important thing about compression is where we begin to compress,
client, pg or objectstore. Then we need to decide how much the
compress unit is. Of course, compress and dedup both like to use
keyvalue-alike storage api to use, but I think it's not difficult to
use existing objectstore api.

Dedup is more possible to implement in local osd instead of the whole
pool or cluster, and if we want to do dedup for the pool level, we
need to do dedup from client.


   Regards,
   James
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Blueprint Submission Open for CDS Jewel

2015-06-08 Thread Haomai Wang

Hi Partick,

It looks confusing to use this. Is it need that we upload a txt file
to describe blueprint instead of editing directly online?

On Wed, May 27, 2015 at 5:05 AM, Patrick McGarry pmcga...@redhat.com wrote:
 It's that time again, time to gird up our loins and submit blueprints
 for all work slated for the Jewel release of Ceph.

 http://ceph.com/uncategorized/ceph-developer-summit-jewel/

 The one notable change for this CDS is that we'll be using the new
 wiki (on tracker.ceph.com) that is still undergoing migration from the
 old wiki. I have outlined the procedure in the announcement above, but
 please feel free to hit me with any questions or issues you may have.
 Thanks.


 --

 Best Regards,

 Patrick McGarry
 Director Ceph Community || Red Hat
 http://ceph.com  ||  http://community.redhat.com
 @scuttlemonkey || @ceph
 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Looking to improve small I/O performance

2015-06-06 Thread Haomai Wang

On Sat, Jun 6, 2015 at 2:07 PM, Dałek, Piotr piotr.da...@ts.fujitsu.com wrote:
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-

 I'm digging into perf and the code to see here/how I might be able to
 improve performance for small I/O around 16K.

 I ran fio with rados and used perf to record. Looking through the report,
 there is a very substantial amount of time creating threads (or so it looks, 
 but
 I'm really new to perf). It seems to point to messenger, so I looked in the
 code. From perf if looks like thread pooling isn't happening, but from what I
 can gather from the code, it should.
 [..]

 This is so because you use SimpleMessenger, which can't handle small I/O well.
 Indeed, threads are problematic with it, as well as memory allocation. I did 
 some
 benchmarking some time ago and the gist of it is that you could try going for
 AsyncMessenger and see if it helps. You can also see my results here:
 http://stuff.predictor.org.pl/chunksize.xlsx
 From there you can see that most of the time of small I/Os in SimpleMessenger
 Is spent in tcmalloc code, and also there's a performance drop around 64k
 Blocksize in Async Messenger.


Thanks for your benchmark, I submit new performance enhanced patchset
for AsyncMessenger. It should solve original stress test performance
degraded problem :-)

 With best regards / Pozdrawiam
 Piotr Dałek




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Looking to improve small I/O performance

2015-06-06 Thread Haomai Wang

We could wait for the next benchmark until this
PR(https://github.com/ceph/ceph/pull/4775) merged

On Sat, Jun 6, 2015 at 11:06 PM, Robert LeBlanc rob...@leblancnet.us wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 I found similar results in my testing as well. Ceph is certainly great
 at large I/O, but our workloads are in the small I/O range. I
 understand that latency plays a huge role when I/O is small. Since we
 have 10 and 40 Gb Ethernet, we can't get much lower in latency that
 way (Infiniband is not an option right now). So I was poking around to
 see if there was some code optimizations that might reduce latency
 (not that I'm smart enough to do the coding myself).

 I was surprised when enabling QEMU writeback cache and set the cache
 to the working set size, I really didn't get any additional
 performance. After the first run the QEMU process allocated almost all
 the memory. I believe after several runs, there was some improvement,
 but not what I expected.

 What is the status of the async messenger, it looks like it is
 experimental in the code? How do you enable it, I can't seem to find a
 config option, does Ceph have to be compiled with it? I would like to
 test it on my dev cluster.

Yes, it's a experimental and you only need to enable this in config
value. No other need to do for compiling.




 Thanks,
 -BEGIN PGP SIGNATURE-
 Version: Mailvelope v0.13.1
 Comment: https://www.mailvelope.com

 wsFcBAEBCAAQBQJVcwxfCRDmVDuy+mK58QAAIJ0P/iz05KKuNw1Ypk3xsg/v
 7MzrSw70+RZMJd4qOs8OFrBC+IiX1KBOlgrrtAjKRygWgYgK3Aqzw5DEu1RN
 2tJiGai9e5Vch/wl+OHhP7S07Q2eN7fJJS+OFtA481XBNeFGhdywhOYenJjk
 RcDSJcVPgcrPB5SI90UqycwxLjH+XBotFHycuwyHj4LqkHXf4tM4Nbi4A1RV
 xOhVPQxWlaregwOaS8b8kwFUzkLQic1mMNgSMizpSiPnLuXUnfI7pjtvjOYU
 ld6QmZgu+xKC/qIJm8ToOJUVD3IkSbpv8Ngs73K12h/3C8mj4+uY4qJWouG4
 RU3sFMfKgVeNDPSIsjO7Zy9s5/lp64RqPcblj72+3yYC+YJ4ZhLAwRyhtSvO
 VXkLheZRtMemWbrOCQKinWAlH+m0dwAHv816oFFvkFdOYl/xmmiTo9ctNBqC
 MVK9tm01DRqA23MFFNQ25lvHzFv3zZ7aPWLeqRin8F7dddwBauva/J7GyFC0
 bk0mPi83++LQt3r+PUMYCOS+aG+0f8oM8/uValUfEGr4+pcjyI/dZk1k0Q6c
 cImb2cmy16OgrfzN7isYt7z37dUlQT/2rC74LvTscIIdf1dZQHWwXHelRm49
 y1pxq07V7LlL6gM+zA6Zskm9QwlJ3D81mH7QpiaixKX8cEzcVifD7WUzP/YV
 Go8K
 =gB4o
 -END PGP SIGNATURE-
 
 Robert LeBlanc
 GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1


 On Sat, Jun 6, 2015 at 12:07 AM, Dałek, Piotr
 piotr.da...@ts.fujitsu.com wrote:
 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-

 I'm digging into perf and the code to see here/how I might be able to
 improve performance for small I/O around 16K.

 I ran fio with rados and used perf to record. Looking through the report,
 there is a very substantial amount of time creating threads (or so it 
 looks, but
 I'm really new to perf). It seems to point to messenger, so I looked in the
 code. From perf if looks like thread pooling isn't happening, but from what 
 I
 can gather from the code, it should.
 [..]

 This is so because you use SimpleMessenger, which can't handle small I/O 
 well.
 Indeed, threads are problematic with it, as well as memory allocation. I did 
 some
 benchmarking some time ago and the gist of it is that you could try going for
 AsyncMessenger and see if it helps. You can also see my results here:
 http://stuff.predictor.org.pl/chunksize.xlsx
 From there you can see that most of the time of small I/Os in SimpleMessenger
 Is spent in tcmalloc code, and also there's a performance drop around 64k
 Blocksize in Async Messenger.

 With best regards / Pozdrawiam
 Piotr Dałek

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [RFC] Implement a new journal mode

2015-06-02 Thread Haomai Wang

On Tue, Jun 2, 2015 at 5:28 PM, Li Wang liw...@ubuntukylin.com wrote:
 I think for scrub, we have a relatively easy way to solve it,
 add a field to object metadata with the value being either UNSTABLE
 or STABLE, the algorithm is as below,
 1 Mark the object be UNSTABLE
 2 Perform object data write

I guess this write should be sync.

 3 Perform metadata write and MARK the object STABLE
 The order of the three steps are enforced, and the step 1 and 3 are
 written into journal, while step 2 is performed directly on the object.
 For scrub, it could now distinguish this situation, and one feasible
 policy could be to find the copy with the latest metadata, and
 synchronize the data of that copy to others.

UNSTABLE/STABLE markers also influent recovery thing I think, in other
word, this way is a little like we write the majority strategy.

Except complexity things, I'm more afraid of the macroscopic
influences. Disk controller is only a local driver and mostly carry
battery to ensure atomic unit write at least, but ceph is a
distributed system which will bring more unstable things to io
stack(any software bug will be amplificative). If we weaken constrains
here, we will more rely to a stable/perfect guest filesystem
impl(legacy filesystem won't safety here, and we can't distinguish the
correctness among different os, kernel version). As a cloud provider
side, application's io has be very long and expensive, we may want to
have a strong consistent block storage to ensure the reliability of
data. We could say rbd is safe for almost block usage, otherwise
meeting data broken problem we will stuck into the big trouble! That's
to say,  we say ceph is safe or unsafe(a broken ceph release) for a
version not ceph may safe for some guest fs(use cases) or not. It may
let ceph suffer more accidental accused.


 For this metadata-only journal mode, I think it does not contradict
 with new store, since they address different scenarios. Metadata-only
 journal mode mainly focuses on the scenarios that data consistency
 does not need be ensured by RADOS itself. And it is especially appealing
 for the scenarios with many random small OVERWRITES, for example, RBD
 in cloud environment. While new store is great for CREATE and APPEND,
 for many random small OVERWRITES, new store is not
 very easy to optimize. It seems the only way is to introduce small size
 of fragments and turn those OVERWRITES into APPEND. However, in that
 case, many small OVERWRITES could cause many small files on the local
 file system, it will slow down the subsequent read/write performance of
 the object, so it seems not worthy. Of course, a small-file-merge
 process could be introduced, but that complicates the design.

 So basically, I think new store is great for some of the scenarios,
 while metadata-only is desirable for some others, they do not
 contradict with each other, what do you think?

 Cheers,
 Li Wang




 On 2015/6/1 8:39, Sage Weil wrote:

 On Fri, 29 May 2015, Li Wang wrote:

 An important usage of Ceph is to integrate with cloud computing platform
 to provide the storage for VM images and instances. In such scenario,
 qemu maps RBD as virtual block devices, i.e., disks to a VM, and
 the guest operating system will format the disks and create file
 systems on them. In this case, RBD mostly resembles a 'dumb' disk.  In
 other words, it is enough for RBD to implement exactly the semantics of
 a disk controller driver. Typically, the disk controller itself does
 not provide a transactional mechanism to ensure a write operation done
 atomically. Instead, it is up to the file system, who manages the disk,
 to adopt some techniques such as journaling to prevent inconsistency,
 if necessary. Consequently, RBD does not need to provide the
 atomic mechanism to ensure a data write operation done atomically,
 since the guest file system will guarantee that its write operations to
 RBD will remain consistent by using journaling if needed. Another
 scenario is for the cache tiering, while cache pool has already
 provided the durability, when dirty objects are written back, they
 theoretically need not go through the journaling process of base pool,
 since the flusher could replay the write operation. These motivate us
 to implement a new journal mode, metadata-only journal mode, which
 resembles the data=ordered journal mode in ext4. With such journal mode
 is on, object data are written directly to their ultimate location,
 when data written finished, metadata are written into the journal, then
 the write returns to caller. This will avoid the double-write penalty
 of object data due to the WRITE-AHEAD-LOGGING, potentially greatly
 improve the RBD and cache tiering performance.

 The algorithm is straightforward, as before, the master send
 transaction to slave, then they extract the object data write
 operations and apply them to objects directly, next they write the
 remaining part of the transaction into journal, then slave ack master,
 master

Re: OSD-Based Object Stubs

2015-05-27 Thread Haomai Wang

I guess it should be something like sam designed in
CDS(https://wiki.ceph.com/Planning/Blueprints/Infernalis/osd%3A_Tiering_II_(Warm-%3ECold))

On Wed, May 27, 2015 at 4:39 PM, Marcel Lauhoff m...@irq0.org wrote:
 Hi,

 I wrote a prototype for an OSD-based object stub feature. An object stub
 being an object with it's data moved /elsewhere/. I hope to get some
 feedback, especially whether I'm on the right path here and if it
 is a feature you are interested in.



 Code is in my osd-stubs branch:
  https://github.com/ceph/ceph/compare/master...irq0:osd-stubs
  https://github.com/irq0/ceph/tree/osd-stubs

  Tools to toy around with osd-stubs + web server to send stubs to:
  https://github.com/irq0/ceph_osd-stub_tools



 Related:
 - 
 https://wiki.ceph.com/Planning/Blueprints/%3CSIDEBOARD%3E/osd:_tiering:_object_redirects



 Implementation:

 Adds two new OSD OPs:
 - STUB :: Move data away; Save location in xattr; Set
   object_info_t::stub_state to 'remote'
 - UNSTUB :: Get data back; Remove xattr; Set object_info_t::stub_state
 to 'local'


 STUB is meant to be called by an external archive agent. UNSTUB
 implicitly when OPs come in that need the object's data. Operations are
 classified as may_use_obj_data similar to how op-may_{write,read,cache} 
 work.

 The implicit UNSUB is implemented by prepending an UNSTUB operation to the
 incoming OP list if may_use_obj_data() and stub_state == REMOTE.
 This sadly causes a waring in the client saying that
 the reply doesn't match the request. He is of course right, but I found
 it to be the simplest way to try that feature.

 External storage in the prototype is just simple HTTP: PUT on STUB;
 GET+DELETE on UNSTUB.

 The Operations implement a kind of converter: STUB reads from the
 primary OSD, store the object on the remote and then issue TRUNCATE(0)
 and SETXATTR OPs. Similar UNSTUB retrieves the object, then does
 WRITEFULL and RMXATTR.


 ~marcel

 --
 Marcel Lauhoff
 Mail/XMPP: m...@irq0.org
 http://irq0.org
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Some thoughts regarding the new store

2015-05-27 Thread Haomai Wang

On Wed, May 27, 2015 at 4:41 PM, Li Wang liw...@ubuntukylin.com wrote:
 I have just noticed the new store development, and had a
 look at the idea behind it (http://www.spinics.net/lists/ceph-
 devel/msg22712.html), so my understanding, we wanna avoid the
 double-write penalty of WRITE_AHEAD_LOGGING journal mechanism,
 the straightforward thought is to optimize CREATE, APPEND and
 FULL-OBJECT-OVERWRITE by writing into new files directly,
 then update the metadata in a transaction. Other changes include:
 move the object metadata from filesystem extend attrbutes into
 key value database; map an object into possibly multiple files.

 If my understanding is correct, then it seems there follows some issues,

 1 Garbage collection is needed to reclaim orphan files generated
 from crashing;

Yes, but still now we haven't dive into this problem. Because
currently newstore only allow one object one file.

Anyway I guess GC isn't a big problem. journal keys should be help, is
something I missed here?


 2 On spinning disks, it loses the advantages that journal makes random
  writes into sequential writes, then commits them in groups and
 leverages another disk to hide the committing delay.


We need to clarify something here, for small random write workload,
newstore still need journal to make durable and shorter latency.

Although filejournal make use of write ahead to improve performance,
but journal is far away from data location in disk(partition or
preallocation file). We always need to write data to disk and the seek
distance is long I think. For newstore, actually in my best wish
journal and data could be in one allocation group in local filesystem
concept(it may be difficult though), just like a ideal fragment
implementation as expected. In other word, fragment should be
something to aggregate small writes, but we haven't make it done as
expected.

Although now newstore's random write performance is bad than
filestore, I think it's not related to design. We still have lots of
things could be apply to improve.

 3 OVERWRITE theoretically does not benefit from this design, and the
 introducing of fragment, increases the object metadata overhead. The
 possibly mapping of multiple files may also slow down the object
 read/write performance. OVERWRITE is the major scenario for RBD,
 consequently, for cloud environment.

yes, we need to handle this thing. Actually for one object mapping to
multi file, we doesn't have a design(@sage yes? or I missed?). We may
could think of a solution to make tradeoff  :-)



 4 By mapping an object into multiple files, potentially we can optimize
 OVERWRITE by turning it also into APPEND by using small fragments,
 that, actually mimic Btrfs. However, for many small writes, it may
 leave many small files in the backend local file system, that may slow
 down the object read/write performance, especially on spinning
 disk. More importantly, I think it, to some extent, against the
 philosophy of object storage, which uses a big object to store data to
 reduce the metadata cost, and leaves the block management for local
 file system. For a local file system, big file performance is generally
 better than small file. If we introduce fragment, it looks like the
 object storage self cares about the object data allocation now.

 What is the community's option?

Anyway, I think the core idea is we make newstore better than
filestore in most of workloads.


 Cheers,
 Li Wang
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [ceph-users] Memory Allocators and Ceph

2015-05-27 Thread Haomai Wang

On Thu, May 28, 2015 at 1:40 AM, Robert LeBlanc rob...@leblancnet.us wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 With all the talk of tcmalloc and jemalloc, I decided to do some
 testing og the different memory allocating technologies between KVM
 and Ceph. These tests were done a pre-production system so I've tried
 to remove some the variance with many runs and averages. The details
 are as follows:

 Ceph v0.94.1 (I backported a branch from master to get full jemalloc
 support for part of the tests)
 tcmalloc v2.4-3
 jemalloc v3.6.0-1
 QEMU v0.12.1.2-2 (I understand the latest version for RH6/CentOS6)
 OSDs are only spindles with SSD journals, no SSD tiering

 The 11 Ceph nodes are:
 CentOS 7.1
 Linux 3.18.9
 1 x Intel E5-2640
 64 GB RAM
 40 Gb Intel NIC bonded with LACP using jumbo frames
 10 x Toshiba MG03ACA400 4 TB 7200 RPM drives
 2 x Intel SSDSC2BB240G4 240GB SSD
 1 x 32 GB SATADOM for OS

 The KVM node is:
 CentOS 6.6
 Linux 3.12.39
 QEMU v0.12.1.2-2 cache mode none

 The VM is:
 CentOS 6.6
 Linux 2.6.32-504
 fio v2.1.10

 On average preloading Ceph with either tcmalloc or jemalloc showed an
 increase of performance of about 30% with most performance gains for
 smaller I/O. Although preloading QEMU with jemalloc provided about a
 6% increase on a lightly loaded server, it did not add or subtract a
 noticeable performance difference combined with Ceph using either
 tcmalloc or jemalloc.

 Compiling Ceph entirely with jemalloc overall had a negative
 performance impact. This may be due to dynamically linking to RocksDB
 instead of the default static linking.

 Preloading QEMU with tcmalloc in all cases overall showed very
 negative results, however it showed the most improvement of any tests
 in the 1MB tests up to almost 2.5x performance of the baseline. If
 your workload is guaranteed to be of 1MB I/O (and possibly larger),
 then this option may be useful.

 Based on the architecture of jemalloc, it is possible that with it
 loaded on the QEMU host may provide more benefit on servers that are
 closer to memory capacity, but I did not test this scenario.

 Any feedback regarding this exercise is welcome.

Really cool!!!

It's really an important job to help us realize so such difference by
memory allocation library.

Recently I did some basic works and want to invest ceph memory
allocation characteristic workload, I'm hesitate to do this because of
the unknown things about improvements. Now the top cpu usage is
consumed by memory allocation/free, and I see different io size
workloads(and high cpu usage) will result in terrible performance for
ceph cluster. I hope we can lower a cpu level for ceph require(for
fast storage device backend) by solving this problem

BTW, could I know the details about your workload?


 Data: 
 https://docs.google.com/a/leblancnet.us/spreadsheets/d/1n12IqAOuH2wH-A7Sq5boU8kSEYg_Pl20sPmM0idjj00/edit?usp=sharing
 Test script is multitest. The real world test is based off of the disk
 stats of about 100 of our servers which have uptimes of many months.

 - - 
 Robert LeBlanc
 GPG Fingerprint 79A2 9CA4 6CC4 45DD A904  C70E E654 3BB2 FA62 B9F1
 -BEGIN PGP SIGNATURE-
 Version: Mailvelope v0.13.1
 Comment: https://www.mailvelope.com

 wsFcBAEBCAAQBQJVZgGRCRDmVDuy+mK58QAAM20QAJh0rR0NIQABCkMjiluP
 f/mcIiy4MQfFd5RJ9/ZlMRDQ0KDwW7haRm58QE0S/l6ZZ3+z7MqsQOW8KHJE
 Y75YjEdsl7zrLLcB4wNnUKJXZrPwzFReTtLbXsNB8h73tbzaLp3y9711gbNf
 EQQujiSp5XDiOK+d+H0FVGp4AfVmFvlO5gjQMSUcUt58qN6BsnD8NbRLEvKf
 S2WzvJjFO7g1HqWr5QssKGb+1rvze2Z2xByURU8yKVpdX59EIhfzPdgadp/n
 AJGR2pXWGgW2CQ3ce7gN7cr32cjjWbmzpdr0djgVB5/Y1ERU8FvwNFIwFa6N
 eFUKCohW5UjMw8CcO9CzUQtQxgKnqeHcyVe6Loamd2eZ+epIupFLI3lQF6NU
 GSdBV/8Ale1SJuhShY6QnEJFav8nLTvNvlDF/NiBoSUMtnsl5fDTpLH3KA2w
 o8sT2dcDEJEc9+kzUrugUBElinjOacFcINU3osYZJ0NNi4t1PDtPTUiWChvT
 jZdpWVGVpxZ3w46csACJZxY0lP/Kd6JoSH+78q7wNivCHeHT7c3uy8KGbKA7
 fecFaHBAsCYliX1tDN/abZFVhEvdb8AuTGqGkZ7xHj0PAUyddObYGjkStVUw
 dGOH+nurnFZ5Qqct/gvcbxggbOTGunHLGwtALT5EAtTB1ThlfpVQImy5vKl0
 aOER
 =YTTi
 -END PGP SIGNATURE-

 ___
 ceph-users mailing list
 ceph-us...@lists.ceph.com
 http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com




-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

[no subject]

2015-05-16 Thread Haomai Wang

Even if from /dev/zero, the data crc shouldn't be 0.

I guess osd(arm) doesn't do crc computing. But from code, crc for arm
should be fine

On Sat, May 16, 2015 at 6:21 PM, huang jun hjwsm1...@gmail.com wrote:
 that always happen, every test have such errors. And our cluster and
 client that  running on X86 works fine, never seen bad crc error.


 2015-05-16 17:30 GMT+08:00 Haomai Wang haomaiw...@gmail.com:
 is this always happen or occasionally?

 On Sat, May 16, 2015 at 10:10 AM, huang jun hjwsm1...@gmail.com wrote:
 hi,steve

 2015-05-15 16:36 GMT+08:00 Steve Capper steve.cap...@linaro.org:
 On 15 May 2015 at 00:51, huang jun hjwsm1...@gmail.com wrote:
 hi,all

 Hi HuangJun,


 We run ceph cluster on ARM platform (arm64, linux kernel 3.14, OS
 ubuntu 14.10), and use dd if=/dev/zero of=/mnt/test bs=4M count=125
 to write data.  On the osd side, we got bad data CRC error.

 The kclient log: (tid=6)
 May 14 17:21:08 node103 kernel: [  180.194312] CPU[0] libceph:
 send_request ffc8d252f000 tid-6 to osd0 flags 36 pg 1.9aae829f req
 data size is 4194304
 May 14 17:21:08 node103 kernel: [  180.194316] CPU[0] libceph: tid-6
 - ffc0702f66c8 to osd0 42=osd_op len 197+0+4194304 -
 libceph: tid-6 front_crc is 388648745 middle_crc is 0 data_crc is 
 3036014994

 The OSD-0 log:
 2015-05-13 08:12:50.049345 7f378d8d8700  0 seq  3 tid 6 front_len 197
 mid_len 0 data_len 4194304
 2015-05-13 08:12:50.049348 7f378d8d8700  0 crc in front 388648745 exp 
 388648745
 2015-05-13 08:12:50.049395 7f378d8d8700  0 crc in middle 0 exp 0
 2015-05-13 08:12:50.049964 7f378d8d8700  0 crc in data 0 exp 3036014994
 2015-05-13 08:12:50.050234 7f378d8d8700  0 bad crc in data 0 != exp 
 3036014994

 some considerations:
 1) we use ceph 0.80.7 realse version and compile it on ARM, did this
 works? or  does ceph's code has ARM branch?

 We did run a Ceph version close to that for 64-bit ARM, I'm checking
 out 0.80.7 now to test.
 In v9.0.0, there is some code to use the ARM optional crc32c
 instructions, but this isn't in 0.80.7.


 2) as we have write 125 objects, only few of them report CRC error,
 and the right object's data_crc is 0 both on osd and kclient. the
 wrong object's data_crc is not 0 on kclient, but osd calculate result
 0. the object data came from /dev/zero, i think the data_crc should be
 0, am i right?


 If the initial CRC seed value is non-zero, then the CRC of a buffer
 full of zeros won't be zero.
 So ceph_crc32c(somethingnonzero, zerofilledbuffer, len), will be non-zero.

 I would like to reproduce this problem here.
 What steps did you take before this error occurred?
 Is this a cephfs filesystem or something on top of an RBD image?
 Which kernel are you running? Is it the one that comes with Ubuntu?
 (If so which package version is it?)

 We use linux kernel version 3.14 and we just tested it on Ubuntu, and
 ceph version v0.80.7. Both cephfs and RBD image have CRC problems.
 I'm not sure whether it's related to Memory, since we tested many
 times, but just a few reported CRC error.
 As i mentioned, i doubt the memory fault changed the data, because we
 write 125 objects, and the all data_crc is 0 except the Bad CRC
 object's data_crc. Any tips are welcome.

 Cheers,
 --
 Steve



 --
 thanks
 huangjun
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



 --
 thanks
 huangjun



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: newstore performance update

2015-04-30 Thread Haomai Wang

On Thu, Apr 30, 2015 at 12:38 AM, Sage Weil sw...@redhat.com wrote:
 On Wed, 29 Apr 2015, Chen, Xiaoxi wrote:
 Hi Mark,
   Really good test:) I only played a bit on SSD, the parallel WAL
 threads really helps but we still have a long way to go especially on
 all-ssd case. I tried this
 https://github.com/facebook/rocksdb/blob/master/util/env_posix.cc#L1515
 by hacking the rocksdb, but the performance difference is negligible.

 It gave me a 25% bump when rocksdb is on a spinning disk, so I went ahead
 and committed the change to the branch.  Probably not noticeable on the
 SSD, though it can't hurt.

 The rocksdb digest speed should be the problem, I believe, I was planned
 to prove this by skip all db transaction, but failed since hitting other
 deadlock bug in newstore.

 Will look at that next!


 Below are a bit more comments.
  Sage has been furiously working away at fixing bugs in newstore and
  improving performance.  Specifically we've been focused on write
  performance as newstore was lagging filestore but quite a bit previously.  
  A
  lot of work has gone into implementing libaio behind the scenes and as a
  result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
  has improved pretty dramatically. It's now often beating filestore:
 

 SSD DB is still better than SSD WAL with request size  128KB, this indicate 
 some WALs are actually written to Level0...Hmm, could we add 
 newstore_wal_max_ops/bytes to capping the total WAL size(how much data is in 
 WAL but not yet apply to backend FS) ?  I suspect this would improve 
 performance by prevent some IO with high WA cost and latency?

  http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
 
  On the other hand, sequential writes are slower than random writes when
  the OSD, DB, and WAL are all on the same device be it a spinning disk or 
  SSD.

 I think sequential writes slower than random is by design in Newstore,
 because for every object we can only have one WAL , that means no
 concurrent IO if the req_size* QD  4MB. Not sure how many #QD do you
 have in the test? I suspect 64 since there is a boost in seq write
 performance with req size  64 ( 64KB*64=4MB).

 In this case, IO pattern will be : 1 write to DB WAL-Sync- 1 Write to
 FS - Sync, we do everything in synchronize way ,which is essentially
 expensive.

 The number of syncs is the same for appends vs wal... in both cases we
 fdatasync the file and the db commit, but with WAL the fs sync comes after
 the commit point instead of before (and we don't double-write the data).
 Appends should still be pipelined (many in flight for the same object)...
 and the db syncs will be batched in both cases (submit_transaction for
 each io, and a single thread doing the submit_transaction_sync in a loop).

 If that's not the case then it's an accident?

I hope I could clarify the current impl(For rbd 4k write, warm object,
aio, no overlay) from my view compared to FileStore:

1. because buffer should be page aligned, we only need to consider aio
here. Prepare aio write(why we need to call ftruncate when doing
append?), a must open call(may increase hugely if directory has lots
of files?)
2. setxattr will encode the whole onode and omapsetkeys is the same as
FileStore, but maybe a larger onode buffer compared to local fs xattr
set in FileStore?
3. submit aio: because we do aio+dio for data file, so the i_size
will be update inline AFAR for lots of cases?
4. aio completed and do aio fsync(comes from #2?, this will increase a
thread wake/signal cost): we need a finisher thread here to do
_txc_state_proc to avoid aio thread not waiting for new aio, so we
need a thread switch cost again?
5. keyvaluedb submit transaction(I think we won't do sync submit
because we can't block in _txc_state_proc, so another thread
wake/signal cost)
6. complete caller's context(Response to client now!)

Am I missing or wrong for this flow?

@sage, could you share your current insight about the next thing? From
my current intuition, it looks a much higher latency and bandwidth
optimization for newstore.


 sage



  
  Xiaoxi.
  -Original Message-
  From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-
  ow...@vger.kernel.org] On Behalf Of Mark Nelson
  Sent: Wednesday, April 29, 2015 7:25 AM
  To: ceph-devel
  Subject: newstore performance update
 
  Hi Guys,
 
  Sage has been furiously working away at fixing bugs in newstore and
  improving performance.  Specifically we've been focused on write
  performance as newstore was lagging filestore but quite a bit previously.  
  A
  lot of work has gone into implementing libaio behind the scenes and as a
  result performance on spinning disks with SSD WAL (and SSD backed rocksdb)
  has improved pretty dramatically. It's now often beating filestore:
 

  http://nhm.ceph.com/newstore/newstore-5d96fe6-no_overlay.pdf
 
  On the other hand,

Re: async messenger small benchmark result

2015-04-29 Thread Haomai Wang

Still not, we currently only focus on bug fix and stable purpose. But
I think performance improvement will be pick up soon(May?), the
problem is clearly I think.

On Wed, Apr 29, 2015 at 2:10 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
Thanks!  So far we've gotten a report that asyncmesseneger was a little
slower than simple messenger, but not this bad!  I imagine Greg will
have lots of questions. :)

 Note that this is with hammer, so maybe some improvements are already done is 
 master ?


 - Mail original -
 De: Mark Nelson mnel...@redhat.com
 À: aderumier aderum...@odiso.com, ceph-devel 
 ceph-devel@vger.kernel.org
 Envoyé: Mardi 28 Avril 2015 15:48:51
 Objet: Re: async messenger small benchmark result

 Hi Alex,

 Thanks! So far we've gotten a report that asyncmesseneger was a little
 slower than simple messenger, but not this bad! I imagine Greg will
 have lots of questions. :)

 Mark

 On 04/28/2015 03:36 AM, Alexandre DERUMIER wrote:
 Hi,

 here a small bench 4k randread of simple messenger vs async messenger

 This is with 2 osd, and 15 fio jobs on a single rbd volume

 simple messager : 345kiops
 async messenger : 139kiops

 Regards,

 Alexandre




 simple messenger
 ---

 ^Cbs: 15 (f=15): [r(15)] [0.0% done] [1346MB/0KB/0KB /s] [345K/0/0 iops] 
 [eta 59d:13h:32m:43s]
 fio: terminating on signal 2

 rbd_iodepth32-test: (groupid=0, jobs=15): err= 0: pid=44713: Tue Apr 28 
 10:26:21 2015
 read : io=15794MB, bw=1321.4MB/s, iops=338255, runt= 11953msec
 slat (usec): min=5, max=17316, avg=33.81, stdev=63.77
 clat (usec): min=4, max=60848, avg=1011.22, stdev=1026.16
 lat (usec): min=110, max=60857, avg=1045.03, stdev=1031.56
 clat percentiles (usec):
 | 1.00th=[ 219], 5.00th=[ 298], 10.00th=[ 362], 20.00th=[ 466],
 | 30.00th=[ 572], 40.00th=[ 676], 50.00th=[ 796], 60.00th=[ 940],
 | 70.00th=[ 1112], 80.00th=[ 1336], 90.00th=[ 1784], 95.00th=[ 2288],
 | 99.00th=[ 4128], 99.50th=[ 5536], 99.90th=[13376], 99.95th=[17536],
 | 99.99th=[28544]
 bw (KB /s): min=31386, max=14, per=6.67%, avg=90244.35, stdev=17571.24
 lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=2.21%
 lat (usec) : 500=21.02%, 750=22.82%, 1000=17.99%
 lat (msec) : 2=28.62%, 4=6.26%, 10=0.88%, 20=0.15%, 50=0.03%
 lat (msec) : 100=0.01%
 cpu : usr=36.30%, sys=10.85%, ctx=2323657, majf=0, minf=5736
 IO depths : 1=0.2%, 2=0.8%, 4=3.4%, 8=16.3%, 16=72.0%, 32=7.3%, =64=0.0%
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete : 0=0.0%, 4=94.8%, 8=1.0%, 16=1.5%, 32=2.6%, 64=0.0%, =64=0.0%
 issued : total=r=4043164/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
 READ: io=15794MB, aggrb=1321.4MB/s, minb=1321.4MB/s, maxb=1321.4MB/s, 
 mint=11953msec, maxt=11953msec


 async messenger (ms_async_op_threads=10)
 -
 ^Cbs: 15 (f=15): [r(15)] [0.0% done] [544.6MB/0KB/0KB /s] [139K/0/0 iops] 
 [eta 301d:09h:10m:03s]
 fio: terminating on signal 2

 rbd_iodepth32-test: (groupid=0, jobs=15): err= 0: pid=42935: Tue Apr 28 
 10:24:29 2015
 read : io=6389.8MB, bw=547856KB/s, iops=136963, runt= 11943msec
 slat (usec): min=7, max=23454, avg=39.33, stdev=226.05
 clat (usec): min=58, max=107304, avg=3002.03, stdev=6270.44
 lat (usec): min=91, max=107327, avg=3041.36, stdev=6279.32
 clat percentiles (usec):
 | 1.00th=[ 129], 5.00th=[ 177], 10.00th=[ 229], 20.00th=[ 334],
 | 30.00th=[ 446], 40.00th=[ 564], 50.00th=[ 692], 60.00th=[ 836],
 | 70.00th=[ 1032], 80.00th=[ 1576], 90.00th=[10816], 95.00th=[17792],
 | 99.00th=[29824], 99.50th=[34048], 99.90th=[42240], 99.95th=[45824],
 | 99.99th=[50432]
 bw (KB /s): min=13359, max=128824, per=6.67%, avg=36544.92, stdev=37000.58
 lat (usec) : 100=0.04%, 250=12.05%, 500=22.51%, 750=19.70%, 1000=14.66%
 lat (msec) : 2=12.32%, 4=2.66%, 10=5.34%, 20=6.81%, 50=3.91%
 lat (msec) : 100=0.01%, 250=0.01%
 cpu : usr=19.03%, sys=6.33%, ctx=370760, majf=0, minf=11335
 IO depths : 1=0.4%, 2=0.9%, 4=5.3%, 8=20.2%, 16=66.0%, 32=7.3%, =64=0.0%
 submit : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, =64=0.0%
 complete : 0=0.0%, 4=95.5%, 8=0.9%, 16=0.9%, 32=2.8%, 64=0.0%, =64=0.0%
 issued : total=r=1635761/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
 latency : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
 READ: io=6389.8MB, aggrb=547855KB/s, minb=547855KB/s, maxb=547855KB/s, 
 mint=11943msec, maxt=11943msec




 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: async messenger small benchmark result

2015-04-28 Thread Haomai Wang

Thanks for your benchmark!

Yeah, async messenger exists a bottleneck when meet high concurrency
and iops. Because it exists a annoying lock related to crc calculate.
Now my main job is focus on passing on qa tests for async messenger.
If no failed tests, I will solve this problem.

On Tue, Apr 28, 2015 at 4:36 PM, Alexandre DERUMIER aderum...@odiso.com wrote:
 Hi,

 here a small bench 4k randread of simple messenger vs async messenger

 This is with 2 osd, and 15 fio jobs on a single rbd volume

 simple messager : 345kiops
 async messenger : 139kiops

 Regards,

 Alexandre




 simple messenger
 ---

 ^Cbs: 15 (f=15): [r(15)] [0.0% done] [1346MB/0KB/0KB /s] [345K/0/0 iops] [eta 
 59d:13h:32m:43s]
 fio: terminating on signal 2

 rbd_iodepth32-test: (groupid=0, jobs=15): err= 0: pid=44713: Tue Apr 28 
 10:26:21 2015
   read : io=15794MB, bw=1321.4MB/s, iops=338255, runt= 11953msec
 slat (usec): min=5, max=17316, avg=33.81, stdev=63.77
 clat (usec): min=4, max=60848, avg=1011.22, stdev=1026.16
  lat (usec): min=110, max=60857, avg=1045.03, stdev=1031.56
 clat percentiles (usec):
  |  1.00th=[  219],  5.00th=[  298], 10.00th=[  362], 20.00th=[  466],
  | 30.00th=[  572], 40.00th=[  676], 50.00th=[  796], 60.00th=[  940],
  | 70.00th=[ 1112], 80.00th=[ 1336], 90.00th=[ 1784], 95.00th=[ 2288],
  | 99.00th=[ 4128], 99.50th=[ 5536], 99.90th=[13376], 99.95th=[17536],
  | 99.99th=[28544]
 bw (KB  /s): min=31386, max=14, per=6.67%, avg=90244.35, 
 stdev=17571.24
 lat (usec) : 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%, 250=2.21%
 lat (usec) : 500=21.02%, 750=22.82%, 1000=17.99%
 lat (msec) : 2=28.62%, 4=6.26%, 10=0.88%, 20=0.15%, 50=0.03%
 lat (msec) : 100=0.01%
   cpu  : usr=36.30%, sys=10.85%, ctx=2323657, majf=0, minf=5736
   IO depths: 1=0.2%, 2=0.8%, 4=3.4%, 8=16.3%, 16=72.0%, 32=7.3%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=94.8%, 8=1.0%, 16=1.5%, 32=2.6%, 64=0.0%, =64=0.0%
  issued: total=r=4043164/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
READ: io=15794MB, aggrb=1321.4MB/s, minb=1321.4MB/s, maxb=1321.4MB/s, 
 mint=11953msec, maxt=11953msec


 async messenger (ms_async_op_threads=10)
 -
 ^Cbs: 15 (f=15): [r(15)] [0.0% done] [544.6MB/0KB/0KB /s] [139K/0/0 iops] 
 [eta 301d:09h:10m:03s]
 fio: terminating on signal 2

 rbd_iodepth32-test: (groupid=0, jobs=15): err= 0: pid=42935: Tue Apr 28 
 10:24:29 2015
   read : io=6389.8MB, bw=547856KB/s, iops=136963, runt= 11943msec
 slat (usec): min=7, max=23454, avg=39.33, stdev=226.05
 clat (usec): min=58, max=107304, avg=3002.03, stdev=6270.44
  lat (usec): min=91, max=107327, avg=3041.36, stdev=6279.32
 clat percentiles (usec):
  |  1.00th=[  129],  5.00th=[  177], 10.00th=[  229], 20.00th=[  334],
  | 30.00th=[  446], 40.00th=[  564], 50.00th=[  692], 60.00th=[  836],
  | 70.00th=[ 1032], 80.00th=[ 1576], 90.00th=[10816], 95.00th=[17792],
  | 99.00th=[29824], 99.50th=[34048], 99.90th=[42240], 99.95th=[45824],
  | 99.99th=[50432]
 bw (KB  /s): min=13359, max=128824, per=6.67%, avg=36544.92, 
 stdev=37000.58
 lat (usec) : 100=0.04%, 250=12.05%, 500=22.51%, 750=19.70%, 1000=14.66%
 lat (msec) : 2=12.32%, 4=2.66%, 10=5.34%, 20=6.81%, 50=3.91%
 lat (msec) : 100=0.01%, 250=0.01%
   cpu  : usr=19.03%, sys=6.33%, ctx=370760, majf=0, minf=11335
   IO depths: 1=0.4%, 2=0.9%, 4=5.3%, 8=20.2%, 16=66.0%, 32=7.3%, =64=0.0%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=95.5%, 8=0.9%, 16=0.9%, 32=2.8%, 64=0.0%, =64=0.0%
  issued: total=r=1635761/w=0/d=0, short=r=0/w=0/d=0, drop=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=32

 Run status group 0 (all jobs):
READ: io=6389.8MB, aggrb=547855KB/s, minb=547855KB/s, maxb=547855KB/s, 
 mint=11943msec, maxt=11943msec



 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: 回复: Re: NewStore performance analysis

2015-04-21 Thread Haomai Wang

On Tue, Apr 21, 2015 at 2:43 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
 Hi Sage,
 Well, that's
 submit_transaction -- submit a transaction , whether block 
 waiting for fdatasync depends on rocksdb-disable-sync.
 submit_transaction_sync -- queue transaction and wait until 
 it is stable on disk.
 So if we default rocksdb-disable-sync to false, the two API are same. 
 I haven't look at the LevelDB but I suspect it's similar.

Eh, I don't think it's the same. By default WriteOption.disableWAL is
false in our ceph side, and submit_transaction will use
WriteOption.sync=false and submit_transaction_sync will use
WriteOption.sync=true.

If sync==fase, rocksdb won't sync log file, otherwise it will call
fsync/fdatasync to flush log file.

Plz correct me if not. :-)


 I just re-read the Newstore code, seems the workflow is not as that 
 we want. We issue a bunch of submit_transaction and in the _kv_sync_thread we 
 try to have a checkpoint that ensure previous transaction are persistent, by 
 using submit_transcation_sync to submit an empty transaction.  But actually
 1. the submit_transaction is already a synchronized call so the empty 
 transcation in _kv_sync_thread is kind of waste.
 2. An sync transaction cannot ensure the previous transaction is also 
 synced. The API doesn't guarantee this, and from implementation, this two 
 transactions may goes to different WAL files.

 Yes, if we want, we can have a Queue and Thread that collecting the 
 transactions and merge them to a big transaction , some ::fdatasync will be 
 saved here. But this approach looks complex.

 Some optimizations in my mind are:
 1. Batch the cleanup operations in _apply_wal_transaction, we don’t 
 need to synchronized remove the WAL item, we can just put them into 
 kv_sync_thread_Q and let kv_sync_thread to form a batch transaction that 
 deleted a bunch of key.
 2. We don't need the empty transaction in kv_sync_thread, we could 
 call the _txc_kv_finish_kv directly from _txc_submit_kv,  since the KV is 
 synchronized.
 3.  Then we can rename _kv_sync_thread to _kv_cleanup_thread 
 to better descript its work.

 How do you think

   
   Xiaoxi
 -Original Message-
 From: Sage Weil [mailto:sw...@redhat.com]
 Sent: Tuesday, April 21, 2015 12:48 AM
 To: Chen, Xiaoxi
 Cc: Mark Nelson; Somnath Roy; Duan, Jiangang; Zhang, Jian; ceph-devel
 Subject: Re: 回复: Re: NewStore performance analysis

 On Mon, 20 Apr 2015, Chen, Xiaoxi wrote:
   An easy way to measure might be comment out
   db-submit_transaction(txc-t); in NewStore::_txc_submit_kv, to
   db-see if
   we can get more QD in fragment part without issuing the DB.
 
  I'm not sure I totally understand the interface.. my assumption is
  that queue_transaction will give rocksdb the txn to commit whenever
  it finds it convenient (no idea what policy is used there) and
  queue_transaction_sync will trigger a commit now.  If we did have
  multiple threads doing queue_trandsaction_sync (by, say, calling it
  directly in _txc_submit_kv) would qa go up?
 
 I think you might miss something, currently the two interface are
 exactly the SAME unless you set rocksdb-disable-sync=true(which is
 false by default).

 When commit, rocksdb will write the content to both memtable(write
 buffer) and WAL. if the transaction doesnt go with sync, it will also
 commit now,but the write to WAL will NOT be sync(by calling fdatasync).
 That means we may lose data if power failure/kernel panic. This is why
 i changed the default rocksdb-disable-sync from true to false in
 previous patch.

 Yeah, I'm confused.  :)

 So now 'rocksdb disable sync = false', which seems to be obviously what we 
 want for newstore.  It's different for filestore, which is doing a syncfs 
 checkpoint.  Perhaps we should have newstore set that explicitly instead of 
 passing through a config option.

 In any case, though, I'm confused by

 if the transaction doesnt go with sync, it will also commit now,but
 the write to WAL will NOT be sync(by calling fdatasync).

 What does it mean to 'commit' but not call fdatasync?  What does commit mean 
 in this case?

 And, and I correct in understanding that we have

  queue_transaction -- queue a transaction but don't block waiting for 
 fdatasync  queue_transaction_sync -- queue transaction and wait until it is 
 stable on disk

 to work with?

 Thanks!
 sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding newstore performance

2015-04-17 Thread Haomai Wang

On Fri, Apr 17, 2015 at 10:08 PM, Chen, Xiaoxi xiaoxi.c...@intel.com wrote:
 I tried to spilit the DB/data/WAL into 3 different SSD, the IOSTAT looks like 
 below:

 SDB is the data while SDC is db and SDD is the WAL of RocksDB.
 The IO pattern is 4KB random write(QD=8) ontop of a pre-filled RBD, using 
 fio-librbd.

 The result looks strange,
 1. in SDB(data part), we are expecting 4KB IO but actually we only get 
 2KB(4Sector).
 2. There are not that much data written to Level 0+, only 0.53MB/s
 3. Note that the avgqu-sz is very low compared to QD=8 in FIO, seems the 
 problem is that we cannot commit the WAL fast enough.

Are you using default io scheduler for these ssd? I'm not sure that
linux cfq scheduler will make fsync/fdatasync behind all inprogress
write op. So if we always issue fsync in rocksdb layer, it will try to
merge more fsync requests? Maybe you could move to deadline or noop?



 My code base is 6e9b2fce30cf297e60454689c6fb406b6e786889,

 avg-cpu:  %user   %nice %system %iowait  %steal   %idle
   15.770.008.872.060.00   73.30

 Device: rrqm/s   wrqm/s r/s w/srMB/swMB/s avgrq-sz 
 avgqu-sz   await r_await w_await  svctm  %util
 sda   0.0010.600.00   49.60 0.0021.56   890.39
  6.68  134.760.00  134.76   1.16   5.76
 sdb   0.00 0.000.00 1627.30 0.00 3.22 4.05
  0.110.070.000.07   0.06  10.52
 sdc   0.00 0.000.204.30 0.00 0.53   239.33
  0.001.072.001.02   0.71   0.32
 sdd   0.00   612.000.00 1829.50 0.00 9.4110.53
  0.850.460.000.46   0.46  84.68


 /dev/sdc1  156172796  2740620 153432176   2% /root/ceph-0-db
 /dev/sdd1  19526457241940 195222632   1% /root/ceph-0-db-wal
 /dev/sdb1  156172796 10519532 145653264   7% /var/lib/ceph/osd/ceph-0

 -Original Message-
 From: Mark Nelson [mailto:mnel...@redhat.com]
 Sent: Friday, April 17, 2015 8:11 PM
 To: Sage Weil
 Cc: Somnath Roy; Chen, Xiaoxi; Haomai Wang; ceph-devel
 Subject: Re: Regarding newstore performance



 On 04/16/2015 07:38 PM, Sage Weil wrote:
 On Thu, 16 Apr 2015, Mark Nelson wrote:
 On 04/16/2015 01:17 AM, Somnath Roy wrote:
 Here is the data with omap separated to another SSD and after 1000GB
 of fio writes (same profile)..

 omap writes:
 -

 Total host writes in this period = 551020111 -- ~2101 GB

 Total flash writes in this period = 1150679336

 data writes:
 ---

 Total host writes in this period = 302550388 --- ~1154 GB

 Total flash writes in this period = 600238328

 So, actual data write WA is ~1.1 but omap overhead is ~2.1 and
 adding those getting ~3.2 WA overall.

 This all suggests that getting rocksdb to not rewrite the wal entries
 at all will be the big win.  I think Xiaoxi had tunable suggestions
 for that?  I didn't grok the rocksdb terms immediately so they didn't
 make a lot of sense at the time.. this is probably a good place to
 focus, though.  The rocksdb compaction stats should help out there.

 But... today I ignored this entirely and put rocksdb in tmpfs and
 focused just on the actual wal IOs done to the fragments files after the 
 fact.
 For simplicity I focused just on 128k random writes into 4mb objects.

 fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly,
 setting
 iodepth=16 makes no different *until* I also set thinktime=10 (us, or
 almost any value really) and thinktime_blocks=16, at which point it
 goes up with the iodepth.  I'm not quite sure what is going on there
 but it seems to be preventing the elevator and/or disk from reordering
 writes and make more efficient sweeps across the disk.  In any case,
 though, with that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec 
 with qd 64.
 Similarly, with qa 1 and thinktime of 250us, it drops to like
 15mb/sec, which is basically what I was getting from newstore.  Here's
 my fio
 config:

   http://fpaste.org/212110/42923089/


 Yikes!  That is a great observation Sage!


 Conclusion: we need multiple threads (or libaio) to get lots of IOs in
 flight so that the block layer and/or disk can reorder and be efficient.
 I added a threadpool for doing wal work (newstore wal threads = 8 by
 default) and it makes a big difference.  Now I am getting more like
 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going
 up much from there as I scale threads or qd, strangely; not sure why yet.

 But... that's a big improvement over a few days ago (~8mb/sec).  And
 on this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
 winning, yay!

 I tabled the libaio patch for now since it was getting spurious EINVAL
 and would consistently SIGBUG from io_getevents() when ceph-osd did
 dlopen() on the rados plugins (weird!).

 Mark, at this point it is probably worth checking that you can
 reproduce these results?  If so, we can redo the io size

Re: Regarding newstore performance

2015-04-16 Thread Haomai Wang

On Fri, Apr 17, 2015 at 8:38 AM, Sage Weil s...@newdream.net wrote:
 On Thu, 16 Apr 2015, Mark Nelson wrote:
 On 04/16/2015 01:17 AM, Somnath Roy wrote:
  Here is the data with omap separated to another SSD and after 1000GB of fio
  writes (same profile)..
 
  omap writes:
  -
 
  Total host writes in this period = 551020111 -- ~2101 GB
 
  Total flash writes in this period = 1150679336
 
  data writes:
  ---
 
  Total host writes in this period = 302550388 --- ~1154 GB
 
  Total flash writes in this period = 600238328
 
  So, actual data write WA is ~1.1 but omap overhead is ~2.1 and adding those
  getting ~3.2 WA overall.

 This all suggests that getting rocksdb to not rewrite the wal
 entries at all will be the big win.  I think Xiaoxi had tunable
 suggestions for that?  I didn't grok the rocksdb terms immediately so
 they didn't make a lot of sense at the time.. this is probably a good
 place to focus, though.  The rocksdb compaction stats should help out
 there.

 But... today I ignored this entirely and put rocksdb in tmpfs and focused
 just on the actual wal IOs done to the fragments files after the fact.
 For simplicity I focused just on 128k random writes into 4mb objects.

 fio can get ~18 mb/sec on my disk with iodepth=1.  Interestingly, setting
 iodepth=16 makes no different *until* I also set thinktime=10 (us, or
 almost any value really) and thinktime_blocks=16, at which point it goes
 up with the iodepth.  I'm not quite sure what is going on there but it
 seems to be preventing the elevator and/or disk from reordering writes and
 make more efficient sweeps across the disk.  In any case, though, with
 that tweaked I can get up to ~30mb/sec with qd 16, ~40mb/sec with qd 64.
 Similarly, with qa 1 and thinktime of 250us, it drops to like 15mb/sec,
 which is basically what I was getting from newstore.  Here's my fio
 config:

 http://fpaste.org/212110/42923089/

 Conclusion: we need multiple threads (or libaio) to get lots of IOs in
 flight so that the block layer and/or disk can reorder and be efficient.
 I added a threadpool for doing wal work (newstore wal threads = 8 by
 default) and it makes a big difference.  Now I am getting more like
 19mb/sec w/ 4 threads and client (smalliobench) qd 16.  It's not going up
 much from there as I scale threads or qd, strangely; not sure why yet.

Do you mean this PR(https://github.com/ceph/ceph/pull/4318)? I have a
simple benchmark at the comment of PR.


 But... that's a big improvement over a few days ago (~8mb/sec).  And on
 this drive filestore with journal on ssd gets ~8.5mb/sec.  So we're
 winning, yay!

 I tabled the libaio patch for now since it was getting spurious EINVAL and
 would consistently SIGBUG from io_getevents() when ceph-osd did dlopen()
 on the rados plugins (weird!).

 Mark, at this point it is probably worth checking that you can reproduce
 these results?  If so, we can redo the io size sweep.  I picked 8 wal
 threads since that was enough to help and going higher didn't seem to make
 much difference, but at some point we'll want to be more careful about
 picking that number.  We could also use libaio here, but I'm not sure it's
 worth it.  And this approach is somewhat orthogonal to the idea of
 efficiently passing the kernel things to fdatasync.

Agreed, this time I think we need to focus data store only. Maybe I'm
missing, what's your overlay config value in this test?


 Anyway, next up is probably wrangling rocksdb's log!

 sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding newstore performance

2015-04-15 Thread Haomai Wang

On Wed, Apr 15, 2015 at 2:01 PM, Somnath Roy somnath@sandisk.com wrote:
 Hi Sage/Mark,
 I did some WA experiment with newstore with the similar settings I mentioned 
 yesterday.

 Test:
 ---

 64K Random write with 64 QD and writing total of 1 TB of data.


 Newstore:
 

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
 iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/49600KB/0KB /s] [0/775/0 iops] [eta 
 00m:00s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=42907: Tue Apr 14 00:34:23 
 2015
   write: io=1000.0GB, bw=48950KB/s, iops=764, runt=21421419msec
 slat (usec): min=43, max=9480, avg=116.45, stdev=10.99
 clat (msec): min=13, max=1331, avg=83.55, stdev=52.97
  lat (msec): min=14, max=1331, avg=83.67, stdev=52.97
 clat percentiles (msec):
  |  1.00th=[   60],  5.00th=[   68], 10.00th=[   71], 20.00th=[   74],
  | 30.00th=[   76], 40.00th=[   78], 50.00th=[   81], 60.00th=[   83],
  | 70.00th=[   86], 80.00th=[   90], 90.00th=[   94], 95.00th=[   98],
  | 99.00th=[  109], 99.50th=[  114], 99.90th=[ 1188], 99.95th=[ 1221],
  | 99.99th=[ 1270]
 bw (KB  /s): min=   62, max=101888, per=100.00%, avg=49760.84, 
 stdev=7320.03
 lat (msec) : 20=0.01%, 50=0.38%, 100=96.00%, 250=3.34%, 500=0.03%
 lat (msec) : 750=0.02%, 1000=0.03%, 2000=0.20%
   cpu  : usr=10.85%, sys=1.76%, ctx=123504191, majf=0, minf=603964
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=2.1%, =64=97.9%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=100.0%, 8=0.1%, 16=0.1%, 32=0.0%, 64=0.1%, 
 =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=48949KB/s, minb=48949KB/s, maxb=48949KB/s, 
 mint=21421419msec, maxt=21421419msec


 So, iops getting is ~764.
 99th percentile latency should be 100ms.

 Write amplification at disk level:
 --

 SanDisk SSDs have some disk level counters that can measure number of host 
 writes with flash logical page size and number of actual flash writes with 
 the same flash logical page size. The difference between these two is the 
 actual WA causing to disk.

 Please find the data in the following xls.

 https://docs.google.com/spreadsheets/d/1vIT6PHRbLdk_IqFDc2iM_teMolDUlxcX5TLMRzdXyJE/edit?usp=sharing

 Total host writes in this period = 923896266

 Total flash writes in this period = 1465339040


 FileStore:
 -

 Fio output at the end of 1 TB write.
 ---

 rbd_iodepth32: (g=0): rw=randwrite, bs=64K-64K/64K-64K/64K-64K, ioengine=rbd, 
 iodepth=64
 fio-2.1.11-20-g9a44
 Starting 1 process
 rbd engine: RBD version: 0.1.9
 Jobs: 1 (f=1): [w(1)] [100.0% done] [0KB/164.7MB/0KB /s] [0/2625/0 iops] [eta 
 00m:01s]
 rbd_iodepth32: (groupid=0, jobs=1): err= 0: pid=44703: Tue Apr 14 18:44:00 
 2015
   write: io=1000.0GB, bw=98586KB/s, iops=1540, runt=10636117msec
 slat (usec): min=42, max=7144, avg=120.45, stdev=45.80
 clat (usec): min=942, max=3954.6K, avg=40776.42, stdev=81231.25
  lat (msec): min=1, max=3954, avg=40.90, stdev=81.23
 clat percentiles (msec):
  |  1.00th=[7],  5.00th=[   11], 10.00th=[   13], 20.00th=[   16],
  | 30.00th=[   18], 40.00th=[   20], 50.00th=[   22], 60.00th=[   25],
  | 70.00th=[   30], 80.00th=[   40], 90.00th=[   67], 95.00th=[  114],
  | 99.00th=[  433], 99.50th=[  570], 99.90th=[  914], 99.95th=[ 1090],
  | 99.99th=[ 1647]
 bw (KB  /s): min=   32, max=243072, per=100.00%, avg=103148.37, 
 stdev=63090.00
 lat (usec) : 1000=0.01%
 lat (msec) : 2=0.01%, 4=0.14%, 10=4.50%, 20=38.34%, 50=42.42%
 lat (msec) : 100=8.81%, 250=3.48%, 500=1.58%, 750=0.50%, 1000=0.14%
 lat (msec) : 2000=0.06%, =2000=0.01%
   cpu  : usr=18.46%, sys=2.64%, ctx=63096633, majf=0, minf=923370
   IO depths: 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.6%, 32=80.4%, =64=19.1%
  submit: 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, 
 =64=0.0%
  complete  : 0=0.0%, 4=94.3%, 8=2.9%, 16=1.9%, 32=0.9%, 64=0.1%, =64=0.0%
  issued: total=r=0/w=16384000/d=0, short=r=0/w=0/d=0
  latency   : target=0, window=0, percentile=100.00%, depth=64

 Run status group 0 (all jobs):
   WRITE: io=1000.0GB, aggrb=98586KB/s, minb=98586KB/s, maxb=98586KB/s, 
 mint=10636117msec, maxt=10636117msec

 Disk stats (read/write):
   sda: ios=0/251, merge=0/149, ticks=0/0, in_queue=0, util=0.00%

 So, iops here is ~1500.
 99th percentile latency should be within 50ms


 Write amplification at disk level:
 --

 Total host writes in this period =

Re: Initial newstore vs filestore results

2015-04-08 Thread Haomai Wang

On Wed, Apr 8, 2015 at 10:58 AM, Sage Weil s...@newdream.net wrote:
 On Tue, 7 Apr 2015, Mark Nelson wrote:
 On 04/07/2015 02:16 PM, Mark Nelson wrote:
  On 04/07/2015 09:57 AM, Mark Nelson wrote:
   Hi Guys,
  
   I ran some quick tests on Sage's newstore branch.  So far given that
   this is a prototype, things are looking pretty good imho.  The 4MB
   object rados bench read/write and small read performance looks
   especially good.  Keep in mind that this is not using the SSD journals
   in any way, so 640MB/s sequential writes is actually really good
   compared to filestore without SSD journals.
  
   small write performance appears to be fairly bad, especially in the RBD
   case where it's small writes to larger objects.  I'm going to sit down
   and see if I can figure out what's going on.  It's bad enough that I
   suspect there's just something odd going on.
  
   Mark
 
  Seekwatcher/blktrace graphs of a 4 OSD cluster using newstore for those
  interested:
 
  http://nhm.ceph.com/newstore/
 
  Interestingly small object write/read performance with 4 OSDs was about
  1/3-1/4 the speed of the same cluster with 36 OSDs.
 
  Note: Thanks Dan for fixing the directory column width!
 
  Mark

 New fio/librbd results using Sage's latest code that attempts to keep small
 overwrite extents in the db.  This is 4 OSD so not directly comparable to the
 36 OSD tests above, but does include seekwatcher graphs.  Results in MB/s:

   write   readrandw   randr
 4MB   57.9319.6   55.2285.9
 128KB 2.5 230.6   2.4 125.4
 4KB   0.4655.65   1.113.56

 What would be very interesting would be to see the 4KB performance
 with the defaults (newstore overlay max = 32) vs overlays disabled
 (newstore overlay max = 0) and see if/how much it is helping.

 The latest branch also has open-by-handle.  It's on by default (newstore
 open by handle = true).  I think for most workloads it won't be very
 noticeable... I think there are two questions we need to answer though:

 1) Does it have any impact on a creation workload (say, 4kb objects).  It
 shouldn't, but we should confirm.

 2) Does it impact small object random reads with a cold cache.  I think to
 see the effect we'll probably need to pile a ton of objects into the
 store, drop caches, and then do random reads.  In the best case the
 effect will be small, but hopefully noticeable: we should go from
 a directory lookup (1+ seeks) + inode lookup (1+ seek) + data
 read, to inode lookup (1+ seek) + data read.  So, 3 - 2 seeks best case?
 I'm not really sure what XFS is doing under the covers here..

WOW, it's really a cool implementation beyond my original mind
according to blueprint. Handler, overlay_map and data_map looks so
flexible and make small io cheaper in theory. Now we only have 1
element in data_map and I'm not sure your goal about the future's
usage. Although I have a unclearly idea that it could enhance the role
of NewStore and make local filesystem just as a block space allocator.
Let NewStore own a variable of FTL(File Translation Layer), so many
cool features could be added. What's your idea about data_map?

My concern currently still is WAL after fsync and kv commiting, maybe
fsync process is just fine because mostly we won't meet this case in
rbd. But submit sync kv transaction isn't a low latency job I think,
maybe we could let WAL parallel with kv commiting?(yes, I really
concern the latency of one op :-) )

Then from the actual rados write op, it will add setattr and
omap_setkeys ops. Current NewStore looks plays badly for setattr. It
always encode all xattrs(and other not so tiny fields) and write again
(Is this true?) though it could batch multi transaction's onode write
in short time.

NewStore also employ much more workload to KeyValueDB compared to
FileStore, so maybe we need to consider the rich workload again
compared before. FileStore only use leveldb just for write workload
mainly so leveldb could fit into greatly, but currently overlay
keys(read) and onode(read) will occur a main latency source in normal
IO I think. The default kvdb like leveldb and rocksdb both plays not
well for random read workload, it maybe will be problem. Looking for
another kv db maybe a choice.

And it still doesn't add journal codes for wal?

Anyway, NewStore should cover more workloads compared to FileStore. Good job!


 sage
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding ceph rbd write path

2015-04-07 Thread Haomai Wang

On Sat, Apr 4, 2015 at 1:20 PM, Somnath Roy somnath@sandisk.com wrote:
 Haomai,
 Yeah, I thought so, but, I didn't know much about that implementation. Good 
 to know that it is taking care of that.
 But, krbd path will still be suboptimal then.
 If we can do something in OSD layer, we may be able to additionally coalesce 
 multiple writes within a PG to a single transaction (of course we need to 
 maintain order). Benefit could be single omap attributes update for multiple 
 object write within a PG.
 May be I should come up with a prototype if you guys are not foreseeing any 
 problem.

I'm not sure but I think it's not a simple way to implement a
effective coalesce multiple transaction via this way.

As for extra metadata, we already have inprogress PR to reduce it as
far as possible.

Anyway, maybe some smart ideas could apply to this problem, look forward it.


 Thanks  Regards
 Somnath

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Friday, April 03, 2015 9:47 PM
 To: Somnath Roy
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: Regarding ceph rbd write path

 On Sat, Apr 4, 2015 at 8:30 AM, Somnath Roy somnath@sandisk.com wrote:
 In fact, we can probably do it from the OSD side like this.

 1. A thread in the sharded opWq is taking the ops within a pg by acquiring 
 lock in the pg_for_processing data structure.

 2. Now, before taking the job, it can do a bit processing to look for the 
 same object transaction in the map till that time and coalesce that to a 
 single job.

 Let me know if I am missing anything.

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Friday, April 03, 2015 5:17 PM
 To: ceph-devel@vger.kernel.org
 Subject: Regarding ceph rbd write path


 Hi Sage/Sam,
 Here is my understanding on ceph rbd write path.

 1. Based on the image order rbd will decide the rados object size, say 4MB.

 2. Now, from application say 64K chunks are being written to the rbd image.

 3. rbd will calculate the objectids (one of the 4MB objects) and start 
 populating the 4MB objects with 64K chunks.

 4. Now, for each of this 64K chunk OSD will write 2 setattrs and the OMAP 
 attrs.

 If the above flow is correct, it is updating the same metadata for every 64K 
 chunk write to the same object (and same pg). So, my question is, is there 
 any way to optimize (coalesce) that in either rbd/osd layer ?
 I couldn't find any way in the osd layer as it is holding pg-lock till a 
 transaction complete. But, is there any way in the rbd side so that it can 
 intelligently stage/coalesce the writes for the same object and do a batch 
 commit?
 This should definitely improve WA/performance for seq writes, may not be 
 much for random though.

 Do you consider to use RBDCache? It could cover most of cases you said I 
 think.


 Let me know your opinion on this.

 Thanks  Regards
 Somnath

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby 
 notified that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies 
 or electronically stored copies).

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: Regarding ceph rbd write path

2015-04-03 Thread Haomai Wang

On Sat, Apr 4, 2015 at 8:30 AM, Somnath Roy somnath@sandisk.com wrote:
 In fact, we can probably do it from the OSD side like this.

 1. A thread in the sharded opWq is taking the ops within a pg by acquiring 
 lock in the pg_for_processing data structure.

 2. Now, before taking the job, it can do a bit processing to look for the 
 same object transaction in the map till that time and coalesce that to a 
 single job.

 Let me know if I am missing anything.

 Thanks  Regards
 Somnath

 -Original Message-
 From: ceph-devel-ow...@vger.kernel.org 
 [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Somnath Roy
 Sent: Friday, April 03, 2015 5:17 PM
 To: ceph-devel@vger.kernel.org
 Subject: Regarding ceph rbd write path


 Hi Sage/Sam,
 Here is my understanding on ceph rbd write path.

 1. Based on the image order rbd will decide the rados object size, say 4MB.

 2. Now, from application say 64K chunks are being written to the rbd image.

 3. rbd will calculate the objectids (one of the 4MB objects) and start 
 populating the 4MB objects with 64K chunks.

 4. Now, for each of this 64K chunk OSD will write 2 setattrs and the OMAP 
 attrs.

 If the above flow is correct, it is updating the same metadata for every 64K 
 chunk write to the same object (and same pg). So, my question is, is there 
 any way to optimize (coalesce) that in either rbd/osd layer ?
 I couldn't find any way in the osd layer as it is holding pg-lock till a 
 transaction complete. But, is there any way in the rbd side so that it can 
 intelligently stage/coalesce the writes for the same object and do a batch 
 commit?
 This should definitely improve WA/performance for seq writes, may not be much 
 for random though.

Do you consider to use RBDCache? It could cover most of cases you said I think.


 Let me know your opinion on this.

 Thanks  Regards
 Somnath

 

 PLEASE NOTE: The information contained in this electronic mail message is 
 intended only for the use of the designated recipient(s) named above. If the 
 reader of this message is not the intended recipient, you are hereby notified 
 that you have received this message in error and that any review, 
 dissemination, distribution, or copying of this message is strictly 
 prohibited. If you have received this communication in error, please notify 
 the sender by telephone or e-mail (as shown above) immediately and destroy 
 any and all copies of this message in your possession (whether hard copies or 
 electronically stored copies).

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in the 
 body of a message to majord...@vger.kernel.org More majordomo info at  
 http://vger.kernel.org/majordomo-info.html
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: keyvaluestore speed up?

2015-03-19 Thread Haomai Wang

On Thu, Mar 19, 2015 at 5:22 PM, Xinze Chi xmdx...@gmail.com wrote:
 hi, all:

 Currently at keyvaluestore, osd send sync
 request(submit_transaction_sync) to filestore when it finishes a
 transaction. But sata disk is not suitable for doing sync request. ssd
 disk is more suitable.

I think here you mean disk not filestore.


 So I think whether we could separate leveldb *.log file with *.sst
 file and move *.log to ssd
 disk, which is similar with journal file in ceph.

 But now, the original leveldb do not support to separate log file
 and sst file.

 Wait for your comment.

 Thanks.



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: crc error when decode_message?

2015-03-16 Thread Haomai Wang

On Mon, Mar 16, 2015 at 10:04 PM, Xinze Chi xmdx...@gmail.com wrote:
 How to process the write request in primary?

 Thanks.

 2015-03-16 22:01 GMT+08:00 Haomai Wang haomaiw...@gmail.com:
 AFAR Pipe and AsyncConnection both will mark self fault and shutdown
 socket and peer will detect this reset. So each side has chance to
 rebuild the session.

 On Mon, Mar 16, 2015 at 9:19 PM, Xinze Chi xmdx...@gmail.com wrote:
 Such as, Client send write request to osd.0 (primary), osd.0 send
 MOSDSubOp to osd.1 and osd.2

 osd.1 send reply to osd.0 (primary), but accident happened:

 1. decode_message crc error when decode reply msg
 or
 2. the reply msg is lost when send to osd.0, so osd.0 do not receive replay 
 msg

 Could anyone tell me what is the behavior if osd.0 (primary)?


osd.0 and osd.1 both will try to reconnect peer side, and the lost
message will be resend to osd.0 from osd.1

 Thanks

 2015-03-16 20:02 GMT+08:00 Xinze Chi xmdx...@gmail.com:
 hi, all:

   I want to know what is the behavior of primary when
 decode_message crc error , such as read

 ack response message from remote peer?

   Thanks.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: crc error when decode_message?

2015-03-16 Thread Haomai Wang

AFAR Pipe and AsyncConnection both will mark self fault and shutdown
socket and peer will detect this reset. So each side has chance to
rebuild the session.

On Mon, Mar 16, 2015 at 9:19 PM, Xinze Chi xmdx...@gmail.com wrote:
 Such as, Client send write request to osd.0 (primary), osd.0 send
 MOSDSubOp to osd.1 and osd.2

 osd.1 send reply to osd.0 (primary), but accident happened:

 1. decode_message crc error when decode reply msg
 or
 2. the reply msg is lost when send to osd.0, so osd.0 do not receive replay 
 msg

 Could anyone tell me what is the behavior if osd.0 (primary)?

 Thanks

 2015-03-16 20:02 GMT+08:00 Xinze Chi xmdx...@gmail.com:
 hi, all:

   I want to know what is the behavior of primary when
 decode_message crc error , such as read

 ack response message from remote peer?

   Thanks.
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About _setattr() optimazation and recovery accelerate

2015-03-08 Thread Haomai Wang

On Mon, Mar 9, 2015 at 1:26 PM, Nicheal zay11...@gmail.com wrote:
 2015-03-07 16:43 GMT+08:00 Haomai Wang haomaiw...@gmail.com:
 On Sat, Mar 7, 2015 at 12:03 AM, Sage Weil sw...@redhat.com wrote:
 Hi!

 [copying ceph-devel]

 On Fri, 6 Mar 2015, Nicheal wrote:
 Hi Sage,

 Cool for issue #3878, Duplicated pg_log write, which is post early in
 my issue #3244 and Single omap_setkeys transaction improve the
 performance in FileStore as in my previous testing (most of time cost
 in FileStore is in the transaction omap_setkeys).

 I can't find #3244?

 I think it's https://github.com/ceph/ceph/pull/3244

 Yeah, exactly it is.

 Well, I think another performance issue is to the strategy of setattrs.
 Here is some kernel log achieve from xfs behavious.
 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 1
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
 di_anextents=0, di_forkoff=239

 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 2
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=1,
 di_anextents=1, di_forkoff=239

 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 2
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
 di_anextents=1, di_forkoff=239

 typedef enum xfs_dinode_fmt {
 XFS_DINODE_FMT_DEV, /* xfs_dev_t */
 XFS_DINODE_FMT_LOCAL, /* bulk data */
 XFS_DINODE_FMT_EXTENTS, /* struct xfs_bmbt_rec */
 XFS_DINODE_FMT_BTREE, /* struct xfs_bmdr_block */
 XFS_DINODE_FMT_UUID /* uuid_t */
 } xfs_dinode_fmt_t;

 while attr forks data = 2 means XFS_DINODE_FMT_EXTENTS (xattr is
 stored in extent format), while attr forks data =1 means
 XFS_DINODE_FMT_LOCAL(xattr is stored as inline attribute).

 However, in most cases, xattr attribute is stored in extent, not
 inline. Please note that, I have already formatted the partition with
 -i size=2048.  when the number of xattrs is larger than 10, it uses
 XFS_DINODE_FMT_BTREE to accelerate key searching.

 Did you by chance look at what size the typical xattrs are?  I expected
 that the usual _ and snapset attrs would be small enough to fit inline..
 but if they're not then we should at a minimum adjust our recommendation
 on xfs inode size.

 So, in _setattr(), we may just get xattr_key by using chain_flistxattr
 instead of  _fgetattrs, which retrieve (key, value) pair, as value is
 exactly no use here. and furthermore, we may consider the strategies
 that we need move spill_out xattr to omap, while xfs only restricts
 that each xattr value  64K and each xattr key  255byte.  And
 duplicated read for XATTR_SPILL_OUT_NAME also occurs in:
 r = chain_fgetxattr(**fd, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
 r = _fgetattrs(**fd, inline_set);
 And I try to ignore the _fgetattrs() logic and just update xattr
 update in _setattr(), my ssd cluster will be improved about 2% - 3%
 performance.

 I'm not quite following... do you have a patch we can look at?

 I think his meaning is that we can use minimal xattr attrs and avoid
 xattr-chains but using omap.

 Yes, make a basic assumption, for example, we just allow user.ceph._
 and user.ceph.snapset as xattr attrs. Then we may simplify the logic a
 lots. Actually, the purpose to implement automatic decision to
 redirect the xattr into omap is served for cephfs, which may save user
 defined xattrs.  For rbd case, no this problem since it is just two
 xattr attrs (user.ceph._  and user.ceph.snapset), and for ecpool, one
 more for hash, which is predictable. Furthermore, I prefer to stop
 recording user.ceph.snapset when there is too much fragment. There is
 a huge performance penalty when user.ceph.snapset is large.
 Since both of Extent and BTREE layout is remote xattr, not inline
 xattr in xfs, I think using omap will not cause much performance
 penalty, especially for HDD based FileStore.

Yeah, I agree with this. So this is a little dive into XFS internal if
we want to
do better this xattrs. If xfs can export this or boundary of xattr
type(btree, inline or list)
would be great.

So do you have any detail about xfs xattr size client usage and how to
optimize FileStore xattr decision? In other word, is it make sense
that FileStore
can aware of XFS xattr layout online or when initing FileStore and so
we can decide the right way to store it
at least.


 Another issue about an idea of recovery is showed in
 https://github.com/ceph/ceph/pull/3837
 Can you give some suggestion about that?

 I think this direction has a lot of potential, although it will add a fair
 bit of complexity.

 I think you can avoid the truncate field and infer that from the dirtied
 interval and the new object size.  Need to look at the patch more closely
 still, though...
 Uh, Yeah

Re: About _setattr() optimazation and recovery accelerate

2015-03-07 Thread Haomai Wang

On Sat, Mar 7, 2015 at 12:03 AM, Sage Weil sw...@redhat.com wrote:
 Hi!

 [copying ceph-devel]

 On Fri, 6 Mar 2015, Nicheal wrote:
 Hi Sage,

 Cool for issue #3878, Duplicated pg_log write, which is post early in
 my issue #3244 and Single omap_setkeys transaction improve the
 performance in FileStore as in my previous testing (most of time cost
 in FileStore is in the transaction omap_setkeys).

 I can't find #3244?

I think it's https://github.com/ceph/ceph/pull/3244


 Well, I think another performance issue is to the strategy of setattrs.
 Here is some kernel log achieve from xfs behavious.
 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 1
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
 di_anextents=0, di_forkoff=239

 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 2
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=1,
 di_anextents=1, di_forkoff=239

 Mar  6 17:19:37 ceph2 kernel: start_xfs_attr_set_int: name =
 ceph._(6), value =.259)
 Mar  6 17:19:37 ceph2 kernel: format of di_c data: 2, format of attr
 forks data: 2
 Mar  6 17:19:37 ceph2 kernel: di_extsize=0, di_nextents=0,
 di_anextents=1, di_forkoff=239

 typedef enum xfs_dinode_fmt {
 XFS_DINODE_FMT_DEV, /* xfs_dev_t */
 XFS_DINODE_FMT_LOCAL, /* bulk data */
 XFS_DINODE_FMT_EXTENTS, /* struct xfs_bmbt_rec */
 XFS_DINODE_FMT_BTREE, /* struct xfs_bmdr_block */
 XFS_DINODE_FMT_UUID /* uuid_t */
 } xfs_dinode_fmt_t;

 while attr forks data = 2 means XFS_DINODE_FMT_EXTENTS (xattr is
 stored in extent format), while attr forks data =1 means
 XFS_DINODE_FMT_LOCAL(xattr is stored as inline attribute).

 However, in most cases, xattr attribute is stored in extent, not
 inline. Please note that, I have already formatted the partition with
 -i size=2048.  when the number of xattrs is larger than 10, it uses
 XFS_DINODE_FMT_BTREE to accelerate key searching.

 Did you by chance look at what size the typical xattrs are?  I expected
 that the usual _ and snapset attrs would be small enough to fit inline..
 but if they're not then we should at a minimum adjust our recommendation
 on xfs inode size.

 So, in _setattr(), we may just get xattr_key by using chain_flistxattr
 instead of  _fgetattrs, which retrieve (key, value) pair, as value is
 exactly no use here. and furthermore, we may consider the strategies
 that we need move spill_out xattr to omap, while xfs only restricts
 that each xattr value  64K and each xattr key  255byte.  And
 duplicated read for XATTR_SPILL_OUT_NAME also occurs in:
 r = chain_fgetxattr(**fd, XATTR_SPILL_OUT_NAME, buf, sizeof(buf));
 r = _fgetattrs(**fd, inline_set);
 And I try to ignore the _fgetattrs() logic and just update xattr
 update in _setattr(), my ssd cluster will be improved about 2% - 3%
 performance.

 I'm not quite following... do you have a patch we can look at?

I think his meaning is that we can use minimal xattr attrs and avoid
xattr-chains but using omap.


 Another issue about an idea of recovery is showed in
 https://github.com/ceph/ceph/pull/3837
 Can you give some suggestion about that?

 I think this direction has a lot of potential, although it will add a fair
 bit of complexity.

 I think you can avoid the truncate field and infer that from the dirtied
 interval and the new object size.  Need to look at the patch more closely
 still, though...

For xattr and omap optimization I expect this PR mostly
https://github.com/ceph/ceph/pull/2972



 sage

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Manila] Ceph native driver for manila

2015-02-26 Thread Haomai Wang

On Fri, Feb 27, 2015 at 1:19 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 27 Feb 2015, Haomai Wang wrote:
  Anyway, this leads to a few questions:
 
   - Who is interested in using Manila to attach CephFS to guest VMs?

 Yeah, actually we are doing this
 (https://www.openstack.org/vote-vancouver/Presentation/best-practice-of-ceph-in-public-cloud-cinder-manila-and-trove-all-with-one-ceph-storage).

 Hmm, the link seemed redirect and useless:-(

   - What use cases are you interested?

 We uses Manila + OpenStack for our NAS service

   - How important is security in your environment?

 Very important, we need to provide with qos, network isolation(private
 network support).

 Now we use default Manila driver, attach a rbd image to service vm and
 this service vm export NFS endpoint.

 Next as we showed in the presentation, we will use qemu driver to
 directly passthrough filesystem command instead of block command. So
 host can directly access cephfs safely and network isolation can be
 ensured. It will make clearly for internal network(or storage network)
 and virtual network.

 Is this using the qemu virtfs/9p server and 9p in the guest?  With a
 cephfs kernel mount on the host?  How reliable have you found it to be?

 That brings us to 4 options:

 1) default driver: map rbd to manila VM, export NFS
 2) ganesha driver: reexport cephfs as NFS
 3) native ceph driver: let guest mount cephfs directly
 4) mount cephfs on host, guest access via virtfs

 I think in all but #3 you get decent security isolation between tenants as
 long as you trust KVM and/or ganesha to enforce permissions.  In #3 we
 need to enforce that in CephFS (and have some work to do).

 I like #3 because it promises the best performance and shines the light on
 the multitenancy gaps we have now, and I have this feeling that
 multitenant security isn't a huge issue for a lot of users, but.. that's
 why I'm asking!

FOR us, the #3's main problem is that guest vm need to access ceph cluster
via *virtual network*. With private network supported, in theory guest vm can't
access internal network especially storage network. Maybe we can make
manila service vm special and let it passthrough via NET or others. But it still
has network impaction which may influent the IO performance.

Actually, #4 may got better performance because it passthrough io via qemu
queue ring and #3 need to translate guest vm network bandwidth from virtual
network to internal network.

Of course, maybe other users don't need to consider this case :-)


 sage



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: FileStore performance: coalescing operations

2015-02-26 Thread Haomai Wang

Hmm, we already obverse this duplicate omap keys set from pglog operations.

And I think we need to resolve it in upper layer, of course,
coalescing omap operations in FileStore is also useful.

@Somnath Do you do this dedup work in KeyValueStore already?

On Thu, Feb 26, 2015 at 10:28 PM, Andreas Bluemle
andreas.blue...@itxperts.de wrote:
 Hi,

 during the performance weely meeting, I had mentioned
 my experiences concerning the transaction structure
 for write requests at the level of the FileStore.
 Such a transaction not only contains the OP_WRITE
 operation to the object in the file system, but also
 a series of OP_OMAP_SETKEYS and OP_SETATTR operations.

 Find attached a README and source code patch, which
 describe a prototype for coalescing the OP_OMAP_SETKEYS
 operations and the performance impact f this change.

 Regards

 Andreas Bluemle

 --
 Andreas Bluemle mailto:andreas.blue...@itxperts.de
 ITXperts GmbH   http://www.itxperts.de
 Balanstrasse 73, Geb. 08Phone: (+49) 89 89044917
 D-81541 Muenchen (Germany)  Fax:   (+49) 89 89044910

 Company details: http://www.itxperts.de/imprint.htm



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: [Manila] Ceph native driver for manila

2015-02-26 Thread Haomai Wang

On Fri, Feb 27, 2015 at 8:01 AM, Sage Weil sw...@redhat.com wrote:
Hi everyone,

The online Ceph Developer Summit is next week[1] and among other things
we'll be talking about how to support CephFS in Manila. At a high level,
there are basically two paths:

1) Ganesha + the CephFS FSAL driver

- This will just use the existing ganesha driver without modifications.
Ganesha will need to be configured with the CephFS FSAL instead of
GlusterFS or whatever else you might use.
- All traffic will pass through the NFS VM, providing network isolation

No real work needed here aside from testing and QA.

2) Native CephFS driver

As I currently understand it,

- The driver will set up CephFS auth credentials so that the guest VM can
mount CephFS directly
- The guest VM will need access to the Ceph network. That makes this
mainly interesting for private clouds and trusted environments.
- The guest is responsible for running 'mount -t ceph ...'.
- I'm not sure how we provide the auth credential to the user/guest...

This would perform better than an NFS gateway, but there are several gaps
on the security side that make this unusable currently in an untrusted
environment:

- The CephFS MDS auth credentials currently are _very_ basic. As in,
binary: can this host mount or it cannot. We have the auth cap string
parsing in place to restrict to a subdirectory (e.g., this tenant can only
mount /tenants/foo), but the MDS does not enforce this yet. [medium
project to add that]

- The same credential could be used directly via librados to access the
data pool directly, regardless of what the MDS has to say about the
namespace. There are two ways around this:

1- Give each tenant a separate rados pool. This works today. You'd
set a directory policy that puts all files created in that subdirectory in
that tenant's pool, then only let the client access those rados pools.

1a- We currently lack an MDS auth capability that restricts which
clients get to change that policy. [small project]

2- Extend the MDS file layouts to use the rados namespaces so that
users can be separated within the same rados pool. [Medium project]

3- Something fancy with MDS-generated capabilities specifying which
rados objects clients get to read. This probably falls in the category of
research, although there are some papers we've seen that look promising.
[big project]

Anyway, this leads to a few questions:

- Who is interested in using Manila to attach CephFS to guest VMs?

Yeah, actually we are doing this
(https://www.openstack.org/vote-vancouver/Presentation/best-practice-of-ceph-in-public-cloud-cinder-manila-and-trove-all-with-one-ceph-storage).

Hmm, the link seemed redirect and useless:-(

- What use cases are you interested?

We uses Manila + OpenStack for our NAS service

- How important is security in your environment?

Very important, we need to provide with qos, network isolation(private
network support).

Now we use default Manila driver, attach a rbd image to service vm and
this service vm export NFS endpoint.

Next as we showed in the presentation, we will use qemu driver to
directly passthrough filesystem command instead of block command. So
host can directly access cephfs safely and network isolation can be
ensured. It will make clearly for internal network(or storage network)
and virtual network.

Thanks!
sage

[1] http://ceph.com/community/ceph-developer-summit-infernalis/
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: About in_seq, out_seq in Messenger

2015-02-25 Thread Haomai Wang

https://github.com/ceph/ceph/pull/3797

The assert failure is hard to reproduce for existing test suites
because it's only happen when lossless_peer_reuse policy and it's hard
to simulate the reproduce steps from upper layer.

But we can easily reproduce this when do stress tests for Messenger, I
added a inject-error stress test for lossless_peer_reuse policy, it
can reproduce it easily

On Wed, Feb 25, 2015 at 2:27 AM, Gregory Farnum gfar...@redhat.com wrote:

On Feb 24, 2015, at 7:18 AM, Haomai Wang haomaiw...@gmail.com wrote:

On Tue, Feb 24, 2015 at 12:04 AM, Greg Farnum gfar...@redhat.com wrote:
On Feb 12, 2015, at 9:17 PM, Haomai Wang haomaiw...@gmail.com wrote:

On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum gfar...@redhat.com wrote:
Sorry for the delayed response.

On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote:

Hmm, I got it.

There exists another problem I'm not sure whether captured by upper
layer:

two monitor node(A,B) connected with lossless_peer_reuse policy,
1. lots of messages has been transmitted
2. markdown A

I don’t think monitors ever mark each other down?

3. restart A and call send_message(message will be in out_q)

oh, maybe you just mean rebooting it, not an interface thing, okay...

4. network error injected and A failed to build a *session* with B
5. because of policy and in_queue() == true, we will reconnect in
writer()
6. connect_seq++ and try to reconnect

I think you’re wrong about this step. The messenger won’t increment
connect_seq directly in writer() because it will be in STATE_CONNECTING,
so it just calls connect() directly.
connect() doesn’t increment the connect_seq unless it successfully
finishes a session negotiation.

Hmm, sorry. I checked log again. Actually A doesn't have any message
in queue. So it will enter standby state and increase connect_seq. It
will not be *STATE_CONNECTING*.

Urgh, that case does seem broken, yes. I take it this is something you’ve
actually run across?

It looks like that connect_seq++ was added by
https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25.
Which makes me think we might just be able to increment the connect_seq
appropriately in the connect() function if we need to do so (on
replacement, I assume). Would you like to look at that and how this change
might impact the peer with regards to the referenced assert failure?

-A very slow-to-reply and apologetic Greg

Thanks to Greg!

I looked at
commit(https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25#diff-500cebf3665775f2e77db1ff88255b9bR1553)
and think it's useless now.

When accepting and connect.connect_seq == existing-connect_seq,
existing-state maybe STATE_OPEN, STATE_STANDBY or STANDY_CONNECTING.
This commit only fix partial problem and want to assert
(existing-state == STATE_CONNECTING). So later we added codes to
catch (existing-state == STATE_OPEN || existing-state ==
STATE_STANDBY) before asserting.

So from my point, this commit is unnecessary now and we can drop this commit.

@sage, what do you think? Any other corner point this commit considered?

Yeah, that looks right to me. Are you seeing this error frequently enough to
validate it in testing? We can run it through our suite as well of course,
but the race is very narrow and I don’t think our current tests capture it
anyway. :/
Maybe even a PR to enable testing on this kind of connection network failure
race? :)
-Greg

--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at http://vger.kernel.org/majordomo-info.html

Re: About in_seq, out_seq in Messenger

2015-02-24 Thread Haomai Wang

On Tue, Feb 24, 2015 at 12:04 AM, Greg Farnum gfar...@redhat.com wrote:
 On Feb 12, 2015, at 9:17 PM, Haomai Wang haomaiw...@gmail.com wrote:

 On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum gfar...@redhat.com wrote:
 Sorry for the delayed response.

 On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote:

 Hmm, I got it.

 There exists another problem I'm not sure whether captured by upper layer:

 two monitor node(A,B) connected with lossless_peer_reuse policy,
 1. lots of messages has been transmitted
 2. markdown A

 I don’t think monitors ever mark each other down?

 3. restart A and call send_message(message will be in out_q)

 oh, maybe you just mean rebooting it, not an interface thing, okay...

 4. network error injected and A failed to build a *session* with B
 5. because of policy and in_queue() == true, we will reconnect in 
 writer()
 6. connect_seq++ and try to reconnect

 I think you’re wrong about this step. The messenger won’t increment 
 connect_seq directly in writer() because it will be in STATE_CONNECTING, so 
 it just calls connect() directly.
 connect() doesn’t increment the connect_seq unless it successfully finishes 
 a session negotiation.

 Hmm, sorry. I checked log again. Actually A doesn't have any message
 in queue. So it will enter standby state and increase connect_seq. It
 will not be *STATE_CONNECTING*.


 Urgh, that case does seem broken, yes. I take it this is something you’ve 
 actually run across?

 It looks like that connect_seq++ was added by 
 https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25. 
 Which makes me think we might just be able to increment the connect_seq 
 appropriately in the connect() function if we need to do so (on replacement, 
 I assume). Would you like to look at that and how this change might impact 
 the peer with regards to the referenced assert failure?

 -A very slow-to-reply and apologetic Greg

Thanks to Greg!

I looked at 
commit(https://github.com/ceph/ceph/commit/0fc47e267b6f8dcd4511d887d5ad37d460374c25#diff-500cebf3665775f2e77db1ff88255b9bR1553)
and think it's useless now.

When accepting and connect.connect_seq == existing-connect_seq,
existing-state maybe STATE_OPEN, STATE_STANDBY or STANDY_CONNECTING.
This commit only fix partial problem and want to assert
(existing-state == STATE_CONNECTING). So later we added codes to
catch (existing-state == STATE_OPEN || existing-state ==
STATE_STANDBY) before asserting.

So from my point, this commit is unnecessary now and we can drop this commit.

@sage, what do you think? Any other corner point this commit considered?



 2015-02-13 06:19:22.240788 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).writer: state = connecting policy.server=0
 2015-02-13 06:19:22.240801 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect 0
 2015-02-13 06:19:22.240821 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :0 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connecting to 127.0.0.1:16800/22032
 2015-02-13 06:19:22.398009 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect read peer addr 127.0.0.1:16800/22032 on socket 91
 2015-02-13 06:19:22.398026 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect peer addr for me is 127.0.0.1:36265/0
 2015-02-13 06:19:22.398066 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect sent my addr 127.0.0.1:16813/22045
 2015-02-13 06:19:22.398089 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect sending gseq=8 cseq=0 proto=24
 2015-02-13 06:19:22.398115 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect wrote (self +) cseq, waiting for reply
 2015-02-13 06:19:22.398137 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).connect read reply (0) Success
 2015-02-13 06:19:22.398155 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060). sleep for 0.1
 2015-02-13 06:19:22.498243 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).fault (0) Success
 2015-02-13 06:19:22.498275 7fdd147c7700  0 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
 c=0x3ed2060).fault with nothing to send, going to standby
 2015-02-13 06:19:22.498290 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
 127.0.0.1:16800/22032 pipe(0x3f82020

Re: [ceph-users] Ceph Dumpling/Firefly/Hammer SSD/Memstore performance comparison

2015-02-22 Thread Haomai Wang

I don't have detail perf number for sync io latency now.

But a few days ago I did single OSD single io depth benchmark. In
short, Firefly  Dumpling  Hammer per op latency.

It's great to see Mark's benchmark result! As for pcie ssd, I think
ceph can't make full use of it currently for one OSD. We may need to
mainly focus on sata-ssd improvments.

On Mon, Feb 23, 2015 at 1:09 PM, Gregory Farnum g...@gregs42.com wrote:
 On Tue, Feb 17, 2015 at 9:37 AM, Mark Nelson mnel...@redhat.com wrote:
 Hi All,

 I wrote up a short document describing some tests I ran recently to look at
 how SSD backed OSD performance has changed across our LTS releases. This is
 just looking at RADOS performance and not RBD or RGW.  It also doesn't offer
 any real explanations regarding the results.  It's just a first high level
 step toward understanding some of the behaviors folks on the mailing list
 have reported over the last couple of releases.  I hope you find it useful.

 Do you have any work scheduled to examine the synchronous IO latency
 changes across versions? I suspect those are involved with the loss of
 performance some users have reported, and I've not heard any
 believable theories as to the cause. Since this is the first set of
 results pointing that way on hardware available for detailed tests I
 hope we can dig into it. And those per-op latencies are the next thing
 we'll need to cut down on, since they correspond pretty directly with
 CPU costs that we want to scale down! :)
 -Greg
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: K/V interface buffer transaction

2015-02-20 Thread Haomai Wang

Yeah, I remind of that we should has a pending work for ObjectStore
refactor bp[1]. We need to change KeyValueDB interface to adopt new
improvment


[1]: http://pad.ceph.com/p/hammer-osd_transaction_encoding

On Fri, Feb 20, 2015 at 7:00 AM, Somnath Roy somnath@sandisk.com wrote:
 Thanks Sage !
 Let me understand the K/V code base more in-depth and will come back to you 
 on this.

 Regards
 Somnath

 -Original Message-
 From: Sage Weil [mailto:s...@newdream.net]
 Sent: Thursday, February 19, 2015 11:46 AM
 To: Somnath Roy
 Cc: Haomai Wang; sj...@redhat.com; Gregory Farnum; Ceph Development
 Subject: RE: K/V interface buffer transaction

 On Thu, 19 Feb 2015, Somnath Roy wrote:
 Sage/Haomai,
 Some more questions.

 1. I am not able to figure out why the KeyValueDB interface is so
 dependent on iter based approach ? If a db supports range queries,
 can't we get rid of these iterator interfaces ?

 2. Also, the function like ::_generic_read() is calling
 StripObjectMap::get_values_with_header - GenericObjectMap::scan().
 Scan is just looping over the keys and still calling
 iter-lower_bound() , why not calling direct get call ? In case, the
 db supports range queries , we can handover the db these keys and it
 will return array of key/value pair itself. Why to bother about that
 from generic keyvaluestore interface ? If dbs are not supporting range
 queries, we can implement similar logic in the shim layer like
 leveldbstore/rocksdbstore, isn't it ?

 The KeyValueDB is the interface that seemed necessary when Sam was 
 implementing the original DBObjectMap a couple years ago.  It's based on what 
 leveldb was providing and what was needed at the time.  We are more than 
 happy to change it!

 A few things:

 1. Adding a call that returns multiple k/v pairs sounds fine as long as there 
 is a limmit so we don't get an unbounded result size.

 2. I'm concerned (in general) about the efficiency of this interface.
 Right now pretty much everything is fetched and returned in the form of an 
 STL structure and I'm worried that there will be a bunch of data copying on 
 the implementation to conform to that.  On the flip side, lots of callers are 
 currently rejiggering their requests into those maps too.  I'd be very 
 interested in hearing about how you think we can make this fit more 
 efficiently to whatever backend you're currently working with.
 Leveldb and rocksdb will I think be the most common backends, but we want to 
 perform well with others too.

 3. One simple example of this is there are several places where we have an 
 encoded bufferlist of mapstring,bufferlist that we are doing a set on (or 
 are pulling out).  Currently we end up decoding into an STL map and feeding 
 to the interface, but I suspect lots of callers could benefit from a set of 
 calls that go direct to/from such a buffer and skip the map.

 4. There a trivial patch in my newstore wip branch that adds a get(prefix, 
 key, *value) so that you don't get to pass in a setstring for a single 
 fetch.  It's somewhere in the pile at

 https://github.com/liewegas/ceph/commits/wip-newstore

 sage


 Let me know if I am missing anything here.

 Thanks  Regards
 Somnath

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Wednesday, February 11, 2015 11:35 PM
 To: Somnath Roy
 Cc: sj...@redhat.com; Sage Weil; Gregory Farnum; Ceph Development
 Subject: Re: K/V interface buffer transaction

 On Thu, Feb 12, 2015 at 3:26 PM, Somnath Roy somnath@sandisk.com wrote:
  Haomai,
 
   KeyValueStore will only write one for duplicate entry in ordering
 
  I saw K/v store (keyvaluestore.cc) itself is not removing the duplicates , 
  are you saying the shim layer like leveldbstore/rocksdbstore is removing 
  the duplicates or the leveldb/rocksdb ?

 Oh no, sorry. That's just I want to do in mind. I forget I haven't impl it.

 Each ObjectStore::Transaction in KeyValueStore has corresponding 
 BufferTransaction will store all kvs needed to store. We could let 
 submit_transaction do it at last instead of calling backend each op.

 Yeah, we could resolve it in KeyValueStore clearly.
 
  Thanks  Regards
  Somnath
 
  -Original Message-
  From: Haomai Wang [mailto:haomaiw...@gmail.com]
  Sent: Wednesday, February 11, 2015 7:36 PM
  To: Somnath Roy
  Cc: sj...@redhat.com; Sage Weil; Gregory Farnum; Ceph Development
  Subject: Re: K/V interface buffer transaction
 
  On Thu, Feb 12, 2015 at 6:53 AM, Somnath Roy somnath@sandisk.com 
  wrote:
  Yeah, thanks!
  Not sure if level-db is handling duplicate entries within a transaction 
  properly or not, if not, in case of filestore (and also for K/V stores) 
  we are having an extra (redundant) OMAP write in the Write-Path.
 
  KeyValueStore will only write one for duplicate entry in ordering.
 
  But FileStore will write redundant omap.
 
  And from dump log, the duplicate entry looks like from pglog
 
 
  Regards
  Somnath
 
  -Original Message

Re: Memstore performance improvements v0.90 vs v0.87

2015-02-20 Thread Haomai Wang

Actually, I'm concerned about the correctness of benchmark using
MemStore. AFAR it may cause lots of memory frag and cause performance
degraded hugely. Maybe set filestore_blackhole=true is more
precious?


On Fri, Feb 20, 2015 at 5:49 PM, Blair Bethwaite
blair.bethwa...@gmail.com wrote:
 Hi James,

 Interesting results, but did you do any tests with a NUMA system? IIUC
 the original report was from a dual socket setup, and that'd
 presumably be the standard setup for most folks (both OSD server and
 client side).

 Cheers,

 On 20 February 2015 at 20:07, James Page james.p...@ubuntu.com wrote:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA256

 Hi All

 The Ubuntu Kernel team have spent the last few weeks investigating the
 apparent performance disparity between RHEL 7 and Ubuntu 14.04; we've
 focussed efforts in a few ways (see below).

 All testing has been done using the latest Firefly release.

 1) Base network latency

 Jay Vosburgh looked at the base network latencies between RHEL 7 and
 Ubuntu 14.04; under default install, RHEL actually had slightly worse
 latency than Ubuntu due to the default enablement of a firewall;
 disabling this brought latency back inline between the two distributions:

 OS  rtt min/avg/max/mdev
 Ubuntu 14.04 (3.13) 0.013/0.016/0.018/0.005 ms
 RHEL7 (3.10)0.010/0.018/0.025/0.005 ms

 ...base network latency is pretty much the same.

 This testing was performed on a matched pair of Dell Poweredge R610's,
 configured with a single 4 core CPU and 8G of RAM.

 2) Latency and performance in Ceph using Rados bench

 Colin King spent a number of days testing and analysing results using
 rados bench against a single node ceph deployment, configured with a
 single memory backed OSD, to see if we could reproduce the disparities
 reported.

 He ran 120 second OSD benchmarks on RHEL 7 as well as Ubuntu 14.04 LTS
 with a selection of kernels including 3.10 vanilla, 3.13.0-44 (release
 kernel), 3.16.0-30 (utopic HWE kernel), 3.18.0-12 (vivid HWE kernel)
 and 3.19-rc6 with 1, 16 and 128 client threads.  The data collected is
 available at [0].

 Each round of tests consisted of 15 runs, from which we computed
 average latency, latency deviation and latency distribution:

 120 second x 1 thread

 Results all seem to cluster around 0.04-0.05ms, with RHEL 7 averaging
 at 0.044 and recent Ubuntu kernels at 0.036-0.037ms.  The older 3.10
 kernel in RHEL 7 does have some slightly higher average latency.

 120 second x 16 threads

 Results all seem to cluster around 0.6-0.7ms.  3.19.0-rc6 had a couple
 of 1.4ms outliers which pushed it out to be worse than RHEL 7. On the
 whole Ubuntu 3.10-3.18 kernels are better than RHEL 7 by ~0.1ms.  RHEL
 shows a far higher standard deviation, due to the bimodal latency
 distribution, which from the casual observer may appear to be more
 jittery.

 120 second x 128 threads

 Later kernels show up to have less standard deviation than RHEL 7, so
 that shows perhaps less jitter in the stats than RHEL 7's 3.10 kernel.
 With this many threads pounding the test, we get a wider spread of
 latencies and it is hard to tell any kind of latency distribution
 patterns with just 15 rounds because of the large amount of latency
 jitter.  All systems show a latency of ~ 5ms.  Taking into
 consideration the amount of jitter, we think these results do not make
 much sense unless we repeat these tests with say 100 samples.

 3) Conclusion

 We’ve have not been able to show any major anomalies in Ceph on Ubuntu
 compared to RHEL 7 when using memstore.  Our current hypothesis is that
 one needs to run the OSD bench stressor many times to get a fair capture
 of system latency stats.  The reason for this is:

 * Latencies are very low with memstore, so any small jitter in
 scheduling etc will show up as a large distortion (as shown by the large
 standard deviations in the samples).

 * When memstore is heavily utilized, memory pressure causes the system
 to page heavily and so we are subject to the nature of perhaps delays on
 paging that cause some latency jitters.  Latency differences may be just
 down to where a random page is in memory or in swap, and with memstore
 these may cause the large perturbations we see when running just a
 single test.

 * We needed to make *many* tens of measurements to get a typical idea of
 average latency and the latency distributions. Don't trust the results
 from just one test

 * We ran the tests with a pool configured to 100 pgs and 100 pgps [1].
 One can get different results with different placement group configs.

 I've CC'ed both Colin and Jay on this mail - so if anyone has any
 specific questions about the testing they can chime in with responses.

 Regards

 James

 [0] http://kernel.ubuntu.com/~cking/.ceph/ceph-benchmarks.ods
 [1] http://ceph.com/docs/master/rados/configuration/pool-pg-config-ref/

 - --
 James Page
 Ubuntu and Debian Developer
 james.p...@ubuntu.com
 jamesp...@debian.org

Re: NewStore update

2015-02-20 Thread Haomai Wang

So cool!

A little notes:

1. What about sync thread in NewStore?
2. Could we consider skipping WAL for large overwrite(backfill, RGW)?
3. Sorry, what means [aio_]fsync?


On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil sw...@redhat.com wrote:
 Hi everyone,

 We talked a bit about the proposed KeyFile backend a couple months back.
 I've started putting together a basic implementation and wanted to give
 people and update about what things are currently looking like.  We're
 calling it NewStore for now unless/until someone comes up with a better
 name (KeyFileStore is way too confusing). (*)

 You can peruse the incomplete code at

 https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore

 This is a bit of a brain dump.  Please ask questions if anything isn't
 clear.  Also keep in mind I'm still at the stage where I'm trying to get
 it into a semi-working state as quickly as possible so the implementation
 is pretty rough.

 Basic design:

 We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
 Object data is stored in files with simple names (%d) in a simple
 directory structure (one level deep, default 1M files per dir).  The main
 piece of metadata we store is a mapping from object name (ghobject_t) to
 onode_t, which looks like this:

  struct onode_t {
uint64_t size;   /// object size
mapstring, bufferptr attrs;/// attrs
mapuint64_t, fragment_t data_map;  /// data (offset to fragment mapping)

 i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
 only lean on the file system for file data and it's block management.

 fragment_t looks like

  struct fragment_t {
uint32_t offset;   /// offset in file to first byte of this fragment
uint32_t length;   /// length of fragment/extent
fid_t fid; /// file backing this fragment

 and fid_t is

  struct fid_t {
uint32_t fset, fno;   // identify the file name: fragments/%d/%d

 To start we'll keep the mapping pretty simple (just one fragment_t) but
 later we can go for varying degrees of complexity.

 We lean on the kvdb for our transactions.

 If we are creating new objects, we write data into a new file/fid,
 [aio_]fsync, and then commit the transaction.

 If we are doing an overwrite, we include a write-ahead log (wal)
 item in our transaction, and then apply it afterwards.  For example, a 4k
 overwrite would make whatever metadata changes are included, and a wal
 item that says then overwrite this 4k in this fid with this data.  i.e.,
 the worst case is more or less what FileStore is doing now with its
 journal, except here we're using the kvdb (and its journal) for that.  On
 restart we can queue up and apply any unapplied wal items.

 An alternative approach here that we discussed a bit yesterday would be to
 write the small overwrites into the kvdb adjacent to the onode.  Actually
 writing them back to the file could be deferred until later, maybe when
 there are many small writes to be done together.

 But right now the write behavior is very simple, and handles just 3 cases:

 
 https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339

 1. New object: create a new file and write there.

 2. Append: append to an existing fid.  We store the size in the onode so
 we can be a bit sloppy and in the failure case (where we write some
 extra data to the file but don't commit the onode) just ignore any
 trailing file data.

 3. Anything else: generate a WAL item.

 4. Maybe later, for some small [over]writes, we instead put the new data
 next to the onode.

 There is no omap yet.  I think we should do basically what DBObjectMap did
 (with a layer of indirection to allow clone etc), but we need to rejigger
 it so that the initial pointer into that structure is embedded in the
 onode.  We may want to do some other optimization to avoid extra
 indirection in the common case.  Leaving this for later, though...

 We are designing for the case where the workload is already sharded across
 collections.  Each collection gets an in-memory Collection, which has its
 own RWLock and its own onode_map (SharedLRU cache).  A split will
 basically amount to registering the new collection in the kvdb and
 clearing the in-memory onode cache.

 There is a TransContext structure that is used to track the progress of a
 transaction.  It'll list which fd's need to get synced pre-commit, which
 onodes need to get written back in the transaction, and any WAL items to
 include and queue up after the transaction commits.  Right now the
 queue_transaction path does most of the work synchronously just to get
 things working.  Looking ahead I think what it needs to do is:

  - assemble the transaction
  - start any aio writes (we could use O_DIRECT here if the new hints
 include WONTNEED?)
  - start any aio fsync's
  - queue kvdb transaction
  - fire onreadable[_sync] notifications (I suspect we'll want to do this
 unconditionally; maybe we

Re: NewStore update

2015-02-20 Thread Haomai Wang

OK, I just viewed part of codes and realized it.

It looks like we want to sync metadata each time when WAL and we ahead
do_transaction jobs before WAL things. It may cause larger latency
than before? Because the latency of do_transactions couldn't be simply
ignore under some latency sensitive cases and it may trigger lookup
operation(get_onode).

On Fri, Feb 20, 2015 at 11:00 PM, Sage Weil sw...@redhat.com wrote:
 On Fri, 20 Feb 2015, Haomai Wang wrote:
 So cool!

 A little notes:

 1. What about sync thread in NewStore?

 My thought right now is that there will be a WAL thread and (maybe) a
 transaction commit completion thread.  What do you mean by sync thread?

 One thing I want to avoid is the current 'op' thread in FileStore.
 Instead of queueing a transaction we will start all of the aio operations
 synchronously.  This has the nice (?) side-effect that if there is memory
 blackpressure it will block at submit time so we don't need to do our own
 throttling.  (...though we may want to do it ourselves later anyway.)

 2. Could we consider skipping WAL for large overwrite(backfill, RGW)?

 We do (or will)... if there is a truncate to 0 it doesn't need to do WAL
 at all.  The onode stores the size so we'll ignore any stray bytes after
 that in the file; that let's us do the truncate async after the txn
 commits.  (Slightly sloppy but the space leakage window is so small I
 don't think it's worth worrying about.)

 3. Sorry, what means [aio_]fsync?

 aio_fsync is just an fsync that's submitted as an aio operation.  It'll
 make fsync fit into the same bucket as the aio writes we queue up, and it
 also means that if/when the experimental batched fsync stuff goes into XFS
 we'll take advantage of it (lots of fsyncs will be merged into a single
 XFS transaction and be much more efficient).

 sage




 On Fri, Feb 20, 2015 at 7:50 AM, Sage Weil sw...@redhat.com wrote:
  Hi everyone,
 
  We talked a bit about the proposed KeyFile backend a couple months back.
  I've started putting together a basic implementation and wanted to give
  people and update about what things are currently looking like.  We're
  calling it NewStore for now unless/until someone comes up with a better
  name (KeyFileStore is way too confusing). (*)
 
  You can peruse the incomplete code at
 
  https://github.com/liewegas/ceph/tree/wip-newstore/src/os/newstore
 
  This is a bit of a brain dump.  Please ask questions if anything isn't
  clear.  Also keep in mind I'm still at the stage where I'm trying to get
  it into a semi-working state as quickly as possible so the implementation
  is pretty rough.
 
  Basic design:
 
  We use a KeyValueDB (leveldb, rocksdb, ...) for all of our metadata.
  Object data is stored in files with simple names (%d) in a simple
  directory structure (one level deep, default 1M files per dir).  The main
  piece of metadata we store is a mapping from object name (ghobject_t) to
  onode_t, which looks like this:
 
   struct onode_t {
 uint64_t size;   /// object size
 mapstring, bufferptr attrs;/// attrs
 mapuint64_t, fragment_t data_map;  /// data (offset to fragment 
  mapping)
 
  i.e., it's what we used to rely on xattrs on the inode for.  Here, we'll
  only lean on the file system for file data and it's block management.
 
  fragment_t looks like
 
   struct fragment_t {
 uint32_t offset;   /// offset in file to first byte of this fragment
 uint32_t length;   /// length of fragment/extent
 fid_t fid; /// file backing this fragment
 
  and fid_t is
 
   struct fid_t {
 uint32_t fset, fno;   // identify the file name: fragments/%d/%d
 
  To start we'll keep the mapping pretty simple (just one fragment_t) but
  later we can go for varying degrees of complexity.
 
  We lean on the kvdb for our transactions.
 
  If we are creating new objects, we write data into a new file/fid,
  [aio_]fsync, and then commit the transaction.
 
  If we are doing an overwrite, we include a write-ahead log (wal)
  item in our transaction, and then apply it afterwards.  For example, a 4k
  overwrite would make whatever metadata changes are included, and a wal
  item that says then overwrite this 4k in this fid with this data.  i.e.,
  the worst case is more or less what FileStore is doing now with its
  journal, except here we're using the kvdb (and its journal) for that.  On
  restart we can queue up and apply any unapplied wal items.
 
  An alternative approach here that we discussed a bit yesterday would be to
  write the small overwrites into the kvdb adjacent to the onode.  Actually
  writing them back to the file could be deferred until later, maybe when
  there are many small writes to be done together.
 
  But right now the write behavior is very simple, and handles just 3 cases:
 
  
  https://github.com/liewegas/ceph/blob/wip-newstore/src/os/newstore/NewStore.cc#L1339
 
  1. New object: create a new file and write there.
 
  2. Append

Re: CDS process

2015-02-17 Thread Haomai Wang

It seemed wiki is better for recording bp and other users(not dev?)
can have a nice look. If we put all in pad, it may be mess for
different bp?

We have some facts:
1. bp needed to be formatted and have a unified view for viewers
2. heavily changes will be applied during CDS mainly
3. latter(after CDS) changes to *bp* needed to be notify

So maybe we can register a bp on wiki at first, then heavily changes
will happen during CDS and also write in pad. After each session, we
need to rewrite(copy?)  back to wiki. This way can be a tradeoff
between pad and wiki?

On Wed, Feb 18, 2015 at 12:24 AM, Patrick McGarry pmcga...@redhat.com wrote:
 Mostly I just want to do small incremental changes to the process,
 especially since it's happening so close to the summit.

 The only thing that I'll miss with an etherpad-only workflow is the
 notification on creations/edits, but I'll survive. I think it's just a
 matter of enforcing the use of blueprints, regardless of where they
 live.



 On Mon, Feb 16, 2015 at 8:42 PM, Sage Weil s...@newdream.net wrote:
 On Mon, 16 Feb 2015, Patrick McGarry wrote:
 I think I'm going to take this forward in baby steps. I'm going to collect
 blueprints via the normal pathway and then just manually capture the data in
 ether pads when I populate the schedule. For J I'll just direct people
 directly to ether pads (assuming there is no major objection).

 Are you worried about the documented workflow and tooling in the wiki, or
 just want to start with small changes to the process?  It's also the
 copying part and most-empty blueprints that I suspect we can avoid without
 loss of value.  I'm curious if we go super-light on the tooling if we'll
 find that there are parts we miss or not.

 Any other thoughts?
 sage



 On Thu, Feb 5, 2015 at 10:36 AM, Josh Durgin josh.dur...@inktank.com
 wrote:
   On 02/05/2015 02:50 PM, Sage Weil wrote:
 I wonder if we should simplify the cds workflow a
 bit to go straight to an
 etherpad outline of the blueprint instead of the
 wiki blueprint doc.  I
 find it a bit disorienting to be flipping between
 the two, and after the
 fact find it frustrating that there isn't a single
 reference to go back to
 for the outcome of the session (you have to look at
 both the pad and the
 bp).

 Perhaps just using the pad from the get-go will
 streamline things a bit
 and make it a little more lightweight?  What does
 everyone think?


 Sounds good to me. I've also wished there were a single location to
 capture a session; searching through the wiki for etherpads that
 aren't linked from the blueprint is a pain.

 Josh




 --

 Best Regards,

 Patrick McGarry
 Director Ceph Community || Red Hat
 http://ceph.com  ||  http://community.redhat.com
 @scuttlemonkey || @ceph





 --

 Best Regards,

 Patrick McGarry
 Director Ceph Community || Red Hat
 http://ceph.com  ||  http://community.redhat.com
 @scuttlemonkey || @ceph
 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About in_seq, out_seq in Messenger

2015-02-12 Thread Haomai Wang

On Fri, Feb 13, 2015 at 1:26 AM, Greg Farnum gfar...@redhat.com wrote:
 Sorry for the delayed response.

 On Feb 11, 2015, at 3:48 AM, Haomai Wang haomaiw...@gmail.com wrote:

 Hmm, I got it.

 There exists another problem I'm not sure whether captured by upper layer:

 two monitor node(A,B) connected with lossless_peer_reuse policy,
 1. lots of messages has been transmitted
 2. markdown A

 I don’t think monitors ever mark each other down?

 3. restart A and call send_message(message will be in out_q)

 oh, maybe you just mean rebooting it, not an interface thing, okay...

 4. network error injected and A failed to build a *session* with B
 5. because of policy and in_queue() == true, we will reconnect in writer()
 6. connect_seq++ and try to reconnect

 I think you’re wrong about this step. The messenger won’t increment 
 connect_seq directly in writer() because it will be in STATE_CONNECTING, so 
 it just calls connect() directly.
 connect() doesn’t increment the connect_seq unless it successfully finishes a 
 session negotiation.

Hmm, sorry. I checked log again. Actually A doesn't have any message
in queue. So it will enter standby state and increase connect_seq. It
will not be *STATE_CONNECTING*.


2015-02-13 06:19:22.240788 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).writer: state = connecting policy.server=0
2015-02-13 06:19:22.240801 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=-1 :0 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect 0
2015-02-13 06:19:22.240821 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :0 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connecting to 127.0.0.1:16800/22032
2015-02-13 06:19:22.398009 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect read peer addr 127.0.0.1:16800/22032 on socket 91
2015-02-13 06:19:22.398026 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect peer addr for me is 127.0.0.1:36265/0
2015-02-13 06:19:22.398066 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect sent my addr 127.0.0.1:16813/22045
2015-02-13 06:19:22.398089 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect sending gseq=8 cseq=0 proto=24
2015-02-13 06:19:22.398115 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect wrote (self +) cseq, waiting for reply
2015-02-13 06:19:22.398137 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).connect read reply (0) Success
2015-02-13 06:19:22.398155 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060). sleep for 0.1
2015-02-13 06:19:22.498243 7fdd147c7700  2 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).fault (0) Success
2015-02-13 06:19:22.498275 7fdd147c7700  0 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=0 l=0
c=0x3ed2060).fault with nothing to send, going to standby
2015-02-13 06:19:22.498290 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
c=0x3ed2060).writer: state = standby policy.server=0
2015-02-13 06:19:22.498301 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
c=0x3ed2060).writer sleeping
2015-02-13 06:19:22.526116 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=3 pgs=0 cs=0 l=0
c=0x3ed2060).writer: state = standby policy.server=0
2015-02-13 06:19:22.526132 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=91 :36265 s=1 pgs=0 cs=1 l=0
c=0x3ed2060).connect 1
2015-02-13 06:19:22.526158 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36265 s=1 pgs=0 cs=1 l=0
c=0x3ed2060).connecting to 127.0.0.1:16800/22032
2015-02-13 06:19:22.526318 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36296 s=1 pgs=0 cs=1 l=0
c=0x3ed2060).connect read peer addr 127.0.0.1:16800/22032 on socket 47
2015-02-13 06:19:22.526334 7fdd147c7700 20 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36296 s=1 pgs=0 cs=1 l=0
c=0x3ed2060).connect peer addr for me is 127.0.0.1:36296/0
2015-02-13 06:19:22.526372 7fdd147c7700 10 -- 127.0.0.1:16813/22045 
127.0.0.1:16800/22032 pipe(0x3f82020 sd=47 :36296 s=1 pgs=0 cs=1 l=0
c=0x3ed2060).connect sent my addr 127.0.0.1:16813/22045
2015-02-13 06:19:22.526388 7fdd147c7700 10

Re: maximum key size

2015-02-11 Thread Haomai Wang

Whatever RBD or object gateway, omap key always used by internal
usage. So the typical key length should be within 100 bytes.

On Wed, Feb 11, 2015 at 10:46 PM, Cook, Nigel nigel.c...@intel.com wrote:
 In RBD and Object gateway use cases, can you tell me what the typical key 
 length is?

 What is the object name used for vs the omap key?

 Regards,
 Nigel Cook
 Intel Fellow  Cloud Chief Architect
 Cloud Platforms Group
 Intel Corporation
 +1 720 319 7508

 -Original Message-
 From: Haomai Wang [mailto:haomaiw...@gmail.com]
 Sent: Tuesday, February 10, 2015 10:17 PM
 To: Cook, Nigel
 Cc: ceph-devel@vger.kernel.org
 Subject: Re: maximum key size

 Hmm, we have object name length limit(osd_max_object_name_len) and attr name 
 length limit(osd_max_attr_name_len) for config and ObjectStore impl setting, 
 but we don't have any limit for omap key length.


 On Wed, Feb 11, 2015 at 12:47 PM, Cook, Nigel nigel.c...@intel.com wrote:
 Folks,

 A quick question...

 In a CEPH OSD ObjectStore implementation, what is the maximum length of the 
 key in the key/value store?

 Regards,
 Nigel Cook
 Intel Fellow  Cloud Chief Architect
 Cloud Platforms Group
 Intel Corporation
 +1 720 319 7508

 --
 To unsubscribe from this list: send the line unsubscribe ceph-devel
 in the body of a message to majord...@vger.kernel.org More majordomo
 info at  http://vger.kernel.org/majordomo-info.html



 --
 Best Regards,

 Wheat



-- 
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Re: About in_seq, out_seq in Messenger

2015-02-11 Thread Haomai Wang

Hmm, I got it.

There exists another problem I'm not sure whether captured by upper layer:

two monitor node(A,B) connected with lossless_peer_reuse policy,
1. lots of messages has been transmitted
2. markdown A
3. restart A and call send_message(message will be in out_q)
4. network error injected and A failed to build a *session* with B
5. because of policy and in_queue() == true, we will reconnect in writer()
6. connect_seq++ and try to reconnect
7. because of connect_seq != 0, B can't detect remote reset and
in_seq(a large value) will be exchange and cause A
crashed(Pipe.cc:1154)

So I guess we can't increase connect_seq when reconnecting? We need to
let peer side detect remote reset via connect_seq == 0.

On Tue, Feb 10, 2015 at 12:00 AM, Gregory Farnum gfar...@redhat.com wrote:
- Original Message -
From: Haomai Wang haomaiw...@gmail.com
To: Gregory Farnum gfar...@redhat.com
Cc: Sage Weil sw...@redhat.com, ceph-devel@vger.kernel.org
Sent: Friday, February 6, 2015 8:16:42 AM
Subject: Re: About in_seq, out_seq in Messenger

On Fri, Feb 6, 2015 at 10:47 PM, Gregory Farnum gfar...@redhat.com wrote:
- Original Message -
From: Haomai Wang haomaiw...@gmail.com
To: Sage Weil sw...@redhat.com, Gregory Farnum g...@inktank.com
Cc: ceph-devel@vger.kernel.org
Sent: Friday, February 6, 2015 12:26:18 AM
Subject: About in_seq, out_seq in Messenger

Hi all,

Recently we enable a async messenger test job in test
lab(http://pulpito.ceph.com/sage-2015-02-03_01:15:10-rados-master-distro-basic-multi/#).
We hit many failed assert mostly are:
assert(0 == old msgs despite reconnect_seq feature);

And assert connection all are cluster messenger which mean it's OSD
internal connection. The policy associated this connection is
Messenger::Policy::lossless_peer.

So when I dive into this problem, I find something confusing about
this. Suppose these steps:
1. lossless_peer policy is used by both two side connections.
2. markdown one side(anyway), peer connection will try to reconnect
3. then we restart failed side, a new connection is built but
initiator will think it's a old connection so sending in_seq(10)
4. new started connection has no message in queue and it will receive
peer connection's in_seq(10) and call discard_requeued_up_to(10). But
because no message in queue, it won't modify anything
5. now any side issue a message, it will trigger assert(0 == old
msgs despite reconnect_seq feature);

I can replay these steps in unittest and actually it's hit in test lab
for async messenger which follows simple messenger's design.

Besides, if we enable reset_check here, was_session_reset will be
called and it will random out_seq, so it will certainly hit assert(0
== skipped incoming seq).

Anything wrong above?

Sage covered most of this. I'll just add that the last time I checked it, I
came to the conclusion that the code to use a random out_seq on initial
connect was non-functional. So there definitely may be issues there.

In fact, we've fixed a couple (several?) bugs in this area since Firefly
was initially released, so if you go over the point release
SimpleMessenger patches you might gain some insight. :)
-Greg

If we want to make random out_seq functional, I think we need to
exchange out_seq when handshaking too. Otherwise, we need to give it
up.

Possibly. Or maybe we just need to weaken our asserts to infer it on initial
messages?

Another question, do you think reset_check=true is always good for
osd internal connection?

Huh? resetcheck is false for lossless peer connections.

Let Messenger rely on upper layer may not a good idea, so maybe we can
enhance in_seq exchange process(ensure each side
in_seq+sent.size()==out_seq). From the current handshake impl, it's
not easy to insert more action to in_seq exchange process, because
this session has been built regardless of the result of in_seq
process.

If enable reset_check=true, it looks we can solve most of incorrect
seq out-of-sync problem?

Oh, I see what you mean.
Yeah, the problem here is a bit of a mismatch in the interfaces. OSDs are
lossless peers with each other, they should not miss any messages, and they
don't ever go away. Except of course sometimes they do go away, if one of
them dies. This is supposed to be handled by marking it down, but it turns
out the race conditions around that are a little larger than we'd realized.
Changing that abstraction in the other direction by enabling reset is also
difficult, as witnessed by our vacillating around how to handle resets in the
messenger code base. :/

Anyway, you may not have seen http://tracker.ceph.com/issues/9555, which
fixes the bug you're seeing here. It will be in the next Firefly point
release. :)
-Greg

--
Best Regards,

Wheat
--
To unsubscribe from this list: send the line unsubscribe ceph-devel in
the body of a message to majord

1 2 3 >

1 - 100 of 225 matches

Mail list logo