RE: ObjectStore collections
Made some changes: BufferTransaction::buffers is now using map coll_t, ghobject_t, mappairprefix, key, bufferlist to records IOs in one transaction, codes is under test with 700s seq write 64k and 700s random write 4k without osd crash. Best Regards, -Chendi -Original Message- From: ceph-devel-ow...@vger.kernel.org [mailto:ceph-devel-ow...@vger.kernel.org] On Behalf Of Xue, Chendi Sent: Sunday, November 9, 2014 10:32 PM To: Haomai Wang Cc: ceph-devel@vger.kernel.org; Sage Weil Subject: RE: ObjectStore collections Hi, Haomai As discussed this afternoon, I moved buffers from strip_header to BufferTransaction The codes has been tested, there seems to be a slight performance enhancement After a pretty long time seq/random write test, I observed no segment fault and growing memory usage problem or any other osd crash problem. Hope you can help to review the codes, thanks https://github.com/ceph/ceph/pull/2889 === For others, there is a brief explanation: 'strip_header' in KeyValueStore is used to cache all headers, 'buffers' is used to ensure all io in one transaction can read the lastest update. Move buffers from strip_header to BufferTransaction can ensure 'buffers' being destructed after the transaction being submitted, which prevent growing 'strip_header-buffers' causing OOM problem. Also, for all 'meta collection' will be cached in strip_header, and there lacks of lock of parallel io accessing 'meta collection', multi-thread accessing the 'meta collection' strip_header-buffers causes segmentation fault, so moving 'buffers' from strip_header to BufferTransaction can also prevent such scenario. Best Regards, -Chendi -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Sunday, November 9, 2014 5:21 PM To: Sage Weil Cc: Xue, Chendi; ceph-devel@vger.kernel.org Subject: Re: ObjectStore collections On Sun, Nov 9, 2014 at 5:59 AM, Sage Weil s...@newdream.net wrote: On Sat, 8 Nov 2014, Haomai Wang wrote: As for OOM, I think the root cause is the mistake commit above too. Because meta collection will be updated each transaction and StripObjectHeader::buffers will be always kept in memory because of the strategy of cache. So this object's buffers will keep in increasing all the time. So I think if we avoid cache meta collection's object will just be fine. Although we don't observe OOM for previous release except this mistake commit, I prefer to add codes to discard buffers each submit transaction time to avoid potential unpredicted memory growing. Do you have a more clear impl about it? I'm just thinking a better way to solve the performance bottleneck for meta collections. I would really like to see if we can eliminate collections from the API entirely. Or, perhaps more importantly, if that would be helpful. For the most part, hobject_t's already sort themselves into collections based on the hash. The exceptions are: - The 'meta' collection. Mostly this includes the pg logs and pg info objects (which are per-pg and would otherwise need no locking) and the osdmap objects. - collection_move and collection_move_rename. I think if we move everything to collection_move_rename and use temporary objects with unique names for everything (I think in-progress recovery objects is the main user of collection_move) then this really just turns into a rename operation. - object listing is currently in terms of pg, but could just as easily specify a hash range. - collection attributes can be moved to the pginfo objects. It sounds like the problem in KeyValueStore is that the pglog/pginfo objects are written by transactions in all PGs but the per-collection index/cache structures weren't being locked. If we can find a way to fit these into the sorted hash in the right position that is conceptually simpler. But I'm not sure if that simplicity actually helps with the implementation, where the data structure locking is the important part. Perhaps we need to keep a collection concept simply for that purpose and the only real problem is 'meta'? Original, KeyvalueStore is true by just avoiding cache meta collection for concurrent purpose. I'm just thinking about is there a way to make meta collection ops more parallelization. BTW, FileStore also can get the benefits because of the same reason. sage -- Best Regards, Wheat 칻 ~ +- ݶ w ˛ m ^ b ^n r z h G h ( 階 ݢj m z ޖ f h ~ m N�r��yb�X��ǧv�^�){.n�+���z�]z���{ay�ʇڙ�,j��f���h���z��w��� ���j:+v���w�j�mzZ+�ݢj��!�i
Re: ObjectStore collections
On Sun, Nov 9, 2014 at 5:59 AM, Sage Weil s...@newdream.net wrote: On Sat, 8 Nov 2014, Haomai Wang wrote: As for OOM, I think the root cause is the mistake commit above too. Because meta collection will be updated each transaction and StripObjectHeader::buffers will be always kept in memory because of the strategy of cache. So this object's buffers will keep in increasing all the time. So I think if we avoid cache meta collection's object will just be fine. Although we don't observe OOM for previous release except this mistake commit, I prefer to add codes to discard buffers each submit transaction time to avoid potential unpredicted memory growing. Do you have a more clear impl about it? I'm just thinking a better way to solve the performance bottleneck for meta collections. I would really like to see if we can eliminate collections from the API entirely. Or, perhaps more importantly, if that would be helpful. For the most part, hobject_t's already sort themselves into collections based on the hash. The exceptions are: - The 'meta' collection. Mostly this includes the pg logs and pg info objects (which are per-pg and would otherwise need no locking) and the osdmap objects. - collection_move and collection_move_rename. I think if we move everything to collection_move_rename and use temporary objects with unique names for everything (I think in-progress recovery objects is the main user of collection_move) then this really just turns into a rename operation. - object listing is currently in terms of pg, but could just as easily specify a hash range. - collection attributes can be moved to the pginfo objects. It sounds like the problem in KeyValueStore is that the pglog/pginfo objects are written by transactions in all PGs but the per-collection index/cache structures weren't being locked. If we can find a way to fit these into the sorted hash in the right position that is conceptually simpler. But I'm not sure if that simplicity actually helps with the implementation, where the data structure locking is the important part. Perhaps we need to keep a collection concept simply for that purpose and the only real problem is 'meta'? Original, KeyvalueStore is true by just avoiding cache meta collection for concurrent purpose. I'm just thinking about is there a way to make meta collection ops more parallelization. BTW, FileStore also can get the benefits because of the same reason. sage -- Best Regards, Wheat -- To unsubscribe from this list: send the line unsubscribe ceph-devel in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: ObjectStore collections
Hi, Haomai As discussed this afternoon, I moved buffers from strip_header to BufferTransaction The codes has been tested, there seems to be a slight performance enhancement After a pretty long time seq/random write test, I observed no segment fault and growing memory usage problem or any other osd crash problem. Hope you can help to review the codes, thanks https://github.com/ceph/ceph/pull/2889 === For others, there is a brief explanation: 'strip_header' in KeyValueStore is used to cache all headers, 'buffers' is used to ensure all io in one transaction can read the lastest update. Move buffers from strip_header to BufferTransaction can ensure 'buffers' being destructed after the transaction being submitted, which prevent growing 'strip_header-buffers' causing OOM problem. Also, for all 'meta collection' will be cached in strip_header, and there lacks of lock of parallel io accessing 'meta collection', multi-thread accessing the 'meta collection' strip_header-buffers causes segmentation fault, so moving 'buffers' from strip_header to BufferTransaction can also prevent such scenario. Best Regards, -Chendi -Original Message- From: Haomai Wang [mailto:haomaiw...@gmail.com] Sent: Sunday, November 9, 2014 5:21 PM To: Sage Weil Cc: Xue, Chendi; ceph-devel@vger.kernel.org Subject: Re: ObjectStore collections On Sun, Nov 9, 2014 at 5:59 AM, Sage Weil s...@newdream.net wrote: On Sat, 8 Nov 2014, Haomai Wang wrote: As for OOM, I think the root cause is the mistake commit above too. Because meta collection will be updated each transaction and StripObjectHeader::buffers will be always kept in memory because of the strategy of cache. So this object's buffers will keep in increasing all the time. So I think if we avoid cache meta collection's object will just be fine. Although we don't observe OOM for previous release except this mistake commit, I prefer to add codes to discard buffers each submit transaction time to avoid potential unpredicted memory growing. Do you have a more clear impl about it? I'm just thinking a better way to solve the performance bottleneck for meta collections. I would really like to see if we can eliminate collections from the API entirely. Or, perhaps more importantly, if that would be helpful. For the most part, hobject_t's already sort themselves into collections based on the hash. The exceptions are: - The 'meta' collection. Mostly this includes the pg logs and pg info objects (which are per-pg and would otherwise need no locking) and the osdmap objects. - collection_move and collection_move_rename. I think if we move everything to collection_move_rename and use temporary objects with unique names for everything (I think in-progress recovery objects is the main user of collection_move) then this really just turns into a rename operation. - object listing is currently in terms of pg, but could just as easily specify a hash range. - collection attributes can be moved to the pginfo objects. It sounds like the problem in KeyValueStore is that the pglog/pginfo objects are written by transactions in all PGs but the per-collection index/cache structures weren't being locked. If we can find a way to fit these into the sorted hash in the right position that is conceptually simpler. But I'm not sure if that simplicity actually helps with the implementation, where the data structure locking is the important part. Perhaps we need to keep a collection concept simply for that purpose and the only real problem is 'meta'? Original, KeyvalueStore is true by just avoiding cache meta collection for concurrent purpose. I'm just thinking about is there a way to make meta collection ops more parallelization. BTW, FileStore also can get the benefits because of the same reason. sage -- Best Regards, Wheat