On 12/4/15 2:12 PM, Ric Wheeler wrote:
> On 12/01/2015 05:02 PM, Sage Weil wrote:
>> Hi David,
>>
>> On Tue, 1 Dec 2015, David Casier wrote:
>>> Hi Sage,
>>> With a standard disk (4 to 6 TB), and a small flash drive, it's easy
>>> to create an ext4 FS with metadata on flash
>>>
>>> Example with sdg1 on flash and sdb on hdd :
>>>
>>> size_of() {
>>>    blockdev --getsize $1
>>> }
>>>
>>> mkdmsetup() {
>>>    _ssd=/dev/$1
>>>    _hdd=/dev/$2
>>>    _size_of_ssd=$(size_of $_ssd)
>>>    echo """0 $_size_of_ssd linear $_ssd 0
>>>    $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create 
>>> dm-${1}-${2}
>>> }
>>>
>>> mkdmsetup sdg1 sdb
>>>
>>> mkfs.ext4 -O 
>>> ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
>>> -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
>>> $((1024*512)) /dev/mapper/dm-sdg1-sdb
>>>
>>> With that, all meta_blocks are on the SSD
>>>
>>> If omap are on SSD, there are almost no metadata on HDD
>>>
>>> Consequence : performance Ceph (with hack on filestore without journal
>>> and directIO) are almost same that performance of the HDD.
>>>
>>> With cache-tier, it's very cool !
>> Cool!  I know XFS lets you do that with the journal, but I'm not sure if
>> you can push the fs metadata onto a different device too.. I'm guessing
>> not?
>>
>>> That is why we are working on a hybrid approach HDD / Flash on ARM or Intel
>>>
>>> With newstore, it's much more difficult to control the I/O profil.
>>> Because rocksDB embedded its own intelligence
>> This is coincidentally what I've been working on today.  So far I've just
>> added the ability to put the rocksdb WAL on a second device, but it's
>> super easy to push rocksdb data there as well (and have it spill over onto
>> the larger, slower device if it fills up).  Or to put the rocksdb WAL on a
>> third device (e.g., expensive NVMe or NVRAM).
>>
>> See this ticket for the ceph-disk tooling that's needed:
>>
>>     http://tracker.ceph.com/issues/13942
>>
>> I expect this will be more flexible and perform better than the ext4
>> metadata option, but we'll need to test on your hardware to confirm!
>>
>> sage
> 
> I think that XFS "realtime" subvolumes are the thing that does this -  the 
> second volume contains only the data (no metadata).
> 
> Seem to recall that it is popular historically with video appliances, etc but 
> it is not commonly used.
> 
> Some of the XFS crew cc'ed above would have more information on this,

The realtime subvolume puts all data on a separate volume, and uses a different
allocator; it is more for streaming type applications, in general.  And it's
not enabled in RHEL - and not heavily tested at this point, I think.

-Eric
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to