[Resend in plain text]

Hi,
       I have played some tunable on RocksDB these days, try to optimize the 
performance of Newstore.   From the data now ,seems the WA of  RocksDB is not 
the issue that blocking the performance, and also seems not the fragment 
part(aio/dio, etc). The issue might be how much OPS rocksdb can offer under 
1-write-per-sync workload. I cannot find the number online so I will do it by 
myself,  if that number is low, maybe we need holding multiple RocksDB instance 
in one OSD and do some sharding .

The WAL log of Rocksdb, RocksDB data file and Newstore directory were backed by 
3 separate SSDs. 
/dev/sdc1      156172796    32928 156139868   1% /root/ceph-0-db
/dev/sdd1      195264572    32928 195231644   1% /root/ceph-0-db-wal
/dev/sdb1      156172796 10589552 145583244   7% /var/lib/ceph/osd/ceph-0

Some interesting finds here:

1.  Avg_reqsz in SDB(newstore FS part) is 2KB, that is half of the request 
block size(4KB),  IOPS in iostat(2K) is ~ 2X of the number reported by FIO. BW 
matched

Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz 
avgqu-sz   await r_await w_await  svctm  %util
sda               0.00     0.00    0.00   58.33     0.00    28.98  1017.51     
6.33  108.55    0.00  108.55   1.30   7.60
sdb               0.00     0.00    0.00 2038.00     0.00     3.98     4.00     
0.13    0.07    0.00    0.07   0.07  13.33
sdc               0.00     0.00    0.00    0.00     0.00     0.00     0.00     
0.00    0.00    0.00    0.00   0.00   0.00
sdd               0.00   747.67    0.00 2099.67     0.00    11.28    11.00     
0.76    0.36    0.00    0.36   0.36  75.73  

I believe newstore will not split the request, so there should be some very 
small IO(~0KB) goes with the data write(4KB), where the small IO comes from ?  

Also checked the Filestore data, this behavior is not present in Filestore, 
changing the WBThrottle will affect the number. So seems this behavior is 
related with the flushing mechanism? In newstore we are doing fdatasync more 
aggressively.



2. Notice that by tuning the write_buffer_size , wirte_buffer_num and 
min_write_buffer_number_to_merge, we can make the DB write to ZERO

Look at the iostat of SDC, actually there is almost no IO happened there, that 
is because most of the WAL entries were merged before flushing to Level0.

Other RocksDB tuning are originally trying  to optimize the compaction 
behavior, but since there is few data written to Level0, the compaction is 
almost unmeasurable here.


3. Disable RocksDB WAL can 3X  the performance(Although this is definitely 
WRONG WAY)

Just curious if there is no extra IO happened in DB side, what the performance 
looks like.
I turn off the WAL log of rocks DB, the performance is 3x(799-2464 , lat from 
10 -> 3.2)

4. The avg queue size is <1 in any case, both DB_WAL part and fragment part.

I guess there is some lock in rocksdb::WriteBatch() that preventing multiple 
OSD_OP_THREAD working concurrently, not carefully analyzed. 

An easy way to measure might be comment out  db->submit_transaction(txc->t); in 
NewStore::_txc_submit_kv, to see if we can get more QD in fragment part without 
issuing the DB.


----------------------------------------------------------Configurations---------------------------------------------------------------------------------------------------------
       My setup is SSD based, 1 OSD, pool with 100pg and size =1. The pattern I 
am working on is 4KB random write(QD=8) on top of RBD(using fio-librbd).FIO 
configuration is:
                bs=4k
iodepth=8
size=10g
iodepth_batch_submit=1
iodepth_batch_complete=1

       The tuning I am using are listed here, this might not be the best but 
already showing something.
                    rocksdb_stats_dump_period_sec = 5
    rocksdb_max_background_compactions = 4
    rocksdb_compaction_threads = 4
    rocksdb_write_buffer_size = 536870912  //512MB
    rocksdb_write_buffer_num = 4
    rocksdb_min_write_buffer_number_to_merge = 2
    rocksdb_level0_file_num_compaction_trigger = 4
    rocksdb_max_bytes_for_level_base = 104857600 //100MB
    rocksdb_target_file_size_base = 10485760      //10MB
    rocksdb_num_levels = 3 // So the MAX_DB_SIZE would be ~10GB(100MB* 10^3), 
fair enough.
 rocksdb_compression = none


                                                                                
                                                                                
                                                                                
                                                                                
                                                Xiaoxi

 
        

Reply via email to