FYI ---------- Forwarded message ---------- From: David Casier <david.cas...@aevoo.fr> Date: 2015-12-01 21:32 GMT+01:00 Subject: Re: Fwd: [newstore (again)] how disable double write WAL To: Sage Weil <s...@newdream.net> Cc: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, Vish Maram-SSI <vishwanat...@ssi.samsung.com>, Ceph Development <ceph-devel@vger.kernel.org>, Benoît LORIOT <benoit.lor...@aevoo.fr>, pascal.billery-schnei...@laposte.net
Hi Sage, With a standard disk (4 to 6 TB), and a small flash drive, it's easy to create an ext4 FS with metadata on flash Example with sdg1 on flash and sdb on hdd : size_of() { blockdev --getsize $1 } mkdmsetup() { _ssd=/dev/$1 _hdd=/dev/$2 _size_of_ssd=$(size_of $_ssd) echo """0 $_size_of_ssd linear $_ssd 0 $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2} } mkdmsetup sdg1 sdb mkfs.ext4 -O ^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode -E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i $((1024*512)) /dev/mapper/dm-sdg1-sdb With that, all meta_blocks are on the SSD If omap are on SSD, there are almost no metadata on HDD Consequence : performance Ceph (with hack on filestore without journal and directIO) are almost same that performance of the HDD. With cache-tier, it's very cool ! That is why we are working on a hybrid approach HDD / Flash on ARM or Intel With newstore, it's much more difficult to control the I/O profil. Because rocksDB embedded its own intelligence In the (near) futur, we will create a portal to display our hardware solution in the CERN OHL license. (My non-fluency in English explains the latency of my answers) 2015-11-24 21:42 GMT+01:00 Sage Weil <s...@newdream.net>: > > On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote: > > Hello Vish, > > > > Please apologize for the delay in my answer. > > Following the conversation you had with my colleague David, here are > > some more details about our work : > > > > We are working on Filestore / Newstore optimizations by studying how we > > could set ourselves free from using the journal. > > > > It is very important to work with SSD, but it is also mandatory to > > combine it with regular magnetic platter disks. This is why we are > > combining metadata storing on flash with data storing on disk. > > This is pretty common, and something we will support natively with > newstore. > > > Our main goal is to have the control on performance. Which is quite > > difficult with NewStore, and needs fundamental hacks with FileStore. > > Can you clarify what you mean by "quite difficult with NewStore"? > > FWIW, the latest bleeding edge code is currently at > github.com/liewegas/wip-bluestore. > > sage > > > > Is Samsung working on ARM boards with embedded flash and a SATA port, in > > order to allow us to work on a hybrid approach? What is your line of > > work with Ceph? > > > > How can we work together ? > > > > Regards, > > Sébastien > > > > > Début du message réexpédié : > > > > > > De: David Casier <david.cas...@aevoo.fr> > > > Date: 12 octobre 2015 20:52:26 UTC+2 > > > À: Sage Weil <s...@newdream.net>, Ceph Development > > > <ceph-devel@vger.kernel.org> > > > Cc: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, > > > benoit.lor...@aevoo.fr, Denis Saget <geo...@gmail.com>, "luc.petetin" > > > <luc.pete...@aevoo.fr> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Ok, > > > Great. > > > > > > With these settings : > > > // > > > newstore_max_dir_size = 4096 > > > newstore_sync_io = true > > > newstore_sync_transaction = true > > > newstore_sync_submit_transaction = true > > > newstore_sync_wal_apply = true > > > newstore_overlay_max = 0 > > > // > > > > > > And direct IO in the benchmark tool (fio) > > > > > > I see that the HDD is 100% charged and there are notransfer of /db to > > > /fragments after stopping benchmark : Great ! > > > > > > But when i launch a bench with random blocs of 256k, i see random blocs > > > between 32k and 256k on HDD. Any idea ? > > > > > > Debits to the HDD are about 8MBps when they could be higher with larger > > > blocs (~30MBps) > > > And 70 MBps without fsync (hard drive cache disabled). > > > > > > Other questions : > > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > > fsync_wq) ? > > > newstore_sync_transaction -> true = sync in DB ? > > > newstore_sync_submit_transaction -> if false then kv_queue (only if > > > newstore_sync_transaction=false) ? > > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) > > > ? > > > > > > Is it true ? > > > > > > Way for cache with battery (sync DB and no sync data) ? > > > > > > Thanks for everything ! > > > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >>> Hello everybody, > > >>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>> I separed "/db" and "/fragments" but during the bench, everything is > > >>> writing > > >>> to "/db" > > >>> I changed options "newstore_sync_*" without success. > > >>> > > >>> Is there any way to write all metadata in "/db" and all data in > > >>> "/fragments" ? > > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >> But if you are overwriting an existing object, doing write-ahead logging > > >> is usually unavoidable because we need to make the update atomic (and the > > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >> mitigates this somewhat for larger writes by limiting fragment size, but > > >> for small IOs this is pretty much always going to be the case. For small > > >> IOs, though, putting things in db/ is generally better since we can > > >> combine many small ios into a single (rocksdb) journal/wal write. And > > >> often leave them there (via the 'overlay' behavior). > > >> > > >> sage > > >> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > *David CASIER > > > DCConsulting SARL > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > **Ligne directe: _01 75 98 53 85_ > > > Email: _david.casier@aevoo.fr_ > > > * ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: David Casier <david.cas...@aevoo.fr> > > > Date: 2 novembre 2015 20:02:37 UTC+1 > > > À: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com> > > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY > > > <sebastien.valse...@aevoo.fr> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > In FileStore, data and metadata are stored in files, with xargs FS and > > > omap. > > > NewStore works with RocksDB. > > > There are a lot of configuration in RocksDB but all options not > > > implemented. > > > > > > The best way, for me, is not to use the logs, with secure cache (for > > > example SSD 845DC). > > > I don't think that is necessary to report I/O with a good metadata > > > optimisation. > > > > > > The problem with RocksDB is that not possible to control I/O blocs size. > > > > > > We will resume work on NewStore soon. > > > > > > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Thanks David for the reply. > > >> > > >> Yeah We just wanted to know how different is it from Filestore and how > > >> do we contribute for this? My motive is to first understand the design > > >> of Newstore and get the Performance loopholes so that we can try looking > > >> into it. > > >> > > >> It would be helpful if you can share what is your idea from your side to > > >> use Newstore and configuration? What plans you are having for > > >> contributions to help us understand and see if we can work together. > > >> > > >> Thanks, > > >> -Vish > > >> <> > > >> From: David Casier [mailto:david.cas...@aevoo.fr > > >> <mailto:david.cas...@aevoo.fr>] > > >> Sent: Thursday, October 29, 2015 4:41 AM > > >> To: Vish (Vishwanath) Maram-SSI > > >> Cc: benoit LORIOT; Sébastien VALSEMEY > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Hi Vish, > > >> It's OK. > > >> > > >> We have a lot of different configuration with newstore tests. > > >> > > >> What is your goal with ? > > >> > > >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Hi David, > > >> > > >> Sorry for sending you the mail directly. > > >> > > >> This is Vishwanath Maram from Samsung and started to play around with > > >> Newstore and observing some issues with running FIO. > > >> > > >> Can you please share your Ceph Configuration file which you have used to > > >> run the IO's using FIO? > > >> > > >> Thanks, > > >> -Vish > > >> > > >> -----Original Message----- > > >> From: ceph-devel-ow...@vger.kernel.org > > >> <mailto:ceph-devel-ow...@vger.kernel.org> > > >> [mailto:ceph-devel-ow...@vger.kernel.org > > >> <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier > > >> Sent: Monday, October 12, 2015 11:52 AM > > >> To: Sage Weil; Ceph Development > > >> Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr > > >> <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > >> > > >> Debits to the HDD are about 8MBps when they could be higher with larger > > >> blocs (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > >> newstore_sync_transaction -> true = sync in DB ? > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread > > >> wal_wq) ? > > >> > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >> Hello everybody, > > >> fragment is stored in rocksdb before being written to "/fragments" ? > > >> I separed "/db" and "/fragments" but during the bench, everything is > > >> writing > > >> to "/db" > > >> I changed options "newstore_sync_*" without success. > > >> > > >> Is there any way to write all metadata in "/db" and all data in > > >> "/fragments" ? > > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >> But if you are overwriting an existing object, doing write-ahead logging > > >> is usually unavoidable because we need to make the update atomic (and the > > >> underlying posix fs doesn't provide that). The wip-newstore-frags branch > > >> mitigates this somewhat for larger writes by limiting fragment size, but > > >> for small IOs this is pretty much always going to be the case. For small > > >> IOs, though, putting things in db/ is generally better since we can > > >> combine many small ios into a single (rocksdb) journal/wal write. And > > >> often leave them there (via the 'overlay' behavior). > > >> > > >> sage > > >> > > >> > > >> > > >> > > >> > > >> -- > > >> ________________________________________________________ > > >> > > >> Cordialement, > > >> > > >> David CASIER > > >> > > >> > > >> 4 Trait d'Union > > >> 77127 LIEUSAINT > > >> > > >> Ligne directe: 01 75 98 53 85 > > >> Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr> > > >> ________________________________________________________ > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: Sage Weil <s...@newdream.net> > > > Date: 12 octobre 2015 21:33:52 UTC+2 > > > À: David Casier <david.cas...@aevoo.fr> > > > Cc: Ceph Development <ceph-devel@vger.kernel.org>, Sébastien VALSEMEY > > > <sebastien.valse...@aevoo.fr>, benoit.lor...@aevoo.fr, Denis Saget > > > <geo...@gmail.com>, "luc.petetin" <luc.pete...@aevoo.fr> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi David- > > > > > > On Mon, 12 Oct 2015, David Casier wrote: > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > > > > > Is this a hard disk? Those settings probably don't make sense since it > > > does every IO synchronously, blocking the submitting IO path... > > > > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > > > > > Random IOs have to be write ahead logged in rocksdb, which has its own IO > > > pattern. Since you made everything sync above I think it'll depend on > > > how many osd threads get batched together at a time.. maybe. Those > > > settings aren't something I've really tested, and probably only make > > > sense with very fast NVMe devices. > > > > > >> Debits to the HDD are about 8MBps when they could be higher with larger > > >> blocs> (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > > > > > yes > > > > > >> newstore_sync_transaction -> true = sync in DB ? > > > > > > synchronously do the rocksdb commit too > > > > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > > > > > yeah.. there is an annoying rocksdb behavior that makes an async > > > transaction submit block if a sync one is in progress, so this queues them > > > up and explicitly batches them. > > > > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread > > >> wal_wq) ? > > > > > > the txn commit completion threads can do the wal work synchronously.. this > > > is only a good idea if it's doing aio (which it generally is). > > > > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > > > > > ? > > > s > > > > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>> Hello everybody, > > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>> I separed "/db" and "/fragments" but during the bench, everything is > > >>>> writing > > >>>> to "/db" > > >>>> I changed options "newstore_sync_*" without success. > > >>>> > > >>>> Is there any way to write all metadata in "/db" and all data in > > >>>> "/fragments" ? > > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>> But if you are overwriting an existing object, doing write-ahead logging > > >>> is usually unavoidable because we need to make the update atomic (and > > >>> the > > >>> underlying posix fs doesn't provide that). The wip-newstore-frags > > >>> branch > > >>> mitigates this somewhat for larger writes by limiting fragment size, but > > >>> for small IOs this is pretty much always going to be the case. For > > >>> small > > >>> IOs, though, putting things in db/ is generally better since we can > > >>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>> often leave them there (via the 'overlay' behavior). > > >>> > > >>> sage > > >>> > > >> > > >> > > >> -- > > >> ________________________________________________________ > > >> > > >> Cordialement, > > >> > > >> *David CASIER > > >> DCConsulting SARL > > >> > > >> > > >> 4 Trait d'Union > > >> 77127 LIEUSAINT > > >> > > >> **Ligne directe: _01 75 98 53 85_ > > >> Email: _david.casier@aevoo.fr_ > > >> * ________________________________________________________ > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >> the body of a message to majord...@vger.kernel.org > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >> > > >> > > > Début du message réexpédié : > > > > > > De: David Casier <david.cas...@aevoo.fr> > > > Date: 29 octobre 2015 12:41:22 UTC+1 > > > À: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com> > > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY > > > <sebastien.valse...@aevoo.fr> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > It's OK. > > > > > > We have a lot of different configuration with newstore tests. > > > > > > What is your goal with ? > > > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > >> Hi David, > > >> > > >> Sorry for sending you the mail directly. > > >> > > >> This is Vishwanath Maram from Samsung and started to play around with > > >> Newstore and observing some issues with running FIO. > > >> > > >> Can you please share your Ceph Configuration file which you have used to > > >> run the IO's using FIO? > > >> > > >> Thanks, > > >> -Vish > > >> > > >> -----Original Message----- > > >> From: ceph-devel-ow...@vger.kernel.org > > >> <mailto:ceph-devel-ow...@vger.kernel.org> > > >> [mailto:ceph-devel-ow...@vger.kernel.org > > >> <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier > > >> Sent: Monday, October 12, 2015 11:52 AM > > >> To: Sage Weil; Ceph Development > > >> Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr > > >> <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin > > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > >> > > >> Ok, > > >> Great. > > >> > > >> With these settings : > > >> // > > >> newstore_max_dir_size = 4096 > > >> newstore_sync_io = true > > >> newstore_sync_transaction = true > > >> newstore_sync_submit_transaction = true > > >> newstore_sync_wal_apply = true > > >> newstore_overlay_max = 0 > > >> // > > >> > > >> And direct IO in the benchmark tool (fio) > > >> > > >> I see that the HDD is 100% charged and there are notransfer of /db to > > >> /fragments after stopping benchmark : Great ! > > >> > > >> But when i launch a bench with random blocs of 256k, i see random blocs > > >> between 32k and 256k on HDD. Any idea ? > > >> > > >> Debits to the HDD are about 8MBps when they could be higher with larger > > >> blocs (~30MBps) > > >> And 70 MBps without fsync (hard drive cache disabled). > > >> > > >> Other questions : > > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >> fsync_wq) ? > > >> newstore_sync_transaction -> true = sync in DB ? > > >> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >> newstore_sync_transaction=false) ? > > >> newstore_sync_wal_apply = true -> if false then WAL later (thread > > >> wal_wq) ? > > >> > > >> Is it true ? > > >> > > >> Way for cache with battery (sync DB and no sync data) ? > > >> > > >> Thanks for everything ! > > >> > > >> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>> Hello everybody, > > >>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>> I separed "/db" and "/fragments" but during the bench, everything is > > >>>> writing > > >>>> to "/db" > > >>>> I changed options "newstore_sync_*" without success. > > >>>> > > >>>> Is there any way to write all metadata in "/db" and all data in > > >>>> "/fragments" ? > > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>> But if you are overwriting an existing object, doing write-ahead logging > > >>> is usually unavoidable because we need to make the update atomic (and > > >>> the > > >>> underlying posix fs doesn't provide that). The wip-newstore-frags > > >>> branch > > >>> mitigates this somewhat for larger writes by limiting fragment size, but > > >>> for small IOs this is pretty much always going to be the case. For > > >>> small > > >>> IOs, though, putting things in db/ is generally better since we can > > >>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>> often leave them there (via the 'overlay' behavior). > > >>> > > >>> sage > > >>> > > >> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com> > > > Date: 29 octobre 2015 17:30:56 UTC+1 > > > À: David Casier <david.cas...@aevoo.fr> > > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY > > > <sebastien.valse...@aevoo.fr> > > > Objet: RE: Fwd: [newstore (again)] how disable double write WAL > > > > > > Thanks David for the reply. > > > > > > Yeah We just wanted to know how different is it from Filestore and how do > > > we contribute for this? My motive is to first understand the design of > > > Newstore and get the Performance loopholes so that we can try looking > > > into it. > > > > > > It would be helpful if you can share what is your idea from your side to > > > use Newstore and configuration? What plans you are having for > > > contributions to help us understand and see if we can work together. > > > > > > Thanks, > > > -Vish > > > <> > > > From: David Casier [mailto:david.cas...@aevoo.fr] > > > Sent: Thursday, October 29, 2015 4:41 AM > > > To: Vish (Vishwanath) Maram-SSI > > > Cc: benoit LORIOT; Sébastien VALSEMEY > > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > > > Hi Vish, > > > It's OK. > > > > > > We have a lot of different configuration with newstore tests. > > > > > > What is your goal with ? > > > > > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote: > > > Hi David, > > > > > > Sorry for sending you the mail directly. > > > > > > This is Vishwanath Maram from Samsung and started to play around with > > > Newstore and observing some issues with running FIO. > > > > > > Can you please share your Ceph Configuration file which you have used to > > > run the IO's using FIO? > > > > > > Thanks, > > > -Vish > > > > > > -----Original Message----- > > > From: ceph-devel-ow...@vger.kernel.org > > > <mailto:ceph-devel-ow...@vger.kernel.org> > > > [mailto:ceph-devel-ow...@vger.kernel.org > > > <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier > > > Sent: Monday, October 12, 2015 11:52 AM > > > To: Sage Weil; Ceph Development > > > Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr > > > <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin > > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL > > > > > > Ok, > > > Great. > > > > > > With these settings : > > > // > > > newstore_max_dir_size = 4096 > > > newstore_sync_io = true > > > newstore_sync_transaction = true > > > newstore_sync_submit_transaction = true > > > newstore_sync_wal_apply = true > > > newstore_overlay_max = 0 > > > // > > > > > > And direct IO in the benchmark tool (fio) > > > > > > I see that the HDD is 100% charged and there are notransfer of /db to > > > /fragments after stopping benchmark : Great ! > > > > > > But when i launch a bench with random blocs of 256k, i see random blocs > > > between 32k and 256k on HDD. Any idea ? > > > > > > Debits to the HDD are about 8MBps when they could be higher with larger > > > blocs (~30MBps) > > > And 70 MBps without fsync (hard drive cache disabled). > > > > > > Other questions : > > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > > fsync_wq) ? > > > newstore_sync_transaction -> true = sync in DB ? > > > newstore_sync_submit_transaction -> if false then kv_queue (only if > > > newstore_sync_transaction=false) ? > > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) > > > ? > > > > > > Is it true ? > > > > > > Way for cache with battery (sync DB and no sync data) ? > > > > > > Thanks for everything ! > > > > > > On 10/12/2015 03:01 PM, Sage Weil wrote: > > > On Mon, 12 Oct 2015, David Casier wrote: > > > Hello everybody, > > > fragment is stored in rocksdb before being written to "/fragments" ? > > > I separed "/db" and "/fragments" but during the bench, everything is > > > writing > > > to "/db" > > > I changed options "newstore_sync_*" without success. > > > > > > Is there any way to write all metadata in "/db" and all data in > > > "/fragments" ? > > > You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > > But if you are overwriting an existing object, doing write-ahead logging > > > is usually unavoidable because we need to make the update atomic (and the > > > underlying posix fs doesn't provide that). The wip-newstore-frags branch > > > mitigates this somewhat for larger writes by limiting fragment size, but > > > for small IOs this is pretty much always going to be the case. For small > > > IOs, though, putting things in db/ is generally better since we can > > > combine many small ios into a single (rocksdb) journal/wal write. And > > > often leave them there (via the 'overlay' behavior). > > > > > > sage > > > > > > > > > > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr> > > > ________________________________________________________ > > > Début du message réexpédié : > > > > > > De: David Casier <david.cas...@aevoo.fr> > > > Date: 14 octobre 2015 22:03:38 UTC+2 > > > À: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, > > > benoit.lor...@aevoo.fr > > > Cc: Denis Saget <geo...@gmail.com>, "luc.petetin" <luc.pete...@aevoo.fr> > > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL > > > > > > Bonsoir Messieurs, > > > Je viens de vivre le premier vrai feu Ceph. > > > Loic Dachary m'a bien appuyé sur le coup. > > > > > > Je peux vous dire une chose : on a beau penser maîtriser le produit, > > > c'est lors d'un incident qu'on se rend compte du nombre de facteurs à > > > connaître par coeur. > > > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un > > > succès et comme un excellent coup de boost. > > > > > > Explications : > > > - LI ont un peu trop joué avec la crushmap (je ferai de la technique > > > pointut un autre jour) > > > - Mise à jour et redémarrage des OSD > > > - Les OSD ne savaient plus où étaient la data > > > - Reconstruction à la mimine de la crushmap et zzooouu. > > > > > > Rien de bien grave en soit et un gros plus (++++) en image chez LI > > > (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé) > > > > > > Conclusion : > > > On va bosser ensemble sur des stress-tests, un peu comme des validations > > > RedHat : une plate-forme, je casse, vous réparez. > > > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de > > > passer quelques jours sur certains trucs). > > > > > > Objectifs : > > > - Maîtriser une liste de vérifs à faire > > > - La rejouer toutes les semaines si beaucoup de fautes > > > - Tous les mois si un peu de faute > > > - Tous les 3 mois si bonne maîtrise > > > - ... > > > > > > Il faut qu'on soit au top et que certaines choses passent en réflexe > > > (vérif crushmap, savoir trouver la data sans les process, ...). > > > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas). > > > > > > Et franchement, c'est vraiment passionnant Ceph ! > > > > > > On 10/12/2015 09:33 PM, Sage Weil wrote: > > >> Hi David- > > >> > > >> On Mon, 12 Oct 2015, David Casier wrote: > > >>> Ok, > > >>> Great. > > >>> > > >>> With these settings : > > >>> // > > >>> newstore_max_dir_size = 4096 > > >>> newstore_sync_io = true > > >>> newstore_sync_transaction = true > > >>> newstore_sync_submit_transaction = true > > >> Is this a hard disk? Those settings probably don't make sense since it > > >> does every IO synchronously, blocking the submitting IO path... > > >> > > >>> newstore_sync_wal_apply = true > > >>> newstore_overlay_max = 0 > > >>> // > > >>> > > >>> And direct IO in the benchmark tool (fio) > > >>> > > >>> I see that the HDD is 100% charged and there are notransfer of /db to > > >>> /fragments after stopping benchmark : Great ! > > >>> > > >>> But when i launch a bench with random blocs of 256k, i see random blocs > > >>> between 32k and 256k on HDD. Any idea ? > > >> Random IOs have to be write ahead logged in rocksdb, which has its own IO > > >> pattern. Since you made everything sync above I think it'll depend on > > >> how many osd threads get batched together at a time.. maybe. Those > > >> settings aren't something I've really tested, and probably only make > > >> sense with very fast NVMe devices. > > >> > > >>> Debits to the HDD are about 8MBps when they could be higher with larger > > >>> blocs> (~30MBps) > > >>> And 70 MBps without fsync (hard drive cache disabled). > > >>> > > >>> Other questions : > > >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread > > >>> fsync_wq) ? > > >> yes > > >> > > >>> newstore_sync_transaction -> true = sync in DB ? > > >> synchronously do the rocksdb commit too > > >> > > >>> newstore_sync_submit_transaction -> if false then kv_queue (only if > > >>> newstore_sync_transaction=false) ? > > >> yeah.. there is an annoying rocksdb behavior that makes an async > > >> transaction submit block if a sync one is in progress, so this queues > > >> them > > >> up and explicitly batches them. > > >> > > >>> newstore_sync_wal_apply = true -> if false then WAL later (thread > > >>> wal_wq) ? > > >> the txn commit completion threads can do the wal work synchronously.. > > >> this > > >> is only a good idea if it's doing aio (which it generally is). > > >> > > >>> Is it true ? > > >>> > > >>> Way for cache with battery (sync DB and no sync data) ? > > >> ? > > >> s > > >> > > >>> Thanks for everything ! > > >>> > > >>> On 10/12/2015 03:01 PM, Sage Weil wrote: > > >>>> On Mon, 12 Oct 2015, David Casier wrote: > > >>>>> Hello everybody, > > >>>>> fragment is stored in rocksdb before being written to "/fragments" ? > > >>>>> I separed "/db" and "/fragments" but during the bench, everything is > > >>>>> writing > > >>>>> to "/db" > > >>>>> I changed options "newstore_sync_*" without success. > > >>>>> > > >>>>> Is there any way to write all metadata in "/db" and all data in > > >>>>> "/fragments" ? > > >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/. > > >>>> But if you are overwriting an existing object, doing write-ahead > > >>>> logging > > >>>> is usually unavoidable because we need to make the update atomic (and > > >>>> the > > >>>> underlying posix fs doesn't provide that). The wip-newstore-frags > > >>>> branch > > >>>> mitigates this somewhat for larger writes by limiting fragment size, > > >>>> but > > >>>> for small IOs this is pretty much always going to be the case. For > > >>>> small > > >>>> IOs, though, putting things in db/ is generally better since we can > > >>>> combine many small ios into a single (rocksdb) journal/wal write. And > > >>>> often leave them there (via the 'overlay' behavior). > > >>>> > > >>>> sage > > >>>> > > >>> > > >>> -- > > >>> ________________________________________________________ > > >>> > > >>> Cordialement, > > >>> > > >>> *David CASIER > > >>> DCConsulting SARL > > >>> > > >>> > > >>> 4 Trait d'Union > > >>> 77127 LIEUSAINT > > >>> > > >>> **Ligne directe: _01 75 98 53 85_ > > >>> Email: _david.casier@aevoo.fr_ > > >>> * ________________________________________________________ > > >>> -- > > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >>> the body of a message to majord...@vger.kernel.org > > >>> <mailto:majord...@vger.kernel.org> > > >>> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >>> <http://vger.kernel.org/majordomo-info.html> > > >>> > > >>> > > >> -- > > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in > > >> the body of a message to majord...@vger.kernel.org > > >> <mailto:majord...@vger.kernel.org> > > >> More majordomo info at http://vger.kernel.org/majordomo-info.html > > >> <http://vger.kernel.org/majordomo-info.html> > > > > > > > > > -- > > > ________________________________________________________ > > > > > > Cordialement, > > > > > > David CASIER > > > DCConsulting SARL > > > > > > > > > 4 Trait d'Union > > > 77127 LIEUSAINT > > > > > > Ligne directe: 01 75 98 53 85 > > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr> > > > ________________________________________________________ > > > > -- ________________________________________________________ Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.cas...@aevoo.fr ________________________________________________________ -- ________________________________________________________ Cordialement, David CASIER 3B Rue Taylor, CS20004 75481 PARIS Cedex 10 Paris Ligne directe: 01 75 98 53 85 Email: david.cas...@aevoo.fr ________________________________________________________ -- To unsubscribe from this list: send the line "unsubscribe ceph-devel" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html