FYI
---------- Forwarded message ----------
From: David Casier <david.cas...@aevoo.fr>
Date: 2015-12-01 21:32 GMT+01:00
Subject: Re: Fwd: [newstore (again)] how disable double write WAL
To: Sage Weil <s...@newdream.net>
Cc: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, Vish Maram-SSI
<vishwanat...@ssi.samsung.com>, Ceph Development
<ceph-devel@vger.kernel.org>, Benoît LORIOT <benoit.lor...@aevoo.fr>,
pascal.billery-schnei...@laposte.net


Hi Sage,
With a standard disk (4 to 6 TB), and a small flash drive, it's easy
to create an ext4 FS with metadata on flash

Example with sdg1 on flash and sdb on hdd :

size_of() {
  blockdev --getsize $1
}

mkdmsetup() {
  _ssd=/dev/$1
  _hdd=/dev/$2
  _size_of_ssd=$(size_of $_ssd)
  echo """0 $_size_of_ssd linear $_ssd 0
  $_size_of_ssd $(size_of $_hdd) linear $_hdd 0" | dmsetup create dm-${1}-${2}
}

mkdmsetup sdg1 sdb

mkfs.ext4 -O 
^has_journal,flex_bg,^uninit_bg,^sparse_super,sparse_super2,^extra_isize,^dir_nlink,^resize_inode
-E packed_meta_blocks=1,lazy_itable_init=0 -G 32768 -I 128 -i
$((1024*512)) /dev/mapper/dm-sdg1-sdb

With that, all meta_blocks are on the SSD

If omap are on SSD, there are almost no metadata on HDD

Consequence : performance Ceph (with hack on filestore without journal
and directIO) are almost same that performance of the HDD.

With cache-tier, it's very cool !

That is why we are working on a hybrid approach HDD / Flash on ARM or Intel

With newstore, it's much more difficult to control the I/O profil.
Because rocksDB embedded its own intelligence

In the (near) futur, we will create a portal to display our hardware
solution in the CERN OHL license.

(My non-fluency in English explains the latency of my answers)

2015-11-24 21:42 GMT+01:00 Sage Weil <s...@newdream.net>:
>
> On Tue, 24 Nov 2015, Sébastien VALSEMEY wrote:
> > Hello Vish,
> >
> > Please apologize for the delay in my answer.
> > Following the conversation you had with my colleague David, here are
> > some more details about our work :
> >
> > We are working on Filestore / Newstore optimizations by studying how we
> > could set ourselves free from using the journal.
> >
> > It is very important to work with SSD, but it is also mandatory to
> > combine it with regular magnetic platter disks. This is why we are
> > combining metadata storing on flash with data storing on disk.
>
> This is pretty common, and something we will support natively with
> newstore.
>
> > Our main goal is to have the control on performance. Which is quite
> > difficult with NewStore, and needs fundamental hacks with FileStore.
>
> Can you clarify what you mean by "quite difficult with NewStore"?
>
> FWIW, the latest bleeding edge code is currently at
> github.com/liewegas/wip-bluestore.
>
> sage
>
>
> > Is Samsung working on ARM boards with embedded flash and a SATA port, in
> > order to allow us to work on a hybrid approach? What is your line of
> > work with Ceph?
> >
> > How can we work together ?
> >
> > Regards,
> > Sébastien
> >
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.cas...@aevoo.fr>
> > > Date: 12 octobre 2015 20:52:26 UTC+2
> > > À: Sage Weil <s...@newdream.net>, Ceph Development 
> > > <ceph-devel@vger.kernel.org>
> > > Cc: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, 
> > > benoit.lor...@aevoo.fr, Denis Saget <geo...@gmail.com>, "luc.petetin" 
> > > <luc.pete...@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Ok,
> > > Great.
> > >
> > > With these  settings :
> > > //
> > > newstore_max_dir_size = 4096
> > > newstore_sync_io = true
> > > newstore_sync_transaction = true
> > > newstore_sync_submit_transaction = true
> > > newstore_sync_wal_apply = true
> > > newstore_overlay_max = 0
> > > //
> > >
> > > And direct IO in the benchmark tool (fio)
> > >
> > > I see that the HDD is 100% charged and there are notransfer of /db to 
> > > /fragments after stopping benchmark : Great !
> > >
> > > But when i launch a bench with random blocs of 256k, i see random blocs 
> > > between 32k and 256k on HDD. Any idea ?
> > >
> > > Debits to the HDD are about 8MBps when they could be higher with larger 
> > > blocs (~30MBps)
> > > And 70 MBps without fsync (hard drive cache disabled).
> > >
> > > Other questions :
> > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread 
> > > fsync_wq) ?
> > > newstore_sync_transaction -> true = sync in DB ?
> > > newstore_sync_submit_transaction -> if false then kv_queue (only if 
> > > newstore_sync_transaction=false) ?
> > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) 
> > > ?
> > >
> > > Is it true ?
> > >
> > > Way for cache with battery (sync DB and no sync data) ?
> > >
> > > Thanks for everything !
> > >
> > > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >>> Hello everybody,
> > >>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>> I separed "/db" and "/fragments" but during the bench, everything is 
> > >>> writing
> > >>> to "/db"
> > >>> I changed options "newstore_sync_*" without success.
> > >>>
> > >>> Is there any way to write all metadata in "/db" and all data in 
> > >>> "/fragments" ?
> > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >> But if you are overwriting an existing object, doing write-ahead logging
> > >> is usually unavoidable because we need to make the update atomic (and the
> > >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >> mitigates this somewhat for larger writes by limiting fragment size, but
> > >> for small IOs this is pretty much always going to be the case.  For small
> > >> IOs, though, putting things in db/ is generally better since we can
> > >> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >> often leave them there (via the 'overlay' behavior).
> > >>
> > >> sage
> > >>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > *David CASIER
> > > DCConsulting SARL
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > **Ligne directe: _01 75 98 53 85_
> > > Email: _david.casier@aevoo.fr_
> > > * ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.cas...@aevoo.fr>
> > > Date: 2 novembre 2015 20:02:37 UTC+1
> > > À: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com>
> > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY 
> > > <sebastien.valse...@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > In FileStore, data and metadata are stored in files, with xargs FS and 
> > > omap.
> > > NewStore works with RocksDB.
> > > There are a lot of configuration in RocksDB but all options not 
> > > implemented.
> > >
> > > The best way, for me, is not to use the logs, with secure cache (for 
> > > example SSD 845DC).
> > > I don't think that is necessary to report I/O with a good metadata 
> > > optimisation.
> > >
> > > The problem with RocksDB is that not possible to control I/O blocs size.
> > >
> > > We will resume work on NewStore soon.
> > >
> > > On 10/29/2015 05:30 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Thanks David for the reply.
> > >>
> > >> Yeah We just wanted to know how different is it from Filestore and how 
> > >> do we contribute for this? My motive is to first understand the design 
> > >> of Newstore and get the Performance loopholes so that we can try looking 
> > >> into it.
> > >>
> > >> It would be helpful if you can share what is your idea from your side to 
> > >> use Newstore and configuration? What plans you are having for 
> > >> contributions to help us understand and see if we can work together.
> > >>
> > >> Thanks,
> > >> -Vish
> > >>   <>
> > >> From: David Casier [mailto:david.cas...@aevoo.fr 
> > >> <mailto:david.cas...@aevoo.fr>]
> > >> Sent: Thursday, October 29, 2015 4:41 AM
> > >> To: Vish (Vishwanath) Maram-SSI
> > >> Cc: benoit LORIOT; Sébastien VALSEMEY
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Hi Vish,
> > >> It's OK.
> > >>
> > >> We have a lot of different configuration with newstore tests.
> > >>
> > >> What is your goal with ?
> > >>
> > >> On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Hi David,
> > >>
> > >> Sorry for sending you the mail directly.
> > >>
> > >> This is Vishwanath Maram from Samsung and started to play around with 
> > >> Newstore and observing some issues with running FIO.
> > >>
> > >> Can you please share your Ceph Configuration file which you have used to 
> > >> run the IO's using FIO?
> > >>
> > >> Thanks,
> > >> -Vish
> > >>
> > >> -----Original Message-----
> > >> From: ceph-devel-ow...@vger.kernel.org 
> > >> <mailto:ceph-devel-ow...@vger.kernel.org> 
> > >> [mailto:ceph-devel-ow...@vger.kernel.org 
> > >> <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier
> > >> Sent: Monday, October 12, 2015 11:52 AM
> > >> To: Sage Weil; Ceph Development
> > >> Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr 
> > >> <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >>
> > >> Debits to the HDD are about 8MBps when they could be higher with larger
> > >> blocs (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread 
> > >> wal_wq) ?
> > >>
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >> Hello everybody,
> > >> fragment is stored in rocksdb before being written to "/fragments" ?
> > >> I separed "/db" and "/fragments" but during the bench, everything is 
> > >> writing
> > >> to "/db"
> > >> I changed options "newstore_sync_*" without success.
> > >>
> > >> Is there any way to write all metadata in "/db" and all data in 
> > >> "/fragments" ?
> > >> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >> But if you are overwriting an existing object, doing write-ahead logging
> > >> is usually unavoidable because we need to make the update atomic (and the
> > >> underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > >> mitigates this somewhat for larger writes by limiting fragment size, but
> > >> for small IOs this is pretty much always going to be the case.  For small
> > >> IOs, though, putting things in db/ is generally better since we can
> > >> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >> often leave them there (via the 'overlay' behavior).
> > >>
> > >> sage
> > >>
> > >>
> > >>
> > >>
> > >>
> > >> --
> > >> ________________________________________________________
> > >>
> > >> Cordialement,
> > >>
> > >> David CASIER
> > >>
> > >>
> > >> 4 Trait d'Union
> > >> 77127 LIEUSAINT
> > >>
> > >> Ligne directe: 01 75 98 53 85
> > >> Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr>
> > >> ________________________________________________________
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: Sage Weil <s...@newdream.net>
> > > Date: 12 octobre 2015 21:33:52 UTC+2
> > > À: David Casier <david.cas...@aevoo.fr>
> > > Cc: Ceph Development <ceph-devel@vger.kernel.org>, Sébastien VALSEMEY 
> > > <sebastien.valse...@aevoo.fr>, benoit.lor...@aevoo.fr, Denis Saget 
> > > <geo...@gmail.com>, "luc.petetin" <luc.pete...@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi David-
> > >
> > > On Mon, 12 Oct 2015, David Casier wrote:
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >
> > > Is this a hard disk?  Those settings probably don't make sense since it
> > > does every IO synchronously, blocking the submitting IO path...
> > >
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >
> > > Random IOs have to be write ahead logged in rocksdb, which has its own IO
> > > pattern.  Since you made everything sync above I think it'll depend on
> > > how many osd threads get batched together at a time.. maybe.  Those
> > > settings aren't something I've really tested, and probably only make
> > > sense with very fast NVMe devices.
> > >
> > >> Debits to the HDD are about 8MBps when they could be higher with larger 
> > >> blocs> (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >
> > > yes
> > >
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >
> > > synchronously do the rocksdb commit too
> > >
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >
> > > yeah.. there is an annoying rocksdb behavior that makes an async
> > > transaction submit block if a sync one is in progress, so this queues them
> > > up and explicitly batches them.
> > >
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread 
> > >> wal_wq) ?
> > >
> > > the txn commit completion threads can do the wal work synchronously.. this
> > > is only a good idea if it's doing aio (which it generally is).
> > >
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >
> > > ?
> > > s
> > >
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>> Hello everybody,
> > >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>> I separed "/db" and "/fragments" but during the bench, everything is
> > >>>> writing
> > >>>> to "/db"
> > >>>> I changed options "newstore_sync_*" without success.
> > >>>>
> > >>>> Is there any way to write all metadata in "/db" and all data in
> > >>>> "/fragments" ?
> > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>> But if you are overwriting an existing object, doing write-ahead logging
> > >>> is usually unavoidable because we need to make the update atomic (and 
> > >>> the
> > >>> underlying posix fs doesn't provide that).  The wip-newstore-frags 
> > >>> branch
> > >>> mitigates this somewhat for larger writes by limiting fragment size, but
> > >>> for small IOs this is pretty much always going to be the case.  For 
> > >>> small
> > >>> IOs, though, putting things in db/ is generally better since we can
> > >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>> often leave them there (via the 'overlay' behavior).
> > >>>
> > >>> sage
> > >>>
> > >>
> > >>
> > >> --
> > >> ________________________________________________________
> > >>
> > >> Cordialement,
> > >>
> > >> *David CASIER
> > >> DCConsulting SARL
> > >>
> > >>
> > >> 4 Trait d'Union
> > >> 77127 LIEUSAINT
> > >>
> > >> **Ligne directe: _01 75 98 53 85_
> > >> Email: _david.casier@aevoo.fr_
> > >> * ________________________________________________________
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >> the body of a message to majord...@vger.kernel.org
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > >>
> > >>
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.cas...@aevoo.fr>
> > > Date: 29 octobre 2015 12:41:22 UTC+1
> > > À: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com>
> > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY 
> > > <sebastien.valse...@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > It's OK.
> > >
> > > We have a lot of different configuration with newstore tests.
> > >
> > > What is your goal with ?
> > >
> > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > >> Hi David,
> > >>
> > >> Sorry for sending you the mail directly.
> > >>
> > >> This is Vishwanath Maram from Samsung and started to play around with 
> > >> Newstore and observing some issues with running FIO.
> > >>
> > >> Can you please share your Ceph Configuration file which you have used to 
> > >> run the IO's using FIO?
> > >>
> > >> Thanks,
> > >> -Vish
> > >>
> > >> -----Original Message-----
> > >> From: ceph-devel-ow...@vger.kernel.org 
> > >> <mailto:ceph-devel-ow...@vger.kernel.org> 
> > >> [mailto:ceph-devel-ow...@vger.kernel.org 
> > >> <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier
> > >> Sent: Monday, October 12, 2015 11:52 AM
> > >> To: Sage Weil; Ceph Development
> > >> Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr 
> > >> <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin
> > >> Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >>
> > >> Ok,
> > >> Great.
> > >>
> > >> With these  settings :
> > >> //
> > >> newstore_max_dir_size = 4096
> > >> newstore_sync_io = true
> > >> newstore_sync_transaction = true
> > >> newstore_sync_submit_transaction = true
> > >> newstore_sync_wal_apply = true
> > >> newstore_overlay_max = 0
> > >> //
> > >>
> > >> And direct IO in the benchmark tool (fio)
> > >>
> > >> I see that the HDD is 100% charged and there are notransfer of /db to
> > >> /fragments after stopping benchmark : Great !
> > >>
> > >> But when i launch a bench with random blocs of 256k, i see random blocs
> > >> between 32k and 256k on HDD. Any idea ?
> > >>
> > >> Debits to the HDD are about 8MBps when they could be higher with larger
> > >> blocs (~30MBps)
> > >> And 70 MBps without fsync (hard drive cache disabled).
> > >>
> > >> Other questions :
> > >> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >> fsync_wq) ?
> > >> newstore_sync_transaction -> true = sync in DB ?
> > >> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >> newstore_sync_transaction=false) ?
> > >> newstore_sync_wal_apply = true -> if false then WAL later (thread 
> > >> wal_wq) ?
> > >>
> > >> Is it true ?
> > >>
> > >> Way for cache with battery (sync DB and no sync data) ?
> > >>
> > >> Thanks for everything !
> > >>
> > >> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>> Hello everybody,
> > >>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>> I separed "/db" and "/fragments" but during the bench, everything is 
> > >>>> writing
> > >>>> to "/db"
> > >>>> I changed options "newstore_sync_*" without success.
> > >>>>
> > >>>> Is there any way to write all metadata in "/db" and all data in 
> > >>>> "/fragments" ?
> > >>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>> But if you are overwriting an existing object, doing write-ahead logging
> > >>> is usually unavoidable because we need to make the update atomic (and 
> > >>> the
> > >>> underlying posix fs doesn't provide that).  The wip-newstore-frags 
> > >>> branch
> > >>> mitigates this somewhat for larger writes by limiting fragment size, but
> > >>> for small IOs this is pretty much always going to be the case.  For 
> > >>> small
> > >>> IOs, though, putting things in db/ is generally better since we can
> > >>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>> often leave them there (via the 'overlay' behavior).
> > >>>
> > >>> sage
> > >>>
> > >>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: "Vish (Vishwanath) Maram-SSI" <vishwanat...@ssi.samsung.com>
> > > Date: 29 octobre 2015 17:30:56 UTC+1
> > > À: David Casier <david.cas...@aevoo.fr>
> > > Cc: benoit LORIOT <benoit.lor...@aevoo.fr>, Sébastien VALSEMEY 
> > > <sebastien.valse...@aevoo.fr>
> > > Objet: RE: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Thanks David for the reply.
> > >
> > > Yeah We just wanted to know how different is it from Filestore and how do 
> > > we contribute for this? My motive is to first understand the design of 
> > > Newstore and get the Performance loopholes so that we can try looking 
> > > into it.
> > >
> > > It would be helpful if you can share what is your idea from your side to 
> > > use Newstore and configuration? What plans you are having for 
> > > contributions to help us understand and see if we can work together.
> > >
> > > Thanks,
> > > -Vish
> > >   <>
> > > From: David Casier [mailto:david.cas...@aevoo.fr]
> > > Sent: Thursday, October 29, 2015 4:41 AM
> > > To: Vish (Vishwanath) Maram-SSI
> > > Cc: benoit LORIOT; Sébastien VALSEMEY
> > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Hi Vish,
> > > It's OK.
> > >
> > > We have a lot of different configuration with newstore tests.
> > >
> > > What is your goal with ?
> > >
> > > On 10/28/2015 11:02 PM, Vish (Vishwanath) Maram-SSI wrote:
> > > Hi David,
> > >
> > > Sorry for sending you the mail directly.
> > >
> > > This is Vishwanath Maram from Samsung and started to play around with 
> > > Newstore and observing some issues with running FIO.
> > >
> > > Can you please share your Ceph Configuration file which you have used to 
> > > run the IO's using FIO?
> > >
> > > Thanks,
> > > -Vish
> > >
> > > -----Original Message-----
> > > From: ceph-devel-ow...@vger.kernel.org 
> > > <mailto:ceph-devel-ow...@vger.kernel.org> 
> > > [mailto:ceph-devel-ow...@vger.kernel.org 
> > > <mailto:ceph-devel-ow...@vger.kernel.org>] On Behalf Of David Casier
> > > Sent: Monday, October 12, 2015 11:52 AM
> > > To: Sage Weil; Ceph Development
> > > Cc: Sébastien VALSEMEY; benoit.lor...@aevoo.fr 
> > > <mailto:benoit.lor...@aevoo.fr>; Denis Saget; luc.petetin
> > > Subject: Re: Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Ok,
> > > Great.
> > >
> > > With these  settings :
> > > //
> > > newstore_max_dir_size = 4096
> > > newstore_sync_io = true
> > > newstore_sync_transaction = true
> > > newstore_sync_submit_transaction = true
> > > newstore_sync_wal_apply = true
> > > newstore_overlay_max = 0
> > > //
> > >
> > > And direct IO in the benchmark tool (fio)
> > >
> > > I see that the HDD is 100% charged and there are notransfer of /db to
> > > /fragments after stopping benchmark : Great !
> > >
> > > But when i launch a bench with random blocs of 256k, i see random blocs
> > > between 32k and 256k on HDD. Any idea ?
> > >
> > > Debits to the HDD are about 8MBps when they could be higher with larger
> > > blocs (~30MBps)
> > > And 70 MBps without fsync (hard drive cache disabled).
> > >
> > > Other questions :
> > > newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > > fsync_wq) ?
> > > newstore_sync_transaction -> true = sync in DB ?
> > > newstore_sync_submit_transaction -> if false then kv_queue (only if
> > > newstore_sync_transaction=false) ?
> > > newstore_sync_wal_apply = true -> if false then WAL later (thread wal_wq) 
> > > ?
> > >
> > > Is it true ?
> > >
> > > Way for cache with battery (sync DB and no sync data) ?
> > >
> > > Thanks for everything !
> > >
> > > On 10/12/2015 03:01 PM, Sage Weil wrote:
> > > On Mon, 12 Oct 2015, David Casier wrote:
> > > Hello everybody,
> > > fragment is stored in rocksdb before being written to "/fragments" ?
> > > I separed "/db" and "/fragments" but during the bench, everything is 
> > > writing
> > > to "/db"
> > > I changed options "newstore_sync_*" without success.
> > >
> > > Is there any way to write all metadata in "/db" and all data in 
> > > "/fragments" ?
> > > You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > > But if you are overwriting an existing object, doing write-ahead logging
> > > is usually unavoidable because we need to make the update atomic (and the
> > > underlying posix fs doesn't provide that).  The wip-newstore-frags branch
> > > mitigates this somewhat for larger writes by limiting fragment size, but
> > > for small IOs this is pretty much always going to be the case.  For small
> > > IOs, though, putting things in db/ is generally better since we can
> > > combine many small ios into a single (rocksdb) journal/wal write.  And
> > > often leave them there (via the 'overlay' behavior).
> > >
> > > sage
> > >
> > >
> > >
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr>
> > > ________________________________________________________
> > > Début du message réexpédié :
> > >
> > > De: David Casier <david.cas...@aevoo.fr>
> > > Date: 14 octobre 2015 22:03:38 UTC+2
> > > À: Sébastien VALSEMEY <sebastien.valse...@aevoo.fr>, 
> > > benoit.lor...@aevoo.fr
> > > Cc: Denis Saget <geo...@gmail.com>, "luc.petetin" <luc.pete...@aevoo.fr>
> > > Objet: Rép : Fwd: [newstore (again)] how disable double write WAL
> > >
> > > Bonsoir Messieurs,
> > > Je viens de vivre le premier vrai feu Ceph.
> > > Loic Dachary m'a bien appuyé sur le coup.
> > >
> > > Je peux vous dire une chose : on a beau penser maîtriser le produit, 
> > > c'est lors d'un incident qu'on se rend compte du nombre de facteurs à 
> > > connaître par coeur.
> > > Aussi, pas de panique, je prend vraiment l'expérience de ce soir comme un 
> > > succès et comme un excellent coup de boost.
> > >
> > > Explications :
> > >  - LI ont un peu trop joué avec la crushmap (je ferai de la technique 
> > > pointut un autre jour)
> > >  - Mise à jour et redémarrage des OSD
> > >  - Les OSD ne savaient plus où étaient la data
> > >  - Reconstruction à la mimine de la crushmap et zzooouu.
> > >
> > > Rien de bien grave en soit et un gros plus (++++) en image chez LI 
> > > (j'aurais perdu 1h à 2h de plus sans Loic où ne s'est pas dispersé)
> > >
> > > Conclusion :
> > > On va bosser ensemble sur des stress-tests, un peu comme des validations 
> > > RedHat : une plate-forme, je casse, vous réparez.
> > > Vous aurez autant de temps qu'il faut pour trouver (il m'est arrivé de 
> > > passer quelques jours sur certains trucs).
> > >
> > > Objectifs :
> > >  - Maîtriser une liste de vérifs à faire
> > >  - La rejouer toutes les semaines si beaucoup de fautes
> > >  - Tous les mois si un peu de faute
> > >  - Tous les 3 mois si bonne maîtrise
> > >  - ...
> > >
> > > Il faut qu'on soit au top et que certaines choses passent en réflexe 
> > > (vérif crushmap, savoir trouver la data sans les process, ...).
> > > Surtout qu'il faut que le client soit rassuré en cas d'incident (ou pas).
> > >
> > > Et franchement, c'est vraiment passionnant Ceph !
> > >
> > >   On 10/12/2015 09:33 PM, Sage Weil wrote:
> > >> Hi David-
> > >>
> > >> On Mon, 12 Oct 2015, David Casier wrote:
> > >>> Ok,
> > >>> Great.
> > >>>
> > >>> With these  settings :
> > >>> //
> > >>> newstore_max_dir_size = 4096
> > >>> newstore_sync_io = true
> > >>> newstore_sync_transaction = true
> > >>> newstore_sync_submit_transaction = true
> > >> Is this a hard disk?  Those settings probably don't make sense since it
> > >> does every IO synchronously, blocking the submitting IO path...
> > >>
> > >>> newstore_sync_wal_apply = true
> > >>> newstore_overlay_max = 0
> > >>> //
> > >>>
> > >>> And direct IO in the benchmark tool (fio)
> > >>>
> > >>> I see that the HDD is 100% charged and there are notransfer of /db to
> > >>> /fragments after stopping benchmark : Great !
> > >>>
> > >>> But when i launch a bench with random blocs of 256k, i see random blocs
> > >>> between 32k and 256k on HDD. Any idea ?
> > >> Random IOs have to be write ahead logged in rocksdb, which has its own IO
> > >> pattern.  Since you made everything sync above I think it'll depend on
> > >> how many osd threads get batched together at a time.. maybe.  Those
> > >> settings aren't something I've really tested, and probably only make
> > >> sense with very fast NVMe devices.
> > >>
> > >>> Debits to the HDD are about 8MBps when they could be higher with larger 
> > >>> blocs> (~30MBps)
> > >>> And 70 MBps without fsync (hard drive cache disabled).
> > >>>
> > >>> Other questions :
> > >>> newstore_sync_io -> true = fsync immediatly, false = fsync later (Thread
> > >>> fsync_wq) ?
> > >> yes
> > >>
> > >>> newstore_sync_transaction -> true = sync in DB ?
> > >> synchronously do the rocksdb commit too
> > >>
> > >>> newstore_sync_submit_transaction -> if false then kv_queue (only if
> > >>> newstore_sync_transaction=false) ?
> > >> yeah.. there is an annoying rocksdb behavior that makes an async
> > >> transaction submit block if a sync one is in progress, so this queues 
> > >> them
> > >> up and explicitly batches them.
> > >>
> > >>> newstore_sync_wal_apply = true -> if false then WAL later (thread 
> > >>> wal_wq) ?
> > >> the txn commit completion threads can do the wal work synchronously.. 
> > >> this
> > >> is only a good idea if it's doing aio (which it generally is).
> > >>
> > >>> Is it true ?
> > >>>
> > >>> Way for cache with battery (sync DB and no sync data) ?
> > >> ?
> > >> s
> > >>
> > >>> Thanks for everything !
> > >>>
> > >>> On 10/12/2015 03:01 PM, Sage Weil wrote:
> > >>>> On Mon, 12 Oct 2015, David Casier wrote:
> > >>>>> Hello everybody,
> > >>>>> fragment is stored in rocksdb before being written to "/fragments" ?
> > >>>>> I separed "/db" and "/fragments" but during the bench, everything is
> > >>>>> writing
> > >>>>> to "/db"
> > >>>>> I changed options "newstore_sync_*" without success.
> > >>>>>
> > >>>>> Is there any way to write all metadata in "/db" and all data in
> > >>>>> "/fragments" ?
> > >>>> You can set newstore_overlay_max = 0 to avoid most data landing in db/.
> > >>>> But if you are overwriting an existing object, doing write-ahead 
> > >>>> logging
> > >>>> is usually unavoidable because we need to make the update atomic (and 
> > >>>> the
> > >>>> underlying posix fs doesn't provide that).  The wip-newstore-frags 
> > >>>> branch
> > >>>> mitigates this somewhat for larger writes by limiting fragment size, 
> > >>>> but
> > >>>> for small IOs this is pretty much always going to be the case.  For 
> > >>>> small
> > >>>> IOs, though, putting things in db/ is generally better since we can
> > >>>> combine many small ios into a single (rocksdb) journal/wal write.  And
> > >>>> often leave them there (via the 'overlay' behavior).
> > >>>>
> > >>>> sage
> > >>>>
> > >>>
> > >>> --
> > >>> ________________________________________________________
> > >>>
> > >>> Cordialement,
> > >>>
> > >>> *David CASIER
> > >>> DCConsulting SARL
> > >>>
> > >>>
> > >>> 4 Trait d'Union
> > >>> 77127 LIEUSAINT
> > >>>
> > >>> **Ligne directe: _01 75 98 53 85_
> > >>> Email: _david.casier@aevoo.fr_
> > >>> * ________________________________________________________
> > >>> --
> > >>> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >>> the body of a message to majord...@vger.kernel.org 
> > >>> <mailto:majord...@vger.kernel.org>
> > >>> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> > >>> <http://vger.kernel.org/majordomo-info.html>
> > >>>
> > >>>
> > >> --
> > >> To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
> > >> the body of a message to majord...@vger.kernel.org 
> > >> <mailto:majord...@vger.kernel.org>
> > >> More majordomo info at  http://vger.kernel.org/majordomo-info.html 
> > >> <http://vger.kernel.org/majordomo-info.html>
> > >
> > >
> > > --
> > > ________________________________________________________
> > >
> > > Cordialement,
> > >
> > > David CASIER
> > > DCConsulting SARL
> > >
> > >
> > > 4 Trait d'Union
> > > 77127 LIEUSAINT
> > >
> > > Ligne directe: 01 75 98 53 85
> > > Email: david.cas...@aevoo.fr <mailto:david.cas...@aevoo.fr>
> > > ________________________________________________________
> >
> >




-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.cas...@aevoo.fr
________________________________________________________



-- 

________________________________________________________

Cordialement,

David CASIER


3B Rue Taylor, CS20004
75481 PARIS Cedex 10 Paris

Ligne directe: 01 75 98 53 85
Email: david.cas...@aevoo.fr
________________________________________________________
--
To unsubscribe from this list: send the line "unsubscribe ceph-devel" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html

Reply via email to