On Thu, Sep 27, 2012 at 3:21 AM, Sijie Guo <[email protected]> wrote:

> > I took at look at the leveldb homepage and it says - "Only a single
> process
> > (possibly multi-threaded) can access a particular database at a time."
> > This is bad because it means we can't run the console or any recovery
> > related operations while the bookies are running.
>
> Yes. leveldb is single process. it prevent misuse by acquiring lock from
> filesystem. We did can't run the console to look its data while the bookie
> is running.
>
> I am confusing about the 'recovery' operations you mentioned. what kind of
> recovery?
>
I was talking about BookkeeperAdmin and how it's used from BookkeeperTools.
I believe that we currently only support replicating a bookie onto another
bookie. We might want to support more operations of this nature in future.
These might need simultaneous access to the log files of a live bookie.

>
> > When we flush to the log file, we
> > should simply flush entries sorted by the ledger id as key.
>
> If we want to sort when flushing, we need to buffer all edits in memory,
> then flush. I am assume this approach would act same as LSM tree (what
> leveldb did).
>
Yes. quite similar. I just glanced over LSM trees, I'll take a more
detailed look soon.

>
> On Thu, Sep 27, 2012 at 9:39 AM, Aniruddha Laud <[email protected]
> >wrote:
>
> > On Wed, Sep 26, 2012 at 5:06 PM, Sijie Guo <[email protected]> wrote:
> >
> > > Sounds good to have thread pool (and make them configurable) for
> reading
> > > from entry log files.
> > >
> > > > We can have a write threadpool (and we should always keep this
> > > lower than the number of processors) to process the add requests.
> > >
> > > one more point, since we had only one entry log file active accepting
> > entry
> > > written to it. I don't think multiple write threads would help now,
> since
> > > adding entry to entry log file is synchronized method.
> > >
> > I was talking about writing to the journal file. From what I understand,
> > log file entries
> > are flushed periodically by one thread and that is okay. The publish
> > latencies on hedwig are
> > dependent on the journal writes, though.
> >
> > >
> > > In order to utilize the disk bandwidth more efficiently, we might need
> to
> > > have one active entry log file accepting entry written per ledger disk.
> > But
> > > it might need to introduce some directory layout change (like logId,
> > > currently we using incrementing log id for whole bookie) and logic
> > > changes.  it would be a separated task if did that.
> > >
> > > > Another thing we could possibly look at is re ordering our writes to
> > the
> > > log file to try and maintain locality for ledger entries.
> > >
> > > We had a prototype work internally working on using leveldb to store 1)
> > > small entries data (size less than hundred of bytes), 2) ledger index
> for
> > > large entries (acts as ledger cache for index entries). Benefiting from
> > > leveldb, 1) we could have more efficient cache when there are larger
> > number
> > > of ledgers and size skew between ledgers, 2) we could have data
> belonging
> > > to same ledger clustered when writing to disks, which achieves what you
> > > mentioned 'reordering writes' in somehow.
> > >
> > I took at look at the leveldb homepage and it says - "Only a single
> process
> > (possibly multi-threaded) can access a particular database at a time."
> > This is bad because it means we can't run the console or any recovery
> > related operations while the bookies are running. I may be wrong, though.
> > What I had in mind was pretty simple. When we flush to the log file, we
> > should simply flush entries sorted by the ledger id as key. Some changes
> > might be needed to the ledger index cache, but I'm not very sure what the
> > changes would be. What do you think?
> >
> > >
> > > -Sijie
> > >
> > > On Thu, Sep 27, 2012 at 12:52 AM, Aniruddha Laud
> > > <[email protected]>wrote:
> > >
> > > > Hi all,
> > > >
> > > > Those stats I pasted might be a little misleading as they show the
> > > average
> > > > over a couple of minutes. Whenever there are reads to the ledger
> disks,
> > > the
> > > > queue size on them is sometimes as high as 100. Also, the CPU
> > utilization
> > > > has been lower than 10% throughout and the process will continue to
> > > remain
> > > > I/O bound even if we introduce more threads (As the CPU remains idle
> > > while
> > > > doing I/O).
> > > >
> > > > A couple of observations about the write path. We currently have a
> hard
> > > > coded buffer size for journal writes (I believe it's 512KB) and we
> > flush
> > > to
> > > > the disk when this fills up or if there is no entry to process (which
> > is
> > > > highly unlikely in case of a high throughput application running on
> > top).
> > > > We should make this buffer size configurable. Now, with more threads,
> > we
> > > > can process more packets in parallel and this buffer can be filled up
> > > > faster. We can have a write threadpool (and we should always keep
> this
> > > > lower than the number of processors) to process the add requests.
> > > >
> > > > For read requests, a configurable number of worker threads would be
> > ideal
> > > > and we could let the user tune it depending on the kind of read
> > patterns
> > > > they expect. Given that ledgers are interleaved ATM, I would expect
> the
> > > > performance to increase linearly with the number of threads till a
> > > certain
> > > > point and then level out.
> > > >
> > > > Another thing we could possibly look at is re ordering our writes to
> > the
> > > > log file to try and maintain locality for ledger entries. This might
> > > reduce
> > > > the number of random seeks we do in case only a small number of
> ledgers
> > > are
> > > > lagging.
> > > >
> > > > Thoughts?
> > > >
> > > > Regards,
> > > > Aniruddha.
> > > >
> > > > On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <[email protected]>
> wrote:
> > > >
> > > > > >>>One question: what is multi-ledgers?
> > > > > multiple ledgers directories(muliple disks)
> > > > >
> > > > > >>>CPU utilization might not be largely affected if the threads are
> > > > > sitting there waiting on IO
> > > > > Ok, seems I got it.
> > > > > If one thread spends most of its time waiting for I/O completion
> > > instead
> > > > > of using the CPU, but does not mean that "we've hit the system I/O
> > > > > bandwidth limit", then IMHO having multiple threads (or
> asynchronous
> > > I/O)
> > > > > might improve performance (by enabling more than one concurrent I/O
> > > > > operation).
> > > > >
> > > > > -Rakesh
> > > > > ________________________________________
> > > > > From: Flavio Junqueira [[email protected]]
> > > > > Sent: Wednesday, September 26, 2012 2:17 PM
> > > > > To: [email protected]
> > > > > Subject: Re: High latencies observed at the bookkeeper client while
> > > > > reading entries
> > > > >
> > > > > CPU utilization might not be largely affected if the threads are
> > > sitting
> > > > > there waiting on IO. In my understanding of the proposal so far,
> the
> > > idea
> > > > > is to have multiple threads only to perform IO.
> > > > >
> > > > > One question: what is multi-ledgers?
> > > > >
> > > > > -Flavio
> > > > >
> > > > >
> > > > > On Sep 26, 2012, at 7:52 AM, Rakesh R wrote:
> > > > >
> > > > > > I just adding one more point:
> > > > > >
> > > > > > Increasing the number of threads, can hit the CPU utilization
> too.
> > > > Also,
> > > > > we would consider this and good to observe whether its more on I/O
> > > bound
> > > > > than CPU bound. However, it depends in great detail on the disks
> and
> > > how
> > > > > much CPU work other threads are doing before they, too, end up
> > waiting
> > > on
> > > > > those disks.
> > > > > >
> > > > > > I'm also thinking inline with Flavio's suggestion to have one
> > thread
> > > > per
> > > > > ledger/journal device. Multithreading can help us with I/O bound
> > > problems
> > > > > if the I/O is perform against different disks.
> > > > > >
> > > > > > From the iostat report: waiting time of ledger directories. It
> > shows
> > > we
> > > > > have options to fully utilizing the disk bandwidth.
> > > > > >
> > > > > > multi-ledgers disk usage:
> > > > > > avgqu-sz
> > > > > > 1.10
> > > > > > 0.12
> > > > > > 0.54
> > > > > > 0.13
> > > > > >
> > > > > > -Rakesh
> > > > > > ________________________________________
> > > > > > From: Sijie Guo [[email protected]]
> > > > > > Sent: Wednesday, September 26, 2012 5:58 AM
> > > > > > To: [email protected]
> > > > > > Subject: Re: High latencies observed at the bookkeeper client
> while
> > > > > reading entries
> > > > > >
> > > > > > One more point is that each write/read request to entry log files
> > > would
> > > > > be
> > > > > > converted to write/read a 8K blob data, since you used
> > > BufferedChannel.
> > > > > For
> > > > > > write requests, a larger write size is OK. For read requests,
> they
> > > are
> > > > > > almost randomly. Even you read a larger blob, the blob might be
> > > useless
> > > > > > when next read goes to other place. Even more, I don't think we
> > need
> > > to
> > > > > > maintain another fixed length readBuffer in BufferedChannel, it
> > > almost
> > > > > > doesn't help for random reads, we could leverage OS cache for it.
> > > > > >
> > > > > > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <[email protected]>
> > > wrote:
> > > > > >
> > > > > >> For serving requests, either queuing the requests in bookie
> server
> > > per
> > > > > >> channel (write/read are blocking operations), or queueing in os
> > > kernel
> > > > > to
> > > > > >> let block device queuing and schedule those io requests. I think
> > > Stu's
> > > > > >> point is to leverage block device's schedule algorithm to issue
> io
> > > > > requests
> > > > > >> in multiple threads to fully utilize the disk bandwidth.
> > > > > >>
> > > > > >> from the iostat reports provided by Aniruddha, the average queue
> > > > length
> > > > > >> and utilized percentage are not high, which means most of time
> the
> > > > disks
> > > > > >> are idle. It makes sense to use multiple threads to issue read
> > > > requests.
> > > > > >> one write thread and several read threads might work for each
> > > device.
> > > > > >>
> > > > > >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira <
> > > [email protected]
> > > > > >wrote:
> > > > > >>
> > > > > >>> Hi Stu, I'm not sure I understand your point. If with one
> thread
> > we
> > > > are
> > > > > >>> getting pretty high latency (case Aniruddha described), doesn't
> > it
> > > > > mean we
> > > > > >>> have a number of requests queued up? Adding more threads might
> > only
> > > > > make
> > > > > >>> the problem worse by queueing up even more requests. I'm
> possibly
> > > > > missing
> > > > > >>> your point...
> > > > > >>>
> > > > > >>> -Flavio
> > > > > >>>
> > > > > >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote:
> > > > > >>>
> > > > > >>>> Separating by device would help, but will not allow the
> devices
> > to
> > > > be
> > > > > >>> fully
> > > > > >>>> utilized: in order to buffer enough io commands into a disk's
> > > queue
> > > > > for
> > > > > >>> the
> > > > > >>>> elevator algorithms to kick in, you either need to use
> multiple
> > > > > threads
> > > > > >>> per
> > > > > >>>> disk, or native async IO (not trivially available within the
> > JVM.)
> > > > > >>>>
> > > > > >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira <
> > > > [email protected]>
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote:
> > > > > >>>>>
> > > > > >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira <
> > > > > [email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Just to add a couple of comments to the discussion,
> > separating
> > > > > reads
> > > > > >>> and
> > > > > >>>>>>> writes into different threads should only help with queuing
> > > > > latency.
> > > > > >>> It
> > > > > >>>>>>> wouldn't help with IO latency.
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Yes, but with the current implementation, publishes
> latencies
> > in
> > > > > >>> hedwig
> > > > > >>>>>> suffer because of lagging subscribers. By separating read
> and
> > > > write
> > > > > >>>>> queues,
> > > > > >>>>>> we can at least guarantee that the write SLA is maintained
> > > > (separate
> > > > > >>>>>> journal disk + separate thread would ensure that writes are
> > not
> > > > > >>> affected
> > > > > >>>>> by
> > > > > >>>>>> read related seeks)
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Agreed and based on my comment below, I was wondering if it
> > > > wouldn't
> > > > > be
> > > > > >>>>> best to separate traffic across threads by device instead of
> by
> > > > > >>> operation
> > > > > >>>>> type.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Also, it sounds like a good idea to have at least one
> thread
> > > per
> > > > > >>> ledger
> > > > > >>>>>>> device. In the case of multiple ledger devices, if we use
> one
> > > > > single
> > > > > >>>>>>> thread, then the performance of the bookie will be driven
> by
> > > the
> > > > > >>> slowest
> > > > > >>>>>>> disk, no?
> > > > > >>>>>>>
> > > > > >>>>>> yup, makes sense.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> -Flavio
> > > > > >>>>>>>
> > > > > >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>>> Could you give some information on what those
> shortcomings
> > > are?
> > > > > >>> Also,
> > > > > >>>>> do
> > > > > >>>>>>>>> let me know if you need any more information from our
> end.
> > > > > >>>>>>>> Off the top of my head:
> > > > > >>>>>>>> - reads and writes are handled in the same thread (as you
> > have
> > > > > >>>>> observed)
> > > > > >>>>>>>> - each entry read requires a single RPC.
> > > > > >>>>>>>> - entries are read in parallel
> > > > > >>>>>>>
> > > > > >>>>>> By parallel, you mean the BufferedChannel wrapper on top of
> > > > > >>> FileChannel,
> > > > > >>>>>> right?
> > > > > >>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Not all of these could result in the high latency you see,
> > but
> > > > if
> > > > > >>> each
> > > > > >>>>>>>> entry is being read separately, a sync on the ledger disk
> in
> > > > > between
> > > > > >>>>>>>> will make a mess of the disk head scheduling.
> > > > > >>>>>>>
> > > > > >>>>>> Increasing the time interval between  flushing log files
> might
> > > > > >>> possibly
> > > > > >>>>>> help in this case then?
> > > > > >>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> -Ivan
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>> Thanks for the help :)
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
>

Reply via email to