Hi all,

Those stats I pasted might be a little misleading as they show the average
over a couple of minutes. Whenever there are reads to the ledger disks, the
queue size on them is sometimes as high as 100. Also, the CPU utilization
has been lower than 10% throughout and the process will continue to remain
I/O bound even if we introduce more threads (As the CPU remains idle while
doing I/O).

A couple of observations about the write path. We currently have a hard
coded buffer size for journal writes (I believe it's 512KB) and we flush to
the disk when this fills up or if there is no entry to process (which is
highly unlikely in case of a high throughput application running on top).
We should make this buffer size configurable. Now, with more threads, we
can process more packets in parallel and this buffer can be filled up
faster. We can have a write threadpool (and we should always keep this
lower than the number of processors) to process the add requests.

For read requests, a configurable number of worker threads would be ideal
and we could let the user tune it depending on the kind of read patterns
they expect. Given that ledgers are interleaved ATM, I would expect the
performance to increase linearly with the number of threads till a certain
point and then level out.

Another thing we could possibly look at is re ordering our writes to the
log file to try and maintain locality for ledger entries. This might reduce
the number of random seeks we do in case only a small number of ledgers are
lagging.

Thoughts?

Regards,
Aniruddha.

On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <[email protected]> wrote:

> >>>One question: what is multi-ledgers?
> multiple ledgers directories(muliple disks)
>
> >>>CPU utilization might not be largely affected if the threads are
> sitting there waiting on IO
> Ok, seems I got it.
> If one thread spends most of its time waiting for I/O completion instead
> of using the CPU, but does not mean that "we've hit the system I/O
> bandwidth limit", then IMHO having multiple threads (or asynchronous I/O)
> might improve performance (by enabling more than one concurrent I/O
> operation).
>
> -Rakesh
> ________________________________________
> From: Flavio Junqueira [[email protected]]
> Sent: Wednesday, September 26, 2012 2:17 PM
> To: [email protected]
> Subject: Re: High latencies observed at the bookkeeper client while
> reading entries
>
> CPU utilization might not be largely affected if the threads are sitting
> there waiting on IO. In my understanding of the proposal so far, the idea
> is to have multiple threads only to perform IO.
>
> One question: what is multi-ledgers?
>
> -Flavio
>
>
> On Sep 26, 2012, at 7:52 AM, Rakesh R wrote:
>
> > I just adding one more point:
> >
> > Increasing the number of threads, can hit the CPU utilization too. Also,
> we would consider this and good to observe whether its more on I/O bound
> than CPU bound. However, it depends in great detail on the disks and how
> much CPU work other threads are doing before they, too, end up waiting on
> those disks.
> >
> > I'm also thinking inline with Flavio's suggestion to have one thread per
> ledger/journal device. Multithreading can help us with I/O bound problems
> if the I/O is perform against different disks.
> >
> > From the iostat report: waiting time of ledger directories. It shows we
> have options to fully utilizing the disk bandwidth.
> >
> > multi-ledgers disk usage:
> > avgqu-sz
> > 1.10
> > 0.12
> > 0.54
> > 0.13
> >
> > -Rakesh
> > ________________________________________
> > From: Sijie Guo [[email protected]]
> > Sent: Wednesday, September 26, 2012 5:58 AM
> > To: [email protected]
> > Subject: Re: High latencies observed at the bookkeeper client while
> reading entries
> >
> > One more point is that each write/read request to entry log files would
> be
> > converted to write/read a 8K blob data, since you used BufferedChannel.
> For
> > write requests, a larger write size is OK. For read requests, they are
> > almost randomly. Even you read a larger blob, the blob might be useless
> > when next read goes to other place. Even more, I don't think we need to
> > maintain another fixed length readBuffer in BufferedChannel, it almost
> > doesn't help for random reads, we could leverage OS cache for it.
> >
> > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <[email protected]> wrote:
> >
> >> For serving requests, either queuing the requests in bookie server per
> >> channel (write/read are blocking operations), or queueing in os kernel
> to
> >> let block device queuing and schedule those io requests. I think Stu's
> >> point is to leverage block device's schedule algorithm to issue io
> requests
> >> in multiple threads to fully utilize the disk bandwidth.
> >>
> >> from the iostat reports provided by Aniruddha, the average queue length
> >> and utilized percentage are not high, which means most of time the disks
> >> are idle. It makes sense to use multiple threads to issue read requests.
> >> one write thread and several read threads might work for each device.
> >>
> >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira <[email protected]
> >wrote:
> >>
> >>> Hi Stu, I'm not sure I understand your point. If with one thread we are
> >>> getting pretty high latency (case Aniruddha described), doesn't it
> mean we
> >>> have a number of requests queued up? Adding more threads might only
> make
> >>> the problem worse by queueing up even more requests. I'm possibly
> missing
> >>> your point...
> >>>
> >>> -Flavio
> >>>
> >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote:
> >>>
> >>>> Separating by device would help, but will not allow the devices to be
> >>> fully
> >>>> utilized: in order to buffer enough io commands into a disk's queue
> for
> >>> the
> >>>> elevator algorithms to kick in, you either need to use multiple
> threads
> >>> per
> >>>> disk, or native async IO (not trivially available within the JVM.)
> >>>>
> >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira <[email protected]>
> >>> wrote:
> >>>>
> >>>>>
> >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote:
> >>>>>
> >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira <
> [email protected]>
> >>>>> wrote:
> >>>>>>
> >>>>>>> Just to add a couple of comments to the discussion, separating
> reads
> >>> and
> >>>>>>> writes into different threads should only help with queuing
> latency.
> >>> It
> >>>>>>> wouldn't help with IO latency.
> >>>>>>>
> >>>>>>
> >>>>>> Yes, but with the current implementation, publishes latencies in
> >>> hedwig
> >>>>>> suffer because of lagging subscribers. By separating read and write
> >>>>> queues,
> >>>>>> we can at least guarantee that the write SLA is maintained (separate
> >>>>>> journal disk + separate thread would ensure that writes are not
> >>> affected
> >>>>> by
> >>>>>> read related seeks)
> >>>>>>
> >>>>>
> >>>>> Agreed and based on my comment below, I was wondering if it wouldn't
> be
> >>>>> best to separate traffic across threads by device instead of by
> >>> operation
> >>>>> type.
> >>>>>
> >>>>>
> >>>>>>>
> >>>>>>> Also, it sounds like a good idea to have at least one thread per
> >>> ledger
> >>>>>>> device. In the case of multiple ledger devices, if we use one
> single
> >>>>>>> thread, then the performance of the bookie will be driven by the
> >>> slowest
> >>>>>>> disk, no?
> >>>>>>>
> >>>>>> yup, makes sense.
> >>>>>>
> >>>>>>>
> >>>>>>> -Flavio
> >>>>>>>
> >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote:
> >>>>>>>
> >>>>>>>>> Could you give some information on what those shortcomings are?
> >>> Also,
> >>>>> do
> >>>>>>>>> let me know if you need any more information from our end.
> >>>>>>>> Off the top of my head:
> >>>>>>>> - reads and writes are handled in the same thread (as you have
> >>>>> observed)
> >>>>>>>> - each entry read requires a single RPC.
> >>>>>>>> - entries are read in parallel
> >>>>>>>
> >>>>>> By parallel, you mean the BufferedChannel wrapper on top of
> >>> FileChannel,
> >>>>>> right?
> >>>>>>
> >>>>>>>>
> >>>>>>>> Not all of these could result in the high latency you see, but if
> >>> each
> >>>>>>>> entry is being read separately, a sync on the ledger disk in
> between
> >>>>>>>> will make a mess of the disk head scheduling.
> >>>>>>>
> >>>>>> Increasing the time interval between  flushing log files might
> >>> possibly
> >>>>>> help in this case then?
> >>>>>>
> >>>>>>>>
> >>>>>>>> -Ivan
> >>>>>>>
> >>>>>>>
> >>>>>> Thanks for the help :)
> >>>>>
> >>>>>
> >>>
> >>>
>

Reply via email to