On Thu, Sep 27, 2012 at 2:53 AM, Sijie Guo <[email protected]> wrote:

> > I was talking about writing to the journal file. From what I understand,
> log file entries
> > are flushed periodically by one thread and that is okay. The publish
> > latencies on hedwig are
> > dependent on the journal writes, though.
>
> I was confused now. Journal file is flushed periodically by 'BookieJournal'
> thread not the entry log file.
>
> The write code path would be:
>
> 1) adding entry into entry log file
> 2) putting index to ledger cache
> 3) logAddEntry to journal.
>
> 1) and 2) is synchronized calls. if 1) and 2) took long time, it would
> affect other requests in PerChannelBookieClient, since
> PerChannelBookieClient process requests one by one.
>
After 1) and 2), we don't actually call flush(), right? So they're not
exactly writing to the disk the moment they're invoked. The entry to the
journal, on the other hand is flushed every 512KBs and we call the
WriteCallback on the entry when this is done. This write callback sends the
response back. We might as well cache 1) and 2) ourselves and then reorder
the writes(to maintain locality) while writing them to the log file. The
thread I was referring to is Bookie#SyncThread.

>
> 3) is asynchronous call, just putting add entry into journal queue. 3)
> would not affect other requests in PerChannelBookieClient.
>
It does, because you send a response to the bookie client only when things
are flushed to the journal.

>
> but 3) would affect the response latency for adding entry. we already took
> care about it to let journal flushing into a  separated journal disk. so 3)
> would not be the cause for high latencies.
>
Correct, journal additions are executed by the same thread as the reads and
that is what is causing this problem. According to me, because we don't
rely on the log files to be up-to-date, we can flush them in a different
thread every few minutes or so. Does that seem right? We can reorder all
writes we received in the flush-interval to maintain locality.

>
> so my understanding is that previous email discussions are talking about
> entry log file which is in the code path for read/write. so the discussion
> is that we need to separate read/write requests to different threads to
> avoid effection between them. isn't it? If I am wrong about it, please
>  correct me.

Yes, i think the idea was to not have any read requests affect write
latencies because reads might be expensive.

>
>
> On Thu, Sep 27, 2012 at 9:39 AM, Aniruddha Laud <[email protected]
> >wrote:
>
> > On Wed, Sep 26, 2012 at 5:06 PM, Sijie Guo <[email protected]> wrote:
> >
> > > Sounds good to have thread pool (and make them configurable) for
> reading
> > > from entry log files.
> > >
> > > > We can have a write threadpool (and we should always keep this
> > > lower than the number of processors) to process the add requests.
> > >
> > > one more point, since we had only one entry log file active accepting
> > entry
> > > written to it. I don't think multiple write threads would help now,
> since
> > > adding entry to entry log file is synchronized method.
> > >
> > I was talking about writing to the journal file. From what I understand,
> > log file entries
> > are flushed periodically by one thread and that is okay. The publish
> > latencies on hedwig are
> > dependent on the journal writes, though.
> >
> > >
> > > In order to utilize the disk bandwidth more efficiently, we might need
> to
> > > have one active entry log file accepting entry written per ledger disk.
> > But
> > > it might need to introduce some directory layout change (like logId,
> > > currently we using incrementing log id for whole bookie) and logic
> > > changes.  it would be a separated task if did that.
> > >
> > > > Another thing we could possibly look at is re ordering our writes to
> > the
> > > log file to try and maintain locality for ledger entries.
> > >
> > > We had a prototype work internally working on using leveldb to store 1)
> > > small entries data (size less than hundred of bytes), 2) ledger index
> for
> > > large entries (acts as ledger cache for index entries). Benefiting from
> > > leveldb, 1) we could have more efficient cache when there are larger
> > number
> > > of ledgers and size skew between ledgers, 2) we could have data
> belonging
> > > to same ledger clustered when writing to disks, which achieves what you
> > > mentioned 'reordering writes' in somehow.
> > >
> > I took at look at the leveldb homepage and it says - "Only a single
> process
> > (possibly multi-threaded) can access a particular database at a time."
> > This is bad because it means we can't run the console or any recovery
> > related operations while the bookies are running. I may be wrong, though.
> > What I had in mind was pretty simple. When we flush to the log file, we
> > should simply flush entries sorted by the ledger id as key. Some changes
> > might be needed to the ledger index cache, but I'm not very sure what the
> > changes would be. What do you think?
> >
> > >
> > > -Sijie
> > >
> > > On Thu, Sep 27, 2012 at 12:52 AM, Aniruddha Laud
> > > <[email protected]>wrote:
> > >
> > > > Hi all,
> > > >
> > > > Those stats I pasted might be a little misleading as they show the
> > > average
> > > > over a couple of minutes. Whenever there are reads to the ledger
> disks,
> > > the
> > > > queue size on them is sometimes as high as 100. Also, the CPU
> > utilization
> > > > has been lower than 10% throughout and the process will continue to
> > > remain
> > > > I/O bound even if we introduce more threads (As the CPU remains idle
> > > while
> > > > doing I/O).
> > > >
> > > > A couple of observations about the write path. We currently have a
> hard
> > > > coded buffer size for journal writes (I believe it's 512KB) and we
> > flush
> > > to
> > > > the disk when this fills up or if there is no entry to process (which
> > is
> > > > highly unlikely in case of a high throughput application running on
> > top).
> > > > We should make this buffer size configurable. Now, with more threads,
> > we
> > > > can process more packets in parallel and this buffer can be filled up
> > > > faster. We can have a write threadpool (and we should always keep
> this
> > > > lower than the number of processors) to process the add requests.
> > > >
> > > > For read requests, a configurable number of worker threads would be
> > ideal
> > > > and we could let the user tune it depending on the kind of read
> > patterns
> > > > they expect. Given that ledgers are interleaved ATM, I would expect
> the
> > > > performance to increase linearly with the number of threads till a
> > > certain
> > > > point and then level out.
> > > >
> > > > Another thing we could possibly look at is re ordering our writes to
> > the
> > > > log file to try and maintain locality for ledger entries. This might
> > > reduce
> > > > the number of random seeks we do in case only a small number of
> ledgers
> > > are
> > > > lagging.
> > > >
> > > > Thoughts?
> > > >
> > > > Regards,
> > > > Aniruddha.
> > > >
> > > > On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <[email protected]>
> wrote:
> > > >
> > > > > >>>One question: what is multi-ledgers?
> > > > > multiple ledgers directories(muliple disks)
> > > > >
> > > > > >>>CPU utilization might not be largely affected if the threads are
> > > > > sitting there waiting on IO
> > > > > Ok, seems I got it.
> > > > > If one thread spends most of its time waiting for I/O completion
> > > instead
> > > > > of using the CPU, but does not mean that "we've hit the system I/O
> > > > > bandwidth limit", then IMHO having multiple threads (or
> asynchronous
> > > I/O)
> > > > > might improve performance (by enabling more than one concurrent I/O
> > > > > operation).
> > > > >
> > > > > -Rakesh
> > > > > ________________________________________
> > > > > From: Flavio Junqueira [[email protected]]
> > > > > Sent: Wednesday, September 26, 2012 2:17 PM
> > > > > To: [email protected]
> > > > > Subject: Re: High latencies observed at the bookkeeper client while
> > > > > reading entries
> > > > >
> > > > > CPU utilization might not be largely affected if the threads are
> > > sitting
> > > > > there waiting on IO. In my understanding of the proposal so far,
> the
> > > idea
> > > > > is to have multiple threads only to perform IO.
> > > > >
> > > > > One question: what is multi-ledgers?
> > > > >
> > > > > -Flavio
> > > > >
> > > > >
> > > > > On Sep 26, 2012, at 7:52 AM, Rakesh R wrote:
> > > > >
> > > > > > I just adding one more point:
> > > > > >
> > > > > > Increasing the number of threads, can hit the CPU utilization
> too.
> > > > Also,
> > > > > we would consider this and good to observe whether its more on I/O
> > > bound
> > > > > than CPU bound. However, it depends in great detail on the disks
> and
> > > how
> > > > > much CPU work other threads are doing before they, too, end up
> > waiting
> > > on
> > > > > those disks.
> > > > > >
> > > > > > I'm also thinking inline with Flavio's suggestion to have one
> > thread
> > > > per
> > > > > ledger/journal device. Multithreading can help us with I/O bound
> > > problems
> > > > > if the I/O is perform against different disks.
> > > > > >
> > > > > > From the iostat report: waiting time of ledger directories. It
> > shows
> > > we
> > > > > have options to fully utilizing the disk bandwidth.
> > > > > >
> > > > > > multi-ledgers disk usage:
> > > > > > avgqu-sz
> > > > > > 1.10
> > > > > > 0.12
> > > > > > 0.54
> > > > > > 0.13
> > > > > >
> > > > > > -Rakesh
> > > > > > ________________________________________
> > > > > > From: Sijie Guo [[email protected]]
> > > > > > Sent: Wednesday, September 26, 2012 5:58 AM
> > > > > > To: [email protected]
> > > > > > Subject: Re: High latencies observed at the bookkeeper client
> while
> > > > > reading entries
> > > > > >
> > > > > > One more point is that each write/read request to entry log files
> > > would
> > > > > be
> > > > > > converted to write/read a 8K blob data, since you used
> > > BufferedChannel.
> > > > > For
> > > > > > write requests, a larger write size is OK. For read requests,
> they
> > > are
> > > > > > almost randomly. Even you read a larger blob, the blob might be
> > > useless
> > > > > > when next read goes to other place. Even more, I don't think we
> > need
> > > to
> > > > > > maintain another fixed length readBuffer in BufferedChannel, it
> > > almost
> > > > > > doesn't help for random reads, we could leverage OS cache for it.
> > > > > >
> > > > > > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <[email protected]>
> > > wrote:
> > > > > >
> > > > > >> For serving requests, either queuing the requests in bookie
> server
> > > per
> > > > > >> channel (write/read are blocking operations), or queueing in os
> > > kernel
> > > > > to
> > > > > >> let block device queuing and schedule those io requests. I think
> > > Stu's
> > > > > >> point is to leverage block device's schedule algorithm to issue
> io
> > > > > requests
> > > > > >> in multiple threads to fully utilize the disk bandwidth.
> > > > > >>
> > > > > >> from the iostat reports provided by Aniruddha, the average queue
> > > > length
> > > > > >> and utilized percentage are not high, which means most of time
> the
> > > > disks
> > > > > >> are idle. It makes sense to use multiple threads to issue read
> > > > requests.
> > > > > >> one write thread and several read threads might work for each
> > > device.
> > > > > >>
> > > > > >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira <
> > > [email protected]
> > > > > >wrote:
> > > > > >>
> > > > > >>> Hi Stu, I'm not sure I understand your point. If with one
> thread
> > we
> > > > are
> > > > > >>> getting pretty high latency (case Aniruddha described), doesn't
> > it
> > > > > mean we
> > > > > >>> have a number of requests queued up? Adding more threads might
> > only
> > > > > make
> > > > > >>> the problem worse by queueing up even more requests. I'm
> possibly
> > > > > missing
> > > > > >>> your point...
> > > > > >>>
> > > > > >>> -Flavio
> > > > > >>>
> > > > > >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote:
> > > > > >>>
> > > > > >>>> Separating by device would help, but will not allow the
> devices
> > to
> > > > be
> > > > > >>> fully
> > > > > >>>> utilized: in order to buffer enough io commands into a disk's
> > > queue
> > > > > for
> > > > > >>> the
> > > > > >>>> elevator algorithms to kick in, you either need to use
> multiple
> > > > > threads
> > > > > >>> per
> > > > > >>>> disk, or native async IO (not trivially available within the
> > JVM.)
> > > > > >>>>
> > > > > >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira <
> > > > [email protected]>
> > > > > >>> wrote:
> > > > > >>>>
> > > > > >>>>>
> > > > > >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote:
> > > > > >>>>>
> > > > > >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira <
> > > > > [email protected]>
> > > > > >>>>> wrote:
> > > > > >>>>>>
> > > > > >>>>>>> Just to add a couple of comments to the discussion,
> > separating
> > > > > reads
> > > > > >>> and
> > > > > >>>>>>> writes into different threads should only help with queuing
> > > > > latency.
> > > > > >>> It
> > > > > >>>>>>> wouldn't help with IO latency.
> > > > > >>>>>>>
> > > > > >>>>>>
> > > > > >>>>>> Yes, but with the current implementation, publishes
> latencies
> > in
> > > > > >>> hedwig
> > > > > >>>>>> suffer because of lagging subscribers. By separating read
> and
> > > > write
> > > > > >>>>> queues,
> > > > > >>>>>> we can at least guarantee that the write SLA is maintained
> > > > (separate
> > > > > >>>>>> journal disk + separate thread would ensure that writes are
> > not
> > > > > >>> affected
> > > > > >>>>> by
> > > > > >>>>>> read related seeks)
> > > > > >>>>>>
> > > > > >>>>>
> > > > > >>>>> Agreed and based on my comment below, I was wondering if it
> > > > wouldn't
> > > > > be
> > > > > >>>>> best to separate traffic across threads by device instead of
> by
> > > > > >>> operation
> > > > > >>>>> type.
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> Also, it sounds like a good idea to have at least one
> thread
> > > per
> > > > > >>> ledger
> > > > > >>>>>>> device. In the case of multiple ledger devices, if we use
> one
> > > > > single
> > > > > >>>>>>> thread, then the performance of the bookie will be driven
> by
> > > the
> > > > > >>> slowest
> > > > > >>>>>>> disk, no?
> > > > > >>>>>>>
> > > > > >>>>>> yup, makes sense.
> > > > > >>>>>>
> > > > > >>>>>>>
> > > > > >>>>>>> -Flavio
> > > > > >>>>>>>
> > > > > >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote:
> > > > > >>>>>>>
> > > > > >>>>>>>>> Could you give some information on what those
> shortcomings
> > > are?
> > > > > >>> Also,
> > > > > >>>>> do
> > > > > >>>>>>>>> let me know if you need any more information from our
> end.
> > > > > >>>>>>>> Off the top of my head:
> > > > > >>>>>>>> - reads and writes are handled in the same thread (as you
> > have
> > > > > >>>>> observed)
> > > > > >>>>>>>> - each entry read requires a single RPC.
> > > > > >>>>>>>> - entries are read in parallel
> > > > > >>>>>>>
> > > > > >>>>>> By parallel, you mean the BufferedChannel wrapper on top of
> > > > > >>> FileChannel,
> > > > > >>>>>> right?
> > > > > >>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> Not all of these could result in the high latency you see,
> > but
> > > > if
> > > > > >>> each
> > > > > >>>>>>>> entry is being read separately, a sync on the ledger disk
> in
> > > > > between
> > > > > >>>>>>>> will make a mess of the disk head scheduling.
> > > > > >>>>>>>
> > > > > >>>>>> Increasing the time interval between  flushing log files
> might
> > > > > >>> possibly
> > > > > >>>>>> help in this case then?
> > > > > >>>>>>
> > > > > >>>>>>>>
> > > > > >>>>>>>> -Ivan
> > > > > >>>>>>>
> > > > > >>>>>>>
> > > > > >>>>>> Thanks for the help :)
> > > > > >>>>>
> > > > > >>>>>
> > > > > >>>
> > > > > >>>
> > > > >
> > > >
> > >
> >
>

Reply via email to