On Thu, Sep 27, 2012 at 3:21 AM, Sijie Guo <[email protected]> wrote:
> > I took at look at the leveldb homepage and it says - "Only a single > process > > (possibly multi-threaded) can access a particular database at a time." > > This is bad because it means we can't run the console or any recovery > > related operations while the bookies are running. > > Yes. leveldb is single process. it prevent misuse by acquiring lock from > filesystem. We did can't run the console to look its data while the bookie > is running. > > I am confusing about the 'recovery' operations you mentioned. what kind of > recovery? > I was talking about BookkeeperAdmin and how it's used from BookkeeperTools. I believe that we currently only support replicating a bookie onto another bookie. We might want to support more operations of this nature in future. These might need simultaneous access to the log files of a live bookie. > > > When we flush to the log file, we > > should simply flush entries sorted by the ledger id as key. > > If we want to sort when flushing, we need to buffer all edits in memory, > then flush. I am assume this approach would act same as LSM tree (what > leveldb did). > Yes. quite similar. I just glanced over LSM trees, I'll take a more detailed look soon. > > On Thu, Sep 27, 2012 at 9:39 AM, Aniruddha Laud <[email protected] > >wrote: > > > On Wed, Sep 26, 2012 at 5:06 PM, Sijie Guo <[email protected]> wrote: > > > > > Sounds good to have thread pool (and make them configurable) for > reading > > > from entry log files. > > > > > > > We can have a write threadpool (and we should always keep this > > > lower than the number of processors) to process the add requests. > > > > > > one more point, since we had only one entry log file active accepting > > entry > > > written to it. I don't think multiple write threads would help now, > since > > > adding entry to entry log file is synchronized method. > > > > > I was talking about writing to the journal file. From what I understand, > > log file entries > > are flushed periodically by one thread and that is okay. The publish > > latencies on hedwig are > > dependent on the journal writes, though. > > > > > > > > In order to utilize the disk bandwidth more efficiently, we might need > to > > > have one active entry log file accepting entry written per ledger disk. > > But > > > it might need to introduce some directory layout change (like logId, > > > currently we using incrementing log id for whole bookie) and logic > > > changes. it would be a separated task if did that. > > > > > > > Another thing we could possibly look at is re ordering our writes to > > the > > > log file to try and maintain locality for ledger entries. > > > > > > We had a prototype work internally working on using leveldb to store 1) > > > small entries data (size less than hundred of bytes), 2) ledger index > for > > > large entries (acts as ledger cache for index entries). Benefiting from > > > leveldb, 1) we could have more efficient cache when there are larger > > number > > > of ledgers and size skew between ledgers, 2) we could have data > belonging > > > to same ledger clustered when writing to disks, which achieves what you > > > mentioned 'reordering writes' in somehow. > > > > > I took at look at the leveldb homepage and it says - "Only a single > process > > (possibly multi-threaded) can access a particular database at a time." > > This is bad because it means we can't run the console or any recovery > > related operations while the bookies are running. I may be wrong, though. > > What I had in mind was pretty simple. When we flush to the log file, we > > should simply flush entries sorted by the ledger id as key. Some changes > > might be needed to the ledger index cache, but I'm not very sure what the > > changes would be. What do you think? > > > > > > > > -Sijie > > > > > > On Thu, Sep 27, 2012 at 12:52 AM, Aniruddha Laud > > > <[email protected]>wrote: > > > > > > > Hi all, > > > > > > > > Those stats I pasted might be a little misleading as they show the > > > average > > > > over a couple of minutes. Whenever there are reads to the ledger > disks, > > > the > > > > queue size on them is sometimes as high as 100. Also, the CPU > > utilization > > > > has been lower than 10% throughout and the process will continue to > > > remain > > > > I/O bound even if we introduce more threads (As the CPU remains idle > > > while > > > > doing I/O). > > > > > > > > A couple of observations about the write path. We currently have a > hard > > > > coded buffer size for journal writes (I believe it's 512KB) and we > > flush > > > to > > > > the disk when this fills up or if there is no entry to process (which > > is > > > > highly unlikely in case of a high throughput application running on > > top). > > > > We should make this buffer size configurable. Now, with more threads, > > we > > > > can process more packets in parallel and this buffer can be filled up > > > > faster. We can have a write threadpool (and we should always keep > this > > > > lower than the number of processors) to process the add requests. > > > > > > > > For read requests, a configurable number of worker threads would be > > ideal > > > > and we could let the user tune it depending on the kind of read > > patterns > > > > they expect. Given that ledgers are interleaved ATM, I would expect > the > > > > performance to increase linearly with the number of threads till a > > > certain > > > > point and then level out. > > > > > > > > Another thing we could possibly look at is re ordering our writes to > > the > > > > log file to try and maintain locality for ledger entries. This might > > > reduce > > > > the number of random seeks we do in case only a small number of > ledgers > > > are > > > > lagging. > > > > > > > > Thoughts? > > > > > > > > Regards, > > > > Aniruddha. > > > > > > > > On Wed, Sep 26, 2012 at 2:55 AM, Rakesh R <[email protected]> > wrote: > > > > > > > > > >>>One question: what is multi-ledgers? > > > > > multiple ledgers directories(muliple disks) > > > > > > > > > > >>>CPU utilization might not be largely affected if the threads are > > > > > sitting there waiting on IO > > > > > Ok, seems I got it. > > > > > If one thread spends most of its time waiting for I/O completion > > > instead > > > > > of using the CPU, but does not mean that "we've hit the system I/O > > > > > bandwidth limit", then IMHO having multiple threads (or > asynchronous > > > I/O) > > > > > might improve performance (by enabling more than one concurrent I/O > > > > > operation). > > > > > > > > > > -Rakesh > > > > > ________________________________________ > > > > > From: Flavio Junqueira [[email protected]] > > > > > Sent: Wednesday, September 26, 2012 2:17 PM > > > > > To: [email protected] > > > > > Subject: Re: High latencies observed at the bookkeeper client while > > > > > reading entries > > > > > > > > > > CPU utilization might not be largely affected if the threads are > > > sitting > > > > > there waiting on IO. In my understanding of the proposal so far, > the > > > idea > > > > > is to have multiple threads only to perform IO. > > > > > > > > > > One question: what is multi-ledgers? > > > > > > > > > > -Flavio > > > > > > > > > > > > > > > On Sep 26, 2012, at 7:52 AM, Rakesh R wrote: > > > > > > > > > > > I just adding one more point: > > > > > > > > > > > > Increasing the number of threads, can hit the CPU utilization > too. > > > > Also, > > > > > we would consider this and good to observe whether its more on I/O > > > bound > > > > > than CPU bound. However, it depends in great detail on the disks > and > > > how > > > > > much CPU work other threads are doing before they, too, end up > > waiting > > > on > > > > > those disks. > > > > > > > > > > > > I'm also thinking inline with Flavio's suggestion to have one > > thread > > > > per > > > > > ledger/journal device. Multithreading can help us with I/O bound > > > problems > > > > > if the I/O is perform against different disks. > > > > > > > > > > > > From the iostat report: waiting time of ledger directories. It > > shows > > > we > > > > > have options to fully utilizing the disk bandwidth. > > > > > > > > > > > > multi-ledgers disk usage: > > > > > > avgqu-sz > > > > > > 1.10 > > > > > > 0.12 > > > > > > 0.54 > > > > > > 0.13 > > > > > > > > > > > > -Rakesh > > > > > > ________________________________________ > > > > > > From: Sijie Guo [[email protected]] > > > > > > Sent: Wednesday, September 26, 2012 5:58 AM > > > > > > To: [email protected] > > > > > > Subject: Re: High latencies observed at the bookkeeper client > while > > > > > reading entries > > > > > > > > > > > > One more point is that each write/read request to entry log files > > > would > > > > > be > > > > > > converted to write/read a 8K blob data, since you used > > > BufferedChannel. > > > > > For > > > > > > write requests, a larger write size is OK. For read requests, > they > > > are > > > > > > almost randomly. Even you read a larger blob, the blob might be > > > useless > > > > > > when next read goes to other place. Even more, I don't think we > > need > > > to > > > > > > maintain another fixed length readBuffer in BufferedChannel, it > > > almost > > > > > > doesn't help for random reads, we could leverage OS cache for it. > > > > > > > > > > > > On Wed, Sep 26, 2012 at 8:06 AM, Sijie Guo <[email protected]> > > > wrote: > > > > > > > > > > > >> For serving requests, either queuing the requests in bookie > server > > > per > > > > > >> channel (write/read are blocking operations), or queueing in os > > > kernel > > > > > to > > > > > >> let block device queuing and schedule those io requests. I think > > > Stu's > > > > > >> point is to leverage block device's schedule algorithm to issue > io > > > > > requests > > > > > >> in multiple threads to fully utilize the disk bandwidth. > > > > > >> > > > > > >> from the iostat reports provided by Aniruddha, the average queue > > > > length > > > > > >> and utilized percentage are not high, which means most of time > the > > > > disks > > > > > >> are idle. It makes sense to use multiple threads to issue read > > > > requests. > > > > > >> one write thread and several read threads might work for each > > > device. > > > > > >> > > > > > >> On Wed, Sep 26, 2012 at 5:06 AM, Flavio Junqueira < > > > [email protected] > > > > > >wrote: > > > > > >> > > > > > >>> Hi Stu, I'm not sure I understand your point. If with one > thread > > we > > > > are > > > > > >>> getting pretty high latency (case Aniruddha described), doesn't > > it > > > > > mean we > > > > > >>> have a number of requests queued up? Adding more threads might > > only > > > > > make > > > > > >>> the problem worse by queueing up even more requests. I'm > possibly > > > > > missing > > > > > >>> your point... > > > > > >>> > > > > > >>> -Flavio > > > > > >>> > > > > > >>> On Sep 25, 2012, at 9:37 PM, Stu Hood wrote: > > > > > >>> > > > > > >>>> Separating by device would help, but will not allow the > devices > > to > > > > be > > > > > >>> fully > > > > > >>>> utilized: in order to buffer enough io commands into a disk's > > > queue > > > > > for > > > > > >>> the > > > > > >>>> elevator algorithms to kick in, you either need to use > multiple > > > > > threads > > > > > >>> per > > > > > >>>> disk, or native async IO (not trivially available within the > > JVM.) > > > > > >>>> > > > > > >>>> On Tue, Sep 25, 2012 at 2:23 AM, Flavio Junqueira < > > > > [email protected]> > > > > > >>> wrote: > > > > > >>>> > > > > > >>>>> > > > > > >>>>> On Sep 25, 2012, at 10:55 AM, Aniruddha Laud wrote: > > > > > >>>>> > > > > > >>>>>> On Tue, Sep 25, 2012 at 1:35 AM, Flavio Junqueira < > > > > > [email protected]> > > > > > >>>>> wrote: > > > > > >>>>>> > > > > > >>>>>>> Just to add a couple of comments to the discussion, > > separating > > > > > reads > > > > > >>> and > > > > > >>>>>>> writes into different threads should only help with queuing > > > > > latency. > > > > > >>> It > > > > > >>>>>>> wouldn't help with IO latency. > > > > > >>>>>>> > > > > > >>>>>> > > > > > >>>>>> Yes, but with the current implementation, publishes > latencies > > in > > > > > >>> hedwig > > > > > >>>>>> suffer because of lagging subscribers. By separating read > and > > > > write > > > > > >>>>> queues, > > > > > >>>>>> we can at least guarantee that the write SLA is maintained > > > > (separate > > > > > >>>>>> journal disk + separate thread would ensure that writes are > > not > > > > > >>> affected > > > > > >>>>> by > > > > > >>>>>> read related seeks) > > > > > >>>>>> > > > > > >>>>> > > > > > >>>>> Agreed and based on my comment below, I was wondering if it > > > > wouldn't > > > > > be > > > > > >>>>> best to separate traffic across threads by device instead of > by > > > > > >>> operation > > > > > >>>>> type. > > > > > >>>>> > > > > > >>>>> > > > > > >>>>>>> > > > > > >>>>>>> Also, it sounds like a good idea to have at least one > thread > > > per > > > > > >>> ledger > > > > > >>>>>>> device. In the case of multiple ledger devices, if we use > one > > > > > single > > > > > >>>>>>> thread, then the performance of the bookie will be driven > by > > > the > > > > > >>> slowest > > > > > >>>>>>> disk, no? > > > > > >>>>>>> > > > > > >>>>>> yup, makes sense. > > > > > >>>>>> > > > > > >>>>>>> > > > > > >>>>>>> -Flavio > > > > > >>>>>>> > > > > > >>>>>>> On Sep 25, 2012, at 10:24 AM, Ivan Kelly wrote: > > > > > >>>>>>> > > > > > >>>>>>>>> Could you give some information on what those > shortcomings > > > are? > > > > > >>> Also, > > > > > >>>>> do > > > > > >>>>>>>>> let me know if you need any more information from our > end. > > > > > >>>>>>>> Off the top of my head: > > > > > >>>>>>>> - reads and writes are handled in the same thread (as you > > have > > > > > >>>>> observed) > > > > > >>>>>>>> - each entry read requires a single RPC. > > > > > >>>>>>>> - entries are read in parallel > > > > > >>>>>>> > > > > > >>>>>> By parallel, you mean the BufferedChannel wrapper on top of > > > > > >>> FileChannel, > > > > > >>>>>> right? > > > > > >>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> Not all of these could result in the high latency you see, > > but > > > > if > > > > > >>> each > > > > > >>>>>>>> entry is being read separately, a sync on the ledger disk > in > > > > > between > > > > > >>>>>>>> will make a mess of the disk head scheduling. > > > > > >>>>>>> > > > > > >>>>>> Increasing the time interval between flushing log files > might > > > > > >>> possibly > > > > > >>>>>> help in this case then? > > > > > >>>>>> > > > > > >>>>>>>> > > > > > >>>>>>>> -Ivan > > > > > >>>>>>> > > > > > >>>>>>> > > > > > >>>>>> Thanks for the help :) > > > > > >>>>> > > > > > >>>>> > > > > > >>> > > > > > >>> > > > > > > > > > > > > > > >
