[ 
https://issues.apache.org/jira/browse/HBASE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Todd Lipcon updated HBASE-3323:
-------------------------------

    Attachment: hbase-3323.txt

Here's a patch which basically redoes the way log splitting happens. It needs 
to be commented up and I want to rename some things, but the basic architecture 
is this:

- Main thread reads logs in order and writes into a structure called EntrySink 
(I want to rename this to EntryBuffer or sometihng)
- EntrySink maintains some kind of approximate heap size (I don't think I 
calculated it quite right, but c'est la vie) and also takes care of managing a 
RegionEntryBuffer for each region key.
-- The RegionEntryBuffer just has a LinkedList of Entries right now, but it 
does size accounting, and I think we could change these to a fancier data 
structure for more efficient memory usage (eg a linked list of 10000-entry 
arrays)
- If the main thread tries to append into the EntrySink but the heap usage has 
hit a max threshold, it waits.

Meanwhile, there are N threads called WriterThread-n which do the following in 
a loop:
- poll the EntrySink to grab a RegionEntryBuffer
-- The EntrySink returns the one with the most outstanding edits (hope is to 
write larger sequential chunks if possible)
-- The EntrySink also keeps track of which regions already have some thread 
working on them, so we don't end up with out-of-order appends
- The EntrySink then drains the RegionEntryBuffer into the "OutputSink" which 
maintains the map from region key to WriterAndPath (bug in patch uploaded: this 
map needs to be synchronizedMap)
- Once the buffer is drained, it notifies the EntrySink that the memory is no 
longer in use (hence unblocking the producer thread)

In summary, it's a fairly standard producer-consumer pattern with some trickery 
to make a separate queue per region so as not to reorder edits.

As a non-scientific test I patched this into my cluster which was getting the 
OOME on master startup, and it not only started up fine, the log splits ran 
about 50% faster than they did before!

Known bug: the "log N of M" always says "log 1 of M"

Thoughts?

> OOME in master splitting logs
> -----------------------------
>
>                 Key: HBASE-3323
>                 URL: https://issues.apache.org/jira/browse/HBASE-3323
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: hbase-3323.txt, sizes.png
>
>
> In testing a RS failure under heavy increment workload I ran into an OOME 
> when the master was splitting the logs.
> In this test case, I have exactly 136 bytes per log entry in all the logs, 
> and the logs are all around 66-74MB). With a batch size of 3 logs, this means 
> the master is loading about 500K-600K edits per log file. Each edit ends up 
> creating 3 byte[] objects, the references for which are each 8 bytes of RAM, 
> so we have 160 (136+8*3) bytes per edit used by the byte[]. For each edit we 
> also allocate a bunch of other objects: one HLog$Entry, one WALEdit, one 
> ArrayList, one LinkedList$Entry, one HLogKey, and one KeyValue. Overall this 
> works out to 400 bytes of overhead per edit. So, with the default settings on 
> this fairly average workload, the 1.5M log entries takes about 770MB of RAM. 
> Since I had a few log files that were a bit larger (around 90MB) it exceeded 
> 1GB of RAM and I got an OOME.
> For one, the 400 bytes per edit overhead is pretty bad, and we could probably 
> be a lot more efficient. For two, we should actually account this rather than 
> simply having a configurable "batch size" in the master.
> I think this is a blocker because I'm running with fairly default configs 
> here and just killing one RS made the cluster fall over due to master OOME.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to