[ 
https://issues.apache.org/jira/browse/HBASE-3323?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12969895#action_12969895
 ] 

Cosmin Lehene commented on HBASE-3323:
--------------------------------------


Here's the object distribution tlipcon mentioned:

{code}
The values of this map contain the 1.5M+ edits (in Entry objects) tlipcon 
mentioned

Map<byte[], LinkedList<Entry>> editsByRegion
      |                  |
      |                  |
      |                  |
(encodedRegionName)      |
      .                  |
      .                  |
      .                  |
      .                  |
      .                  |
      .                  | 
      .                  --- WalEdit edit
      .                  |      |
      .                  |      |
      .                  |      |
      .                  |      --- ArrayList<KeyValue> kvs
      .                  |                      |
      .                  |                      |
      .                  |                      |
      .                  |                      --- byte[] bytes
      .                  |                          
      .                  |
      .  ----------------------------------------------------------
      .  |               |                                        |
      .  |               |                                        |
      .  |               --- HLogKey key                          |
      .  |                    |                                   |
      .  |                    |                                   |
      .  |                    |                                   |
      .  |                    |                                   |
      . .| . . . . . . . . . .--- byte[] encodedRegionName        |
         |                    |                                   |
         |                    |                                   |
         |                    |                                   |
         |                    --- byte[] tableName                |
         |                                                        |
         |                                                        |
         | this is useless as we could have this in the map key   |
         ----------------------------------------------------------

{code}

The splitLog workflow loads all the edits in a map indexed by region, and then 
uses a thread pool to write them to per region directories.


As you can see from this diagram, each edit duplicates the tableName and the 
encodedRegionName (hence the 2 extra byte[]). 

*One simple, partial solution:*
We can reduce the memory footprint by putting the tableName in the map key with 
the encodedRegionName (it's free). This would leave us with a LinkedList of 
WalEdit objects (ArrayList + KeyValue + the actual info: byte[]). Of course 
this could be further compressed, but it might not be worth it (WalEdit has a 
replication scope as well IIUC). 
This is a partial solution since we still don't solve the case when we have too 
much data in the HLogs.


*A second solution/suggestion:*

We can change the split process a bit. Let me explain how HLogs are organized 
and how we split (please correct me if I'm wrong):

*Context:*
* Eeach region server has one HLog directory in HDFS (under /hbase/.logs)
* In each HRegionServer corresponding directory there's a bunch of HLog files. 
* There's a strict order of the HLog files within a region's dir and edits 
inside are ordered as well. 
* We read all the files in memory first because we need all the edits for a 
particular region and to respect the order of the edits. 
* Only after everything is read, we use a thread pool to distribute the log 
entries per regions. 

*Suggestion:*
We could read the files in parallel, and instead of writing a single file in 
the HRegion corresponding directory, we write one file for each HLog. This 
should keep all the edits in strict order. Then HRegionServer could safely load 
them in the same order and apply edits. 

While we read the files in parallel we don't have to read the entire content in 
memory: we can just read and write to the corresponding destination file. This 
should solve the memory footprint problem. 


I haven't spent too much time analyzing the second option; it might have been 
discussed in the past, so if I'm missing something let me know.


Cosmin


> OOME in master splitting logs
> -----------------------------
>
>                 Key: HBASE-3323
>                 URL: https://issues.apache.org/jira/browse/HBASE-3323
>             Project: HBase
>          Issue Type: Bug
>          Components: master
>    Affects Versions: 0.90.0
>            Reporter: Todd Lipcon
>            Priority: Blocker
>             Fix For: 0.90.0
>
>         Attachments: sizes.png
>
>
> In testing a RS failure under heavy increment workload I ran into an OOME 
> when the master was splitting the logs.
> In this test case, I have exactly 136 bytes per log entry in all the logs, 
> and the logs are all around 66-74MB). With a batch size of 3 logs, this means 
> the master is loading about 500K-600K edits per log file. Each edit ends up 
> creating 3 byte[] objects, the references for which are each 8 bytes of RAM, 
> so we have 160 (136+8*3) bytes per edit used by the byte[]. For each edit we 
> also allocate a bunch of other objects: one HLog$Entry, one WALEdit, one 
> ArrayList, one LinkedList$Entry, one HLogKey, and one KeyValue. Overall this 
> works out to 400 bytes of overhead per edit. So, with the default settings on 
> this fairly average workload, the 1.5M log entries takes about 770MB of RAM. 
> Since I had a few log files that were a bit larger (around 90MB) it exceeded 
> 1GB of RAM and I got an OOME.
> For one, the 400 bytes per edit overhead is pretty bad, and we could probably 
> be a lot more efficient. For two, we should actually account this rather than 
> simply having a configurable "batch size" in the master.
> I think this is a blocker because I'm running with fairly default configs 
> here and just killing one RS made the cluster fall over due to master OOME.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Reply via email to