[ 
https://issues.apache.org/jira/browse/HBASE-14921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387050#comment-15387050
 ] 

Anoop Sam John commented on HBASE-14921:
----------------------------------------

I got your argument abt dynamic decision making for compaction vs flatten only. 
 Worry was how costly it will be to do another round of scan.  It involve SQM 
and a Heap with many compares. It is not so cheap.  As of now you are not 
adding the CellChunkMap based flattening.  Things will be much worse, when we 
have that.  We get rid of Cell objects as such in that flattened form. But then 
this scan need Cells to be created again. Means lots of garbage.
May be in a use case where duplicates are possible, chances are there that 
there are not enough #duplicate records so that we get some real benefit out of 
compaction. So then flatten would be enough.  So an extra scan may make sense 
there.
But when the use case is like some thing of time series data, where we really 
dont expect duplicates/updates, it might be better to turn off compaction and 
do only flatten.
Again flatten to CellChunkMap would be ideal as that will release and reduce 
heap memory footprint for this memstore considerably. CellArrayMap, yes it 
reduces but not much.  In your usecase, the max adv you get because of the 
compaction as many cells will get removed.

My another concern is regarding the fact that in this memstore only the tail of 
the pipeline getting flushed to disk when a flush request comes.  In 1st 
version it was like always the compaction happens. So all chances that the tail 
of pipeline is much bigger sized and so that much data gets flushed.  Now when 
compaction is not at all happening and we do have many small sized segments in 
pipeline, it would have been better to flush all the segments to disk that 
making small sized flushes. I raised this concern at first step also. But then 
the counter was that the compaction happens always but now it is not the case.

Ya Ram will come up with al perf analysis.

bq.We are now holding more in the memory and thus having more possibility to 
let a cell "die" in memory. 
JFYI..  There is a periodic memstore flush checking. If we accumulate more than 
30 million edits in memstore, we will flush
{code}
if (this.maxFlushedSeqId > 0
          && (this.maxFlushedSeqId + this.flushPerChanges < 
this.mvcc.getReadPoint())) {
      whyFlush.append("more than max edits, " + this.flushPerChanges + ", since 
last flush");
      return true;
    }
{code}
This flushPerChanges is configurable btw.
The second check here is time based. If we have not flushed memstore for quite 
some time, we will make a flush. This time def to 1 hr.
Just saying for your consideration.


> Memory optimizations
> --------------------
>
>                 Key: HBASE-14921
>                 URL: https://issues.apache.org/jira/browse/HBASE-14921
>             Project: HBase
>          Issue Type: Sub-task
>    Affects Versions: 2.0.0
>            Reporter: Eshcar Hillel
>            Assignee: Anastasia Braginsky
>         Attachments: CellBlocksSegmentInMemStore.pdf, 
> CellBlocksSegmentinthecontextofMemStore(1).pdf, HBASE-14921-V01.patch, 
> HBASE-14921-V02.patch, HBASE-14921-V03.patch, HBASE-14921-V04-CA-V02.patch, 
> HBASE-14921-V04-CA.patch, HBASE-14921-V05-CAO.patch, 
> HBASE-14921-V06-CAO.patch, InitialCellArrayMapEvaluation.pdf, 
> IntroductiontoNewFlatandCompactMemStore.pdf
>
>
> Memory optimizations including compressed format representation and offheap 
> allocations



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to