[ https://issues.apache.org/jira/browse/HBASE-14921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387050#comment-15387050 ]
Anoop Sam John commented on HBASE-14921: ---------------------------------------- I got your argument abt dynamic decision making for compaction vs flatten only. Worry was how costly it will be to do another round of scan. It involve SQM and a Heap with many compares. It is not so cheap. As of now you are not adding the CellChunkMap based flattening. Things will be much worse, when we have that. We get rid of Cell objects as such in that flattened form. But then this scan need Cells to be created again. Means lots of garbage. May be in a use case where duplicates are possible, chances are there that there are not enough #duplicate records so that we get some real benefit out of compaction. So then flatten would be enough. So an extra scan may make sense there. But when the use case is like some thing of time series data, where we really dont expect duplicates/updates, it might be better to turn off compaction and do only flatten. Again flatten to CellChunkMap would be ideal as that will release and reduce heap memory footprint for this memstore considerably. CellArrayMap, yes it reduces but not much. In your usecase, the max adv you get because of the compaction as many cells will get removed. My another concern is regarding the fact that in this memstore only the tail of the pipeline getting flushed to disk when a flush request comes. In 1st version it was like always the compaction happens. So all chances that the tail of pipeline is much bigger sized and so that much data gets flushed. Now when compaction is not at all happening and we do have many small sized segments in pipeline, it would have been better to flush all the segments to disk that making small sized flushes. I raised this concern at first step also. But then the counter was that the compaction happens always but now it is not the case. Ya Ram will come up with al perf analysis. bq.We are now holding more in the memory and thus having more possibility to let a cell "die" in memory. JFYI.. There is a periodic memstore flush checking. If we accumulate more than 30 million edits in memstore, we will flush {code} if (this.maxFlushedSeqId > 0 && (this.maxFlushedSeqId + this.flushPerChanges < this.mvcc.getReadPoint())) { whyFlush.append("more than max edits, " + this.flushPerChanges + ", since last flush"); return true; } {code} This flushPerChanges is configurable btw. The second check here is time based. If we have not flushed memstore for quite some time, we will make a flush. This time def to 1 hr. Just saying for your consideration. > Memory optimizations > -------------------- > > Key: HBASE-14921 > URL: https://issues.apache.org/jira/browse/HBASE-14921 > Project: HBase > Issue Type: Sub-task > Affects Versions: 2.0.0 > Reporter: Eshcar Hillel > Assignee: Anastasia Braginsky > Attachments: CellBlocksSegmentInMemStore.pdf, > CellBlocksSegmentinthecontextofMemStore(1).pdf, HBASE-14921-V01.patch, > HBASE-14921-V02.patch, HBASE-14921-V03.patch, HBASE-14921-V04-CA-V02.patch, > HBASE-14921-V04-CA.patch, HBASE-14921-V05-CAO.patch, > HBASE-14921-V06-CAO.patch, InitialCellArrayMapEvaluation.pdf, > IntroductiontoNewFlatandCompactMemStore.pdf > > > Memory optimizations including compressed format representation and offheap > allocations -- This message was sent by Atlassian JIRA (v6.3.4#6332)