[
https://issues.apache.org/jira/browse/HBASE-14921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387470#comment-15387470
]
Anastasia Braginsky commented on HBASE-14921:
---------------------------------------------
Thank you [~anoop.hbase] for your very reasonable comments!
bq. But when the use case is like some thing of time series data, where we
really dont expect duplicates/updates, it might be better to turn off
compaction and do only flatten.
Do you suggest to make an externally editable flag for turning compaction on
and off? So what should be the default value for this flag? Didn’t we wanted
sysadmins to work less with all those flags and settings (that we already
have)? We can make this compaction-pre-check scan every second (Xth) flush to
pipeline if it appears to decrease the performance.
bq. Again flatten to CellChunkMap would be ideal as that will release and
reduce heap memory footprint for this memstore considerably. CellArrayMap, yes
it reduces but not much.
CellChunkMap is valuable because it can be taken off-heap, but CellChunkMap
doesn’t significantly reduces the memory usage compared to CellArrayMap. All
that you save memory-wise in CellChunkMap is that Cell object is now “embedded"
as part of the array, and so you do not need the reference and the object
overhead. So the difference between CellArrayMap and CellChunkMap is in 24
bytes per Cell.
bq. In your usecase, the max adv you get because of the compaction as many
cells will get removed.
I do not agree. In our experiments we (on purpose) use uniform distribution
with small data size and we have little duplicates. We still see that the
compaction has little impact on the performance.
bq. My another concern is regarding the fact that in this memstore only the
tail of the pipeline getting flushed to disk when a flush request comes. In 1st
version it was like always the compaction happens. So all chances that the tail
of pipeline is much bigger sized and so that much data gets flushed. Now when
compaction is not at all happening and we do have many small sized segments in
pipeline, it would have been better to flush all the segments to disk that
making small sized flushes. I raised this concern at first step also. But then
the counter was that the compaction happens always but now it is not the case.
I remember this concern of yours from the code review. This is a valid concern
and we are thinking about it. Apparently, this is one more reason to do
compactions (at least for merge) once in a while. We can do it when we have
like e.g. 10 segments in the pipeline. If we are going to simply flush it all
to disk we are going to create many small files and their compaction is going
to run on disk then...
bq. JFYI.. There is a periodic memstore flush checking. If we accumulate more
than 30 million edits in memstore, we will flush
We know there is a flush to disk once about every hour. The main reason for
that is WAL, right? Otherwise, why would we care how many cells are in memory?
Actually, may be in this we do not want to flush absolutely everything to disk
and to flush just the oldest part so the WAL can truncate a bit is enough?
> Memory optimizations
> --------------------
>
> Key: HBASE-14921
> URL: https://issues.apache.org/jira/browse/HBASE-14921
> Project: HBase
> Issue Type: Sub-task
> Affects Versions: 2.0.0
> Reporter: Eshcar Hillel
> Assignee: Anastasia Braginsky
> Attachments: CellBlocksSegmentInMemStore.pdf,
> CellBlocksSegmentinthecontextofMemStore(1).pdf, HBASE-14921-V01.patch,
> HBASE-14921-V02.patch, HBASE-14921-V03.patch, HBASE-14921-V04-CA-V02.patch,
> HBASE-14921-V04-CA.patch, HBASE-14921-V05-CAO.patch,
> HBASE-14921-V06-CAO.patch, InitialCellArrayMapEvaluation.pdf,
> IntroductiontoNewFlatandCompactMemStore.pdf
>
>
> Memory optimizations including compressed format representation and offheap
> allocations
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)