[
https://issues.apache.org/jira/browse/HBASE-13408?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14988099#comment-14988099
]
stack commented on HBASE-13408:
-------------------------------
Thanks for the update to the design doc. It is great.
"This may result in underutilization of the memory due to duplicate entries per
row, for example, when hot data is continuously updated."
Other reasons for compactions in memory would be to move data from an
expensive, fat if-convenient data structure -- the skip list -- to a data
structure that is more compact in memory and more efficient to access (can
exploit the fact that it is read-only, can trigger the in-memory flush when
memstore hits some 'small' size -- should improve perf). The latter is
especially pertinent in the case where a flush is encoded and/or compressed
such that the in-memory representation may be, say 128MB, but then the hfile on
disk is much smaller than this. IIRC we have seen cases where it takes longer
to pull an entry from a large skip list than it does from cache. We could also
remove 'duplicates' by coalescing increments, applying deletes to deleted data
if present in memory.
Downsides to in-memory compaction is that we carry a larger backlog of WALs
making MTTR take longer.
bq. Data is kept in memory for as long as possible, by periodically compacting
the memstore content.
I think the principal should be more that you will afford yourself the luxury
of being able to make smarter decisions on when to flush (Keeping data in
memory as long as possible could actual be detrimental over the long run).
bq. A flush call may interrupt an inprogress compaction and flush to the disk
part of the data.
The part that will be flushed is the 'compacted' part? Just trying to
understand better.
On name of the config., I think it should be IN_MEMORY_COMPACTION rather than
COMPACTED. We already have a compaction process. Add the IN_MEMORY prefix to
show where it applies. Also, this feature should be on by default. You might
want to say that in the doc as being your intent. Lets prove it makes life
better so we can indeed enable it by default.
Can the in-memory flush use same code as the flush-to-disk flush? Ditto on
compaction?
bq. . An additional flushtotalsize is defined as the average of flushsize
and blockingflushsize.
Pardon me, what is the above for?
bq. In case of FlushLargeStoresPolicy, the decision is based on the size of
the active segment.
I'm not clear on when a flush to disk will happen. I get you need headroom for
in-memory flush and compaction to run but can you be more clear on where the
threshold for flush to disk is? Thanks.
What is a snapshot in this scheme? It is just the last segment in the pipeline?
Each pipeline segment because an hfile? Or we have to do a merge sort on flush
to make the hfile?
{code}
Upon inmemory flush, we take advantage of the existing blocking mechanism
the active
segment is pushed to the pipeline while holding the region updatesLockin
exclusive mode.
Then, a compaction is applied at the background to all pipeline segments
resulting in a single
immutable segment. This procedure is executed at the scope of the store level,
and specifically
in the context of the compacted memstore.
{code}
Do we hold the region lock while we compact the in-memory segments on a column
family? Every time a compaction runs, it compacts all segments in the pipeline?
I'm not sure I follow the approximation of oldest sequence id. Why does it have
to to be an approximation? Can't we keep a running oldest sequenceid seen?
(Removing it too if compactions remove the cell that is oldest -- I suppose
this could complicate acccounting... how you find the next oldest). So, you
have a map of timestamp to oldest sequenceid in the segment? And when a segment
is flushed to disk, the next oldest becomes oldest edit in memory?
Do you have a rig where you can try out your implementation apart from running
it inside a regionserver?
Should be CompactingMemStore rather than CompactedMemStore.
The class diagram looks great.
So, on design, we talking about adding one more thread -- a compacting thread
-- per Store? Do we have to do this? HBase has too many threads already.
Minimally, these threads can be run by an executor so they don't live for ever
while the process is up? It could be trigger by an in-memory flush.
On compactionpipeline, add rather than push-in and remove rather than pull-out?
On MemstoreScanner, we are keeping the fact that the implementation is crossing
Segments an internal implementation detail? +1 on making the MemStoreScanner
first-class detached from DefaultMemStore.
MemStoreCompactor should not be first class, right? It should be internal to
the CompactingMemStore implementation?
Yeah, its fine if internally the kv sets go into a StoreSegment...but clients
don't have to know this, right?
How will you implement low priority compacting memstore? (Not important, just
interested)
We should add your seek and sequenceid to KeyValueScanner so you don't have to
have them implement something else (we have too many tiers of classes already)?
What have a set sequenceid rather than pass to the scanner on construction?
Same with the seek.
Do you have a harness so you can run this little engine outside of a
regionserver? Might make testing and proofing, debugging easier?
I suppose you'll deliver a skiplist version first and then move on to work on
in-memory storefile, a more compact in-memory representation?
Design writeup is great. Thanks. Let me review your last rb too... might help.
> HBase In-Memory Memstore Compaction
> -----------------------------------
>
> Key: HBASE-13408
> URL: https://issues.apache.org/jira/browse/HBASE-13408
> Project: HBase
> Issue Type: New Feature
> Reporter: Eshcar Hillel
> Assignee: Eshcar Hillel
> Fix For: 2.0.0
>
> Attachments: HBASE-13408-trunk-v01.patch,
> HBASE-13408-trunk-v02.patch, HBASE-13408-trunk-v03.patch,
> HBASE-13408-trunk-v04.patch, HBASE-13408-trunk-v05.patch,
> HBASE-13408-trunk-v06.patch, HBASE-13408-trunk-v07.patch,
> HBASE-13408-trunk-v08.patch,
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver02.pdf,
> HBaseIn-MemoryMemstoreCompactionDesignDocument-ver03.pdf,
> HBaseIn-MemoryMemstoreCompactionDesignDocument.pdf,
> InMemoryMemstoreCompactionEvaluationResults.pdf,
> InMemoryMemstoreCompactionMasterEvaluationResults.pdf,
> InMemoryMemstoreCompactionScansEvaluationResults.pdf,
> StoreSegmentandStoreSegmentScannerClassHierarchies.pdf
>
>
> A store unit holds a column family in a region, where the memstore is its
> in-memory component. The memstore absorbs all updates to the store; from time
> to time these updates are flushed to a file on disk, where they are
> compacted. Unlike disk components, the memstore is not compacted until it is
> written to the filesystem and optionally to block-cache. This may result in
> underutilization of the memory due to duplicate entries per row, for example,
> when hot data is continuously updated.
> Generally, the faster the data is accumulated in memory, more flushes are
> triggered, the data sinks to disk more frequently, slowing down retrieval of
> data, even if very recent.
> In high-churn workloads, compacting the memstore can help maintain the data
> in memory, and thereby speed up data retrieval.
> We suggest a new compacted memstore with the following principles:
> 1. The data is kept in memory for as long as possible
> 2. Memstore data is either compacted or in process of being compacted
> 3. Allow a panic mode, which may interrupt an in-progress compaction and
> force a flush of part of the memstore.
> We suggest applying this optimization only to in-memory column families.
> A design document is attached.
> This feature was previously discussed in HBASE-5311.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)