[
https://issues.apache.org/jira/browse/PHOENIX-5494?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16975309#comment-16975309
]
Kadir OZDEMIR commented on PHOENIX-5494:
----------------------------------------
[~comnetwork], Yes, you can filter out cells based on timestamps if we
continue to do raw scans with all versions, something that I would like to get
rid of eventually, at least for the new design. You also need to find the max
timestamp for a batch replay writes and set the scan time range properly.
For regular writes, we lock the data table rows during index updates (see
IndexRegionObserver.lockRows()). This guarantees that only one update can be
prepared for a given row. Now given that there can be only one thread working
on a given row, we only need to ensure that we can add and remove entries for
different rows to/from the map of mutations. And for this, I used
ConcurrentHashMap (see LocalTable.scanCurrentRowStates() where results = new
ConcurrentHashMap<>()).
For regular writes, we group the mutations on the multiple rows into one (see
IndexRegionObserver.groupMutations()) however we do not do this for replay
writes at least for the new design. Actually, we break individual data
mutations into multiple mutations for the index mutation preparation to make
sure that the cells in each mutation have the same timestamp (see
flattenMutationsByTimestamp() called in IndexRegionObserver.groupMutations()).
This was the other reason I did not want to use this optimization for replay
writes. I will look into your patch more and try to merge it.
> Batched, mutable Index updates are unnecessarily run one-by-one
> ---------------------------------------------------------------
>
> Key: PHOENIX-5494
> URL: https://issues.apache.org/jira/browse/PHOENIX-5494
> Project: Phoenix
> Issue Type: Improvement
> Reporter: Lars Hofhansl
> Assignee: Kadir OZDEMIR
> Priority: Major
> Labels: performance
> Attachments: 5494-4.x-HBase-1.5.txt,
> PHOENIX-5494-4.x-HBase-1.4.patch, PHOENIX-5494.master.001.patch,
> PHOENIX-5494.master.002.patch, PHOENIX-5494.master.003.patch,
> Screenshot_20191110_160243.png, Screenshot_20191110_160351.png,
> Screenshot_20191110_161453.png
>
> Time Spent: 1h 20m
> Remaining Estimate: 0h
>
> I just noticed that index updates on mutable tables retrieve their deletes
> (to invalidate the old index entry) one-by-one.
> For batches, this can be *the* major time spent during an index update. The
> cost is mostly incured by the repeated setup (and seeking) of the new region
> scanner (for each row).
> We can instead do a skip scan and get all updates in a single scan per region.
> (Logically that is simple, but it will require some refactoring)
> I won't be getting to this, but recording it here in case someone feels
> inclined.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)