[
https://issues.apache.org/jira/browse/HBASE-2496?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jean-Daniel Cryans updated HBASE-2496:
--------------------------------------
Attachment: HBASE-2496.patch
Patch that fixes 2 issues:
- HRS: in next() we instantiate the AL with the default size. For big caching
values this is inefficient. Also we are re-creating ALs in a loop for _every_
scanned row, we could just reuse the same instance and clear it.
- ExplicitColumnTracker: reset() calls buildColumnList() which creates a new
AL with new ColumnCounts after every row which is unnecessary, and possibly
slowing long scans. I added a way to reset the count in ColumnCount, I keep a
reference to all of them that I then reuse in buildColumnList. Also adding a
couple of finals.
I tested this patch on 166M rows with a modified RowCounter that doesn't use
block caching and that caches 10k rows. I tested 3 times each version (without
and with the patch), major compacting the table before counting and restarting
between the 2 runs (but not restarting HDFS).
With the patch:
3mins, 34sec
3mins, 19sec
3mins, 22sec
Without the patch:
4mins, 36sec
3mins, 56sec
3mins, 55sec
> Less ArrayList churn on the scan path
> -------------------------------------
>
> Key: HBASE-2496
> URL: https://issues.apache.org/jira/browse/HBASE-2496
> Project: Hadoop HBase
> Issue Type: Improvement
> Affects Versions: 0.20.3
> Reporter: Jean-Daniel Cryans
> Assignee: Jean-Daniel Cryans
> Fix For: 0.20.5, 0.21.0
>
> Attachments: HBASE-2496.patch
>
>
> Doing some profiling when testing the scanning speed of 0.20.4, I saw that we
> are spending a lot of time instantiating ArrayLists when scanning and that we
> could sometime set the right size of the arrays. I don't expect big
> improvements for short scans, but people like us who are scanning in batches
> of 10k could get some nice speedups.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.