[
https://issues.apache.org/jira/browse/ACCUMULO-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14028114#comment-14028114
]
Josh Elser edited comment on ACCUMULO-2827 at 6/11/14 5:59 PM:
---------------------------------------------------------------
Results of accumulo continuous ingest (against 1.5.1 on hadoop-2.2.0). Tests
were run against a 12 physical-core, 64GB RAM, 8 drive single-node machine:
Test:
- Ingest roughly 1 billion entries (set NUM=1,000,000,000 (without commas))
- Pre-split into 8 tablets
- table.split.threshold=100G (Avoid splits so we can have more entries per
tablet)
- table.compaction.major.ratio=4
- table.file.max=10
- tserver.compaction.major.concurrent.max=9 (enough to have all compactions
running concurrently)
- tserver.compaction.major.thread.files.open.max=20 (all files open at once
during majc)
- tserver.memory.maps.max=4G
We only used 1 ingester instance (so a single batchwriter thread).
Results:
After ingest completed, we triggered a full majc and timed how long it took to
complete.
{noformat}
time accumulo shell -u root -p <secret> -e 'compact -t ci -w'
{noformat}
1.5.1 old heap iterator
{noformat}
real 21m48.785s
user 0m6.014s
sys 0m0.475s
{noformat}
1.5.1 new heap iterator
{noformat}
real 20m45.002s
user 0m5.693s
sys 0m0.456s
{noformat}
was (Author: parkjsung):
Results of accumulo continuous ingest (against 1.5.1 on hadoop-2.2.0). Tests
were run against a 12 physical-core, 64GB RAM, 8 drive single-node machine:
Test:
- Ingest roughly 1 billion entries (set NUM=1,000,000,000 (without commas))
- Pre-split into 8 tablets
- table.split.threshold=100G (Avoid splits so we can have more entries per
tablet)
- table.compaction.major.ratio=4
- table.file.max=10
- tserver.compaction.major.concurrent.max=9 (enough to have all compactions
running concurrently)
- tserver.compaction.major.thread.files.open.max=20 (all files open at once
during majc)
- tserver.memory.maps.max=4G
We only used 1 ingester instance (so a single batchwriter thread).
Results:
After ingest completed, we triggered a full majc and timed how long it took to
complete.
{noformat}
time accumulo shell -u root -p <secret> -e 'compact -t ci -w'
{noformat}
1.5.1 old heap iterator
{no format}
real 21m48.785s
user 0m6.014s
sys 0m0.475s
{no format}
1.5.1 new heap iterator
{no format}
real 20m45.002s
user 0m5.693s
sys 0m0.456s
{no format}
> HeapIterator optimization
> -------------------------
>
> Key: ACCUMULO-2827
> URL: https://issues.apache.org/jira/browse/ACCUMULO-2827
> Project: Accumulo
> Issue Type: Improvement
> Affects Versions: 1.5.1, 1.6.0
> Reporter: Jonathan Park
> Assignee: Jonathan Park
> Priority: Minor
> Fix For: 1.5.2, 1.6.1, 1.7.0
>
> Attachments: ACCUMULO-2827.0.patch.txt, accumulo-2827.raw_data,
> new_heapiter.png, old_heapiter.png, together.png
>
>
> We've been running a few performance tests of our iterator stack and noticed
> a decent amount of time spent in the HeapIterator specifically related to
> add/removal into the heap.
> This may not be a general enough optimization but we thought we'd see what
> people thought. Our assumption is that it's more probable that the current
> "top iterator" will supply the next value in the iteration than not. The
> current implementation takes the other assumption by always removing +
> inserting the minimum iterator back into the heap. With the implementation of
> a binary heap that we're using, this can get costly if our assumption is
> wrong because we pay the log penalty of percolating up the iterator in the
> heap upon insertion and again when percolating down upon removal.
> We believe our assumption is a fair one to hold given that as major
> compactions create a log distribution of file sizes, it's likely that we may
> see a long chain of consecutive entries coming from 1 iterator.
> Understandably, taking this assumption comes at an additional cost in the
> case that we're wrong. Therefore, we've run a few benchmarking tests to see
> how much of a cost we pay as well as what kind of benefit we see. I've
> attached a potential patch (which includes a test harness) + image that
> captures the results of our tests. The x-axis represents # of repeated keys
> before switching to another iterator. The y-axis represents iteration time.
> The sets of blue + red lines varies in # of iterators present in the heap.
--
This message was sent by Atlassian JIRA
(v6.2#6252)