[jira] [Commented] (ACCUMULO-2827) HeapIterator optimization

Keith Turner (JIRA) Thu, 19 Jun 2014 17:35:25 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-2827?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14038166#comment-14038166
 ]


Keith Turner commented on ACCUMULO-2827:
----------------------------------------

I ran some test using the script in 
{{ACCUMULO-2827-compaction-performance-test.patch}}.    The table below shows 
average compaction rates I observed.  I threw away the numbers from the first 
few compactions after the tserver started because the JIT was busy doing its 
thing.

||files||rows per file||cols per row||rate w/o patch||rate w/ patch|| speedup||
|10|1000000|1|433,148|451,797|1.04|
|10|100000|10|557,307|684,984|1.23|
|10|10000|100|593,710|760,557|1.28|

These numbers show a nice speedup for the case where rows are interleaved but 
have many columns.  The rows in the files were randomly generated.   Its nice 
to see a significant difference in compaction speeds. These test were reading 
and writing data.  [~parkjsung] numbers show an even more dramatic difference 
for only reading data. 

> HeapIterator optimization
> -------------------------
>
>                 Key: ACCUMULO-2827
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-2827
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.5.1, 1.6.0
>            Reporter: Jonathan Park
>            Assignee: Jonathan Park
>            Priority: Minor
>             Fix For: 1.5.2, 1.6.1, 1.7.0
>
>         Attachments: ACCUMULO-2827.0.patch.txt, accumulo-2827.raw_data, 
> new_heapiter.png, old_heapiter.png, together.png
>
>
> We've been running a few performance tests of our iterator stack and noticed 
> a decent amount of time spent in the HeapIterator specifically related to 
> add/removal into the heap.
> This may not be a general enough optimization but we thought we'd see what 
> people thought. Our assumption is that it's more probable that the current 
> "top iterator" will supply the next value in the iteration than not. The 
> current implementation takes the other assumption by always removing + 
> inserting the minimum iterator back into the heap. With the implementation of 
> a binary heap that we're using, this can get costly if our assumption is 
> wrong because we pay the log penalty of percolating up the iterator in the 
> heap upon insertion and again when percolating down upon removal.
> We believe our assumption is a fair one to hold given that as major 
> compactions create a log distribution of file sizes, it's likely that we may 
> see a long chain of consecutive entries coming from 1 iterator. 
> Understandably, taking this assumption comes at an additional cost in the 
> case that we're wrong. Therefore, we've run a few benchmarking tests to see 
> how much of a cost we pay as well as what kind of benefit we see. I've 
> attached a potential patch (which includes a test harness) + image that 
> captures the results of our tests. The x-axis represents # of repeated keys 
> before switching to another iterator. The y-axis represents iteration time. 
> The sets of blue + red lines varies in # of iterators present in the heap.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

[jira] [Commented] (ACCUMULO-2827) HeapIterator optimization

Reply via email to