[jira] [Issue Comment Edited] (ACCUMULO-348) Adding splits to table via the shell with addsplits is very slow when adding a lot of split points

Keith Turner (Issue Comment Edited) (JIRA) Fri, 30 Mar 2012 13:27:53 -0700

    [ 
https://issues.apache.org/jira/browse/ACCUMULO-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242631#comment-13242631
 ]


Keith Turner edited comment on ACCUMULO-348 at 3/30/12 8:27 PM:
----------------------------------------------------------------

I put together a workaround for 1.3.5 and 1.4.0 and posted it on github.  This 
adds lots of splits to a table much faster.

  https://github.com/keith-turner/Accumulo-Parallel-Splitter

While testing this I discovered more about why adding lots of splits is slow 
and another workaround.  While trying to add 99,999 splits to a table using the 
addsplits command in the shell, I noticed on the monitor page that the rate 
seemed to be slowing down.  I used jstack to look at the process adding split 
points and noticed the stack traces were always doing metadata lookups.  After 
a split the client has to refresh its tablet location cache by looking in the 
metadata table.  I went to the tablet server and saw that metadata lookups were 
taking more than a quater second.

{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess 
xxx.xxx.xxx.3:42412 4 entries in 0.29 secs (lookup_time:0.29 secs tablets:1 
ranges:1)
{noformat}

I thought about why this was going on and it occurred to me that the code was 
always splitting the last tablet.  This meant that columns in the metadata 
table were always getting updated and therefore had lots of versions.  These 
versions were all kept in memory and suppressed by the versioning iterator.  
About 60k tablets had been added.  I knew if I flushed the metadata table, it 
would get rid of all of these version.  Below is the minor compaction caused by 
flushing the metadata table.   It read 1.4M and wrote 724K, so it dropped 
almost 700K key/values.  Some of the dropped data may have been deleted tables 
from previous experiments, some of it was old versions of key/values for the 
last tablet.  

{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7 
1,394,754 read | 724,252 written | 581,874 entries/sec |  2.397 secs
{noformat}

After the flush metadata lookups by the client doing the split were much faster 
and the rate of adding splits shot up.

{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess 
xxx.xxx.xxx.3:42412 4 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 
ranges:1)
{noformat}

So another work around is to periodically flush the metadata table when adding 
lots of splits. 

                
      was (Author: kturner):
    I put together a workaround for 1.3.5 and 1.4.0 and posted it on github.  
This adds lots of splits to a table much faster.

  https://github.com/keith-turner/Accumulo-Parallel-Splitter

While testing this I discovered more about why adding lots of splits is slow 
and another workaround.  While trying to add 99,999 splits to a table using the 
addsplits command in the shell, I noticed on the monitor page that the rate 
seemed to be slowing down.  I used jstack to look at the process adding split 
points and noticed the stack traces were always doing metadata lookups.  After 
a split the client has to refresh its tablet location cache by looking in the 
metadata table.  I went to the tablet server and saw that metadata lookups were 
taking more than a quater second.

{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess 
xxx.xxx.xxx.3:42412 4 entries in 0.29 secs (lookup_time:0.29 secs tablets:1 
ranges:1)
{noformat}

I thought about why this was going on and it occurred to me that the code was 
always splitting the last tablet.  This meant that columns in the metadata 
table were always getting updated and therefore had lots of versions.  These 
versions were all kept in memory and suppressed by the versioning iterator.  
About 60k tablets had been added.  I knew if I flushed the metadata table, it 
would get rid of all of these version.  Below is the minor compaction caused by 
flushing the metadata table.   It read 1.4M and wrote 724K, so it dropped 
almost 700K old versions.  

{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7 
1,394,754 read | 724,252 written | 581,874 entries/sec |  2.397 secs
{noformat}

After the flush metadata lookups by the client doing the split were much faster 
and the rate of adding splits shot up.

{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess 
xxx.xxx.xxx.3:42412 4 entries in 0.00 secs (lookup_time:0.00 secs tablets:1 
ranges:1)
{noformat}

So another work around is to periodically flush the metadata table when adding 
lots of splits. 

                  
> Adding splits to table via the shell with addsplits is very slow when adding 
> a lot of split points
> --------------------------------------------------------------------------------------------------
>
>                 Key: ACCUMULO-348
>                 URL: https://issues.apache.org/jira/browse/ACCUMULO-348
>             Project: Accumulo
>          Issue Type: Improvement
>    Affects Versions: 1.3.5
>            Reporter: Dave Marion
>            Priority: Minor
>             Fix For: 1.5.0
>
>


--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[jira] [Issue Comment Edited] (ACCUMULO-348) Adding splits to table via the shell with addsplits is very slow when adding a lot of split points

Reply via email to