[
https://issues.apache.org/jira/browse/ACCUMULO-348?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242631#comment-13242631
]
Keith Turner edited comment on ACCUMULO-348 at 3/30/12 8:27 PM:
----------------------------------------------------------------
I put together a workaround for 1.3.5 and 1.4.0 and posted it on github. This
adds lots of splits to a table much faster.
https://github.com/keith-turner/Accumulo-Parallel-Splitter
While testing this I discovered more about why adding lots of splits is slow
and another workaround. While trying to add 99,999 splits to a table using the
addsplits command in the shell, I noticed on the monitor page that the rate
seemed to be slowing down. I used jstack to look at the process adding split
points and noticed the stack traces were always doing metadata lookups. After
a split the client has to refresh its tablet location cache by looking in the
metadata table. I went to the tablet server and saw that metadata lookups were
taking more than a quater second.
{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess
xxx.xxx.xxx.3:42412 4 entries in 0.29 secs (lookup_time:0.29 secs tablets:1
ranges:1)
{noformat}
I thought about why this was going on and it occurred to me that the code was
always splitting the last tablet. This meant that columns in the metadata
table were always getting updated and therefore had lots of versions. These
versions were all kept in memory and suppressed by the versioning iterator.
About 60k tablets had been added. I knew if I flushed the metadata table, it
would get rid of all of these version. Below is the minor compaction caused by
flushing the metadata table. It read 1.4M and wrote 724K, so it dropped
almost 700K key/values. Some of the dropped data may have been deleted tables
from previous experiments, some of it was old versions of key/values for the
last tablet.
{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7
1,394,754 read | 724,252 written | 581,874 entries/sec | 2.397 secs
{noformat}
After the flush metadata lookups by the client doing the split were much faster
and the rate of adding splits shot up.
{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess
xxx.xxx.xxx.3:42412 4 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
ranges:1)
{noformat}
So another work around is to periodically flush the metadata table when adding
lots of splits.
was (Author: kturner):
I put together a workaround for 1.3.5 and 1.4.0 and posted it on github.
This adds lots of splits to a table much faster.
https://github.com/keith-turner/Accumulo-Parallel-Splitter
While testing this I discovered more about why adding lots of splits is slow
and another workaround. While trying to add 99,999 splits to a table using the
addsplits command in the shell, I noticed on the monitor page that the rate
seemed to be slowing down. I used jstack to look at the process adding split
points and noticed the stack traces were always doing metadata lookups. After
a split the client has to refresh its tablet location cache by looking in the
metadata table. I went to the tablet server and saw that metadata lookups were
taking more than a quater second.
{noformat}
30 17:36:09,458 [tabletserver.TabletServer] DEBUG: MultiScanSess
xxx.xxx.xxx.3:42412 4 entries in 0.29 secs (lookup_time:0.29 secs tablets:1
ranges:1)
{noformat}
I thought about why this was going on and it occurred to me that the code was
always splitting the last tablet. This meant that columns in the metadata
table were always getting updated and therefore had lots of versions. These
versions were all kept in memory and suppressed by the versioning iterator.
About 60k tablets had been added. I knew if I flushed the metadata table, it
would get rid of all of these version. Below is the minor compaction caused by
flushing the metadata table. It read 1.4M and wrote 724K, so it dropped
almost 700K old versions.
{noformat}
30 17:36:09,698 [tabletserver.Compactor] DEBUG: Compaction !0;~;p\\;3c7
1,394,754 read | 724,252 written | 581,874 entries/sec | 2.397 secs
{noformat}
After the flush metadata lookups by the client doing the split were much faster
and the rate of adding splits shot up.
{noformat}
30 17:36:09,773 [tabletserver.TabletServer] DEBUG: MultiScanSess
xxx.xxx.xxx.3:42412 4 entries in 0.00 secs (lookup_time:0.00 secs tablets:1
ranges:1)
{noformat}
So another work around is to periodically flush the metadata table when adding
lots of splits.
> Adding splits to table via the shell with addsplits is very slow when adding
> a lot of split points
> --------------------------------------------------------------------------------------------------
>
> Key: ACCUMULO-348
> URL: https://issues.apache.org/jira/browse/ACCUMULO-348
> Project: Accumulo
> Issue Type: Improvement
> Affects Versions: 1.3.5
> Reporter: Dave Marion
> Priority: Minor
> Fix For: 1.5.0
>
>
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira