Re: better presplitting

David Medinets Sat, 21 Jun 2014 13:46:03 -0700

Given a list of x split points in a an empty tablet, would it make
sense to split at the x/2 split point, wait for the tablet to migrate
and then recursively split on either side of that x/2 split?


On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote:
> I would encourage the community to figure this our for the following reason.
> As other databases adopt Accumulo's security features, Accumulo's
> primary feature is performance.
> Other NoSQL databases have let performance slide in favor of adding more 
> features.
> The gap between Accumulo performance and other NoSQL databases is growing.
> There are many applications where Accumulo can do on one node what it would
> take 20 or more nodes to do using another technology.
> That said, the SQL and NewSQL communities have not been idle and
> their are some fairly high performance competitors out there.
> In the future, I believe Accumulo's primary performance competition
> will come from the SQL and NewSQL communities.
>
> The key to performance is optimization.  The key to optimization
> is how quickly you can do a performance measurement.  The IEEE HPEC
> paper was able to get its results because we are able to collect
> an accurate performance number at scale in a few minutes.
> However, for the largest results, pre-splitting took almost an hour.
> If we are able to remove the pre-splitting bottleneck we will
> be able to very quickly test performance at scale which will
> allow us to maintain Accumulo's impressive performance.
>
> My $0.02
>
> P.S. I should add that the next biggest issue was the WAL, which
> we had to turn off because it made things unstable at extreme
> insert rate.  I think if we solve the pre-splitting issue
> it will be a lot easier to attack the WAL issue.
>
>
> On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote:
>> On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]> wrote:
>>
>> > Right...pre splitting more gradually might be worthwhile...
>> >
>>
>> Yeah, If balancing is a problem adding 128 splits that are evenly
>> distributed and letting those spread would probably help alot.  After the
>> 128 spread then add the rest.
>>
>> I did the following in 1.4.0 and was able to add 100,000 splits in ~4mins
>> using 16 threads.  I think i merged this code into 1.4.0 with a default of
>> 16 threads.  I wonder what has changed.  This is an example of another
>> targeted performance test we need to check for regressions.
>>
>> https://github.com/keith-turner/Accumulo-Parallel-Splitter
>>
>> In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be
>> contributing to some of the slowness.  Each split does 2 synchronous writes
>> to the metadata table, which results in an hsync.  If hsync takes 50 ms and
>> there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624 seconds.
>>  However w/ group commit not working properly, these numbers may be worse
>> as all of the parallel writes to metadata from tservers splitting would
>> have to wait on each other.
>>
>>
>>
>> >
>> > <div>-------- Original message --------</div><div>From: dlmarion <
>> > [email protected]> </div><div>Date:06/20/2014  7:26 PM  (GMT-05:00)
>> > </div><div>To: [email protected] </div><div>Subject: Re: better
>> > presplitting </div><div>
>> > </div>We have always had issues with splitting taking a long time. Its a
>> > serial process that has to compete with the balancer for a lock on the
>> > metadata table. At least in 1.4 anyway, my information may be outdated.
>> > Trying to add threads to create splits in parallel was never faster. It
>> > would be nice if you could manually acquire a lock on the metadata table in
>> > the shell, add all your split points, then release the lock and let the
>> > tservers figure it out. In this case you could parallelize the splitting by
>> > avoiding splitting the last tablet, but split at the midpoint of the last
>> > tablet and last split.
>> >
>> >
>> >
>> > <div>-------- Original message --------</div><div>From: Josh Elser <
>> > [email protected]> </div><div>Date:06/20/2014  6:33 PM  (GMT-05:00)
>> > </div><div>To: [email protected] </div><div>Subject: Re: better
>> > presplitting </div><div>
>> > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]> wrote:
>> > >
>> > > When you add splits, they definitely start out on the server that is
>> > > hosting the tablet that has to split apart.  They have to, since the
>> > tablet
>> > > that hosted the previous key extent is the only one that can properly
>> > > handle requests for the new key extents.
>> > >
>> > > We've run into this consistently when doing any testing that requires
>> > > pre-splitting for perf reasons.
>> >
>> > I'd have to pull up the split code, but it seems like a simple fix could be
>> > to let all but one result of the split of a tablet remain local. That way
>> > the current server doesn't get bogged down, and the master would just use
>> > the regular assignment path instead of waiting for the balancer to kick in.
>> >
>> > Maybe there's a reason this doesn't work though :)
>> >
>> > > In the case of YCSB tests, Mike scripted some nice manual pre-splitting
>> > in
>> > > waves:
>> > >
>> > > * split table into X parts
>> > > * wait for balancing
>> > > * split each X part into Y parts
>> > > * wait for balancing
>> > >
>> > > presuming the goal is to end up with X*Y presplits, this was way faster
>> > > than just asking for the total right off the bat.
>> > >
>> > > We could generally look at improving the migration code to handle these
>> > > reassignments faster, but how often does this situation come up for
>> > people
>> > > who aren't making a new table? If the "do this offline" feature speeds up
>> > > the new table use case enough, I'm not sure optimizing the migration path
>> > > is worth the time investment right now.
>> > >
>> > >
>> > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]>
>> > wrote:
>> > >
>> > > > bq. They all started out on one server
>> > > >
>> > > > This seems.. weird. Would be good to start addressing this by
>> > identifying
>> > > > what the actual balancer code does so we can immediately start to test
>> > the
>> > > > assertions. We can then use the results to identify the deficiencies
>> > that
>> > > > exist.
>> > > >
>> > > > I think the 200splits per server was an Eric quote from some time ago
>> > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of things,
>> > > > workload and memory available most notably, and would be good to
>> > quantify
>> > > > too.
>> > > >
>> > > >
>> > > > On 6/20/14, 11:58 AM, Sean Busbey wrote:
>> > > >
>> > > >> One thing that jumped out from the most recent D4M paper was this
>> > quote:
>> > > >>
>> > > >>    One issue that was encountered is that after creating the
>> > pre-splits,
>> > > >> they all started out on one server. Accumulo load balanced the splits
>> > > >> across its servers at rate of ~50 splits/second, which is more than
>> > > >> adequate for normal operation, but can take ~20 minutes for 50,000
>> > pre-
>> > > >> splits.[1]
>> > > >>
>> > > >> Do we already have an open ticket that would help this? I think maybe
>> > > >> there's one about being able to presplit a table that is offline?
>> > > >>
>> > > >> I believe our recommended sweet spot is like 100-200 tablets per
>> > server
>> > > >> (though I can't find the reference for *why* I believe this ATM),
>> > which
>> > > >> means for clusters in the ~100s of nodes this would be in the ballpark
>> > for
>> > > >> an expected number of pre-splits.
>> > > >>
>> > > >>
>> > > >> [1]:  arXiv:1406.4923v1 [cs.DB]
>> > > >>
>> > > >>
>> > >
>> > >
>> > > --
>> > > Sean
>> >

Re: better presplitting

Reply via email to