Re: better presplitting

Keith Turner Sat, 21 Jun 2014 16:02:27 -0700

On Sat, Jun 21, 2014 at 4:45 PM, David Medinets <[email protected]>
wrote:


> Given a list of x split points in a an empty tablet, would it make
> sense to split at the x/2 split point, wait for the tablet to migrate
> and then recursively split on either side of that x/2 split?
>

Thats kinda how the current code works, except it does not wait as you
suggested.

The current code splits the 1/2 point, when that finiishes it queues the
1/4 and 3/4 points.   The 1/4 and 3/4 points are done in parallel.   After
those finish, then the 1/8,3/8,5/8,and 7/8 points are queued to run in
parallel, each queuing its two children when done.

Waiting for migration in the earlier part of this process may be helpful
may be helpful.


>
> On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote:
> > I would encourage the community to figure this our for the following
> reason.
> > As other databases adopt Accumulo's security features, Accumulo's
> > primary feature is performance.
> > Other NoSQL databases have let performance slide in favor of adding more
> features.
> > The gap between Accumulo performance and other NoSQL databases is
> growing.
> > There are many applications where Accumulo can do on one node what it
> would
> > take 20 or more nodes to do using another technology.
> > That said, the SQL and NewSQL communities have not been idle and
> > their are some fairly high performance competitors out there.
> > In the future, I believe Accumulo's primary performance competition
> > will come from the SQL and NewSQL communities.
> >
> > The key to performance is optimization.  The key to optimization
> > is how quickly you can do a performance measurement.  The IEEE HPEC
> > paper was able to get its results because we are able to collect
> > an accurate performance number at scale in a few minutes.
> > However, for the largest results, pre-splitting took almost an hour.
> > If we are able to remove the pre-splitting bottleneck we will
> > be able to very quickly test performance at scale which will
> > allow us to maintain Accumulo's impressive performance.
> >
> > My $0.02
> >
> > P.S. I should add that the next biggest issue was the WAL, which
> > we had to turn off because it made things unstable at extreme
> > insert rate.  I think if we solve the pre-splitting issue
> > it will be a lot easier to attack the WAL issue.
> >
> >
> > On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote:
> >> On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]>
> wrote:
> >>
> >> > Right...pre splitting more gradually might be worthwhile...
> >> >
> >>
> >> Yeah, If balancing is a problem adding 128 splits that are evenly
> >> distributed and letting those spread would probably help alot.  After
> the
> >> 128 spread then add the rest.
> >>
> >> I did the following in 1.4.0 and was able to add 100,000 splits in
> ~4mins
> >> using 16 threads.  I think i merged this code into 1.4.0 with a default
> of
> >> 16 threads.  I wonder what has changed.  This is an example of another
> >> targeted performance test we need to check for regressions.
> >>
> >> https://github.com/keith-turner/Accumulo-Parallel-Splitter
> >>
> >> In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be
> >> contributing to some of the slowness.  Each split does 2 synchronous
> writes
> >> to the metadata table, which results in an hsync.  If hsync takes 50 ms
> and
> >> there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624
> seconds.
> >>  However w/ group commit not working properly, these numbers may be
> worse
> >> as all of the parallel writes to metadata from tservers splitting would
> >> have to wait on each other.
> >>
> >>
> >>
> >> >
> >> > <div>-------- Original message --------</div><div>From: dlmarion <
> >> > [email protected]> </div><div>Date:06/20/2014  7:26 PM
>  (GMT-05:00)
> >> > </div><div>To: [email protected] </div><div>Subject: Re: better
> >> > presplitting </div><div>
> >> > </div>We have always had issues with splitting taking a long time.
> Its a
> >> > serial process that has to compete with the balancer for a lock on the
> >> > metadata table. At least in 1.4 anyway, my information may be
> outdated.
> >> > Trying to add threads to create splits in parallel was never faster.
> It
> >> > would be nice if you could manually acquire a lock on the metadata
> table in
> >> > the shell, add all your split points, then release the lock and let
> the
> >> > tservers figure it out. In this case you could parallelize the
> splitting by
> >> > avoiding splitting the last tablet, but split at the midpoint of the
> last
> >> > tablet and last split.
> >> >
> >> >
> >> >
> >> > <div>-------- Original message --------</div><div>From: Josh Elser <
> >> > [email protected]> </div><div>Date:06/20/2014  6:33 PM
>  (GMT-05:00)
> >> > </div><div>To: [email protected] </div><div>Subject: Re: better
> >> > presplitting </div><div>
> >> > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]>
> wrote:
> >> > >
> >> > > When you add splits, they definitely start out on the server that is
> >> > > hosting the tablet that has to split apart.  They have to, since the
> >> > tablet
> >> > > that hosted the previous key extent is the only one that can
> properly
> >> > > handle requests for the new key extents.
> >> > >
> >> > > We've run into this consistently when doing any testing that
> requires
> >> > > pre-splitting for perf reasons.
> >> >
> >> > I'd have to pull up the split code, but it seems like a simple fix
> could be
> >> > to let all but one result of the split of a tablet remain local. That
> way
> >> > the current server doesn't get bogged down, and the master would just
> use
> >> > the regular assignment path instead of waiting for the balancer to
> kick in.
> >> >
> >> > Maybe there's a reason this doesn't work though :)
> >> >
> >> > > In the case of YCSB tests, Mike scripted some nice manual
> pre-splitting
> >> > in
> >> > > waves:
> >> > >
> >> > > * split table into X parts
> >> > > * wait for balancing
> >> > > * split each X part into Y parts
> >> > > * wait for balancing
> >> > >
> >> > > presuming the goal is to end up with X*Y presplits, this was way
> faster
> >> > > than just asking for the total right off the bat.
> >> > >
> >> > > We could generally look at improving the migration code to handle
> these
> >> > > reassignments faster, but how often does this situation come up for
> >> > people
> >> > > who aren't making a new table? If the "do this offline" feature
> speeds up
> >> > > the new table use case enough, I'm not sure optimizing the
> migration path
> >> > > is worth the time investment right now.
> >> > >
> >> > >
> >> > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]>
> >> > wrote:
> >> > >
> >> > > > bq. They all started out on one server
> >> > > >
> >> > > > This seems.. weird. Would be good to start addressing this by
> >> > identifying
> >> > > > what the actual balancer code does so we can immediately start to
> test
> >> > the
> >> > > > assertions. We can then use the results to identify the
> deficiencies
> >> > that
> >> > > > exist.
> >> > > >
> >> > > > I think the 200splits per server was an Eric quote from some time
> ago
> >> > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of
> things,
> >> > > > workload and memory available most notably, and would be good to
> >> > quantify
> >> > > > too.
> >> > > >
> >> > > >
> >> > > > On 6/20/14, 11:58 AM, Sean Busbey wrote:
> >> > > >
> >> > > >> One thing that jumped out from the most recent D4M paper was this
> >> > quote:
> >> > > >>
> >> > > >>    One issue that was encountered is that after creating the
> >> > pre-splits,
> >> > > >> they all started out on one server. Accumulo load balanced the
> splits
> >> > > >> across its servers at rate of ~50 splits/second, which is more
> than
> >> > > >> adequate for normal operation, but can take ~20 minutes for
> 50,000
> >> > pre-
> >> > > >> splits.[1]
> >> > > >>
> >> > > >> Do we already have an open ticket that would help this? I think
> maybe
> >> > > >> there's one about being able to presplit a table that is offline?
> >> > > >>
> >> > > >> I believe our recommended sweet spot is like 100-200 tablets per
> >> > server
> >> > > >> (though I can't find the reference for *why* I believe this ATM),
> >> > which
> >> > > >> means for clusters in the ~100s of nodes this would be in the
> ballpark
> >> > for
> >> > > >> an expected number of pre-splits.
> >> > > >>
> >> > > >>
> >> > > >> [1]:  arXiv:1406.4923v1 [cs.DB]
> >> > > >>
> >> > > >>
> >> > >
> >> > >
> >> > > --
> >> > > Sean
> >> >
>

Re: better presplitting

Reply via email to