Re: better presplitting

Keith Turner Sat, 21 Jun 2014 09:15:06 -0700

On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote:


> I would encourage the community to figure this our for the following
> reason.
> As other databases adopt Accumulo's security features, Accumulo's
> primary feature is performance.
>

I am going to look into this issue next week and run some experiments.  I
plan on trying to add 100k splits using uniform random split points on a 20
node EC2 cluster and trying to understand what the issues are.   Initially
i plan on running these experiments using 1.6.1-SNAPSHOT.


> Other NoSQL databases have let performance slide in favor of adding more
> features.
> The gap between Accumulo performance and other NoSQL databases is growing.
> There are many applications where Accumulo can do on one node what it would
> take 20 or more nodes to do using another technology.
> That said, the SQL and NewSQL communities have not been idle and
> their are some fairly high performance competitors out there.
> In the future, I believe Accumulo's primary performance competition
> will come from the SQL and NewSQL communities.
>
> The key to performance is optimization.  The key to optimization
> is how quickly you can do a performance measurement.  The IEEE HPEC
> paper was able to get its results because we are able to collect
> an accurate performance number at scale in a few minutes.
> However, for the largest results, pre-splitting took almost an hour.
> If we are able to remove the pre-splitting bottleneck we will
> be able to very quickly test performance at scale which will
> allow us to maintain Accumulo's impressive performance.
>
> My $0.02
>
> P.S. I should add that the next biggest issue was the WAL, which
> we had to turn off because it made things unstable at extreme
> insert rate.  I think if we solve the pre-splitting issue
> it will be a lot easier to attack the WAL issue.
>
>
> On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote:
> > On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]>
> wrote:
> >
> > > Right...pre splitting more gradually might be worthwhile...
> > >
> >
> > Yeah, If balancing is a problem adding 128 splits that are evenly
> > distributed and letting those spread would probably help alot.  After the
> > 128 spread then add the rest.
> >
> > I did the following in 1.4.0 and was able to add 100,000 splits in ~4mins
> > using 16 threads.  I think i merged this code into 1.4.0 with a default
> of
> > 16 threads.  I wonder what has changed.  This is an example of another
> > targeted performance test we need to check for regressions.
> >
> > https://github.com/keith-turner/Accumulo-Parallel-Splitter
> >
> > In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be
> > contributing to some of the slowness.  Each split does 2 synchronous
> writes
> > to the metadata table, which results in an hsync.  If hsync takes 50 ms
> and
> > there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624
> seconds.
> >  However w/ group commit not working properly, these numbers may be worse
> > as all of the parallel writes to metadata from tservers splitting would
> > have to wait on each other.
> >
> >
> >
> > >
> > > <div>-------- Original message --------</div><div>From: dlmarion <
> > > [email protected]> </div><div>Date:06/20/2014  7:26 PM  (GMT-05:00)
> > > </div><div>To: [email protected] </div><div>Subject: Re: better
> > > presplitting </div><div>
> > > </div>We have always had issues with splitting taking a long time. Its
> a
> > > serial process that has to compete with the balancer for a lock on the
> > > metadata table. At least in 1.4 anyway, my information may be outdated.
> > > Trying to add threads to create splits in parallel was never faster. It
> > > would be nice if you could manually acquire a lock on the metadata
> table in
> > > the shell, add all your split points, then release the lock and let the
> > > tservers figure it out. In this case you could parallelize the
> splitting by
> > > avoiding splitting the last tablet, but split at the midpoint of the
> last
> > > tablet and last split.
> > >
> > >
> > >
> > > <div>-------- Original message --------</div><div>From: Josh Elser <
> > > [email protected]> </div><div>Date:06/20/2014  6:33 PM  (GMT-05:00)
> > > </div><div>To: [email protected] </div><div>Subject: Re: better
> > > presplitting </div><div>
> > > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]>
> wrote:
> > > >
> > > > When you add splits, they definitely start out on the server that is
> > > > hosting the tablet that has to split apart.  They have to, since the
> > > tablet
> > > > that hosted the previous key extent is the only one that can properly
> > > > handle requests for the new key extents.
> > > >
> > > > We've run into this consistently when doing any testing that requires
> > > > pre-splitting for perf reasons.
> > >
> > > I'd have to pull up the split code, but it seems like a simple fix
> could be
> > > to let all but one result of the split of a tablet remain local. That
> way
> > > the current server doesn't get bogged down, and the master would just
> use
> > > the regular assignment path instead of waiting for the balancer to
> kick in.
> > >
> > > Maybe there's a reason this doesn't work though :)
> > >
> > > > In the case of YCSB tests, Mike scripted some nice manual
> pre-splitting
> > > in
> > > > waves:
> > > >
> > > > * split table into X parts
> > > > * wait for balancing
> > > > * split each X part into Y parts
> > > > * wait for balancing
> > > >
> > > > presuming the goal is to end up with X*Y presplits, this was way
> faster
> > > > than just asking for the total right off the bat.
> > > >
> > > > We could generally look at improving the migration code to handle
> these
> > > > reassignments faster, but how often does this situation come up for
> > > people
> > > > who aren't making a new table? If the "do this offline" feature
> speeds up
> > > > the new table use case enough, I'm not sure optimizing the migration
> path
> > > > is worth the time investment right now.
> > > >
> > > >
> > > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]>
> > > wrote:
> > > >
> > > > > bq. They all started out on one server
> > > > >
> > > > > This seems.. weird. Would be good to start addressing this by
> > > identifying
> > > > > what the actual balancer code does so we can immediately start to
> test
> > > the
> > > > > assertions. We can then use the results to identify the
> deficiencies
> > > that
> > > > > exist.
> > > > >
> > > > > I think the 200splits per server was an Eric quote from some time
> ago
> > > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of
> things,
> > > > > workload and memory available most notably, and would be good to
> > > quantify
> > > > > too.
> > > > >
> > > > >
> > > > > On 6/20/14, 11:58 AM, Sean Busbey wrote:
> > > > >
> > > > >> One thing that jumped out from the most recent D4M paper was this
> > > quote:
> > > > >>
> > > > >>    One issue that was encountered is that after creating the
> > > pre-splits,
> > > > >> they all started out on one server. Accumulo load balanced the
> splits
> > > > >> across its servers at rate of ~50 splits/second, which is more
> than
> > > > >> adequate for normal operation, but can take ~20 minutes for 50,000
> > > pre-
> > > > >> splits.[1]
> > > > >>
> > > > >> Do we already have an open ticket that would help this? I think
> maybe
> > > > >> there's one about being able to presplit a table that is offline?
> > > > >>
> > > > >> I believe our recommended sweet spot is like 100-200 tablets per
> > > server
> > > > >> (though I can't find the reference for *why* I believe this ATM),
> > > which
> > > > >> means for clusters in the ~100s of nodes this would be in the
> ballpark
> > > for
> > > > >> an expected number of pre-splits.
> > > > >>
> > > > >>
> > > > >> [1]:  arXiv:1406.4923v1 [cs.DB]
> > > > >>
> > > > >>
> > > >
> > > >
> > > > --
> > > > Sean
> > >
>

Re: better presplitting

Reply via email to