On Sat, Jun 21, 2014 at 4:45 PM, David Medinets <[email protected]> wrote:
> Given a list of x split points in a an empty tablet, would it make > sense to split at the x/2 split point, wait for the tablet to migrate > and then recursively split on either side of that x/2 split? > Thats kinda how the current code works, except it does not wait as you suggested. The current code splits the 1/2 point, when that finiishes it queues the 1/4 and 3/4 points. The 1/4 and 3/4 points are done in parallel. After those finish, then the 1/8,3/8,5/8,and 7/8 points are queued to run in parallel, each queuing its two children when done. Waiting for migration in the earlier part of this process may be helpful may be helpful. > > On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote: > > I would encourage the community to figure this our for the following > reason. > > As other databases adopt Accumulo's security features, Accumulo's > > primary feature is performance. > > Other NoSQL databases have let performance slide in favor of adding more > features. > > The gap between Accumulo performance and other NoSQL databases is > growing. > > There are many applications where Accumulo can do on one node what it > would > > take 20 or more nodes to do using another technology. > > That said, the SQL and NewSQL communities have not been idle and > > their are some fairly high performance competitors out there. > > In the future, I believe Accumulo's primary performance competition > > will come from the SQL and NewSQL communities. > > > > The key to performance is optimization. The key to optimization > > is how quickly you can do a performance measurement. The IEEE HPEC > > paper was able to get its results because we are able to collect > > an accurate performance number at scale in a few minutes. > > However, for the largest results, pre-splitting took almost an hour. > > If we are able to remove the pre-splitting bottleneck we will > > be able to very quickly test performance at scale which will > > allow us to maintain Accumulo's impressive performance. > > > > My $0.02 > > > > P.S. I should add that the next biggest issue was the WAL, which > > we had to turn off because it made things unstable at extreme > > insert rate. I think if we solve the pre-splitting issue > > it will be a lot easier to attack the WAL issue. > > > > > > On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote: > >> On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]> > wrote: > >> > >> > Right...pre splitting more gradually might be worthwhile... > >> > > >> > >> Yeah, If balancing is a problem adding 128 splits that are evenly > >> distributed and letting those spread would probably help alot. After > the > >> 128 spread then add the rest. > >> > >> I did the following in 1.4.0 and was able to add 100,000 splits in > ~4mins > >> using 16 threads. I think i merged this code into 1.4.0 with a default > of > >> 16 threads. I wonder what has changed. This is an example of another > >> targeted performance test we need to check for regressions. > >> > >> https://github.com/keith-turner/Accumulo-Parallel-Splitter > >> > >> In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be > >> contributing to some of the slowness. Each split does 2 synchronous > writes > >> to the metadata table, which results in an hsync. If hsync takes 50 ms > and > >> there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624 > seconds. > >> However w/ group commit not working properly, these numbers may be > worse > >> as all of the parallel writes to metadata from tservers splitting would > >> have to wait on each other. > >> > >> > >> > >> > > >> > <div>-------- Original message --------</div><div>From: dlmarion < > >> > [email protected]> </div><div>Date:06/20/2014 7:26 PM > (GMT-05:00) > >> > </div><div>To: [email protected] </div><div>Subject: Re: better > >> > presplitting </div><div> > >> > </div>We have always had issues with splitting taking a long time. > Its a > >> > serial process that has to compete with the balancer for a lock on the > >> > metadata table. At least in 1.4 anyway, my information may be > outdated. > >> > Trying to add threads to create splits in parallel was never faster. > It > >> > would be nice if you could manually acquire a lock on the metadata > table in > >> > the shell, add all your split points, then release the lock and let > the > >> > tservers figure it out. In this case you could parallelize the > splitting by > >> > avoiding splitting the last tablet, but split at the midpoint of the > last > >> > tablet and last split. > >> > > >> > > >> > > >> > <div>-------- Original message --------</div><div>From: Josh Elser < > >> > [email protected]> </div><div>Date:06/20/2014 6:33 PM > (GMT-05:00) > >> > </div><div>To: [email protected] </div><div>Subject: Re: better > >> > presplitting </div><div> > >> > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]> > wrote: > >> > > > >> > > When you add splits, they definitely start out on the server that is > >> > > hosting the tablet that has to split apart. They have to, since the > >> > tablet > >> > > that hosted the previous key extent is the only one that can > properly > >> > > handle requests for the new key extents. > >> > > > >> > > We've run into this consistently when doing any testing that > requires > >> > > pre-splitting for perf reasons. > >> > > >> > I'd have to pull up the split code, but it seems like a simple fix > could be > >> > to let all but one result of the split of a tablet remain local. That > way > >> > the current server doesn't get bogged down, and the master would just > use > >> > the regular assignment path instead of waiting for the balancer to > kick in. > >> > > >> > Maybe there's a reason this doesn't work though :) > >> > > >> > > In the case of YCSB tests, Mike scripted some nice manual > pre-splitting > >> > in > >> > > waves: > >> > > > >> > > * split table into X parts > >> > > * wait for balancing > >> > > * split each X part into Y parts > >> > > * wait for balancing > >> > > > >> > > presuming the goal is to end up with X*Y presplits, this was way > faster > >> > > than just asking for the total right off the bat. > >> > > > >> > > We could generally look at improving the migration code to handle > these > >> > > reassignments faster, but how often does this situation come up for > >> > people > >> > > who aren't making a new table? If the "do this offline" feature > speeds up > >> > > the new table use case enough, I'm not sure optimizing the > migration path > >> > > is worth the time investment right now. > >> > > > >> > > > >> > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]> > >> > wrote: > >> > > > >> > > > bq. They all started out on one server > >> > > > > >> > > > This seems.. weird. Would be good to start addressing this by > >> > identifying > >> > > > what the actual balancer code does so we can immediately start to > test > >> > the > >> > > > assertions. We can then use the results to identify the > deficiencies > >> > that > >> > > > exist. > >> > > > > >> > > > I think the 200splits per server was an Eric quote from some time > ago > >> > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of > things, > >> > > > workload and memory available most notably, and would be good to > >> > quantify > >> > > > too. > >> > > > > >> > > > > >> > > > On 6/20/14, 11:58 AM, Sean Busbey wrote: > >> > > > > >> > > >> One thing that jumped out from the most recent D4M paper was this > >> > quote: > >> > > >> > >> > > >> One issue that was encountered is that after creating the > >> > pre-splits, > >> > > >> they all started out on one server. Accumulo load balanced the > splits > >> > > >> across its servers at rate of ~50 splits/second, which is more > than > >> > > >> adequate for normal operation, but can take ~20 minutes for > 50,000 > >> > pre- > >> > > >> splits.[1] > >> > > >> > >> > > >> Do we already have an open ticket that would help this? I think > maybe > >> > > >> there's one about being able to presplit a table that is offline? > >> > > >> > >> > > >> I believe our recommended sweet spot is like 100-200 tablets per > >> > server > >> > > >> (though I can't find the reference for *why* I believe this ATM), > >> > which > >> > > >> means for clusters in the ~100s of nodes this would be in the > ballpark > >> > for > >> > > >> an expected number of pre-splits. > >> > > >> > >> > > >> > >> > > >> [1]: arXiv:1406.4923v1 [cs.DB] > >> > > >> > >> > > >> > >> > > > >> > > > >> > > -- > >> > > Sean > >> > >
