Given a list of x split points in a an empty tablet, would it make sense to split at the x/2 split point, wait for the tablet to migrate and then recursively split on either side of that x/2 split?
On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote: > I would encourage the community to figure this our for the following reason. > As other databases adopt Accumulo's security features, Accumulo's > primary feature is performance. > Other NoSQL databases have let performance slide in favor of adding more > features. > The gap between Accumulo performance and other NoSQL databases is growing. > There are many applications where Accumulo can do on one node what it would > take 20 or more nodes to do using another technology. > That said, the SQL and NewSQL communities have not been idle and > their are some fairly high performance competitors out there. > In the future, I believe Accumulo's primary performance competition > will come from the SQL and NewSQL communities. > > The key to performance is optimization. The key to optimization > is how quickly you can do a performance measurement. The IEEE HPEC > paper was able to get its results because we are able to collect > an accurate performance number at scale in a few minutes. > However, for the largest results, pre-splitting took almost an hour. > If we are able to remove the pre-splitting bottleneck we will > be able to very quickly test performance at scale which will > allow us to maintain Accumulo's impressive performance. > > My $0.02 > > P.S. I should add that the next biggest issue was the WAL, which > we had to turn off because it made things unstable at extreme > insert rate. I think if we solve the pre-splitting issue > it will be a lot easier to attack the WAL issue. > > > On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote: >> On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]> wrote: >> >> > Right...pre splitting more gradually might be worthwhile... >> > >> >> Yeah, If balancing is a problem adding 128 splits that are evenly >> distributed and letting those spread would probably help alot. After the >> 128 spread then add the rest. >> >> I did the following in 1.4.0 and was able to add 100,000 splits in ~4mins >> using 16 threads. I think i merged this code into 1.4.0 with a default of >> 16 threads. I wonder what has changed. This is an example of another >> targeted performance test we need to check for regressions. >> >> https://github.com/keith-turner/Accumulo-Parallel-Splitter >> >> In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be >> contributing to some of the slowness. Each split does 2 synchronous writes >> to the metadata table, which results in an hsync. If hsync takes 50 ms and >> there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624 seconds. >> However w/ group commit not working properly, these numbers may be worse >> as all of the parallel writes to metadata from tservers splitting would >> have to wait on each other. >> >> >> >> > >> > <div>-------- Original message --------</div><div>From: dlmarion < >> > [email protected]> </div><div>Date:06/20/2014 7:26 PM (GMT-05:00) >> > </div><div>To: [email protected] </div><div>Subject: Re: better >> > presplitting </div><div> >> > </div>We have always had issues with splitting taking a long time. Its a >> > serial process that has to compete with the balancer for a lock on the >> > metadata table. At least in 1.4 anyway, my information may be outdated. >> > Trying to add threads to create splits in parallel was never faster. It >> > would be nice if you could manually acquire a lock on the metadata table in >> > the shell, add all your split points, then release the lock and let the >> > tservers figure it out. In this case you could parallelize the splitting by >> > avoiding splitting the last tablet, but split at the midpoint of the last >> > tablet and last split. >> > >> > >> > >> > <div>-------- Original message --------</div><div>From: Josh Elser < >> > [email protected]> </div><div>Date:06/20/2014 6:33 PM (GMT-05:00) >> > </div><div>To: [email protected] </div><div>Subject: Re: better >> > presplitting </div><div> >> > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]> wrote: >> > > >> > > When you add splits, they definitely start out on the server that is >> > > hosting the tablet that has to split apart. They have to, since the >> > tablet >> > > that hosted the previous key extent is the only one that can properly >> > > handle requests for the new key extents. >> > > >> > > We've run into this consistently when doing any testing that requires >> > > pre-splitting for perf reasons. >> > >> > I'd have to pull up the split code, but it seems like a simple fix could be >> > to let all but one result of the split of a tablet remain local. That way >> > the current server doesn't get bogged down, and the master would just use >> > the regular assignment path instead of waiting for the balancer to kick in. >> > >> > Maybe there's a reason this doesn't work though :) >> > >> > > In the case of YCSB tests, Mike scripted some nice manual pre-splitting >> > in >> > > waves: >> > > >> > > * split table into X parts >> > > * wait for balancing >> > > * split each X part into Y parts >> > > * wait for balancing >> > > >> > > presuming the goal is to end up with X*Y presplits, this was way faster >> > > than just asking for the total right off the bat. >> > > >> > > We could generally look at improving the migration code to handle these >> > > reassignments faster, but how often does this situation come up for >> > people >> > > who aren't making a new table? If the "do this offline" feature speeds up >> > > the new table use case enough, I'm not sure optimizing the migration path >> > > is worth the time investment right now. >> > > >> > > >> > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]> >> > wrote: >> > > >> > > > bq. They all started out on one server >> > > > >> > > > This seems.. weird. Would be good to start addressing this by >> > identifying >> > > > what the actual balancer code does so we can immediately start to test >> > the >> > > > assertions. We can then use the results to identify the deficiencies >> > that >> > > > exist. >> > > > >> > > > I think the 200splits per server was an Eric quote from some time ago >> > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of things, >> > > > workload and memory available most notably, and would be good to >> > quantify >> > > > too. >> > > > >> > > > >> > > > On 6/20/14, 11:58 AM, Sean Busbey wrote: >> > > > >> > > >> One thing that jumped out from the most recent D4M paper was this >> > quote: >> > > >> >> > > >> One issue that was encountered is that after creating the >> > pre-splits, >> > > >> they all started out on one server. Accumulo load balanced the splits >> > > >> across its servers at rate of ~50 splits/second, which is more than >> > > >> adequate for normal operation, but can take ~20 minutes for 50,000 >> > pre- >> > > >> splits.[1] >> > > >> >> > > >> Do we already have an open ticket that would help this? I think maybe >> > > >> there's one about being able to presplit a table that is offline? >> > > >> >> > > >> I believe our recommended sweet spot is like 100-200 tablets per >> > server >> > > >> (though I can't find the reference for *why* I believe this ATM), >> > which >> > > >> means for clusters in the ~100s of nodes this would be in the ballpark >> > for >> > > >> an expected number of pre-splits. >> > > >> >> > > >> >> > > >> [1]: arXiv:1406.4923v1 [cs.DB] >> > > >> >> > > >> >> > > >> > > >> > > -- >> > > Sean >> >
