On Sat, Jun 21, 2014 at 11:59 AM, Jeremy Kepner <[email protected]> wrote:
> I would encourage the community to figure this our for the following > reason. > As other databases adopt Accumulo's security features, Accumulo's > primary feature is performance. > I am going to look into this issue next week and run some experiments. I plan on trying to add 100k splits using uniform random split points on a 20 node EC2 cluster and trying to understand what the issues are. Initially i plan on running these experiments using 1.6.1-SNAPSHOT. > Other NoSQL databases have let performance slide in favor of adding more > features. > The gap between Accumulo performance and other NoSQL databases is growing. > There are many applications where Accumulo can do on one node what it would > take 20 or more nodes to do using another technology. > That said, the SQL and NewSQL communities have not been idle and > their are some fairly high performance competitors out there. > In the future, I believe Accumulo's primary performance competition > will come from the SQL and NewSQL communities. > > The key to performance is optimization. The key to optimization > is how quickly you can do a performance measurement. The IEEE HPEC > paper was able to get its results because we are able to collect > an accurate performance number at scale in a few minutes. > However, for the largest results, pre-splitting took almost an hour. > If we are able to remove the pre-splitting bottleneck we will > be able to very quickly test performance at scale which will > allow us to maintain Accumulo's impressive performance. > > My $0.02 > > P.S. I should add that the next biggest issue was the WAL, which > we had to turn off because it made things unstable at extreme > insert rate. I think if we solve the pre-splitting issue > it will be a lot easier to attack the WAL issue. > > > On Sat, Jun 21, 2014 at 11:46:14AM -0400, Keith Turner wrote: > > On Fri, Jun 20, 2014 at 11:52 PM, ivan.bella <[email protected]> > wrote: > > > > > Right...pre splitting more gradually might be worthwhile... > > > > > > > Yeah, If balancing is a problem adding 128 splits that are evenly > > distributed and letting those spread would probably help alot. After the > > 128 spread then add the rest. > > > > I did the following in 1.4.0 and was able to add 100,000 splits in ~4mins > > using 16 threads. I think i merged this code into 1.4.0 with a default > of > > 16 threads. I wonder what has changed. This is an example of another > > targeted performance test we need to check for regressions. > > > > https://github.com/keith-turner/Accumulo-Parallel-Splitter > > > > In addition to balancing, for 1.5 and 1.6 hsync and ACCUMULO-2766 may be > > contributing to some of the slowness. Each split does 2 synchronous > writes > > to the metadata table, which results in an hsync. If hsync takes 50 ms > and > > there are 16 threads adding splits, then 50ms * 100,000 / 16 = 624 > seconds. > > However w/ group commit not working properly, these numbers may be worse > > as all of the parallel writes to metadata from tservers splitting would > > have to wait on each other. > > > > > > > > > > > > <div>-------- Original message --------</div><div>From: dlmarion < > > > [email protected]> </div><div>Date:06/20/2014 7:26 PM (GMT-05:00) > > > </div><div>To: [email protected] </div><div>Subject: Re: better > > > presplitting </div><div> > > > </div>We have always had issues with splitting taking a long time. Its > a > > > serial process that has to compete with the balancer for a lock on the > > > metadata table. At least in 1.4 anyway, my information may be outdated. > > > Trying to add threads to create splits in parallel was never faster. It > > > would be nice if you could manually acquire a lock on the metadata > table in > > > the shell, add all your split points, then release the lock and let the > > > tservers figure it out. In this case you could parallelize the > splitting by > > > avoiding splitting the last tablet, but split at the midpoint of the > last > > > tablet and last split. > > > > > > > > > > > > <div>-------- Original message --------</div><div>From: Josh Elser < > > > [email protected]> </div><div>Date:06/20/2014 6:33 PM (GMT-05:00) > > > </div><div>To: [email protected] </div><div>Subject: Re: better > > > presplitting </div><div> > > > </div>On Jun 20, 2014 12:41 PM, "Sean Busbey" <[email protected]> > wrote: > > > > > > > > When you add splits, they definitely start out on the server that is > > > > hosting the tablet that has to split apart. They have to, since the > > > tablet > > > > that hosted the previous key extent is the only one that can properly > > > > handle requests for the new key extents. > > > > > > > > We've run into this consistently when doing any testing that requires > > > > pre-splitting for perf reasons. > > > > > > I'd have to pull up the split code, but it seems like a simple fix > could be > > > to let all but one result of the split of a tablet remain local. That > way > > > the current server doesn't get bogged down, and the master would just > use > > > the regular assignment path instead of waiting for the balancer to > kick in. > > > > > > Maybe there's a reason this doesn't work though :) > > > > > > > In the case of YCSB tests, Mike scripted some nice manual > pre-splitting > > > in > > > > waves: > > > > > > > > * split table into X parts > > > > * wait for balancing > > > > * split each X part into Y parts > > > > * wait for balancing > > > > > > > > presuming the goal is to end up with X*Y presplits, this was way > faster > > > > than just asking for the total right off the bat. > > > > > > > > We could generally look at improving the migration code to handle > these > > > > reassignments faster, but how often does this situation come up for > > > people > > > > who aren't making a new table? If the "do this offline" feature > speeds up > > > > the new table use case enough, I'm not sure optimizing the migration > path > > > > is worth the time investment right now. > > > > > > > > > > > > On Fri, Jun 20, 2014 at 3:09 PM, Josh Elser <[email protected]> > > > wrote: > > > > > > > > > bq. They all started out on one server > > > > > > > > > > This seems.. weird. Would be good to start addressing this by > > > identifying > > > > > what the actual balancer code does so we can immediately start to > test > > > the > > > > > assertions. We can then use the results to identify the > deficiencies > > > that > > > > > exist. > > > > > > > > > > I think the 200splits per server was an Eric quote from some time > ago > > > > > (1.4-ish, maybe 1.5). I think this is relative to a bunch of > things, > > > > > workload and memory available most notably, and would be good to > > > quantify > > > > > too. > > > > > > > > > > > > > > > On 6/20/14, 11:58 AM, Sean Busbey wrote: > > > > > > > > > >> One thing that jumped out from the most recent D4M paper was this > > > quote: > > > > >> > > > > >> One issue that was encountered is that after creating the > > > pre-splits, > > > > >> they all started out on one server. Accumulo load balanced the > splits > > > > >> across its servers at rate of ~50 splits/second, which is more > than > > > > >> adequate for normal operation, but can take ~20 minutes for 50,000 > > > pre- > > > > >> splits.[1] > > > > >> > > > > >> Do we already have an open ticket that would help this? I think > maybe > > > > >> there's one about being able to presplit a table that is offline? > > > > >> > > > > >> I believe our recommended sweet spot is like 100-200 tablets per > > > server > > > > >> (though I can't find the reference for *why* I believe this ATM), > > > which > > > > >> means for clusters in the ~100s of nodes this would be in the > ballpark > > > for > > > > >> an expected number of pre-splits. > > > > >> > > > > >> > > > > >> [1]: arXiv:1406.4923v1 [cs.DB] > > > > >> > > > > >> > > > > > > > > > > > > -- > > > > Sean > > > >
