On Mon, Nov 16, 2015 at 10:35 AM, z11373 <[email protected]> wrote: > Last week on separate thread I was suggested to use > tableOperations.deleteRows for deleting rows that matched with specific > ranges. So I was curious to try it out to see if it's better than my > current > implementation which is iterating all rows, and call putDelete for each. > While researching, I also found Accumulo already provides BatchDeleter, > which also does the same thing. > I tried all of three, and below is my test results against three different > tables (numbers are in milliseconds): > > Test 1 (using iterator and call putDelete for each): > Table 1: 5,702 > Table 2: 6,912 > Table 3: 4,694 > > Test 2 (using BatchDeleter class): > Table 1: 8,089 > Table 2: 10,405 > Table 3: 7,818 > > Test 3 (using tableOperations.deleteRows, note that I first iterate all > rows, just to get the last row id, which then being passed as argument to > the function): > Table 1: 196,597 > Table 2: 226,496 > Table 3: 8,442 > > > I ran the tests few times, and pretty much got the consistent results > above. > I didn't look at the code what deleteRows really doing, but looking at my > test results, I can say it sucks! >
An advantage of deleteRows is that it can drop entire tablets that fall completely within a range. However the tablet at the end of the range may need to be compacted in order to extend its range. Using deleteRows for a "small" range that falls completely within a table may be suboptimal. Is that your case? How many key values are you deleting? If its not the compaction that causing the delay, then there may be a bug. Not sure if it will help, but there is a utility function for finding a max row. It does a binary search within the key space. http://accumulo.apache.org/1.6/apidocs/org/apache/accumulo/core/client/admin/TableOperations.html#getMaxRow%28java.lang.String,%20org.apache.accumulo.core.security.Authorizations,%20org.apache.hadoop.io.Text,%20boolean,%20org.apache.hadoop.io.Text,%20boolean%29 > Note that for that test, I did scan and iterate just to get the last row > id, > but even I subtract the time for doing that, it's still way too slow. > Therefore, I'd recommend anyone to avoid using deleteRows for this > scenario. > YMMV, but I'd stick with my original approach, which is doing the same like > Test 1 above. > > > Thanks, > Z > > > > > -- > View this message in context: > http://apache-accumulo.1065345.n5.nabble.com/delete-rows-test-result-tp15569.html > Sent from the Developers mailing list archive at Nabble.com. >
