The immediate intent is to run some memory stress tests as part of a deliverable of ours for Accumulo. So, right now I'm just trying to get the tests to pass. A greater goal is indeed to understand what's needed to support really large keys or values. I don't think we're looking to generate a formula yet, but maybe just advice on what configuration settings and such to look out for in general.
I did discover that the test client was a) requesting one split at a time and b) not setting a low scanner batch size (the default is 1000). Setting the batch size down to 1 seems to have helped a lot, so things are slowly improving. :) On Wed, May 28, 2014 at 10:24 AM, Josh Elser <[email protected]> wrote: > On 5/28/14, 9:39 AM, Bill Havanki wrote: > >> Thanks Josh! >> >> - This is indeed under CDH 4.6.0. If there is a particular line number you >> want to see code for, just name it and I'll look it up. >> > > I was generally curious to see what kind of batching the DfsOutputStream > does (it looked like it was checksumming small chunks of data), but I can > look into that some more to satisfy my curiosity. > > > - Re #2, the test client is sending mutations of only one cell each, so a >> mutation should be 100 MB + a little, due to the large value. It's >> inefficient, but it seems to be a good idea just for getting this test to >> survive. Maybe the logger code is hanging on to mutations in memory before >> writing them out? (That would surprise me, but I dunno.) >> > > Well, I think you're going to have to be able to keep "about" two copies > in memory (what I was trying to get at before). The tserver is going to get > the Mutation objects from the client. So, that's one instance of, say, > 100MB. Before that write finishes, you'll also need to write those out to > the WAL, which means that you'll be serializing each Mutation using the > Writable methods, which, while it isn't quite the same as having a discrete > object of that size on the heap, you're still writing out those bytes to > the DataOutput which are going to be buffered through JVM heap. > > > Another fact I didn't mention is that I am running 2 writers and 2 readers >> for the test. Perhaps 612 and 613 are the write threads, and then 615 is >> one scan, which might leave 614 as the remains of the other scan, which >> has >> already failed and is logging an OOME (which is what the monitor shows)? >> > > Perhaps! That might make sense. > > > My thought from looking at this again is that Thrift is running out of >> space forming the scan result message as it fills up a >> ByteArrayOutputStream. Maybe there is some way to force Thrift to break >> things up? >> > > I don't know of anything inside of thrift that we could use to do that. > > Overall, though, what's your intent by testing this? Is it to have a > better understanding of server-side memory usage? Generally speaking, if > you have clients getting back 100MB values and the server is writing 100MB > values, that would intuitively use up a bit of heap space. > > I could see merit in constructing a general formula for memory consumption > based on avg key-value size, number of threads available to read, number of > threads available to write, and number of MinC/MajC threads. It probably > wouldn't be much more valuable than a starting point due to variance, but > it would be a starting point! > > > Thanks for burning cycles on this. >> >> Bill >> >> >> On Tue, May 27, 2014 at 7:11 PM, Josh Elser <[email protected]> wrote: >> >> Well, for this one, it looks to me that you have two threads writing data >>> (ClientPool 612 and 613), with 612 being blocked by 613. There are two >>> threads reading data, but they both appear to be in nativemap code, so I >>> don't expect too much memory usage from them. ClientPool 615 is the >>> thrift >>> call for one of those scans. I'm not quite sure what ClientPool 614 is >>> doing. >>> >>> Much hunch is that 613 is what actually pushed you into the OOME. I can't >>> really say much more because I assume you're running on CDH as the line >>> numbers don't match up to the Hadoop sources I have locally. >>> >>> I don't think there's much inside the logger code that will hold onto >>> duplicate mutations, so the two things I'm curious about are: >>> >>> 1. Any chunking/buffering done inside of the DFSOutputStream (and if we >>> should be using/configuring something differently). I see some signs of >>> this from the method names in the stack trace. >>> >>> 2. Figuring out a formula for sizes of Mutations that are directly (via >>> (Server)Mutation objects on heap) or indirectly (being written out to >>> some >>> OutputStream, like the DfsOutputStream previously mentioned), relative to >>> the Accumulo configuration. >>> >>> I imagine #2 is where the most value we could gain would come from. >>> >>> Hopefully that brain dump is helpful :) >>> >>> >>> On 5/27/14, 6:19 PM, Bill Havanki wrote: >>> >>> Stack traces are here: >>>> >>>> https://gist.github.com/kbzod/e6e21ea15cf5670ba534 >>>> >>>> This time something showed up in the monitor, often there is no stack >>>> trace >>>> there. The thread dump is from setting ACCUMULO_KILL_CMD to "kill -3 >>>> %p". >>>> >>>> Thanks again >>>> Bill >>>> >>>> >>>> On Tue, May 27, 2014 at 5:09 PM, Bill Havanki < >>>> [email protected]> >>>> wrote: >>>> >>>> I left the default key size constraint in place. I had set the tserver >>>> >>>>> mesage size up from 1 GB to 1.5 GB, but it didn't help. (I forgot that >>>>> config item.) >>>>> >>>>> Stack trace(s) coming up! I got tired of failures all day so I'm >>>>> running >>>>> a >>>>> different test that will hopefully work. I'll re-break it shortly :D >>>>> >>>>> >>>>> On Tue, May 27, 2014 at 5:04 PM, Josh Elser <[email protected]> >>>>> wrote: >>>>> >>>>> Stack traces would definitely be helpful, IMO. >>>>> >>>>>> >>>>>> (or interesting if nothing else :D) >>>>>> >>>>>> >>>>>> On 5/27/14, 4:55 PM, Bill Havanki wrote: >>>>>> >>>>>> No sir. I am seeing general out of heap space messages, nothing >>>>>> about >>>>>> >>>>>>> direct buffers. One specific example would be while Thrift is writing >>>>>>> to >>>>>>> a >>>>>>> ByteArrayOutputStream to send off scan results. (I can get an exact >>>>>>> stack >>>>>>> trace - easily :} - if it would be helpful.) It seems as if there >>>>>>> just >>>>>>> isn't enough heap left, after controlling for what I have so far. >>>>>>> >>>>>>> As a clarification of my original email: each row has 100 cells, and >>>>>>> each >>>>>>> cell has a 100 MB value. So, one row would occupy just over 10 GB. >>>>>>> >>>>>>> >>>>>>> On Tue, May 27, 2014 at 4:49 PM, <[email protected]> wrote: >>>>>>> >>>>>>> Are you seeing something similar to the error in >>>>>>> >>>>>>> https://issues.apache.org/jira/browse/ACCUMULO-2495? >>>>>>>> >>>>>>>> ----- Original Message ----- >>>>>>>> >>>>>>>> From: "Bill Havanki" <[email protected]> >>>>>>>> To: "Accumulo Dev List" <[email protected]> >>>>>>>> Sent: Tuesday, May 27, 2014 4:30:59 PM >>>>>>>> Subject: Supporting large values >>>>>>>> >>>>>>>> I'm trying to run a stress test where each row in a table has 100 >>>>>>>> cells, >>>>>>>> each with a value of 100 MB of random data. (This is using Bill >>>>>>>> Slacum's >>>>>>>> memory stress test tool). Despite fiddling with the cluster >>>>>>>> configuration, >>>>>>>> I always run out of tablet server heap space before too long. >>>>>>>> >>>>>>>> Here are the configurations I've tried so far, with valuable >>>>>>>> guidance >>>>>>>> from >>>>>>>> Busbey and madrob: >>>>>>>> >>>>>>>> - native maps are enabled, tserver.memory.maps.max = 8G >>>>>>>> - table.compaction.minor.logs.threshold = 8 >>>>>>>> - tserver.walog.max.size = 1G >>>>>>>> - Tablet server has 4G heap (-Xmx4g) >>>>>>>> - table is pre-split into 8 tablets (split points 0x20, 0x40, 0x60, >>>>>>>> ...), 5 >>>>>>>> tablet servers are available >>>>>>>> - tserver.cache.data.size = 256M >>>>>>>> - tserver.cache.index.size = 40M (keys are small - 4 bytes - in this >>>>>>>> test) >>>>>>>> - table.scan.max.memory = 256M >>>>>>>> - tserver.readahead.concurrent.max = 4 (default is 16) >>>>>>>> >>>>>>>> It's often hard to tell where the OOM error comes from, but I have >>>>>>>> seen >>>>>>>> it >>>>>>>> frequently coming from Thrift as it is writing out scan results. >>>>>>>> >>>>>>>> Does anyone have any good conventions for supporting large values? >>>>>>>> (Warning: I'll want to work on large keys (and tiny values) next! >>>>>>>> :) ) >>>>>>>> >>>>>>>> Thanks very much >>>>>>>> Bill >>>>>>>> >>>>>>>> -- >>>>>>>> // Bill Havanki >>>>>>>> // Solutions Architect, Cloudera Govt Solutions >>>>>>>> // 443.686.9283 >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>> -- >>>>> // Bill Havanki >>>>> // Solutions Architect, Cloudera Govt Solutions >>>>> // 443.686.9283 >>>>> >>>>> >>>>> >>>> >>>> >>>> >> >> -- // Bill Havanki // Solutions Architect, Cloudera Govt Solutions // 443.686.9283
