Re: Supporting large values

Bill Havanki Wed, 28 May 2014 08:16:45 -0700

The immediate intent is to run some memory stress tests as part of a
deliverable of ours for Accumulo. So, right now I'm just trying to get the
tests to pass. A greater goal is indeed to understand what's needed to
support really large keys or values. I don't think we're looking to
generate a formula yet, but maybe just advice on what configuration
settings and such to look out for in general.


I did discover that the test client was a) requesting one split at a time
and b) not setting a low scanner batch size (the default is 1000). Setting
the batch size down to 1 seems to have helped a lot, so things are slowly
improving. :)


On Wed, May 28, 2014 at 10:24 AM, Josh Elser <[email protected]> wrote:

> On 5/28/14, 9:39 AM, Bill Havanki wrote:
>
>> Thanks Josh!
>>
>> - This is indeed under CDH 4.6.0. If there is a particular line number you
>> want to see code for, just name it and I'll look it up.
>>
>
> I was generally curious to see what kind of batching the DfsOutputStream
> does (it looked like it was checksumming small chunks of data), but I can
> look into that some more to satisfy my curiosity.
>
>
>  - Re #2, the test client is sending mutations of only one cell each, so a
>> mutation should be 100 MB + a little, due to the large value. It's
>> inefficient, but it seems to be a good idea just for getting this test to
>> survive. Maybe the logger code is hanging on to mutations in memory before
>> writing them out? (That would surprise me, but I dunno.)
>>
>
> Well, I think you're going to have to be able to keep "about" two copies
> in memory (what I was trying to get at before). The tserver is going to get
> the Mutation objects from the client. So, that's one instance of, say,
> 100MB. Before that write finishes, you'll also need to write those out to
> the WAL, which means that you'll be serializing each Mutation using the
> Writable methods, which, while it isn't quite the same as having a discrete
> object of that size on the heap, you're still writing out those bytes to
> the DataOutput which are going to be buffered through JVM heap.
>
>
>  Another fact I didn't mention is that I am running 2 writers and 2 readers
>> for the test. Perhaps 612 and 613 are the write threads, and then 615 is
>> one scan, which might leave 614 as the remains of the other scan, which
>> has
>> already failed and is logging an OOME (which is what the monitor shows)?
>>
>
> Perhaps! That might make sense.
>
>
>  My thought from looking at this again is that Thrift is running out of
>> space forming the scan result message as it fills up a
>> ByteArrayOutputStream. Maybe there is some way to force Thrift to break
>> things up?
>>
>
> I don't know of anything inside of thrift that we could use to do that.
>
> Overall, though, what's your intent by testing this? Is it to have a
> better understanding of server-side memory usage? Generally speaking, if
> you have clients getting back 100MB values and the server is writing 100MB
> values, that would intuitively use up a bit of heap space.
>
> I could see merit in constructing a general formula for memory consumption
> based on avg key-value size, number of threads available to read, number of
> threads available to write, and number of MinC/MajC threads. It probably
> wouldn't be much more valuable than a starting point due to variance, but
> it would be a starting point!
>
>
>  Thanks for burning cycles on this.
>>
>> Bill
>>
>>
>> On Tue, May 27, 2014 at 7:11 PM, Josh Elser <[email protected]> wrote:
>>
>>  Well, for this one, it looks to me that you have two threads writing data
>>> (ClientPool 612 and 613), with 612 being blocked by 613. There are two
>>> threads reading data, but they both appear to be in nativemap code, so I
>>> don't expect too much memory usage from them. ClientPool 615 is the
>>> thrift
>>> call for one of those scans. I'm not quite sure what ClientPool 614 is
>>> doing.
>>>
>>> Much hunch is that 613 is what actually pushed you into the OOME. I can't
>>> really say much more because I assume you're running on CDH as the line
>>> numbers don't match up to the Hadoop sources I have locally.
>>>
>>> I don't think there's much inside the logger code that will hold onto
>>> duplicate mutations, so the two things I'm curious about are:
>>>
>>> 1. Any chunking/buffering done inside of the DFSOutputStream (and if we
>>> should be using/configuring something differently). I see some signs of
>>> this from the method names in the stack trace.
>>>
>>> 2. Figuring out a formula for sizes of Mutations that are directly (via
>>> (Server)Mutation objects on heap) or indirectly (being written out to
>>> some
>>> OutputStream, like the DfsOutputStream previously mentioned), relative to
>>> the Accumulo configuration.
>>>
>>> I imagine #2 is where the most value we could gain would come from.
>>>
>>> Hopefully that brain dump is helpful :)
>>>
>>>
>>> On 5/27/14, 6:19 PM, Bill Havanki wrote:
>>>
>>>  Stack traces are here:
>>>>
>>>> https://gist.github.com/kbzod/e6e21ea15cf5670ba534
>>>>
>>>> This time something showed up in the monitor, often there is no stack
>>>> trace
>>>> there. The thread dump is from setting ACCUMULO_KILL_CMD to "kill -3
>>>> %p".
>>>>
>>>> Thanks again
>>>> Bill
>>>>
>>>>
>>>> On Tue, May 27, 2014 at 5:09 PM, Bill Havanki <
>>>> [email protected]>
>>>> wrote:
>>>>
>>>>   I left the default key size constraint in place. I had set the tserver
>>>>
>>>>> mesage size up from 1 GB to 1.5 GB, but it didn't help. (I forgot that
>>>>> config item.)
>>>>>
>>>>> Stack trace(s) coming up! I got tired of failures all day so I'm
>>>>> running
>>>>> a
>>>>> different test that will hopefully work. I'll re-break it shortly :D
>>>>>
>>>>>
>>>>> On Tue, May 27, 2014 at 5:04 PM, Josh Elser <[email protected]>
>>>>> wrote:
>>>>>
>>>>>   Stack traces would definitely be helpful, IMO.
>>>>>
>>>>>>
>>>>>> (or interesting if nothing else :D)
>>>>>>
>>>>>>
>>>>>> On 5/27/14, 4:55 PM, Bill Havanki wrote:
>>>>>>
>>>>>>   No sir. I am seeing general out of heap space messages, nothing
>>>>>> about
>>>>>>
>>>>>>> direct buffers. One specific example would be while Thrift is writing
>>>>>>> to
>>>>>>> a
>>>>>>> ByteArrayOutputStream to send off scan results. (I can get an exact
>>>>>>> stack
>>>>>>> trace - easily :} - if it would be helpful.) It seems as if there
>>>>>>> just
>>>>>>> isn't enough heap left, after controlling for what I have so far.
>>>>>>>
>>>>>>> As a clarification of my original email: each row has 100 cells, and
>>>>>>> each
>>>>>>> cell has a 100 MB value. So, one row would occupy just over 10 GB.
>>>>>>>
>>>>>>>
>>>>>>> On Tue, May 27, 2014 at 4:49 PM, <[email protected]> wrote:
>>>>>>>
>>>>>>>    Are you seeing something similar to the error in
>>>>>>>
>>>>>>>  https://issues.apache.org/jira/browse/ACCUMULO-2495?
>>>>>>>>
>>>>>>>> ----- Original Message -----
>>>>>>>>
>>>>>>>> From: "Bill Havanki" <[email protected]>
>>>>>>>> To: "Accumulo Dev List" <[email protected]>
>>>>>>>> Sent: Tuesday, May 27, 2014 4:30:59 PM
>>>>>>>> Subject: Supporting large values
>>>>>>>>
>>>>>>>> I'm trying to run a stress test where each row in a table has 100
>>>>>>>> cells,
>>>>>>>> each with a value of 100 MB of random data. (This is using Bill
>>>>>>>> Slacum's
>>>>>>>> memory stress test tool). Despite fiddling with the cluster
>>>>>>>> configuration,
>>>>>>>> I always run out of tablet server heap space before too long.
>>>>>>>>
>>>>>>>> Here are the configurations I've tried so far, with valuable
>>>>>>>> guidance
>>>>>>>> from
>>>>>>>> Busbey and madrob:
>>>>>>>>
>>>>>>>> - native maps are enabled, tserver.memory.maps.max = 8G
>>>>>>>> - table.compaction.minor.logs.threshold = 8
>>>>>>>> - tserver.walog.max.size = 1G
>>>>>>>> - Tablet server has 4G heap (-Xmx4g)
>>>>>>>> - table is pre-split into 8 tablets (split points 0x20, 0x40, 0x60,
>>>>>>>> ...), 5
>>>>>>>> tablet servers are available
>>>>>>>> - tserver.cache.data.size = 256M
>>>>>>>> - tserver.cache.index.size = 40M (keys are small - 4 bytes - in this
>>>>>>>> test)
>>>>>>>> - table.scan.max.memory = 256M
>>>>>>>> - tserver.readahead.concurrent.max = 4 (default is 16)
>>>>>>>>
>>>>>>>> It's often hard to tell where the OOM error comes from, but I have
>>>>>>>> seen
>>>>>>>> it
>>>>>>>> frequently coming from Thrift as it is writing out scan results.
>>>>>>>>
>>>>>>>> Does anyone have any good conventions for supporting large values?
>>>>>>>> (Warning: I'll want to work on large keys (and tiny values) next!
>>>>>>>> :) )
>>>>>>>>
>>>>>>>> Thanks very much
>>>>>>>> Bill
>>>>>>>>
>>>>>>>> --
>>>>>>>> // Bill Havanki
>>>>>>>> // Solutions Architect, Cloudera Govt Solutions
>>>>>>>> // 443.686.9283
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>> --
>>>>> // Bill Havanki
>>>>> // Solutions Architect, Cloudera Govt Solutions
>>>>> // 443.686.9283
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>
>>


-- 
// Bill Havanki
// Solutions Architect, Cloudera Govt Solutions
// 443.686.9283

Re: Supporting large values

Reply via email to