On 5/28/14, 9:39 AM, Bill Havanki wrote:
Thanks Josh!

- This is indeed under CDH 4.6.0. If there is a particular line number you
want to see code for, just name it and I'll look it up.

I was generally curious to see what kind of batching the DfsOutputStream does (it looked like it was checksumming small chunks of data), but I can look into that some more to satisfy my curiosity.

- Re #2, the test client is sending mutations of only one cell each, so a
mutation should be 100 MB + a little, due to the large value. It's
inefficient, but it seems to be a good idea just for getting this test to
survive. Maybe the logger code is hanging on to mutations in memory before
writing them out? (That would surprise me, but I dunno.)

Well, I think you're going to have to be able to keep "about" two copies in memory (what I was trying to get at before). The tserver is going to get the Mutation objects from the client. So, that's one instance of, say, 100MB. Before that write finishes, you'll also need to write those out to the WAL, which means that you'll be serializing each Mutation using the Writable methods, which, while it isn't quite the same as having a discrete object of that size on the heap, you're still writing out those bytes to the DataOutput which are going to be buffered through JVM heap.

Another fact I didn't mention is that I am running 2 writers and 2 readers
for the test. Perhaps 612 and 613 are the write threads, and then 615 is
one scan, which might leave 614 as the remains of the other scan, which has
already failed and is logging an OOME (which is what the monitor shows)?

Perhaps! That might make sense.

My thought from looking at this again is that Thrift is running out of
space forming the scan result message as it fills up a
ByteArrayOutputStream. Maybe there is some way to force Thrift to break
things up?

I don't know of anything inside of thrift that we could use to do that.

Overall, though, what's your intent by testing this? Is it to have a better understanding of server-side memory usage? Generally speaking, if you have clients getting back 100MB values and the server is writing 100MB values, that would intuitively use up a bit of heap space.

I could see merit in constructing a general formula for memory consumption based on avg key-value size, number of threads available to read, number of threads available to write, and number of MinC/MajC threads. It probably wouldn't be much more valuable than a starting point due to variance, but it would be a starting point!

Thanks for burning cycles on this.

Bill


On Tue, May 27, 2014 at 7:11 PM, Josh Elser <[email protected]> wrote:

Well, for this one, it looks to me that you have two threads writing data
(ClientPool 612 and 613), with 612 being blocked by 613. There are two
threads reading data, but they both appear to be in nativemap code, so I
don't expect too much memory usage from them. ClientPool 615 is the thrift
call for one of those scans. I'm not quite sure what ClientPool 614 is
doing.

Much hunch is that 613 is what actually pushed you into the OOME. I can't
really say much more because I assume you're running on CDH as the line
numbers don't match up to the Hadoop sources I have locally.

I don't think there's much inside the logger code that will hold onto
duplicate mutations, so the two things I'm curious about are:

1. Any chunking/buffering done inside of the DFSOutputStream (and if we
should be using/configuring something differently). I see some signs of
this from the method names in the stack trace.

2. Figuring out a formula for sizes of Mutations that are directly (via
(Server)Mutation objects on heap) or indirectly (being written out to some
OutputStream, like the DfsOutputStream previously mentioned), relative to
the Accumulo configuration.

I imagine #2 is where the most value we could gain would come from.

Hopefully that brain dump is helpful :)


On 5/27/14, 6:19 PM, Bill Havanki wrote:

Stack traces are here:

https://gist.github.com/kbzod/e6e21ea15cf5670ba534

This time something showed up in the monitor, often there is no stack
trace
there. The thread dump is from setting ACCUMULO_KILL_CMD to "kill -3 %p".

Thanks again
Bill


On Tue, May 27, 2014 at 5:09 PM, Bill Havanki <[email protected]>
wrote:

  I left the default key size constraint in place. I had set the tserver
mesage size up from 1 GB to 1.5 GB, but it didn't help. (I forgot that
config item.)

Stack trace(s) coming up! I got tired of failures all day so I'm running
a
different test that will hopefully work. I'll re-break it shortly :D


On Tue, May 27, 2014 at 5:04 PM, Josh Elser <[email protected]>
wrote:

  Stack traces would definitely be helpful, IMO.

(or interesting if nothing else :D)


On 5/27/14, 4:55 PM, Bill Havanki wrote:

  No sir. I am seeing general out of heap space messages, nothing about
direct buffers. One specific example would be while Thrift is writing
to
a
ByteArrayOutputStream to send off scan results. (I can get an exact
stack
trace - easily :} - if it would be helpful.) It seems as if there just
isn't enough heap left, after controlling for what I have so far.

As a clarification of my original email: each row has 100 cells, and
each
cell has a 100 MB value. So, one row would occupy just over 10 GB.


On Tue, May 27, 2014 at 4:49 PM, <[email protected]> wrote:

   Are you seeing something similar to the error in

https://issues.apache.org/jira/browse/ACCUMULO-2495?

----- Original Message -----

From: "Bill Havanki" <[email protected]>
To: "Accumulo Dev List" <[email protected]>
Sent: Tuesday, May 27, 2014 4:30:59 PM
Subject: Supporting large values

I'm trying to run a stress test where each row in a table has 100
cells,
each with a value of 100 MB of random data. (This is using Bill
Slacum's
memory stress test tool). Despite fiddling with the cluster
configuration,
I always run out of tablet server heap space before too long.

Here are the configurations I've tried so far, with valuable guidance
from
Busbey and madrob:

- native maps are enabled, tserver.memory.maps.max = 8G
- table.compaction.minor.logs.threshold = 8
- tserver.walog.max.size = 1G
- Tablet server has 4G heap (-Xmx4g)
- table is pre-split into 8 tablets (split points 0x20, 0x40, 0x60,
...), 5
tablet servers are available
- tserver.cache.data.size = 256M
- tserver.cache.index.size = 40M (keys are small - 4 bytes - in this
test)
- table.scan.max.memory = 256M
- tserver.readahead.concurrent.max = 4 (default is 16)

It's often hard to tell where the OOM error comes from, but I have
seen
it
frequently coming from Thrift as it is writing out scan results.

Does anyone have any good conventions for supporting large values?
(Warning: I'll want to work on large keys (and tiny values) next! :) )

Thanks very much
Bill

--
// Bill Havanki
// Solutions Architect, Cloudera Govt Solutions
// 443.686.9283







--
// Bill Havanki
// Solutions Architect, Cloudera Govt Solutions
// 443.686.9283







Reply via email to