gianm commented on issue #6743: IncrementalIndex generally overestimates theta sketch size URL: https://github.com/apache/incubator-druid/issues/6743#issuecomment-449146550 > I don't have a deep understanding of the inner workings of the memory allocation strategy in Druid, but I should point out that the current model of allocating equal sized slots in a Buffer where each slot is the maximum possible size of a sketch Is very likely a horrible waste of memory space. Yeah, it's not ideal, especially not for some of the newer sketches. (The granddaddy of Druid sketches, hyperUnique, doesn't have a very high max memory footprint, so it was less of an issue back then.) > What I would recommend is that if we could work together, we could come up with a much more efficient memory management model for sketches in Druid that would allow you to recapture most if not all of that wasted space. This will likely require some changes in Druid as well as a change in how sketches use and allocate memory. That would be awesome. Rather than the Druid memory manager allowing for off-heap space that it doesn't control, what if Druid _did_ control it, but basically delegated that control to the sketch? Druid's query engine works by allocating a fixed-size "processing buffer" to each compute thread. When a thread processes a segment, it allocates all the memory it needs out of that buffer. After the segment is done being processed, the results are transferred elsewhere, and the processing buffer is reused for the next segment to be processed. The processing buffers are typically 500MB to 2GB in size, and they are preallocated at server startup to avoid "surprises" (one buffer per compute thread, which is a fixed-size pool generally set to the number of processors). Right now, as I'm guessing you know, the protocol for aggregators getting space in that buffer is something like: 1. Druid calls AggregatorFactory's getMaxIntermediateSize method to figure out how much memory to allocate per aggregator. 2. Druid allocates that much memory per grouping tuple. 3. Druid calls BufferAggregator's "init", "aggregate", and "get" methods to interact with the memory it has allocated. Riffing off your idea, what I'm thinking is carving out a chunk of the processing buffer to be managed by the BufferAggregator impl: 1. Druid calls a new AggregatorFactory "getTypicalIntermediateSize" method to figure out a "typical" size per aggregator, and getMaxIntermediateSize to figure out the max. 2. Druid computes how many grouping tuples it could store in the buffer if each one had aggregators of the "typical" size. Call it N. 3. Druid carves out an arena of size N * getTypicalIntermediateSize from the processing buffer, and passes it to the AggregatorFactory's "factorizeBuffered" method, which creates a BufferAggregator that is free to use that arena. 4. For each grouping tuple, Druid stores not the aggregated value, but just the information needed to find the actual data in the arena. We could also tweak this protocol to avoid the arena if "typical" equals "max", so primitive aggregators don't need to become more complex, and so we save the extra overhead for the pointer into the arena.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
