There are various reasons - mainly we want more control over memory usage and accounting than shared_ptr allows.
Generally we avoid shared_ptr in Impala since it makes it harder to reason about when resources are released. E.g. we typically want to know/control exactly when memory is freed up. Using shared_ptr doesn't help with accounting memory accurately against different plan nodes. E.g. if you have multiple join nodes in the same pipeline, and each of them is processing a batch that references the same disk io buffer, how do you attribute the memory? The most sensible approach is to have the bottom-most node be the "owner" of the resource, then transfer that ownership up by attaching it to the last batch that references it. To do that we need to explicitly know which the last batch is, so we have to explicitly track that anyway, which means that shared_ptr doesn't really help us manage memory lifetime. I can see some advantages to tracking all the resources each batch references them (e.g. having non-owning and owning references) - it would make memory transfer issues easier to debug, but I don't think shared_ptr helps with that accounting. I think there may be some advantages to explicitly reference counting resources for debugging memory issues. On Wed, Aug 31, 2016 at 5:06 AM, Amos Bird <[email protected]> wrote: > > Hi there, > > I'm reading > https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches. > It says "If an operator is accumulating batches, this means that it must > be careful not to destroy or reset a batch if previous batches are still > in use, because this could release memory resources that are used by the > previous batches." > > This seems to be a good place to use shared_ptr. I'm curious why impala > handles this problem using some sort of coding conventions. Is it > because we use MemPools? > > I may be very ignorance. Any explanation is highly appreciated! > > Regards, > Amos > > > >
