There are various reasons - mainly we want more control over memory usage
and accounting than shared_ptr allows.

Generally we avoid shared_ptr in Impala since it makes it harder to reason
about when resources are released. E.g. we typically want to know/control
exactly when memory is freed up.

Using shared_ptr doesn't help with accounting memory accurately against
different plan nodes. E.g. if you have multiple join nodes in the same
pipeline, and each of them is processing a batch that references the same
disk io buffer, how do you attribute the memory? The most sensible approach
is to have the bottom-most node be the "owner" of the resource, then
transfer that ownership up by attaching it to the last batch that
references it. To do that we need to explicitly know which the last batch
is, so we have to explicitly track that anyway, which means that shared_ptr
doesn't really help us manage memory lifetime.

I can see some advantages to tracking all the resources each batch
references them (e.g. having non-owning and owning references) - it would
make memory transfer issues easier to debug, but I don't think shared_ptr
helps with that accounting.


I think there may be some advantages to explicitly reference counting
resources for debugging memory issues.

On Wed, Aug 31, 2016 at 5:06 AM, Amos Bird <[email protected]> wrote:

>
> Hi there,
>
> I'm reading
> https://cwiki.apache.org/confluence/display/IMPALA/Impala+Row+Batches.
> It says "If an operator is accumulating batches, this means that it must
> be careful not to destroy or reset a batch if previous batches are still
> in use, because this could release memory resources that are used by the
> previous batches."
>
> This seems to be a good place to use shared_ptr. I'm curious why impala
> handles this problem using some sort of coding conventions. Is it
> because we use MemPools?
>
> I may be very ignorance. Any explanation is highly appreciated!
>
> Regards,
> Amos
>
>
>
>

Reply via email to