Hello,

I'm writing an HTTP server in Java that provides Arrow data to users. For
performance, I keep the most-recently-used Arrow batches in an in-memory
cache. A batch is wrapped in a "DataBatch" Java object containing the
schema and field vectors.

I'm looking for a good memory management strategy here, given the situation
that,
- batches can be evicted the in-memory cache, and the underlying memory
should be cleared up as quickly as possible, *if nothing else is using them*
,
- data retrieved from the cache undergoes a zero-copy path with filters etc
(which are views on the underlying data) before being sent out, so it can
still be in use when it gets cache-evicted, as there are multiple
simultaneous threads.

I'm used to C++, where this scenario would seem relatively unchallenging,
as we'd keep std::shared_ptr's and just clean up everything in the
destructor.

In Java, however, it seems that,
- Object#finalize is deprecated, and not super-reliable anyway,
- GC might only happen when there is pressure on the Java heap, but the
Arrow data is allocated in Netty buffers.

I wonder if people have encountered this scenario before, and what approach
was favoured. Some ideas,
- Manually maintain a ref-count and free when it goes to zero. This seems
brittle in the face of errors etc, that could lead to leaks,
- Use the PhantomReference mechanism. Would appear to suffer from the same
potential delay in GC, though, i.e. my Java object is just a little holder
for the underlying FieldVectors.  Perhaps there's a way of saying that
these DataBatch object should be GC'd often?
- Make a copy of the data when it gets retrieved from the cache, so an
eviction from the cache means it can always be safely removed. Seems very
wasteful, and not very scalable if there are other reusal paths.
- Allocate the buffers in a way that counts towards heap memory pressure.

Any thoughts are appreciated! I'm not a Java expert at all, so may be
missing obvious things, or thinking about it non-idiomatically.

Best,
-J

Reply via email to