On Tue, Nov 5, 2024 at 7:31 PM Patrick Zhai <zhai7...@gmail.com> wrote:

I wouldn't call this a good way, but as the last resort you can parse the
> metadata files yourself, as it is not so hard to parse (yet)


Yeah ... the Lucene codec itself knows precisely how much disk is used for
each field, and indeed stores it simply in its metadata.  And it's
incredibly fast to peek into that metadata to get the per-field metrics.

We will likely take this approach (on top of Lucene) but it is clearly an
abstraction violation, is brittle to future Codec changes, not supported,
etc.

We've taken the same brittle approach (at Amazon product search) to track
per-field disk usage of terms dictionary and postings (inverted lexical
index) for similar reasons (so many teams indexing so many fields with so
many words!).

I would think many multi-tenant users of Lucene would want some resource
tracking along these lines... but "officially" supporting this in Lucene's
Codec APIs would be an added dev burden.

As for RAM usage, OnHeapHnswGraph actually implements the Accountable API
> <
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L279C15-L279C27
> >,
> and HnswGraphBuilder also have InfoStream passed in
> <
> https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L199
> >
> so
> I think
> it's ok and reasonable to report the RAM usage at the end of graph build
> maybe? Tho this won't include the off heap vector sizes but that one can be
> estimated easily I think?
>

I think the OnHeapHnswGraph is used only during indexing?  But +1 to have
the infoStream print the RAM size of that graph during indexing if it
doesn't already ... Tanmay maybe open a spinoff issue for this small
improvement?

Thanks Patrick.

Mike McCandless

http://blog.mikemccandless.com

Reply via email to