On Tue, Nov 5, 2024 at 7:31 PM Patrick Zhai <zhai7...@gmail.com> wrote:
I wouldn't call this a good way, but as the last resort you can parse the > metadata files yourself, as it is not so hard to parse (yet) Yeah ... the Lucene codec itself knows precisely how much disk is used for each field, and indeed stores it simply in its metadata. And it's incredibly fast to peek into that metadata to get the per-field metrics. We will likely take this approach (on top of Lucene) but it is clearly an abstraction violation, is brittle to future Codec changes, not supported, etc. We've taken the same brittle approach (at Amazon product search) to track per-field disk usage of terms dictionary and postings (inverted lexical index) for similar reasons (so many teams indexing so many fields with so many words!). I would think many multi-tenant users of Lucene would want some resource tracking along these lines... but "officially" supporting this in Lucene's Codec APIs would be an added dev burden. As for RAM usage, OnHeapHnswGraph actually implements the Accountable API > < > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L279C15-L279C27 > >, > and HnswGraphBuilder also have InfoStream passed in > < > https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L199 > > > so > I think > it's ok and reasonable to report the RAM usage at the end of graph build > maybe? Tho this won't include the off heap vector sizes but that one can be > estimated easily I think? > I think the OnHeapHnswGraph is used only during indexing? But +1 to have the infoStream print the RAM size of that graph during indexing if it doesn't already ... Tanmay maybe open a spinoff issue for this small improvement? Thanks Patrick. Mike McCandless http://blog.mikemccandless.com