I wouldn't call this a good way, but as the last resort you can parse the
metadata files yourself, as it is not so hard to parse (yet), the logics
are in:
Lucene99HnswVectorsFormat.java
Lucene99FlatVectorsFormat.java
The risk for sure is that whenever the format is changed the parsing logic
will need to be changed accordingly, and I don't think we will/can
guarantee any consistency on the format nor do we want to maintain such a
disk usage estimation tool in the lucene codebase

As for RAM usage, OnHeapHnswGraph actually implements the Accountable API
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/OnHeapHnswGraph.java#L279C15-L279C27>,
and HnswGraphBuilder also have InfoStream passed in
<https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java#L199>
so
I think
it's ok and reasonable to report the RAM usage at the end of graph build
maybe? Tho this won't include the off heap vector sizes but that one can be
estimated easily I think?

On Tue, Nov 5, 2024 at 2:16 PM Adrien Grand <jpou...@gmail.com> wrote:

> I cannot think of good ways to do this. Why is it important to break down
> per field as opposed to scaling based on the total volume of vector data?
>
> On Tue, Nov 5, 2024 at 10:58 PM Tanmay Goel <goeltan...@gmail.com> wrote:
>
> > Hi Rui
> >
> > Thanks for your response and the snippet that you shared is great but not
> > exactly what I was looking for. With this snippet we are able to find the
> > total size of the .vec files, but I want to see inside the .vec files and
> > try to compute a map of vector_field_name to the number of bytes on disk.
> >
> > Thanks
> > Tanmay
> >
> > On Wed, 30 Oct 2024 at 13:30, Rui Wu <rui...@mongodb.com.invalid> wrote:
> >
> > > Hi Tanmay,
> > >
> > > Are you bothered by the .vec files hidden within the compound files? If
> > > yes, I have a snippet that can sum up the .vec files inside and outside
> > > compound files.
> > > https://gist.github.com/wurui90/28de20d46079108d7ae5ed181ba939d4
> > >
> > > On Tue, Oct 29, 2024 at 12:08 PM Tanmay Goel <goeltan...@gmail.com>
> > wrote:
> > >
> > > > Hi all
> > > >
> > > > I recently joined the Lucene team at Amazon and this is my first time
> > > > working with Lucene so any help will be appreciated.
> > > >
> > > > One of my first tasks is to *add a metric in production to track the
> > RAM
> > > /
> > > > disk usage of vector fields*. We want to use this metric to decide
> when
> > > to
> > > > scale our deployments.
> > > >
> > > > One of the ideas to get this data was to split the index files such
> > that
> > > we
> > > > have separate files for each field and prefix filenames with the
> > > > field name. We could then analyze the index files and figure out how
> > many
> > > > bytes are used for each field. However, this idea is called out as a
> > bad
> > > > practice in Lucene docs (
> > > >
> > > >
> > >
> >
> https://github.com/apache/lucene/blob/main/dev-docs/file-formats.md#dont-use-too-many-files
> > > > )
> > > >
> > > > Is there any other way to find out how many bytes are being used by
> > > vector
> > > > fields?
> > > >
> > > > Thanks!
> > > >
> > > > Tanmay
> > > >
> > >
> >
>
>
> --
> Adrien
>

Reply via email to