Re: How to find RAM/disk usage of each vector field

Michael McCandless Wed, 06 Nov 2024 07:03:40 -0800

On Tue, Nov 5, 2024 at 5:17 PM Adrien Grand <jpou...@gmail.com> wrote


Why is it important to break down per field as opposed to scaling based on
> the total volume of vector data?
>

It's really for internal planning purposes / service telemetry ... at
Amazon product search team (where I also work w/ Tanmay -- hi Tanmay!) we
have a number of teams using our Lucene search service to experiment with
KNN search, varying the number of dimensions, whether quantization is in
use, which ML model, etc.  These fields come and go, sometimes without our
(low level infrastructure) service knowing ahead of time how these fields
are changing.  So we would ideally have an efficient way to break out
per-field KNN disk and "ideal hot RAM" online (in our production service)
instead of offline / inefficiently like rewriting the whole index into
separate files (Robert's cool DiskUsage tool).

It's tricky with KNN and features like scalar quantization (
https://www.elastic.co/search-labs/blog/scalar-quantization-in-lucene) and
soon RabitQ (https://github.com/apache/lucene/pull/13651) because the
on-disk form (which retains full float32 precision vectors) is different
from what searching really uses (the quantized byte-per-dimension form).
So the disk consumed by each field is larger than the amount of effective
"hot RAM" you might need.

Mike McCandless

http://blog.mikemccandless.com

Re: How to find RAM/disk usage of each vector field

Reply via email to