Hi Michael,

We developed a similar functionality in Elasticsearch. The DiskUsage API
<https://github.com/elastic/elasticsearch/pull/74051> estimates the storage
of each field by iterating its structures (i.e., inverted index,
doc-values, stored fields, etc.) and tracking the number of read-bytes. The
result is pretty fast and accurate.

I am +1 to the proposal.

Thanks,
Nhat

On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <[email protected]> wrote:

> At Amazon, we have a need to produce regular metrics on how much disk
> storage is consumed by each field. We manage an index with data
> contributed by many teams and business units and we are often asked to
> produce reports attributing index storage usage to these customers.
> The best tool we have for this today is based on a custom Codec that
> separates storage by field; to get the statistics we read an existing
> index and write it out using AddIndexes and force-merging, using the
> custom codec. This is time-consuming and inefficient and tends not to
> get done.
>
> I wonder if it would make sense to add methods to *some* API that
> would expose a per-field disk space metric? If we don't want to add to
> IndexReader, which would imply lots of intermediate methods and API
> additions, maybe we could make it be computed by CheckIndex?
>
> (implementation note: For the current formats, the information for
> each field is always segregated by field, I think. I suppose that in
> theory we might want to have some shared data structure across fields
> some day, but it seems like an edge case that we could handle in some
> exceptional way.)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: [email protected]
> For additional commands, e-mail: [email protected]
>
>

Reply via email to