Hi Michael, We developed a similar functionality in Elasticsearch. The DiskUsage API <https://github.com/elastic/elasticsearch/pull/74051> estimates the storage of each field by iterating its structures (i.e., inverted index, doc-values, stored fields, etc.) and tracking the number of read-bytes. The result is pretty fast and accurate.
I am +1 to the proposal. Thanks, Nhat On Mon, Jun 13, 2022 at 1:22 PM Michael Sokolov <[email protected]> wrote: > At Amazon, we have a need to produce regular metrics on how much disk > storage is consumed by each field. We manage an index with data > contributed by many teams and business units and we are often asked to > produce reports attributing index storage usage to these customers. > The best tool we have for this today is based on a custom Codec that > separates storage by field; to get the statistics we read an existing > index and write it out using AddIndexes and force-merging, using the > custom codec. This is time-consuming and inefficient and tends not to > get done. > > I wonder if it would make sense to add methods to *some* API that > would expose a per-field disk space metric? If we don't want to add to > IndexReader, which would imply lots of intermediate methods and API > additions, maybe we could make it be computed by CheckIndex? > > (implementation note: For the current formats, the information for > each field is always segregated by field, I think. I suppose that in > theory we might want to have some shared data structure across fields > some day, but it seems like an edge case that we could handle in some > exceptional way.) > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
