[
https://issues.apache.org/jira/browse/OAK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Davide Giannella updated OAK-7193:
----------------------------------
Fix Version/s: (was: 1.12.0)
> DataStore: API to retrieve statistic (file headers, size estimation)
> --------------------------------------------------------------------
>
> Key: OAK-7193
> URL: https://issues.apache.org/jira/browse/OAK-7193
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: blob
> Reporter: Thomas Mueller
> Priority: Major
> Fix For: 1.14.0
>
>
> Extension of OAK-6254: in addition to retrieving the size, it would be good
> to retrieve the estimated number and total size per file type. A simple (and
> in my view sufficient) solution is to use the first few bytes ("magic
> numbers", 2 bytes should be enough) to get the file type. That would allow to
> estimate, for example, the number of, and total size, of PDF files, JPEG,
> Lucene index and so on. A histogram would be nice as well, but I think is not
> needed.
> To speed up calculation, the blob ID could be extended with the first 2 bytes
> of the file content, that is: <hash>#<length>@<magic> where magic is the
> first two bytes, in hex. That would allow to quickly get the data from the
> blob ids (no need to actually read content).
> Sampling should be enough. The longer it takes, the more accurate the data.
> We could store the data while doing datastore GC, in which case the returned
> data would be somewhat stale; that's OK.
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)