[ 
https://issues.apache.org/jira/browse/OAK-7193?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Andrei Dulceanu updated OAK-7193:
---------------------------------
    Fix Version/s:     (was: 1.36.0)

> DataStore: API to retrieve statistic (file headers, size estimation)
> --------------------------------------------------------------------
>
>                 Key: OAK-7193
>                 URL: https://issues.apache.org/jira/browse/OAK-7193
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: blob
>            Reporter: Thomas Mueller
>            Priority: Major
>             Fix For: 1.38.0
>
>
> Extension of OAK-6254: in addition to retrieving the size, it would be good 
> to retrieve the estimated number and total size per file type. A simple (and 
> in my view sufficient) solution is to use the first few bytes ("magic 
> numbers", 2 bytes should be enough) to get the file type. That would allow to 
> estimate, for example, the number of, and total size, of PDF files, JPEG, 
> Lucene index and so on. A histogram would be nice as well, but I think is not 
> needed.
> To speed up calculation, the blob ID could be extended with the first 2 bytes 
> of the file content, that is: <hash>#<length>@<magic> where magic is the 
> first two bytes, in hex. That would allow to quickly get the data from the 
> blob ids (no need to actually read content).
> Sampling should be enough. The longer it takes, the more accurate the data. 
> We could store the data while doing datastore GC, in which case the returned 
> data would be somewhat stale; that's OK.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to