[ 
https://issues.apache.org/jira/browse/OAK-4471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331518#comment-15331518
 ] 

Chetan Mehrotra commented on OAK-4471:
--------------------------------------

*Use dictionary for Property Names*

Under this we can use a dictionary for commonly occurring property names. Below 
are some stats from a repository with 

* 14.5 M documents in nodes collection
* 26 M property names
* 8475 unique property names
* 290M - Total size of property names (assuming 8bits per char)
* 8755M - Total repo size

Top property name stats

{noformat}
+-----------------------------------------------------+
|Count  |Name                    |% by count|% by size|
+-----------------------------------------------------+
|3033972|jcr:lastModified        |11.65     |15.94    |
|2573208|jcr:data                |9.88      |6.76     |
|2505308|uniqueKey               |9.62      |7.40     |
|2286350|blobSize                |8.78      |6.00     |
|1706460|match                   |6.55      |2.80     |
|1484283|jcr:primaryType         |5.70      |7.31     |
|969596 |jcr:created             |3.72      |3.50     |
|933960 |jcr:createdBy           |3.59      |3.99     |
|921199 |sling:resourceType      |3.54      |5.44     |
|702208 |:childOrder             |2.70      |2.54     |
|601959 |entry                   |2.31      |0.99     |
|600299 |jcr:uuid                |2.31      |1.58     |
|481036 |jcr:lastModifiedBy      |1.85      |2.84     |
|477625 |jcr:frozenPrimaryType   |1.83      |3.29     |
|477625 |jcr:frozenUuid          |1.83      |2.20     |
|357201 |text                    |1.37      |0.47     |
|351712 |textIsRich              |1.35      |1.15     |
|228623 |event\djob\dqueued\dtime|0.88      |1.80     |
+-----------------------------------------------------+
{noformat}

Based on above we can say for now using dictionary for property names would not 
provide much benefit!

> More compact storage format for Documents
> -----------------------------------------
>
>                 Key: OAK-4471
>                 URL: https://issues.apache.org/jira/browse/OAK-4471
>             Project: Jackrabbit Oak
>          Issue Type: Improvement
>          Components: documentmk
>            Reporter: Chetan Mehrotra
>            Assignee: Chetan Mehrotra
>              Labels: performance
>             Fix For: 1.6
>
>
> Aim of this task is to evaluate storage cost of current approach for various 
> Documents in DocumentNodeStore. And then evaluate possible alternative to see 
> if we can get a significant reduction in storage size.
> Possible areas of improvement
> # NodeDocument
> ## Use binary encoding for property values - Currently property values are 
> stored in JSON encoding i.e. arrays and single values are encoded in json 
> along with there type
> ## Use binary encoding for Revision values - In a given document Revision 
> instances are a major part of storage size. A binary encoding might provide 
> more compact storage
> # Journal - The journal entries can be stored in compressed form
> Any new approach should support working with existing setups i.e. provide 
> gradual change in storage format. 
> *Possible Benefits*
> More compact storage would help in following ways
> # Low memory footprint of Document in Mongo and RDB
> # Low memory footprint for in memory NodeDocument instances - For e.g. 
> property values when stored in binary format would consume less memory
> # Reduction in IO over wire - That should reduce the latency in say 
> distributed deployments where Oak has to talk to remote primary
> Note that before doing any such change we must analyze the gains. Any change 
> in encoding would make interpreting stored data harder and also represents 
> significant change in stored data where we need to be careful to not 
> introduce any bug!



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Reply via email to