[
https://issues.apache.org/jira/browse/OAK-4471?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15331518#comment-15331518
]
Chetan Mehrotra commented on OAK-4471:
--------------------------------------
*Use dictionary for Property Names*
Under this we can use a dictionary for commonly occurring property names. Below
are some stats from a repository with
* 14.5 M documents in nodes collection
* 26 M property names
* 8475 unique property names
* 290M - Total size of property names (assuming 8bits per char)
* 8755M - Total repo size
Top property name stats
{noformat}
+-----------------------------------------------------+
|Count |Name |% by count|% by size|
+-----------------------------------------------------+
|3033972|jcr:lastModified |11.65 |15.94 |
|2573208|jcr:data |9.88 |6.76 |
|2505308|uniqueKey |9.62 |7.40 |
|2286350|blobSize |8.78 |6.00 |
|1706460|match |6.55 |2.80 |
|1484283|jcr:primaryType |5.70 |7.31 |
|969596 |jcr:created |3.72 |3.50 |
|933960 |jcr:createdBy |3.59 |3.99 |
|921199 |sling:resourceType |3.54 |5.44 |
|702208 |:childOrder |2.70 |2.54 |
|601959 |entry |2.31 |0.99 |
|600299 |jcr:uuid |2.31 |1.58 |
|481036 |jcr:lastModifiedBy |1.85 |2.84 |
|477625 |jcr:frozenPrimaryType |1.83 |3.29 |
|477625 |jcr:frozenUuid |1.83 |2.20 |
|357201 |text |1.37 |0.47 |
|351712 |textIsRich |1.35 |1.15 |
|228623 |event\djob\dqueued\dtime|0.88 |1.80 |
+-----------------------------------------------------+
{noformat}
Based on above we can say for now using dictionary for property names would not
provide much benefit!
> More compact storage format for Documents
> -----------------------------------------
>
> Key: OAK-4471
> URL: https://issues.apache.org/jira/browse/OAK-4471
> Project: Jackrabbit Oak
> Issue Type: Improvement
> Components: documentmk
> Reporter: Chetan Mehrotra
> Assignee: Chetan Mehrotra
> Labels: performance
> Fix For: 1.6
>
>
> Aim of this task is to evaluate storage cost of current approach for various
> Documents in DocumentNodeStore. And then evaluate possible alternative to see
> if we can get a significant reduction in storage size.
> Possible areas of improvement
> # NodeDocument
> ## Use binary encoding for property values - Currently property values are
> stored in JSON encoding i.e. arrays and single values are encoded in json
> along with there type
> ## Use binary encoding for Revision values - In a given document Revision
> instances are a major part of storage size. A binary encoding might provide
> more compact storage
> # Journal - The journal entries can be stored in compressed form
> Any new approach should support working with existing setups i.e. provide
> gradual change in storage format.
> *Possible Benefits*
> More compact storage would help in following ways
> # Low memory footprint of Document in Mongo and RDB
> # Low memory footprint for in memory NodeDocument instances - For e.g.
> property values when stored in binary format would consume less memory
> # Reduction in IO over wire - That should reduce the latency in say
> distributed deployments where Oak has to talk to remote primary
> Note that before doing any such change we must analyze the gains. Any change
> in encoding would make interpreting stored data harder and also represents
> significant change in stored data where we need to be careful to not
> introduce any bug!
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)