[Neo4j] storage footprint ratio to raw source data

Bo Ferri Mon, 27 Apr 2015 04:34:53 -0700

Hi all,

we try to do some MDM in our open-source project [1,2]. Currently, we do 
this with the help of Neo4j as datahub for all the content [3]. I'm really 
wondering a bit about the storage footprint that will be created, when we 
load the content into the database. Right now, we can process arbitrary XML 
and CSV data. We transform the raw content into a neutral data model [4] 
that is based on top of RDF (without losing any information from the 
original source). The footprint ratio is somehow exploding in the database. 
Here, are some excerpts:


raw = 30MB
db = 2261MB
index* = 353MB

nodes= ~1.4M
relationships= ~1.9M
properties= ~7.3M

db/raw = 75x
index/raw = 11.76x
index/db = 15,6% 

Some reasons are that we add various metadata that have its origin in the 
neutral data model (where we make use of URIs etc.) and some metadata for 
versioning information (two int values at each relationship). However, 
generally the ratio looks really high. Can you explain this somehow?

Thanks a lot in advance.

Cheers,


Bo


*) index size is also include in db site
 

[1] http://dswarm.org
[2] https://github.com/dswarm
[3] https://github.com/dswarm/dswarm-graph-neo4j
[4] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

[Neo4j] storage footprint ratio to raw source data

Reply via email to