Okay,

I think the main increase comes from the transformation of the original 
data (CSV or XML) to our neutral data format (GDM). I did further testing 
(also with more data) and got a ratio for GDM/raw of ~11x. Whereby, the 
ratio DB/GDM is 5.4x. We try to implement a URI pre and post processing to 
reduce the size in the graph db.

Cheers,


Bo 


On Monday, April 27, 2015 at 1:34:30 PM UTC+2, Bo Ferri wrote:
>
> Hi all,
>
> we try to do some MDM in our open-source project [1,2]. Currently, we do 
> this with the help of Neo4j as datahub for all the content [3]. I'm really 
> wondering a bit about the storage footprint that will be created, when we 
> load the content into the database. Right now, we can process arbitrary XML 
> and CSV data. We transform the raw content into a neutral data model [4] 
> that is based on top of RDF (without losing any information from the 
> original source). The footprint ratio is somehow exploding in the database. 
> Here, are some excerpts:
>
> raw = 30MB
> db = 2261MB
> index* = 353MB
>
> nodes= ~1.4M
> relationships= ~1.9M
> properties= ~7.3M
>
> db/raw = 75x
> index/raw = 11.76x
> index/db = 15,6% 
>
> Some reasons are that we add various metadata that have its origin in the 
> neutral data model (where we make use of URIs etc.) and some metadata for 
> versioning information (two int values at each relationship). However, 
> generally the ratio looks really high. Can you explain this somehow?
>
> Thanks a lot in advance.
>
> Cheers,
>
>
> Bo
>
>
> *) index size is also include in db site
>  
>
> [1] http://dswarm.org
> [2] https://github.com/dswarm
> [3] https://github.com/dswarm/dswarm-graph-neo4j
> [4] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model
>

-- 
You received this message because you are subscribed to the Google Groups 
"Neo4j" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to [email protected].
For more options, visit https://groups.google.com/d/optout.

Reply via email to