Okay, I think the main increase comes from the transformation of the original data (CSV or XML) to our neutral data format (GDM). I did further testing (also with more data) and got a ratio for GDM/raw of ~11x. Whereby, the ratio DB/GDM is 5.4x. We try to implement a URI pre and post processing to reduce the size in the graph db.
Cheers, Bo On Monday, April 27, 2015 at 1:34:30 PM UTC+2, Bo Ferri wrote: > > Hi all, > > we try to do some MDM in our open-source project [1,2]. Currently, we do > this with the help of Neo4j as datahub for all the content [3]. I'm really > wondering a bit about the storage footprint that will be created, when we > load the content into the database. Right now, we can process arbitrary XML > and CSV data. We transform the raw content into a neutral data model [4] > that is based on top of RDF (without losing any information from the > original source). The footprint ratio is somehow exploding in the database. > Here, are some excerpts: > > raw = 30MB > db = 2261MB > index* = 353MB > > nodes= ~1.4M > relationships= ~1.9M > properties= ~7.3M > > db/raw = 75x > index/raw = 11.76x > index/db = 15,6% > > Some reasons are that we add various metadata that have its origin in the > neutral data model (where we make use of URIs etc.) and some metadata for > versioning information (two int values at each relationship). However, > generally the ratio looks really high. Can you explain this somehow? > > Thanks a lot in advance. > > Cheers, > > > Bo > > > *) index size is also include in db site > > > [1] http://dswarm.org > [2] https://github.com/dswarm > [3] https://github.com/dswarm/dswarm-graph-neo4j > [4] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model > -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
