Hi all, we try to do some MDM in our open-source project [1,2]. Currently, we do this with the help of Neo4j as datahub for all the content [3]. I'm really wondering a bit about the storage footprint that will be created, when we load the content into the database. Right now, we can process arbitrary XML and CSV data. We transform the raw content into a neutral data model [4] that is based on top of RDF (without losing any information from the original source). The footprint ratio is somehow exploding in the database. Here, are some excerpts:
raw = 30MB db = 2261MB index* = 353MB nodes= ~1.4M relationships= ~1.9M properties= ~7.3M db/raw = 75x index/raw = 11.76x index/db = 15,6% Some reasons are that we add various metadata that have its origin in the neutral data model (where we make use of URIs etc.) and some metadata for versioning information (two int values at each relationship). However, generally the ratio looks really high. Can you explain this somehow? Thanks a lot in advance. Cheers, Bo *) index size is also include in db site [1] http://dswarm.org [2] https://github.com/dswarm [3] https://github.com/dswarm/dswarm-graph-neo4j [4] https://github.com/dswarm/dswarm-documentation/wiki/Graph-Data-Model -- You received this message because you are subscribed to the Google Groups "Neo4j" group. To unsubscribe from this group and stop receiving emails from it, send an email to [email protected]. For more options, visit https://groups.google.com/d/optout.
