Hi, as i mentioned at the DBpedia meetup yesterday, i'd like to discuss the motivation to use bz2 as compression algorithm for the dump files.
bz2 might have the advantage that it's well known, but apart from that it's outdated. Other compression algorithms (for example xz) compress and decompress faster and create smaller file sizes. So if the main concern is file-size and bandwidth for the dump files, then xz might be a better choice. If bandwidth is not so much of a concern, i'd love to see the dumps being provided as gz files. The reason for this is that gzip is also well known, but stream-processing wise much closer to the sweet spot of using a bit of CPU to make total IO throughput a lot faster. With typical hardware, working on gzipped files is actually faster than working with uncompressed ones. Cheers, Jörn ------------------------------------------------------------------------------ _______________________________________________ DBpedia-developers mailing list DBpedia-developers@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/dbpedia-developers