as i mentioned at the DBpedia meetup yesterday, i'd like to discuss the 
motivation to use bz2 as compression algorithm for the dump files.

bz2 might have the advantage that it's well known, but apart from that it's 
Other compression algorithms (for example xz) compress and decompress faster 
and create smaller file sizes.
So if the main concern is file-size and bandwidth for the dump files, then xz 
might be a better choice.

If bandwidth is not so much of a concern, i'd love to see the dumps being 
provided as gz files.
The reason for this is that gzip is also well known, but stream-processing wise 
much closer to the sweet spot of using a bit of CPU to make total IO throughput 
a lot faster.
With typical hardware, working on gzipped files is actually faster than working 
with uncompressed ones.


DBpedia-developers mailing list

Reply via email to