Last week: - profiling banach showed some interesting things in the sesame2 memory store. started a discussion on the sesame list on possible ways to improve performance
- created "memento", a short-term fork of sesame2 memory store and harmony's implementation of HashMap that I'll use to instrument and testing. Memento is unlikely to be a long-term effort as eventual performance improvements will be donate back to the sesame project. - profiling shows that I was able to load 2.5Mt in sesame memory store with 400Mb of RAM using Java6 on winxp. That is around 160bytes/triple (including all the indices and JVM typing overhead). This result seems a little poor in terms of pure information theory (since gzip uses 8bytes/triple on the same dataset) but it's good enough to to fit barton in memory with a 64bit machine and 16Gb of RAM (which is a totally affordable hardware solution these days). - wrote a 'sampler' application that is capable of extracting a subset of a big RDF graph but maintaining the in-degree and out-degree distributions. Basically, instead of sampling random statements, we sample subjects at random and then we get all the statements that belong to those subjects. The implementation is based on a two-pass over a GZipped version of the RDF model encoded in NTriples (which is very verbose but for that reason compresses very well with GZip) and it's very much scalable with not a lot of memory needs (only the list of subjects is kept in memory). Performance shows that extracting a random 10000 subjects from 2.5Mt takes 30 seconds on my laptop. This week: - continue the work on profiling memento and report my finding to the sesame project -- Stefano Mazzocchi Digital Libraries Research Group Research Scientist Massachusetts Institute of Technology E25-131, 77 Massachusetts Ave skype: stefanomazzocchi Cambridge, MA 02139-4307, USA email: stefanom at mit . edu ------------------------------------------------------------------- _______________________________________________ General mailing list [email protected] http://simile.mit.edu/mailman/listinfo/general
