Last week:

 - profiling banach showed some interesting things in the sesame2 memory
store. started a discussion on the sesame list on possible ways to
improve performance

 - created "memento", a short-term fork of sesame2 memory store and
harmony's implementation of HashMap that I'll use to instrument and
testing. Memento is unlikely to be a long-term effort as eventual
performance improvements will be donate back to the sesame project.

 - profiling shows that I was able to load 2.5Mt in sesame memory store
with 400Mb of RAM using Java6 on winxp. That is around 160bytes/triple
(including all the indices and JVM typing overhead). This result seems a
little poor in terms of pure information theory (since gzip uses
8bytes/triple on the same dataset) but it's good enough to to fit barton
in memory with a 64bit machine and 16Gb of RAM (which is a totally
affordable hardware solution these days).

 - wrote a 'sampler' application that is capable of extracting a subset
of a big RDF graph but maintaining the in-degree and out-degree
distributions. Basically, instead of sampling random statements, we
sample subjects at random and then we get all the statements that belong
to those subjects. The implementation is based on a two-pass over a
GZipped version of the RDF model encoded in NTriples (which is very
verbose but for that reason compresses very well with GZip) and it's
very much scalable with not a lot of memory needs (only the list of
subjects is kept in memory). Performance shows that extracting a random
10000 subjects from 2.5Mt takes 30 seconds on my laptop.

This week:

 - continue the work on profiling memento and report my finding to the
sesame project

-- 
Stefano Mazzocchi
Digital Libraries Research Group                 Research Scientist
Massachusetts Institute of Technology
E25-131, 77 Massachusetts Ave               skype: stefanomazzocchi
Cambridge, MA  02139-4307, USA         email: stefanom at mit . edu
-------------------------------------------------------------------

_______________________________________________
General mailing list
[email protected]
http://simile.mit.edu/mailman/listinfo/general

Reply via email to