On 28/01/11 15:36, Andy Seaborne wrote:
On 28/01/11 13:03, Mikhail Sogrin wrote:
I had the similar issue, not running out of memory, but starting very
fast
and then slowing significantly. I was using the same Ubuntu 10.10
32-bit and
loading Dbpedia data into a TDB store (using command-line loader and Java
API).
Loading instance_types_en.nt file containing just rdf:type triples
(~800 MB)
was rather fast (I suppose it had enough memory for file caches), but
labels_en.nt with only rdfs:labels (~900 MB) was extremely slow - it took
about 8 hours to complete with average ~300 triples/sec. That slow
loading
was caused by a lot of disk activity. For input files ~1.7 GB and
resulting
TDB store ~3 GB there were about 80 GB of disk writes. Is it normal or
expected to have so much of disk writing?
tdbloader2 does do a lot of I/O - it does it in streaming fashion and
usually this is cheap - a lot cheaper than random I/O. On a prtable,
that may be a bad tradeoff - try the native tdbloader.
For example, its slightly faster (a few percent) to parse from
uncompressed N-triples files than from .nt.gz files. The less I/O is
not enough to compensate for the increases cost of decompressing - the
I/O is typically going to be from a file laid nicely in disk-order on
disk if it was written all at once. So when reading starts, the disk is
pumping bytes through at high speed.
Andy