Re: tdbloader2 OutOfMemoryException with large files

Andy Seaborne Fri, 28 Jan 2011 08:00:06 -0800


On 28/01/11 15:36, Andy Seaborne wrote:



On 28/01/11 13:03, Mikhail Sogrin wrote:

I had the similar issue, not running out of memory, but starting very
fast
and then slowing significantly. I was using the same Ubuntu 10.10
32-bit and
loading Dbpedia data into a TDB store (using command-line loader and Java
API).

Loading instance_types_en.nt file containing just rdf:type triples
(~800 MB)
was rather fast (I suppose it had enough memory for file caches), but
labels_en.nt with only rdfs:labels (~900 MB) was extremely slow - it took
about 8 hours to complete with average ~300 triples/sec. That slow
loading
was caused by a lot of disk activity. For input files ~1.7 GB and
resulting
TDB store ~3 GB there were about 80 GB of disk writes. Is it normal or
expected to have so much of disk writing?

tdbloader2 does do a lot of I/O - it does it in streaming fashion andusually this is cheap - a lot cheaper than random I/O. On a prtable,that may be a bad tradeoff - try the native tdbloader.

For example, its slightly faster (a few percent) to parse fromuncompressed N-triples files than from .nt.gz files. The less I/O isnot enough to compensate for the increases cost of decompressing - theI/O is typically going to be from a file laid nicely in disk-order ondisk if it was written all at once. So when reading starts, the disk ispumping bytes through at high speed.


        Andy

Re: tdbloader2 OutOfMemoryException with large files

Reply via email to