On 28/01/11 18:17, Mikhail Sogrin wrote:
On Fri, Jan 28, 2011 at 3:36 PM, Andy Seaborne<
[email protected]> wrote:
On 28/01/11 13:03, Mikhail Sogrin wrote:
I had the similar issue, not running out of memory, but starting very fast
and then slowing significantly. I was using the same Ubuntu 10.10 32-bit
and
loading Dbpedia data into a TDB store (using command-line loader and Java
API).
Loading instance_types_en.nt file containing just rdf:type triples (~800
MB)
was rather fast (I suppose it had enough memory for file caches), but
labels_en.nt with only rdfs:labels (~900 MB) was extremely slow - it took
about 8 hours to complete with average ~300 triples/sec. That slow loading
was caused by a lot of disk activity. For input files ~1.7 GB and
resulting
TDB store ~3 GB there were about 80 GB of disk writes. Is it normal or
expected to have so much of disk writing?
Kind regards,
Mikhail
Mikhail,
TDB loading, with tdbloader and tdbloader2, is rather better on 64 bit than
32 bit. On a 32 bit JVM, TDB has to do it's own disk caching, and it can
only access 1.5G of RAM (Java limitation). The caching isn't going to be as
sophisticated as the OS can manage; an advantage of 64bit is that cache work
is devolved to the OS. That said, 300TPS is unexpected slow.
Additional, if it's a portable, portable's disk are noticeably slower than
a desktop machine.
If you can load on a 64 bit machine somewhere, you can just copy the
database onto the 32 bit machine. The file format is portable.
I speak in triple counts : labels_en.nt is about 8M IIRC. I don't know if
the unusual data data pattern of all one property has an effect.
TDB databases are relative uncompressed - tdbloader2 creates smaller ones
than the general purpose loader.
I tried labels_en.nt and it didn't really go very fast to start with so
maybe there is something in the shape of the data. I'll try to find time to
profile it (no promises when) - maybe there is a hotspot I'm not aware of.
Loading it here I got:
291 seconds which is 27K TPS.
(Ubuntu 10.10, 64 bit, desktop)
java version "1.6.0_20"
OpenJDK Runtime Environment (IcedTea6 1.9.4) (6b20-1.9.4-0ubuntu1)
OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
Andy
The disk is an internal laptop 7200 rpm hard disk, and there's 4GB of
memory.
instance_types_en.nt loaded with average 26k TPS.
labels_en.nt started at 4.5k TPS. After working 3 minutes and loading
approximately 800k triples, the amount of write I/O (as reported by 'htop')
approached 2 GB, which was approximately amount of memory available for OS
cache. Size of created files was ~170 MB at this time. At this moment, there
was sharp drop of loading speed to 2k TPS and it continued to decrease after
that, CPU usage decreased, hard disk usage jumped to maximum, so it's
clearly hard disk throughput limitation. The hard disk cannot perform so
many (random?) disk writes. While the cache was available, the application
generated more than 11 MB/s disk write requests.
From my rough calculations for this test, it needed to write 2 GB after
loading 800k triples, so it's 2.5 Kbytes of requested disk write I/O per
triple. The file size of disk is about 200 bytes/triple, 12.5 times less
than required disk writes (writing and rewriting the same files a dozen
times does not seem right). The first 800k triples in original n-triple file
take 95 MB, which is about 120 bytes/triple.
tdbloader and tdbloader2 have the same speed and the same moment for speed
drop.
--Mikhail
Mikhail,
I've run tdblaoder2 on a small machine (Samsung netbook, Ubuntu 10.10,
32 bit). Looking at 'top' I see that the resident size is only about
0.5G even if I try increasing the internal caches (which requires source
code tweaks currently). It does drop in speed as you see it.
If the resident size is fixed to that sort of size, then it maybe it's
paging the code and caches which is pointlessly bad. I might try
reducing the cache sizes and see what happens.
The script sets the heap to 1200M - that might be a bad idea on a small
memory machine.
Andy