On Fri, Jan 28, 2011 at 3:36 PM, Andy Seaborne <
[email protected]> wrote:

>
>
> On 28/01/11 13:03, Mikhail Sogrin wrote:
>
>> I had the similar issue, not running out of memory, but starting very fast
>> and then slowing significantly. I was using the same Ubuntu 10.10 32-bit
>> and
>> loading Dbpedia data into a TDB store (using command-line loader and Java
>> API).
>>
>> Loading instance_types_en.nt file containing just rdf:type triples (~800
>> MB)
>> was rather fast (I suppose it had enough memory for file caches), but
>> labels_en.nt with only rdfs:labels (~900 MB) was extremely slow - it took
>> about 8 hours to complete with average ~300 triples/sec. That slow loading
>> was caused by a lot of disk activity. For input files ~1.7 GB and
>> resulting
>> TDB store ~3 GB there were about 80 GB of disk writes. Is it normal or
>> expected to have so much of disk writing?
>>
>> Kind regards,
>> Mikhail
>>
>
> Mikhail,
>
> TDB loading, with tdbloader and tdbloader2, is rather better on 64 bit than
> 32 bit.  On a 32 bit JVM, TDB has to do it's own disk caching, and it can
> only access 1.5G of RAM (Java limitation).  The caching isn't going to be as
> sophisticated as the OS can manage; an advantage of 64bit is that cache work
> is devolved to the OS.  That said, 300TPS is unexpected slow.
>
> Additional, if it's a portable, portable's disk are noticeably slower than
> a desktop machine.
>
> If you can load on a 64 bit machine somewhere, you can just copy the
> database onto the 32 bit machine.  The file format is portable.
>
> I speak in triple counts : labels_en.nt is about 8M IIRC.  I don't know if
> the unusual data data pattern of all one property has an effect.
>
> TDB databases are relative uncompressed - tdbloader2 creates smaller ones
> than the general purpose loader.
>
> I tried labels_en.nt and it didn't really go very fast to start with so
> maybe there is something in the shape of the data.  I'll try to find time to
> profile it (no promises when) - maybe there is a hotspot I'm not aware of.
>
> Loading it here I got:
> 291 seconds which is 27K TPS.
>
> (Ubuntu 10.10, 64 bit, desktop)
>
> java version "1.6.0_20"
> OpenJDK Runtime Environment (IcedTea6 1.9.4) (6b20-1.9.4-0ubuntu1)
> OpenJDK 64-Bit Server VM (build 19.0-b09, mixed mode)
>
>        Andy
>

The disk is an internal laptop 7200 rpm hard disk, and there's 4GB of
memory.

instance_types_en.nt loaded with average 26k TPS.
labels_en.nt started at 4.5k TPS. After working 3 minutes and loading
approximately 800k triples, the amount of write I/O (as reported by 'htop')
approached 2 GB, which was approximately amount of memory available for OS
cache. Size of created files was ~170 MB at this time. At this moment, there
was sharp drop of loading speed to 2k TPS and it continued to decrease after
that, CPU usage decreased, hard disk usage jumped to maximum, so it's
clearly hard disk throughput limitation. The hard disk cannot perform so
many (random?) disk writes. While the cache was available, the application
generated more than 11 MB/s disk write requests.

>From my rough calculations for this test, it needed to write 2 GB after
loading 800k triples, so it's 2.5 Kbytes of requested disk write I/O per
triple. The file size of disk is about 200 bytes/triple, 12.5 times less
than required disk writes (writing and rewriting the same files a dozen
times does not seem right). The first 800k triples in original n-triple file
take 95 MB, which is about 120 bytes/triple.

tdbloader and tdbloader2 have the same speed and the same moment for speed
drop.

--Mikhail

Reply via email to