On 06/01/11 21:22, [email protected] wrote:
I've been taking the new tdbloader2 out for a spin with some fairly large 
datasets. In total, I have about 3Billion triples I am trying to load. I have 
87 turtle files that average around 1-2GB each. I am running the job under 
Ubuntu 10.10 on a quad core system with 6GB of ram. The load process runs very 
vast up until about 26M triples and performance drops sharply from about 100k 
down to about 400 and the it eventually runs out of memory.

In which step of the process does it fail? I guess the data phase (phase 1). A trace would be helpful.

I am using TDB 0.8.9.

which was released only a few hours ago :-)

I tried to tweak the memory settings, but that only prolongs the problem. I am 
assuming that 1-2GB files are a likely culprit, but I wanted to be sure.

Not per se - the 3e9 triples total might be the issue. Maybe the Java heap size needs tweaking more carefully.

Are you loading them as named graphs in some way or are you loading 87 files into the default graph in one single load operation? If it's the latter, then the parser ends up streaming them together so 1 file of all the triples or 87 files ends up much the same.

Also, does tdbloader2 have a preference to N-Triples over Turtle?

N-triples is faster because the parser runs faster on N-triples despite the extra overall number of bytes, although the read I/O is nicely streamed. I'm not exactly sure why it's faster - either it's simply that the N-triples loop fits into L2 cache better or because the Turtle parser shares much of it's logic with the TrIG parser as a superclass with two concrete classes and it's supeclass method overhead.

        Andy

Ryan-

Ryan J. McDonough
Architect
Service Platforms
NOKIA INC.


Reply via email to