Re: tdbloader2 OutOfMemoryException with large files

Andy Seaborne Thu, 06 Jan 2011 13:43:03 -0800


On 06/01/11 21:22, [email protected] wrote:

I've been taking the new tdbloader2 out for a spin with some fairly large 
datasets. In total, I have about 3Billion triples I am trying to load. I have 
87 turtle files that average around 1-2GB each. I am running the job under 
Ubuntu 10.10 on a quad core system with 6GB of ram. The load process runs very 
vast up until about 26M triples and performance drops sharply from about 100k 
down to about 400 and the it eventually runs out of memory.

In which step of the process does it fail? I guess the data phase(phase 1). A trace would be helpful.

I am using TDB 0.8.9.


which was released only a few hours ago :-)

I tried to tweak the memory settings, but that only prolongs the problem. I am 
assuming that 1-2GB files are a likely culprit, but I wanted to be sure.

Not per se - the 3e9 triples total might be the issue. Maybe the Javaheap size needs tweaking more carefully.

Are you loading them as named graphs in some way or are you loading 87files into the default graph in one single load operation? If it's thelatter, then the parser ends up streaming them together so 1 file of allthe triples or 87 files ends up much the same.

Also, does tdbloader2 have a preference to N-Triples over Turtle?

N-triples is faster because the parser runs faster on N-triples despitethe extra overall number of bytes, although the read I/O is nicelystreamed. I'm not exactly sure why it's faster - either it's simplythat the N-triples loop fits into L2 cache better or because the Turtleparser shares much of it's logic with the TrIG parser as a superclasswith two concrete classes and it's supeclass method overhead.


        Andy

Ryan-

Ryan J. McDonough
Architect
Service Platforms
NOKIA INC.

Re: tdbloader2 OutOfMemoryException with large files

Reply via email to