Re: Bad Jena TDB performance

Brian McBride Fri, 25 Mar 2011 09:16:05 -0700

Hi Andy, Frank,

On 25/03/2011 14:13, Andy Seaborne wrote:

I ran today:
https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/reports/ReportOutOfMemoryManyGraphsTDB.java


[[

I would very much appreciate it if others on this mailing list could also
give it a quick try. I'd like to know if it's just me (and my colleagues),
or is there some kind of pattern to explain this huge difference.
]]

Having a windows 7 box and a couple of spare minutes I tried runningAndy's version of the test code. I got performance akin to what Frankreports.


[[
Max mem: 1,820M
DIRECT mode
> Starting test: Fri Mar 25 16:06:36 GMT 2011
 WARN [main] (SetupTDB.java:755) - No BGP optimizer
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 16:07:17 GMT 2011
200 at: Fri Mar 25 16:08:05 GMT 2011
300 at: Fri Mar 25 16:08:56 GMT 2011
...
]]

However, it runs a lot faster, if instead of reading the input streamdirectly into TDB, I read it into a memory model first and then add thatmodel to a TDB model. Then I get performance similar to that which Andyreports. i.e.


[[
Max mem: 1,820M
DIRECT mode
> Starting test: Fri Mar 25 15:55:48 GMT 2011
 WARN [main] (SetupTDB.java:755) - No BGP optimizer
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 15:55:51 GMT 2011
200 at: Fri Mar 25 15:55:51 GMT 2011
300 at: Fri Mar 25 15:55:52 GMT 2011
...

100000 at: Fri Mar 25 16:04:47 GMT 2011
> Done at: Fri Mar 25 16:04:55 GMT 2011
100,000 graphs in 547.00 sec
]]

Frank, you might try replacing

[[
               InputStream instream = getGraph(i); // the RDF graph to load
                dataset.getLock().enterCriticalSection(Lock.WRITE);
                try {

Model model =dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/";+ i);

                    model.read(instream, null);
                    //model.close();
                } finally { dataset.getLock().leaveCriticalSection() ; }

if (++i % 100 == 0) System.out.println(i/100 + "00 at:" + new java.util.Date());

                instream.close();
            }
]]

with
[[
               InputStream instream = getGraph(i); // the RDF graph to load
                Model m = ModelFactory.createDefaultModel();
                m.read(instream, null);

                dataset.getLock().enterCriticalSection(Lock.WRITE);
                try {

Model model =dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/com.ibm.team.workitem.WorkItem/";+ i);

//                    model.read(instream, null);
                    model.add(m);
                    //model.close();
                } finally { dataset.getLock().leaveCriticalSection() ; }

if (++i % 100 == 0) System.out.println(i/100 + "00 at:" + new java.util.Date());

                instream.close();
            }
]]

I am running I think the latest Maven versions of Jena etc on Windows 7on an Intel i7 with 8GB. Tests were run in Eclipse.


Windows 7 SP1
Jena 2.6.4
ARQ 2.8.7
TDB 0.8.9

Hoping this might be helpful,
Brian


in Eclispe
in direct mode.

It has some configuration choices you might like to try.

Max mem: 910M
DIRECT mode
> Starting test: Fri Mar 25 13:57:02 GMT 2011
> Initial number of indexed graphs: 0
100 at: Fri Mar 25 13:57:04 GMT 2011
200 at: Fri Mar 25 13:57:04 GMT 2011
300 at: Fri Mar 25 13:57:05 GMT 2011
400 at: Fri Mar 25 13:57:06 GMT 2011
500 at: Fri Mar 25 13:57:06 GMT 2011
600 at: Fri Mar 25 13:57:07 GMT 2011
700 at: Fri Mar 25 13:57:07 GMT 2011
800 at: Fri Mar 25 13:57:08 GMT 2011
900 at: Fri Mar 25 13:57:08 GMT 2011
1000 at: Fri Mar 25 13:57:09 GMT 2011
....
98000 at: Fri Mar 25 14:06:47 GMT 2011
98100 at: Fri Mar 25 14:06:47 GMT 2011
98200 at: Fri Mar 25 14:06:48 GMT 2011
98300 at: Fri Mar 25 14:06:48 GMT 2011
98400 at: Fri Mar 25 14:06:49 GMT 2011
98500 at: Fri Mar 25 14:06:50 GMT 2011
98600 at: Fri Mar 25 14:06:50 GMT 2011
98700 at: Fri Mar 25 14:06:52 GMT 2011
98800 at: Fri Mar 25 14:06:52 GMT 2011
98900 at: Fri Mar 25 14:06:53 GMT 2011
99000 at: Fri Mar 25 14:06:53 GMT 2011
99100 at: Fri Mar 25 14:06:54 GMT 2011
99200 at: Fri Mar 25 14:06:55 GMT 2011
99300 at: Fri Mar 25 14:06:55 GMT 2011
99400 at: Fri Mar 25 14:06:56 GMT 2011
99500 at: Fri Mar 25 14:06:56 GMT 2011
99600 at: Fri Mar 25 14:06:57 GMT 2011
99700 at: Fri Mar 25 14:06:58 GMT 2011
99800 at: Fri Mar 25 14:06:59 GMT 2011
99900 at: Fri Mar 25 14:07:00 GMT 2011
100000 at: Fri Mar 25 14:07:00 GMT 2011
> Done at: Fri Mar 25 14:07:04 GMT 2011
100,000 graphs in 601.98 sec



On 25/03/11 13:50, Frank Budinsky wrote:


Hi Andy and all,

I finally managed to get a relatively powerful machine set up


Details?

and I reran
the test program I sent you, but unfortunately, it still runs orders of
magnitude slower than the numbers you got when you tried it. Thehardware I
used this time is similar to what you are using, the only significant
difference is that it's running Window's 7 instead of Ubuntu. I knowLinuxis somewhat faster than Windows, but I don't expect we can blameMicrosoft
for a change from 573.87 seconds to about 4 hours :-)
Are you sure that your numbers are correct? How big was the TDBdatabase on
disk at the end of the run?


3.9G    DB1

Do you have any other ideas what may be wrong
with my configuration?


Windows server or desktop? (server is better at I/O)
64 bit?
Is there a virus checker?

I would very much appreciate it if others on this mailing list couldalsogive it a quick try. I'd like to know if it's just me (and mycolleagues),
or is there some kind of pattern to explain this huge difference.

Here is the simple test program again (inlined this time, since Apache
seems to throw away attachments). To run it, just change the TDB_DIR
constant to some empty directory. The test program loads100,000datagraphs(about 100 triples each -> 10M triples total). It prints a messageon the
console at every 100, so if it's taking seconds for each println, you'll
know very quickly that it will take hours to run, instead of a fewminutes,
like Andy has seen.

Thanks,
Frank.

Re: Bad Jena TDB performance

Reply via email to