Re: Bad Jena TDB performance

Brian McBride Mon, 28 Mar 2011 10:03:45 -0700

Hi Frank,

Separating out commentary from data, I've put this in a different message.


Here is what I think is happening:

1. You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDBto go faster.

2. The current released version of Jena/ARQ/TDB does a SYNC at the endof every model.read into a TDB model, so with model.read() you are doing100,000 SYNC's plus one at the end. model.add does not do a SYNC so ifyou use that you are only doing one SYNC and that is why things gofaster that way.

3. If you do both DIRECT mode and model.add you get the benefit of bothfaster TDB performance on 32 bit OS's and avoid 100k SYNCs.

4. You reported loading 100k graphs in 2080 secs in direct mode. Andyreported getting the same in 601 secs. If Andy was running the test onthe current development version (plausible as he is a developer), thatis not doing the 100k extra SYNCs and that is why he got a figure muchcloser to your DIRECT mode + model.add().

4. I'm loading 100k graphs in about 540s. You are taking 812. Couldthat difference be plausibly accounted for by hardware difference plusthe fact I am running a 64 bit OS? Or do you think there is more to it?


Brian


On 28/03/2011 16:04, Frank Budinsky wrote:

Brian, Frank, which versions are you running?

I'm running the same version as Brian.

tbd: 0.8.9
arq: 2.8.7
jena: 2.6.4

I tried disabling anti-virus - no difference.

I tried bumping up the heap to 3G - no difference.

I tried running without setting direct mode, but using Brian's change,
i.e., model.add() instead of model.read() - this runs slower than with both
Brian's change and direct mode, but faster than if I run with direct mode
without Brian's change:

Only direct mode:
100,000 graphs in 2,080.58 sec

Only model.add():
100,000 graphs in 1,246.25 sec

Both direct mode and model.add():
100,000 graphs in 812.30 sec

Brian, have you tried running it without setting direct mode? I'm wondering
if there's a problem with memory mapped I/O on Windows?

Thanks,
Frank.

Andy Seaborne<[email protected]>  wrote on 03/28/2011 04:17:49
AM:

[image removed]

Re: Bad Jena TDB performance

Andy Seaborne

to:

jena-users

03/28/2011 04:18 AM

Please respond to jena-users



On 27/03/11 18:56, Brian McBride wrote:

On 25/03/2011 18:25, Frank Budinsky wrote:
[...]

One interesting difference between my slow running case and Brian's,

is

that mine always starts out pretty fast, but gradually slows down.

Notice

that it was taking about 2 seconds per 100 at the start, but it was up

to

15 seconds at 39800. I didn't let it run to completion, this time, but

remember from a previous run that it was taking about 2 min per 100 at
the
end. Any idea why it might slow down over time, like this, when not

using

direct mode?

Anyway, the next thing I tried, was to change the code has Brian
suggested.
That also had a big impact for me:

I have a hypothesis for why.

When ARP, the RDF/XML parser, finishes reading a file it issues a
finishedRead event to the event handler in the graph it is reading

into.

If the TDB default event handler then did something (like sync to

disk?)

that the memory model does not, this could explain the difference in
performance. I have put a profiler on the test program and it reports
that the test program is spending a lot more time in
BlockManagerFile.force() when it is reading directly in to TDB than

when

it is going via a memory model. So there is some evidence that this is
what is happening.

I haven't been able to track down the block manager code actually in

use

as I'm having trouble checking ARQ out of SVN, but Andy likely knows

off

the top of his head whether this is plausible.

  >  s/block manager/event manager/

Could be - the model.add(model) will go via the BulkUpdateHandler (I
think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
for insertion.

Could you try putting a break point in dataset.sync and see what the
call stack is when it gets hit?  That'll tell you who is causing the

sync.

There used to be (up to v 0.8.9? not int he last snapshot build) a sync
wrapper that sync() every n'000 triples added.  It's not in the
development codebase.  All hidden implicit syncs should now be removed
   They were causing problems for a user who was tracking whether the DB
on disk was dirty or not.

Brian, Frank, which versions are you running?

  >  as I'm having trouble checking ARQ out of SVN,

What sort of trouble?

    Andy

Brian

Max mem: 1,820M
DIRECT mode
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.sparql.mgt.ARQMgt).
log4j:WARN Please initialize the log4j system properly.

Starting test: Fri Mar 25 09:56:03 PDT 2011
Initial number of indexed graphs: 0

100 at: Fri Mar 25 09:56:06 PDT 2011
200 at: Fri Mar 25 09:56:07 PDT 2011
300 at: Fri Mar 25 09:56:09 PDT 2011
400 at: Fri Mar 25 09:56:10 PDT 2011
500 at: Fri Mar 25 09:56:11 PDT 2011
...
99700 at: Fri Mar 25 10:09:27 PDT 2011
99800 at: Fri Mar 25 10:09:27 PDT 2011
99900 at: Fri Mar 25 10:09:28 PDT 2011
100000 at: Fri Mar 25 10:09:29 PDT 2011

Done at: Fri Mar 25 10:09:35 PDT 2011

100,000 graphs in 812.30 sec

With that change, I'm now only about 30% slower than Andy's number.

Maybe

that's attributable to Windows vs Linux or the hardware differences.

I'm

running it on:

Intel Xeon E5345 233 GHZ (2 processors)
8 GB RAM
300 GB HDD
Windows 7 Enterprise SP1

Does anybody know how a Xeon 5345 should compare to i5 or i7
processors, or
how much difference there might be between Linux vs Windows 7.

Thanks again for your help.

Frank

Brian McBride<[email protected]>  wrote on 03/25/2011 12:15:30 PM:

[image removed]

Re: Bad Jena TDB performance

Brian McBride

to:

jena-users

03/25/2011 12:16 PM

Please respond to jena-users

Hi Andy, Frank,

On 25/03/2011 14:13, Andy Seaborne wrote:

I ran today:

https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/

reports/ReportOutOfMemoryManyGraphsTDB.java
[[

I would very much appreciate it if others on this mailing list could
also
give it a quick try. I'd like to know if it's just me (and my

colleagues),

or is there some kind of pattern to explain this huge difference.
]]


Having a windows 7 box and a couple of spare minutes I tried running
Andy's version of the test code. I got performance akin to what Frank
reports.

[[
Max mem: 1,820M
DIRECT mode

Starting test: Fri Mar 25 16:06:36 GMT 2011

WARN [main] (SetupTDB.java:755) - No BGP optimizer

Initial number of indexed graphs: 0

100 at: Fri Mar 25 16:07:17 GMT 2011
200 at: Fri Mar 25 16:08:05 GMT 2011
300 at: Fri Mar 25 16:08:56 GMT 2011
...
]]

However, it runs a lot faster, if instead of reading the input stream
directly into TDB, I read it into a memory model first and then add

that

model to a TDB model. Then I get performance similar to that which

Andy

reports. i.e.

[[
Max mem: 1,820M
DIRECT mode

Starting test: Fri Mar 25 15:55:48 GMT 2011

WARN [main] (SetupTDB.java:755) - No BGP optimizer

Initial number of indexed graphs: 0

100 at: Fri Mar 25 15:55:51 GMT 2011
200 at: Fri Mar 25 15:55:51 GMT 2011
300 at: Fri Mar 25 15:55:52 GMT 2011
...

100000 at: Fri Mar 25 16:04:47 GMT 2011

Done at: Fri Mar 25 16:04:55 GMT 2011

100,000 graphs in 547.00 sec
]]

Frank, you might try replacing

[[
InputStream instream = getGraph(i); // the RDF graph to

load

dataset.getLock().enterCriticalSection(Lock.WRITE);
try {
Model model =
dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
com.ibm.team.workitem.WorkItem/"
+ i);
model.read(instream, null);
//model.close();
} finally { dataset.getLock().leaveCriticalSection() ; }
if (++i % 100 == 0) System.out.println(i/100 + "00 at:
" + new java.util.Date());
instream.close();
}
]]

with
[[
InputStream instream = getGraph(i); // the RDF graph to

load

Model m = ModelFactory.createDefaultModel();
m.read(instream, null);

dataset.getLock().enterCriticalSection(Lock.WRITE);
try {
Model model =
dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
com.ibm.team.workitem.WorkItem/"
+ i);
// model.read(instream, null);
model.add(m);
//model.close();
} finally { dataset.getLock().leaveCriticalSection() ; }
if (++i % 100 == 0) System.out.println(i/100 + "00 at:
" + new java.util.Date());
instream.close();
}
]]

I am running I think the latest Maven versions of Jena etc on Windows

on an Intel i7 with 8GB. Tests were run in Eclipse.

Windows 7 SP1
Jena 2.6.4
ARQ 2.8.7
TDB 0.8.9

Hoping this might be helpful,
Brian

in Eclispe
in direct mode.

It has some configuration choices you might like to try.

Max mem: 910M
DIRECT mode

Starting test: Fri Mar 25 13:57:02 GMT 2011
Initial number of indexed graphs: 0

100 at: Fri Mar 25 13:57:04 GMT 2011
200 at: Fri Mar 25 13:57:04 GMT 2011
300 at: Fri Mar 25 13:57:05 GMT 2011
400 at: Fri Mar 25 13:57:06 GMT 2011
500 at: Fri Mar 25 13:57:06 GMT 2011
600 at: Fri Mar 25 13:57:07 GMT 2011
700 at: Fri Mar 25 13:57:07 GMT 2011
800 at: Fri Mar 25 13:57:08 GMT 2011
900 at: Fri Mar 25 13:57:08 GMT 2011
1000 at: Fri Mar 25 13:57:09 GMT 2011
....
98000 at: Fri Mar 25 14:06:47 GMT 2011
98100 at: Fri Mar 25 14:06:47 GMT 2011
98200 at: Fri Mar 25 14:06:48 GMT 2011
98300 at: Fri Mar 25 14:06:48 GMT 2011
98400 at: Fri Mar 25 14:06:49 GMT 2011
98500 at: Fri Mar 25 14:06:50 GMT 2011
98600 at: Fri Mar 25 14:06:50 GMT 2011
98700 at: Fri Mar 25 14:06:52 GMT 2011
98800 at: Fri Mar 25 14:06:52 GMT 2011
98900 at: Fri Mar 25 14:06:53 GMT 2011
99000 at: Fri Mar 25 14:06:53 GMT 2011
99100 at: Fri Mar 25 14:06:54 GMT 2011
99200 at: Fri Mar 25 14:06:55 GMT 2011
99300 at: Fri Mar 25 14:06:55 GMT 2011
99400 at: Fri Mar 25 14:06:56 GMT 2011
99500 at: Fri Mar 25 14:06:56 GMT 2011
99600 at: Fri Mar 25 14:06:57 GMT 2011
99700 at: Fri Mar 25 14:06:58 GMT 2011
99800 at: Fri Mar 25 14:06:59 GMT 2011
99900 at: Fri Mar 25 14:07:00 GMT 2011
100000 at: Fri Mar 25 14:07:00 GMT 2011

Done at: Fri Mar 25 14:07:04 GMT 2011

100,000 graphs in 601.98 sec



On 25/03/11 13:50, Frank Budinsky wrote:

Hi Andy and all,

I finally managed to get a relatively powerful machine set up

Details?

and I reran
the test program I sent you, but unfortunately, it still runs

orders

of

magnitude slower than the numbers you got when you tried it. The
hardware I
used this time is similar to what you are using, the only

significant

difference is that it's running Window's 7 instead of Ubuntu. I

know

Linux
is somewhat faster than Windows, but I don't expect we can blame
Microsoft
for a change from 573.87 seconds to about 4 hours :-)

Are you sure that your numbers are correct? How big was the TDB
database on
disk at the end of the run?

3.9G DB1

Do you have any other ideas what may be wrong
with my configuration?

Windows server or desktop? (server is better at I/O)
64 bit?
Is there a virus checker?

I would very much appreciate it if others on this mailing list

could

also
give it a quick try. I'd like to know if it's just me (and my
colleagues),
or is there some kind of pattern to explain this huge difference.

Here is the simple test program again (inlined this time, since

Apache

seems to throw away attachments). To run it, just change the

TDB_DIR

constant to some empty directory. The test program loads100,000
datagraphs
(about 100 triples each ->  10M triples total). It prints a message
on the
console at every 100, so if it's taking seconds for each println,

you'll

know very quickly that it will take hours to run, instead of a few
minutes,
like Andy has seen.

Thanks,
Frank.

Re: Bad Jena TDB performance

Reply via email to