On 25/03/11 18:25, Frank Budinsky wrote:

Hi Andy and Brian,

Thank you both for your speedy replies! Your input, combined, has enabled
me to get my performance numbers into the right ballpark.

When running Andy's version of the test, I noticed significantly better
performance than I had previously been getting:

Max mem: 1,820M
DIRECT mode
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.sparql.mgt.ARQMgt).
log4j:WARN Please initialize the log4j system properly.
Starting test: Fri Mar 25 08:21:07 PDT 2011
Initial number of indexed graphs: 0
100 at: Fri Mar 25 08:21:11 PDT 2011
200 at: Fri Mar 25 08:21:13 PDT 2011
300 at: Fri Mar 25 08:21:15 PDT 2011
400 at: Fri Mar 25 08:21:17 PDT 2011
1200 at: Fri Mar 25 08:21:34 PDT 2011
...
99700 at: Fri Mar 25 08:55:41 PDT 2011
99800 at: Fri Mar 25 08:55:43 PDT 2011
99900 at: Fri Mar 25 08:55:45 PDT 2011
100000 at: Fri Mar 25 08:55:47 PDT 2011
Done at: Fri Mar 25 08:55:47 PDT 2011
100,000 graphs in 2,080.58 sec

Although much better than the 4 hours I was seeing previously, it's still
3.5 times slower than Andy's numbers. Is there any chance the warning
messages about log4j might affect performance?

No. - can't see any reason why it would make any difference.

Drop a log4j.properties file in the current and it should go away. TDB and all my Jena software, plays a game of hunt the config file for "log4j.properties" that includes the current directory.

Looking at the difference between Andy's and my original test, I see this
call is the only difference:

             SystemTDB.setFileMode(FileMode.direct) ;

What exactly does that do, and should I always be configuring TDB this way?


It forces the file handling to be in the style of the 32 bit version as
per your original setup. 64 bit uses memory mapped (32 bit mmap I/O is
limited to 1.5G total file size, all files).

If it makes a difference, then the memory mapped I/O is slow on your
machine. This might be explained by the fact it isn't Windows Server. I
guess you are running some anti-virus as well - I have no idea how that
interacts with memory mapped file I/O but it may be bad.

I tried running without this call and, as expected, it seems to be back to
the 4 hour performance:

Max mem: 1,820M

Make the heap size larger for FileMode.direct - maybe 2-3 G
You have an 8G machine. Might help GC but as you can see, I was using a 1G heap. The cache sizes are fixed currently - the

Starting test: Fri Mar 25 09:00:50 PDT 2011
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.tdb.info).
log4j:WARN Please initialize the log4j system properly.
Initial number of indexed graphs: 0
100 at: Fri Mar 25 09:00:55 PDT 2011
200 at: Fri Mar 25 09:00:57 PDT 2011
300 at: Fri Mar 25 09:00:59 PDT 2011
400 at: Fri Mar 25 09:01:01 PDT 2011
500 at: Fri Mar 25 09:01:03 PDT 2011
...
39400 at: Fri Mar 25 09:51:23 PDT 2011
39500 at: Fri Mar 25 09:51:38 PDT 2011
39600 at: Fri Mar 25 09:51:52 PDT 2011
39700 at: Fri Mar 25 09:52:06 PDT 2011
39800 at: Fri Mar 25 09:52:21 PDT 2011
...
Done at: About 4 hours from start time

One interesting difference between my slow running case and Brian's, is
that mine always starts out pretty fast, but gradually slows down. Notice
that it was taking about 2 seconds per 100 at the start, but it was up to
15 seconds at 39800. I didn't let it run to completion, this time, but I
remember from a previous run that it was taking about 2 min per 100 at the
end. Any idea why it might slow down over time, like this, when not using
direct mode?

None - memory mapped I/O should be better, not worse. (I haven't used it much on Windows but the server/desktop versions do make a difference as does AV.)

It's as if the machine is OS swapping as the caches grow which from your configuration really should not be happening.


Anyway, the next thing I tried, was to change the code has Brian suggested.
That also had a big impact for me:

Max mem: 1,820M
DIRECT mode
log4j:WARN No appenders could be found for logger
(com.hp.hpl.jena.sparql.mgt.ARQMgt).
log4j:WARN Please initialize the log4j system properly.
Starting test: Fri Mar 25 09:56:03 PDT 2011
Initial number of indexed graphs: 0
100 at: Fri Mar 25 09:56:06 PDT 2011
200 at: Fri Mar 25 09:56:07 PDT 2011
300 at: Fri Mar 25 09:56:09 PDT 2011
400 at: Fri Mar 25 09:56:10 PDT 2011
500 at: Fri Mar 25 09:56:11 PDT 2011
...
99700 at: Fri Mar 25 10:09:27 PDT 2011
99800 at: Fri Mar 25 10:09:27 PDT 2011
99900 at: Fri Mar 25 10:09:28 PDT 2011
100000 at: Fri Mar 25 10:09:29 PDT 2011
Done at: Fri Mar 25 10:09:35 PDT 2011
100,000 graphs in 812.30 sec

Brian's results are rather strange. TDB does nothing special for model.add(Model) so my guess is that is about timing/processing effects or merely that the triples come out of the model in a "better" order then the streaming from the parser.

        Andy


With that change, I'm now only about 30% slower than Andy's number. Maybe
that's attributable to Windows vs Linux or the hardware differences. I'm
running it on:

Intel Xeon E5345 233 GHZ (2 processors)
8 GB RAM
300 GB HDD
Windows 7 Enterprise SP1

Does anybody know how a Xeon 5345 should compare to i5 or i7 processors, or
how much difference there might be between Linux vs Windows 7.

Thanks again for your help.

Frank

Reply via email to