Re: Bad Jena TDB performance

Frank Budinsky Mon, 28 Mar 2011 10:38:52 -0700

Hi Brian,

Thanks for summarizing the situation so clearly. It all adds up, except for
one thing (see below).


Brian McBride <[email protected]> wrote on 03/28/2011 01:03:13 PM:

> [image removed]
>
> Re: Bad Jena TDB performance
>
> Brian McBride
>
> to:
>
> jena-users
>
> 03/28/2011 01:03 PM
>
> Please respond to jena-users
>
> Hi Frank,
>
> Separating out commentary from data, I've put this in a different
message.
>
> Here is what I think is happening:
>
> 1.  You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB
> to go faster.

It is really behaving like this is the case, only it's not. I'm running on
a 64 bit OS. Since you are also running on Windows 7 64 bit and are not
seeing the same slow behavior in non-DIRECT mode, it seems like the issue
must be something specific with my configuration and not a more general
Windows 7 problem.

>
> 2. The current released version of Jena/ARQ/TDB does a SYNC at the end
> of every model.read into a TDB model, so with model.read() you are doing
> 100,000 SYNC's plus one at the end.  model.add does not do a SYNC so if
> you use that you are only doing one SYNC and that is why things go
> faster that way.
>
> 3. If you do both DIRECT mode and model.add you get the benefit of both
> faster TDB performance on 32 bit OS's and avoid 100k SYNCs.
>
> 4. You reported loading 100k graphs in 2080 secs in direct mode.  Andy
> reported getting the same in 601  secs.  If Andy was running the test on
> the current development version (plausible as he is a developer), that
> is not doing the 100k extra SYNCs and that is why he got a figure much
> closer to your DIRECT mode + model.add().
>
> 4. I'm loading 100k graphs in about 540s.  You are taking 812.   Could
> that difference be plausibly accounted for by hardware difference plus
> the fact I am running a 64 bit OS?  Or do you think there is more to it?

I think the difference (540 vs 812) probably can be explained by the
hardware difference.

Thanks again for all your help with this.

Frank.

>
> Brian
>
>
> On 28/03/2011 16:04, Frank Budinsky wrote:
> >> Brian, Frank, which versions are you running?
> > I'm running the same version as Brian.
> >
> > tbd: 0.8.9
> > arq: 2.8.7
> > jena: 2.6.4
> >
> > I tried disabling anti-virus - no difference.
> >
> > I tried bumping up the heap to 3G - no difference.
> >
> > I tried running without setting direct mode, but using Brian's change,
> > i.e., model.add() instead of model.read() - this runs slower than with
both
> > Brian's change and direct mode, but faster than if I run with direct
mode
> > without Brian's change:
> >
> > Only direct mode:
> > 100,000 graphs in 2,080.58 sec
> >
> > Only model.add():
> > 100,000 graphs in 1,246.25 sec
> >
> > Both direct mode and model.add():
> > 100,000 graphs in 812.30 sec
> >
> > Brian, have you tried running it without setting direct mode? I'm
wondering
> > if there's a problem with memory mapped I/O on Windows?
> >
> > Thanks,
> > Frank.
> >
> > Andy Seaborne<[email protected]>  wrote on 03/28/2011
04:17:49
> > AM:
> >
> >> [image removed]
> >>
> >> Re: Bad Jena TDB performance
> >>
> >> Andy Seaborne
> >>
> >> to:
> >>
> >> jena-users
> >>
> >> 03/28/2011 04:18 AM
> >>
> >> Please respond to jena-users
> >>
> >>
> >>
> >> On 27/03/11 18:56, Brian McBride wrote:
> >>> On 25/03/2011 18:25, Frank Budinsky wrote:
> >>> [...]
> >>>> One interesting difference between my slow running case and Brian's,
> > is
> >>>> that mine always starts out pretty fast, but gradually slows down.
> > Notice
> >>>> that it was taking about 2 seconds per 100 at the start, but it was
up
> > to
> >>>> 15 seconds at 39800. I didn't let it run to completion, this time,
but
> > I
> >>>> remember from a previous run that it was taking about 2 min per 100
at
> >>>> the
> >>>> end. Any idea why it might slow down over time, like this, when not
> > using
> >>>> direct mode?
> >>>>
> >>>> Anyway, the next thing I tried, was to change the code has Brian
> >>>> suggested.
> >>>> That also had a big impact for me:
> >>> I have a hypothesis for why.
> >>>
> >>> When ARP, the RDF/XML parser, finishes reading a file it issues a
> >>> finishedRead event to the event handler in the graph it is reading
> > into.
> >>> If the TDB default event handler then did something (like sync to
> > disk?)
> >>> that the memory model does not, this could explain the difference in
> >>> performance. I have put a profiler on the test program and it reports
> >>> that the test program is spending a lot more time in
> >>> BlockManagerFile.force() when it is reading directly in to TDB than
> > when
> >>> it is going via a memory model. So there is some evidence that this
is
> >>> what is happening.
> >>>
> >>> I haven't been able to track down the block manager code actually in
> > use
> >>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows
> > off
> >>> the top of his head whether this is plausible.
> >>   >  s/block manager/event manager/
> >>
> >> Could be - the model.add(model) will go via the BulkUpdateHandler (I
> >> think).  TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler
> >> for insertion.
> >>
> >> Could you try putting a break point in dataset.sync and see what the
> >> call stack is when it gets hit?  That'll tell you who is causing the
> > sync.
> >> There used to be (up to v 0.8.9? not int he last snapshot build) a
sync
> >> wrapper that sync() every n'000 triples added.  It's not in the
> >> development codebase.  All hidden implicit syncs should now be removed
> >>    They were causing problems for a user who was tracking whether the
DB
> >> on disk was dirty or not.
> >>
> >> Brian, Frank, which versions are you running?
> >>
> >>   >  as I'm having trouble checking ARQ out of SVN,
> >>
> >> What sort of trouble?
> >>
> >>     Andy
> >>
> >>
> >>> Brian
> >>>
> >>>
> >>>
> >>>> Max mem: 1,820M
> >>>> DIRECT mode
> >>>> log4j:WARN No appenders could be found for logger
> >>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt).
> >>>> log4j:WARN Please initialize the log4j system properly.
> >>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011
> >>>>> Initial number of indexed graphs: 0
> >>>> 100 at: Fri Mar 25 09:56:06 PDT 2011
> >>>> 200 at: Fri Mar 25 09:56:07 PDT 2011
> >>>> 300 at: Fri Mar 25 09:56:09 PDT 2011
> >>>> 400 at: Fri Mar 25 09:56:10 PDT 2011
> >>>> 500 at: Fri Mar 25 09:56:11 PDT 2011
> >>>> ...
> >>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011
> >>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011
> >>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011
> >>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011
> >>>>> Done at: Fri Mar 25 10:09:35 PDT 2011
> >>>> 100,000 graphs in 812.30 sec
> >>>>
> >>>> With that change, I'm now only about 30% slower than Andy's number.
> > Maybe
> >>>> that's attributable to Windows vs Linux or the hardware differences.
> > I'm
> >>>> running it on:
> >>>>
> >>>> Intel Xeon E5345 233 GHZ (2 processors)
> >>>> 8 GB RAM
> >>>> 300 GB HDD
> >>>> Windows 7 Enterprise SP1
> >>>>
> >>>> Does anybody know how a Xeon 5345 should compare to i5 or i7
> >>>> processors, or
> >>>> how much difference there might be between Linux vs Windows 7.
> >>>>
> >>>> Thanks again for your help.
> >>>>
> >>>> Frank
> >>>>
> >>>> Brian McBride<[email protected]>  wrote on 03/25/2011 12:15:30
PM:
> >>>>
> >>>>> [image removed]
> >>>>>
> >>>>> Re: Bad Jena TDB performance
> >>>>>
> >>>>> Brian McBride
> >>>>>
> >>>>> to:
> >>>>>
> >>>>> jena-users
> >>>>>
> >>>>> 03/25/2011 12:16 PM
> >>>>>
> >>>>> Please respond to jena-users
> >>>>>
> >>>>> Hi Andy, Frank,
> >>>>>
> >>>>> On 25/03/2011 14:13, Andy Seaborne wrote:
> >>>>>> I ran today:
> >>>>>>
> >>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/
> >>>>> reports/ReportOutOfMemoryManyGraphsTDB.java
> >>>>> [[
> >>>>>
> >>>>> I would very much appreciate it if others on this mailing list
could
> >>>>> also
> >>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>> colleagues),
> >>>>> or is there some kind of pattern to explain this huge difference.
> >>>>> ]]
> >>>>>
> >>>>>
> >>>>> Having a windows 7 box and a couple of spare minutes I tried
running
> >>>>> Andy's version of the test code. I got performance akin to what
Frank
> >>>>> reports.
> >>>>>
> >>>>> [[
> >>>>> Max mem: 1,820M
> >>>>> DIRECT mode
> >>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011
> >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>>>>> Initial number of indexed graphs: 0
> >>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011
> >>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011
> >>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011
> >>>>> ...
> >>>>> ]]
> >>>>>
> >>>>> However, it runs a lot faster, if instead of reading the input
stream
> >>>>> directly into TDB, I read it into a memory model first and then add
> > that
> >>>>> model to a TDB model. Then I get performance similar to that which
> > Andy
> >>>>> reports. i.e.
> >>>>>
> >>>>> [[
> >>>>> Max mem: 1,820M
> >>>>> DIRECT mode
> >>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011
> >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer
> >>>>>> Initial number of indexed graphs: 0
> >>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011
> >>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011
> >>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011
> >>>>> ...
> >>>>>
> >>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011
> >>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011
> >>>>> 100,000 graphs in 547.00 sec
> >>>>> ]]
> >>>>>
> >>>>> Frank, you might try replacing
> >>>>>
> >>>>> [[
> >>>>> InputStream instream = getGraph(i); // the RDF graph to
> >>>> load
> >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>>>> try {
> >>>>> Model model =
> >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>>>> com.ibm.team.workitem.WorkItem/"
> >>>>> + i);
> >>>>> model.read(instream, null);
> >>>>> //model.close();
> >>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>>>> " + new java.util.Date());
> >>>>> instream.close();
> >>>>> }
> >>>>> ]]
> >>>>>
> >>>>> with
> >>>>> [[
> >>>>> InputStream instream = getGraph(i); // the RDF graph to
> >>>> load
> >>>>> Model m = ModelFactory.createDefaultModel();
> >>>>> m.read(instream, null);
> >>>>>
> >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE);
> >>>>> try {
> >>>>> Model model =
> >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/
> >>>>> com.ibm.team.workitem.WorkItem/"
> >>>>> + i);
> >>>>> // model.read(instream, null);
> >>>>> model.add(m);
> >>>>> //model.close();
> >>>>> } finally { dataset.getLock().leaveCriticalSection() ; }
> >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at:
> >>>>> " + new java.util.Date());
> >>>>> instream.close();
> >>>>> }
> >>>>> ]]
> >>>>>
> >>>>> I am running I think the latest Maven versions of Jena etc on
Windows
> > 7
> >>>>> on an Intel i7 with 8GB. Tests were run in Eclipse.
> >>>>>
> >>>>> Windows 7 SP1
> >>>>> Jena 2.6.4
> >>>>> ARQ 2.8.7
> >>>>> TDB 0.8.9
> >>>>>
> >>>>> Hoping this might be helpful,
> >>>>> Brian
> >>>>>
> >>>>>
> >>>>>
> >>>>>> in Eclispe
> >>>>>> in direct mode.
> >>>>>>
> >>>>>> It has some configuration choices you might like to try.
> >>>>>>
> >>>>>> Max mem: 910M
> >>>>>> DIRECT mode
> >>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011
> >>>>>>> Initial number of indexed graphs: 0
> >>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011
> >>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011
> >>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011
> >>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011
> >>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011
> >>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011
> >>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011
> >>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011
> >>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011
> >>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011
> >>>>>> ....
> >>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011
> >>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011
> >>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011
> >>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011
> >>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011
> >>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011
> >>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011
> >>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011
> >>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011
> >>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011
> >>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011
> >>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011
> >>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011
> >>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011
> >>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011
> >>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011
> >>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011
> >>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011
> >>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011
> >>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011
> >>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011
> >>>>>> 100,000 graphs in 601.98 sec
> >>>>>>
> >>>>>>
> >>>>>>
> >>>>>> On 25/03/11 13:50, Frank Budinsky wrote:
> >>>>>>> Hi Andy and all,
> >>>>>>>
> >>>>>>> I finally managed to get a relatively powerful machine set up
> >>>>>> Details?
> >>>>>>
> >>>>>>> and I reran
> >>>>>>> the test program I sent you, but unfortunately, it still runs
> > orders
> >>>> of
> >>>>>>> magnitude slower than the numbers you got when you tried it. The
> >>>>>>> hardware I
> >>>>>>> used this time is similar to what you are using, the only
> > significant
> >>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I
> > know
> >>>>>>> Linux
> >>>>>>> is somewhat faster than Windows, but I don't expect we can blame
> >>>>>>> Microsoft
> >>>>>>> for a change from 573.87 seconds to about 4 hours :-)
> >>>>>>>
> >>>>>>> Are you sure that your numbers are correct? How big was the TDB
> >>>>>>> database on
> >>>>>>> disk at the end of the run?
> >>>>>> 3.9G DB1
> >>>>>>
> >>>>>>> Do you have any other ideas what may be wrong
> >>>>>>> with my configuration?
> >>>>>> Windows server or desktop? (server is better at I/O)
> >>>>>> 64 bit?
> >>>>>> Is there a virus checker?
> >>>>>>
> >>>>>>> I would very much appreciate it if others on this mailing list
> > could
> >>>>>>> also
> >>>>>>> give it a quick try. I'd like to know if it's just me (and my
> >>>>>>> colleagues),
> >>>>>>> or is there some kind of pattern to explain this huge difference.
> >>>>>>>
> >>>>>>> Here is the simple test program again (inlined this time, since
> > Apache
> >>>>>>> seems to throw away attachments). To run it, just change the
> > TDB_DIR
> >>>>>>> constant to some empty directory. The test program loads100,000
> >>>>>>> datagraphs
> >>>>>>> (about 100 triples each ->  10M triples total). It prints a
message
> >>>>>>> on the
> >>>>>>> console at every 100, so if it's taking seconds for each println,
> >>>> you'll
> >>>>>>> know very quickly that it will take hours to run, instead of a
few
> >>>>>>> minutes,
> >>>>>>> like Andy has seen.
> >>>>>>>
> >>>>>>> Thanks,
> >>>>>>> Frank.

Re: Bad Jena TDB performance

Reply via email to