Hi Brian, Thanks for summarizing the situation so clearly. It all adds up, except for one thing (see below).
Brian McBride <[email protected]> wrote on 03/28/2011 01:03:13 PM: > [image removed] > > Re: Bad Jena TDB performance > > Brian McBride > > to: > > jena-users > > 03/28/2011 01:03 PM > > Please respond to jena-users > > Hi Frank, > > Separating out commentary from data, I've put this in a different message. > > Here is what I think is happening: > > 1. You are on a 32 bit OS, and on a 32 bit OS DIRECT mode enables TDB > to go faster. It is really behaving like this is the case, only it's not. I'm running on a 64 bit OS. Since you are also running on Windows 7 64 bit and are not seeing the same slow behavior in non-DIRECT mode, it seems like the issue must be something specific with my configuration and not a more general Windows 7 problem. > > 2. The current released version of Jena/ARQ/TDB does a SYNC at the end > of every model.read into a TDB model, so with model.read() you are doing > 100,000 SYNC's plus one at the end. model.add does not do a SYNC so if > you use that you are only doing one SYNC and that is why things go > faster that way. > > 3. If you do both DIRECT mode and model.add you get the benefit of both > faster TDB performance on 32 bit OS's and avoid 100k SYNCs. > > 4. You reported loading 100k graphs in 2080 secs in direct mode. Andy > reported getting the same in 601 secs. If Andy was running the test on > the current development version (plausible as he is a developer), that > is not doing the 100k extra SYNCs and that is why he got a figure much > closer to your DIRECT mode + model.add(). > > 4. I'm loading 100k graphs in about 540s. You are taking 812. Could > that difference be plausibly accounted for by hardware difference plus > the fact I am running a 64 bit OS? Or do you think there is more to it? I think the difference (540 vs 812) probably can be explained by the hardware difference. Thanks again for all your help with this. Frank. > > Brian > > > On 28/03/2011 16:04, Frank Budinsky wrote: > >> Brian, Frank, which versions are you running? > > I'm running the same version as Brian. > > > > tbd: 0.8.9 > > arq: 2.8.7 > > jena: 2.6.4 > > > > I tried disabling anti-virus - no difference. > > > > I tried bumping up the heap to 3G - no difference. > > > > I tried running without setting direct mode, but using Brian's change, > > i.e., model.add() instead of model.read() - this runs slower than with both > > Brian's change and direct mode, but faster than if I run with direct mode > > without Brian's change: > > > > Only direct mode: > > 100,000 graphs in 2,080.58 sec > > > > Only model.add(): > > 100,000 graphs in 1,246.25 sec > > > > Both direct mode and model.add(): > > 100,000 graphs in 812.30 sec > > > > Brian, have you tried running it without setting direct mode? I'm wondering > > if there's a problem with memory mapped I/O on Windows? > > > > Thanks, > > Frank. > > > > Andy Seaborne<[email protected]> wrote on 03/28/2011 04:17:49 > > AM: > > > >> [image removed] > >> > >> Re: Bad Jena TDB performance > >> > >> Andy Seaborne > >> > >> to: > >> > >> jena-users > >> > >> 03/28/2011 04:18 AM > >> > >> Please respond to jena-users > >> > >> > >> > >> On 27/03/11 18:56, Brian McBride wrote: > >>> On 25/03/2011 18:25, Frank Budinsky wrote: > >>> [...] > >>>> One interesting difference between my slow running case and Brian's, > > is > >>>> that mine always starts out pretty fast, but gradually slows down. > > Notice > >>>> that it was taking about 2 seconds per 100 at the start, but it was up > > to > >>>> 15 seconds at 39800. I didn't let it run to completion, this time, but > > I > >>>> remember from a previous run that it was taking about 2 min per 100 at > >>>> the > >>>> end. Any idea why it might slow down over time, like this, when not > > using > >>>> direct mode? > >>>> > >>>> Anyway, the next thing I tried, was to change the code has Brian > >>>> suggested. > >>>> That also had a big impact for me: > >>> I have a hypothesis for why. > >>> > >>> When ARP, the RDF/XML parser, finishes reading a file it issues a > >>> finishedRead event to the event handler in the graph it is reading > > into. > >>> If the TDB default event handler then did something (like sync to > > disk?) > >>> that the memory model does not, this could explain the difference in > >>> performance. I have put a profiler on the test program and it reports > >>> that the test program is spending a lot more time in > >>> BlockManagerFile.force() when it is reading directly in to TDB than > > when > >>> it is going via a memory model. So there is some evidence that this is > >>> what is happening. > >>> > >>> I haven't been able to track down the block manager code actually in > > use > >>> as I'm having trouble checking ARQ out of SVN, but Andy likely knows > > off > >>> the top of his head whether this is plausible. > >> > s/block manager/event manager/ > >> > >> Could be - the model.add(model) will go via the BulkUpdateHandler (I > >> think). TDB's BulkUpdateHandler inherits from SimpleBulkUpdateHandler > >> for insertion. > >> > >> Could you try putting a break point in dataset.sync and see what the > >> call stack is when it gets hit? That'll tell you who is causing the > > sync. > >> There used to be (up to v 0.8.9? not int he last snapshot build) a sync > >> wrapper that sync() every n'000 triples added. It's not in the > >> development codebase. All hidden implicit syncs should now be removed > >> They were causing problems for a user who was tracking whether the DB > >> on disk was dirty or not. > >> > >> Brian, Frank, which versions are you running? > >> > >> > as I'm having trouble checking ARQ out of SVN, > >> > >> What sort of trouble? > >> > >> Andy > >> > >> > >>> Brian > >>> > >>> > >>> > >>>> Max mem: 1,820M > >>>> DIRECT mode > >>>> log4j:WARN No appenders could be found for logger > >>>> (com.hp.hpl.jena.sparql.mgt.ARQMgt). > >>>> log4j:WARN Please initialize the log4j system properly. > >>>>> Starting test: Fri Mar 25 09:56:03 PDT 2011 > >>>>> Initial number of indexed graphs: 0 > >>>> 100 at: Fri Mar 25 09:56:06 PDT 2011 > >>>> 200 at: Fri Mar 25 09:56:07 PDT 2011 > >>>> 300 at: Fri Mar 25 09:56:09 PDT 2011 > >>>> 400 at: Fri Mar 25 09:56:10 PDT 2011 > >>>> 500 at: Fri Mar 25 09:56:11 PDT 2011 > >>>> ... > >>>> 99700 at: Fri Mar 25 10:09:27 PDT 2011 > >>>> 99800 at: Fri Mar 25 10:09:27 PDT 2011 > >>>> 99900 at: Fri Mar 25 10:09:28 PDT 2011 > >>>> 100000 at: Fri Mar 25 10:09:29 PDT 2011 > >>>>> Done at: Fri Mar 25 10:09:35 PDT 2011 > >>>> 100,000 graphs in 812.30 sec > >>>> > >>>> With that change, I'm now only about 30% slower than Andy's number. > > Maybe > >>>> that's attributable to Windows vs Linux or the hardware differences. > > I'm > >>>> running it on: > >>>> > >>>> Intel Xeon E5345 233 GHZ (2 processors) > >>>> 8 GB RAM > >>>> 300 GB HDD > >>>> Windows 7 Enterprise SP1 > >>>> > >>>> Does anybody know how a Xeon 5345 should compare to i5 or i7 > >>>> processors, or > >>>> how much difference there might be between Linux vs Windows 7. > >>>> > >>>> Thanks again for your help. > >>>> > >>>> Frank > >>>> > >>>> Brian McBride<[email protected]> wrote on 03/25/2011 12:15:30 PM: > >>>> > >>>>> [image removed] > >>>>> > >>>>> Re: Bad Jena TDB performance > >>>>> > >>>>> Brian McBride > >>>>> > >>>>> to: > >>>>> > >>>>> jena-users > >>>>> > >>>>> 03/25/2011 12:16 PM > >>>>> > >>>>> Please respond to jena-users > >>>>> > >>>>> Hi Andy, Frank, > >>>>> > >>>>> On 25/03/2011 14:13, Andy Seaborne wrote: > >>>>>> I ran today: > >>>>>> > >>>>>> https://jena.svn.sourceforge.net/svnroot/jena/TDB/trunk/src-dev/ > >>>>> reports/ReportOutOfMemoryManyGraphsTDB.java > >>>>> [[ > >>>>> > >>>>> I would very much appreciate it if others on this mailing list could > >>>>> also > >>>>> give it a quick try. I'd like to know if it's just me (and my > >>>> colleagues), > >>>>> or is there some kind of pattern to explain this huge difference. > >>>>> ]] > >>>>> > >>>>> > >>>>> Having a windows 7 box and a couple of spare minutes I tried running > >>>>> Andy's version of the test code. I got performance akin to what Frank > >>>>> reports. > >>>>> > >>>>> [[ > >>>>> Max mem: 1,820M > >>>>> DIRECT mode > >>>>>> Starting test: Fri Mar 25 16:06:36 GMT 2011 > >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer > >>>>>> Initial number of indexed graphs: 0 > >>>>> 100 at: Fri Mar 25 16:07:17 GMT 2011 > >>>>> 200 at: Fri Mar 25 16:08:05 GMT 2011 > >>>>> 300 at: Fri Mar 25 16:08:56 GMT 2011 > >>>>> ... > >>>>> ]] > >>>>> > >>>>> However, it runs a lot faster, if instead of reading the input stream > >>>>> directly into TDB, I read it into a memory model first and then add > > that > >>>>> model to a TDB model. Then I get performance similar to that which > > Andy > >>>>> reports. i.e. > >>>>> > >>>>> [[ > >>>>> Max mem: 1,820M > >>>>> DIRECT mode > >>>>>> Starting test: Fri Mar 25 15:55:48 GMT 2011 > >>>>> WARN [main] (SetupTDB.java:755) - No BGP optimizer > >>>>>> Initial number of indexed graphs: 0 > >>>>> 100 at: Fri Mar 25 15:55:51 GMT 2011 > >>>>> 200 at: Fri Mar 25 15:55:51 GMT 2011 > >>>>> 300 at: Fri Mar 25 15:55:52 GMT 2011 > >>>>> ... > >>>>> > >>>>> 100000 at: Fri Mar 25 16:04:47 GMT 2011 > >>>>>> Done at: Fri Mar 25 16:04:55 GMT 2011 > >>>>> 100,000 graphs in 547.00 sec > >>>>> ]] > >>>>> > >>>>> Frank, you might try replacing > >>>>> > >>>>> [[ > >>>>> InputStream instream = getGraph(i); // the RDF graph to > >>>> load > >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE); > >>>>> try { > >>>>> Model model = > >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/ > >>>>> com.ibm.team.workitem.WorkItem/" > >>>>> + i); > >>>>> model.read(instream, null); > >>>>> //model.close(); > >>>>> } finally { dataset.getLock().leaveCriticalSection() ; } > >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at: > >>>>> " + new java.util.Date()); > >>>>> instream.close(); > >>>>> } > >>>>> ]] > >>>>> > >>>>> with > >>>>> [[ > >>>>> InputStream instream = getGraph(i); // the RDF graph to > >>>> load > >>>>> Model m = ModelFactory.createDefaultModel(); > >>>>> m.read(instream, null); > >>>>> > >>>>> dataset.getLock().enterCriticalSection(Lock.WRITE); > >>>>> try { > >>>>> Model model = > >>>>> dataset.getNamedModel("https://jazz.net/jazz/resource/itemName/ > >>>>> com.ibm.team.workitem.WorkItem/" > >>>>> + i); > >>>>> // model.read(instream, null); > >>>>> model.add(m); > >>>>> //model.close(); > >>>>> } finally { dataset.getLock().leaveCriticalSection() ; } > >>>>> if (++i % 100 == 0) System.out.println(i/100 + "00 at: > >>>>> " + new java.util.Date()); > >>>>> instream.close(); > >>>>> } > >>>>> ]] > >>>>> > >>>>> I am running I think the latest Maven versions of Jena etc on Windows > > 7 > >>>>> on an Intel i7 with 8GB. Tests were run in Eclipse. > >>>>> > >>>>> Windows 7 SP1 > >>>>> Jena 2.6.4 > >>>>> ARQ 2.8.7 > >>>>> TDB 0.8.9 > >>>>> > >>>>> Hoping this might be helpful, > >>>>> Brian > >>>>> > >>>>> > >>>>> > >>>>>> in Eclispe > >>>>>> in direct mode. > >>>>>> > >>>>>> It has some configuration choices you might like to try. > >>>>>> > >>>>>> Max mem: 910M > >>>>>> DIRECT mode > >>>>>>> Starting test: Fri Mar 25 13:57:02 GMT 2011 > >>>>>>> Initial number of indexed graphs: 0 > >>>>>> 100 at: Fri Mar 25 13:57:04 GMT 2011 > >>>>>> 200 at: Fri Mar 25 13:57:04 GMT 2011 > >>>>>> 300 at: Fri Mar 25 13:57:05 GMT 2011 > >>>>>> 400 at: Fri Mar 25 13:57:06 GMT 2011 > >>>>>> 500 at: Fri Mar 25 13:57:06 GMT 2011 > >>>>>> 600 at: Fri Mar 25 13:57:07 GMT 2011 > >>>>>> 700 at: Fri Mar 25 13:57:07 GMT 2011 > >>>>>> 800 at: Fri Mar 25 13:57:08 GMT 2011 > >>>>>> 900 at: Fri Mar 25 13:57:08 GMT 2011 > >>>>>> 1000 at: Fri Mar 25 13:57:09 GMT 2011 > >>>>>> .... > >>>>>> 98000 at: Fri Mar 25 14:06:47 GMT 2011 > >>>>>> 98100 at: Fri Mar 25 14:06:47 GMT 2011 > >>>>>> 98200 at: Fri Mar 25 14:06:48 GMT 2011 > >>>>>> 98300 at: Fri Mar 25 14:06:48 GMT 2011 > >>>>>> 98400 at: Fri Mar 25 14:06:49 GMT 2011 > >>>>>> 98500 at: Fri Mar 25 14:06:50 GMT 2011 > >>>>>> 98600 at: Fri Mar 25 14:06:50 GMT 2011 > >>>>>> 98700 at: Fri Mar 25 14:06:52 GMT 2011 > >>>>>> 98800 at: Fri Mar 25 14:06:52 GMT 2011 > >>>>>> 98900 at: Fri Mar 25 14:06:53 GMT 2011 > >>>>>> 99000 at: Fri Mar 25 14:06:53 GMT 2011 > >>>>>> 99100 at: Fri Mar 25 14:06:54 GMT 2011 > >>>>>> 99200 at: Fri Mar 25 14:06:55 GMT 2011 > >>>>>> 99300 at: Fri Mar 25 14:06:55 GMT 2011 > >>>>>> 99400 at: Fri Mar 25 14:06:56 GMT 2011 > >>>>>> 99500 at: Fri Mar 25 14:06:56 GMT 2011 > >>>>>> 99600 at: Fri Mar 25 14:06:57 GMT 2011 > >>>>>> 99700 at: Fri Mar 25 14:06:58 GMT 2011 > >>>>>> 99800 at: Fri Mar 25 14:06:59 GMT 2011 > >>>>>> 99900 at: Fri Mar 25 14:07:00 GMT 2011 > >>>>>> 100000 at: Fri Mar 25 14:07:00 GMT 2011 > >>>>>>> Done at: Fri Mar 25 14:07:04 GMT 2011 > >>>>>> 100,000 graphs in 601.98 sec > >>>>>> > >>>>>> > >>>>>> > >>>>>> On 25/03/11 13:50, Frank Budinsky wrote: > >>>>>>> Hi Andy and all, > >>>>>>> > >>>>>>> I finally managed to get a relatively powerful machine set up > >>>>>> Details? > >>>>>> > >>>>>>> and I reran > >>>>>>> the test program I sent you, but unfortunately, it still runs > > orders > >>>> of > >>>>>>> magnitude slower than the numbers you got when you tried it. The > >>>>>>> hardware I > >>>>>>> used this time is similar to what you are using, the only > > significant > >>>>>>> difference is that it's running Window's 7 instead of Ubuntu. I > > know > >>>>>>> Linux > >>>>>>> is somewhat faster than Windows, but I don't expect we can blame > >>>>>>> Microsoft > >>>>>>> for a change from 573.87 seconds to about 4 hours :-) > >>>>>>> > >>>>>>> Are you sure that your numbers are correct? How big was the TDB > >>>>>>> database on > >>>>>>> disk at the end of the run? > >>>>>> 3.9G DB1 > >>>>>> > >>>>>>> Do you have any other ideas what may be wrong > >>>>>>> with my configuration? > >>>>>> Windows server or desktop? (server is better at I/O) > >>>>>> 64 bit? > >>>>>> Is there a virus checker? > >>>>>> > >>>>>>> I would very much appreciate it if others on this mailing list > > could > >>>>>>> also > >>>>>>> give it a quick try. I'd like to know if it's just me (and my > >>>>>>> colleagues), > >>>>>>> or is there some kind of pattern to explain this huge difference. > >>>>>>> > >>>>>>> Here is the simple test program again (inlined this time, since > > Apache > >>>>>>> seems to throw away attachments). To run it, just change the > > TDB_DIR > >>>>>>> constant to some empty directory. The test program loads100,000 > >>>>>>> datagraphs > >>>>>>> (about 100 triples each -> 10M triples total). It prints a message > >>>>>>> on the > >>>>>>> console at every 100, so if it's taking seconds for each println, > >>>> you'll > >>>>>>> know very quickly that it will take hours to run, instead of a few > >>>>>>> minutes, > >>>>>>> like Andy has seen. > >>>>>>> > >>>>>>> Thanks, > >>>>>>> Frank.
