Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

David E Jones Fri, 28 Aug 2009 21:53:29 -0700


On Aug 27, 2009, at 7:57 PM, Scott Gray wrote:

I just tried using a cursor with MySQL and the following caveat fromthe docs caught me out:
There are some caveats with this approach. You will have to readall of the rows in the result set (or close it) before you canissue any other queries on the connection, or an exception will bethrown.
Attempting the count query causes the exception to be thrown becausethe same connection is used. Is it feasible to try and detect astreaming ResultSet on the connection and if present get anotherfrom the connection factory? I'll skip the cursor for MySQL for nowI think but I'll need to check whether Postgres exhibits the samebehavior although their docs make no mention of it.

I'm shooting from the hip here... but I don't think the connectionpool returns a connection to the pool (for reuse by other threads)while it is still be used, ie not until the connection is "closed", soI don't think this would happen with Entity Engine connections.


-David

On 28/08/2009, at 12:39 PM, Scott Gray wrote:
Hi David,
Always doing the separate query would certainly be easier, I thinkI'll go with that.
You mentioned: "regardless of the database you are using it isALWAYS faster to retrieve a resultset limited." By that do youmean including just Derby and MySQL, or have you tried otherdatabases? Also, was the JDBC driver setup to have the ResultSetbacked by a cursor in the database? As for MySQL and Derby, I'mnot sure how good a job with this sort of thing I'd expect. Thingsare a little different with Postgres and I'd expect better resultsthere, and a lot different with Oracle and I'd expect way betterresults there.
Thanks for calling that out, saying ALWAYS was probably getting abit carried away :-)I've been trying things out with Derby, Postgres and MySQL. Bydefault both Postgres[1] and MySQL[2] load the entire ResultSetinto memory (Derby does not but calling last() is expensive), theonly way to get hold of a cursor is to specify TYPE_FOWARD_ONLY andCONCUR_READ_ONLY but I've read of potential problems with thisapproach in MySQL[3]. And of course it removes the ability to calllast() to get the size of the result but that can be negated byusing the separate count query. It also causes getPartialList tofail because you can't jump forward in the resultset (MySQL andDerby, I haven't tried Postgres on that yet), but we could alsonegate that by detecting a forward only result set and using next()only to get to where we want to be.
So with all that said here is my revised solution:
1. Always use a separate count query to get the resultset size inthe ELI2. Switch pagination queries to use performFindList rather thanperformFind and set maxRows, also switch the resultset type toFORWARD_ONLY for both services, I think it is rare that we wouldwant to jump around a resultset except to get the size. SettingmaxRows will cause no harm if using a cursor but if not it'll speedup queries quite a bit for large resultsets and reduce the memoryconsumed (unless the viewIndex is significantly high)3. Detect forward only result sets in the ELI and changegetPartialList to iterate to the desired position rather than usingabsolute().4. MySQL requires you to set the fetchSize to Integer.MIN_VALUEbut SQLProcessor overrides this if the setting is less than zero sochange that to allow any int value to pass through.
I'll keep testing with MySQL to see if I can reproduce the problemmentioned in [3]. Even if it is a problem and I have to go withoutthe cursor, using maxRows should still result in some improvements.
Thanks David, your feedback is always appreciated and often helpsme to look at things in a different light.
Regards
Scott

[1] http://jdbc.postgresql.org/documentation/83/query.html#query-with-cursor
[2] http://dev.mysql.com/doc/refman/5.0/en/connector-j-reference-implementation-notes.html(Skip down to the ResultSet paragraph)[3] http://javaquirks.blogspot.com/2007/12/mysql-streaming-result-set.html(2nd to last sentence)
On 28/08/2009, at 3:41 AM, David E Jones wrote:
Scott,
The steps you mentioned sound good. One possible simplification isthat you could always do a separate query in thegetResultSizeAfterPartialList method. IMO it is safe to assumethat the data may be large if the EntityListIterator is being usedexplicitly. In other words, if the programmers knows there won'tbe much data they'll just get a List back instead of an ELI.
You mentioned: "regardless of the database you are using it isALWAYS faster to retrieve a resultset limited." By that do youmean including just Derby and MySQL, or have you tried otherdatabases? Also, was the JDBC driver setup to have the ResultSetbacked by a cursor in the database? As for MySQL and Derby, I'mnot sure how good a job with this sort of thing I'd expect. Thingsare a little different with Postgres and I'd expect better resultsthere, and a lot different with Oracle and I'd expect way betterresults there.
-David


On Aug 27, 2009, at 2:57 AM, Scott Gray wrote:
I can't do that at the moment because I've been modifying thesame script repeatedly to test different situations and it'sgotten pretty messy. But as soon as I've found a temporarysolution to the problem we're having I'll come back to it, cleanit up and post it so the discussion can continue with more peoplerunning tests.
What I have been able to determine so far is that regardless ofthe database you are using it is ALWAYS faster to retrieve aresultset limited to the records you are actually going to use.The problem is that for pagination we "need" (I'm not sure howbadly) to know the full size of the resultset. It turns outthere is a magic number of records where it becomes faster to doa separate count query but only if you are usingEntityFindOptions.setMaxRows(int) on your ELI along with it. Onmy machine it is somewhere between 25,000-50,000 records.
Keep in mind also that this is only really a problem for screenswhere we paginate through resultsets, in most other cases wealways use the entire resultset.
It was taking too long to test different table sizes so I endedup just filling a table in MySql with 1,000,000 records andsimulated paginating through it (each result is the same queryrun 5 times and the times are in milliseconds):
Here's the result for the way we currently do it with an ELI,there is only one result because because it takes too long totest and the result doesn't really vary regardless of theviewIndex being 50 or 500,000:
0 viewIndex: min 13610, max 37307, avg 27386.2, total 136931
Here's the result using an ELI with maxRows and a separate countquery:
50 viewIndex: min 363, max 377, avg 371.8, total 1859
100 viewIndex: min 372, max 462, avg 413.4, total 2067
200 viewIndex: min 385, max 412, avg 394.4, total 1972
400 viewIndex: min 378, max 412, avg 392.2, total 1961
800 viewIndex: min 373, max 1044, avg 510, total 2550
1600 viewIndex: min 390, max 405, avg 397.4, total 1987
3200 viewIndex: min 402, max 433, avg 418.6, total 2093
6400 viewIndex: min 425, max 504, avg 449.8, total 2249
12800 viewIndex: min 459, max 648, avg 536, total 2680
25600 viewIndex: min 570, max 1173, avg 705.8, total 3529
51200 viewIndex: min 756, max 1144, avg 958.8, total 4794
102400 viewIndex: min 1252, max 2810, avg 1576, total 7880
204800 viewIndex: min 2123, max 10120, avg 5253.6, total 26268
409600 viewIndex: min 4212, max 10837, avg 7011, total 35055
That's for a 1,000,000 records but the ELI by itself is muchfaster for me on small resultsets but gets progressively sloweras the resultset's size increases.
Another issue I encountered is that if I run the first portion ofthis test twice in two separate browser windows at the same timethen an out of memory error occurs and the instance locks upuntil I restart it. Should we be able to recover from an out ofmemory error or should just we just concentrate on avoiding them?
The only solution I can think of so far is to:
1. Add the ability for OFBiz to learn when a query becomes highvolume i.e. it's resultsize begins crossing the configurablemagic number threshold2. Add a new method for pagination to the delegator that candecide whether or not to set maxRows based on #1 for theEntityListIterator that it will return3. Provide the EntityListIterator with the information requiredto be able to perform a separate count query (it needs thedelegator or dao + the where and having conditions or we couldjust give it a SQLProcessor ready to go)4. Change the ELI's getResultSizeAfterPartialList to perform acount query if maxRows was set and the info from #3 was provided5. For forms that use the generic performFind service forpaginated results, switch them over to using the performFindListservice and change it's implementation to use the new delegatormethod from #2
Any thoughts?

Thanks
Scott


On 27/08/2009, at 3:22 AM, Adrian Crum wrote:
Scott,
It would be helpful if you could post your script in a Jiraissues so we can run it against various databases. I would liketo try it on ours.
-Adrian
--- On Tue, 8/25/09, Scott Gray <[email protected]>wrote:
From: Scott Gray <[email protected]>
Subject: EntityListIterator.getResultsSizeAfterPartialList()vs. delegator.getCountByCondition(...)
To: [email protected]
Date: Tuesday, August 25, 2009, 9:04 PM
Hi all,

We've had a few slow query problems lately and I've
narrowed it down to the ResultSet.last() method call in
EntityListIterator when performed on large result
sets.  I switched the FindGeneric.groovy script in
webtools to use findCountByCondition instead of
getResultSizeAfterPartialList and the page load time went
from 40-50s down to 2-3s for a result containing 700,000
records.

Based on that I assumed there was probably some magic
number depending on your system where it becomes more
efficient to do a separate count query rather than use the
ResultSet so I put together a quick test to find out.
I threw together a script that repeatedly adds 500 rows to
the jobsandbox and then outputs the average time taken of 3
attempts to get the list size for each method.  Here's
a graph of the results using embedded Derby: http://imgur.com/ieR7m

So unless the magic number lies somewhere in the first 500
records it looks to me like it always more efficient to do a
separate count query.

It makes me wonder if we should be taking a different
approach to pagination in the form widget and in
general.  Any thoughts?

Thanks
Scott

Re: EntityListIterator.getResultsSizeAfterPartialList() vs. delegator.getCountByCondition(...)

Reply via email to