Actually, it sounds like the problem of finding top 1000 rows out of 166333 rows is different than sorting 166333 rows and maybe it could be optimized. There is no need to sort all 166333 but the information that we are only looking 1000 rows would have to be passed all the way down to the point where Derby decides to sort. I have not thought through the details of an algorithm but when nRows we want is substantially smaller than TotalRows then i just feel there should be a better way to pick those nRows. For example, if nRows were 1, then all we had to do would be 1 single pass on 166333 rows to find the max. That is quite different than sorting all and this idea should be possible to generalize on 1<=nRows<TotalRows.
Ali
Craig Russell <[EMAIL PROTECTED]> wrote:
Hi Scott,From the query plan it appears that your filter selects 166,333 rows, of which you want the first 1000 according to the ordering of the order_id column. You can see that this is an effective strategy because Number of rows qualified=166333 Number of rows visited=166333. There's no time lost visiting rows that don't qualify.The database has to sort the 166,333 rows because the results are ordered according to the index scan column "time" not according to the order_id column. All of the rows need to be sorted even though you only want the first 1000 rows. I'd guess that the sorting of the 166,333 rows is what accounts for the 15 second delay you are experiencing.The index on order_id doesn't do you any good because you have a result that isn't indexed on order_id. If this isn't obvious, try to think of an algorithm that would use the order_id index on the result set.Craig
On Sep 19, 2005, at 9:29 AM, scotto wrote:
So for the second query:select * from orderswhere time > '10/01/2002' and time < '11/30/2002'order by order_id;the query plan shows that the index IX_ORDERS_TIME is used to filter theresult set by time. The order by step does not use the primary key index tosort the results after the filter step. My questions:--Is it correct that the sort step not use the primary key index in thiscase?--Why is it not possible to use the index on order_id to sort after thefilter has happened?Here is the query plan:Statement Name:nullStatement Text:select * from orderswhere time > '10/01/2002' and time < '11/30/2002'order by order_idParse Time: 0Bind Time: 0Optimize Time: 0Generate Time: 0Compile Time: 0Execute Time: 14329Begin Compilation Timestamp : nullEnd Compilation Timestamp : nullBegin Execution Timestamp : 2005-09-19 09:20:06.171End Execution Timestamp : 2005-09-19 09:20:20.5Statement Execution Plan Text:Sort ResultSet:Number of opens = 1Rows input = 166333Rows returned = 1000Eliminate duplicates = falseIn sorted order = falseSort information:Number of merge runs=1Number of rows input=166333Number of rows output=166333Size of merge runs=[93695]Sort type=externalconstructor time (milliseconds) = 0open time (milliseconds) = 14297next time (milliseconds) = 32close time (milliseconds) = 0optimizer estimated row count: 78377.51optimizer estimated cost: 166745.12Source result set:Index Row to Base Row ResultSet for ORDERS:Number of opens = 1Rows seen = 166333Columns accessed from heap = {0, 1, 2, 3, 4, 5, 6}constructor time (milliseconds) = 0open time (milliseconds) = 0next time (milliseconds) = 10488close time (milliseconds) = 0optimizer estimated row count: 78377.51optimizer estimated cost: 166745.12Index Scan ResultSet for ORDERS using index IX_ORDERS_TIMEat read committed isolation level using instantaneous share row lockingchosen by the optimizerNumber of opens = 1Rows seen = 166333Rows filtered = 0Fetch Size = 16constructor time (milliseconds) = 0open time (milliseconds) = 0next time (milliseconds) = 3438close time (milliseconds) = 0next time in milliseconds/row = 0scan information:Bit set of columns fetched=AllNumber of columns fetched=2Number of deleted rows visited=0Number of pages visited=887Number of rows qualified=166333Number of rows visited=166333Scan type=btreeTree height=3start position:> on first 1 column(s).Ordered null semantics on the following columns:stop position:>= on first 1 column(s).Ordered null semantics on the following columns:qualifiers:Noneoptimizer estimated row count: 78377.51optimizer estimated cost: 166745.12--scott-----Original Message-----From: Sunitha Kambhampati [mailto:[EMAIL PROTECTED]]Sent: Friday, September 16, 2005 5:55 PMTo: Derby DiscussionSubject: Re: derby performance and 'order by'Scott Ogden wrote:
I have observed some interesting query performance behavior and amhoping someone here can explain.In my scenario, it appears that an existing index is not being usedfor the 'order by' part of the operation and as a result theperformance of certain queries is suffering.Can someone explain if this is supposed to be what is happening andwhy? Please see below for the specific queries and their performancecharacteristics.Here are the particulars:---------------------------------create table orders(order_id varchar(50) NOT NULLCONSTRAINT ORDERS_PK PRIMARY KEY,amount numeric(31,2),time date,inv_num varchar(50),line_num varchar(50),phone varchar(50),prod_num varchar(50));--Load a large amount of data (720,000 records) into the 'orders' table--Create an index on the time column as that will be used in the'where' clause.create index IX_ORDERS_TIME on orders(time);--When I run a query against this table returning top 1,000 records,this query returns very quickly, consistently less than .010 seconds.select * from orderswhere time > '10/01/2002' and time < '11/30/2002'order by time;--Now run a similarly query against same table, returning the top1,000 records.--The difference is that the results are now sorted by the primary key('order_id') rather than 'time'.--This query returns slowly, approximately 15 seconds. Why??select * from orderswhere time > '10/01/2002' and time < '11/30/2002'order by order_id;--Now run a third query against the same 'orders' table, removing thewhere clause--This query returns quickly, around .010 seconds.select * from ordersorder by order_id;---------------------------------------------If you run with derby.language.logQueryPlan=true, the actual query plansused for the following queries will be written to derby.log. This willshow what indexes was used by the optimizer. Also seeQuery with 'order by' will require sorting. Usually, sorting requires anextra step to put the data into the right order. This extra step can beavoided for data that are already in the right order. For example, if asingle-table query has an ORDER BY on a single column, and there is anindex on that column, sorting can be avoided if Derby uses the index asthe access path.I think in case of your first and third query the optimizer will pickthe available index thus probably avoiding requiring the sort step.Your second query involves more work than the first query, since it hasa search condition on time, and an order by order_id. Thus if theoptimizer picks the index on time, that will involve a sort step onorder_id.____________Thanks,Sunitha.
Craig Russell
Architect, Sun Java Enterprise System http://java.sun.com/products/jdo
408 276-5638 mailto:[EMAIL PROTECTED]
P.S. A good JDO? O, Gasp!
