On Tue, 2009-01-13 at 15:19 +0100, Michael Segel wrote: > > > -----Original Message----- > > From: [email protected] [mailto:[email protected]] > > Sent: Tuesday, January 13, 2009 4:59 AM > > To: Derby Discussion > > Subject: Re: Bad performance with UPDATE on a nested SELECT > [SNIP] > > Yes, it looks like the same query plan. I see the following problems > > with this query plan: > > > > 1) The IN clause is rewritten to a join with SUMMA_RECORDS as the outer > > table, and we have no information to limit the scan of the outer table, > > hence a full table scan is performed. It would probably have been better > > if SUMMA_RECORDS were the inner table. Then the C index could be used to > > enforce the restriction (childId='...') on the outer table, > > SUMMA_RELATIONS, and index I could be used to perform lookups in the > > inner table. > > > > 2) The index scan on SUMMA_RELATIONS uses the index PC. Since that index > > is on the columns (parentId, childId), it is primarily sorted on > > parentId, so the scan needs to go through the entire index, for each row > > in the outer table, in order to match childId with the criterion in the > > WHERE clause of the nested select. It would probably be better to use > > the index on childId (index C) instead of PC (or reverse the order of > > the columns in the PC index). > > > > Derby allows you to override some of the optimizer's decisions, but it's > > not as easy for UPDATE statements as it is for SELECT statements, so I > > don't know how to tell it not to pick the table scan. Using the C index > > instead of the PC index should be easy enough, though: > > > > UPDATE summa_records SET base='my_base' WHERE id IN > > (SELECT parentId FROM summa_relations --DERBY-PROPERTIES index=C > > WHERE childId='horizon_2332668') > > > > -- > > Knut Anders > > I think that you're missing something. > > First without knowing the whole application, why is there an index PC > instead of an Index P and then an index C? The relationship match table has > just two columns. So why do you need a compound index? It doesn't make > sense.
The idea was to have a unique index on the 'relations' table, but maybe you are right... > Will a parent have over 10,000 children? Then maybe it might make sense. > (Well not really. If that was the case, you'd probably want to use a temp > table. Assuming that Derby support the dynamic creation of temp tables like > Informix, but not like Oracle or DB2... ) A typical case is that parents might have 0-10 children (and reverse - children might have 0-10 parents). > I suggested that you redesign your tables so that instead of working with > varchars and indexing varchars all over the place, you instead create a > table so you can do a lookup of the varchar, matching to an integer ID. > > Indexing varchars in a main table is a *bad* idea. You end up with a lot of > fat indexes. Fat indexs don't perform as well as skinny indexes, especially > when you're talking about tables with millions plus rows. It will not be trivial for us to "just" use integer ids. The problem is that either end of a relation will not necessarily exist in the base. This means that if we use proxy integer ids we'd have to insert a dummy-id in the relations table that me later must pair up to the real id if a record with linked id shows up. To be concrete; I am working at a library and the "records" are books linking each other. A case could be that we are instructed to put the following metadata in the base: id: book_1 children: book_2, book_3 parents: 7 days later we receive metadata for book_2: id: book_2 children: parents: Note that book_2 does not mention its relation to book_1. This is (sadly) legal in our world. I can see that your strategy can work out if we store integer_id->varchar_id in a separate table. Ids for non-existing records can just be added here as well. I am just concerned that this will bring even more JOIN overhead than we already have... > Also do not use query optimization statements except as a last resort. > > Again I *highly* recommend creating a table where you assign a unique id to > a given 'parentId'/'id'/'childId' and then use that id in these tables > instead of the varchar. 1.5 million rows of an index on a varchar column > means a lot of wasted space and overhead. (Just to re-iterate the point made > above.) > > Look, no disrespect... But a lot of times, improving your table design will > improve your performance and at the same time keeping your solution database > agnostic. Your input is much appreciated and provides good food for thought :-) The problem is just that we have a PostgresQL doing this exact query (on the same data amounts) in < 1ms, but we would really, _really_ like to use Derby instead. Cheers, Mikkel
