Hi,
 
I have researched quite a bit on Derby-655 : getImportedKeys returning duplicate rows in some cases. Here are my findings and 3 possible solutions to the problem. After reading the mail, if someone can think of any other way of solving this problem, then please share it.
 
First some background information
Derby internally creates backing indexes to enforce unique, primary and foreign key constraints. Unique and primary key constraints generate unique indexes whereas foreign key constraint generates a non-unique index. The information about these indexes are stored in following system tables, namely, SYS.SYSCONSTRAINTS, SYS.SYSKEYS, SYS.SYSFOREIGNKEYS, and SYS.SYSCONGLOMERATES.
 
SYS.SYSCONSTRAINTS keep information that is common among all kinds of constraints. CONGLOMERATEID is the primary key of SYS.SYSCONSTRAINTS.
 
Derby tables and indexes are stored in storage units called conglomerates. And information about them is stored in SYS.SYSCONGLOMERATES. Each conglomerate has an id and that is kept in column CONGLOMERATEID in SYS.SYSCONGLOMERATES. So, each of the constraints in SYSCONSTRAINTS will have a corresponding conglomerate and information about that conglomerate will be saved in SYSCONGLOMERATES.
 
The information about constraints in the 2 different tables, SYSCOSTRAINTS and SYSCONGLOMERATES need to be tied together and that is done using SYSKEYS or SYSFOREIGNKEYS depending on the kind of constraint.
 
If the constraint type is primary key or unique key, then the information specific to primary key or unique key constraint is saved in SYS.SYSKEYS. SYS.SYSKEYS has only 2 columns, CONSTRAINTID and CONGLOMERATEID. This table is used to tie SYSCONSTRAINTS and SYSCONGLOMERATES tables together for primary key and unique key constraints.
 
If the constraint type is foreign key, then the information specific to foreign key constraints is kept in SYS.SYSFOREIGNKEYS. Among other columns, this table has 2 columns, CONSTRAINTID and CONGLOMERATEID. This table is used to tie SYSCONSTRAINTS and SYSCONGLOMERATES tables together for foreign key constraints.
 
When a table with no constraint is created, Derby creates a conglomerate for that table's heap and an entry is made into SYSCONGLOMERATES for that heap .
 
When the first constraint is created on the table, say primary key, Derby creates a conglomerate for the table's primary key backing index and an entry is made into SYSCONGLOMERATES for that index.
 
For every conglomerate that gets created for a table (heap or index), Derby has to do the job of maintaining them as the table data changes. So, higher the number of conglomerates per table, slower would be Derby's performance.
 
In order to improve Derby's performance in conglomerate maintenance area, in Derby 10.0, few changes were made in org.apache.derby.impl.sql.execute.CreateIndexConstantAction.executeConstantAction, so that if a new (backing)index was detected to be a duplicate of an existing index, then rather than creating a new conglomerate for it, the new index would share the conglomerate that was already created for the existing duplicate index. So, an entry will be made into SYSCONGLOMERATES for the new constraint and an entry will be made into SYSCONGLOMERATES for the new constraint but the CONGLOMERATEID of the new constraint will be same as the CONGLOMERATEID of the existing duplicate index. Following is what qualifies an index to be a duplicate index(Note that a user defined index will throw an error if it is going to cause a duplicate index. Only backing indexes for constraints are allowed to be duplicates, eg. a user can define a primary key and a foreign key constraint on the same column of the table and that will succeed.)
   /* For an index to be considered a duplicate of already existing index, the
    * following conditions have to be satisfied:
    * 1. the set of columns and their order in the index is the same as that of an existing index AND
    * 2. the ordering attributes are the same AND
    * 3. both the previously existing index and the one being created are non-unique OR the previously existing index is unique
    */
These rules for duplicate indexes for constraints mean that for a given CONGLOMERATEID, there can be duplicate rows in SYSCONGLOMERATES. So, when a query tries to join SYSCONSTRAINTS, SYSKEYS/SYSFOREIGNKEYS, and SYSCONGLOMERATES, it should be aware of the duplicate rows in SYSCONGLOMERATES and should have mechanisms to filter out duplicate rows in SYSCONGLOMERATES. The metadata query for getImportedKeys in org.apache.derby.impl.jdbc.metadata.properties in the outermost select makes a join on SYSCONSTRAINTS, SYSKEYS AND SYSCONGLOMERATES and it was not fixed to handle the duplicate rows in SYSCONGLOMERATES when the duplicate index work went into org.apache.derby.impl.sql.execute.CreateIndexConstantAction.executeConstantAction and that is causing Derby-655 to retunr duplicate rows in some cases.
 
To make this more clear with an example, I am going to run through subset of the sql included in Derby-655
CREATE TABLE t2 (c21_ID BIGINT NOT NULL primary key);
CREATE TABLE t3(c31_ID BIGINT NOT NULL primary key);
ALTER TABLE t2 ADD CONSTRAINT F_443 Foreign Key (c21_ID)
   REFERENCES t3(c31_ID) ON DELETE CASCADE ON UPDATE NO ACTION;
 
In this example, when primary key constraint is defined on t2(c21_ID), a conglomerate is created for backing index for primary key on column c21_ID. A row is inserted into SYSCONSTRAINTS, SYSKEYS, and SYSCONGLOMERATES for this constraint and it's newly created conglomerate. So far, so good.
 
When foreign key constraint is defined of t2(c21_ID), notice that it is on the same column as the existing primary key constraint. Hence, org.apache.derby.impl.sql.execute.CreateIndexConstantAction.executeConstantAction decides that it is a duplicate index. But since it is a backing index, no error is thrown. For duplicate backing indexes, we decide not to create a new conglomerate since foreign key backing index in this case can share the conglomerate for primary key backing index on t2. So, a row is inserted into SYSCONSTRAINTS, SYSKEYS, and SYSCONGLOMERATES but the row in SYSCONGLOMERATES has the same CONGLOMERATEID as the row for primary key backing index.
 
The query for getImportedKeys has following where clause for the outermost select statement
   ) AS FKINFO(FK_ID, \
      FKTABLE_SCHEM, \
      FKTABLE_NAME, \
      FKCOLUMN_NAME, \
      KEY_SEQ, \
      FK_NAME, \
      FK_UPDATERULE, \
      FK_DELETERULE), \
   SYS.SYSCONSTRAINTS c PROPERTIES joinStrategy=NESTEDLOOP, index = 'SYSCONSTRAINTS_INDEX1', \
   SYS.SYSTABLES T PROPERTIES joinStrategy=NESTEDLOOP, index = 'SYSTABLES_INDEX2', \
   SYS.SYSSCHEMAS S PROPERTIES joinStrategy=NESTEDLOOP, index = 'SYSSCHEMAS_INDEX2', \
   SYS.SYSKEYS K PROPERTIES joinStrategy=NESTEDLOOP, index = 'SYSKEYS_INDEX1', \
   SYS.SYSCONGLOMERATES CONGLOMS PROPERTIES joinStrategy = NESTEDLOOP, index = 'SYSCONGLOMERATES_INDEX1', \
   SYS.SYSCOLUMNS COLS PROPERTIES joinStrategy=NESTEDLOOP, index = 'SYSCOLUMNS_INDEX1' \
 WHERE T.TABLEID = C.TABLEID  \
 AND C.CONSTRAINTID = FKINFO.FK_ID  \
 AND FKINFO.KEY_SEQ = CONGLOMS.DESCRIPTOR.getKeyColumnPosition(  \
               COLS.COLUMNNUMBER) \
 AND S.SCHEMAID = T.SCHEMAID \
 AND K.CONSTRAINTID = C.CONSTRAINTID \
 AND (CASE WHEN CONGLOMS.DESCRIPTOR IS NOT NULL THEN \
  CONGLOMS.DESCRIPTOR.getKeyColumnPosition(COLS.COLUMNNUMBER) ELSE 0 END) <> 0  \
 AND K.CONGLOMERATEID = CONGLOMS.CONGLOMERATEID   \
 AND C.TABLEID = COLS.REFERENCEID  \
 
In the above sql snippet, when the join is made on SYSCONSTRAINTS, SYSKEYS, and SYSCONGLOMERATES for primary key on table t2, the query finds 2 rows in SYSCONGLOMERATES with the conglomerateid and it picks both those rows. This is wrong and we need to somehow return only one row from SYSCONGLOMERATES, eventhough there might be duplicate rows in there for a given conglomerateid.
 
I have thought of following 3 solutions
1)Even if a foreign key's backing index qualifies as a duplicate index, have foreign key backing index create its own conglomerate rather than share an existing conglomerate. But this would mean that system would have to maintain 2 identical conglomerates. This can impact Derby's performance negatively, although I don't know how badly because it might be rare that a user creates a primary key and foreign key on identical columns.
2)Write a system function which will return just one row for all the duplicate conglomerateids for a given constraintid and use that row in the outermost select statement of metadata query for getImportedKeys. Not sure how easy this would be to implement but I am leaning towards this solution.
3)Do not create duplicate rows in sysconglomerates for duplicate indexes. Instead have just one row in sysconglomerate to represent all the duplicate indexes in sysconstraints. But that seems like more of an enhancement rather than a bug fix to me. Current Derby code heavily relies on having a row in SYSCONGLOMERATE for each of the indexes.
 
Thanks if you have made it this far :) I hope this mail is clear in explaining the problem and the possible solutions. If anything is not clear, then let me know and I can provide more information.
 
If anyone has any ideas about some other possible fix for the problem, please let me know that too.
 
thanks,
Mamta

Reply via email to