More efficient handling of duplicate keys in indices 
-----------------------------------------------------

                 Key: DIRSERVER-733
                 URL: http://issues.apache.org/jira/browse/DIRSERVER-733
             Project: Directory ApacheDS
          Issue Type: Improvement
          Components: core
    Affects Versions: pre-1.0, 1.0-RC1, 1.0-RC2, 1.0-RC3, 1.1.0, 1.0
            Reporter: Alex Karasulu
             Fix For: 1.1.0, 1.0


Here's the description of the problem and the solution that I posted to the 
mailing list:

Hi again,

I've completed my optimization on indices under high partition capacities.  The 
results are in line and side by side to the old results.  For those not 
interested in the exact optimization that took place skip down to the results.


The problem:
------------

Up to now duplicate keys (keys with many values) were stored using a TreeSet 
within the B+Trees of indices.  Indices usually map some attribute value to an 
ID into the master table for an entry.  The ID is a sequentially unique value 
assigned to the entry.

So if 100 people have initials AA then there would be 100 values in the TreeSet 
for the key 'AA' in the intials index.  This would be serialized and 
deserialized to and from disk.  At these low numbers there's really very little 
impact.  However certain indices were seriously impacted like the objectClass 
index or the hierarchy based system index.  Just imagine having 1 million 
entries of inetOrgPersons.  The objectClass index would then have 1 million 
values in the TreeSet for the key 'inetOrgPerson'.  This would seriously hurt 
performance from both a space and time perspective.  It would essentially 
obfuscate the benefits of using BTree's in the first place with large numbers 
of entries.


Solution:
---------

What I did was add a configurable threshold parameter when instead of using a 
TreeSet to store duplicates another B+Tree was created and used within the same 
index database file.  Right now the default value for this which these tests 
were conducted with is 1000.  So after having 1000 duplicate values the server 
switches from using in memory TreeSets to on disk B+Trees for only storing 
values for that key.



-- 
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the administrators: 
http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to