Query optimization

Bryan Atsatt Mon, 11 Jun 2007 15:24:08 -0700

Given a list of ModuleDefinition instances, it is obvious how a
Repository implementation can do a *linear* search using Query:


private List<ModuleDefinition> definitions;

public List<ModuleDefinition> findModuleDefinitions(Query constraint) {
    List<ModuleDefinition> result = new ArrayList<ModuleDefinition>();
    for(ModuleDefinition def : definitions) {
        if (constraint.match(def)) {
            result.add(def);
        }
    }
}


We ought to be able to do much better than this, however, and make it
easy to use one or more indices to narrow the search. And I think we
should. Sure, small collections won't benefit hugely from it. But large
ones will, and, even more importantly, *remote* ones can by using a
local index. URLRepository is a perfect candidate, as the contents of
the repository-metadata.xml file can be turned into an index.

And consider a large, remote, *centralized* repository (perhaps even
hosted by Google ;^)


Given the current import-by-name syntax, the least we can do is to
leverage that name, if present in the Query, to use an index:

private Map<String, List<ModuleDefinition>> byNameIndex;

public List<ModuleDefinition> findModuleDefinitions(Query constraint) {

    List<ModuleDefinition> result = EMPTY_LIST;

    // Extract the module name from the query somehow (TBD)
    String moduleName = constraint.getModuleName();

    if (moduleName == null) {

        // Fall back on linear search.
        result = find(constraint, definitions);

    } else {

        // Use index to narrow search.
        List<ModuleDefinition> list = definitions.get(moduleName);
        if (list != null) {
            result = find(constraint, list);
        }
    }
    return result;
}

private List<ModuleDefinition> find(Query constraint,
                                    List<ModuleDefinition> context) {
    List<ModuleDefinition> result = new ArrayList<ModuleDefinition>();
    for(ModuleDefinition def : context) {
        if (constraint.match(def)) {
            result.add(def);
        }
    }
    return result;
}


Used this way, module-name becomes a primary key. It should also be
possible to have package-name as a primary key.

We could hard-wire support for common primary keys. Better, though,
would be a general mechanism for this, so that Repository
implementations can index on whatever makes sense. I think this can
actually be accomplished reasonably easily...

First, note that the index above only works if the node performs the
*same* value comparison operation, in this case String.equals(). This is
important, since if the node actually does a !String.equals(), our use
of the index is just plain wrong.

Here are some elements of the problem as things currently stand:

1. Iteration. How do we access the contents of a Query?

2. Value types. How do we determine the type of a Query value?

3. Comparison operations. How do we determine the comparison operation?

4. Extraction. How do we extract a value from a primary key node?

5. Identifier(s). How do we specify what can be used as a primary key?

The visitor pattern is an easy solution to #1, but is awkward to use.
Far better would be if we can make Query implement Iterable<Query> so
that for-each loops can be used. This requires a slightly more
complicated implementation, but I think the usability is probably worth it:

public abstract class Query implements Iterable<Query> {
    public Iterator<Query> iterator() {...}
    public boolean hasChildren() {return false;}
    public List<Query> getChildren() {return EMPTY_LIST};

    ...
}

The boolean node types just override the has/getChildren() methods.

Ok, so now that we can walk query nodes, how do we tackle the other
issues? It seems to me that there is one relatively simple path that
resolves all of the remaining issues at the same time: explicit, public
Query subclasses. For example:

public class ModuleNameEquals extends Query {
    public ModuleNameEquals (String name) {...}
    public String getName() {...}
    public boolean match(ModuleDefinition def) {
        return getName().equals(def.getName());
    }
}

public class AttributeEquals extends Query {
    public AttributeEquals (String key, String value) {...}
    public String getKey() {...}
    public String getValue() {...}
    public boolean match(ModuleDefinition def) {
        String v = def.getAttribute(getKey());
        return getValue().equals(v);
    }
}

public class ExportsPackage extends Query {
    public ExportsPackage (String name) {...}
    public String getName() {...}
    public boolean match(ModuleDefinition def) {
        Set<String> pkgs = def.getExportedPackages();
        return pkgs.contains(getName());
    }
}

public class VersionEquals extends Query {
    public VersionEquals (Version version) {...}
    public Version getVersion() {...}
    public boolean match(ModuleDefinition def) {
        return def.getVersion().equals(getVersion());
    }
}

public class VersionInRange extends Query {
    public VersionInRange (VersionRange range) {...}
    public VersionRange getRange() {...}
    public boolean match(ModuleDefinition def) {
        return getRange().contains(def.getVersion());
    }
}

...

Now we can iterate across any query looking for types that match any
index we've created, and extract values to use against them. And the
model is easily extensible if we want to support other types in the
future (e.g. ModuleSystemNameEquals, AttributeGreaterThan, etc.).


But, we're not done yet; knowing that a Query contains a
ModuleNameEquals instance, for example, isn't enough. What if the
instance is a child of a Not instance (i.e.
Query.not(Query.name("foo")))? Our use of the name against our
byNameIndex example above would then fail utterly.

In the spirit of making "simple things easy, and hard things possible",
I think we should provide convenience methods for the common cases.
(Which will also help guard against repository implementors missing the
Not case.)

A simple way to expose this is to add a method to each of the explicit
types for which we'd like to support indexing. For example:

public class ModuleNameEquals {
   public static List<String> getIndexableNames(Query q) {...}
   ...
}

This method would iterate the query looking for instances of its own
class, and collect up the names. More importantly, it would ignore an
instance if a Not node is found as a *parent*. (It could even be made
smart enough to deal with double negatives.)

We can add similar methods for the other types.

(We could also get fancy and add support for multiple indices and do the
 intersection or union, but I think that is better left to repository
implementors :^)


Thoughts?

// Bryan

Query optimization

Reply via email to