The DocumentMK (formerly "MongoMK") uses the DocumentStore API
(org.apache.jackrabbit.oak.plugins.document) for persistence. We
currently have three implementations of this API:
1) MemoryDocumentStore (mainly for testing),
2) MongoDocumentStore, and
3) RDBDocumentStore (only in trunk for now).
In theory, the DocumentMK code should be persistence-agnostic; in
practice it has a few hardwired optimizations for Mongo. These are used
for recovery and maintenance tasks.
Mongo-specific optimizations are mainly there because of the way the
DocumentStore API handles queries:
/**
* Get a list of documents where the key is greater than a start
value and
* less than an end value <em>and</em> the given "indexed property"
is greater
* or equals the specified value.
* <p>
* The indexed property can either be a {@link Long} value, in which
case numeric
* comparison applies, or a {@link Boolean} value, in which case
"false" is mapped
* to "0" and "true" is mapped to "1".
* <p>
* The returned documents are sorted by key and are immutable.
*
* @param <T> the document type
* @param collection the collection
* @param fromKey the start value (excluding)
* @param toKey the end value (excluding)
* @param indexedProperty the name of the indexed property (optional)
* @param startValue the minimum value of the indexed property
* @param limit the maximum number of entries to return
* @return the list (possibly empty)
*/
@Nonnull
<T extends Document> List<T> query(Collection<T> collection,
String fromKey,
String toKey,
String indexedProperty,
long startValue,
int limit);
So the following criteria can be used to constrain a query:
a) range of IDs
b) a single greater-Or-equals condition
In the maintenance tasks however we need additional constraints, such as:
- a condition other than greater-or-equals
- a conjunction of multiple constraints
Also, for big result sets the response type (a list) is sub-optimal
because a store might contain large NodeDocuments. Finally, there are
filter criteria that are hard/impossible to express declaratively.
Marcel and I chatted about this, and here are two API improvements we
could do; these are independent, and add some complexity - in the
optimal case we'll find out that doing one of these two would be sufficient.
Proposal #1: improve declarative constraints
Add a variant of query() such as:
<T extends Document> List<T> query(Collection<T> collection,
List<Constraint> constraints,
int limit);
This would return all documents where all of the listed constraints are
true (we currently do not seem to have a use case for a disjunction). A
constraint would apply to an indexed property (such as "_id") and would
allow the common comparisons, plus an "in" clause.
This would be straightforward to support both in the Mongo- and
RDBDocumentStore.
Proposal #2: add Java-based filtering and "sparse" documents
This would add a "QueryFilter" parameter to queries. A filter would have
- an optional way of selecting certain properties, and
- an accept(Docucment) method
Advantages:
- if the filter only selects certain properties (say "_id",
"_deletedOnce", and "_modified"), the persistence may not need to fetch
the complete document representation from storage (in RDB, this would be
true for any system property that has it's own column)
- the accept method could have "arbitrary" complexity and would be
responsible for generating the result set; for instance, it might only
build a list of Strings containing the identifiers of matching documents
(which would be sufficient for a subsequent delete operation).
Note: Proposal #2 is more flexible, but as it's only partly declarative
it makes it impossible to pass the selection constraints down to the
persistence.
Feedback appreciated...