The DocumentMK (formerly "MongoMK") uses the DocumentStore API (org.apache.jackrabbit.oak.plugins.document) for persistence. We currently have three implementations of this API:

1) MemoryDocumentStore (mainly for testing),
2) MongoDocumentStore, and
3) RDBDocumentStore (only in trunk for now).

In theory, the DocumentMK code should be persistence-agnostic; in practice it has a few hardwired optimizations for Mongo. These are used for recovery and maintenance tasks.

Mongo-specific optimizations are mainly there because of the way the DocumentStore API handles queries:

  /**
* Get a list of documents where the key is greater than a start value and * less than an end value <em>and</em> the given "indexed property" is greater
   * or equals the specified value.
   * <p>
* The indexed property can either be a {@link Long} value, in which case numeric * comparison applies, or a {@link Boolean} value, in which case "false" is mapped
   * to "0" and "true" is mapped to "1".
   * <p>
   * The returned documents are sorted by key and are immutable.
   *
   * @param <T> the document type
   * @param collection the collection
   * @param fromKey the start value (excluding)
   * @param toKey the end value (excluding)
   * @param indexedProperty the name of the indexed property (optional)
   * @param startValue the minimum value of the indexed property
   * @param limit the maximum number of entries to return
   * @return the list (possibly empty)
   */
  @Nonnull
  <T extends Document> List<T> query(Collection<T> collection,
                                     String fromKey,
                                     String toKey,
                                     String indexedProperty,
                                     long startValue,
                                     int limit);

So the following criteria can be used to constrain a query:

a) range of IDs
b) a single greater-Or-equals condition

In the maintenance tasks however we need additional constraints, such as:

- a condition other than greater-or-equals
- a conjunction of multiple constraints

Also, for big result sets the response type (a list) is sub-optimal because a store might contain large NodeDocuments. Finally, there are filter criteria that are hard/impossible to express declaratively.

Marcel and I chatted about this, and here are two API improvements we could do; these are independent, and add some complexity - in the optimal case we'll find out that doing one of these two would be sufficient.


Proposal #1: improve declarative constraints

Add a variant of query() such as:

  <T extends Document> List<T> query(Collection<T> collection,
                                     List<Constraint> constraints,
                                     int limit);

This would return all documents where all of the listed constraints are true (we currently do not seem to have a use case for a disjunction). A constraint would apply to an indexed property (such as "_id") and would allow the common comparisons, plus an "in" clause.

This would be straightforward to support both in the Mongo- and RDBDocumentStore.


Proposal #2: add Java-based filtering and "sparse" documents

This would add a "QueryFilter" parameter to queries. A filter would have

- an optional way of selecting certain properties, and
- an accept(Docucment) method

Advantages:

- if the filter only selects certain properties (say "_id", "_deletedOnce", and "_modified"), the persistence may not need to fetch the complete document representation from storage (in RDB, this would be true for any system property that has it's own column)

- the accept method could have "arbitrary" complexity and would be responsible for generating the result set; for instance, it might only build a list of Strings containing the identifiers of matching documents (which would be sufficient for a subsequent delete operation).


Note: Proposal #2 is more flexible, but as it's only partly declarative it makes it impossible to pass the selection constraints down to the persistence.

Feedback appreciated...

Reply via email to