CALL FOR: HugeNodeListIterator
Called by: Michiel Meeuwissen Total tally on this call : -1
START: 2004-04-21 22:30 START OF VOTING: 2004-04-26 22:30 END OF CALL: 2004-05-11 22:30
YEA (0) :
ABSTAIN (1) : Pierre van Rooden, Eduard Witteveen
NAY (1) : Kees Jongenburger
VETO (0) :
No votes, assumed abstained (12): Jaco de Groot, Marcel Maatkamp, Andre van Toly,
Johannes Verelst, Daniel Ockeloen, Rob Vermeulen, Rico Jansen,
Nico Klasens, Rob van Maris, Gerard van Enk, Mark Huijser, Ernst Bunders
Call result:
Since there were only three clear votes, I am extending this hack call to Tuesday, 11th May.
Michiel Meeuwissen wrote:
When you want to do exceptionally large iterations over MMBase nodes, using the bridge, you might not be content with the current way to do that:
NodeList list = nodeManager.getList(query); NodeIterator iterator = list.nodeIterator(); while (iterator.hasNext()) { ..
}
Because, you need all nodes in memory then ('list' in the above example), and also, the query result is cashed, so the memory is not even given back, immediately after completion.
I have created a job that wants to iterate over all nodes of a certain type every night. And there are quite a lot of such nodes, so I felt that I need something better here.
I created a 'HugeNodeListIterator', a NodeListIterator implementation that one can use as follows:
NodeIterator iterator = HugeNodeListIterator(query); while (iterator.hasNext()) { ... }
The implementation of this NodeList ensures that the query result is not too big, by applying a 'limit' to the query and doing it in batches (so, the query is 'uniquely sorted' and the next batch is found by applying a constraint).
Of course this is not perfect, because you would need no batching at all if you have direct access to the database cursor, but you cannot have that throught the bridge. When and if this becomes possible, the HugeNodeListIterator implementation can be changed in a backwards-compatible way to use this. But I dont anticipate such a new feature in the bridge anytime soon.
The other current problem is that it accesses the cache-framework to take the query-results from the Cache, which were put there by the bridge.. Therefore RMMCI cannot currently use this. This indicates that there problably should come a RMMCI (or even MMCI) way to access the Cache administration. The other option would perhaps be to add HugeNodeListIterator itself somehow to RMMCI. And finally, if and when the bridge has options to disable caching of a certain query result, of course this implementation can be changed to use that, in that way loosing the need to access the Caches itself.
I want to offer this HugeNodeListIterator as a hack, because I think it may
be useful for others as well. I currently don't need it for an MMBase project, so
I cant add it like that.
It is for CVS HEAD, in the package org.mmbase.bridge.util. It can also be taken to work in 1.7 environments. It is attached to this mail. But note that before committing I may apply implementation improvements if those are suggested in reaction.
START OF CALL: 2004-04-21 22:30 END OF CALL: 2004-04-26 22:30
[_] +1 (YES) [_] +0 (ABSTAIN ) [_] -1 (NO), because : [_] VETO, because:
Michiel
--
Michiel Meeuwissen |
Mediacentrum 140 H'sum | +31 (0)35 6772979 | I hate computers
nl_NL eo_XX en_US |
mihxil' |
[] () |
------------------------------------------------------------------------
/*
This software is OSI Certified Open Source Software. OSI Certified is a certification mark of the Open Source Initiative.
The license (Mozilla version 1.0) can be read at the MMBase site. See http://www.MMBase.org/license
*/
package org.mmbase.bridge.util;
import org.mmbase.bridge.*; import org.mmbase.storage.search.*; import org.mmbase.cache.*;
import java.util.*;
/** * Iterates the big result of a query. It avoids using a lot of memory (which you would need if you * get the complete NodeList first), and pollution of the (node) cache. In this current * implementation the Query is 'batched' to avoid reading in all nodes in memory, and the batches * are removed from the query-caches. * * @author Michiel Meeuwissen * @version $Id: HugeNodeListIterator.java,v 1.2 2004/04/05 19:04:12 michiel Exp $ * @since MMBase-1.8 */
public class HugeNodeListIterator implements NodeIterator {
// will not work through RMMCI, because caches are accessed.
protected static MultilevelCache multilevelCache = MultilevelCache.getCache(); protected static NodeListCache nodeListCache = NodeListCache.getCache();
public static final int DEFAULT_BATCH_SIZE = 10000;
protected NodeIterator nodeIterator; protected Node nextNode; protected Node previousNode;
protected Query originalQuery; protected int batchSize = DEFAULT_BATCH_SIZE;
protected int nextIndex = 0;
/** * Constructor for this Iterator. * * @param query The query which is used as a base for the querie(s) to be executed. * @param batchSize The (approximate) size of the sub-queries, should be a reasonably large * number, like 10000 or so. */ public HugeNodeListIterator(Query query, int batchSize) { this.batchSize = batchSize; init(query); }
/** * Constructor for this Iterator. The 'batchSize' is taken from the query's 'maxnumber' * properties, or, it that is not set, it is defaulted to 10000. * * @param query The query which is used as a base for the querie(s) to be executed. */ public HugeNodeListIterator(Query query) { if (query.getMaxNumber() != SearchQuery.DEFAULT_MAX_NUMBER) { batchSize = query.getMaxNumber(); } // else leave on default; init(query); }
/** * Called by constructors only */ private void init(Query query) { if (query.getOffset() > 0) { throw new UnsupportedOperationException("Not implemented for queries with offset"); } Queries.sortUniquely(query); originalQuery = query; executeNextQuery((Query) originalQuery.clone()); }
/**
* Executes the given query, taking into account the fact wether it is NodeQuery or not, and
* applying the 'batchSize'. The result is available in the 'nodeIterator' member.
*/
protected void executeQuery(Query currentQuery) {
currentQuery.setMaxNumber(batchSize);
if (originalQuery instanceof NodeQuery) {
NodeQuery nq = (NodeQuery) currentQuery;
nodeIterator = nq.getNodeManager().getList(nq).nodeIterator();
nodeListCache.remove(nq);
} else {
nodeIterator = currentQuery.getCloud().getList(currentQuery).nodeIterator();
multilevelCache.remove(currentQuery);
}
}
/** * Executes the given query, and prepares to do 'next', so setting 'nextNode' and 'previousNode'.
*/
protected void executeNextQuery(Query q) {
executeQuery(q);
previousNode = nextNode;
if (nodeIterator.hasNext()) {
nextNode = nodeIterator.nextNode(); } else {
nextNode = null;
}
}
/** * Executes the given query, and prepares to do 'previous', so setting 'nextNode' and
* 'previousNode', and winds the new nodeIterator to the end.
*/
protected void executePreviousQuery(Query q) {
executeQuery(q);
nextNode = previousNode;
while (nodeIterator.hasNext()) {
nodeIterator.next();
}
if (nodeIterator.hasPrevious()) {
previousNode = nodeIterator.nextNode(); } else {
previousNode = null;
}
}
/** * [EMAIL PROTECTED] */ public boolean hasNext() { return nextNode != null; }
/** * [EMAIL PROTECTED] */ public boolean hasPrevious() { return previousNode != null; }
/** * [EMAIL PROTECTED] */ public Object next() { return nextNode(); }
/** * [EMAIL PROTECTED] */ public Object previous() { return previous(); } /** * [EMAIL PROTECTED] */ public int previousIndex() { return nextIndex - 1; } /** * [EMAIL PROTECTED] */ public int nextIndex() { return nextIndex; }
/**
* Used by nextNode and previousNode. Does a field-by-field compare of two Node objects to check
* if they are equal. One would expect the equals-member function of Node to be useable for
* this, but that seems not to be the case.
*/
protected boolean equals(Node node1, Node node2) {
Iterator i = node1.getNodeManager().getFields().iterator();
while (i.hasNext()) {
Field f = (Field) i.next();
String name = f.getName();
if (! node1.getValue(name).equals(node2.getValue(name))) {
return false;
} }
return true;
}
/** * [EMAIL PROTECTED] * * Implementation calculates also the next next Node, and gives back the 'old' next Node, from * now on known as 'previousNode'. */ public Node nextNode() { if (nextNode != null) { nextIndex++; previousNode = nextNode; if (nodeIterator.hasNext()) { nextNode = nodeIterator.nextNode(); } else { Query currentQuery = (Query) originalQuery.clone();
// We don't use offset to determin the 'next' batch of query results
// because there could have been deletions/insertions.
// We use the sort-order to apply a constraint.
SortOrder order = (SortOrder) originalQuery.getSortOrders().get(0);
Constraint cons;
if (order.getDirection() == SortOrder.ORDER_ASCENDING) {
cons = currentQuery.createConstraint(order.getField(), FieldCompareConstraint.GREATER_EQUAL, previousNode.getValue(order.getField().getFieldName()));
} else {
cons = currentQuery.createConstraint(order.getField(), FieldCompareConstraint.LESS_EQUAL, previousNode.getValue(order.getField().getFieldName()));
}
Queries.addConstraint(currentQuery, cons);
executeNextQuery(currentQuery);
// perhaps the sort-order did not find a unique result, skip some nodes in that case. // XXX This wrong if (which is unlikely) there follow more nodes than 'batchSize'. while(nextNode != null && equals(nextNode, previousNode)) { if (nodeIterator.hasNext()) { nextNode = nodeIterator.nextNode(); } else { nextNode = null; } } }
return previousNode; // looks odd, but really is wat is meant. } else { throw new NoSuchElementException("No next element"); } }
/**
* [EMAIL PROTECTED]
*
* Implementation is analogous to nextNode.
*/
public Node previousNode() {
if (previousNode != null) {
nextNode = previousNode;
nextIndex --;
if (nodeIterator.hasPrevious()) {
previousNode = nodeIterator.previousNode();
} else {
Query currentQuery = (Query) originalQuery.clone();
SortOrder order = (SortOrder) originalQuery.getSortOrders().get(0);
Constraint cons;
if (order.getDirection() == SortOrder.ORDER_ASCENDING) {
cons = currentQuery.createConstraint(order.getField(), FieldCompareConstraint.LESS_EQUAL, nextNode.getValue(order.getField().getFieldName()));
} else {
cons = currentQuery.createConstraint(order.getField(), FieldCompareConstraint.GREATER_EQUAL, nextNode.getValue(order.getField().getFieldName()));
}
Queries.addConstraint(currentQuery, cons);
executePreviousQuery(currentQuery);
while(previousNode != null && equals(nextNode, previousNode)) {
if (nodeIterator.hasPrevious()) {
previousNode = nodeIterator.previousNode();
} else {
previousNode = null;
}
}
}
return nextNode;
} else {
throw new NoSuchElementException("No previous element");
}
}
/**
* @throws UnsupportedOperationException */
public void remove() {
throw new UnsupportedOperationException("Optional operation 'remove' not implemented");
}
/**
* @throws UnsupportedOperationException */
public void add(Object o) {
throw new UnsupportedOperationException("Optional operation 'add' not implemented");
}
/**
* @throws UnsupportedOperationException */
public void set(Object o) {
throw new UnsupportedOperationException("Optional operation 'set' not implemented");
}
/** * Main only for testing. */ public static void main(String[] args) { Cloud cloud = ContextProvider.getDefaultCloudContext().getCloud("mmbase"); NodeQuery q = cloud.getNodeManager("object").createQuery(); HugeNodeListIterator nodeIterator = new HugeNodeListIterator(q); while (nodeIterator.hasNext()) { Node node = nodeIterator.nextNode(); System.out.println(node.getFunctionValue("gui", null).toString()); }
}
}
-- Pierre van Rooden Mediapark, C 107 tel. +31 (0)35 6772815 "Never summon anything bigger than your head."
