[jr3] Sequential Node Ids

Thomas Mueller Wed, 22 Dec 2010 01:00:29 -0800

Currently, Jackrabbit uses random node ids (randomly generated UUIDs). For 
Jackrabbit 3, and possibly Jackrabbit 2.x, I propose to use sequential node ids.


The main advantages for random node ids (the default) are: it's easy to 
develop, and merging data from multiple repositories is simple.

The disadvantages for random node ids are performance and space. With 
performance, I don't mean that randomly generated UUIDs are slower to generate 
(they are). I mean performance to read and write data to disk. The main 
performance problem is reading and writing (indexing) random keys, which is 
slow because adjacent (nearby) nodes data aren't adjacent in the index. This 
slows down index access (read and write) a lot, specially if the data doesn't 
fit in the buffer cache (file system cache).

I wrote a simple test case for Jackrabbit trunk. I also patched Jackrabbit to 
allow sequential UUIDs. Creating 1 million nodes is about 80% faster when using 
sequential UUIDs instead of random UUIDs. I don't have the data yet, but I'm 
sure the difference is much, much bigger when reading from larger repositories 
once the data doesn't fit in RAM (10 million nodes or so). I will expand the 
test case to support reading.

The following diagram is the operations per second (bigger is better) after 
inserting x nodes (0 to 1 million). With sequential UUIDs, performance gets 
better and then stays the same (which is expected). For random UUIDs which are 
now the default, performance first goes up and then down. The resulting 
database (I used H2 for performance) was about 145 MB for sequential UUIDs and 
169 MB for random UUIDs (the reason for this difference is that the b-tree fill 
factor is better for sequential UUIDs). All the data still fits in the buffer 
cache of the file system without problem.

[cid:82D27269-8424-4062-B68F-568E482436D3]


Something else I noticed when running the test case was that the Jackrabbit 
internals are relatively slow.  When profiling the test case, I saw that 
database access is not an important factor. Creating the nodes with the JCR API 
took 285 / 526 seconds, and re-creating the database from a SQL script took 55 
seconds (this includes reading the 200 MB SQL script file and parsing the SQL 
statements, one insert statement per row; this is for the random UUIDs). I 
would expect Jackrabbit is faster than 55 seconds, because it doesn't have to 
read the SQL script from disk and parse statements.

The test case creates 1 million nodes, saves after each 10'000 nodes, and uses 
a fan-out of 20 child nodes per node. I disabled the search index in the 
repository configuration. Source code:



import java.io.File;

import javax.jcr.Node;

import javax.jcr.Session;

import javax.jcr.SimpleCredentials;

import org.apache.commons.io.FileUtils;

import org.apache.jackrabbit.core.TransientRepository;

import org.apache.jackrabbit.core.id.NodeId;


public class TestLoop {


    long start;

    Profiler prof;


    public static void main(String... args) throws Exception {

        new TestLoop().run();

    }


    void startProf() {

//        if(prof != null) {

//            prof.stopCollecting();

//            System.out.println(prof.getTop(3));

//        }

//        prof = new Profiler();

//        prof.interval = 1;

//        prof.depth = 32;

//        prof.startCollecting();

    }


    void run() throws Exception {

        System.setProperty("jackrabbit.sequentialUUIDs", "false");

        startProf();

        FileUtils.deleteDirectory(new File("repository"));

        FileUtils.copyFile(new File("repository copy.xml"),

                new File("repository.xml"));

        TransientRepository rep = new TransientRepository();

        Session s = rep.login(

                new SimpleCredentials("admin", "admin".toCharArray()));

        int nodes = 1000000;

        int nodesPerFolder = 20;

        int saveEach = 10000;

        System.out.println(new NodeId());

        int levels = (int) Math.ceil(Math.log(nodes)

                / Math.log(nodesPerFolder));

        start = System.currentTimeMillis();

        startProf();

        createNodes(s.getRootNode(), 0, nodes,

                saveEach, levels, nodesPerFolder);

        System.out.println("took: " + (System.currentTimeMillis() - start));

        s.logout();

    }


    private int createNodes(Node parent, int x, int nodes,

            int saveEach, int depth, int nodesPerFolder)

            throws Exception {

        for (int i = 0; x < nodes && i < nodesPerFolder; i++) {

            if (depth > 1) {

                Node n = parent.addNode("f" + x, "nt:unstructured");

                x = createNodes(n, x, nodes,

                        saveEach, depth - 1, nodesPerFolder);

            } else {

                parent.addNode("n" + i, "nt:unstructured");

                x++;

                if ((x % saveEach) == 0) {

                    parent.getSession().save();

                    long t = System.currentTimeMillis() - start;

                    startProf();

                    System.out.println((x * 1000L / t) + " op/s at " + x);

                }

            }

        }

        return x;

    }

}

<<inline: Screen shot 2010-12-22 at 9.22.03 AM.png>>

[jr3] Sequential Node Ids

Reply via email to