OK I have some code and benchmarks for this solution up on a Google Code
project here: http://code.google.com/p/graphdb-load-tester/
The project exists to address the performance challenges I have encountered
when dealing with large graphs. It uses all of the Wikipedia links as a test
dataset and a choice of graph databases (most of which use Lucene BTW).
The test data is essentially 130 million edges representing links between pages
e.g. Communism->Russia.
To load the data all of the graph databases have to translate user-defined keys
like "Russia" into an internally-generated node ID using a service that looks
like this:
interface KeyService
{
//Returns existing nodeid or -1 if is not already in store
public long getGraphNodeId(String udk);
//Adds a new record - assumption is client has checked user
defined key (udk) is not stored already using getGraphNodeId
public void put(String udk, long graphNodeId);
}
This is a challenge on a dataset of this size. I tried using a Lucene-based
implementation for this service with the following optimisations:
1) a Bloomfilter to quickly "know what we don't know"
2) an LRUCache to hold on to commonly referenced vertices e.g the Wikipdedia
article for "United States"
3) a hashmap representing the unflushed state of Lucene's IndexWriter to avoid
the need for excessive flushing with NRT reader etc
The search/write performance showed the familiar saw-toothing as the Lucene
index grew in size and merge operations kicked in.
The KVStore implementation I wrote attempts to tackle this problem using a
fundamentally different form of index. The results from the KVStore runs show
it was twice as fast as this Lucene solution and maintains constant
performance without the saw toothing effect.
Benchmark figures are here: http://goo.gl/VQ027
The KVStore source code is here: http://goo.gl/ovkop and the Lucene
implementation I compare against is also in the project.
Cheers
Mark
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]