Hi Mark, I'm interested in what you have done in somewhat peculiar way:
Currently, we use fields and terms in Lucene as the basis for the inverted index. However, as you can read in this paper for indexing Boolean expressions : http://theory.stanford.edu/~sergei/papers/vldb09-indexing.pdf, they create posting lists for all possible attribute name and value pairs (also called keys) among the conjunctions. A posting list head contains the key (A, v). The keys of the posting lists are stored in a hash table, which will be used to search posting lists given keys of an assignment. So perhaps the ability to mix the two different forms of indexes by building a posting list for each entry in your KVStore may help me design a Lucene-based solution for this problem. -- J On Sat, Mar 24, 2012 at 4:50 PM, Lance Norskog <[email protected]> wrote: > Cool! > > On Sat, Mar 24, 2012 at 4:17 PM, Mark Harwood <[email protected]> > wrote: > > OK I have some code and benchmarks for this solution up on a Google Code > project here: http://code.google.com/p/graphdb-load-tester/ > > > > The project exists to address the performance challenges I have > encountered when dealing with large graphs. It uses all of the Wikipedia > links as a test dataset and a choice of graph databases (most of which use > Lucene BTW). > > The test data is essentially 130 million edges representing links > between pages e.g. Communism->Russia. > > To load the data all of the graph databases have to translate > user-defined keys like "Russia" into an internally-generated node ID using > a service that looks like this: > > interface KeyService > > { > > //Returns existing nodeid or -1 if is not already in store > > public long getGraphNodeId(String udk); > > > > //Adds a new record - assumption is client has checked > user defined key (udk) is not stored already using getGraphNodeId > > public void put(String udk, long graphNodeId); > > } > > > > This is a challenge on a dataset of this size. I tried using a > Lucene-based implementation for this service with the following > optimisations: > > 1) a Bloomfilter to quickly "know what we don't know" > > 2) an LRUCache to hold on to commonly referenced vertices e.g the > Wikipdedia article for "United States" > > 3) a hashmap representing the unflushed state of Lucene's IndexWriter to > avoid the need for excessive flushing with NRT reader etc > > > > The search/write performance showed the familiar saw-toothing as the > Lucene index grew in size and merge operations kicked in. > > > > The KVStore implementation I wrote attempts to tackle this problem using > a fundamentally different form of index. The results from the KVStore runs > show it was twice as fast as this Lucene solution and maintains constant > performance without the saw toothing effect. > > > > Benchmark figures are here: http://goo.gl/VQ027 > > The KVStore source code is here: http://goo.gl/ovkop and the Lucene > implementation I compare against is also in the project. > > > > Cheers > > Mark > > > > > > > > > > > > --------------------------------------------------------------------- > > To unsubscribe, e-mail: [email protected] > > For additional commands, e-mail: [email protected] > > > > > > -- > Lance Norskog > [email protected] > > --------------------------------------------------------------------- > To unsubscribe, e-mail: [email protected] > For additional commands, e-mail: [email protected] > >
