Hi everybody, Lately I've been doing some research about further integration of lucene(1.2) with the product (An XML repository) of the company I work for. I was particularly interested in the following things: - Storing the indexes in our database instead of in a file system, so operations on it would be part of a transaction. - Integrating full text search in our XQuery implementation. - Storing direct references to node objects in the index instead of document id's to improve performance and flexibility in our situation.
I though it might be interesting for you to share my experiences with the lucene community, and since I've got so much benefit of using lucene I have taken some time to write down my experiences. But first I want to give my compliments about the clean source code, and the good use of abstractions. I think the current design makes it very easy to integrate lucene in other products. The first approach I took was replacing the FSDirectory by a custom directory that stores the indexes in an OODBMS. This was about half a day work. I encountered some minor issues while doing this: * The Directory class is currently an abstract base class with no method implementations. It would have been easier to do if this was an interface, because then it would be possible to directly extend my directory implementation class from a persistent capable class in the object db system we use. now another indirection was necessary. * I found the names of the InputStream and OutputStream classes a bit misleading since they actually do not represent "real" streams (at least not in the same sense of the streams in the java.io.package), but instead offer random access to the underlying store. * The implementation of the InputStream method has a seekInternal method. This method is never used and instead the underlying implementations use the file pointer that is kept by the inputstream method and seek automatically in their read operations. This was a bit misleading, wouldn't it be better if the InputStream called the seekInternal method? * Maybe it is a good idea to change the InputStream class to an interface, and add BufferedInputStream class that wraps an InputStream implementation class (separation of concerns), and also implements the InputStream class. This al makes the contract for implementers of a store simpler. I'm aware of the performance implications (late binding vs methods that can be inlined), but I'm not sure if this would be a real issue, since what really matters is I/O in this product. While this approach quickly gave me a working prototype I wasn't really happy with the performance (although this could be improved by improving the random access methods of the internal blob storage I used). But another things was that Lucene does not allow existing indexes to be updated, and instead always creates new sub indexes. Since al the other indexes in our product are live I also wanted the full text indexes to have this same "live" behavior (with live I mean that all changes in a XML document are reflected directly in the indexes). So I took another more drastic approach: I replaced the index and store packages with code of my own that uses our own Btree indexes: - Replacing the index package was not completely trivial because although the abstractions are very clear they are not always implemented in a way that makes it easy to replace a component by another, since in a lot of places abstract classes are used instead of interfaces. - We do not need the multi field functionality (This is solved in XQuery) so I removed that code. - A number of interfaces have an "iterator like behavior" with a next method, but in some cases it isn't necessary to call next to see the first item, and in other cases it was, this was a bit confusing. - We use long identifiers instead of integers. And our identifiers are not incremental (within a range they are, but the start point of a range depends on the physical location of an object). It is also not easy to find the highest identifier. This was a big issue since because I had to make a lot of modifications to make this work. Especially for the boolear scorer since the algorithm used by this class is totally not suitable for "random" long id's since it iterates from 0 to the highest int id. And as you know there is a big difference between the highest possible int id, and the highest possible long id. So I replaced this algorithm with another algorithm using priority queues. If someone's interested in the algorithm just let me know. - I changed the scorer interfaces to only include a next method the returns a scoredoc or null. It's not necessary anymore to have a max doc id. The scorer just returns null if its ready (This is only possible because of the changes in the Boolean scorer I made). Scorers that do not produce any more results are not part anymore of the priority queue used by the Boolean scorer (The original Boolean scorer keeps calling sub scorers even if the will nor produce any results anymore). - I encountered some dead code along the lines... Of course there where more issues, but I forgot to keep an exact list of it. I think these where the most important. The second approach is the one that will be included in the new version of our product. I'm quite happy with the end result, and the perfomance is also good. The only one thing I regret is that I had to change quite some lucene code making it hard to upgrade in the future. Again I want to say I was impressed by the quality of the code, and the design. Very good! I hope this information can be of some use for further improvement of Lucene. Kind regards, Arno de Quaasteniet X-Hive Corporation +31 (0)10 710 86 24 http://www.x-hive.com [EMAIL PROTECTED] P.S. While testing performance I noticed that the StandardAnalyzer spends a significant amount of time in a fillinstacktrace method. I think it uses an exception internally to signal something, replacing this with a test on a return value will probably speed up the process a lot. -- To unsubscribe, e-mail: <mailto:lucene-dev-unsubscribe@;jakarta.apache.org> For additional commands, e-mail: <mailto:lucene-dev-help@;jakarta.apache.org>
