Re: [Performance] Streaming main memory indexing of single strings

Wolfgang Hoschek Thu, 14 Apr 2005 14:03:50 -0700

This seems to be a promising avenue worth exploring. My gutfeeling is that this could easily be 10-100 times faster.

The drawback is that it requires a fair amount of understanding of intricate Lucene internals, pulling those pieces together and adapting them as required for the seemingly simple "float match(String text, Query query)".

I might give it a shot but I'm not sure I'll be able to pull this off! Is there any similar code I could look at as a starting point?

Wolfgang.

On Apr 14, 2005, at 1:13 PM, Robert Engels wrote:

I think you are not approaching this the correct way.

Pseudo code:

Subclass IndexReader.

Get tokens from String 'document' using Lucene analyzers.

Build simple hash-map based data structures using tokens for terms, and term positions.

reimplement termDocs() and termPositions() to use the structures from above.

run searches.

start again with next document.

-----Original Message-----
From: Wolfgang Hoschek [mailto:[EMAIL PROTECTED]
Sent: Thursday, April 14, 2005 2:56 PM
To: [email protected]
Subject: Re: [Performance] Streaming main memory indexing of single
strings


Otis, this might be a misunderstanding.

- I'm not calling optimize(). That piece is commented out you if look
again at the code.
- The *streaming* use case requires that for each query I add one (and
only one) document (aka string) to an empty index:

repeat N times (where N is millions or billions):
        add a single string (aka document) to an empty index
        query the index
        drop index (or delete it's document)

with the following API being called N times: float match(String text,
Query query)

So there's no possibility of adding many documents and thereafter
running the query. This in turn seems to mean that the IndexWriter
can't be kept open - unless I manually delete each document after each
query to repeatedly reuse the RAMDirectory, which I've also tried
before without any significant performance gain - deletion seems to
have substantial overhead in itself. Perhaps it would be better if
there were a Directory.deleteAllDocuments() or similar. Did you have
some other approach in mind?

As I said, Lucene's design doesn't seem to fit this streaming use case
pattern well. In *this* scenario one could easily do without any
locking, and without byte level organization in RAMDirectory and
RAMFile, etc because a single small string isn't a large persistent
multi-document index.

For some background, here's a small example for the kind of XQuery
functionality Nux/Lucene integration enables:

(: An XQuery that finds all books authored by James that have something
to do with "fish", sorted by relevance :)
declare namespace lucene = "java:nux.xom.xquery.XQueryUtil";
declare variable $query := "fish*~";

for $book in /books/book[author="James" and lucene:match(string(.),
$query) > 0.0]
let $score := lucene:match(string($book), $query)
order by $score descending
return (<score>{$score}</score>, $book)

More interestingly one can use this for classifying and routing XML
messages based on rules (i.e. queries) inspecting their content...

Any other clues about potential improvements would be greatly
appreciated.

Wolfgang.

On Apr 13, 2005, at 10:09 PM, Otis Gospodnetic wrote:

It looks like you are calling that IndexWriter code in some loops, opening it and closing it in every iteration of the loop and also calling optimize. All of those things could be improved. Keep your IndexWriter open, don't close it, and optimize the index only once you are done adding documents to it.

See the highlights and the snipets in the first hit:
  http://www.lucenebook.com/search?query=when+to+optimize

Otis


--- Wolfgang Hoschek <[EMAIL PROTECTED]> wrote:

Hi,

I'm wondering if anyone could let me know how to improve Lucene
performance for "streaming main memory indexing of single strings".
This would help to effectively integrate Lucene with the Nux XQuery
engine.

Below is a small microbenchmark simulating STREAMING XQuery fulltext
search as typical for XML network routers, message queuing system,
P2P
networks, etc. In this on-the-fly main memory indexing scenario, each

individual string is immediately matched as soon as it becomes
available without any persistance involved. This usage scenario and
corresponding performance profile is quite different in comparison to

fulltext search over persistent (read-mostly) indexes.

The benchmark runs at some 3000 lucene queries/sec (lucene-1.4.3)
which
is unfortunate news considering the XQuery engine can easily walk
hundreds of thousands of XML nodes per second. Ideally I'd like to
run
at some 100000 queries/sec. Runnning this through the JDK 1.5
profiler
it seems that most time is spent in and below the following calls:

writer = new IndexWriter(dir, analyzer, true);
writer.addDocument(...);
writer.close();

I tried quite a few variants of the benchmark with various options,
unfortunately with little or no effect.
Lucene just does not seem to designed to do this sort of "transient
single string index" thing. All code paths related to opening,
closing,
reading, writing, querying and object creation seem to be designed
for
large persistent indexes.

Any advice on what I'm missing or what could be done about it would
be
greatly appreciated.

Wolfgang.

P.S. the benchmark code is attached as a file below:

package nux.xom.pool;


import java.io.IOException;
//import java.io.Reader;

import org.apache.lucene.analysis.Analyzer;
//import org.apache.lucene.analysis.LowerCaseTokenizer;
//import org.apache.lucene.analysis.PorterStemFilter;
//import org.apache.lucene.analysis.SimpleAnalyzer;
//import org.apache.lucene.analysis.TokenStream;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.Document;
import org.apache.lucene.document.Field;
//import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.IndexWriter;
import org.apache.lucene.queryParser.ParseException;
import org.apache.lucene.queryParser.QueryParser;
import org.apache.lucene.search.Hits;
import org.apache.lucene.search.IndexSearcher;
import org.apache.lucene.search.Query;
import org.apache.lucene.search.Searcher;
import org.apache.lucene.store.Directory;
import org.apache.lucene.store.RAMDirectory;

public final class LuceneMatcher { // TODO: make non-public

        private final Analyzer analyzer;
//      private final Directory dir = new RAMDirectory();

        public LuceneMatcher() {
                this(new StandardAnalyzer());
//              this(new SimpleAnalyzer());
//              this(new StopAnalyzer());
//              this(new Analyzer() {
//                      public final TokenStream tokenStream(String fieldName, 
Reader
reader) {
//                              return new PorterStemFilter(new 
LowerCaseTokenizer(reader));
//                      }
//              });
        }

        public LuceneMatcher(Analyzer analyzer) {
                if (analyzer == null)
                        throw new IllegalArgumentException("analyzer must not be 
null");
                this.analyzer = analyzer;
        }

        public Query parseQuery(String expression) throws ParseException {
                QueryParser parser = new QueryParser("content", analyzer);
//              parser.setPhraseSlop(0);
                return parser.parse(expression);
        }

        /**
         * Returns the relevance score by matching the given index against
the given
         * Lucene query expression. The index must not contain more than one
Lucene
         * "document" (aka string to be searched).
         */
        public float match(Directory index, Query query) {
                Searcher searcher = null;
                try {
                        searcher = new IndexSearcher(index);
                        Hits hits = searcher.search(query);
                        float score = hits.length() > 0 ? hits.score(0) : 0.0f;
                        return score;
                } catch (IOException e) { // should never happen (RAMDirectory)
                        throw new RuntimeException(e);
                } finally {
                        try {
                                if (searcher != null) searcher.close();
                        } catch (IOException e) { // should never happen 
(RAMDirectory)
                                throw new RuntimeException(e);
                        }
                }
        }

//      public float match(String text, Query query) {
//              return match(createIndex(text), query);
//      }

        public Directory createIndex(String text) {
                Directory dir = new RAMDirectory();
                IndexWriter writer = null;
                try {
                        writer = new IndexWriter(dir, analyzer, true);
//                      writer.setUseCompoundFile(false);
//                      writer.mergeFactor = 2;
//                      writer.minMergeDocs = 1;
//                      writer.maxMergeDocs = 1;

                        writer.addDocument(createDocument(text));
//                      writer.optimize();
                        return dir;
                } catch (IOException e) { // should never happen (RAMDirectory)
                        throw new RuntimeException(e);
                } finally {
                        try {
                                if (writer != null) writer.close();
                        } catch (IOException e) { // should never happen 
(RAMDirectory)
                                throw new RuntimeException(e);
                        }
                }
        }

        private Document createDocument(String content) {
                Document doc = new Document();
                doc.add(Field.UnStored("content", content));
//              doc.add(Field.Text("x", content));
                return doc;
        }

        /**
         * Lucene microbenchmark simulating STREAMING XQuery fulltext search
as
         * typical for XML network routers, message queuing system, P2P
networks,
         * etc. In this on-the-fly main memory indexing scenario, each
individual
         * string is immediately matched as soon as it becomes available
without any
         * persistance involved. This usage scenario and corresponding
performance
         * profile is quite different in comparison to fulltext search over
         * persistent (read-mostly) indexes.
         *
         * Example XPath: count(/table/row[lucene:match(string(./firstname),
         * "James") > 0.0])
         */
        public static void main(String[] args) throws Exception {
                int k = -1;
                int runs = 5;
                if (args.length > ++k) runs = Integer.parseInt(args[k]);

                int nodes = 10000;
                if (args.length > ++k) nodes = Integer.parseInt(args[k]);

                String content = "James is out in the woods";
                if (args.length > ++k) content = args[k];

                String expression = "James";
                if (args.length > ++k) expression = args[k];

                LuceneMatcher matcher = new LuceneMatcher();
                Query query = matcher.parseQuery(expression); // to be reused N
times

                for (int r = 0; r < runs; r++) {
                        long start = System.currentTimeMillis();
                        int matches = 0;

                        for (int i = 0; i < nodes; i++) {
//                              if (LuceneUtil.match(content + i, expression) > 
0.0f) {
                                if (matcher.match(matcher.createIndex(content + 
i), query) >
0.0f) {
                                        matches++;
                                }
                        }

                        long end = System.currentTimeMillis();
                        System.out.println("matches=" + matches);
                        System.out.println("secs=" + ((end-start) / 1000.0f));
                        System.out.println("queries/sec=" + (nodes / 
((end-start) /
1000.0f)));
                        System.out.println();
                }
        }
}

---------------------------------------------------------------------

To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [Performance] Streaming main memory indexing of single strings

Reply via email to