> > > I decided to run a little Lucene app that does some > > > indexing under a > > > profiler. (I used JMP, > > > http://www.khelekore.org/jmp/, a rather simple > > > one). > > > > > > The app uses StandardAnalyzer. > > > I've noticed that a lot of time is spent in > > > StandardTokenizer and > > > various JavaCC-generated methods. > > > I am wondering if anyone tried replacing > > > StandardTokenizer.jj with > > > something more efficient? > > > > > > Also,StopFilter is using a Hashtable to store the > > > list of stop words. > > > Has anyone tried using HashMap instead?
HashMap is certainly a higher-performance choice, so long as the map is static for the duration of its lifetime and built in the constructor. Otherwise, you could run afoul of thread-safety issues. And HashSet uses less memory. But the bigger point is one that Doug convinced me of only after I went on a mad micro-optimization tear earlier in the project (Sorry, Doug, you were right) -- and that is that for the most part, tokenization is a very very small part of the total work done by the system. Tokenization gets done once for each document, wheras the document gets merged, searched, and queried many times. Time spent tweaking tokenizers for performance is likely wasted effort; that time could probably be much better spent improving the code in much more useful ways. Sure, StandardToeknizer is slow. But that tokenization effort gets spread over the many times the document is searched. Even if it does a 1% better job at tokenizing, that might be worth a 100x increase in tokenizing time. I think any effort you want to spend tweaking tokenizers would be much better spent doing a better job of toeknization and preprocessing (stemming, dealing intelligently with non-letters and word breaks, format stripping) than on performance tweaks. -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
