I realized soon after I sent the message that this is the case and I knew somebody would quickly point it out :) Still, if the effort to improve a piece is costless, why not do it :)
I changed my code locally to use HashMap. I actually started with HashSet, but with Sets one can't do set.get(object) :( Anyhow, yes, there are bigger things to fix. Otis --- Brian Goetz <[EMAIL PROTECTED]> wrote: > > > > I decided to run a little Lucene app that does some > > > > indexing under a > > > > profiler. (I used JMP, > > > > http://www.khelekore.org/jmp/, a rather simple > > > > one). > > > > > > > > The app uses StandardAnalyzer. > > > > I've noticed that a lot of time is spent in > > > > StandardTokenizer and > > > > various JavaCC-generated methods. > > > > I am wondering if anyone tried replacing > > > > StandardTokenizer.jj with > > > > something more efficient? > > > > > > > > Also,StopFilter is using a Hashtable to store the > > > > list of stop words. > > > > Has anyone tried using HashMap instead? > > HashMap is certainly a higher-performance choice, so long as the map > is static for the duration of its lifetime and built in the > constructor. Otherwise, you could run afoul of thread-safety issues. > And HashSet uses less memory. > > But the bigger point is one that Doug convinced me of only after I > went on a mad micro-optimization tear earlier in the project (Sorry, > Doug, you were right) -- and that is that for the most part, > tokenization is a very very small part of the total work done by the > system. Tokenization gets done once for each document, wheras the > document gets merged, searched, and queried many times. Time spent > tweaking tokenizers for performance is likely wasted effort; that > time > could probably be much better spent improving the code in much more > useful ways. > > Sure, StandardToeknizer is slow. But that tokenization effort gets > spread over the many times the document is searched. Even if it does > a 1% better job at tokenizing, that might be worth a 100x increase in > tokenizing time. I think any effort you want to spend tweaking > tokenizers would be much better spent doing a better job of > toeknization and preprocessing (stemming, dealing intelligently with > non-letters and word breaks, format stripping) than on performance > tweaks. > > > > -- > To unsubscribe, e-mail: > <mailto:[EMAIL PROTECTED]> > For additional commands, e-mail: > <mailto:[EMAIL PROTECTED]> > __________________________________________________ Do you Yahoo!? Yahoo! Web Hosting - Let the expert host your site http://webhosting.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>
