I profiled the Lucene indexing using java profile option looks like while indexing, it spends around 80% of its time in StandardTokenizerManager.java and eats up practically all the CPU on my 1GHz machine. It takes around 17 Minutes to index 140 MB of plain text thats around 8MB/ minute. I think its too slow.
Specially when we just want to tokenize based on white spaces and some standard delimiters, I want to speed it up to change by changing the grammer. Just wondering if anybody has done any test in this area. Or have some other clues as to how to speed it up. I am wondering if we should use YACC instead of javacc and call it thgough JNI although It shouldn't matter because of hotspot but you nver know -Manish __________________________________________________ Do You Yahoo!? Yahoo! Health - Feel better, live better http://health.yahoo.com -- To unsubscribe, e-mail: <mailto:[EMAIL PROTECTED]> For additional commands, e-mail: <mailto:[EMAIL PROTECTED]>