Might be nice to add a line of documentation to the highlighter on the possible performance hit if one uses StandardAnalyzer which probably is a common case. Thanks for the speedy response.
-M On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote:
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really limited by JavaCC speed. You cannot shave much more performance out of the grammar as it is already about as simple as it gets. You should first see if you can get away without it and use a different Analyzer, or if you can re-implement just the functionality you need in a custom Analyzer. Do you really need the support for abbreviations, companies, email address, etc? If so: You can use the TokenSources class in the highlighter package to rebuild a TokenStream without re-analyzing if you store term offsets and positions in the index. I have not found this to be super beneficial, even when using the StandardAnalyzer to re-analyze, but it certainly could be faster if you have large enough documents. Your best bet is probably to use https://issues.apache.org/jira/browse/LUCENE-644, which is a non-positional Highlighter that finds offsets to highlight by looking up query term offset information in the index. For larger documents this can be much faster than using the standard contrib Highlighter, even if your using TokenSources. LUCENE-644 has a much flatter curve than the contrib Highlighter as document size goes up. - Mark Michael Stoppelman wrote: > Hi all, > > I was tracking down slowness in the contrib highlighter code and it seems > the seemingly simple tokenStream.next() is the culprit. > I've seen multiple posts about this being a possible cause. Has anyone > looked into how to speed up StandardTokenizer? For my > documents it's taking about 70ms per document that's a big ugh! I was > thinking I might just cache the TermVectors in memory if > that will be faster. Anyone have another approach to solving this > problem? > > -M > --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]