I think it goes without saying that a semi-complex NFA or DFA is going
to be quite a bit slower than say, breaking on whitespace. Not that I am
against such a warning.
To support my point on writing a custom solution that is more exact
towards your needs:
If you just remove the <NUM> recognizer in StandardTokenizer.jj you will
gain 20-25% speed in my tests of small and large documents.
Limiting what is considered a letter to just the language/encodings you
need might also get some good returns.
- Mark
Michael Stoppelman wrote:
Might be nice to add a line of documentation to the highlighter on the
possible
performance hit if one uses StandardAnalyzer which probably is a common
case.
Thanks for the speedy response.
-M
On 7/18/07, Mark Miller <[EMAIL PROTECTED]> wrote:
Unfortunately, StandardAnalyzer is slow. StandardAnalyzer is really
limited by JavaCC speed. You cannot shave much more performance out of
the grammar as it is already about as simple as it gets. You should
first see if you can get away without it and use a different Analyzer,
or if you can re-implement just the functionality you need in a custom
Analyzer. Do you really need the support for abbreviations, companies,
email address, etc?
If so:
You can use the TokenSources class in the highlighter package to rebuild
a TokenStream without re-analyzing if you store term offsets and
positions in the index. I have not found this to be super beneficial,
even when using the StandardAnalyzer to re-analyze, but it certainly
could be faster if you have large enough documents.
Your best bet is probably to use
https://issues.apache.org/jira/browse/LUCENE-644, which is a
non-positional Highlighter that finds offsets to highlight by looking up
query term offset information in the index. For larger documents this
can be much faster than using the standard contrib Highlighter, even if
your using TokenSources. LUCENE-644 has a much flatter curve than the
contrib Highlighter as document size goes up.
- Mark
Michael Stoppelman wrote:
> Hi all,
>
> I was tracking down slowness in the contrib highlighter code and it
seems
> the seemingly simple tokenStream.next() is the culprit.
> I've seen multiple posts about this being a possible cause. Has anyone
> looked into how to speed up StandardTokenizer? For my
> documents it's taking about 70ms per document that's a big ugh! I was
> thinking I might just cache the TermVectors in memory if
> that will be faster. Anyone have another approach to solving this
> problem?
>
> -M
>
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]