[
https://issues.apache.org/jira/browse/LUCENE-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429203#comment-13429203
]
Robert Muir commented on LUCENE-4291:
-------------------------------------
{quote}
For tokenizers, the buffer needs to be able to hold a token (and its trailing
context, if lookahead is used), but nothing more. 16k tokens are likely
extremely rare. 4k seems reasonable to me - it's still way bigger than most
people are likely to hit over normal text input.
{quote}
Yes, I think its reasonable too: especially since maxTokenLength is 255 by
default.
{quote}
HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather
than tokens. In the face of malformed input (e.g. an opening angle bracket '<'
with no closing angle bracket '>'), the scanner might buffer the entire
remaining input. In contrast, LegacyHTMLStripCharFilter, the pre-JFlex
implementation, limits this kind of buffering, to 8k max chars IIRC.
{quote}
OK, I can leave this one alone. We can revisit if we can make CharFilters
reusable (not simple to do cleanly today). Its not as much of an issue since
nothing is hanging on to it.
I'll work up a patch.
> consider reducing jflex buffer sizes
> ------------------------------------
>
> Key: LUCENE-4291
> URL: https://issues.apache.org/jira/browse/LUCENE-4291
> Project: Lucene - Core
> Issue Type: Task
> Components: modules/analysis
> Reporter: Robert Muir
>
> Spinoff from SOLR-3684.
> Most lucene tokenizers have some buffer size, e.g. in
> CharTokenizer/ICUTokenizer its char[4096].
> But the jflex tokenizers use char[16384] by default, which seems overkill.
> I'm not sure we really see any performance bonus by having such a huge buffer
> size as a default.
> There is a jflex parameter to set this: I think we should consider reducing
> it.
> In a configuration like solr, tokenizers are reused per-thread-per-field,
> so these can easily stack up in RAM.
> Additionally CharFilters are not reused so the configuration in e.g.
> HtmlStripCharFilter might not be great since its per-document garbage.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]