[ 
https://issues.apache.org/jira/browse/LUCENE-4291?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13429203#comment-13429203
 ] 

Robert Muir commented on LUCENE-4291:
-------------------------------------

{quote}
For tokenizers, the buffer needs to be able to hold a token (and its trailing 
context, if lookahead is used), but nothing more. 16k tokens are likely 
extremely rare. 4k seems reasonable to me - it's still way bigger than most 
people are likely to hit over normal text input.
{quote}

Yes, I think its reasonable too: especially since maxTokenLength is 255 by 
default.

{quote}
HTMLStripCharFilter is a bit different, since it buffers HTML constructs rather 
than tokens. In the face of malformed input (e.g. an opening angle bracket '<' 
with no closing angle bracket '>'), the scanner might buffer the entire 
remaining input. In contrast, LegacyHTMLStripCharFilter, the pre-JFlex 
implementation, limits this kind of buffering, to 8k max chars IIRC.
{quote}

OK, I can leave this one alone. We can revisit if we can make CharFilters 
reusable (not simple to do cleanly today). Its not as much of an issue since 
nothing is hanging on to it.

I'll work up a patch.
                
> consider reducing jflex buffer sizes
> ------------------------------------
>
>                 Key: LUCENE-4291
>                 URL: https://issues.apache.org/jira/browse/LUCENE-4291
>             Project: Lucene - Core
>          Issue Type: Task
>          Components: modules/analysis
>            Reporter: Robert Muir
>
> Spinoff from SOLR-3684.
> Most lucene tokenizers have some buffer size, e.g. in 
> CharTokenizer/ICUTokenizer its char[4096].
> But the jflex tokenizers use char[16384] by default, which seems overkill. 
> I'm not sure we really see any performance bonus by having such a huge buffer 
> size as a default.
> There is a jflex parameter to set this: I think we should consider reducing 
> it.
> In a configuration like solr, tokenizers are reused per-thread-per-field,
> so these can easily stack up in RAM.
> Additionally CharFilters are not reused so the configuration in e.g.
> HtmlStripCharFilter might not be great since its per-document garbage.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to