[
https://issues.apache.org/jira/browse/LUCENE-6993?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15172324#comment-15172324
]
Mike Drob commented on LUCENE-6993:
-----------------------------------
I think I am getting to a good place here, just a few more issues that I need
some additional direction --
{code}
/**
* Sets the scanner buffer size in chars
*/
public final void setBufferSize(int numChars) {
ZZ_BUFFERSIZE = numChars;
char[] newZzBuffer = new char[ZZ_BUFFERSIZE];
System.arraycopy(zzBuffer, 0, newZzBuffer, 0, Math.min(zzBuffer.length,
ZZ_BUFFERSIZE));
zzBuffer = newZzBuffer;
}
{code}
This is code that we inject directly from our jflex templates, not generated
code. True to their promises, ZZ prefixed items in jflex are subject to change,
and this one has become final between old and new versions. We could fix this
with an additional post-processing step to take out the final modifier, or put
changes in jflex to add a new constructor or something like that. It looks like
all of the non-test usage of setting the size happens immediately post
construction.
bq. Also, there are several effectively-pinned character sets (for CJK and
Thai) that are hard-coded in the grammar, and don't include any supplementary
characters at all. If the Unicode version changes, these will need to be moved
to use the appropriate Unicode properties instead.
Currently ClassicTokenizer has {{THAI = \[\u0E00-\u0E59]; ALPHANUM =
(\{LETTER}\|\{THAI}|\[:digit:])+;}}. If I understand the Unicode spec correctly
with Unicode 8.0 we can remove the THAI declaration and it would be correctly
included in LETTER. But I have near zero confidence in this. Alternatively,
leaving it as is should be fine because the assigned THAI characters have not
gone outside of that range.
For CJK, we have a special call out for CJ, but K was apparently already
included in LETTER? I don't understand the relationship between ALPHANUM,
{{\p\{Letter\}}} and CJK.
> Update UAX29URLEmailTokenizer TLDs to latest list, and upgrade all
> JFlex-based tokenizers to support Unicode 8.0
> ----------------------------------------------------------------------------------------------------------------
>
> Key: LUCENE-6993
> URL: https://issues.apache.org/jira/browse/LUCENE-6993
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/analysis
> Reporter: Mike Drob
> Assignee: Robert Muir
> Fix For: 6.0
>
> Attachments: LUCENE-6993.patch, LUCENE-6993.patch, LUCENE-6993.patch,
> LUCENE-6993.patch, LUCENE-6993.patch
>
>
> We did this once before in LUCENE-5357, but it might be time to update the
> list of TLDs again. Comparing our old list with a new list indicates 800+ new
> domains, so it would be nice to include them.
> Also the JFlex tokenizer grammars should be upgraded to support Unicode 8.0.
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]