rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860631960
##########
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##########
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
* Limit the default token length to a size that doesn't cause random
analyzer failures on
* unpredictable data like the enwiki data set.
*
- * <p>This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+ * <p>This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
*
* @see "https://issues.apache.org/jira/browse/LUCENE-10541"
*/
- public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+ public static final int DEFAULT_MAX_TOKEN_LENGTH =
IndexWriter.MAX_TERM_LENGTH;
Review Comment:
> We can make it safe - IndexWriter.MAX_TERM_LENGTH/2?
I just checked, currently it indeed counts UTF-16 code units. This tokenizer
is quite slow and processes single codepoint at a time. But the check for this
looks at String.length(). Possibly correct or incorrect, but it is a check that
should work.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]