[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

GitBox Thu, 28 Apr 2022 01:38:11 -0700


rmuir commented on code in PR #850:
URL: https://github.com/apache/lucene/pull/850#discussion_r860631960



##########
lucene/test-framework/src/java/org/apache/lucene/tests/analysis/MockTokenizer.java:
##########
@@ -66,11 +67,11 @@ public class MockTokenizer extends Tokenizer {
    * Limit the default token length to a size that doesn't cause random 
analyzer failures on
    * unpredictable data like the enwiki data set.
    *
-   * <p>This value defaults to {@code CharTokenizer.DEFAULT_MAX_WORD_LEN}.
+   * <p>This value defaults to {@link IndexWriter#MAX_TERM_LENGTH}.
    *
    * @see "https://issues.apache.org/jira/browse/LUCENE-10541";
    */
-  public static final int DEFAULT_MAX_TOKEN_LENGTH = 255;
+  public static final int DEFAULT_MAX_TOKEN_LENGTH = 
IndexWriter.MAX_TERM_LENGTH;

Review Comment:
   > We can make it safe - IndexWriter.MAX_TERM_LENGTH/2?
   
   I just checked, currently it indeed counts UTF-16 code units. This tokenizer 
is quite slow and processes single codepoint at a time. But the check for this 
looks at String.length(). Possibly correct or incorrect, but it is a check that 
should work.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] [lucene] rmuir commented on a diff in pull request #850: LUCENE-10541: limit the default length of MockTokenizer tokens to 255.

Reply via email to