Adrien Grand created LUCENE-8779: ------------------------------------ Summary: MinHashFilter generates invalid terms Key: LUCENE-8779 URL: https://issues.apache.org/jira/browse/LUCENE-8779 Project: Lucene - Core Issue Type: Improvement Reporter: Adrien Grand
This problem was reported at https://github.com/elastic/elasticsearch/issues/41556: MinHashFilter computes a hash and then folds its bits into the chars of the term. However this might generate invalid terms that eg. end with a character that is a high surrogate. This doesn't trigger exceptions at index time because we are lenient with unmatched surrogates when converting to a binary term. {code} } else { // surrogate pair // confirm valid high surrogate if (code < 0xDC00 && (i < end-1)) { int utf32 = (int) s.charAt(i+1); // confirm valid low surrogate and write pair if (utf32 >= 0xDC00 && utf32 <= 0xDFFF) { utf32 = (code << 10) + utf32 + SURROGATE_OFFSET; i++; out[upto++] = (byte)(0xF0 | (utf32 >> 18)); out[upto++] = (byte)(0x80 | ((utf32 >> 12) & 0x3F)); out[upto++] = (byte)(0x80 | ((utf32 >> 6) & 0x3F)); out[upto++] = (byte)(0x80 | (utf32 & 0x3F)); continue; } } // replace unpaired surrogate or out-of-order low surrogate // with substitution character out[upto++] = (byte) 0xEF; out[upto++] = (byte) 0xBF; out[upto++] = (byte) 0xBD; } {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org