Mikael Brännström created DRILL-8535: ----------------------------------------
Summary: tdigest_merge cannot parse correct tdigest data Key: DRILL-8535 URL: https://issues.apache.org/jira/browse/DRILL-8535 Project: Apache Drill Issue Type: Bug Affects Versions: 1.22.0 Reporter: Mikael Brännström The tdigest_merge SQL function parses binary data via an UTF-8 string (bytes -> UTF8 String -> bytes), which corrupts the data. Any byte value >= 0x80 will likely be expanded to multiple bytes. The effect is that the call to MergingDigest.fromBytes parse exceptions, such as BufferUnderFlowException, due to corrupt data. To reproduce, create a tdigest with e.g. the single integer value 1082 with the default compression 100. The resulting data is: {code:java} [0, 0, 0, 2, 64, -112, -24, 0, 0, 0, 0, 0, 64, -112, -24, 0, 0, 0, 0, 0, 66, -56, 0, 0, 0, -46, 4, 26, 0, 1, 63, -128, 0, 0, 68, -121, 64, 0] {code} After UTF-8 String corruption, the data becomes: {code:java} [0, 0, 0, 2, 64, -17, -65, -67, -17, -65, -67, 0, 0, 0, 0, 0, 64, -17, -65, -67, -17, -65, -67, 0, 0, 0, 0, 0, 66, -17, -65, -67, 0, 0, 0, -17, -65, -67, 4, 26, 0, 1, 63, -17, -65, -67, 0, 0, 68, -17, -65, -67, 64, 0]{code} The fix is trivial and relates to the class {{{}TDigestFunctions.TDigestMergeFunction{}}}. Line 1109 is incorrect: {code:java} byte[] buf = org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(in.start, in.end, in.buffer).getBytes(java.nio.charset.StandardCharsets.UTF_8); {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)