Mikael Brännström created DRILL-8535:
----------------------------------------

             Summary: tdigest_merge cannot parse correct tdigest data
                 Key: DRILL-8535
                 URL: https://issues.apache.org/jira/browse/DRILL-8535
             Project: Apache Drill
          Issue Type: Bug
    Affects Versions: 1.22.0
            Reporter: Mikael Brännström


The tdigest_merge SQL function parses binary data via an UTF-8 string (bytes -> 
UTF8 String -> bytes), which corrupts the data. Any byte value >= 0x80 will 
likely be expanded to multiple bytes. 

The effect is that the call to MergingDigest.fromBytes parse exceptions, such 
as BufferUnderFlowException, due to corrupt data.

To reproduce, create a tdigest with e.g. the single integer value 1082 with the 
default compression 100. The resulting data is:

 
{code:java}
[0, 0, 0, 2, 64, -112, -24, 0, 0, 0, 0, 0, 64, -112, -24, 0, 0, 0, 0, 0, 66, 
-56, 0, 0, 0, -46, 4, 26, 0, 1, 63, -128, 0, 0, 68, -121, 64, 0]
{code}
After UTF-8 String corruption, the data becomes:

 
{code:java}
[0, 0, 0, 2, 64, -17, -65, -67, -17, -65, -67, 0, 0, 0, 0, 0, 64, -17, -65, 
-67, -17, -65, -67, 0, 0, 0, 0, 0, 66, -17, -65, -67, 0, 0, 0, -17, -65, -67, 
4, 26, 0, 1, 63, -17, -65, -67, 0, 0, 68, -17, -65, -67, 64, 0]{code}
 

The fix is trivial and relates to the class 
{{{}TDigestFunctions.TDigestMergeFunction{}}}. 

Line 1109 is incorrect:
{code:java}
byte[] buf = 
org.apache.drill.exec.expr.fn.impl.StringFunctionHelpers.toStringFromUTF8(in.start,
 in.end, in.buffer).getBytes(java.nio.charset.StandardCharsets.UTF_8); {code}
 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to