koke opened a new issue #50:
URL: https://github.com/apache/incubator-datasketches-hive/issues/50
I've been scratching my head for a while with this one, but I was writing
some unit tests where I created a theta sketch with a single item, and the
estimate function was returning an estimate of minus 800M.
This seems easily reproducible for me (using Hive version 1.1.0-cdh5.16.1):
```sql
add jar /path/to/datasketches-memory-1.2.0-incubating.jar;
add jar /path/to/datasketches-java-1.2.0-incubating.jar;
add jar /path/to/datasketches-hive-1.0.0-incubating.jar;
create temporary function data2sketch as
'org.apache.datasketches.hive.theta.DataToSketchUDAF';
create temporary function estimate as
'org.apache.datasketches.hive.theta.EstimateSketchUDF';
create temporary table theta_input as select 1 as id;
create temporary table sketch_intermediate as select data2sketch(id) as
sketch from theta_input;
select estimate(sketch) as estimate_from_table from sketch_intermediate;
-- Output:
-- +----------------------+--+
-- | estimate_from_table |
-- +----------------------+--+
-- | -8.80936683E8 |
-- +----------------------+--+
with intermediate as (
select data2sketch(id) as sketch from theta_input
)
select estimate(sketch) as estimate_from_table from intermediate;
-- Output:
-- +----------------------+--+
-- | estimate_from_table |
-- +----------------------+--+
-- | 1.0 |
-- +----------------------+--+
```
For some reason there were some extra bytes in the `BytesWritable` storage,
which was breaking the calculations. What was supposed to be a 16 byte
`SingleItemSketch`, got an extra 8 bytes (zero-filled), making datasketches
think it was a completely different thing.
A unit test of what I was seeing coming from Hive:
```java
@Test
public void evaluateRespectsByteLength() {
byte[] inputBytes = new byte[]{
(byte) 0x01, (byte) 0x03, (byte) 0x03, (byte) 0x00,
(byte) 0x00, (byte) 0x3a, (byte) 0xcc, (byte) 0x93,
(byte) 0x15, (byte) 0xf9, (byte) 0x7d, (byte) 0xcb,
(byte) 0xbd, (byte) 0x86, (byte) 0xa1, (byte) 0x05,
(byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00,
(byte) 0x00, (byte) 0x00, (byte) 0x00, (byte) 0x00
};
BytesWritable input = new BytesWritable(inputBytes, 16);
EstimateSketchUDF estimate = new EstimateSketchUDF();
Double testResult = estimate.evaluate(input);
assertEquals(1.0, testResult, 0.0);
}
```
Adding this wrapper around `EstimateSketchUDF` fixes the problem:
```java
public class EstimateSketchUDF extends
org.apache.datasketches.hive.theta.EstimateSketchUDF {
@Override
public Double evaluate(BytesWritable binarySketch) {
if (binarySketch == null) {
return 0.0;
}
byte[] bytes = new byte[binarySketch.getLength()];
System.arraycopy(binarySketch.getBytes(), 0, bytes, 0,
binarySketch.getLength());
BytesWritable fixedSketch = new BytesWritable(bytes);
return super.evaluate(fixedSketch);
}
}
```
----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]