Gabor Kaszab has uploaded this change for review. ( 
http://gerrit.cloudera.org:8080/16026


Change subject: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix
......................................................................

IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix

There is a bug in DataSketches HLL MurmurHash where long strings are
over-read resulting a cardinality estimate that is more than 15% off
from the correct cardinality number. A recent upstream fix in Apache
DataSketches addresses this issue and this patch pulls it to Impala.

https://issues.apache.org/jira/browse/DATASKETCHES-5

Testing:
  - I used ds_hll_sketch() and ds_hll_estimate() functions from
    IMPALA-9632 to trigger DataSketches HLL functionality.
  - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to
    reproduce the issue. The symptom was that the actual result was
    around 15% off from the correct cardinality result (~69M vs 79M).
  - After applying this fix re-running the query gives much closer
    results, usually under 3% error range.

Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d
---
M be/src/thirdparty/datasketches/MurmurHash3.h
1 file changed, 3 insertions(+), 8 deletions(-)



  git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/16026/1
--
To view, visit http://gerrit.cloudera.org:8080/16026
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: newchange
Gerrit-Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d
Gerrit-Change-Number: 16026
Gerrit-PatchSet: 1
Gerrit-Owner: Gabor Kaszab <[email protected]>

Reply via email to