Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/16026 )
Change subject: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix ...................................................................... IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix There is a bug in DataSketches HLL MurmurHash where long strings are over-read resulting a cardinality estimate that is more than 15% off from the correct cardinality number. A recent upstream fix in Apache DataSketches addresses this issue and this patch pulls it to Impala. https://issues.apache.org/jira/browse/DATASKETCHES-5 Testing: - I used ds_hll_sketch() and ds_hll_estimate() functions from IMPALA-9632 to trigger DataSketches HLL functionality. - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to reproduce the issue. The symptom was that the actual result was around 15% off from the correct cardinality result (~69M vs 79M). - After applying this fix re-running the query gives much closer results, usually under 3% error range. Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d Reviewed-on: http://gerrit.cloudera.org:8080/16026 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/thirdparty/datasketches/MurmurHash3.h 1 file changed, 3 insertions(+), 8 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/16026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d Gerrit-Change-Number: 16026 Gerrit-PatchSet: 3 Gerrit-Owner: Gabor Kaszab <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
