Gabor Kaszab has uploaded this change for review. ( http://gerrit.cloudera.org:8080/16026
Change subject: IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix ...................................................................... IMPALA-9820: Pull Datasketches-5 HLL MurmurHash fix There is a bug in DataSketches HLL MurmurHash where long strings are over-read resulting a cardinality estimate that is more than 15% off from the correct cardinality number. A recent upstream fix in Apache DataSketches addresses this issue and this patch pulls it to Impala. https://issues.apache.org/jira/browse/DATASKETCHES-5 Testing: - I used ds_hll_sketch() and ds_hll_estimate() functions from IMPALA-9632 to trigger DataSketches HLL functionality. - Ran DataSketches HLL on lineitem.l_comment in TPCH25_parquet to reproduce the issue. The symptom was that the actual result was around 15% off from the correct cardinality result (~69M vs 79M). - After applying this fix re-running the query gives much closer results, usually under 3% error range. Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d --- M be/src/thirdparty/datasketches/MurmurHash3.h 1 file changed, 3 insertions(+), 8 deletions(-) git pull ssh://gerrit.cloudera.org:29418/Impala-ASF refs/changes/26/16026/1 -- To view, visit http://gerrit.cloudera.org:8080/16026 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: newchange Gerrit-Change-Id: I84d73fce1e7a197c1f8fb49404b58ed9bb0b843d Gerrit-Change-Number: 16026 Gerrit-PatchSet: 1 Gerrit-Owner: Gabor Kaszab <[email protected]>
