Impala Public Jenkins has submitted this change and it was merged. ( http://gerrit.cloudera.org:8080/20793 )
Change subject: IMPALA-12606: Sporadic failures around query_test.test_queries.TestQueries.test_intersect ...................................................................... IMPALA-12606: Sporadic failures around query_test.test_queries.TestQueries.test_intersect test_intersect failed when ASYNC_CODEGEN was enabled. This happened because we were using codegened 'ProcessProbeBatch' in the HASH JOIN operator with non-codegened InsertBatch/ProcessBuildBatch at the Builder side, or vice versa. Only the NULL StringValues were hit by the bug, turned out NULLs are handled differently in the hash table. We are using the HashUtil::FNV_SEED number to represent NULL values. This number was chosen arbitrarily, we just wanted to use some random value. In the codegened path we invoke StringValue::Assign(ptr, len) with both params being HashUtil::FNV_SEED. HashUtil::FNV_SEED is a negative value in int32_t, so StringValue::Assign(ptr, len) stored 0 as len actually, and not HashUtil::FNV_SEED. This is needed to be resilient against invalid values in corrupt files. In non-codegened path we are creating a StringValue object by reinterpret casting a [HashUtil::FNV_SEED, HashUtil::FNV_SEED, ...] array to StringValue. Then in RawValue::WriteNonNullPrimitive() we invoke StringValue::Assign(StringValue&) that just memcopies the parameter to this. It cannot check for negative values, because the parameter StringValue might be a valid small string. To sum up, this is how a NULL string was represented in the HashTable: * Codegen path: ptr = HashUtil::FNV_SEED, len = 0 * Non-codegen path: ptr = HashUtil::FNV_SEED, len = HashUtil::FNV_SEED This is why the hash join operator was working incorrectly on NULL String values when some parts of it used Codegen'ed path while other parts were using the non-codegened path. To fix the issue, I introduced UnsafeAssign(ptr, len) which doesn't do any checks for 'ptr' or 'len', so we have the same StringValue for objects for NULLs at both the codegened and non-codegened paths. Testing: * Executed TestQueries.test_intersect multiple times Change-Id: I6b855c59808db80fd7ac596ce338fc4c3c9c7667 Reviewed-on: http://gerrit.cloudera.org:8080/20793 Reviewed-by: Impala Public Jenkins <[email protected]> Tested-by: Impala Public Jenkins <[email protected]> --- M be/src/codegen/gen_ir_descriptions.py M be/src/exec/hash-table.cc M be/src/runtime/smallable-string.h M be/src/runtime/string-value-ir.cc M be/src/runtime/string-value.h 5 files changed, 21 insertions(+), 4 deletions(-) Approvals: Impala Public Jenkins: Looks good to me, approved; Verified -- To view, visit http://gerrit.cloudera.org:8080/20793 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-MessageType: merged Gerrit-Change-Id: I6b855c59808db80fd7ac596ce338fc4c3c9c7667 Gerrit-Change-Number: 20793 Gerrit-PatchSet: 8 Gerrit-Owner: Zoltan Borok-Nagy <[email protected]> Gerrit-Reviewer: Csaba Ringhofer <[email protected]> Gerrit-Reviewer: Impala Public Jenkins <[email protected]> Gerrit-Reviewer: Michael Smith <[email protected]> Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
