Amogh Margoor has posted comments on this change. ( 
http://gerrit.cloudera.org:8080/17592 )

Change subject: IMPALA-7635: Reducing HashTable size by packing it's buckets 
efficiently.
......................................................................


Patch Set 8:

(11 comments)

> (1 comment)
 >
 > I grabbed this change and was doing some local testing, and I ran
 > into a scenario that crashes Impala.
 >
 > Here's what I did:
 > 1. I loaded a larger TPC-H dataset in parquet (takes quite a while
 > and will use up some disk space):
 > bin/load-data.py -w tpch -s 42 
 > --table_formats=text/none/none,parquet/none/none
 > 2. Ran the following multiple times (it is intermittent):
 > use tpch42_parquet;
 > SELECT l_orderkey,
 > count(*) AS cnt
 > FROM lineitem
 > GROUP BY l_orderkey
 > HAVING count(*) > 9999999999999;
 >
 > That intermittently crashes with this stack:
 > C  [impalad+0x27e8546]  long impala::HashTable::Probe<true,
 > false>(impala::HashTable::Bucket*, long, impala::HashTableCtx*,
 > unsigned int, bool*, impala::HashTable::BucketData*)+0x28c
 > C  [impalad+0x27e2141]  impala::HashTable::ResizeBuckets(long,
 > impala::HashTableCtx*, bool*)+0x7d1
 > C  [impalad+0x27e194d]  impala::HashTable::CheckAndResize(unsigned
 > long, impala::HashTableCtx*, bool*)+0xcb
 > C  [impalad+0x27c8a48]  
 > impala::GroupingAggregator::CheckAndResizeHashPartitions(bool,
 > int, impala::HashTableCtx*)+0x176
 >
 > This may reproduce on smaller datasets.

This has been fixed now.

http://gerrit.cloudera.org:8080/#/c/17592/5/be/src/exec/hash-table-test.cc
File be/src/exec/hash-table-test.cc:

http://gerrit.cloudera.org:8080/#/c/17592/5/be/src/exec/hash-table-test.cc@744
PS5, Line 744: // Test to ensure the bucket size doesn't change accidentally.
> Nice! It could be also a static_assert at the class scope, e.g.:
Done.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/exec/hash-table.h@653
PS7, Line 653: n
> nit. May define and use constant for MATCHED here.
This logic has been changed to reduce the runtime overhead to read these flags.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/exec/hash-table.h@697
PS7, Line 697:
> nit. same as above. May use constants for FILLED, MATCHED and DUPLICATED, i
This logic has been changed to reduce the runtime overhead to read these flags.


http://gerrit.cloudera.org:8080/#/c/17592/5/be/src/exec/hash-table.h
File be/src/exec/hash-table.h:

http://gerrit.cloudera.org:8080/#/c/17592/5/be/src/exec/hash-table.h@694
PS5, Line 694:
> Why it isn't BucketData?
So it stores the pointer that union BucketData stores and not the pointer to 
the BucketData itself.


http://gerrit.cloudera.org:8080/#/c/17592/5/be/src/exec/hash-table.h@712
PS5, Line 712:     ALWAYS_INLINE void UnsetMatched() { UnSetTagBit1(); }
             :     ALWAYS_INLINE void UnsetHasDuplicates() { UnSetTagBit2(); }
             :     ALWAYS_INLINE vo
> (&ptr) is the address of the local variable 'ptr'.
Have changed this logic. '*reinterpret_cast<BucketData*>(getPtr());' would not 
work as it getPtr() returns the pointer to tuple/duplicate node instead of 
returning pointer to BucketData.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h
File be/src/util/tagged-ptr.h:

http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h@71
PS7, Line 71: late<class T, bool
> nit. UNLIKELY()?
This conditional has been removed altogether.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h@78
PS7, Line 78: return TaggedPtr(ptr)
> same as above
This conditional has been removed altogether.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h@91
PS7, Line 91:  if (OWNS) {
> nit. delete?
Done.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h@101
PS7, Line 101: //   ALWAYS_INLINE void SetTagBit0() {
             :   //     data_ = (data_ | DATA_MASK_0);
             :   //   }
> nit. I found the implementation of these two operators counter intuitive as
These are comparison of the tagged pointers which includes tag. For comparison 
of just pointers getPtr() on both can be used.


http://gerrit.cloudera.org:8080/#/c/17592/7/be/src/util/tagged-ptr.h@115
PS7, Line 115:
> nit. remove?
Done.


http://gerrit.cloudera.org:8080/#/c/17592/7/fe/src/main/java/org/apache/impala/planner/PlannerContext.java
File fe/src/main/java/org/apache/impala/planner/PlannerContext.java:

http://gerrit.cloudera.org:8080/#/c/17592/7/fe/src/main/java/org/apache/impala/planner/PlannerContext.java@47
PS7, Line 47: public final static double SIZE_OF_HASH = 4;
> nit. I wonder if this value can be folded back into SIZE_OF_BUCKET and SIZE
Makes sense. Have made the change.



--
To view, visit http://gerrit.cloudera.org:8080/17592
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-MessageType: comment
Gerrit-Change-Id: I72912ae9353b0d567a976ca712d2d193e035df9b
Gerrit-Change-Number: 17592
Gerrit-PatchSet: 8
Gerrit-Owner: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Amogh Margoor <[email protected]>
Gerrit-Reviewer: Impala Public Jenkins <[email protected]>
Gerrit-Reviewer: Joe McDonnell <[email protected]>
Gerrit-Reviewer: Qifan Chen <[email protected]>
Gerrit-Reviewer: Zoltan Borok-Nagy <[email protected]>
Gerrit-Comment-Date: Wed, 04 Aug 2021 20:34:08 +0000
Gerrit-HasComments: Yes

Reply via email to