Taras Bobrovytsky has posted comments on this change.

Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage
......................................................................


Patch Set 2:

(17 comments)

Marcel, I will run the targeted perf query later today.

http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc
File be/src/exprs/aggregate-functions-ir.cc:

PS2, Line 963: max_num_samples
> confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity
Done


PS2, Line 973: { return begin() + idx; }
> it'd be good to add bounds checks w/ dchecks
Done


PS2, Line 984: max_num_samples
> capacity
Done


Line 996:   bool IsFull() const {
> dcheck that num_samples <= capacity, i.e. that it is NOT over
Done


PS2, Line 990:   size_t byte_size() const {
             :     return byte_size(max_num_samples);
             :   }
             : 
             :   // Returns true if the array of ReservoirSamples that follows 
this struct in memory is
             :   // full, and no more elements can be pushed back without 
resizing.
             :   bool IsFull() const {
             :     return num_samples == max_num_samples;
             :   }
> i think both of these can be one line
One lined one of them, added a DCHECK to the other.


PS2, Line 1001: // make this const
> ?
Done


PS2, Line 1018: resizeReservoirSampleState
> nit: inconsistent casing in fn names
Make this one start with a capital letter. But other ones like begin() should 
probably start with a lowercase.


PS2, Line 1021: N_
> NUM
Done


PS2, Line 1024: new_size 
> new_capacity to avoid conflating with memory sizes
Done


PS2, Line 1024: new_size 
> Queries with a large number buckets will run out of memory much quicker tha
Done


PS2, Line 1024:   int new_size = state->max_num_samples * 10;
              :   DCHECK_EQ(state->max_num_samples, state->num_samples);
              :   DCHECK_LE(new_size, MAX_NUM_SAMPLES);
> Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't th
In this implementation it couldn't DCHECK, because we would start with capacity 
20, then go to 200, 2000, and finally 20000. This is now changed because 
Mostafa suggested doubling instead.


PS2, Line 1036:   state->max_num_samples = new_size;
> why don't any other fields need to be initialized? e.g. the rng will be gar
The previous state gets copied in ReallocBuffer. The only thing we need to 
change is is the capacity. Added comment.


PS2, Line 1048: size
> again I'd vote for capacity
I got rid of this variable.


PS2, Line 1058:   *state = ReservoirSampleState<T>();
> this looks like it'll initialize the memory properly. I think we'd probably
We don't need to do that on line 1035 because everything gets copied over in 
ReallocBuffer()


PS2, Line 1084:   ++state->source_size;
> I think this gets lost when you allocate a new State.
As mentioned above, everything is copied in realloc


PS2, Line 1152: ReservoirSampleMerge
> I think this algorithm will require some changes to handle varying length S
The capacity shouldn't matter in this algorithm. This algorithm only cares 
about num samples in src and dst states.


PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES
> I don't think you can rely on this anymore, dst may not have capacity MAX_N
If DST does not have enough capacity for MAX_NUM_SAMPLES, it will be resized on 
line 1170


-- 
To view, visit http://gerrit.cloudera.org:8080/6025
To unsubscribe, visit http://gerrit.cloudera.org:8080/settings

Gerrit-MessageType: comment
Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968
Gerrit-PatchSet: 2
Gerrit-Project: Impala-ASF
Gerrit-Branch: master
Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com>
Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com>
Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com>
Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com>
Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com>
Gerrit-HasComments: Yes

Reply via email to