Taras Bobrovytsky has posted comments on this change. Change subject: IMPALA-4787: Optimize APPX_MEDIAN() memory usage ......................................................................
Patch Set 2: (17 comments) Marcel, I will run the targeted perf query later today. http://gerrit.cloudera.org:8080/#/c/6025/2/be/src/exprs/aggregate-functions-ir.cc File be/src/exprs/aggregate-functions-ir.cc: PS2, Line 963: max_num_samples > confusing to have MAX_NUM_SAMPLES as well. maybe this should be capacity Done PS2, Line 973: { return begin() + idx; } > it'd be good to add bounds checks w/ dchecks Done PS2, Line 984: max_num_samples > capacity Done Line 996: bool IsFull() const { > dcheck that num_samples <= capacity, i.e. that it is NOT over Done PS2, Line 990: size_t byte_size() const { : return byte_size(max_num_samples); : } : : // Returns true if the array of ReservoirSamples that follows this struct in memory is : // full, and no more elements can be pushed back without resizing. : bool IsFull() const { : return num_samples == max_num_samples; : } > i think both of these can be one line One lined one of them, added a DCHECK to the other. PS2, Line 1001: // make this const > ? Done PS2, Line 1018: resizeReservoirSampleState > nit: inconsistent casing in fn names Make this one start with a capital letter. But other ones like begin() should probably start with a lowercase. PS2, Line 1021: N_ > NUM Done PS2, Line 1024: new_size > new_capacity to avoid conflating with memory sizes Done PS2, Line 1024: new_size > Queries with a large number buckets will run out of memory much quicker tha Done PS2, Line 1024: int new_size = state->max_num_samples * 10; : DCHECK_EQ(state->max_num_samples, state->num_samples); : DCHECK_LE(new_size, MAX_NUM_SAMPLES); > Looks like it'll dcheck when capacity reaches MAX_NUM_SAMPLES. Shouldn't th In this implementation it couldn't DCHECK, because we would start with capacity 20, then go to 200, 2000, and finally 20000. This is now changed because Mostafa suggested doubling instead. PS2, Line 1036: state->max_num_samples = new_size; > why don't any other fields need to be initialized? e.g. the rng will be gar The previous state gets copied in ReallocBuffer. The only thing we need to change is is the capacity. Added comment. PS2, Line 1048: size > again I'd vote for capacity I got rid of this variable. PS2, Line 1058: *state = ReservoirSampleState<T>(); > this looks like it'll initialize the memory properly. I think we'd probably We don't need to do that on line 1035 because everything gets copied over in ReallocBuffer() PS2, Line 1084: ++state->source_size; > I think this gets lost when you allocate a new State. As mentioned above, everything is copied in realloc PS2, Line 1152: ReservoirSampleMerge > I think this algorithm will require some changes to handle varying length S The capacity shouldn't matter in this algorithm. This algorithm only cares about num samples in src and dst states. PS2, Line 1167: (dst->num_samples < MAX_NUM_SAMPLES > I don't think you can rely on this anymore, dst may not have capacity MAX_N If DST does not have enough capacity for MAX_NUM_SAMPLES, it will be resized on line 1170 -- To view, visit http://gerrit.cloudera.org:8080/6025 To unsubscribe, visit http://gerrit.cloudera.org:8080/settings Gerrit-MessageType: comment Gerrit-Change-Id: I99adaad574d4fb0a3cf38c6cbad8b2a23df12968 Gerrit-PatchSet: 2 Gerrit-Project: Impala-ASF Gerrit-Branch: master Gerrit-Owner: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-Reviewer: Alex Behm <alex.b...@cloudera.com> Gerrit-Reviewer: Marcel Kornacker <mar...@cloudera.com> Gerrit-Reviewer: Matthew Jacobs <m...@cloudera.com> Gerrit-Reviewer: Mostafa Mokhtar <mmokh...@cloudera.com> Gerrit-Reviewer: Taras Bobrovytsky <tbobrovyt...@cloudera.com> Gerrit-HasComments: Yes