Yingyi Bu has posted comments on this change. Change subject: RangeGenerator aggfunc for the numeric/asciiString datatype based on parallel streaming histogram. ......................................................................
Patch Set 35: (18 comments) Here are some high-level comments: 1. Let partition number be the second argument of the range generation function; 2. Use JDK library as much as possible, e.g., binary search and priority queue; 3. Move histogram generation code from Hyracks to Asterix and remove type enums in the histogram processing code. Detailed comments are inlined. https://asterix-gerrit.ics.uci.edu/#/c/806/35/asterixdb/asterix-app/src/test/resources/runtimets/queries/aggregate/rg_double/rg_double.3.query.aql File asterixdb/asterix-app/src/test/resources/runtimets/queries/aggregate/rg_double/rg_double.3.query.aql: Line 20: set partitions '2' can "partitions" be an argument of the function? https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IHeapList.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IHeapList.java: Line 28: public void maintainHeap() throws HyracksDataException; revisit the possibility of using priority queue and not exposing xxxHeap as public methods, since xxxHeap seems to be implementation details rather than abstractions. https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IHistogram.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IHistogram.java: Line 38: SHORT, Those types are not AsterixDB types. Can we move all histogram code into asterixdb-runtime codebase and only handle asterixdb types. Line 52: public void appendPair(E item, int count, boolean soft) throws HyracksDataException; What's the difference between addPair and appendPair? Do we need both at abstraction level? What does soft mean? Is it sth. necessary at the abstraction level? Line 54: public List<Entry<E, Integer>> generate(boolean isGlobal) throws HyracksDataException; Must there be an final step to call generate? Can generate be called multiple times? If it can be called multiple times, can we use "get" instead of "generate"? https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IQuantileList.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/IQuantileList.java: Line 29: public GenericQuantile<K, V> get(int index); Can you add some high-level annotation to describe what each method means? Are all of those methods needed at the abstraction level? https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/AbstractQuantileList.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/AbstractQuantileList.java: Line 34: private List<GenericQuantile<Double, Integer>> bins; Use a fixed size array? Line 38: protected DominantQuantile<Number> heapHead; Use Java's PriorityQueue? Line 40: protected Random prng; prng is not used in this abstract class? Move this to subclasses? It might end up with duplications across different subclasses, but readability seems better to me. Line 46: prng = new Random(31183); Why is the random seed a fixed number? Usually it's sth. like System.nanoTime(). Line 60: } else { Use JDK library binarySearch? Line 102: public void add(Double key, Integer value) { Double->double? Line 120: public void add(Double key) throws HyracksDataException { Double->double? Line 125: else if (bins.size() < cardinality) { is the code formatted correctly? "else if" should be appended to the previous line. Line 175: public void setKey(int index, Double key) { Double->double? https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/DominantQuantile.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/DominantQuantile.java: Line 70: return super.hashCode(); This doesn't seem right. hashCode implementation should be consistent with equals. E.g.: a.equals(b) --> a.hashCode() == b.hashCode() https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/GenericQuantile.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/GenericQuantile.java: Line 72: public int hashCode() { hashCode implementation should be consistent with equals(). https://asterix-gerrit.ics.uci.edu/#/c/806/35/hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/MinSumCountQuantileList.java File hyracks-fullstack/hyracks/hyracks-dataflow-std/src/main/java/org/apache/hyracks/dataflow/std/range/structures/MinSumCountQuantileList.java: Line 78: public void increase(int index, Double key) throws HyracksDataException { Double -> double? Why do you always need boxed "Double key"? -- To view, visit https://asterix-gerrit.ics.uci.edu/806 To unsubscribe, visit https://asterix-gerrit.ics.uci.edu/settings Gerrit-MessageType: comment Gerrit-Change-Id: I450d0962fbeacfb2b6ab9fae0750f025ef17ba01 Gerrit-PatchSet: 35 Gerrit-Project: asterixdb Gerrit-Branch: master Gerrit-Owner: Wenhai Li <[email protected]> Gerrit-Reviewer: Jenkins <[email protected]> Gerrit-Reviewer: Jianfeng Jia <[email protected]> Gerrit-Reviewer: Preston Carman <[email protected]> Gerrit-Reviewer: Till Westmann <[email protected]> Gerrit-Reviewer: Wenhai Li <[email protected]> Gerrit-Reviewer: Yingyi Bu <[email protected]> Gerrit-Reviewer: Yingyi Bu <[email protected]> Gerrit-HasComments: Yes
