----------------------------------------------------------- This is an automatically generated e-mail. To reply, visit: https://reviews.apache.org/r/18347/#review36717 -----------------------------------------------------------
exec/java-exec/src/main/codegen/data/AggrTypes1.tdd <https://reviews.apache.org/r/18347/#comment67800> Is it possible that we would want to make the output type of a sum of BigInts (or even Ints with the dataset sizes we are expecting) a type other than BigInt? If we move to something like a double, or a BigDecimal we could store a more accurate guess of the sum, rather than just overflowing the output value, which I assume would be reasonably common. There is the problem that while doubles can store large values, they lose precision, so as we calculate the sum (assuming it is going to be a large number of small values) once we reach a large partial sum, adding a single value would not increment the sum. We would have to store intermediate sums in some kind of temporary storage until they are large enough to be added to the whole sum and not be lost to rounding. exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java <https://reviews.apache.org/r/18347/#comment67805> While this strategy will produce the most accurate results, if we are storing the running sum in a BigInt for all of the types, we will almost certainly hit an overflow for inputs like BigInt, and will likely also be an issue for integer columns in full scale datasets as well. We should look at keeping an approximate average as the running state with a count of values factored in so far. New values should be factored by adding distance(runningAvg, currVal) * 1/valCount to the runningAvg. The downside is values that should push the average up or down just a little bit, once a lot of other values have been factored in will be lost to rounding errors. This can be fixed the same as above, by batching subsets of values that will have a collected sub-average applied to the total average all at once. As it is approximate, this could still produce inaccurate results if the sub-groups are not large enough, but this could be adjusted on a per query or per table basis. - Jason Altekruse On March 6, 2014, 2:10 a.m., Aman Sinha wrote: > > ----------------------------------------------------------- > This is an automatically generated e-mail. To reply, visit: > https://reviews.apache.org/r/18347/ > ----------------------------------------------------------- > > (Updated March 6, 2014, 2:10 a.m.) > > > Review request for drill, Jacques Nadeau, Jason Altekruse, Mehant Baid, and > Steven Phillips. > > > Bugs: Drill-318, Drill-335 and and > https://issues.apache.org/jira/browse/Drill-318 > https://issues.apache.org/jira/browse/Drill-335 > https://issues.apache.org/jira/browse/and > > > Repository: drill-git > > > Description > ------- > > This patch contains the following main components: > 1. Implementation of the hash aggregation execution operator - this has two > main parts: the HashAggTemplate and the HashAggBatch. > 2. Implementation of a hash table which is used by the hash aggregation. The > hash table hash two main parts: the HashTableTemplate and the > ChainedHashTable. The hash table internally uses the notion of 'BatchHolder' > to keep track of all keys that can fit within one batch of 64K values. New > BatcHolder objects are created as needed. Each BatchHolder has its own > vector container. The HashAggregate also has a similar structure and it > keeps track of the workspace variables. > (NOTE: An initial design document for the hash aggregation and hash table > was already attached with Drill-335. The document has not yet been updated > with the latest implementation ... will try to do that in the near future). > 3. Jinfeng's changes to use workspace vectors in the generated code for > aggregate functions (previously, for streaming aggregate we only needed to > maintain workspace variable for 1 running group; however for hash aggregate > we need to maintain it for all groups). > 4. Fix for Drill-318: because of #3 above, the previous fix for Drill-318 is > not valid anymore. I modified the template generation code for the aggregate > functions such that they conform to the new infrastructure. > 5. The original AggTemplate, AggBatch and Aggregator classes have been moved > to corresponding StreamingAggTemplate, StreamingAggBatch and > StreamingAggregator in order to differentiate it from hash aggregation. > These appear as new files but the code there has not changed. > > > Diffs > ----- > > > common/src/main/java/org/apache/drill/common/expression/fn/AggregationFunctions.java > 5b46b78 > exec/java-exec/src/main/codegen/config.fmpp 8f1060a > exec/java-exec/src/main/codegen/data/AggrTypes1.tdd PRE-CREATION > exec/java-exec/src/main/codegen/data/AggrTypes2.tdd PRE-CREATION > exec/java-exec/src/main/codegen/templates/AggrTypeFunctions1.java > PRE-CREATION > exec/java-exec/src/main/codegen/templates/AggrTypeFunctions2.java > PRE-CREATION > exec/java-exec/src/main/codegen/templates/TypeHelper.java 8c72cf7 > > exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/GeneratorMapping.java > f99150e > > exec/java-exec/src/main/java/org/apache/drill/exec/compile/sig/MappingSet.java > 1cb2380 > exec/java-exec/src/main/java/org/apache/drill/exec/expr/ClassGenerator.java > e674eab > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillAggFuncHolder.java > 7622865 > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/DrillFuncHolder.java > dcdf1ea > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/FunctionConverter.java > befa9bf > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/CountFunctions.java > b0939f1 > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/agg/impl/SumFunctions.java > 57905fe > > exec/java-exec/src/main/java/org/apache/drill/exec/expr/fn/impl/BitFunctions.java > 41bb7c6 > exec/java-exec/src/main/java/org/apache/drill/exec/ops/FragmentContext.java > 8462622 > exec/java-exec/src/main/java/org/apache/drill/exec/opt/BasicOptimizer.java > aed4802 > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/OperatorCost.java > 0bade41 > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/AbstractPhysicalVisitor.java > ec7244a > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/base/PhysicalVisitor.java > 2e2d7fd > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/config/HashAggregate.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatch.java > 86fea4e > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggBatchCreator.java > 36344f9 > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/AggTemplate.java > a8f39e3 > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/Aggregator.java > 2683045 > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatch.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggBatchCreator.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggTemplate.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/HashAggregator.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatch.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggBatchCreator.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggTemplate.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/aggregate/StreamingAggregator.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/ChainedHashTable.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTable.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableConfig.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/common/HashTableTemplate.java > PRE-CREATION > > exec/java-exec/src/main/java/org/apache/drill/exec/physical/impl/project/ProjectRecordBatch.java > 509d13b > > exec/java-exec/src/main/java/org/apache/drill/exec/record/VectorContainer.java > 6e764a8 > exec/java-exec/src/test/java/org/apache/drill/exec/expr/ExpressionTest.java > 0de64b4 > > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/SimpleRootExec.java > ccf7758 > > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/TestSimpleFunctions.java > f6a8096 > > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/agg/TestHashAggr.java > PRE-CREATION > > exec/java-exec/src/test/java/org/apache/drill/exec/physical/impl/common/TestHashTable.java > PRE-CREATION > exec/java-exec/src/test/resources/common/test_hashtable1.json PRE-CREATION > > Diff: https://reviews.apache.org/r/18347/diff/ > > > Testing > ------- > > I have run several tests manually as part of TestHashAggr...these tests use > TPC-H data and in particular a relatively large 'Orders' table. However, I > have not yet packaged the tests to run as part of JUnit since the location > and size of the parquet files needs to be figured out. I will continue to > work on that. > > > Thanks, > > Aman Sinha > >
