Hi Jignesh, Thanks for sending the list. I want to share an update on point 1.
At present I am working on partitioned aggregation, which builds on top of QUICKSTEP-28 and QUICKSTEP-29 JIRA issues. As the first step in this goal, I have created QUICKSTEP-43 JIRA issue (and a corresponding GitHub PR), in which we create a new operator to destruct the Aggregation state (similar to the destroy hash table operator). This operator will be useful when finalize step in aggregation is parallel and thus the shared state can only be destructed once the finalize phase is complete.
On 08/22/2016 08:23 PM, J Patel wrote:
Hi folks, Here is a list of features that would be good for the community to work on. Feel free to add or comment on this list. 1: Improve handling of aggregation: Aggregate handling in Quickstep is slow as a separate hash table is being built for each aggregate. PR https://github.com/apache/incubator-quickstep/pull/90 is a step in fixing this, but there is more to be done, including increasing the space efficiency of the hash table, improving the finalize operation (which is single-threaded), and considering partitioning (so that finalize can be parallelized). 2: The use of ColumnVectors is very expensive as it involves a full extra read and write of data, and results in a bad memory access pattern. That design needs to be rethought/refactored. Nav has suggested using an iterator model v/s accessors and that is a good idea. We can probably go beyond that and think of defining patterns for taking an input, applying a predicate, and applying a projection (copy). Any ideas here are welcome. 3: We have bloomfilters and that needs to be optimized to work with joins. Jianqiao is working on this. 4: Error handling in the system can be improved. Here we need to consider if we want to use error return codes or C++ throw/catch mechanism. Right now we use a mix of both. I am starting to turn in favor of throw/catch as that way we at least have a way of catching the error at the top (rather than crashing). We can then refactor the code to add entire throw/catch chains. Right now the most serious error handling that is lacking, IMHO, is when we are loading a large file and there is a corrupted tuple near the end. The system crashes after making the user wait, and there is no cleanup. 5: Our type system also needs a major surgery to make it easier to add new types. Clean UDFs support is also missing. Other thoughts? Cheers, Jignesh
-- Thanks, Harshad