Hi folks, Here is a list of features that would be good for the community to work on. Feel free to add or comment on this list.
1: Improve handling of aggregation: Aggregate handling in Quickstep is slow as a separate hash table is being built for each aggregate. PR https://github.com/apache/incubator-quickstep/pull/90 is a step in fixing this, but there is more to be done, including increasing the space efficiency of the hash table, improving the finalize operation (which is single-threaded), and considering partitioning (so that finalize can be parallelized). 2: The use of ColumnVectors is very expensive as it involves a full extra read and write of data, and results in a bad memory access pattern. That design needs to be rethought/refactored. Nav has suggested using an iterator model v/s accessors and that is a good idea. We can probably go beyond that and think of defining patterns for taking an input, applying a predicate, and applying a projection (copy). Any ideas here are welcome. 3: We have bloomfilters and that needs to be optimized to work with joins. Jianqiao is working on this. 4: Error handling in the system can be improved. Here we need to consider if we want to use error return codes or C++ throw/catch mechanism. Right now we use a mix of both. I am starting to turn in favor of throw/catch as that way we at least have a way of catching the error at the top (rather than crashing). We can then refactor the code to add entire throw/catch chains. Right now the most serious error handling that is lacking, IMHO, is when we are loading a large file and there is a corrupted tuple near the end. The system crashes after making the user wait, and there is no cleanup. 5: Our type system also needs a major surgery to make it easier to add new types. Clean UDFs support is also missing. Other thoughts? Cheers, Jignesh