Hi folks,

Here is a list of features that would be good for the community to work on.
Feel free to add or comment on this list.

1: Improve handling of aggregation: Aggregate handling in Quickstep is slow
as a separate hash table is being built for each aggregate. PR
https://github.com/apache/incubator-quickstep/pull/90 is a step in fixing
this, but there is more to be done, including increasing the space
efficiency of the hash table, improving the finalize operation (which is
single-threaded), and considering partitioning (so that finalize can be
parallelized).

2: The use of ColumnVectors is very expensive as it involves a full extra
read and write of data, and results in a bad memory access pattern. That
design needs to be rethought/refactored. Nav has suggested using an
iterator model v/s accessors and that is a good idea. We can probably go
beyond that and think of defining patterns for taking an input, applying a
predicate, and applying a projection (copy). Any ideas here are welcome.

3: We have bloomfilters and that needs to be optimized to work with joins.
Jianqiao is working on this.

4: Error handling in the system can be improved. Here we need to consider
if we want to use error return codes or C++ throw/catch mechanism. Right
now we use a mix of both. I am starting to turn in favor of throw/catch as
that way we at least have a way of catching the error at the top (rather
than crashing). We can then refactor the code to add entire throw/catch
chains. Right now the most serious error handling that is lacking, IMHO, is
when we are loading a large file and there is a corrupted tuple near the
end. The system crashes after making the user wait, and there is no
cleanup.

5: Our type system also needs a major surgery to make it easier to add new
types. Clean UDFs support is also missing.

Other thoughts?

Cheers,
Jignesh

Reply via email to