gianm edited a comment on issue #12262: URL: https://github.com/apache/druid/issues/12262#issuecomment-1055125017
> Gian, Are there any algorithms that would be made possible by this architecture? (By algorithms, I mean distributed implementations of relational operators, e.g. shuffle-sort, shuffle-join, broadcast-join, hybrid-hash-join, use of bloom filters.) If so, can you list the algorithms that you have in mind? @julianhyde I had in mind that a "partition-sort-shuffle-merge" operation would be the basic building block. The "execution coordinator" I mentioned should understand those, & for each stage should accept parameters like: - set of input stages (this is what makes it a DAG) - which input stages should be broadcast and which should be partitioned - shuffle key for output - whether the shuffled output needs to be sorted (if not, we can stream through the stage; if so, we need a buffer between stages) With that building block I think the most useful algorithms to implement at first would be shuffle sort (trivial given the building block), sort-based aggregation, sort-merge join, broadcast hash join, and shuffle hash join. Btw: I think certain of these algorithms would be hybrid in some way if you consider the flow end to end. For example: by "sort-based aggregation" I just mean the distributed part. I had imagined the part that happens local to the leaf server being hash-based, followed by a sort-based distributed step. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected] --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
