gianm opened a new pull request #9111: Add HashJoinSegment, a virtual segment for joins. URL: https://github.com/apache/druid/pull/9111 An initial step towards #8728. This patch adds enough functionality to implement a joining cursor on top of a normal datasource. It does not include enough to actually do a query. For that, future patches will need to wire this low-level functionality into the query language. The main files in this patch: - HashJoinSegment: The virtual join Segment described in #8728. - HashJoinSegmentStorageAdapter: Storage adapter for that segment; "makeCursors" is the interesting part. - HashJoinEngine: Contains JoinColumnSelectorFactory, JoinCursor, which together implement the row-by-row logic of a join. - LookupJoinable: Allows joining onto lookups. - IndexedTableJoinable: A more flexible Joinable that can have multiple columns in general, including multiple key columns. I expect this will be used for joining onto subquery results in the future. It may even be used as a sort of super-lookup. Some supporting elements: - Added a "withDimension" method to DimensionSpec so prefixed dimensions can be rewritten to remove their prefixes. - Added "canIterate" and "iterable" to LookupExtractor, necessary for right and full joins on lookups. It will also be useful for direct queries on lookups in the future. - Removed "getSegmentIdentifier" method from StorageAdapter. It was not being used. - Moved RowBasedColumnSelectorFactory out of the groupBy engine, reflecting the fact that it has been used by other, non-groupBy things. Also, split out the RowAdapter interface, which is now used by RowBasedIndexedTable as well. - Renamed VectorColumnStrategizer to VectorColumnProcessorFactory (see below). - Added a "ColumnProcessors" utility class and "ColumnProcessorFactory" interface that is currently only used to make join condition matchers in IndexedTableJoinMatcher. It wasn't strictly necessary, but I think it's designed better than ColumnSelectorStrategyFactory, and could replace it in the future. It's similar in design to VectorColumnProcessorFactory. Next steps: - Implement the rest of "data server behavior" from https://github.com/apache/druid/issues/8728 (this patch is number 3, the virtual join Segment). - Implement "broker behavior" from https://github.com/apache/druid/issues/8728. - Implement SQL planning. - Various performance optimizations: filter push-down, deferred lookupName during condition matching / row retrieval, vectorized joins. - Fix handling of right-joins; see comment in HashJoinEngine: "Warning! The way this engine handles 'righty' joins is flawed: it generates the 'remainder' rows per-segment, but this should really be done globally. This should be improved in the future."
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
