yurmix opened a new issue #7009: Support for Multi-Values in Lookups URL: https://github.com/apache/incubator-druid/issues/7009 # Motivation The motivation is to combine two usage of two Druid features: [Multi-value dimensions](http://druid.io/docs/latest/querying/multi-value-dimensions.html) and [Lookups](http://druid.io/docs/latest/querying/lookups.html). This will allow multi-value dimensions which are dynamic and have the correct value at query time, instead of data ingestion time, thus avoiding frequent and expensive data reindexing. _Note: This is considered specific use-case for now, so the proposed implementation, accordingly, tries to minimize breaking changes._ # Proposed Changes ## Overview Lookups are a type of [Extraction function](http://druid.io/docs/latest/querying/dimensionspecs.html#extraction-functions). Extraction functions transform a string (that is, a dimension/column) into a new value. Druid query engine calls the extraction function `apply()` to perform the transformation on the dimension original values. i.e.: https://github.com/apache/incubator-druid/blob/3ae563263a23000560749071d262727d47296856/processing/src/main/java/org/apache/druid/segment/column/StringDictionaryEncodedColumn.java#L128 Each extraction functions has a DimensionSpec which defines the extraction API and its part in the query JSON. ## Required changes In the case of lookups, the transformation is a key-value lookup. So, in order to change Lookups to provide Multi-Values, and in order for queries to use these values, the following high-level changes needs to be implemented: - Changing `apply()` to return `List<String>`. - Changing the Lookups cache to store `List<String>` values. - Changing query engine to accept `List<String>` and to apply this list properly (As it does with Multi-Value Dimensions). ## Principles - Minimize disruption (regressions, performance) to existing Lookups and Extraction functions codebase and users. - Don’t make breaking API changes. ## Implementation High level implementation steps: - A new `LookupExtractor` implementation, named `MultiValueLookupExtractor` is added, with the following additions: - New method: `List<String> applyList(String key)` - New map: `Map<String, List<String>> map`. - A new `DimensionSpec` called `MultiValueLookupDinsionSpec` is added. It exposes `MultiValueLookupExtractor`, bypassing `LookupExtrationFn`/`ExtractionFn`. _Note: I might break this into an abstract class and implementation class, which allows distinct implementations of the concept by people._ - Query engine (groupBy, topN, Filters, DimensionSelector...) will be changed to call `applyList()` when a dimension is of `MultiValueLookupDinsionSpec` type. This could be implemented in parts. # Changed Interfaces No breaking API changes. Adding a new DimensionSpec. # Migration / Operational impact No migration needed. # Rejected Alternatives Originally there was a plan to extend extraction functions along with lookups, but there were several caveats identified: * It is a much bigger change, both to extractions as well as existing lookups. It doesn’t involve just adding `applyList()` method but also things like storing the `extractionFn` in a separate field and returning it separately, and so on. * It isn’t clear if extractions will benefit for that features. It might be only relevant for lookups. The main issue with the chosen proposal is that it create two separate code paths. In order to minimize that we could decide that if this feature tracks enough demand, we can refactor the code in the future to merge Multi-Value lookups code into the main lookups code. # Future work * This feature can be implemented gradually (across multiple prs) in the query engine code and officially released/documented when fully ready. This approach might be required, as each query-engine part needs its own deeper research. * At some point, in the chance this is becoming widely used, we can consider folding into the mainstream Lookups instead of keeping two separate code paths.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
