yurmix opened a new issue #7009: Support for Multi-Values in Lookups
URL: https://github.com/apache/incubator-druid/issues/7009
 
 
   # Motivation
   The motivation is to combine two usage of two Druid features: [Multi-value 
dimensions](http://druid.io/docs/latest/querying/multi-value-dimensions.html) 
and [Lookups](http://druid.io/docs/latest/querying/lookups.html). This will 
allow multi-value dimensions which are dynamic and have the correct value at 
query time, instead of data ingestion time, thus avoiding frequent and 
expensive data reindexing.
   
   _Note: This is considered specific use-case for now, so the proposed 
implementation, accordingly, tries to minimize breaking changes._
   
   # Proposed Changes
   ## Overview
   Lookups are a type of [Extraction 
function](http://druid.io/docs/latest/querying/dimensionspecs.html#extraction-functions).
 Extraction functions transform a string (that is, a dimension/column) into a 
new value. Druid query engine calls the extraction function `apply()` to 
perform the transformation on the dimension original values. 
   i.e.:
   
https://github.com/apache/incubator-druid/blob/3ae563263a23000560749071d262727d47296856/processing/src/main/java/org/apache/druid/segment/column/StringDictionaryEncodedColumn.java#L128
   Each extraction functions has a DimensionSpec which defines the extraction 
API and its part in the query JSON.
   
   ## Required changes
   In the case of lookups, the transformation is a key-value lookup. So, in 
order to change Lookups to provide Multi-Values, and in order for queries to 
use these values, the following high-level changes needs to be implemented:
   
   - Changing `apply()` to return `List<String>`.
   - Changing the Lookups cache to store `List<String>` values.
   - Changing query engine to accept `List<String>` and to apply this list 
properly (As it does with Multi-Value Dimensions).
   
   ## Principles
   - Minimize disruption (regressions, performance) to existing Lookups and 
Extraction functions codebase and users.
   - Don’t make breaking API changes.
   
   ## Implementation
   High level implementation steps:
   
   - A new `LookupExtractor` implementation, named `MultiValueLookupExtractor` 
is added, with the following additions:
     - New method: `List<String> applyList(String key)`
     - New map: `Map<String, List<String>> map`.
   
   - A new `DimensionSpec` called `MultiValueLookupDinsionSpec` is added. It 
exposes `MultiValueLookupExtractor`, bypassing 
`LookupExtrationFn`/`ExtractionFn`.
   _Note: I might break this into an abstract class and implementation class, 
which allows distinct implementations of the concept by people._
   
   - Query engine (groupBy, topN, Filters, DimensionSelector...) will be 
changed to call `applyList()` when a dimension is of 
`MultiValueLookupDinsionSpec` type.
   This could be implemented in parts.
   
   # Changed Interfaces
   No breaking API changes. 
   Adding a new DimensionSpec.
   
   # Migration / Operational impact
   No migration needed.
   
   # Rejected Alternatives
   Originally there was a plan to extend extraction functions along with 
lookups, but there were several caveats identified:
   * It is a much bigger change, both to extractions as well as existing 
lookups. It doesn’t involve just adding `applyList()` method but also things 
like storing the `extractionFn` in a separate field and returning it 
separately, and so on.
   * It isn’t clear if extractions will benefit for that features. It might be 
only relevant for lookups.
   The main issue with the chosen proposal is that it create two separate code 
paths. In order to minimize that we could decide that if this feature tracks 
enough demand, we can refactor the code in the future to merge Multi-Value 
lookups code into the main lookups code.
   
   # Future work
   * This feature can be implemented gradually (across multiple prs) in the 
query engine code and officially released/documented when fully ready. This 
approach might be required, as each query-engine part needs its own deeper 
research.
   * At some point, in the chance this is becoming widely used, we can consider 
folding into the mainstream Lookups instead of keeping two separate code paths.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to