[GitHub] yurmix opened a new issue #7009: Support for Multi-Values in Lookups

GitBox Tue, 05 Feb 2019 04:13:56 -0800

yurmix opened a new issue #7009: Support for Multi-Values in Lookups
URL: https://github.com/apache/incubator-druid/issues/7009

# Motivation
The motivation is to combine two usage of two Druid features: [Multi-value
dimensions](http://druid.io/docs/latest/querying/multi-value-dimensions.html)
and [Lookups](http://druid.io/docs/latest/querying/lookups.html). This will
allow multi-value dimensions which are dynamic and have the correct value at
query time, instead of data ingestion time, thus avoiding frequent and
expensive data reindexing.

_Note: This is considered specific use-case for now, so the proposed
implementation, accordingly, tries to minimize breaking changes._

# Proposed Changes
## Overview
Lookups are a type of [Extraction
function](http://druid.io/docs/latest/querying/dimensionspecs.html#extraction-functions).
Extraction functions transform a string (that is, a dimension/column) into a
new value. Druid query engine calls the extraction function `apply()` to
perform the transformation on the dimension original values.
i.e.:

https://github.com/apache/incubator-druid/blob/3ae563263a23000560749071d262727d47296856/processing/src/main/java/org/apache/druid/segment/column/StringDictionaryEncodedColumn.java#L128
Each extraction functions has a DimensionSpec which defines the extraction
API and its part in the query JSON.

## Required changes
In the case of lookups, the transformation is a key-value lookup. So, in
order to change Lookups to provide Multi-Values, and in order for queries to
use these values, the following high-level changes needs to be implemented:

- Changing `apply()` to return `List<String>`.
- Changing the Lookups cache to store `List<String>` values.
- Changing query engine to accept `List<String>` and to apply this list
properly (As it does with Multi-Value Dimensions).

## Principles
- Minimize disruption (regressions, performance) to existing Lookups and
Extraction functions codebase and users.
- Don’t make breaking API changes.

## Implementation
High level implementation steps:

- A new `LookupExtractor` implementation, named `MultiValueLookupExtractor`
is added, with the following additions:
- New method: `List<String> applyList(String key)`
- New map: `Map<String, List<String>> map`.

- A new `DimensionSpec` called `MultiValueLookupDinsionSpec` is added. It
exposes `MultiValueLookupExtractor`, bypassing
`LookupExtrationFn`/`ExtractionFn`.
_Note: I might break this into an abstract class and implementation class,
which allows distinct implementations of the concept by people._

- Query engine (groupBy, topN, Filters, DimensionSelector...) will be
changed to call `applyList()` when a dimension is of
`MultiValueLookupDinsionSpec` type.
This could be implemented in parts.

# Changed Interfaces
No breaking API changes.
Adding a new DimensionSpec.

# Migration / Operational impact
No migration needed.

# Rejected Alternatives
Originally there was a plan to extend extraction functions along with
lookups, but there were several caveats identified:
* It is a much bigger change, both to extractions as well as existing
lookups. It doesn’t involve just adding `applyList()` method but also things
like storing the `extractionFn` in a separate field and returning it
separately, and so on.
* It isn’t clear if extractions will benefit for that features. It might be
only relevant for lookups.
The main issue with the chosen proposal is that it create two separate code
paths. In order to minimize that we could decide that if this feature tracks
enough demand, we can refactor the code in the future to merge Multi-Value
lookups code into the main lookups code.

# Future work
* This feature can be implemented gradually (across multiple prs) in the
query engine code and officially released/documented when fully ready. This
approach might be required, as each query-engine part needs its own deeper
research.
* At some point, in the chance this is becoming widely used, we can consider
folding into the mainstream Lookups instead of keeping two separate code paths.


----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] yurmix opened a new issue #7009: Support for Multi-Values in Lookups

Reply via email to