How about modeling them as a collection (ARRAY or MULTISET)? This document 
describes them pretty well: 
http://farrago.sourceforge.net/design/CollectionTypes.html 
<http://farrago.sourceforge.net/design/CollectionTypes.html>


> On Feb 14, 2017, at 5:08 PM, Gian Merlino <[email protected]> wrote:
> 
> Hey Calcites,
> 
> I'm hoping for some feedback on how to best handle Druid multi-value
> columns in SQL.
> 
> Background: Druid has a "multi-value column" feature that lets you have
> string columns with more than one value per row. Details and examples of
> how this works are here:
> http://druid.io/docs/latest/querying/multi-value-dimensions.html. The short
> version is that filtering on values works under the rule that "rows match a
> filter if any value in a multi-value dimension matches your predicate".
> Grouping works by sort of causing an explosion into multiple result rows,
> similar to what Pig does when you flatten a bag. Selecting without grouping
> doesn't do the exploding thing; instead it gives you the array of values.
> 
> These behaviors are intended to make multi-value columns work well to hold
> data like "tags" or "keywords" where you might want to ask questions like:
> "how many rows have the tag 't1'" or "count the number of distinct users
> for each tag".
> 
> The current Calcite-based Druid SQL stuff doesn't handle this in any way
> that really makes sense. The biggest issue is the expression simplifier,
> which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false".
> But, it's possible for a row to match that if "tags" is multi-value.
> Another issue is that the type is reported as a simple "varchar" and there
> is no indication that multiple values are possible.
> 
> I'm wondering what _would_ make the most sense in the SQL framework. The
> simplest thing is to keep reporting it as "varchar", adjust the expression
> simplifying rules to be aware of the fact that some optimizations shouldn't
> be applied to multi-value columns, and leave it at that. The behavior
> wouldn't be quite what you would expect for a varchar type but it should
> "work" in a sense.
> 
> Or we could report a different type than "varchar" and maybe do some other
> things differently too?
> 
> Gian

Reply via email to