How about modeling them as a collection (ARRAY or MULTISET)? This document describes them pretty well: http://farrago.sourceforge.net/design/CollectionTypes.html <http://farrago.sourceforge.net/design/CollectionTypes.html>
> On Feb 14, 2017, at 5:08 PM, Gian Merlino <[email protected]> wrote: > > Hey Calcites, > > I'm hoping for some feedback on how to best handle Druid multi-value > columns in SQL. > > Background: Druid has a "multi-value column" feature that lets you have > string columns with more than one value per row. Details and examples of > how this works are here: > http://druid.io/docs/latest/querying/multi-value-dimensions.html. The short > version is that filtering on values works under the rule that "rows match a > filter if any value in a multi-value dimension matches your predicate". > Grouping works by sort of causing an explosion into multiple result rows, > similar to what Pig does when you flatten a bag. Selecting without grouping > doesn't do the exploding thing; instead it gives you the array of values. > > These behaviors are intended to make multi-value columns work well to hold > data like "tags" or "keywords" where you might want to ask questions like: > "how many rows have the tag 't1'" or "count the number of distinct users > for each tag". > > The current Calcite-based Druid SQL stuff doesn't handle this in any way > that really makes sense. The biggest issue is the expression simplifier, > which would incorrect simplify "tags = 't1' AND tags = 't2'" to "false". > But, it's possible for a row to match that if "tags" is multi-value. > Another issue is that the type is reported as a simple "varchar" and there > is no indication that multiple values are possible. > > I'm wondering what _would_ make the most sense in the SQL framework. The > simplest thing is to keep reporting it as "varchar", adjust the expression > simplifying rules to be aware of the fact that some optimizations shouldn't > be applied to multi-value columns, and leave it at that. The behavior > wouldn't be quite what you would expect for a varchar type but it should > "work" in a sense. > > Or we could report a different type than "varchar" and maybe do some other > things differently too? > > Gian
