clintropolis commented on issue #7525: Druid Multi-Value String Columns, Expressions, and SQL URL: https://github.com/apache/incubator-druid/issues/7525#issuecomment-486458368 These are very good questions, and missing from the proposal because for cases like these I am not quite certain how to handle it yet, so thanks for bringing it up. :metal: > What happens if an expression has array functions, but a multi-valued input x is sometimes used by array functions and sometimes used by non-array functions? (We could disallow this, I suppose, on the grounds that the writer of the expression should be able to decide if they want to treat x as an array or not.) My current prototype doesn't have anything that explicitly is checking for this, so the behavior is probably from the user perspective unexpected and strange. What it does right now is notice the outer `x` that was used in a non-array function without a `map`, and then wrap the whole expression in a `map`, which would mean the array functions would be applied to auto cast single element arrays of the mapped value instead of the source array. This behavior is definitely not probably what a user would expect, so I think we probably should check for these mixed usages and consider it a malformed expression and some sort of error. And while we certainly _could_ probably handle this by rewriting the expression tree appropriately to only map the sub expressions that makes sense, or in such a way that it doesn't overload the array identifier, I'm not sure that we _should_ allow this since it seems a bit too magical for my taste. I think it's fair that if the user explicitly expresses an input as having multiple values then it must be treated as multi-valued for all usages in the same expression, and it's invalid otherwise. > What happens if an expression has no array functions, but there are two multi-valued inputs x and y? (Do we map Expr over all possible pairings of values from x and y? Do we return… null? Do we return… some N-dimensional matrix?) At the time I opened the proposal my prototype would throw an exception with explicit language in it because I wasn't sure what to do yet. I think the first 2 suggestions you have are the only ones which make sense to me. I haven't implemented expression arrays to support any sort of nesting or anything, only flat constant arrays so the matrix idea is out of the picture for me at least, and I see nothing obvious to do with such a matrix even if we could compute it if it made it out as the final evaluated result. After sleeping on it and doing some experiments, I think the most useful thing would be to do the cartesian product of all multi-value inputs which are not part of an array function, done by either extending the `map` function to allow it to just implicitly do this if given more than one argument, or maybe better an explicit `cartesian_map` function. > What happens if an expression uses array functions for a multi-valued input x but doesn't for multi-valued inputs y and z? I think we could handle this case in the same manner we handle more than one multi-value inputs without any array function use (question 2), at least the way I have things setup currently I would have to explicitly check to stop this from working as you might expect, so maybe would be nice to leave it that way? I feel less strongly about asserting that since a user acknowledged one column as an array that they must be aware of all other columns that are array typed in the expression. >I think what will really happen is Calcite will 'optimize' this filter into FALSE and plan the query into a no-op. Yes, you're correct, updated the proposal to reflect this behavior. >Will there be issues if we decide to introduce real SQL array types in the future? Presumably, these expression array functions will be written such that they apply to VARCHAR (and numeric types as well?). Will they conflict with honest-to-goodness SQL array functions or is there a path to supporting both? I do not believe there needs to be a conflict, I _think_ it should be possible to write the SQL functions such that they accept either `VARCHAR` or `ARRAY` and then validate that any `VARCHAR` input are marked as multi value in the information schema? At the expression layer the array functions are intended to handle both multi-value strings and true array typed columns as arrays internally, so that shouldn't be an issue. Ideally, at the SQL level I'd definitely like these array functions to support both our multi-value string dimensions and real SQL arrays. Admittedly I need to do a bit more experimentation with SQL layer stuff to be certain that this is possible though and there are no like validation or syntax gotchas from us treating them as `VARCHAR` that will make working with `ARRAY` with the same functions painful, so we might need to reconsider if there happens to be a conflict. In that case, I would expect to rename the functions for multi-val `VARCHAR` at the SQL layer, to reserve these suggested names for true SQL array types since some of them are standard function names, and then plumb them both to the same set of Druid expression array functions to handle internally.
---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: [email protected] With regards, Apache Git Services --------------------------------------------------------------------- To unsubscribe, e-mail: [email protected] For additional commands, e-mail: [email protected]
