clintropolis commented on issue #7525: Druid Multi-Value String Columns, 
Expressions, and SQL
URL: 
https://github.com/apache/incubator-druid/issues/7525#issuecomment-486458368
 
 
   These are very good questions, and missing from the proposal because for 
cases like these I am not quite certain how to handle it yet, so thanks for 
bringing it up. :metal:
   
   > What happens if an expression has array functions, but a multi-valued 
input x is sometimes used by array functions and sometimes used by non-array 
functions? (We could disallow this, I suppose, on the grounds that the writer 
of the expression should be able to decide if they want to treat x as an array 
or not.)
   
   My current prototype doesn't have anything that explicitly is checking for 
this, so the behavior is probably from the user perspective unexpected and 
strange. What it does right now is notice the outer `x` that was used in a 
non-array function without a `map`, and then wrap the whole expression in a 
`map`, which would mean the array functions would be applied to auto cast 
single element arrays of the mapped value instead of the source array. This 
behavior is definitely not probably what a user would expect, so I think we 
probably should check for these mixed usages and consider it a malformed 
expression and some sort of error. 
   
   And while we certainly _could_ probably handle this by rewriting the 
expression tree appropriately to only map the sub expressions that makes sense, 
or in such a way that it doesn't overload the array identifier, I'm not sure 
that we _should_ allow this since it seems a bit too magical for my taste. I 
think it's fair that if the user explicitly expresses an input as having 
multiple values then it must be treated as multi-valued for all usages in the 
same expression, and it's invalid otherwise.
   
   
   > What happens if an expression has no array functions, but there are two 
multi-valued inputs x and y? (Do we map Expr over all possible pairings of 
values from x and y? Do we return… null? Do we return… some N-dimensional 
matrix?)
   
   At the time I opened the proposal my prototype would throw an exception with 
explicit language in it because I wasn't sure what to do yet. I think the first 
2 suggestions you have are the only ones which make sense to me. I haven't 
implemented expression arrays to support any sort of nesting or anything, only 
flat constant arrays so the matrix idea is out of the picture for me at least, 
and I see nothing obvious to do with such a matrix even if we could compute it 
if it made it out as the final evaluated result.
   
   After sleeping on it and doing some experiments, I think the most useful 
thing would be to do the cartesian product of all multi-value inputs which are 
not part of an array function, done by either extending the `map` function to 
allow it to just implicitly do this if given more than one argument, or maybe 
better an explicit `cartesian_map` function.
   
   > What happens if an expression uses array functions for a multi-valued 
input x but doesn't for multi-valued inputs y and z?
   
   I think we could handle this case in the same manner we handle more than one 
multi-value inputs without any array function use (question 2), at least the 
way I have things setup currently I would have to explicitly check to stop this 
from working as you might expect, so maybe would be nice to leave it that way? 
I feel less strongly about asserting that since a user acknowledged one column 
as an array that they must be aware of all other columns that are array typed 
in the expression.
   
   
   >I think what will really happen is Calcite will 'optimize' this filter into 
FALSE and plan the query into a no-op.
   
   Yes, you're correct, updated the proposal to reflect this behavior.
   
   >Will there be issues if we decide to introduce real SQL array types in the 
future? Presumably, these expression array functions will be written such that 
they apply to VARCHAR (and numeric types as well?). Will they conflict with 
honest-to-goodness SQL array functions or is there a path to supporting both?
   
   I do not believe there needs to be a conflict, I _think_ it should be 
possible to write the SQL functions such that they accept either `VARCHAR` or 
`ARRAY` and then validate that any `VARCHAR` input are marked as multi value in 
the information schema? At the expression layer the array functions are 
intended to handle both multi-value strings and true array typed columns as 
arrays internally, so that shouldn't be an issue. Ideally, at the SQL level I'd 
definitely like these array functions to support both our multi-value string 
dimensions and real SQL arrays. 
   
   Admittedly I need to do a bit more experimentation with SQL layer stuff to 
be certain that this is possible though and there are no like validation or 
syntax gotchas from us treating them as `VARCHAR` that will make working with 
`ARRAY` with the same functions painful, so we might need to reconsider if 
there happens to be a conflict. In that case, I would expect to rename the 
functions for multi-val `VARCHAR` at the SQL layer, to reserve these suggested 
names for true SQL array types since some of them are standard function names, 
and then plumb them both to the same set of Druid expression array functions to 
handle internally.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
[email protected]


With regards,
Apache Git Services

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to