Hello All, I have created a JIRA to track this approach. https://issues.apache.org/jira/browse/DRILL-6312
Thanks, -Hanu On Fri, Apr 6, 2018 at 7:38 PM, Paul Rogers <par0...@yahoo.com.invalid> wrote: > Hi Aman, > > As we get into details, I suggested to Hanu that we move the discussion > into a JIRA ticket. > > >On the subject of CAST pushdown to Scans, there are potential drawbacks > > > - In general, the planner will see a Scan-Project where the Project > has CAST functions. But the Project can have arbitrary expressions, e.g > CAST(a as INT) * 5 > > Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the > Project operator. > > > or a combination of 2 CAST functions > > If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then > one simple rule is to push only the innermost cast downwards. > > > or non-CAST functions etc. > > Just keep it in Project. > > > It would be quite expensive to examine each expression (there could > be hundreds) to determine whether it is eligible to be pushed to the Scan. > > Just push CAST(<column_ref> AS <type>). Even that would be a huge win. > Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns" > is special for CSV. > > > - Expressing Nullability is not possible with CAST. If a column > should be tagged as (not)nullable, CAST syntax does not allow that. > > Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL) > ? > > > - Drill currently supports CASTing to a SQL data type, but not to > the complex types such as arrays and maps. We would have to add support > for that from a language perspective as well as the run-time. This would > be non-trivial effort. > > The term "complex type" is always confusing. Consider a map. The rules > would apply recursively to the members of the map. (Problem: today, if I > reference a map member, Drill pulls it to the top level: SELECT m.a creates > a new top-level field, it does not select "a" within "m". We need to fix > that anyway. So, CAST(m.a AS INT) should imply the type of column "a" > within map "m". > > For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS > INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a > repeated list (2D array). > > Unions don't need a solution as they are their own solution (they can hold > multiple types.) Same for (non-repeated) lists. > > To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that > "m" is a Map from the expression CAST(m.a AS INT). For arrays, the > previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be > a non-null scalar, then we have no good answer. > > CAST cannot solve the nasty cases of JSON in which some fields are > complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value: > "foo"}}. I suppose no solution is perfect... > > I'm sure that, if someone gets a chance to desig this feature, they'll > find lots more issues. Maybe cast push-down is only a partial solution. > But, it seems to solve so many of the JSON and CSV cases that I've seen > that it seems too good to pass up. > > Thanks, > > > - Paul