Hi Aman, As we get into details, I suggested to Hanu that we move the discussion into a JIRA ticket.
>On the subject of CAST pushdown to Scans, there are potential drawbacks > - In general, the planner will see a Scan-Project where the Project has CAST functions. But the Project can have arbitrary expressions, e.g CAST(a as INT) * 5 Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the Project operator. > or a combination of 2 CAST functions If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then one simple rule is to push only the innermost cast downwards. > or non-CAST functions etc. Just keep it in Project. > It would be quite expensive to examine each expression (there could be hundreds) to determine whether it is eligible to be pushed to the Scan. Just push CAST(<column_ref> AS <type>). Even that would be a huge win. Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns" is special for CSV. > - Expressing Nullability is not possible with CAST. If a column should be > tagged as (not)nullable, CAST syntax does not allow that. Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL) ? > - Drill currently supports CASTing to a SQL data type, but not to the complex types such as arrays and maps. We would have to add support for that from a language perspective as well as the run-time. This would be non-trivial effort. The term "complex type" is always confusing. Consider a map. The rules would apply recursively to the members of the map. (Problem: today, if I reference a map member, Drill pulls it to the top level: SELECT m.a creates a new top-level field, it does not select "a" within "m". We need to fix that anyway. So, CAST(m.a AS INT) should imply the type of column "a" within map "m". For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a repeated list (2D array). Unions don't need a solution as they are their own solution (they can hold multiple types.) Same for (non-repeated) lists. To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that "m" is a Map from the expression CAST(m.a AS INT). For arrays, the previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be a non-null scalar, then we have no good answer. CAST cannot solve the nasty cases of JSON in which some fields are complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value: "foo"}}. I suppose no solution is perfect... I'm sure that, if someone gets a chance to desig this feature, they'll find lots more issues. Maybe cast push-down is only a partial solution. But, it seems to solve so many of the JSON and CSV cases that I've seen that it seems too good to pass up. Thanks, - Paul