Hello All,

I have created a JIRA to track this approach.
https://issues.apache.org/jira/browse/DRILL-6312

Thanks,
-Hanu

On Fri, Apr 6, 2018 at 7:38 PM, Paul Rogers <par0...@yahoo.com.invalid>
wrote:

> Hi Aman,
>
> As we get into details, I suggested to Hanu that we move the discussion
> into a JIRA ticket.
>
>  >On the subject of CAST pushdown to Scans, there are potential drawbacks
>
>  >  - In general, the planner will see a Scan-Project where the Project
> has  CAST functions.  But the Project can have arbitrary expressions,  e.g
> CAST(a as INT) * 5
>
> Suggestion: push the CAST(a AS INT) down to the scan, do the a * 5 in the
> Project operator.
>
> >  or a combination of 2 CAST functions
>
> If the user does a two-stage cast, CAST(CAST(a AS INT) AS BIGINT), then
> one simple rule is to push only the innermost cast downwards.
>
> > or non-CAST functions etc.
>
> Just keep it in Project.
>
>  >    It would be quite expensive to examine each expression (there could
> be hundreds) to determine whether it is eligible to be pushed to the Scan.
>
> Just push CAST(<column_ref> AS <type>). Even that would be a huge win.
> Note, for CSV, it might have to be CAST(columns[2] AS INT), since "columns"
> is special for CSV.
>
> >   - Expressing Nullability is not possible with CAST.  If a column
> should be tagged as  (not)nullable, CAST syntax does not allow that.
>
> Can we just add keywords: CAST(a AS INT NULL), CAST(b AS VARCHAR NOT NULL)
> ?
>
>  >  - Drill currently supports CASTing to a SQL data type, but not to
> the complex types such as arrays and maps.  We would have to add support
> for that from a language perspective as well as the run-time.  This would
> be non-trivial effort.
>
> The term "complex type" is always confusing. Consider a map. The rules
> would apply recursively to the members of the map. (Problem: today, if I
> reference a map member, Drill pulls it to the top level: SELECT m.a creates
> a new top-level field, it does not select "a" within "m". We need to fix
> that anyway.  So, CAST(m.a AS INT) should imply the type of column "a"
> within map "m".
>
> For arrays, the problem is more complex. Perhaps more syntax: CAST(a[] AS
> INT) to force array elements to INT. Maybe use CAST(a[][] AS INT) for a
> repeated list (2D array).
>
> Unions don't need a solution as they are their own solution (they can hold
> multiple types.) Same for (non-repeated) lists.
>
> To resolve runs of nulls, maybe allow CAST(m AS MAP). Or we can imply that
> "m" is a Map from the expression CAST(m.a AS INT). For arrays, the
> previously suggested CAST(a[] AS INT). If columns "a" or "m" turn out to be
> a non-null scalar, then we have no good answer.
>
> CAST cannot solve the nasty cases of JSON in which some fields are
> complex, some scalar. E.g. {a: 10} {a: [20]} or {m: "foo"} {m: {value:
> "foo"}}. I suppose no solution is perfect...
>
> I'm sure that, if someone gets a chance to desig this feature, they'll
> find lots more issues. Maybe cast push-down is only a partial solution.
> But, it seems to solve so many of the JSON and CSV cases that I've seen
> that it seems too good to pass up.
>
> Thanks,
>
>
> - Paul

Reply via email to