[jira] [Comment Edited] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true

Jorge (Jira) Thu, 20 Aug 2020 22:16:38 -0700


    [ 
https://issues.apache.org/jira/browse/ARROW-9809?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17181616#comment-17181616
 ]


Jorge edited comment on ARROW-9809 at 8/21/20, 5:15 AM:
--------------------------------------------------------

In general, I think that we could consider the following:

* Expressions in the logical plan have a type and nullability (function that 
maps inputs's meta to output's meta such as {{Expr::get_type}})
* Expressions in the physical plan have a type and nullability (functions that 
map inputs's meta to output's meta, currently {{PhysicalExpr::data_type}})
* {{PhysicalExpr::data_type}} must match the actual calculation return type 
(i.e. the builder that it is actually used depending on the input type)
* The physical planner must {{assert}} that both type and nullability are 
preserved during conversion from logical to physical.

It is our responsibility to ensure that, within datafusion, the schema from our 
logical plan matches the schema from the physical plan (or the planner errors). 
If anyone decides to use different physical plans (e.g. GPU), it is their 
responsibility ensure that:

* {{PhysicalExpr::data_type}} must match the actual calculation return type 
(i.e. the builder that it is used)
* The physical planner must {{assert}} that both type and nullability are 
preserved during conversion from logical to physical.

>From this perspective, our logical plan is the user's expectation of the 
>output schema, while the physical plan is the developer's representation of 
>the computation, and users can switch planners to derive different physical 
>plans from logical plans, both at the compute level (e.g. CPU vs GPU, the 
>{{create_physical_expr}}) and at the distribution level (local vs distributed, 
>the {{create_physical_plan}} and {{create_aggregate_expr}}).

IMO the type coercer should not be a logical optimizer, but a physical one: 
before type coercion, all logical types are all already set by the scanned 
types, and thus the whole logical plan can be derived. We use a type coercer 
because our physical expressions only support a subset of all operations (e.g. 
we support u32 * u32, but not u32 * u16). The output type is still the same 
(u32) and the logical plan does not care. IMO this is a physical (compute) -- 
not logical (types and nullability) -- issue. If someone else had a way to 
compute u32 * u16 -> u32, IMO our logical plan should not have to know about it 
and our coercion rule would not need to be applied.

With this said:

> I would think that the schema should match between the optimized logical plan 
> and the physical plan though?

If we want to enforce that physical expressions are always named as logical 
expressions, then yes, we should enforce the same schema. Strictly speaking, we 
only need to enforce type and nullability.

> It seems that we should determine data types and nullability only in the 
> logical plan and then pass that information to the physical plan rather than 
> re-compute them there.

We can do that, but it will make it more difficult to find errors when someone 
else writes a new physical expression: they do not have to write a `data_type` 
for it, but will have to match the `data_type` (via which builder they use) 
that the logical plan passes to them.

Since all operations are dynamically typed, I think that we should continue to 
have both {{PhysicalExpr::data_type}} and {{Expr::get_type}} and make the 
assert at the planner level and not only in tests.

Currently, we do not do this, as we use statements of the form 
{{LogicalPlan::Projection { input, expr, .. }}} in the planner, that make no 
use of the logical schema and derive a new physical schema that may or may not 
match the logical schema.


was (Author: jorgecarleitao):
In general, I think that we consider the following:

* Expressions in the logical plan have a type and nullability (function that 
maps inputs's meta to output's meta such as {{Expr::get_type}})
* Expressions in the physical plan have a type and nullability (functions that 
map inputs's meta to output's meta, currently {{PhysicalExpr::data_type}})
* {{PhysicalExpr::data_type}} must match the actual calculation return type 
(i.e. the builder that it is actually used depending on the input type)
* The physical planner must {{assert}} that both type and nullability are 
preserved during conversion from logical to physical.

It is our responsibility to ensure that, within datafusion, the schema from our 
logical plan matches the schema from the physical plan (or the planner errors). 
If anyone decides to use different physical plans (e.g. GPU), it is their 
responsibility ensure that:

* {{PhysicalExpr::data_type}} must match the actual calculation return type 
(i.e. the builder that it is used)
* The physical planner must {{assert}} that both type and nullability are 
preserved during conversion from logical to physical.

>From this perspective, our logical plan is the user's expectation of the 
>output schema, while the physical plan is the developer's representation of 
>the computation, and users can switch planners to derive different physical 
>plans from logical plans, both at the compute level (e.g. CPU vs GPU, the 
>{{create_physical_expr}}) and at the distribution level (local vs distributed, 
>the {{create_physical_plan}} and {{create_aggregate_expr}}).

IMO the type coercer should not be a logical optimizer, but a physical one: 
before type coercion, all logical types are all already set by the scanned 
types, and thus the whole logical plan can be derived. We use a type coercer 
because our physical expressions only support a subset of all operations (e.g. 
we support u32 * u32, but not u32 * u16). The output type is still the same 
(u32) and the logical plan does not care. IMO this is a physical (compute) -- 
not logical (types and nullability) -- issue. If someone else had a way to 
compute u32 * u16 -> u32, IMO our logical plan should not have to know about it 
and our coercion rule would not need to be applied.

With this said:

> I would think that the schema should match between the optimized logical plan 
> and the physical plan though?

If we want to enforce that physical expressions are always named as logical 
expressions, then yes, we should enforce the same schema. Strictly speaking, we 
only need to enforce type and nullability.

> It seems that we should determine data types and nullability only in the 
> logical plan and then pass that information to the physical plan rather than 
> re-compute them there.

We can do that, but it will make it more difficult to find errors when someone 
else writes a new physical expression: they do not have to write a `data_type` 
for it, but will have to match the `data_type` (via which builder they use) 
that the logical plan passes to them.

Since all operations are dynamically typed, I think that we should continue to 
have both {{PhysicalExpr::data_type}} and {{Expr::get_type}} and make the 
assert at the planner level and not only in tests.

Currently, we do not do this, as we use statements of the form 
{{LogicalPlan::Projection { input, expr, .. }}} in the planner, that make no 
use of the logical schema and derive a new physical schema that may or may not 
match the logical schema.

> [Rust] [DataFusion] logical schema = physical schema is not true
> ----------------------------------------------------------------
>
>                 Key: ARROW-9809
>                 URL: https://issues.apache.org/jira/browse/ARROW-9809
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: Rust, Rust - DataFusion
>            Reporter: Jorge
>            Priority: Major
>
> In tests/sql.rs, we test that the physical and the optimized schema must 
> match. However, this is not necessarily true for all our queries. An example:
> {code:java}
> #[test]
> fn csv_query_sum_cast() {
>     let mut ctx = ExecutionContext::new();
>     register_aggregate_csv_by_sql(&mut ctx);
>     // c8 = i32; c9 = i64
>     let sql = "SELECT c8 + c9 FROM aggregate_test_100";
>     // check that the physical and logical schemas are equal
>     execute(&mut ctx, sql);
> }
> {code}
> The physical expression (and schema) of this operation, after optimization, 
> is {{CAST(c8 as Int64) Plus c9}} (this test fails).
> AFAIK, the invariant of the optimizer is that the output types and 
> nullability are the same.
> Also, note that the reason the optimized logical schema equals the logical 
> schema is that our type coercer does not change the output names of the 
> schema, even though it re-writes logical expressions. I.e. after the 
> optimization, `.to_field()` of an expression may no longer match the field 
> name nor type in the Plan's schema. IMO this is currently by (implicit?) 
> design, as we do not want our logical schema's column names to change during 
> optimizations, or all column references may point to non-existent columns. 
> This is something that brought up on the mailing list about polymorphism.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Comment Edited] (ARROW-9809) [Rust] [DataFusion] logical schema = physical schema is not true

Reply via email to