[jira] [Updated] (ARROW-17915) [C++] Error when using Substrait ProjectRel

Dewey Dunnington (Jira) Mon, 03 Oct 2022 06:52:10 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-17915?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Dewey Dunnington updated ARROW-17915:
-------------------------------------
    Description: 
After ARROW-16989 and ARROW-15584, there is new behaviour with ProjectRel. I 
implemented a solution that worked with DuckDB's consumer in 
https://github.com/voltrondata/substrait-r/pull/181, but when I try with 
Arrow's compiler I get an error:

``` r
library(arrow, warn.conflicts = FALSE)
#> Some features are not enabled in this build of Arrow. Run `arrow_info()` for 
more information.

plan_as_json <- '{
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": 
"https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml";
    }
  ],
  "relations": [
    {
      "rel": {
        "project": {
          "common": {"emit": {"outputMapping": [2, 3]}},
          "input": {
            "read": {
              "baseSchema": {
                "names": ["int", "dbl"],
                "struct": {"types": [{"i32": {}}, {"fp64": {}}]}
              },
              "localFiles": {
                "items": [
                  {
                    "uriFile": "file://THIS_IS_THE_TEMP_FILE",
                    "parquet": {}
                  }
                ]
              }
            }
          },
          "expressions": [
            {"selection": {"directReference": {"structField": {"field": 1}}}},
            {"selection": {"directReference": {"structField": {"field": 0}}}}
          ]
        }
      }
    }
  ]
}'

temp_parquet <- tempfile()
write_parquet(data.frame(int = integer(), dbl = double()), temp_parquet)
plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
arrow:::do_exec_plan_substrait(plan_as_json)
#> Error: Invalid: Invalid column index to add field.
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338
  project_schema->AddField( num_columns + 
static_cast<int>(project.expressions().size()) - 1, std::move(project_field))
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 
 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
ext_set, conversion_options)
```

<sup>Created on 2022-10-03 by the [reprex 
package](https://reprex.tidyverse.org) (v2.0.1)</sup>

It's admittedly a goofy thing to do: to compute a new column that is an 
identical copy of an existing column and then discard the original. I can and 
should simplify the substrait that I'm generating, but maybe this is also valid 
substrait that should be accepted?

  was:
After ARROW-16989 and ARROW-15584, there is new behaviour with ProjectRel. I 
implemented a solution that worked with DuckDB's consumer in 
https://github.com/voltrondata/substrait-r/pull/181, but when I try with 
Arrow's compiler I get an error:

{code:R}
library(arrow, warn.conflicts = FALSE)

plan_as_json <- '{
  "extensionUris": [
    {
      "extensionUriAnchor": 1,
      "uri": 
"https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml";
    }
  ],
  "relations": [
    {
      "rel": {
        "project": {
          "common": {"emit": {"outputMapping": [3, 4]}},
          "input": {
            "read": {
              "baseSchema": {
                "names": ["int", "dbl"],
                "struct": {"types": [{"i32": {}}, {"fp64": {}}]}
              },
              "localFiles": {
                "items": [
                  {
                    "uriFile": "file://THIS_IS_THE_TEMP_FILE",
                    "parquet": {}
                  }
                ]
              }
            }
          },
          "expressions": [
            {"selection": {"directReference": {"structField": {"field": 1}}}},
            {"selection": {"directReference": {"structField": {"field": 0}}}}
          ]
        }
      }
    }
  ]
}'

temp_parquet <- tempfile()
write_parquet(data.frame(int = integer(), dbl = double()), temp_parquet)
plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
arrow:::do_exec_plan_substrait(plan_as_json)
#> Error: Invalid: Invalid column index to add field.
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338
  project_schema->AddField( num_columns + 
static_cast<int>(project.expressions().size()) - 1, std::move(project_field))
#> 
/Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/serde.cc:156 
 FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
ext_set, conversion_options)
{code}

It's admittedly a goofy thing to do: to compute a new column that is an 
identical copy of an existing column and then discard the original. I can and 
should simplify the substrait that I'm generating, but maybe this is also valid 
substrait that should be accepted?


> [C++] Error when using Substrait ProjectRel
> -------------------------------------------
>
>                 Key: ARROW-17915
>                 URL: https://issues.apache.org/jira/browse/ARROW-17915
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Dewey Dunnington
>            Priority: Major
>
> After ARROW-16989 and ARROW-15584, there is new behaviour with ProjectRel. I 
> implemented a solution that worked with DuckDB's consumer in 
> https://github.com/voltrondata/substrait-r/pull/181, but when I try with 
> Arrow's compiler I get an error:
> ``` r
> library(arrow, warn.conflicts = FALSE)
> #> Some features are not enabled in this build of Arrow. Run `arrow_info()` 
> for more information.
> plan_as_json <- '{
>   "extensionUris": [
>     {
>       "extensionUriAnchor": 1,
>       "uri": 
> "https://github.com/apache/arrow/blob/master/format/substrait/extension_types.yaml";
>     }
>   ],
>   "relations": [
>     {
>       "rel": {
>         "project": {
>           "common": {"emit": {"outputMapping": [2, 3]}},
>           "input": {
>             "read": {
>               "baseSchema": {
>                 "names": ["int", "dbl"],
>                 "struct": {"types": [{"i32": {}}, {"fp64": {}}]}
>               },
>               "localFiles": {
>                 "items": [
>                   {
>                     "uriFile": "file://THIS_IS_THE_TEMP_FILE",
>                     "parquet": {}
>                   }
>                 ]
>               }
>             }
>           },
>           "expressions": [
>             {"selection": {"directReference": {"structField": {"field": 1}}}},
>             {"selection": {"directReference": {"structField": {"field": 0}}}}
>           ]
>         }
>       }
>     }
>   ]
> }'
> temp_parquet <- tempfile()
> write_parquet(data.frame(int = integer(), dbl = double()), temp_parquet)
> plan_as_json <- gsub("THIS_IS_THE_TEMP_FILE", temp_parquet, plan_as_json)
> arrow:::do_exec_plan_substrait(plan_as_json)
> #> Error: Invalid: Invalid column index to add field.
> #> 
> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/relation_internal.cc:338
>   project_schema->AddField( num_columns + 
> static_cast<int>(project.expressions().size()) - 1, std::move(project_field))
> #> 
> /Users/dewey/Desktop/rscratch/arrow/cpp/src/arrow/engine/substrait/serde.cc:156
>   FromProto(plan_rel.has_root() ? plan_rel.root().input() : plan_rel.rel(), 
> ext_set, conversion_options)
> ```
> <sup>Created on 2022-10-03 by the [reprex 
> package](https://reprex.tidyverse.org) (v2.0.1)</sup>
> It's admittedly a goofy thing to do: to compute a new column that is an 
> identical copy of an existing column and then discard the original. I can and 
> should simplify the substrait that I'm generating, but maybe this is also 
> valid substrait that should be accepted?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Updated] (ARROW-17915) [C++] Error when using Substrait ProjectRel

Reply via email to