[
https://issues.apache.org/jira/browse/ARROW-13480?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Jonathan Keane updated ARROW-13480:
-----------------------------------
Description:
Working on integration with DuckDB, we ran into an issue where it looks like
errors are not being propagated fully/correctly with record batch readers using
the C-interface. The DuckDB issue where this came up is
https://github.com/duckdb/duckdb/issues/2055
In the example I'm passing a dataset with either one or two files from R to
python. I've specifically mis-specified the schema to get an error The one file
version works like I expect percolating the error up:
{code:r}
> library("arrow")
>
> venv <- try(reticulate::virtualenv_create("arrow-test"))
virtualenv: arrow-test
> install_pyarrow("arrow-test", nightly = TRUE)
[output from installing pyarrow ...]
> reticulate::use_virtualenv("arrow-test")
>
> file <- "arrow/r/inst/v0.7.1.parquet"
> arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
>
> scan <- Scanner$create(arrow_table)
> reader <- scan$ToRecordBatchReader()
> pyreader <- reticulate::r_to_py(reader)
> pytab <- pyreader$read_all()
Error in py_call_impl(callable, dots$args, dots$keywords) :
OSError: NotImplemented: Unsupported cast from double to null using function
cast_null
Detailed traceback:
File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
{code}
But when having 2 (or more) files, the process hangs reading all of the batches:
{code:r}
> library("arrow")
>
> venv <- try(reticulate::virtualenv_create("arrow-test"))
virtualenv: arrow-test
> install_pyarrow("arrow-test", nightly = TRUE)
[output from installing pyarrow ...]
> reticulate::use_virtualenv("arrow-test")
>
> file <- "arrow/r/inst/v0.7.1.parquet"
> arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
>
> scan <- Scanner$create(arrow_table)
> reader <- scan$ToRecordBatchReader()
> pyreader <- reticulate::r_to_py(reader)
> pytab <- pyreader$read_all()
{hangs forever here}
{code}
was:
Working on integration with DuckDB, we ran into an issue where it looks like
errors are not being propagated fully/correctly with record batch readers using
the C-interface. The DuckDB issue where this came up is
https://github.com/duckdb/duckdb/issues/2055
In the example I'm passing a dataset with either one or two files from R to
python. I've specifically mis-specified the schema to get an error The one file
version works like I expect percolating the error up:
{code:r}
> library("arrow")
>
> venv <- try(reticulate::virtualenv_create("arrow-test"))
virtualenv: arrow-test
> install_pyarrow("arrow-test", nightly = TRUE)
Using virtual environment 'arrow-test' ...
Looking in indexes: https://pypi.org/simple,
https://repo.fury.io/arrow-nightlies/
Requirement already satisfied: pyarrow in
/Users/jkeane/.virtualenvs/arrow-test/lib/python3.9/site-packages (5.0.0.dev524)
Requirement already satisfied: numpy>=1.16.6 in
/Users/jkeane/.virtualenvs/arrow-test/lib/python3.9/site-packages (from
pyarrow) (1.20.3)
WARNING: You are using pip version 21.1.2; however, version 21.2.1 is available.
You should consider upgrading via the
'/Users/jkeane/.virtualenvs/arrow-test/bin/python -m pip install --upgrade pip'
command.
> reticulate::use_virtualenv("arrow-test")
>
> file <- "arrow/r/inst/v0.7.1.parquet"
> arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
>
> scan <- Scanner$create(arrow_table)
> reader <- scan$ToRecordBatchReader()
> pyreader <- reticulate::r_to_py(reader)
> pytab <- pyreader$read_all()
Error in py_call_impl(callable, dots$args, dots$keywords) :
OSError: NotImplemented: Unsupported cast from double to null using function
cast_null
Detailed traceback:
File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
{code}
But when having 2 (or more) files, the process hangs reading all of the batches:
{code:r}
> library("arrow")
>
> venv <- try(reticulate::virtualenv_create("arrow-test"))
virtualenv: arrow-test
> install_pyarrow("arrow-test", nightly = TRUE)
Using virtual environment 'arrow-test' ...
Looking in indexes: https://pypi.org/simple,
https://repo.fury.io/arrow-nightlies/
Requirement already satisfied: pyarrow in
/Users/jkeane/.virtualenvs/arrow-test/lib/python3.9/site-packages (5.0.0.dev524)
Requirement already satisfied: numpy>=1.16.6 in
/Users/jkeane/.virtualenvs/arrow-test/lib/python3.9/site-packages (from
pyarrow) (1.20.3)
WARNING: You are using pip version 21.1.2; however, version 21.2.1 is available.
You should consider upgrading via the
'/Users/jkeane/.virtualenvs/arrow-test/bin/python -m pip install --upgrade pip'
command.
> reticulate::use_virtualenv("arrow-test")
>
> file <- "arrow/r/inst/v0.7.1.parquet"
> arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
>
> scan <- Scanner$create(arrow_table)
> reader <- scan$ToRecordBatchReader()
> pyreader <- reticulate::r_to_py(reader)
> pytab <- pyreader$read_all()
{hangs forever here}
{code}
> [C++] [R] [Python] C-interface error propagation
> -------------------------------------------------
>
> Key: ARROW-13480
> URL: https://issues.apache.org/jira/browse/ARROW-13480
> Project: Apache Arrow
> Issue Type: Improvement
> Components: C++, Python, R
> Reporter: Jonathan Keane
> Priority: Major
>
> Working on integration with DuckDB, we ran into an issue where it looks like
> errors are not being propagated fully/correctly with record batch readers
> using the C-interface. The DuckDB issue where this came up is
> https://github.com/duckdb/duckdb/issues/2055
> In the example I'm passing a dataset with either one or two files from R to
> python. I've specifically mis-specified the schema to get an error The one
> file version works like I expect percolating the error up:
> {code:r}
> > library("arrow")
> >
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> >
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 1), schema(x=arrow::null()))
> >
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> Error in py_call_impl(callable, dots$args, dots$keywords) :
> OSError: NotImplemented: Unsupported cast from double to null using
> function cast_null
> Detailed traceback:
> File "pyarrow/ipc.pxi", line 563, in pyarrow.lib.RecordBatchReader.read_all
> File "pyarrow/error.pxi", line 114, in pyarrow.lib.check_status
> {code}
> But when having 2 (or more) files, the process hangs reading all of the
> batches:
> {code:r}
> > library("arrow")
> >
> > venv <- try(reticulate::virtualenv_create("arrow-test"))
> virtualenv: arrow-test
> > install_pyarrow("arrow-test", nightly = TRUE)
> [output from installing pyarrow ...]
> > reticulate::use_virtualenv("arrow-test")
> >
> > file <- "arrow/r/inst/v0.7.1.parquet"
> > arrow_table <- arrow::open_dataset(rep(file, 2), schema(x=arrow::null()))
> >
> > scan <- Scanner$create(arrow_table)
> > reader <- scan$ToRecordBatchReader()
> > pyreader <- reticulate::r_to_py(reader)
> > pytab <- pyreader$read_all()
> {hangs forever here}
> {code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)