[
https://issues.apache.org/jira/browse/ARROW-14344?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17431118#comment-17431118
]
Reinier van Linschoten commented on ARROW-14344:
------------------------------------------------
I have done some more diagnostics, and I think the problem lies in empty
pd.DataFrame with columns that have dtype "category".
See the code below:
{code:python}
import pandas as pd
columns = ['record_id', 'institute', 'survey_name', 'survey_instance_id',
'created_on', 'sent_on', 'progress', 'completed_on', 'package_id', 'archived']
# Simple example, works
empty_df = pd.DataFrame(columns=columns)
empty_df.reset_index(drop=True).to_feather(
"empty_df.feather",
compression="uncompressed",
)
# Category dtypes, don't work
cat_df = pd.DataFrame(columns=columns).astype("category")
cat_df.reset_index(drop=True).to_feather(
"cat_df.feather",
compression="uncompressed",
)
# Int32 dtypes, work
int_df = pd.DataFrame(columns=columns).astype("int32")
int_df.reset_index(drop=True).to_feather(
"int_df.feather",
compression="uncompressed",
)
{code}
Then we can try to import it in R:
{code:r}
empty_df <- arrow::read_feather("empty_df.feather") # Works
int_df <- arrow::read_feather("int_df.feather") # Works
cat_df <- arrow::read_feather("cat_df.feather") # Crashes
{code}
> [R][Python] Crash when reading empty .feather file
> --------------------------------------------------
>
> Key: ARROW-14344
> URL: https://issues.apache.org/jira/browse/ARROW-14344
> Project: Apache Arrow
> Issue Type: Bug
> Components: Python, R
> Environment: Ubuntu Server 20.04.3, arrow (R) 5.0.02, pyarrow 3.0.0
> (Python), RStudio 1.4.1717, R 4.1.0
> Reporter: Reinier van Linschoten
> Priority: Major
> Labels: R, arrow, bug, error, pandas, python
>
> I get an R Session Error in RStudio Server when I try to read an empty
> .feather file.
> Error: The previous R session was abnormally terminated due to an unexpected
> crash. You may have lost workspace data as a result of this crash.
> Reproduce:
> * Create empty pandas dataframe in Python
> * Write to .feather file with .reset_index(drop=True) and
> compression="uncompressed"
> * Try to read data in RStudio with arrow::read_feather(path)
> * Error
> I can read dataframes with one or more rows in RStudio.
> I can read the empty dataframe with pandas.read_feather(). This returns an
> empty pandas dataframe.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)