[
https://issues.apache.org/jira/browse/ARROW-7639?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17022128#comment-17022128
]
Etienne Racine commented on ARROW-7639:
---------------------------------------
Thanks. I believe coercing to factors is the right choice. By looking at the
[doc|https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html]
{quote}The categorical data type is useful in the following cases:
* A string variable consisting of only a few different values. Converting such
a string variable to a categorical variable will save some memory, see
[here|https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-memory]
* The lexical order of a variable is not the same as the logical order (“one”,
“two”, “three”). By converting to a categorical and specifying an order on the
categories, sorting and min/max will use the logical order instead of the
lexical order, see
[here|https://pandas.pydata.org/pandas-docs/stable/user_guide/categorical.html#categorical-sort]
* As a signal to other Python libraries that this column should be treated as
a categorical variable (e.g. to use suitable statistical methods or plot
types).{quote}
Especially this last point.
> [R] Cannot convert Dictionary Array to R when values aren't strings
> -------------------------------------------------------------------
>
> Key: ARROW-7639
> URL: https://issues.apache.org/jira/browse/ARROW-7639
> Project: Apache Arrow
> Issue Type: Bug
> Components: R
> Affects Versions: 0.15.1
> Environment: Ubuntu 16.04.5 LTS
> Reporter: Etienne Racine
> Assignee: Neal Richardson
> Priority: Major
> Labels: pull-request-available
> Fix For: 0.16.0
>
> Time Spent: 20m
> Remaining Estimate: 0h
>
> I got an error in R when reading a feather file using arrow::read_feather()
> prepared in python.
> {code:r}
> #' Error in Table__to_dataframe(x, use_threads = option_use_threads()) :
> #' Cannot convert Dictionary Array of type `dictionary<values=double,
> indices=int8, ordered=0>` to R{code}
> I could reproduce the issue with a minimal example:
> In python:
> {code:python}
> import pandas as pd
> import pyarrow as pa
> df = pd.DataFrame({"float": [0.1, .2, 0.5, .001]})
> df["category"] = df["float"].astype('category')
> df.dtypes
> #' float float64
> #' A object
> #' category category
> #' dtype: object
> df.to_feather("series.feather")
> pa.__version__
> #' '0.15.1'
> {code}
> From R:
> {code:r}
> arrow::read_feather("series.feather")
> #' Error in Table__to_dataframe(x, use_threads = option_use_threads()) :
> #' Cannot convert Dictionary Array of type `dictionary<values=double,
> indices=int8, ordered=0>` to R
> #' Backtrace:
> #' █
> #' 1. └─arrow::read_feather("series.feather")
> #' 2. ├─[ base::as.data.frame(...) ]
> #' 3. └─arrow:::as.data.frame.Table(out)
> #' 4. └─arrow:::Table__to_dataframe(x, use_threads = option_use_threads())
> {code}
> The feather file is read correctly back in python
> {code:python}
> ft = pd.read_feather("series.feather")
> ft.dtypes
> #' float float64
> #' A object
> #' category category
> #' dtype: object
> {code}
> {code:r}
> sessionInfo()
> #' R version 3.5.1 (2018-07-02)
> #' Platform: x86_64-conda_cos6-linux-gnu (64-bit)
> #' Running under: Ubuntu 16.04.5 LTS
> #'
> #' Matrix products: default
> #' BLAS/LAPACK: /misc/DLshare/home/etbellem/miniconda3/lib/R/lib/libRblas.so
> #'
> #' locale:
> #' [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> #' [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> #' [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> #' [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> #' [9] LC_ADDRESS=C LC_TELEPHONE=C
> #' [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
> #'
> #' attached base packages:
> #' [1] stats graphics grDevices utils datasets methods base
> #'
> #' loaded via a namespace (and not attached):
> #' [1] Rcpp_1.0.3 arrow_0.15.1 crayon_1.3.4 assertthat_0.2.1
> #' [5] R6_2.4.1 magrittr_1.5 rlang_0.4.2 rstudioapi_0.10
> #' [9] bit64_0.9-7 glue_1.3.1 purrr_0.3.3 bit_1.1-15.1
> #' [13] compiler_3.5.1 tidyselect_0.2.5{code}
--
This message was sent by Atlassian Jira
(v8.3.4#803005)