[
https://issues.apache.org/jira/browse/ARROW-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Rok Mihevc updated ARROW-5190:
------------------------------
External issue URL: https://github.com/apache/arrow/issues/16714
> [R] Discussion: tibble dependency in R package
> ----------------------------------------------
>
> Key: ARROW-5190
> URL: https://issues.apache.org/jira/browse/ARROW-5190
> Project: Apache Arrow
> Issue Type: Wish
> Components: R
> Reporter: James Lamb
> Assignee: Romain Francois
> Priority: Minor
> Labels: pull-request-available
> Fix For: 0.14.0
>
> Time Spent: 50m
> Remaining Estimate: 0h
>
> Hello,
>
> I would like to have a discussion on the use of *tibble* in the Apache Arrow
> R package. I looked at the [the project contributor
> guidelines|[https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]]
> and could not tell where the best place might be to start a public
> discussion on this topic, so I decided on JIRA. I apologize if this is not
> the right place.
>
> *TL;DR*
> I would like to propose moving the *tibble* dependency in the *arrow* R
> package to "Suggests", removing the _as_tibble()_ in _read_arrow()_, and
> having the core R code implementing the Arrow API only return data.frames or
> other base-R data structures wherever possible.
>
> *Reasoning*
> [As far as I can
> tell|[https://github.com/apache/arrow/search?p=1&q=tibble&unscoped_q=tibble]],
> outside of tests and examples *tibble* is only used in three places in the
> package:
> * S3 methods to convert Arrow objects to tibbles
> (_as_tibble.arrow__::__RecordBatch()_, _as.tibble.arrow::Table()_)
> * optional "convert to tibble on the way out" behavior controlled by a flag
> in interfaces to file types (parquet and feather)
> *
> [_read_arrow()_|[https://github.com/apache/arrow/blob/0536ef8174982a7a13a251174cc38701e8663b68/r/R/read_table.R#L88]]
>
> In my opinion, all three of these uses of *tibble* are valuable for
> developers who use that package (or other packages in its ecosystem), but I
> am not convinced that the Arrow R package should be tightly coupled to them.
> In the Python community, *pandas* is a broadly agreed-upon standard for
> representing data frames. Even with that ubiquity, *pyarrow* does not depend
> on *pandas* (it is not necessary to work with it) and all "compatibility with
> *pandas*" code is isolated in a place explicitly intended for that purpose:
> [https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py]
> I think that is the ideal handling for integration of Arrow extensions with
> other software it might be used with. This allows users who care about only
> one of the integrations (e.g. feather, parquet, HDFS, Apache Spark, tibble,
> data.table, etc.) to only have to build things they're already using.
>
> *Other background information*
> I took the time to write this tonight after talking a colleague through the
> issues *feather* (R package) users experienced after the *tibble 2.0*
> release. See for example
> [wesm/feather#374|[https://github.com/wesm/feather/issues/374]] and
> [wesm/feather#372|[https://github.com/wesm/feather/issues/37|https://github.com/wesm/feather/issues/374]2].
> When *tibble 2.0* came out it broke *feather 0.3.1* and the maintainers
> there promptly released to CRAN a *feather 0.3.2* which was compatible with
> *tibble 2.0+*. Unfortunately, this still caused disruptions for many people
> using *feather* (who inadvertently had *tibble* upgraded as part of
> installing other packages which depended on it). Nothing about *tibble* was
> necessary to the implementation of _read_feather()_, as far as I can tell,
> but this design choice made installing and upgrading *tibble* non-optional
> for developers who just wanted to use the feather file format and all it's
> awesome features.
>
> If the proposal here is accepted, I hope it will mean we can prevent
> repeating the same experience with the R *arrow* package and set a strong
> precedent for developers who want to add compatibility in this package for
> other members of the ecosystem like parquet or Apache Spark.
>
>
> Thank you for hearing me out!
>
>
>
--
This message was sent by Atlassian Jira
(v8.20.10#820010)