[jira] [Updated] (ARROW-5190) [R] Discussion: tibble dependency in R package

James Lamb (JIRA) Sat, 20 Apr 2019 23:13:56 -0700


     [ 
https://issues.apache.org/jira/browse/ARROW-5190?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


James Lamb updated ARROW-5190:
------------------------------
    Description: 
Hello,

 

I would like to have a discussion on the use of *tibble* in the Apache Arrow R 
package. I looked at the [the project contributor 
guidelines|[https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]]
 and could not tell where the best place might be to start a public discussion 
on this topic, so I decided on JIRA. I apologize if this is not the right place.

 

*TL;DR*

I would like to propose moving the *tibble* dependency in the *arrow* R package 
to "Suggests", removing the _as_tibble()_ in _read_arrow()_, and having the 
core R code implementing the Arrow API only return data.frames or other base-R 
data structures wherever possible.

 

*Reasoning*

[As far as I can 
tell|[https://github.com/apache/arrow/search?p=1&q=tibble&unscoped_q=tibble]], 
outside of tests and examples *tibble* is only used in three places in the 
package:
 * S3 methods to convert Arrow objects to tibbles 
(_as_tibble.arrow__::__RecordBatch()_, _as.tibble.arrow::Table()_)
 * optional "convert to tibble on the way out" behavior controlled by a flag in 
interfaces to file types (parquet and feather)
 * 
[_read_arrow()_|[https://github.com/apache/arrow/blob/0536ef8174982a7a13a251174cc38701e8663b68/r/R/read_table.R#L88]]

 

In my opinion, all three of these uses of *tibble* are valuable for developers 
who use that package (or other packages in its ecosystem), but I am not 
convinced that the Arrow R package should be tightly coupled to them.

In the Python community, *pandas* is a broadly agreed-upon standard for 
representing data frames. Even with that ubiquity, *pyarrow* does not depend on 
*pandas* (it is not necessary to work with it) and all "compatibility with 
*pandas*" code is isolated in a place explicitly intended for that purpose: 
[https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py]

I think that is the ideal handling for integration of Arrow extensions with 
other software it might be used with. This allows users who care about only one 
of the integrations (e.g. feather, parquet, HDFS, Apache Spark, tibble, 
data.table, etc.) to only have to build things they're already using. 

 

*Other background information*

I took the time to write this tonight after talking a colleague through the 
issues *feather* (R package) users experienced after the *tibble 2.0* release. 
See for example [wesm/feather#374|[https://github.com/wesm/feather/issues/374]] 
and 
[wesm/feather#372|[https://github.com/wesm/feather/issues/37|https://github.com/wesm/feather/issues/374]2].
 When *tibble 2.0* came out it broke *feather 0.3.1* and the maintainers there 
promptly released to CRAN a *feather 0.3.2* which was compatible with *tibble 
2.0+*. Unfortunately, this still caused disruptions for many people using 
*feather* (who inadvertently had *tibble* upgraded as part of installing other 
packages which depended on it). Nothing about *tibble* was necessary to the 
implementation of _read_feather()_, as far as I can tell, but this design 
choice made installing and upgrading *tibble* non-optional for developers who 
just wanted to use the feather file format and all it's awesome features.

 

If the proposal here is accepted, I hope it will mean we can prevent repeating 
the same experience with the R *arrow* package and set a strong precedent for 
developers who want to add compatibility in this package for other members of 
the ecosystem like parquet or Apache Spark.

 

 

Thank you for hearing me out!

 

 

 

  was:
Hello,

 

I would like to have a discussion on the use of *tibble* in the Apache Arrow R 
package. I looked at the [the project contributor 
guidelines|[https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]]
 and could not tell where the best place might be to start a public discussion 
on this topic, so I decided on JIRA. I apologize if this is not the right place.

 

*TL;DR*

I would like to propose moving the *tibble* dependency in the *arrow* R package 
to "Suggests", removing the _as_tibble()_ in _read_arrow()_, and having the 
core R code implementing the Arrow API only return data.frames or other base-R 
data structures wherever possible.

 

*Reasoning*

[As far as I can 
tell|[https://github.com/apache/arrow/search?p=1&q=tibble&unscoped_q=tibble]], 
outside of tests and examples *tibble* is only used in three places in the 
package:
 * S3 methods to convert Arrow objects to tibbles 
(_as_tibble.arrow__::__RecordBatch()_, _as.tibble.arrow::Table()_)
 * optional "convert to tibble on the way out" behavior controlled by a flag in 
interfaces to file types (parquet and feather)
 * 
[_read_arrow()_|[https://github.com/apache/arrow/blob/0536ef8174982a7a13a251174cc38701e8663b68/r/R/read_table.R#L88]]

 

In my opinion, all three of these uses of *tibble* are valuable for developers 
who use that package (or other packages in its ecosystem), but I am not 
convinced that the Arrow R package should be tightly coupled to them or that 
Arrow maintainers should have to maintain them.

In the Python community, *pandas* is a broadly agreed-upon standard for 
representing data frames. Even with that ubiquity, *pyarrow* does not depend on 
*pandas* (it is not necessary to work with it) and all "compatibility with 
*pandas*" code is isolated in a place explicitly intended for that purpose: 
[https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py]

I think that is the ideal handling for integration of Arrow extensions with 
other software it might be used with. This allows users who care about only one 
of the integrations (e.g. feather, parquet, HDFS, Apache Spark, tibble, 
data.table, etc.) to only have to build things they're already using. 

 

*Other background information*

I took the time to write this tonight after talking a colleague through the 
issues *feather* (R package) users experienced after the *tibble 2.0* release. 
See for example [wesm/feather#374|[https://github.com/wesm/feather/issues/374]] 
and 
[wesm/feather#372|[https://github.com/wesm/feather/issues/37|https://github.com/wesm/feather/issues/374]2].
 When *tibble 2.0* came out it broke *feather 0.3.1* and the maintainers there 
promptly released to CRAN a *feather 0.3.2* which was compatible with *tibble 
2.0+*. Unfortunately, this still caused disruptions for many people using 
*feather* (who inadvertently had *tibble* upgraded as part of installing other 
packages which depended on it). Nothing about *tibble* was necessary to the 
implementation of _read_feather()_, as far as I can tell, but this design 
choice made installing and upgrading *tibble* non-optional for developers who 
just wanted to use the feather file format and all it's awesome features.

 

If the proposal here is accepted, I hope it will mean we can prevent repeating 
the same experience with the R *arrow* package and set a strong precedent for 
developers who want to add compatibility in this package for other members of 
the ecosystem like parquet or Apache Spark.

 

 

Thank you for hearing me out!

 

 

 


> [R] Discussion: tibble dependency in R package
> ----------------------------------------------
>
>                 Key: ARROW-5190
>                 URL: https://issues.apache.org/jira/browse/ARROW-5190
>             Project: Apache Arrow
>          Issue Type: Wish
>          Components: R
>            Reporter: James Lamb
>            Priority: Minor
>
> Hello,
>  
> I would like to have a discussion on the use of *tibble* in the Apache Arrow 
> R package. I looked at the [the project contributor 
> guidelines|[https://github.com/apache/arrow/blob/master/docs/source/developers/contributing.rst]]
>  and could not tell where the best place might be to start a public 
> discussion on this topic, so I decided on JIRA. I apologize if this is not 
> the right place.
>  
> *TL;DR*
> I would like to propose moving the *tibble* dependency in the *arrow* R 
> package to "Suggests", removing the _as_tibble()_ in _read_arrow()_, and 
> having the core R code implementing the Arrow API only return data.frames or 
> other base-R data structures wherever possible.
>  
> *Reasoning*
> [As far as I can 
> tell|[https://github.com/apache/arrow/search?p=1&q=tibble&unscoped_q=tibble]],
>  outside of tests and examples *tibble* is only used in three places in the 
> package:
>  * S3 methods to convert Arrow objects to tibbles 
> (_as_tibble.arrow__::__RecordBatch()_, _as.tibble.arrow::Table()_)
>  * optional "convert to tibble on the way out" behavior controlled by a flag 
> in interfaces to file types (parquet and feather)
>  * 
> [_read_arrow()_|[https://github.com/apache/arrow/blob/0536ef8174982a7a13a251174cc38701e8663b68/r/R/read_table.R#L88]]
>  
> In my opinion, all three of these uses of *tibble* are valuable for 
> developers who use that package (or other packages in its ecosystem), but I 
> am not convinced that the Arrow R package should be tightly coupled to them.
> In the Python community, *pandas* is a broadly agreed-upon standard for 
> representing data frames. Even with that ubiquity, *pyarrow* does not depend 
> on *pandas* (it is not necessary to work with it) and all "compatibility with 
> *pandas*" code is isolated in a place explicitly intended for that purpose: 
> [https://github.com/apache/arrow/blob/master/python/pyarrow/pandas_compat.py]
> I think that is the ideal handling for integration of Arrow extensions with 
> other software it might be used with. This allows users who care about only 
> one of the integrations (e.g. feather, parquet, HDFS, Apache Spark, tibble, 
> data.table, etc.) to only have to build things they're already using. 
>  
> *Other background information*
> I took the time to write this tonight after talking a colleague through the 
> issues *feather* (R package) users experienced after the *tibble 2.0* 
> release. See for example 
> [wesm/feather#374|[https://github.com/wesm/feather/issues/374]] and 
> [wesm/feather#372|[https://github.com/wesm/feather/issues/37|https://github.com/wesm/feather/issues/374]2].
>  When *tibble 2.0* came out it broke *feather 0.3.1* and the maintainers 
> there promptly released to CRAN a *feather 0.3.2* which was compatible with 
> *tibble 2.0+*. Unfortunately, this still caused disruptions for many people 
> using *feather* (who inadvertently had *tibble* upgraded as part of 
> installing other packages which depended on it). Nothing about *tibble* was 
> necessary to the implementation of _read_feather()_, as far as I can tell, 
> but this design choice made installing and upgrading *tibble* non-optional 
> for developers who just wanted to use the feather file format and all it's 
> awesome features.
>  
> If the proposal here is accepted, I hope it will mean we can prevent 
> repeating the same experience with the R *arrow* package and set a strong 
> precedent for developers who want to add compatibility in this package for 
> other members of the ecosystem like parquet or Apache Spark.
>  
>  
> Thank you for hearing me out!
>  
>  
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (ARROW-5190) [R] Discussion: tibble dependency in R package

Reply via email to