[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Wed, 26 Oct 2022 01:37:03 -0700


thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1005378186



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) 
methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides 
[reticulate](https://rstudio.github.io/reticulate/) methods for passing data 
between R and Python within the same process. This vignette provides a brief 
overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for 
example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R 
Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of 
functionality that is better supported in Python than in R at the current state 
of development. For example, at one point in time the R `arrow` package didn't 
support `concat_arrays()` but PyArrow did, so this would have been a good use 
case at that time. At the time of current writing PyArrow has more 
comprehensive support for [Arrow 
Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- 
but see `vignette("flight", package = "arrow")` -- so that would be another 
instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass 
data objects between R and Python. With large data sets, it can be quite costly 
-- in terms of time and CPU cycles -- to perform the copy and covert operations 
required to translate a native data structure in R (e.g., a data frame) to an 
analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. 
Because Arrow data objects such as Tables have the same in-memory format in R 
and Python, it is possible to perform "zero-copy" data transfers, in which only 
the metadata needs to be passed between languages. As illustrated later, this 
drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For 
example, you may wish to create a Python [virtual 
environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` 
library. A virtual environment is a specific Python installation created for 
one project or purpose. It is a good practice to use specific environments in 
Python so that updating a package doesn't impact packages in other projects.

Review Comment:
   nit: would change "with" to "containing" or similar



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) 
methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides 
[reticulate](https://rstudio.github.io/reticulate/) methods for passing data 
between R and Python within the same process. This vignette provides a brief 
overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for 
example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R 
Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of 
functionality that is better supported in Python than in R at the current state 
of development. For example, at one point in time the R `arrow` package didn't 
support `concat_arrays()` but PyArrow did, so this would have been a good use 
case at that time. At the time of current writing PyArrow has more 
comprehensive support for [Arrow 
Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- 
but see `vignette("flight", package = "arrow")` -- so that would be another 
instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass 
data objects between R and Python. With large data sets, it can be quite costly 
-- in terms of time and CPU cycles -- to perform the copy and covert operations 
required to translate a native data structure in R (e.g., a data frame) to an 
analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. 
Because Arrow data objects such as Tables have the same in-memory format in R 
and Python, it is possible to perform "zero-copy" data transfers, in which only 
the metadata needs to be passed between languages. As illustrated later, this 
drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For 
example, you may wish to create a Python [virtual 
environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` 
library. A virtual environment is a specific Python installation created for 
one project or purpose. It is a good practice to use specific environments in 
Python so that updating a package doesn't impact packages in other projects.
+
+You can perform the set up from within R. Let's suppose you want to call your 
virtual environment something like `my-pyarrow-env`. Your setup code would look 
like this: 
 
 ```r
-library(reticulate)
-virtualenv_create("arrow-env")
-install_pyarrow("arrow-env")
+virtualenv_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
 ```
 
-If you want to install a development version of `pyarrow`,
-add `nightly = TRUE`:
+If you want to install a development version of `pyarrow` to the virtual 
environment, add `nightly = TRUE` to the `install_pyarrow()` command:
 
 ```r
-install_pyarrow("arrow-env", nightly = TRUE)
+install_pyarrow("my-pyarrow-env", nightly = TRUE)
 ```
 
-A virtualenv or a virtual environment is a specific Python installation
-created for one project or purpose. It is a good practice to use
-specific environments in Python so that updating a package doesn't
-impact packages in other projects.
+Note that you don't have to use virtual environments. If you prefer [conda 
environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html),
 you can use this setup code:
 
-`install_pyarrow()` also works with `conda` environments
-(`conda_create()` instead of `virtualenv_create()`).
+```r
+conda_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
+```
 
-For more on installing and configuring Python,
-see the [reticulate 
docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+To learn more about installing and configuring Python from R,
+see the [reticulate 
documentation](https://rstudio.github.io/reticulate/articles/python_packages.html),
 which discusses the topic in more detail.
 
-## Using
+## Importing PyArrow
 
-To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+Assuming that `arrow` and `reticulate` are both loaded in R, your first step 
is to make sure that the correct Python environment is being used. To do that, 
use a command like this:
+
+```r
+use_virtualenv("my-pyarrow-env") # virtualenv users
+use_condaenv("my-pyarrow-env")   # conda users
+```

Review Comment:
   Would it be worth splitting these out into separate code chunks so that it's 
clear for people who are skim-reading or blindly copying-and-pasting (I know 
I'm guilty of that a lot) that they only actually need to run one or the other 
based on how they've done their setup?



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) 
methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides 
[reticulate](https://rstudio.github.io/reticulate/) methods for passing data 
between R and Python within the same process. This vignette provides a brief 
overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for 
example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R 
Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of 
functionality that is better supported in Python than in R at the current state 
of development. For example, at one point in time the R `arrow` package didn't 
support `concat_arrays()` but PyArrow did, so this would have been a good use 
case at that time. At the time of current writing PyArrow has more 
comprehensive support for [Arrow 
Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- 
but see `vignette("flight", package = "arrow")` -- so that would be another 
instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass 
data objects between R and Python. With large data sets, it can be quite costly 
-- in terms of time and CPU cycles -- to perform the copy and covert operations 
required to translate a native data structure in R (e.g., a data frame) to an 
analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. 
Because Arrow data objects such as Tables have the same in-memory format in R 
and Python, it is possible to perform "zero-copy" data transfers, in which only 
the metadata needs to be passed between languages. As illustrated later, this 
drastically improves performance. 
 
-## Installing
+## Installing PyArrow
 
-To use `arrow` in Python, at a minimum you'll need the `pyarrow` library.
-To install it in a virtualenv,
+To use Arrow in Python, the `pyarrow` library needs to be installed. For 
example, you may wish to create a Python [virtual 
environment](https://docs.python.org/3/library/venv.html) with the `pyarrow` 
library. A virtual environment is a specific Python installation created for 
one project or purpose. It is a good practice to use specific environments in 
Python so that updating a package doesn't impact packages in other projects.
+
+You can perform the set up from within R. Let's suppose you want to call your 
virtual environment something like `my-pyarrow-env`. Your setup code would look 
like this: 
 
 ```r
-library(reticulate)
-virtualenv_create("arrow-env")
-install_pyarrow("arrow-env")
+virtualenv_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
 ```
 
-If you want to install a development version of `pyarrow`,
-add `nightly = TRUE`:
+If you want to install a development version of `pyarrow` to the virtual 
environment, add `nightly = TRUE` to the `install_pyarrow()` command:
 
 ```r
-install_pyarrow("arrow-env", nightly = TRUE)
+install_pyarrow("my-pyarrow-env", nightly = TRUE)
 ```
 
-A virtualenv or a virtual environment is a specific Python installation
-created for one project or purpose. It is a good practice to use
-specific environments in Python so that updating a package doesn't
-impact packages in other projects.
+Note that you don't have to use virtual environments. If you prefer [conda 
environments](https://docs.conda.io/projects/conda/en/latest/user-guide/concepts/environments.html),
 you can use this setup code:
 
-`install_pyarrow()` also works with `conda` environments
-(`conda_create()` instead of `virtualenv_create()`).
+```r
+conda_create("my-pyarrow-env")
+install_pyarrow("my-pyarrow-env")
+```
 
-For more on installing and configuring Python,
-see the [reticulate 
docs](https://rstudio.github.io/reticulate/articles/python_packages.html).
+To learn more about installing and configuring Python from R,
+see the [reticulate 
documentation](https://rstudio.github.io/reticulate/articles/python_packages.html),
 which discusses the topic in more detail.
 
-## Using
+## Importing PyArrow
 
-To start, load `arrow` and `reticulate`, and then import `pyarrow`.
+Assuming that `arrow` and `reticulate` are both loaded in R, your first step 
is to make sure that the correct Python environment is being used. To do that, 
use a command like this:
+
+```r
+use_virtualenv("my-pyarrow-env") # virtualenv users
+use_condaenv("my-pyarrow-env")   # conda users
+```
+
+Once you have done this, the next step is to import `pyarrow` into the Python 
session as shown below:
 
 ```r
-library(arrow)
-library(reticulate)
-use_virtualenv("arrow-env")
 pa <- import("pyarrow")
 ```
 
-The arrow R package include support for sharing Arrow `Array` and `RecordBatch`
-objects in-process between R and Python. For example, let's create an `Array`
-in pyarrow.
+Executing this command in R is the equivalent of the following import in 
Python:
+
+```python
+import pyarrow as pa
+```
+
+It may be a good idea to check your `pyarrow` version too, as shown below:
+
+```r
+pa$`__version__`
+```
+
+```
+## [1] "8.0.0"
+```
+
+Support for passing data to and from R is included in `pyarrow` versions 0.17 
and greater.
+
+## Using PyArrow
+
+You can use the `reticulate` function `r_to_py()` to pass objects from R to 
Python, and similarly you can use `py_to_r()` to pull objects from the Python 
session into R. To illustrate this, let's create two objects in R: `df_random` 
is an R data frame containing 100 million rows of random data, and `tb_random` 
is the same data stored as an Arrow Table: 
+
+```r
+set.seed(1234)
+nrows <- 10^8
+df_random <- data.frame(
+  x = rnorm(nrows), 
+  y = rnorm(nrows),
+  subset = sample(10, nrows, replace = TRUE)
+)
+tb_random <- arrow_table(df_random)
+```
+
+Transferring the data from R to Python without Arrow is a time-consuming 
process because the underlying object has to be copied and converted to a 
Python data structure:
+
+```r
+system.time({
+  df_py <- r_to_py(df_random)
+})
+```
+
+```
+##   user  system elapsed 
+##  0.307   5.172   5.529 
+```
+
+In contrast, sending the Arrow Table across happens almost instantaneously:
+
+```r
+system.time({
+  tb_py <- r_to_py(tb_random)
+})
+```
+
+```
+##   user  system elapsed 
+##  0.004   0.000   0.003 
+```
+

Review Comment:
   This is brilliant, I love how you really draw out the motivation here.



##########
r/vignettes/python.Rmd:
##########
@@ -113,40 +185,12 @@ a_and_b
 
 Now you have a single Array in R.
 
-## How this works
+## Futher reading

Review Comment:
   ```suggestion
   ## Further reading
   ```



##########
r/vignettes/python.Rmd:
##########
@@ -1,68 +1,141 @@
 ---
-title: "Apache Arrow in Python and R with reticulate"
+title: "Integrating Arrow, Python, and R"
+description: > 
+  Learn how to use `arrow` and `reticulate` to efficiently transfer data 
+  between R and Python without making unnecessary copies
 output: rmarkdown::html_vignette
 vignette: >
-  %\VignetteIndexEntry{Apache Arrow in Python and R with reticulate}
+  %\VignetteIndexEntry{Integrating Arrow, Python, and R}
   %\VignetteEngine{knitr::rmarkdown}
   %\VignetteEncoding{UTF-8}
 ---
 
-The arrow package provides [reticulate](https://rstudio.github.io/reticulate/) 
methods for passing data between
-R and Python in the same process. This document provides a brief overview.
+The `arrow` package provides 
[reticulate](https://rstudio.github.io/reticulate/) methods for passing data 
between R and Python within the same process. This vignette provides a brief 
overview.
 
-Why you might want to use `pyarrow`?
+Code in this vignette assumes `arrow` and `reticulate` are both loaded:
 
-* To use some Python functionality that is not yet implemented in R, for 
example, the `concat_arrays` function.
-* To transfer Python objects into R, for example, a Pandas dataframe into an R 
Arrow Array. 
+```r
+library(arrow, warn.conflicts = FALSE)
+library(reticulate, warn.conflicts = FALSE)
+```
+
+## Motivation
+
+One reason you might want to use PyArrow in R is to take advantage of 
functionality that is better supported in Python than in R at the current state 
of development. For example, at one point in time the R `arrow` package didn't 
support `concat_arrays()` but PyArrow did, so this would have been a good use 
case at that time. At the time of current writing PyArrow has more 
comprehensive support for [Arrow 
Flight](https://arrow.apache.org/docs/format/Flight.html) than the R package -- 
but see `vignette("flight", package = "arrow")` -- so that would be another 
instance in which PyArrow would be of benefit to R users.
+
+A second reason that R users may want to use PyArrow is to efficiently pass 
data objects between R and Python. With large data sets, it can be quite costly 
-- in terms of time and CPU cycles -- to perform the copy and covert operations 
required to translate a native data structure in R (e.g., a data frame) to an 
analogous structure in Python (e.g., a Pandas DataFrame) and vice versa. 
Because Arrow data objects such as Tables have the same in-memory format in R 
and Python, it is possible to perform "zero-copy" data transfers, in which only 
the metadata needs to be passed between languages. As illustrated later, this 
drastically improves performance. 

Review Comment:
   This section is a huge improvement to the vignette IMO



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to