jonkeane commented on a change in pull request #11915:
URL: https://github.com/apache/arrow/pull/11915#discussion_r771540190
##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,227 @@
+---
+title: "Writing Bindings"
+---
+
+```{r, include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+When writing bindings between C++ compute functions and R functions, the aim
is
+to expose the C++ functionality via existing R functions. The syntax and
+functionality should match that of the existing R functions
+(though with some exceptions) so that users are able to use existing tidyverse
Review comment:
```suggestion
(though there are some exceptions) so that users are able to use existing
tidyverse
```
##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,227 @@
+---
+title: "Writing Bindings"
+---
+
+```{r, include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+When writing bindings between C++ compute functions and R functions, the aim
is
+to expose the C++ functionality via existing R functions. The syntax and
Review comment:
This is super pedantic, but it's slightly more accurate to say: "via the
same interface as existing R functions" since we are actually writing new R
functions (in Arrow) — that have the same call + args as the existing functions
— which then call into C++.
Then again, being this pedantic might be too much for an intro and would be
more of a distraction than a help here.
##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,238 @@
+---
+title: "Writing Bindings"
+---
+
+```{r, include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+
+When writing bindings between C++ compute functions and R functions, the aim
is
+to expose the C++ functionality via existing R functions. The syntax and
+functionality should (usually) exactly match that of the existing R functions
+(though with some exceptions) so that users are able to use existing tidyverse
+or base R syntax, or call existing S3 methods on objects, whilst taking
+advantage of the speed and functionality of the underlying arrow package.
+
+# Implementing bindings for S3 generics
+
+If a function is an S3 generic method, you may be able to define a version of
it for
+Arrow objects. There are two base classes which have been defined in the
+R package so that S3 methods don't have to be defined repeatedly for objects
with
+similar behaviour:
+
+* ArrowTabular - for RecordBatch and Table objects
+* ArrowDatum - for Scalar, Array, and ChunkedArray objects
+
+What this means is that any function defined for the base class will work with
+the child class. For example, the function `dim()` may be defined as:
+
+```{r, eval = FALSE}
+dim.ArrowTabular <- function(x) c(x$num_rows, x$num_columns)
+```
+
+This implements `dim()` for both RecordBatch and Table objects.
+
+```{r}
+arrow_table(x = c(1, 2, 3), y = c(4, 5, 6)) %>%
+ dim()
+```
+
+# Implementing bindings to work within dplyr pipelines
+
+One of main ways in which users interact with arrow is via dplyr syntax called
+on Arrow objects. For example, when a user calls `dplyr::mutate()` on an
Arrow Tabular,
+Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
+used and under the hood, translates the dplyr code into Arrow C++ code.
+
+When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use
functions
+from other packages. The example below uses `stringr::str_detect()`.
+
+```{r}
+library(dplyr)
+library(stringr)
+starwars %>%
+ filter(str_detect(name, "Darth"))
+```
+This functionality has also been implemented in Arrow, e.g.:
+
+```{r}
+library(arrow)
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth")) %>%
+ collect()
+```
+
+This is possible as a **binding** has been created between the stringr function
+`str_detect()` and the Arrow C++ function `match_substring_regex`. You can
see
+this for yourself by inspecting the arrow data query object without retrieving
the
+results via `collect()`.
+
+```{r}
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth"))
+```
+
+In the following sections, we'll walk through how to create a binding between
an
+R function and an Arrow C++ function.
+
+## Walkthrough
+
+Imagine you are writing the bindings for the C++ function
+[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
+and want to bind it to the (base) R function `startsWith()`.
+
+First, take a look at the docs for both of those functions.
+
+### Examining the R function
+
+Here are the docs for R's `startsWith()` (also available at
https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./startswithdocs.png")
+```
+
+It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
+if `x` starts with.
+
+### Examining the C++ function
+
+Now, go to
+[the compute function
documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
+and look for the Arrow C++ library's `starts_with()` function:
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./starts_with_docs.png")
+```
+
+The docs show that `starts_with()` is a unary function, which means that it
takes a
+single data input. The data input must be a string-like class, and the
returned
+value is boolean, both of which match up to R's `startsWith()`.
+
+There is an options class associated with `starts_with()` - called
[`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
+- so let's take a look at that.
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./matchsubstringoptions.png")
+```
+
+Options classes allow the user to control the behaviour of the function. In
+this case, there are two possible options which can be supplied - `pattern`
and
+`ignore_case`, which are described in the docs shown above.
+
+### Comparing the R and C++ functions
+
+What conclusions can be drawn from what you've seen so far?
+
+Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
+types, return equivalent data types, and as there are no options implemented
in
+R that Arrow doesn't have, this should be fairly simple to map without a great
+deal of extra work.
+
+As `starts_with()` has an options class associated with it, we'll need to make
+sure that it's linked up with this in the R code.
+
+In case you're wondering about the difference between arguments in R and
options
+in Arrow, in R, arguments to functions can include the actual data to be
+analysed as well as options governing how the function works, whereas in the
+C++ compute functions, the arguments are the data to be analysed and the
+options are for specifying how exactly the function works.
+
+So let's get started.
+
+### Step 1 - add unit tests
+
+Look up the R function that you want to bind the compute kernel to, and write
a
+set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
+perhaps even `compare_dplyr_error()` if necessary. These functions compare
the
+output of the original function with the dplyr bindings and make sure they
match.
+
+Make sure you're testing all parameters of the R function.
+
+Below is a possible example test for `startsWith()`.
Review comment:
We also might want to mention in step 4 / a new step 5 that it's _very_
common to add more tests at the end, cause you know more edge cases / things
you need to make sure behave in certain ways (as well as adding tests for edge
cases as you iterate).
##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,227 @@
+---
+title: "Writing Bindings"
+---
+
+```{r, include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+When writing bindings between C++ compute functions and R functions, the aim
is
+to expose the C++ functionality via existing R functions. The syntax and
+functionality should match that of the existing R functions
+(though with some exceptions) so that users are able to use existing tidyverse
+or base R syntax, whilst taking advantage of the speed and functionality of
the
+underlying arrow package.
+
+# Implementing bindings to work within dplyr pipelines
+
+One of main ways in which users interact with arrow is via
+[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For
+example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
+Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
+used and under the hood, translates the dplyr code into Arrow C++ code.
+
+When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use
functions
+from other packages. The example below uses `stringr::str_detect()`.
+
+```{r}
+library(dplyr)
+library(stringr)
+starwars %>%
+ filter(str_detect(name, "Darth"))
+```
+This functionality has also been implemented in Arrow, e.g.:
+
+```{r}
+library(arrow)
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth")) %>%
+ collect()
+```
+
+This is possible as a **binding** has been created between the call to the
+stringr function `str_detect()` and the Arrow C++ code, here as a direct
mapping
+to `match_substring_regex`. You can see this for yourself by inspecting the
+arrow data query object without retrieving the results via `collect()`.
+
+
+```{r}
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth"))
+```
+
+In the following sections, we'll walk through how to create a binding between
an
+R function and an Arrow C++ function.
+
+## Walkthrough
+
+Imagine you are writing the bindings for the C++ function
+[`starts_with()`](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
+and want to bind it to the (base) R function `startsWith()`.
+
+First, take a look at the docs for both of those functions.
+
+### Examining the R function
+
+Here are the docs for R's `startsWith()` (also available at
https://stat.ethz.ch/R-manual/R-devel/library/base/html/startsWith.html)
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./startswithdocs.png")
+```
+
+It takes 2 parameters; `x` - the input, and `prefix` - the characters to check
+if `x` starts with.
+
+### Examining the C++ function
+
+Now, go to
+[the compute function
documentation](https://arrow.apache.org/docs/cpp/compute.html#containment-tests)
+and look for the Arrow C++ library's `starts_with()` function:
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./starts_with_docs.png")
+```
+
+The docs show that `starts_with()` is a unary function, which means that it
takes a
+single data input. The data input must be a string-like class, and the
returned
+value is boolean, both of which match up to R's `startsWith()`.
+
+There is an options class associated with `starts_with()` - called
[`MatchSubstringOptions`](https://arrow.apache.org/docs/cpp/api/compute.html#_CPPv4N5arrow7compute21MatchSubstringOptionsE)
+- so let's take a look at that.
+
+```{r, echo=FALSE, out.width="50%"}
+knitr::include_graphics("./matchsubstringoptions.png")
+```
+
+Options classes allow the user to control the behaviour of the function. In
+this case, there are two possible options which can be supplied - `pattern`
and
+`ignore_case`, which are described in the docs shown above.
+
+### Comparing the R and C++ functions
+
+What conclusions can be drawn from what you've seen so far?
+
+Base R's `startsWith()` and Arrow's `starts_with()` operate on equivalent data
+types, return equivalent data types, and as there are no options implemented
in
+R that Arrow doesn't have, this should be fairly simple to map without a great
+deal of extra work.
+
+As `starts_with()` has an options class associated with it, we'll need to make
+sure that it's linked up with this in the R code.
+
+In case you're wondering about the difference between arguments in R and
options
+in Arrow, in R, arguments to functions can include the actual data to be
+analysed as well as options governing how the function works, whereas in the
+C++ compute functions, the arguments are the data to be analysed and the
+options are for specifying how exactly the function works.
+
+So let's get started.
+
+### Step 1 - add unit tests
+
+We recommend a test-driven-development approach - write failing tests first,
+then check that they fail, and then write the code needed to make them pass.
+Thinking up-front about the behavior which needs testing can make it easier to
+reason about the code which needs writing later.
+
+Look up the R function that you want to bind the compute kernel to, and write
a
+set of unit tests that use a dplyr pipeline and `compare_dplyr_binding()` (and
+perhaps even `compare_dplyr_error()` if necessary. These functions compare
the
+output of the original function with the dplyr bindings and make sure they
match.
+We recommend looking at the documentation next to the source code for these
+functions to get a better understanding of how they work.
+
+You should make sure you're testing all parameters of the R function in your
+tests.
+
+Below is a possible example test for `startsWith()`.
+
+```{r, eval = FALSE}
+test_that("startsWith behaves identically in dplyr and Arrow", {
+ df <- tibble(x = c("Foo", "bar", "baz", "qux"))
+
Review comment:
```suggestion
```
##########
File path: r/vignettes/developers/bindings.Rmd
##########
@@ -0,0 +1,227 @@
+---
+title: "Writing Bindings"
+---
+
+```{r, include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+library(dplyr, warn.conflicts = FALSE)
+```
+
+When writing bindings between C++ compute functions and R functions, the aim
is
+to expose the C++ functionality via existing R functions. The syntax and
+functionality should match that of the existing R functions
+(though with some exceptions) so that users are able to use existing tidyverse
+or base R syntax, whilst taking advantage of the speed and functionality of
the
+underlying arrow package.
+
+# Implementing bindings to work within dplyr pipelines
+
+One of main ways in which users interact with arrow is via
+[dplyr](https://dplyr.tidyverse.org/) syntax called on Arrow objects. For
+example, when a user calls `dplyr::mutate()` on an Arrow Tabular,
+Dataset, or arrow data query object, the Arrow implementation of `mutate()` is
+used and under the hood, translates the dplyr code into Arrow C++ code.
+
+When using `dplyr::mutate()` or `dplyr::filter()`, you may want to use
functions
+from other packages. The example below uses `stringr::str_detect()`.
+
+```{r}
+library(dplyr)
+library(stringr)
+starwars %>%
+ filter(str_detect(name, "Darth"))
+```
+This functionality has also been implemented in Arrow, e.g.:
+
+```{r}
+library(arrow)
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth")) %>%
+ collect()
+```
+
+This is possible as a **binding** has been created between the call to the
+stringr function `str_detect()` and the Arrow C++ code, here as a direct
mapping
+to `match_substring_regex`. You can see this for yourself by inspecting the
+arrow data query object without retrieving the results via `collect()`.
+
+
+```{r}
+arrow_table(starwars) %>%
+ filter(str_detect(name, "Darth"))
Review comment:
```suggestion
filter(str_detect(name, "Darth"))
```
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]