[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Wed, 02 Nov 2022 08:45:15 -0700


thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1011863788



##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and 
documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent 
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional 
rectangular data structures used to store tabular data. For columnar, 
one-dimensional data, the `Array` and `ChunkedArray` classes are provided. 
Finally, `Scalar` objects represent individual values. The table below 
summarizes these objects and shows how you can create new instances using the 
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions 
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | 
Convenience function                          |
+| --- | -------------- | ----------------------------------------------| 
--------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |       
                                        |
+| 1   | `Array`        | `Array$create(vector, type)`                  |       
                                        |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | 
`chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | 
`record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | 
`arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | 
`open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of 
the same name in the underlying Arrow C++ library. It is also worth mentioning 
that the `arrow` package also defines classes that do not exist in the C++ 
library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects

Review Comment:
   Is there benefit to mentioning these classes to Arrow users who aren't 
package developers? 



##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and 
documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent 
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional 
rectangular data structures used to store tabular data. For columnar, 
one-dimensional data, the `Array` and `ChunkedArray` classes are provided. 
Finally, `Scalar` objects represent individual values. The table below 
summarizes these objects and shows how you can create new instances using the 
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions 
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | 
Convenience function                          |
+| --- | -------------- | ----------------------------------------------| 
--------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |       
                                        |
+| 1   | `Array`        | `Array$create(vector, type)`                  |       
                                        |

Review Comment:
   We can use the convenience function `as_arrow_array()` to create Arrays from 
R vectors.



##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: > 
+  Learn about Scalar, Array, Table, and Dataset objects in `arrow` 
+  (among others), how they relate to each other, as well as their 
+  relationships to familiar R objects like data frames and vectors 
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and 
documents how these objects are structured. 
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent 
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional 
rectangular data structures used to store tabular data. For columnar, 
one-dimensional data, the `Array` and `ChunkedArray` classes are provided. 
Finally, `Scalar` objects represent individual values. The table below 
summarizes these objects and shows how you can create new instances using the 
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions 
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class          | How to create an instance                     | 
Convenience function                          |
+| --- | -------------- | ----------------------------------------------| 
--------------------------------------------- |
+| 0   | `Scalar`       | `Scalar$create(value, type)`                  |       
                                        |
+| 1   | `Array`        | `Array$create(vector, type)`                  |       
                                        |
+| 1   | `ChunkedArray` | `ChunkedArray$create(..., type)`              | 
`chunked_array(..., type)`                    |
+| 2   | `RecordBatch`  | `RecordBatch$create(...)`                     | 
`record_batch(...)`                           |
+| 2   | `Table`        | `Table$create(...)`                           | 
`arrow_table(...)`                            |
+| 2   | `Dataset`      | `Dataset$create(sources, schema)`             | 
`open_dataset(sources, schema)`               |
+  
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of 
the same name in the underlying Arrow C++ library. It is also worth mentioning 
that the `arrow` package also defines classes that do not exist in the C++ 
library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
+
+In addition to these data objects, `arrow` defines the following classes for 
representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a 
tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+To learn more about the metadata classes, see the [metadata 
article](./metadata.html).
+
+## Scalars
+
+A Scalar object is simply a single value that can be of any type. It might be 
an integer, a string, a timestamp, or any of the different `DataType` objects 
that Arrow supports. Most users of the `arrow` R package are unlikely to create 
Scalars directly, but should there be a need you can do this by calling the 
`Scalar$create()` method:
+
+```{r}
+Scalar$create("hello")
+```
+
+
+## Arrays
+
+Array objects are ordered sets of Scalar values. As with Scalars most users 
will not need to create Arrays directly, but if the need arises there is an 
`Array$create()` method that allows you to create new Arrays:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
+string_array
+```
+
+An Array can be subset using square brackets as shown below:
+
+```{r}
+string_array[4:5]
+```
+
+Arrays are immutable objects: once an Array has been created it cannot be 
modified or extended. 
+
+## Chunked Arrays
+
+In practice, most users of the `arrow` R package are likely to use Chunked 
Arrays rather than simple Arrays. Under the hood, a Chunked Array is a 
collection of one or more Arrays that can be indexed _as if_ they were a single 
Array. The reasons that Arrow provides this functionality are described in the 
[data object layout article](./developers/data_object_layout.html) but for the 
present purposes it is sufficient to notice that Chunked Arrays behave like 
Arrays in regular data analysis.
+
+To illustrate, let's use the `chunked_array()` function:
+
+```{r}
+chunked_string_array <- chunked_array(
+  string_array,
+  c("I", "love", "you")
+)
+```
+
+The `chunked_array()` function is just a wrapper around the functionality that 
`ChunkedArray$create()` provides. Let's print the object:
+
+```{r}
+chunked_string_array
+```
+
+The double bracketing in this output is intended to highlight the fact that 
Chunked Arrays are wrappers around one or more Arrays. However, although 
comprised of multiple distinct Arrays, a Chunked Array can be indexed as if 
they were laid end-to-end in a single "vector-like" object. This is illustrated 
below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_indexing.png")
+```
+
+We can use `chunked_string_array` to illustrate this: 
+
+```{r}
+chunked_string_array[4:7]
+```
+
+An important thing to note is that "chunking" is not semantically meaningful. 
It is an implementation detail only: users should never treat the chunk as a 
meaningful unit. Writing the data to disk, for example, often results in the 
data being organized into different chunks. Similarly, two Chunked Arrays that 
contain the same values assigned to different chunks are deemed equivalent. To 
illustrate this we can create a Chunked Array that contains the same four same 
four values as `chunked_string_array[4:7]`, but organized into one chunk rather 
than split into two:
+
+```{r}
+cruel_world <- chunked_array(c("cruel", "world", "I", "love"))
+cruel_world
+```
+
+Testing for equality using `==` produces an element-wise comparison, and the 
result is a new Chunked Array of four (boolean type) `true` values:
+
+```{r}
+cruel_world == chunked_string_array[4:7]
+```
+
+In short, the intention is that users interact with Chunked Arrays as if they 
are ordinary one-dimensional data structures without ever having to think much 
about the underlying chunking arrangement. 
+
+Chunked Arrays are mutable, in a specific sense: Arrays can be added and 
removed from a Chunked Array.
+
+## Record Batches
+
+A Record Batch is tabular data structure comprised of named Arrays. Record 
Batches are a fundamental unit for data interchange in Arrow, but are not 
typically used for data analysis. Tables and Datasets are usually more 
convenient in analytic contexts.
+
+These Arrays can be of different types but must all be the same length. Each 
Array is referred to as one of the "fields" or "columns" of the Record Batch. 
You can create a Record Batch using the `record_batch()` function or by using 
the `RecordBatch$create()` method. These functions are flexible and can accept 
inputs in several formats: you can pass a data frame, one or more named 
vectors, an input stream, or even a raw vector containing appropriate binary 
data. For example:
+
+```{r}
+rb <- record_batch(
+  strs = string_array, 
+  ints = integer_array,
+  dbls = c(1.1, 3.2, 0.2, NA, 11)
+)
+rb
+```
+
+This is a Record Batch containing 5 rows and 3 columns, and its conceptual 
structure is shown below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./record_batch.png")
+```
+
+The `arrow` package supplies a `$` method for Record Batch objects, used to 
extract a single column by name:
+
+```{r}
+rb$strs
+```
+
+You can use double brackets `[[` to refer to columns by position. The 
`rb$ints` array is the second column in our Record Batch so we can extract it 
with this:
+
+```{r}
+rb[[2]]
+```
+
+There is also `[` method that allows you to extract subsets of a record batch 
in the same way you would for a data frame. The command `rb[1:3, 1:2]` extracts 
the first three rows and the first two columns:
+
+```{r}
+rb[1:3, 1:2]
+```
+
+Record Batches cannot be concatenated: because they are comprised of Arrays, 
and Arrays are immutable objects, new rows cannot be added to Record Batch once 
created.
+
+## Tables
+
+A Table is comprised of named Chunked Arrays, in the same way that a Record 
Batch is comprised of named Arrays. You can subset Tables with `$`, `[[`, and 
`[` the same way you can for Record Batches. Unlike Record Batches, Tables can 
be concatenated (because they are comprised of Chunked Arrays). Suppose a 
second Record Batch arrives:
+
+```{r}
+new_rb <- record_batch(
+  strs = c("I", "love", "you"), 
+  ints = c(5L, 0L, 0L),
+  dbls = c(7.1, -0.1, 2)
+)
+```
+
+It is not possible to create a Record Batch that appends the data from 
`new_rb` to the data in `rb`, not without creating entirely new objects in 
memory. With Tables, however, we can:
+
+```{r}
+df <- arrow_table(rb)
+new_df <- arrow_table(new_rb)
+```
+
+We now have the two fragments of the data set represented as Tables. The 
difference between the Table and the Record Batch is that the columns are all 
represented as Chunked Arrays. Each Array from the original Record Batch is one 
chunk in the corresponding Chunked Array in the Table:
+
+```{r}
+rb$strs
+df$strs
+```
+
+It's the same underlying data -- and indeed the same immutable Array is 
referenced by both -- just enclosed by a new, flexible Chunked Array wrapper. 
However, it is this wrapper that allows us to concatenate Tables:
+
+```{r}
+concat_tables(df, new_df)
+```
+
+The resulting object is shown schematically below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./table.png")
+```
+

Review Comment:
   Do we perhaps also want a section on Datasets here as well?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [arrow] thisisnic commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to