thisisnic commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1011863788
##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: >
+ Learn about Scalar, Array, Table, and Dataset objects in `arrow`
+ (among others), how they relate to each other, as well as their
+ relationships to familiar R objects like data frames and vectors
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and
documents how these objects are structured.
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional
rectangular data structures used to store tabular data. For columnar,
one-dimensional data, the `Array` and `ChunkedArray` classes are provided.
Finally, `Scalar` objects represent individual values. The table below
summarizes these objects and shows how you can create new instances using the
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class | How to create an instance |
Convenience function |
+| --- | -------------- | ----------------------------------------------|
--------------------------------------------- |
+| 0 | `Scalar` | `Scalar$create(value, type)` |
|
+| 1 | `Array` | `Array$create(vector, type)` |
|
+| 1 | `ChunkedArray` | `ChunkedArray$create(..., type)` |
`chunked_array(..., type)` |
+| 2 | `RecordBatch` | `RecordBatch$create(...)` |
`record_batch(...)` |
+| 2 | `Table` | `Table$create(...)` |
`arrow_table(...)` |
+| 2 | `Dataset` | `Dataset$create(sources, schema)` |
`open_dataset(sources, schema)` |
+
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of
the same name in the underlying Arrow C++ library. It is also worth mentioning
that the `arrow` package also defines classes that do not exist in the C++
library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
Review Comment:
Is there benefit to mentioning these classes to Arrow users who aren't
package developers?
##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: >
+ Learn about Scalar, Array, Table, and Dataset objects in `arrow`
+ (among others), how they relate to each other, as well as their
+ relationships to familiar R objects like data frames and vectors
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and
documents how these objects are structured.
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional
rectangular data structures used to store tabular data. For columnar,
one-dimensional data, the `Array` and `ChunkedArray` classes are provided.
Finally, `Scalar` objects represent individual values. The table below
summarizes these objects and shows how you can create new instances using the
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class | How to create an instance |
Convenience function |
+| --- | -------------- | ----------------------------------------------|
--------------------------------------------- |
+| 0 | `Scalar` | `Scalar$create(value, type)` |
|
+| 1 | `Array` | `Array$create(vector, type)` |
|
Review Comment:
We can use the convenience function `as_arrow_array()` to create Arrays from
R vectors.
##########
r/vignettes/data_objects.Rmd:
##########
@@ -0,0 +1,206 @@
+---
+title: "Data objects"
+description: >
+ Learn about Scalar, Array, Table, and Dataset objects in `arrow`
+ (among others), how they relate to each other, as well as their
+ relationships to familiar R objects like data frames and vectors
+output: rmarkdown::html_vignette
+---
+
+This article describes the various data object types supplied by `arrow`, and
documents how these objects are structured.
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+The `arrow` package supplies several object classes that are used to represent
data. `RecordBatch`, `Table`, and `Dataset` objects are two-dimensional
rectangular data structures used to store tabular data. For columnar,
one-dimensional data, the `Array` and `ChunkedArray` classes are provided.
Finally, `Scalar` objects represent individual values. The table below
summarizes these objects and shows how you can create new instances using the
[`R6`](https://r6.r-lib.org/) class object, as well as convenience functions
that provide the same functionality in a more traditional R-like fashion:
+
+| Dim | Class | How to create an instance |
Convenience function |
+| --- | -------------- | ----------------------------------------------|
--------------------------------------------- |
+| 0 | `Scalar` | `Scalar$create(value, type)` |
|
+| 1 | `Array` | `Array$create(vector, type)` |
|
+| 1 | `ChunkedArray` | `ChunkedArray$create(..., type)` |
`chunked_array(..., type)` |
+| 2 | `RecordBatch` | `RecordBatch$create(...)` |
`record_batch(...)` |
+| 2 | `Table` | `Table$create(...)` |
`arrow_table(...)` |
+| 2 | `Dataset` | `Dataset$create(sources, schema)` |
`open_dataset(sources, schema)` |
+
+Later in the article we'll look at each of these in more detail.
+
+For now we note that each of these object classes corresponds to a class of
the same name in the underlying Arrow C++ library. It is also worth mentioning
that the `arrow` package also defines classes that do not exist in the C++
library including:
+
+* `ArrowDatum`: inherited by `Scalar`, `Array`, and `ChunkedArray`
+* `ArrowTabular`: inherited by `RecordBatch` and `Table`
+* `ArrowObject`: inherited by all Arrow objects
+
+In addition to these data objects, `arrow` defines the following classes for
representing metadata:
+
+- A `Schema` is a list of `Field` objects used to describe the structure of a
tabular data object; where
+- A `Field` specifies a character string name and a `DataType`; and
+- A `DataType` is an attribute controlling how values are represented
+
+To learn more about the metadata classes, see the [metadata
article](./metadata.html).
+
+## Scalars
+
+A Scalar object is simply a single value that can be of any type. It might be
an integer, a string, a timestamp, or any of the different `DataType` objects
that Arrow supports. Most users of the `arrow` R package are unlikely to create
Scalars directly, but should there be a need you can do this by calling the
`Scalar$create()` method:
+
+```{r}
+Scalar$create("hello")
+```
+
+
+## Arrays
+
+Array objects are ordered sets of Scalar values. As with Scalars most users
will not need to create Arrays directly, but if the need arises there is an
`Array$create()` method that allows you to create new Arrays:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+```{r}
+string_array <- Array$create(c("hello", "amazing", "and", "cruel", "world"))
+string_array
+```
+
+An Array can be subset using square brackets as shown below:
+
+```{r}
+string_array[4:5]
+```
+
+Arrays are immutable objects: once an Array has been created it cannot be
modified or extended.
+
+## Chunked Arrays
+
+In practice, most users of the `arrow` R package are likely to use Chunked
Arrays rather than simple Arrays. Under the hood, a Chunked Array is a
collection of one or more Arrays that can be indexed _as if_ they were a single
Array. The reasons that Arrow provides this functionality are described in the
[data object layout article](./developers/data_object_layout.html) but for the
present purposes it is sufficient to notice that Chunked Arrays behave like
Arrays in regular data analysis.
+
+To illustrate, let's use the `chunked_array()` function:
+
+```{r}
+chunked_string_array <- chunked_array(
+ string_array,
+ c("I", "love", "you")
+)
+```
+
+The `chunked_array()` function is just a wrapper around the functionality that
`ChunkedArray$create()` provides. Let's print the object:
+
+```{r}
+chunked_string_array
+```
+
+The double bracketing in this output is intended to highlight the fact that
Chunked Arrays are wrappers around one or more Arrays. However, although
comprised of multiple distinct Arrays, a Chunked Array can be indexed as if
they were laid end-to-end in a single "vector-like" object. This is illustrated
below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_indexing.png")
+```
+
+We can use `chunked_string_array` to illustrate this:
+
+```{r}
+chunked_string_array[4:7]
+```
+
+An important thing to note is that "chunking" is not semantically meaningful.
It is an implementation detail only: users should never treat the chunk as a
meaningful unit. Writing the data to disk, for example, often results in the
data being organized into different chunks. Similarly, two Chunked Arrays that
contain the same values assigned to different chunks are deemed equivalent. To
illustrate this we can create a Chunked Array that contains the same four same
four values as `chunked_string_array[4:7]`, but organized into one chunk rather
than split into two:
+
+```{r}
+cruel_world <- chunked_array(c("cruel", "world", "I", "love"))
+cruel_world
+```
+
+Testing for equality using `==` produces an element-wise comparison, and the
result is a new Chunked Array of four (boolean type) `true` values:
+
+```{r}
+cruel_world == chunked_string_array[4:7]
+```
+
+In short, the intention is that users interact with Chunked Arrays as if they
are ordinary one-dimensional data structures without ever having to think much
about the underlying chunking arrangement.
+
+Chunked Arrays are mutable, in a specific sense: Arrays can be added and
removed from a Chunked Array.
+
+## Record Batches
+
+A Record Batch is tabular data structure comprised of named Arrays. Record
Batches are a fundamental unit for data interchange in Arrow, but are not
typically used for data analysis. Tables and Datasets are usually more
convenient in analytic contexts.
+
+These Arrays can be of different types but must all be the same length. Each
Array is referred to as one of the "fields" or "columns" of the Record Batch.
You can create a Record Batch using the `record_batch()` function or by using
the `RecordBatch$create()` method. These functions are flexible and can accept
inputs in several formats: you can pass a data frame, one or more named
vectors, an input stream, or even a raw vector containing appropriate binary
data. For example:
+
+```{r}
+rb <- record_batch(
+ strs = string_array,
+ ints = integer_array,
+ dbls = c(1.1, 3.2, 0.2, NA, 11)
+)
+rb
+```
+
+This is a Record Batch containing 5 rows and 3 columns, and its conceptual
structure is shown below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./record_batch.png")
+```
+
+The `arrow` package supplies a `$` method for Record Batch objects, used to
extract a single column by name:
+
+```{r}
+rb$strs
+```
+
+You can use double brackets `[[` to refer to columns by position. The
`rb$ints` array is the second column in our Record Batch so we can extract it
with this:
+
+```{r}
+rb[[2]]
+```
+
+There is also `[` method that allows you to extract subsets of a record batch
in the same way you would for a data frame. The command `rb[1:3, 1:2]` extracts
the first three rows and the first two columns:
+
+```{r}
+rb[1:3, 1:2]
+```
+
+Record Batches cannot be concatenated: because they are comprised of Arrays,
and Arrays are immutable objects, new rows cannot be added to Record Batch once
created.
+
+## Tables
+
+A Table is comprised of named Chunked Arrays, in the same way that a Record
Batch is comprised of named Arrays. You can subset Tables with `$`, `[[`, and
`[` the same way you can for Record Batches. Unlike Record Batches, Tables can
be concatenated (because they are comprised of Chunked Arrays). Suppose a
second Record Batch arrives:
+
+```{r}
+new_rb <- record_batch(
+ strs = c("I", "love", "you"),
+ ints = c(5L, 0L, 0L),
+ dbls = c(7.1, -0.1, 2)
+)
+```
+
+It is not possible to create a Record Batch that appends the data from
`new_rb` to the data in `rb`, not without creating entirely new objects in
memory. With Tables, however, we can:
+
+```{r}
+df <- arrow_table(rb)
+new_df <- arrow_table(new_rb)
+```
+
+We now have the two fragments of the data set represented as Tables. The
difference between the Table and the Record Batch is that the columns are all
represented as Chunked Arrays. Each Array from the original Record Batch is one
chunk in the corresponding Chunked Array in the Table:
+
+```{r}
+rb$strs
+df$strs
+```
+
+It's the same underlying data -- and indeed the same immutable Array is
referenced by both -- just enclosed by a new, flexible Chunked Array wrapper.
However, it is this wrapper that allows us to concatenate Tables:
+
+```{r}
+concat_tables(df, new_df)
+```
+
+The resulting object is shown schematically below:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./table.png")
+```
+
Review Comment:
Do we perhaps also want a section on Datasets here as well?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]