djnavarro commented on code in PR #14514: URL: https://github.com/apache/arrow/pull/14514#discussion_r1006134236
########## r/vignettes/data_object_layout.Rmd: ########## @@ -0,0 +1,183 @@ +--- +title: "Internal structure of Arrow objects" +description: > + Learn about the internal structure of Arrow data objects. +output: rmarkdown::html_vignette +vignette: > + %\VignetteIndexEntry{Internal structure of Arrow objects} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} +--- + +This vignette describes the internal structure of Arrow data objects. Users of the `arrow` R package will not generally need to understand the internal structure of Arrow data objects. We include it here to help orient those R users and Arrow developers who wish to understand the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html). This vignette provides a deeper dive into some of the topics described in `vignette("data_objects", package = "arrow")`, and is intended mostly for developers. It is not necessary knowledge for using the `arrow` package. + + +```{r include=FALSE} +library(arrow, warn.conflicts = FALSE) +``` + + +We begin by describing two key concepts: + +- Values in an array are stored in one or more **buffers**. A buffer is a sequential virtual address space (i.e., block of memory) with a given length. Given a pointer specifying the memory address where the buffer starts, you can reach any byte in the buffer with an "offset" value that specifies a location relative to the start of the buffer. +- The **physical layout** of an array is a term used to describe how data in an array is laid out in memory, without taking into account of how that information is interpreted. As an example: a 32-bit signed integer and 32-bit floating point number have the same layout: they are both 32 bits, represented as 4 contiguous bytes in memory. The meaning is different, but the layout is the same. + +We can unpack these ideas using a simple array of integer values: + +```{r} +integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L)) +integer_array +``` + +We can inspect the `integer_array$type` attribute to see that the values in the Array are stored as signed 32 bit integers. When laid out in memory by the Arrow C++ library, an integer array consists of two pieces of metadata and two buffers that store the data. The metadata specify the length of the array and a count of the number of null values, both stored as 64-bit integers. These metadata can be viewed from R using `integer_array$length()` and `integer_array$null_count` respectively. The number of buffers associated with an array depends on the exact type of data being stored. For an integer array there are two: a "validity bitmap buffer" and a "data value buffer". Schematically we could depict the array as follows: + +```{r, echo=FALSE, out.width="100%"} +knitr::include_graphics("./array_layout_integer.png") +``` + +This image shows the array as a rectangle subdivided into two parts, one for the metadata and the other for the buffers. Underneath the rectangle we've unpacked the contents of the buffers for you, showing the contents of the two buffers in the area enclosed in a dotted line. At the very bottom of the figure, you can see the contents of specific bytes. + +## Validity bitmap buffer + +The validity bitmap is binary-valued, and contains a 1 whenever the corresponding slot in the array contains a valid, non-null value. At an abstract level we can assume this contains the following five bits: + +``` +10111 +``` + +However this is a slight over-simplification for three reasons. First, because memory is allocated in byte-size units there are three trailing bits at the end (assumed to be zero), giving us the bitmap `10111000`. Second, while we have written this from left-to-right, this written format is typically presumed to represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas Arrow is little-endian. To reflect this we write the bits in reversed order: `00011101`. Finally, Arrow encourages [naturally aligned data structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which allocated memory addresses are a multiple of the data block sizes. Arrow uses *64 byte alignment*, so each data structure must be a multople of 64 bytes in size. This design feature exists to allow efficient use of modern hardware, as discussed in the [Arrow specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding). This is what the buffer looks l ike this in memory: Review Comment: Yeah, let me think about it. On the one hand I feel like it would be weird to do a full digression into "memory addresses can be rewritten from hex notation to a decimal number..." and then talk about the reasons why we want all data blocks to start (and stop) at an address that is a multiple of 64 bytes. That seems like a long and unhelpful tangent (especially since I'm right at the edge of my own knowledge trying to understand why it actually matters!) but at the same time the Arrow spec page does go into this detail and treats it as if it's assumed knowledge. So I feel almost obligated to try to unpack it in the R docs just so that readers of this vignette will be able to read the Arrow spec and not get completely confused. ugh. it's a mess. this vignette is the one I'm least certain about -- I feel like we do need it to bridge the yawning chasm between the R docs and the Arrow spec, but I'm not confident I'm doing it well -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org