[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

GitBox Wed, 26 Oct 2022 13:03:08 -0700


djnavarro commented on code in PR #14514:
URL: https://github.com/apache/arrow/pull/14514#discussion_r1006134236



##########
r/vignettes/data_object_layout.Rmd:
##########
@@ -0,0 +1,183 @@
+---
+title: "Internal structure of Arrow objects"
+description: > 
+  Learn about the internal structure of Arrow data objects. 
+output: rmarkdown::html_vignette
+vignette: >
+  %\VignetteIndexEntry{Internal structure of Arrow objects}
+  %\VignetteEngine{knitr::rmarkdown}
+  %\VignetteEncoding{UTF-8}
+---
+
+This vignette describes the internal structure of Arrow data objects. Users of 
the `arrow` R package will not generally need to understand the internal 
structure of Arrow data objects. We include it here to help orient those R 
users and Arrow developers who wish to understand the [Arrow 
specification](https://arrow.apache.org/docs/format/Columnar.html). This 
vignette provides a deeper dive into some of the topics described in 
`vignette("data_objects", package = "arrow")`, and is intended mostly for 
developers. It is not necessary knowledge for using the `arrow` package. 
+
+
+```{r include=FALSE}
+library(arrow, warn.conflicts = FALSE)
+```
+
+
+We begin by describing two key concepts:
+
+- Values in an array are stored in one or more **buffers**. A buffer is a 
sequential virtual address space (i.e., block of memory) with a given length. 
Given a  pointer specifying the memory address where the buffer starts, you can 
reach any byte in the buffer with an "offset" value that specifies a location 
relative to the start of the buffer. 
+- The **physical layout** of an array is a term used to describe how data in 
an array is laid out in memory, without taking into account of how that 
information is interpreted. As an example: a 32-bit signed integer and 32-bit 
floating point number have the same layout: they are both 32 bits, represented 
as 4 contiguous bytes in memory. The meaning is different, but the layout is 
the same.
+
+We can unpack these ideas using a simple array of integer values:
+
+```{r}
+integer_array <- Array$create(c(1L, NA, 2L, 4L, 8L))
+integer_array
+```
+
+We can inspect the `integer_array$type` attribute to see that the values in 
the Array are stored as signed 32 bit integers. When laid out in memory by the 
Arrow C++ library, an integer array consists of two pieces of metadata and two 
buffers that store the data. The metadata specify the length of the array and a 
count of the number of null values, both stored as 64-bit integers. These 
metadata can be viewed from R using `integer_array$length()` and 
`integer_array$null_count` respectively. The number of buffers associated with 
an array depends on the exact type of data being stored. For an integer array 
there are two: a "validity bitmap buffer" and a "data value buffer". 
Schematically we could depict the array as follows:
+
+```{r, echo=FALSE, out.width="100%"}
+knitr::include_graphics("./array_layout_integer.png")
+```
+
+This image shows the array as a rectangle subdivided into two parts, one for 
the metadata and the other for the buffers. Underneath the rectangle we've 
unpacked the contents of the buffers for you, showing the contents of the two 
buffers in the area enclosed in a dotted line. At the very bottom of the 
figure, you can see the contents of specific bytes.
+
+## Validity bitmap buffer
+
+The validity bitmap is binary-valued, and contains a 1 whenever the 
corresponding slot in the array contains a valid, non-null value. At an 
abstract level we can assume this contains the following five bits: 
+
+```
+10111
+```
+
+However this is a slight over-simplification for three reasons. First, because 
memory is allocated in byte-size units there are three trailing bits at the end 
(assumed to be zero), giving us the bitmap `10111000`. Second, while we have 
written this from left-to-right, this written format is typically presumed to 
represent [big endian format](https://en.wikipedia.org/wiki/Endianness) whereas 
Arrow is little-endian. To reflect this we write the bits in reversed order: 
`00011101`. Finally, Arrow encourages [naturally aligned data 
structures](https://en.wikipedia.org/wiki/Data_structure_alignment) in which 
allocated memory addresses are a multiple of the data block sizes. Arrow uses 
*64 byte alignment*, so each data structure must be a multople of 64 bytes in 
size. This design feature exists to allow efficient use of modern hardware, as 
discussed in the [Arrow 
specification](https://arrow.apache.org/docs/format/Columnar.html#buffer-alignment-and-padding).
 This is what the buffer looks l
 ike this in memory:

Review Comment:
   Yeah, let me think about it. On the one hand I feel like it would be weird 
to do a full digression into "memory addresses can be rewritten from hex 
notation to a decimal number..." and then talk about the reasons why we want 
all data blocks to start (and stop) at an address that is a multiple of 64 
bytes. That seems like a long and unhelpful tangent (especially since I'm right 
at the edge of my own knowledge trying to understand why it actually matters!) 
but at the same time the Arrow spec page does go into this detail and treats it 
as if it's assumed knowledge. So I feel almost obligated to try to unpack it in 
the R docs just so that readers of this vignette will be able to read the Arrow 
spec and not get completely confused. 
   
   ugh. it's a mess. this vignette is the one I'm least certain about -- I feel 
like we do need it to bridge the yawning chasm between the R docs and the Arrow 
spec, but I'm not confident I'm doing it well 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@arrow.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

[GitHub] [arrow] djnavarro commented on a diff in pull request #14514: ARROW-17887: [R][Doc][WIP] Improve readability of the Get Started and README pages

Reply via email to